Depression is a severe mental health disorder with high societal costs. Current clinical practice depends almost exclusively on self-report and clinical opinion, risking a range of subjective biases. The long-term goal of our research is to develop assistive technologies to support clinicians and sufferers in the diagnosis and monitoring of treatment progress in a timely and easily accessible format. In the first phase, we aim to develop a diagnostic aid using affective sensing approaches. This paper describes the progress to date and proposes a novel multimodal framework comprising of audio-video fusion for depression diagnosis. We exploit the proposition that the auditory and visual human communication complement each other, which is well-known in auditory-visual speech processing; we investigate this hypothesis for depression analysis. For the video data analysis, intra-facial muscle movements and the movements of the head and shoulders are analysed by computing spatio-temporal interest points. In addition, various audio features (fundamental frequency f0, loudness, intensity and mel-frequency cepstral coefficients) are computed. Next, a bag of visual features and a bag of audio features are generated separately. In this study, we compare fusion methods at feature level, score level and decision level. Experiments are performed on an age and gender matched clinical dataset of 30 patients and 30 healthy controls. The results from the multimodal experiments show the proposed framework’s effectiveness in depression analysis.