Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization

Komal Chugh, Parul Gupta, Abhinav Dhall, Ramanathan Subramanian

Research output: A Conference proceeding or a Chapter in BookConference contributionpeer-review

18 Citations (Scopus)

Abstract

We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Original languageEnglish
Title of host publicationMM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
EditorsChang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, Roger Zimmermann
Place of PublicationUnited States
PublisherAssociation for Computing Machinery (ACM)
Pages439-447
Number of pages9
ISBN (Electronic)9781450379885
DOIs
Publication statusPublished - 12 Oct 2020
Externally publishedYes
Event28th ACM International Conference on Multimedia, MM 2020 - Virtual, Online, United States
Duration: 12 Oct 202016 Oct 2020

Publication series

NameMM 2020 - Proceedings of the 28th ACM International Conference on Multimedia

Conference

Conference28th ACM International Conference on Multimedia, MM 2020
Country/TerritoryUnited States
CityVirtual, Online
Period12/10/2016/10/20

Fingerprint

Dive into the research topics of 'Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization'. Together they form a unique fingerprint.

Cite this