TY - GEN
T1 - Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization
AU - Chugh, Komal
AU - Gupta, Parul
AU - Dhall, Abhinav
AU - Subramanian, Ramanathan
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/10/12
Y1 - 2020/10/12
N2 - We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
AB - We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
KW - contrastive loss
KW - deepfake detection and localization
KW - modality dissonance
KW - neural networks
UR - http://www.scopus.com/inward/record.url?scp=85101205270&partnerID=8YFLogxK
UR - https://dl.acm.org/doi/proceedings/10.1145/3394171
U2 - 10.1145/3394171.3413700
DO - 10.1145/3394171.3413700
M3 - Conference contribution
AN - SCOPUS:85101205270
T3 - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
SP - 439
EP - 447
BT - MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia
A2 - Chen, Chang Wen
A2 - Cucchiara, Rita
A2 - Hua, Xian-Sheng
A2 - Qi, Guo-Jun
A2 - Ricci, Elisa
A2 - Zhang, Zhengyou
A2 - Zimmermann, Roger
PB - Association for Computing Machinery (ACM)
CY - United States
T2 - 28th ACM International Conference on Multimedia, MM 2020
Y2 - 12 October 2020 through 16 October 2020
ER -