Abstract
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.
| Original language | English |
|---|---|
| Title of host publication | MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 439-447 |
| Number of pages | 9 |
| ISBN (Electronic) | 9781450379885 |
| DOIs | |
| Publication status | Published - 12 Oct 2020 |
| Externally published | Yes |
| Event | 28th ACM International Conference on Multimedia - Virtual, Online, United States Duration: 12 Oct 2020 → 16 Oct 2020 Conference number: 28th |
Publication series
| Name | MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia |
|---|
Conference
| Conference | 28th ACM International Conference on Multimedia |
|---|---|
| Abbreviated title | MM '20 |
| Country/Territory | United States |
| City | Virtual, Online |
| Period | 12/10/20 → 16/10/20 |
| Other | This year's conference, for the first time in the history, will be running as an online large-scale meeting due to COVID-19. Face-to-face gatherings have been replaced by all-hands live sessions, parallel live question-answering (Q&A) sessions, and presentations from the authors via pre-recorded videos. However, the distance needed to prevent the spreading of the virus does not cool down enthusiasm in participants, presenters, chairs, reviewers, and volunteers. Under today’s challenging circumstance, we are pleased that we are still able to deliver the most cutting-edge research results, covering the latest findings in the field, for which the ACM Multimedia conference series is widely known. |
Keywords
- contrastive loss
- deepfake detection and localization
- modality dissonance
- neural networks
Fingerprint
Dive into the research topics of 'Not made for each other: Audio-Visual Dissonance-based Deepfake Detection and Localization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver