TY - JOUR
T1 - Glitch in the matrix
T2 - A large scale benchmark for content driven audio–visual forgery detection and localization
AU - Cai, Zhixi
AU - Ghosh, Shreya
AU - Dhall, Abhinav
AU - Gedeon, Tom
AU - Stefanov, Kalin
AU - Hayat, Munawar
PY - 2023/11
Y1 - 2023/11
N2 - Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio–visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio–visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
AB - Most deepfake detection methods focus on detecting spatial and/or spatio-temporal changes in facial attributes and are centered around the binary classification task of detecting whether a video is real or fake. This is because available benchmark datasets contain mostly visual-only modifications present in the entirety of the video. However, a sophisticated deepfake may include small segments of audio or audio–visual manipulations that can completely change the meaning of the video content. To addresses this gap, we propose and benchmark a new dataset, Localized Audio Visual DeepFake (LAV-DF), consisting of strategic content-driven audio, visual and audio–visual manipulations. The proposed baseline method, Boundary Aware Temporal Forgery Detection (BA-TFD), is a 3D Convolutional Neural Network-based architecture which effectively captures multimodal manipulations. We further improve (i.e. BA-TFD+) the baseline method by replacing the backbone with a Multiscale Vision Transformer and guide the training process with contrastive, frame classification, boundary matching and multimodal boundary matching loss functions. The quantitative analysis demonstrates the superiority of BA-TFD+ on temporal forgery localization and deepfake detection tasks using several benchmark datasets including our newly proposed dataset. The dataset, models and code are available at https://github.com/ControlNet/LAV-DF.
KW - Datasets
KW - Deepfake
KW - Detection
KW - Localization
UR - http://www.scopus.com/inward/record.url?scp=85169906520&partnerID=8YFLogxK
U2 - 10.1016/j.cviu.2023.103818
DO - 10.1016/j.cviu.2023.103818
M3 - Article
AN - SCOPUS:85169906520
SN - 1077-3142
VL - 236
JO - Computer Vision and Image Understanding
JF - Computer Vision and Image Understanding
M1 - 103818
ER -