基于关键帧的频域多特征融合的Deepfake视频检测

王金伟, 张玫瑰, 张家伟, 罗向阳, 马宾

doi:10.3969/j.issn.0255-8297.2025.03.007

应用科学学报 >

2025 , Vol. 43 >Issue 3: 451 - 462

DOI: https://doi.org/10.3969/j.issn.0255-8297.2025.03.007

计算机科学与应用

基于关键帧的频域多特征融合的Deepfake视频检测

展开

1. 南京信息工程大学计算机学院、网络空间安全学院, 江苏南京 210044;
2. 南京信息工程大学江苏省大气环境与装备技术协同创新中心, 江苏南京 210044;
3. 数学工程与高级计算国家重点实验室, 河南郑州 450001;
4. 齐鲁工业大学山东省计算机网络重点实验室, 山东济南 250353

收稿日期: 2022-04-11

网络出版日期: 2025-06-23

基金资助

国家自然科学基金（No.62472229,No.62371145,No.62172435,No.62272255,No.62302248,No.U24B20179,No.U23A20305,No.U23B2022）；国家重点研发计划（No.2021QY0700）；中国中原科技创新领军人才项目（No.214200510019）

收起

Frequency-Domain Multi-feature Fusion for Deepfake Video Detection Based on Key Frames

Expand

1. School of Computer Science, School of Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, Jiangsu, China;
2. Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, Jiangsu, China;
3. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, Henan, China;
4. Shandong Provincial Key Laboratory of Computer Networks, Qilu University of Technology, Jinan 250353, Shandong, China

Received date: 2022-04-11

Online published: 2025-06-23

Fold

摘要

现有的Deepfake视频检测方法为节约计算资源，避免数据冗余，大多随机选取视频的多帧或部分段作为检测对象，因而会降低检测对象的表征能力以及限制检测的性能。此外，现有算法在单一数据集上的检测效果良好，但在跨数据集检测时性能下降严重，泛化能力有待进一步提升。为此，提出了一种基于关键帧的频域多特征融合的Deepfake视频检测算法。利用频域的均方误差提取关键帧作为检测对象，并将频域学习主帧的伪影特征和关键帧间的时间不一致性进行融合后输入到全连接层中，从而获得最终的检测结果。实验结果表明，所提算法在跨数据集检测任务中的性能优于现有算法，具有较强的泛化性。

关键词： Deepfake 视频检测; 关键帧; 频域; 多特征融合

本文引用格式

王金伟, 张玫瑰, 张家伟, 罗向阳, 马宾 . 基于关键帧的频域多特征融合的Deepfake视频检测[J]. 应用科学学报, 2025 , 43(3) : 451 -462 . DOI: 10.3969/j.issn.0255-8297.2025.03.007

Abstract

To avoid data redundancy and save computing resources, most of the existing Deepfake video detection methods select multiple frames or partial segments of videos as the detection objects. However, this selection strategy compromises the representation ability of the detection objects and limits the performance. Moreover, while the existing algorithms perform well on individual datasets, their performance degrade seriously when detecting across datasets, highlighting the need for improved generalization. To address these challenges, we propose a frequency domain multi-feature fusion algorithm for Deepfake video detection based on key frames. The mean square error in frequency domain is used to extract the key frames as the detection objects. Then the artifact features of the main frame and temporal inconsistency features between the key frames are learned in frequency domain. These features are fused and passed through a fully connected layer to obtain the final detection results. Experimental results show that our algorithm achieves superior performance in cross-dataset detection compared to existing methods, showcasing strong generalization capabilities.

Key words： Deepfake video detection; key frames; frequency domain; multi-feature fusion

参考文献

[1] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks [C]//2017 IEEE International Conference on Computer Vision, 2017: 2242- 2251.
[2] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial networks [DB/OL]. (2014-06-10) [2022-04-11]. https://arxiv.org/abs/1406.2661v1.
[3] Ji Z, Yan J T, Wang Q, et al. Triple discriminator generative adversarial network for zero-shot image classification [J]. Science China Information Sciences, 2021, 64(2): 120101.
[4] Zhao H Q, Wei T Y, Zhou W B, et al. Multi-attentional deepfake detection [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2185-2194.
[5] Zhou P, Han X T, Morariu V I, et al. Two-stream neural networks for tampered face detection [C]//2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017: 1831-1839.
[6] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions [C]//2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015: 1-9.
[7] Schroff F, Kalenichenko D, Philbin J. FaceNet: a unified embedding for face recognition and clustering [C]//2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015: 815-823.
[8] Burges C. A tutorial on support vector machines for pattern recognition [J]. Data Mining and Knowledge Discovery, 1998, 2: 121-167.
[9] Afchar D, Nozick V, Yamagishi J, et al. MesoNet: a compact facial video forgery detection network [C]//2018 IEEE International Workshop on Information Forensics and Security, 2018: 1-7.
[10] Li Y Z, Lyu S W. Exposing DeepFake videos by detecting face warping artifacts [DB/OL]. (2018-11-01) [2022-04-11]. https://arxiv.org/abs/1811.00656v3.
[11] Nguyen H H, Yamagishi J, Echizen I. Capsule-forensics: using capsule networks to detect forged images and videos [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing, 2019: 2307-2311.
[12] Li Y Z, Chang M C, Lyu S W. In ictu oculi: exposing AI created fake videos by detecting eye blinking [C]//201810th IEEE International Workshop on Information Forensics and Security, 2018: 1-7.
[13] Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Computation, 1997, 9: 1735-1780.
[14] Güera D, Delp E J. Deepfake video detection using recurrent neural networks [C]//201815th IEEE International Conference on Advanced Video and Signal Based Surveillance, 2018: 127-132.
[15] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision [C]//2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 2818-2826.
[16] Sabir E, Cheng J X, Jaiswal A, et al. Recurrent convolutional strategies for face manipulation detection in videos [DB/OL]. (2019-05-16) [2022-04-11]. https://arxiv.org/abs/1905.00582.
[17] Kazemi V, Sullivan J. One millisecond face alignment with an ensemble of regression trees [C]//2014 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014: 1867- 1874.
[18] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks [C]//30th IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 2261- 2269.
[19] Schuster M, Paliwal K K. Bidirectional recurrent neural networks [J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[20] Durall R, Keuper M, Pfreundt F J, et al. Unmasking DeepFakes with simple features [DB/OL]. (2020-03-04) [2022-04-11]. https://arxiv.org/abs/1911.00686v3.
[21] Chen S, Yao T P, Chen Y, et al. Local relation learning for face forgery detection [C]//35th AAAI Conference on Artificial Intelligence/33rd Conference on Innovative Applications of Artificial Intelligence/11th Symposium on Educational Advances in Artificial Intelligence, 2021: 1081-1088.
[22] Qian Y Y, Yin G J, Sheng L, et al. Thinking in frequency: face forgery detection by mining frequency aware clues [C]//European Conference on Computer Vision, 2020: 86-103.
[23] Liu H G, Li X D, Zhou W B, et al. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 772-781.
[24] Jang E, Gu S X, Poole B. Categorical reparameterization with Gumbel-Softmax [DB/OL]. (2016-11-03) [2022-04-11]. https://arxiv.org/abs/1611.01144v5.
[25] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 770-778.
[26] Dey R, Salem F M. Gate-variants of gated recurrent unit (GRU) neural networks [C]//2017 IEEE 60th International Midwest Symposium on Circuits and Systems, 2017: 1597-1600.
[27] R?ssler A, Cozzolino D, Verdoliva L, et al. FaceForensics plus plus: learning to detect manipulated facial images [C]//2019 IEEE International Conference on Computer Vision, 2019: 1-11.
[28] Li Y Z, Yang X, Sun P, et al. Celeb-DF: a large-scale challenging dataset for DeepFake forensics [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 3204-3213.
[29] Korshunov P, Marcel S. DeepFakes: a new threat to face recognition? assessment and detection [DB/OL]. (2018-12-20) [2022-04-11]. https://arxiv.org/abs/1812.08685v1.
[30] Thies J, Zollh?fer M, Stamminger M, et al. Face2Face: real-time face capture and reenactment of RGB videos [C]//2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 2387-2395.
[31] Liu Z W, Luo P, Wang X G, et al. Deep learning face attributes in the wild [C]//2015 IEEE International Conference on Computer Vision, 2015: 3730-3738.
[32] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules [C]//31st Annual Conference on Neural Information Processing Systems, 2017: 3859-3869.
[33] Chollet F. Xception: deep learning with depthwise separable convolutions [C]//2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 1800-1807.
[34] Wu X, Xie Z, Gao Y T, et al. SSTNet: detecting manipulated faces through spatial, steganalysis and temporal features [C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020: 2952-2956.
[35] Karras T, Laine S, Aittala M, et al. Analyzing and improving the image quality of StyleGAN [C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 8107-8116.
[36] Nguyen H H, Fang F M, Yamagishi J, et al. Multi-task learning for detecting and segmenting manipulated facial images and videos [DB/OL]. (2019-06-17) [2022-04-11]. https://arxiv.org/abs/1906.06876v1.
[37] Masi I, Killekar A, Mascarenhas R M, et al. Two-branch recurrent network for isolating deepfakes in videos [C]//European Conference on Computer Vision, 2020: 667-684.

Options

文章导航

模态框（Modal）标题

摘要

本文引用格式

Abstract

参考文献