Unmasking Sophisticated Video Deepfakes: A Spatio-Temporal Approach for Multi-Identity and Size-Varying Scenarios
Kongara Harsha Deep1, Mogili Karunakar2 , Mullapudi Bhargava Satya Narendra3,
Dr. K. Siva Kumar4
1,2,3,4 Department of Computer Science and Engineering, R.V.R & J.C College of Engineering, Guntur, India
Abstract:
Deepfake video generation techniques are evolving rapidly, creating highly realistic manipulated content that poses significant societal risks. Detecting these fakes, especially in complex, real-world videos, remains a major challenge. Current detection methods often struggle when videos feature multiple individuals or when faces appear at widely varying sizes. Many approaches rely on simple frame averaging or focus only on the most prominent face, potentially missing subtle or localized manipulations and ignoring crucial temporal inconsistencies. To overcome these limitations, we present a novel video deepfake detection framework designed for robustness in challenging scenarios. Our approach uniquely integrates a Convolutional Neural Network (CNN) backbone, capturing fine-grained spatial details, with a Spatio-Temporal Transformer architecture adept at modeling temporal dynamics. Critically, we introduce an Identity-Aware Attention mechanism. This allows the model to process face sequences corresponding to different individuals independently within the Transformer, enabling effective analysis of multi-person videos without resorting to naive post-hoc aggregation. Furthermore, we incorporate two specialized embedding strategies: Temporal Coherence Embeddings that preserve the correct temporal ordering and relationships of faces, even across different identities appearing concurrently, and Relative Size Embeddings that explicitly encode the scale of each detected face relative to the video frame. Our experiments, particularly on the diverse ForgeryNet dataset, demonstrate state-of-the-art performance, showing a marked improvement (up to 14% AUC) in videos containing multiple people compared to existing methods. The framework also shows strong generalization capabilities across different forgery types and datasets, highlighting its potential for practical deployment. [Optional: We plan to release our implementation to facilitate further research.