DEEPGUARD: A Comprehensive Multimodal Deepfake Detection Framework with Attention-Based Fusion, Explainability, and Scalable Deployment
Nikhil Yadav, Mayur Raval, Om Yadav, Atharav Chougule, Ritesh Upadhye
Department of Computer Science and Engineering (AIML)
Shivaji University, Kolhapur, India
Abstract—
The rapid advancement of generative artificial intelligence has led to the widespread creation of highly realistic deepfake content across images, videos, audio, and text. While such technologies offer innovative applications, they also pose significant risks to digital trust, cybersecurity, and information integrity. Existing deepfake detection methods often rely on unimodal analysis, which limits their ability to detect sophisticated multimodal manipulations. To address this limitation, this paper proposes DeepGuard, a comprehensive multimodal deepfake detection framework that integrates image, video, audio, and textual analysis using an attention-based fusion strategy.
The proposed system employs pretrained MobileNetV2 models for feature extraction from images, video frames, and audio spectrograms, ensuring computational efficiency and robust representation learning. Textual features are extracted using TF–IDF vectorization and classified through a Multinomial Naïve Bayes model. The modality-specific embeddings are projected into a shared latent space and adaptively fused using a learnable attention mechanism that dynamically assigns importance weights based on contextual relevance.
Experimental results demonstrate that the proposed multimodal approach outperforms unimodal baselines and static fusion methods across standard evaluation metrics. The lightweight architecture further supports scalable deployment in cloud and edge environments. The DeepGuard framework provides an efficient and practical solution for detecting evolving deepfake threats in real-world multimedia systems.
Keywords—
Deepfake Detection, Multimodal Fusion, CNN, LSTM, Transformer, Attention, Explainable AI