- Version
- Download 13
- File Size 538.73 KB
- File Count 1
- Create Date 04/12/2025
- Last Updated 04/12/2025
A Multimodal Bangla Fake News Detection Framework Using Bangla-BERT, VIT, and Co-Attention Fusion
A Shahbaz Akhtar
M.Tech Student, Department of Computer Science and Engineering All Saints College of Technology, Bhopal, India
Affiliated to Rajiv Gandhi Proudyogiki Vishwavidyalaya (RGPV) Shahbazs2s224@gmail.com
B Prof. Sarwesh Site
Associate Professor, Department of Computer Science and Engineering All Saints College of Technology, Bhopal, India
Affiliated to Rajiv Gandhi Proudyogiki Vishwavidyalaya (RGPV) er.sarwesh@gmail.com
ABSTRACT
The rise of misleading and fabricated news content on digital platforms has created a critical need for reliable misinformation detection systems, particularly in under-resourced languages like Bangla where research and datasets remain limited. Traditional fake news detection approaches often rely on either textual or visual features in isolation, which restricts their ability to capture the cross-modal inconsistencies commonly found in multimedia misinformation. To address these challenges, this study proposes MBM-CTNet, a multimodal and multitask learning framework designed for comprehensive fake news detection using the MultiBanFakeDetect dataset comprising 9,600 Bangla text–image pairs. The model integrates a Bangla-BERT text encoder, a Vision Transformer (ViT) image encoder, and a cross-modal co-attention fusion mechanism to jointly model semantic relationships across modalities. Additionally, a text–image consistency head trained using contrastive learning is introduced to explicitly detect mismatched or manipulated visuals, a common characteristic of clickbait and rumor-based content. The proposed framework performs three simultaneous tasks: binary fake news detection, fake-news type classification (misinformation, rumor, clickbait), and category prediction across 12 news domains. Experimental evaluations reveal that MBM-CTNet surpasses established baselines, including text-only transformers, image-only classifiers, and traditional multimodal fusion models. The system achieves 94.5% accuracy, 94.2% F1-score, 94.8% precision, 93.9% recall, and an AUC-ROC of 96.2% on the benchmark dataset.These results demonstrate the effectiveness of co-attention-based multimodal fusion and multitask learning in improving misinformation detection in low-resource settings.
Overall, this work offers a robust and scalable solution for Bangla multimodal fake news detection and provides a strong foundation for future extensions to other low-resource and code-mixed languages. It further highlights the importance of cross-modal consistency modeling as a key component for detecting modern, visually manipulated misinformation.
Keywords: Multimodal Fake News Detection; Bangla Language Processing, MBM-CTNet, Cross-Modal Co-Attention; Text–Image Consistency,Contrastive Learning, Multitask Learning, Vision Transformer (ViT); Bangla-BERT,Under-Resourced Languages, MultiBanFakeDetect Dataset, Misinformation Classification; Clickbait Detection, Rumor Identification, Deep Learning.,






