A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 42
File Size 691.33 KB
File Count 1
Create Date 05/06/2025
Last Updated 05/06/2025

Download

Description

A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models

Guide: Dr. S China Venkateswarlu, Professor, ECE & IARE

Dr. V Siva Nagaraju, Professor, ECE & IARE

Gulshan Kumar1

1Gulshan Kumar Electronics and Communication Engineering & Institute of Aeronautical Engineering

---------------------------------------------------------------------***-----------------------------------------------------------------

Abstract -- With the rapid rise of social media platforms, communities have been able to share their passions and interests with the world much more conveniently. This, in turn, has led to individuals being able to spread hateful messages through the use of memes. The classification of such materials requires not only looking at the individual images but also considering the associated text in tandem. Looking at the images or the text separately does not provide the full con text. In this paper, we describe our approach to hateful meme classification for the Multimodal Hate Speech Shared Task at CASE 2024. We utilized the same approach in the two subtasks, which involved a classification model based on text and image features obtained using Con trastive Language-Image Pre-training (CLIP) in addition to utilizing BERT-Based models. We then utilize predictions created by both mod else in an ensemble approach. This approach ranked second in both subtasks, respectively.

Keywords: Multimodal learning, text-embedded images, CLIP (Contrastive Language–Image Pre-training), BERT (Bidirectional Encoder Representations from Transformers), vision-language models, image classification, text representation, feature fusion, deep learning, semantic understanding, cross-modal retrieval, scene text recognition, visual contextualization, natural language processing (NLP), computer vision.

A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models

A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models

A Multimodal Approach for Classification of Text-Embedded Images Based on CLIP and BERT-Based Models

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us