Zero Shot Image Classification Using Clip





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 7
File Size 384.39 KB
File Count 1
Create Date 02/04/2026
Last Updated 02/04/2026

Download

Description

Zero Shot Image Classification Using Clip

Dr. A. V. S Siva Rama Rao

Department of CSE – AIML

Sasi Institute of technology and Engineering

1.Cherukuri Meenakshi Devi

Department of CSE – AIML

Sasi Institute of technology and Engineering meenakshi.cherukuri@sasi.ac.in

3.Godavari Adhi Vardhini yadav

Department of CSE – AIML

Sasi Institute of technology and Engineering vardhini.godavari@sasi.ac.in

4.Bhavanam Pavan Kalyan Reddy

Department of CSE – AIML

Sasi Institute of technology and Engineering kalyan.bhavanam@sasi.ac.in

2.Gummadi Mohan Krishna

Department of CSE – AIML

Sasi Institute of technology and Engineering mohan.gummadi@sasi.ac.in

Abstract
Image classification plays a crucial role in applications such as image search, content moderation, healthcare imaging, and e-commerce platforms. In this paper, a Zero-Shot Image Classification system using CLIP (Contrastive Language–Image Pretraining) is proposed, which classifies images without requiring labeled training data or model retraining. The system aligns images and natural language descriptions in a shared semantic embedding space. Input images are processed using CLIP’s Vision Encoder, while class descriptions are processed using the Text Encoder to generate embeddings.

Classification is performed by computing Cosine Similarity between image embeddings and text embeddings, and the class with the highest similarity score is selected as the predicted category. The system is implemented using Python with libraries such as PyTorch, OpenCLIP, and NumPy. Experimental evaluation shows that the proposed approach achieves competitive performance compared to traditional CNN-based classifiers, while providing advantages such as flexibility, scalability, reduced labeling cost, and the ability to classify unseen categories, demonstrating the effectiveness of Vision–Language Models for real-world image classification.

Keywords: Zero-Shot Learning, CLIP, Vision–Language Models, Image Classification, Cosine Similarity, Semantic Embedding, Prompt-Based Classification

Zero Shot Image Classification Using Clip

Zero Shot Image Classification Using Clip

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

Zero Shot Image Classification Using Clip

Zero Shot Image Classification Using Clip

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us