Zero Shot Image Classification Using Clip
Dr. A. V. S Siva Rama Rao
Department of CSE – AIML
Sasi Institute of technology and Engineering
1.Cherukuri Meenakshi Devi
Department of CSE – AIML
Sasi Institute of technology and Engineering meenakshi.cherukuri@sasi.ac.in
3.Godavari Adhi Vardhini yadav
Department of CSE – AIML
Sasi Institute of technology and Engineering vardhini.godavari@sasi.ac.in
4.Bhavanam Pavan Kalyan Reddy
Department of CSE – AIML
Sasi Institute of technology and Engineering kalyan.bhavanam@sasi.ac.in
2.Gummadi Mohan Krishna
Department of CSE – AIML
Sasi Institute of technology and Engineering mohan.gummadi@sasi.ac.in
Abstract
Image classification plays a crucial role in applications such as image search, content moderation, healthcare imaging, and e-commerce platforms. In this paper, a Zero-Shot Image Classification system using CLIP (Contrastive Language–Image Pretraining) is proposed, which classifies images without requiring labeled training data or model retraining. The system aligns images and natural language descriptions in a shared semantic embedding space. Input images are processed using CLIP’s Vision Encoder, while class descriptions are processed using the Text Encoder to generate embeddings.
Classification is performed by computing Cosine Similarity between image embeddings and text embeddings, and the class with the highest similarity score is selected as the predicted category. The system is implemented using Python with libraries such as PyTorch, OpenCLIP, and NumPy. Experimental evaluation shows that the proposed approach achieves competitive performance compared to traditional CNN-based classifiers, while providing advantages such as flexibility, scalability, reduced labeling cost, and the ability to classify unseen categories, demonstrating the effectiveness of Vision–Language Models for real-world image classification.
Keywords: Zero-Shot Learning, CLIP, Vision–Language Models, Image Classification, Cosine Similarity, Semantic Embedding, Prompt-Based Classification