A Comprehensive Survey on Vision Language Models for Image Text Retrieval





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 13
File Size 517.43 KB
File Count 1
Create Date 17/04/2026
Last Updated 17/04/2026

Download

Description

A Comprehensive Survey on Vision Language Models for Image Text Retrieval

Dr. Goldi Soni

Assistant professor, AUC

gsoni@rpr.amity.edu

Ch. Kirti Yadav

B.Tech CSE, AUC

chatti.yadav@s.amity.edu

Bikash Bardhan

B.Tech CSE, AUC

bikash.bardhan@s.amity.edu

Abstract

This review paper provides a comprehensive analysis of recent research developments in Vision–Language Models (VLMs) and Large Vision–Language Models (LVLMs) that enable machines to understand and reason over both visual and textual information. This study surveys 30 scientific articles written between 2023 and 2026. These articles discuss recent progress related to multimodal learning, cross-modal retrieval, visual document understanding, and multimodal reasoning systems. Some of the most significant architectures and frameworks examined in this work include CLIP-inspired models, BLIP-2, SigLIP-2, and different kinds of adapters and prompt learning techniques which increase model efficiency. Besides, the study analyzes recent developments in the field of new benchmarks and datasets that can help evaluate multimodal models on various tasks like image retrieval, visual document understanding, captioning, and visual reasoning. At the same time, important issues like hallucinations in generated outputs, weak visual grounding, culture biases, security problems, and the computational complexity of multimodal models are considered. In their studies, some researchers propose possible ways to address these problems through better multimodal alignment, token compression, contrastive learning approaches, and instruction-based feature fusion. Finally, several works describe practical applications of vision-language models in other domains such as biomedical imaging, education, autonomous systems testing, and multimodal data augmentation. Overall, this paper serves as a consolidated reference for understanding recent progress in vision–language models and guiding future research in multimodal AI systems.

Keywords :Vision–Language Models (VLMs), Multimodal Learning, Cross-Modal Retrieval, Large Language Models (LLMs), Multimodal Artificial Intelligence

A Comprehensive Survey on Vision Language Models for Image Text Retrieval

A Comprehensive Survey on Vision Language Models for Image Text Retrieval

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

A Comprehensive Survey on Vision Language Models for Image Text Retrieval

A Comprehensive Survey on Vision Language Models for Image Text Retrieval

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us