A Comprehensive Survey on Vision Language Models for Image Text Retrieval
Dr. Goldi Soni
Assistant professor, AUC
gsoni@rpr.amity.edu
Ch. Kirti Yadav
B.Tech CSE, AUC
chatti.yadav@s.amity.edu
Bikash Bardhan
B.Tech CSE, AUC
bikash.bardhan@s.amity.edu
Abstract
This review paper provides a comprehensive analysis of recent research developments in Vision–Language Models (VLMs) and Large Vision–Language Models (LVLMs) that enable machines to understand and reason over both visual and textual information. This study surveys 30 scientific articles written between 2023 and 2026. These articles discuss recent progress related to multimodal learning, cross-modal retrieval, visual document understanding, and multimodal reasoning systems. Some of the most significant architectures and frameworks examined in this work include CLIP-inspired models, BLIP-2, SigLIP-2, and different kinds of adapters and prompt learning techniques which increase model efficiency. Besides, the study analyzes recent developments in the field of new benchmarks and datasets that can help evaluate multimodal models on various tasks like image retrieval, visual document understanding, captioning, and visual reasoning. At the same time, important issues like hallucinations in generated outputs, weak visual grounding, culture biases, security problems, and the computational complexity of multimodal models are considered. In their studies, some researchers propose possible ways to address these problems through better multimodal alignment, token compression, contrastive learning approaches, and instruction-based feature fusion. Finally, several works describe practical applications of vision-language models in other domains such as biomedical imaging, education, autonomous systems testing, and multimodal data augmentation. Overall, this paper serves as a consolidated reference for understanding recent progress in vision–language models and guiding future research in multimodal AI systems.
Keywords :Vision–Language Models (VLMs), Multimodal Learning, Cross-Modal Retrieval, Large Language Models (LLMs), Multimodal Artificial Intelligence