Context Aware Visual Analysis for Dynamic Audio Narration
Dinakar E J
Department Of Artificial Intelligence and Data Science
Panimalar Institute Of Technology Chennai, India dina212ka7n@gmail.com
Suryaprakash M
Department Of Artificial Intelligence and Data Science Panimalar
Institute Of Technology Chennai, India suryaprakashaadhi2003@gmail.com
Arunkumar P
Department Of Artificial Intelligence and Data Science
Panimalar Institute Of Technology Chennai, India Arunparthiban2519@gmail.com
Babisha A
Department Of Information Technology Panimalar Institute Of Technology Chennai, India
babisha15@gmail.com
Suma Christal Mary S
Department Of Information Technology Panimalar Institute Of Technology Chennai, India
ithod@pit.ac.in
Saranya k
Department Of Artificial Intelligence and Data Science Panimalar
Institute Of Technology Chennai, India kansarcsegmail.com
Abstract—Due to the exponential growth of multimedia content, there is a growing demand for advanced image captioning systems that go beyond static descriptions and provide deep, dynamic audio narratives. This paper introduces "Context- Aware Visual Analysis for Dynamic Audio Narration," a pipeline where computer vision and natural language processing synergize to convert images into contextually informed, user- controlled audio descriptions. In this work, the network leverages the robust architecture of `salesforce/BLIP-image- captioning-large` alongside a fine-tuned `google/FLAN-T5- large` model, integrating feature extraction, contextualization, and prompt-driven captioning into a single, unified framework. Compared to conventional models, this methodology allows users to personalize narratives by focusing on affective, kinetic, or contextual content, making the system highly beneficial for visually impaired users, educators, and content creators.
The system performs multi-scale visual feature extraction, modality correspondence with linguistic context using a multimodal transformer, and produces grammatically rich, coherent captions. These captions are then synthesized into speech by an integrated text-to-speech (TTS) engine powered by gTTS. Users can download the image, its caption, and the narrated output as a single bundle. Evaluations on the Flickr8k dataset demonstrate competitive results in BLEU, METEOR, ROUGE, and CIDEr metrics, showing strong accuracy and fluency compared to previous approaches. With a user-friendly interface supporting real- time changes, this system enhances content accessibility and engagement, advancing the frontier of interactive, AI-based storytelling.
Keywords: Multimodal AI, Vision-Language Models, Context- Aware Image Captioning, Audio Narration, Accessibility, BLIP, FLAN-T5, gTTS, Gradio.