Context Aware Visual Analysis for Dynamic Audio Narration





Find us on Google Scholar

Peer Review Policy
Article Processing Charges
Publication Procedure
Research Topics
FAQ
Copyright Infringement
Refund and Cancellation Policy

Find us on Google Scholar

Peer Review Policy

Article Processing Charges

Publication Procedure

Research Topics

FAQ

Refund and Cancellation Policy

Version
Download 52
File Size 484.13 KB
File Count 1
Create Date 23/05/2025
Last Updated 23/05/2025

Download

Description

Context Aware Visual Analysis for Dynamic Audio Narration

Dinakar E J

Department Of Artificial Intelligence and Data Science

Panimalar Institute Of Technology Chennai, India dina212ka7n@gmail.com

Suryaprakash M

Department Of Artificial Intelligence and Data Science Panimalar

Institute Of Technology Chennai, India suryaprakashaadhi2003@gmail.com

Arunkumar P

Department Of Artificial Intelligence and Data Science

Panimalar Institute Of Technology Chennai, India Arunparthiban2519@gmail.com

Babisha A

Department Of Information Technology Panimalar Institute Of Technology Chennai, India

babisha15@gmail.com

Suma Christal Mary S
Department Of Information Technology Panimalar Institute Of Technology Chennai, India

ithod@pit.ac.in

Saranya k
Department Of Artificial Intelligence and Data Science Panimalar

Institute Of Technology Chennai, India kansarcsegmail.com

Abstract—Due to the exponential growth of multimedia content, there is a growing demand for advanced image captioning systems that go beyond static descriptions and provide deep, dynamic audio narratives. This paper introduces "Context- Aware Visual Analysis for Dynamic Audio Narration," a pipeline where computer vision and natural language processing synergize to convert images into contextually informed, user- controlled audio descriptions. In this work, the network leverages the robust architecture of `salesforce/BLIP-image- captioning-large` alongside a fine-tuned `google/FLAN-T5- large` model, integrating feature extraction, contextualization, and prompt-driven captioning into a single, unified framework. Compared to conventional models, this methodology allows users to personalize narratives by focusing on affective, kinetic, or contextual content, making the system highly beneficial for visually impaired users, educators, and content creators.

The system performs multi-scale visual feature extraction, modality correspondence with linguistic context using a multimodal transformer, and produces grammatically rich, coherent captions. These captions are then synthesized into speech by an integrated text-to-speech (TTS) engine powered by gTTS. Users can download the image, its caption, and the narrated output as a single bundle. Evaluations on the Flickr8k dataset demonstrate competitive results in BLEU, METEOR, ROUGE, and CIDEr metrics, showing strong accuracy and fluency compared to previous approaches. With a user-friendly interface supporting real- time changes, this system enhances content accessibility and engagement, advancing the frontier of interactive, AI-based storytelling.

Keywords: Multimodal AI, Vision-Language Models, Context- Aware Image Captioning, Audio Narration, Accessibility, BLIP, FLAN-T5, gTTS, Gradio.

Context Aware Visual Analysis for Dynamic Audio Narration

Context Aware Visual Analysis for Dynamic Audio Narration

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us

Context Aware Visual Analysis for Dynamic Audio Narration

Context Aware Visual Analysis for Dynamic Audio Narration

What is DOI

Site Map

Frequently Asked Questions

Why IJSREM?

Publication Time Period

Publication Procedure

Processing Fee's

Follow Us

Working Hours

Contact Us