Real-Time Multi Model Synchronization of TTS, Lip Sync, and Caption Generation using Deep Learning
Abhay Gupta
Dept. Computer Science & Engineering
Babu Banarasi Das Institute of
Technology & Management (Dr A P J
Abdul Kalam Technical University)
Lucknow, India abhaymidas099@gmail.com
Aditi Kesarwani
Dept. Computer Science & Engineering
Babu Banarasi Das Institute of
Technology & Management (Dr A P J
Abdul Kalam Technical University)
Lucknow, India aditikesarwaniak@gmail.com
Ashish Kumar Mishra
Dept. Computer Science & Engineering
Babu Banarasi Das Institute of
Technology & Management (Dr A P J
Abdul Kalam Technical University)
Lucknow, India workwith.ak0@gmail.com
Guided By: Shubha Mishra
Assistant Professor
Dept. Computer Science & Engineering
Babu Banarasi Das Institute of
Technology & Management (Dr A P J
Abdul Kalam Technical University)
Lucknow, India iamshubha@bbdnitm.ac.in
Abstract: Real-time multimodal communication systems require seamless synchronization between speech generation, lip movements, and textual captions to create natural, accessible, and interactive digital experiences. This work proposes a unified deep-learning–based framework for real-time multimodal synchronization of Text-to-Speech (TTS), lipsync animation, and caption generation. The system integrates a streaming neural TTS model, an audio/text-driven lipsync module, and a low-latency caption generator built on streaming ASR. A central synchronization engine aligns phoneme timestamps, viseme transitions, and caption token timings using adaptive buffering and drift-correction strategies. This ensures that all three modalities—audio, visual articulation, and text output—remain synchronized within perceptually acceptable thresholds (<40 ms). The proposed pipeline improves temporal coherence, reduces caption lag, and enhances user experience in applications such as virtual presenters, digital avatars, assistive technologies, and human–AI communication. Experimental evaluation demonstrates significant improvements in alignment accuracy and latency over baseline independent systems. The framework sets a scalable foundation for future advancements in expressive avatars, multilingual communication, and low-resource real-time deployment.
Keywords: Real-time synchronization, Text-to-Speech (TTS), Lipsync, Caption generation, Multimodal deep learning, Phoneme–viseme alignment, Streaming ASR, Neural vocoder, Human–computer interaction, Virtual avatar systems.