Voicebridge: An AI-Based Multi-Modal Voice Assistant Using Whisper, GTTS and GPT
Author: DHANIMIREEDI GREESHMANTH (MCA student),
M. BALA NAGA BHUSHANAMU2 (Asst. Prof)
Department of Information Technology & Computer application, Andhra University College of Engineering, Visakhapatnam, AP.
Corresponding Author: Dhanimireedi Greeshmanth
(email-id:greeshmanthdhanimireddi@gmail.com)
ABSTRACT
In recent years, voice assistants have emerged as powerful tools for enabling human-machine interaction through natural spoken language. These systems, powered by advances in artificial intelligence and speech processing, offer users the convenience of hands-free control, instant information retrieval, and intelligent dialogue management. However, many existing voice assistants are highly dependent on cloud infrastructure and continuous internet access, limiting their functionality in rural or offline scenarios.
This project introduces VoiceBridge, a multi-modal AI-powered voice assistant that integrates OpenAI Whisper for speech-to-text conversion, gTTS (Google Text-to-Speech) for voice synthesis, and GPT-4o for intelligent conversational replies. The system is implemented using a Python Flask backend and a browser-based frontend, offering users a complete speech-driven interaction experience.
Unlike traditional assistants, VoiceBridge emphasizes modularity, privacy, and future support for offline capabilities. It serves as an efficient, scalable, and platform-independent solution for personalized AI communication. The assistant is capable of transcribing audio, generating text responses using GPT, and converting those responses into speech, creating a complete input-output cycle.
This paper presents the system architecture, functional modules, implementation workflow, and observed performance characteristics. The solution is intended for integration into educational, accessibility, and personal productivity applications with minimal resource consumption.
Keywords
Voice Assistant, Whisper, GPT-4o, Text-to-Speech, Flask, Artificial Intelligence, Speech Recognition, gTTS, Natural Language Processing, Conversational AI