User-Centric Voice Cloning Platform for Scalable Audiobook Narration
SharathS1
1M.Tech in Software Engineering, Dept. of Information Science & Engineering, R.V. College of Engineering, Bangalore, India.
Email- sharaths0901@gmail.com
Dr. Vanishree K2
2Associate Professor, Dept. of Information Science & Engineering,R.V. College of Engineering,
Bangalore, India.
Email- vanishreek@rvce.edu.in
Abstract-Personalized speech synthesis is emerging as a transformative technology in human–computer interaction, particularly for audiobook narration. This paper presents a deep learning-based voice cloning system that generates speaker-specific and expressive speech using neural text-to-speech (TTS) techniques. The proposed system integrates XTTS v2 for high-fidelity, multilingual synthesis and a pre-trained speaker encoder to extract voice characteristics from short user-provided samples. Operating fully offline, the pipeline enables private, real-time inference without requiring internet connectivity. Given a text input, the system produces speech output that closely mimics the target speaker’s vocal timbre and prosody. It further allows control over pitch, speed, and expressiveness, supporting personalized narration styles. A Streamlit-based graphical interface enables seamless user interaction for uploading voice samples, entering text, real-time playback, waveform visualization, and audio download. The modular design supports multiple-speaker presets and offers future extensibility for emotion-aware synthesis and multi-speaker narration. Experimental results show that the system consistently generates intelligible, natural-sounding speech, validated through subjective listening tests and waveform analyses. The solution demonstrates the feasibility of secure, offline voice cloning for personalized audiobook creation. Future developments will focus on improving speakers embedding fidelity, emotional control, and deployment on mobile and edge platforms.
Key Words:Voice Cloning, XTTS v2, Neural Text-to-Speech, Speaker Embedding, Deep Learning, Personalized Speech Synthesis, Local Inference, Audio Generation, Speech Processing.