AI Voice Cloning using Deep Learning
Akshay Kumar1, Dr. Amandeep2, Ritu3
M.Sc. Computer Science1,3, Artificial Intelligence and Data Science, GJUS&T Hisar,
Assistant Professor2, Artificial Intelligence and Data Science, GJUS&T Hisar,
Email- akshay9068s@gmail.com
Abstract— In this project, we have worked on creating a voice cloning system using deep learning. The main idea was to build a model that can listen to one person's voice and then convert it into another person’s voice, in such a way that it sounds real and natural. We used the LibriSpeech dataset for training our model because it contains a large number of voice recordings from many different speakers, which helped us teach the model how various people speak.To process the audio, first we convert the voice into features like mel spectrograms and pitch (F0), which will help to capture the sound and style of someone’s voice. The captured features were then used to train a neural network that learns how to copy the target speaker’s voice style and apply it to a new voice. We used a multi-speaker training method so that the system doesn’t just work for one or two speakers, but can handle many different voices.
After training, we tested our model by giving it new voice samples and asking it to clone those voices into different speaker styles. The results were quite good. The converted voices sounded very close to the target speakers and were easy to understand.
We also checked the waveforms and did listening tests to compare the original and cloned voices. The output was smooth and clear, showing that the model was able to learn speaker characteristics effectively.
Overall, this project shows that voice cloning using deep learning is possible and can give good results even without a huge amount of data. It has many future uses like helping people who can’t speak, making virtual assistants more personal, or even dubbing videos in different voices. In future, we can try adding emotions or working on real- time voice conversion as well.
Keywords:Voice Cloning, Deep Learning, Mel Spectrogram, Speaker Conversion, Speech Synthesis, LibriSpeech.