Speech Enhancement Using Spectrogram Denoising with Deep U-Net Architectures
Guide: Dr. S China Venkateswarlu, Professor, ECE & IARE
Dr. V Siva Nagaraju, Professor, ECE & IARE
Aennam Ashritha patel1
1Aennam Ashritha patel Electronics and Communication Engineering & Institute of Aeronautical Engineering
Abstract -- Acoustic noise significantly degrades speech quality and intelligibility in almost all applications, ranging from telecommunications to voice assistants. In this paper, we address this problem by designing an efficient speech enhancement system based on deep learning. Our approach relies on spectrogram denoising, wherein audio signals are represented as 2D magnitude spectrograms that well maintain signal structure and enable direct application of Convolutional Neural Networks (CNNs).
The backbone of our system is a U-Net model, which is a strong deep convolutional autoencoder capable of approximating the noise model of noisy voice spectrograms. We compiled a heterogeneous dataset carefully by mixing clean English speech from SiSec and LibriSpeech and 10 environmental noise classes from ESC-50 and others, using data augmentation and random noise levelization to encourage model generalization. We trained the U-Net with the Adam optimizer and Huber loss and attained strong performance with training loss 0.002129 and validation loss 0.002406.
In prediction, the trained U-Net estimates the noise model accurately, which is then subtracted from the noisy spectrogram. The denoised magnitude spectrogram is then combined with the original phase, and the enhanced audio is reconstructed using an inverse Short Time Fourier Transform (ISTFT) process. Qualitative evaluations, including visual comparisons of time series and spectrograms, and audio demonstrations, confirm the efficacy of the system in suppressing various noises and preserving speech fidelity, even at high-noise levels. This project demonstrates a real-world and scalable deep learning solution to significant speech quality improvement in noisy environments.
Key Words: speech enhancement, deep learning, spectrogram denoising, U-Net, convolutional neural networks, noise reduction, audio processing.