Voice Conversion using Hybrid CNN BiLSTM-WaveNet Deep Learning Models

Main Article Content

A. Bala Raju, S. P Singh, Dhiraj Sunehra

Abstract

Voice conversion is an exciting area of speech processing in which deep learning approaches are developed that can modify the vocal qualities of an speaker to resemble the voice of another person without altering the context of the utterance. The significance of speech conversion cannot be overstated, as it is employed in a wide range of systems, including entertainment, vocal communication, and privacy enhancement. However, traditional methods have fallen short in the face of large data sets and the preservation of subtle emotions, hindering voice simulation. To address the above limitations, we present a novel way that combines the fusion of the Speech to Text technology with a text-to-speech transformation system powered by a deep learning architecture. The system contains advanced embedding layers like phoneme embedding, bidirectional Long Short-Term Memory (LSTM) networks, and WaveNet vocoder, which make the transformed voice more accurate and authentic. In the proposed model, we use the speech recognition tools packages of Python and complex neural network methods to improve the naturalness and clarity. Moreover, it sets a bar when it comes to processing power, efficiency, and performance.

Article Details

Section
Articles