Voice conversion systems are designed to convert speech from a source speaker to sound like it is produced by a target speaker. Most of the existing techniques for voice conversion consist of the following steps: Split speech signal into overlapping segments (frames) at certain interval, extract features that capture the spectral characteristic of the segment, time align the sentences spoken by the source and target speakers, learn a function that can estimate the equivalent target MCEP coefficients for a given frame of source MCEP coefficients. In approaches based on deep neural networks (DNN), a DNN serves as the model of the estimation function. To learn this function, a DNN is trained using the source MCEP coefficients as input and the corresponding MCEP coefficients from the target speaker as output. Once trained, the DNN model generates equivalent target MCEP coefficients as output when the source MCEP coefficients are fed as inputs.
Given a recording of the test sentence from the source speaker and the trained model, the system estimates the sequence of target MCEP coefficients for that sentence. A speech signal is then generated from sequence of estimated MCEP coefficients. This approach is effective in voice conversion but the resulting speech has low acoustic quality and naturalness. There is therefore a need for a technique to improve the naturalness and quality of voice conversion.