Voice conversion seeks to convert speech from a source speaker to sound like it is produced by a target speaker. A main aspect of voice conversion is the mapping of spectral characteristics of speech sound from source speaker to the same for target speaker. Among others, Mel cepstral (MCEP) coefficients are commonly used to capture the spectral characteristics of speech sound. MCEP is a representation of the short-term power spectrum in terms of a linear cosine transform of the log power spectrum in nonlinear Mel scale frequency. In a voice conversion method based on deep neural network (DNN), the MCEP coefficients (cc) for each segment of speech from the source speaker are replaced by the equivalent MCEP coefficients for the target speaker as estimated by a trained DNN model. The model is trained on recordings of the same sentences from the source and target speaker, using the MCEP coefficients from the source speaker (ccsrc) as the input and the same from the target speaker (cctgt) as the output.
The relationships between the source and target MCEP coefficients {f: ccsrc→cctgt} are dependent on the linguistic contents of the sounds (e.g., phonemes) and are highly nonlinear in nature. In order to map the source-target relationships in all possible linguistic contexts, a large network is required. When trained on small corpus, however, such a big network is likely to suffer from overfitting problems. Overfitting can be reduced by making the network smaller but it reduces its ability to learn the complex nonlinear relationship between the source and target features in different linguistic contexts. There is therefore a need for a technique for robustly learning a wide variety of linguistic content in both large and small corpuses.