It is known to electronically convert one voice to another. In such a voice conversion process, a training phase is performed in which speech training data from source and target speakers is collected and used to train a voice conversion model. Next, a usage phase is entered in which the trained voice conversion model is used to convert a voice.
In general, the training phase is separate and distinct from the usage phase, meaning that the user must spend time providing speech training data before being able to use the voice conversion function. The better the quality of the speech training data, the better the quality of the voice conversion model. In practice, to obtain high quality speech training data, it is typical for a user to spend quite a lot of time speaking to train the system. Typically, the user is asked to speak a set of pre-defined sentences or a large amount of free speech in a dedicated collection mode. Or the user may provide speech training data from pre-stored source recorded under controlled conditions. However, it is unreasonable and inconvenient to expect the user to speak or otherwise collect large amounts of training material for the sake of training the voice conversion model. If the source voice is generated using text-to-speech (TTS) technology, then only the target speech corpus need be collected. Nonetheless, such training remains burdensome and inconvenient to the user.