The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network. An example of such an application may be paying a bill, ordering a program, receiving driving instructions, etc. Furthermore, in some services, such as audio books, for example, the application is based almost entirely on receiving audio information. It is becoming more common for such audio information to be provided by computer generated voices. Accordingly, the user's experience in using such applications will largely depend on the quality and naturalness of the computer generated voice. As a result, much research and development has gone into speech processing techniques in an effort to improve the quality and naturalness of computer generated voices.
Examples of speech processing include speech coding and voice conversion related applications. Voice conversion is a technique that can be used to effectively modify the speech of a source speaker in such a way that it sounds as if it was spoken by a different target speaker. Gaussian mixture models (GMMs) have been found to offer a good approach for performing transformations from source speech to target speech. More precisely, the combination of source vectors extracted from the source speech and target vectors extracted from the target speech may be used to estimate the GMM parameters for the joint density. A GMM-based conversion function may be used to minimize the mean squared error between converted vectors and target vectors.
Recently, the interest in voice conversion has risen immensely at least in part due to its application to the cost-efficient individualization of text-to-speech (TTS) systems. Another common application for voice conversion has involved use in speech-to-speech translation, where a standard voice of a text-to-speech module speaking a target language is converted to a source language of an input speaker. There are also many other potential applications for voice conversion, e.g. in entertainment applications and games.
Conventional voice conversion techniques convert feature vectors from the source speaker to match the characteristics of the target speaker on a frame by frame basis. Thus, temporal information is not typically utilized and the timing structure across multiple frames is not well addressed. As a result, the quality of voice conversion is compromised and the output of voice conversion techniques may be perceived as lacking naturalness or smoothness. Thus, a need exists for providing a mechanism for improving the quality and naturalness of speech produced as a result of voice conversion.