The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network. An example of such an application may be paying a bill, ordering a program, receiving driving instructions, etc. Furthermore, in some services, such as audio books, for example, the application is based almost entirely on receiving audio information. It is becoming more common for such audio information to be provided by computer generated voices. Accordingly, the user's experience in using such applications will largely depend on the quality and naturalness of the computer generated voice. As a result, much research and development has gone into speech processing techniques in an effort to improve the quality and naturalness of computer generated voices.
Examples of speech processing include speech coding and voice conversion related applications. Voice conversion, for example, may be used to modifying speaker identity. In this regard, speech uttered by a source speaker may be converted in order to sound like a different speaker (e.g., a target speaker) uttered the speech. Algorithms have been developed for the performance of voice conversion using a conversion function having parameters that are estimated based on a corpus of matching words or phrases (i.e., a parallel corpus) that are spoken by both speakers. This may be accomplished, for example, by asking the source speaker and the target speaker to each recite the same sentences. However, depending upon the target speaker's identity and other factors, it may sometimes be difficult or impossible to collect a parallel corpus sufficient for voice conversion between a particular pair of source and target speakers. Furthermore, free speech (i.e., unscripted speech) recorded from either the source or target speaker is often not useful for use in voice conversion, since there is not necessarily a match between the words and/or phrases spoken by the source and target speakers in free speech.
Due to the difficulties described above, attempts have been made to develop voice conversion techniques that do not rely upon a parallel corpus for training. For example, mechanisms that require a parallel corpus for training are often referred to as being text dependent, since the sentences spoken for the training data are limited to provide the parallel corpus. Text independent voice conversion generally refers to voice conversion in which there is no limitation to the sentences that the source and/or target speakers read or speak for the training. However, to date, voice conversion techniques that do not rely on a parallel corpus for training typically perform worse than parallel corpus schemes. Furthermore, such schemes typically require linguistic knowledge for system tuning and very large databases in order to find parallel subunits from both source and target speakers within a certain context.
Particularly in mobile environments, increases in memory consumption directly affect the cost of devices employing such methods. Thus, it may be desirable to develop an improved mechanism for performing voice conversion without a need for a parallel corpus and without a need for large databases for identifying parallel subunits. Moreover, even in non-mobile environments, an improved mechanism for performing voice conversion without a need for a parallel corpus may be desirable.