The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network. An example of such an application may be paying a bill, ordering a program, receiving driving instructions, etc. Furthermore, in some services, such as audio books, for example, the application is based almost entirely on receiving audio information. It is becoming more common for such audio information to be provided by computer generated voices. Accordingly, the user's experience in using such applications will largely depend on the quality and naturalness of the computer generated voice. As a result, much research and development has gone into improving the quality and naturalness of computer generated voices.
One specific application of such computer generated voices that is of interest is known as text-to-speech (TTS). TTS is the creation of audible speech from computer readable text. TTS is often considered to consist of two stages. First, a computer examines the text to be converted to audible speech to determine specifications for how the text should be pronounced, what syllables to accent, what pitch to use, how fast to deliver the sound, etc. Next, the computer tries to create audio that matches the specifications.
With the development of improved means for delivery of natural sounding and high quality speech via TTS, there has come a desire to further enhance the user's experience when receiving TTS output. Accordingly, one way to improve the user's experience is to deliver the TTS output in a familiar or desirable voice. For example, the user may prefer to hear the TTS output delivered in his or her own voice, or another desirable target voice rather than the source voice of the TTS output. Conversion of speech to some target speech is an example of feature transformation.
In order to provide improved feature transformation, Gaussian mixture model (GMM) based techniques have been found to be efficient in transformation of features that can be represented as scalars or vectors. In GMM based transformation, a combination of source and target vectors is used to estimate GMM parameters for a joint density. Thus, a GMM based conversion function may be created. For example, a set of training data including samples of source and target vectors may be used to train a transformation model. Once trained, the transformation model may be used to produce transformed vectors given input source vectors. Since it is desirable to minimize the mean squared error (MSE) between transformed and target vectors, a set of testing or validation data is used to compare the transformed and target vectors. However, it is often necessary to include large amounts of both training and testing data in order to have an effective transformation. For example, a database may include source and target speech corresponding to a relatively large number of sample sentences in which 60% of the samples are used for training data and 40% of the samples are used for testing data. Accordingly, there may be an increased consumption of resources such as memory and power.
Particularly in mobile environments, increases in memory and power consumption directly affect the size and cost of devices employing such methods. However, even in non-mobile environments, such methods may result in long processing times of algorithms used to train or test the model. Thus, a need exists for providing feature transformation of sufficient quality which can be efficiently employed.