The modern communications era has brought about a tremendous expansion of wireline and wireless networks. Computer networks, television networks, and telephony networks are experiencing an unprecedented technological expansion, fueled by consumer demand. Wireless and mobile networking technologies have addressed related consumer demands, while providing more flexibility and immediacy of information transfer.
Current and future networking technologies continue to facilitate ease of information transfer and convenience to users. One area in which there is a demand to increase ease of information transfer relates to the delivery of services to a user of a mobile terminal. The services may be in the form of a particular media or communication application desired by the user, such as a music player, a game player, an electronic book, short messages, email, etc. The services may also be in the form of interactive applications in which the user may respond to a network device in order to perform a task or achieve a goal. The services may be provided from a network server or other network device, or even from the mobile terminal such as, for example, a mobile telephone, a mobile television, a mobile gaming system, etc.
In many applications, it is necessary for the user to receive audio information such as oral feedback or instructions from the network or mobile terminal. An example of such an application may be paying a bill, ordering a program, receiving driving instructions, etc. Furthermore, in some services, such as audio books, for example, the application is based almost entirely on receiving audio information. It is becoming more common for such audio information to be provided by computer generated voices. Accordingly, the user's experience in using such applications will largely depend on the quality and naturalness of the computer generated voice. As a result, much research and development has gone into speech processing techniques in an effort to improve the quality and naturalness of computer generated voices.
Speech processing may generally include applications such as text-to-speech (TTS) conversion, speech coding, voice conversion, language identification, and numerous other like applications. In many speech processing applications, a computer generated voice, or synthetic speech, may be provided. In one particular example, TTS, which is the creation of audible speech from computer readable text, may be employed for speech processing including selection and concatenation of acoustical units. However, such forms of TTS often require very large amounts of stored speech data and are not adaptable to different speakers and/or speaking styles. In an alternative example, a hidden Markov model (HMM) approach may be employed in which smaller amounts of stored data may be employed for use in speech generation. However, current HMM systems often suffer from degraded naturalness in quality. In other words, many may consider that current HMM systems tend to oversimplify signal generation techniques and therefore do not properly mimic natural speech pressure waveforms.
Particularly in mobile environments, increases in memory consumption can directly affect the cost of devices employing such methods. Thus, HMM systems may be preferred in some cases due to the potential for speech synthesis with relatively fewer resource requirements. However, even in non-mobile environments, possible increases in application footprints and memory consumption may not be desirable. Accordingly, it may be desirable to develop an improved speech synthesis mechanism that may, for example, enable the provision of more natural sounding synthetic speech in an efficient manner.