The design of handheld portable computing devices is driven by ergonomics for user convenience and comfort. A main feature of handheld portable device design is maximizing portability. This has resulted in minimizing form factors and limiting power for computer resources due to reduction of power source size. Compared with general purpose computing devices, for example personal computers, desktop computers, laptop computers and the like, handheld portable computing devices have relatively limited processing power (to prolong usage duration of power source) and storage capacity resources.
Limitations in processing power and storage and memory (RAM) capacity restrict the number of applications that may be available in the handheld portable computing environment. An application which may be suitable in the general purpose computing environment may be unsuitable in a portable computing device environment due to the application's processing resource, power resource or storage capacity demand. Such an application is high-quality text-to-speech processing. Text-to-speech synthesis applications have been implemented on handheld portable computers, however the text-to-speech output achievable is of relatively low quality when compared with the text-to-speech output achievable in computer environments with significantly more processing and capacity capabilities.
There are different approaches taken for text-to-speech synthesis. One approach is articulatory synthesis, where model movements of articulators and acoustics of the vocal tract are replicated. However this approach has high computational requirements and the output using articulatory synthesis is not natural-sounding fluent speech. Another approach is format synthesis, which starts with acoustics replication, and creates rules/filters to create each format. Format synthesis generates highly intelligible, but not completely natural sounding speech, although it does have a low memory footprint with moderate computational requirements. Another approach is with concatenative synthesis where stored speech is used to assemble new utterances. Concatenative synthesis uses actual snippets of recorded speech cut from recordings and stored in a voice database inventory, either as waveforms (uncoded), or encoded by a suitable speech coding method. The inventory can contain thousands of examples of a specific diphone/phone, and concatenates them to produce synthetic speech. Since concatenative systems use snippets of recorded speech, concatenative systems have the highest potential for sounding natural.
One aspect of concatenative systems relates to use of unit selection synthesis. Unit selection synthesis uses large databases of recorded speech. During database creation, each recorded utterance is segmented into some or all of the following: individual phones, diphones, half-phones, syllables, morphemes, words, phrases, and sentences. Typically, the division into segments is done using a specially modified speech recognizer set to a “forced alignment” mode with some manual correction afterward, using visual representations such as the waveform and spectrogram. An index of the units in the speech database is then created based on the segmentation and acoustic parameters like the fundamental frequency (pitch), duration, position in the syllable, and neighboring phones. At runtime, the desired target utterance is created by determining the best chain of candidate units from the database (unit selection).
Attempts have been made to increase the quality standard of text-to-speech output in handheld portable devices. In a media management system discussed in United States Patent Application Publication No. 2006/0095848, a host personal computer has a text-to-speech conversion engine that performs a synchronization operation during connection with a media player device that identifies and copies to the personal computer any text strings that do not have an associated audio file on the media player device and converts at the personal computer the text string to a corresponding audio file for sending the audio file to the media player. Although the text-to-speech conversion is completely performed on the personal computer having significantly more processing and capacity capabilities than the media player device which allows for higher quality text-to-speech output from the media player, as the complete audio file is sent from the power computer to the media player device the data size of the audio file transferred from the host personal computer to the media player is relatively large and may take a large amount of time to transfer and occupy a large proportion of the storage capacity. Additionally, for each new text string on the media player, the media player must connect to the personal computer for conversion of the text string to the audio file (regardless whether the exact text string has been converted previously).
Thus, there is need for a text-to-speech synthesis system that enables high quality text-to-speech natural sounding output from a handheld portable device, while minimizing the size of the data transferred to and from the handheld portable device. There is a need to limit the dependency of the handheld portable device on a separate text-to-speech conversion device while maintaining high quality text-to-speech output from the handheld portable device. There is also a need to enable high intelligibility of the text-to-speech output from the handheld portable device.