As shown in FIG. 1, numeral 100, text-to-speech synthesis is the conversion of written or printed text (102) into speech (110). Text-to-speech synthesis offers the possibility of providing voice output at a much lower cost than recording speech and playing it back. Speech synthesis is often employed in situations where the text is likely to vary a great deal and where it is simply not possible to record it beforehand.
In a language like English, where the pronunciation of words is often not obvious from the spelling of the words, it is important to convert orthographies (102) into unambiguous phonetic representations (106) by means of a linguistic module (104) before submitting the orthographies to an acoustic module (108) for the generation of speech waveforms (110). In order to produce phonetic representations from orthography, either rule based systems, pronunciation dictionaries, or automatic orthography-pronunciation conversion procedures trained on such pronunciation dictionaries may be employed.
Pronunciation lexicons, and therefore automatic procedures trained on pronunciation lexicons, employ lexical pronunciations. Lexical pronunciations are underspecified, generalized pronunciations that may or may not result in modified postlexical pronunciations in natural speech. For example, the English word foot might be listed in a pronunciation dictionary as /fuht/. Pronunciations are given using TIMIT (Texas Instruments-Massachusetts Institute of Technology) notation, described in Garofolo, John S., "The Structure and Format of the DARPA TIMIT CD-ROM Prototype". In natural speech, the final /t/ might surface either as [t], for example when foot ends a sentence, or as a flap, [dx], when foot comes before another word that starts with a vowel in the same sentence, as in "my foot is . . . "
Adding postlexical pronunciations to dictionaries instead of lexical pronunciations is not a viable solution to this problem for two reasons. The first reason is that pronunciation dictionaries would dramatically expand in size. The second reason is that pronunciation dictionaries are used to determine the pronunciations for words in isolation, while postlexical phenomena are encountered across words in sentences. So, at the time when a lexicon is consulted, there may or may not be sufficient information available to determine the appropriate postlexical pronunciation.
In neural network and other data-driven forms of speech synthesis, a learning procedure is employed to learn to generate speech spectral information from phonetic information. This constitutes the acoustic parameter neural network training. This is performed by labeling speech waveforms with phonetic information and then training, for example, a neural network or other data-driven system to learn the spectral characteristics associated with the time slices labeled with particular phones.
When such a neural network system is actually used, the neural network must produce appropriate spectral information for given phonetic information. As mentioned above, such phonetic information is derived from text by means of an orthography-phonetics lexicon or an automatic procedure trained on such a lexicon.
Since the object of data-driven speech synthesis methods is to produce testing data that is analogous to the training data, and thus similar to natural speech, it is important that the phonetic representations developed in the testing phase substantially match those that were used in the training phase. This will assure that the most reliable performance is obtained.
Unfortunately there is always likely to be some mismatch between the lexical pronunciations found in dictionaries and the pronunciations used to label speech. This mismatch may stem from at least four different sources: speaker idiosyncrasies, dictionary idiosyncrasies, labeler idiosyncrasies, and differences between lexical and postlexical pronunciations.
While rule-based approaches to generating postlexical pronunciations from lexical pronunciations might be successful for a given language, rule-based approaches will not be able to automatically deal with dictionary and labeler idiosyncrasies at the same time. That is, a new rule set would need to be developed for each possible combination of speaker, labeler and dictionary, resulting in an unwieldy situation.
Hence, there is a need for an automatic procedure for generating postlexical pronunciations from lexical pronunciations, both to increase the naturalness of synthetic speech, and to reduce the cost and time required to develop high quality speech synthesis systems. A method, device and article of manufacture for neural-network based generation of postlexical pronunciations from lexical pronunciations is needed.