As shown in FIG. 1, numeral 100, text-to-speech synthesis is the conversion of written or printed text (102) into speech (110). Text-to-speech synthesis offers the possibility of providing voice output at a much lower cost than recording speech and playing that speech back. Speech synthesis is often employed in situations where the text is likely to vary a great deal and where it is simply not possible to record the text beforehand.
Speech synthesizers need to convert text (102) to a phonetic representation (106) that is then passed to an acoustic module (108) which converts the phonetic representation to a speech waveform (110).
In a language like English, where the pronunciation of words is often not obvious from the orthography of words, it is important to convert orthographies (102) into unambiguous phonetic representations (106) by means of a linguistic module (104) which are then submitted to an acoustic module (108) for the generation of speech waveforms (110). In order to produce the most accurate phonetic representations, a pronunciation lexicon is required. However, it is simply not possible to anticipate all possible words that a synthesizer may be required to pronounce. For example, many names of people and businesses, as well as neologisms and novel blends and compounds are created every day. Even if it were possible to enumerate all such words, the storage requirements would exceed the feasibility of most applications.
In order to pronounce words that are not found in pronunciation dictionaries, prior researchers have employed letter-to-sound rules, more or less of the form--orthographic c becomes phonetic /s/ before orthographic e and i, and phonetic /k/ elsewhere. As is customary in the art, pronunciations will be enclosed in slashes: //. For a language like English, several hundred such rules associated with a strict ordering are required for reasonable accuracy. Such a rule-set is extremely labor-intensive to create and difficult to debug and maintain, in addition to the fact that such a rule-set cannot be used for a language other than the one for which the rule-set was created.
Another solution that has been put forward has been a neural network that is trained on an existing pronunciation lexicon and that learns to generalize from the lexicon in order to pronounce novel words. Previous neural network approaches have suffered from the requirement that letter-phone correspondences in the training data be aligned by hand. In addition, such prior neural networks failed to associate letters with the phonetic features of which the letters might be composed. Finally, evaluation metrics were based solely on insertions, substitutions and deletions, without regard to the featural composition of the phones involved.
Therefore, there is a need for an automatic procedure for learning to generate phonetics from orthography that does not require rule-sets or hand alignment, that takes advantage of the phonetic featural content of orthography, and that is evaluated, and whose error is backpropagated, on the basis of the featural content of the generated phones. A method, device and article of manufacture for neural-network based orthography-phonetics transformation is needed.