Phonemisation is a technology that converts an input text into pronunciations. Even prior to the information appliance era, worldwide analysts had long predicted the application of the audio-based human-computer interface to reach booming highs over the information industry. The phonemisation technology has been widely used in systems related to speech synthesis as well as speech recognition.
Conventionally, the fastest way to get the pronunciation of a word is through direct dictionary lookup. The problem is no single dictionary can include all words/pronunciations. When a word lookup system cannot find a particular word, the technique of phonemisation can be employed to generate the pronunciations of the word. In speech synthesis, phonemisation provides an audio system with the pronunciations for a missing word and avoids the audio output error due to the lack of pronunciation for missing words. In speech recognition, it is a common process to expand the trained audio vocabulary set/database by adding new words/pronunciations to enhance the accuracy of the speech recognition. With phonemisation, a speech recognition system can easily process the missing pronunciation and minimize the difficulty for the audio vocabulary set/database expansion.
A conventional phonemisation is rule-based which maintains a large rule set prepared by linguistic specialists. But no matter how many rules you have, exceptions always happen. There is also no guarantee not to conflict to the existing rules by adding a new rule. With the growing of the rule-database, the cost for the rule-database refinement and maintenance is also getting high. Other than this, since rule-databases differ from language to language, it is hard to expand the same rule-database to a different language without major efforts to redesign a new rule-database. In general, a rule-based text-to-pronunciation conversion system has limited expandability due to its lacking of reusability and portability.
To overcome the aforementioned drawbacks, more and more text-to-pronunciation conversion systems gear to data-driven methods, such as pronunciation by analogy (PbA), neural-network model, decision tree model, joint N-gram model, automatic rule learning model, and multi-stage text-to-pronunciation conversions model, etc.
A data-driven text-to-pronunciation conversion system has the advantage of minimum involvement of manual labor and specialty knowledge, and is language-independent. Compared with a conventional rule-based system, a data-driven text-to-pronunciation conversion system is superior, from the perspectives of system construction, future maintenance, and reusability, etc.
Pronunciation by analogy decomposes an input text into a plurality of strings of variable lengths. Each string is then compared with the words in a dictionary to identify the most representative phoneme for each string. After that, it constructs an associate graph composed of the strings accompanied with the corresponding phonemes. The optimal path in the graph is selected to represent the pronunciation of the input text. U.S. Pat. No. 6,347,295 disclosed a computer method and apparatus for grapheme-to-phoneme conversion. This technology uses the PbA method, and requires a pronouncing dictionary. In the pronouncing dictionary, it searches for each segment that has ever occurred, as well as its occurrence count as a score to construct the whole phoneme graph.
A text-to-pronunciation conversion with neural-network model is exampled by the method disclosed in the U.S. Pat. No. 5,930,754. This prior art disclosed a technology of manufacture for neural-network based orthography-phonetics transformation. This technique requires a predetermined set of input letter feature to train a neural-network-model to generate a phonetic representation.
A text-to-pronunciation conversion technique with decision tree model is exampled by the method disclosed in the U.S. Pat. No. 6,029,132. This prior art disclosed a method for letter-to-sound in text-to-speech synthesis. This technique is a hybrid approach, using decision trees to represent the established rules. The phonetic transcription of an input text is also represented by a decision tree. Another U.S. Pat. No. 6,230,131, also disclosed a decision tree method for phonetics-to-pronunciation conversion. In this prior art, the decision tree is utilized to identify the phonemes, and probability models are followed to identify the optimum path to generate the pronunciation for the spelled-word letter sequence.
A text-to-pronunciation conversion with joint N-gram model is done by first decomposing all text/phonetic transcriptions into grapheme-phoneme pairs. A probability model is built with all grapheme-phoneme pairs from all words/phonetic transcriptions. After that, any input text is also decomposed into grapheme-phoneme pairs. The optimum path of the grapheme-phoneme pair sequence for the input text is obtained by comparing the grapheme-phoneme pairs of the input text with the pre-built grapheme-phoneme probability model to generate the final pronunciation of the input text.
Multi-stage text-to-speech conversion is an improving process, which emphasizes on graphemes (vowels) that are easily mispronounced, with more prefix/postfix information for further verification before the final pronunciation is generated. This text-to-speech conversion technique is disclosed in U.S Pat. No. 6,230,131.
The aforementioned data-driven techniques all need a training set of pronunciation information, which is usually a dictionary with sets of word/phonetic transcriptions. Amongst these techniques, PbA and joint N-gram models are the two methods referenced the most, while the multi-stage text-to-speech conversion model is the one with the best functionality.
PbA has good execution efficiency, but the accuracy is not satisfactory. The joint N-gram model although has good accuracy, the associate decision graph composed of grapheme-phoneme mapping pairs is too large when n=4, and its execution efficiency the worst amongst all methods. The multi-stage model although yields the highest resulting pronunciation, the overhead process for the further verification on easily mispronounced graphemes limits the enhancement to its overall execution efficiency.
Since audio is an important media for man-machine interface in the mobile information appliance era, and the text-to-pronunciation technique plays a critical role in speech-synthesis and speech-recognition, researching and developing superior techniques for text-to-pronunciation techniques is essentially necessary.