The present invention relates generally to speech processing. More particularly, the invention relates to a system for generating pronunciations of spelled words. The invention can be employed in a variety of different contexts, including speech recognition, speech synthesis and lexicography.
Spelled words accompanied by their pronunciations occur in many different contexts within the field of speech processing. In speech recognition phonetic transcriptions for each word in the dictionary are needed to train the recognizer prior to use. Traditionally phonetic transcriptions are manually created by lexicographers who are skilled in the nuances of phonetic pronunciation of the particular language of interest. Developing a good phonetic transcription for each word in the dictionary is time consuming and requires a great deal of skill. Much of this labor and specialized expertise could be dispensed with if there were a reliable system that could generate phonetic transcriptions of words based on their letter spelling. Such a system could extend current recognition systems to recognize words such as geographic locations and surnames that are not currently found in existing dictionaries.
Spelled words are also encountered frequently in the speech synthesis field. Present day speech synthesizers convert text to speech by retrieving digitally-sampled sound units from a dictionary and concatenating these sound units to form sentences.
As the above examples demonstrate, both the speech recognition and the speech synthesis fields of speech processing would benefit from the ability to generate accurate pronunciations from spelled words. The need for this technology is not limited to speech processing, however. Lexicographers have today completed fairly large and accurate pronunciation dictionaries for many of the major world languages. However, there still remain many hundreds of regional languages for which good phonetic transcriptions do not exist. Because the task of producing a good phonetic transcription has heretofore been largely a manual one, it may be years before some regional languages will be transcribed, if at all. The transcription process could be greatly accelerated if there were a good computer-implemented technique for scoring transcription accuracy. Such a scoring system would use an existing language transcription corpus to identify those entries in the transcription prototype whose pronunciations are suspect. This would greatly enhance the speed at which a quality transcription is generated.
Heretofore most attempts at spelled word-to-pronunciation transcription have relied solely upon the letters themselves. These techniques leave a great deal to be desired. For example, a letter-only pronunciation generator would have great difficulty properly pronouncing the word Bible. Based on the sequence of letters only the letter-only system would likely pronounce the word "Bib-l", much as a grade school child learning to read might do. The fault in conventional systems lies in the inherent ambiguity imposed by the pronunciation rules of many languages. The English language, for example, has hundreds of different pronunciation rules, making it difficult and computationally expensive to approach the problem on a word-by-word basis.
The present invention addresses the problem from a different angle. The invention uses a specially constructed mixed-decision tree that encompasses both letter sequence and phoneme sequence decision-making rules. More specifically, the mixed-decision tree embodies a series of yes-no questions residing at the internal nodes of the tree. Some of these questions involve letters and their adjacent neighbors in a spelled word sequence; other of these questions involve phonemes and their neighboring phonemes in the word sequence. The internal nodes ultimately lead to leaf nodes that contain probability data about which phonetic pronunciations of a given letter are most likely to be correct in pronouncing the word defined by its letter sequence.
The pronunciation generator of the invention uses this mixed-decision tree to score different pronunciation candidates, allowing it to select the most probable candidate as the best pronunciation for a given spelled word. Generation of the best pronunciation is preferably a two-stage process in which a letter-only tree is used in the first stage to generate a plurality of pronunciation candidates. These candidates are then scored using the mixed-decision tree in the second stage to select the best candidate.
Although the mixed-decision tree is advantageously used in a two-stage pronunciation generator, the mixed tree is useful in solving some problems that do not require letter-only first stage processing. For example, the mixed-decision tree can be used to score pronunciations generated by linguists using manual techniques.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.