The present invention relates generally to speech processing. More particularly, the invention relates to a system for generating pronunciations of spelled words. The invention can be employed in a variety of different contexts, including speech recognition, speech synthesis and lexicography.
Spelled words are also encountered frequently in the speech synthesis field. Present day speech synthesizers convert text to speech by retrieving digitally-sampled sound units from a dictionary and concatenating these sound units to form sentences.
Heretofore most attempts at spelled word-to-pronunciation transcription have relied solely upon the letters themselves. These techniques leave a great deal to be desired. For example, a letter-only pronunciation generator would have great difficulty properly pronouncing the word "read" used in the past tense. Based on the sequence of letters only the letter-only system would likely pronounce the word "reed", much as a grade school child learning to read might do. The fault in conventional systems lies in the inherent ambiguity imposed by the pronunciation rules of many languages. The English language, for example, has hundreds of different pronunciation rules, making it difficult and computationally expensive to approach the problem on a word-by-word basis.
The present invention addresses the problem from a different angle. The invention uses a specially constructed mixed-decision tree that encompasses letter sequence, syntax, context and dialect decision-making rules. More specifically, the letter-syntax-context-dialect mixed-decision trees embody a series of yes-no questions residing at the internal nodes of the tree.
Some of these questions involve letters and their adjacent neighbors in a spelled word sequence (i.e., letter-related questions); other questions examine what words precede or follow a particular word (i.e.. context-related questions); other questions examine what part of speech the word has within a sentence as well as what syntax other words have in the sentence (i.e., syntax-related questions); still other questions examine what dialect it is desired to be spoken.
The internal nodes ultimately lead to leaf nodes that contain probability data about which phonetic pronunciations and stress of a given letter are most likely to be correct in pronouncing the word defined by its letter and word sequence.
The pronunciation generator of the invention uses mixed-decision trees on the word-level to score different pronunciation candidates, allowing it to select the most probable candidate as the best pronunciation for a given spelled word. Generation of the best pronunciation is preferably a two-stage process in which a set of letter-syntax-context-dialect mixed-decision trees is used in the first stage to generate a plurality of pronunciation candidates with scores indicating an order of preference. These candidates are then rescored using a second set of mixed-decision trees in the second stage to select the best candidate. This second set of mixed decision trees examines the word at the phoneme level.
For a more complete understanding of the invention, its objects and advantages, reference may be had to the following specification and to the accompanying drawings.