This invention relates to text-to-pronunciation systems and more particularly to rule-based learning of word pronunciations from training corpora or set of pronunciations.
In this document we will frequently refer to phonemes and graphemes (letters). Graphemes are enclosed in single quotation marks (e.g. xe2x80x98abcxe2x80x99). In fact, any symbol(s) within single quotation marks refer to graphemes, or grapheme sequences.
Phonemes or phoneme sequences are enclosed in parentheses; (xe2x80x2m uw). We Will use the ASCII representation for the English phoneme set. See Charles T. Hemphill, EPHOD, Electronic PHOnetic Dictionary, Texas Instruments, Dallas, Tex., USA, Edition 1.1, May 12, 1995. Stress levels are usually not marked in most examples, as they are not important to the discussion. In fact, we will assume that the stress information directly belongs to the vowels, so the above phoneme sequence will be denoted as (m xe2x80x2uw) or simply (m uw). Schwas are represented either as unstressed vowels (.ah) or using their special symbol (ax) or (.ax).
Grapheme-phoneme correspondences (partial or whole pronunciations) are represented by connecting the graphemes to the phoneme sequence (e.g. xe2x80x98wordxe2x80x99xe2x86x92(w er d)). The grapheme-phoneme correspondences usually do not contain stress marks.
Grapheme or phoneme contexts are represented by listing the left and right contexts, and representing the symbol of interest with an underscore (e.g. (bxe2x80x941)). Word boundaries in contexts are denoted with a dollar sign (e.g. xe2x80x98$x_xe2x80x99).
In this decade, speech as a medium is becoming a more prevalent component in consumer computing. Games, office productivity and entertainment products use speech as a natural extension to visual interfaces. Some programs use prerecorded digital audio files to produce speech, while other programs use speech synthesis systems. The advantage of the latter system is that they can generate a broad range of sentences, and thus, they can be used for presenting dynamic information. Nevertheless, their speech quality is usually lower than that of prerecorded audio segments.
Speech recognition systems are also becoming more and more accessible to average consumers. A drawback of these systems is that speech recognition is a computationally expensive process and requires a large amount of memory; nonetheless, powerful computers are becoming available for everyday people.
Both speech synthesis and speech recognition rely on the availability of pronunciations for words or phrases. Earlier systems used pronunciation dictionaries to store word pronunciations. However, it is possible to generate word pronunciations from language-specific pronunciation rules. In fact, systems starting from the early stages have been using algorithms to generate pronunciations for words not in their pronunciation dictionary. Also, since pronunciation dictionaries tend to be large, it would be reasonable to store pronunciations only for words that are difficult to pronounce, namely, for words that the pronunciation generator cannot correctly pronounce.
Speech recognizers are becoming an important element of communication systems. These recognizers often have to recognize arbitrary phrases, especially when the information to be recognized is from an on-line, dynamic source. To make this possible, the recognizer has to be able to produce pronunciations for arbitrary words. Because of space requirements, speech systems need a compact yet robust method to make up word pronunciations.
There are a myriad of approaches that have been proposed for text-to-pronunciation (TTP) systems. In addition to using a simple pronunciation dictionary, most systems use rewrite rules which have proven to be quite well-adapted to the task at hand. Unfortunately, these rules are handcrafted; thus, the effort put into producing these rules needs to be repeated when a new language comes into focus. To solve this problem, more recent methods use machine-learning techniques, such as neural networks, decision trees, instance-based learning, Markov models, analogy-based techniques, or data-driven solutions to automatically extract pronunciation information for a specific language. See Franxc3xa7ois Yvon, Grapheme-to-Phoneme Conversion Using Multiple Unbounded Overlapping Chunks, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEW METHODS IN LANGUAGE PROCESSING, No. 2, Ankara, Turkey, 1996. Also in internet address xxx.lanl.gov/list/cmp-lg/9608#cmp-lg/9608006.
A review of some of the approaches follows. It is difficult to objectively compare the performance of these methods, as each is trained and tested using different corpora and different scoring functions. Nevertheless, an overall assessment is presented of each approach.
The simplest way to generate word pronunciations is to store them in a pronunciation dictionary. The advantage of this solution is that the lookup is very fast. In fact, we can have a constant lookup time if we use a hash table. It is also capable of capturing multiple pronunciations for words with no additional complexity. The major drawback of dictionaries is that they cannot seamlessly handle words that are not in them. They also take up a lot of space (O(N), where N is the number of words). O (f(x)) is the mathematical notation for the order of magnitude.
A somewhat more flexible solution is to generate pronunciations for words based on their spelling. In a pronunciation system developed by Advanced Research Projects Agency (ARPA); each letter (grapheme) is pronounced based on its grapheme context. An example for English would be to
pronounce xe2x80x98exe2x80x99 in the context xe2x80x98_r$xe2x80x99 as (er).xe2x80x83xe2x80x83(1.1)
The system consists of a set of rules, each containing a letter context and a phoneme sequence (pronunciation) corresponding to the letter of interest underlined. The representation of the above rule (1.1) would be:
xe2x80x98er$xe2x80x99xe2x86x92(er).xe2x80x83xe2x80x83(1.2)
These pronunciation rules are generated by a human expert for the given language. The advantage of this system is that it can produce pronunciations for unknown words; in fact, every word is treated as unknown. Also, this method can encapsulate pronunciation dictionaries, as entire words can be used as contexts. Furthermore, this method can produce multiple pronunciations for words, since the phoneme sequences in the rules can be arbitrary. The disadvantage of the system is that it cannot take advantage of phonetic features; thus, it requires an extensive rule set. Also, a human expert is needed to produce the rules; therefore, it is difficult to switch to a different language. Moreover, it pronounces each letter as a unit, which seems counter-intuitive.
The rule-based transliterator (RBT) uses transformation rules to produce pronunciations. See Caroline B. Huang et al., Generation of Pronunciations from Orthographies Using Transformation-Based Error-Driven Learning, INTERNATIONAL CONFERENCE ON SPEECH AND LANGUAGE PROCESSING, pp 411-414, Yokohama, Japan, 1994. It was written in the framework of the theory of phonology by Chomsky and Halle, and it uses phonetic features and phonemes. See Noah Chomsky and M. Halle, The Sound Pattern of English, HARPER and Row, New York, New York, USA, 1968. Rewrite rules are formulated as
xcex1xe2x86x92xcex2/xcex3xcex4xe2x80x83xe2x80x83(1.3)
which stands for
xcex1 is rewritten as xcex2 in the context of xcex3 (left) and xcex4 (right).
Here, xcex1, xcex2, xcex3, and xcex4 can each be either graphemes or phonemes. Each phoneme is portrayed as a feature bundle; thus, rules can refer to the phonetic features of each phoneme. Rewrite rules are generated by human experts, and are applied in a specific order.
This method is similar to the simple context-based ARPA method described above. One improvement is that this system can make use of phonetic features to generalize pronunciation rules. Also, it can capture more complex pronunciation rules because applied rules change the pronunciations which become the context for future rules. The major disadvantage of this solution is that a human expert is still needed to produce the rules; thus, it is difficult to switch to a different language. Another disadvantage is that in contrast with the simple context-based model, this method cannot produce multiple pronunciations for words. Nevertheless, it can be extended to handle multiple pronunciations if we specify how contexts are matched when the phonetic representation is a directed graph and contains multiple pronunciations.
The transformation-based error-driven learner is an extension of the rule-based transliterator. This approach uses similar rewrite rules to produce pronunciations; however, it derives these rules by itself. Also, the context in the rewrite rules can contain corresponding graphemes and phonemes, as this extra information helps the error-driven learner to discover rules.
The learning process consists of the following steps. First, the spelling and the pronunciation (guess) of each word in the training set is aligned with each other. Then, an initial pronunciation is produced for each word. After that, the learner produces transformations that bring the pronunciation guesses closer to the true pronunciations. The most successful transformation is applied to the training set generating a new guess for each word, and then the process is repeated until there are no more transformations that improve the word pronunciations.
This method, based on Eric Brill""s part-of-speech tagging system, can achieve very high accuracy. See Eric Brill, Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging, PhD Thesis, The Johns Hopkins University, Department of Computer Science, 1995. Also in COMPUTATIONAL LINGUISTICS, December 1995. See Caroline B. Huang et al., Generation of Pronunciations from Orthographies Using Transformation-Based Error-Driven Learning, INTERNATIONAL CONFERENCE ON SPEECH AND LANGUAGE PROCESSING, pp 411-414, Yokohama, Japan, 1994. The main advantage of this approach is that it is completely automatic, and it needs only a little knowledge about the language (phoneme-grapheme mappings). The disadvantage is that it is hard to produce multiple pronunciations with it, and it is prone to overlearning, in which case it memorizes word pronunciations as opposed to extracting meaningful pronunciation rules.
So far, we have not used any informationxe2x80x94besides phonetic contextxe2x80x94to produce word pronunciations. As one would expect, word morphology can have a large effect on word pronunciations, especially when it comes to stress. Predicting unstressed syllables is important for speech recognition, as these vowels tend to have very different characteristics than their stressed counterparts. The Spoken Language Systems Group at MIT proposed a system that uses word morphology information to generate word pronunciations with stress information. See Helen Meng et al., Reversible Letter-to-Sound/Sound-to-Letter Generation Based on Parsing Word Morphology, SPEECH COMMUNICATION, Vol. 18, No. 1, pp 47-63, North-Holland, 1996.
In their method, they generate a layered parse tree (shown in FIG. 2) to examine several linguistic layers of the words in the training set. Then, they examine various conditional probabilities along the tree, such as the probability that a symbol follows a column, etc. During pronunciation generation, they try to generate a parse tree for the word while maximizing its overall probability. The advantage of this method is that it generates very accurate pronunciations while also producing morphological structure. However, it needs the full morphological structure of words in the training set, which can be very expensive to provide when training for a new language. Also, this method cannot produce multiple pronunciations in its current form.
The overlapping chunks method tries to relate to human intelligence, as it mimics how people pronounce unseen words. The method uses multiple unbounded overlapping chunks to generate word pronunciations. Chunks are corresponding grapheme and phoneme sequences that are cut out from word pronunciations. For example, xe2x80x98anuaxe2x80x99xe2x86x92(xe2x80x2ae n y .ah w .ah) is a chunk derived from xe2x80x98manualxe2x80x99xe2x86x92(m xe2x80x2ae n y .ah w .ah l). In this method, first, all possible chunks are generated for the words in the knowledge base. When a new pronunciation is requested, these chunks are recombined in all possible ways to produce the requested word. During this process, chunks can overlap if the overlapping phonemes and graphemes are equivalent. After finding all possible recombinations, the best pronunciation candidate is selected. In general, candidates with fewer and longer chunks are favored, especially if those chunks are largely overlapping.
The advantage of this system is that it is language independent, and it can truly produce multiple pronunciations. Also, very little language specific information is needed aside from the word pronunciations, although the words have to be aligned with their pronunciations. The main disadvantage of the system is that it requires a lot of run-time memory during pronunciation generation to speed up the recombination process. There are other theoretical deficiencies of this algorithm. See Franxc3xa7ois Yvon, Grapheme-to-Phoneme Conversion Using Multiple Unbounded Overlapping Chunks, PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON NEW METHODS IN LANGUAGE PROCESSING, No. 2, Ankara, Turkey, 1996. Also in xxx.lanl.gov/list/cmp-lg/9608#cmp-lg/9608006.
Neural networks are used in many areas of automated learning, including to generate word pronunciations. The most popular network type is the multilayer perception network (MLP) where the processing units are arranged in several layers, and only adjacent layers are connected. During processing, activations are propagated from the input units to the output units. See Joe Picone, Advances in Automatic Generation of Multiple Pronunciations for Proper Nouns, Technical Report, Institute for Signal and Information Processing, Mississippi State, Mississippi, USA, September 1997. There is one input unit at each grapheme space (e.g., first letter, second letter, . . . of the word) for each possible grapheme, and similarly, there is one output unit at each phoneme slot (e.g., first phoneme, second phoneme, etc. of the pronunciation) for every phoneme of the language. During pronunciation generation, the input units at the appropriate graphemes of the word are activated. The pronunciation is composed from the phonemes by the output units holding the largest activation value at each phoneme slot. Neural networks are trained using an iterative backpropagation algorithm.
The advantages of this approach are that it is language independent, very little language-specific side information is needed, and it can produce multiple pronunciations. One can further improve this method by also assigning input units to phonetic features, so that it can make use of phonetic features. The disadvantage of this method is that the resulting neural network is large. Also, it is hard to capture the phonetic knowledge efficiently from the activation properties. For example, it is hard to separate the important information from the non-important information to balance the neural network size and performance.
An overview of the complexities and performances of the different approaches is shown in FIG. 1A with footnotes on FIG. 1B. Our goal was to find an automatic method that is fast, uses a small amount of space, and is effective at producing correct pronunciations for words.
In accordance with one embodiment of the present invention, a text-to-pronunciation system that is able to generate pronunciations for arbitrary words wherein the system extracts language-specific pronunciation information from a large training corpus (a set of word pronunciations), and is able to use that information to produce good pronunciations for words not in the training set.