The present invention relates to letter-to-sound conversion systems. In particular, the present invention relates to generating graphonemes used in letter-to-sound conversion.
In letter-to-sound conversion, a sequence of letters is converted into a sequence of phones that represent the pronunciation of the sequence of letters.
In recent years, an n-gram based system has been used for letter-to-speech conversion. The n-gram system utilizes “graphonemes” which are joint units representing both letters and the phonetic pronunciation of those letters. In each graphoneme, there can be zero or more letters in the letter part of the graphoneme and zero or more phones in the phoneme part of the graphoneme. In general, the graphoneme is denoted as l*:p*, where l* means zero or more letters and p* means zero or more phones. For example, “tion:sh&ax&n” represents a graphoneme unit with four letters (tion) and three phones (sh, ax, n). The delimiter “&” is added between phones because phone names can be longer than one character.
The graphoneme n-gram model is trained based on a dictionary that has spelling entries for words and phoneme pronunciations for each word. This dictionary is called the training dictionary. If the letter to phone mapping in the training dictionary is given, the training dictionary can be converted into a dictionary of graphoneme pronunciations. For example, assume
phone ph:f o:ow n:n e:# is given somehow. The graphoneme definitions for each word are then used to estimate the likelihood of sequences of “n” graphonemes. For example, in a graphoneme trigram, the probability of sequences of three graphonemes, Pr(g3|g1g2), are estimated from the training dictionary with graphoneme pronunciations.
Under many systems of the prior art that use graphonemes, when a new word is provided to the letter-to-sound conversion system, a best first search algorithm is used to find the best or n-best pronunciations based on the n-gram scores. To perform this search, one begins with a root node that contains the beginning symbol of the graphoneme n-gram model, typically denoted by <s>. <s> indicates the beginning of a sequence of graphonemes. The score (log probability) associated with the root node is log(Pr(<s>)=1)=0. In addition, each node in the search tree keeps track of the letter location in the input word. Let's call it the “input position”. The input position of <s> is 0 since no letter in the input word is used yet. To sum up, a node in the search tree contains the following information for the best-first search:
struct node {  int score, input_position;  node *parent;  int graphoneme_id;};
Meanwhile a heap structure is maintained in which the highest scoring of search nodes is found at the top of the heap. Initially there is only one element in the heap. This element points to the root node of the search tree. At any iteration of the search, the top element of the heap is removed, which gives us the best node so far in the search tree. One then extends child nodes from this best node by looking up the graphoneme inventory those graphonemes whose letter parts are a prefix of the left-over letters in the input word starting from the input position of the best node. Each such graphoneme generates a child node of the current best node. The score of a child node is the score of the parent node (i.e. the current best node), plus the n-gram graphoneme score to the child node. The input position of the child node is advanced to be the input position of the parent node plus the length of the letter part of the associated graphoneme in the child node. Finally the child node is inserted into the heap.
Special attention has to be paid when all the input letters are consumed. If the input position of the current best node has reached the end of the input word, a transition to the end symbol of the n-gram model, </s>, is added to the search tree and the heap.
If the best node removed from the heap contains </s> as its graphoneme id, a phonetic pronunciation corresponding to the complete spelling of the input word has been obtained. To identify the pronunciation, the path from the last best node </s> all the way back to the root node <s> is traced and the phoneme parts of the graphoneme units along that path are output.
The first best node with </s> is the best pronunciation according to the graphoneme n-gram model, as the rest of the search nodes have scores that are worse than this score already and future paths to </s> from any of the rest of search nodes are going to make the scores only worse (because log(probability) <0). If elements continue to be removed from the heap, the 2nd best, 3rd best, etc. pronunciations can be identified until either there are no more elements in the heap or the n-th best pronunciation is worse than the top 1 pronunciation by a threshold. The n-best search then stops.
There are several ways to train the n-gram graphoneme model, such as maximum likelihood, maximum entropy, etc. The graphonemes themselves can also be generated in different ways. For example, some prior art uses hidden Markov models to generate initial alignments between letters and phonemes of the training dictionary, followed by merging of frequent pairs of these l:p graphonemes into larger graphoneme units. Alternatively a graphoneme inventory can also be generated by a linguist who associates certain letter sequences with particular phone sequences. This takes a considerable amount of time and is error-prone and somewhat arbitrary because the linguist does not use a rigorous technique when grouping letters and phones into graphonemes.