Multilingual aspects are becoming increasingly important in the Automatic Speech Recognition systems. The kind of speech recognition system comprises a speech recognition engine which may for example comprise units for automatic language identification, on-line pronunciation modeling (text-to-phoneme) and multilingual acoustic modeling. The operation of the speech recognition engine works on an assumption of that the vocabulary items are given in textual form. At first, the language identification module identifies the language, based on the written representation of the vocabulary item. Once this has been determined, an appropriate on-line text-to-phoneme modeling scheme is applied to obtain the phoneme sequence associated with the vocabulary item. The phoneme is the smallest item that differentiates the pronunciation of a word from the pronunciation of another word. Any vocabulary item in any language can be presented as a set of phonemes that correspond the changes in the human speech production system.
The multilingual acoustic models are concatenated to construct a recognition model for each vocabulary item. Using these basic models the recognizer can, in principle, automatically cope with multilingual vocabulary items without any assistance from the user. Text-to-phoneme has a key role for providing accurate phoneme sequences for the vocabulary items in both automatic speech recognition as well as in text-to-speech. Usually neural network or decision tree approaches are used as the text-to-phoneme mapping. In the solutions for language- and speaker-independent speech recognition, the decision tree based approach has provided the most accurate phoneme sequences. One example of a method for arranging a tree structure is presented is the U.S. Pat. No. 6,411,957B1.
In the decision tree approach, the pronunciation of each letter in the alphabet of the language is modeled separately and a separate decision tree is trained for each letter. When the pronunciation of a word is found, the word is processed one letter at a time, and the pronunciation of the current letter is found based on the decision tree text-to-phoneme model of the current letter.
An example of the decision tree is shown in FIG. 1. It is composed of nodes, which can be either internal nodes I or leaves L. A branch is a collection of nodes, which are linked together from a root R to the leaf L. The node can be either a parent node or a child node. The parent node is a node from which the tree can be traversed further, in other words; has a child node. A child node in the tree is a node that can be reached from a parent node. The internal node I can be both a parent and a child node, but the leaf is only a child node. Every node in the decision tree stores information. Stored information varies depending on the context of a decision tree.
In the speech-recognition systems the internal nodes I usually have information about a word being recognized and the pronunciation of the word. The pronunciations of the letters of the word can be specified by the phonemes (pi) in certain contexts. Context refers, for example, to the letters in the word to the right and to the left of the letter of interest. The type of context information can be specified by an attribute (ai) (also called attribute type) which context is considered when climbing in the decision tree. Climbing can be implemented with a help of an attribute value, which defines the branch into which the searching algorithm should proceed given the context information of the given letter.
The tree structure is climbed starting from the root node R. At each node the attribute type (ai) should be examined and the corresponding information should be taken for determining the context of the current letter. By the information the branch that matches the context information can be moved along to the next node in the tree. The tree is climbed until a leaf node L is found or there is no matching attribute value in the tree for the current context.
A simplified example of the decision tree based text-to-phoneme mapping, is illustrated in FIG. 2. The decision tree in the figure is for the letter ‘a’, wherein the nodes represents the phonemes of the letter ‘a’. It should be noticed that the illustration is simplified and does not include all the phonemes of the letter ‘a’. In the root node there is information about the attribute type, which is the first letter on the right and denoted by r1. For the two other internal nodes, the attribute types are the first letter on the left denoted by I1 and the second letter on the right denoted by r2. For the leaf nodes, no attribute types are assigned.
When searching the pronunciation for the word ‘Ada’, the phoneme sequence for the word can be generated with the decision tree presented in the example and a decision tree for the letter ‘d’. In the example, the tree for the letter ‘d’ is composed of the root node only, and the phoneme assigned to the root node is phoneme /d/.
When generating the phoneme sequence, the word is processed from left to right one letter at a time. The first letter is ‘a’, therefore the decision tree for the letter ‘a’ is considered first (see the FIG. 2). The attribute r1 is attached to the root node. The next letter after ‘a’ is ‘d’, therefore we proceed to the branch after the root node that corresponds to the attribute value ‘d’. This node is an internal node to which attribute r2 is attached the second letter to the right is ‘a’, and we proceed to the corresponding branch, and further to the corresponding node which is a leaf. The phoneme corresponding to the leaf is /el/. Therefore the first phoneme in the sequence is /el/.
The next letter in the example word is ‘d’. The decision tree for the letter ‘d’ is, as mentioned, composed of the root node, where the most frequent phoneme is /d/. Hence the second phoneme in the sequence is /d/.
The last letter in the word is ‘a’, and the decision tree for the letter ‘a’ is considered once again (see FIG. 2). The attribute attached to the root node is r1. For being a last letter in the word, the next letter to the right of letter ‘a’ is the grapheme epsilon ‘_’. The tree is climbed along the corresponding branch to the node that is a leaf. The phoneme attached to the leaf node is /V/, which is the last phoneme in the sequence.
Finally the complete phoneme sequence for the word ‘Ada’ is /el/ /d/ /V/. The phoneme sequence for any word can be generated in a similar fashion after the decision trees have been trained for all the letters in the alphabet.
The decision tree training is done on a pronunciation dictionary that contains words and their pronunciations. The strength of the decision tree lies in the ability to learn a compact mapping from a training lexicon by using information theoretic principles.
As said, the decision tree based implementations have provided the most accurate phoneme sequences, but the drawback is large memory consumption when using the decision tree solution as the text-to-phoneme mapping. Large memory consumption is due to numerous pointers used in the linked list decision tree approach. The amount of the memory increases especially with languages such as English or the like, where pronunciation irregularities occur frequently.
The prior art solutions for the said problem can be categorized into lossy and lossless methods. When the memory requirement of decision trees is tried to reduce, mostly the lossy methods are used. These approaches are for example grouping the attribute values of the decision trees, optimizing the stopping criterion of the decision tree training process, pruning the decision tree based on error counts, and other similar methods.
For the prior art low memory decision tree methods the performance is always decreased, when the system is optimized for memory. There is always a trade-off between accuracy and memory consumption. On the contrary, due to the approach according to the invention, there is hardly any degradation in accuracy and the memory consumption is optimized. Memory requirements can be significantly reduced without degradation in performance.