The present application is a national stage application of and claims priority from PCT Application PCT/GB99/03812 filed Nov. 16, 1999 and published under PCT Article 21(2) in English.
This invention relates to speech recognition systems and, in particular, to network models, language models and search methods for use in such systems.
One technique which is widely used in speech recognition systems is based upon the representation of speech units using probabilistic models known as hidden Markov models (HMMs). An EHM consists of a set of states connected by transitions. The HMMs are used to model units in the speech recogniser system which are usually individual speech sounds, referred to as phones. By using individual HMMs, models for complete words can be formed by connecting together individual phone models according to pronunciation rules for the language being recognised.
Given a segment of speech and a set of HMMs that may correspond to the speech segment, the likelihood that each set of HMMs corresponds to the speech segment can be calculated. If this likelihood is calculated for all words in a language vocabulary, then the most probable word can be chosen. One technique for doing this likelihood assessment employs the well-known Viterbi algorithm.
One approach to tackling the above problem has been to form a network model in which each word in a vocabulary is represented by a path through the model. Since the composite model is also an HMM, the most likely path, and hence the word, can be computed using the Viterbi algorithm. Such a model for single word recognition can be extended to the case of sentences by allowing connections from the end of words to the start of other words. So that language model probabilities, which are based upon the likelihood of one word being adjacent to another, can also be considered in such models, probabilities for each inter-word connections are also provided in such models.
Such network models can work well, but are often large and in order for them to be employed in real time require considerable processing power. Furthermore, such models which only use a single HMM for each phone are often not particularly accurate. Accuracy can be improved by considering not only the identity of the phone the model represents but also the identity of the preceding and the following phone when determining the appropriate HMM parameters. Such an approach is often called a triphone approach. However, if the phonetic context is considered across word boundaries this approach increases the network complexity considerably. At word boundaries such a system requires that for each different cross-word boundary context a different HMM is used for the first and last phone of each word. This leads to considerable increase in network size and hence high computational requirements on the system employing such a model.
A number of approaches have been proposed in attempts to employ triphone models without excessive computational for those requirements. However, these approaches typically use approximate models and/or operate multiple passes through a network, reducing accuracy and/or increasing processing time.
As mentioned above, speech recognition systems usually require the calculation of likelihoods which must be computed to compare individual word hypotheses and determine the most likely word. If such a system employs word context as an assessment criteria, this usually means that such likelihoods are composed of two parts, the acoustic model likelihood (dependent upon the detected sound) and a language model probability. This language probability is normally determined from a reference language model which forms part of the speech recognition system, with the language model being accessed from the system network model as the network model is passed through during speech recognition. Given the large vocabulary and high complexity of typical languages an accurate statistical model of general word sequences can be very large. The time taken to access this language model whilst carrying out the recognition process can be considerable, affecting significantly the system""s ability to operate in real time, and the overall data processing requirement demands of the system.
The present invention seeks to provide a network model which can use such cross word context dependent triphone HMMs yet which over comes the above and other problems.
The present invention seeks to provide a language model structure which stores all the necessary language model data, yet which is capable of being accessed quickly and efficiently.
According to a first aspect of the present invention, we provide, a language model structure for use in a speech recognition system employing a tree-structured network model, the language model being structured such that identifiers associated with each word and contained therein are arranged such that each node of the network model with which the language model is associated spans a continuous range of identifiers
According to a second aspect of the present invention, we provide, a tree-structured network for use in a speech recognition system, the tree-structured network comprising:
a first tree-structured section representing the first phone of each word having two or more phones;
a second tree-structured section representing within word phones, wherein within word phones includes any phone between the first phone and the last phone of a word;
a third tree-structured section representing the last or only phone of each word;
a fourth tree-structured section representing inter-word silences; and,
a number of null nodes for joining each tree-structured section to the following tree-structured section.
Each tree structured section is joined to the next by a set of null nodes. These reduce the total number of links required and in the layer before the final phone model of each word also mark the point at which the token history is updated to indicate the recognised word.
According to a third aspect of the present invention, we provide, a method of transferring tokens through a tree-structured network in a speech recognition process, each token including a likelihood which indicates the probability of a respective path through the network representing a respective word to be recognised, and wherein each token further includes a history of previously recognised words, the method comprising:
i) combining tokens at each state of the network to form a set of tokens, the set including a main token having the highest likelihood and one or more relative tokens;
ii) for each set of tokens, merging tokens having the same history;
iii) transferring the set of tokens to subsequent nodes in the network;
iv) updating the likelihood of at least the main token of each set of tokens; and,
v) repeating steps i) to iv) at each respective node.
Thus the present invention allows the tokens to be combined and then handled as sets of tokens. This helps reduce the amount of processing required to transfer the tokens through the tree-structured network.
According to a fourth aspect of the present invention, we provide, a method of merging sets of tokens in a speech recognition process, each token including a likelihood which indicates the probability of a respective path through the network representing a respective word to be recognised, and wherein each token further includes a history of previously recognised words, the method comprising:
i) assigning an identifier to each set of tokens, the identifier representing the word histories of each of the tokens in the set of tokens;
ii) comparing the identifiers of different sets of tokens; and,
iii) merging sets of tokens having the same identifiers.
The present invention allows identifiers to be assigned to sets of tokens, based on the histories of the tokens within the set. This allows different sets of tokens to be compared without requiring the comparison of the history of each token within each set, thereby reducing the level of computation required.