A. Field of the Invention
The present invention relates generally to speech recognition and, more particularly, to speech recognition models.
B. Description of Related Art
Speech has not traditionally been valued as an archival information source. As effective as the spoken word is for communicating, archiving spoken segments in a useful and easily retrievable manner has long been a difficult proposition. Although the act of recording audio is not difficult, automatically transcribing and indexing speech in an intelligent and useful manner can be difficult.
Speech recognition systems are generally based on statistical models. The models are trained on a speech signal and a corresponding transcription of the speech signal. The models “learn” how the speech signal corresponds to the transcription. Conventional models are frequently implemented based on Hidden Markov Models (HMMs).
FIG. 1 is a diagram illustrating a conventional speech recognition system. A content transcription component 102 receives an input audio stream. The content transcription component 102 converts speech in the input audio stream into text based on language and acoustic model(s) 101. Model(s) 101 are pre-trained based on a training audio stream that is expected to be similar to the run-time version of the input audio stream.
FIG. 2 is a diagram illustrating training of models 101 in additional detail. When training, models 101 receive the input audio stream 210 and a corresponding transcription 211 of the input audio stream. Transcription 211 may be meticulously generated by a human based on the input audio stream 210. Transcription 211 may be converted into a stream of phonemes 213 by system dictionary 212. System dictionary 212 includes correspondences between the written orthographic representation of a word and the phonemes that correspond to the word. A phoneme is generally defined as the smallest acoustic event that distinguishes one word from another.
Models 101 may model each individual phoneme in the audio stream in the context of its surrounding phonemes. Thus, model 101 may model a target sound using the phonemes that immediately proceed and succeed the target sound. A more complex model may use two proceeding and two succeeding phonemes. In English, there are approximately 50 phonemes. Accordingly, for a model using one phoneme on each side of the target phoneme, there are approximately 125,000 (503) possible phoneme groups to model. For a model using two phonemes on each side, there are 322,500,000 (505) possible phoneme groups. Because of the large number of phoneme groups, individually modeling each of the phoneme groups could require impractically complex models. Additionally, a training set that fully covered each possible phoneme group would be inordinately large.
Accordingly, conventional speech recognition models may be based on phoneme clusters in which each phoneme cluster corresponds to a number of individual phoneme groups. One phoneme cluster, for example, may correspond to all phoneme groups in which the middle sound is a hard “C” sound. A model may use only 2,000-5,000 phoneme clusters instead of the 125,000 or more possible phoneme groups. The reduced number of phoneme clusters relative to the original phoneme groups allows for less complex speech recognition models and smaller training data sets.
One problem associated with phoneme clustering is determining the mapping of which phoneme groups to assign to which phoneme clusters. One conventional technique for mapping groups to clusters is based on a statistical analysis of the training data. A problem with this technique is that it does not generalize well to phoneme groups that are not present in the training data.
A second conventional technique for mapping groups to clusters is based on a clustering decision tree. A decision tree uses a predetermined series of questions to categorize a phoneme group into one of the phoneme clusters. For example, a decision tree may first determine whether the left-most phoneme in a group corresponds to a vowel sound. If so, a second question, such as whether the right-most phoneme corresponds to a fricative sound, is asked. If not, a different second question, such as whether the left-most phoneme corresponds to a silence, is asked. Eventually, the answer to a question will lead to a node of the decision tree that assigns the phoneme grouping to a cluster. In this manner, each phoneme group is assigned to a cluster based on a sequence of questions where the particular sequence is determined by nodes in the clustering tree.
In contrast to mapping clusters based on a statistical analysis of the training data, classifying phoneme groups based on a decision tree tends to lead to more accurate clustering when classifying data not in the training data. Phoneme groups that were not in the training data but that are similar to groups in the training data will tend to follow the same path down the clustering tree as the similar phoneme groups. Accordingly, the clustering tree generalizes well when dealing with new phoneme groups.
A problem associated with clustering trees is that the questions in the trees and the topology of the trees are manually designed by a speech expert. A tree can be complex and is often designed for a specific speech recognition model. Speech recognition systems often include multiple models that, therefore, require the separate creation and maintenance of multiple clustering trees. This can be costly and the independent nature of the clustering trees and the models can lead to non-optimal overlapping results.
Thus, there is a need in the art for improved clustering trees.