The present invention relates to speech recognition systems, and more particularly to the recognition models they use.
State-of-the-art speech recognition systems make use of context-dependent sub-word models to represent the system vocabulary. These models represent phones in the context of other phones, so as to capture the effects of coarticulation between adjacent phones in spoken language. Dealing with coarticulation effectively is crucialxe2x80x94systems which do not do so, and rely on context-independent models grossly underperform systems with context-dependent models.
One type of model frequently used to deal with coarticulation is the triphone. A triphone model for a particular phone P will be conditioned on both the preceding and following phonetic context. For example, Axe2x88x92P+B would be the triphone model for phone P with the left context of phone A and the right context of phone B. It is effectively impossible to train and use triphones involving all phone combinations. Nevertheless, the repertory of such units used in typical speech recognition systems is large.
A particular problem arises in the recognition of continuous speech, where words are spoken without pauses between them. Coarticulation effects cross word boundaries, and to maximize system performance, models should be utilized which reflect the effect that the phones in a preceding word have on the phones in the following word and vice versa.
Such xe2x80x9ccross-word-boundaryxe2x80x9d units have a significant effect on the computational load of a continuous speech recognition system. In principle, in a dictation system, each vocabulary word must be able to connect to every other vocabulary word. Thus, at the end of each hypothesized word, the system must consider all the words in the vocabulary as potential successors, and must thus connect the current word to all these potential followers using the appropriate connecting units. Inter-word connections present a particularly serious computational challenge to large vocabulary continuous speech recognition (LVCSR) systems, because at this point in the extension of hypotheses, little acoustic information about the identity of the following word is available, and thus it is difficult to apply aggressive thresholding and pruning schemes which are typically used to limit the overall computation within words.
Consider the following example involving within-word and cross-word triphone models. To connect a word which ends with the phones B and C to a following word which begins with the phones D and E:
. . . A B Cxe2x86x92D E F . . . ,
means that the last phone model of the first word and the first phone model of the second word have to be cross-word triphones:
. . . Axe2x88x92B+C Bxe2x88x92C+#Dxe2x86x92C#xe2x88x92D+E Dxe2x88x92E+F . . . ,
where # denotes a word boundary. Thus, the last triphone of the first word, and the first triphone of the second word depend on the second and first words respectively. The full set of connecting units for a given vocabulary word can be expressed as follows:
1.) A first set of cross word triphones connecting the given word to all possible following phonetic contexts of which there are P (Bxe2x88x92C+#D in the above example).
2.) For each of these units there is a further set connecting the last phone of the first word to all the valid pairs of the first two phones of following words in the vocabulary, of which there are p (C#xe2x88x92D+E in the above example).
Thus, in a full triphone model system, each vocabulary word requires P(1+p) segments to connect it to all following vocabulary words. In a typical system with a vocabulary of several 10""s of thousands of words, P may be on the order of 50, while p may be on the order of the 15, resulting in on the average 800 connecting units requiring activation for each vocabulary word.
A preferred embodiment of the present invention provides a speech recognition system for recognizing an input utterance of spoken words. The system includes a set of word models for modeling vocabulary to be recognized, each word model being associated with a word in the vocabulary, each word in the vocabulary considered as a sequence of phones including a first phone and a last phone, wherein each word model begins in the middle of the first phone of its associated word and ends in the middle of the last phone of its associated word; a set of word connecting models for modeling acoustic transitions between the middle of a word""s last phone and the middle of an immediately succeeding word""s first phone; and a recognition engine for processing the input utterance in relation to the set of word models and the set of word connecting models to cause recognition of the input utterance.
In a further embodiment, each word model uses context-dependent phone models, e.g., triphones, to represent the sequence of phones. The acoustic transitions modeled may include a pause, a period of silence, or a period of noise. Each word connecting model may further include a previous word identification field which represents the word associated with the word model immediately preceding the word connecting model, an ending score field which represents a best score from the beginning of the input utterance to reach the word connecting model, or a type field which represents specific details of the word connecting model.
A preferred embodiment also includes a method of a speech recognition system for recognizing an input utterance of spoken words. The method includes modeling vocabulary to be recognized with a set of word models, each word model being associated with a word in the vocabulary, each word in the vocabulary being considered as a sequence of phones including a first phone and a last phone, wherein each word model begins in the middle of the first phone of its associated word and ends in the middle of the last phone of its associated word; modeling acoustic transitions between the middle of a word""s last phone and the middle of an immediately succeeding word""s first phone with a set of word connecting models; and processing with a recognition engine the input utterance in relation to the set of word models and the set of word connecting models to cause recognition of the input utterance.
In a further embodiment, each word model uses context-dependent phone models, e.g., triphones, to represent the sequence of phones. The acoustic transitions may further include a pause, a period of silence, or a period of noise. Each word connecting model may further include a previous word identification field which represents the word associated with the word model immediately preceding the word connecting model, an ending score field which represents a best score from the beginning of the input utterance to reach the word connecting model, or a type field which represents specific details of the word connecting model.