The present invention relates generally to automated speech recognition. More particularly, the invention relates to a re-estimation technique for acoustic models used in automated speech recognition systems.
Speech recognition systems that handle medium sized and large vocabularies usually take as their basic units phonemes or syllables, or phonemes sequences within a specified acoustic context. Such units are typically called context dependent acoustic models or allophones models. An allophone is a specialized version of phoneme defined by its context. For instance, all the instances of ‘ae’ pronounced before ‘t’, as in “bat,” “fat,” etc. define an allophone of ‘ae’.
For most languages, the acoustic realization of a phoneme depends very strongly on the preceding and following phonemes. For instance, an ‘eh’ preceded by a ‘y’ (as in “yes”) is quite different from an ‘eh’ preceded by ‘s’ (as in “set”).
For a variety of reasons, it can be beneficial to separate or subdivide the acoustic models into separate speaker dependent and speaker independent parts. Doing so allows the recognition system to be quickly adapted to a new speaker by using the speaker dependent part of the acoustic model as a centroid to which transformations corresponding to the speaker independent part may be applied. In our copending application entitled “Context-Dependent Acoustic Models For Medium And Large Vocabulary Speech Recognition With Eigenvoice Training,” Ser. No. 09/450,392 filed Nov. 29, 1999, we described a technique for developing context dependent models for automatic speech recognition in which an eigenspace is generated to represent a training speaker population and a set of acoustic parameters for at least one training speaker is then represented in that eigenspace. The representation in eigenspace comprises a centroid associated with the speaker dependent components of the speech model and transformations, associated with the speaker independent components of the model. When adapting the speech model to a new speaker, the new speaker's centroid within the eigenspace is determined and the transformations associated with that new centroid may then be applied to generate the adapted model.
The technique of separating the variability into speaker dependent and speaker independent parts enables rapid adaptation because typically the speaker dependent centroid contains fewer parameters and is thus quickly relocated in the eigenspace without extensive computation. The speaker independent transformations typically contain far more parameters (corresponding to the numerous different allophone contexts). Because these speaker independent transformations may be readily applied once the new centroid is located, very little computational effort is expended.
While the forgoing technique of separating speaker variability into constituent speaker dependent and speaker independent parts shows much promise, we have more recently discovered a re-estimation technique that greatly improves performance of the aforesaid method. According to the present invention a set of maximum likelihood re-estimation formulas may be applied: (a) to the eigenspace, (b) to the centroid vector for each training speaker and (c) to the speaker-independent part of the speech model. The re-estimation procedure can be applied once or iteratively. The result is a speech recognition model (employing the eigenspace, centroid and transformation components) that is well tuned to separate the speaker dependent and speaker independent parts. As will be more fully described below, each re-estimation formula augments the others: one formula provides feedback to the next. Also, as more fully explained below, the re-estimation technique may be used at adaptation time to estimate the location of a new speaker, regardless of what technique is used in constructing the original eigenspace at training time.
Let MU(S,P) be the portion of the eigencentroid for speaker S that pertains to phoneme P. To get a particular context-dependent variant of the model for P—that is, an allophone model for P in the phonetic context C—apply a linear transformation T(P,C) to MU(P,C). This allophone model can be expressed as:M(S,C,P)=T(P,C)*MU(S,P).
In our currently preferred embodiment, T is the simple linear transformation given by a translation vector δ. Thus, in this embodiment:M(S,C,P)=MU(S,P)+Ε(P,C).
For instance, allophone 1 of MU(S,P) might be given by MU(S,P)+Ε1, allophone 2 might be given by MU(S,P)+Ε2 and so on.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.