Small vocabulary speech recognition systems have as their basic units the words in the small vocabulary to be recognized. For instance, a system for recognizing the English alphabet will typically have 26 models, one model per letter of the alphabet. This approach is impractical for medium and large vocabulary speech recognition systems. These larger systems typically take as their basic units, the phonemes or syllables of a language. If a system contains one model (e.g. Hidden Markov Model) per phoneme of a language, it is called a system with xe2x80x9ccontext-independentxe2x80x9d acoustic models.
If a system employs different models for a given phoneme, depending on the identity of the surrounding phonemes, the system is said to employ xe2x80x9ccontext-dependentxe2x80x9d acoustic models. An allophone is a specialized version of a phoneme defined by its context. For instance, all the instances of xe2x80x98aexe2x80x99 pronounced before xe2x80x98txe2x80x99, as in xe2x80x9cbat,xe2x80x9d xe2x80x9cfat,xe2x80x9d etc. define an allophone of xe2x80x98aexe2x80x99.
For most languages, the acoustic realization of a phoneme depends very strongly on the preceding and following phonemes: For instance, an xe2x80x98ehxe2x80x99 preceded by a xe2x80x98yxe2x80x99 (as in xe2x80x9cyesxe2x80x9d) is quite different from an xe2x80x98ehxe2x80x99 preceded by xe2x80x98sxe2x80x99 (as in xe2x80x98setxe2x80x99). Thus, for a system with a medium-sized or large vocabulary, the performance of context-dependent acoustic models is much better than that of context-independent models. Most practical applications of medium and large vocabulary recognition systems employ context-dependent acoustic models today.
Many context-dependent recognition systems today employ decision tree clustering to define the context-dependent, speaker-independent acoustic models. A tree-growing algorithm finds questions about the phonemes surrounding the phoneme of interest and splits apart acoustically dissimilar examples of the phoneme of interest. The result is a decision tree of yes-no questions for selecting the acoustic model that will best recognize a given allophone. Typically, the yes-no questions pertain to how the allophone appears in context (i.e., what are its neighboring phonemes).
The conventional decision tree defines for each phoneme a binary tree containing yes/no questions in the root node and in each intermediate node (children, grandchildren, etc. of the root node). The terminal nodes, or leaf nodes, contain the acoustic models designed for particular allophones of the phoneme. Thus, in use, the recognition system traverses the tree, branching xe2x80x98yesxe2x80x99 or xe2x80x98noxe2x80x99 based on the context of the phoneme in question until the leaf node containing the applicable model is identified. Thereafter the identified model is used for recognition.
Unfortunately, conventional allophone modeling can go wrong. We believe this is because current methods do not take into account the particular idiosyncrasies of each training speaker. Current methods assume that individual speaker idiosyncrasies will be averaged out if a large pool of training speakers is used. However, in practice, we have found that this assumption does not always hold. Conventional decision tree-based allophone models work fairly well when a new speaker""s speech happens to resemble the speech of the training speaker population. However, conventional techniques break down when the new speaker""s speech lies outside the domain of the training speaker population.
The present invention addresses the foregoing problem through a reduced dimensionality speaker space assessment technique that allows individual speaker idiosyncrasies to be rapidly identified and removed from the recognition equation, resulting in allophone models that are far more universally applicable and robust. The reduced dimensionality speaker space assessment is performed in a reduced dimensionality space that we call the eigenvoice space or eigenspace. One of the important advantages of our eigenvoice technique is speed. When a new speaker uses the recognizer, his or her speech is rapidly placed or projected into the Eigenspace derived from the training speaker population. Even the very first utterance by the new speaker can be used to place the new speaker into eigenspace. In eigenspace, the allophones may be represented with minimal influence by irrelevant factors such as each speaker""s position in speaker space.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings. In the following detailed description, two basic embodiments are illustrated. Different variations of these embodiments are envisioned as will be appreciated by those skilled in this art.