Automatic speech recognition techniques are known that convert a speech signal into a sequence of speech feature vectors (and/or classes) and then identify segments of the sequence that correspond to specific words. The segmentation and recognition process typically relies on a set of speech recognition models (wherein typically each such model corresponds to a given sound, such as a specific word or sub-word unit). Again as well understood, each model can provide a basis for computing a likelihood that a particular set of speech feature values (or classes) are properly associated with a corresponding acoustic unit such as a given specific sound.
Such speech recognition models are often created during overall system development and are usually based on a large corpus of speech data from many speakers representing a given language (or dialect). During use, however, speech recognition accuracy depends on the ability of the models to provide accurate estimates of feature likelihoods for a given user's voice. Since the statistics of speech feature usage and occurrence in fact differ significantly as between various speakers of a given language, models trained on many speakers will usually not provide completely accurate likelihood estimates for any given individual user unless the permitted words are purposefully significantly limited and significantly audibly distinct from one another.
Known methods exist to adapt models during use to attempt to better represent the characteristics of a given individual speaker's voice. These methods tend to require, however, that the speech recognition system be able to correctly recognize a sufficient amount of the user's speech to provide reliable supervisory information for the adaptation process. Upon receiving a speech sample, these processes utilize speech recognition to ascertain the verbal content of the speech and then assigns that content to corresponding acoustic classes, models, and/or other classes such as phonemes (note that “acoustic classes” can include phonemes as such, but more typically also comprise more abstract categories for the most part, such as sub-phonemes or context dependent phonemes where the sound depends upon what precedes and/or what follows the sound). When the initial models are not sufficiently accurate, however, initial recognition performance will be poor and tend to significantly hamper the adaptation process.
At least one prior art suggestion has been made to attempt to avoid these problems by making modifications to vector quantization-based models through use of clustering the speaker's feature data. This suggestion proposes that speech recognition as based upon vector quantization codebooks might benefit from such an approach. Other specifics of this approach are described more completely further herein for the convenience of the reader. For the moment, it may be noted that this approach tends to require the storing of feature vectors while the intended clustering occurred, and only a simple vector distance is utilized to facilitate the alignment of speaker independent classes to speaker dependent classes. These processing requirements and relatively unsophisticated constituent elements may have contributed to the general lack of usage of such a technique in speech recognition today (although a general shift in the technology away from vector quantization methodology may factor into this situation as well).
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of various embodiments of the present invention. Also, common but well-understood elements that are useful or necessary in a commercially feasible embodiment are typically not depicted in order to facilitate a less obstructed view of these various embodiments of the present invention.