Traditional ASR generally has three broad Automated Speech Recognition (ASR) categories: 1) speaker independent ASR, 2) speaker dependent ASR, and 3) speaker-adaptive ASR. Speaker independent ASR does not involve the user training the system in any form. Although named “speaker independent”, generally these systems categorize the speaker as male or female (males and females generally have much different vocal tract and vocal cord physiology and therefore produce sounds that are characteristically male or female). After this characterization additional process is usually performed to normalize the result to a “nominal male” or “nominal female” characteristic (this process is called speaker normalization). Although both of these categorizations are generally performed, they are done so without user intervention or knowledge and therefore this type of ASR system is characterized as “speaker independent” ASR.
For speaker dependent ASR, the system is trained to recognize how a particular user pronounces words/phrases/phonemes. In general, these systems have higher performance than speaker independent systems and therefore are more desirable when higher confidence ASR is desired. However, speaker dependent systems require the user to “train/teach” the ASR how they pronounce phonemes (or di-phones, or tri-phones—whatever the phonetic unit of the particular ASR design is). The training involves each user dictating a precise text to the ASR system. Depending on the system design, this text may or may not be the same text for all users—but such text will generally have the property that it has a balanced set of phonemes (or the phonetic units used by the system) and that the relative frequency of the phonemes is within a certain bound defined by the system training design. Thus speaker dependent ASR has the advantage in the training stage of having both the audio and the “text transcription” of the audio—because the user has read the text.
Most speaker adaptive ASR systems (e.g., DRAGON SYSTEMS) have an optional stage where they can first train to a particular user by having the user read a text and perform speaker dependent ASR training. Whether or not such an optional step is used, speaker adaptive ASR depends on the user “watching the ASR generated text transcription of their spoken words” and then “correcting the mistakes” the ASR system makes in real time. Whenever the system is instructed to correct its ASR output by the user (an “ASR mistake”), the system learns the characteristics of this “mistake” and “adapts its understanding” as to how that particular user pronounces words. In this way it modifies its future decoding of similar words/phonemes in an attempt not to make the same “mistake” again. An ASR system learning in this way is called a “speaker adaptive” ASR system. Thus, this type of ASR is also dependent on having a “text transcript” (in this case it's the ASR output corrected by the user) and the “audio” to obtain performance improvement.
In general, both speaker dependent and speaker adaptive ASR performs better than speaker independent ASR owing to having both the “user audio” and the “transcription” of the spoken utterances for a particular individual. However, both require users to train the system for this ASR performance improvement.