In conventional methods for training acoustic models in an automatic speech recognizer ("ASR"), the ASR must be trained before it can recognize random acoustic data (i.e. spoken words and phrases). In order to train the ASR, a correct script must be manually created by a user and input to the ASR. The user must generate acoustic data by physically and sequentially saying hundreds of sentences. After uttering each sentence, the user must input transcribed data corresponding to the sentence by typing the words of the spoken sentence. As a result of the operation above, the ASR learns what voice patterns of the user correspond to the sounds and phonemes of the words relating to the transcribed data. In other words, the ASR learns the sounds of particular words by aligning the acoustic data (i.e. the correct script) with the transcribed data.
After the ASR has been trained, it is capable of decoding randomly spoken words into a decoded script based on the voice patterns it learned while it was trained. Since the ASR can only understand random acoustic data based on the learned voice patterns, the accuracy of the ASR is proportional to the amount of acoustic data and corresponding transcribed data that it received when it was trained. Naturally, a very large amount of acoustic data and transcribed data are needed to train speaker-independent and continuous ASRs or telephone ASks. However, manually generating the acoustic data and inputting the corresponding transcribed data is time consuming and expensive, and thus, training an ASR so that it can accurately produce decoding scripts from randomly spoken acoustic data is likewise expensive and inefficient.
Alternative training methods, such as unsupervised training methods, do not rely explicitly on transcribed data. Instead, the unsupervised training methods update acoustic model parameters via optimization criteria functions that are computed over non-transcribed acoustic data. These methods are less efficient than the supervised training methods that try to explicitly relate spoken words to the previously stored transcribed words.
In light of the above disadvantages of conventional ASRs, a need exists for a training method which eliminates the requirement of generating acoustic data and manually inputting corresponding transcribed data to the ASR. Specifically, what is needed is a method which automatically transcribes a large amount of acoustic data by repeatedly comparing and refining the acoustic data via an iterative procedure.