Dictation devices have been in common use in many fields in which it is inconvenient or undesirable to make handwritten or typewritten notes. It was common in these fields for a user of a dictation device to make voice recordings and provide these recordings to transcriptionists, who transcribe the recordings in order to generate written transcripts of the recording for the speaker's review or for record keeping purposes. More recently, dictation technology has developed significantly and includes such tools as automated speech recognition (ASR) systems to eliminate much of the need for transcriptionists to transcribe recordings. ASR systems take input in the form of speech and derive a text of words that was most likely to have been spoken. Besides dictation systems, ASR is also becoming increasingly important in applications such as voice activated telephone menus and voice messaging systems. However, a recognized and pervasive problem in the art is that speech recognition can be unreliable, and it is often easier and even cheaper to use transcription services or live operators instead because faulty speech recognition often wastes more time than it saves. There is therefore a need for improved speech recognition systems that are more reliable and less error prone.
There are several methods of speech recognition known to those skilled in the art. Most of these methods are based on combinations of acoustic (speech) models and language models. Acoustic models can be constructed using speech from many different speakers or using speech from a particular speaker for whom the model is intended. An acoustic model constructed either a priori or using many different speakers is called a speaker independent (SI) acoustic model because the parameters of the model are not tuned to any particular speaker. Speaker independent acoustic models are usually substantially less accurate than speaker dependent (SD) acoustic models because the parameters of a SD acoustic model are adapted to a particular speaker's pronunciation and manner of speaking whereas a SI acoustic model is not. In addition to pronunciation and manner of speaking, a speaker's vocal cord length and size and shape of mouth and other physical characteristics can affect speech recognition. Furthermore, differences in signal quality due to differences in microphone input or signal transmission routes between when a recognition engine is initially programmed and when it is actually used can also affect the accuracy of automatic speech recognition. A speaker dependent acoustic model can account for these differences whereas a speaker independent acoustic model cannot. Therefore, in order to increase the accuracy of speech recognition engines, it is desirable to adapt acoustic models on which the engines are based to particular speakers with the particular conditions under which they are speaking.
There are several ways known to those skilled in the art to adapt a speech model to a particular speaker, including Bayesian maximum a posteriori (MAP) estimates, eigenvoice methods, and maximum-likelihood methods. Many of these methods are reviewed by Neumeyer et al., “A Comparative Study of Speaker Adaptation Techniques,” 4th European Conf. on Speech Communication and Technology, pages 1127-1130, incorporated by reference.
The maximum-likelihood linear regression (MLLR) method is described in detail in C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9, pages 171-185, incorporated by reference. This reference describes two methods of adapting a speech recognition engine to a particular speaker. The first method is supervised adaptation, which requires an exact or literal transcript of a recorded speech and refines the acoustic model to match the literal transcript. The second method is unsupervised adaptation, which does not require a literal transcript. In this method, the speech recognition engine transcribes the speech and that transcription is used to refine the speech model. As expected, Leggetter and Woodland report that the supervised adaptation method was superior to the unsupervised adaptation method, although both methods approached the performance of a SD based acoustic model as the number of adaptation utterances increased.
Supervised adaptation approaches require literal transcripts of a recorded speech. Thus transcriptionists are required to literally transcribe a recorded speech in order to create a text document for comparison, or alternatively, a speaker is required to read a pre-written adaptation speech. Both of these methods require time on the part of a human, thus making these methods more costly and time-consuming than unsupervised methods. Furthermore, supervised adaptation methods are not usually an option in many telephony applications in which speakers interact only briefly with the system. There is therefore a need for an unsupervised adaptive speech recognition system that does not rely on literal transcriptions and does not rely on speakers reading pre-written adaptation speeches.
One approach to adaptive speech recognition without literal transcription is found in K. Shinoda and C. -H. Lee, “Unsupervised Adaptation Using Structural Bayes Approach,” Proc. IEEE Intl. Conf on Acoustics, Speech and Signal Processing, Seattle, Wash. (1998), incorporated herein by reference. In this approach, an automatic speech recognition system was used to generate semi-literal transcripts for adaptation. However, because the generated semi-literal transcripts contained many recognition errors, the adapted acoustical models developed using this approach did not perform as well as adapted acoustical models generated using literal transcripts or combinations of supervised adaptation followed by unsupervised adaptation. There is therefore a need for an unsupervised adaptive speech recognition system that performs better than this generated semi-literal approach.
In addition to acoustic models, speech recognition engines also typically employ language models. Language models allow speech recognition engines to take into account the fact that there are correlations between nearby words in sentences. This concept is described in detail by F. Jelinek, “Self-Organized Language Modeling for Speech Recognition,” in Language Processing for Speech Recognition at pages 450-503, incorporated herein by reference. Language models attempt to characterize the probability that a particular word will appear at a particular location in a sentence given the identity and locations of other nearby words.
One popular set of language models is the n-gram language models. N-gram language models attempt to calculate the probability that the nth word in a group of words will be a particular word given the identity of the previous n−1 words. For example, the trigram language model (n=3) attempts to calculate the probability that the third word in a group of words will be a particular word given the identity of the previous two words in that group. N-gram probabilities are typically calculated based on large quantities of text, called a training corpus.
A training corpus can come from any text, and a broad range of texts can be used to generate a general language model. A topic language model can be created from texts using language in the same way as the speaker whose language is being modeled. For example, if the speaker is a physician and is speaking about medical matters, then medical texts would be the most appropriate elements of a training corpus. Or if the speaker is a lawyer speaking about legal matters, then legal texts would be most appropriate. Topic language models can be powerful predictive tools for speech recognition, yet they have not been used in speaker adaptation methods.
Another adaptive speech recognition approach involving semi-literal transcripts is set forth in S. S. Pakhomov and M. J. Schonwetter, “A Method and System for Generating Semi-Literal Transcriptions for Speech Recognition Systems,” U.S. patent application Ser. No. 09/487,398, filed Jan. 18, 2000, incorporated herein by reference. This approach generates semi-literal transcripts from pre-existing partial transcripts. Partial transcripts reflect a speaker's intended text rather than what was actually spoken, thus pause-fillers (e.g., “um . . . ”, “uh . . . ”) are removed, dictation instructions are removed, and spoken corrections (e.g., “right . . . no, I mean left . . . ”) are corrected in the text of the partial transcript. To generate a semi-literal transcript, Pakhomov et al. interpolate pause-fillers with a filled pause language model and interpolate other omitted words (such as corrections and dictation instructions) using a background language model. The semi-literal transcript so generated is then used in combination with the original audio file from which the partial transcript was generated to generate a speaker dependent model. However, the semi-literal transcripts thus generated cannot provide the accuracy required for good adaptive speech recognition. The method augments the partial transcript using a relatively simple probabilistic finite state model that limits the power of the language model. There is therefore a need for an adaptive speech recognition system that does not suffer from this limitation and can fully benefit from a sophisticated language model.