The present invention relates to the field of speech recognition. More particularly, the present invention relates to the field of adaptation of a speech recognition system across multiple remote sessions with a speaker.
Speech recognition systems are known which permit a user to interface with a computer system using spoken language. The speech recognition system receives spoken input from the user, interprets the input, and then translates the input into a form that the computer system understands.
Speech recognition systems typically recognize spoken words or utterances based upon an acoustic model of a person who is speaking (the speaker). Acoustic models are typically generated based upon samples of speech. When the acoustic model is constructed based upon samples of speech obtained from a number of persons rather than a specific speaker, this is called speaker-independent modeling. When a speaker-independent model is then modified for recognizing speech of a particular person based upon samples of that person""s speech, this is called adaptive modeling. When a model is constructed based solely on the speech of a particular person, this is termed speaker-dependent modeling.
Speaker-independent modeling generally enables a number of speakers to interface with the same recognition system without having obtained prior samples of the speech of the particular speakers. In comparison to speaker-independent modeling, adaptive modeling and speaker-dependent modeling generally enable a speech recognition system to more accurately recognize a speaker""s speech, especially if the speaker has a strong accent, has a phone line which produces unusual channel characteristics or for some other reason is not well modeled by speaker independent models.
FIG. 1 illustrates a plurality of speaker-dependent acoustic models M1, M2, and Mn, in accordance with the prior art. For each speaker, 1 through n, a corresponding speaker-dependent acoustic model M1 through Mn, is stored. Thus, speech 10 of speaker 1 is recognized using the model M1 and the results 12 are outputted. Similarly, speech 14 of speaker 2 is recognized using the model M2 and the results 16 are outputted. And, speech 18 of speaker n is recognized using the model Mn and the results are outputted.
A speech recognition application program called NaturallySpeaking(trademark), which adapts to a particular user, is available from Dragon Systems, Inc. This application program enables a user to enter text into a written document by speaking the words to be entered into a microphone attached to the user""s computer system. The spoken words are interpreted and translated into typographical characters which then appear in the written document displayed on the user""s computer screen. To adapt the application program to the particular user and to background noises of his or her environment, the user is asked to complete two initial training sessions during which the user is prompted to read textual passages aloud. A first training session requires that the user read several paragraphs aloud, while a second training session requires 25 to 30 to minutes for speaking and 15 to 20 minutes for processing the speech.
Other speech recognition systems are known which adapt to an individual speaker based upon samples of speech obtained while the speaker is using the system, without requiring a training session. The effectiveness of this type of adaptation, however, is diminished when only a small sample of speech is available.
Speech recognition systems are known which provide a telephonic interface between a caller and a customer service application. For example, the caller may obtain information regarding flight availability and pricing for a particular airline and may purchase tickets utilizing spoken language and without requiring assistance from an airline reservations clerk. Such customer service applications are typically intended to be accessed by a diverse population of callers and with various background noises. In such applications, it would be impractical to ask the callers to engage in a training session prior to using the customer service application. Accordingly, an acoustic model utilized for such customer service applications must be generalized so as to account for variability in the speakers. Thus, speaker-independent modeling is utilized for customer service applications. A result of using speaker-independent modeling is that the recognition system is less accurate than may be desired. This is particularly true for speakers with strong accents and those who have a phone line which produces unusual channel characteristics.
Therefore, what is needed is a technique for improving the accuracy of speech recognition for a speech recognition system.
The invention is a method and apparatus for adaptation of a speech recognition system across multiple remote sessions with a speaker. The speaker can remotely access a speech recognition system, such as via a telephone or other remote communication system. An acoustic model is utilized for recognizing speech utterances made by the speaker. Upon initiation of a first remote session with the speaker, the acoustic model is speaker-independent. During the first remote session, the speaker is uniquely identified and speech samples are obtained from the speaker. In the preferred embodiment, the samples are obtained without requiring the speaker to engage in a training session. The acoustic model is then modified based upon the samples thereby forming a modified model. The model can be modified during the remote session or after the session is terminated. Upon termination of the remote session, the modified model is then stored in association with an identification of the speaker. Alternately, rather than storing the modified model, statistics that can be used to modify a pre-existing acoustic model are stored in association with an identification of the speaker.
During a subsequent remote session, the speaker is identified and, then, the modified acoustic model is utilized to recognize speech utterances made by the speaker. Additional speech samples are obtained during the subsequent session and, then, utilized to further modify the acoustic model. In this manner, an acoustic model utilized for recognizing the speech of a particular speaker is cumulatively modified according to speech samples obtained during multiple remote sessions with the speaker. As a result, the accuracy of the speech recognizing system improves for the speaker even when the speaker only engages in relatively short remote sessions.
For each speaker to remotely access the speech recognizing system, a modified acoustic model, or a set of statistics that can be used to modify the acoustic model or incoming acoustic speech, is formed and stored along with the speaker""s unique identification. Accordingly, multiple different acoustic models or sets of statistics are stored, one for each speaker.