The invention relates to enrollment in speech recognition.
A speech recognition system analyzes a user""s speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words or phrases regardless of whether the user pauses between them. By contrast, a discrete speech recognition system recognizes discrete words or phrases and requires the user to pause briefly after each discrete word or phrase. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech. A more detailed description of continuous speech recognition is provided in U.S. Pat. No. 5,202,952, entitled xe2x80x9cLARGE-VOCABULARY CONTINUOUS SPEECH PREFILTERING AND PROCESSING SYSTEM,xe2x80x9d which is incorporated by reference.
In general, the processor of a continuous speech recognition system analyzes xe2x80x9cutterancesxe2x80x9d of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by a pause of at least a predetermined duration.
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word, phrase or command from a vocabulary. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
The words or phrases corresponding to the best matching acoustic models are referred to as recognition candidates. The processor may produce a single recognition candidate (i.e., a single sequence of words or phrases) for an utterance, or may produce a list of recognition candidates.
An acoustic model generally includes data describing how a corresponding speech unit (e.g., a phoneme) is spoken by a variety of speakers. To increase the accuracy with which an acoustic model represents a particular user""s speech, and thereby to decrease the incidence of recognition errors, the speech recognition system may modify the acoustic models to correspond to the particular user""s speech. This modification may be based on samples of the user""s speech obtained during an initial enrollment session and during use of the system.
Enrollment sessions for previous speech recognition systems typically required a user to read from a list of words or to read specific words in response to prompts. For example, DragonDictate(copyright) for Windows(copyright), available from Dragon Systems, Inc. of Newton, Mass., included a quick enrollment session that prompted a new user to speak each word of a small set of words, and then adapted the acoustic models based on the user""s speech.
Other enrollment approaches also have been used. For example, NaturallySpeaking(copyright), also available from Dragon Systems, Inc. of Newton, Mass., includes an interactive enrollment session in which a new user recites a selected enrollment text. An associated display (e.g., an arrow) indicates the user""s position in the text.
The invention provides non-interactive techniques for enrolling a user in a speech recognition system. Since the techniques are not interactive, the user may record enrollment speech using, for example, a portable recording device, and may later download the speech for processing to refine acoustic models of a speech recognition system. The techniques require the speech to generally correspond to an enrollment text, but permit the user to skip or repeat words, phrases, sentences, or paragraphs of the enrollment text. The techniques involve analyzing the user""s speech relative to the enrollment text to identify portions of the speech that match portions of the enrollment text, and updating acoustic models corresponding to the matched portions of the enrollment text using the matching portions of the user""s speech. The techniques promise to provide increased flexibility to the enrollment process, and to thereby simplify enrollment.
In one general aspect, a computer enrolls a user in a speech recognition system by obtaining data representing speech by the user and generally corresponding to an enrollment text. The computer analyzes acoustic content of a user utterance, and, based on the analysis, determines whether the user utterance matches a portion of the enrollment text. If the user utterance matches a portion of the enrollment text, the computer uses the acoustic content of the user utterance to update acoustic models corresponding to the portion of the enrollment text. A determination that the user utterance matches a portion of the enrollment text is permitted even when the user has skipped or repeated words, sentences, or paragraphs of the enrollment text.
Implementations may include one or more of the following features. The enrollment procedure is not performed interactively. This means that the data representing the user""s speech may be data recorded using a recording device physically separate from the computer. For example, the recording device may be a digital recording device, and obtaining data may include receiving a file from the digital recording device. Obtaining data also may include receiving signals generated by playing back the user""s speech using a recording device, such as an analog recording device.
Prior to analyzing a user utterance, the computer may divide the data into groups, with each group representing an utterance by the user.
The computer may designate an active portion of the enrollment text, and may analyze acoustic content of an utterance relative to the active portion of the enrollment text. The computer may identify a position of a previously analyzed utterance in the enrollment text, and may designate the active portion of the enrollment text based on the identified position. The active portion may include text preceding and following the identified position. For example, the active portion may include a paragraph including the position, as well as paragraphs preceding and following that paragraph.
The computer may attempt to match the utterance to models for words included in the active portion of the enrollment text. To this end, the computer may employ an enrollment grammar corresponding to the active portion of the enrollment text.
A rejection grammar may be used to determine whether the user utterance matches a portion of the enrollment text. The rejection grammar may be a phoneme grammar and may model an utterance using a set of phonemes that is smaller than a set of phonemes used by the enrollment grammar.
The enrollment text may be selected from a group of enrollment texts, with each of the enrollment texts having a corresponding enrollment grammar. An enrollment text also may be received from a user. An enrollnent grammar corresponding to the received enrollment text may be generated for use in determining whether the user utterance matches a portion of the enrollment text.
The user utterance may be ignored if it does not match a portion of the enrollment text.
In another general aspect, a user may be enrolled in a speech recognition system by recording the user""s speech using a portable recording device and transferring the recorded speech to a computer. The computer then is used to analyze acoustic content of the recorded speech, identify, based on the analysis, portions of the speech that match portions of the enrollment text, and update acoustic models corresponding to matched portions of the enrollment text using acoustic content of matching portions of the speech. The recorded speech may skip or repeat portions of the enrollment text.
Other general aspects include obtaining data corresponding to enrollment text using a physically separate recording device, as well as designating an active portion of the enrollment text and analyzing acoustic content of an utterance relative to the active portion of the enrollment text.
Other features and advantages will be apparent from the following description, including the drawings and the claims.