The invention relates to position manipulation in speech recognition.
A speech recognition system analyzes a user""s speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech.
A continuous speech recognition system can recognize spoken words or phrases regardless of whether the user pauses between them. By contrast, a discrete speech recognition system recognizes discrete words or phrases and requires the user to pause briefly after each discrete word or phrase. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech. A more detailed description of continuous speech recognition is provided in U.S. Pat. No. 5,202,952, entitled xe2x80x9cLARGE-VOCABULARY CONTINUOUS SPEECH PREFILTERING AND PROCESSING SYSTEM,xe2x80x9d which is incorporated by reference.
In general, the processor of a continuous speech recognition system analyzes xe2x80x9cutterancesxe2x80x9d of speech. An utterance includes a variable number of frames and may correspond to a period of speech followed by a pause of at least a predetermined duration.
The processor determines what the user said by finding acoustic models that best match the digital frames of an utterance, and identifying text that corresponds to those acoustic models. An acoustic model may correspond to a word, phrase or command from a vocabulary. An acoustic model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word. Acoustic models also may represent silence and various types of environmental noise.
The words or phrases corresponding to the best matching acoustic models are referred to as recognition candidates. The processor may produce a single recognition candidate (i.e., a single sequence of words or phrases) for an utterance, or may produce a list of recognition candidates.
Correction mechanisms for previous discrete speech recognition systems displayed a list of choices for each recognized word and permitted a user to correct a misrecognition by selecting a word from the list or typing the correct word. For example, DragonDictate(copyright) for Windows(copyright), available from Dragon Systems, Inc. of Newton, Mass., displayed a list of numbered recognition candidates (xe2x80x9ca choice listxe2x80x9d) for each word spoken by the user, and inserted the best-scoring recognition candidate into the text being dictated by the user. If the best-scoring recognition candidate was incorrect, the user could select a recognition candidate from the choice list by saying xe2x80x9cchoose-Nxe2x80x9d, where xe2x80x9cNxe2x80x9d was the number associated with the correct candidate. If the correct word was not on the choice list, the user could refine the list, either by typing in the first few letters of the correct word, or by speaking words (e.g., xe2x80x9calphaxe2x80x9d, xe2x80x9cbravoxe2x80x9d) associated with the first few letters. The user also could discard the incorrect recognition result by saying xe2x80x9cscratch thatxe2x80x9d.
Dictating a new word implied acceptance of the previous recognition. If the user noticed a recognition error after dictating additional words, the user could say xe2x80x9cOopsxe2x80x9d, which would bring up a numbered list of previously-recognized words. The user could then choose a previously-recognized word by saying xe2x80x9cword-Nxe2x80x9d, where xe2x80x9cNxe2x80x9d is a number associated with the word. The system would respond by displaying a choice list associated with the selected word and permitting the user to correct the word as described above.
In one general aspect, an action position in computer-implemented speech recognition is manipulated in response to received data representing a spoken command. The command includes a command identifier and a designation of at least one previously-spoken word. Speech recognition is performed on the data to identify the command identifier and the designation. Thereafter, an action position is established relative to the previously-spoken word based on the command identifier.
Implementations may include one or more of the following features. The designation may include a previously-spoken word or words, or may include a shorthand identifier for a previously-spoken selection or utterance (e.g., xe2x80x9cthatxe2x80x9d).
The command identifier may indicate that the action position is to be before (e.g., xe2x80x9cinsert beforexe2x80x9d) or after (e.g., xe2x80x9cinsert afterxe2x80x9d) the previously-spoken word, words, or utterance. When this is the case, the action position may be established immediately prior to, or immediately following, the previously-spoken word, words, or utterance.
The designation may include one or more previously-spoken words and one or more new words. In this case, any words following the previously-spoken words included in the command may be replaced by the new words included in the command. The action position then is established after the new words. This command may be implemented, for example, as a xe2x80x9cresume withxe2x80x9d command in which the words xe2x80x9cresume withxe2x80x9d are followed by one or more previously-recognized words and one or more new words.
The xe2x80x9cresume withxe2x80x9d command does not rely on the presentation of information on the display. For that reason, the command is particularly useful when the user records speech using a portable recording device, such as an analog or digital recorder, and subsequently transfers the recorded speech to the speech recognition system for processing. In that context, the xe2x80x9cResume Withxe2x80x9d command provides the user with a simple and efficient way of redirecting the dictation and eliminating erroneously-spoken words.
The data representing the command may be generated by recording the command using a recording device physically separate from a computer implementing the speech recognition. When the recording device is a digital recording device, the data may be in the form of a file generated by the digital recording device. The data also may be in the form of signals generated by playing back the spoken command using the recording device, such as when an analog recording device is used.
In another general aspect, a block of text is selected in computer-implemented speech recognition in response to data representing a spoken selection command. The command includes a command identifier and a text block identifier identifying a block of previously-recognized text. At least one word included in the block of text is not included in the text block identifier. Speech recognition is performed on the data to identify the command identifier and the text block identifier. Thereafter, the block of text corresponding to the text block identifier is selected.
Implementations may include one or more of the following features. The text block identifier may include at least a first previously-recognized word of the block of text and at least a last previously-recognized word of the block of text. For example, the command identifier may be xe2x80x9cselectxe2x80x9d and the text block identifier may include the first previously-recognized word of the block of text, xe2x80x9cthroughxe2x80x9d, and the last previously-recognized word of the block of text (i.e., xe2x80x9cselect X through Yxe2x80x9d). Alternatively, the text block identifier may be a shorthand notation (e.g., xe2x80x9cthatxe2x80x9d) for a previously-spoken selection or utterance.
Speech recognition may be performed using a constraint grammar. The constraint grammar may permit the block of text to start with any word in a set of previously-recognized words and to end with any word in the set of previously-recognized words. The set of previously-recognized words may include previously-recognized words displayed on a display device when the selection command is spoken.
Performing speech recognition may include generating multiple candidates for the text block identifier, and eliminating candidates for which the block of text starts with a previously-recognized word spoken after a previously-recognized word with which the block of text ends.
Performing speech recognition may include associating a score with each of the multiple candidates. Generally, a score for a candidate is based on scores for components of the candidate. When components of different candidates are homophones, the scores for the candidates may be adjusted so that the portion of each score attributable to one of the homophones equals the score of the best-scoring one of the homophones.
In another general aspect, a computer-based technique for use in working with text includes receiving a command including an utterance designating a portion of the text, performing speech recognition on the utterance to identify the portion of the text, and establishing an action position in the text at a location relative to the identified portion of the text the location being determined by the command.