1. Field of the Invention
The present invention relates to speech recognition and creation of a second generation session file for text that has been corrected using a software speech editor.
2. Background Information
Speech recognition programs that automatically convert speech into text have been under continuous development since the 1980s. The first programs required the speaker to speak with clear pauses between each word to help the program separate one word from the next. One example of such a program was DragonDictate, a discrete speech recognition program originally produced by Dragon Systems, Inc. (Newton, Mass.).
In 1994, Philips Dictation Systems of Vienna, Austria introduced the first commercial, continuous speech recognition system. See, Judith A. Markowitz, Using Speech Recognition (1996), pp. 200-06. Currently, the two most widely used off-the-shelf continuous speech recognition programs are Dragon NaturallySpeaking™ (now produced by ScanSoft, Inc., Peabody, Mass.) and IBM ViaVoice™ (manufactured by IBM, Armonk, N.Y.). The focus of the off-the-shelf Dragon NaturallySpeaking™ and IBM ViaVoice™ products has been direct dictation into the computer and correction by the user of misrecognized text. Both the Dragon NaturallySpeaking™ and IBM ViaVoice™ programs are available in a variety of languages and versions and have a software development kit (“SDK”) available for independent speech vendors.
Conventional continuous speech recognition programs are speaker dependent and require creation of an initial speech user profile by each speaker. This “enrollment” generally takes about a half-hour for each user. It usually includes calibration, text reading (dictation), and vocabulary selection. With calibration, the speaker adjusts the microphone output to insure adequate audio signal and minimal background noise. Then the speaker dictates a standard text provided by the program into a microphone connected to a handheld recorder or computer. The speech recognition program correlates the spoken word with the pre-selected text excerpt. It uses the correlation to establish an initial speech user profile based on that user's speech characteristics.
If the speaker uses different types of microphones or handheld recorders, an enrollment must be completed for each since the acoustic characteristics of each input device differ substantially. In fact, it is recommended a separate enrollment be performed on each computer having a different manufacturer's or type of sound card because the different characteristics of the analog to digital conversion may substantially affect recognition accuracy. For this reason, many speech recognition manufacturers advocate a speaker's use of a single microphone that can digitize the analog signal external to the sound card, thereby obviating the problem of dictating at different computers with different sound cards.
Finally, the speaker must specify the reference vocabulary that will be used by the program in selecting the words to be transcribed. Various vocabularies like “General English,” “Medical,” “Legal,” and “Business” are usually available. Sometimes the program can add additional words from the user's documents or analyze these documents for word use frequency. Adding the user's words and analyzing the word use pattern can help the program better understand what words the speaker is most likely to use.
Once enrollment is completed, the user may begin dictating into the speech recognition program or applications such as conventional word processors like MS Word™ (Microsoft Corporation, Redmond, Wash.) or Wordperfect™ (Corel Corporation, Ottawa, Ontario, Canada). Recognition accuracy is often low, for example, 60-70%. To improve accuracy, the user may repeat the process of reading a standard text provided by the speech recognition program. The speaker may also select a word and record the audio for that word into the speech recognition program. In addition, written-spokens may be created. The speaker selects a word that is often incorrectly transcribed and types in the word's phonetic pronunciation in a special speech recognition window.
Most commonly, “corrective adaptation” is used whereby the system learns from its mistakes. The user dictates into the system. It transcribes the text. The user corrects the misrecognized text in a special correction window. In addition to seeing the transcribed text, the speaker may listen to the aligned audio by selecting the desired text and depressing a play button provided by the speech recognition program. Listening to the audio, the speaker can make a determination as to whether the transcribed text matches the audio or whether the text has been misrecognized. With repeated correction, system accuracy often gradually improves, sometimes up to as high as 95-98%. Even with 90% accuracy, the user must correct about one word a sentence, a process that slows down a busy dictating lawyer, physician, or business user. Due to the long training time and limited accuracy, many users have given up using speech recognition in frustration. Many current users are those who have no other choice, for example, persons who are unable to type, such as paraplegics or patients with severe repetitive stress disorder.
In the correction process, whether performed by the speaker or editor, it is important that verbatim text is used to correct the misrecognized text. Correction using the wrong word will incorrectly “teach” the system and result in decreased accuracy. Very often the verbatim text is substantially different from the final text for a printed report or document. Any experienced transcriptionist will testify as to the frequent required editing of text to correct errors that the speaker made or other changes necessary to improve grammar or content. For example, the speaker may say “left” when he or she meant “right,” or add extraneous instructions to the dictation that must be edited out, such as, “Please send a copy of this report to Mr. Smith.” Consequently, the final text can often not be used as verbatim text to train the system.
With conventional speech recognition products, generation of verbatim text by an editor during “delegated correction” is often not easy or convenient. First, after a change is made in the speech recognition text processor, the audio-text alignment in the text may be lost. If a change was made to generate a final report or document, the editor does not have an easy way to play back the audio and hear what was said. Once the selected text in the speech recognition text window is changed, the audio text alignment may not be maintained. For this reason, the editor often cannot select the corrected text and listen to the audio to generate the verbatim text necessary for training. Second, current and previous versions of off-the-shelf Dragon NaturallySpeaking™ and IBM ViaVoice™ SDK programs, for example, do not provide separate windows to prepare and separately save verbatim text and final text. If the verbatim text is entered into the text processor correction window, this is the text that appears in the application window for the final document or report, regardless of how different it is from the verbatim text. Similar problems may be found with products developed by independent speech vendors using, for example, the IBM ViaVoice™ speech recognition engine and providing for editing in commercially available word processors such as Word™ or WordPerfect™.
Another problem with conventional speech recognition programs is the large size of the session files. As noted above, session files include text and aligned audio. By opening a session file, the text appears in the application text processor window. If the speaker selects a word or phrase to play the associated audio, the audio can be played back using a hot key or button. For Dragon NaturallySpeaking™ and IBM ViaVoice™ SDK session files, the session files reach about a megabyte for every minute of dictation. For example, if the dictation is 30 minutes long, the resulting session file will be approximately 30 megabytes. These files cannot be substantially compressed using standard software techniques. Even if the task of correcting a session file could be delegated to an editor in another city, state, or country, there would be substantial bandwidth problems in transmitting the session file for correction by that editor. The problem is obviously compounded if there are multiple, long dictations to be sent. Until sufficient high-speed Internet connection or other transfer protocol comes into existence, it may be difficult to transfer even a single dictation session file to a remote editor. A similar problem would be encountered in attempting to implement the remote editing features using the standard session files available in the Dragon NaturallySpeaking™ and IBM ViaVoice™ SDK.
Another limitation concerns completion of forms using speech recognition. Currently, there are a variety of structured reporting formats available to the speech recognition user. These systems generally use a “fill-in-the-blank” format. The dictating user views a standard form displayed on the monitor, and skips from blank to blank, dictating the word or phrase to complete the form. This dictation is transcribed using real-time speech recognition, and the user usually can correct mistakes that the speech engine makes. At the end of the process, the user has a transcribed report or document that has been completed using a standard template. This type of structured reporting system requires that the user view the form on the screen and dictate directly into a microphone attached to the computer. Among other potential disadvantages, this approach is not practical for dictation and form completion using a telephone, handheld recorder, or other portable device. In these settings, the standard template typically is not displayed on a computer monitor, and the “blank” to be completed that corresponds to the audio cannot be selected by the user.
Accordingly, it is an object of the present invention to provide a system that offers training of the speech recognition program transparent to the end-users by performing an enrollment for them. It is an associated object to develop condensed session files for rapid transmission to remote editors. An additional associated object is to develop a convenient system for generation of verbatim text for speech recognition training through use of multiple linked windows in a text processor. It is another associated object to facilitate speech recognition training by use of a word mapping system for transcribed and verbatim text that has the effect of permanently aligning the audio with the verbatim text. Another associated object is to use speech recognition and text comparison to permit speakers to dictate into a telephone, handheld recorder, or other device without having to view the form on a screen, and to use comparison with a previously created form to speed up the back end completion of the report or document by a transcription editor for the busy dictating user.
These and other objects will be apparent to those of ordinary skill in the art having the present drawings, specifications, and claims before them.