The invention relates generally to the field of speech recognition, and more specifically to a multi-source input and playback utility for a display computer.
Since the advent of the personal computer, human interaction with the computer has been primarily through the keyboard. Typically, when a user wants to input information into a computer, he types the information on a keyboard attached to the computer. Other input devices have supplemented the keyboard, including the mouse, touch-screen displays, integrated pointer devices, and scanners. Use of these other input devices has decreased the amount of user time spent in entering data or commands into the computer.
Computer-based speech recognition and speech recognition systems have also been used for data or command input into personal computers. Speech recognition and speech recognition systems convert human speech into a format understood by the computer. When a computer is equipped with a voice or speech recognition system, data input may be performed by merely speaking the data into a computer input device. The speed at which the user can speak is typically faster than conventional data entry. Therefore, the inherent speed in disseminating data through human speech is an advantage of incorporating speech recognition and speech recognition systems into personal computers. The increased efficiency of users operating personal computers equipped with speech recognition and speech recognition systems has encouraged the use of such systems in the workplace. Many workers in a variety of industries now utilize speech recognition and speech recognition systems for numerous applications. For example, computer software programs utilizing speech recognition and speech recognition technologies have been created by Dragon Systems, Inc. (Newton, Mass.), IBM Corporation (Armonk, N.Y.), and Lemout and Hauspie (Burlington, Mass.). When a user reads a document aloud or dictates to a speech recognition program, the program may enter the user""s spoken words directly into a word processing program or other application operating on a personal computer.
Generally, computer-based speech recognition and speech recognition programs convert human speech into a series of digitized frequencies. These frequencies are matched against a previously stored set of words or speech elements, called phonemes.
A phoneme is the smallest unit of speech that distinguishes one sound from another in a spoken language. Each phoneme may have one or more corresponding allophones. An allophone is an acoustic manifestation of a phoneme. A particular phoneme may have many allophones, each sounding slightly different due to the position of the phoneme in a word or variant pronunciations in a language of the same letter set. For example, the phoneme /b/ is pronounced differently in the words xe2x80x9cboyxe2x80x9d and xe2x80x9cbeyond.xe2x80x9d Each pronunciation is an allophone of the phoneme /b/.
The utility processes these phonemes and converts them to text based on the most likely textual representation of the phoneme in a manner well known to those skilled in the art. The text is then displayed within a word processor or other application, such as a spreadsheet, database, web browser, or any program capable of receiving a voice input and converting it into display text or a program command. The multi-source input and playback utility may store the audio data. The audio data may be stored in a variety of formats on various storage media, including in volatile RAM, on long-term magnetic storage, or on optical media such as a CD-ROM. The audio data may be further compressed in order to minimize storage requirements. The utility may also link the stored audio data to the text generated by the audio data for future playback. When the computer determines correct matches for the series of frequencies, computer recognition of that portion of human speech is accomplished. The frequency matches are compiled until sufficient information is collected for the computer to react. The computer can then react to certain spoken words by storing the speech in a memory device, transcribing the speech as text in a document manipulable by a word processing program, or executing a command in an application program.
Natural speech input systems are expected to ultimately reach the marketplace. Such systems will not require the user to speak in any particular way for the computer to understand, but instead will be able to understand the difference between a user""s command to the computer and information to be entered into the computer.
Lacking this technological advance, contemporary speech recognition and speech recognition systems are not completely reliable. Even with hardware and software modifications, the most proficient speech recognition and speech recognition systems attain no greater than 97-99% reliability. Internal and external factors may affect the reliability of speech recognition and speech recognition systems. Factors dependent upon the recognition technology itself include the finite set of words or phonemes inherent in the speaker""s language, and the vocabulary of words to which the speech recognition software may compare the speaker""s input. Environmental factors such as regional accents, external noise, and microphone quality may degrade the quality of the input, thus affecting the frequency of the user""s words and introducing potential error into the word or phoneme matching.
Consequently, dictated documents transcribed by speech recognition software often contain recognition errors. Unlike typing errors, where simple mistakes such as the transposition of letters are easily identifiable and correctable, recognition errors are often more severe. Recognition errors typically are not the substitution or transposition of letters, but instead tend to be the wholesale substitution of similar-sounding words. For example, a classic speech recognition error is the transcription of the phrase xe2x80x9crecognize speechxe2x80x9d as xe2x80x9cwreck a nice beach.xe2x80x9d While these phrases sound similar, they have totally different meanings. Further, an editor proofreading a document containing this recognition error may not immediately recall the intended phrase, leading to unnecessary confusion.
Traditionally, users have attempted to minimize this confusion by reading words aloud as they proofread the document. This practice assists in identifying intended phrases, since the vocal similarities are apparent when the document is read aloud. However, where significant time elapses between dictating and editing a document, the user may forget what the intended phrase was.
Known current speech recognition products attempt to solve this problem by storing the dictation session as audio data, and linking the stored audio data to the individual transcribed words. Users may select single words or text sequences and request playback of the audio corresponding to the selected portion.
While this aids a user in recognizing the intended transcription, a severe problem arises in the event that the user has edited the document in the time between dictation and requesting audio playback. A user is then presented with the prospect of requesting playback for a portion of a document generated through mixed input sources.
For example, a user may have dictated xe2x80x9cI wish my computer could recognize speech,xe2x80x9d which the speech recognition system transcribed as xe2x80x9cI wish my computer could wreck a nice beach.xe2x80x9d If the user then types the word xe2x80x9creallyxe2x80x9d between xe2x80x9cIxe2x80x9d and xe2x80x9cwish,xe2x80x9d the document has mixed input sources. Thus, when a user selects the sentence as it appears on the screen (xe2x80x9cI really wish my computer could wreck a nice beachxe2x80x9d) and requests playback, no audio data is linked to the word xe2x80x9creally,xe2x80x9d since it was typed and not dictated.
Known current speech recognition platforms disable the playback option in this situation. Instead, the speech recognition system returns an error message to the user, stating that playback is not available because audio data does not exist for all of the selected text. This forces a user to attempt to recall which portions of a document were typed and which dictated, and then reselect text accordingly. This solution is inherently frustrating, since it requires a user to attempt to recall a dictation session already unclear in the user""s memory in order to access any audio playback whatsoever. Thus, there is a general need in the art for a method and system for reliably playing back audio in an intuitive format corresponding to a selected portion of a document. There is also a need for a method and system for filling in gaps in audio playback of a document wherein no audio data is available for portions of the document.
Generally stated, the invention is a multi-source input and playback utility for a personal computer. The multi-source input and playback utility accepts inputs from multiple input sources, converts these inputs into text, and displays the text on a display screen. When a user dictates text, the speech input is stored on a storage medium or in system memory as audio data. Text transcribed from speech input is linked to this stored audio data. Text transcribed from a writing tablet, or typed with a keyboard, has no link to any audio data. A user may edit the text as required through the use of a keyboard, mouse, or other input device. Typically, editorial changes are made by directly typing the changes into the text and so have no associated stored audio data.
The multi-source input and playback utility also vocalizes a text portion selected by a user. In the event that all of the selected text is linked to stored audio data, the audio data is played back. If a portion of the selected text has no associated audio data, then the utility retrieves a text-to-speech (xe2x80x9cTTSxe2x80x9d) audio entry and fills in any gaps in stored audio playback with the retrieved entry. Thus, where a user selects a multi-source text portion for playback, the text vocalization will consist of a mix of played-back audio data as available and text-to-speech introduced as necessary.
The present invention meets the identified needs by providing a simple method for providing vocalization of text inputted through the use of multiple input methods, including non-speech inputs. By retrieving text-to-speech entries for words lacking any associated audio data, multi-source documents may be played aloud by a computer in their entirety rather than resorting to an error message. Further, continuous playback of all selected text minimizes user confusion otherwise caused by skipping non-dictated text portions.