1. Field of the Invention
The present invention relates to speech and language processing.
2. Background Information
Speech recognition programs include Dragon NaturallySpeaking® (ScanSoft, Inc., Peabody, Mass., now Nuance Communications, Inc.), IBM ViaVoice® (IBM, Armonk, N.Y.), and SpeechMagic® (Philips Speech Processing, Vienna, Austria). Microsoft® Speech Software Development Kit (Microsoft Corporation, Redmond, Wash.) includes Microsoft® Speech Application Programming Interface (SAPI) v.5.x (Microsoft Corporation, Redmond, Wash.) and a speech recognition and text-to-speech engines. NaturalVoices® (AT&T® New York, N.Y.) is another SAPI-compliant text-to-speech engine. Language Weaver (Marina del Rey, Calif.) is an example of machine translation using statistical, probabilistic models.
The speech recognition representational model may be termed a speech user profile and may consist of an acoustic model, language model, lexicon, and other speaker-related data. Other speech and language applications may share some or all of these components.
Most commonly, speech recognition is used for large vocabulary, free-form, continuous dictation for letters, reports, or other documents. Some court reporters and other transcriptionists redictate speech input using real-time speech recognition. Compared to the primary speaker's speech input, redictation with the transcriber's voice may be more accurate and reduce keystrokes and risk of carpal tunnel syndrome. With structured dictation using data categories or fill-in-the-blank forms, a speaker may also use speech recognition to enter text into fields or blanks in a form.
Speech recognition may also be used for synchronizing audio and text data, e.g., in the form of electronic files, representing audio and text expressions of the same or information. See Heckerman et al., “Methods and Apparatus for Automatically Synchronizing Electronic Audio Files with Electronic Text Files,” U.S. Pat. No. 6,260,011 B1, issued Jul. 10, 2001.
While speech and language pattern recognition technologies are common, manual techniques still are widely used. Examples include manual transcription with a word processor of dictation or handwritten notes, court reporting or real-time television captioning with a steno machine designed for rapid transcription, or manual translation by a trained professional. Steno machines are available from a variety of manufacturers, including Stenograph, L.L.C. (Mount Prospect, Ill.).
One problem with prior speech recognition options is that they do not provide effective methods for correcting pattern recognition results, e.g., speech recognition text, by another operator, e.g., a second speaker, using the same or different pattern recognition program and saving training data for the respective speech user profiles for the first and second speakers. For instance, currently, when a second, redictating speaker corrects, modifies, or appends to text using speech recognition in a session file created by another user, the second speaker may open the original session file in the speech recognition application, select his or her (the second user's) speech user profile, dictate the correction, and save the text changes. The corrected session file has first speaker's speech input aligned to the corrected text and cannot use this audio-aligned text to train the second speaker's speech user profile. If the second speaker opens the primary speaker's speech user profile to dictate corrections, use of newly dictated audio-aligned text as training data would degrade the first user's profile. Consequently, in the prior art, one speech recognition user cannot effectively use speech recognition to correct the speech recognition dictation of another speaker. The operator must follow other strategies, e.g., creating a text file of the recognized text from the first speaker and opening this in the speech recognition user interface.
Accordingly, a technique is needed that supports creation of training data for both users and otherwise supports modification of session file with speech recognition, text to speech, or other pattern recognition program.
Another limitation of the prior art concerns changing or modify nontext components of a session file, for example audio. Using typical speech recognition or text-to-speech application, a user cannot change, modify, or substitute the audio where the original audio is poor quality and the session file is being accessed for its audio and not text content. For example, a blind user may listen to session file audio on a local computer, or a remote user may access a session file by telephone for playback of dictation. In these circumstances, it would be desirable to replace poor quality audio with a recording of a human voice, synthetic speech from text-to-speech application, or audio enhanced with noise reduction or voice enhancement or other similar techniques.
Another problem with prior speech recognition options concerns structured dictation, e.g., where a speaker is directed to dictate “name,” “date,” or other specified information. With structured entry, the document, the data, or both may be saved. Structured dictation may also be part of a document assembly program that includes dialogs for selection from alternative boilerplate or other text. Different off-the-shelf programs will extract stored data and generate web-accessible and other electronic reports with searchable fields for health care, law, business, insurance, and other activities. See, e.g., Crystal Reports (Business Objects SA, Paris, France).
As with free-form dictation, prior speech recognition programs do not provide the ability to easily gather training data for both a primary and secondary, correcting speaker. Among other potential problems, the graphical user interfaces of off-the-shelf speech recognition programs do not support easy end-user creation of structured dictation forms for completion by data category that would permit the ordinary end user to use the speech recognition or text-to-speech annotation techniques disclosed herein. For example, with Dragon® NaturallySpeaking®, forms creation for speech recognition require extensive knowledge of a speech recognition application and available software development kit.
Moreover, alignment of pre-existing text to audio has been inefficient using speech recognition. Opportunities to potentially synchronize the text of books, lecture notes, speeches, board meeting minutes, courtroom presentations, and other instances to speech input are not properly capitalized upon because of limitations of conventional speech recognition. These include the failure to support second-speaker correction, the failure to save training data for both the primary and secondary correcting speaker, the need for considerable speech recognition training and correction time, and the difficulty of aligning audio and text with complex electronic files that include verbatim and nonverbatim text and other nondictated elements, such as punctuation (periods, commas, colons, and quotation marks), table of contents, bibliographies, index, page numbers, graphics, and images.