1. Field of the Invention
The present invention relates to speech and language processing.
2. Background Information
Speech and language technologies use pattern recognition approaches found in a variety of applications. Generally, there are separate programs that perform various captures of speech, audio, text, image, video, or other data, boundary definition of a segment, region, volume, or space of interest, elimination of unneeded data, feature extraction, comparison with stored representational models, and conversion, analysis, or interpretation of extracted features. See, e.g., Andrew R. Webb, Statistical Pattern Recognition (2nd ed. 2002).
Speech and language processing includes speech recognition for dictation, for command and control (voice activation), for interactive voice response in telephony, for text-based or phoneme-based audio mining (word spotting), for speaker recognition in detection, for identification, or for verification, for text to speech, for phonetic generation, for natural language understanding, and for machine translation. The various speech and language applications frequently use common or similar representational models and software algorithms. See, e.g., Lawrence Rabiner Biing-Hwang Juang, Fundamentals of Speech Recognition (1993), Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Spoken Language Processing (2001), Daniel Jurafsky & James H. Martin, Speech and Language Processing (2000).
For speech recognition, the representational model may be termed a speech user profile, or speaker model, and may consist of an acoustic model, language model, lexicon, and other speaker-related data. Other types of speech and language applications may share some or all of these components of the speech user profile.
Speech input reflects speaker-specific differences, such as physical vocal-tract size, age, sex, dialect, health, education, emotion, and personal style, including word use and expression. Recorded speech also reflects physical characteristics of the speech wave, as well as recording device, background noise, audio file format, and postprocessing artifacts.
The speech wave is composed of smaller packets of sounds. Words are separated by short pauses. Longer pauses separate phrases or sentences, and are often termed “utterances.” In general, utterances may contain dozens or more phonemes. Phonemes are sound-subunits that convey different meanings, such as the perceptually different initial sounds in “mop,” “hop,” and “top.” Phonemes may be further subdivided into a series of triphones that express phoneme variability based upon left to right context. Prosody refers to aspects of pronunciation not described by the text sequence of phonemes, such as stress, rhythm, and pitch.
The acoustic model used in speech-to-text decoding and other speech and language processing commonly represents a collection of probabilistic models for the audio characteristics of words, or small speech units, such as syllables or demisyllables, phonemes, triphones, or other word-subunits. The acoustic properties of speech units vary depending upon the position in a word, sentence, speech rate, or other factors. Improved performance can be obtained by modeling the variation explicitly, such as with context-dependent or speech-rate dependent models.
The speech input is commonly entered into a graphical user interface or other application and segmented in a signal processing stage prior to conversion, interpretation, or analysis. After segmentation, acoustic features may be extracted by observation through a finite-length analysis window that is regularly shifted along the speech sample and processed to produce a sequence of acoustic vectors that define the time evolution of the speech signal.
With speech recognition, an acoustic decoding block searches the sequence of words whose corresponding sequence of acoustical models are closest to the observed sequence of acoustic vectors. This search is constrained by a language model (e.g. grammar) and a lexicon. During the decoding phase, the acoustic data is typically processed with Hidden Markov Models, probabilistic methods that include parameters defined for states, transition probabilities, and observation likelihoods.
After speech recognition decoding for dictation, the speech engine output undergoes postprocessing whereby results are commonly converted to a user-compatible format. For instance, output text “January seventeenth <comma> two thousand five” may be postprocessed to “Jan. 17, 2005” using regular expressions algorithms or similar techniques. A session file consisting of audio-aligned text and other data may be assembled at this stage. After postprocessing, the results are displayed in a graphical user interface or may even be used to activate other applications. Other speech and language applications may process speech and other language input using a similar, layered process.
In speech recognition for dictation, confidence scores are assigned to alternative hypotheses as to what the speaker said during speech-to-text decoding. The display text in the graphical user interface in a speech recognition system for dictation represents a “best result” based upon the highest-scoring hypothesis. Alternative hypotheses may be listed in a drop-down window or other dialog for user reference and typically may be selected for substitution in the display text. Accuracy may be improved by exploiting differences in the nature of errors made by multiple speech recognition systems by automatic rescoring or “voting” process that selects an output sequence with the lowest score to identify the correct word. See, e.g., Jonathan G. Fiscuss, “A Post-Processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” pp. 347-354 0-7805-3695-4/97/$10.00 © 1997 EEE
Many hours of computational time are required to create an acoustic model. For a large set of training data, that may consist of several hundred hours of speech, single processor training may take months to complete. Consequently, parallel training with distributed computing techniques on multiple computers has been frequently used to reduce production time. During the training process, updates equations for state-transition and state-observation probability distributions and other computed values may be modified to permit parallel training on multiple processors. After completion of the processing on the individual computers, accumulator files from the different processors may be combined to create a final acoustic model. See, e.g., Institute for Signal Information and Processing, Department for Electrical and Computer Engineering, Mississippi State University, “Acoustic Modeling: Parallel Training” (website tutorial 2005). However, due to exacting requirements for data synchronization, parallel processing is not within the reach of the typical end user or independent speech developer.
The acoustic model may be trained using supervised or unsupervised techniques. In supervised training, word output corresponding to the speech input is known. In unsupervised techniques, the known word output is not submitted. Training occurs through use of confidence scoring or other techniques. Forced alignment is a form of supervised training whereby segmented speech input, verbatim text, and phonetic pronunciation are submitted to the speech engine for feature extraction and Hidden Markov Model processing. During the training, examples of each word or word subunit are presented to the speech engine to generate a statistical model for each subunit modeling the distribution of the acoustic vectors. With Hidden Markov Models, word models have been used for recognizing strings of connected digits. Lawrence R. Rabiner, Jay G. Wilpon, Rank K. Soong, “High Performance Connected Digit Recognition Using Hidden Markov Models, “IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 37, No. 8, August 1989 0096-3518/89/0800-1214$01.00 C 1989 IEEE. However, word models have not generally been used for large vocabulary tasks, such as large vocabulary continuous speech recognition, because the data to train one word is not shared by others.
Selection of the basic speech unit may represent a tradeoff between trainability and specificity. The more specific acoustic models, such as those based upon word models, tend to perform better than the more general models, based upon triphones or other word subunits, but, because of their specificity, the acoustic data may occur relatively rarely and may be difficult to train well. General models may be trained well, because of the wide occurrence of phonemes or triphones in a language, but are less likely to provide a good match to any particular token.
Consequently, use of a small set of units, such as phonemes or triphones, can be used to cover all possible utterances, but the data available to train each model may be small. This sparse data problem can be handled by training models only for those units for which training data are available or to merge contexts to increase the training data per model and reduce the number of models applied during recognition.
Commonly, in cases of triphone sparsity, linguistic questions may be used to group triphones into a statistically estimable number of clusters for creation of the acoustic model. This technique is frequently referred to as “state-tying.” While a professional linguist usually prepares the linguistic questions, automated linguistic questions generation has been described. The methods require creation of one or more acoustic models and testing of accuracy to determine the best automatically generated questions set. R. Singh, B. Raj, and R. M. Stem, “Automatic Clustering and Generation of Contextual Questions for Tied States in Hidden Markov Models,” Research sponsored by the Department of the Navy, Naval Research Laboratory under Grant No. N00014-93-1-2005 (ICASSP 1999) and K. Beulen and H. Ney, “Automatic Question Generation for Decision Tree Based State Tying,” pp. 805-808 0-7803-4428-6/98/$10.00 @ 1998 IEEE.
During the decoding phase, the search for the best sequence of words using the acoustic model is constrained by the language model and lexicon. The language model defines the most likely sequence of words based upon prior experience. This is based upon a frequency analysis of word combinations in a specified text corpus. While trigram modeling is common, other N-gram techniques may be used. The language model may provide information on the probability of a word or phrase following another, possible branching patterns given the a word or phrase, word use frequency, and likelihood that a particular trigram or other N-gram is present in the text corpus.
In preparation of the language model, text from day-to-day dictation may be converted to a speech engine compatible format. For instance, punctuation may be expressed as tokens, such as <comma> or <period>. During decoding, the lexicon defines the recognizable words and typically includes a phonetic representation for each word in the language model with a single symbol for each distinctive speech sound. See, e.g., International Phonetic Association (IPA), A Guide to the Use of the International Phonetic Alphabet (1999). The SAPI Universal Phone Set (UPS) is one example of a machine-readable phone set based on the IPA pronunciation. Microsoft® UPSWhitePaper3.2 © 2004.
Selection of the basic acoustic unit, e.g., triphone, phoneme, or word, is reflected in the lexicon. If phonemes are used, for example, a word is represented phonetically as a series of phonemes separated by a space. If word models are the basic units, a word is represented as a series of characters without a space.
The lexicon is usually prepared by a phonetic expert using the IPA representation or other system, or is derived from pronunciation data from the Linguistic Data Consortium (University of Pennsylvania, Philadelphia, Pa.) or similar sources. The LDC data does not offer pronunciation data for every particular language, dialect, or specialized vocabulary, such as medicine, engineering, law, and others. A lexicon may include pronunciation for more than one language for multilingual speakers. See, e.g., Giorgio Micca, Enrico Palme, Alessandra Frasca, “Multilingual Vocabularies in Automatic Speech Recognition,” (MIST-1999).
Speech recognition and other speech and language processing may be speaker-independent, speaker-independent systems made more speaker-dependent by speaker adaptation (“speaker-adaptive”), or speaker-dependent. Speaker-dependent systems are based upon training data sets from a single speaker that reflect the speaker's speaking style and vocabulary, specific recording device, and local background noise. They are highly accurate and reliable for that single speaker, but are infrequently used. See, e.g., Yu Shi and Eric Chang, “Studies in Massively Speaker-Specific Speech Recognition,” pp. I-825-828 0-7893-8484-9/04/$20.00 © IEEE ICASSP 2004.
Speaker-independent technology is commonly used for small vocabulary tasks. Speaker-independent systems should not require any training by the end user to work.
Speaker-adaptive systems are frequently used for large vocabulary continuous speech recognition such that would result from dictation. Sometimes these speaker-adaptive systems are referred to as “speaker-dependent.” Unlike a speaker dependent system that requires data only from a single speaker, speaker-independent or speaker-adaptive technology require many hours of data from many speakers to create an accurate composite speaker-independent model. With speaker adaptive systems, the composite model is adapted to a particular end user speaking style by an enrollment and ongoing correction of recognition errors.
The enrollment may consist of reading from a prepared script to adapt the speaker-independent model to the speaker's speaking style, input device, and background noise. Correction of recognition errors is often done by selecting the audio-aligned text from the temporary buffer file and entering verbatim, corrected text into a pop-up window in the speech recognition text processor. Unfortunately, speaker-adaptive systems also show large performance fluctuations among some new speakers. This is frequently due to mismatches between the composite speech user profile and speaker accent, style, input device, and background noise.
In speaker-adaptive systems, acoustic model updates are commonly performed with relatively small amounts of data compared to the many hours of speech from different speakers initially used to create the speaker independent model. In this setting, data-sparse techniques such as maximum likelihood linear regression (MLLR), maximum a posteriori (MAP), and other approximation methods have been used for adaptive training for modeling the limited data. The purpose of these speaker adaptive methods is to use as small amount of adaptation data as possible to calibrate and change the recognition system so that the data models as much of the speaker-specific information as possible. See, e.g., P. C. Woodland, “Speaker Adaptation for Continuous Density HMMs: A Review,” ISCA Archive (http://www.isca-speech.org/archive), ITRW on Adaptation Methods for Speech Recognition, Sophia Antipolis, France, Aug. 29-30, 2001, pp. 1-19.
Examples of real-time speaker-adaptive programs for dictation include Dragon NaturallySpeaking® (ScanSoft, Inc., Peabody, Mass.) and IBM ViaVoice® (IBM, Armonk, N.Y.). Dragon NaturallySpeaking® Systems Server or SpeechMagic® (Philips Speech Processing, Vienna, Austria) also provide offline, server-based solutions.
Despite improvements in technology, there has been limited adoption of speaker-adaptive speech recognition. This apparently reflects frustration with training time and frequent and inevitable recognition errors.
In view of limited use of adaptive speech recognition systems, assignee of the current invention has previously disclosed methods for text comparison to generate verbatim text for speech user training that improve upon prior systems. See, U.S. Pat. No. 6,704,709 “System and Method for Improving the Accuracy of a Speech Recognition Program,” U.S. Pat. No. 6,490,598 “System and Method for Improving the Accuracy of a Speech Recognition Program Through Repetitive Training,” U.S. Pat. No. 6,122,614, “System and Method for Automating Transcription Services.” Even with these improvements, there are unmet needs to further improve speech and language processing.
This may be why speech recognition is still less commonly used than manual transcription for free-form dictation recorded by microphone, personal digital assistant or other handheld recorder, or telephone. There an operator plays back the recorded audio with a foot pedal and transcribes into a word processor, such as Word™ (Microsoft Corporation, Redmond, Wash.), WordPerfect® (Corel Corporation, Ottawa, Canada), or StarOffice™ (Sun Microsystems, Inc., Palo Alto, Calif.). In some cases, doctors, lawyers, and others may also use “structured reporting” whereby the speaker dictates, segment by segment, into a “fill-in-the-blank” form that is transcribed manually. As a result of its continued prevalence, manual transcription generates a large amount of text from recorded audio both from free-form dictation and structured reporting.
Yet, the dictation audio that could be used to create highly accurate speaker-dependent speaker model, or other representational speaker model, is generally discarded as a useless byproduct of the manual dictation-to-text process. The development of speaker-dependent applications has not been pursued because there is no cost effective system to acquire and process hours of speaker-dependent data available from business or professional dictation, let alone conversational and other speech generated at home, in a car, or other places.
Accordingly, there is a need for a system that can capture the currently discarded dictation audio and create a robust and accurate speaker-dependent speech user profile that can be shared across a number of related speech and language applications. There is an associated need for improvements in manual processing of day-to-day dictation to enable quick generation of training data sets for a speaker-dependent speech user profile with little added end-user effort. A further benefit would be the availability of large amounts of acoustic data to generate highly accurate word models for a large vocabulary system.
In the prior art, one limiting factor in the creation of a speaker-dependent model is the cost of obtaining lexical pronunciation data. Limited data is freely available, but may be restricted to certain languages, dialects, categories, or topics. In general, the prior art has relied upon expensive phonetic expertise or purchase from various sources to generate a lexicon. The expert often uses a sophisticated phonetic alphabetic, such as that available from the International Phonetics Association, that would not be understood by the average end user or software developer.
In cases of data sparsity, a professional linguist may also be used to create linguistic questions for the acoustic model for “state tying.” While automated questions file generation has been previously described, the prior disclosures do not permit the end user to automatically select a best questions file among those created from a variety of different parameter values. Instead, the prior art relies upon creation of different acoustic models using different linguistic questions, determining the word error rate with the different models, and selecting the acoustic model for use with highest accuracy. This process may take many hours and requires significant computational resources.
The prior art is also characterized by general lack of portability, sharing, and interchangeability of a trained speaker model or its separate components between related speech and language processing applications. The hours or days spent improving a user profile for one speech recognition application often cannot be used to create or update the speaker model for another application. As such, there is a need for a system and method that can provide a portable speech model.
While there is also a proliferation of software programs for conversion, interpretation, and analysis of speech, audio, text, image, or other data input, there is continued reliance upon manual processes for interpretation of the same data input. Among other drawbacks, none of these programs have a single graphical user interface to process or compare results from computers, humans, or between both computers and humans.
One prime example where this approach would be useful is transcription where manual transcription and speech recognition are usually processed with separate applications. In no small part the absence of such programs may be due to the unavailability of systems and methods to support synchronized comparison where a user dictates and records using a real-time speech recognition program.
In the prior art, session files from different speech recognition programs also cannot be synchronized for segment-by-segment comparison. These different programs usually generate text with different numbers of segments, even though the same audio files were used, due to differences in proprietary segmentation techniques. Moreover, a program such as Dragon Naturally Speaking® also may make utterance start/duration times available, but another program, such as IBM ViaVoice®, may only provide audio tags for individual words.
In many instances, the notwithstanding the potential utility, prior art fails to provide a session file for review and correction of speech recognition errors. In one example, command and control (voice activation) programs typically activate a feature of a software program. If the voice command fails to achieve its desired result, the end user does not know if this was due to a misrecognition and has no opportunity to update and train the speech user profile. Similar limitations may be seen in other speech and language applications.
The frequent lack of audio-aligned text generated by manual transcription in the prior art also makes it difficult for a transcriptionist, quality assurance personnel, or other parties to quickly find the specific audio associated to a particular word or phrase. Instead, the entire audio file must be played back and searched. The lack of audio-aligned, segmented text also makes it difficult to sequentially track document changes and translations while associating the audio to the proper text segment.
Adaptation of preexisting speech processing programs for professional manual transcription for generation of training data sets for speech recognition would also be inefficient. The correction process in large vocabulary programs for dictation, such as Dragon® NaturallySpeaking® and IBM® ViaVoice®, require the selection of an incorrectly transcribed word or phrase, opening a correction window, manual entry of verbatim text into a text box in the correction window, acceptance of the entered text as verbatim, and closing the window. These multiple steps and mouse clicks often slow down the busy transcriptionist, increase turnaround time, and reduce acceptance of speech recognition.
Standard word processors and speech recognition text editors also include a generic spell check and grammar check based upon general, speaker-independent rules. In many instances, it would be of benefit to have tools to check transcription accuracy based upon a preferably speaker-dependent acoustic model, language model, or lexical data.
In manual dictation, the final text distributed in a letter, document, or report may include grammatical or factual corrections to the dictation, deletion of extraneous remarks or redundant material, and inclusion of non-dictated elements such as headers, footers, tables, graphs, and images. This is different from “verbatim text,” which represents what the speaker actually said. Verbatim text is required for accurate training of a speech user profile.
Currently, word and text processors for speech recognition do not include techniques for simultaneous creation of verbatim and distributable final text, nor do they provide convenient methods to mark elements as “nondictated.” Text processors for speech recognition also generally require saving the entire session file for speech user profile updates. They generally do not support selective exclusion of audio-aligned text unsuitable for training due to poor diction, increased background noise, or other factors.
Off-the-shelf speaker-adaptive programs such as Dragon® NaturallySpeaking®, IBM® ViaVoice®, or Philips® SpeechMagic® are designed for use by a single speaker and single channel. They may be modified for multispeaker, multichannel settings whereby each speaker is isolated with use of a separate microphone or recording device. The prior art does not provide for training and update of speaker-dependent models in cases, such as depositions, trials, or regular business meetings, where two or more speakers share the same microphone or other recording device.
The prior art of other, non-speech-related pattern recognition technologies also shows continued reliance upon manual processes. For instance, print copy may be converted to electronic text manually with a word processor or automatically with an optical character recognition program. In yet another example, handwriting may be converted to electronic text using a word processor or handwriting recognition. Electrocardiograms (EKGs), blood cells, and chest x-rays, mammograms, and other radiographic images may be interpreted and analyzed by humans, machines, or both. In radiology, for example, the doctor frequently dictates a mammography or other report at a computerized workstation that is part of a hospital or radiology information system. The radiologist increasingly uses speech recognition and must correct the report. With system integration and growing use of multimedia, a structured report with reporting categories may include a hyperlink to a thumbnail image for the referring physician and electronic medical record.
Despite this convergence and integration of technology and increased data output in medical imaging and other fields, the prior art does not teach automated comparison of results from human sources, computerized pattern recognition, or both to expedite throughput. The prior art also does not teach use of a single application that would support manual conversion, interpretation, analysis, and comparison of output of processing of speech, text, audio, or image source data processed by manual or automated methods.