This invention relates to speech recognition methods and systems. More particularly, this invention relates to computerized methods and systems for generating semi-literal transcripts that may be used as source data for acoustic and language models for a speech recognition system and for other purposes where literal transcripts could be used.
Speech recognition systems, or speech recognizers, use recorded speech as an input and generate, or attempt to generate, a transcript of the spoken words in the speech recording. The recorded speech may come in a variety of forms; one common form for recorded speech is a digital recording that may be a mu-law encoded 8-bit audio digital signal.
Speech recognizers are commonly available. Speech recognizers use models of previous speech to assist in decoding a given utterance in a speech recording. One such commercial speech recognizer is the Truetalk product developed by Entropic Inc. This speech recognizer, which runs on a computer, in general comprises an experience base and pattern recognition code to drive the speech recognizer. The experience base contains important components of the speech recognizer, and may use a variety of models in speech recognition. The primary categories of models are acoustic models and language models.
The acoustic models of the speech recognizer may contain a set of models of sounds (sometimes called phonemes) and sound sequences (triphones). Each sound used in common speech may therefore be represented by a model within the acoustic models. For instance, the sounds xe2x80x9ck,xe2x80x9d xe2x80x9caexe2x80x9d and xe2x80x9ctxe2x80x9d (which together form the word xe2x80x9ccatxe2x80x9d) may be represented within the acoustic models. The acoustic models are used to assist in the recognition of the phonetic sequences that support the speech recognizer""s selection of the most likely words of a given utterance, and the acoustic models use statistical representations to accomplish this task.
The language models may aid in determining the occurrence of words by applying known patterns of occurrences of words within speech. For instance, the language model may be able to determine the words from the context or from patterns of occurrence of certain words in spoken language.
The Truetalk speech recognizer contains three inter-connected modules within the experience base: a set of acoustic models, a language model, and a pronunciation dictionary. The three modules function together to recognize words in spoken speech. The pronunciation dictionary may be a set of models that is capable of combining the sounds within the acoustic models to form words. For example, the pronunciation dictionary may include models that can combine the xe2x80x9ck,xe2x80x9d xe2x80x9caexe2x80x9d and xe2x80x9ctxe2x80x9d sounds from the acoustic models to form the word xe2x80x9ccat.xe2x80x9d Although the speech recognizer described herein will be described with reference to the English language, the modules may be adapted to perform word recognition for other languages.
Commercial speech recognizers generally come with generic versions of the experience base. Some of these speech recognizers, such as the Truetalk product by Entropic, Inc., allow the user to train, modify and add to the models. The models, for instance, may be modified so that filled pause xe2x80x9cwords,xe2x80x9d such as xe2x80x9cumxe2x80x9d or xe2x80x9cah,xe2x80x9d are represented in the data used to train the models and so that patterns of occurrence are modeled for these xe2x80x9cwords.xe2x80x9d A large number of words (on the order of between 2 million and 500 million) may be used to train the language model and the acoustic models. The models may be person-specific, such as for specific users with different accents or grammatical patterns, or specific to certain contexts, such as the medical field. If the models are limited by person or context, the models may require less training to determine patterns of occurrence of words in speech. The models, however, need not be person or context specific. The significant point is that the models, and in particular the acoustic models and language models, may be trained or modified so that they perform better to recognize speech for a given speaker or context.
Literal transcripts have traditionally been used to train and modify acoustic models and language models. The literal transcript and the recorded speech are submitted to software that generates an acoustic model or language model or that modifies a given acoustic model or language model for transcribed words. This software is well established and commonly used by those skilled in the art. One problem with this method of producing acoustic models or language models, however, is that a literal transcript must be generated for use in building the model. A xe2x80x9cliteral transcriptxe2x80x9d of recorded speech, as used in this specification, means a transcript that includes all spoken words or utterances in the recorded speech, including filled pause words (such as xe2x80x9cumxe2x80x9d and xe2x80x9cahxe2x80x9d), repair instructions in dictated speech (such as xe2x80x9cgo left, no, I mean go rightxe2x80x9d), grammatical errors, and any pleasantries and asides dictated for the benefit of the human transcriptionist (such as xe2x80x9cend of dictation; thank you,xe2x80x9d or xe2x80x9cnew paragraphxe2x80x9d). Such literal transcripts are generated by human transcriptionists, which is a labor intensive and expensive task, especially when the end product of a literal transcript is not the desired output in the transcription business.
The commercial transcription business produces partial transcripts as the desired output. These partial transcripts typically remove filled pause words, repairs, pleasantries and asides, and grammatical errors. A xe2x80x9cpartial transcript,xe2x80x9d as used throughout this specification, is what the dictator of the speech desires for the outcome, rather than a literal transcript of the dictated speech. It is, in other words, what the human transcriptionist generates from recorded speech, which typically includes correcting grammatical errors, repetitive speech, partial sentences, and other speech that should not be included in a commercial transcript. Unlike literal transcripts, which have no real commercial value, partial transcripts are the desired end product of the transcription business. Although partial transcripts are commonly generated in the transcription business, unlike literal transcripts, they miss and alter much of the spoken speech in a recording and are therefore commonly of limited value as a data source for building or modifying the models of a speech recognizer.
A need exists for a method and system that can use commonly available partial transcripts of recorded speech to develop or modify the models of a speech recognizer.
One embodiment of the invention is a method for generating a semi-literal transcript from a partial transcript of recorded speech. In this embodiment, the method includes augmenting the partial transcript with words from one of a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, inputting the recorded speech and the augmented probabilistic finite state model to a speech recognition system, and generating a hypothesized output for the recorded speech using the speech recognition system, whereby the hypothesized output may be used as the semi-literal transcript. In another embodiment, the method may further include integrating the hypothesized output with the partial transcript to generate the semi-literal transcript of the recorded speech.
In another embodiment of a method for generating a semi-literal transcript from a partial transcript of recorded speech, the invention comprises augmenting the partial transcript with words from a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, inputting the recorded speech and the augmented probabilistic finite state model to a speech recognition system, generating a hypothesized output for the recorded speech using the speech recognition system, and integrating the hypothesized output with the partial transcript to generate a semi-literal transcript of the recorded speech.
In another embodiment, the invention is a method for using a partial transcript in a speech recognition system. This embodiment of the invention comprises augmenting the partial transcript of recorded speech with words from a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, inputting the recorded speech and the augmented probabilistic finite state model to a speech recognition system, generating a hypothesized output for the recorded speech using the speech recognition system, integrating the hypothesized output with the partial transcript to generate a semi-literal transcript of the recorded speech, and using the semi-literal transcript as a substitute for a literal transcript of the recorded speech.
The above embodiments of the invention have a number of advantages over the prior art. The invention may allow for adaptation of the models of a speech recognizer without having to generate literal transcripts of recorded speech. Instead, a partial transcript, which may be generated for commercial purposes anyway, may be used for modeling purposes. The use of partial transcripts for the generation of semi-literal transcripts may save significant amounts of time and money which may have been spent constructing literal transcripts.
In another embodiment, the invention is a method for using a partial transcript of recorded speech. In this embodiment, the invention comprises producing a semi-literal transcript from the partial transcript using speech recognition technology and augmentation of a speech model derived from the partial transcript, and using the semi-literal transcript as a substitute for a literal transcript of the recorded speech. This embodiment of the invention provides the advantages over the prior art of using partial transcripts for the creation of semi-literal transcripts, which may have numerous uses in speech recognition applications.
In yet another embodiment, the invention is an apparatus for generating a semi-literal transcript from a partial transcript of recorded speech. In this embodiment, the invention comprises an interpolator containing programs for augmenting the partial transcript with words from one of a filled pause model and a background model to build an augmented probabilistic finite state model for the partial transcript, a speech recognizer containing programs for generating a hypothesized output for the recorded speech using the augmented probabilistic finite state model and the recorded speech as inputs, and an integrator containing instructions for integrating the hypothesized output with the partial transcript to generate a semi-literal transcript of the recorded speech. Like the above embodiments of the invention, this embodiment allows for the adaptation of the models of a speech recognizer without having to generate literal transcripts of recorded speech.
Other features and advantages of the present invention will become more fully apparent and understood with reference to the following description and drawings, and the appended claims.