1. Technical Field of the Invention
The present invention relates to speech recognition and, more particularly, to selectively merging of segments separated in response to a break in an utterance.
2. Background Art
One component in a speech recognizer is the language model. A popular way to capture the syntactic structure of a given language is using conditional probability to capture the sequential information embedded in the word strings in sentences. For example, if the current word is W1, a language model can be constructed indicating the probabilities that certain other words W2, W3, . . . Wn, will follow W1. The probabilities of the words can be expressed such that P21 is the probability that word W2 will follow word W1, where P21=(W2|W1). In this notation, P31 is the probability word W3 will follow word W1; P41 is the probability word W4 will follow word W1, and so forth with Pn1 being the probability that Wn will follow word W1. The maximum of P21, P31, . . . Pn1 can be identified and used in the language model. The preceding examples are for bi-gram probabilities. Computation of tri-gram conditional probabilities is also well known.
Language models are often created through looking at written literature (such as newspapers) and determining the conditional probabilities of the vocabulary words with respect to others of the vocabulary words.
In speech recognition systems, complex recognition tasks, for example, such as long utterances, are typically handled in stages. Usually these stages include a segmentation stage which involves separating a long utterance into shorter segments. A first-pass within-word decoding is used to generate hypotheses for segments. A final-pass cross-word decoding generates the final recognition results with detailed acoustic and language models.
In the segmentation stage, long segments are typically chopped first at the sentence boundary and then at the word boundary (detected by a fast word recognizer). A typical way to detect sentence beginning and endings are between boundaries of silence (pause in speaking) detected by, for example, a mono-phone decoder. The assumption is that people momentarily stop speaking at the end of a sentence. The resulting segments are short enough (about 4 to 8 seconds) to ensure that they can be handled by the decoder given constrains of real-time pipeline and memory size. In the traditional decoding procedure, each short segment, which can be any part of a sentence, is decoded, and each transcription is merged to give the complete recognition result.
Another way in which segment (e.g., sentence) boundaries can be created is in response to unrecognizable non-speech noise, such as background noise.
The problem noticed by the inventor of the invention in this disclosure is that with the existing systems, the language model is not applied across segment boundaries. Accordingly, if a sentence is ended because of an unintended break (e.g., pause or noise), the language model will not be applied between the last word of the ending sentence and the first word of the beginning sentence. In the case in which the last word of the ending sentence and the first word of the beginning sentence are intended to be part of a continuous word stream, the benefits the language model could provide are not enjoyed with present recognition systems.