(Not Applicable)
(Not Applicable)
1. Technical Field
This invention relates to the field of speech recognition and dialog based systems, and more particularly, to the use of language models to convert speech to text.
2. Description of the Related Art
Speech recognition is the process by which an acoustic signal received by microphone is converted to a set of text words, numbers, or symbols by a computer. These recognized words may then be used in a variety of computer software applications for purposes such as document preparation, data entry, and command and control. Improvements to speech recognition systems provide an important way to enhance user productivity.
Speech recognition systems can model and classify acoustic signals to form acoustic models, which are representations of basic linguistic units referred to as phonemes. Upon receipt of the acoustic signal, the speech recognition system can analyze the acoustic signal, identify a series of acoustic models within the acoustic signal, and derive a list of potential word candidates for the given series of acoustic models.
Subsequently, the speech recognition system can contextually analyze the potential word candidates using a language model as a guide. Specifically, the language model can express restrictions imposed on the manner in which words can be combined to form sentences. The language model is typically a statistical model which can express the likelihood of a word appearing immediately adjacent to another word or words. The language model can be specified as a finite state network, where the permissible words following each word are explicitly listed, or can be implemented in a more sophisticated manner making use of a context sensitive grammar. Other exemplary language models can include, but are not limited to, n-gram models and maximum entropy language models, each of which is known in the art. A common example of a language model can be an n-gram model. In particular, the bigram and trigram models are exemplary n-gram models commonly used within the art.
Conventional language models can be derived from an analysis of a training corpus of text. A training corpus contains text which reflects the ordinary manner in which human beings speak. The training corpus can be processed to determine the statistical language models used by the speech recognition system for converting speech to text, also referred to as decoding speech. It should be appreciated that such methods are known in the art. For example, for a more thorough explanation of language models and methods of building language models, see Statistical Methods for Speech Recognition by Frederick Jelinek (The MIT Press ed., 1997).
Currently within the art, speech recognition systems can use a combination of language models to convert a user spoken utterance to text. Each language model can be used to determine a resulting text string. The resulting text strings from each language model can be statistically weighted to determine the most accurate or likely result. For example, speech recognition systems can incorporate a general or generic language model included within the system as well as a user specific language model derived from the first several dictation sessions or documents dictated by a user. Some speech recognition systems can continue to enhance an existing language model as a user dictates new documents or initiates new dictation sessions. Thus, in many conventional speech recognition systems, the language models can be continually updated.
Unfortunately, as the language models continue to grow, the importance of subject specific user dictation can be reduced. In particular, the effect of the more recent speech sessions can be diminished by the growing mass of data within the language model. Similarly, more recent user dictations, whether subject specific or not, also can be diminished in importance within the growing language model. This occurs primarily with regard to statistical language models where the statistical importance of one particular session or document which can be used to enhance the language model is lessened by an ever expanding data set. This statistical effect can be significant, for example, in the case where the user""s speech patterns change as the user becomes more familiar and accustomed to interacting with the speech recognition or dialog based system. Notably, any enhancement of a language model resulting from a single session or document, which can produce a limited amount of data especially in light of the entire data set corresponding to the language model, will not likely alter the behavior of a statistical speech based system. In consequence, the language model may not accurately reflect a user""s changing dictation style.
Similar problems can exist within the context of dialog based systems such as natural language understanding systems where a user can verbally respond to one or more system prompts. Though such systems can include one or more language models for processing user responses, the language models tailored to specific prompts can be built using an insufficient amount of data. Consequently, such language models can be too specific to accurately process received speech. Specifically, the language models can lack the ability to abstract out from the language model to process a more generalized user response.
The invention disclosed herein concerns a method of creating a hierarchy of contextual models and using those contextual models for converting speech to text. The method of the invention can be utilized within a speech recognition system and within a natural language understanding dialog based system. In particular, the invention can create a plurality of contextual models from different user speech sessions, documents, portions of documents, or user responses in the form of user spoken utterances. Those contextual models can be organized or clustered in a bottom up fashion into related pairs using a known distance metric. The related pairs of language models continually can be merged until a tree-like structure is constructed. The tree-like structure of contextual models, or hierarchy of contextual models, can expand outwardly from a single root node. The hierarchy of contextual models can be interpolated using a held out corpus of text using techniques known in the art such as deleted interpolation or the back-off approach. Notably, the invention is not so limited by the specific smoothing techniques disclosed herein. Rather, any suitable smoothing technique which is known in the art can be used.
After the hierarchy of contextual models is determined and smoothed, received user spoken utterances can be processed using the resulting hierarchy of contextual models. One or more contextual models within the hierarchy of contextual models can be identified which correspond to one or more received user spoken utterances. The identified contextual models can be used to process subsequent received user spoken utterances.
One aspect of the invention can include a method of converting speech to text using a hierarchy of contextual models. The hierarchy of contextual models can be statistically smoothed into a language model. The method can include (a) processing text with a plurality of contextual models wherein each one of the plurality of contextual models can correspond to a node in a hierarchy of the plurality of contextual models. The processing of text can be performed serially or in parallel. Also included in the method can be (b) identifying at least one of the contextual models relating to the received text and (c) processing subsequent user spoken utterances with the identified at least one contextual model.
At least one of the plurality of contextual models can correspond to a document or a portion of a document, a section of a document, at least one user response received in a particular dialog state in a dialog based system, or at least one user response received at a particular location within a particular transaction within a dialog based system. Still, the at least one of the plurality of contextual models can correspond to the syntax of a dialog based system prompt, a particular, known dialog based system prompt, or a received electronic mail message.
Another embodiment of the invention can include a method of creating a hierarchy of contextual models. In that case the method can include (a) measuring the distance between each of a plurality of contextual models using a distance metric. Notably, at least one of the plurality of contextual models can correspond to a portion of a document or a user response within a dialog based system. Also included can be (b) identifying two of the plurality of contextual models which can be closer in distance than other ones of the plurality of contextual models. Also included can be (c) merging the identified contextual models into a parent contextual model. The merging step (c) can include interpolating between the identified contextual models wherein the interpolation can result in a combination of the identified contextual models. Alternatively, the merging step (c) can include building a parent contextual model using data corresponding to the identified contextual models. Also included can be step (d) wherein steps (a), (b), and (c) can be repeated until a hierarchy of the plurality of contextual models can be created. In that case, the hierarchy can include a root node. Still, the hierarchy of the plurality of contextual models can be statistically smoothed resulting in a language model. For example, the hierarchy of contextual models can be interpolated using a held out corpus of text using techniques known in the art such as deleted interpolation, the back-off approach, or another suitable smoothing technique.
The plurality of contextual models, or the initial contextual models can be built from speech sessions, document templates, documents, and portions of documents such as paragraphs, or any part of a document that can be subdivided into one or more parts, such as a section of a document. In the case of a dialog based system such as a natural language understanding system, the initial contextual models can be built from one or more user responses to all or a subset of the various system prompts.