Recognizing and understanding spoken human speech is believed to be integral to future computing environments. To date, the tasks of recognizing and understanding spoken speech have been addressed by speech recognition systems and spoken language understanding (SLU) systems. An SLU system is a type of natural language understanding (NLU) system in which the input to the SLU system is specifically spontaneous speech utterances, which are noisy and full of disfluencies such as false starts, hesitations, repetitions repairs, etc.
Current speech recognition systems receive a speech signal indicative of a spoken language input. Acoustic features are identified in the speech signal and the speech signal is decoded, using both an acoustic model and a language model, to provide an output indicative of words represented by the input speech signal.
Spoken language understanding addresses the problem of extracting semantic meaning conveyed by a user's utterance. This problem is often addressed with a knowledge-based approach. To a large extent, many implementations have relied on manual development of domain-specific grammars. The task of manually developing such grammars is time consuming, error prone, and requires a significant amount of expertise in the domain.
Other approaches involve different data-driven statistical models. Statistical grammars (models) can be used in development of speech enabled applications and services use example-based grammar authoring tools. These tools ease grammar development by taking advantage of many different sources of prior information. They allow a developer, with little linguistic knowledge, to build a semantic grammar for spoken language understanding.
In speech recognition and natural language processing, Hidden Markov Models (HMMs) have been used extensively to model the acoustics of speech or the observations of text. HMMs are generative models that use the concept of a hidden state sequence to model the non-stationarity of the generation of observations from a label. At each frame of an input signal (or word), the HMM determines the probability of generating that frame from each possible hidden state. This probability is determined by applying a feature vector derived from the frame of speech (or text) to a set of probability distributions associated with the state. In addition, the HMM determines a probability of transitioning from a previous state to each of the states in the Hidden Markov Model. Using the combined transition probability and observation probability, the Hidden Markov Model selects a state that is most likely to have generated a frame.
In the field of sequence labeling, conditional random field models have been used that avoid some of the limitations of Hidden Markov Models. In particular, conditional random field models allow observations taken across an entire utterance to be used at each frame when determining the probability for a label in the frame. In addition, different labels may be associated with different features, thereby allowing a better selection of features for each label.
The current statistical learning approach for training statistical models exploit the generative models used for spoken language understanding. However, data sparseness is a problem associated with such approaches. In other words, without a great deal of training data, the purely statistical spoken language understanding models can lack robustness and exhibit brittleness.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.