For numerous speech interface applications, recognized speech needs to be mined for information relevant to the task to which it is applied. An example application is automated technical phone help, where a virtual operator directs the call based on the natural language utterance of a caller. The virtual operator, like existing IVR systems, might ask “Please state the nature of your problem” and the system must be capable of directing the caller to the appropriate resource. Another example is closed-domain canonical speech-to-text or speech-to-speech machine translation, where the various ways of expressing the same idea are grouped together, and either via a grammar or classifier, the utterance is mapped to the appropriate group and a canonical translation is the output. When no resource exists to handle the utterance, the system must be capable of correctly rejecting the utterance and in the example of the virtual operator, either ask further questions or redirect the caller to a human operator.
The task of classifying an utterance properly is complicated by the introduction of recognition error, which is inherent to any recognition system. It is the challenge of information extraction of recognized speech to be robust to that error.
A recognizer converts an input speech signal into a text stream. The output text may be an “one-best” recognition, an “N-best” recognition, or a word-recognition lattice, with associated recognition confidence scores. Recognitions are based upon both an acoustic model, which models the conversion of an acoustic signal into phonemes, and a language model, which models the probabilistic distribution of word sequences in a language. The broader the domain an ASR engine is trained to recognize, the worse the recognizer performs. Determining the balance between recognition coverage and recognition accuracy must be addressed in the creation of an ASR system.
The text of an utterance, which may be processed linguistically to aid in the labeling of semantic information, is then mined for the information relevant to the task for which the system is designed.
The text of the utterance can be mined via a rule-based approach, wherein “grammars” are applied to an input text stream. Grammars in this context refer to manually or (semi) automatically generated rules, which attempt to predict structural patterns of an input text stream.
The advantage of the manually created extraction grammars is that there is no requirement for large amounts of training data. The method, however, does require human expertise to create these grammars and is therefore labor intensive and susceptible to low recall or conversely low precision. On the other hand, the more automatically (or less dependent upon human expertise) the grammar is created, however, the more training data is necessary. Training data, depending on the task, may not be readily available.
In addition to insufficient rules, rule ambiguity and recognition error reduce the accuracy and coverage. Rule ambiguity occurs when multiple rules apply to an input text stream, and there is no reason (statistical or otherwise) to choose one over the other. Recognition error makes the extraction of the information less accurate. Though rule-based approaches tend to be robust to recognition error, their coverage and accuracy are still diminished.