1. Technical Field
This invention relates to the field of speech recognition and natural language understanding, and more particularly, to an improved Monte Carlo method for generating training data.
2. Description of the Related Art
Computer-based systems capable of interacting with users in a conversational manner typically include a speech recognition (SR) system and a natural language understanding (NLU) system. The SR system can convert speech to text and the NLU system can extract information from the resulting text. Currently within the art, conversational systems can be implemented using statistical, rather than linguistic methods. Such statistical methods utilize high quality statistical models, such as a language model, for processing information. Notably, both SR and NLU systems can utilize statistical models for processing information. Oftentimes, both systems rely upon the same statistical model.
Language and understanding models can express restrictions imposed on the manner in which words can be combined to form sentences and can express the likelihood of a word appearing immediately adjacent or proximate to another word or words. Language models can be expressed as statistical models, grammatically based models, or lists of allowable phrases. Examples of statistical language models can include n-gram models such as the bigram and trigram models. Exemplary grammatical models can include context free grammars which can provide a formal specification of the structures allowable in a language. Context free grammars can be specified using Backus-Naur Form (BNF). Still, hybrid language models, such as the probabilistic context free grammar, can incorporate features of both grammatical and statistical models.
Presently, the development of both statistically based conversational systems and speech recognition language models require a large corpus of annotated sentences, called a training corpus. Artificial data creation methods can be used to increase the size of a training corpus in an effort to produce a higher quality language model. In particular, a Monte Carlo method can be used to generate additional training sentences from a set of actual training sentences. The Monte Carlo method entails making multiple copies of actual sentences and replacing phrases within the sentences with alternate phrases, thereby creating multiple permutations of the actual training sentence. For example, a training sentence including dates can be “I want to fly on May fifth”. Using a Monte Carlo method, this sentence can be copied wherein the date phrase “May fifth” is replaced with another date phrase.
Conventional Monte Carlo methods, however, can have disadvantages. In particular, the substitution of a phrase having a syntax which differs from the replaced phrase can lead to grammatically incorrect training sentences. For example, substitution of the date expression “May fifth” with the alternate date expression “fifth of May” results in the grammatically incorrect training sentence “I want to fly on fifth of May”. In addition to the phrase syntax, the text and characters surrounding the date phrase, referred to as the boundary conditions, also can affect which alternate phrase results in a well-formed sentence. In this case, the boundary condition of the actual well-formed training sentence lacked the article “the” before the date phrase. After insertion of the alternate phrase, the article “the” was needed to form a well-formed sentence. Boundary conditions can be particularly significant with regard to other languages wherein gender is enforced.
The inclusion of grammatically incorrect training sentences within a training corpus can result in a less accurate statistical model. Accordingly, the ability of a conversational program to extract meaning from text or a speech recognizer to decode an utterance can be diminished due to the inaccuracy of the statistical model relied upon by the NLU system. Because SR systems often rely upon the same flawed statistical model, the reduction in system performance can be even greater. This can lead to compound errors within a conversational computer-based system wherein speech is inaccurately converted to text and subsequently inaccurately interpreted.