Natural language understanding involves converting a string of characters into a meaning set representing the meaning of the string of characters. Such processing can involve a number of natural language components including a segmentation component that assigns characters to individual words, a part of speech tagger that identifies the part of speech of each word, a syntactic parser that assigns a structure to a sentence or group of sentences so that the syntactic relationship between the words can be understood and a semantic interpreter that analyzes the syntactic parse to produce a semantic structure.
Each component in a natural language system must be trained before it can be used. In the past, such training has largely been done by hand. For example, the rules used by syntactic parsers to parse sentences were derived by hand. However, training by hand is a laborious process of trial and error. Because of this, more recent systems have attempted to develop natural language components automatically, using supervised machine learning techniques for training.
For example, in supervised training of a parser, a corpus of input sentences is created that is annotated to indicate the syntactic structure of each sentence. Such annotated sentences are referred to as tree banks in the art. During training, proposed changes to the parsing rules, known as candidate learning sets, are tested by repeatedly parsing the tree bank using a different candidate learning set for each parse. The candidate learning set that provides the best parse based on the annotations in the tree bank is then used to change the parser rules.
One problem with using supervised training is that it is expensive and time-consuming. For example, tree banks are so expensive and time-consuming to create that there are very few in existence in the world.
Thus, a less expensive and less time-consuming method is needed for training natural language processing components.