Text analysis is an area of computer science that focuses on processing text to extract information through pattern recognition. The decade of the 1990's has seen an unprecedented explosion in work on learning methods for text analysis. Prior text analysis methods rely on unsupervised learning, where the system is responsible for teasing generalizations from texts or samples. One such system, the HASTEN system described in “SRA: Description of the SRA System as Used for MUC-6,” Krupka, George R., pp. 221-235, Proceedings Sixth Message Understanding Conference (MUC-6), November 1995 (referred to herein as Krupka). Krupka teaches a system for grouping text samples supplied and labeled by users and creating data structures called e-graphs. The system in Krupka then uses a similarity metric to decide if portions of an input text are related to e-graphs that have been created. It applies these collections of e-graphs, called collectors, as sequential processing phases, in order to match each sample set to the input text. Generalization of the elements of e-graphs is performed manually by the developer. There is no notion of generating grammar rules from e-graphs. The work does not establish a method for converting the collectors to rule-based passes of a text analyzer. The work does not describe a way to automatically generate substantial portions of a text analyzer. The system in Krupka requires a large amount of user interaction to perform tasks manually beyond adding and labeling samples, and was applied specifically to create an event level pattern for MUC text analysis. However, Krupka's system does not teach a general and fully automated text analyzer capability.
Another text analysis system is disclosed in Huffman (U.S. Pat. Nos. 5,796,926 and 5,841,895). The Huffman patents deal with text extraction at the event level and teach methods for locating potential event patterns of interest. In essence, Huffman teaches a rigid, inflexible method of searching for specific patterns such as “actor acts on object.”
There is a need for a system that automatically generates text analysis systems with minimal training samples while retaining sufficient intelligence to recognize patterns beyond those described by the training samples, sufficiently flexible to allow adaptation to a variety of applications.