1. Field of the Invention
The present invention generally relates to extracting formatted information from unformatted text files, where the appropriate formatting processor is determined by categorizing the textual input into one or more predefined categories.
2. Background Description
Natural language computer interfaces require a natural language analysis engine that can analyze user input text, extract and format information that drives some back end application or process. User input text could be derived, for example, from the output of a speech recognizer or other system that generates text, e.g., an optical character recognition (OCR) system. There is no solution to the general problem of understanding natural language via a computer program. There are two main basic approaches to the problem of computer-based natural language analysis:
(1) Use a general purpose grammar/parser of a particular language and then interpret the output of the parser with a semantic interpreter that uses domain specific knowledge to build an internal, formatted representation of the information needed by the back-end applications or processes. General English parsers are described, for example, by Michael C. McCord in xe2x80x9cSlot Grammar: A system for simpler construction of practical natural language grammarsxe2x80x9d, pp. 118-145 in Natural Language and Logic: International Scientific Symposium, Lecture Notes in Computer Science, R. Studer, Editor, Springer Verlag, Berlin (1990). The problem with this approach is that general purpose natural language grammars/parsers will typically deliver a large number of parses or structures, representing high level syntactic information; e.g., subject-verb-object-modifier patterns, all but one or a few of which must then be eliminated by the post-parsing semantic interpretation process. This can be extremely computationally inefficient.
(2) Build special purpose so-called semantic grammars that are much less ambiguous than general grammars and support very simple semantic interpretation processes. Semantic grammars are discussed by J. S. Brown, R. R. Burton and J. De Kleer in xe2x80x9cPedagogical, Natural Language and Knowledge Engineering Techniques in Sophie I, II, and IIIxe2x80x9d, in Intelligent Tutoring Systems, D. Sleeman and J. S. Brown, Editors, Academic Press, London (1982). The problem with semantic or domain-specific grammars is that a new one must be built for each domain; i.e., there is a portability issue.
There are significant practical problems with both approaches in many real world applications that use natural language interfaces for input. In many real world applications, e.g., electronic mail (e-mail) auto-response or auto-routing systems, or Web-based (the World Wide Web (WWW) portion of the Internet, or simply xe2x80x9cthe Webxe2x80x9d) self-service product and services ordering applications, a user input could be about a variety of topics and even worse a single input might refer to a number of topics. For a general purpose parser-based system, the issue is how to invoke the right semantic processing routines in an efficient manner. For a special-purpose semantic grammar-based system, the issue is how to invoke the right grammar(s) for interpretation. Running all the grammars on the data is in general extremely inefficient and can lead to errors in interpretation.
David D. Lewis and Richard M. Tong in xe2x80x9cText Filtering in MUC-3 and MUC-4xe2x80x9d, pp. 51-66, in Fourth Message Understanding Conference (MUC-4), McLean, Va., Jun. 16-18, 1992, describe the emergence of text filtering as an explicit topic of discussion. The processes described, however, do not lend themselves to a solution to the problem of how to invoke the right semantic processing routines in an efficient manner. In the processes described, text documents are categorized into only two types: relevant versus non-relevant. Documents considered relevant are then processed by natural language processing algorithms. There is no suggestion of invoking non-linguistic processes based on categorization; e.g., invoking database queries to gather information for back end application or for humans is not part of the message understanding work. Dynamically categorizing an input document into zero, one or more categories is also not suggested by the message understanding work, nor is the assignment of confidence labels.
What is needed is a configurable system that can efficiently and effectively determine for a given electronically represented text document (e-mail, Web form, scanned facsimile, output of speech recognition, etc.) which linguistic analysis and extraction processes, and even other application specific processes, should be invoked.
It is therefore an object of the invention to provide a configurable system that can efficiently and effectively determine for a given electronically represented text document which linguistic analysis and extraction processes should be invoked.
It is another object of the present invention to provide a rules based system that can efficiently and effectively determine for a given electronically represented text document which application specific processes should be invoked to provide more accurate answers to a user""s query.
Assuming a rules based classifier, where each category or topic is represented by a set of rules, in the preferred embodiment of the invention in applications, e.g., routing, the categorization effecting the routing can be effectively combined with processes extracting other information. For example, if a user sends an e-mail asking about xe2x80x9capply for new home mortgagexe2x80x9d, the categorization component would identify the general topic for routing as xe2x80x9cHome Mortgagexe2x80x9d and also invoke extractors extracting name, and other information of relevance for new home mortgage applications. Such information may include, for example, any information indicating the amount of the desired mortgage, whether the person is a current bank customer, location of the property, and the like. In contrast, if the person specifies an interest in xe2x80x9crefinancing their current home mortgagexe2x80x9d, the categorizer might also place this in the xe2x80x9cHome Mortgagexe2x80x9d category but invoke extractors specific to refinancing inquiries.