There are many situations in which it is desirable to extract key pieces of information from documents. For example, in transcribed voice mail messages, the name of the caller and any return numbers that were left are crucial for summarizing the call. Or when resumes are submitted to a company along with a cover letter, it is desirable to extract the job objective and salary requirements of the applicant, in order to determine if a suitable match exists. This invention comprises a simple, efficient, and effective way of extracting such information.
In the past, several techniques have been developed to solve the somewhat simpler problem of “named entity extraction.” In this task, the goal is to identify all occurrences of certain classes of words in a document. For example, all the person-names, city-names, dates, and times might be identified. One way of identifying such entities is to train a statistical classification system to tag each word in the document as either a “person-name,” “city-name,” “date,” “time,” or “other” word. Examples of this sort of approach are disclosed in, e.g., Ratnaprkhi, “A Maximum Entropy Part of Speech Tagger,” Conference on Empirical Methods in Natural Language Processing, University of Pennsylvania, 1996; and Brill, “Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging,” Computational Linguistics, December 1995, the disclosures of which are incorporated by reference herein. While the problem of extracting key pieces of information can also be viewed as a tagging problem, in which each word is tagged as either “key information” or “irrelevant,” such a tagging approach fails to explicitly identify the portions of text that are important.
Thus, there is a need for data processing techniques which explicitly identify portions of data that are sought to be identified, rather than only implicitly identifying, or tagging, such portions of data.