Information extraction may be described as the task of identifying facts from documents about a given entity. Information retrieval on the other hand returns a subset of documents that are relevant to a given query. Many situations exist in which it is desirable to extract key pieces of information from databases, like a collection of documents. For example, in transcribed voice mail messages, the name of the caller and any return numbers that were left are crucial for summarizing the call. Also, when résumés are submitted to a company along with a cover letter, it is desirable to extract the job objective and salary requirements of the applicant, in order to determine if a suitable match exists.
Information extraction is made difficult because many ways exist of expressing the same fact. For example, following three sentences contain the same information in different forms:
BNC Holdings Inc named Ms G Torretta as its new chairman.
Nicholas Andrews was succeeded by Gina Torretta as chairman of BNC Holdings Inc.
Ms. Gina Torretta took the helm at BNC Holdings Inc.
When the information to be extracted is present in a single sentence, that information may be referred to as “localized information”. In contrast with localized information, the information may be spread across several sentences. For example:
After a long boardroom struggle, Mr Andrews stepped down as chairman of BNC Holdings Inc. He was succeeded by Ms. Torretta.
The spread of information across several sentences adds further difficulty to the task of information extraction.
Hitherto methods for the extraction of facts from documents using templates have been proposed. Such methods collect contextual clues (both syntactic and semantic) around hand picked “patterns”, and then generalize those patterns. These generalized patterns are usually represented using regular expression constructs. Considerable manual effort and time has to be spent in the construction of new templates and marking patterns in free text, and as a result these methods are time consuming. Furthermore, such methods cannot be readily re-used for extracting new types of facts.
In many situations, it is good enough to identify a sentence or groups of sentences that contain the required pieces of information. As an example, when information pertaining to “management changes in companies” is to be identified, it may be good enough to identify sentences that describe new appointments. A sentence or a group of sentences may be termed a “snippet”. A snippet that contains some fact is termed a “factoid”.
Factoids may be categorized based on what information they convey. For example, factoids that describe new appointments in companies may be grouped together under a “Change in Management” category. Thus, “Change in Management” is an example of a factoid category.
Different methods have been proposed to identify portions of data in a document that a user deems relevant or important in terms of information content. The method disclosed in U.S. Pat. No. 6,842,796 entitled “Information extraction from documents with regular expression matching” provides techniques for exploiting the readily-identifiable structure of language to explicitly identify portions of data in a document that a user seeks to be identified, such as relevant or important information. “Regular expressions” are used to identify the information-bearing portions of a document. However, this method requires considerable manual effort for generating these expressions.
A need therefore exists for an improved method for identification and extraction of factoids.