The present invention relates to information extraction, and in particular, systems and methods for modular information extraction.
Generally speaking, search engines provide an interface that allows a user to query a group of items based on specific criteria about an item of interest. FIG. 1 illustrates an exemplary search engine. Search engine 120 receives criteria 110. Criteria 110 may include keywords or other descriptors describing the item of interest. Search engine 120 performs a query on items residing in or across information base 121 based on criteria 110 and returns a list of items as results 130. Depending on the criteria provided and the algorithm performed, different items prioritized in different manners may be returned to the user. The group of items may range from information on the World Wide Web to documents within a company database.
To improve the quality of the search, items across information base 121 may be preprocessed. For example, an extraction engine may parse items and extract keywords or phrases that describe portions of the items. These keywords or phrases may be used by search engine 120 in the future to match search criteria provided by a user. FIG. 2 illustrates an example of preprocessing a document. Information extraction engine 220 generates keywords 230 based on common phrases within document 210. These keywords are a form of annotations as set forth below. Once the keywords are generated, they are stored (e.g., within document 210 or in a repository) for matching with user specified search criteria.
As the group of items in information base 121 increases, it becomes increasingly difficult to generate accurate search results. For example, a query for a desired document may be successful within an information base of 100 items. However, when the information base is expanded to 10,000 items or more the same query may not return the desired document. This has led to increasing demand for precise search engines. One method of improving search engines is to improve the extraction of information from the items.
One particular problem with existing information extraction technology is the lack of precision for keyword based search queries. For instance, the same search keywords typically occur with varying degrees of frequency to find the same or different items of information. However, even though the correct result for a number of given queries may have been available in the information base, none of the right results may have been returned or identified as more relevant (e.g., returned in the first three answer pages or 30 result links) by the search engine.
The ultimate goal of any search system is to answer the intention behind the query. Unfortunately, building extraction systems based on intention recognition technology remains a time and cost intensive project. First, common intentions need to be unraveled. Next, possible answer sources need to be identified. Third, for each source, specific information extraction technologies need to be developed to catch entities and relationships from relevant documents. Last, user intentions and content sources change over time, thus the search engine not only needs to be adjusted but also extend to new search intention and content sources.
The problem of extracting and matching entities is difficult. Sophisticated approaches for named entity recognition (or record linkage, entity matching, reference reconciliation, duplicate detection and fuzzy matching) have typically been based on rules or on learning techniques. Unfortunately, typical approaches to develop information extraction (“IE”) programs have not been very satisfying. Perhaps the most straightforward approach is to employ an off-the-shelf, monolithic IE ‘blackbox’. Monolithic IE techniques are used to spot relevant entities and relationships in the underlying document corpus. However, this approach is cost intensive, difficult to maintain and difficult to extend. In particular, this approach severely limits the expressiveness of IE programs that can be developed. Hence, the most popular approach today is to decompose an IE task into smaller subtasks, apply off-the-shelf IE blackboxes or write hand-crafted code to solve each subtask, ‘stitch’ them together (e.g., using Perl, Java, C++), and perform any necessary final processing. This approach has the problem of generating large IE programs that are difficult to understand, debug, modify, and optimize.
One approach has proposed compositional frame works for developing IE programs. A prime example of such frameworks is UIMA. UIMA stands for Unstructured Information Management Architecture. UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies. UIMA proposes an ‘object-oriented’ language with standard object APIs. This language allows developers to code each IE subtask as an extraction object, then compose new extraction objects from existing objects. Such languages can make writing, debugging, and modifying IE programs much easier. To generate new extraction objects for such UIMA-like languages ‘generic’ information extraction and composition operators would be extremely valuable.
Thus, there is a need for the improved systems and methods for information extraction. The present invention solves these and other problems by providing systems and methods for modular information extraction.