The present invention relates to information extraction rules, and in particular, to systems and methods of generating information extraction rules.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
A vast amount of information is accessible on the internet and much more is stored in other repositories. Moreover, a large amount of new information is produced every day. The sheer volume of this information makes human processing of it impossible. In most cases, computer processes such as information extraction provide the only practical way to process a large volume of information in a timely manner. Information extraction is a computer process in which the computer applies certain rules on a piece of text and extracts the bits of information of interest from the text according to the rules. Developing information extraction rules is a rather complicated process that requires special skills as well as time and effort. What makes a particular extraction rule a good rule to use is often subjective and depends upon the needs of the actual application. One criterion of a good rule is accuracy, which may be measured in precision and recall. Another criterion of a good rule is speed, since a rule is not useful if it takes too long to apply to a document. Another criterion of a good rule is complexity; a rule should be simple enough to run on a reasonable machine (e.g., it does not require too many system resources) in a reasonable period of time (e.g., it does not require too much processing time). Even for experienced rule developers, developing good, accurate rules can be difficult and time consuming.
There have been a number of systems that automatically generate extraction rules using Machine Learning techniques. In general, a Machine Learning system automatically produces (induces) models, such as rules and patterns, from data. For example, a Machine Learning system is designed to be configured to do a task based on a set of rules created through user parameterization of heuristic rules via direct parameter input, a training period, or both. As with most Machine Learning-based systems, these systems require large amounts of tagged data to generate each extraction rule. For most applications, finding tagged samples require a significant effort and is a large and difficult task by itself. Therefore, while these systems do not require the specialized skills as required to write extraction rules manually, the amount of effort required is often comparable to or in excess of that of creating them manually.
There have also been a number of systems that can generate extraction rules automatically using untagged data. Virtually all of these systems are bootstrapping-based. Using “seed knowledge” provided by the user, tagged samples are generated from a corpus (collection of documents) by the system automatically. Extraction rules are then derived from the generated samples automatically. For example, the user can provide the system with the person name Mozart and the year of his birth. Then, the system finds all the documents that contain this piece of “target” information on the internet or some other large corpus. From these documents, the system determines all the different ways a person's birth year can be encoded in a piece of text, and from this the system creates the rules for extracting a person's birth year from text.
However, these bootstrapping-based systems are limited in terms of applicability. The systems use methods based on the assumption that the information sought after (such as the named entity, fact or event) have abundant occurrences in the corpus. For example, there would be thousands if not millions of web-pages with the information of some person's birth year, and using a small number of well-known persons as seeds allows the system to collect all the data needed to generate the rules for extracting the birth year of a person. This assumption is only true for the basic or common entities, facts and events. For other entities, facts and events that are less common, the assumption does not hold, and these bootstrapping-based systems do not work.
Thus, there is a need for improved systems and methods of generating extraction rules. The present invention solves these and other problems by providing systems and methods of generating information extraction rules.