1. Technical Field
This invention relates to extracting information from regular expressions. More specifically, the invention relates to building a general purpose index to handle complex regular expressions at the character level.
2. Description of the Related Art
A regular expression, hereinafter referred to as regex, is a set of pattern matching rules encoded in a string according to certain syntax rules. More specifically, regex is a string which defines a set of strings satisfying a pattern. A regex can be specified using a number of syntactic methods. It is widely used as the pattern specification language in applications such as information extraction, network packet scanning and filtering, information dissemination, and document search tools. Regex expression evaluation has become a fundamental operation for information searching, mining, and extraction over a text database.
One prior approach for extracting structured information from text executes extraction rules over individual documents. This form of information extraction is based upon the quality of rules employed. An information extraction rule developer hypothesizes some initial rules, followed by an iterative trial and error process for modifying the initial rules. Multiple arbitrary regexs are evaluated over a large text collection for an exploratory task. However, problems arise with large text collections, wherein the time employed for the information extraction increases significantly. One solution for reducing time for implementing the extraction rules is to pre-process the documents and create an index so that for any specific regex only the documents that contain at least one match are examined.
FIG. 1 is a prior art block diagram (100) of architecture for exploiting indexes in regex evaluation. There are two primary modules, an offline indexing module (110) to digest the document collection and to create an index (130), and a run-time module (120) to exploit the index and filter documents guaranteed not to contain a match for a given query. The indexing module (110) receives documents (105), and the run-time module (120) receives regex queries (140) and returns documents containing a match (150) in response to consulting the index (130). The offline indexing module (110) is employed to create an index that can support regex queries to properly filter returned documents. In a prior art multigram index, the following regex query: \p {alpha} {1, 12} @ \p {alpha} {1,10}\.edu, is efficiently supported in the prior art structure. More specifically, the regex expression is properly supported by the index because of the presence of the .edu string in the expression, which is a multigram present in the index and can be used to filter documents. When the input regex does not have select multigram strings, such as: \p {alpha}{1,12}@\p{alpha}{1,10}\.\p{alpha}{1,10}, the index cannot properly filter the regex input query. In general, for complex regular expressions that may not contain a multigram string, the index cannot filter documents effectively.
Applications, such as information extraction, evaluate complex regex queries, consisting of regex constructs, including but not limited to character classes, groups, quantifiers, disjunctions, etc. Both the prior art offline indexing module (110) and the run-time module (120) are not configured to address the challenges associated with processing complex regex queries. To fully exploit an index while ensuring that the system can handle arbitrary regexs is challenging.
Accordingly, there is a need to build a filter index that supports complex regex queries to eliminate documents guaranteed not to contain a match for the query evaluation. Such a filter index supports the complex regex evaluation over fewer documents, thereby improving overall execution time in query evaluation.