1. Field of the Invention
The present invention relates to matching regular expressions (regex) in data repositories such as the World Wide Web.
2. Description of the Related Art
Enormous amounts of information are available via computer networks, notably the World Wide Web. The information explosion has resulted in over a billion Web pages being available, with the number growing every day.
Perhaps the biggest challenge posed by this information deluge is how to sift through the mountain of data to find specific information. Not surprisingly, much effort has been devoted to efficiently finding sought-after information on the Web. Typically, such efforts focus on key word searching, i.e., enabling a user to type in one or more key words and subsequently returning Web pages that contain some or all of the words, in some ordered fashion.
As recognized herein, however, key word searching often requires a considerable amount of query result analysis by the user. That is, it remains the responsibility of the user to sift through the pages and extract the information being sought. To illustrate, suppose a user wants to find the middle name of Thomas Edison. Simply inputting the query xe2x80x9cThomas Edisonxe2x80x9d to a key word search engine will result in many hundreds or perhaps thousands of pages being returned, with the user being forced to wade through the result set until he or she happens upon the middle name.
In addition, key word search cannot match patterns which occur within words in text. Consider, for instance, a user who wants to find all references to .mp3 files stored on Stanford University computer systems. Searches of this kind are not possible using key word search methods.
As further recognized herein, examples such as the two postulated above can be more efficiently addressed using what is known in the art as xe2x80x9cregular expressionsxe2x80x9d, or xe2x80x9cregexesxe2x80x9d. A regex is distinct from a key word search. Essentially, a regex is a string which defines a set of strings satisfying a pattern. A regex can be specified using a number of syntactic methods, including, e.g., that used by the tools flex, awk, and grep, or as set forth in the textbook Automata Theory, Languages, and Computation. 
In the context of the above example, a user could use the following regex to efficiently find the middle name of Thomas Edison: xe2x80x9cThomas  a+ Edisonxe2x80x9d, with xe2x80x9c axe2x80x9d meaning xe2x80x9cany letter in the alphabetxe2x80x9d and xe2x80x9c+xe2x80x9d meaning xe2x80x9cone or more repetitions of the previous characterxe2x80x9d. Thus, the expression xe2x80x9c a a+xe2x80x9d means xe2x80x9cfind strings matching the pattern Thomas [at least two letters of the alphabet] Edisonxe2x80x9d, and would return only strings that have at least two letters between the words xe2x80x9cThomasxe2x80x9d and xe2x80x9cEdisonxe2x80x9d. In this way, strings that have only the middle initial of Edison or no middle name at all are not returned. Only strings having at least two letters (likely to be the name xe2x80x9cAlvaxe2x80x9d) between xe2x80x9cThomasxe2x80x9d and xe2x80x9cEdisonxe2x80x9d are returned.
The present invention still further recognizes, however, that although regex matching can be used in the context of Web searching, for very large repositories like the Web, matching a regular expression using conventional regex principles can take days. Having recognized the above-noted considerations and problems, the present invention provides solutions to one or more of them as disclosed below.
To address the issues noted above, a regex query system is disclosed in which regex-based indices are first built using multigrams. When a regex query is received, the indices are used to return a set of potentially matching pages, which are then quickly and efficiently searched for matches to the regex query.
A general purpose computer is programmed according to the inventive steps herein. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to execute the present logic. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein.
Accordingly, a general purpose computer includes a regular expression (regex) index generator that generates at least one multigram index, and a run time engine that receives a regex query and that accesses the index to respond to the query. The run time engine returns data from a repository of, e.g., Web pages, in response to the query. In a preferred non-limiting embodiment, a Web crawler is associated with the data repository, and the data repository includes Web pages.
As set forth in detail below, the run time engine parses the regex query and generates an execution plan based on the parsed query. Using the execution plan, potentially matching documents in the repository are found using the index. Then, the query is executed against the matching pages.
Preferably, the index includes plural keys, with each key having an associated posting list representing documents in the repository containing the key. In a preferred implementation, the keys are multigrams. The index preferably contains only minimal useful multigrams relative to a user-defined usefulness factor xe2x80x9ccxe2x80x9d. Further, the index can be further limited to include only suffix-free minimal useful multigrams. If desired, the index can be a predicate index including multigrams representing binary numbers.
In the presently preferred embodiment the run time engine generates an execution plan based on a parsed query by rewriting a query to have only xe2x80x9cORxe2x80x9d or xe2x80x9cSTARxe2x80x9d logical connectives. The run time engine constructs a parse tree and replaces STAR nodes in the tree with NULL, and then removes NULL nodes. In one optimization, the run time engine executes the query by performing a fall regex match only in a predetermined vicinity of an anchor in a document.
In another aspect, a computer program device includes a computer program storage device that can be read by a digital processing apparatus. A program is on the program storage device. The program includes instructions that are executable by the digital processing apparatus for executing a regex query The program can include computer readable code means for generating at least one index, and computer readable code means for accessing the index to execute a regex query.
In still another aspect, a computer-implemented method for executing a regular expression (regex) query against a data repository includes generating at least one multigram index, and receiving a regex query. The method includes accessing the index based on the query. A set of Web pages is returned based on the accessing act. The Web pages are then searched for matches to the regex query.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: