This invention relates generally to Web crawlers, and more particularly to learning character patterns in queries to control the scope of Web crawler searches for Web pages.
In the context of the Internet and the World Wide Web (Web)xe2x80x94an application layer of the Internet, Web crawlers are software programs that search for resources, such as Web pages, by following hyperlinks that connect the pages. This activity is sometimes known as xe2x80x9cwalkingxe2x80x9d the Web. Search services, such as Digital Equipment Corporation""s AltaVista service employ Web crawlers to build indexes of words contained in the pages located by the search. Typically, the scope of the search is specified in a query.
In general, prior art Web crawlers search and walk the Web without much discrimination on where the pages are located. The location or address of each Web page is specified by a string of characters called a Universal Resource Locator (URL), e.g.,
xe2x80x9chttp://www.digital.com/homepage/topic/topic.htmlxe2x80x9d.
Hyperlinks are URLs embedded in Web pages to identify connected or xe2x80x9clinkedxe2x80x9d pages. Potentially, the scope of most prior art Web crawlers is the entire Web. During a search, the links are followed to locate as many connected pages as possible.
On the other hand, an end-user crawler performs custom searches on behalf of an end-user. For example, an end-user crawler may locate and copy a subset of pages to a local disk. End-user crawlers have a restricted scope and typically walk only a small portion of the Web. For example, an end-user crawler may confine its walk to a single domain or xe2x80x9chostxe2x80x9d computer. That is, the scope of the search is restricted to a single or small number of sites.
Customizing an end-user crawler involves, in part, specifying the portion of the Web to be walked, that is, the scope of the search. This specification can be called a walking rule. It is possible to program very effective walking rules based on the URL character strings alone. Unfortunately, even this amount of programming is too complicated for many end-users.
In another approach, the search scope is manually defined. However, hand coded rules are tedious to write, error-prone and often sub-optimal. Learning walking rules from examples is another possibility. Users typically are incapable or unwilling to provide enough examples to infer a perfect walking rule, i.e., a rule that exactly matches the desired crawl scope.
Several end-user crawlers provide broad control on crawl scope. The MacroBot crawler (http://www.ipgroup.com/macrobot/) does not have restrictions on individual pages. However, it supports limits on depth or number of pages retrieved.
The Mapuccino/WebCutter crawler (http://www.ibm.com/java/mapuccino/) supports page restriction based on keyword matching with the textual content of the document. This approach can be time consuming, because pages have to be fetched before it can be decided whether or not the content of pages are useful to the end-user. Also, content-based walking rules do not exploit syntactic patterns in the character strings that name URLs within a given site.
The HotCargo Express crawler (http://www.documagix.com/products/hotcargo_express/) uses regular expressions to identify the URLs of pages that should be downloaded. However, the regular expressions need to be hand-coded.
There are prior art methods for inferring grammars and regular expressions in character strings from examples, such as in natural language processing. These inference methods are designed to work on a large training set without the need for much generalization.
These methods do not usually make use of negative examples, i.e., character strings that specify portions of the Web not to be searched. The grammars learned there are usually too complex to be understood or edited by end-users. Hence, given small set of examples that users are likely provide, the prior art natural language processing might not work well.
Some methods attempt to infer a regular grammar from positive examples in the context of specific types of Web content, e.g., phone numbers or e-mail addresses in Web pages. These methods assume a large training set and learn patterns that are hard for humans to verify or edit, see Goan et al. xe2x80x9cA Grammar Inference Algorithm for the World Wide Webxe2x80x9d, AAAI Spring Symposium, 1996, (http://www.parc.xerox.com/istl/projects/mlia/papers/goan.ps). Please see xe2x80x9chttp://www-cse.ucsd.80/users/rik/MLIA.html for a listing of work in applying machine learning to information retrieval.
Another approach to rule learning uses a decision tree. There, a tree of decision nodes is constructed from training data based on various attributes. For each node in the tree, an attribute is tested. Based on the result of the test, either a xe2x80x9cyesxe2x80x9d or xe2x80x9cnoxe2x80x9d result is returned, or the decision process moves to a descendant node in the tree, where another attribute is examined similarly. In constructing such a tree, when a new node is added, the method chooses the attribute which is likely to yield the largest information gain at that stage.
An example of such a method is described by Quinlan in xe2x80x9cInduction of decision trees,xe2x80x9d Machine Learning, Vol. 1, pp. 81-106, 1986. For an overview of Decision Tree learning see Russell et al in Section 18.3 of xe2x80x9cLearning Decision Trees,xe2x80x9d Artificial Intelligence: A Modern Approach Prentice Hall Series in Artificial Intelligence, 1995.
Decision tree learning is optimized to learn trees that are either concise in their description using a xe2x80x9cminimum description lengthxe2x80x9d optimization, or are fast to evaluate. Neither of these optimizations are important criteria in the context of Web crawlers. The time required to fetch and process Web pages significantly exceeds the time required to match URLs. The decision trees learned by these methods tend to be highly nested and hard for end-users to comprehend or modify.
None of the known learning and grammar inference methods can be xe2x80x9ctunedxe2x80x9d to match the characteristic textual grammar of URLs. Hence, none of them incorporate biases specified in URLs.
Hence, an automated approach to generating walking rules is desirable. One goal of such an automated approach would allow end-users who are non-programmers to interactively train a crawler to walk a specific portion of the Web by giving a small set of simple examples. From the training set, a walking rule would be inferred that is comprehensible to the end-user. Then, the user could modify the input or the rule directly, to develop a walking rule that better matches the desired crawl scope. Thus, the second goal is to specifically generate rules that are comprehensible to end-users, and to support iterative refinement. The third goal is to tune the rule inference to the specific domain of URLs and exploit knowledge about URL naming. Hence, in building a walking rule based on character patterns, the system could preferentially select patterns that match actual structures of character strings that specify a target Web site, such as directories, paths, filenames and file-extensions typically found in URLs.
The invention as a first goal provides a method for interactively training a Web crawler to search the Web according to a desired scope. During training, input comprises of examples of URLs that should be walked (positive examples) and not walked (negative examples) by the crawler. This input is used to infer a walking rule.
Users typically are incapable or unwilling to provide enough examples to infer a perfect walking rule, i.e., a rule that exactly matches the desired crawl scope. Hence, it is preferable that the inferred rule is comprehensible to the end-user. This allows the user to modify the input, or the rule directly, and to develop a rule that better matches the scope of the desired crawl. Thus, a second goal is to specifically generate rules that are comprehensible to end-users, and to support iterative refinement.
The third goal is to allow the user to tune the rule inferred to a specific domain of URLs, and to exploit knowledge about URL naming. Hence, in building a rule based on patterns, the system should preferentially select patterns that match actual data structures on the target Web site, such as directories, paths, filenames and file-extensions.
Therefore, a search for Web resources according to a rule includes the step of defining a scope for an initial Web search by supplying settings. The settings limit the scope of the search to produce a set of qualifying resources. Positive and negative examples are selected from the set, and a rule is inferred from the selected examples. The rule is rendered, as well as the subset of resources that match the rule. The selecting, inferring and rendering can be repeated while searching the network.