1. Field of the Invention
This invention generally relates to the field of search engine technology, and more particularly relates to crawlers, robots, and spiders, and to a method of improving the performance of a crawler based search engine using a proxy type device to modify hyperlink requests and HTML pages.
2. Description of Related Art
Currently, searches on the Internet, and more specifically on the World Wide Web, are performed by users using a number of commercial search engines. These search engines are accessed at various web sites maintained by the operators of the search engines. Typically, to perform a search the user will enter terms to be searched into a form, and may also make selections from pull-down menus and checkboxes, to enter a search request on a search engine's web site. Then, the search engine will return a listing of web sites that contain the entered terms.
Search engines perform many complex tasks which can be generally categorized as front-end and back-end tasks. For example, when the user enters the terms and executes a search, the search engine service does not immediately search the Internet or World Wide Web for web sites containing data matching the search terms. This method would be slow and cumbersome given the huge number of web site that must be searched in order to find potential matches. Instead, the search engine service will search its own internal database of cataloged terms and corresponding web sites to find matches for the entered terms. The process of accepting the user's input, searching the internal database, and displaying the results for the user would be examples of front end tasks.
However, the search engine must perform back-end tasks unseen by the user in order to create and maintain its database of terms and corresponding web sites. These back-end tasks include searching for common terms on the Internet or World Wide Web, and cataloging their locations in the search engine's internal database so that the data can be provided quickly and efficiently to users in response to a search request.
Among the devices used by search engines to find data on the Internet and the World Wide Web are robots, crawlers, and spiders. Crawlers, spiders, and robots all work in a similar manner. These devices start by issuing a hyperlink request to a web site of interest. A hyperlink request contains a Uniform Resource Locator, or URL which indicates the address to a particular web page containing data. In response to the hyperlink request, the web site will send data back to the crawler. This data may be Hyper Text Markup Language pages, known as HTML pages, or other documents. Once the crawler has received an HTML page, it will look for other hyperlinks contained within the HTML page itself. These new hyperlinks will be indexed and cataloged in the search engines database. Then the crawler will follow the new hyperlinks and repeat the process, collecting more hyperlinks.
One significant limitation with current crawlers is that they only detect and follow static hyperlinks. Static hyperlinks are links in which the entire URL is plainly visible in the HTML page and easily extractable by the crawler. Some examples would include URLs such as “http://www.upsto.gov” generally following an HTML tag. HTML tags are commands written in the HTML language. Static tags would include “<A>” anchor tags, “<IMG>” image tags, and “<FRAME>” child frame tags among others. Thus, the crawler will look for URLs following these tags, and extract them from the HTML document for further processing.
However, the content on the Internet and World Wide Web that is accessible through static hyperlinks is dwarfed by the volume of content accessible via non-static hyperlinks such as those constructed from HTML forms. For example, many web pages contain a form requiring the user to enter either a selection or a keyword, and also the user may make selections via pull-down menus, checkboxes, and other selectable items. The user enters search terms and other parameter values, collectively referred to herein as parameter values, into a search engine in a web site, such as by utilizing any of the above mentioned mechanisms. In response to an input by the user, the web site will return additional data which may be in the form of an HTML page or other documents. Since existing crawlers are unable to supply this selection or keyword to the HTML form, the crawler can not reach this additional data.
Similarly, many web sites require the use of a client side script. For instance, many web sites keep track of users who visit the site by requiring a user's identification, sometime known as a user name. Similarly, a web site may require other information such as cookies, session identifiers, catalog names, and shopping cart identifiers to name a few. Typically, this information is combined with the user's own input to the form or selection by the use of a client-side script. A client-side script is basically a set of instructions that are executed by the user's computer. Examples of such scripting languages are VBScript and JavaScript. For example, when a user visits a web site and enters data in a form, if the web site requires a user identification, a JavaScript program can intercept the request and piggyback the request with the user identification and additional information. Many web sites will not allow a user to access areas of the web site without this information. Since existing crawlers do not have the capability to handle these requests for information, they are precluded for searching the content deeper on that web site, resulting in extracting of less data and hyperlinks than possible.
Therefore a need exists to overcome the problems with existing crawlers, as discussed above, in order to access a larger amount of potentially important data on the Internet and the World Wide Web.