1. Field of the Invention
The present invention relates to a method, system, and program for gathering indexable metadata on content at an electronic data repository.
2. Description of the Related Art
To locate documents on the Internet, users typically use an Internet search engine. Internet users enter one or more key search terms which may include boolean operators for the search, and transmit the search request to a server including a search engine. The search engine maintains an index of information from web pages on the Internet. This index provides search terms for a particular Web address or Universal Resource Locator (URL). If the index terms for a URL in the search engine database satisfy the Internet user search query, than that URL is returned in response to the query.
Search engine providers need to constantly update their URL database to provide a more accurate and larger universe of potential search results that may be returned to the user. Search engine companies sometimes employ a robot that searches and categorizes Web pages on the basis of metatags and content in the located HTML pages. A robot is a program that automatically traverses the Web's hypertext structure by retrieving an HTML page, and then recursively retrieving all documents referenced from the retrieved page. Web robots released by search engines to access and index Web pages are referred to as Web crawlers and Web spiders.
Search engines having a database of indexable terms for URLs generated by robots are quite common and popular. However, some of the noticeable disadvantages of such robot generated URL databases is that periodic updates to the URL web site may render the URL database inaccurate and outdated until the robot rechecks a previously indexed page. Further, search engine robots are currently designed to search for HTML pages and parse HTML content into index search terms in the search engine database. However, many web pages provide content in formats that are not accessible or parseable to prior art search engine robots that are designed to traverse HTML pages, such as content encoded in various multi-media formats, e.g., MPEG, SHOCKWAVE, ZIP files, etc. Further, web site content may be dynamic and accessible by providing a search term that is then used by a program, e.g., the Common Gateway Interface (CGI), Java programs, Microsoft Active Server pages, etc., to query a database and return search results. Such dynamic data accessible through queries is typically not identified by prior art search engine robots and indexed in the search engine URL database.
A still further disadvantage is that Web robots have been known to overload web servers and present security hazards. For this reason, many web sites use a firewall that restricts the search engine web robot from accessing and cataloging the content, even when the web site provider would want their information publicly available. Web site providers may also limit a web robot's access to a site by creating a “robot.txt” file that indicates URLs on the site that the robot is not permitted to access and index. Such limitations of search engine web robots may prevent the web robot from accessing relevant web pages that would be of significant interest to search engine users.
Some search engines use a manual taxonomist. For instance Yahoo receives a manual submission of a web page and then categorizes the web page for inclusion in its database. This approach may be very time consuming. Further, the manual taxonomical approach cannot catalog as many pages as a robot approach that continually traverses the Internet, i.e., World Wide Web, for new pages and that is not limited to content that is submitted by users.
Thus, there is a need in the art for an improved technique for cataloging web pages.