In the last several years, the Internet has experienced exponential growth in the number of web sites and corresponding web pages contained on the Internet. Countless individuals and corporations have established web sites to market products, promote their firms, provide information on a specific topic, or merely provide access to the family's latest photographs for friends and relatives. As a result of the rapid growth in web sites on the Internet, it has become increasingly difficult to locate pertinent information.
Search engines, such as Inktomi, Excite, Lycos, Infoseek, or FAST, are typically utilized to locate information on the Internet. Upon inquiry from a user, the search engine software searches the millions of records contained in a central index. The search engine software finds matches to the search query and may rank them in terms of relevance according to predefined ranking algorithms. While most search engines accept submissions of sites for indexing, even upon such a submission, the site may not be indexed in a timely manner, if at all.
An inherent shortcoming of the method of indexing utilized in the conventional search engine is that only documents stored with a mark-up language such as SGML, HTML or XML is utilized in generating the central index. Due to the format of a mark-up language web page, certain types of information may not be placed in the mark-up language tags. For example, conceptual information such as the intended audience's demographics and geographic information may not be placed in an assigned tag in the document. Such information would be extremely helpful in generating a more useful index. For example, a person might want to search in a specific geographical area, or within a certain industry. Assume a person is searching for a red barn manufacturer in a specific geographic area. Because mark-up language pages have no standard tags for identifying industry type or geographical area, the spider on the server in the conventional search engine does not have such information to utilize in generating the central index. As a result, the conventional search engine would typically list not only manufacturers but would also list the location of picturesque red barns in New England that are of no interest to the searcher.
Some Internet search engines, such as Infoseek, have proposed a distributed search engine approach to assist their spidering programs in finding and indexing new web pages. Infoseek has proposed that each web site on the Internet create a local file named “robots1.txt” containing a list of all files on the web site that have been modified within the last twenty-four hours. Files that have not been modified will not be indexed, saving bandwidth on the Internet otherwise consumed by the spidering program and thus increasing the efficiency of the spidering program.
There is a need for a method and system of indexing or cataloging remotely stored data that allows conceptual information and other human generated information about web sites to be utilized in generating the index to allow sites to be found in a search, to make search results more meaningful, and to allow sites to be more accurately rated.
Full-text search and indexing systems such as web search engines typically have two distinct means of organizing the presentation of documents. The first means is usually a categorization system (hierarchical or otherwise) which presents the documents in groups or “clusters” related by topic, content or origin. The second is dynamically generated as a result of a search process of some sort such as a matching keyword search. Normally, this second means is presented as a linear list in which matching documents are sorted either by alphabetical title, date of change or a ranking value based on a calculation whose input may come in part from the document content. For example, in searching for the work “car” in a set of documents, the resulting list of matching documents might be sorted by the number of times the word occurred in each document.