1. Field of the Invention
The present invention relates generally to methods for categorizing and searching for information on a network and, more specifically, to categorizing and searching Web pages on the Internet.
2. Description of the Related Art
The Internet contains over two billion Web pages. It has been estimated that two million Web pages are added to the Internet each day (The Industry Standard, Feb. 28, 2000). This vast amount of information is a tremendous resource for the public to use. However, there is no effective way for a user to obtain relevant information. Although 85 percent of users use search engines to find information on the Internet, “a mind-boggling 92 percent of searches fail to find relevant information or to arrange the results in a meaningful order.” (The Industry Standard, Apr. 17, 2000, referring to a Forrester Research review of Web sites.)
There are two fundamental problems. First, there is no standardized international categorization system or catalog of the information contained on the Internet. A group of librarians and others have been working on a cataloging system for the Internet for the last few years. This work is referred to as the Dublin Core Metadata Element Set. This system suffers from a number of problems, including requiring a high degree of cataloging knowledge and being time-consuming and very expensive. In addition, because of the size of the Internet, it is a system that is unworkable.
Second, because there is no standardized categorization system or catalog, the existing search methods, which primarily include directories and search engines, are often cumbersome, ineffective, and inefficient.
Directories or indices are human-compiled databases of Web sites or pages. Most directories use editors to review and categorize Web sites. Some use contributions by their visitors. A user searches a directory by reviewing lists of categories and subcategories, or also typing in keywords. The result is a list of documents that the user can access by links. Directories are helpful to familiarize a user with the scope of a subject, but are not very useful in finding specific information. Also, directories can be slow, and the results may be haphazard. Another major problem is that directories review and categorize only a small percentage of pages and sites. Examples of directories commonly used are Yahoo! and LookSmart.
Search engines are huge databases that automatically index large portions of the Internet and continually update that index. Search engines typically include a Web crawler or spider (also called a worm, robot, or bot) that automatically crawls through the Internet on hyperlinks indexing Web pages, a database which is the index compiled by the crawler, and a search tool which the user can use to search the database. The databases of the existing search engines differ in how they are created. Some Web crawlers index each word in a document, some index only keywords, including META tags, and some index other parts of a Web page, such as title, headings, etc. Most search engines require a search to be conducted by typing in keywords. The way in which the search query is formulated may be by Boolean logic, where keywords are used with various terms, or by natural language, where keywords are used in the form of a question. Although natural language searches may be easier for a user to formulate, both types of formulations rely on keywords.
Most search engines use mathematical algorithms to weigh or rank the results, with the most relevant items listed first. These rankings may be based on the number of times a keyword is used on a page or the location of the keyword on the page. Some search engines also allow the user to organize or group the results by category, date, or other variable, such as the folders used by Northern Light, U.S. Pat. No. 5,924,090 to Krellenstein. Another search engine, known as the Clever Project, by IBM, analyzes hyperlinks between pages, in addition to text and citations, in order to develop algorithms that are intended to increase the relevancy of search results. This method is a marginal improvement over other search engines, but has its own set of problems. “A shortcoming of Clever has been that for a narrow topic, such as Frank Lloyd Wright's house Fallingwater, the system sometimes broadens its search and retrieves information on a general subject, such as American architecture.” (“Hypersearching the Web,” Scientific American, June 1999.)
Search engines do not index the entire Internet. Most have indexed about one-third of the available or publicly indexable Web pages (i.e., excluding Web pages with authorization requirements). Examples of search engines are Google, FAST, AltaVista, Inktomi, and Northern Light. A greater portion of the Internet can be searched using a meta-search. This technology allows the user to search several search engines at the same time and presents all the results in a single list, but exacerbates the problems inherent in existing search engines.
Because they contain such huge databases, existing search engines often produce search results too voluminous for the user to review. Also, the search results typically contain a vast amount of irrelevant or unrelated items. As stated previously, it has been found that 92 percent of searches did not yield relevant information or did not organize the results in a usable fashion (The Industry Standard, Apr. 17, 2000). Another problem is that search engines are more likely to index pages with more links, pages with commercial information, and pages in the United States, rather than lesser known, educational, or non-United States pages.
Another major problem of existing search engines is that they may allow minors access to pornography on the Internet. Current filtering software is an ineffective and often clumsy tool that fails to limit access to many pornographic sites, but blocks other sites that are educational or medical in nature. In addition, the controversy surrounding this issue has created enormous difficulties for public institutions, such as schools and libraries, with respect to allowing minors access to the Internet.
Lastly, it is often difficult for a user to determine the copyright status of material on the Internet. There is also no easy way for owners of content to indicate the copyright status of their material. This problem has hampered the flow of information and left both the owners of content and users confused and potentially in legal jeopardy.