Internet search engines, database management systems and other information retrieval systems are designed to retrieve information that corresponds to a user's query. Typically, the user's query is specified as a set of keywords and the information retrieval system's job is to retrieve documents, files, etc. that contain as many of the specified keywords as possible. Information retrieval systems that operate on large databases can often produce large quantities of search results corresponding to the keywords in a query. Naturally, many of the search results will use the specified keywords in contexts other than those intended by the user. Therefore, many of the search results will be irrelevant to and unwanted by the user. Recognizing this issue, many current information retrieval systems attempt to assess or rank the relevancy of their search results before presenting them to the user.
For example, some information retrieval systems approximate a relevancy score for retrieved documents based upon factors such as how recently each document has been updated, the proximity of the specified keywords to the beginning of each document, or whether the specified keywords are included in the title, links (e.g., anchor text) or metadata. Other systems approximate a relevancy score based on the apparent popularity or importance of a retrieved document, which may be measured by counting the number of other documents that link to the retrieved document. Some systems assess the importance of the retrieved document not only based on the number of links to it, but also the importance of the documents that link to it. See, for example, U.S. Pat. No. 6,285,909. While such methods for approximating relevancy of search results can produce useful information, they remain unable to assess whether the retrieved documents contain content that the user is actually looking for.
It is commonly recognized that performing searches on information sets that have been categorized according to content is more likely to produce larger quantities of relevant search results, as compared to performing searches on uncategorized data sets. However, categorizing vast databases (e.g., the World Wide Web) is generally thought to require a painstaking process that involves considerable human intervention. The difficulty of categorizing large data sets that are accessed by a large number of users is compounded by the fact that the included data is constantly being changed. Accordingly, what is needed are efficient systems and methods for assigning content category scores to large and dynamic data sets with minimal human involvement, to thereby enhance the relevancy of search results produced by an information retrieval system.