Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service identifies web pages that may be related to the search request based on how well the keywords of a web page match the words of the query. The search engine service then displays to the user links to the identified web pages in an order that is based on a ranking that may be determined by their relevance to the query, popularity, importance, and/or some other measure.
Because of the vast number of web sites and web pages, a search engine service may identify hundreds of thousands of web pages and that may match a query. A user, however, may be interested in web pages on one topic, but the search engine service may return web pages on many different topics. For example, an attorney who submits the query “L.A. court time” may get the same query result as an athlete who submits the same query. In such a case, web pages related to superior court times in Los Angeles County may be relevant to the attorney, but irrelevant to the athlete who may be interested in web pages related to sport court times of the Los Angeles Parks and Recreation Department. A search engine service may not know whether the user is interested in law or sports and thus cannot always rank the web pages based on the relevance to the user. If the search engine service does not rank the web pages that are of interest to the user appropriately, then it can be difficult for the user to review the textual excerpts displayed with a large number of query results to determine whether the described web pages are of interest. Moreover, a user may need to actually view many web pages before finding one of interest because the textual excerpts may not provide enough information to determine the relevance of the web pages. For example, the textual excerpt of a query result may state, “This web page helps you check on L.A. court times at your convenience . . . if you need to arrange a court time . . . . Please arrive 15 minutes before your scheduled court time.” In such a case, the user may not know whether the web page is about legal or sport courts.
Some search engine services provide a classification hierarchy for web pages to assist in locating web pages of interest to a user. FIG. 1 illustrates a portion of an example classification hierarchy. In this example, a classification hierarchy 100 includes a service classification 101 corresponding to web pages related to services. The service classification has a recreation classification 110 and a business classification 150 as sub-classifications. The recreation classification has a sports classification 120 and a dancing classification 130 as sub-classifications. The sports classification has a baseball classification 121 and a football classification 122 as sub-classifications, and the dancing classification has a folk dance classification 131 and a rock 'n roll classification 132 as sub-classifications. The business classification has an insurance classification 160 and a financial classification 170 as sub-classifications. The financial classification has a stock market classification 171 and a bonds classification 172 as sub-classifications. Each web page within the service classification is associated with a classification path leading to a leaf classification such as classifications 121-122, 131-132, 160, and 171-172. For example, a web page relating to baseball would be classified into the service classification, the recreation classification, the sports classification, and the baseball classification. As another example, a web page relating to insurance would be classified into the service classification, the business classification, and the insurance classification. When a search engine service crawls the web, it may identify the classifications of the web pages that it encounters and create an index that maps classifications to the web pages within the classifications.
To assist a user in searching, a search engine service may allow the user to specify a classification of interest as part of the query. For example, a user who is interested in superior court times of Los Angeles County may enter the query “L.A. court times” and specify the classification of “criminal justice.” The search engine service may search for only web pages within the specified classification (e.g., criminal justice) and related classifications (e.g., legal). Alternatively, a search engine service may search for web pages in all classifications and then present the search results organized by classification of the web pages. In such a case, a user could then fairly quickly select the classification of interest and review the web pages within that classification.
Although the classification of web pages is a specific type of classification within the field of Text Classification (“TC”), the classification of web pages presents many challenges not encountered with traditional text classification. A significant challenge is the efficient classification of large numbers of web pages. Traditional text classification techniques have used supervised learning to develop a classifier to classify documents (e.g., published papers and news articles) into non-hierarchical classifications. These supervised learning techniques, however, cannot effectively be used to train a classifier for the hundreds of thousands of classifications used by some search engine services. These traditional supervised learning techniques include Support Vector Machines (“SVMs”), k-Nearest Neighbor (“k-NN”), Naïve Bayes (“NB”), and other algorithms. These supervised learning techniques input training data (e.g., documents with their corresponding classifications), generate a feature vector for each document, and generate a classifier that can be used to classify other documents represented by their feature vectors. A feature vector may, for example, contain the number of occurrences of each term or keyword in the document. An SVM is a supervised learning technique that operates by finding a hyper-surface in the space of possible inputs. The hyper-surface attempts to split the positive examples from the negative examples by maximizing the distance between the nearest of the positive and negative examples to the hyper-surface. This allows for correct classification of data that is similar to but not identical to the training data. Various techniques can be used by a support vector machine. One technique uses a sequential minimal optimization algorithm that breaks the large quadratic programming problem down into a series of small quadratic programming problems that can be solved analytically. (See Sequential Minimal Optimization, available at Microsoft Research web site as “˜iplatt/smo.html.”)
The use of a hierarchical classifier has been proposed to classify documents in general and web pages in particular using a classification hierarchy with many thousands of classifications. A hierarchical classifier typically has a classifier for each classification. Each classifier is trained to classify documents within a certain classification into its sub-classifications. For example, a classifier for the sports classification 120 of FIG. 1 would classify sports related web pages into the sub-classifications of baseball and football. Because a hierarchical classifier can comprise hundreds of thousands of classifiers (e.g., one for each non-leaf classification), it can be particularly time-consuming to effectively train such a large number of classifiers.