Web searching has become an indispensable tool for web users to locate desired information, particularly as the amount of information that is available on the Internet continues to rapidly increase. To submit a query, web users typically submit a few words to a search engine. However, because these queries are short and often ambiguous, interpreting the queries in terms of a set of target categories is a difficult problem. For example, the users issuing a Web query “apple” might expect to see Web pages related to the fruit apple, or they may prefer to see products or news related to the music record label or the computer company.
In general, the various problems and solutions related to interpreting web queries in terms of categories along a taxonomy are referred to as query classification. In general, a taxonomy comprising hierarchically arranged categories is used to process a web query into results. For example, online advertisement services rely on query classification results to promote different products more accurately. Search result pages can be grouped according to the categories predicted by a query classification algorithm.
Previous solutions for query classification generally needed human-labeled training data. However, it is a very difficult and time consuming task to provide enough training examples, especially when the target taxonomy is complicated. Another potential problem related to the training data is caused by the ongoing changes in the query stream, which makes it hard to systematically cover the space of queries. For example, if changes are made to a defined taxonomy, re-training is needed to handle the changes.
In another previous type of solution, an input query is first mapped to an intermediate category, and then a second mapping is applied to map the query from the intermediate category to a target category. However, this method suffers from a number of problems. One problem is that the classifier for the second mapping function needs to be trained whenever the target category structure changes. Because in real applications the target categories often change depending on the needs of the service providers, as well as the distribution of the web contents, this re-training solution is not sufficiently flexible. Another problem with this solution is that the Open Directory Project (ODP) taxonomy, in which web content is classified by human volunteers, is used as the intermediate taxonomy. However, because the ODP contains more than 590,000 different categories, it is also costly to handle the mapping operations.