With the electronic information explosion caused by Internet, a huge amount of diversified information is accumulated on the Web, and still continues to grow at a staggering rate. It is a challenging task to help net-citizens find useful information amongst this enormous information pool.
Information retrieval (IR) is the science of searching for information in a set of objects (e.g. documents), which can further be divided into searching for a piece of information contained in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets, for texts, sounds, images or data. Originated from this long-established research discipline, web search engine (e.g., Google or Baidu) is a document retrieval system designed specifically to help find information stored on the Web, which allows one to ask for the contents that meet specific criteria (typically those containing a given word or phrase) and to retrieve a list of items that match those criteria.
Object classification is the activity of labeling objects (e.g. documents or natural language texts) with thematic categories from a predefined set, which can be applied in many usage scenarios of IR and text data mining, e.g., word sense disambiguation, document organization, text filtering, and web page retrieval. Object Clustering is the classification of objects into different groups, or more precisely, the partitioning of an object set, such as a document set, into subsets (clusters), so that the documents in each subset share some common trait.
Considering the fact that there are a large amount of returned results from these popular search engines, it is still difficult for the web users to find what they really want. The object clustering/classification techniques provide great potentials to enable an effective way to organize search results, which allows a user to navigate into relevant documents quickly.
As described above, the rapid growth of electronic media content makes search engines (for web pages or desktop documents) play critical role in helping people to find useful information. However, the large amount of returned results, which are often heterogeneous in topics and genres, would also be a great burden for the users to find their interested information.
There are many existing automatic information classification algorithms in the prior arts. For example, in Paper: XuanHui Wang, ChengXiang Zhai, “Learn from Web Search Logs to Organize Search Results”, SIGIR2007, pp. 87-94 (hereinafter, referred to as Reference 1), a search result classification method is provided, in which search results are organized by aspects learned from search engine logs. For another example, Japanese patent application 2005-182280 (hereinafter, referred to as Reference 2) provides another method for organizing search results, which first extracts object categories based on pre-stored ontological information, and then organizes the search results according to the extracted categories.
In the query log-based object classification methods, the category selection does not take background knowledge (i.e. ontology) into account. Thus, the classification accuracy is not good enough. In addition, since the solution depends too much on the history information, the discovered category information might not be familiar for the users. Therefore, the classification result is not user-friendly.
On the other hand, regarding ontological information-based object classification method, since it is restricted by pre-stored ontological information, the search result category set of ontology based classification method is inflexible and cannot reflect the change of users' interest.