At present, World Wide Web (WWW) has become a popular and important medium to disseminate and acquire information, which is of huge amount, diverse, heterogeneous, distribute and other features, and much of information is implicit. Web information extraction and mining technology is important to help people to utilize the maximum of the web and information. In fact, web information extraction and mining has already turned out to be a hot research area, and even the applications and products based on these technologies have been also popular in the market.
Document clustering is a kind of general information mining technology, which is used for exploiting the similarities and relationships among documents. The purpose of document clustering is to organize the documents into several meaningful groups so that the documents within the same group have high similarities or strong relations, while documents belonging to the different groups are far from each other. The grouping process is automatic and without pre-defined groups. Clustering results are organized document sets, so document clustering is widely used to increase the efficiency and effectiveness of the information retrieval and other information extraction systems, and also used to organize the retrieval results for browsing conveniently. Because of the large amounts of web information, clustering plays more particularly important role in enabling efficient and accurate information extraction in the web domain.
The goal of web document clustering is to automatically divide the pre-selected web document set into several meaningful groups, which are not pre-defined, and to guarantee that the similarities or relations of the documents in the same group are much stronger than those of the documents in different groups. On the other hand, because the similarities and relations can be defined differently by different measurement standards, different cluster analysis results may be obtained for the same document set from different aspects. For example, the clustering can be used to group some product-related web pages of company website into news pages, advertisement pages, shopping pages, etc according to content type, or to group them according to product categories into several product clusters, i.e. a cluster represents all the pages about the same product. Thus, the general problem of web document clustering is how to design an appropriate clustering method to meet the practical requirement accurately and efficiently.
In the technical view, the primary process for designing a document clustering method is firstly to select proper and efficient document features for specific clustering purpose and then to model clustering mechanisms based on the documents features. So, we review the existing technical solutions from these two aspects.
From the aspect of the feature selection, the existing solutions for web document clustering can be generally divided into the following four categories which consider different kinds of features for clustering: (1) document content based clustering; (2) hyperlink information based (context based) clustering; (3) web usage information based clustering; (4) hybrid clustering. In the traditional document clustering solutions, the most common one is the document clustering methods by content-related features, i.e. the textual information within the documents. For web document clustering, the content-related features include not only textual information of the content, but also the HTML structure of the web pages. Furthermore, since the hyperlink is the primary feature of the web, the importance of link-related information is the same as, or even more than content-related information for web document clustering. Therefore, the document clustering based on hyperlink information is more and more popular. Also, because the web users' usage information, such as browsing history, browsing paths and so on, can be recorded, some solutions use this kind of usage information to assess the relationship among web documents. Certainly, for general cases, the information is not much enough if considering only web document contents, because many web pages include little textual information and have irregular HTML structure. And on the other hand, the information is not meaningful enough if considering only hyperlink information or web usage information, because many links and browsing are random and subjective. Thus, the hybrid solutions are usually designed for general web document clustering.
From the aspect of clustering mechanism modeling, almost all the existing solutions are based on peer-to-peer similarity analysis models. In more details, these solutions design some algorithms to analyze the similarities (usually represented by similarity scores) between each pair of documents directly or indirectly, and then cluster the documents according to the results, i.e., the group, every two documents of which have high similarities, becomes a cluster. The concrete model for similarity analysis is either set by rules or from machine learning.
Several representative technical solutions in the prior art are introduced as follows.
In non-patent document [1] (V. Crescenzi, P. Merialdo, P. Missier. Clustering web pages based on their structure. Data & Knowledge Engineering 54 (2005) 279-299), the solution is given to cluster pages from a data intensive website with the analysis of link collection (a set of links with the same layout and presentation properties in one page) and page document object model (DOM) structure. The entry point to the site is a single seed page, which becomes the first member of the first class, the link collections of the seed page are extracted and pushed into a priority queue. Then, following steps are iterated until the queue is empty: One of the link collections from the queue is selected and a subset of the pages pointed to by its links is fetched. The fetched pages are clustered according to their page structure similarity (which is defined with respect to their DOM trees). Minimum Description Length (MDL) principle is adopted to determine whether each candidate class is a new class to be added to the model, or it should be merged with an existing class.
In non-patent document [2] (X. He, H. Zha, C. H. Q. Ding, etc. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis 41 (2002): 19-45), the basic feature for web page clustering is the hyperlink structure, and also the textual information and co-citation information are combined inside. The kernel idea for clustering is that those pages, which are more inter-linked together, are more similar, the clustering problem is transformed into link graph partitioning problem. The similarity weight from link structure is adjusted by textual information similarity information, and is enhanced if two pages are co-cited.
Furthermore, Japanese Patent document [3], i.e. [JP2004-341942] clusters the web documents by analyzing the similarities of each pair of documents with comparing their respective domain name, directory name, file name, which are retrieved from their URLs.
In order to better understand the present invention, the disclosures of the above-mentioned documents are hereby incorporated entirely by reference for all purposes.
However, there are some still unaddressed problems with the existing solutions. At first, with respect to the non-patent document [1], the method can cluster the pages only for restrict data intensive websites. Nevertheless, for the websites with even a little dirty structure, it would not be applicable, because the structural similarity can't imply the topic or content similarity in non-restrict data intensive situation. Thus, this method is too specific and the accuracy of this method in a general view can't be obtained. And for the non-patent document [2], the solution uses learning-based clustering algorithms, such that the collection and tagging for sample corpus manually is still the bottleneck for limit of the efficiency. Also the results are biased by the sample corpus and this clustering method is too general to guarantee enough accuracy for specific situations. Furthermore, the Patent document [JP2004-341942] is too limited to handle the usual situations because most URLs are not normative and meaningful for the great mass of websites, especially for those dynamic websites with parameter-based URLs. Thus, based on the observation above, we can find that the deficiencies on the accuracy and efficiency are still the common disadvantage of the existing solutions.
On the other hand, for the efficiency need of clustering, there's another unaddressed problem of the existing solutions. Because the existing solutions are all based on peer-to-peer similarity analysis, the result clusters have only flat structure, i.e., there are no relations among different clusters except that the documents in different clusters are much less similar than the documents within the same cluster. Thus, the clustering result can only reflect the similarities of the documents from a single aspect or a single level, and it would take much work to modify the features and models of clustering in order to transfer the similarity aspect or level. For example, for a clustering analysis of product pages within a company website, we can group the pages by different products, i.e. a cluster represents an individual product, or also can we group the pages by different product category, i.e. a cluster represents a product category. The second clustering goal has the higher similarity level than the first one, and they can be hierarchical related. But the existing solutions can't achieve the two clustering results at the same time, and although can the results be got successively, they can't be related together automatically and then the clustering methods are lack of efficiency in the whole view.