Knowledge collaboration includes contributing to, authoring within, discussing, sharing, exploring, and deploying a collective knowledge base. The World Wide Web (WWW) opens up new possibilities for people to share knowledge, exchange information, and conduct knowledge collaboration. Numerous kinds of knowledge collaborative online communities are now available as a result of knowledge input on the WWW, for example, Weblog and Wiki have become well-known sources of collaborative information and are now common words in daily life. Despite the influx of data to these and other collaborative knowledge bases, computers have no understanding of the content and meaning of the submitted information data. Assessing and classifying the information data has mainly relied on the manual work of a few experienced people (e.g., wiki editors, discussion board moderators, etc.) in these knowledge collaboration systems. With the growth of a community, the workload of the manual sorting can become monumentally complex and difficult to achieve. Generally, the more people that join a discussion or contribute knowledge and information data, the heavier workload placed on experienced information sorters.
Document clustering and classification has long been studied as a post-retrieval document visualization technique. Document clustering algorithms attempt to group documents together based on their similarities such that documents that are relevant to a certain topic will typically be allocated to a single cluster (e.g., topic, idea, concept, response, etc.). A document clustering algorithm, for example, to categorize WWW documents in an online community can substantially reduce the reliance on human information sorters and can provide efficiency, speed, and accuracy advantages for sorting over the human analog. Automated clustering, for example, can be very helpful in speeding up knowledge collaboration in online communities. For example, experienced members and editors can focus on identifying and assessing high qualify documents more easily and efficiently where automated sorting has automatically provided relevant clusters for additional refinement. Also for example, an efficient online searching service using automated clustering can easily provide a categorical index of a whole forum which can aid users, especially novice users or forum guests, when looking for topics of interest within a plurality of knowledge sources.
Research into document clustering can generally be classified into the two divergent areas of graphical and vector document modeling for clustering. Vector data models generally strip words from documents into vectors of words that can then be compared for similarity. Graphical data modeling generally includes graphical tree data modeling techniques. These tree techniques can include suffix tree modeling wherein phrases can be placed into a representative tree structure to generate a compact model of phrases in a document allowing similarity calculations by traversing branches of the tree.
Text document clustering has been traditionally investigated as a means of improving the performance of search engines by pre-clustering an entire corpus of documents. The methods used for document clustering covers several research areas, such as database, information retrieval, and artificial intelligent including machine learning and natural language processing. The Agglomerative Hierarchical Clustering (AHC) algorithm is generally considered to be the most commonly used algorithm among the numerous document clustering algorithms. There are several variants from this algorithm, e.g., single-link, group-average and complete-link. In practice, the AHC algorithm can often generate a high quality clustering result with the tradeoff of a higher computing complexity.
In traditional vector document models, words or characters are considered to be atomic elements in statistical feature analysis and extraction. Clustering methods mostly make use of single word/term analysis of a document data set. In order to achieve more accurate document clustering, development of more informative features (e.g., bigrams, trigrams and much longer n-grams) needs to receive considerable attention in future information retrieval research. Vector document model clustering suffers from a failure to account for the importance of the sequence order of words comprising phrases in documents. The sequence of words can convey additional information over the mere presence and frequency of a word, however, this information is typically discarded in vector data model clustering techniques to the detriment of more optimized clustering.
A suffix tree document model was first proposed in 1997 and differed from vector document models, which treat a document as a set of words and ignore the sequence order of the words, by considering a document to be a set of suffix substrings. In the suffix tree document model the common prefixes of the suffix substrings are selected as phrases to label the edges of a suffix tree. Numerous derivative suffix tree document models were developed based on this generic suffix tree model and work well in clustering WWW document snippets returned from several specific search engines. Generally, these derivatives are often essentially only based on fusion heuristics that evaluate the suffix tree document model on graph-based similarity measures for large document collections to compute the document similarities. Little attention is given to effective quality measurements in cluster phrases or the agglomeration of clusters under these cluster phrases to the detriment of more optimal clustering.