Document clustering is a process by which textual documents are analyzed and grouped by some predetermined criteria, such as by topic. Document clustering usually involves topic detection and tracking of documents and it becomes beneficial when dealing with large collections of documents. Such collections might include, for example, news stories of major news providers. Document clustering is also important because of the large number of documents currently available in a wide variety of different contexts, such as on the World Wide Web.
Arranging these large collections of documents by topic, for instance, allows users to easily browse these topics by moving from one document on a given topic to another document on the same topic. Unless the documents are arranged by topic, this cannot be done.
Some current clustering systems treat each document simply as a group of words (or “bag of words”) and generate a vector having features that indicate the presence or absence of words in the bag. Also, some current approaches identify named entities in the documents and give them preferential treatment in the vector, with respect to other words in the “bag of words”.
In such systems, an incoming document that is to be clustered has a vector generated for it. It is compared with the representative vectors, called centroids, associated with each previously defined cluster. The document is assigned to the cluster with the centroid that is the closest to the vector for the incoming document. Where named entities are identified and given preferential treatment by increasing their weights in the corresponding vectors, two vectors that have numerous named entities in common will typically be closer to each other in the induced vectorial space than to other documents that do not contain the same named entities.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.