The exemplary embodiment relates to clustering of documents and finds particular application in connection with a threshold-based clustering algorithm, suited to clustering news articles.
Clustering algorithms are useful tools for analyzing data. Many algorithms exist for this task, although their application to a particular problem is very much data-dependent. For example, in the case of news article clustering, clustering may be based on the detection of events inside a given collection of news articles coming from multiple sources. However, since the events themselves are often unpredictable in advance and the articles often arrive in small batches, the identification of clusters is challenging.
In this setting, so-called k-based algorithms tend to perform poorly. These algorithms take as input the number of expected clusters and try to fit the given points into k clusters guided by a selected score function. Typically, a user specifies several possible values for k (or an interval of values) and the score function is extended in order to be able to choose the best value for k, by including a complexity-penalizing term. However, it is generally not evident what could be the expected number of events at a given moment, and this number may change over time. Also, there are possibly outlier articles to deal with. These are documents which do not talk about any particular event. However, k-based algorithms are very sensitive to the presence of outliers.
Threshold-based clustering algorithms tend to be better suited to clustering in such a setting, where the given input is a threshold on the similarity (denoted by τ) that relates to how close documents in the same cluster should be to each other. Popular algorithms in this setting includes DBSCAN (Martin Ester, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in KDD, pages 226-231, 1996), Star Clustering (Javed A. Aslam, et al., “The Star Clustering Algorithm for Static and Dynamic Information Organization,” Journal of Graph Algorithms and Applications, 8(1):95-129, 2004, hereinafter “Aslam”), Quality-Threshold (L. J. Heyer, “Exploring Expression Data: Identification and Analysis of Coexpressed Genes,” Genome Research, 9(11):1106-1115, November 1999), and Correlation Clustering (Nikhil Bansal, et al., “Correlation Clustering,” Machine Learning, November 2004). However, these algorithms do not deal well with sequential data.
The standard approach for such a setting is given by the single-pass (or fully-incremental) algorithm, one of the earliest and simplest threshold-based algorithms (J. Allan, et al., “Taking Topic Detection From Evaluation to Practice,” in Proc. 38th Annual Hawaii International Conference on System Sciences, IEEE (2004); C. J. van Rijsbergen, “Information Retrieval” (Butterworths 1979)). In this approach, the data points are processed one by one. For any point p, its similarities to all existing clusters are computed and the data point is assigned to the closest one. In the case that no cluster is closer than τ, a new singleton cluster is created for p. This algorithm considers all points only once and thus it may be very sensitive to the order of the data. A purely incremental approach is thus often not entirely suited to a variety of real applications, for example, when the items arrive in batches in which there is no specified order.