In many clustering applications, the characteristics of the objects to be clustered change over time. Typically, such characteristic change contains both long-term trend due to concept drift and short-term variation due to noise. For example, in the blogosphere where blog sites are to be clustered (e.g., for community detection), the overall interests of a blogger and the blogger's friendship network may drift slowly over time and simultaneously, short-term variation may be triggered by external events. As another example, in a ubiquitous computing environment, moving objects equipped with GPS sensors and wireless connections are to be clustered (e.g., for traffic jam prediction or for animal migration analysis). The coordinate of a moving object may follow a certain route in the long-term but its estimated coordinate at a given time may vary due to limitations on bandwidth and sensor accuracy.
These application scenarios, where the objects to be clustered evolve with time, raise new challenges to traditional clustering algorithms. On one hand, the current clusters should depend mainly on the current data features—aggregating all historic data features makes little sense in non-stationary scenarios. On the other hand, the current clusters should not deviate too dramatically from the most recent history. This is because in most dynamic applications, the system does not expect data to change too quickly and as a consequence, the system expects certain levels of temporal smoothness between clusters in successive time steps. This point can be illustrated using an evolutionary clustering scenario example in FIG. 1. In this example, assume they system wants to partition 5 blogs into 2 clusters. FIG. 1 shows the relationship among the 5 blogs at time t−1 and time t, where each node represents a blog and the numbers on the edges represent the similarities (e.g., the number of links) between blogs. Obviously, the blogs at time t−1 should be clustered by Cut I. The clusters at time t are not so clear. Both Cut II and Cut III partition the blogs equally well. However, according to the principle of temporal smoothness, Cut III should be preferred because it is more consistent with recent history (time t−1).
In time series analysis, moving averages are often used to smooth out short-term fluctuations. Because similar short-term variances also exist in clustering applications, either due to data noises or due to non-robust behaviors of clustering algorithms (e.g., converging to different locally suboptimal modes), new clustering techniques are needed to handle evolving objects and to obtain stable and consistent clustering results.
In clustering data streams, large amounts of data that arrive at high rate make it impractical to store all the data in memory or to scan them multiple times. Such a new data model raises issues such as how to efficiently cluster massive data set by using limited memory and by one-pass scanning of data, and how to cluster evolving data streams under multiple resolutions so that a user can query any historic time period with guaranteed accuracy.
Incremental clustering algorithms have been used to efficiently apply dynamic updates to the cluster centers, medoids, or hierarchical trees when new data points arrive. However, newly arrived data points have no direct relationship with existing data points, other than that they probably share similar statistical characteristics. For example, moving objects can be clustered based on micro-clustering and an incremental spectral clustering algorithm has been applied to similarity changes among objects that evolve with time. However, the focus of these systems is to improve computation efficiency at the cost of lower cluster quality. Constrained clustering has also been used where either hard constraints such as cannot links and must links or soft constraints such as prior preferences are incorporated in the clustering task.
Evolutionary clustering is an emerging research area essential to important applications such as clustering dynamic Web and blog contents and clustering data streams. In evolutionary clustering, a good clustering result should fit the current data well, while simultaneously not deviate too dramatically from the recent history. To fulfill this dual purpose, a measure of temporal smoothness is integrated in the overall measure of clustering quality. In Chakrabarti et al., Evolutionary clustering, In Proc. Of the 12th ACM SIGKDD Conference, 2006, an evolutionary hierarchical clustering algorithm and an evolutionary k-means clustering algorithm are discussed. Chakrabarti et al. proposes to measure the temporal smoothness by a distance between the clusters at time t and those at time t−1. The cluster distance is defined by (1) pairing each centroid at t to its nearest peer at t−1 and (2) summing the distances between all pairs of centroids. However, the pairing procedure is based on heuristics and it could be unstable (a small perturbation on the centroids may change the pairing dramatically). Additionally, because Chakrabarti ignores the fact that the same data points are to be clustered in both t and t−1, this distance may be sensitive to the movement of data points such as shifts and rotations (e.g., consider a fleet of vehicles that move together while the relative distances among them remain the same).