In general, large volumes of continuously evolving data, which may be stored, is referred to as a data stream. Data streams have received increased attention in recent years due to technological innovations, which have facilitated the creation, maintenance and storage of such data. A number of data mining studies have been conducted in the data stream context in recent years, see, e.g., C. C. Aggarwal, “A Framework for Diagnosing Changes in Evolving Data Streams,” ACM SIGMOD Conference, 2003; B. Babcock et al., “Models and Issues in Data Stream Systems,” ACM PODS Conference, 2002; P. Domingos et al., “Mining High-Speed Data Streams,” ACM SIGKDD Conference, 1998; S. Guha et al., “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the International Conference on Data Engineering, 1999; and L. O'Callaghan et al., “Streaming-Data Algorithms for High-Quality Clustering,” ICDE Conference, 2002.
Clustering is the partitioning of a given set of objects, such as data points, into one or more groups (clusters) of similar objects. The similarity of a data point with another data point is typically defined by a distance measure or objective function. In addition, data points that do not naturally fit into any particular cluster are referred to as outliers. Clustering has been widely studied by those in the database and data mining communities because of its applicability to a wide range of problems, see, e.g., P. Bradley et al., “Scaling Clustering Algorithms to Large Databases,” SIGKDD Conference, 1998; S. Guha et al., “CURE: An Efficient Clustering Algorithm for Large Databases,” ACM SIGMOD Conference, 1998; R. Ng et al., “Efficient and Effective Clustering Methods for Spatial Data Mining,” Very Large Data Bases Conference, 1994; A. Jain et al., “Algorithms for Clustering Data,” Prentice Hall, New Jersey, 1998; L. Kaufman et al., “Finding Groups in Data—An Introduction to Cluster Analysis,” Wiley Series in Probability and Math Sciences, 1990; E. Knorr et al., “Algorithms for Mining Distance-Based Outliers in Large Data Sets,” Proceedings of the VLDB Conference, September, 1998; E. Knorr et al., “Finding Intensional Knowledge of Distance-Based Outliers,” Proceedings of the VLDB Conference, September, 1999; S. Ramaswamy et al., “Efficient Algorithms for Mining Outliers from Large Data Sets,” Proceedings of the ACM SIGMOD Conference, 2000; and T. Zhang et al., “BIRCH: An Efficient Data Clustering Method for Very Large Databases,” ACM SIGMOD Conference, 1996.
The problem of categorical data clustering has also been recently studied, see, e.g., V. Ganti et al., “CACTUS—Clustering Categorical Data Using Summaries,” Proceedings of the ACM SIGKDD Conference, 1999; D. Gibson et al., “Clustering Categorical Data: An Approach Based on Dynamical Systems,” Proceedings of the VLDB Conference, 1998; and S. Guha et al., “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the International Conference on Data Engineering, 1999. However, these techniques cannot be utilized for clustering data streams, since they do not naturally scale well with increasing data size. Furthermore, a data stream clustering technique requires the appropriate mechanisms to deal with the temporal issues created by the evolution of the data stream.
Clustering and outlier monitoring present a number of unique challenges in an evolving data stream environment. For example, the continuous evolution of clusters makes it essential to quickly identify new patterns in the data. In addition, it is also important to provide end users with the ability to analyze the clusters in an offline fashion.
In the data stream environment, outlier and abnormality monitoring is especially problematic, since the temporal component of the data stream influences whether an outlier is defined as an abnormality. For example, the first arriving data point of a cluster may be considered an outlier at the moment of its arrival. However, as time passes, data points may join the newly created cluster, thereby initiating a new pattern of activity resulting from the evolution of the data stream. On the other hand, in many other cases, data points may not join the outlier or newly created cluster over time, thereby defining an abnormality. An important aspect of the data stream clustering process is the ability to identify and label such events effectively.