The present invention relates generally to data clustering, and more particularly to clustering data according to the relative age of data clusters. Clustering is the classification of objects (e.g., data, documents, articles, etc.) into different groups (e.g., partitioning of a data set into subsets (e.g., clusters)) so the objects in each cluster share some common trait. The common trait may be a defined measurement attribute (e.g., a feature vector) such that the feature vector is within a predetermined proximity (e.g., mathematical “distance”) to a feature vector of the cluster in which the object may be grouped. Data clustering is used in news article feeds, machine learning, data mining, pattern recognition, image analysis, and bioinformatics, among other areas.
Conventional data clustering can be hierarchical or partitional. Hierarchical data clustering finds successive clusters using previously established clusters, whereas partitional data clustering determines all clusters at once.
Hierarchical algorithms can be agglomerative or divisive. Agglomerative algorithms begin with each object as a separate object or, in some cases, separate clusters, and merge them into successively larger clusters. Divisive algorithms begin with the whole set and divide it into successively smaller clusters. These algorithms are often iterative. That is, each object and/or each cluster is continually reevaluated to determine if the current cluster for a particular object is the best cluster for that object (e.g., the cluster with the feature vector nearest the feature vector of the object). As new objects enter the clustering system and/or as objects are clustered into new clusters, the feature vectors of the clusters will change, constantly requiring evaluation and/or updating of each object in each cluster.
Partitional algorithms, such as k-means and bisecting k-means algorithms are also conventionally used in clustering. However, such algorithms suffer similar deficiencies as hierarchical algorithms in that they are computationally intense and require multiple iterations. This requires more memory and slows the clustering rate of the system.
The growth of the Internet has allowed rapid dissemination of news articles. News articles produced at a seemingly continuous rate are transmitted from news article producers (e.g., newspapers, wire services, etc.) to news aggregators, such as Google News, Yahoo! News, etc. The news aggregators use combinations of software and human interaction to sort news articles into clusters for display. These clustering methods result in delays in serving articles to users and inaccurate clustering.
Increased access to numerous databases and rapid delivery of large quantities of information (e.g., high density data streams over the Internet) has overwhelmed such conventional methods of data clustering. Further, end users desire increasingly sophisticated, accurate, and rapidly delivered data clusters. For example, multiple news providers as well as other content providers such as weblog (e.g., blog) servers, etc. deliver tens of thousands to hundreds of thousands of news articles each day. Each article is evaluated and assigned a measurement attribute, such as one or more feature vectors based on words in the news article. The news articles are streamed to clustering services at such a high rate and volume that multiple iterations, as used in conventional methods, of clustering would significantly slow down clustering systems.
As clustering progresses, increasingly large numbers of documents are contributed to the system and increasingly large numbers of clusters are created and modified. As the number of clusters grows, clustering delays occur since each incoming article must be compared to each cluster to determine the most appropriate cluster for each article. The increasingly large numbers of comparisons tarry the system and delay availability of clustered articles to users.
Therefore, alternative methods and apparatus are required to efficiently, accurately, and relevantly cluster objects from continuous high density data streams.