This specification relates to clustering.
Clustering generally involves techniques for assigning data points to clusters. Data points assigned to a cluster may be referred to as cluster members. Clustering techniques aim to assign data points to clusters so that data points within a cluster have a common attribute or more similarity with each other than data points assigned to other clusters. Each data point may be assigned to one or more clusters.
A data point can be any appropriate collection of data having one or more features. For example, a data point can be a feature vector having elements whose values represent features of an observation, e.g., of an email message, a network request from a user, or a user selection of a link.
Online clustering involves assigning data points to clusters as the data points are received. Typically, online clustering processes are designed to assign data points to clusters by making only one pass over the data points. In other words, the data points are assigned to clusters while considering each data point only once.
Online clustering techniques are often used for streaming and real-time applications in which speed is critical and where the data points are frequently changing. For example, some email systems may handle upwards of 20,000 email messages per second. Thus, the email systems can use online clustering techniques to cluster arriving email messages in order to detect clusters of spam email messages. The clusters of spam email messages can then be quarantined, deleted, or marked as spam.