Various systems have natural groupings. For example, large scale distributed systems can have groups of virtual and/or physical devices. A system can also have groups of time series datasets collected at different time intervals. Such groups are usually characterized by one or more multidimensional metrics (features) datasets. Clustering groups within these datasets has wide ranging applications. For example, clustering may help identify anomalous groups and, therefore, anomalous virtual and/or physical devices within a distributed system. The table below illustrates an example of a 2D dataset with metrics, such as Mq for metric q (where q=1, 2, . . . , N), in the headers/columns, and data point observations, such as viq for data point i of metric q in the rows.
M1M2. . .MNv11v12. . .v1Nv21v22. . .v2N. . .. . .. . .. . .vR1vR2. . .vRN
The 2D dataset shown above, for example, is derived from performance data belonging to a single machine over a certain period of time. In such an example, metrics M1 to MN may be performance metrics belonging to the machine while the data points in each row correspond to performance metric values at, for example, a certain point in time. As an example, the first row may show metric values measured at t1 while row 2 may show metric values at t1+5 mins, and row 3 may show metric values at t1+10 mins, etc.
Although traditional algorithms, such as the K-means clustering algorithm, are suitable for clustering 2D datasets (i.e., a tabular dataset with metric names as columns and per-metric data points forming the rows as shown in the table above), such as the one shown above, in many practical real world systems and applications, datasets are typically multidimensional (e.g., three or more dimensions). However, traditional algorithms are not suitable for clustering multidimensional datasets.