Cluster analysis or clustering is the task of grouping a set of data in such a way that data points in the same cluster (e.g. a group of data points) are more similar (in some sense or another) to each other than to those in other clusters. Cluster analysis is frequently employed in exploratory data mining, statistical data analysis, etc., and is useful in many fields.
In general, a cluster analysis (e.g. K-means, CLARANS, or the like) seeks to collect data points in a data set into similar groups. Depending on the clustering algorithm used, a group can be defined by the center of the group, for example a centroid or a mediod for a data set of n dimensions. In DBSCAN is assumed that all core points in a cluster are connected, so any point in a cluster can be a representative of its cluster.
Two examples of standalone clustering algorithms include CLARANS (Clustering Algorithm based on Randomized Search) and DBSCAN (Density-Based Spatial Clustering of Applications with Noise). CLARANS is a long running clustering algorithm that randomly searches the centers of the clusters (e.g. medoids or centroids) and (ideally) converges to a set at which point the algorithm terminates and produces the clusters. DBSCAN is a density based clustering algorithm based on the idea that a cluster should grow in any direction as long as the density of the elements remains over a certain threshold.
DBSCAN is a density-based clustering algorithm, which is especially useful in detecting arbitrary shaped clusters. The algorithm requires two parameters: a minimum number of points (MinPts) and epsilon (Eps). A key idea of DBSCAN is that the neighborhood of a point determined by Eps should contain a number of data points equal to or greater than MinPts so that it can populate or extend a cluster. Thus, in DBSCAN, the points are grouped into three types of points: core points, border points, and noise points, which can be defined as follows and in reference to the diagram 100 of FIG. 1. The Eps-neighborhood of a data point p in a data set, denoted by NEps (p) can be defined byNEps(p)={q∈D|dist(p,q)≤Eps}  (1)
Based on this definition, a data point p is a core point when the following inequality holds:|N(p)|≥MinPts  (2)
In FIG. 1, m, p, o, and r are core points. A data point p is directly density reachable from a point q with respect to Eps and MinPts if the data point p qualifies as a core point per equation 2 and if the following relationship is also true:p∈NEps(q)  (3)
In other words, the points in the Eps-neighborhood of a core point are directly density reachable from that core point. This relation is symmetric for two core points but not symmetric for a core point and a border point. In FIG. 1, q is directly density reachable from m.
A point p is density reachable from a point q with respect to Eps and MinPts if there is a chain of points (p1, . . . , pn), p1=q, pn=p such that p1+1 is directly density reachable from pi. This relation is transitive and similar to directly density reachability, and is symmetric for two core points and non-symmetric for a core and a border point. In FIG. 1, q is directly reachable from p, but the inverse is not true because q is not a core object.
A point p is density connected from a point q with respect to Eps and MinPts if there is a point o such that both p and q are density reachable from o with respect to Eps and MinPts. This relation is symmetric. Points s, o, and r are density connected in FIG. 1.
A cluster C with respect to Eps and MinPts in a data set D is a non-empty subset of D satisfying both of maximality and connectivity. For maximality, the following relationship is generally satisfied: ∀p,q: if p∈C and q is density-reachable from p with respect to Eps and MinPts, then q∈C. For connectivity: ∀p,q∈C: p is density-connected to q with respect to Eps and MinPts. According to this definition, a cluster is a set of density connected points which is maximal with respect to density reachability.
Noise is defined in DBSCAN as the set of points in a data set D not belonging to any cluster, where C1, C2, . . . , Ck are the clusters of the data set D with respect to parameters Epsi and MinPtsi. In other words, all points that do not belong to a cluster are noise points, and this n noise can be quantified as follows:Noise={p∈D|∀i: p∉Ci}  (4)
Advantages of DBSCAN can include the ability to detect arbitrarily shaped clusters, requiring little information about data, handling noise explicitly without requiring any other mechanism, and not requiring a hierarchical structure on data.
Using conventional index structures, the complexity of DBSCAN is generally on the order of the number of values squared (e.g. O[n2]). Variations of DBSCAN in currently applied approaches can reduce the complexity of DBSCAN to O(n*log(n)) by using hierarchical structures such as r-tree and b-tree. However, these structures are not typically used in column stores because they grow very fast with increasing size of the data set and do not easily support parallel access. Currently available solutions do not address application and optimization of DBSCAN or any other clustering algorithm in a column store environment.
While DBSCAN can typically give comparatively more efficient results than CLARANS, it can also require long running times on large data sets. DBSCAN is not parallelizable (e.g. across multiple parallel computing nodes) in its original definition, and it not readily parallelizable without some sort of preprocessing. For example, simply splitting a data set manually into some number of partitions and applying DBSCAN onto each partition can yield an undesirable result as merging of the resultant partial clusters (which could be produced by DBSCAN as a result of bad partitioning) can be difficult or in some cases impossible.
There are currently available approaches are available for making DBSCAN parallelizable. For example, an approach known as Enhanced DBSCAN (E-DBSCAN) combines CLARANS and DBSCAN. CLARANS is partitional. In general, partitional clustering algorithms groups the points into different sets and then in every following iteration it optimizes the previous set. In the end, the algorithm converges, albeit often with a large number of iterations, when further iteration no longer result in changes in the result. In E-DBSCAN, a few initial iterations in CLARANS are applied to the data to yield at least a semi optimal partitioning without creating an exact CLARANS result. DBSCAN is applied to the partitions given by CLARANS. This approach can improve performance by enabling parallel processing, for many data sets the results can be less than optimal while still largely acceptable. Partitioning of the data might prevent DBSCAN from calculating the neighborhood of a point properly, which cause the DBSCAN part of the analysis to also produce a semi optimal result too. The results of parallel processed DBSCAN analyses are checked to identify clusters which are split because of the parallelization. The goal is generally to merge such clusters in a manner that may yield a result that closely resembles what would be produced by a non-partitioned DBSCAN. The E-DBSCAN approach generally uses an “interconnectivity” property, which can be easily calculated and which does not require checking each and every point within each cluster. An E-DBSCAN process ends after the merge operation completes.
In E-DBSCAN, two clusters a and b can be merged if their relative inter-connectivity Nab exceeds a merging threshold amerge. The relative inter-connectivity is found by dividing the number of the edges that connect two clusters Nab by the sum of the edges that connect the points within these clusters, Na and Nb respectively, which can be expressed as follows:
                                          N            ab                                              (                                                N                  a                                +                                  N                  b                                            )                        ⁢                          /                        ⁢            2                          ≥                  α          merge                                    (        5        )            
An E-DBSCAN process using this algorithm can be time-consuming if all points in a cluster are used to calculate the relative inter-connectivity. To reduce the overhead, only the border points, which are already extracted by DBSCAN, are used in calculations, thereby relying on the assumption assumes that there is an edge between two border points if their distance is less than Eps as illustrated in the diagram 200 of FIG. 2.
Since DBSCAN is applied to separate partitions, some border points might be labeled as noise by mistake. To correct their labels, all noise points can be checked for a core point in their Eps neighborhood in this step. If a core point is found, these points are assigned to the cluster of that core point. Despite its efficiency, there are several problems with this approach. Generally, determining an appropriate k requires knowledge about the distribution of data, which might not be available in large databases. Furthermore, CLARANS initializes the centroids randomly. An inappropriate choice in the beginning might increase the run time of the algorithm substantially. Additionally, the candidates to replace a center are also chosen randomly. In other words, all points, including the ones that are far away from the centers, have the same chance to be chosen, which causes a computational overhead.