Clustering a set of points into a few groups is frequently used for analysis and classification in numerous applications, including facility location (D. Shmoys, E. Tardos, and K. Aardel, “Approximation algorithms for facility location problems,” Proc. 29th Annu. ACM Sympos. Theory Comput., pages 265-274, 1997), information retrieval (M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental clustering and dynamic information retrieval,” Proc. 29th Annu. ACM Sympos. Theory Comput., pages 626-635, 1997), data mining (J. Shafer, R. Agrawal, and M. Mehta, “Sprint: A scalable parallel classifier for data mining,” Proceedings of the International Conference on Very Large Databases, pages 544-555. Morgan Kauffman, 1996), and image processing (P. Schroeter and J. Bigiin, “Hierarchical image segmentation by multi-dimensional clustering and orientation adaptive boundary refinement,” Pattern Recogn., 28(5):695-709, 1995). Because of such a diversity of applications, several variants of clustering problems have been proposed and widely studied.
In general, a clustering problem requires the partition of all data points into a set of clusters, so as to optimize a given objective function. The points usually lie in a metric space (,Lp) (Lp is usually L1, L2, or L∞) and some clustering measure (such as maximum cluster radius, or sum of distance from points to cluster centers) is provided. The objective is then either to minimize the clustering measure, given the total number of clusters, or to minimize the number of clusters, given the clustering measure. This invention is directed to a different approach, arising from the necessity to more flexibly model several practical applications.
Recent advances in wireless technology have opened the door to alternative access to customer locations, other than the traditional land lines, leading to the apparition of new companies and new services in the telecommunications industry. A natural question that arises is how to choose the locations for the base stations so as to optimize the coverage, while minimizing capital investment costs. Because of technological requirements, each base station covers a fixed-radius circle around it. Hence, one way to formulate the problem is to compute the minimum number of clusters of given radius that cover all the customer location points. However, financial considerations generally enforce additional restrictions, such as limiting the number of clusters, or requiring a cluster to achieve some minimum customer coverage in order to make it financially viable. Moreover, it is more important to reach some customers than others, based on their monthly spending or contract stipulations. Hence, we must allow for the possibility of outliers (i.e. locations that remain unclustered), as well as adjust the objective function to take into account these additional issues. The problem is a general facility location one, and can be of interest to other domains and applications.
One approach to this problem appeared in A. Meyerson, “Profit-earning facility location,” Proc. 33rd Annu. ACM Sympos. on Theory Comput., pages 30-36, 2001, in which a problem related to problem (PC2), as described later in this specification, was defined. The proposed algorithms return solutions which violate both radius and minimum profit constraints, making them difficult to compare to our results.
A widely studied class of problems, often referred to by the generic term facility location, defines the objective function to be a linear combination of the cost to set up facilities and the cost to connect customer locations to open facilities. In this case, however, the connection cost per facility is proportional to the sum of distances from the facility to the customers it serves. While this is a good model for applications in which a cost-per-mile is paid for each customer, it is clearly not the appropriate one for the problem we considered above. Moreover, the algorithms developed for this class of problems attempt to connect all the customer locations, not allowing for the possibility of outliers. These issues are discussed in M. Charikar and S. Guha, “Improved combinatorial algorithms for facility location and k-median problems,” Proc. 40th Annu. IEEE Sympos. Found. Comput. Sci., pages 378-388, 1999).
Another related approach is the so-called prize-collecting Steiner tree problem, as discussed in D. S. Johnson, M. Minkoff, and S. Phillips, “The prize collecting steiner tree problem: theory and practice,” Proc. 11th Annu. ACM-SIAM Symp. on Discrete Algorithms, pages 760-769, 2000. In this approach, each point is associated with a prize, and the goal is to compute a subtree minimizing the sum of the total cost of subtree edges plus the total prize of vertices not contained in the subtree. While this problem clearly allows for outliers, and the decision on which points to leave uncovered is based on a notion of how important the point is (i.e. how big is its prize), yet again the cost function depends on sum of distances, rather than maximum radius.
Finally, center clustering problems in the presence of outliers have been considered in M. Charikar, S. Khuller, D. M. Mount, and G. Narasimhan, “Algorithms for facility location problems with outliers,” Proc. 12th Annu. ACM-SIAM Sympos. Discrete Algorithms, pages 642-651, 2001. There, a maximum number of outliers is provided as a parameter. Because of the way the objective function is defined, the optimal solution always leaves unclustered the maximum number of points allowed. Hence, the solution is highly sensitive to the user's estimate on the number of outliers. Moreover, the objective function is to minimize the cluster radius, given a fixed number of clusters, which does not correspond to our restriction that the cluster radius be fixed. Database research has also considered the problem of clustering with automatic detection of outliers (e.g. DBSCAN and M. Ester, H.-P. Kriegel, J. Sande, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” Proc. 2nd Intl. Conf. on Knowledge Discovery and Data Mining, pages 226-231, 1996). However, clusters can be arbitrarily shaped, which again does not satisfy our requirement.
Let P={pi, . . . , pn} be a set of points, so that each point is associated with a potential profit w(pi)≧0, and let r>0 denote the cluster radius. The system and method of the present invention is directed to two problems, (PC1) and (PC2): (PC1) Given k>0, compute k clusters of radius r so that the clustering profit, defined as Σpi:clusteredw(pi) is maximized; and (PC2) Given a minimum profit W>0 compute the set of clusters of radius r that maximizes the clustering profit, under the restriction that each cluster C of the solution satisfies the minimum profit requirement Σpi assigned to Cw(pi)≧W.