Chloropeth maps are often used in Geographic Information Systems (GIS). This concerns the problem of representing on a map a statistical variable whose value on each area of the map is represented by a color intensity. The statistical variable being displayed can be of different natures (population density, number of cars by inhabitant, rainfall, disease proportion).
In this context, the choice of the set of intensities chosen is primordial in this representation. The two main algorithms of the background art used for this choice are Jenks/Fisher's natural breaks optimization (described in the articles “The Data Model Concept in Statistical Mapping”, by Jenks, 1967, and “On grouping for maximum homogeneity” by Fisher, 1958) and Head-tail breaks (described in the articles “A comparison study on natural and head/tail breaks involving digital elevation models” by Lin and Yue, 2013, and “Head/tail breaks for visualizing the fractal or scaling structure of geographic features” by Jiang and Bing, 2014, or “Ht-index for quantifying the fractal or scaling structure of geographic features”). These solutions are actually based on Cluster Analysis algorithms. Other examples of choropleth map systems that use cluster analysis include documents US20050278182A1, U.S. Pat. No. 8,412,419B1 and U.S. Pat. No. 8,510,080B2.
The design of a chloropeth map relates in a way to quantization (i.e. replacement of an input value by a closest one in a predetermined set of values, according to a predetermined distance). Indeed, areas of the maps are grouped together and the value of the statistical variable for such areas is replaced accordingly by a representative value for the group. This whole framework relates to the more general field of cluster analysis.
Cluster Analysis concerns the task of partitioning a set of objects in groups (called clusters), so that in each group the data are similar (see the article of Jain et al., “Data Clustering: A Review”). It appears as a central problem in Data Mining (see the article of Chen et al., “Data mining: an overview from a database perspective”), Machine Learning (see the book of Murphy, “Machine Learning, A Probabilistic Perspective”), and Large Scale Search (see the article of Goodrum, “Image Information Retrieval: An Overview of Current Research”). Cluster Analysis is an important tool for Quantization: assigning a center to each cluster, one has a simple quantization that consists in quantizing each point to the center of its cluster.
The K-means clustering problem is the most famous problem of Cluster Analysis and was introduced by Stuart Lloyd in 1957 at Bell Laboratories, as a technique for Pulse-Code Modulation. The Lloyd algorithm takes as input a collection of p-dimensional points and outputs a partition of these points that aims to minimize the “total distortion”. This algorithm is only a heuristic (it does not provide the optimal clustering). But in fact we cannot hope for an exact algorithm since the K-means clustering problem is NP-hard in the non-one-dimensional case. The Lloyd algorithm is nowadays still widely used. Several variants have also been proposed (see J. A. Hartigan (1975), “Clustering algorithms”, John Wiley & Sons, Inc.”).
The one-dimension application is particularly important. One of the most famous algorithms for this problem is actually above-cited Jenks natural breaks optimization developed in 1967 (see the book of Jenks, “The Data Model Concept in Statistical Mapping”, in International Yearbook of Cartography) and was introduced for cartographic purpose, as mentioned above. As Lloyd algorithm, it is only a heuristic. In 2011 an exact algorithm, called CKmeans, was developed by Wang and Song (see the article of Wang and Wong, “Optimal k-means Clustering in One Dimension by Dynamic Programming”). This algorithm is the corner stone of document U.S. Pat. No. 1,543,036A. It runs in time O(K*n^2) where K is the requested number of clusters and n is the number of real numbers. Even more recently (in 2013), Maarten Hilferink has developed a more efficient algorithm and provides an implementation of it. This implementation was actually dedicated to cartography, more precisely for choropleth maps, however the only documentation of this algorithm is a Wikipedia page (Fisher's Natural Breaks Classification, accessible at the following URL at the priority date: wiki.objectvision.nl/index.php/Fisher %27s_Natural_Breaks_Classification).
All these existing methods are however limited because either they do not produce the optimal K-means clustering, or they are too slow. Within this context, there is still a need for an improved solution to design a choropleth map.