Large business enterprises, including for example health insurance companies or healthcare providers, work with vast amounts of data that are continuously being generated and operated upon. Tools for profiling such vast amounts of data exist, but existing profiling tools lack the capability to efficiently find quantiles, such as an exact median.
Median and other quantiles—such as quartile, percentile and kth smallest value—are common metrics in characterizing data sets and have broad applications in scientific computing, database, data mining, data analysis and machine learning. Median and other quantiles like quartile and percentile are known to be more difficult to compute than several other metrics such as sum, count, mean, etc., and it becomes even more difficult to compute such quantiles in a distributed system where a massive amount of data is stored across multiple machines (sometimes hundreds or thousands of machines).
A median can be characterized as the kth smallest (or largest) element in a set of n elements where k is the [(n+1)/2]th smallest element (if n is odd) or k is the average between the [n/2]th smallest element and the [(n/2)+1]th smallest element (if n is even). The problem of computing a kth smallest (or largest) element of a list is known as a selection problem, and is solved by a selection algorithm. Determining a median is often considered the hardest case for a selection algorithm, and such selection algorithms have been extensively studied over the years, both in academia and in industry. Different selection algorithms have been developed, ranging from simple to complex, to account for different computing properties and constraints such as computation time, complexity, space and memory usage, accuracy, the number of iterations needed, the type of data (e.g., static data or streaming data), etc.
Existing methods of determining median and other quantiles in a distributed dataset can be divided into two main types of methods: 1) exact methods, and 2) approximation methods. The approximation methods are typically used because the exact methods are often resource intensive and are not feasible with given memory or time constraints for very large datasets. The exact methods often include the iterative or recursive steps that involve reading source data multiple times, which is a very expensive operation in conventional systems. The trade-off between the accuracy and runtime performance is a key factor in choosing which method to use.
These existing methods are often rooted in algorithms that were originally intended to be performed on a single computer. For example, a simple and precise algorithm is to collect data into one machine, globally sort the data then select the median and other quantiles. However, sorting is a very expensive operation and collecting data into one machine is often not feasible for a large data set. And although enhancements have been made to improve performance by using various partial sorting strategies and partitioning strategies to select the kth smallest elements instead of sorting a whole list, such algorithms still need memory that is proportional to the amount of input data, and are not suitable for handling large data sets in general.