Percentiles and percentile ranks are relative measures often used to provide information about how a specific data element within a data set relates to a larger group of the data set or the entire data set itself. For instance, percentiles and percentile ranks may be used to evaluate academic performance. Most nationally standardized test scores are reported as percentile ranks, such as deciles or quartiles. A student may find how they rank among peers based on a percentile rank for a test score of the student, such as whether it falls within an upper 5th percentile, upper 50th percentile, and so forth. In general, a percentile typically refers to a value and a percentile rank typically refers to a percentage. For instance, a percentile may refer to a particular test score or value (e.g., 95), while a percentile rank may be used to determine where a particular score or value falls within a broader distribution (e.g., top 5th percentile or 5%).
Finding statistical information such as percentiles or percentile ranks becomes more difficult as a size of a total data set increases. In some cases, data sets for commercial applications may be on the order of terabytes or larger sets of data. In order to efficiently process such massive data sets, a single data set is typically distributed across multiple processing nodes communicating over a network. Each of the multiple processing nodes may then process subsets of data in a parallel manner. This distributed processing approach provides benefits such as reduced processing times and processing loads, at the cost of increased coordination between the distributed processors and network resources used for such coordination. Such costs may potentially increase when attempting to find percentiles or percentile ranks across a distributed data set. For instance, conventional solutions attempt to move subsets of data from remote processing nodes across a network to a central processing node for sorting and ranking operations in order to locate a specific percentile within the overall data set. This may take a relatively long period of time and consume significant amounts of computing and communications resources, which may be unacceptable for some applications. It is with respect to these and other considerations that the present improvements are needed.