Pervasiveness and quantity of electronic data available today in all areas of human endeavor call for new approaches in order to extract timely insights and actionable information based on the very large data sets encountered in practice. In addition to sheer data volume, research analysts face methodological challenges when encountering poorly described or irregular data, such as continuous data with non-normal data distribution.
Computation of order statistics and statistical data distributions, along with the other field summaries, is an important part of robust assessment of data properties, as well as, data preparation for further analyses. These summaries are useful in supporting data preparation and diagnostics features, such as outlier detection, histograms, and box plots that are based on order statistics and statistical data distribution. Moreover, non-normal data usually require transformation to normality for exploratory analysis and in preparation for modeling.
The cost of computing order statistics, statistical distributions, and straightening transformations is prohibitive for large and distributed data sets using available computation techniques. It requires either storage of impermissible amounts of data in the main computer memory or multiple data passes. Neither approach is efficient for processing of large distributed data sets. This is in contrast to available computation techniques for simple summaries, such as means or standard deviations, that are computed in a single data pass with modest memory storage requirements.
Some available computation techniques make the data ready for model building without the need for prior knowledge of the statistical concepts involved. Such available computation techniques do not support computation on distributed data sources and are inefficient for very large data sets requiring multiple data passes to accomplish several data transformation steps sequentially.
Some conventional approaches focus on computing quantiles with precision in a specified quantile range. Quantiles may be described as data values taken at regular intervals from a cumulative distribution function of a random variable. Dividing ordered data into q essentially equal-sized data subsets is the motivation for q-quantiles; the quantiles are the data values marking the boundaries between consecutive subsets. Put another way, the k-th q-quantile marks the boundary at the k/q fraction of the ranked data values and there are q−1 of the q-quantiles, one for each integer k satisfying 0<k<q. Here, a more general φ−quantile specification, where φ is a real number with 0≦φ≦1, is used, and the φ−quantile marks the boundary at the φ fraction of the ranked data values. When queried for a φ−quantile whose se precise value is x, these conventional approaches return an element y that is guaranteed to be in the [φ−ε, φ+ε] quantile range. On the other hand, there are no guarantees for the precision of y in terms of the x itself. As a result, there can be uncontrolled errors in the location of the computed approximate order statistics, thus invalidating location-based statistical analysis. Moreover, the important information on the tails of the statistical distribution and their possible anomalies may be lost.