Distribution functions are widely used in analytical queries of data sets. In particular, a distribution function may be used to determine a data value that corresponds to a desired percentile in the distribution of values in the data set. An inverse distribution function for a data set takes a percentile value and a sort specification, and returns a data value that would fall into that percentile value when the data set is arranged according to the sort specification. Thus, to determine the data value in the data set, the data set is sorted according to the sort specification, and the location(s) in the sorted data set that correspond to the specified percentile value are computed. When a discrete distribution is assumed, the returned data value of the function corresponds to an actual value at the computed location. Meanwhile, when a continuous distribution is assumed, the returned value may be calculated based on values at or adjacent to the computed locations.
Regardless of the distribution assumed, to compute the location(s), the inverse distribution function has to perform a scan and sort operation for ordering values in the data set according to the sort specification. Such a scan and sort operation, particularly on a large data set, may consume substantial resources and may take substantial time to execute.
Furthermore, scanning or sorting a data set using a single process is not scalable. For example, as the data set grows the sorting (even using most efficient sort algorithms) will take longer time. One solution to reduce the sort time and save resources is to parallelize the sort operation by chunking the data set into subsets and assigning a separate process to execute a sort operation on each subset of data. However, to determine the locations of data values corresponding to the specified percentile in the sorted subsets, the function needs to consider the whole data set. Thus, either the data set has to be recombined for a full sort of the data set, or complex inter-process communication has to occur between the processes sorting the subsets in order to determine the resultant data value corresponding to the specified percentile.
Furthermore, the computation of a percentile function is further complicated when the splitting of data set for parallelization cannot be readily derived from the query itself. When a query specifies a “group by” clause, the queried data set can be split based on the values of the column specified in the group by clause. But when no group by clause exists in the query, the percentile function is to be applied on the whole data set, as a single set, and parallelizing the execution of the percentile function becomes challenging.
The problem of evaluation of a percentile function, in a scalable manner, is even further complicated when a query contains multiple percentile functions on different columns. Or in addition to percentile functions, a query further contains distinct aggregate or non-distinct aggregate functions. Separate execution of these additional functions would further exacerbate the execution time of such queries.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.