Percentile functions are inverse distribution functions which take a sorted list of values and a measure as inputs, and return a value located at the appropriate position in the sorted list as output. Median is an example of the percentile function where the measure is 0.5. Examples of the percentile functions and formulae may include PERCENTILE_DISC(p) and PERCENTILE_CONT(p).
PERCENTILE_DISC(p): is an inverse distribution function that assumes a discrete distribution model. It takes a percentile value (p) and a sort specification and returns an element from the set. Nulls are ignored in the calculation.
PERCENTILE_CONT(p): is an inverse distribution function that assumes a continuous distribution model. It takes a percentile value (p) and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
As mentioned above, for the median function, P is 0.5. Stated another way, the median function calls for the value in the list position that follows 50% of the values in a sorted list.
According to the percentile definition, to compute the final percentile result the list must be fully sorted. For this reason, using conventional techniques, the degree of parallelization used to process queries that include percentile functions is usually the same as or less than the number of sorted groups to which the percentile function is applied. For example, assume that a table “emp” includes the salaries of all employees of a company. Such a query may appear as follows (Q1):
SELECTPercentile_disc(0.5) within group (order by salary)FROM emp;
To process a query that requires the median salary of all workers in the company, a single ordered list with the salary values from all rows in the emp table is necessary. Consequently, no parallelization of the aggregation portion of the query is possible, though other portions of the query may still be parallelized. On the other hand, to process a query that requires the median salary for each department in a company that has three departments, three ordered lists need to be created. Such a query may appear as follows (Q2):
SELECTPercentile_disc(0.5) within group (order by salary)FROM empgroup by department;
Assuming that there are three distinct department values (“sales”, “accounting”, “service”), the work associated with deriving the medians for this query can be spread among three processes, one for the ordered list for each department. If each process is assigned to a different processor, then the computational power of three processors can be put to use, in parallel, to execute the percentile calculation required by the query.
Unfortunately, the number of groups involved in a query with a percentile function may be significantly smaller than the number of computing resources (e.g. processors) available to perform the work. For example, there may be 10 processors available to execute Q2, even though there are only three distinct departments. In such cases, some computing resources go unused, and the query processing time of queries with percentile functions may be longer than it would be if it were possible to take advantage of all available computing resources.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Other features of the present embodiments will be apparent from the accompanying drawings and from the detailed description that follows.