An important function performed by a database management system is aggregating data. Typically, an aggregation function is applied to a set of values, such as the values in a specified column of a set of rows in table. One important aggregate function is the count( ) function, which returns the number of values in a specified column.
A distinct aggregate function applies the aggregate function to each unique value in the set of values. For example, even if the value “1” appears multiple times in the specified column, the aggregate function should only be applied to the value “1” once. A non-distinct counting function, represented to hereafter as “count( )”, will return the number of entries in the set. A distinct counting function, represented to hereafter as “count_distinct( )”, will return the number of distinct entries in the set. For example, for the set of values {1, 1, 1, 3}, count_distinct( ) returns 2, while count( ) returns 4.
The number of distinct entries in the set is referred to hereafter as the number of distinct values (NDV). NDV is heavily used in database queries and OLTP workloads. NDV is also valuable for business intelligence applications. Data such as web click logs and social network interactions can easily involve petabytes of data. NDV functions can glean important information from such data, such as number of distinct page views and distinct users.
For a non-distinct count function, a single pass is sufficient to determine the exact number of distinct entries in a set, and the function may be performed in a distributed manner without retaining specific values encountered in the set. However, to determine the NDV of a set of values, the distinct values must be tracked during the calculation in order to determine whether subsequently processed values are unique. Likewise, the distinct values must be shared across multiple processes when the operation is performed in a distributed manner. Thus, memory consumption can be intensive when the number of distinct values is large due to tracking all the distinct values that are observed.
When the memory consumption is so large that the tracked distinct values have to be spilled, or written, to disk, performance is substantially affected. For example, the tracked distinct values may be spilled to disk when the tracked data does not fit in available memory. When data is spilled to disk, the performance of the disk media becomes a bottleneck because the disk access latency limits performance during subsequent computations.
To minimize usage resource such as computation and memory, the NDV may be approximated. One way of calculating an approximate NDV is by sampling. A distinct count is performed on a sample of values in the set. The distinct sample count is then scaled up to represent the entire set. However, sampling techniques can produce inaccurate statistics.
Based on all of the foregoing, it would be desirable for improved techniques for approximating distinct values using a bounded amount of memory.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.