Databases are used to store and process data for a variety of tasks today. Businesses often use databases to manage customers, track sales, manage internal personnel and operations, and so forth. Business Intelligence (BI) refers to the use of databases or other systems to analyze a business. The analysis may include observing historical trends, projecting future performance, slicing collected information (e.g., sales by region or by month), and so on. Databases often provide query functionality that facilitates the implementation of business intelligence using data stored in a database and implemented by information technology (IT) or other personnel within a business.
Two common types of query operations are obtaining distinct counts of items in repetitive data and grouping data by some value. In terms of Structured Query Language (SQL), these operations are invoked by the “DISTINCT COUNT” and “GROUP BY” keywords. A distinct count can be used, for example, to determine how many distinct customers purchased products from a company during a period. Using a database table that contains a row for each purchase transaction and a column with a customer identifier, a distinct count on the customer identifier will return the total number of customers that had transactions matching any other criteria in the query (such as filtering conditions in the WHERE clause). Grouping collects related data together, and can be used to produce subtotals or other sub-collections of data within a larger collection. For example, using the same transactions table, a query can group transactions by region to obtain a revenue per region that contributes to the overall revenue represented in the table. These two concepts can also be used together, such as to obtain a distinct count of customers in each region.
The most common problems with counting and grouping operations are the computational complexity and corresponding time involved with producing the results. Determining a distinct count involves some level of tracking which values have been seen before so that additional data of that value can be ignored, and which values have not been seen so that the count can be incremented when new data values are encountered. Grouping data likewise involves tracking how many separate values of the grouped data value exist, and partitioning rows that are found based on the value of the grouped data value for each row. Performing these tasks may involve sorting the data, creating temporary tables, building hash tables, pushing large data structures to disk or other large (but slow) storage, and so forth. These operations are costly in terms of time and storage resources, and affect how large of a dataset can be used with such operations to achieve results in a reasonable period. In particular, datasets that have columns of high cardinality (i.e., many different values in the set) present computational difficulties for existing analysis systems.