Large databases rely heavily on grouping and aggregation of data to facilitate queries of the database. Aggregation is ripe for optimization to further facilitate queries of databases. The demand for database volume and database speed continues to grow. As databases continue to grow in size they present numerous challenges related to quickly and efficiently performing real-time queries of terabytes (TBs) of data. Aggregation represents a large and important part of such queries.
As databases grow in size, modem hardware continues to grow more powerful and include increasingly more processors with multiple cores. For example, affordable computing devices include two to eight processors each having four to eight cores, each core having two or four hardware contexts. Each core includes one or two levels of primary caches and many processors include features which increase performance such as pipelining, instruction level parallelism, branch prediction and features for synchronization.
Conventional aggregation techniques struggle with challenges including synchronization, cache utilization, non-uniform access characteristics (NUMA), a plurality of database columns, data skew and operator selection of an optimal aggregation operator by an optimizer.
Aggregation is one of the most expensive relational database operators. Aggregation occurs in many analytical queries. The dominant cost of aggregation is, as with most relational operators, the movement of the data. In the days of disk-based database systems, relational operators were designed to reduce the number of I/Os needed to access the disk whereas access to main memory was considered free. In today's in-memory database systems, the challenge stays more or less the same but moves one level up in the memory hierarchy.