In many computing applications, sorting and aggregating are fundamental operations used to manipulate and process data records. A sort operation may involve any process of arranging data records systematically in sequence or set based upon some sorting criterion or category.
An aggregation operation may involve combining values of multiple data records grouped together based on certain criteria in order to form a single value of more significant meaning or measurement. Examples of common aggregate functions include: a sum function that finds the sum of its inputs, an average function that finds the arithmetic mean of its inputs, a median function that finds the median of the inputs, a mode function that finds a mode based upon the inputs, a maximum function that finds the maximum value from among its inputs, a minimum that finds the minimum value from among its inputs, a count function that counts values of multiple data items, and others.
In conventional systems, data records are fully sorted (e.g., to group the data records based on certain criteria) before aggregation can be performed (e.g., on the sorted groups of data records). In such conventional systems, all of the data records must be loaded into memory in order to perform a full sort. However, if the number of data records is large, a large amount of memory will be needed in order to store all of the data records. If there is not enough memory to store all of the data records, then batches of sorted data records will have to be spilled to disk (e.g., written to external storage), which is orders of magnitude slower than accessing memory. This can make the sort operation more resource and time-consuming due to frequent spilling to disk, and the speed of the sort operation will heavily depend on the number of data records and the underlying computing resources available to the data processing system.
Further, simply increasing the maximum memory available for the sort operation is not an adequate solution because the overall memory for a data processing system may be fixed and shared with other processes executed by the system. Providing too much memory to the sort operation may result in other processes being starved of essential memory resources and this may also adversely impact the overall functioning of the data processing system. Dedicating a larger amount of memory to perform sorting is also not an adequate solution because it may result in inefficient underutilization of the available memory on the data processing system.