In recent years, central processing units (CPUs) of computer processing hardware have generally experienced their greatest performance increases by increasing the number of processor cores rather than through increasing clock rates. Accordingly, to maximize performance, modern software advantageously employs the benefits of multi-core CPUs by allowing parallel execution and with architectures that scale well with the number of cores. For data management systems, taking full advantage of parallel processing capabilities generally requires partitioning of stored data into sections or “partitions” for which the calculations can be executed in parallel.
A database program or database management system generally displays data as two-dimensional tables, of columns and rows. However, data are typically stored as one-dimensional strings. A row-based store typically serializes the values in a row together, then the values in the next row, and so on, while a column-based store serializes the values of a column together, then the values of the next column, and so on.
In general, column-based systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. Column-based systems can be more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows. Row-based systems can be more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek. Row-based systems can also be more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.
Column-based storage can facilitate execution of operations in parallel using multiple processor cores. In a column store, data are already vertically partitioned, so operations on different columns can readily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core. In addition, operations on one column can be parallelized by partitioning the column into multiple sections that are processed by different processor cores. Column data is typically of uniform type, which can facilitate opportunities for storage size optimizations available in column-based data stores that are not available in row-based data stores. For example, some modern compression schemes can make use of the similarity of adjacent data to compress. To improve compression of column-based data, typical approaches involve sorting the rows. For example, using bitmap indexes, sorting can often improve compression by approximately an order of magnitude. In conventional systems, columnar compression generally achieves a reduction in storage space requirements at the expense of efficiency of retrieval. Retrieving all data from a single row can be more efficient when that data is located in a single location, such as in a row-based architecture. Further, the greater adjacent compression achieved, the more difficult random-access may become, as data typically need to be uncompressed to be read. Therefore, conventional column-based architectures are often enriched by additional mechanisms aimed at minimizing the need for access to compressed data. These additional mechanisms can result in lower compression efficiency and/or increased processing requirements to access the compressed data.
Currently available relational database management systems can accomplish partitioning based on specified criteria applied to split the database. In general, a partitioning key is used to assign a partition based on certain criteria. Commonly used approaches include range partitioning, list partitioning, hash partitioning, round robin partitioning, and composite partitioning. In range partitioning, a partition can be defined by determining if the partitioning key is inside a certain range. For example, a partition can be created to include for all rows in which values in a column of postal codes are between 70000 and 79999. In list partitioning, a partition can be assigned a list of values and the partition can be chosen if the partitioning key has one of the values on the list. For example, a partition built to include data relating to Nordic countries can includes all rows in which a column of country names includes the text string values Iceland, Norway, Sweden, Finland, Denmark, etc. In hash partitioning, the value of a hash function can determine membership in a partition. For example, for a partitioning scheme in which there are four partitions, the hash function can return a value from 0 to 3 to designate one of the four partitions. Round robin partitioning can be used to distribute storage and/or processing loads among multiple data partitions and/or servers or server processes according to a pre-set rotation among the available partitions or servers or server processes. As an example, a first data unit can be directed to a first partition of three partitions, a second data unit to the second partition, a third data unit to the third partition, a fourth data unit to the first partition, and so forth. In composite partitioning, certain combinations of other partitioning schemes can be allowed, for example by first applying a range partitioning and then a hash partitioning.