Databases can be used to maintain a collection of records or other information such that they can be easily accessed, updated and managed. Relational database systems store data as tables. Logically a table is a set as there is no order to rows that are contained therein. Relational databases can be managed via database management systems.
Many database management systems have the capacity to build indexes to their tables. A table with an index stores rows in a particular order. Database systems use indexes to optimize data access and to provide fast access to sorted data. For example, in a database that stores sales data, a sales table can be created that stores information related to the particular products that have been purchased by a particular customer. If the sales table has an index for the customer, the database system can efficiently answer a query such as “find all sales to customer X” without having to process all of the customer records and can answer a query such as “sort all sales by customer” without actually having to sort the data.
Database management systems are able to scan a table (with or without an index) using multiple CPUs or threads. This type of scan is referred to as a “parallel scan.” During a parallel scan, a database management system can assign data pages or ranges of values from the table to respective threads. The assignment of the ranges of values to the threads can be dynamic and on-demand. In particular, whenever a thread completes scanning its current assignment of pages, it requests the next available set of pages (or ranges) to continue scanning. A result of this feature is that work can be automatically load balanced across threads.
Many conventional database management systems support horizontal partitioning (or fragmentation) of database tables. When a table is horizontally partitioned, it is divided into several smaller physical storage units. Each “partition” of a “partitioned table” stores a disjoint set of rows. By combining the rows from all of the partitions, the original data can be reconstructed. To determine to which partition a row belongs a “partition function” can be applied to one or more columns of the table (the “partition keys”). The output of this function is called the “partition id.”
In one conventional system, table partitioning is implemented by adding special case code throughout the query processor. This code provides partitioning capability and the capability to treat partitioned tables differently from non-partitioned tables. As such, different plans are provided for partitioned and non-partitioned tables. The result is more complex query plans and an increase in code complexity.
In addition, in such systems, parallel scans on partitioned tables may not be supported. If a scan of a partitioned table using multiple threads is desired, concurrent serial scans of multiple partitions are performed. Such scans use one worker thread per partition, and partitions are assigned to threads as illustrated in FIG. 1A. This assignment can be dynamic and on-demand. However, the algorithm involved does not always perform well. If the number of partitions to be scanned is small relative to the number of threads involved all of the threads may not be kept busy (see FIG. 1A). Furthermore, if one partition is larger than the others (see partition 2 in FIG. 1A), the thread assigned to this partition can end up much busier than the other threads and can dominate execution time. However, it is important to note that even where uniform partition sizes are employed, if the number of partitions is not an even multiple of the number of threads, all threads may not be kept busy for the duration of the scan.
Query plans with partitioned tables achieve parallelism by scanning multiple partitions concurrently (e.g., by performing concurrent serial scans of multiple partitions). The maximum degree of parallelism for scans of partitioned tables is effectively limited by the number of partitions to be scanned. In the extreme case, if the query processor dynamically eliminates all but one partition, a serial scan of this partition can result.
To improve performance, many database management systems, issue background disk reads before the data is actually needed. This process, which is called “read ahead” (some systems use the term “prefetching”) reduces the likelihood that threads will need to wait for a disk read to complete its operations and enables a single thread to issue reads simultaneously against multiple disks. It should be noted that parallel scans enable central processing unit (CPU) parallelism as multiple threads are allowed to process data from a single table concurrently while read-ahead operations enable input-output (I/O) parallelism as they allow data pages to be read from multiple disks concurrently.
Conventional database management systems have several shortcomings. Many are only able to perform a parallel scan of a single range or partition at a time. When processing a query plan that involves a parallel scan of multiple ranges or partitions of the same table, some conventional database management systems must wait for all threads participating in the parallel scan to complete the scan of the current range or partition before proceeding to the next one. In particular, one or more threads that complete their assigned portions for a range or partition get blocked and have to wait until other threads complete the scan of their assigned portions of the same range or partition before proceeding. The effect of this is that system utilization drops as the system completes the scan of one range or partition and prepares to begin the scan of the next range or partition.
Database queries such as “find all sales to customers X, Y, or Z” scan multiple ranges of a table. As discussed above, some conventional database management systems employ an index on customer to satisfy such a query. Such systems use the index to perform three lookups, one for each of the customers X, Y, and Z. The three lookups are referred to as ranges since each lookup scans a range of table rows.
In at least one conventional system a parallel scan for each of these three lookups can be performed. This can be implemented as three independent and parallel scans. First, all of the threads perform a parallel scan of the first range X. Then, all of the threads perform a parallel scan of the second range Y. Finally, all of the threads perform a parallel scan of the third range Z. During each parallel scan, the database management system assigns pages or sub-ranges dynamically to the threads as discussed above. A shortcoming of such scans is illustrated in FIG. 1B which characterizes a scan of two ranges X and Y. As shown in FIG. 1B, when there are no more pages remaining in a range assigned to one thread, the scanning system waits for the other participating threads to finish scanning pages assigned for that range. When all of the threads are idle and the parallel scan of the range is complete (see FIG. 1B), the parallel scan of the next range can be begun.
Some conventional database management systems stop I/O read-ahead operations when they reach the end of a range or partition as shown in FIG. 1C. This limitation exacerbates the problems discussed above by increasing the length of time it takes to begin scanning the next range or partition.
If an index is range partitioned on the leading index key, a scanning system can scan return data from this index in sorted order by scanning the partitions sequentially using a serial scan. However, because the “parallel scan” algorithm used in conventional systems such as described above scans multiple partitions of the table simultaneously (with each partition assigned to a different thread), a “parallel scan” by such systems does not return data in sorted order.