1. Field of the Invention
The present invention relates to an improved data processing system and, in particular, to a method for generating partitioning keys in a multithreaded application to reduce contention when inserting data into a range-partitioned database.
2. Description of the Related Art
Database management systems (DBMSs), especially relational database management systems (RDBMSs), are well known in the art. In such systems, data is organized into tables, or relations, each of which contains one or more rows and one or more columns. Each row (also known as a tuple or record) of a table corresponds to a particular data object (such as an employee), while each column corresponds to a particular attribute of that object, such as name, age, department, salary, or the like. Database systems are described in such online references as Wikipedia, as well as in more traditional references such as Date, C. J., An Introduction to Database Systems, Eighth Edition, Addison Wesley, 2003, incorporated herein by reference.
In many database applications, partitioning of the database is used to improve scalability and performance. Database partitioning is a method that breaks a large database table into smaller table segments, where each table segment independently manages its own locks. To accomplish this, the partitioning boundaries are defined by the user through the use of Data Definition Language (DDL). The partitioning boundaries, also referred to as the partition high key values, are values from one or more columns from the database table column that separate one partition of the table from another partition. Once the boundaries are defined, then each row in the database table will fall into a specific partition based on the value of that row's partitioning column(s). Typically, more attention is given to read access than to insertion of new data. However, multithreaded applications which perform high-volume inserts of new data can achieve performance benefits from partitioning as well, especially when concurrent threads insert into different partitions.
Many applications use a generated key for new data, where this key also serves as the partitioning key and the data is partitioned using range partitioning. Range partitioning selects a partition by determining if the partitioning key is inside a certain range of values. One common approach for key generation is to use a monotonically increasing key by incrementing successive key values. This has the advantage of keeping the data organized in sequence, clustered by the generated key. It also has the advantage of assigning space for new data at the end of the partition, which is customarily far more efficient than inserting in the middle of existing data. One drawback with this approach is that data is not immediately spread across partitions. Instead, the generated key must progress through the key ranges chosen for the partitions as data accumulates over time before a good spread of data occurs. Another drawback of using monotonically increasing key values is the formation of “hotspots” in the database when concurrent access occurs in a multithreaded application environment, both during initial insertion of new data and subsequent access of the new data. Collisions on “hotspots” often cause contention between competing threads, due to serialization in locking and space allocation in localized areas of the database. Additionally, if multiple data records are inserted for each generated key, multithreaded applications may not achieve the full benefit of keeping inserted data in sequence, as two competing threads with sequential generated keys may interleave their data. When the underlying database management system uses page-level locking, interleaved data on the same page can cause deadlocks between application threads which are doing concurrent updates to otherwise unrelated records.
One method commonly used to overcome both drawbacks above is to reverse the bytes of a monotonically increasing key. This has the effect of continuously spreading new data evenly over the entire key range, placing it in the various partitions in a round-robin fashion. This avoids hotspots, as each subsequent insertion is at another point in the database. However, this approach does not keep the data well organized, potentially requiring more frequent data reorganization. Furthermore, inserts are not done at the end of previously existing data, thus sacrificing efficiency during insert. Finally, simply reversing the bytes of an incremented value does not guarantee that two successive keys are in separate partitions unless there are enough partitions defined with key ranges that are chosen to optimize the key generation procedure. If successive keys are defined in the same partition, multithreaded workloads will compete for resources within that partition.
Another method used to overcome the drawbacks of a monotonically increasing key is to generate a random key. This has basically the same advantages and disadvantages as the method of reversing the bytes of the monotonically increasing key noted above.
None of the three key generation mechanisms noted above—specifically, using a monotonically increasing key, reversing the bytes of a monotonically increasing key, and generating a random key—accomplish all of the following goals: (1) keeping newly inserted data organized in key sequence; (2) continuously spreading new data uniformly across the partitions; and (3) maximizing the isolation of inserts from concurrent threads into different partitions.
What is needed is a key generation procedure which achieves all of these goals to help maximize the performance benefit of database partitioning.