In a shared nothing database (SN), data is divided into partitions (also known as shards) and placed over one or more nodes or individual machines which have local disk storage, memory and central processing unit (CPU). The partitions, nodes and machines are connected to each other through a high speed interconnect through which all database interpartition communication is facilitated. Each partition has no direct access to data on other partitions and nodes, other than through the high speed interconnect. Typically there are a fixed number of partitions per node automatically configured as defined by a configuration file.
An SN database manager allows great flexibility in spreading data across multiple database partitions of a partitioned database. Users can choose how to distribute their data by declaring distribution keys, and can determine which and how many database partitions their table data can be spread across by selecting the database partition group and table space in which the data is to be stored. In addition, a distribution map (which is updateable) specifies the mapping of hashed distribution key values to database partitions. This makes it possible for flexible workload parallelization across a partitioned database for large tables, while allowing smaller tables to be stored on one or a small number of database partitions if the application designer so chooses.
In a partitioned database, the distribution key is used to distribute table and index data across a set of database partitions. The distribution key for a table in a table space on a partitioned database partition group is specified in a CREATE TABLE statement or an ALTER TABLE statement. Rows are placed in a database partition as follows: a hashing algorithm (database partitioning function) is applied to all of the columns of the distribution key, which results in the generation of a distribution map index value; and the database partition number at that index value in the distribution map identifies the database partition in which the row is to be stored.
Referring to an example distribution map and distribution key in FIG. 1, a distribution key is the contents of one or more columns for a row of data (c1 and c3 in the example) that map onto a partition in the distribution map (p5 in the example), which is identified through the distribution map index by the hashed distribution key value 2.
The database manager supports partial de-clustering, which means that a table can be distributed across a subset of database partitions in the system (that is, a database partition group). Tables do not have to be distributed across all of the database partitions in the system. The database manager has the capability of recognizing when data being accessed for a join or a sub-query is located at the same database partition in the same database partition group. This is known as table collocation. Rows in collocated tables with the same distribution key values are located on the same database partition. The database manager can choose to perform join or sub-query processing at the database partition in which the data is stored. This has very significant performance advantages.
Collocated tables must: be in the same database partition group and have the same distribution map; have distribution keys with the same number of columns; have the corresponding columns of the distribution key be database partition-compatible (compatible data types for example); and be in a single partition database partition group defined on the same database partitions.
In a data warehouse, the fact table and its largest dimension table are typically collocated. For example, FIG. 2 depicts a join of the two largest tables in a well-known bench marking workload. In this example, an ORDERS table (a dimension table) and a LINEITEM table (a fact table) are both distributed on an orderkey column and a join operation across these two tables is processed locally on each partition of the containing partition group, resulting in a significant performance saving.
The challenge is to maintain performance while expanding a SN partitioned database through the addition of new nodes. When a partitioned database is expanded to include newly provisioned nodes and partitions, the partition groups must be expanded to include the newly provisioned partitions. To maintain a balanced allocation of data for each table across the expanded system, and to allow for future growth on the existing partitions, the data must be redistributed across the system. The redistribution process involves the allocation of a partition group which spans all partitions in the expanded system and then reassignment of each table row in the partition group based on the new partition map. This process can take a significant amount of time, for example, reassigning tens or hundreds of terabytes of data can take several days. Typically, this process is performed as an offline activity and can require several days outage.
One of the main difficulties during the redistribution process is that collocation is not maintained during the operation. As the tables in the partition group are redistributed in sequence one at a time, the partition maps of joined tables will differ until all such joined tables in the partition group have been redistributed. Another challenge is the ability to incrementally redistribute each table and the row data therein, to the new database map, transparently to the end user while maintaining full access to the data and without incurring the need for two copies of the data being moved (existing and new).
There exists in the prior art techniques for the online moving of tables to a new partition group/map. For instance, US20090319581A1 discloses a method for online movement of a table requiring two copies of the table being moved but does not maintain collocation between tables during the move.
Another existing patent publication US20110295907A1 discloses an apparatus and method for expanding a SN database and a technique for redistributing a SN database while minimizing downtime, however, this method does not provide for preservation of collocation or full availability of data and requires two copies of the data.
Another existing patent publication US20120246194A1 provides a method for maintaining collocation through placing data with a previously encountered distribution key in the same database partition, and placing data for distribution keys not previously encountered in a new database partition. In other words, this method attempts to dispense entirely with data redistribution. However, this approach effectively can result in permanently unbalanced partition groups (as only data for new distribution keys are placed on new partitions), invariably leading to performance and tuning problems, and can result in critical scenarios where the existing nodes exhaust available storage, especially in homogenous appliance based clusters.