Massively parallel processing (MPP) is the coordinated processing of a program by multiple processors, with each processer working on different parts of the program. The processors communicate with one another to complete a task with each of them using their own operating system and memory resources. An MPP database system is based on shared-nothing architecture, with the tables of its databases being partitioned into segments and distributed to different processing nodes. There is no data sharing among the processing nodes. When database queries arrive, the work of each query is divided and assigned to one of the processing nodes according to a data distribution plan and an optimized execution plan. The processing entities in each processing node manage only their portion of the data. However, these processing entities may communicate with one another to exchange necessary information during their work execution. A query may be divided into multiple sub-queries, and the sub-queries may be executed in parallel or in some optimal order in some or all the processing nodes. The results of the sub-queries may be aggregated and further processed, and subsequently more sub-queries may the executed according to the results.
One challenge in MPP systems is maintaining efficient scaling as data is added to the MPP database. More specifically, an MPP database is generally created by partitioning one or more tables between multiple database partitions (DBpartitions) using an algorithm (e.g., hash, range, etc.). As new data is added to the MPP database, new data entries are made to tables within the DBpartitions according to the algorithm. However, the algorithm for partitioning data in conventional MPP databases is set during creation of the MPP database, and remains the same throughout the life of the MPP database. Hence, the static algorithm may be incapable of evolving to changing conditions, thereby causing the underlying MPP database to become unbalanced and less efficient at processing queries over time.
By way of example, suppose a conventional customer database is partitioned based on the sex of its account holders, with database entries corresponding to male account holders being stored in a different DBpartition than database entries corresponding to female account holders. This algorithm may have been chosen because there was a relatively even ratio of male account holders to female account holders when the MPP database was initially created. However, as time goes on, assume female account holders are added to the database at a much higher rate than male account holders, thereby causing the MPP database to become un-balanced (i.e., the second DBpartition becomes much larger than the first DBpartition). At some point, it becomes desirable to repartition the MPP database in order to rebalance the DBpartitions. Conventionally, repartitioning the MPP database is performed manually by the database administrator (DBA), which typically requires the MPP database to go offline for a period of time. Accordingly, mechanisms that allow the MPP databases to be re-partitioned without interrupting their runtime operation are desired.