In an ordered distributed database setting, records are stored in partitions, with each partition covering a particular key range. Partitions are each stored on one of many machines. Updating records in such an ordered distributed database may involve several operations. Single record inserts are handled by looking up the partition encompassing the record's key, sending the record to that partition, and inserting it there. In the case that a record insert pushes the partition size beyond a defined limit, the partition may split into two. Global load balancing may additionally shift partitions from machines with many partitions to those with few. In the case record inserts are repeatedly concentrated on some small key range, one or a few partitions receive the inserts, and subsequently execute partitions splits and moves to other machines.
However, using one-at-a-time insertion for a large number of record insertions is a poor solution. Bulk loading of a large number of records to update a large-scale distributed database is significantly more efficient than inserting records one at a time. For instance, a planning phase may identify new partitions that need to be created by splitting partitions for bulk insertion of records, partitions may be moved for global load balancing, and then the records may be inserted into the partitions. A significant challenge for efficient bulk loading of records is to create the new set of partitions from the old while minimizing the number of records written to new disk locations.
What is needed is a way to efficiently perform bulk loading of a large number of records into partitioned data tables while maintaining a balanced load across storage units in a large-scale distributed database. Such a system and method should execute partition translation between existing and new partitions efficiently by minimizing the number of records moved between partitions of data.