1. Field of the Invention
This invention relates generally to parallel database systems and more particularly to a method and apparatus for distributing data in a table across a group of nodes of a parallel database system. The invention is useful in relational database systems, particularly in statically partitioned systems.
2. Description of Related Art
Database systems increasingly rely upon parallelism to achieve high performance and large capacity. Rather than relying upon a single monolithic processor, parallel systems exploit fast and inexpensive microprocessors to achieve high cost effectiveness and improved performance. The popular shared-memory architecture of symmetric multiprocessors is relatively easy to parallelize, but cannot scale to hundreds or thousands of nodes, due to contention for the shared memory by those nodes.
Shared-nothing parallel systems, on the other hand, interconnect independent processors via high-speed networks. Each processor stores a portion of the database locally on its disk. These systems can scale up to hundreds or even thousands of nodes, and are the architecture of choice for today's data warehouses that typically range from tens of terabytes to over one hundred terabytes of online storage. High throughput and response times can be achieved not only from inter-transaction parallelism, but also from intra-transaction parallelism for complex queries.
Relational tables in such databases are distributed over one or more database partitions, where each partition resides on one physical machine in a cluster of physical machines. The location of rows in a table portioned in this way is determined by a distribution function that maps row data to a partition number. In such a system, it may occasionally be desirable to modify this distribution function. The most common reasons for doing so are (1) the current database manager capacity is inconsistent with current or future business requirements and thus physical machines need to be added or removed from the database cluster, and (2) the existing distribution of data across database partitions has become non-uniform or inconsistent with the processing power of the physical machines on which the database partitions are stored.
Whenever the data distribution function is modified, it is necessary to redistribute existing table data among the database partitions according to a new distribution function. When large-object (LOB) data is present in a table that needs to undergo data redistribution, ideally one would like to avoid sending the LOB data for each outgoing record along with the record data itself, as this method produces a slower data redistribution operation that can be CPU intensive. In some cases, the CPU may even stall while large amounts of data input/output is performed to handle the sending of LOB data.
A method has been proposed in application Ser. No. 11/847,306 to “de-couple” the sending of LOB data from the sending of regular table data so that the LOB data is essentially handled in a parallel stream that does not stall the mainline data processing stream. This method is specifically for architectures where, for each LOB value associated with a row, references to the location of the data chunks that comprise the LOB value are embedded in the row itself. These references will hereinafter be referred to as “LOB descriptors”.
One significant drawback of the method described in application Ser. No. 11/847,306 is that it requires receiving database partitions to allocate space for an incoming LOB value before the sending partition can begin to send the LOB data for that LOB value. Furthermore, the receiver must send the resulting LOB allocation descriptor for the LOB value to the sender before it can begin sending so that when the sender is sending LOB data to the receiver, the sender can include information about where to store each block of data on the receiver. By having the sender send the information that indicates where each block on the sender should be located on the receiver, the LOB data processing is effectively de-coupled from the non-LOB data processing.
It is important to note that this method involves a rather complicated protocol between sender and receiver that results in extra communication steps and is also not optimal from a performance standpoint as the sender may have resources to send LOB data but may be stalled waiting for the receiver to send the LOB descriptor for a particular LOB value.
Thus, there is a need for an improved method and apparatus of de-coupling LOB data from regular data during data redistribution.