The present invention relates to parallel execution of task by a set of processes, and in particular, dividing a task into subtasks and assigning the subtasks to processes.
To fully utilize the computing power of a multi-processing system, a larger task (a xe2x80x9cparent taskxe2x80x9d) may be divided into smaller tasks (xe2x80x9cwork granulesxe2x80x9d) which are then distributed to processes (xe2x80x9cslave processesxe2x80x9d) running on one or more processing nodes. Each node in a multi-processing system may contain multiple processors and multiple concurrent processes. The process that divides parent tasks into work granules and distributes the work granules to slave processes on the various processing nodes is referred to herein as the coordinator process.
Databases that run on multi-processing systems typically fall into two categories: shared disk databases and shared nothing databases. A shared disk database expects all disks to be visible to all processing nodes on the computer system on which the database runs. Consequently, a coordinator process in a shared disk database may assign any work granule to a slave process on any node, regardless of the location of the disk that contains the data that will be accessed during the work granule. Shared disk databases may be run on both shared nothing and shared disk computer systems. To run a shared disk database on a shared nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have direct access to remote disks.
A shared nothing database assumes that a process can only access data if the data is contained on a disk that belongs to the same node as the process. Consequently, a coordinator process in a shared nothing database can only assign a work granule to a slave process if the data to be processed in the work granule resides on a disk in the same node as the process. Shared nothing databases may be run on both shared disk and shared nothing multi-processing systems. To run a shared nothing database on a shared disk machine, a mechanism may be provided for logically dividing the disks so that each of the disks is assigned to a particular node.
The power of database systems that run on multi-processing systems stems from the fact that many processors can be working in parallel on the same task. This power would be wasted if all work is assigned to a process running on a single node within the system, while processes on the other nodes remained idle. In general, the more evenly work is distributed, the greater the benefit derived from the parallelism provided by the system architecture. The more skewed the workload distribution, the less efficient the use of the multi-processing system. Ideally, work granules are distributed so all slave processes working on the work granules of the same parent task will complete their work granules at the same time.
There are numerous causes for skewed workload distributions. For example, in a shared disk database, a coordinator process may assign a work granule that may be performed more efficiently by a slave process on one node to a slave process on another node due to the location of data accessed by the work granule. Specifically, in a shared nothing computer system, the amount of time required for a process within a node to access data on a disk within the same node (a xe2x80x9clocal accessxe2x80x9d) is significantly less than the amount of time required for the same process to access data on a disk within another node (a xe2x80x9cremote accessxe2x80x9d). However, under the assumption that processes on all nodes have equal access to all disks, the coordinator process in some shared databases may assign a work granule that accesses data in a first node to a slave process running on a second node even if a slave process on the first node is available to perform the work granule. Because a slave process that is executing an assigned work granule may have to remotely access data, the slave process may be executing the granule after other slave processes have completed execution of work granules assigned to the slave processes.
The process of managing work granules to reduce skewed work load distributions is referred to herein as work load skew management. One approach to work load skew management is referred to as the 9-3-1 technique. Under the 9-3-1 technique, work is divided among a set of slave processes into equal parts, one part for each slave process. Each part is then subdivided into work granules according to the ratio 9-3-1. For example, assume that an amount of work W is to be shared between three slave processes. The work is divided into three parts, one for each of the slave processes. Each part is further subdivided according to the ratio 9-3-1. Thus, work W is ultimately apportioned to a slave process as follows: 9/13*1/3 W, 3/13*1/3 W, and 1/13*1/3 W.
When the coordinator process assigns a work granule to a slave process, it first assigns the largest of the work granules for the slave process, i.e. the 9/13 portion. When a slave process completes a work granule, the coordinator process assigns to the slave process one of the remaining smaller work granules for the slave process. If a slave process completes all of its work granules, and another slave process has not completed its work granules, the coordinator xe2x80x9cstealsxe2x80x9d the work granule from the busy slave process and assigns it to the slave process that has completed its own work granules. Consequently, both slave processes continue to perform work, and the work load distribution skew is reduced, if not avoided.
A disadvantage of the 9-3-1 technique is that it is usually difficult to both split a task into the dictated portions and split the task in a manner that accounts for other efficiency considerations. For example, a task AB entails access to data that is equally distributed at a location A and location B. Data at location A is locally accessible from a node A, data at location B is locally accessible from a node B. Three slave processes are available to execute the task. A process A1 and A2 executes on node A, a process B1 executes on node B. To distribute the task according to the 9-3-1 technique, the largest of the work granules designated for process A2 is assigned a portion of the task that entails access to data at location A and location B. Data at location B is remotely accessed by process A2, causing A2 to execute its largest work granule much more slowly. The smaller work granules for process A2 are redistributed to process A1 and B1. However, due to slowness of the execution of process A2, the redistribution of work load does not significantly offset the disparity in execution times between process A1 and B1, and process B2. Thus, in applying the 9-3-1 technique, not only were other efficiency considerations ignored, but as a result of ignoring them, the work distribution skew may have been actually worsened.
Another disadvantage of the 9-3-1 technique is in allocating work to a work granule that is substantially larger than other work granules, the technique puts xe2x80x9ctoo many eggs in one basketxe2x80x9d. If the largest work granule is executed slowly, the work load skew is not only created, but may be substantially worsened. This may occur even when work is divided in a manner that accounts for efficiency considerations, such as locality of data.
For example, a task C entails access to data, in the form of rows in a table that reside at a location A. Data at location A is locally accessible from a node A. Three slave processes on node A are available to execute task A, and are assigned work according to the 9-3-1 technique. The largest work granules are assigned to each of the three slave processes. The slave processes apply a filter to the rows they read. The rows that pass the filter are stored by the slave processes. For two of the slave processes, relatively few of the rows read by them pass the filter, and relatively few of the read rows are stored. However, for a third slave process, most rows read pass the filter. The third slave process thus expends a relatively large amount of work storing the rows, writing them to a buffer on disk, and, as a consequence, executes its larger work granule very slowly, causing significant disparity in execution times. The disparity is further aggravated by a relatively large portion of work originally designated for the third slave process being assigned to its largest work granule.
Based on the foregoing, it desirable to provide a work load skew management mechanism that dynamically transfers work designated for a process to another process and that allows work to be divided and assigned to processes in a manner that accounts for multiple efficiency considerations.
The foregoing needs and objects, and other needs and objects that will become apparent from the following description, are achieved by the present invention, which comprises, in one aspect, a mechanism for managing work granules executed in parallel.
According to an aspect of the present invention, a task is evenly divided between a number of work granules. The number of work granules falls between a threshold minimum and a threshold maximum. The threshold minimum and maximum may be configured to balance a variety of efficiency factors affected by the number of work granules, including workload skew and overhead incurred in managing a larger number of work granules.
According to another aspect of the present invention, work granules are distributed to processes on nodes according to which of the nodes, if any, may execute the work granule more efficiently than other nodes. A variety of factors may be used to determine where a work granule may be performed most efficiently, including which nodes have local access to the data accessed during the execution of a work granule. Finally, work granules are classified according to whether they access sections of tables referred to as partitions.