The present invention relates to accessing data in a multi-processing system, and more specifically, to a method and apparatus for distributing work granules to multiple processes within a computer system.
To fully utilize the computing power of a multi-processing system, a larger task (a xe2x80x9cparent taskxe2x80x9d) may be divided into smaller tasks (xe2x80x9cwork granulesxe2x80x9d) which are then distributed to processes running on one or more processing nodes. Each node may contain multiple processors and multiple concurrent processes. The process that divides parent tasks into work granules and distributes the work granules to processes on the various processing nodes is referred to herein as the coordinator process.
Multi-processing computer systems typically fall into three categories: shared everything systems, shared disk systems, and shared nothing systems. The constraints placed on the coordinator process during the work granule distribution process vary based on the type of multi-processing system involved.
In shared everything systems, processes on all processors have direct access to all dynamic memory devices (hereinafter generally referred to as xe2x80x9cmemoryxe2x80x9d) and to all static memory devices (hereinafter generally referred to as xe2x80x9cdisksxe2x80x9d) in the system. Consequently, a coordinator process in a shared everything system has few constraints with respect to how work granules may be assigned. However, a high degree of wiring between the various computer components is required to provide shared everything functionality. In addition, there are scalability limits to shared everything architectures.
In shared disk systems, processors and memories are grouped into nodes. Each node in a shared disk system may itself constitute a shared everything system that includes multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only the processes on processors that belong to a particular node can directly access the memory within the particular node. Shared disk systems generally require less wiring than shared everything systems. However, shared disk systems are more susceptible to unbalanced workload conditions. For example, if a node has a process that is working on a work granule that requires large amounts of dynamic memory, the memory that belongs to the node may not be large enough to simultaneously store all required data. Consequently, the process may have to swap data into and out of its node""s local memory even though large amounts of memory remain available and unused in other nodes.
In shared nothing systems, all processors, memories and disks are grouped into nodes. In shared nothing systems as in shared disk systems, each node may itself constitute a shared everything system or a shared disk system. Only the processes running on a particular node can directly access the memories and disks within the particular node. Of the three general types of multi-processing systems, shared nothing systems typically require the least amount of wiring between the various system components. However, shared nothing systems are the most susceptible to unbalanced workload conditions. For example, all of the data to be accessed during a particular work granule may reside on the disks of a particular node. Consequently, only processes running within that node can be used to perform the work granule, even though processes on other nodes remain idle.
To more evenly distribute the processing workload in a shared nothing system, the data can be redistributed such that the data that must be accessed for a particular parent task will be evenly distributed among many nodes. However, when data is redistributed to spread the data evenly between nodes based on one criteria, the redistribution may distribute the data unevenly based on another criteria. Therefore, a redistribution optimized for one parent task may actually decrease the performance of other parent tasks.
Databases that run on multi-processing systems typically fall into two categories: shared disk databases and shared nothing databases. A shared disk database expects all disks in the computer system to be visible to all processing nodes. Consequently, a coordinator process in a shared disk database may assign any work granule to a process on any node, regardless of the location of the disk that contains the data that will be accessed during the work granule. Shared disk databases may be run on both shared nothing and shared disk computer systems. To run a shared disk database on a shared nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have direct access to remote disks.
A shared nothing database assumes that a process can only access data if the data is contained on a disk that belongs to the same node as the process. Consequently, a coordinator process in a shared nothing database can only assign a work granule to a process if the data to be processed in the work granule resides on a disk in the same node as the process. Shared nothing databases may be run on both shared disk and shared nothing multi-processing systems. To run a shared nothing database on a shared disk machine, a mechanism may be provided for logically partitioning the disks so that each of the disks is assigned to a particular node.
When run on a shared nothing computer system, the different types of databases experience different types of problems. For example, a shared nothing database is not able to distribute the work granules of a parent task to processes spread evenly between nodes when the data that will be accessed during the parent task (the xe2x80x9caccessed dataxe2x80x9d) is not evenly distributed among the nodes. Consequently, a process on a node within which resides a large portion of the data for a parent task may work a long period of time on a large work granule while processes on other nodes remain dormant after having completed small work granules. In addition, load skew may result when multiple processes are already executing on a node on which resides the data to be accessed in a large work granule.
The workload distribution within a shared nothing database system may become even more skewed as a consequence of component failure. Specifically, the processing nodes of a shared nothing system are often grouped so that each given node has a companion node that will assume the responsibilities of the given node if the given node fails. Thus, when the processing or memory component of a node fails, processes on the companion node of the failed node must perform all work granules that access data on disks of the failed node as well as all work granules that access disks that belong to itself. Assuming that work granules were distributed evenly among a set of nodes in an initial distribution, the fact that processes on the companion node of the failed node must perform twice the work of processes on the other nodes significantly skews the workload distribution.
Shared disk databases have the disadvantage that the coordinator process may assign a work granule that may be performed more efficiently by a process on one node to a process on another node. Specifically, in a shared nothing computer system, the amount of time required for a process within a node to access data on a disk within the same node (a xe2x80x9clocal accessxe2x80x9d) is significantly less than the amount of time required for the same process to access data on a disk within another node (a xe2x80x9cremote accessxe2x80x9d). However, under the assumption that processes on all nodes have equal access to all disks, the coordinator process in a shared database may assign a work granule that accesses data in a first node to a process running on a second node even if a process on the first node is available to perform the work granule.
Based on the foregoing, it is desirable to provide a system in which work granules are assigned to processes in a way that distributes the work of a parent task evenly and efficiently. It is further desirable to be able to assign to a process on a first node a work granule that accesses data on a second node if the second node is busy executing other processes, and to assign the work granule directly to a process on the second node if the second node is not busy executing other processes.
A method and apparatus for distributing work granules of a parent task among processes on various nodes in a multi-processing computer system is provided. The parent task is divided into work granules of varying sizes based on the location of the data that must be accessed to perform the work granules. Each of the processes that will be assisting in the execution of the parent task is initially assigned a work granule based on efficiency considerations. Such efficiency considerations may include, for example, the location of the data to be accessed relative to the process, the current I/O load of the devices on which data to be accessed is stored, and the relative sizes of the various work granules. When a process completes the work granule assigned to it, it is assigned one of the remaining unassigned work granules. The assignment of subsequent work granules is made based on efficiency considerations. This process continues until all of the work granules have been completed.