The present invention relates to workload balancing, and in particular, to distributing workload between resources used to access a data object.
To fully utilize the computing power of a multi-processing system, a larger task (a xe2x80x9cparent taskxe2x80x9d) may be divided into smaller tasks (xe2x80x9cwork granulesxe2x80x9d) which are then distributed to processes (xe2x80x9cslave processesxe2x80x9d) running on one or more processing nodes. Each node in a multi-processing system may contain multiple processors and multiple concurrent processes. The process that divides parent tasks into work granules and distributes the work granules to slave processes on the various processing nodes is referred to herein as the coordinator process.
Databases that run on multi-processing systems typically fall into two categories: shared disk databases and shared nothing databases. A shared disk database expects all disks to be visible to all processing nodes on the computer system on which the database runs. Consequently, a coordinator process in a shared disk database may assign any work granule to a slave process on any node, regardless of the location of the disk that contains the data that will be accessed during the work granule. Shared disk databases may be run on both shared nothing and shared disk computer systems. To run a shared disk database on a shared nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have direct access to remote disks.
A shared nothing database assumes that a process can only access data if the data is contained on a disk that belongs to the same node as the process. Consequently, a coordinator process in a shared nothing database can only assign a work granule to a slave process if the data to be processed in the work granule resides on a disk in the same node as the process. Shared nothing databases may be run on both shared disk and shared nothing multi-processing systems. To run a shared nothing database on a shared disk machine, a mechanism may be provided for logically dividing the disks so that each of the disks is assigned to a particular node.
The power of database systems that run on multi-processing systems stems from the fact that many processors can be working in parallel on the same task. This power would be wasted, however, if a resource for accessing data, such as a disk controller, became a bottleneck during the parallel execution of the task. For example, assume that a particular parent task requires operations to be performed on data objects that reside on many disks controlled by many disk controllers. The task would be broken up into work granules, each of which would typically require access to data on one of the disks. If the coordinator process initially assigns to all of the slave processes work granules that require access to disks controlled by the same disk controller, then all of the slave processes would have to contend with each other for use of that disk controller. Consequently, that disk controller would become a bottleneck for the task, while the other disk controllers remain idle. In general, the more evenly workload is distributed among access devices, the greater the benefit derived from the parallelism provided by the system architecture. The more skewed the workload distribution, the less efficient the use of the multi-processing system. Ideally, work granules are distributed so all access devices with the same capacity are used at the same rate.
Many factors affect how efficiently a process may execute one work granule relative to other work granules. For example, in a shared database system implemented in a shared nothing computer system, the amount of time required for a process within a node to access data on a disk within the same node (a xe2x80x9clocal accessxe2x80x9d) is significantly less than the amount of time required for the same process to access data on a disk within another node (a xe2x80x9cremote accessxe2x80x9d). However, under the assumption that processes on all nodes have equal access to all disks, the coordinator process in some shared databases may assign to a slave process running on a particular node a work granule that accesses data in a different node, even though an unexecuted work granule may be available for the node on which the slave process resides.
To prevent workload skew, and to improve overall system performance, work granules may be assigned to slave processes in a manner that accounts for location of data accessed by a work granule. When a work granule is assigned, the coordinator selects, if available for assigning to the work granule, a slave process on a node that may locally access the needed data.
Location of data to be accessed, however, is not the only factor which affects how efficiently a work granule may be executed by a slave process. Another factor that affects how efficiently a work granule may be processed is contention between processes for a device that supplies the needed data. For example, two slave processes on a node may be assigned work granules that require access to different data objects on different disk drives. Although the data objects reside on different disk drives, they may be controlled by the same disk controller. Thus, when the two slave processes execute their assigned work granules, they contend for the same disk controller, interfering with each other and executing less efficiently.
Contention between processes may be avoided by reducing the number of slave processes that concurrently require use of the same resource. However, reducing the number of slave processes that concurrently require use of the same resource may require the system to know which resources would be used during execution of each work granule. Based on this information, a coordinator process could avoid assigning work granules that lead to contention. While information may be available about some resources used to execute a work granule, information may not be available about all resources used to execute the work granules. For example, information about what particular disk controller controls a disk device, or even what disk drive contains a data object, may not be available to a coordinator process assigning the work granules.
Based on the foregoing, it desirable to provide a system that reduces contention between slave processes for resources accessed during execution of work granules, and in particular, a method that reduces contention in the absence of information about which resources are accessed during execution of each of the work granules.
A method and mechanism are provided for balancing the workload placed on resources used to access a set of data objects.
According to one aspect of the invention, the work granules of a task are distributed to slave processes in an manner that causes the data objects that must be accessed to perform the task to be accessed in a balanced way, such that the difference in the number of slave processes accessing any object is not greater than one. Distributing the work granules in this manner decreases the likelihood that the resources required to access any particular data object will become a bottleneck in performing the task.
In this context, xe2x80x9cdata objectxe2x80x9d refers to an identifiable set of data. The actual granularity of the data objects that are used as the basis for distributing work granules may vary from implementation to implementation. For example, one implementation may distribute the work granules of a task in a manner that causes the files that must be accessed to perform the task to be accessed in a balanced way. Another implementation may distribute the work granules of a task in a manner that causes the tables that must be accessed to perform the task to be accessed in a balanced way. Yet another implementation may distribute the work granules of a task in a manner that causes the table partitions that must be accessed to be accessed in balanced way.
According to an aspect of the present invention, a task that requires access to a set of data objects is divided into work granules. For each data object in the set of data objects, a work granule list is maintained. The work granule list of each data object identifies the work granules that require access to the data object.
A slave process is assigned a work granule selected from a set of work granule lists. To select a work granule for a slave process, an initial work granule list with a remaining unassigned granule is picked at random. If the current load of the data object associated with the selected work granule list satisfies a condition, then the slave process is assigned a currently-unassigned work granule from the selected work granule list. Otherwise, the slave process is assigned a currently unassigned work granule from another work granule list.
In one embodiment, for example, if the quantity of currently-assigned work granules from the selected work granule list equals a xe2x80x9cthreshold minimumxe2x80x9d, then a work granule from the work granule list is assigned to the slave process. If the quantity of work granules does not match the threshold minimum, then another work granule list is selected. The threshold minimum may be, for example, the minimum number of currently-assigned work granules from the work granule list.