1. Technical Field
The present invention generally relates to multi-processor data processing systems and in particular to operations on data processing systems configured with multiple independent processing nodes. Still more particularly, the present invention relates to a method and system for completing parallel processing of work items of a single work set distributed across multiple processing units of a multi-node data processing system.
2. Description of the Related Art
Multi-core data processing systems are widely utilized to enable parallel processing of data that can be divided into portions for completion. There are various different topologies of multi-core systems of which non-uniform memory access (NUMA) system topology is one example. To support process scheduling or work scheduling on distributed processing systems such as the NUMA system, separate queues are provided for each processing node because it is assumed that latency (e.g., communication latency, data transfer latency, etc.) between each node is too large or great, according to some metric, to share a common queue for scheduling work. For example, a memory bus (such as a POWER5™ (P5) bus) may operate at a data transfer rate which provides data transfer latency that is too large or great for multiple nodes to share a common queue. Thus, with these types of multi-node processing systems, work processes and associated data must be divided among the separate work queues ahead of work dispatch and execution. Once the execution of work begins in the different processing nodes, a work stealing system/algorithm is then utilized to rebalance the workload in the separate queues. Implementation of these work stealing algorithms injects a large amount of complexity into the scheduler. This complexity can often lead to inefficient run scenarios where work is continuously “balanced” or “re-balanced” between or among two or more nodes.