Parallel processing generally refers to the use of two or more processing cores to execute two or more threads of a software application (or, more generally, to execute two or more of virtually any set of processor-based tasks). For example, a software application may be required to execute a large number of operations or other tasks, such as, e.g., executing a large number of database queries, performing a large number of mathematical calculations, or performing virtually any other type of software-based tasks which may be executed at least partially independently of, and therefore in parallel with, one another. Thus, a plurality of execution threads of a software application, or other processor-based tasks, may be divided for execution thereof on a corresponding plurality of processing cores. As is known, such parallel processing of execution threads of an application by a plurality of processing cores may result in an increased speed of execution of the application as a whole, so that users of the application may be provided with results in a faster and more efficient manner.
One type of architecture used to implement such parallel processing techniques is known as a uniform memory access (UMA) architecture. As the name implies, such UMA architectures generally assume that each of the processing cores used to implement the parallel processing of the various execution threads has the same access time with respect to one or more memories on which the execution threads (or other related information) is stored.
Thus, parallel processing generally proceeds by executing a plurality of execution threads using a plurality of processing cores. Ideally, the execution threads may be distributed among the available processing cores in a substantially uniform and balanced manner. In practice, however, it may occur that such load balancing is difficult to achieve. For example, it may occur that one processing core is not currently assigned with an execution thread for execution, and thus is needlessly idle, while one or more other available processing cores is assigned a plurality of threads for execution thereon. In such situations, efficiencies and other advantages of the parallel processing architecture may be mitigated or lost, because completion of the application may be delayed while some execution threads are queued for processing at some processing cores, while other processing cores are idle or otherwise under-utilized.
In order to address such difficulties and to attempt to optimize the parallel processing of execution threads of an application, schedulers may be used to remove execution threads from an over-utilized processing core for reassignment to an idle or under-utilized processing core. For example, in such situations, an executing application thread on an over-utilized processing core may be paused, and its associated state information may be transferred for execution on the under-utilized or idle processing core. In the UMA architecture referenced above, it may be appreciated that such transfers may be executed substantially without regard for associated latencies caused by the transfer process. That is, since all processing cores have the same (i.e., uniform) access time with respect to the underlying main memory, it is possible for an under-utilized or idle core to receive execution threads from virtually any other available processing core. As a result, for example, such thread transfers may be executed simply by selecting an over-utilized processing core at random from a plurality of over-utilized processing cores, for subsequent transfer of execution threads therefrom to the under-utilized processing core.
Another type of architecture use for parallel processing is known as a non-uniform memory access (NUMA) architecture. In such architectures, the various processing cores may be associated with different memories, may be located on different chip sets or sockets, or may otherwise have variability in their access time with respect to one or more memories in which the various execution threads may be stored for execution. In such architectures, similar difficulties as those described above may occur. Specifically, for example, such NUMA architectures may experience difficulties in achieving a desired load balance, as just described. In NUMA architectures, it is possible to use the same or similar techniques as those referenced above in the context of UMA architectures to address sub-optimalities in load balancing. However, due to the non-uniform nature of memory access in the context of NUMA architectures, it may occur that an under-utilized or idle processing core may select a processing core and associated execution thread for reassignment thereto that is/are associated with a large memory latency for the transfer, relative to other processing cores executing the various execution threads within the NUMA architecture. Consequently, a relative delay may be introduced into the execution of the application as a whole, so that a full benefit of the NUMA architecture is not realized.