Computers and computer based systems are in widespread use today, ranging from desktop personal computers to global and sophisticated computer networked systems which provide the backbone for today's World Wide Web. As a result, computers form an integral part of today's modern society.
Computers have been developed in a wide variety of electronic manufacturing and system architecture configurations, depending upon the intended use. At the core of each computer is a central processing unit (CPU) that performs the processing of program data and instructions necessary for the execution of a desired task. A CPU communicates with memory units comprised of random access memory (RAM) in main memory or cache memories during the execution of a desired task, called a “program” by practitioners of the art. Cache memories typically provide each CPU (or, less commonly, groups of CPUs) with a higher-speed copy of selected portions of the data in main memory. The memory units are used to store and retrieve a program's data and instructions during its execution. Generally, execution of most programs requires millions of memory accesses between a CPU and the memory units. The speed with which a memory unit can provide access to the data for a CPU can therefore greatly affect the overall performance of a computer. As the processing speed of the CPUs steadily increases, the need for faster delivery of data to a CPU from its memory units becomes increasingly important. Consequently, modern CPU architectures typically implement a hierarchy of caches starting with a very small and extremely fast first-level cache, and adding n-level caches that are progressively slower, but able to hold more program data and instructions.
Currently, methods for reducing the delay time associated with accessing data in memory units are based on a combination of memory access speed and proximity of each of the memory units with respect to the CPU. Cache (faster) memory units, can be placed closer to the CPU than main (slower) memory because of their smaller size and because typically only some, and not all, of the CPUs in the system need to access a single cache. In fact, caches are frequently placed inside the CPU unit to minimize the distance between the CPU and the cache. Two factors necessitate access to main memory: First cache memory can generally store less data than main memory and therefore it cannot accommodate all the program data and instructions required for the execution of a typical task. When there is not enough available memory space in cache memory, some of the program data and instructions contained in cache need to be relocated (or “evicted”) to make available memory space for accommodating of new program data and instructions. Second, because a cache memory may not be accessible to all the CPUs in a multiple CPU system, any portions of its data that a particular CPU has modified need to be written back to main memory before they can be accessed by any other of the CPUs which do not share the same cache. Minimizing these two factors can result in substantial improvements in the overall computer performance.
To minimize the amount of cache eviction and to therefore maximize the effectiveness of cache memory, various well known techniques are used to estimate the program data and instructions most frequently accessed by a CPU in a given time interval, so that these data can be retained on the faster cache units. The general underlying principle of these techniques is that computer programs tend to access small portions of their data and instructions, which fit in the cache, during a given time interval. The first time a program accesses its data and instruction, the data and instructions are loaded into the cache and can be accessed rapidly thereafter. When a CPU proceeds to the execution of another sub-program, the pertinent new program data and instructions are also loaded into the cache from main memory for faster access. In this way, a CPU needs to access data in main memory only once in any small interval of time. When the cache becomes full, special hardware evicts (i.e. overwrites) the least-recently used instructions and data in the cache. Thus, the longer the time since the last access to a given portion memory, the less likely it is that the data will later be found in the cache.
Modern computer systems utilize CPU “time-slicing” techniques to simulate concurrent processing of multiple tasks, such as those by different users. Time slicing techniques execute a portion of a first task in a CPU for fraction of a second, before interrupting the CPU and instructing it to execute a second task. This process continues from one task to the next until the first task once again gets a turn to execute. Each subsequent task overwrites some of the first task's data and instructions in the cache, so that when the first task returns for execution, little or none of its program data and instructions may still be in the cache, and must therefore be “reloaded” into the cache from the relatively slower main memory.
As described above, because a cache memory may not be accessible to all the CPUs in a multiple CPU system, any portions of its data that a particular CPU has modified need to be written back to main memory before they can be accessed by any other of the CPUs which do not share the same cache. This factor becomes particularly evident on multiple CPU computer systems. In such computer systems, the operating system makes multiple CPUs available for the execution of tasks, which are typically divided amongst a number of CPUs for faster overall processing. One such multi-CPU environment is the Symmetrical Multi-Processor environment (SMP) in which multiple CPUs share a single main memory. In SMP systems when two or more CPUs need to access the contents of the same portion of main memory, they must take turns doing so, thus reducing the effectiveness of SMP for faster processing. Another multi-CPU environment is a NuMA™ environment in which each of several groups of CPUs has direct access to a predetermined subset of main memory. In a NuMA™ environment, CPUs in one group do not have direct access to memory units of another Group of CPUs, as in a SMP. Consequently, while this approach reduces competition among CPUs for a main memory location, it limits the number of CPUs that can work efficiently on a single task.
In a third approach, known as Cellular Multi-Processor (CMP) architecture, all CPUs share a single main memory (as in an SMP environment), but take advantage of special memory caches, known as third-level caches (TLC), which are shared amongst a group (called a “subpod”) of CPUs. The TLC provides a large cache that can store much more program data and instructions. Because TLCs are shared among a group CPUs, such as a group of four CPUs, other CPUs in the same group can share data in the cache, resulting in more efficient use of the cache.
As with any cache, the performance improvements CMP gains from the use of TLC depends on program data and instructions staying in cache as long as possible. In a multiple-CPU system, a task often has an opportunity to run on a different CPU instead of waiting for the CPU on which it was last executed to become available. The benefits of switching to another CPU, however, can be detrimentally and often substantially reduced by the added delay associated with the reloading all of the task's program data and instructions into a different cache. For this reason, a system's performance may improve if tasks are discouraged from frequently switching CPUs. Likewise, in a CMP environment, system performance may improve if tasks are discouraged from switching from the CPUs of one Group of CPUs (such as a subpod) to the CPUs of another group of CPUs. This is because in a CMP system, all the CPUs in a Group of CPUs share a common TLC, and in the event the task switched to another CPU in the same Group of CPUs the task's data and instructions do not need to be reloaded from one TLC to another TLC.
Another benefit of restricting task switching between subpods in a CMP environment becomes most evident whenever a task splits itself into two or more concurrent sub-tasks called program threads, or, simply, threads. Program threads are sub-tasks that can be performed concurrently with only occasional need to communicate their results to one another. When threads do need to communicate, they often do so through a pre-designated memory location. If the threads that share such a pre-designated memory location are allowed to execute on CPUs in different subpods, then every access to that memory location must be carefully coordinated because one CPU may have altered the contents of the memory in its own TLC, and such a change would not be visible to the other thread on the other TLC without such coordination. Such coordination among TLCs, however, is time-consuming, and during which time some CPUs may sit idle while waiting for the TLCs to determine which CPU will be allowed to modify the data.
One existing approach to the foregoing problem is to use task affinitization, wherein a task (and all its program threads) is “affinitized” to a Group of CPUs (in this case, a subpod). In other words, a task affinitized to a Group of CPUs is executed only within that Group of CPUs. While this approach may reduce the time delays associated with the transfer of data amongst the TLC of different Group of CPUs, it restricts the execution of a task or tasks to a particular Group of CPUs (which is necessarily less than the total number of available CPUs in the system) and therefore compromises the benefits of having multiple CPUs.
An ongoing need thus exists in a CMP system to minimize the sharing of data between program threads executing on CPUs in different Group of CPUs, and to increase the number of available CPUs available for executing a given task's program threads.