1. Field Of The Invention
This invention relates generally to improved performance in a multi-processing system and, more particularly, to techniques for reducing access latency by correlating memory allocations with requesting processor location in a multi-processing system.
2. Background Of The Related Art
This section is intended to introduce the reader to various aspects of art which may be related to various aspects of the present invention which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Computer usage has increased dramatically over the past few decades. With the advent of standardized architectures and operating systems, computers have become virtually indispensable for a wide variety of uses from business applications to home computing. Whether a computer system includes a single personal computer or a network of computers, computers today rely on processors, associated chip sets, and memory chips to perform most of the processing of requests throughout the system. The more complex the system architecture, the more difficult it becomes to manage and process the requests efficiently.
Some systems, for example, include multiple processing units or microprocessors connected together via a processor bus. To coordinate the exchange of information among the processors, a cache coherency protocol is generally provided. The cache coherency protocol is further tasked with coordinating the exchange of information between the plurality of processors and the system memory. Cache memory is a special high speed storage mechanism which may be provided as a reserved section of the main memory or as an independent high-speed storage device. Essentially, the cache memory is a portion of the RAM which is made of high speed static RAM (SRAM) rather than the slower and cheaper dynamic RAM (DRAM) which may be used for the remainder of the main memory. When a program needs to access new data, the operating system first checks to see if the data is stored in the main memory before going out to retrieve it from disk. The processor may store a portion of that memory in its cache SRAM. By storing frequently accessed data and instructions in the SRAM, the system can minimize its access to the slower DRAM and thereby increase the request processing speed in the system and improve overall system performance.
Each computer generally includes an operating system (O/S), such as DOS, OS/2, UNIX, Windows, etc., to run program applications and perform basic functions, such as recognizing input from the keyboard, sending output to the display screen, keeping track of files and directories stored in memory, and controlling peripheral devices such as disk drives and printers. Operating systems provide a software platform on top of which application programs can run. For large systems, the O/S may allow multiprocessing (running a program on more than one processor), multitasking (allowing more than one program to run with time division scheduling), and multithreading (allowing different parts of a single program to run concurrently on one or more processors). When a computer system is powered-up, the O/S generally loads into main memory. The O/S includes a kernel which is the central module in the operating system. The kernel is the first part of the O/S to load into the main memory, and it remains in main memory while the system is operational. Typically, the kernel is responsible for memory management, process and task scheduling, and disk management. In most systems, the kernel schedules the execution of program segments, or “threads,” for one or more applications.
Regardless of whether the system is a single computer or a network of computers (wherein each individual computer represents a “node” in the system), multiprocessing design schemes are generally implemented for advanced computer systems. Some systems share a single common bus and single memory controller. Others can have one or more memory controllers on a shared bus, while still others can have multiple buses to a single memory controller. As the number of CPUs gets larger, having a single resource like a bus or memory controller becomes a bottleneck to the system performance as well as creating a volumetric problem fitting too many devices near a common resource. A common solution to this problem is to divide the CPUs into small clusters and connect them to each other via some interconnect fabric. Likewise, the system memory can also be divided, distributed and connected via some interconnect fabric, as well. This “distributed memory system,” can be implemented through a variety of schemes such as uniform memory access (UMA) or non-uniform memory access (NUMA) as discussed further below.
For a NUMA distributed memory systems, all processors in the system are able to access any memory space in the entire system regardless of proximity to the requesting processor. Each processor makes requests to the memory node containing the specified memory, wherein a caching scheme may be implemented to improve system performance. Regardless of the caching scheme, the distributed system should ensure that all copies of a memory block contain the most recent and correct data. Thus, as soon as a processor writes new data to a cached line, all other cached lines must be invalidated or updated. The method employed to accomplish this is generally referred to as “cache coherency.”
There are two basic categories of cache coherency schemes: “write invalidate,” which invalidates all old cached copies of a changed line, and “write update,” which updates all old cached copies of a changed line. Both cache coherency schemes require sending messages over the memory network to inform the caches of the change. Rather than broadcasting each change to every processor in the system, a shared list is usually provided to track all changes at each corresponding node. Directory-based cache coherency maintains a section of memory which contains memory block sharing information. Snoop-based cache coherency maintains a list with each cached line which denotes which processors are sharing that particular line. Although snooping protocols require more cache memory, the shared list is immediately available without having to perform a directory lookup as required in directory-based protocols.
Current distributed memory architectures, such as COMA (cache-only memory architecture) and cc-NUMA (cache coherent NUMA), are generally used in large multi-processor systems wherein the main memories are distributed among the various processing nodes which make up the overall system. cc-NUMA may use a local cache at each node to hold copies of both local data and/or data from other memory nodes. Disadvantageously, moving data from a remote node to a local cache increases access latency. Thus, implementing remote cache architectures, such as cc-NUMA, tends to slow down system performance and increase interconnect utilization. The farther, or more “remote,” the memory segment used in conjunction with a particular processor, the higher the access latency and the lower the system performance.
The COMA architecture tries to alleviate some of the overhead involved in the cc-NUMA systems. In a COMA system, additional hardware, including tag and state memory, is added to the DRAM of each processing node to convert it into a kind of cache. This additional hardware enables the disassociation of the actual data location in the machine and the physical address produced by the processors. This enables data to be replicated and migrated automatically upon demand around the system. While this architecture may provide a more flexible platform for applications, it requires complex hardware and data coherence protocols. Thus, the COMA approach generally requires the addition of expensive hardware to handle page migration.
The present invention may address one or more of the problems set forth above.