1. Field of the Invention
The present invention relates, in general, to microprocessor systems, and, more particularly, to systems and methods for providing cache coherency and atomic transactions in a multiprocessor computer system.
2. Relevant Background
Microprocessors manipulate data according to instructions specified by a computer program. The instructions and data in a conventional system are stored in memory which is coupled to the processor by a memory bus. Computer programs are increasingly compiled to take advantage of parallelism. Parallelism enables a complex program to be executed as a plurality of less complex routines run in at the same time to improve performance.
Traditionally, microprocessors were designed to handle a single stream of instructions in an environment where the microprocessor had full control over the memory address space. Multiprocessor computer systems were developed to decrease execution time by providing a plurality of data processors operating in parallel. Early multiprocessor systems used special-purpose processors that included features specifically designed to coordinate the activities of the plurality of processors. Moreover, software was often specifically compiled to a particular multiprocessor platform. These factors made multiprocessing expensive to obtain and maintain.
The increasing availability of low-cost high performance microprocessors makes general purpose multiprocessing computers feasible. As used herein the terms xe2x80x9cmicroprocessorxe2x80x9d and xe2x80x9cprocessorxe2x80x9d include complex instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. However, general purpose microprocessors are not designed specifically for large scale multiprocessing. Some microprocessors support configurations of up to four processors in a system. To go beyond these limits, special purpose hardware, firmware, and software must be employed to coordinate the activities of the various microprocessors in a system.
Memory management is one of the more difficult coordination problems faced by multiprocessor system designers. Essentially, the problems surround ensuring that each processor has a consistent view of data stored in memory at any given time while each processor is operating simultaneously to access and modify the memory contents. This problem becomes quite complex as the number of processors increases.
Two basic architectures have evolved sometimes referred to as xe2x80x9cshared memoryxe2x80x9d and xe2x80x9cdistributed memoryxe2x80x9d. Distributed memory assigns a unique range of memory to each processor or to a small group of processors. In a distributed memory system only the small number of processors assigned to a given memory space need to coordinate their memory access activities. This assignment greatly simplifies the tasks associated with accessing and manipulating the memory contents. However, because the memory is physically partitioned there is less synergism and therefore less performance gain achieved by adding additional processors.
It is advantageous if all of the processors share a common memory address space. In a shared memory system hardware and software mechanisms are used to maintain a consistent view of the memory from each processor""s perspective. Shared memory enables the processors to work on related processes and share data between processes. Shared memory systems offer a potential for greater synergism between processors, but at the cost of greater complexity in memory management.
One of the key advances that have resulted in microprocessors performance improvements is the integration of cache memory on chip with other microprocessor functional units. Cache memory enables a processor to store copies of frequently used portions of data and instructions from the main memory subsystem in a closely coupled, low latency cache memory. When data is supplied by cache memory, the latency associated with main memory access is eliminated. Inherent in a cache memory system is a need to guarantee coherency between the data copy in one or more processor cache(s) and the main memory itself.
In a single processor system coherency is a relatively straightforward matter because only one cache system hierarchy exits for a given memory address space. Likewise in partitioned memory systems one, or at most a few, cache subsystems corresponds to a single memory partition. However, in shared memory systems a given memory location may be stored in any cache of any processor in the system. In order for one processor to manipulate the data in its own cache or main memory it must ensure that no other processor can manipulate the data at the same time. In typical systems this requires high bandwidth communication between the cache subsystems and/or processors. Further, conventional microprocessors require specialized hardware support to scale the operating system and application software to operate on processor counts greater than four to eight processors.
Cache systems are organized as a plurality of cache lines that are typically accessed as an atomic unit. Each cache line is associated with a set of state information that indicates, for example, whether the cache line is valid (i.e., coherent with main memory), whether it is xe2x80x9cdirtyxe2x80x9d (i.e., changed) and the like. Shared memory systems often use a (multi-state) protocol where each cache line includes state information. For example, a xe2x80x9cMESIxe2x80x9d protocol includes state information indicating whether the cache line is modified, exclusive, shared or invalid (MESI). Alternative coherency protocols include xe2x80x9cupdatexe2x80x9d protocols that send a new data value to each processor holding a cache copy of the value to update each cache copy. Ownership protocols pass an owner token among caches to indicate which cache has write permission and which cache holds the most recent version of the data.
In a MESI system, when an unshared cache line is accessed it is marked exclusive (E). A subsequent read does not change the state, but a subsequent write to the cache line changes the state to modified (M). If another processor is seen to load the data into that processor""s cache, the line is marked shared (S). In order to write data to a shared cache line, an invalidate command must be sent to all processors, or at least to all processors having a copy of the shared data. Before a processor can load data from a modified line the processor having the modified cache line must write the data back to memory and remark it as shared. Any read or write to a cache line marked invalid (I) results in a cache miss.
Similar issues exist for any atomic memory operation. An atomic memory operation is one in which a read or write operation is made to a shared memory location. Even when the shared memory location is uncached, the atomic memory operation must be completed in a manner that ensures that any processors that are accessing the shared memory location are prevented from reading the location until the atomic operation is completed.
In shared memory multiprocessor systems in which all processors and main memory are physically connected using a common bus, a processor can query or xe2x80x9csnoopxe2x80x9d this state information of the other processors. Moreover, the requesting processor can manipulate this state information to obtain desired access to a given memory location by, for example, causing a cache line to be invalidated. However, accessing the cache of each processor by snooping is a time consuming process. Moreover, it is disruptive because the snoop request must arbitrate for cache access with the multiple ongoing cache requests generated by the processor""s efforts to execute its own instructions. As the number of processors grows the overhead associated with this type of coherency protocol becomes impractical.
Other coherency systems require that the processors provide replacement hints to the memory management system. These hints provide a mechanism by which the processors cooperate in the indication of the current state of a cache line (e.g., whether the cache line remains exclusive to a particular processor). Although using replacement hints offers advantages, this requirement significantly limits the variety of available microprocessors that can be used for the multiprocessor system. A need exists for a system and method that provide cache coherency and atomic memory operations that does not require or rely on the processor providing replacement hints.
To alleviate both the engineering and scalability issues associated with snooped based coherency implementation, a directory approach is used. A data structure is employed to hold information about which processors hold which lines of memory as well as a superset of the MESI state information about the line. Rather than snooping each cache, the central directory structure can be queried to determine the state of each memory line. After examining the directory, only the necessary cache coherency operations are required to allocate the memory line to the requesting processor. A directory may be implemented as a full map, coarse map, partial map or a link list.
Alternatively, host bus locking can be used to ensure atomicity in a very brute force manner. The atomic operation support in the Intel architecture 32 (IA32) instruction set with uncached memory requires two bus operations: a read, followed by a write. While these operations proceed, a bus lock is asserted and thereby prevents other processors from utilizing the bus"" unused bandwidth. This is particularly detrimental in computer systems where multiple processors and other components share the host bus. Asserting bus lock by any agent using the host bus will prevent the others from being able to start or complete any bus transaction to memory.
More complex multiprocessor architectures combine multiple processor boards where each processor board contains multiple processors coupled together with a shared front side bus. In such systems, the multiple boards are interconnected with each other and with memory using an interconnect network that is independent of the front side bus. In essence, each of the multiprocessing boards has an independent front side bus. Because the front side bus is not shared by all of the system processors, coherency mechanisms such as bus locking and bus snooping, which operate only on the front side bus, are difficult if not impossible to implement.
Accordingly, it is desirable to provide a cache coherency mechanism and method that operates efficiently to minimize overhead. More specifically, a means for providing cache management that does not rely on either conventional cache coherency mechanisms or bus locking mechanisms is needed. A further need exists for cache coherency mechanisms that operate on systems with multiple independent front side buses.
Briefly stated, the present invention involves a system and method of implementing both uncached references (including atomic references) and cached references within a multiprocessor system. The multiprocessor system includes a plurality of independent processor nodes each having one or more processors. A shared memory comprising a plurality of memory locations is accessible by each of the processor nodes. A cache is implemented within at least some of the processor nodes, each cache comprising a plurality of cache entries. A state machine is associated with each memory location, wherein the state machine includes an exclusive state, a locked state, a busy state, a pending state, an uncached state, a shared state and a busy uncached state. A current memory operation is performed including copying contents of a memory location into the cache of a first processor node. In response to a pending memory operation involving the memory location, the state machine for the memory location is manipulated to indicate that the cache line in the first processor node contains a copy of information that is desired by a second processor node. The state machine is then transitioned to the busy state. In response to an indication that the first processor node has terminated the current operation, the state machine is transitioned to the pending state. While in the pending state, the memory location is locked from access by any of the processor nodes.
In another aspect, the present invention involves a cache coherency directory for a shared memory multiprocessor computer. A data structure is associated with each cacheable memory location. The data structure has locations for storing state values indicating an exclusive state, a shared state, an uncached state, a busy state, a busy uncached state, a locked state, and a pending state. The busy state and pending state cooperate to reserve a cache line for future use by a processor while the cache line is currently being used by one or more other processors.
In yet another aspect the present invention involves a computing system having a plurality of processor nodes coupled to a communication bus. Each processor node contains two or more processors coupled together by a shared front side bus. A cache memory within each of the processor nodes, the cache memory comprising a plurality of cache lines. A shared memory includes a plurality of cacheable locations and is defined by an address space shared amongst all of the processor nodes. A state machine is associated with each cacheable location of the shared memory, the state machine implementing a state indicating that the associated cacheable location is currently in use by a first processor node and reserved for future use by a second processor node.