Not Aplicable
1. Field of the Invention
The present invention generally relates to a computer system comprising a plurality of pipelined, superscalar microprocessors and implementing a directory-based cache coherency protocol. More particularly, the invention relates to a method of improving bandwidth and performance in a such a system by eliminating redundant directory read and write requests.
2. Background of the Invention
It often is desirable to include multiple processors in a single computer system. This is especially true for computationally intensive applications and applications that otherwise can benefit from having more than one processor simultaneously performing various tasks. It is not uncommon for a multi-processor system to have 2 or 4 or more processors working in concert with one another. Typically, each processor couples to at least one and perhaps three or four other processors.
Such systems usually require data and commands (e.g., read requests, write requests, etc.) to be transmitted from one processor to another. Furthermore, the processors may be executing tasks and working on identical problems which requires that data be shared among the processors. This data is commonly stored in a memory location that may be adjacent to each processor or may be located in a distinctly separate location. In either event, the processor must access the data from memory. If the memory is some distance away from the processor, delays are incurred as the data request is transmitted to a memory controller and the data is transmitted back to the processor. To alleviate this type of problem, a memory cache may be coupled to each processor. The memory cache is used to store xe2x80x9clocalxe2x80x9d copies of data that is xe2x80x9cpermanentlyxe2x80x9d stored at the master memory location. Since the data is local, fetch and retrieve times are reduced thereby decreasing execution times. The memory controller may distribute copies of that same data to other processors as needed.
Successful implementation of this type of memory structure requires a method of keeping track of the copies of data that are delivered to the various cache blocks. The particular method chosen depends on the cache coherency protocol implemented for that particular multi-processor system. Cache coherency, in part, means that only one microprocessor can modify any part of the data at any one time, otherwise the state of the system would be nondeterministic. In one example of a cache coherency protocol, the memory controller will broadcast requests to each processor in the system, regardless of whether or not the processors have a copy of the data block. This approach tends to require less bookkeeping since the memory controller and processors do not need to keep track of how many copies of data exist in the memory structure. However, bandwidth is hindered because processors must check to see if there is a local copy of the data block each time the processor receives a request.
Another conventional cache coherency protocol is a directory-based protocol. In this type of system, the memory controller keeps a master list, or directory, of the data in main memory. When copies of the data are distributed to the individual processors, the memory controller will note the processor to which the data was sent and the state of that data. In addition, the data that is delivered to a particular data cache will include information about the memory address from where the data resides and the directory state for that particular block of data. Since the memory controller tracks the processors that have copies of the same block of data, bandwidth may be conserved by limiting memory read and write requests to only those processors which have a copy of a data block in the local cache.
When a processor makes a memory request for a particular block of data, a read request goes to the directory controller which then makes some decision based on the existence and state of that data block. This decision may be as simple as locating the data block and sending that data back to the requesting processor. In another example, the memory controller may determine that the data block does not exist in the local memory and send or forward a memory request to the actual location of the data block. The owner at this location will then send the data block back to the requestor. In either event, the memory controller follows the read request with a write request that instructs the directory to update the address and cache state information for that data block.
Some multi-processor systems have a means of buffering these memory requests. In such systems, it is highly likely that memory requests to the same block of data may exist in the request buffer at the same time. It may also be the case that only the directory information and not the actual data block is read. This type of scenario is likely in a multiprocessor system executing a spin lock initialization. In a spin lock, a processor will attempt to test-and-set a memory location (the lock) associated with a data block to be accessed. If the test indicates that the set succeeded, the processor may proceed to access or modify the data associated with the lock. If the test indicates that the set failed, that is, the lock is already owned by another processor, then the processor will loop back and execute the test-and-set until the set succeeds and the lock is obtained. This lock is known as a spin lock because the processor xe2x80x9cspinsxe2x80x9d in a tight loop waiting to acquire the lock. Since requests are repeatedly accessing the same memory address, it is quite feasible that the memory request buffer will contain multiple requests for the same memory block.
In a conventional system, when a memory request is executed, the current cache directory state must be read and updated for each block of data so cache coherency is maintained. A directory read request is sent to memory to determine the current directory state of the data block. After the memory request is executed, the next state in the directory field for that cache block is updated (if needed) and a directory write request is sent back to memory to update the directory state for the memory block. As transactions are processed, a conventional system completes the directory read and directory write pairs for each transaction.
It is desirable therefore, to develop a method of eliminating redundancy in memory requests by chaining directory requests using local, known good information about a memory block to execute directory read and write requests locally. Since the latest cache state information is written to directory entries in the memory request buffer, there is no need for subsequent directory requests for a common memory address to go to memory. The information can be read from the directory entry for a request that contains valid directory state information and this information can be applied to all other requests that reference the same memory address. This process eliminates the need to transmit read and write requests between the memory request buffer and memory and may advantageously improve bandwidth and execution speeds.
The problems noted above are solved in large part by a directory-based multiprocessor cache control system that uses a memory transaction buffer called a directory in-flight table. The directory in-flight table may hold up to 32 memory requests, some of which may access the same block of memory. These common memory requests may be placed in a linked list which is used to eliminate unnecessary directory read and directory write requests to memory.
The system includes a plurality of processors, each comprising at least one memory cache and a memory. Each processor also includes a memory controller that includes a front-end section, a middle section and a back-end section. The front-end includes a directory in-flight table and a decoder configured to read and write to each entry in the directory in-flight table. The front-end is also configured to manage a directory based coherence protocol and validate directory information for each transaction in the directory in-flight table. Memory requests from the processors are allocated in the directory in-flight table.
Each memory request entry in the directory in-flight table comprises fields for a memory address, a requestor processor identification, a valid bit, a directory state, a behind entry number, and an end bit. These fields are used, in part, to create the linked list of requests with a common memory address and to eliminate unnecessary memory directory read and write requests. For instance, when the valid bit in a transaction entry is set, the directory state of that entry is assumed valid and the front-end decoder does not send a read request to memory. Conversely, when the valid bit in a transaction entry is not set, the front-end decoder sends a directory read request to memory and updates the directory state field of that entry with directory state information from memory.
Similarly, when the end bit in a transaction entry is not set, the front-end decoder does not send a write request to memory. Instead, the transaction is retired and the directory state is written to the next transaction in the list and the valid bit for that next transaction is set. This retiring process accomplishes two things: it eliminates a directory write request to memory and it validates the directory state for the next transaction which eliminates a subsequent directory read request. A set end bit signifies the end of a list of transactions accessing the same memory block. If the end bit in a transaction entry is set, the front-end decoder sends a directory write request to memory to update the directory state of the memory block corresponding to the address in the memory address field.
When a new memory request is allocated to the directory in-flight table, that memory request is automatically appended to the end of a list. Hence, the end bit for all new memory requests is set. Each list contains memory requests that reference a common memory block. Thus, the list may contain only the new request or it may include many memory requests with common memory addresses. If the contents of the memory address field for the new request are identical to memory address fields of existing memory requests already in the directory in-flight table, the new memory request is added to an existing list. In this case, the behind entry number field for the new entry is filled with the location of the existing entry in the list that has a set end bit (i.e., the previous end of the list). The end bits for all existing entries containing the same memory addresses (all entries in the list other than the new request) are reset.
The front-end further comprises control logic that is configured to transmit information in both directions between the decoder and memory or to loop information that is leaving the decoder back to the decoder. The act of retiring a transaction asserts a retire control signal from the decoder. When the retire signal is asserted, the control logic loops information that is leaving the decoder back to the decoder. When the retire signal is not asserted, the control logic permits information to travel between the decoder and memory. Thus, when a memory request entry in the directory in-flight table that does not have a set end bit is processed by the front-end, the request is retired from the directory in-flight table and the retire signal is asserted. The directory state information from the retiring entry is written by the decoder to the entry in the directory in-flight table that has the location of the retiring entry in its behind entry number field (i.e.; the next entry in the list) and the valid bit for that next entry is set.
Thus, in the manner described above, the preferred embodiment may eliminate unnecessary directory read an directory write traffic, which may advantageously improve memory bandwidth and memory access times.