Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with increased demand. One particular subject of significant research and development efforts is parallelism, i.e., the performance of multiple tasks in parallel.
A number of computer software and hardware technologies have been developed to facilitate increased parallel processing. From a hardware standpoint, computers increasingly rely on multiple processors to provide increased workload capacity. Furthermore, some processors have been developed that support the ability to execute multiple threads in parallel, effectively providing many of the same performance gains attainable through the use of multiple processors.
A significant bottleneck that can occur in a multi-processor computer, however, is associated with the transfer of data to and from each processor, often referred to as communication cost. Many computers rely on a main memory that serves as the principal working storage for the computer. Retrieving data from a main memory, and storing data back into a main memory, however, is often required to be performed at a significantly slower rate than the rate at which data is transferred internally within a processor. Often, intermediate buffers known as caches are utilized to temporarily store data from a main memory when that data is being used by a processor. These caches are often smaller in size, but significantly faster, than the main memory. Caches often take advantage of the temporal and spatial locality of data, and as a result, often significantly reduce the number of comparatively-slower main memory accesses occurring in a computer and decrease the overall communication cost experienced by the computer. While some caches may serve all of the processors in a computer, in many instances, dedicated caches are used to serve individual processors or subsets of processors. For example, it is often desirable to incorporate a cache directly on a processor chip to provide the fastest possible access to the data stored in the cache.
Whenever multiple processors or other devices are permitted to access a particular memory, the memory is required to implement some form of shared memory architecture that is capable of maintaining coherence throughout the memory architecture. In particular, whenever a processor attempts to access a particular memory address, typical shared memory architectures retrieve a block of memory often referred to as a cache line that contains the requested data at that address, and store the cache line in a cache accessible by the processor. If that processor subsequently modifies the data in its locally cached copy of the cache line, the copy of the cache line in the main memory is no longer up to date. As a result, if another processor attempts to access that cache line, the shared memory architecture is required to provide some mechanism by which the most recent copy of that cache line can be forwarded to the other processor. In addition, it is often desirable at that time to update the copy of the cache line in the main memory.
A number of shared memory architectures, for example, implement snoop-based coherency protocols, where each cache coupled to a main memory monitors the memory requests issued by other devices, and updates the status of any cache lines stored in its local cache, and/or notifies the other devices of the status of any such cache lines, in response to those requests. An agreed-upon set of states are often used to designate the status of each cache line. One common coherency protocol, for example, referred to as the MSI protocol, assigns each cache line one of three states: a modified state that indicates the cache line is stored in one cache and has been modified, thus rendering the copy in the shared memory out-of-date, a shared state that indicates the cache line is stored in more than one cache and has not been modified by any cache, and an invalid state that indicates that the cache line is not stored in any cache and must be fetched from memory in order to be accessed. Another protocol referred to as the MESI protocol adds to the three states of the MSI protocol, an exclusive state that indicates the cache line is stored in one cache but has not been modified.
Of note, using either the MSI or MESI protocol, multiple caches are permitted to hold multiple copies of a cache line when in a shared state, and furthermore, the processors associated with those caches are able to read the copies of the cache line independently and directly from the respective caches. However, if any processor needs to modify its own copy of the cache line, it is necessary to invalidate the other copies of the cache line in the other caches in connection with modifying the data. In effect, the cache line changes from “shared” to “modified” state, whereby the cache line is discarded in every cache except the cache containing the modified version. Should any other processor wish to access that cache line again, it is necessary for the cache line to be written out to main memory or otherwise transferred to the cache for the other processor to ensure that the other processor has the most recent version of the cache line. The cache line then typically transitions back to a “shared” state. If the original processor then wishes to update the cache line again, another transfer is required back to the “modified” state to permit the processor to modify any data in the cache line.
Particularly if one device is frequently modifying a cache line, and another device is frequently reading the same cache line, the MSI and MESI protocols will require substantial data traffic and delay in copying the cache line back and forth between multiple caches and/or between caches and main memory. The primary performance benefits of caching arise when frequent accesses to a given cache line are capable of being serviced by a cache, without involvement of the rest of the shared memory system. Requiring frequent state changes and copying of data between caches and main memory, and/or between caches and other caches, thus negates many of the performance advantages of caching.
Despite these drawbacks caching has been increasingly used in many computer architectures. In addition, as the components of a computer system continue to increase in complexity and performance, caching has been implemented even within individual computer system components. For example, caching is frequently used in the input/output subsystems of many computer systems, e.g., within input/output adapters (IOA's) such as host channel adapters (HCA's) compliant with the InfiniBand architecture. Caching may be used, for example, to accelerate access to thread-specific context information in a multithreaded HCA that supports the concurrent transmission and reception of data by multiple data streams. The context information includes, among other information, control information that tracks the current status of a communication session between two devices, e.g., packet sequence numbers, acknowledgment tracking information, etc.
Transmitter and receiver circuits in an HCA often operate independently to handle outgoing and incoming data packets over an InfiniBand link. To accelerate the handling of packets associated with given data streams, the context information associated with such data streams is typically retrieved by the transmitter or receiver circuit, as appropriate, with the context information modified as necessary by the transmitter or receiver circuit to manage the data stream. Furthermore, to ensure consistency for the context information, a coherency protocol similar to the MSI and MESI protocols is used to ensure that the context information for a given data stream is consistent even when being accessed by both of the transmitter and receiver circuits.
However, it has been found that, particularly when both the transmitter and receiver circuits are accessing the same context information when processing incoming and outgoing packets associated with the same data stream, significant latency may be introduced due to the need to invalidate the copy of the context information in one circuit when the other circuit modifies the information. As is often the case, only one of the transmitter and receiver circuits may update certain values in the context information, while the other circuit may only read those values. Due to the need to maintain coherence, however, significant data traffic overhead and latency are often introduced, thus decreasing the performance of the HCA.
Therefore, a significant need continues to exist for a manner of maintaining coherency in a shared memory architecture in applications where multiple devices frequently attempt to access the same data.