Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both processing units—the “brains” of a computing system—and the memory that stores the data processed by a computing system.
In general, a processing unit is a microprocessor or other integrated circuit that operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a “memory address space,” representing an addressable range of memory regions that can be accessed by a microprocessor.
Both the instructions forming a computer program and the data operated upon by those instructions are often stored in a memory system and retrieved as necessary by the microprocessor when executing the computer program. The speed of microprocessors, however, has increased relative to that of memory devices to the extent that retrieving instructions and data from a memory often becomes a significant bottleneck on performance of the microprocessor as well as the computing system. To decrease this bottleneck, it is often desirable to use the fastest available memory devices possible. However, both memory speed and memory capacity are typically directly related to cost, and as a result, many computer designs must balance memory speed and capacity with cost.
A predominant manner of obtaining such a balance is to use multiple “levels” of memories in a memory architecture to attempt to decrease costs with minimal impact on performance. Often, a computing system relies on a relatively large, slow and inexpensive mass storage system such as a hard disk drive or other external storage device, an intermediate main memory that uses dynamic random access memory (DRAM) devices or other volatile memory storage devices, and one or more high speed, limited capacity cache memories, or caches, implemented with static random access memory (SRAM) devices or the like. Information from segments of the memory regions, often known as “cache lines” of the memory regions, are often transferred between the various memory levels in an attempt to maximize the frequency that requested cache lines are stored in the fastest cache memory accessible by the microprocessor. Whenever a memory request from a requester attempts to access a cache line, or entire memory region, that is not cached in a cache memory, a “cache miss,” or “miss,” typically occurs. As a result of a cache miss, the cache line for a memory address typically must be retrieved from a relatively slow, lower level memory, often with a significant performance penalty. Whenever a memory request from a requester attempts to access a cache line, or entire memory region, that is cached in a cache memory, a “cache hit,” or “hit,” typically occurs and the cache line or memory region is supplied to the requester.
Cache misses in particular have been found to significantly limit system performance. In some designs, for example, it has been found that over 25% of a microprocessor's time is spent waiting for retrieval of cache lines after a cache miss. Therefore, any mechanism that can reduce the frequency and/or latency of cache misses can have a significant impact on overall performance.
One conventional approach for reducing the impact of cache misses is to increase the size of the cache to in effect reduce the frequency of misses. However, increasing the size of a cache can add significant cost. Furthermore, oftentimes the size of the cache is limited by the amount of space available on an integrated circuit device. Particularly when the cache is integrated onto the same integrated circuit device as a microprocessor to improve performance, the amount of space available for the cache is significantly restricted.
Another conventional approach includes decreasing the miss rate by increasing the associativity of a cache, and/or using cache indexing to reduce conflicts. While each approach can reduce the frequency of data cache misses, however, each approach still incurs an often substantial performance hit whenever cache misses occur.
However, conventional approaches for reducing the impact of cache misses often introduce additional problems to shared memory computing systems. Generally, shared memory computing systems include a plurality of microprocessors that share a common memory. Microprocessors are permitted to obtain exclusive or shared ownership of a cache line, with the former usually required whenever a microprocessor needs to modify data stored in the cache line, and the latter being permitted whenever multiple microprocessors merely need to read the data in the cache line. A coherence protocol, typically using either a central directory or a snooping protocol, is used to coordinate the retrieval of a cache line by a microprocessor, such that a requesting microprocessor always receives a current copy of the data in a cache line. A coherence protocol often requires a microprocessor to broadcast a request over a shared memory bus, which results in a lookup being performed either in a central directory or in each individual node in the shared memory system to determine the status of the requested cache line, with the requested cache line ultimately returned to the requesting processor and the status of that cache line being updated to reflect the new ownership status of the cache line. Given that a memory bus is a limited resource, the broadcast of memory requests over the memory bus can result in decreased performance, so it is desirable whenever possible to minimize the number of memory requests that are broadcast over a shared memory bus.
To reduce global bandwidth requirements, many modern shared memory multiprocessor systems are clustered. The processors are divided into groups called SMP nodes, where processors in the same node share a cabinet, board, multi-chip module, or even the same chip, enabling low-latency, high-bandwidth communication between processors in the same node. To reduce global bandwidth requirements, these systems utilize a two-level cache coherence protocol to broadcast requests to processors within a node first (referred to as a “node pump”), and only sending requests to remote nodes if necessary, i.e., when a request cannot be handled in the local node (referred to as a “global pump”). While this “double pump” reduces the global request traffic, global requests are delayed by checking the local node first.
One alternative to a conventional double-pumps is to utilize a special pseudo-invalid coherence state, much like the In and/or Ig states used in the POWER6 system microarchitecture microprocessor developed by International Business Machines (“IBM”) of Armonk, N.Y. Those states can be used to predict if cache lines are remote or local. However, these states displace actual data, occupying as much as about 20% of cache memory and increasing cache miss rate an average of about 5%. This, in turn, increases bandwidth and energy requirements for memory subsystems, decreases execution time of workloads, and generally exerts negative time and monetary constraints on the design and use of conventional shared memory computing systems.
Consequently, there is a need in the art for determining when particular memory requests are unnecessary and improving microprocessor communications in a shared memory computing system.