One way to increase the speed of a computer program is to divide its operations into separate tasks and execute those tasks concurrently on multiple processors. The multiple processors may or may not share a common memory. One of the slowest operations in a computer system is an operation involving memory-resident data. Fetching operands (associated with one or more processor instructions) from memory may be hundreds or thousands of times slower than fetching those same operands from a register in the processor. Storing data in memory can be comparably slow. Although various common techniques are effective in hiding that latency from an application, the latency still affects the processor design and, thus, the performance of the application.
It is common practice in processor design to attach a high-speed cache that stores the most recently referenced data and data near that recently referenced data. A cache hides the latency of a fetch or read memory access by quickly supplying a copy of memory resident data that is stored in the cache. A cache also hides the latency of a write memory access by various means.
A cache interacts with the memory in units of cache lines. A cache line is typically on the order of 32-512 bytes and is typically the smallest unit of exchange between the cache and the memory. When a processor requests a value from memory that is not resident in the cache, the memory transfers a cache-line-sized chunk of data containing the data requested by the processor. The entire line of data is stored in the cache. The cache then extracts the requested data from the line transferred from memory and returns it to the processor. When a processor writes data to an address that is represented in its cache, the cache copy is updated and the cache line is marked as “dirty.” When the line is flushed from cache back to memory, the line in memory is updated. When a processor writes data to an address that is not represented in its cache, the line is read from memory as described above for a read and then the new data is written into the cache as described above for writes.
The effectiveness of caches in hiding memory latency is attenuated by several factors. First, cache sizes are strongly constrained and the size of a single cache as a percentage of total physical memory continues to shrink. As an example, Sun Microsystem, Inc.'s™ E15K machine or computer can have up to 239 bytes of read access memory (RAM), but the UltraSparc-III™ processor typically used in the E15K machine has only 223 cache, a ratio of 1:216. Advances in the E15K machine may include incorporating 252 bytes of RAM and a per-processor cache size nowhere close to the 236 bytes that would be required to maintain a 1:216 ratio. For this and other reasons, caches are becoming less effective for programs with a large working set and non-sequential access patterns because the likelihood of a desired value being in cache is dropping with the decline of the percentage of coverage. (Percentage of coverage refers to the percentage of memory that can be covered with a particular cache. For example, on a computer system with 512 megabytes of memory, an 8 megabytes cache covers 8/512*100≈1.6%.)
Secondly, in conventional systems distance to memory is increasing in cycle terms, which typically causes memory access speed to increase, resulting in a performance problem. As the distance to memory goes from hundreds of cycles to thousands of cycles, the difficulty of hiding the latency sharply increases. The increased latency also causes problems in processor design because the processor needs to keep track of an operation until it is complete. Informally speaking, a write is typically considered complete when memory is updated. If memory is 1000 cycles away from the processor and a write can be issued in every cycle then the processor must keep track of at least 2000 outstanding memory transactions in order to avoid stalling on a write-buffer-full (2000 transactions=1 transaction/cycle=1000 cycles to send the data to memory and 1000 cycles for the acknowledgement to return.). Keeping track of 2000 independent transactions is typically quite difficult.
In conventional systems, memory access speed is also increased and parallel performance decreased by the way that conventional processors execute programs that use certain data structures. Conventional processors typically interact sequentially with a sequence of homogeneous memory operations. For example, a processor may give preference to reads over writes, but does not usually distinguish between the relative importances of individual reads. A processor executing a thread in a program running in parallel on multiple processors often does a series of memory operations related to computation and then a final memory operation to synchronize its activities with the other processors operating in parallel. It may then issue memory operations to get more work for itself so that it may proceed on an independent part of the computation and it will block all processing until its requests for more work are satisfied. In many computations, it is much more important to get more work than to report that previous work has completed, but the sequential nature of the way that processors interact with streams of memory operations stalls that important request behind less important requests. As the distance to memory increases and techniques such as caching become less capable of hiding the latencies involved, the problem of having important operations stalled behind ever-slower operations of lower importance becomes more pressing.
Therefore, a need has long existed for a method and system that overcome the problems noted above and others previously experienced.