1. Field of the Invention
The present invention relates generally to a low latency memory system, particularly in association with a weakly-ordered (loosely synchronized) multiprocessor system, and provides for efficiently synchronizing the activities of multiple processors.
The present invention also provides an efficient and simple method for prefetching non-contiguous data structures.
The present invention relates generally to the field of distributed-memory, message-passing, parallel computer design as applied, for example, to computation in the field of life sciences.
2. Discussion of the Prior Art
A large class of important computations can be performed by massively parallel computer systems. Such systems consist of many identical compute nodes, each of which typically consist of one or more CPUs, memory, and one or more network interfaces to connect it with other nodes.
The computer described in related U.S. provisional application Ser. No. 60/271,124, filed Feb. 24, 2001, for A Massively Parallel Supercomputer, leverages system-on-a-chip (SOC) technology to create a scalable cost-efficient computing system with high throughput. SOC technology has made it feasible to build an entire multiprocessor node on a single chip using libraries of embedded components, including CPU cores with integrated, first-level caches. Such packaging greatly reduces the component count of a node, allowing for the creation of a reliable, large-scale machine. A first level cache is a cache which is generally very close to the processor and is generally smaller and faster when compared to a second level cache which is further from the processor and is generally larger and slower, and so on for higher level caches.
A common problem faced by multiprocessors is the orderly sharing of resources. This is often accomplished by the use of locks, wherein a processor obtains usage permission to use a resource by acquiring a lock assigned to that resource. The processor retains permission for the resource as long as it holds (owns) the lock, and relinquishes its permission by releasing the lock. A very common type of lock is the test-and-set lock which is simple to implement and general enough to be widely applicable.
The test-and-set lock generally relies upon a hardware read-modify-write (RMW) operation for its implementation. This operation allows a value to be written to a memory location, and returns the value that was previously in that location (before the write). That is, the operation consists of a read followed immediately and without interruption, by a write.
The semantics of a test-and-set lock are as follows. Say the unlocked condition is 0 and the locked condition is 1. A processor attempts to acquire the lock by performing a RMW operation to the lock, wherein the value written is 1. If the value returned is 0, then the lock was unlocked before the RMW, and it has been locked due to the write of 1. If the value returned is 1, then the lock was already locked and the write had no effect. To release the lock, a 0 is simply written.
Another aspect of the present invention involves prefetching, which is a well known technique for enhancing performance of memory systems containing caches, especially when applications exhibit a predictable access pattern. In general, prefetching is accomplished either through the use of software directives, or though special hardware. Some hardware schemes are straightforward, such as sequential prefetching, and some are more sophisticated, such as strided stream buffers. However, all of the hardware techniques rely upon the predictability of the address sequence. See Vanderwiel and Lilja for a through survey of conventional prefetching techniques.
Modern virtual memory systems can affect the effectiveness of hardware prefetching because large data structures that are contiguous in virtual memory need not be contiguous in physical memory, and hardware prefetching usually deals with physical memory addresses. Even if a large data structure is traversed contiguously, as is often the case, the actual memory references will not be contiguous and hence, difficult to predict. However, many applications have highly repetitive behavior, so a mechanism that can learn the repeating access pattern could prefetch effectively.
One such mechanism is described in U.S. Pat. No. 4,807,110, Pomerene et al., for a Prefetching System for a Cache Having a Second Directory for Sequentially Accessed Blocks. The idea is to provide a large, two-level table that stores relationships between consecutively accessed cache lines and allows those relationships to be exploited for prefetching into a cache. Various methods for establishing and maintaining the relationships are described. A significant drawback of this approach is that the table is of fixed size, and eventually fills up. At that point, known relationships must be evicted to make room for new ones. This is not a problem as long as the table is large enough to capture a working set, but the working set of many scientific applications, such as those that will be run on the scalable computer described in related U.S. provisional application Ser. No. 60/271,124, filed Feb. 24, 2001, for A Massively Parallel Supercomputer, can be as large as the main memory. In this case, the table will provide little benefit as follow-on relationships between cache lines will be evicted due to limited capacity long before they can be used for prefetching.