1. Field of the Invention
This invention relates to the field of distributed-memory message-passing parallel computer design and system software, as applied for example to computation in the field of life sciences.
2. Background Art
In provisional patent application No. 60/271,124 titled “A Novel Massively Parallel Supercomputer,” therein is described a massively parallel supercomputer architecture in the form of a three-dimensional torus designed to deliver processing power on the order of teraOPS (trillion operations per second) for a wide range of applications. The architecture comprises 65,536 processing nodes organized as a 64×32×32 three-dimensional torus, with each processing node connected to six (6) neighboring nodes.
Each processing node of the supercomputer architecture is a semiconductor device that includes two electronic processors (among other components). One of these processors is designated the “Compute Processor” and, in the common made operation, is dedicated to application computation. The other processor is the “I/O Processor,” which, in the common mode of operation, is a service processor dedicated to performing activities in support of message-passing communication. Each of these processors contains a separate first-level cache (L1) which may contain a copy of data stored in a common memory accessed by both processors. If one processor changes its L1 copy of a memory location, and the other processor has a copy of the same location, the two copies become “coherent” if they are made to be the same.
Message passing is a commonly-known form of computer communication wherein processors explicitly copy data from their own memory to that of another node. In the dual-processor node disclosed in the above-identified provisional patent application No. 60/271,124, the I/O Processor is principally used to facilitate message passing between the common memory of a node and the common memory of other nodes. Therefore, it both produces data (when a message is received) that is consumed by the Compute Processor, and consumes data (in order to send a message) that is produced by the Compute Processor. As a result, it is very common for both processors to have a copy of the same memory location in their L1s. If the messages passed are small and many, then the problem is exacerbated. Thus, there is a clear need to find a way to make the L1s of each processor coherent, without extensive circuitry, and with minimal impact on performance.
As massively parallel computers are scaled to thousands of processing nodes, typical application messaging traffic involves an increasing number of messages, where each such message contains information communicated by other nodes in the computer. Generally, one node scatters locally-produced messages to some number of other nodes, while receiving some number of remotely produced messages into its local memory. Overall performance for these large-scale computers is often limited by the message-passing performance of the system.
For such data transfers, a common message-passing interface, described in the literature (see for example http://www.mpi-forum.org/docs/docs.html, under MPI-2), is known as “one-sided communication.” One-sided communication uses a “put/get” message-passing paradigm, where messages carry the source (for get) or the destination (for put) memory address. In parallel supercomputers operating on a common problem, puts and gets are typically assembled in batches and issued together. This keeps the independently operating processors in rough synchronization, maximizing performance. The time during which puts and gets occur is termed the put/get window. This window extends both in time (when it occurs) and in memory (over the range of memory addresses carried by the put or get messages). FIG. 2 shows a put/get window 30 having a number of distinct messages.
Put/get windows extend the concept of coherence to processors on different processing nodes of the massively parallel supercomputer. Implementations of put/get windows must insure that all messages put to a window during the time it is open are received into the memory of the window before the window is closed. Similarly, a get on the memory of the window is only allowed during the time the window is open. Therefore, put/get windows are simply a mechanism for a node to synchronize with remote processors operating on its memory.
The management of a put/get window is currently accomplished by either buffering the put/get messages or by using explicit synchronization messages. Buffering the messages consumes memory, which is always in limited supply. Explicit synchronization for each window suffers from the long latency of round-trip messages between all the nodes accessing the window. Therefore, on large-scale machines such as the one described in copending patent application Ser. No. 10/468,993, filed Aug. 22, 2003, these approaches do not scale well because of limited memory for buffering, and because the number of nodes accessing any particular window often scales along with the number of processing nodes in the computer.
A long-standing problem in the field of computer design, is how to keep these L1 caches coherent. Typical solutions employ techniques known as “snooping” the memory bus of the other processor, which can be slow and reduce the performance of each processor. Alternatively, the processor that contains an old copy in L1 of the data in the common memory, can request a new copy, or mark the old copy obsolete, but this requires knowledge of when the copy became invalid. Sometime this knowledge is incomplete, forcing unnecessary memory operations, further reducing performance. Other computers make use of “interlocks,” whereby one processor is granted permission to use certain data while the other processor cannot, but this permission involves interactions between the two processors, which usually requires additional complex circuitry in the semiconductor device, reducing the performance of the two processors.
Still other solutions in common practice disable all caching for areas of memory intended to be shared. This practice penalizes all memory accesses to these areas, not just those to the shared data.