1. Field of the Invention
This invention relates generally to a method and an apparatus for reconciling communication and locality in a parallel processor system and particularly to a method and apparatus for enabling a user/programmer to write programs in an extended procedural language such as an extended C programming language which explicitly manipulate locality in order to optimize performance of a parallel multiprocessor.
2. Background of the Related Art
A recurrent problem posed by parallel processing architectures is that of communication latency. Communication latency is the time required for a communication operation to complete. This time may include transfer time, overhead, and the time spent waiting for data synchronization. Communication latency exists in any parallel architecture regardless of whether it supports a shared or a non-shared memory paradigm. Latency in and of itself, however, does not have to result in diminished performance. Rather, performance is diminished whenever a central processing unit (CPU) within a parallel system is forced to wait for some communication (or synchronization) operation. Therefore, latency may be addressed by either decreasing time cost of communication or by overlapping it with other tasks, i.e., tolerating latency.
One common method of decreasing the time cost of communication on shared memory systems is through cache memory. With cache memory, hardware is utilized to bring more frequently accessed data into memories that are closer to each CPU. This process is done automatically and utilizes a principle of locality. By bringing these data into local cache memory, the time cost of most loads and stores is reduced, thereby reducing latency. However, programs do not always exhibit such locality, particularly parallel programs accessing shared data. Further, this locality is hidden from a programmer and therefore is difficult to exploit.
Another technique, more common to distributed memory systems, is to increase communication bandwidth. This decreases communication latency by reducing the time required to send and receive data. Unfortunately, on many existing systems, the amount of time associated with software overhead tends to dominate communication time. While this may be improved by using a hardware coprocessor as in J. M. Hsu and P. Banerjee, "A message passing coprocessor for distributed memory multicomputers," Supercomputing '90, November 1990, pp. 720-729, this solution is not complete because overheads still exist in controlling a coprocessor. Also, this solution does not aid a programmer in finding and exploiting locality. Finally, if data is not ready to be sent, no reduction of communication cost can eliminate associated data synchronization latency.
An alternative to reducing latency is to simply tolerate it. There are several mechanisms that have been utilized to tolerate latency. These have one common aspect. They change the programming paradigm from a control based model to a data based one. This is because data movement and synchronization are fundamental to the problem of communication latency (Arvind and R. A. Iannuci, Two Fundamental Issues in Multiprocessing, Tech. Report MIT/LCS/TM-330, MIT Laboratory for Computer Science, 1987).
One approach to latency tolerance is that used in dataflow machines such as Monsoon (FG. M. Papadopoulos and D. E. Culler, "Monsoon: an explicit token-store architecture," 17th Annual Symposium on Computer Architecture, May 1990, pp. 82-91). In such data flow machines, computation follows data movement. When all data for a given computation become ready, a computation takes place. Thus, latency is only reflected in program execution time when there is no data ready to be computed upon. These dataflow machines are most efficient when used with a "dataflow language" such as Id or Sisal.
Another approach to latency tolerance is that used in multithreaded machines such as HEP (B. J. Smith, "Architecture and applications of the HEP multiprocessor system," SPIE Vol. 298 Real-Time Signal Processing IV, 1981, pp. 241-248), Horizon (J. T. Kuehn and B. J. Smith, "The Horizon supercomputing system; architecture and software," Supercomputing '88, November 1988, pp. 28-34), and Tera (R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, "The Tera computer system," 1990 International Conference on Supercomputing, June 1990, pp. 1-6). In these machines, latency is tolerated by keeping a large set of active light-weight processes (threads). Then, when a thread needs to access an operand from shared memory, the thread is put in a wait queue and another thread is activated. Similarly, threads are placed in a queue when they need to wait for some form of data synchronization. Threads then re-enter a pool of active threads as data becomes available. This mechanism is aided by a large set of hardware contexts and a large register file and thus it adds little overhead. Therefore, if enough threads exist and there is ready data, this mechanism allows latency to be hidden to some extent. However, this does not help in the execution time of a single thread. Thus, programs must be broken into many threads to take advantage of this mechanism. Further, the number of threads must grow at a rate that is higher than the growth in the number of processors to maintain latency tolerance.
While both of the above approaches (dataflow machines and multithreaded machines) are capable of tolerating latency, they both require a very high degree of parallelism--one that is much larger than the total machine size. Further, neither of these approaches can utilize off-the-shelf CPUs. Thus, they cannot, for example, take advantage of cost and performance benefits of new generation microprocessors, such as reduced instruction set computers (RISCs). For example, Intel i860 and Inmos T9000 are moderately priced and have peak performance levels in ranges once achieved only by supercomputers (25-100 MFlops). RISCs utilize pipelining to exploit fine-grained parallelism and contain internal caches and floating point support.
While RISCs provide a significant improvement in cost/performance ratio, they accentuate problems associated with communication latency. For example, RISCs have significant memory bottlenecks, because memory speed has not kept up with the higher clock rates in CPUs. Even data accesses to locations within a Processing Element's (PE's) local memory may be costly, because any off-chip accesses add latency due to time required to drive external pins.
Performance of distributed-memory multiprocessors (DMMPs) are the future backbone for large scale parallel architectures. The programming models for DMMPs generally require explicit data movement among processing elements.
Early generation DMMPs were technologically immature in communication operations. These machines generally had large interconnection networks and very little hardware to assist in the transmission of messages. The transmission of messages was done through a store-and-forward methodology. A characteristic of store-and-forward networks is that when a node receives a message, a CPU interrupt must occur. This requires the CPU to preempt its computation in order to take the entire message off the channel. The message's destination address must be analyzed and, if destined for itself, the message's data will be integrated into the computation. However, if the message was not destined for itself the CPU must send the message down the channel towards its destination node. This method of communication caused every node between the source node and destination node to process the incoming message. The transmission time for store-and-forward communication is the product of the length of the message and the number of hops between the source to destination nodes. Due to storing and forwarding of messages, CPU cycles are lost, thus communication overheads and latencies are incurred. Architectures with this communication methodology were only well-suited for coarse-grained applications.
One of the first major communication-related advancements in DMMPs was in the development of more sophisticated interconnection networks. These new interconnection networks had route-through capabilities which meant that when a message arrived at an intermediate node, the processor no longer had to check the destination address. Special hardware in the network analyzed the message and then automatically routed it to its destination. This hardware routing was performed without disturbing any of the intermediate nodes. The processor, however, was still responsible for initiating the communication and for servicing the interrupts generated when the message arrived at its destination.
The transmission time of route-through networks is the sum of the length of the message and the number of hops between the source and destination nodes. Route-through networks are much more efficient than store-and-forward networks since only the overheads of initiation, termination, and latency of communication occur. Route-through systems are also better suited for finer-grained parallelism as opposed to store-and-forward systems.
Even with advancements in route-through schemes, the overheads of communication are still not eliminated. Distributed systems need to deal with synchronization of messages and the communication protocol for the nodes at both ends of the channel. The delays incurred from the additional synchronization and communication initiation/termination induce communication latency. Several means for resolving communications latency have been proposed in current parallel systems.
In an effort to dedicate the CPU strictly to computation, communications co-processors have recently been introduced to handle the communication initiation/termination overheads. This passing of responsibility from the CPU to the co-processor can potentially provide for even finer-grained applications. In particular, a smooth interface should exist between the CPU and the co-processor, because initiation of communication requires the CPU to notify the co-processor of communication operations.
There are a number of past approaches to tolerating communication latency and synchronization overheads in DMMPs while still providing a high degree of parallelism are many. In fine-grained multicomputers, such as the J-Machine (Da189), there is a large amount of concurrency, but this is achieved at the cost of high communication overheads. SIMD single instruction multiple data machines (e.g., CM-2 in Thinking Machines, "The Connection Machine CM-2 Technical Summary, Thinking Machines Corporation", Cambridge, Mass., Oct. 1990), used massive parallelism and hard-wired synchronization to overcome the problems of slow processors and long network latency, respectively. In the J-Machine, attempts have been made to decrease the communication overheads inherent in fine-grained parallelism to allow such concurrent operations.
Distributed shared memory machines provide the user with the ease of many of the features of shared memory programming (i.e., uniform address space) but allows for the implementation of the machine in such a way that it can scale far beyond typical shared memory machines. In distributed shared-memory machines (e.g., KSR1 Kendall Research, Technical Survey, Kendall Square Research Corporation, 1992, Dash D. Lenoski et al., "The DASH Prototype2: Implementation and Performance," The 19th, International Symposium on Computer Architecture, GC, Australia, May 1992, pp. 92-103, Agarwal, et al., "The MIT Alewife Machine: A Large Scale Distributed Memory Multiprocessor", Tech. Report MIT/LCS TM-54, MIT, 1991), communication is tolerated by exploiting locality of reference. Locality of reference is present in parallel programs to some extent without any conscious effort by the programmer. Programmers can increase locality by optimizing memory-reference patterns or compilers may be used to increase locality automatically.
In machines such as the CM-5 in Thinking Machines, "The Connection Machine CM-5 Technical Summary", Thinking Machines Corporation, Cambridge, Mass., Oct. 1991 and the AP1000 by Fujitsu, the control networks and special network operations take advantage of the type of parallelism inherent in the application in order to tolerate latency. A brief overview of the above machines will demonstrate the way these architectures have handled communication and synchronization overheads.
The J-Machine is able to tolerate communication latency by implementing message-driven processors (MDP) to handle fine-grained tasks. The processors are message-driven in the sense that they begin execution in response to messages, via the dispatch mechanism. No receive is needed, and this eliminates some of the communication software overhead. The MDPs create tasks to handle each arriving message. Messages carrying these tasks advance, or drive, each computation. To support the fine-grain concurrent programming, the tasks are small, typically 20 instructions. The router, and the six two-way network ports integrated into each processing node, couples the three-dimensional network topology to provide efficient transmission of messages. The sending of messages does not consume any processing resources on intermediate nodes and buffer memory is automatically allocated on the receiving nodes. The MDP provides concurrency through fast context-switching and prefetching is also used to improve locality. The MDPs provide synchronization by using message dispatch and presence tags on all States. In response to an arriving message the MPD may set the presence tags so that access to a value may not be preempted.
Kendall Square Research's multiprocessor KSR 1 has what is referred to as an ALLCACHE memory system. There, the behavior of memory is similar to that of familiar caches. The difference is that unlike typical architectures, the source for data is not in main memory, but is in another cache. The task of dynamic storage allocation and cache management is inherent in the ALLCACHE hardware mechanisms. The ALLCACHE hardware manages a distributed directory to determine the location of each reference. The directory is also used to provide synchronization semantics by maintaining the state of the information it has stored (i.e., determining which operations the memory system is allowed to perform on that particular reference copy). The KSR1 replicates read-only data to decrease communication and the directories are responsible for managing the replicated data across the processors by issuing invalidation messages. The direct invalidation of a select number of copies has obvious advantages over the more commonly used broadcast invalidations used in other shared-memory architectures. KSR1 exploits locality of reference by organizing the ALLCACHE memory into a hierarchy and by constructing the node interconnections with a fat-tree topology. With this approach, the aggregate memory bandwidth theoretically increases in direct proportion to the number of processors. This type of architecture can be extended to an arbitrary number of hierarchal levels to allow an increase in the number of processors employed. The KSR1 also incorporates a number of features to assist the programmer in their efforts to optimize locality. These features include an Event Monitor Unit which logs local cache hit/misses and how far in the hierarchy a request had to travel to be satisfied along with the number of cycles involved with such events. Prefetch and post-store instructions are also available and controllable either by the compiler or by the programmer.
The Dash (i.e., Directory architecture for shared memory) and Alewife architectures are similar to the KSR1, in that they both primarily depend on caching to achieve scalability. Some of the differences include the network topology, the node architectures, the memory model, and the means for managing the distributed directories. Dash uses a 2-D meshes and each node in the system architecture corresponds to a cluster of four processors and a two-level cache implementation. The first level write-through data cache, inherent on each processor, is interfaced with a larger second-level write-back data cache. The main purpose of the second-level cache is to convert the write-through policy of the first level to a write-back policy, and to provide extra cache tags for bus snooping. The snooping policy is implemented only within a cluster of processors within a node. The distributed directory-based cache coherence protocol is a function of the memory consistency model adopted by the architecture. The sequential consistency model, inherent in KSR1 and Alewife, essentially represents program execution on a uniprocessor where multi-tasking is available. In Dash, a weaker consistency model called release consistency is used. This approach requires the programmer to specify which memory accesses require sequential consistency, but can hide much of the synchronization overhead of write operations. The distributed directory-based cache coherence protocol is implemented entirely in the hardware. Additional latency-tolerating features include a forwarding control strategy and special purpose operations such as prefetching and update writes.
Alewife also implements a 2-D mesh network and its single processor node employs a modified processor, called SPARCLE, which provides for multi-threading and fast context switching. The rapid context switching is meant to hide communication and synchronization delays and achieve high processor utilization. Alewife uses a directory-based cache coherence protocol, which is implemented as a combination of hardware and software. The hardware only supports a small number of variable copies. For any larger number of copies, system software must be used. The primary architecture difference between Alewife, Dash and KSR1 is that Alewife and Dash have fixed homes for each address. That is, in addition to the caches, Alewife and Dash employ ordinary memory modules. Therefore, in Alewife, along with Dash, a cache miss must be resolved by referencing the home memory module.
The CM-5 is a distributed system that employs both SIMD and multiple instruction multiple data (MIMD) execution models. The CM-5 employs a control and a data network along with a number of control processors to manage partitions. Each partition consists of a control processor, a collection of processing nodes (each implementing a SPARC processor), and dedicated portions of the data and control networks. Each user process executes on a single partition, but may exchange data with processes on other partitions. All partitions utilize UNIX timesharing and security features so that multiple users may have access to the partitions. The CM-5 control network allows for a number of interprocessor communication operations that may be used to reduce communication latency. These include replication, reduction, permutation and parallel prefix. The control network provides mechanisms to allow data-parallel code to be extracted efficiently, as well as supporting MIMD execution for general-purpose applications. The hierarchal nature of a fat tree topology, as used in the KER1, is provided to increase the locality of data which helps to reduce the communication latency.
In the AP1000, three independent networks are employed to route messages. The Torus-network (T-net) is used for point-to-point communication between cells, the Broadcast network (B-net) is used for 1-to-N communication, and the Synchronization network (S-net) is used for barrier synchronization. These different networks are used to optimize the use of barrier synchronization and broadcasts. An automatic routing scheme is used for the T-net, which combines wormhole routing with a structured buffer pool algorithm to reduce communication latency. Each cell incorporates a message controller (MSG) for fast message handling, a SPARC processor and a routing controlling (RTC). The MSG supports buffer receiving, index receiving, stride DMA, list/list vector transferring, and line sending. The RTC provides the automatic routing function for T-net. To provide efficient data distribution and collection of data to and from the host processor, each cell has special hardware for scatter and gather functions. These functions are provided so that the time for data transfer setup and message assembling/disassembling at the host computer does not increase as the number of cells increases.
The Paragon (an Intel machine) a system that employs a communication co-processor to handle communication operations. Each node implements an i860XP processor as an application processor and another i860XP as a communication co-processor. The user processor runs the operating system and application processes while the other processor executes message-passing software. The Network Interface (NIC) connects the node to the network and contains receive and transmit FIFO's which buffer messages between the node and the mesh routing chip (iMRC) on the backplane. The iMRC interfaces a node to the 2-D network through full-duplex communication pathways available in a NEWS grid formation.
U.S. patent application Ser. No. 07/800,530, now U.S. Pat. No. 5,442,797, resolves these issues by providing a multi-processor system made up of a plurality of processing elements. Each of these processing elements includes a locality manager and a central processing unit. The central processing unit in each processing element can be reduced instruction code (RISC) microprocessors, thereby taking advantage of high speed single microprocessor technology.
The user can also specify locality manager code (lmc) statements which are to be executed only by the locality manager. Hence, the multi-processor has a runtime environment which has kernels for both the central processing unit and the locality manager. The multi-processor allows a user to declare certain variables to be of a storage class taggable. In so doing, the user can design code which takes full advantage of parallel processing power. For example, the multi-processor utilizes a "request" operation and a "release" operation which makes it possible to transfer data among a plurality of localities. In addition, the multi-processor allows a user to specify a count field that indicates how many localities must request a data item before the data item may be overwritten.
Therefore, it is desirable have a second processor which can serve as a locality manager for each processing element in a parallel multiprocessor.
It is also desirable to provide bus interface logic to monitor a central processing unit bus in each processing element of a parallel multiprocessor.
It is also desirable to provide synchronization components for each processing element of a parallel multiprocessor unit.