1. Field of the Invention
This invention relates generally to a method and an apparatus for reconciling communication and locality in parallel processor systems and particularly to a method and apparatus for enabling a user/programmer to write programs in an extended procedural language such as an extended C programming language which explicitly manipulate locality in order to optimize performance of a parallel multiprocessor.
2. Background of the Related Art
A recurrent problem posed by parallel processing architectures is that of communication latency. Communication latency is the time required for a communication operation to complete. This time may include transfer time, overhead, and the time spent waiting for data synchronization. Communication latency exists in any parallel architecture regardless of whether it supports a shared or a non-shared memory paradigm. Latency in and of itself however, does not have to result in diminished performance. Rather, performance is diminished whenever a central processing unit (CPU) within a parallel system is forced to wait for some communication (or synchronization) operation. Therefore, latency may be addressed by either decreasing time cost of communication or by overlapping it with other tasks, i.e., tolerating latency.
One common method of decreasing the time cost of communication on shared memory systems is through cache memory. With cache memory, hardware is utilized to bring more frequently accessed data into memories that are closer to each CPU. This process is done automatically and utilizes a principle of locality. By bringing these data into local cache memory, the time cost of most loads and stores is reduced, thereby reducing latency. However, programs do not always exhibit such locality, particularly parallel programs accessing shared data. Further, this locality is hidden from a programmer and therefore is difficult to exploit.
Another technique, more common to distributed memory systems, is to increase communication bandwidth. This decreases communication latency by reducing the time required to send and receive data. Unfortunately, on many existing systems, the amount of time associated with software overhead tends to dominate communication time. While this may be improved by using a hardware coprocessor as in J. -M. Hsu and P. Banerjee, "A message passing coprocessor for distributed memory multicomputers," Supercomputing '90, November 1990, pp. 720-729, this solution is not complete because overheads still exist in controlling a coprocessor. Also, this solution does not aid a programmer in finding and exploiting locality. Finally, if data is not ready to be sent, no reduction of communication cost can eliminate associated data synchronization latency.
An alternative to reducing latency is to simply tolerate it. There are several mechanisms that have been utilized to tolerate latency. These have one common aspect. They change the programming paradigm from a control based model to a data based one. This is because data movement and synchronization are fundamental to the problem of communication latency (Arvind and R. A. Iannuci, Two Fundamental Issues in Multiprocessing, Tech. Report MIT/LCS/TM-330, MIT Laboratory for Computer Science, 1987).
One approach to latency tolerance is that used in dataflow machines such as Monsoon (FG.M. Papadopoulos and D. E. Culler, "Monsoon: an explicit token-store architecture," 17th Annual Symposium on Computer Architecture, May 1990, pp. 82-91). In such data flow machines, computation follows data movement. When all data for a given computation become ready, a computation takes place. Thus, latency is only reflected in program execution time when there is no data ready to be computed upon. Further, dataflow machines are most efficient when used with a "dataflow language" such as Id or Sisal.
Another approach to latency tolerance is that used in multithreaded machines such as HEP (B. J. Smith, "Architecture and applications of the HEP multiprocessor system," SPIE Vol. 298 Real-Time Signal Processing IV, 1981, pp. 241-248), Horizon (J. T. Kuehn and B. J. Smith, "The Horizon supercomputing system; architecture and software," Supercomputing' 88, November 1988, pp. 28-34), and Tera (R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith "The Tera computer system," 1990 International Conference on Supercomputing, June 1990, pp. 1-6). In these machines, latency is tolerated by keeping a large set of active light-weight processes (threads). Then, when a thread needs to access an operand from shared memory, the thread is put in a wait queue and another thread is activated. Similarly, threads are placed in a queue when they need to wait for some form of data synchronization. Then threads re-enter a pool of active threads as data becomes available. This mechanism is aided by a large set of hardware contexts and a large register file and thus it adds little overhead. Therefore, if enough threads exist and there is ready data, this mechanism allows latency to be hidden to some extent. However, this does not help in the execution time of a single thread. Thus, programs must be broken into many threads to take advantage of this mechanism. Further, the number of threads must grow at a rate that is higher than the growth in the number of processors to maintain latency tolerance.
While both of the above approaches (dataflow machines and multithreaded machines) are capable of tolerating latency, they both require a very high degree of parallelism-one that is much larger than the total machine size. Further, neither of these approaches can utilize off-the-shelf CPUs. Thus, they cannot, for example, take advantage of cost and performance benefits of new generation microprocessors, such as reduced instruction set computer (RISCs). For example, Intel i860 and Inmos T9000 are moderately priced and have peak performance levels in ranges once achieved only by supercomputers (25-100MFlops). RISCs utilize pipelining to exploit fine-grained parallelism and contain internal caches and floating point support.
While RISCs provide a significant improvement in cost/performance ratio, they accentuate problems associated with communication latency. For example, RISCs have significant memory bottlenecks, because memory speed has not kept up with the higher clock rates in CPUs. Even data accesses to locations within a Processing Element's (PE's) local memory may be costly, because any off-chip accesses add latency due to time required to drive external pins.