Various high-speed computer processing systems. sometimes referred to as supercomputers, have been developed to solve a variety of computationally intensive applications, such as weather modeling, structural analysis, fluid dynamics, computational physics, nuclear engineering, real-time simulation, signal processing, etc. The architectures of such present supercomputer systems can be generally classified into one of two broad categories: minimally parallel processing systems and massively parallel processing systems.
The minimally parallel class of supercomputers includes both uniprocessors and shared memory multiprocessors. A uniprocessor is a very high-speed processor that utilizes multiple functional elements, vector processing, pipeline and look-ahead techniques to increase the computational speed of the single processor. Shared-memory multiprocessors are comprised of a small number of high-speed processors (typically two, four or eight) that are tightly-coupled to each other and to a common shared-memory using either a bus-connected or direct-connected architecture.
The massively parallel class of supercomputers includes both array processors and distributed-memory multicomputers. Array processors generally consist of a very large array of single-bit or small processors that operate in a single-instruction-multiple-data (SIMD) mode, as used for example in signal or image processing, Distributed-memory multicomputers also have a very large number of computers (typically 1024 or more) that are loosely-coupled together using a variety of connection topologies such as hypercube, ring, butterfly switch and hypertrees to pass messages and data between the computers in a multiple-instruction-multiple-data (MIMD) mode.
As used within the present invention, the term multiprocessor will refer to a tightly-coupled, shared-memory multiple-processor computer processing system. The term multicomputer will refer to a loosely-coupled, multiple-processor computer processing system with distributed local memories. The terms tightly-coupled and loosely-coupled refer to the relative difficulty and time delay in passing messages and data between processors. Tightly-coupled processors share a common connection means and respond relatively quickly to messages and data passed between processors. Loosely-coupled computers, on the other hand, do not necessarily share a common connection means and may respond relatively slowly to messages and data passed between computers. An architectural taxonomy for the existing architectures of modem supercomputers using these definitions is set forth in Hwang, K., Parallel Processing for Supercomputers and Artificial Intelligence, pp. 31-67 (1989).
For most applications for which a supercomputer system would be useful, the objective is to provide a computer processing system with the fastest processing speed and the largest problem solving space, i.e., the ability to process a large variety of traditional application programs. In an effort to increase the problem solving space and the processing speed of supercomputer systems, the minimally parallel and massively parallel architectures previously described have been introduced into supercomputer systems.
It will be recognized that parallel computer processing systems work by partitioning a complex job into processes and distributing both the program instructions and data for these processes among the different processors and other resources that make up the computer processing system. For parallel computer processing systems, the amount of processing to be accomplished between synchronization points in a job is referred to as the granularity of the job. If there is a small amount of processing between synchronization points, the job is referred to as fine grain. If there is a large amount of processing between synchronization points, then the job is referred to as large grain. In general, the finer the granularity of a job, the greater the need for synchronization and communication among processors, regardless of whether the computer processing system is a minimally parallel or massively parallel system. The exception to this situation is the SIMD processor array system that operates on extremely parallel problems where the limited locality of shared data requires communication among only a very few processors.
The approach taken by present massively parallel computer processing systems is to increase the processing speed by increasing the number of processors working on the problem. In theory, the processing speed of any parallel computer processing system should be represented as the number of processors employed in solving a given job multiplied by the processing speed of each processor. In reality, the problems inherent in present parallel computer processing systems prevent them from realizing this full potential. The principal problems of massively parallel computer processing systems are the inability to successfully divide jobs into several generally coequal but independent processes, and the difficulties in the distribution and coordination or synchronization of these processes among the various processors and resources during actual processing. The present architectures for massively parallel computer processing systems cannot perform the interprocessor communication and coordination efficiently enough to justify the large overhead for setting up such a system because inter-processor communication is, at best, indirect. In addition, massively parallel systems sacrifice problem solving space for speed by requiring users to reprogram traditional applications to fit the distributed memory architecture of such systems. By analogy, these problems are similar to the problems that prevent a job requiring 1,000 person-hours of effort from being completed by 1,000 workers in a single hour.
Minimally parallel computer processing systems, on the other hand, attempt to increase problem solving space and processing speed by increasing the speed of the individual processors. Such minimally parallel systems have a larger problem space because a shared-memory system is required to execute traditional application programs. Unfortunately, the clock speed of the individual processors used in present minimally parallel computer processing systems is approaching the practical and theoretical limits that are achievable using current semiconductor technology. While this technique works relatively well for large grain problems where inter-processor communication is limited, the small number of processors limit the number of independent parallel processes that may be simultaneously performed, regardless of the speed of each individual processor. Again, by analogy, a 1,000 person-hour job cannot be completed in less than 125 hours if a maximum of four people can work on the job, even if each person can work twice as fast as a normal person.
Ideally, It would be desirable to extend the direct-connection methods of inter-processor communication of minimally parallel computer processing systems to the numbers of processors used in massively parallel computer processing systems. Unfortunately, the present direct-connection methods of coordinating the processors in minimally parallel systems severely limits the number of processors that may be efficiently interconnected and cannot be extended to serve the numbers of processor utilized in a massively parallel system. For example, in the architecture for the Gray X-MP supercomputer system developed by Cray Research, Inc., that is the subject of U.S. Pat. No. 4,363,942, a deadlock interrupt means is used to coordinate two high-speed processors. While this type of tightly-coupled, direct-connection method is an efficient means for coordinating two high speed processors, the hardware deadlock interrupt mechanism described in this invention is most effective when the number of processors being coupled together is very small, i.e., eight or less.
Because of the inherent limitations of the present architectures for minimally parallel and massively parallel supercomputer systems, such computer processing systems are unable to achieve significantly increased processing speeds and problem solving spaces over current systems. Therefore, a new architecture is needed for interconnecting parallel processors and associated resources that allows the speed and coordination of current minimally parallel multiprocessor systems to be extended to larger numbers of processors, while also resolving some of the synchronization problems associated with massively parallel multicomputer systems. This range between minimally parallel and massively parallel systems will be referred to as highly parallel computer processing systems and can include multiprocessor systems having sixteen to 1024 processors.
Presently, the only attempts to define an architecture suitable for use with such highly parallel computer processing systems have been memory-hierarchy type supercomputers. In these systems, some type of hierarchical or divided memory is built into the supercomputer system.
In the Cedar supercomputer system developed at the University of Illinois, a two stage switch is used to connect an existing cluster of processors in the form of an Alliant FX/8 eight processor supercomputer to an external global memory module. In this system, the global memory is separate and distinct from the cluster memory. Coordination among clusters is accomplished by paging blocks of data or instructions in and out of each cluster memory from common blocks of data or instructions in the global memory. Kuck, D., "Parallel Supercomputing Today and the Cedar Approach", Science, Vol. 231, pp. 967-74 (February 1986).
In the ETA-10 supercomputer system developed by Control Data Corporation, but now abandoned, each of eight processors has a register file and a central processor memory. Each processor also has access to a common shared memory and a shared virtual memory existing on disk storage that is accessible through eighteen I/O units. A communication buffer that is not part of the virtual memory system provides fast locking and synchronization functions. ETA10 System Overview; EOS, Tech. Note, Publ. 1006, Rev. B, ETA Systems, Sep. 30, 1988.
In the RP3 supercomputer system developed at the IBM Watson Research Center, 512 32-bit microprocessors are configured together in eight groups of 64 microprocessors. Each microprocessor has its own local memory, a portion of which may be reconfigurable as global memory at the run time for a particular job. In essence, the local/global boundary is dynamically determined at the beginning of each job in an attempt to maximize the granularity of the system while minimizing inter-processor communication bottlenecks. Pfister, G., "The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture", International Conference on Parallel Processing, pp. 764-71, August 1985.
The principal problem with using these kinds of memory-hierarchy type architectures for highly parallel supercomputer systems is that the structure of each software application program must be optimized to fit the particular memory-hierarchy architecture of that supercomputer system. In other words, the software programmer must know how the memory is divided up in the memory-hierarchy in order to similarly divide the job into tasks so as to optimize the processing speed for the particular job. If a job is not optimized for the particular memory-hierarchy, not only will the memory-hierarchy supercomputer not approach its maximum theoretical processing speed, but, in fact, the processing speed may actually be slower than other comparable supercomputers because of the memory thrashing that may occur between the different levels of memory.
While the present architectures for supercomputer systems have allowed such systems to achieve peak performances in the range of 0.2 to 2.4 GFLOPS (billion floating point operations per second), it would be advantageous to provide a method and apparatus for creating a cluster architecture for a highly parallel scalar/vector multiprocessor system that is capable of effectively connecting between sixteen and 1024 processors together in a highly parallel architecture to achieve peak performance speeds in the range of 10 to 1,000 GFLOPS. More importantly, there is a need for a highly parallel architecture for a multiprocessor computer processing system that allows for the symmetric access of all processors to all shared resources and minimizes the need for optimization of software applications to a particular memory-hierarchy.