Since the advent of computing, the time required to access data from memory has steadily increased relative to the computational processing time. Vector processors were developed in the 1970s to minimize the load/store time between processors and memory for repetitive computations by performing the computation in an assembly line manner. With the advent of multiprocessor architectures in the late 80s, processor speed and local memory access were markedly faster than communication between processors, which completely dominated total compute time.
During the late 80s and 90s several companies were established to further the idea of combining the power of emerging microprocessors into a single “supercomputer.” Of particular interest was the Connection Machine (CM) that was built by Thinking Machines Inc.; Hillis (U.S. Pat. No. 4,598,400), Theil, et al. (U.S. Pat. No. 4,805,091), (U.S. Pat. No. 4,814,973), Hillis (U.S. Pat. No. 5,535,408), Hillis (U.S. Pat. No. 5,590,283), and Wade et al. (U.S. Pat. No. 6,219,775). The CM demonstrated the feasibility of a supercomputer based on a massive number of processors but fell victim to what will be referred to as the “Communication Problem” that is discussed below. On paper the CM had a very impressive “peak” speed but in practice the machine would achieve less than single digit efficiency as measured by the ratio of computation to communication time. To date, efficiency has not improved for a number of reasons that will be discussed in further detail below. Primarily, this lack of efficiency is because communication speed has not kept pace with processor speeds. Consider the recent quote in the March 2006 issue of the High Performance Wire newsletter by the Chief Technical Officer (CTO) at Cray Inc.:                “. . . but as it is widely recognized, when scientific computing migrated to commodity platforms, interconnect performance, both in terms of bandwidth and latency, became the limiting factor on overall application performance and remains a bottleneck to this day.”Steve Scott, CTO of Cray, In HPCwire, 24 Mar., 2006        
This statement was made to explain Cray's shift to building special purpose computers outside the “commercial off-the-shelf” (COTS) framework. This is a remarkable shift by the company whose name has been synonymous with general purpose supercomputing. Nevertheless it is a reasonable shift considering the extent of the communication “bottleneck.”.
Turning to the particular problems facing existing multiprocessor supercomputer systems, an understanding of the general nature and utilization of the systems is required. Multiprocessor systems are comprised of a number of interconnected processors. The largest and most powerful systems are often referred to as supercomputers. Supercomputers are used for the most computation demanding applications such as weather prediction. These applications are distributed across the interconnected processors by assigning a portion of the overall computation task to each processor. However, to complete even a portion of the overall computation, each processor must communicate intermediate results to other processors. The communication among the processors adversely affects the performance of a multiprocessor for a number of reasons. Specifically, a first issue arises even in the communication of a two processor system, wherein performance is adversely affected because the off-chip communication between the processors is generally much slower than on-processor computation.
A second issue relates to having an increased numbers of processors participating in the multiprocessor. Essentially, it rapidly becomes impractical to directly connect all processors to all other processors. As such, communication between any two or more processors is likely to be routed through intermediate processors, further degrading communication speed by at least a factor equal to the number of processors on the route.
A third issue relates to the fact that data must be routed to all processors in parallel to support the parallel computation for which such supercomputer systems are designed. That is, to compute in parallel, all processors must be able to receive data simultaneously. Simultaneous data receipt leads to further degradation of communication because of possible contention or the blocking of transmissions by other transmissions.
A fourth issue relates to blocking and other such contention for resources. To eliminate blocking and otherwise optimize performance, transmissions must be scheduled. Depending on the overall system design, transmission scheduling may require additional time. System software and protocols are utilized to implement schedules that require real time transmission and decoding of the schedule and destination addresses, further contributing to an overall degradation in performance.
A fifth issue relates to the shear number of processors in a particular multiprocessor system. As the number of processors increases, there is a corresponding increase in the physical system size. The physical size increase also increases signal distance and transmission time within the multiprocessor.
A sixth issue relates to system imbalances, which result in idle circuitry (i.e., an under utilization of all the available resources). Examples of imbalances include mismatched local and global communication speeds and/or mismatched communication and computation speeds. The effect of such imbalances is to delay or idle certain system components until the slower components complete their tasks.
A seventh issue further relates to the complexity of the network as processor numbers increase. Fixed off-chip bandwidth has to be shared between an increasing number of channels thereby reducing the individual channel bandwidth.
An eighth issue is that the use of cache and resulting coherency communication associated with shared memory can substantially increase the resources of the multiprocessor that are dedicated to communication.
In light of the issues discussed above, one can understand why it is possible for performance to decrease with increasing numbers of processors. The increased communication between the additional processors requires more time than the decrease in computation time. An architecture/computation that yields such results or effects is said to lack scalability.
The magnitude of the communication problem is evident in multiple processor supercomputers where communication likely consumes 90% or more of the total compute time, resulting in a decrease of overall computing performance of such supercomputers. As such, in light of the foregoing problems there exists a need for a system and method, to address the relatively poor performance of multiprocessor systems despite the increase in processor and memory speeds. Specifically, there exists a need to provide a solution to the communication delay among processors in a multiprocessor environment.
There exists a further need to provide a scalable multiprocessor system wherein overall performance of the multiprocessor system will increase in relative and direct proportion to the number of individual component processors. In sum, the magnitude of the communication problem as earlier described, justifies an effort in both circuitry and software that is comparable to the effort that has been made over the years to improve computation. Indeed, every multiprocessor computational task has an underlying communication task, thus a solution to the “Communication Problem” would have a wide spread and significant impact on future multiprocessor performance. It is intuitive that the performance we ultimately seek will come in the form of multiprocessors with truly massive numbers of processors.