The present invention relates generally relates to the formation of a 100 petaflop scale, low power, and massively parallel supercomputer.
This invention relates generally to the field of high performance computing (HPC) or supercomputer systems and architectures of the type such as described in the IBM Journal of Research and Development, Special Double Issue on Blue Gene, Vol. 49, Numbers 2/3, March/May 2005; and, IBM Journal of Research and Development, Vol. 52, 49, Numbers 1 and 2, January/March 2008, pp. 199-219.
Massively parallel computing structures (also referred to as “supercomputers”) interconnect large numbers of compute nodes, generally, in the form of very regular structures, such as mesh, torus, and tree configurations. The conventional approach for the most cost/effective scalable computers has been to use standard processors configured in uni-processors or symmetric multiprocessor (SMP) configurations, wherein the SMPs are interconnected with a network to support message passing communications. Today, these supercomputing machines exhibit computing performance achieving 1-3 petaflops (see http://www.top500.org/ June 2009). However, there are two long standing problems in the computer industry with the current cluster of SMPs approach to building supercomputers: (1) the increasing distance, measured in clock cycles, between the processors and the memory (the memory wall problem) and (2) the high power density of parallel computers built of mainstream uni-processors or symmetric multi-processors (SMPs').
In the first problem, the distance to memory problem (as measured by both latency and bandwidth metrics) is a key issue facing computer architects, as it addresses the problem of microprocessors increasing in performance at a rate far beyond the rate at which memory speeds increase and communication bandwidth increases per year. While memory hierarchy (caches) and latency hiding techniques provide excellent solutions, these methods necessitate the applications programmer to utilize very regular program and memory reference patterns to attain good efficiency (i.e., minimizing instruction pipeline bubbles and maximizing memory locality).
In the second problem, high power density relates to the high cost of facility requirements (power, cooling and floor space) for such peta-scale computers.
It would be highly desirable to provide a supercomputing architecture that will reduce latency to memory, as measured in processor cycles, exploit locality of node processors, and optimize massively parallel computing at ˜100 petaOPS-scale at decreased cost, power, and footprint.
It would be highly desirable to provide a supercomputing architecture that exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single ASIC.
It would be highly desirable to provide a supercomputing architecture that comprises a unique interconnection of processing nodes for optimally achieving various levels of scalability.
It would be highly desirable to provide a supercomputing architecture that comprises a unique interconnection of processing nodes for efficiently and reliably computing global reductions, distribute data, synchronize, and share limited resources.