In serial computing, communication was needed between the processor and memory. Starting around 2003 and into the foreseeable future most opportunities for performance growth in mainstream computers are based on their exploitation of the increasing number of processor cores. Communication must play an even bigger role to enable such exploitation since processors need to exchange information among them, and data need to be moved among the many processors and between processors and memory. The need for high-communication bandwidth is clear in some important applications (e.g., FFT). However, the need for communication is broader than that:                1. The current capacity of communication switches limit performance of large machines. They require connecting modules, boards, and/or racks and many of these connections would benefit from improved bandwidth and/or latency.        2. High-productivity parallel computer systems (i.e., a system that enable both fast application development time and fast runtime) would benefit greatly from a programmer's abstraction that assumes flat-memory; namely, that any set of concurrent memory requests can be satisfied in unit time. When memory addresses are known ahead of time, it is hard to estimate the latency of accessing them in modern computer systems, and effective support of the flat memory abstraction is helpful. However, such abstraction is even more desired in the many applications in which it is impossible to predict addresses of memory requests ahead of time (e.g., at compile time). Support of such abstraction has generally the added benefit that it includes high-bandwidth applications.        
Bandwidth and latency of switches are often performance bottlenecks for large parallel computers. Zahavi et al 2014 points out the interest of switch vendors in reducing the number of chips in a switch, and the corollary that all the available ports in a chip should be used; greatly increasing the number of ports on a chip would improve the overall performance of the switch. E. Zahavi, I. Keslassy and A. Kolodny, “Quasi Fat Trees for HPC Clouds and Their Fault-Resilient Closed-Form Routing”. Presented at Hot Interconnects (HOTI) 2014, Mountain View, Calif., USA.
Approaching the end of the so-called Dennard scaling is also an important concern as it implies decreasing improvement in power consumption of computers. This concern has led to a remarkable consensus in the industry and in the research community: communication avoidance must drive both the design of computer systems and their programming. Consequently, commercial parallel systems have been evolving away from a flat memory abstraction, for example without any multi-core (or GPU) machine in the market today that supports a flat memory abstraction; in particular, the impetus to avoid overheating of computer chips due to data movement (“DM”) ended up leaving no choice for programmers but to labor hard in order to minimize such movement. Per the influential report [Fuller, Millet], which is a good representative of the aforementioned consensus, mainstream computer system vendors and researchers consign to even stricter restrictions on data movement in the future; their premise being that there is no way to avoid such restrictions for increasing parallelism (S. H. Fuller and L. I. Millet (editors). The Future of Computing Performance: Game Over or Next Level, National Research Council of the National Academies, National Academies Press, 2011). Vendors preferred to pack more and more functional units into a chip, due to their energy consumption relative to DM, resulting in increasingly unbalanced architectures.
The viewpoint article [Vishkin 2014] opines that claims that solutions requiring higher level of DM are not feasible (some use the term “dark silicon”) played a key role in dashing some high hopes of vendors a decade ago, such as that: (1) Parallel computing in the form of multi-cores replaces serial computing for single-task general purpose applications, which did not materialize; and (2) machines of 500-1,000 cores will be widely deployed by 2014, which gave way to a reality of mostly cores in the single digit and two digits in the most advanced machines. [Vishkin 2014] elaborates on these dashed hopes tying the problem (both for multi-cores and GPUs) to the strict restrictions on DM, namely, the DM problem prevented flat-memory altogether and greatly constrained the number of cores in commercial machines. U. Vishkin, Is Multicore Hardware for General-Purpose Parallel Processing Broken?, Comm. of the ACM (CACM), Volume 57, No. 4, pages 35-39, April 2014.
A need remains, therefore, for a new integrated circuit architecture which enables and provides for significant data movement and a parallel processing architecture, while concurrently providing for a cooling architecture in a 3D VLSI structure and avoidance of the prior art overheating problems.