Since von Neumann and others more than 60 years ago developed a stored program electronic computer, the fundamental memory accessing principle has not been changed. While the processing speeds of computers have increased significantly over the years for whole range of high performance computing (HPC) applications, it has been accomplished either by device technology or by schemes that avoid memory accessing (such as using cache). However, the memory accessing time still limits performance. Currently computer systems use many processors 11 and large-scale main memory 331, as shown in FIG. 1.
The computer system shown in FIG. 1 includes a processor 11, a cache memory (321a, 321b) and a main memory 331. The processor 11 includes a control unit 111 having a clock generator 113 configured to generate a clock signal, an arithmetic logic unit (ALU) 112 configured to execute arithmetic and logic operations synchronized with the clock signal, a instruction register file (RF) 322a connected to the control unit 111 and a data register file (RF) 322b connected to the ALU 112. The cache memory (321a, 321b) has an instruction cache memory 321a and a data cache memory 321b. A portion of the main memory 331 and the instruction cache memory 321a are electrically connected by wires and/or buses, which limits the memory access time (or having the Von Neumann bottleneck).351. The remaining portion of the main memory 331, and the data cache memory 321b are electrically connected to enable a similar memory access. 351. Furthermore, wires and/or buses, which implement memory access. 352, electrically connect between the data cache memory 321b and the instruction cache memory 321a, and the instruction register file 322a and the data register file 322b. 
Even though the HPC systems are expected to operate at high speed and low energy consumption, there are speed limitations due to the memory accessing bottlenecks 351, 352. The bottlenecks 351, 352 are ascribable to the wirings between processors 11 and the main memory 331, because the wire length delays access to the computers and stray capacitance existing between wires cause additional delay. Such capacitance requires more power consumption that is proportional to the processor clock frequency in 11.
Currently some HPC processors are implemented using several vector arithmetic pipelines. This vector processor makes better use of memory bandwidth and is a superior machine for HPC applications that can be expressed in vector notation. The vector instructions are made from loops in a source program and each of these vector instructions is executed in an arithmetic pipeline in a vector processor or corresponding units in a parallel processor. The results of these processing schemes give the same results.
However, even the vector processor based system has the memory bottleneck 351, 352 between all the units. Even in a single system with a wide memory and large bandwidth, the same bottleneck 351, 352 appears and if the system consists of many of the same units as in a parallel processor, and the bottleneck 351, 352 is unavoidable.
There are two essential memory access problems in the conventional computer systems. The first problem is wiring lying not only between memory chips and caches or between these two units even on a chip but also inside memory systems. Between chips the wiring between these two chips/units results in more dynamic power consumption due to capacity and the wire signal time delay. This is extended to the internal wire problems within a memory chip related to access lines and the remaining read/write lines. Thus in both inter and intro wiring of memory chips, there exists energy consumption caused by the capacitors with these wires.
The second problem is the memory bottleneck 351, 352 between processor chip, cache and memory chips. Since the ALU can access any part of cache or memory, the access path 351, 352 consists of global wires of long length. These paths are also limited in the number of wires available. Such a bottleneck seems to be due to hardware such as busses. Especially when there is a high speed CPU and a large capacity of memory, the apparent bottleneck is basically between these two.
The key to removing the bottleneck is to have the same memory clock cycle as the CPU's. First, addressing proceeding must be created to improve memory access. Secondly the time delay due to longer wires must be significantly reduced both inside memory and outside memory.
By solving these two issues, a fast coupling between memory and the CPU is made, which fact enables a computer without the Memory Bottleneck.
The processor consumes 70% of the total energy because of these problems, which is divided into 42 percent for supplying instructions and 28 percent for data as shown in FIG. 32. The wiring problems generate not only power consumption but also time delay of signals. Overcoming the wiring problems implies the elimination of bottlenecks 351, 352 that limits the flow of data/instructions. If we could remove the wirings of intra/inter chips, the problems of power consumption, time delay and memory bottlenecks 351, 352 would be solved.