The present invention relates generally to a novel VLIW computer processing architecture, and more particularly to a processor having a scalable multi-pipeline processing core and memory fabricated on the same integrated circuit.
Computer architecture designers are constantly trying to increase the speed and efficiency of computer processors. However, conventional xe2x80x9cstate-of-the-artxe2x80x9d CPU designs are predicated on the fact that there is a huge latency inherent in the accompanying memory systems, coupled with limited bandwidth communications between the memory systems and the CPU core. These inherent problems with current processor and memory latencies have led to computer architecture designs with many and large cache layers and highly complex designsxe2x80x94each additional fraction of design complexity obtaining only a small improvement in performance (i.e., diminishing returns).
For example, computer architecture designers have attempted to increase processing speeds by increasing clock speeds and attempting latency hiding techniques, such as data pre-fetching and cache memories. In addition, other techniques, such as instruction-level parallelism using very long instruction word (VLIW) designs, and embedded-DRAM have been attempted.
Combining memory (i.e., DRAM) and logic on the same chip appears to be an excellent way to improve internal memory bandwidth and reduce memory access latencies at a low cost. However, DRAM circuits tend to be sensitive to temperature and thermal gradients across the silicon die. Conventional RISC and CISC CPUs, because they must be clocked at high speeds to attain adequate performance, are necessarily energy inefficient and tend to produce a large amount of heat, which ultimately affects the performance of any DRAM residing on the same chip. Thus, architectures which attain their performance through instruction-level parallelism, instead of maximizing clock speeds, tend to be better suited for use with on-chip DRAM because they can exploit the large communication bandwidth between the processor and memory while operating at lower clock speeds and lower supply voltages. Examples of architectures utilizing instruction-level parallelism include single instruction multiple data (SIMD), vector or array processing, and very long instruction word (VLIW). Of these, VLIW appears to be the most suitable for general purpose computing.
Certain VLIW computer architecture designs are currently known in the art. However, while processing multiple instructions simultaneously may help increase processor performance, it is difficult to process a large number of instructions in parallel because of instruction dependencies on other instructions. In addition, most VLIW processors require extremely complex logic to implement the VLIW design, which also slows the performance of VLIW processors. In fact, with VLIW designs which do not take advantage of the memory efficiencies with on-chip DRAM, the average number of instructions per clock (IPC) can drop well below 1 when factors such as branch miss-prediction, cache misses, and instruction fetch restrictions are factored in. Thus, what is needed is a novel, high performance computer processing architecture to overcome the shortcomings of the prior art.
One embodiment of the present invention comprises a processor chip including a processing core, at least one bank of DRAM memory, an I/O link configured to communicate with other like processor chips or compatible I/O devices, and a communication and memory controller in electrical communication with the processing core, the at least one bank of DRAM memory, and the I/O link. The communication and memory controller is configured to control the exchange of date between the processor chip and the other processor chips or I/O device. The communication and memory controller also is configured to receive memory requests from the processing core, and the other processor chips via the I/O link, and process the memory requests with the at least one bank of DRAM memory.
In accordance with another embodiment of the present invention, the communication and memory controller comprises a memory controller in electrical communication with the processing core and the at least one bank of DRAM memory, and a distributed shared memory controller in electrical communication with the memory controller and the I/O link. The distributed shared memory controller is configured to control the exchange of data between the processor chip and the other processor chips or I/O devices. In addition, the memory controller is configured to receive memory requests from the processing core and the distributed shared memory controller, and process the memory requests with the at least one bank of DRAM memory.
In accordance with yet another embodiment of the present invention, the processor chip may further comprise an external memory interface in electrical communication with the communication and memory controller. In accordance with this aspect of the present invention, the external memory interface is configured to connect the processor chip in electrical communication with external memory. The communication and memory controller is configured to receive memory requests form the processing core and from the other processing chips via the I/O link, determine whether the memory requests are directed to the at least one bank of DRAM memory on the processor chip or the external memory, and process the memory requests with the at least one bank of DRAM memory on the processor chip of with the external memory through the external memory interface.
A more complete understanding of the present invention may be derived by referring to the detailed description of preferred embodiments and claims when considered in connection with the figures.