Multiprocessor computer systems typically comprise a number of processing element nodes connected together by an interconnect network. Each processing element node typically includes at least one processing element and corresponding local memory, such as dynamic random access memory (DRAM). The interconnect network transmits packets of information or messages between processing element nodes. In a typical multiprocessor system, every processing element can directly address all of memory, including the memory of another (remote) processing element, without involving the processor at that processing element. Instead of treating processing element-to-remote-memory communications as an I/O operation, reads or writes to another processing element's memory are accomplished in the same manner as reads or writes to the local memory.
There is an increasing gap between processing power and memory speed. One proposed solution to compensate for this gap is to have higher integration of processing elements and local DRAM memory. The current level of integration is at the level of the printed circuit board. Proposed integrations are for disposing processing elements and local memory on multi-chip modules (MCM) and for eventually disposing processing elements and local memory on the same integrated circuit chip. Such tightly coupled systems offer advantages, such as providing a substantial increase in the available bandwidth between the processor and its memory, and providing a reduction of the memory access latency. The bandwidth advantage is a result of the vastly improved ability to interconnect the processor with its memory banks. The latency advantage is a result of the elimination of the overhead of crossing chip boundaries.
With improved local memory bandwidth and improved local access latency, it has been proposed that vector units can be implemented on-chip. Such on-chip vector units can exploit significant local memory bandwidth because of their efficient issue and their ability to have deep pipelines. However, providing ample external bandwidth is expensive. This is evident in the design of current vector supercomputers, such as the CRAY C-90 and T-90 vector supercomputers sold by Cray Research, Inc. that employ static random access memory (SRAM) and elaborate interconnection networks to achieve very high performance from their memory systems. With the integration of vector units and memory on the same device (MCM or chip), systems can be built having the potential for significantly lower cost-performance than traditional supercomputers.
The importance of vector processing in the high-performance scientific arena is evident from the successful career of the vector supercomputer. One reason for this success is that vector processing is a good fit for many real-life problems. In addition, vector processing's serial programming model is popular among engineers and scientists because the burden of extracting the application parallelism (and hence performance) is realized by the vectorizing compiler. This proven vector processing model, now in use for two decades, is supported by significant vectorizing compiler technology and accounts for a very important portion of current scientific computation.
Nevertheless, vector applications are memory intensive and they would overflow any single device with a limited and non-expandable memory. Such memory intensive applications include weather prediction, crash-test simulations, and physics simulations run with huge data sets. Therefore, these applications require external memory access. Furthermore, processor-memory integration increases the relative cost of external accesses by making on-chip accesses much faster. However, providing a very expensive external memory system to speed up external accesses, would negate the cost-performance advantage obtained by integrated processor/memory device. Cache memory on the integrated device could help alleviate the cost of external accesses, but for a large class of vector applications caches are not as effective as in other applications.
For reasons stated above and for other reasons presented in greater detail in the Description of the Preferred Embodiments section of the present specification, there is a need to for an improved distributed vector architecture for a multiprocessor computer system having multiple integrated devices, such as MCMs or chips, where each device includes a processing element, memory, and a vector unit.