1. Field of the Invention
The present invention relates to a computing system and, more particularly, to a computing system that uses computing processors residing in data storage devices to process data in a highly parallel fashion.
2. Description of the Related Art
A computing system generally includes a Central Processing Unit (CPU), a cache, a main memory, a chip set, and a peripheral. The computing system normally receives data input from the peripheral and supplies the data to the CPU where the data is to be processed. The processed data can then be stored back to the peripheral. The CPU can, for example, be an Arithmetic Logic Unit (ALU), a floating-point processor, a Single-Instruction-Multiple-Data execution (SIMD) unit, or a special functional unit. The peripheral can be a memory peripheral, such as a hard disk drive or any nonvolatile massive data storage device to provide mass data storage, or an I/O peripheral device, such as a printer or graphics sub-system, to provide I/O capabilities. The main memory provides less data storage than the hard drive peripheral but at a faster access time. The cache provides even lesser data storage capability than the main memory, but at a much faster access time. The chip set contains supporting chips for said computing system and, in effect, expands the small number of I/O pins with which the CPU can communicate with many peripherals.
FIG. 1 illustrates a conventional system architecture of a general computing system. In FIG. 1, block 10 is a CPU. Block 11 is a cache that has a dedicated high speed bus connecting to CPU for high performance. Block 12 is a chip set to connect CPU with main memory 13 and a fast peripheral 14 such as a graphics subsystem. Block 15 is another chip set to expand the bus, such as RS-232 or parallel port for slower peripherals. Note that the components discussed above are very general building blocks of a computing system. Those skilled in the art understand that a computing system may have different configurations and building blocks beyond these general building blocks.
An execution model indicates how a computing system works. FIG. 2 illustrates an execution model of a typical scalar computing system. Between a CPU 10 and a hard disk 17, there are many different levels of data storage devices such as main memory 13, a cache 11, and register 16. The farther the memory devices are positioned from the CPU 10, the more capacity and the slower speed the memory devices have. The CPU 10 fetches data from the hard disk 17, processes the data to obtain resulting data, and stores the resulting data into the various intermediate data storage devices, such as the main memory 13, the cache 11 or the register 16, depending on how often they will be used and how long they will be used. Each level of storage is a superset of the smaller and faster devices nearer to the CPU 10. The efficiency of this buffering scheme depends on the temporal and spatial localities. The temporal locality means the data accessed now are very likely to be accessed later. The spatial locality means the data accessed now are very likely to be accessed in the same neighborhood later. In today's technology, the CPU 10, the register 16, and two levels of cache 11 are integrated into a monolithic integrated circuit.
FIG. 3 shows an execution model of a vector computer. A vector computer has an array of vector CPUs 210, an array of vector registers 216, a main memory 13, and a hard drive 17. The size of the vector array is usually a power of 2, such as 16 or 32, for example. The vector CPUs 210 fetch the data from the hard drive 17 through the main memory 13 to the vector registers 216 and then process an array of the data at the same time. Hence, the processing speed by the vector computer can be improved by a factor equal to the size of the array. Note that a vector computer can also have a scalar unit, such as the computer system described in FIG. 2, as well as many vector units such as those described in FIG. 3. Some vector computers also make use of caches.
A vector computer is able to exploit data parallelism to speed up those special applications that can be vectorized. However, vector computers replicate many expensive hardware components such as vector CPUs and vector register files to achieve high performance. Moreover, vector computers require very high data bandwidth in order to support the vector CPUs. The end result is a very expensive, bulky and power hungry computing system.
In recent years, logic has been embedded into memories to provide a special purpose computing system to perform specific processing. Memories that include processing capabilities are sometimes referred to as “smart memory” or intelligent RAM. Research on embedding logic into memories has led to some technical publications, namely: (1) Duncan G, Elliott, “Computational RAM: A Memory-SIMD Hybrid and its Application to DSP,” Custom Integrated Circuit Conference, Session 30.6, 1992, which describes simply a memory chip integrating bit-serial processors without any system architecture considerations; (2) Andreas Schilling et al., “Texram: A Smart Memory for Texturing,” Proceedings of the Sixth International Symposium on High Performance Computer Architecture, IEEE, 1996, which describes a special purpose smart memory for texture mapping used in a graphics subsystem; (3) Stylianos Perissakis et al., “Scalable Processors to 1 Billion Transistors and Beyond: IRAM,” IEEE Computer, September 1997, pp. 75-78, which is simply a highly integrated version of a vector computer without any enhancement in architecture level; (4) Mark Horowitz et al., “Smart Memories: A Modular Configurable Architecture,” International Symposium of Computer Architecture, June 2000, which describes a project to try to integrate general purpose multi-processors and multi-threads on the same integrated circuit chip; and (5) Lewis Tucker, “Architecture and Applications of the Connection Machines,” IEEE Computer, 1988, pp. 26-28, which used massively distributed array processors connected by many processors, memories, and routers among them. The granularity of the memory size, the bit-serial processors, and the I/O capability is so fine that these processors end up spending more time to communicate than to process data.
Accordingly, there is a need for computing systems with improved efficiency and reduced costs as compared to conventional vector computers.