1. Field of the Invention
The present invention relates generally to computer architectures, and more particularly to computer architectures that are capable of processing more than one instruction during each computer system clock cycle. Still more particularly, the present invention is a computer architecture for processing multiple scalar instructions in parallel using a minimal instruction set.
2. Description of the Background Art
In present day computing applications, the amount of data that must be processed is often quite large. For example, in image processing, approximately six Megabytes of data are required to represent a 1024 by 768 pixel color image. If a single image requires multiple transformations, or if two or more images are to be combined into a composite image, the data set on which operations are required becomes even larger. As the size of the data set increases, the time required to complete a set of computational operations on the data set eventually becomes unacceptably long. The need to maximize the number of operations that can be performed on a given set of data in the least amount of time has served as the major driving force in the evolution of computer architectures.
Reduced Instruction Set Computing (RISC) architectures attempt to meet this need by increasing the speed at which each computational instruction supported is performed. RISC is based upon the premise that a complex computational instruction can typically be performed through a corresponding sequence of simple instructions. The smallest number of simple instructions required to fully implement a complete instruction set are chosen as the RISC instruction set. A given RISC architecture is designed to execute each simple instruction in its instruction set very quickly, such that a given complex instruction can be executed very rapidly as a corresponding sequence of simple instructions. A block diagram of an exemplary RISC architecture is shown in FIG. 1. This architecture comprises a memory, a data cache, an instruction cache, pipelined instruction decoding, a register file, a floating-point functional unit, and an integer functional unit. In the RISC architecture, either the integer functional unit or the floating-point functional unit can be used for a given computation.
All data transferred between the register file and the memory in the RISC architecture must pass through the data cache. The data cache serves as a high-speed, small capacity memory. A data cache size of 16 kilobytes is currently common (1993). Hence, in a data-intensive computational environment such as image processing, the data cache can store only a very small portion of the total amount of data that must be processed. Presently, a backing cache is occasionally used between a CPU's internal cache and memory to provide additional cache storage. In the future, cache sizes will increase as improvements are made in microelectronic processing technology. Regardless of the cache situation (present or future), a cache serves as an information "bottleneck" between the memory and the register file. This limits the overall processing speed of the RISC architecture, particularly in data-intensive applications. Therefore, RISC architectures are not well-suited for processing large amounts of data, such as in image or document processing, since access to any data element stored within memory necessitates routing the data element through the data cache.
Other architectures attempt to rapidly process large amounts of data by performing multiple instructions simultaneously. One such architecture is a vector computer, in which data elements within an array each receive the same operation simultaneously. Vector computers require specialized hardware to function, which adversely affects the cost of system design and manufacture. A vector computer must also include a scalar processing unit to successfully perform computations in situations where individual data elements must each receive a unique operation. A highly specialized compiler must be used to maximize the amount of data that can be operated upon in an array format. Such a compiler requires many person-years to develop, and is therefore prohibitively expensive, making vector computers unavailable for most computing needs.
A second architecture that attempts to rapidly process large amounts of data by performing multiple instructions simultaneously is a multiprocessor architecture, where several processing units simultaneously operate upon data, and each processing unit attends to a different portion of a given computation. However, each processing unit significantly increases the cost of manufacturing the system. Moreover, a highly specialized compiler is also necessary in this second architecture, to ensure that computational tasks are optimally distributed between each processor. Such a compiler is again prohibitively expensive, making the multiprocessor an undesirable option.
A third type of architecture that performs multiple instructions simultaneously is that of a pipelined computer. In pipelined computers, the hardware elements required for data storage and performance of arithmetic and logical operations are duplicated. Each set of duplicated hardware elements operates on a given subset of data in parallel. Yet again, a highly specialized compiler is required to maximize the number of hardware elements that are actively processing data at any given time, and the compiler is prohibitively expensive.
Other architectures are combinations of those mentioned above. These are even more complex and more expensive than those previously mentioned, and are therefore undesirable. What is needed is a computer architecture for rapidly performing computational operations on large sets of data in which data path bottlenecks and the need for a highly specialized compiler are eliminated.