Conventional processing systems utilize parallel processing in an inefficient manner. Example conventional processors include scalar, Very Long Instruction Word (VLIW), superscalar, and vector processors.
A scalar is a single item or value. A scalar processor performs arithmetic computations on scalars, one at a time. For example, on a first clock, an instruction C=A+B is fetched. On a second clock, the instruction is decoded. On a third clock, the instruction operands A and B are retrieved. On a fourth clock, the instruction is executed. On a fifth clock, the result C of the executed instruction is written to memory. This process may proceed in a pipelined manner with new instructions fetched on each subsequent clock and processed through the remaining five clock cycles as previously described. However, a scalar processor uses only limited parallelism, limited by the number of pipeline stages. Further, although the processor may have multiple execution units for different functions such as add, multiply, and shift, only one execution unit is used during each clock cycle, limited by the scalar instruction. Thus, although pipelined processing may be implemented with scalar systems, multiple scalar elements are not processed in parallel resulting in impediments to efficient instruction processing.
VLIW processors have an architecture that processes multiple scalar instructions simultaneously or in parallel by including multiple instructions into a wide single instruction, i.e., a very long instruction word (VLIW) includes multiple scalar instructions as previously described.
One example VLIW instruction is a 256 bit VLIW. Multiple independent instructions can be incorporated into a single VLIW instruction. For example, a VLIW instruction may include instruction sections for an adder, a shifter, a multiplier, or other execution units. Thus, the VLIW instruction enables an execution unit such as an adder to proceed in a pipelined fashion and, in addition, enables other components, such as a shifter or multiplier, to proceed in parallel with the adder.
While a VLIW processing system may reduce processing times by executing multiple instructions within a single wide instruction word, this system has a number of shortcomings. For example, larger amounts of wider memory are used to store a series of wide instruction words. As a result, additional logic and interconnect wiring are used to manage the wider memory. These extra logic and wiring components consume additional area, power, and bandwidth to fetch these wider instructions—on each clock, a 256 bit instruction is fetched.
Also, in response to the limited parallelism of scalar processing systems, superscalar processors were developed. Superscalar processors are similar to VLIW systems but can execute two or more smaller instructions in parallel. Multiple smaller instructions are fetched per clock cycle, and if there are no conflicts or unmet dependencies, multiple instructions can be issued down separate pipelines in parallel. While superscalar processors may utilize narrower or shorter instructions and process multiple instructions in parallel, other problems remain in the complexity of selecting instructions that can issue in parallel without conflicting demands and in accessing operands in parallel. Additionally, concerns about interactions between pipelines and permitting other components to be idle until an instruction is completely executed still remain.
Vector processors process vectors or linear arrays of data elements or values, e.g., scalar values, arranged in one dimension, e.g., a one dimensional array. Example vector operations include element-by-element arithmetic, dot products, convolution, transforms, matrix multiplications, and matrix inversions. Vector processors typically provide high-level instructions that operate on a vector in a pipelined fashion, element by element. A typical instruction can add two 64-element vectors element by element in a pipeline to produce a 64-element vector result, which would also be generated by a complete loop on a scalar processor that computes one element per loop iteration. Vector processing units, however, typically provide limited sequential control capacity. For example, a separate scalar unit is typically used to perform scalar computations using sequential decisions.
For example, a vector processor may pass vector operands to a single pipelined functional unit, e.g., an adder. If a vector instruction calls for C=A+B, each element of vectors A and B are sequentially added with a single functional adder and stored element by element to a vector C. In pipelined fashion, during a first clock, the first element of each vector is processed with an adder, e.g., A1+B1, and stored to C1 of vector C. During a second clock, the second element of each vector is processed with an adder, e.g., A2+B2, and stored to C2 of vector C. During a third clock, the third element of each vector is processed with an adder, e.g., A3+B3, and stored to C3 of vector C, and so on for each element.
Thus, performing an operation on “x” elements may require “x” clock cycles and additional clock cycles to manage overhead operations. Consequently, conventional vector processors are limited in that they utilize a complex control unit to sequence vector processing element by element, one clock per element, resulting in many clock cycles to execute one vector instruction. This problem is further amplified when more complex instructions are processed. Additionally, when processing of one element is completed, a control system must move the processing from the element just processed to the next element. Further, control of other execution units such as a multiplier, shifter, etc. are further complicated and use of these units is delayed until the instruction is completed and each element of the vector has been processed through respective clock cycles. Thus, other instructions relating to other execution units are unnecessarily delayed or require complex “vector chaining” controls to manage parallel instruction execution with different units.
Some processing systems that use co-processors or reconfigurable arrays have synchronization problems with the execution of the application program. Further, some conventional systems utilize one processor to execute an application program with the assistance of a co-processor or a reconfigurable computing array. As a result, such systems utilize an asynchronous request/acknowledge handshake between the separate processor and the co-processor or reconfigurable array. These handshakes result in either the processor waiting for the array, or the array waiting for the processor. In both cases, the result is inefficient use of the processor in performing fine-grain requests because the overhead can exceed the array run time.
In summary, shortcomings of conventional processing systems relating to the complexity of issuing parallel instructions, instructions with many bits, bandwidth and power used fetching wide instructions, additional instruction memory, logic, and/or area, larger bandwidth, diminished processing speeds, and asynchronous processor communications.
Accordingly, there is a need in the art for a processing system that executes instructions in a more time, cost, and space efficient manner by enhancing the control and utilization of parallel processing.