The present invention relates generally to improvements in parallel processing, and more particularly to advantageous techniques for providing dynamic very long instruction word (VLIW) sub-instruction selection for execution time parallelism in an indirect VLIW processor.
In a VLIW processor, a typical problem is that it is difficult to make effective use of the full capabilities of the fixed length VLIWs available in the hardware. In previous designs, this design problem led to a very porous VLIW memory containing many No Operation (NOP) instructions within the VLIWs. Some machines have attempted to encode the NOPs to more fully utilize the VLIW memory space. One motivation of such attempts was to make better use of the costly VLIW memory included in these earlier processors. The encoded NOPs were typically assigned to each specific VLIW with no reuse of the VLIW possible in different areas of the program.
There are other needs to be met by a VLIW parallel data processor. For example, it is desirable to pipeline operations in order to achieve a steady state flow of data for maximum throughput. Consider the case of matrix multiplication using a VLIW architecture with four short instruction words (SIWs) per VLIW. In the example of FIG. 1, a 4-element vector 2 and a 4xc3x974 matrix 4 are multiplied. Given a processor with operands stored in a register file and VLIW execution units that operate on register file source data operands and deliver result data to the register file, it can be reasonably assumed that the vector elements are stored in data registers R20=a0, R21=a1, R22=a2, and R23=a3, and the 4xc3x974 matrix 4 is stored in a processor accessible memory. FIG. 2 illustrates how the entire operation is handled in a typical prior art approach. Each row in table 10 represents a unique short instruction word (SIW) or VLIW instruction with the program flow beginning at the top of the table and proceeding time-wise down the page. The Load operation is an indexed load that incrementally addresses memory to fetch the data element listed and load it into the specified register R0 or R1. The Add and Mpy instructions provide the function Rtarget=Rx Operation Ry, where Rtarget is the operand register closest to the function name and the source operands Rx and Ry are the second and third register specified. Each unique VLIW memory address is identified with a number in the first column. The table 10 of FIG. 2 shows that a minimum of seven VLIWs, each stored in a unique VLIW memory address, and three unique SIWs, are required to achieve the desired results in the prior art. It is important to note that of the seven VLIWs, three VLIWs, namely numbers 1, 2, and 7, use only two SlWs per VLIW, the other four use three SIWs per VLIW. When a four instruction slot VLIW contains only two SIWs, the other two slots contain NOP instructions. When the four instruction slot VLIW contains three SIWs, the other slot contains a single NOP. With a five instruction slot VLIW as will be described in greater detail below, even poorer usage of the VLIW memory results using prior art techniques. In the vector matrix example, a five slot VLIW will use 7*5=35 VLIW memory locations with 17 NOPs assuming the fifth slot is not used for this matrix multiplication example. The prior art approach results in a very porous VLIW memory with numerous NOP instructions.
It is desirable to reduce the number of unique VLIW memory addresses to accomplish the same task since this makes more efficient use of the available hardware. It is also desirable to reduce duplicate instructions in the VLIW memory storage. This is an important consideration that allows a smaller VLIW memory to be designed into a processor thereby minimizing its cost. Further, if the same VLIW memory address could be shared by multiple sections of code and even multiple programs then the latency cost of loading the VLIW memories can be minimized, as compared to prior art approaches, and amortized over the multiple programs thereby improving overall performance. In addition, it is desirable to extend this concept into multiple Processing Elements (PEs) and to a controller Sequence Processor (SP) of a Single Instruction Multiple Data stream (SIMD) machine.
The present invention is preferably used in conjunction with the ManArray architecture various aspects of which are described in greater detail in U.S. patent application Ser. No. 08/885,310 filed Jun. 30, 1997, now U.S. Pat. No. 6,023,753, U.S. patent application Ser. No. 08/949,122 filed Oct. 10, 1997, now U.S. Pat. No. 6,167,502, U.S. patent application Ser. No. 09/169,255 filed Oct. 9, 1998, now U.S. Pat. No. 6,343,356, U.S. patent application Ser. No. 09/169,256 filed Oct. 9, 1998, now U.S. Pat. No. 6,167,501, U.S. patent application Ser. No. 09/169,072 filed Oct. 9, 1998, now U.S. Pat. No. 6,219,776, and U.S. patent application Ser. No. 09/187,539 filed Nov. 6, 1998, now U.S. Pat. No. 6,151,668, Provisional Application Serial No. 60/068,021 entitled xe2x80x9cMethods and Apparatus for Scalable Instruction Set Architecturexe2x80x9d filed Dec. 18, 1997, now expired, Provisional Application Serial No. 60/071,248 entitled xe2x80x9cMethods and Apparatus to Dynamically Expand the Instruction Pipeline of a Very Long Instruction Word Processorxe2x80x9d filed Jan. 12, 1998, now expired, Provisional Application Serial No. 60/072,915 entitled xe2x80x9cMethods and Apparatus to Support Conditional Execution in a VLIW-Based Array Processor with Subword Executionxe2x80x9d filed Jan. 28, 1988, now expired, Provisional Application Serial No. 60/077,766 entitled xe2x80x9cRegister File Indexing Methods and Apparatus for Providing Indirect Control of Register in a VLIW Processorxe2x80x9d filed Mar. 12, 1998, now expired, Provisional Application Serial No. 60/092,130 entitled xe2x80x9cMethods and Apparatus for Instruction Addressing in Indirect VLIW Processorsxe2x80x9d filed Jul. 9, 1998, now expired, Provisional Application Serial No. 60/103,712 entitled xe2x80x9cEfficient Complex Multiplication and Fast Fourier Transform (FFT) Implementation on the ManArrayxe2x80x9d filed Oct. 9, 1998, now expired, and Provisional Application Serial No. 60/106,867 entitled xe2x80x9cMethods and Apparatus for Improved Motion Estimation for Video Encodingxe2x80x9d filed Nov. 3, 1998, now expired, respectively, and incorporated herein in their entirety.
The present invention addresses the need to provide a compressed VLIW memory and the ability to reuse instruction components of VLIWs in a highly advantageous way. In one aspect, the present invention comprises a SIW fetch controller for reading instructions from the SIW memory (SIM), a VLIW memory (VIM) to store composed VLIWs at specified addresses, a VLIW controller for indirectly loading and reading instructions from the VIM, and instruction decode and execution units. VLIWs in the present invention are composed by loading and concatenating multiple SlWs in a VIM address prior to their execution.
In a SIMD machine, the SIW fetch controller resides in the SIMD array controller SP which dispatches the fetched 32-bit instructions to the array PEs. The SP and the PEs include a VIM, a VIM controller, and instruction and decode execution units. The concepts discussed in this disclosure apply to both the indirect VLIW (iVLIW) apparatus and mechanism located in the SP controller and each PE in a multiple PE array SIMD machine.
After at least one VLIW is loaded into VIM, it may be selected by an execute-VLIW (XV) instruction. There are two types of XV instructions described in this invention. The first one XV1 provides sub-VLIW SIW selection across the slots at the same VIM address for execution time parallelism. The second XV2 provides sub-VLIW SIW selection with independently selectable SIWs from the available SIWs within each of the slots VIM sections for execution time parallelism. The XV1 instruction is described first with an example demonstrating the advantages of this approach. The XV2 instruction description follows with an example demonstrating its inherent advantages.
The XV1 instruction causes the stored VLIW to be read out indirectly based upon address information that is computed from a VIM base address register and an immediate Offset value that is present in the XV1 instruction. The XV1 instruction contains Mask-Enable-bits which select the instructions from the read-out VLIW that are to be scheduled for execution. In a preferred ManArray embodiment there are 8-bit Mask-Enable-bits, one bit per execution unit, supporting up to 8-SIWs in a single VLIW. For the first implementation, 5 SIWs are preferably used.
Due to the use of a VIM base register, Vb, unlimited VIM storage is possible. For each Vb base address, the XV1 instruction preferably supports, in the first implementation, an 8-bit offset thereby allowing 256 VLIWs per Vb address. The preferred ManArray architecture specifies that up to 8 SIWs can be stored per VIM address and a minimum of eight Mask-Enable-bits, one per slot, are supported by the preferred embodiment. Also, because each VIM entry has a unique address, each VIM entry can be loaded, modified, executed, or disabled, independently.
With eight SIW slots available per VIM entry, up to 255 unique combinations of SIW types can be stored in each entry, where, for example, SIW instruction types can include Store, Load, Arithmetic Logic Unit (ALU), Multiply Accumulate Unit (MAU), and Data Select Unit (DSU) instruction types. Each combination represents a unique indirect VLIW (iVLIW) available for XV1 execution. Furthermore, when invoking the execution of SIWs from a previously loaded VIM entry via the XV1 containing the 8-bit mask, up to 255 unique iVLIW operations can be invoked from that VIM entry alone.
The XV2 instruction provides the capability to remove duplicate instructions within groups of VLIWs within a slot specific section of the VIM. This capability provides optimum packing of instructions within the VIM thereby further optimizing its efficiency and minimizing its size for specific applications.
A more complete understanding of the present invention, as well as other features and advantages of the invention will be apparent from the following Detailed Description and the accompanying drawings.