The present invention relates to data processor architecture and, more particularly, to data processor architecture for digital signal processing.
The term xe2x80x9cdigital signal processingxe2x80x9d (DSP) herein denotes any data processing procedures which operate on digital representations of data, particularly, but not limited to, those which represent analog signals or quantities. Hereinafter, the terms xe2x80x9cdigital signal processorxe2x80x9d and xe2x80x9cprocessorxe2x80x9d both denote any device which is capable of processing digital data, including, but not limited to, representations of analog signals or quantities.
Digital signal processing algorithms, such as the Fast Fourier Transform (FFT), usually involve complex and intensive computation. Moreover, many DSP applications must run in real time, so a processor""s ability to handle a large number of calculations in a short amount of time is of fundamental importance. It is also important that programs for the processor be easy to code and maintain. Furthermore, system power consumption is important for many DSP applications which require low power drain, for example to maximize battery life in cellular phone handsets, laptop computers, and consumer audio equipment. Thus, there are three aspects of a digital signal processor which influence the overall performance of a system based thereon. The first is the program execution speed, the second is the ease of programming, and the third is the power consumption during execution of the program code. In typical applications of digital signal processing, one or more of these aspects are critical.
Execution speed is highly dependent on the extent of parallel processing employed by the processor during program execution. The terms xe2x80x9cparallel processingxe2x80x9d and xe2x80x9cparallelxe2x80x9d denote the execution of a plurality of data operations substantially simultaneously. When comparing two processors with the same clock rate, if the first is capable of performing two instructions in parallel (per cycle), while the second is capable of executing only one instruction per cycle, then the first processor has a clear advantage, because if all other conditions are identical, the first processor will execute a program in half the time required by the second processor.
Ease of programming is also vitally important, since the nature of the instruction set has a major influence on the suitability of a processor for different tasks. Today""s processors must provide an instruction set that has the properties of flexibility and orthogonality. Flexibility assures that dependencies between instructions are reduced to minimum, thus allowing the programmer to write code freely, without restrictions. Orthogonality frees the programmer from concern over which operands are permitted in the current operation, because orthogonality permits most operands to be used in most instructions. Thus, flexibility and orthogonality in an instruction set reduce the restrictions and hence reduce the burden on the programmer.
Power consumption is dependent on the hardware complexity of the processor, such as the width of data buses, the number of computation units employed, and the number of instruction decoders necessary to handle the different fields in an instruction word.
There are currently two xe2x80x9cmainstreamxe2x80x9d architectures for digital signal processors. Both involve design compromises concerning the three issues mentioned above. The first mainstream architecture is referred to as the xe2x80x9cregularxe2x80x9d architecture, and is characterized by the execution of a single instruction in a machine cycle. The second mainstream architecture is referred to as the xe2x80x9cVery Long Instruction Wordxe2x80x9d (VLIW) architecture, and is characterized by the execution of several instructions in a single machine cycle. An overview of the prior art can be obtained from DSP Processor Fundamentalsxe2x80x94Architectures and Features, by Lapsley, Bier, Shohan, and Lee, Berkeley Design Technology, Inc., 1996., and from some of the technical literature pertaining to currently-available digital signal processors, such as the Texas Instruments xe2x80x9cC6xe2x80x9d series. Technical documentation provided by the manufacturer for this series of digital signal processors includes the TMS320C62xx CPUand Instruction Set, July 1997, Texas Instruments Incorporated, Houston, Tex. A discussion of the architecture of this digital signal processor is found in xe2x80x9cThe VelociTI Architecture of the TMS320C6xxe2x80x9d by Thomas J. Dillon, Jr., in The Proceedings of the 8th International Conference on Signal Processing Applications and Technology, pp. 838-842, September, 1997, Miller Freeman, Inc., San Francisco, Calif.
A regular processor, where a single instruction is executed per machine cycle, features a relatively small program data bus, because it is necessary to fetch only one instruction word (typically 32 bits wide) per cycle. In addition, since only one instruction is executed per machine cycle, the number of computation units in the execution unit of the processor is small (typically 2xe2x80x94an adder and a multiplier). As noted above, the program bus width and the number of computation units directly influence the power consumption of the processor. Thus, the regular architecture is also characterized by a relatively low power consumption. It is also easier to write program code for a regular processor than for a VLIW processor. The inherent disadvantage of the regular architecture is that the execution speed (usually measured in xe2x80x9cMIPSxe2x80x9d, or Million-Instructions executed Per Second) is smaller than that of the VLIW architecture described below.
The second mainstream architecture, the VLIW architecture, implements an instruction set in which a number of simple, noninterdependent operations are packed into the same instruction word. The term xe2x80x9cinstruction wordxe2x80x9d herein denotes a set of instructions contained in a single programming step, such that at run-time, the processor executes all the instructions within the instruction word in parallel. Thus, the VLIW architecture requires a plurality of computation units in the processor, along with a corresponding plurality of instruction decoders to analyze the instructions contained in the instruction word fetched from the program memory. VLIW architecture has the advantage of parallel processing, thus increasing the MIPS capability of the processor. VLIW architecture, however, also requires wider memory banks, multiple computation units, and multiple instruction decoders, thus increasing chip area and power consumption. In addition, the skills required of the programmer to write code for a VLIW architecture processor are also inherently higher, in order to exploit the parallel processing capabilities of the processor.
There is thus a widely recognized need for a processor architecture which combines the advantages of the regular architecture and the VLIW architecture, while reducing or eliminating the disadvantages inherent in these two mainstream architectures. It would be highly advantageous to have a processor which has a high execution speed, ease of programming, and low power consumption. These goals are met by the present invention.
The configurable long instruction word (CLIW) architecture described here is an innovative processor architecture and instruction set design which optimizes processing speed, ease of programming, and power consumption to benefit DSP programming. The present invention combines the advantages of both the regular and VLIW architectures in a flexible design which overcomes the limitations of the current technologies. In a CLIW processor, several instructions may be executed in parallel during a single cycle without the need for an enlarged general program memory bus, and without the need for multiple processor instruction decoders to support the parallel instruction execution. The present invention also represents improvements in the ease of programming, both from the standpoint of the instruction set as well as the syntax of the instructions themselves.
The general concept of the present invention may be better understood with reference to the drawings and the accompanying description.
FIG. 1 illustrates the general composition of an instruction word and the instructions contained therein. An instruction word 2 is composed of at least one instruction, and possibly more instructions. The xe2x80x9cwidthxe2x80x9d of an object, such as an instruction, refers to the number of bits needed to code the object. Instruction word 2 as illustrated in FIG. 1 contains an instruction 4, an instruction 10, and an instruction 16, and may contain further instructions, as indicated by the ellipsis ( . . . ). Each instruction of instruction word 2 directs the processor to perform some function. These functions can include, for example, arithmetic or logical operations on data, or directions to the processor to branch to different locations of the program or perform other operations related to the execution of the code. Each instruction of instruction word 2 therefore specifies the precise operation, and may contain additional information which further specifies how the operation executes. The common means of specifying the operation is by an operation code, or xe2x80x9cop-codexe2x80x9d. Certain operations require one or more operands, upon which they work, and this is also specified in an instruction. To illustrate this, instruction 4 contains an operation code 6 and an operand 8. For example, operation code 6 might direct the processor to perform a bitwise NOT operation on the contents of a register, and operand 8 might specify which register is involved. In contrast, instruction 10 contains an operation code 12 and two operands, an operand 14 and an operand 15. For example, operation code 12 might direct the processor to move the contents of a first register to a second register, and operand 14 might specify the first register, while operand 15 might specie the second register. In further contrast, instruction 16 contains only an operation code 18 and no operands. For example, operation code 18 might be a xe2x80x9cno operationxe2x80x9d.
In a regular architecture, an instruction word consists of a single instruction, whereas in a VLIW architecture, an instruction word contains a number of instructions which are executed in parallel to achieve increased execution speed. In prior art implementations, the number of instructions in an instruction word is fixed. In a processor according to the present invention, however, the number of instructions in an instruction word is variable and may change during program execution. Doing so enables a processor according to the present invention to realize increased execution speed along with savings in power consumption and improved ease of programming.
FIG. 2 is a block diagram showing in general the conceptual high-level instruction flow of a prior art processor with a pipeline. 20. The function of pipeline 20 is to provide a steady flow of instruction words from a program memory 30 to the processor""s internal logic. The processor""s internal logic is able to operate at relatively high speed compared to the fetching of instruction words from program memory 30, so more efficient use of the processor can be made by reducing the time spent waiting for instruction words to be fetched. Once pipeline 20 is filled with instruction words, the processing of an instruction word is completed every cycle, even though it takes a number of cycles to process any single instruction word. In this sense, a pipeline is analogous to a manufacturing assembly line. The general, high-level stages of pipeline 20 are illustrated in descending order as the instruction words propogate from beginning to end. First is a program fetch stage 22, where the instruction word is retrieved from program memory 30, as previously noted. Next is a decoding stage 24, where the instruction word is analyzed by a set of one or more instruction decoders 40, which outputs two sets of controls. The first set of controls goes to an address unit 50 (AU) which fetches the data which the instructions of the instruction word need. For example, an instruction of an instruction word might contain an ADD operation, in which case an addend may be required from data storage. The second set of controls from instruction decoders 40 goes to an execution unit 60 (EU), which utilizes the EU controls to direct the processing of the instructions contained in the instruction word on the data.
Herein, the term xe2x80x9ccontrolxe2x80x9d denotes a hardware logic level or signal, such as an electrical voltage, as distinct from an instruction. Controls need no further decoding stage to be utilized by execution unit 60, but rather drive execution unit 60 directly to perform operations, which are accomplished by computation units c1 through c4. In contrast, the term xe2x80x9cinstructionxe2x80x9d herein denotes a symbolic entity which represents operations to be performed by the processor, and which must be decoded into controls, for example by instruction decoders 40, in order to be performed by execution unit 60. Since they are symbolic entities, instructions can be arranged, manipulated, and processed independent of the processor, such as by a programmer or a compiler. A finished program ready for execution consists of a sequence of instructions. In summary, then, execution unit 60 does not execute instructions, but rather is driven by controls, such as from instruction decoders 40. It is the processor as a whole which executes instructions of the program.
Execution unit 60 is designed to perform xe2x80x9cprimitive data operationsxe2x80x9d, which hereinafter denotes basic arithmetic and logical operations, including but not limited to moving data from one location to another, comparing data in one location against data in another location, addition, subtraction, multiplication, division, AND, OR, NOT, and so forth. The specific primitive data operation which is performed depends on the control which is input to execution unit 60. After decoding stage 24, there is a data fetch stage 26, during which address unit 50 fetches the required data. Finally, there is an execution stage 28, where execution unit 60 uses the EU controls to direct one or more computation units (denoted in FIG. 2 as c1, c2, c3, and c4) which are the hardware entities that perform the actual numerical and logical operations of the instructions in the instruction word on the data. Note that the pipeline illustrated in FIG. 2 has only the highest levels of operation shown. Some processors have pipelines with more stages, representing a more detailed breakdown of the basic functions illustrated. Such processors are sometimes referred to as xe2x80x9cdeep pipelined processorsxe2x80x9d.
With the configurable long instruction word (CLIW) architecture of the present invention, a set of controls corresponding to a decoded instruction word is associated with, and is executed according to, a regular instruction. A regular instruction which causes the processor to execute such a set of controls is herein denoted as a xe2x80x9creference instructionxe2x80x9d because it references a set of controls. Such sets of controls may correspond to decoded instruction words containing different numbers of instructions. Hence, a reference instruction may invoke the execution of different numbers of instructions. This is inherently different from the regular or the traditional VLIW architectures described earlier, where an instruction (or group of instructions within an instruction word) is fetched from the program memory, decoded and executed in a manner which is entirely fixed. In a CLIW processor according to the present invention, an instruction can start at the beginning of the pipeline as a reference instruction read from program memory, but before entering the execution phase, the instruction is transformed into the controls corresponding to an plurality of instructions, using a set of controls stored in a pre-loaded, dedicated array of writable processor memory, herein referred to as a xe2x80x9cCLIW arrayxe2x80x9d for convenience. Moreover, the set of controls associated with a particular reference instruction is changeable, not only from one program to another, but also within a program. The CLIW array is illustrated in FIG. 3 and FIG. 4, as described below. The term xe2x80x9cdedicatedxe2x80x9d denotes that this array is intended specifically for the purpose of storing controls corresponding to decoded instructions, and the term xe2x80x9cwritablexe2x80x9d denotes that information contained in the array may be freely changed, as opposed to xe2x80x9cread-onlyxe2x80x9d memory, which cannot be rewritten.
The CLIW architecture of the present invention also differs significantly from the architecture of prior art processors which utilize xe2x80x9cmicroinstructionsxe2x80x9d to decode or perform regular instructions. In such prior art processors, every instruction is associated with a fixed sequence of primitive hardware operations, or microinstructions, which are stored during manufacture of the processor in a read-only memory within the processor. When the instruction is to be executed, its corresponding sequence of microinstructions is invoked and executed in a linear fashion according to the timing of a multi-phase clock. The CLIW architecture of the present invention, however, differs from this in a number of important aspects. First, the indended functions are different. Microinstructions are employed as a decoding mechanism to implement single instructions in a sequential fashion, whereas the set of controls associated with a CLIW reference instruction executes multiple instructions in parallel. Second, the set of controls associated with a CLIW reference instruction is changeable and under programmer control, rather than fixed by the manufacturer, as are microinstructions. Third, the set of controls associated with a CLIW reference instruction is stored in writable memory (the CLIW array) rather than read-only memory, so that it may be easily changed by the programmer.
From the programmer""s point of view, the CLIW concept according the the present invention allows full utilization of the processor""s hardware by re-defining a dynamic-width, user-configurable instruction set as an extension to the processor""s native instruction set. That is, the programmer can define new instructions to accomplish special purposes. These programmer-defined instructions can have the same structure and have the same status as any other instructions, including the instructions of the processor""s native instruction set. For example, if the programmer wishes to find the minimum Euclidean distance between an input vector to array of given vectors (utilizing an algorithm known as xe2x80x9cvector quantizationxe2x80x9d), the following instruction could be defined, using syntactical conventions well-known in the art, and similar to those of the C programming language:
xe2x80x83Find_Eucl_distance (r4, r0)xe2x80x83xe2x80x83(1)
{
a0L=*(r4++)xe2x88x92*(r0++)||
a1+=sqr(a0L)||
a2L=*(r4)xe2x88x92*(r0+rn0)||
a3+=sqr(a2L);xe2x80x83xe2x80x83(2)
}
The instruction Find_Eucl_distance (r4, r0) shown in Expression (1) and defined in Expression (2) above calculates the sum of the squares of the differences between the components of a vector pointed to by r4 and the corresponding components of a vector pointed to by r0. Find_Eucl_distance (r4, r0) can be invoked in a program exactly as if it were an instruction of the processor""s native instruction set. The definition of Find_Eucl_distance (r4, r0), as shown in Expression (2), however, is in the form of an instruction word containing four instructions, each of which is written on a separate line and separated by double bars (||). In the first line, the contents of the accumulator a0L is set to the difference between the vector component pointed to by r4 and the vector component pointed to by r0, and afterwards, both r4 and r0 are incremented to point to the next vector component. In the second line, the square of this difference (a0L) is added to the contents of accumulator a1. A similar pair of operations is done simultaneously on the second half of the r0 vector components in the third and fourth lines, utilizing accumulators a2L and a3. If there are n vector components, it is necessary to first clear the accumulators and then execute this instruction n+1 times, because the procedures of the second and fourth lines add results computed (in the first and third lines, respectively) during the previous execution.
It is important to note that, to the programmer, this instruction can be executed with arbitrary parameters in place of r4 and r0, and is therefore a general-purpose instruction. It is furthermore important to note that, with the CLIW architecture, an instruction such as Find_Eucl_distance (r4, r0) is defined in terms similar to that of VLIW architecture, and therefore has the speed and processing efficiency of VLIW architecture. However, in a CLIW processor, an instruction such as Find_Eucl_distance (r4, r0) appears in program memory as a regular instruction, and proceeds through most of the pipeline as a regular instruction with the hardware efficiencies and low power consumption of a regular processor. CLIW architecture therefore combines the advantages of both the regular architecture and the VLIW architecture, and additionally provides improved ease of programming.
Typically, each instruction of a definition such as Find_Eucl_distance (r4, r0) would be coded in 24 bits. In a traditional VLIW architecture, a total of 96 bits would therefore be required for such an instruction word. This would in turn require a program memory bus width of 96 bits in order for a traditional VLIW architecture to execute such an instruction in a single cycle. In contrast, CLIW architecture enables an instruction word such as this, or, for example, even an instruction word with up to six parallel instructions, to require a program memory bus of only 48 bits, which is similar to that of regular architecture. Such an instruction can be used any number of times throughout the program code, and will always occupy the regular 48 bits of program data. The additional information required for execution of the instruction word is stored in the dedicated array of writable processor memory (the CLIW array).
In the CLIW processor, the new instruction set is a super-set of the regular instruction set, and can be re-used throughout the code with different operands, as needed in the specific application. Each new instruction is equivalent to a decoded VLIW instruction word and is hereinafter referred to as an xe2x80x9centryxe2x80x9d in the CLIW array. As in any instruction, the newly defined instruction, such as Find_Eucl_distance (r4, r0), can be used with different address unit pointers other than r0 and r4 without the need for an additional CLIW array entry, allowing a true programmer-defined instruction set.
Therefore, according to the present invention there is provided a method for defining a new instruction to be executed by a data processor having an instruction set, wherein the new instruction is added to the instruction set, the data processor having an execution unit operative to performing a plurality of primitive data operations according to a plurality of controls, the method including the steps of: (a) providing a dedicated array of writable processor memory having space for at least one entry; (b) providing at least one execution instruction in the instruction set, the execution instruction operative to selecting and executing the at least one entry; (c) determining an operation set which is operative to performing the new instruction, the operation set selected from the plurality of primitive data operations; (d) compiling the operation set into an executable control set selected from the plurality of controls; (e) storing the executable set as an entry in the dedicated array; and (f providing a pointer to the entry.
Furthermore, according to the present invention there is also provided an improvement to a data processor which executes programs stored in program memory, the processor having an instruction set containing a plurality of instructions, the programs including instructions selected from the plurality of instructions, the processor further having an execution unit operative to executing the instructions of the instruction set, the processor further having at least one instruction decoder operative to decoding instructions of the instruction set that are included in the programs and sending controls to the execution unit for executing the decoded instructions, the improvement including: (a) a dedicated array of writable processor memory for storing entries of predetermined controls for the execution unit, the dedicated array of writable processor memory having space for at least one entry; and (b) at least one reference instruction in the instruction set, the reference instruction operative to directing the sending of an entry of predetermined controls from the dedicated array of writable processor memory to the execution unit, the predetermined controls superseding the controls from the at least one instruction decoder.