1. Field of the Invention
The invention relates to methods and apparatus for reducing code size of instructions on a microprocessor or micro-controller, e.g., on digital signal processing devices, with instructions requiring NOPs (hereinafter xe2x80x9cprocessorsxe2x80x9d). In particular, the invention relates to methods and apparatus for reducing code size on architectures with an exposed pipeline, such as a very large instruction word (VLIW), by encoding NOP operations as an instruction operand.
2. Description of Related Art
VLIW describes an instruction-set philosophy in which a compiler packs a number of relatively simple, non-interdependent operations into a single instruction word. When fetched from a cache or memory into a processor, these words are readily broken up and the operations dispatched to independent execution units. VLIW may perhaps best be described as a software- or compiler-based, superscalar technology. VLIW architectures frequently have exposed pipelines.
Delayed effect instructions are instructions, in which one or more successive instructions may be executed before the initial instructions effects are complete. NOP instructions are inserted to compensate for instruction latencies. A NOP instruction is a dummy instruction that has no effect. It may be used as an explicit xe2x80x9cdo nothingxe2x80x9d instruction that is necessary to compensate for latencies in the instruction pipeline. However, such NOP instructions increase code size. For example, NOPs may be defined as a multiple cycle NOP or a series of individual NOPs, as follows:
NOPs occur frequently in code for VLIWs.
Often NOP instructions are executed for multiple sequential cycles. The c6x series architecture has a multi-cycle NOP for encoding a sequence of NOP instructions. c6000 platform, available from Texas Instruments, Inc., of Dallas, Tex., provides a range of fixed- and floating-point digital signal processors (DSPs) that enable developers of high-performance systems to choose the device suiting their specific application. The platform combines several advantageous feature with DSPs that achieve enhanced performance, improved cost efficiency, and reduced power dissipation. As some of the industry""s most powerful processors, the c6000 platform, available from Texas Instruments, Inc., of Dallas, Tex., offers c62x fixed-point DSPs with performance levels ranging from 1200 million instructions per second (MIPS) up to 2400 MIPS. The c67x floating-point devices range from 600 million floating-point operations per second (MFLOPS) and to above the 1 GFLOPS (1 billion floating-point operations per second) level. To accommodate the performance needs of emerging technologies, the c6000 platform provides a fixed-point and floating-point code compatible roadmap to 5000 MIPS for the c62x generation fixed-point devices and to more than 3 GFLOPS for the floating-point devices.
Load (LD) and branch (B) instructions may have five (5) and six (6) cycle latencies, respectively. A latency may be defined as the period (measured in cycles or delay slots) within which all effects of an instruction are completed. Instruction scheduling is used to xe2x80x9cfillxe2x80x9d these latencies with other useful operations. Assuming that such other instructions are unavailable for execution during the instruction latency, NOPs are inserted after the instruction issues to maintain correct program execution. The following are examples of the use of NOPs in current pipelined operations:
Although NOPs are used to compensate for delayed effects of other instructions, NOPs may be associated with other types of instructions having a latency greater than one (1). Generally complex operations, load instructions that read memory, and control flow instructions (e.g., Branches) have latencies greater than one (1), and their execute phases may take multiple cycles.
Pipelining is a method for executing instructions in an assembly-line fashion. Pipelining is a design technique for reducing the effective propagation delay per operation by partitioning the operation into a series of stages, each of which performs a portion of the operation. A series of data is typically clocked through the pipeline in sequential fashion, advancing one stage per clock period.
The instruction is the basic unit of programming that causes the execution of one operation. It consists of an op-code and operands along with optional labels and comments. An instruction is encoded by a number of bits, N. N may vary or be fixed depending on the architecture of a particular device. For example, the c6x family of processors, available from Texas Instruments, Inc., of Dallas, Tex., has a fixed, 32-bit instruction word. A register is a small area of high speed memory, located within a processor or electronic device, that is used for temporarily storing data or instructions. Each register is given a name, contains a few bytes of information, and is referenced by programs.
In one example of an instruction pipeline, the pipeline may consist of fetch, decode, and execute stages. Each of these stages may take multiple cycles. For example, the instruction-fetch phase is the first phase of the pipeline. The phase in which the instruction is fetched from program-memory. The instruction-decode phase is the next phase of the pipeline; the phase in which the instruction is decoded. The operand-fetch phase is the third phase of the pipeline, in which an operand or operands are read from the register file. Operands are the parts of an instruction that designates where the central processing unit (CPU) will fetch or store information. The operand consists of the arguments (or parameters) of an assembly language instruction. Finally, in the instruction-execute phase, the instruction is executed. An instruction register (IREG) or (IR) is a register that contains the actual instruction being executed, and an instruction cache is an on-chip static RAM (SCRAM) that contains current instructions being executed by one of the processors.
Thus, a need has arisen for a method and apparatus for reducing or minimizing code size by reducing the number of NOP instructions and a method for reducing the total and average code size for codes developed for use with an exposed pipeline and on processors. Because the insertion of NOPs as separate instructions increases code size, by including the NOP as a field within an existing instruction, code size may be reduced.
Further, the need has arisen to reduce the cost of processors by reducing the memory requirements for such devices. Reducing code size reduces total system cost by lessening or minimizing the amount of physical memory required in the system. Reducing code size also may improve system performance by allowing more code to fit into on-chip memory, i.e., memory that is internal to the chip or device, which is a limited resource.
Moreover, the need has arisen to increase the performance and capabilities of existing processors by reducing the memory requirements to perform current operations. It also may improve performance in systems that have program caches.
In addition, the need has arisen for methods for reducing the total power required to perform the signal processing operations on existing and new devices. Reducing code size also reduces the amount of power used by a chip, because the number of instructions that are fetched may be reduced.
In an embodiment, the invention also is a method for reducing total code size in a device having an exposed pipeline, e.g., in a processor. The method may comprise the steps of determining a latency between a defining instruction, e.g., a load instruction, and a using instruction and inserting a NOP field into the defining or using instruction or into an intervening instruction. For example, latencies may be determined by searching the code to identify periods (measured in cycles or delay slots) within which all effects of an instruction are to be completed, e.g., branching steps involving the switching of program control to a nonsequential program-memory address. When inserted into the defining instruction, the NOP field defines the following latency following the defining instruction. When inserted into the using instruction, the NOP field defines the latency preceding the using instruction. Because the defining or using instruction may have insufficient space to accommodate the NOP field, it may be convenient or desirable to place the NOP field in an intervening instruction. Generally, defining instructions xe2x80x9cdefinexe2x80x9d the value of some variable, while using instructions employ a defined variable, e.g., within some mathematical or logical operation. Further, when inserted into an intervening instruction, the NOP field may indicate that the delay occurs before or after the intervening instruction.
In another embodiment, the invention is a method for reducing total code size during branching, e.g., in a processor. The method may comprise the steps of determining a latency after a branch instruction for initiating a branch to a new (non-successive) point in an instruction stream, e.g. from a first point to a second point in an instruction stream, and inserting a NOP field into the branch instruction.
In yet another embodiment, the invention is an apparatus having reduced total code size. The apparatus may comprise a processor including at least one defining instruction followed by at least one using instruction wherein a latency between the at least one defining instruction, e.g., a load instruction, and the at least one using instruction. The at least one defining or the at least one using instruction or an intervening instruction may include a NOP field. As noted above, when inserted into the defining instruction, the NOP field defines the following latency following the defining instruction. When inserted into the using instruction, the NOP field defines the latency preceding the using instruction. Further, when inserted into an intervening instruction, the NOP field may indicate that the delay occurs before or after the intervening instruction.
In still another embodiment, the invention is an apparatus for reducing total code size during branching. The apparatus may comprise a processor including at least one branch instruction for branching to a new (non-successive) point in an instruction stream, e.g., from a first point to a second point in an instruction stream. A latency exists in a shift between the first point and the second point, e.g., the latency following a branch instruction. The at least one branch instruction includes a NOP field corresponding to the latency.
In yet a further embodiment, the invention is a method comprising the steps of locating at least one delayed effect instruction followed by NOPs (either serially or as a multiple-cycle NOP), such as load or branch instructions, within a code; deleting the NOPs from the code; and inserting a NOP field into a delaying instruction, such as the at least one delayed effect instruction. Alternatively, the NOPs may be replaced by including a NOP field in an intervening instruction or another appropriately positioned instruction within the code. Further, the NOPS may precede or follow the delaying instruction. In addition, once delayed effect instructions have been located, the code may be reordered to facilitate replacement of NOPs with NOP fields.
In still a further embodiment, the invention is an apparatus comprising a processor including a code containing at least one delayed effect instruction. At least one of the at least one delayed effect instruction includes a NOP field, thereby replacing NOPs.
Other objects, features, and advantages will be apparent to persons skilled in the art by the following detailed description.