The demand for high performance computers requires that state-of-the-art microprocessors execute instructions in the minimum amount of time. A number of different approaches have been taken to decrease instruction execution time, thereby increasing processor throughput. One way to increase processor throughput is to use a pipeline architecture in which the processor is divided into separate processing stages that form the pipeline. Instructions are broken down into elemental steps that are executed in different stages in an assembly line fashion.
A pipelined processor is capable of executing several different machine instructions concurrently. This is accomplished by breaking down the processing steps for each instruction into several discrete processing phases, each of which is executed by a separate pipeline stage. Hence, each instruction must pass sequentially through each pipeline stage in order to complete its execution. In general, a given instruction is processed by only one pipeline stage at a time, with one clock cycle being required for each stage. Since instructions use the pipeline stages in the same order and typically only stay in each stage for a single clock cycle, an N stage pipeline is capable of simultaneously processing N instructions. When filled with instructions, a processor with N pipeline stages completes one instruction each clock cycle.
The execution rate of an N-stage pipeline processor is theoretically N times faster than an equivalent non-pipelined processor. A non-pipelined processor is a processor that completes execution of one instruction before proceeding to the next instruction. Typically, pipeline overheads and other factors decrease somewhat the execution rate advantage that a pipelined processor has over a non-pipelined processor.
An exemplary seven stage processor pipeline may consist of an address generation stage, an instruction fetch stage, a decode stage, a read stage, a pair of execution (E1 and E2) stages, and a write (or write-back) stage. In addition, the processor may have an instruction cache that stores program instructions for execution, a data cache that temporarily stores data operands that otherwise are stored in processor memory, and a register file that also temporarily stores data operands.
The address generation stage generates the address of the next instruction to be fetched from the instruction cache. The instruction fetch stage fetches an instruction for execution from the instruction cache and stores the fetched instruction in an instruction buffer. The decode stage takes the instruction from the instruction buffer and decodes the instruction into a set of signals that can be directly used for executing subsequent pipeline stages. The read stage fetches required operands from the data cache or registers in the register file. The E1 and E2 stages perform the actual program operation (e.g., add, multiply, divide, and the like) on the operands fetched by the read stage and generates the result. The write stage then writes the result generated by the E1 and E2 stages back into the data cache or the register file.
Assuming that each pipeline stage completes its operation in one clock cycle, the exemplary seven stage processor pipeline takes seven clock cycles to process one instruction. As previously described, once the pipeline is full, an instruction can theoretically be completed every clock cycle.
The throughput of a processor also is affected by the size of the instruction set executed by the processor and the resulting complexity of the instruction decoder. Large instruction sets require large, complex decoders in order to maintain a high processor throughput. However, large complex decoders tend to increase power dissipation, die size and the cost of the processor. The throughput of a processor also may be affected by other factors, such as exception handling, data and instruction cache sizes, multiple parallel instruction pipelines, and the like. All of these factors increase or at least maintain processor throughput by means of complex and/or redundant circuitry that simultaneously increases power dissipation, die size and cost.
In many processor applications, the increased cost, increased power dissipation, and increased die size are tolerable, such as in personal computers and network servers that use x86-based processors. These types of processors include, for example, INTEL PENTIUM™ processors and AMD ATHLON™ processors.
However, in many applications it is essential to minimize the size, cost, and power requirements of a data processor. This has led to the development of processors that are optimized to meet particular size, cost and/or power limits. For example, the recently developed TRANSMETA CRUSOE™ processor greatly reduces the amount of power consumed by the processor when executing most x86 based programs. This is particularly useful in laptop computer applications. Other types of data processors may be optimized for use in consumer appliances (e.g., televisions, video players, radios, digital music players, and the like) and office equipment (e.g., printers, copiers, fax machines, telephone systems, and other peripheral devices). The general design objectives for data processors used in consumer appliances and office equipment are the minimization of cost and complexity of the data processor.
The presence of multiple functional units in processor pipelines allows multiple operations to be explicitly encoded (or compressed) in a single instruction. These types of processors are generically referred to as wide-issue processors or very large instruction word (VLIW) processors. VLIW processors implement an instruction alignment and dispersal block that is responsible for aligning fetched instructions with the correct execution pipeline and dispersing the aligned instructions for execution.
Some VLIW processors, such as the PHILIPS TRIMEDIA™ processor and the MULTIFLOW TRACE™ processor, fetch compressed instructions from memory and load decompressed instructions into the instruction cache. The advantage to this technique is that it does not occur on the critical execution path. The disadvantage is that it wastes instruction cache space. Other VLIW processors, such as TEXAS INSTRUMENTS C6XXX processors and IA-64 processors, store compressed instructions in the instruction cache and perform instruction dispersal and alignment on instruction fetch. In these types of VLIW processor, the decompression hardware is on the critical instruction and hence temporal efficiency is important. In either case, there are two issues that a compression scheme must address: 1) identifying the boundaries between instructions and 2) compressing the operations within instructions.
Because VLIW instructions may consist of multiple machine words encoding multiple operations, the following terminology is used to clarify different types of instruction-related data quantities. A bundle is a group of operations to be executed simultaneously. A bundle is encoded as a sequence of syllables. A syllable is fixed-size (e.g., 16 bits, 32 bits, and the like) unit of addressing. Syllable size is specific to each type of processor implementation.
Instruction alignment and dispersal units in prior art processors have taken one of three general approaches in defining how syllables comprising a bundle are decompressed. One approach involves the use of a separate syllable template in the encoded instruction. The MULTIFLOW TRACE™ processor, the earliest commercial VLIW, used a bundle format that consisted of one or more mask syllables followed by syllables that encoded zero to N operations. The mask syllables described the layout of the remainder of the bundle. Specifically, the mask syllables defined the functional units required by, and hence the dispersal of, the operations specified by the remaining syllables in the bundle. On instruction fill, a bundle was fully expanded so that the resulting instruction in the cache consisted of one syllable per functional unit in the machine with a particular syllable encoding denoting a null operation. The instruction fetch stage then consisted of simply reading the contents of a decompressed instruction cache entry. The bundle length is implicit in the mask encoding.
The second approach to defining how syllables comprising a bundle are decompressed involves the use of dispersal bits within each operation. TI C6XXX processors use bits within each operation syllable to determine how an instruction bundle should be dispersed and a bit in each syllable to be used as an end-of-bundle marker. In order to expand a bundle, it is necessary to examine each syllable in order to determine how to route the syllable to a functional unit.
The third approach to defining how syllables comprising a bundle are decompressed involves the use of template bits within a block of memory to indicate how words within that block should be combined to form instruction bundles. IA-64 processors use dedicated template bits within each memory block to describe the dispersal of the syllables encoded in the memory block. This template describes both bundle boundaries and the dispersal of operations.
A common implementation issue in VLIW processors is balancing the requirement of efficient instruction encoding with the requirement of efficient instruction issue hardware. Problems arise because the average static parallelism of a program can be significantly lower than the available issue slots. The most extreme solution requires that all instructions provide an operation for each available issue slot, but allows explicit null operations for those issue slots that cannot be effectively used by a program. While such a solution requires no special hardware in the instruction issue unit, it is extremely inefficient in instruction memory utilization.
Therefore, there is a need in the art for a pipelined data processor that provides an improved mechanism for encoding compressed instructions in memory and for decompressing the instructions on either instruction cache fill or instruction fetch. In particular, there is a need for a compression technique that balances the efficiency of the compression algorithm with the space and time complexity of the decompression hardware. More particularly, there is a need for an instruction decompression and dispersal scheme that achieves a good balance between memory utilization and hardware complexity.