This invention relates to methods and apparatus for compressing processor instructions, and more particularly to a method and apparatus for compressing the opcode portion of instructions.
In the conventional efforts for optimizing processing efficiency, the data bandwidth has been addressed more often than the instruction bandwidth. This emphasis seems justified, for example, based on benchmark programs which typically show much smaller instruction cache miss rates than data cache miss rates. Such results indicate that off-chip instruction bandwidth requirements are much smaller than data bandwidth requirements. However, for some commercial workloads, such as image processing, data cache miss rates are typically lower than instruction cache miss rates. Accordingly, there is an increasing need to optimize instruction bandwidth.
Two recent trends are increasing the instruction bandwidth, and correspondingly, the need for a larger instruction cache size. The first trend is that very long instruction word (VLIW) architectures are becoming popular in many high-performance processor architectures. A VLIW architecture executes a large number of operations per cycle by taking advantage of its wide instruction bits. This directly translates into significantly increased instruction bandwidth compared to superscalar architectures. For example, VLIW instruction widths of 256 bits (4 to 8 times wider than a typical reduced instruction set computer (RISC) instruction) are not uncommon.
The second trend is the use of deep execution pipelines that have become critical in increasing processor clock frequencies. Deep execution pipelines increase the chance of conflicts in read-after-write dependencies. The conflicts are resolved by inserting NOP instructions or by hardware detection techniques that stall the execution pipeline. In either case, valuable execution cycles are lost, which prevents the processor from achieving peak utilization. Software pipelining has become an important tool in eliminating these read-after-write conflicts in deep execution pipelines. Software pipelining works by unrolling a tight loop several times and overlapping multiple iterations of the tight loop to allow more room for the read-after-write dependencies to be resolved without incurring extra NOPs or processor stall cycles. This has the side effect of increasing the tight loop size, thus increasing instruction cache miss rates. Accordingly, there is a need for techniques which reduce or more effectively handle instruction bandwidth.
In the complex instruction set computer (CISC) architecture and reduced instruction set computer (RISC) architecture, there has been little need for instruction compression due to the effectiveness of an instruction cache. However, in U.S. Pat. No. 5,636,352 issued Jun. 3, 1997 for xe2x80x9cMethod and Apparatus for Utilizing Condensed Instructionsxe2x80x9d, Bealkowski et al. introduce an instruction compression technique. An instruction consists of an opcode (i.e., instruction operand), plus one or more data operands (e.g., source operand field and destination operand field). One or more control bits also are included in the instruction. Bealkowski et al. implement a table, referred to therein as a synonym table, which includes entries for frequently-used instructions. A sequence of instructions is compressed into a single instruction having a previously-undefined special opcode and respective indices into the synonym table (e.g., one per instruction of the sequence being compressedxe2x80x94up to a limit based on the number of bits permitted in the instruction).
A limitation of Bealkowski et al.""s compression technique is that the number of unique instructions in a typical program is quite large. Accordingly, Bealkowski et al. suggest a maximum index width of 12 bits and a synonym table with 4096 entries, each entry holding a 32-bit instruction. Such a table requires 16 kbytes of on-chip memory. This is an expensive solution as the size of such a table is comparable to a first-level instruction cache such as used in high performance processors. Bealkowski et al. suggest one embodiment in which the synonym table is stored in read-only memory, being predetermined at the time of microprocessor design. In another embodiment Bealkowski et al. suggest that the synonym table be loadable during processor initialization. As contemplated, however, the table is of static, unchanging composition. Accordingly, there is a need for a more effective solution for reducing instruction bandwidth.
According to the invention, instruction bandwidth is reduced by implementing an opcode compression technique. This is distinct from an instruction compression technique in which the entire instruction is compressed. An area of on-chip random access memory is allocated to store one or more tables of commonly-used opcodes. The normal opcode in the instruction is replaced with a code identifying the table and the index into the table. The code includes fewer bits than the uncompressed opcode. As a result, the instruction is compressed.
Although, the technique is implemented for a variety of processor architectures, the technique is particularly advantageous for VLIW instructions which include multiple opcodes, (i.e., one for each subinstruction). In one embodiment a bit among the special code bits of the instruction is allocated to designate whether the VLIW instruction is compressed or not compressed. For example, in some embodiments opcode compression for a VLIW instruction is all or nothingxe2x80x94all subinstruction opcodes are compressed or none. Because adequate methods exist for compressing NOP instruction opcodes, alternative, conventional methods may be used to identify NOP subinstructions among the compressed instruction format of embodiments of this invention.
According to one aspect of this invention, the table of commonly-used opcodes is dynamically updated, overwritten and replaced during real-time processing. A table can be stored during execution of an application program. An advantage of dynamic updating is that a smaller table size can effectively reduce instruction bandwidth. In some embodiments the table need not be dynamic and may be fixed. To store all the most frequently used opcodes for a broad range of application programs, such a table will be larger than a dynamically updated table. For the preferred dynamic implementation the table is customized to the application and becomes part of the program design. For example, different tasks are programmed with a respective table of opcodes to be stored in the opcode table. The respective tables then are loaded in when task switching. A smaller, dynamic opcode table provides the advantage of an effective selection of opcodes and a low overhead for table loading during task switching. Further, when space is allocated on the processor chip to store multiple tables, the table loading overhead is further reduced as one table is made active and another inactive.
In some embodiments, one or more specific entries in a given opcode table are updated. A specific instruction is included in which a table index is used to identify where in the opcode table to overwrite an updated value. Further a CISC-like instruction is included in some embodiments to transfer data from memory into the opcode table faster and to store the table more compactly.
In some embodiments the opcode table is preloaded from non-volatile memory early in a function call. Further, a pointer to the prior table is maintained so that after the function is complete and processing returns to the calling routine, the opcode table for the calling routine is restored.
These and other aspects and advantages of the invention will be better understood by reference to the following detailed description taken in conjunction with the accompanying drawings.