1. Technical Field
The present invention relates in general to a method and system for data processing, and in particular, to a method and system for processing instructions within a data processing system. Still more particularly, the present invention relates to a method and system for recoding noneffective instructions within a data processing system.
2. Description of the Related Art
In order to satisfy consumer demand for high performance data processing systems, processor designers have developed several architectural improvements to enhance processor performance, including the use of superscalar architecture and pipelined execution units. With reference now to FIG. 1, there is illustrated a typical instruction processing subsystem within a superscalar pipelined processor. As depicted, instruction processing subsystem 100 includes bus interface unit 102, instruction cache 106, instruction queue 108, dispatcher 110, and a number of execution units 114. Instruction processing subsystem 100 typically includes, for example, an integer processing unit, a floating point unit, and a load/store unit among execution units 114.
As will be understood by those skilled in the art, instruction cache 106 is a small block of expensive, high speed memory that stores a subset of the instructions which may be accessed by the processor. In general, instructions are stored within instruction cache 106 in association with an address tag, i.e., a portion of the absolute address at which the instructions are stored at a lower level of memory. When the processor requests an instruction, the processor first searches instruction cache 106 to determine if the requested instruction is resident within instruction cache 106. Those skilled in the art will appreciate that if a requested instruction is not resident within instruction cache 106, a cache miss occurs and the instruction request is forwarded to a lower level of memory via bus interface unit 102 and address bus 120. In response to the instruction request, the lower level memory (e.g., a level two (L2) cache) which stores the requested instruction transmits the memory segment (e.g., cache line) containing the requested instruction to bus interface unit 102 via data bus 118. The instructions within the returned memory segment are then stored within instruction cache 106 in place of other instructions, which are selected according to a least recently used (LRU) replacement algorithm, for example.
When the processor requests an instruction stored within instruction cache 106, the requested instruction is loaded into instruction queue 108, which sequentially stores several instructions that will be executed within the processor. As the processor executes instructions, the oldest instruction within instruction queue 108 is loaded by dispatcher 110, which includes one decode logic unit 112 for each instruction within the dispatch bandwidth of dispatcher 110. For example, if dispatcher 110 dispatches three instructions each cycle, dispatcher 110 includes three decode logic units 112. Those skilled in the art will appreciate that in order to enhance performance, the instruction dispatch bandwidth of dispatcher 110 is preferably the same or approximately the same as the number of execution units 114.
Typically, each decode logic unit 112 comprises a multi-level logic circuit which partially decodes instructions by comparing bits within the instructions' operation codes (opcodes) with bit patterns corresponding to valid instructions within the processor's instruction set. If the decode of an instruction indicates that the instruction is illegal (i.e., the instruction has an invalid opcode), dispatcher 110 forwards the illegal instruction to a completion buffer (not illustrated) without attempting to execute the illegal instruction. Legal instructions, on the other hand, are dispatched to an execution unit 114 corresponding to the instruction type determined by the decode operation.
As will be understood by those skilled in the art, each execution unit 114 comprises a multiple stage execution pipeline, including, for example, fetch, decode, execution, and completion stages. By dividing the execution of each instruction into several discrete steps, each execution unit 114 is able to process an instruction at each of its multiple stages during each processor clock cycle. Typically, after legal instructions that precede an illegal instruction in the instruction stream have completed execution within an execution unit 114, a selected exception (interrupt) handler is executed to process the illegal instruction.
Although processors utilizing an instruction processing subsystem like that depicted in FIG. 1 provide enhanced performance compared with conventional scalar processors, processor efficiency and performance remain less than optimal due to the processor's mechanism for detecting and handling illegal instructions. Because illegal instruction detection and handling are performed in a critical timing path, the processing delay generated by each logic gate within the multi-level logic network in each decode logic unit 112 slows overall processor performance. In addition, as the dispatch bandwidth of a processor increases, the amount of processor area allocated to illegal instruction detection logic within dispatcher 110 concomitantly increases, which in turn dramatically increases the cost of processor fabrication.
Referring now to FIG. 2, there is depicted an improved conventional instruction processing subsystem of a superscalar pipelined processor. As indicated by like reference numerals, instruction processing subsystem 130 includes bus interface unit 102, instruction cache 106, instruction queue 108, and a number of execution units 114, which operate like corresponding components within instruction processing subsystem 100 illustrated in FIG. 1. Instruction processing subsystem 130, however, also includes predecode logic 104 which predecodes the instructions within a memory segment returned from lower level memory prior to storage of the instructions within instruction cache 106. Typically, predecode logic 104 compares selected bits within the opcode portion of each instruction with a bit pattern corresponding to valid instruction types within the processor's instruction set. Utilizing the results of this comparison, predecode logic 104 sets flag bits 107, which specify an instruction type for each instruction. In addition, if the comparison indicates that an instruction has an invalid opcode, predecode logic 104 sets a bit within flag bits 107 to indicate that the instruction has an illegal opcode. As illustrated, flag bits 107 associated with each instruction are stored within instruction cache 106.
Continuing along the instruction processing path, dispatcher 111 contains one flag detection logic unit 113 for each instruction within the dispatch bandwidth of dispatcher 111. When an instruction is loaded by dispatcher 111, one of flag detection logic units 113 examines flag bits 107 associated with the instruction and assigns the instruction to an execution unit 114 corresponding to the instruction type indicated by flag bits 107. Partially decoding instructions prior to loading the instructions into dispatcher 111 reduces the complexity of the decode logic required within dispatcher 111 since only flag bits 107 must be analyzed to assign instructions to the appropriate execution unit 114. This decrease in the decode logic within dispatcher 111 decreases the logic gate delays within dispatcher 111, thereby enhancing overall processor performance.
Although instruction processing subsystem 130 enjoys enhanced performance versus instruction processing subsystem 100 of FIG. 1 due to the decrease in logic gate delays in a critical timing path, the improvement in processor performance entails an increase in processor cost since storing flag bits within the instruction cache reduces the number of instructions which may be stored within a given cache size. In order to avoid an increase in the instruction cache miss ratio (and consequent performance degradation) due to the reduction in resident instructions, the instruction cache size must be increased to compensate for the storage consumed by the flag bits. However, those skilled in the art will appreciate that substantially increasing the size of the instruction cache may render the processor prohibitively expensive to consumers.
A source of inefficiency within instruction processing subsystems that is unaddressed by either of the prior art systems depicted in FIGS. 1 and 2 is the utilization of processor cycle time to execute instructions which do not alter the state of the processor. In other words, instruction processing subsystems consume processor cycle time processing instructions that do not change the value of any architected register within the processor. Instructions that do not affect the state of the processor include, for example, instructions which add 0 to or subtract 0 from data within a register, and instructions that multiply or divide a register value by 1, etc. Although these instructions do not change the state of any architected register, they degrade the performance of the processor since the execution of these instructions may require several processor cycles. Therefore, it would be desirable to detect instructions which do not affect the state of the processor and remove these noneffective instructions from the instruction stream or replace them with no operation (no-op) instructions, which typically execute within a single processor cycle.
Consequently, it would be desirable to provide an efficient method and system within a superscalar pipelined processor for detecting noneffective instructions with illegal opcodes and noneffective instructions which do not change the state of architected registers within the processor. Furthermore, it would be desirable to provide a method and system within a superscalar pipelined processor for detecting noneffective instructions which do not increase the size, and therefore cost, of the processor.