1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to loop control mechanisms within superscalar microprocessors.
2. Description of the Relevant Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Microprocessor designers often design their products in accordance with the x86 microprocessor architecture in order to take advantage of its widespread acceptance in the computer industry. Because the x86 microprocessor architecture is pervasive, many computer programs are written in accordance with the architecture. X86 compatible microprocessors may execute these computer programs, thereby becoming more attractive to computer system designers who desire x86-capable computer systems. Such computer systems are often well received within the industry due to the wide range of available computer programs.
Certain instructions within the x86 instruction set are quite complex, specifying multiple operations to be performed. For example, the PUSHA instruction specifies that each of the x86 registers be pushed onto a stack defined by the value in the ESP register. The corresponding operations are a store operation for each register, and decrements of the ESP register between each store operation to generate the address for the next store operation. Often, complex instructions are classified as MROM instructions. MROM instructions are transmitted to a microcode instruction unit, or MROM unit, within the microprocessor, which decodes the complex MROM instruction and dispatches two or more simpler fast-path instructions for execution by the microprocessor. The simpler fast-path instructions corresponding to the MROM instruction are typically stored in a read-only memory (ROM) within the microcode instruction unit. The microcode instruction unit determines an address within the ROM at which the simpler fast-path instructions are stored, and transfers the fast-path instructions out of the ROM beginning at that address. Multiple clock cycles may be used to transfer the entire set of fast-path instructions corresponding to the MROM instruction. The entire set of fast-path instructions that effect the function of an MROM instruction is called a microcode sequence. Each MROM instruction may correspond to a particular number of fast-path instructions dissimilar from the number of fast-path instructions corresponding to other MROM instructions. Additionally, the number of fast-path instructions corresponding to a particular MROM instruction may vary according to the addressing mode of the instruction, the operand values, and/or the options included with the instruction. The microcode unit issues the fast-path instructions into the instruction processing pipeline of the microprocessor. The fast-path instructions are thereafter executed in a similar fashion to other instructions. It is noted that the fast-path instructions may be instructions defined within the instruction set, or may be custom instructions defined for the particular microprocessor.
Conversely, less complex instructions are decoded by hardware decode units within the microprocessor, without intervention by the microcode unit. The terms "directly-decoded instruction" and "fastpath instruction" will be used herein to refer to instructions which are decoded and executed by the microprocessor without the aid of a microcode unit. As opposed to MROM instructions which are reduced to simpler instructions which may be handled by the microprocessor, fast-path instructions are decoded and executed via hardware decode and functional units included within the microprocessor.
Fast-path instructions that implement an MROM instruction may include branch instructions. For example, a string instruction may include a loop of instructions. A microcode loop is one or more instructions that are repetitively executed a specific number of times. The specific number of iterations is called a loop count or string count. A microcode loop typically includes a branch instruction and a decrement instruction. With each iteration of the loop, the string count is decremented and a branch instruction tests the string count for a termination condition. If the termination condition is true, the branch instruction branches to the top of the loop and the instructions of the microcode loop are executed again. Termination conditions include the string count equally zero and a flag being asserted or unasserted.
The execution of string instructions are performance critical, and maintaining high throughput of string instructions may be essential to high performance of the microprocessor. String instructions are designed to work on a series of data. The string count or count value determines the number of iterations to perform the string instruction. When a string operation is performed on a group of data, it is referred to as "repeated" string instruction. Examples of string instruction in the x86 architecture are MOVS (move string) and CMPS (compare string). The MOVS instruction loads data from a memory location specified by index register ESI, increments/decrements ESI, stores the loaded data to a memory location specified by EDI and increments/decrements EDI. Register ECX stores the number of iterations to repeat the string instruction. Accordingly, each iteration of MOVS register ECX is decremented and a termination condition is tested. A direction flag (DF) indicates whether the index registers (ESI and EDI) are incremented or decremented. By incrementing/decrementing the index registers, the string instruction operates on a series of sequential data. For example, MOVS can move a block of data from one memory location to another memory location. The size of the block is determined by the string count stored in register ECX.
It is desirable to design the execution of string instructions such that each iteration of a repeated string instruction may be executed in a single clock cycle. A superscalar microprocessor includes a fixed number of functional units for executing instructions that implement the functionality of the string instruction. The string count decrement instruction is executed by one issue position and the branch instruction is executed by a second issue position. The remaining issue position(s) are available for executing microcode instructions that perform the function of the string operation. Unfortunately, in a superscalar microprocessor with only three issue positions, only one issue position is available to perform the function of the string operation. Because most string operations require two or more instructions to perform the function of the string instruction, most string instructions are not able to be executed in one clock cycle with only three functional units. For example, MOVS requires one instruction to load a memory operand and increment/decrement the source index register and another instruction to store the memory operand and increment/decrement the destination index register. Additional instructions must occupy a subsequent line of microcode to be dispatched in a subsequent cycle. Accordingly, the number of clock cycles per loop iteration is increased, which decreases the performance of string operations and the overall throughput of the microprocessor. What is desired is an apparatus and method for executing iterations of string instructions in a single clock cycle.