1. Field of the Invention
This invention relates generally to processors, and more particularly, to instruction decode in a superscalar processor.
2. Description of the Related Art
Computers and many other types of machines are engineered around a "processor." A processor is an integrated circuit that executes programmed instructions on data stored in the machine's memory. There are many types of processors and there are several ways to categorize them. For instance, one may categorize processors by their intended application, such as microprocessors, digital signal processors ("DSPs"), or controllers. One may also categorize processors by the complexity of their instruction sets, such as reduced instruction set computing ("RISC") processors and complex instruction set computing ("CISC") processors. The operational characteristics on which these categorizations are based define a processor and are collectively referred to as the processor's architecture. More particularly, an architecture is a specification defining the interface between the processor's hardware and the processor's software.
One aspect of a processor's architecture is whether it executes instructions sequentially or out of order. Historically, processors executed one instruction at a time in a sequence. A program written in a high level language was compiled into object code consisting of many individual instructions for handling data. The instructions might tell the processor to load or store certain data from memory, to move data from one location to another, or any one of a number of data manipulations. The instructions would be fetched from memory, decoded, and executed in the sequence in which they were stored. This is known as the "sequential programming model." Out of order execution involves executing instructions in some order different from the order in which they are found in the program, i.e., out of order or non-sequentially.
A second aspect of a processor's architecture is whether it "pipelines" instructions. The processor fetches instructions from memory and feeds them into one end of the pipeline. The pipeline is made of several "stages," each stage performing some function necessary or desirable to process instructions before passing the instruction to the next stage. For instance, one stage might fetch an instruction, the next stage might decode the fetched instruction, and the next stage might execute the decoded instruction. Each stage of the pipeline typically moves the instruction closer to completion.
Some advanced processor pipelines process selected instructions "speculatively." Exemplary speculative execution techniques include, but are not limited to, advanced loads, branch prediction, and predicate prediction. Speculative execution means that instructions are fetched and executed before resolving pertinent control dependencies. Speculative execution requires a prediction as to what instructions are needed depending on whether a branch is taken, executing fetched instructions, and then verifying the execution and prediction. The pipeline executes a series of instructions and, in the course of doing so, makes certain predictions about how control dependencies will be resolved. For instance, if two instructions are to be alternatively executed depending on the value of some quantity, then the pipeline has to guess what that value will be or which instruction will be executed. The pipeline then predicts the next instruction to be executed and fetches the predicted instruction before the previous instruction is actually executed.
A pipeline therefore has the tremendous advantage that, while one part of the pipeline is working on a first instruction, a second part of the pipeline can be working on a second instruction. Thus, more than one instruction can be processed at a time, thereby increasing the rate at which instructions can be executed in a given time period. This, in turn, increases the processor throughput.
A third aspect of a processor's architecture is whether the processor is "superscalar." Historically, processors executed only one instruction at a time, i.e., in any given clock cycle. Such a processor is called a "scalar" processor. More recently, "superscalar" processors have been designed that execute more than one instruction at a time. More technically, a scalar processor executes one instruction per clock cycle whereas a superscalar processor executes more than one instruction per clock cycle.
Superscalar processors typically use a pipeline as described above where different stages of a pipeline work on different instructions at any given time. Not only do superscalar processors work on several different instructions at a time, but each stage of a superscalar pipeline processes more than one instruction each clock cycle. A superscalar pipeline usually includes one or more stages having several execution units executing instructions in parallel. Each execution unit reads from and writes to storage through "functional unit ports." Thus, a pipeline including N execution units may be described as an N-way pipeline having N functional unit processors.
One of the pipeline's challenges is to determine how many instructions can be executed at any given time. Some instructions require greater resources and/or more time to execute than do others. Thus, a pipeline might be able to handle twice as many instructions that are half as hard as other instructions. The trick is to know which instructions are coming down the pipeline so that the pipeline can utilize its resources efficiently. This trick is important because its resolution effectively guards the gate to the pipeline, ensuring that neither too many nor too few instructions enter the pipeline at any given time.
The question of how many instructions can be executed at any given time is particularly important in at least two types of architectures. The first type is the superscalar architecture in which a variable number of fixed length instructions may be issued into the pipeline. The second type is a scalar architecture having variable length instructions. However, there may be other contexts in which the question arises. The following disclosure shall, for the sake of clarity, be presented in the context of a superscalar architecture employing fixed length instructions but capable of issuing a variable number of those instructions depending upon availability of pipeline resources. Nevertheless, the invention is not so limited.
Superscalar processors usually fetch, decode, and issue instructions in a "rotator loop." The loop begins when instructions are fetched and loaded into a queue for the decoder. A pointer points to the next instruction to be decoded. The decoder then decodes the instruction, issues the decoded instruction, and updates the pointer to the next instruction. If the decoder comes to the end of the queue, it rotates around to the beginning of the queue. This completes the loop.
If another instruction may issue, the loop is repeated. The loop may be repeated several time each clock cycle depending on how many of the instructions may issue. However, the fetch, decode, and issuance for every issued instruction must be completed in a single clock cycle so that all issued instructions are issued into the next stage at the next clock cycle. At the next clock cycle, the number of instructions determined by the decoder issue into the pipeline. The pointer is then rotated to point to the next instruction in the queue for the next clock cycle.
A conventional decoder must therefore receive the queued instructions, decode them, make the determination of how many will issue, and update the pointer in a single clock cycle. This timing constraint is critically important since, by definition, the decoder determines how many bundles will issue in the next clock cycle. The slower the decoder performs its function, the slower the clock cycle must be.
The demand for faster, more powerful processors continually outstrips present technology. The demand pressures all aspects of processor architecture design to become faster, including the decoding and issuance of bundled instructions. Thus, there is a need for a new technique to decode and determine how many bundles of instructions might issue for execution in a pipelined processor.
The present invention is directed to overcoming, or at least reducing the effects of, one or more of the problems set forth above.