1. Field of the Invention
This invention is related to the field of processors and, more particularly, to instruction fetching mechanisms within processors.
2. Description of the Related Art
Superscalar processors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. As used herein, the term xe2x80x9cclock cyclexe2x80x9d refers to an interval of time accorded to various stages of an instruction processing pipeline within the processor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term xe2x80x9cinstruction processing pipelinexe2x80x9d is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
A popular instruction set architecture is the x86 instruction set architecture. Due to the widespread acceptance of the x86 instruction set architecture in the computer industry, superscalar processors designed in accordance with this architecture are becoming increasingly common. The x86 instruction set architecture specifies a variable byte-length instruction set in which different instructions may occupy differing numbers of bytes. For example, the 80386 and 80486 processors allow a particular instruction to occupy a number of bytes between 1 and 15. The number of bytes occupied depends upon the particular instruction as well as various addressing mode options for the instruction.
Because instructions are variable-length, locating instruction boundaries is complicated. The length of a first instruction must be determined prior to locating a second instruction subsequent to the first instruction within an instruction stream. However, the ability to locate multiple instructions within an instruction stream during a particular clock cycle is crucial to superscalar processor operation. As operating frequencies increase (i.e. as clock cycles shorten), it becomes increasingly difficult to locate multiple instructions simultaneously.
Various predecode schemes have been proposed in which a predecoder appends information regarding each instruction byte to the instruction byte as the instruction is stored into the cache. As used herein, the term xe2x80x9cpredecodingxe2x80x9d is used to refer to generating instruction decode information prior to storing the corresponding instruction bytes into an instruction cache of a processor. The generated information may be stored with the instruction bytes in the instruction cache. For example, an instruction byte may be indicated to be the beginning or end of an instruction. By scanning the predecode information when the corresponding instruction bytes are fetched, instructions may be located without actually attempting to decode the instruction bytes. The predecode information may be used to decrease the amount of logic needed to locate multiple variable-length instructions simultaneously. Unfortunately, these schemes become insufficient at high clock frequencies as well. A method for locating multiple instructions during a clock cycle at high frequencies is needed.
The problems outlined above are in large part solved by a line predictor as described herein. The line predictor caches alignment information for instructions. In response to each fetch address, the line predictor provides information for the instruction beginning at the fetch address, as well as alignment information for up to one or more additional instructions subsequent to that instruction. The alignment information may be, for example, instruction pointers, each of which directly locates a corresponding instruction within a plurality of instruction bytes fetched in response to the fetch address. The line predictor may include a memory having multiple entries, each entry storing up to a predefined maximum number of instruction pointers and a fetch address corresponding to the instruction identified by a first one of the instruction pointers. Fetch addresses may be searched against the fetch addresses stored in the multiple entries, and if a match is detected the corresponding instruction pointers may be used. Additionally, each entry may include a link to another entry storing instruction pointers to the next instructions within the predicted instruction stream and a next fetch address corresponding to the first instruction within the next entry. The link to the next entry may be provided to the line predictor to fetch the next entry during the next clock cycle, and the next fetch address may be provided to the instruction cache to fetch the corresponding instruction bytes.
Since the line predictor provides alignment information from one entry per fetch, the line predictor may provide a flow control mechanism for the initial portion of the pipeline within a microprocessor. Each entry may store combinations of instructions which the hardware within the pipeline may handle without creating stalls resulting from the combinations. Furthermore, by specifying the allowable combinations within an entry, the pipeline hardware may be simplified. Rather than handling the more complicated combinations with the pipeline hardware (either by stalling some of the instructions while processing others, or providing hardware to concurrently process the more complicated combinations), complicated combinations may be divided into multiple entries linked in the manner described above. A portion of the combination is fetched during a first clock cycle, and the remainder of the combination is fetched during the next clock cycle.
Since instructions located by a particular entry may be handled by the pipeline hardware, the instructions may flow through various pipeline stages as a unit. When the instructions reach the portion of the pipeline at which instructions are scheduled for execution, then they may be treated individually. Additional pipeline hardware simplifications may be realized. Because the pipeline hardware is simplified, higher frequency operation may be achieved than more complicated pipeline hardware which handles concurrent fetch of the more complicated combinations of instructions.
Broadly speaking, a processor is contemplated, comprising a fetch address generation unit configured to generate a fetch address and a line predictor coupled to the fetch address generation unit. The line predictor includes a first memory comprising a plurality of entries, each entry storing a plurality of instruction pointers. The line predictor is configured to detect a miss of the fetch address in the line predictor. Coupled to receive a plurality of instruction bytes fetched in response to the fetch address and further coupled to the line predictor, the decode unit is configured to decode the plurality of instruction bytes in response to the miss to generate the plurality of instruction pointers for a first entry in the first memory. The first entry corresponds to the fetch address. The decode unit is configured to terminate decode and provide the plurality of instruction pointers to the line predictor in response to detecting a termination condition. Additionally, a computer system is contemplated including the processor and an input/output (I/O) device configured to communicate between the computer system and another computer system to which the I/O device is couplable.
Moreover, a method is contemplated. A fetch address is generated. A miss of the fetch address is detected in a line predictor which includes a first memory comprising a plurality of entries, each entry storing a plurality of instruction pointers. Responsive to the miss, a plurality of instruction bytes fetched in response to the fetch address are decoded to generate the plurality of instruction pointers for a first entry in the first memory. The first entry corresponds to the fetch address. A termination condition is detected during the decode, and the decode is terminated and the plurality of instruction pointers are provided to the line predictor in response to detecting a termination condition.