Modem microprocessors employ pipelining techniques which allow multiple, consecutive instructions to be prefetched, decoded, and executed in separate stages simultaneously. Accordingly, in any given clock cycle, a first instruction may be executed while the next (second) instruction is simultaneously being decoded, and the instruction after that one (a third instruction) is simultaneously being fetched. Since less processing is performed on each instruction per cycle, cycle time can be made shorter. Thus, while it requires several clock cycles for a single instruction to be pre-fetched, decoded, and executed, it is possible to have a processor completing instructions as fast as one instruction per cycle with a very short cycle period, because multiple consecutive instructions are in various stages simultaneously.
Typically, buffers for temporarily holding data are used to define the boundary between consecutive stages of a microprocessor pipeline. The data calculated in a particular stage is written into these buffers before the end of the cycle. When the pipeline advances upon the start of a new cycle, the data is written out of the boundary buffers into the next stage where the data can be further processed during that next cycle.
Most pipelined microprocessor architectures have at least four stages including, in order of flow, 1) a prefetch stage, 2) a decode stage, 3) an execute stage, and 4) a write-back stage. In the prefetch stage, instructions are read out of memory (e.g., an instruction cache) and stored in a buffer. Depending on the particular microprocessor, in any given cycle, the prefetch buffer may receive one to several instructions.
In the decode stage, the processor reads an instruction out of the prefetch buffer and converts it into an internal instruction format which can be used by the microprocessor to perform one or more operations, such as arithmetic or logical operations. In the execute stage, the actual operations are performed. Finally, in the write-back stage, the results of the operations are written to the designated registers and/or other memory locations.
In more complex microprocessors, one or more of the four basic stages can be further broken down into smaller stages to simplify each individual stage and even further improve instruction completion speed.
Generally, instructions are read out of memory in a sequential address order. However, instruction branches, in which the retrieval of instructions from sequential address spaces is disrupted, are common, occurring on average about every six to nine instructions.
The hardware in an instruction prefetch stage typically comprises a prefetch buffer or prefetch queue which can temporarily hold instructions. Each cycle, the decode stage can take in the bytes of an instruction held in the prefetch stage for decoding during that cycle.
The hardware in a decode stage typically comprises at least a program counter and hardware for converting instructions into control lines for controlling the hardware in the execute stage. Alternately, the decode stage can include a microcode-ROM. The incoming instruction defines an entry point (i.e., an address) into the microcode-ROM at which the stored data defines the appropriate conditions for the execute stage control lines. The execute stage control data for the particular instruction may exist entirely at a single addressable storage location on the microcode-ROM or may occupy several sequentially addressable storage locations. The number of addressable storage locations in the microcode-ROM which must be accessed for a given instruction may be encoded in the instruction itself. Alternately, one or more data bits in the storage locations in the microcode-ROM may indicate whether or not another storage location should be accessed.
The control data output from the microcode-ROM is written into buffer registers for forwarding to the execute stage on the next cycle transition. The decode stage also includes hardware for extracting the operands, if any, from the instruction or from registers or memory locations and presenting the operands to the appropriate hardware in the execution stage.
Some microprocessor architectures employ what are known as variable width instruction sets. In such architectures, the instructions are not all the same width. For instance, in the instruction set for the 16/32 bit class x86 family of microprocessors developed by Intel Corporation of Santa Clara, Calif., an instruction can be anywhere from 1 to 16 bytes wide.
Some microprocessor architectures utilize a segmented address space in which the total memory space is broken down into a plurality of independent, protected address spaces. Each segment is defined by a base address and a segment limit. The base address, for instance, may be the lowest numerical address in the segment space. The segment limit defines the size of the segment. Accordingly, the end boundary of the segment is defined by the sum of the base address and the segment limit. Alternately, the base address may be the highest address and, as such, the end boundary of the segment would be the difference between the base address and the segment limit.
Software programs are written, compiled and assembled such that, when a program is running, instructions are normally retrieved from sequential addresses in memory for presentation into the pipeline. Accordingly, once a program is begun, the prefetch stage will normally continue to retrieve consecutive instructions for presentation to the decode stage from consecutive addresses in memory until that flow is interrupted. The most common way by which the sequential addressing of instructions can be interrupted is by a branch instruction. A branch instruction usually specifies, in some manner, the address from which the next instruction to be executed after the branch instruction is to be retrieved. Thus, when a branch instruction is executed in the execute stage, the execute stage halts the normal flow of instructions through the preceding stage of the pipe, e.g., the prefetch and decode stages, and instead supplies the next address for retrieving instructions to the prefetch stage. Accordingly, when a branch occurs, the instructions which had been retrieved from sequential addresses after the branch instruction which are in the pipe, i.e., the instructions in the prefetch and decode stages, should not be executed, but should be flushed from the pipe. The flow can be altered by mechanisms other than an executed branch instruction, such as an interrupt. Any change in program flow from sequential addressing is collectively referred to as a branch in this specification, even if it is not the result of a branch instruction.
To generate a linear address according to the x86 architecture, at the very least two quantities are added. Particularly, the base address of the particular segment, as indicated by the segment descriptor and an offset indicating the distance of the desired data (i.e., instruction) from the base of the segment must be added together. The offset itself may comprise up to three more parts, a base, index and displacement. If so, those quantities must be added to generate the offset before the offset could be added to the segment base. A more detailed discussion of segmented addressing in the x86 architecture can be found in INTEL486 Microprocessor Family Programmer's Reference Manual, 1992, Intel Corporation.
It is an object of the present invention to provide an improved pipelined microprocessor.
It is a further object of the present invention to provide a pipelined microprocessor architecture in which the prefetch line buffer has a line width less than the maximum possible instruction width in order to conserve semiconductor area, yet which is wide enough to accommodate the vast majority of instructions in a single line.
It is a further object of the present invention to provide a pipelined microprocessor architecture having a decode stage in which shadow buffers are used to temporarily store portions of instructions which are wider than the prefetch line buffer and therefore require multiple cycles to be loaded into the decode stage.
It is a further object of the present invention to provide a pipelined microprocessor architecture having a tagged prefetch buffer in which instruction bytes are individually tagged so that the prefetch buffer bytes can be cleared individually to allow data to be loaded more efficiently into the prefetch buffer.
It is a further object of the present invention to provide a pipelined microprocessor architecture having a decode stage in which dynamic information is decoded by hardware, while fixed information is decoded by use of an addressable ROM.