1. Field of the Invention
The present invention relates to super-scalar processors as would typically be implemented on a single integrated circuit, and more specifically relates to instruction cache arrangements for super-scalar processors incorporating a variable byte-length instruction format.
2. Description of Related Art
The term superscalar describes a computer implementation that improves performance by a concurrent execution of scalar instructions--the type of instructions typically found in general-purpose microprocessors. Because the majority of existing microprocessor applications are targeted toward scalar computation, superscalar microprocessors are the next logical step in the evolution of microprocessors. Using today's semiconductor processing technology, a single processor chip can incorporate high performance techniques that were once applicable only to large-scale scientific processors. However, many of the techniques applied to large scale processors are either inappropriate for scalar computation or too expensive to be applied to microprocessors.
Microprocessors by definition must be implemented on one or a very small number of semiconductor chips. Semiconductor technology provides ever increasing circuit densities and speeds for implementing a microprocessor, but the interconnection with the microprocessor's memory is quite constrained by packaging technology. Though on-chip interconnections are extremely cheap, off-chip connections are very expensive; often the processor's package and pins are more expensive than the processor chip itself. Any technique intended to improve microprocessor performance must take advantage of increasing circuit densities and speeds while remaining within the constraints of packaging, technology and the physical separation between the processor and its memory. At the same time, though increasing circuit densities provide a path to ever more complex designs, the operation of the microprocessor must remain simple and clear enough that users can understand how to use it.
An application program comprises a group of instructions. The processor fetches and executes instructions in some sequence. There are several steps involved in the execution of a single instruction, including fetching the instruction, decoding it, assembling its operands, performing the operations specified by the instruction, and writing the results of the instruction to storage. The execution of instructions is controlled by a periodic clock signal. The period of the clock signal is the processor cycle time.
The time taken by a processor to complete a program is determined by three factors: (1) the number of instructions required to execute the program; (2) the average number of processor cycles required to execute an instruction; and (3) the processor cycle time. Processor performance is improved by reducing the time taken, which dictates reducing one or more of these factors.
An obvious way to increase performance is by overlapping the steps of different instructions, using a technique called pipelining. To pipeline instructions, the various steps of instruction execution are performed by independent units called pipeline stages. Pipeline stages are separated by clocked registers (or latches). The steps of different instructions are executed independently in different pipeline stages. The result of each pipeline stage is communicated to the next pipeline stage via the register between the stages. Pipelining reduces the average number of cycles required to execute an instruction, though not the total amount of time required to execute an instruction, by permitting the processor to handle more than one instruction at a time. This is done without increasing the processor cycle time appreciably. Pipelining typically reduces the average number of cycles per instruction by as much as a factor of three. However, when executing a branch instruction, the pipeline may sometimes stall until the result of the branch operation is known and the correct instruction is fetched for execution. This delay is known as the branch-delay penalty. Increasing the number of pipeline stages also typically increases the branch-delay penalty relative to the average number of cycles per instruction.
During the development of early microprocessors, instructions took a long time to fetch compared to the execution time. This motivated the development of complex instruction, or CISC, processors. (The acronym CISC stands for "Complex Instruction Set Computer") CISC processors were based on the observation that given the available technology the number of cycles per instruction was determined mostly by the number of cycles taken to fetch the instruction. To improve performance, the two principal goals of the CISC architecture were to reduce the number of instructions needed for a given task and to encode these instructions densely. It was acceptable to accomplish these goals by increasing the average number of cycles taken to decode and execute an instruction because using pipelining, the decode and execution cycles could be mostly overlapped with a relatively lengthy instruction fetch. With this set of assumptions, CISC processors evolved densely encoded instructions at the expense of decode and execution time inside the processor. Multiple-cycle instructions reduced the overall number of instructions and thus reduced the overall execution time because they reduced the instruction fetch time.
But in the late 1970's and early 1980's, memory and packaging technology changed rapidly. High pin count packages made possible the design of advanced memory interfaces that no longer had quite the same fetch limitations as applied when CISC processors evolved. Memory densities and speeds increased to the point where high speed local memories called caches could be implemented near the processor. When instructions are fetched more quickly using caches, the performance is limited by the decode and execution time that was previously hidden within the instruction fetch time. The number of instructions does not affect performance as much as the average number of cycles taken to execute an instruction.
The improvement in memory and packaging technology, to the point where instruction fetching did not take much longer than instruction execution, motivated the development of reduced instruction, or RISC, processors. (The acronym RISC stands for "Reduced Instruction Set Computer") To improve performance, the principal goal of a RISC architecture is to reduce the number of cycles taken to execute an instruction, allowing some increase in the total number of instructions. The trade-off between cycles per instruction and the number of instructions is not one to one. Compared to CISC processors, RISC processors typically reduce the number of cycles per instruction by factors of roughly three to five, while they typically increase the number of instructions by thirty to fifty percent.
RISC processors have been characterized by some as a return to the basic rudimentary architectures that were developed very early in the evolution of computers. However, early processors were simple because technology was relatively primitive. RISC processors are simple because simplicity yields better performance. Relative to CISC processors, RISC processors depend heavily on advanced memory technology, advanced packaging technology and advanced compiler technology. Furthermore, RISC processors typically rely very much on auxiliary features such as a large number of general purpose registers, instruction and data caches, and others, that help the compiler reduce the overall instruction count or that reduce the number of cycles per instruction.
A typical RISC processor executes one instruction on every processor cycle and, at first glance, no more improvement seems possible. A superscalar processor reduces the average number of cycles per instruction beyond what is possible in a pipelined scalar RISC processor by allowing concurrent execution of instructions in the same pipeline stage as well as concurrent execution of instructions in different pipeline stages. The term superscalar emphasizes multiple concurrent operations on scalar quantities as distinguished from multiple concurrent operations on vectors or arrays as is common in scientific computing.
Superscalar processors are conceptually simple but there is much more to achieving performance than widening a processor's pipeline. Widening the pipeline makes it possible to execute more than one instruction per cycle but there is no guarantee that any given sequence of instructions can take advantage of this capability. Instructions are not independent of one another but are interrelated; these interrelationships prevent some instructions from occupying the same pipeline stage. Furthermore, the processor's mechanisms for decoding and executing instructions can make a big difference in its ability to discover instructions that can be executed at the same time.
Superscalar techniques largely concern the processor organization independent of the instruction set and other architectural features. Thus, one of the attractions of superscalar techniques is the possibility of developing a processor that is code compatible with an existing architecture. Many superscalar techniques apply equally well to either RISC or CISC architectures. However, because of the regularity of many of the RISC architectures, superscalar techniques have initially been applied to RISC processor designs.
The attributes of the instruction set of a RISC processor that lend themselves to single cycle decoding also lend themselves well to decoding multiple RISC instructions in the same clock cycle. These include a general three operand load/store architecture, instructions having only a few instruction lengths, instructions utilizing only a few addressing modes, instructions which operate on fixed-width registers, and register identifiers in only a few places within the instruction format. Techniques for designing a superscalar RISC processor are described in Superscalar Microprocessor Design, by William Michael Johnson, .COPYRGT.1991 by Prentice-Hall, Inc. (a division of Simon & Schuster), Englewood Cliffs, N.J.
In contrast to RISC architectures, CISC architectures were defined at a time when the principal implementation technique was microcode interpretation of the instruction set and when pipelining was considered to be an exotic technique. Design goals were oriented more toward deciding which operations should be combined into instructions than designing operations so that they could be overlapped. Because of microcode interpretation, almost anything could be done with the definition of the instruction set--and generally just about everything was done. It is difficult to implement a pipelined version of such an architecture, and extremely difficult to implement a superscalar version.
Most CISC processors use a large number of different instruction formats. As an example, several of the various instruction formats of the X86 architecture are shown in FIGS. 1A and 1B. This architecture, first introduced in the i386.TM. microprocessor, is also the basic architecture of both the i486.TM. microprocessor and the Pentium.TM. microprocessor, all available from the Intel Corporation of Santa Clara, Calif. There are many instruction format variations, and individual instructions vary from 1 to 15 bytes long.
Referring to FIG. 1A, the minimum instruction of the X86 architecture consists of a single byte, which usually contains an 8-bit opcode (for example, field 2). For certain instructions, the opcode field can be up to 16 bits long, while for other instructions, the byte containing the opcode field (the "opcode" byte) also contains a register field (see format (b) in FIG. 1A). Operations can be register-to-register, register-to-memory, or memory-to-register (but not memory-to-memory). An optional MODRM field (for example, field 3 in instruction format (d)), which follows the opcode field, contains the register specifier and indicates how the memory addressing should be performed. In some instructions, the MODRM field is also used (in a slightly different format) to select condition flags. Finally, an instruction optionally contains up to a four-byte immediate data field (for example, field 4 in instruction format (h)).
As illustrated in FIG. 1B, the MODRM field itself has a variety of possible formats. The first byte of the MODRM field always contains a ModR/M field (for example, field 5) that specifies which register and addressing mode to use. In the more complex memory-addressing modes, an 8-bit S-I-B field (for example, field 6) specifies how address computation is to be done. Finally, the MODRM field contains an optional displacement or offset field (for example, field 7), for address computation.
The length of the displacement and immediate fields depends on the data-width mode of the instruction, because an instruction can operate on 8-bit, 16-bit, or 32-bit data. The data width is determined primarily by a segment descriptor in the memory management architecture, but the default for the segment can be overridden by a bit in the instruction (the w-bit) or by a prefix byte which toggles the effect of the w-bit.
Prefix bytes can appear before any instruction. A prefix byte changes the interpretation of the instruction: it can, for example, change the memory address or operand size of the instruction, change the default segment used in memory addressing, or indicate that the instruction should be executed with the external bus locked. More than one prefix may be included before an instruction, as each type of prefix byte is independent of the others, which gives rise to the maximum instruction length of 15 bytes (that is, for non-redundant prefixes).
During execution, a processor executing the X86 instruction set must deal with instructions that can be from 8 to 120 bits long. The actual length of the instruction is a complex function of the opcode and other instruction fields, because many fields specify whether or not other fields are present. For example, the ModR/M and S-I-B fields both indicate the presence and length of the displacement (DISP) field, and this can be further modified by a prefix byte which can change the address size. A similar situation exists for the length of the immediate (IMMED) field.
It is hard to see how an X86 processor might be able to quickly locate more than one instruction per cycle. At a minimum, it would seem that an additional pipeline stage would be required to locate these instructions before any decoding could be done, adding to the branch-delay penalty because this extra stage must be flushed on a branch. Marking the instruction bytes to aid in subsequent decoding is not itself a solution to reduce this difficulty because, for example, the processor executing X86 instructions must be able to execute self-modifying code. Furthermore, the same X86 instruction byte stream can be executed with different alignments. For example, a programmer could write a sequence of instructions that branches to a given instruction opcode at certain times, and branches to a prefix byte immediately ahead of the opcode byte of the given instruction at other times. The beginning byte of an instruction is not necessarily fixed, depending on the execution flow.
It has been observed that, in the 8086 processor, many commercial programs execute a limited subset of available 8086 instructions. It is also likely that an X86 processor executes a relatively small portion of its instruction repertoire most of the time. This is the very realization that motivated RISC architectures in the first place. This phenomenon should hold true in the future, even as new applications are developed, because new applications probably are going to be written in high-level languages. Compilers typically generate instructions in a stylized fashion, using a subset of the instructions available in a CISC architecture, because code generators often cannot recognize cases where complex instructions can be used.
The vast majority of X86 instructions that are typically utilized are very simple, such as move, jump, add, and shift. Others are almost inordinately complex. It has been suggested that a superscalar X86 processor would probably have two modes of execution: a slow, serial mode for the very complex instructions and a faster, superscalar mode for the simpler instructions. The slow, serial mode would likely take advantage of some form of microcode, while the faster mode would likely execute in hardware.
The Pentium.TM. processor achieves some degree of superscalar operation by utilizing two fairly traditional integer pipelines, called the U pipeline and the V pipeline, to support execution of up to two simultaneous instructions. The processor decodes in hardware as many of the most frequently occurring instructions as possible. If two such instructions have no resource conflicts, one instruction may execute within the U pipeline under hardwired control while the other executes within the V pipeline, again under hardwired control. More complex instructions require a microcode routine, which controls both pipelines in attempt to optimize the execution of the complex instruction. Since microcoded routines take over all the execution resources, it is not possible for the Pentium processor to pair microinstructions with regular, X86 instructions. Instruction fetching and dispatch are stalled during the execution of a complex, microcoded instruction. In general, the U and V pipelines simultaneously execute separate instructions only if the instructions they contain are independent. Otherwise, the instruction execution is serialized. Additionally, if the U pipeline contains any kind of branch instruction, the V pipeline is idle.
The Pentium.TM. processor utilizes separate instruction and data caches. The instruction cache (I-cache) is an 8K two-way set-associative cache using a 32-byte line size. A dedicated ITLB (Instruction Table-Lookaside-Buffer) allows the instruction cache to be physically tagged. The array containing these physical tags are triple-ported: one port is for bus snooping (for supporting cache coherency between multiple processors) while the other two are used for a split fetch capability, which gives the processor the ability to fetch a contiguous block of instruction bytes when the block straddles the boundary between two half-cache lines. Instruction bytes read from the cache are stored within one of four 32-byte prefetch buffers.
Alternatively, it has also been suggested that a superscalar X86 processor would have a faster, RISC-like superscalar mode for the simpler instructions and would likely execute in hardware which follows recent advances in RISC processor design. This technique is based on defining a "RISC core" of instructions that are able to take advantage of even more powerful superscalar techniques, such as register renaming, wider superscalar dispatch, out-of-order instruction issue, and out-of-order instruction completion.
However, fetching and decoding instructions is still a critical bottleneck. It is hard enough to find the instruction boundary of a single X86 instruction and to decode its various fields, but it is all the more difficult to do so for up to four X86 instructions, all within a single clock cycle.