1. Field of the Invention
This invention relates in general to the field of instruction execution in computers, and more particularly to an apparatus and method for predecoding macro instructions prior to translation.
2. Description of the Related Art
The architecture of a present day pipeline microprocessor consists of a path, or channel, or pipeline that is divided into stages. Each of the pipeline stages performs specific tasks related to the accomplishment of an overall operation that is directed by a programmed instruction. Software application programs are composed of sequences of macro instructions. As a macro instruction enters the first stage of the pipeline, certain tasks are accomplished. The macro instruction is then passed to subsequent stages for the execution of subsequent tasks. Following completion of a final task, the instruction completes execution and exits the pipeline. Execution of programmed instructions by a pipeline microprocessor is very much likened to the manufacture of items on an assembly line.
The efficiency of any assembly line is beneficially impacted by the following two factors: 1) keeping each stage of the assembly line occupied with work and 2) ensuring that the tasks performed within each stage are equally balanced, that is, optimizing the line so that no one stage creates a bottleneck. These same factors can also be said to affect the efficiency of a pipeline microprocessor. Consequently, it is incumbent upon microprocessor designers 1) to provide logic within each of the stages that maximizes the probability that none of the stages in the pipeline will sit idle and 2) to distribute the tasks among the architected pipeline stages such that no one stage will be the source of a bottleneck in the pipeline. Bottlenecks, or pipeline stalls, cause delays in the execution of application programs.
The first stage of a pipeline microprocessor, the fetch stage, performs the task of retrieving macro instructions from memory devices external to the microprocessor. External memory in a desktop computer system typically takes the form of random access memory (RAM), a device technology that is significantly slower than logic within the microprocessor itself. Hence, to access external memory each time an instruction is required for execution would create an overwhelming bottleneck in the fetch stage. For this reason, a present day microprocessor transfers large blocks instructions from external memory into a smaller, yet significantly faster, memory device that resides within the microprocessor chip itself. This internal memory device is referred to as an instruction cache. The large blocks of memory, known as cache lines, are transferred in parallel bursts rather than one byte at a time, thus alleviating some of the delays associated with retrieving instructions from external memory. Ideally, when an instruction is required for execution, it is desirable to find that the instruction has already been transferred into the instruction cache so that it may immediately be forwarded to the next stage in the pipeline. Finding a requested instruction within the instruction cache is referred to as a cache hit. A cache miss occurs when the requested instruction is not found within the cache and the pipeline must be stalled while the requested instruction is retrieved from external memory. Virtually all present day microprocessors have an on-board instruction cache, the average cache size being approximately 64 KB.
The next stage of a present day pipeline, the translate (or decode) stage, deals with converting a macro instruction into a sequence of associated micro instructions for execution by subsequent stages of the microprocessor. Macro instructions specify high-level operations such as arithmetic operations, Boolean logic operations, and data load/store operations-operations that are too complex to be performed within one given stage of a pipeline. Because of this, the macro instructions are decoded and functionally decomposed by logic in the translate stage into the sequence of micro instructions having sub-tasks which can be efficiently executed within each of the pipeline stages, thus precluding bottlenecks in subsequent stages of the pipeline. Decoded micro instructions are then issued sequentially to the subsequent stages for execution.
The format and composition of micro instructions for a particular microprocessor are unique to that particular microprocessor design and are hence tailored to execute very efficiently on that microprocessor. In spite of this, the translation macro instructions into micro instructions without causing undue pipeline delays persists as a significant challenge to microprocessor designers. More specifically, translation of x86 macro instructions is particularly difficult and time consuming, primarily because x86 instructions can vary from 1 to 15 bytes in length and their opcode bytes (i.e., the bytes that provide the essential information about the format of a particular instruction) can follow up to four optional prefix bytes. One skilled in the art will agree that marking boundaries between macro instructions and designating the bytes containing opcodes is a task that is common to the translation of all macro instructions. This task of determining initial information about macro instructions is referred to as predecoding.
As macro instruction sets continue to grow, exemplified by the addition of MMX(copyright) instructions to the x86 instruction set in the late 1990""s, the operations (and attendant clock cycles) required to decode these instructions has caused attention to be directed again to overcoming bottlenecks in the translate stage. Consequently, to more evenly balance the operations performed within stages of the pipeline, more recent microprocessor designs have shifted the predecoding operation up into the fetch stage.
There are two techniques used today to predecode macro instructions in the fetch stage. The first technique, employed within the Intel Pentium(copyright) II/III series of microprocessors, performs the predecoding operation following retrieval of the bytes of a macro instruction from the instruction cache. Accordingly, predecode logic generates a predecode field corresponding to each byte of the macro instruction and provides these fields along with the bytes in a macro instruction queue. Translation logic then retrieves the instruction bytes and predecode fields from the queue as required. Under some conditions, the time required to perform predecoding in this manner is actually transparent to the pipeline because the translation logic is still able to access bytes from the queue for translation while subsequent bytes are being predecoded. But when the queue is empty, the pipeline must be stalled until predecoding completes.
A second technique for predecoding is believed to be employed within Advanced Micro Device""s K6(copyright) series of microprocessors. This second technique performs predecoding prior to inserting bytes of a macro instruction into the instruction cache. Accordingly, the time required to predecode instruction bytes is absorbed into the time required to retrieve cache lines from external memory. Predecode information fields corresponding to each instruction byte fetched from memory must then be stored alongside each instruction byte in the instruction cache. Hence, although this second predecoding technique may alleviate potential bottlenecks in the fetch stage, it requires a significantly larger instruction cache than would otherwise be needed.
Neither of the two above techniques sufficiently addresses the predecoding problem. The first approach still can present stalls in the pipeline because predecoding is not performed in parallel with some other function in the fetch stage. The second approach requires a significantly larger cache, which results in more complex and costly parts.
Therefore, what is needed is a predecoding apparatus in a pipeline microprocessor that performs predecoding in parallel with another operation in the fetch stage.
In addition, what is needed is an apparatus in a pipeline microprocessor for predecoding macro instructions that does not require a larger instruction cache to store predecode information.
Furthermore what is needed is a macro instruction predecoder that can predecode bytes as they are provided within an instruction cache yet prior to the time that a cache hit is declared.
Finally, what is needed is a method for predecoding macro instructions within instruction cache logic wherein the predecoding operation is performed in parallel with the operations required to determine whether a cache hit has occurred or not.
To address the above-detailed deficiencies, it is an object of the present invention to provide predecoding logic that performs predecoding in parallel with another mandatory operation in the fetch stage of a pipeline microprocessor.
Accordingly, in the attainment of the aforementioned object, it is a feature of the present invention to provide an apparatus in a pipeline microprocessor for predecoding macro instructions. The apparatus includes an instruction cache, predecode logic, way selection logic, a translation lookaside buffer, and tag compare logic. The instruction cache stores instruction bytes. The instruction cache has a plurality of cache ways and a plurality of tag arrays. Each of the plurality of tag arrays is coupled to a corresponding cache way, and stores tags within, where specific tags within the each of the plurality of tag arrays denote specific physical page addresses associated with corresponding cache lines within the cache ways. The predecode logic is coupled to the instruction cache. The predecode logic predecodes instruction bytes retrieved from each of the plurality of cache ways, The each of the plurality of cache ways includes cache lines, and the instruction bytes are stored within the cache lines. Low-order bits of a requested linear address are provided to the each of the plurality of cache ways to indicate the instruction bytes. The low-order bits are provided to the plurality of tag arrays to retrieve a corresponding plurality of indexed tags, where each of the plurality of tags are retrieved from a corresponding each of the plurality of tag arrays. The way selection logic is coupled to the predecode logic. The way selection logic provides for translation of a set of said instruction bytes and associated predecode data entities that correspond to one of the each of the plurality of cache ways. The translation lookaside buffer is coupled to the way selection logic. The translation lookaside buffer translates upper address bits of the requested linear address into a requested physical page address. The tag compare logic is coupled to the translation lookaside buffer. The tag compare logic compares the requested physical page address with the corresponding plurality of tags, and selects one of the plurality of tags that matches the requested physical page address, thereby denoting the one of the each of the plurality of cache ways. The tag compare logic directs the way selection logic to provide the set of the instruction bytes and the associated predecode data entities for translation. The predecode logic predecodes the instruction bytes while the translation lookaside buffer converts the upper address bits of the requested linear address into the requested physical page address.
An advantage of the present invention is that the time to predecode macro instructions is transparent to the instruction pipeline.
Another object of the present invention is to provide a pipeline microprocessor predecoding apparatus that determines initial information about macro instructions without requiring a larger instruction cache to store the initial information.
In another aspect, it is a feature of the present invention to provide an instruction predecoding apparatus within a pipeline microprocessor for predecoding bytes of an instruction associated with a next instruction pointer. The predecoding apparatus has a plurality of predecoders, a translation lookaside buffer, select logic, a plurality of tag arrays, and tag compare logic. The plurality of predecoders generates predecode data entities corresponding to each of a plurality of indexed instruction bytes. The plurality of indexed instruction bytes are received from a plurality of cache ways and are indexed by low-order bits of the next instruction pointer. The plurality of cache ways stores the plurality of indexed instruction bytes as cache lines, where each of the cache lines are 32 bytes in length, and where the low-order bits index 16 of the plurality of indexed instruction bytes from each of the plurality of cache ways. The translation lookaside buffer is coupled to the plurality of cache ways. The translation lookaside buffer receives upper bits of the next instruction pointer and translates the upper bits into a physical page number. The select logic is coupled to the plurality of predecoders. The select logic provides a subset of the plurality of indexed instruction bytes and a corresponding subset of the predecode data entities for decoding, where the subset of the plurality of indexed instruction bytes contains the bytes of the instruction and is provided from one of the plurality of cache ways. The plurality of tag arrays stores tags corresponding to the cache lines, where each of the tags within denote a specific physical page number corresponding to a specific cache line, and where the low-order bits are employed to retrieve indexed tags. Each of the indexed tags is retrieved from a corresponding each of the plurality of tag arrays. The tag compare logic is coupled to the translation lookaside buffer. The tag compare logic compares the physical page number with the indexed tags, and designates that the subset of the plurality of indexed instruction bytes comes from the one of the plurality of cache ways. The tag compare logic directs the select logic to provide the subset of the plurality of indexed instruction bytes and the corresponding subset of predecode data entities for translation. The plurality of predecoders predecode the indexed instruction bytes in parallel with the translation of the upper address bits of the next instruction pointer into the physical page number.
Another advantage of the present invention is that pipeline delays due to predecoding can be precluded without significantly increasing the complexity, power consumption, or cost of a microprocessor.
A further object of the present invention is to provide a macro instruction predecoder that can predecode bytes as they are provided within an instruction cache yet prior to the time that a cache hit is declared.
In a further aspect, it is a feature of the present invention to provide a pipeline microprocessor that determines associated initial information about bytes of an instruction prior to translating the instruction into associated micro instructions. The pipeline microprocessor includes an instruction cache and TLB logic. The instruction cache stores the instruction. The instruction cache includes cache ways, instruction predecoding logic, a mux, and tag arrays. The cache ways store cache lines retrieved from memory, where the bytes of the instruction are stored within a specific indexed cache line within a specific cache way, and where the specific indexed cache line is one of a plurality of indexed cache lines taken from each of the cache ways. The instruction cache comprises four of the cache ways, and each of the cache ways stores 512 of the cache lines. The instruction predecoding logic is coupled to the cache ways, and determines initial information concerning each of the plurality of indexed cache lines. The mux is coupled to the instruction predecoding logic. The mux selects the specific cache way, and provides the bytes of the macro instruction, along with corresponding specific initial information bytes for translation. The tag arrays are each configured to store tags associated with each of the cache lines. The TLB logic is coupled to the instruction cache. The TLB logic translates a linear address corresponding to the macro instruction into a physical page address, and directs the mux to select the specific cache way. The TLB logic compares the physical page address to indexed tags taken from each of the tag arrays to determine which of the cache ways contains the bytes of the instruction. The TLB logic directs the mux to provide the bytes of the instruction along with the associated initial information for translation. The instruction predecoding logic determines the initial information while the TLB logic translates the linear address.
Yet a further object of the present invention is to provide a method for predecoding macro instructions within instruction cache logic wherein the predecoding operation is performed in parallel with the operations required to determine whether a cache hit has occurred or not.
In yet a further aspect, it is a feature of the present invention to provide a method for predecoding macro instructions in a pipeline microprocessor. The method includes retrieving a indexed cache line bytes from each of a plurality of cache ways; translating a linear address into a requested physical page address; in parallel with the translating, predecoding the indexed cache line bytes to generate corresponding bytes of predecode information; and providing for instruction decoding, a subset of the indexed cache line bytes and a subset of the corresponding bytes of predecode information.
Yet a further advantage of the present invention is that predecoding can be performed in parallel with cache hit determination operations without a requirement to store predecode data within an instruction cache.