1. Field of the Invention
This invention relates in general to the field of pipelined microprocessors, and more particularly to microprocessor data cache operations.
2. Description of the Related Art
Modern microprocessors operate on several instructions at the same time, within different blocks or pipeline stages of the microprocessor. Hennessy and Patterson define pipelining as, “an implementation technique whereby multiple instructions are overlapped in execution.” Computer Architecture: A Quantitative Approach, 2nd edition, by John L. Hennessy and David A. Patterson, Morgan Kaufmann Publishers, San Francisco, Calif., 1996. The authors go on to provide the following excellent illustration of pipelining:                A pipeline is like an assembly line. In an automobile assembly line, there are many steps, each contributing something to the construction of the car. Each step operates in parallel with the other steps, though on a different car. In a computer pipeline, each step in the pipeline completes a part of an instruction. Like the assembly line, different steps are completing different parts of the different instructions in parallel. Each of these steps is called a pipe stage or a pipe segment. The stages are connected one to the next to form a pipe—instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.        
An example of a pipeline stage, typically at the top of the pipeline, is one that fetches instructions from memory for the pipeline to execute. Another example is a stage that calculates addresses of data operands to be loaded from or stored to memory as specified by the instruction in the stage. Another example is a stage that performs arithmetic operations, such as adds or multiplies, on data operands associated with the instruction in the stage. Each of the stages is separated by a pipeline register that saves the output of the pipeline stage above the register at the end of a clock cycle and provides that output to the pipeline stage below the register at the beginning of the next clock cycle.
Typically, each stage performs its function during one processor clock cycle. Thus, every clock cycle each instruction in the pipeline progresses downward one stage along the pipeline. However, certain events or conditions prevent an instruction from executing in a given stage and prevent the instruction from progressing to the next stage in the pipeline on the next clock cycle. These conditions are referred to as “stall conditions” because the pipeline must be “stalled” until the condition is resolved. That is, all instructions above the stalled instruction in the pipeline are held in their current stage by the pipeline registers rather than being allowed to progress to the next stage. Instructions below the stalled instruction stage may continue down the pipeline. There are three main causes of stalls: resource conflicts, data hazards and cache misses.
Resource conflicts occur when the hardware components in the microprocessor cannot service a given combination of instructions in simultaneous overlapped execution within the pipeline. For example, a processor may support an arithmetic instruction, such as a floating point or MMX multiply instruction. The hardware may include a multiplier circuit that requires multiple processor clock cycles to perform the multiply and the multiplier is not itself pipelined, i.e., it cannot receive a second multiply instruction until it has completed the current multiply instruction. In this case, the processor must stall the pipeline at the multiplier stage.
Data hazards, or data dependencies, are another main cause of pipeline stalls. Data hazards occur when an instruction depends on the results of an instruction ahead of it in the pipeline, and therefore cannot be executed until the first instruction executes. One class of data hazards occurs when instructions access input/output (I/O) devices.
I/O devices typically include status and control registers that are read and written by the microprocessor. Some microprocessors, such as x86 processors, have dedicated instructions for accessing the registers of I/O devices, such as the x86 “in” and “out” instructions. These instructions address a separate address space of the processor bus, namely the I/O space. The other way I/O devices are accessed is by mapping them into the memory address space of the processor. Such an I/O device is referred to as a memory-mapped I/O device and the region in which the I/O device is mapped is referred to as a memory-mapped I/O region. Typically, memory mapped I/O regions are specified via registers within the microprocessor.
An example of an I/O related data hazard occurs when a first instruction writes a value to an I/O register and the next instruction reads from an I/O register on the same device, such as a store to a memory-mapped I/O region followed by a load from the same memory-mapped I/O region. Due to the nature of I/O devices, in order to insure proper operation of the I/O device, the two instructions must be guaranteed to execute in order. That is, the read cannot be executed until the write has completed.
Cache misses are a third common cause of pipeline stalls. Program execution speed often is affected as much by memory access time as by instruction execution time. This is readily observable from the fact that a typical system memory access might take 40 processor clock cycles, whereas a typical average execution time per instruction in a well-designed pipelined processor is between 1 and 2 processor clock cycles.
Load and store instructions are used to access memory. Load instructions read data from system memory and store instructions write data to system memory. When a memory access instruction reaches a stage in a processor pipeline where the memory access is performed, the pipeline must stall waiting for the memory access to complete. That is, during the typical 40 clock cycles of the memory access, the memory access instruction remains in its current stage until the specified data is written or read. When a stall occurs, all of the other instructions in the pipeline behind the stalled instruction also wait for the stalled memory access instruction to resolve and move on down the pipeline.
Processor designers attempt to alleviate the memory access time problem by employing cache memories within the processor. Data caches, which commonly require only one or two clock cycles per memory access, significantly reduce the negative effects of stalls caused by load and store instructions introduced by the large system memory access times. However, when a cache miss occurs, a pipeline stall must ensue.
Some microprocessor designers have attempted to improve on the pipelined approach by “widening” the processor, i.e., by adding more pipelines within the processor in order to execute multiple instructions in parallel and to execute those instructions out of program order where advantageous and possible. These processors are commonly referred to as “superscalar” or “multiple-issue” processors since they issue multiple instructions at a time into multiple pipelines for parallel execution. Another term associated with the techniques employed by multiple-pipeline processors is instruction level parallelism (ILP).
Typically, processor architectures require the processor to retire instructions in-order. That is, any program-visible processor state changes must be made in the order of the program instruction sequence. However, multiple-issue processors commonly execute instructions out of order by employing reorder buffers. The processor fetches a stream of instructions of a program from memory and places the instructions into the top of the reorder buffer. The processor searches the reorder buffer looking for dependencies between the various instructions, such as data hazards or resource conflicts discussed above.
Instructions that do not have dependencies may be reordered within the reorder buffer for out of order execution. The instructions are then removed from the bottom of the reorder buffer and distributed to different pipelines within the superscalar processor for potential out of order execution.
To illustrate, a superscalar processor might receive a load instruction requiring a memory access followed by an add instruction not requiring a memory access. If the two instructions are independent, the superscalar processor will issue the load instruction to one pipeline and the add instruction to another. Since the add instruction does not require a memory access, it will likely complete before the load instruction, even though the load instruction precedes the add instruction in the program sequence.
Out of order execution is a common characteristic of multiple-issue processors facilitated by their reorder buffers. Stated alternatively, out of order processors have the capability to reorder instructions soon after they are fetched into the processor so that the reordered instructions are sent down the pipelines of the processor for execution in different order than specified by the program that they constitute, as illustrated in the previous example. In contrast, an in-order single-pipeline processor sends instructions down its pipeline in program order.
However, superscalar processors have their disadvantages. First, multiple instruction issue and out of order execution add complexity to the processor design that typically results in greater cost in terms of reduced clock speeds, larger die sizes and longer development periods. Furthermore, it has been observed that in practice processor throughput does not scale with the number of pipelines added. For example, a typical dual-pipeline processor may provide on the order of 1.3 times the instruction throughput of a comparable single-pipeline processor in executing typical programs.
Finally, it has been observed that the throughput improvement enjoyed by superscalar processors is largely a function of the degree of parallelism exhibited by the particular software program being executed. Computationally intensive programs, such as CAD programs or graphic-intensive games, exhibit high degrees of parallelism. superscalar processors generally execute these programs much faster than comparable single-pipeline processors. In contrast, business oriented programs, such as word processors, exhibit low degrees of parallelism and show relatively slight improvement in execution times on superscalar processors over single-pipeline processors.
The most common explanation for these observations is that, as stated above, program execution speed often is dominated by memory access time rather than instruction execution time. That is, the detrimental impact on processor performance that large memory access latencies impose often dominates gains made by multiple instruction issue and out of order execution. Thus, memory access latency hampers both superscalar and single-issue in-order microprocessor performance.
Although data caches help alleviate the memory access latency problem, as described above, they do not address certain situations, such as when a new data set is brought into the cache. For example, a new data set must be brought in when a newly loaded program begins to access its data. Additionally, an already loaded program may begin to access new data, such as a new database or document file, or new records of an already accessed database. In these situations, a relatively long series of load instructions will be executed, often in a program loop, to load the data from memory to be operated upon by the processor. The load instructions generate a series of cache misses.
As mentioned above, the added complexity of superscalar processors has a negative impact on clock speed, die size and development periods. Therefore, single-pipeline in-order processors may be desirable in many contexts.
However, one problem that may be observed from the preceding discussion is that the serialization of memory accesses behind a series of memory access instructions in a single-pipeline in-order processor can have devastating effects on performance. In a common situation, a first cache miss is detected and the pipeline stalls while the missing data is loaded from system memory. The load of the missing data typically requires approximately processor 40 clock cycles. When the data is returned from system memory and placed into the cache, the pipeline stall ends. Then, the next instruction (or perhaps second or third instruction) generates a cache miss and the pipeline stalls while the missing data is loaded from system memory, which requires another 40 clock cycles. This continues until the new data set is loaded into the cache.
Therefore, what is needed is a single instruction issue in-order execution microprocessor that reduces memory access latency by detecting cache misses generated by instructions behind a stalled instruction and overlapping requests for the missing data with resolution of the stalled instruction.
In addition, most modern microprocessors support virtual memory systems. In a virtual memory system, programs specify data using virtual addresses corresponding to the address space of the processor. The virtual address space is larger than the amount of physical memory in the system. The physical memory is backed by a permanent disk storage system. The physical memory is managed by the operating system as fixed size blocks, typically 4 KB large, called pages. At a any given time, a page may reside in the physical system memory or on disk. As a page is needed and brought into physical memory from disk, another page presently in physical memory, typically the least recently used page, is swapped out to disk and the new page replaces the swapped-out page.
A by-product of the “paging” process and the fact that the physical and virtual memory sizes do not match is that the processor must translate the virtual addresses into physical addresses. This process is referred to as “page translation.” To perform page translation, a processor searches or “walks” data structures in system memory referred to as page tables that provide the necessary address translation information.
Page table walks can be time consuming, because unless the page table data is in the data cache, page table walks involve memory accesses. For this reason, processors typically employ a small hardware cache referred to as a translation lookaside buffer (TLB) to cache already translated physical addresses. When a processor performs a page table walk and translates a virtual memory address into a physical memory address, the processor caches the physical address in the TLB. In single-issue in-order processors, TLB misses are serialized like cache misses and therefore also negatively impact processor performance.
Therefore, what is needed is a single-issue in-order microprocessor that reduces TLB miss latency by detecting TLB misses generated by instructions behind a stalled instruction and overlapping page table walks with resolution of the stalled instruction.