The present invention relates generally to digital data processing, and more particularly to pipelined operations in a processing unit of a data processing system.
A modern computer system typically comprises a central processing unit (CPU) and supporting hardware necessary to store, retrieve and transfer information, such as communications busses and memory. It also includes hardware necessary to communicate with the outside world, such as input/output controllers or storage controllers, and devices attached thereto such as keyboards, monitors, tape drives, disk drives, communication lines coupled to a network, etc. The CPU is the heart of the system. It executes the instructions which comprise a computer program and directs the operation of the other system components.
From the standpoint of the computer""s hardware, most systems operate in fundamentally the same manner. Processors are capable of performing a limited set of very simple operations, such as arithmetic, logical comparisons, and movement of data from one location to another. But each operation is performed very quickly. Programs which direct a computer to perform massive numbers of these simple operations give the illusion that the computer is doing something sophisticated. What is perceived by the user as a new or improved capability of a computer system is made possible by performing essentially the same set of very simple operations, but doing it much faster. Therefore continuing improvements to computer systems require that these systems be made ever faster.
The overall speed of a computer system (also called the xe2x80x9cthroughputxe2x80x9d) may be crudely measured as the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, and particularly the clock speed of the processor. E.g., if everything runs twice as fast but otherwise works in exactly the same manner, the system will perform a given task in half the time. Early computer processors, which were constructed from many discrete components, were susceptible to significant speed improvements by shrinking component size, reducing component number, and eventually, packaging the entire processor as an integrated circuit on a single chip. The reduced size made it possible to increase the clock speed of the processor, and accordingly increase system speed.
Despite the enormous improvement in speed obtained from integrated circuitry, the demand for ever faster computer systems has continued. Hardware designers have been able to obtain still further improvements in speed by greater integration (i.e., increasing the number of circuits packed onto a single chip), by further reducing the size of the circuits, and by various other techniques. However, designers can see that physical size reductions can not continue indefinitely, and there are limits to their ability to continue to increase clock speeds of processors. Attention has therefore been directed to other approaches for further improvements in overall speed of the computer system.
Without changing the clock speed, it is possible to improve system throughput by using multiple copies of certain components, and in particular, by using multiple CPUs. The modest cost of individual processors packaged on integrated circuit chips has made this practical. While there are certainly potential benefits to using multiple processors, additional architectural issues are introduced. Without delving deeply into these, it can still be observed that there are many reasons to improve the speed of the individual CPU, whether or not a system uses multiple CPUs or a single CPU. If the CPU clock speed is given, it is possible to further increase the speed of the individual CPU, i.e., the number of operations executed per second, by increasing the average number of operations executed per clock cycle.
Most modern processors employ some form of pipelining to increase the average number of operations executed per clock cycle, as well as one or more levels of cache memory to provide high-speed access to a subset of data in main memory. Pipelined instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. Ideally, a new instruction begins with each clock cycle, and subsequently moves through a pipeline stage with each cycle. Even though an instruction may take multiple cycles or pipeline stages to complete, if the pipeline is always full, the processor executes one instruction every cycle.
Of course, the pipeline being always full is simply an ideal towards which designers strive, knowing that it is impossible to always keep the pipeline full. For various reasons, the pipeline will sometimes stall. For example, the instruction stream may take an unexpected branch to an instruction which is not in the cache, or may load data from a data location which is not in the immediate (lowest level) cache. In these cases, the processor can not begin a new instruction, and must typically wait until the necessary instruction or data is fetched into the cache, either from another higher level cache, or from main memory.
There are other causes of pipeline stall. Among them are address conflicts between pipeline operations, particularly, between load and store operations. If a store operation stores data to an address X, and a load operation subsequently loads data from address X, care must be taken that the store operation completes before the load operation begins, or incorrect data may be loaded. In order to prevent erroneous operation, a processor using pipelined instruction execution typically compares the address of a target operand of certain operations with similar addresses of operations in the pipeline. If a conflict is detected, the subsequent operation must be delayed or restarted.
Some system designs, of which UNIX-based systems are an example, employ a form of virtual addressing which has the possibility of address aliasing. I.e., addresses derived from the instructions and generated by the processor, which are often referred to as xe2x80x9cvirtual addressesxe2x80x9d or xe2x80x9ceffective addressesxe2x80x9d, are mapped to addresses in the physical main memory of the system, generally referred to as xe2x80x9creal addressesxe2x80x9d or xe2x80x9cphysical addressesxe2x80x9d, where it is possible that multiple virtual addresses map to the same real address. Because multiple virtual addresses may map to the same real address, the cache is typically accessed with a real address and only the real address may reliably be used to determine whether there is an address conflict in pipeline operations.
Typically, in order to obtain a real address of data for a data reference operation (e.g., a load or store operation), a portion of the virtual address is used to access a table called a translation lookaside buffer (TLB). The TLB is typically N-way set associative, providing N possible real address entries corresponding to the virtual address. A TLB lookup requires that the N entries be retrieved from the TLB, that each entry be compared to the virtual address, and that the real address corresponding to the matched entry be selected. These operations may require multiple clock cycles.
Where an address conflict exists between pipeline operations, it is desirable to detect the conflict as soon as possible. The longer it takes to detect such a conflict, the greater is the potential performance impact. Not only is the conflicting instruction potentially compromised, but instructions occurring after the conflicting instruction may be compromised as well. Late detection of an address conflict requires that all potential data integrity exposures be rectified before proceeding. Since an address conflict can not be detected until the virtual addresses are translated to real addresses, the time required to perform the TLB lookup directly delays the detection of an address conflict.
As processors grow more capable and more complex, the problem of address conflicts between pipeline operations will be magnified. Some newer processor designs employ so-called xe2x80x9cWide Issue Superscalarxe2x80x9d or xe2x80x9cVery Long Instruction Wordxe2x80x9d (VLIW) architectures, in which multiple operations are concurrently executed, and multiple loads and stores can be issued concurrently. Other processor designs also grow in complexity, as the lengths of pipelines increase, multiple pipelines may exist, multiple levels of cache may be supported, etc.
All of this growing complexity increases the number of active pipeline stages at any instant in time, which has two consequences. On the one hand, there is an increased likelihood of an address conflict, while at the same time, there is a greater potential performance impact of restarting the pipelines when an address conflict exists. Thus, address conflicts may become a significant performance bottleneck as pipeline complexity increases in current and future processor designs. Although this trend is not necessarily well understood, there exists a need now and in the future for improved techniques for dealing with pipeline address conflicts.
A low-order portion of a virtual address (xe2x80x9cbyte addressxe2x80x9d) for a pipelined operation is compared directly with the corresponding low-order portions of addresses of one or more other operations in the pipeline mechanism to detect an address conflict, without translating the address through an address translation mechanism. If no match of byte addresses is detected, then there is no address conflict and pipeline operations proceed normally.
In the preferred embodiment, if a match is found between byte addresses, it is assumed that an address conflict does exist, and no further verification of an actual address conflict is performed. In this case, the corresponding operations are treated as if an actual address conflict exists, even though the higher-order portions of the addresses may not match. Specifically, if the operations are of a type which require some minimum time interval between them or require that one operation complete before the later operation can begin (e.g., a store operation, followed by a load operation), the later operation (and any beginning after it in the pipeline) are stalled a sufficient time to prevent any data integrity exposure. This may be accomplished by stalling a pre-determined number of cycles, by stalling until the earlier conflicting operation completes, or other means. It is alternatively possible to restart the pipeline.
In the preferred embodiment, the CPU has one or more caches, which are addressed using real addresses. An N-way translation lookaside buffer (TLB) in the CPU is used to determine the high-order portion of a real address from the high-order portion of a virtual address. Pipeline stages contain the corresponding real addresses of the operations once the real addresses have been determined. The low-order portion of a new pipeline operation is compared with the low-order address portions of potentially conflicting operations ahead of it in the pipeline, without first translating the virtual address of the new pipeline operation through the TLB.
A pipeline address conflict detection mechanism in accordance with the preferred embodiment of the present invention has several advantages. Pipeline address conflicts are detected at an earlier stage, and as a result, the performance impact of an individual conflict is reduced. Generally, as a result of early detection, data integrity can be preserved by simply stalling the later instruction in the pipeline, rather than restarting the pipeline after the instruction has proceeded well down the pipe. Even though a certain number of xe2x80x9cfalse positivesxe2x80x9d will be detected, the reduced performance cost of each detected address conflict will typically more than offset the false positives. Stalling a portion of the pipeline is generally simpler and generally requires less hardware than that required for restarting the pipeline after some progress has already been made. Finally, the hardware required to make the address comparisons for purposes of detecting an address conflict is reduced, because only a subset of the entire address need be compared.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: