1. Field of the Invention
The present invention relates to the field of digital electronic processors; more particularly, to processors designed for high-speed pipelined synchronous operation.
2. Description of the Related Art
Advances in digital electronic data processors have included process, circuit and microarchitecture improvements. Microarchitecture improvements have leveraged the fact that improving processes have allowed the integration of millions of devices on individual integrated circuits (ICs). With millions of devices available for very little cost, microarchitectures have evolved to deliver greater performance, even though individual devices contribute less to overall performance than in earlier microarchitectures.
One approach that researchers and scientists have investigated for improving performance is to increase the frequency at which the device operates. This is accomplished by using more aggressive circuit families such as dynamic gates, and by implementing fewer gates in each pipeline stage. Another approach is to implement multiple execution units such that more than one instruction can be executed at a time. Yet another approach is to allow each instruction to proceed at its own pace, beginning execution when encountered but not completing execution until all operands and resources are available. Pipelining all the functional units is known as super pipelining. Executing more than one instruction at a time is known as super scalarity, and allowing instructions to complete at their own pace is known as out-of-order. Most modern microarchitectures support some degree of each of these innovations. An efficient microarchitecture makes efficient use of its circuits. Super scalarity and out-of-order features reduce circuit efficiency because, in the case of super scalarity, multiple redundant units are present. That is, a super scalar design includes two or more ALUs, and both ALUs are used less frequently than one ALU is used in a uniscalar (non-super scalar) design.
Out-of-order features also reduce circuit utilization efficiency because of the additional control hardware required to track individual instructions and ensure that they appear to complete in the proper order as expected by the executing program. Super pipelining has the potential of increasing circuit utilization because it can operate individual circuits at a higher clock rate, but because of the synchronization requirements, i.e., latches, the percent of each stage doing actual work is reduced, which tends to offset efficiency gains. In otherwords, increased instruction latency decreases circuit efficiency.
Due to the complexity of modem processors, internal synchronization often requires pipeline stalls. This is because it is not necessarily known whether all resources required for an instruction to execute will be present when the instruction is initiated. For example, a store instruction may begin execution, generating an address and translating it from a virtual to a physical address before the register value to be stored is available. If the instruction gets to the point where the register value is required but is not yet available, the instruction will stall until such point as the register value appears. While these stalls are conceptually simple to implement, they often introduce critical paths into a design. Furthermore, the cost of such stalls is often overlooked as they represent additional circuitry present to manage instruction operation but not do any actual work. A stall, therefore, decreases circuit utilization.
Circuit utilization is important when designing processors that must balance performance against other issues such as silicon cost, power, heat dissipation and manufacturability. It is not particularly important when the only important criteria is absolute performance. When it is necessary to deliver the greatest performance for the minimum cost, for example, circuit utilization is an important metric. As such, super scalarity and out-of-order features are unattractive, and super pipelining is attractive only when synchronization elements need not be included in the critical path.
The present invention focuses on high circuit utilization by combining a floating point unit with a graphics unit and an integer unit. It does so by implementing super pipelining in a latchless dynamic logic family, which requires no additional logic levels for synchronization. It avoids super scalarity except where functional units are sufficiently different to justify not building a single consolidated unit. It allows operations to occur at their natural latency. In other words, individual instructions deliver their results as soon as their execution unit produces them, limiting the out-of-order nature of the design to where it occurs naturally. And, it simplifies operand bypass logic by having each subsequent stage of the pipeline pass previously generated results unaffected. Finally, by predicting complex operand conditions, operand availability, execution unit availability, and write ordering, it eliminates all non-calculating stall conditions, which eliminates the need for recirculation circuits.
Additionally, the present invention illustrates the need for locating a combination functional unit at a late stage within a pipeline. Late pipeline functional units have implementation cost and simplicity advantages over a traditional, early pipeline location. If properly designed, late pipeline functional units do not need to support partial or complete cancellation, and can avoid all pipeline stalls. A late pipeline functional unit can therefore be designed without recirculating hold paths such that it is free-running, e.g., once an operation is dispatched, it will proceed in a regular, predictable fashion. There are some costs to placing units late in a pipeline, but these are considered minor.
The present invention comprises a method and apparatus for an enhanced floating point unit that supports floating point, integer, and graphics operations by combining the units into a single functional unit. The enhanced floating point unit comprises a register file coupled to a plurality of bypass multiplexers. Coupled to the bypass multiplexers are an aligner and a multiplier. And, coupled to the multiplier is an adder that further couples to a normalizer/rounder unit. The normalizer/rounder unit may comprise a normalizer and a rounder coupled in series and or a parallel normalizer/rounder. The enhanced floating point unit of the present invention supports both integer operations and graphics operations with one functional unit.
Additionally, the present invention comprises a method and apparatus for a pipeline of functional units with a late pipe functional unit that executes instructions without stalling until the result is available. The present invention comprises one or more earlier functional units coupled to a late pipe functional unit. The late pipe functional unit does not begin executing instructions until all of the input operands are or will be available for execution so that the late pipe functional unit will execute instructions without stalling until the result will be available in a fixed number of cycles. The present invention further comprises a late pipe functional unit that may comprise a floating point unit, a graphics unit, or an enhanced floating point unit. And finally, the late pipe functional unit is non-stalling and or is non-cancelable.