1. Field of the Invention
This invention relates to a superscalar processor and, more particularly, to a reservation station for a functional unit within a superscalar processor which includes a plurality of functional units.
2. Description of the Relevant Art
As is known in the art, a floating point number may be represented in binary format as an exponent and a mantissa. The exponent represents a power to which a base number such as 2 is raised and the mantissa is a number to be multiplied by the base number. Accordingly, the actual number represented by a floating point number is the mantissa multiplied by a quantity equal to the base number raised to a power specified by the exponent. In such a manner, any particular number may be approximated in floating point notation as f.times.B.sup.e or (f,e) where f is an n-digit signed mantissa, e is an m-digit signed integer exponent and B is the base number system. In most computer systems, the base number system used is the binary number system where B=2, although some systems use the decimal number system (B=10) or the hexadecimal number system (B=16) as their base number system. Floating point numbers may be added, subtracted, multiplied, or divided and computing structures for performing these arithmetic operations on binary floating point numbers are well known in the art.
In a practical computer, the mantissa portion of a number is not of infinite "precision" (i.e. there are not an unlimited number of digits which may be assigned to the mantissa of a floating point number). Instead, floating point numbers are normally processed in a register with a fixed number of digits. Thus, although two input operands to be added, subtracted, multiplied, or divided may each be exact representations, the result of the operation may create more significant digits than the fixed number of digits in the register. As a result, a less precise (but still accurate) representation of the result must be squeezed into the fixed number of digits in the register by the processes of normalization and rounding.
Normalization is the process which assures that all floating point numbers with the same value have the same representation. Typically, normalization of a binary floating point number is accomplished by shifting the bits of the mantissa to the left until the most significant bit is a one. The exponent is decreased so that the value of the product of the mantissa and base number raised to the power of the exponent remains constant. Since the most significant bit in the mantissa of a normalized number is always a one, floating point representations often represent the bit implicitly (effectively freeing up one bit position for use as an additional bit of precision). Together these significant bits, whether they include an explicit or an implicit most significant bit, are known as the significand. The normalization process maximizes the number of significant bits represented in this significand. Rounding a floating point number is the process of reducing the precision of a number, so as to fit a representation of the number into a smaller number of significand bits. For floating point number representations, four rounding modes are typical: round up, round down, round to nearest, and truncate (see Dewar, Microprocessors: A Programmer's View, McGraw-Hill Publishing Co., New York, 1990, pp. 140-143 for a discussion).
The finite number of digits in the exponent also places limits on the magnitude of numbers which can be represented. Arithmetic results which exceed these limits are known as underflow and overflow. There are two ranges of numbers that correspond to arithmetic overflow and arithmetic underflow, respectively. If the result of an arithmetic operation is greater than the largest positive value representable or less than the most negative value representable, arithmetic overflow occurs. On the other hand, when the result of an arithmetic operation is too small to be expressed, either positive or negative arithmetic underflow has occurred.
Floating point exponents are typically represented with bias (i.e., the biased exponent is equal to the sum of the true exponent value and a bias constant). The bias constant, which is typically 2.sup.n-1 -1, where n is the number of exponent bits, allows a biased exponent to be represented as an unsigned integer. This unsigned representation simplifies comparison logic by allowing the exponent of two floating point numbers to be compared bitwise from left to right. The first bit position which differs serves to order the numbers and the true exponent can be determined by subtracting the bias from the biased exponent.
A series of floating point formats exist which represent different trade offs between the precision and range of numbers (largest to smallest) representable, storage requirements, and cycles required for computing arithmetic results. In general, longer formats trade increased storage requirements and decreased speed of arithmetic operations (mainly multiplication and division operations) for greater precision and available range.
ANSI IEEE Standard 754 defines several floating point formats including single-precision, double-precision, and extended double-precision. Referring to FIG. 1a, the format of a 32-bit single precision floating point number is broken into a one-bit sign field "s," an eight-bit biased exponent field "exp," a so called "hidden" bit (which although not explicitly represented, is assumed to be a one just left of the implied binary point 11), and a 23-bit "significand."
Referring next to FIG. 1b, the format of a double-precision floating point number increases the size of the biased exponent field to eleven (11) bits and the size of the significand to fifty-two (52) bits. A hidden bit, which is assumed to be one, is implicit (just to the left of the implied binary point 12) in the double-precision format.
Referring next to FIG. 1c, the minimum requirements of an extended double-precision (hereinafter extended-precision) floating point format are presented. According to ANSI IEEE Standard 754, at least sixty-four (64) bits of significand and fifteen (15) bits of biased exponent must be provided. In contrast with single- and double-precision formats, extended-precision floating point format places the implied binary point 13 within the significand, and the digit to the left of the binary point is explicitly represented. There is no "hidden" bit; instead, the most significant bit of the significand (shown as "h" in FIG. 1c) is explicit in extended-precision format. Although envisioned in the IEEE standard as an internal format for computation of intermediate results, the format is in practice supported by many floating point units, including the i80387.TM. by Intel Corporation, as an external format (i.e., represented in memory and accessible to the programmer).
To maximize computational throughput, a number of superscalar techniques have been proposed to enable instruction-level parallelism using multiple functional units. Instruction parallelism can be described as the opportunity for simultaneous (parallel) execution of more than one instruction in a processor containing multiple functional units. Pipelining techniques involve exploitation of instruction parallelism within a single functional unit, whereas superscalar techniques involve the exploitation of instruction parallelism across more than one functional unit. The instruction parallelism exploited by superscalar techniques may be contrasted with data parallelism in that superscalar techniques enable the parallel execution of dissimilar instructions, not just identical instructions with independent operands. These techniques, which are known in the art of superscalar processor design, include out-of-order instruction issue, out-of-order instruction completion, and speculative execution of instructions.
Out-of-order instruction issue involves the issuance of instructions to functional units with little regard for the actual order of instructions in executing code. A superscalar processor which exploits out-of-order issue need only be constrained by dependencies between the output (results) of a given instruction and the inputs (operands) of subsequent instructions in formulating its instruction dispatch sequence. Out-of-order completion, on the other hand, is a technique which allows a given instruction to complete (e.g., store its result) prior to the completion of an instruction which precedes it in the program sequence. Finally, speculative execution involves the execution of an instruction sequence based on predicted outcomes (e.g., of a branch). Speculative execution (i.e., execution under the assumption that branches are correctly predicted) allows a processor to execute instructions without waiting for branch conditions to actually be evaluated. Assuming that branches are predicted correctly more often than not, and assuming that a reasonably efficient method of undoing the results of an incorrect prediction is available, the instruction parallelism (i.e., the number of instructions available for parallel execution) will typically be increased by speculative execution (see Johnson, Superscalar Processor Design, Prentice-Hall, Inc., New Jersey, 1991, pp. 63-77 for an analysis).
Architectural designs for exploiting the instruction parallelism associated with each of these techniques have been proposed in a variety of articles and texts. For a discussion, see Johnson, pp. 127-146 (out of order issue), pp. 103-126 (out-of-order completion and dependency), pp. 87-102 (branch misprediction recovery). Two architectural approaches for exploiting instruction parallelism are the reservation station and the reorder buffer. A reservation station is essentially an instruction and operand buffer for a given functional unit within a processor which includes multiple functional units; however, in addition to buffering instructions and operands, a reservation station provides a means for directly receiving results from other functional units. In this way, an instruction for which operands are not yet available can be dispatched to the reservation station for a given functional unit without waiting for its operands to be stored in and then retrieved from a register. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Execution Units," IBM Journal, vol. 11, January 1967, pp. 25-33, discloses a floating point processor implementation which includes multiple functional units, each with a reservation station. Tomasulo used the term "execution unit" rather than "functional unit," but in this context the concept is similar.
A reorder buffer is a content-addressable store which maintains the speculative (i.e., out-of-order) state of registers in a processor which includes multiple functional units. When each instruction is decoded, a reorder-buffer entry is allocated to store the instruction's result and a temporary identifier, or tag, is created to identify the result. In a normal instruction sequence, a given register may be written many times and therefore multiple reorder buffer entries will be allocated, corresponding to the state of the register at different points in the instruction sequence. As instructions which require register values as operands are dispatched, the most recently allocated reorder buffer entry is referenced, or if no reorder buffer entry corresponds to the required register location, the value stored in the register file is used. Assuming that a corresponding reorder buffer entry has been allocated, the value of an operand required by a given instruction is supplied by the reorder buffer if the instruction which computes the operand value has completed; otherwise, a tag is supplied allowing the instruction to recognize the result when it becomes available. A superscalar processor design which incorporates a reorder buffer also provides facilities to retire reorder buffer entries (i.e., store the entry value to the register file or discard the entry if no longer needed).
A reorder buffer implementation facilitates various superscalar techniques including register renaming, branch misprediction exception handling, and out-of-order instruction completion. A superscalar architecture which includes reservation stations and a reorder buffer also facilitates the exploitation of instruction parallelism among functional units which receive operands from, and store their results to, a reorder buffer.
Typically, floating point units have been implemented as a co-processor with special-purpose floating point registers internal to the unit and using internal floating point formats which meet or slightly exceed the minimum requirements of IEEE 754 for extended-precision floating point numbers. Internal floating point registers are often implemented as a register stack (see e.g., Intel, i486.TM. Microprocessor Family Programmer's Reference Manual, pp. 15-1 through 15-2) or as a series of accumulators (see e.g., U.S. Pat. No. 5,128,888, "Arithmetic Unit Having Multiple Accumulators" to Tamura, et al. Such architectures convert operand data from external formats (e.g., single-, double-, and extended-precision floating point) to an internal format when operands are loaded into the internal floating point registers. Subsequent floating point instructions operate on data stored in these registers and intermediate results (represented in internal format) are written back to the internal registers. Finally, results are converted back to an external format and transferred to general purpose registers external to the floating point unit. Non-floating point operations (e.g., branch tests, stores to memory, I/O, etc.) must typically be performed on the floating point values stored in an external format in the general purpose registers.
A design for a floating point unit which includes multiple functional units, reservation stations, and a reorder buffer is shown in Johnson, pp. 44-45. FIG. 2 depicts the block diagram of a processor incorporating such a floating point unit 21 together with an integer unit 22. The processor includes a pair of register file (23 and 24) and a pair of reorder buffers (25 and 26); the first register file/reorder buffer combination is dedicated to the integer unit while the second combination is dedicated to the floating point unit. The processor design shown in FIG. 2 maintains independent integer and floating point registers (and reorder buffers); therefore, results which are computed within one operational unit (integer or floating point) and which are required as operands in the other operational unit must be transferred to the second unit for use in subsequent calculations.