1. Technical Field
The present invention generally relates to the processing of instructions in a microprocessor and, more particularly, to a method and apparatus for sharing storage and execution resources between architectural units in a microprocessor.
2. Description of the Related Art
In modern microprocessors, significant amounts of resources are duplicated between the Single Instruction-Stream, Multiple Data-Stream (SIMD) and scalar data paths. For example, such duplication is more prevalent for microprocessors which have large execution units, such as floating point, and particularly, double precision floating point computation. This undesirable duplication of resources leads to increased chip area, reduced yield, slower clock frequencies, higher power dissipation and overall higher cost and lower performance.
Thus, it is preferable to reduce chip area and design complexity by sharing resources between similar execution units, such as scalar and vector floating point units. A variety of approaches have been implemented for architectural sharing. For newer microprocessor designs, one such approach is to architect a single scalar/vector unit, which performs execution of both scalar and vector operations from a shared storage resource, and in a shared data path. An exemplary implementation of such a modern instruction set architecture (ISA) is the Cell Broadband Engine Synergistic Processor Unit (SPU), which uses the described approach in accordance with U.S. Pat. No. 6,839,828 to Gschwind et al., entitled “Simd Datapath Coupled to Scalar/Vector/Address/Conditional Data Register File with Selective Subpath Scalar Processing Mode”, issued on Jan. 4, 2005, and U.S. patent application Ser. No. 11/065,707, to Gschwind et al., entitled “SIMD-RISC Microprocessor Architecture”, filed on Feb. 24, 2005, which are commonly assigned and incorporated by reference herein.
In another approach to architectural sharing, SIMD FP execution units such as the IBM BlueGene/L Double Floating Point Unit, and the INTEL SSE2 instruction set extensions, use a subfield of the SIMD unit for scalar computation using dedicated load and store operations to that subfield.
Alternatively, in yet another approach to architectural sharing, for designs which have architected separate execution primitives and separate storage resources, a large physical storage resource can be employed that stores the architected state for both separate architected storage facilities in a large common array. This approach is employed by recent implementations of the INTEL x86 and AMD64 architectures by INTEL and AMD, wherein traditional stack-oriented x87 FP registers, multi-media extensions (MMX) registers, and Streaming SIMD Extension (SSE) registers are stored in a common physical array.
The sharing of register files is advantageous in the case when dynamic usage in a specific program or program phase uses a specific register file heavily, while making little or no use of another register file, wherein the physical registers can be allocated to the heavily used register file.
Thus, sharing a physical register file between multiple architectural register files leads to better resource utilization. However, while implementations sharing a physical register file, by allocating multiple architectural register files within a single physical register file (such as the x87 FP registers, the MMX registers, and/or the SSE registers in the AMD64 architecture), allow efficient sharing of rename registers, resource wastage accrues from unused but allocated architected registers which are of no use to program execution.
Thus, a large architected register file allocated commonly to all units would be more advantageous, by allowing sharing and dynamic allocation of physical registers according to program usage, under compiler control. In addition, sharing of data between different units becomes effortless, as no move instruction between architected register files (possibly via memory, as is the case in several modern instruction set architectures) is necessary.
Unfortunately, current industry standard instruction set architectures already specify preexisting register files which have to be maintained even if newer execution resources are defined and architected for future microprocessor implementations.
In another aspect of modern microprocessor implementation, execution data paths should advantageously be shared between different execution units. Thus, a floating point operation executed using a scalar floating point instruction should advantageously use the same floating point unit as a floating point instruction executed as part of a SIMD instruction sequence.
For example, a future industry standard processor may want to employ an improved merged scalar/vector floating point unit, yet retain high performance compatibility with legacy architectures.
However, sharing execution units is difficult if data is to be sourced from different physical register files, requiring input operand data routing from several physical register files, and data operand selection from such files, and result data routing to multiple register files. This inevitably will increase chip area and degrade performance.
In the prior art, the sharing of register files has been implemented at the architectural level. Thus, the MMX and INTEL x87 stack-oriented floating point architecture architecturally share the register file in a mutually exclusive fashion. The primary motivation between this architectural sharing of two register files between a first MMX unit and a second floating point unit was to minimize required changes in operating systems during the process context switch sequence. Effectively, selection of architectural context stored in the single architectural register file storing exclusively one of either floating point or MMX data was under user control with no system provision to store state information upon a context switch.
This use is cumbersome and inefficient, requiring the user code to identify all possible code paths through a user application, and insert explicit architectural register file context switch sequences in the user program. Additionally, static analysis has to be conservative, introducing architectural register file context switch sequences when an architectural register file context switch is possible, but does not actually occur, leading to significant performance reduction due to the execution of extraneous architectural context switch sequences. Finally, such an implementation is not compatible with current industry standard instruction sets which specify the existence of separate and independent architectural register files, such as between the IBM Power Architecture™ floating point and Vector/SIMD Media eXtension (VMX), or the AMD64 instruction set x87 floating point unit and SSE SIMD instructions.
U.S. Pat. No. 6,178,482 to Sollars, entitled “Virtual register sets”, which is incorporated by reference herein, discloses an implementation of virtual register files. In particular, the use of virtual register sets is disclosed wherein multiple register files are maintained in a cache, and accessed from the cache. While this allows for the maintaining of multiple register files, it involves long access paths to a cache and access to a large array, leading to long access latencies for registers and, hence, performance degradation. This design approach will also include control complexity by requiring synchronizing data cache and register file accesses. Further, this requires the management of tags and other aspects of a cache which are not typically required for performing cache accesses and, hence, constitute additional overhead in performing a register access.
Butts et al., in “Use-Based Register Caching with Decoupled Indexing”, Proceedings of the 31st annual international symposium on Computer architecture, München, Germany, June 2004, provide a review of prior art hierarchical register file designs. In hierarchical register file designs, a multi-level register file hierarchy is used to store the values of architected registers. According to hierarchical register file designs, frequently used values are stored in a small array in proximity to execution resources, and infrequently accessed values in a larger, slower array. Using a hierarchical storage has a number of costs associated therewith including that the design must identify the presence or non-presence of a value in the fast storage, based on some table-of-content structures, such as a register map, tags associated with registers, or another indirection or content-addressable structure. This analysis must be performed based on the specific register name, and when multiple operands are present, for each of the multiple operands. This analysis has a significant cost in power, and possibly latency. Furthermore, design complexity is increased.
In addition, using hierarchical register files to implement multiple architected register files does not mitigate some of the inconsistencies associated with implementing a monolithic array to store multiple architected registers files, such as that found in recent implementations of the x86 and AMD64 architecture. Specifically, all register files need to be mapped into a common address space, with internal register specifiers having a suitably large number of register specifier bits in a unified register address space. Specifically, this increases operation latency to determine bypass and/or dependence conditions, map table access, and so forth.
Mapping multiple architectural register files into a single large common physical register file thus has several disadvantages. On such disadvantage is, at a minimum, physical registers for all architectural registers have to be allocated to maintain their architected state even if they are otherwise unused. Another disadvantage is that mapping multiple architectural register files into a single physical register file to be simultaneously resident therein requires all architectural names to be mapped to a common internal register specifier name space, leading to the requirement for long internal register specifiers, thereby degrading performance. Yet another disadvantage is that providing simultaneous storage for multiple architected register files and their rename registers may lead to a large physical register file, thereby leading to long access latencies.
Building multi-level register files or register file caches is not advantageous, because register file caches still require the use of long register specifier names for determining if a specific register is in the top-level register file hierarchy, as well as requiring expensive CAM-like accesses to the register file.
In reducing design area, complexity and power consumption by sharing execution resources between architectural units, it is preferable to allow a common data path to execute similar operations specified for different architectural units. Thus, a floating point operation executed using a scalar floating point instruction should advantageously use the same floating point unit as a floating point instruction executed as part of a SIMD instruction sequence.
Unfortunately, the architectural specification of similar operations often differs in semantic details. For example, in the AMD64 instruction set, multiple definitions of floating point operations are present in the form of instructions from the x87 legacy floating point unit operating on an extended range 80 bit floating point definition, the AMD 3DNow SIMD instructions, and the SSE/SSE2/SSE3 extensions. Similarly, the Power Architecture™ defines floating point operations in a scalar FP unit having either 32 or 64 bits of data width and supporting multiple rounding modes, and a number of IEEE specified floating point status bits specified and maintained in the Power Architecture™ FPSCR status register, as well as optionally precise exceptions, the Power Architecture™ VMX instruction set extensions specify 32 bit floating point operations with a single default rounding mode, de-normalized number handling specified in a separate VSCR vector status and control register, and no exception support. Similarly, the IBM zSeries ESAME architecture specification specifies two different floating point instruction families, using a first IBM System/360 compliant hexadecimal floating point representation, and a second IEEE-compliant binary floating point representation.
In one implementation of a common data path, instruction characteristics, such as the use of 80 bit or 64 bit floating point representation, the use of Floating Point Status and Control Register (FPSCR) specified FP rounding or VMX default rounding, and the use of hexadecimal or binary floating point computation formats, are specified with each operation passed to the common data path.
Thus, it is preferable to have a methodology that allows similar, but not identical, operations to be executed on a common data path by reconfiguring the data path. Unfortunately, this leads to long internal representations of instructions specifying a variety of options, as well as slow cycle time, as these options have to be dynamically selected using a variety of selectors embedded in the data path. However, it should be noted that the use of different architectural specifications is usually not interleaved in a fine-grained manner in application programs.
Thus, a typical program might use either the legacy x87-based 80 bit floating point specification or the SSE2-based 64 bit floating point specification on an AMD64-instruction set processor. In such an execution environment, the processor would see either exclusively x87 floating point or SSE2 floating point operations for a given program, until the user application program context is switched by the operating system. In another aspect of programs that use multiple floating point specifications, some modules may use a legacy x87 floating point specification, while other modules have been upgraded. In such an execution environment, the processor would see either exclusively x87 floating point or SSE2 floating point operations for a given module, until control is transferred to a module using the other representation.
Similarly, Power Architecture™ environments in use today typically include programs which either use the floating point architecture or the VMX architecture. In some applications, some compute critical kernels with long execution times have been rewritten to exploit the VMX specification, while other modules use the scalar FP instruction set.
Similarly, zseries environments in use today may typically either execute MVS code using preexisting applications exploiting the IBM System/360 hexadecimal floating point execution environment, or Linux code using newly compiled UNIX applications exploiting the IEEE binary floating point execution environment.
Thus, it is preferable to reduce the size of internal operation codes and eliminate the need to select operation specifics in response to every single operation received.
In the prior art, the use of field-programmable gate array (FPGA) configurable function units has been proposed. Hauck et al., in “The Chimaera Reconfigurable Functional Unit”, IEEE Symposium on FPGAs for Custom Computing Machines, 1997, the disclosure of which is incorporated by reference herein, describes the use of an FPGA based functional unit. In accordance therewith, instructions are decoded by instruction decoding logic, and then transmitted to the FPGA. Moreover, different FPGA configurations are loaded into a reconfigurable function unit which is managed as a cache of recently used configurations.
While the approach by Hauck et al. allows access to a working set of FPGA configurations, the approach is inadequate for the efficient processing of general purpose instruction sets. First, FPGA configurations are inefficient in terms of areas usage, power and speed, because multiple physical gates must switch to simulate a single logical gate in an FPGA configuration. Customized logic, which directly implements functions such as floating point or integer data paths as used in microprocessors using advanced circuit techniques, leads to better area, power and performance efficiency. Thus, it is preferable to implement logic such that it is reflected in the manufacture of a processor to eliminate the inefficiencies associated with field-programmable gate arrays. Second, the described approach does not support the concept of instructions being part of an instruction repertoire associated with a particular architectural unit, where one or another unit is typically used at a given time. Thus, there is no provision for loading and unloading state information for different register files associated with different architectural units. Furthermore, there is no concept of shared primitives which need to be configured to match the semantics of a particular architectural specification for a unit. Finally, the RFU proposed by Hauck et al. loads configurations for instructions which define these specific instructions, and does not reconfigure the unit for an architectural unit supporting a repertoire of instructions. Thus, a sequence of instructions would each require the separate overhead of reloading.
While the above and related prior art have suggested the use of configurable FPGA logic to implement different types of user defined instructions, the purpose of a polymorphic unit is different. For example, one purpose of a polymorphic unit is to provide an optimized implementation of a specific set of functionality, where the implementation includes loading a specific configuration to control the semantics of these predefined operations, including, but not limited to, single or double precision operations, rounding, de-normalized number handling, saturation, overflow handling, exception handling, tracking of exception events in status and control registers, so as to operate in accordance with a selected set of instructions. In a polymorphic unit as utilized in accordance with an embodiment herein, the configurations are limited to related classes of instructions defined when the polymorphic execution unit is architected, and optimizing the implementation for this set of operations to allow the efficient implementation of hardware specific to the instruction functions. In comparison, FPGA function is defined to implement general logic gates, to allow users of FPGA technology to define operations after the manufacture of FPGAs, and under the specification of the FPGA user. The result of this overly general flexibility is low computing density, and low operation frequency.
While FPGAs offer flexible logic gates to be defined by users, state management in accordance with proposed FPGA extensions is limited. Operand state is considered to be maintained in the FPGA logic, or delivered from a fixed register file. As a result, proposals for FPGA configurable units are limited in how program state can be used by user-defined instructions. Specifically, these extensions do not include the ability to dynamically associate a polymorphic register file with a first or a second class of instructions, and to reload said state in response to a encountering an instruction of a specific instruction class.
In another aspect of prior art, APU ports are used in embedded Power Architecture™ processor cores to provide designer-specified application specific processing units, as described in “PowerPC® 440 Processor Core”, Product Brief, available at http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/F72367F770327F8A87256E63006CB7EC/$file/PPC440Core3-24.pdf, March 2004, the disclosure of which is incorporated by reference herein. Unfortunately, application-specific processing units can only be selected during design time and, hence, do not provide the flexibility to provision different capabilities according to program usage of dynamically loaded programs.
In the context of Power Architecture cores embedded in FPGAs, Ansari et al., in “Accelerated System Performance with APU-Enhanced Processing”, Xcell Journal First Quarter, 2005, available at http://www.xilinx.com/publications/xcellonline/xcell—52/xc_pdf/x c_v4acu52.pdf, the disclosure of which is incorporated by reference herein, discloses the use of the PowerPC® 405 APU port in the PowerPC® cores embedded in the XIlinx Virtex-4 FX family. The Auxiliary Processor Unit (APU) controller is a key embedded processing feature in the Virtex-4 FX family. However, APU reconfiguration must either be done in the FPGA configuration bits stream, e.g., at FPGA logic design time, or using a DCR interface which requires programs to explicitly re-provision the APU logic not unlike program based reconfiguration between MMX and legacy x87 function use.