1. Field of the Invention
The present invention relates to a method and apparatus for reducing encoding needs and reducing the number of ports to shared resources in a multi-operation (wide-issue) processor, and more particularly to a mechanism based on a set of identifier fields which are shared among operations (the consumers of a shared resource).
2. Description of Related Art
Wide-issue processors are characterized by their ability to specify multiple xe2x80x9coperationsxe2x80x9d that are carried out simultaneously and which may share certain resources in the processor. This set of operations, or xe2x80x9cpacket,xe2x80x9d can be created either when the program is generated (static generation by a programmer, compiler or other means), or by some mechanism invoked while the operations are carried out (dynamic generation, for example, performed at the time instructions are fetched from main memory into an instruction cache or instruction buffer, or at the time when instructions are decoded, or in some other stage in the processor pipeline).
Typically, the format of the multiple operations specified in a packet 100 contains a separate field for identifying the arguments used by each one of the operations, which are extracted from a collection of shared resources (for example, the various registers in a buffer or register file), as illustrated in FIG. 1. Furthermore, each of the identifier fields 111, 112 and 113 is associated with an independent port to access the shared resource, so that there is no conflict among the different operations 121-124 for accessing the shared resource. As a result, the number of ports to a shared resource needed in an implementation corresponds to the maximum number of identifiers that can be encoded in a packet 100. This format of a packet 100 is the approach used to specify the registers used by the primitive operations in Very-Long Instruction Word (VLIW) processors such as TRACE, CYDRA 5, ITANIUM, Phillips TRIMEDIA, among others. This format is also the approach implicitly used in processors which dynamically construct long-instructions such as those described in U.S. Pat. No. 5,442,760 and Franklin, M. and Smotherman, M., A Fill-unit Approach to Multiple Instruction Issue, Proceedings of the 27th International Conference on Microarchitecture, 1994, at 162-171.
However, a disadvantage of the above packet format is that, for packets with many primitive operations, large shared structures result from having independent ports to a shared resource for each operation. Moreover, some primitive operations actually use fewer than the maximum possible number of arguments or results. For example, a register-to-register primitive operation such as add or subtract uses three register fields and consequently three ports in a register file: two read ports to access the operands, and one write port to save the result of the operation. On the other hand, a load operation specifying a base register and a displacement uses only one read and one write port in the register file, whereas a store operation does not use a write port.
Therefore, a need exists for a method and system having efficient use of identifier fields for specifying arguments accessed in the shared resource.
Attempts have been made to reduce the number of ports to the register file in a wide-issue processor. One such attempt is the Power2 processor, available commercially from IBM, Inc., it provides the number of ports needed by replicating the register file. More specifically, the fixed-point execution unit contains two register files with 4-read and 4-write ports; each of two functional units reads operands from one of the register files, but write ports are common to both register files. In other words, read ports are distributed across the register files whereas write ports are replicated in both modules.
In the context of VLIW processors, providing the needed ports in the register file has been addressed by the use of partitioned register files. Registers and ports are distributed across different modules, and data are either moved or copied among the modules through the execution of specific instructions, as in TRACE and Cydra 5. A variation on this approach includes replicating registers throughout some of the modules so that read ports are distributed and write ports are replicated across the corresponding modules.
U.S. Pat. No. 5,129,067 describes a group of instructions (primitive operations), fetched from the cache memory, potentially in some predecoded state. The patent is based on arbitration logic to dynamically resolve contention for the ports to the register file. Generally, the patent provides (1) arbitration logic for arbitrating conflicts among the operations in accessing the register file, based on arbitration data corresponding to each of the operations, and (2) a multiplexing unit for selectively supplying the N register identifiers to the M available ports in response to control signals generated by the arbitration logic. More specifically, the patent addresses the problem of long-instructions with N register-operand identifiers on a processor having M ports to the register file, wherein M less than N; the values of N and M considered in the embodiments described are 4-8 and 2-4, respectively. Such an approach is not adequate for the case of executing many primitive operations simultaneously (N greater than 8), as is the trend nowadays, due to the exponentially increasing hardware complexity involved; in addition, the delay across the arbitration logic grows very fast for a larger number of possible operands.
A solution related to the one proposed in U.S. Pat. No. 5,129,067 (described above) is further developed by Johnson, M., Superscalar Microprocessor Design, (Prentice Hall 1991), indicating that a four-operation decoder suffers minor degradation when there are only four read ports in the register file. The publication also relates to a superscalar processor, for the case of a four-operation decoder. The scheme proposes a long-instruction format with a separate register access field which specifies the register identifiers for four source operands and four destination registers. Destination-register identifiers are positionally assigned to each operation, so the operations do not need to identify their corresponding destination register. On the other hand, each operation identifies source operands by selecting among the source-register identifiers and destination-register identifiers in the register access field. This scheme also allows identifying the destination register of one operation as a source register of another operation (in left-to-right order).
The solution proposed by Johnson, M., Superscalar Microprocessor Design, supra, has as many destination-register identifiers as primitive operations, so that the associated ports and fields are used inefficiently whenever there is an operation in the long-instruction which does not generate a result to be placed in the register file (such as a store operation, or some forms of compare operations which place the result in a condition register instead of the register file). Moreover, any of the register-identifiers in the register access field is used as source for any of the operations in the long instruction, leading to a rather complex network for routing operands from the register file to the functional units. This aspect is briefly mentioned by the Johnson publication, but no solution for it is described.
Partitioned register files have been addressed by Colwell, Robert P., et al., A VLIW Architecture for a TRACE Scheduling Compiler, Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, 1987, at 180-192, and by Beck, G., Yen, D., and Anderson, T., The Cydra 5 Minisupercomputer: Architecture and Implementation, The Journal of Supercomputing, 1993, Vol. 7 at 143-180. The partitioned register file used by Colwell et al. and by Beck, Yen, and Anderson, is a feasible solution regarding the implementation of a register file with many ports. However, such an approach introduces additional complexities in the code generation process. For example, (1) the compiler/programmer needs to ensure that operands are available in the corresponding register file module at the right moment, and (2) overhead is introduced by the extra operations needed to move/copy the operands across the different register file modules. In the case of replicated registers, the approach is more costly because it needs larger hardware resources (area, transistors, wires), uses more power, etc.
Therefore, a need exists for a system and method for efficient use of identifiers to reduce encoding needs and ports to shared resources in a processor.
The present invention relates to an operation decoder having a defined scheme for processing an instruction packet with shared identifier fields. The operation decoder includes a consumer signal for controlling a plurality of consumers, indicating the operation to be performed by each consumer. The operation decoder also includes a routing signal for controlling selectors, the selectors for routing values read from, and written to, a shared resource and the consumers. Further, the operation decoder includes an enable signal for enabling write ports in the shared resource for saving inputs to the shared resource.
According to one embodiment of the present invention, a method for accessing elements from a shared resource to be used by consumers that perform actions according to corresponding operations is disclosed. The method creates a packet of operations to be processed simultaneously, wherein the elements from the shared resource used by the operations are specified by source and destination identifier fields that are shared among the operations in such a way that the sum of all the elements from the shared resource used by the operations does not exceed a total number of identifiers available in the packet. The method also reads the elements from the shared resource according to the shared identifier fields specified in the packet. The method decodes a number of elements from the shared resource needed by each operation, by passing the operations to an operation decoder having a defined routing scheme based on the needs of the operations. The method routes the elements to the consumers performing operations and resulting values to the shared resource, according to a routing signal of the operation decoder.
The operation decoder described above specifies operations to be performed by the consumers according to a consumer signal from the operation decoder, as determined by the operation decoder.
According to the method, write ports of the shared resource are enabled to save the values according to an enable signal of the operation decoder, as determined by the operation decoder.
The method compiles the packet of operations so that the elements share identifier fields. The method determines the number of source identifier fields for elements, based on the operation to be performed, and further, determines the number of destination identifier fields for elements, based on the operation to be performed. The method also determines the needs of the operations to be placed in the packet and groups the operations so that the needs for elements are within the number of identifier fields available in the packet.
Further, if the needs of the operations to be grouped exceed the number of identifier fields available, the method removes one operation and inserting another operation having no needs. Alternatively, if the needs of the operations to be grouped exceed the number of identifier fields available, the method removes one operation and inserting another operation having needs which fit within the number of identifier fields available when inserted in the packet.
The method routes the elements as follows. The method signals a first selector disposed between the shared resource and the consumers, the signal being based upon the defined routing scheme saved in the operation decoder. The method also signals a second selector disposed between the consumers and shared resource, the signal being based upon the defined routing scheme saved in the operation decoder.
According to the method, the packet with shared identifier fields includes a source registers field, including source address information for the elements read concurrently from the shared resource, a destination registers field, including destination address information for the values from the consumers to be saved in the shared resource, and an operations field, including the operations to be performed using the elements.
The method uses the defined scheme of the operation decoder including a consumer lookup table having possible routes for elements from the shared resource to the consumers, based on each operation""s individual needs. The scheme also has a write port lookup table having possible routes for the values from the consumers to write ports in the shared resource where the values are to be saved, based on each operation""s individual needs. The scheme further includes a set of logic equations for controlling the routing of the elements from the shared resource to the consumers, and the values from the consumers to the shared resource.
According to the method the shared resource can be a register rename buffer or a register file. Further, consumers can be renaming engines or functional units. According to the method the elements are operands.
Alternatively, the method may be carried out by a computer usable medium having computer readable program code.
In a preferred embodiment of the present invention, a processor for executing an instruction packet having a plurality of operations sharing identifier fields is disclosed. The processor includes an instruction register for accepting the instruction packet. The processor includes a register file with a reduced number of ports accessed by a source register field from the instruction packet. The register file accepts an enable signal from an operation decoder for enabling write ports on the register file. The processor further includes the operation decoder having a defined scheme for routing operands from the register file to the consumers and routing values from the consumers to the register file. The processor includes a first selector for accepting the operands from the register file and for routing the operands from the first selector to the corresponding consumer according to a routing signal from the operation decoder. The processor includes the consumers for accepting the operands from the first selector and performing an operation according to an operation signal from the operation decoder. Also included in the processor is a second selector for accepting the values from the consumers and for routing the values to a corresponding write port in the register file according to the routing signal from the operation decoder.
According to the above embodiment, the processor further includes an instruction fetch unit for fetching the instruction packet from an instruction cache interfaced with the processor.
The register file further includes enable ports for accepting the enable signal from the operation decoder, and write ports for accepting the values from the second selector, the values to be stored in the register file.
The first selector and the second selector apply a combinational logic to the operands and the values respectively.
The consumers can be renaming engines or functional units.
The processor can be a dynamically-scheduled out-of-order execution processor or a statically-scheduled in-order execution processor.
According to another embodiment of the present invention, a method for accessing elements from a shared resource to be used by consumers that perform actions according to corresponding operations is disclosed. The method creates a packet of operations to be processed simultaneously, wherein the elements from the shared resource used by the operations are specified by source and destination identifier fields that are shared among the operations, in such a way that the sum of all the elements from the shared resource used by the operations does not exceed a total number of identifiers available in the packet. The method reads the elements from the shared resource according to the shared identifier fields specified in the packet. The method also decodes a number of elements from the shared resource needed by each operation, by passing the operations to an operation decoder having a defined routing scheme based on the needs of the operations. Further, the method routes the elements read from the shared resource to the corresponding consumers, as determined by the decoding of the operations in the operation decoder having the defined routing based on the needs of the individual operations. The method specifies to the consumers the specific operation to be performed by each consumer with the corresponding elements from the shared resource, as determined by the operation decoder. The method routes a plurality of values generated by the consumers to the shared resource, as determined by the operation decoder and specifies to the shared resource the placement of the values generated by the consumers according to the destination identifier fields specified in the packet of operations. The method also enables the shared resource to save the results from the consumers, as determined by the operation decoder.
According to the above embodiment, the method compiles the packet of operations so that the elements share the identifier fields. The method determines a number of source identifier fields for elements, based on the operation to be performed. The method determines a number of destination identifier fields for elements, based on the operation to be performed. Further, the method determines the needs of the operations to be placed in the packet, and groups the operations so that the needs for the elements are within the number of identifier fields available in the packet.
During compiling, if the needs of the operation to be grouped exceed the number of identifier fields available, removing one operation and inserting another operation having no needs. Alternatively, if the needs of the operations to be grouped exceed the number of identifier fields available, removing one operation and inserting another operation having needs which fit within the number of identifier fields available when inserted in the packet.
The step of routing the elements further includes signaling a first selector disposed between the shared resource and the consumers, the signal being based upon the defined routing scheme saved in the operation decoder, and signaling a second selector disposed between the consumers and the shared resource, the signal being based upon the defined routing scheme saved in the operation decoder.
The packet with shared identifier fields includes a source registers field, including source address information for the elements read concurrently from the shared resource, a destination registers field, including destination address information for the values from the consumers to be saved in the shared resource, and an operations field, including the operations to be performed using the elements.
The defined scheme of the operation decoder includes a consumer lookup table having possible routes for elements from the shared resource to the consumers, based on each operation""s individual needs, a write port lookup table having possible routes for the values from the consumers to a plurality of write ports in the shared resource where the values are to be saved, based on each element""s individual needs, and a set of logic equations for controlling the routing of the elements from the shared resource to the consumers, and the values from the consumers to the shared resource.
The shared resource is a register rename buffer or a register file. The consumers are renaming engines or functional units. According to the method the elements are operands.
These and other objects, features, and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be used in connection with the accompanying drawings.