The present invention relates to a processor, and more particularly, to a clustered superscalar processor and a method for controlling communication between clusters in a clustered superscalar processor.
In the prior art, for example, S. Palacharla, N. P. Jouppi, and J. E. Smith, “Complexity-Effective Superscalar Processors,” Proceedings of 24th International Symposium on Computer Architecture, pp. 206-218, June 1997 describes a clustered superscalar processor for solving problems that arise when executing instructions in parallel to improve the performance of a processor. The technique for clustering a superscalar processor is a technique that divides functional units, which execute instructions, and instruction windows, which temporarily store instructions, into a plurality of groups referred to as clusters. Each cluster includes a functional unit, an instruction window, and a register file for storing the execution result of a calculation. Execution results of the functional units, which are included in every cluster, are written to the register files, which are included in every cluster. Thus, the register files included in each cluster hold the same contents. FIG. 1 shows a clustered superscalar processor 100 of the prior art.
When processing instructions, the processor 100 first stores instructions from a main memory (not shown) in a instruction cache 110. A fetch unit 120 reads the instructions from the instruction cache 110 and provides the instructions to a decoder 130. The decoder 130 decodes the instructions. A steering unit 140 analyzes data dependency relationship in each instruction and allocates the instructions, in accordance with the dependency relationship, to instruction windows 151 and 161 of the clusters 150 and 160. Instructions satisfying the data dependency relationship are read from the instruction windows 151 and 161, and data used to execute the instructions is read from the register files 152 and 162. Functional units 153, 154, 163, and 164 use the read data to execute the instructions. The execution results of the instructions are written to the register files 152 and 162. During the execution of the instructions, data is read from the main memory or a data cache 170 when necessary.
When the execution result of an instruction executed by the functional units 153, 154, 163, and 164 is used in an immediately subsequent instruction, the execution result is transferred to a subsequent instruction before the execution result is written to the register files 152 and 162. The transfer path of the execution result is hereafter referred to as a bypass route. The bypass route is configured by intra-cluster bypasses CIB, formed inside clusters, and an inter-cluster bypass CBB, formed between clusters.
In this manner, by clustering a superscalar processor, the quantity of functional units in each cluster may be reduced in comparison to a superscalar processor that is not clustered. The reduction in the quantity of the functional units shortens the wire length for the intra-cluster bypass routes and reduces wire delays.
However, in a clustered processor, the size of the register file required in each cluster is substantially the same as that of a register file in a processor that is not clustered. Thus, the wire length and wire delay of the register file are not shortened. J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Tophan, “Multiple-Banked Register File Architectures”, Proceedings of 27th International Symposium on Computer Architecture, pp. 316-325, June 2000 describes a hierarchical register file as an example of a technique for eliminating delay of the register file. FIG. 2 shows a processor 200 incorporated in a hierarchical register file. The hierarchical register file is configured by a register cache RC (upper level register file) and a main register file (lower level register file). The register cache RC is incorporated in a data path. The capacity of the register cache RC is smaller than that of the main register file MRF, and the register cache RC may be accessed at high speeds. The main register file MRF holds every calculation result of functional units 251 to 254. The register cache RC holds some of the values of the main register file MRF.
When a value necessary for an instruction exists in a register cache RC, the functional units 251, 252, 253, and 254 access the register cache RC to retrieve a register value within a shorter access time than when accessing the main register file MRF. When a necessary value does not exist in the register cache RC, the functional units 251, 252, 253, and 254 retrieve a register value only after the register value is transferred from the main register file MRF to the register cache RC. This requires a long access time.
A state in which the data requested by the processor exists in the register cache RC is referred to as a hit, and a state in which it does not exist in the register cache RC is referred to as a miss. Further, the percentage in which the accessed data is found in the register cache RC is referred to as a hit rate, and the percentage in which the accessed data is not found in the register cache RC is referred to as a miss rate. The time required to access the register cache RC is referred to as hit time. The reference time required to access the main register file MRF is referred to as a miss penalty.
The register RC is small and fast. Thus, the hit time (e.g., one clock cycle) is shorter than the access time of the register file prior to hierarchization.
To further reduce the access time of the register file, the above-described hierarchical register file and the clustered superscalar processor may be combined. More specifically, the main register file MRF shown in FIG. 1 is added to the processor 100 shown in FIG. 1, and the register files 152 and 162 of the clusters 150 and 160 are changed to the register cache RC. However, incorporation of the above hierarchical register file in the clustered superscalar processor would lead to the problems described below.
In the prior art method, the execution result of an instruction is written to the register cache RC of every cluster. The register cache RC can hold only some of the values of the main register file MRF. Thus, to effectively use the register cache RC, it is preferred that only the register values necessary for the instruction executed prior to the present instruction be held. It is known that the execution result of an instruction is referred to only in a small number of calculations. Accordingly, among the execution results copied for each cluster, only some are referred to. The remaining execution results that are not referred to consume the memory area of the register cache in an unnecessary manner. This increases the possibility of deletion of useful register values that are stored in the register cache RC and have the possibility of being referred to. This increases the miss rate of the register cache RC, and the miss penalty lowers the performance.