The present invention generally relates to computer systems and, more specifically, to an asymmetric clustered processor architecture.
Most conventional clustered processor architectures are symmetric systems. It is known that the scalability of a high-performance processor architecture has been limited in new system designs by various factors, including increasing clock frequencies, issue widths, and greater wire delays. In addition, many high-performance processor families have extended their Instruction Set Architecture (ISA), or have introduced new ones, to handle 64-bit integers, which further exacerbates the above design factors.
Clustering is generally viewed as a possible solution to these problems. Clustered processors have many advantages, including improved implementation and scalability, reduced power consumption and potentially faster clock speed. However, a design difficulty is encountered in assigning instructions to clusters so as to minimize the effect of inter-cluster communication latency.
There is shown in FIG. 1 a functional diagram of a conventional 64-bit clustered processor organization 10 with a 4-wide issue width and two integer clusters sharing a front end and a data cache. The conventional 64-bit clustered processor organization 10 may include a first conventional 64-bit integer cluster 11 and an essentially identical second conventional 64-bit integer cluster 21. A first 64-bit register file 13 in the first conventional 64-bit integer cluster 11 may be replicated in a second 64-bit register file 23 in the second conventional 64-bit integer cluster 21. Each “write” is sent to both a local register file and to a remote register file, with posting to the remote register file at least one clock cycle later than the posting to the local register file. The term “64-bit,” as used herein, refers to the size of the operand of an instruction as generally understood in the relevant art.
The first conventional 64-bit integer cluster 11 may also include a first instruction queue 15, with a first pair of 64-bit arithmetic logic units (ALUs) 17 and 19, to provide a combined issue width of two integer instructions. Similarly, the second conventional 64-bit integer cluster 21 may also include a second instruction queue 25, and a second pair of 64-bit ALUs 27 and 29, for an issue width of an additional two integer instructions.
Operation of the conventional 64-bit clustered processor organization 10 may be described with additional reference to a baseline pipeline 30, shown in FIG. 2. An Instruction Fetch stage 31 may be executed by an instruction cache 61. The fetched instruction may move to a decode logic 63 for decoding at a Decode stage 33, and may be assigned physical registers via a rename logic 65 at a subsequent Rename stage 35. The fetched instruction may then enter either the first conventional 64-bit integer cluster 11 or the second conventional 64-bit integer cluster 13, in accordance with the steering.
Steering may be performed by a steering logic 67 after register renaming performed by the rename logic 65. The fetched instruction may pass through a first Queue stage 39 and a second Queue stage 41 to issue, followed by the Issue stage 43, from either the first instruction queue 15 or the second instruction queue 25. If the fetched instruction issues from the first instruction queue 15, the first 64-bit register file 13 receives the fetched instruction, at the Register File Read stage 45. Alternatively, if the fetched instruction issues from the second instruction queue 25, the second 64-bit register file 23 receives the fetched instruction, at the Register File Read stage 45.
For example, if a fetch instruction has been sent to the second register file 23, an instruction may proceed to either the 64-bit ALU 27 or the 64-bit ALU 29, at an Execute stage 47. The following Memory I stage 49, Memory II stage 51, and Write-Back stage 53, function largely as generally understood in the relevant art except for address translation in the Narrow cluster and for testing result value type. A Commit stage 55 functions as generally understood in the relevant art.
As can be appreciated, there is a need for an improved apparatus and method for accommodating the increasing clock frequencies, mitigating wire delays, and addressing the problem of issue widths in the past processor architecture designs.