Dedicated pipeline queues have been used in multi-pipeline execution units of processors in order to achieve faster processing speeds. In particular, dedicated queues have been used for execution (EX) units having multiple EX pipelines that are configured to execute different subsets of a set of supported micro-instructions. Dedicated queuing has generated various bottlenecking problems and problems for the scheduling of microinstructions that required both numeric manipulation and retrieval/storage of data.
Additionally, processors are conventionally designed to process operations (Ops) that are typically identified by operation codes (OpCodes), (i.e., instruction codes). In the design of new processors, it is important to be able to process all of a standard set of Ops so that existing computer programs based on the standardized codes will operate without the need for translating Ops into an entirely new code base. Processor designs may further incorporate the ability to process new Ops, but backwards compatibility to older instruction sets is often desirable.
Execution of micro-instructions/Ops is typically performed in an execution unit of a processor. To increase speed, multi-core processors have been developed. Furthermore, to facilitate faster execution throughput, “pipeline” execution of Ops within an execution unit of a processor core is used. Cores having multiple execution units for multi-thread processing are also being developed. However, there is a continuing demand for faster throughput for processors.
One type of standardized set of Ops is the instruction set compatible with “x86” chips, (e.g., 8086, 286, 386, and the like), that have enjoyed widespread use in many personal computers. The micro-instruction sets, such as the “x86” instruction set, include Ops requiring numeric manipulation, Ops requiring retrieval and/or storage of data, and Ops that require both numeric manipulation and retrieval/storage of data. To execute such Ops, execution units within processors have included two types of pipelines: arithmetic logic pipelines (“EX pipelines”) to execute numeric manipulations and address generation (AG) pipelines (“AG pipelines”) to facilitate load and store Ops.
In order to quickly and efficiently process Ops as required by a particular computer program, the program commands are decoded into Ops within the supported set of microinstructions and dispatched to the execution unit for processing. Conventionally, an OpCode is dispatched that specifies the Op/micro-instruction to be performed along with associated information that may include items such as an address of data to be used for the Op and operand designations.
Dispatched instructions/Ops are conventionally queued for a multi-pipeline scheduler of an execution unit. Queuing is conventionally performed with some type of decoding of a micro-instruction's OpCode in order for the scheduler to appropriately direct the instructions for execution by the pipelines with which it is associated within the execution unit.
FIG. 1 shows an example of a block diagram of a conventional processor 10, which may be one of many processors residing in an integrated circuit (IC). The processor 10 includes a decoder 15 that decodes and dispatches micro-instructions to a fixed point execution unit 20. Multiple fixed point execution units may be provided for multi-thread Op. Optionally, a second fixed point execution unit (not shown) may be provided for dual thread processing.
The conventional processor 10 further includes a floating point unit 25 for execution of floating point instructions. Preferably, the decoder 15 dispatches instructions in information packets over a common bus to both the fixed point execution unit 20 and the floating point unit 25.
The fixed point execution unit 20 includes a mapper 30 associated with a scheduler queue 35 and pickers 40. These components control the selective distribution of Ops among a plurality of arithmetic logic (EX) pipelines 45 and address generation (AG) pipelines 50. The pipelines 45 and 50 execute Ops queued in the scheduler queue 35 by the mapper 30 that are picked therefrom by the pickers 40 and directed to an appropriate pipeline 45 or 50. In executing a micro-instruction, the pipelines 45 and 50 identify the specific kind of Op to be performed by a respective OpCode assigned to that kind of micro-instruction.
In the example shown in FIG. 1, the fixed point execution unit 20 includes four pipelines for executing queued Ops. A first arithmetic logic pipeline 451 (EX0) and a first address generation pipeline 501 (AG0) are associated with a first set 551 of physical registers in which data is stored relating to execution of specific Ops by the two pipelines 451 and 501. A second arithmetic logic pipeline 452 (EX1) and a second address generation pipeline 502 (AG1) are associated with a second set 552 of physical registers in which data is stored relating to execution of specific Ops by those two pipelines 452 and 502. Preferably, there are 96 physical registers in each of the first and second sets of registers 551 and 552.
In the example fixed point execution unit 20 shown in FIG. 1, the arithmetic logic pipelines 45 (EX0, EX1) have asymmetric configurations. The first arithmetic pipeline 451 (EX0) is preferably the only pipeline configured to process divide (DIV) Ops 60 and count leading zero (CLZ) Ops 65 within the fixed point execution unit 20. The second arithmetic pipeline 452 (EX1) is preferably the only pipeline configured to process multiplication (MULT) Ops 70 and branch Ops 75 within the fixed point execution unit 20.
DIV and MUL Ops generally require multiple clock cycles to execute. The complexity of both arithmetic pipelines is reduced by not requiring either arithmetic pipelines to perform all possible arithmetic Ops, and by dedicating multi-cycle arithmetic Ops for execution by only one of the two arithmetic pipelines. This saves chip real estate while still permitting a substantial overlap in the sets of Ops that can be executed by the respective arithmetic pipelines EX0, EX1.
The processing speed of the fixed point execution unit 20 may be affected by the operation of any of the components. Since all the micro-instructions that are processed must be mapped by the mapper 30 into the scheduler queue 35, any delay in the mapping/queuing process can adversely affect the overall speed of the fixed point execution unit 20.
There are three kinds of Ops requiring retrieval and/or storage of data; namely, load (LD), store (ST) and load/store (LD-ST). These Ops are performed by the address generation pipelines 50 (AG0, AG1) in connection with a load/store unit 80 of the fixed point execution unit 20.
The pickers 40 of the conventional processor 10 may include at least one fixed priority encoder 85. Typical priority encoders, which are used for age order picks in any scheduler-like logic, depend on the occurrence of an allocation in a fixed order (top-to-bottom or bottom-to-top). A fixed priority encoder works on a set of requesters, which are the Ops having all sources available and ready to be picked. The fixed priority encoder also works on age arbitrates, which indicate the relative age information for all of the Ops in the queue. Based on the foregoing, the fixed priority encoder identifies at least one requester that is granted the request for an entry to be picked.
FIG. 2A shows a plurality of queue positions QP1 . . . QPn in the scheduler queue 35. The scheduler queue 35 preferably has 40 positions. Generally, it is preferable to have at least five times as many queue positions as there are pipelines to prevent bottlenecking of the unified scheduler queue 35. However, when a unified queue that services multiple pipelines has too many queue positions, scanning Ops may become time prohibitive and impair the speed in which the scheduler operates. The scheduler queue 35 is sized such that queued instructions for each of the four pipelines can be picked and directed to the respective pipeline for execution in a single cycle. The full affect of the speed of the scheduler queue 35 directing the execution of queued instructions can be realized because there is no impediment in having instructions queued into the scheduler queue due to the mapper's speed in queuing instructions based on OpTypes, which may signify whether an instruction is an EX operation or an AG operation.
Referring again to FIG. 1, the mapper 30 is configured to queue a micro-instruction into an open queue position based on the micro-instruction's information packet received from the decoder 15. Preferably, the mapper 30 is configured to receive two instruction information packets in parallel, which the mapper 30 preferably queues in a single clock cycle. The decoder 15 is preferably configured to dispatch four instruction information packets in parallel. Two of the packets are preferably flagged for potential execution by the fixed point execution unit 20 and the other two flagged for potential execution by the second similar fixed point execution unit 20.
Preferably, the floating point unit 25 scans the OpType of all four packets dispatched in a given clock cycle. Any floating point instruction components indicated by the scan of the OpType fields data of the four packets are then queued and executed in the floating point unit 25.
The mapper 30 is preferably configured to make a top to bottom scan and a bottom to top scan in parallel of the queue positions QP1-QPn to identify a topmost open queue position and bottom most open queue position; one for each of the two micro-instructions corresponding to two packets received in a given clock cycle.
Where the OpType field data of a dispatched packet indicates OpType FP, the micro-instruction corresponding to that packet is not queued because it only requires execution by the floating point unit 25. Accordingly, even when two instruction information packets are received from the decoder 15 in one clock cycle, one or both microinstructions may not be queued in the scheduler queue 35 for this reason.
One of the primary goals for the scheduler queue 35 is to try to pick operations from a pool of Ops in their age order. Once a plurality of Ops are stored in the scheduler queue 35, it is desirable to pick those entries that are ready to be executed in the order in which they arrived in the scheduler queue 35 to provide the best possible scheduling of the Ops. However, in order to do that traditionally, the entries in the queue are always maintained in age order. Thus, the top entry is the oldest, and the bottom entry is the newest, and a significant amount of logic and power is required to maintain the scheduler queue 35 in this manner.
As shown in FIG. 2A, each queue position QP1 . . . QPn is associated with memory fields for an arithmetic logic instruction (ALU payload) 45, an address generation instruction (AG payload) 50, four wakeup content-addressable memories (CAMs) 205, 210, 215 and 220 (sources A-D) that identify addresses of physical registers that contain source data for the instruction, and a destination CAM 225 (destination) that identifies a physical register where the data resulting from the execution of the micro-instruction is to be stored.
A separate data field 230 (immediate/displacement) is provided for accompanying data that an instruction is to use. Such data is sent by the decoder 15 in the dispatched packet for that instruction. For example, a load operation Ld is indicated in queue position QP1 that seeks to have the data stored at the address 6F3D indicated in the immediate/displacement data field into the physical register identified as P5. In this case, the address 6F3D was data contained in the instruction's information packet dispatched from the decoder 15, which information was transferred to the immediate/displacement data field 2301 for queue position QP1 in connection with queuing that instruction to queue position QP1.
The ALU payload fields 235 and the AG payload fields 240 are configured to contain the specific identity of an instruction as indicated by the instruction's OpCode, along with relative address indications of the instruction's required sources and destinations that are derived from the corresponding dispatched data packet. In connection with queuing, the mapper 30 translates relative source and destination addresses received in the instruction's information packet into addresses of physical registers associated with the pipelines 45 and 50 of FIG. 1.
The mapper 30 tracks relative source and destination address data received in the instruction information packets so that it can assign the same physical register address to a respective source or destination where two instructions reference the same relative address. For example, P5 is indicated as one of the source operands in the ADD instruction queued in queue position QP2, and P5 is also identified as the destination address of the result of the Ld operation queued in queue position QP1. This indicates that the dispatched packet for the Ld instruction indicated the same relative address for the destination of the Ld operation as the dispatched packet for the ADD instruction had indicated for one of the ADD source operands.
Referring to FIGS. 1 and 2A, flags are provided in the scheduler queue 35 to indicate eligibility for picking the instruction for execution in the respective pipelines 45 and 50 (EX0, EX1, AG0, and AG1). The pickers 40 preferably include an individual picker for each of the ALU pipelines 45 (EX0, EX1) and the AG pipelines 50 (AG0, AG1). Each respective pipeline's picker scans the respective pipeline picker flags of the queue positions to find queued operations that are eligible for picking. Upon finding an eligible queued operation, the picker checks to see if the instruction is ready to be picked. If it is not ready, the picker resumes its scan for an eligible instruction that is ready to be picked. Preferably, the EX0 and AG0 pickers scan the flags from the top queue position QP1 to the bottom queue position QPn, and the EX1 and AG1 pickers scan the flags from the bottom queue position QPn to the top queue position QP1 during each cycle. A picker will stop its scan when it finds an eligible instruction that is ready for execution, and then direct that instruction to its respective pipeline. Preferably this occurs in a single clock cycle.
Readiness for picking is indicated by the source wakeup CAMs 205, 210, 215 and 220 for the particular operation component being awake indicating a ready state. Where there is no wake up CAM being utilized for a particular instruction component, the instruction is automatically ready for picking. For example, the Ld operation queued in queue position QP1 does not utilize any source CAMs so that it is automatically ready for picking by either of the AG0 or AG1 pickers upon queuing. In contrast, the ADD instruction queued in queue position QP2 uses the queue position's wakeup CAMs sources A and B. Accordingly, that ADD instruction is not ready to be picked until the physical registers P1 and P5 have been indicated as ready by queue position QP2's wakeup CAMs source A and source B being awake.
Where one of the arithmetic pipelines is performing a multi-cycle operation, the pipeline preferably provides its associated picker with an instruction to suspend picking operations until the arithmetic pipeline completes execution of that multi-cycle operation. In contrast, the address generation pipelines are preferably configured to commence execution of a new address generation instruction without awaiting the retrieval of load data for a prior instruction. Accordingly, the pickers will generally attempt to pick an address generation instruction for each of the address generation pipelines AG0, AG1 for each clock cycle when there are available address generation instructions that are indicated as ready to pick.
In some cases, the CAMs may awake before the required data is actually stored in the designated physical register. Typically, when a load instruction is executed where a particular physical register is indicated as the load destination, that physical register address is broadcast after four cycles to the wakeup CAMs in order to wake up all of the CAMs designated with the physical register's address. Four cycles is a preferred nominal time it takes to complete a load operation. However, it can take much longer if the data is to be retrieved by the load/store unit 80 from a remote location. Where an instruction is picked before the physical register actually contains the required data, the execution unit is preferably configured to replay the affected instructions which are retained in their queue positions until successful completion.
FIG. 2B shows an example of conventional priority encoding in a six-entry scheduler queue having entry numbers 0-5 with corresponding requests, (i.e., operations ready to be picked), and results (i.e., the output of a priority encoder), for both a top-to-bottom fixed priority encoder 250 and a bottom-to-top fixed priority encoder 260 located in the scheduler queue 35. The priority encoder 250 generates a “one-hot vector,” (a vector having no more that one bit having a logic one value), based on a “multi-hot vector,” (a vector that may have more than one bit having a logic one value). Thus, the top entry number is 0, (i.e., the oldest entry), the bottom entry number is 5, (i.e., the youngest entry), and the entry numbers 1, 2 and 4 are occupied, (i.e., the six-entry queue currently has an occupancy of 3 entries). In accordance with the example shown in FIG. 2B, the first entry from the top (entry 1)) that requests to be picked is granted a result when a top-to-bottom fixed priority encoder 250 is used, and the first entry from the bottom (entry 4) that requests to be picked is granted a result when a bottom-to-top fixed priority encoder 260 is used. Multiple pickers are implemented as more than one operation is issued in each cycle. The priority encoders for different pickers may be configured to scan in different directions.
In order to perform an age pick through the typical priority encoder logic, extra hardware is required in order to always re-arrange the scheduler queue 35 such that holes, which result when entries are picked, issued and executed, are either at the top or bottom of the queue, (based on the priority encoder's scan direction). Such holes result when an entry is picked and issued. The entry that is picked is then cleared so that it will not be picked again in the next cycle.
FIG. 2C shows an example of a top-to-bottom age scheduler queue having entry numbers 0-5 with corresponding pick requests and results for a top-to-bottom fixed priority encoder 270 located in the scheduler queue 35. As shown in the example of FIG. 2C, there are five valid entries in cycle N, and entry numbers 1, 2 and 4 are requesting to be picked. As indicated in the result column of FIG. 2C, only entry 1 will be granted its request (i.e., receive a result of 1) to be picked, issued and executed. However, in cycle N+1, it is necessary for the scheduler queue 35 to be age ordered again, whereby the hole, which is created when entry number 1 is picked, issued and executed, has to be “collapsed” by shifting entry numbers 2, 3 and 4 to entry numbers 1, 2 and 3.
This shifting and collapsing process becomes even more complicated when the picker can pick more than one entry per cycle. For example, if entry numbers 1 and 3 shown in FIG. 2C are picked, then there is a variable shift amount for each entry (i.e., entry number 2 will shift by 1 entry, but entry numbers 4 and 5 will shift by 2 entries). This shifting and collapsing process is complicated and slows down the cycle time for other logic, and it causes a significant power drain as well. For example, consider a scenario where more than 400 bits are shifted each time an Op (entry) is picked.
It would be desirable to eliminate the shifting collapsing nature of the scheduler queue. By doing so, the efficiency of the processor would be greatly enhanced.