1. Technical Field
Embodiments of the present invention generally relate to computer processors. More particularly, embodiments relate to the fusing of multiple operations into a single micro-operation.
2. Discussion
Computers have become an integral part of modern society, and the demand for more functionality, lower costs and greater efficiency continues to grow. In order for computers to continue to meet the needs of the marketplace, a number of software as well as hardware issues must be addressed. For example, compiling programs into low-level macro-instructions, decoding the macro-instructions into even lower-level micro-operations (uops), reassigning logical registers to physical registers based on the uops, processing the uops, and retiring the uops after execution are but a small sampling of the processes that must be considered when improving computer efficiency.
A conventional uop has one operational code (opcode) field and two source fields. The opcode field specifies the operation to be performed and the source fields provide the data to be used in the operation. Traditional approaches to decoding macro-instructions such as method 20 shown in FIG. 1, involve transferring data relating to a first operation from the macro-instruction to a first uop at processing block 22. Data relating to a second operation is transferred from the macro-instruction to a second uop at block 24. Thus, a single macro-instruction is often broken into multiple uops.
For example, a typical read-modify (or load-op) macro-instruction involves the first operation of reading a first operand from a particular address in memory, and the second operation of generating a final result based on the first operand and a second operand. Thus, the first uop is dedicated to the read operation and the second uop is dedicated to the modify operation. The opcode field of the first uop receives the appropriate opcode for the read operation, and the source fields receive the address data that specifies the memory location of the first operand. As will be discussed below, address data typically includes an address index, which incorporates a scaling factor, an address base and an address displacement. One approach to specifying memory addresses is discussed in U.S. Pat. No. 5,860,154 to Abramson, et al., although other approaches may also be used. The opcode field of the second uop receives the appropriate opcode for the modify operation, and the source fields receive the first operand (resulting from execution of the first uop) and the second operand. It should be noted that since the first operand results from execution of the first uop, one of the source fields in the second uop is left blank at the decoder stage. The first operand is typically copied from the memory location to the second uop at the reservation station stage of the pipeline (discussed below).
When the macro-instruction implements the storage of data, the first operation is to calculate the address of the store, and the second operation is to store the data to the calculated address. Thus, the first uop is dedicated to the address calculation operation and the second uop is dedicated to the data storage operation. The opcode field of the first uop receives the appropriate opcode for the address calculation operation, and the source fields receive the address data that specifies the destination memory location of the store. The opcode field of the second uop receives the appropriate opcode for the data storage operation, and the source fields receive the first operand (resulting from execution of the first uop) and the second operand (representing the data to be stored). Unlike the case of the read-modify macro-instruction, both uops may have all the necessary values at the decoder stage.
One reason for breaking instructions into two uops has been the limited number of source fields available in traditional uops. For example, in a read-modify instruction two source fields are needed for the address data, and two source fields are needed for the operands. Since conventional uops only have two source fields, two uops have been required to implement the entire macro-instruction. A more important reason for breaking instructions into two uops has been the desire to reduce latencies through out-of-order execution. Under this well documented approach, uops are executed when all of the necessary dependencies are resolved (and the execution resources are available) instead of in the order in which they are encountered. Unfortunately, there are a number of instructions, such as read-modify, with atomic operations that are inherently serial. In other words, the second operation cannot start until the first operation has completed. As a result, the benefits of out-of-order execution are lost with regard to certain instructions. Furthermore, the use of more uops than necessary reduces the number of instructions that can be executed in a clock cycle. There is therefore a need to improve efficiency and performance with regard to processor macro-instructions that have inherently serial operations. In the store case, there is a need to separate between the data and the address in order to resolve the store-address operation such that future memory accesses will not be delayed. The memory order buffer (MOB) enforces serial accesses to the memory due to unresolved store addresses (i.e., loads can't bypass stores to the same address). This serialization of future loads is performed based on the physical addresses of the cycles. If the address is not ready, all subsequent memory operations are held until the address is resolved. As it turns out, in most cases the operands for the address calculation are ready much earlier than the data of the store. In other words, the address is often a pointer to an element in a table, while the data is a result of a complex calculation. By breaking the store operation into two uops the store-address operation is able to dispatch earlier, resolve all address conflicts and open the memory pipeline for other loads (in which any delay greatly effects performance).