1. Field
The present application relates generally to computer processors and, more specifically, to mechanisms for storing and referencing transient operands that are produced and consumed by the computer processors.
2. State of the Art
Computer processors execute operations on data. An individual data value (an operand) is produced by some producer operation, recorded, and then used later by one or more other consumer operation. The time between production and consumption by the last consumer is the lifetime of the operand. Operands vary widely in lifetime, but lifetimes can usually be loosely categorized into persistent (or global) lifetimes that last for an appreciable fraction of total program execution; local lifetimes that last for the duration of a function or several statements in the program; and transient lifetimes that last for only portions of a single expression in the program. These categories are not sharp, and programs exhibit a continuum of lifetimes, but the rough grouping is strong enough that computer hardware usually contains different storage means for operands of each category. For example, persistent operands may use a software-provided heap in memory, while local operands may use a hardware-assisted stack and transient operands use a wholly hardware register bank.
Transient operands are ubiquitous and very common. For example, if the source program contains the expression “A+B+C” then the computer will execute a first add operation of A and B, and then a second add operation of the result of the first add operation to C. The A+B result is typically transient and will be discarded as soon as it is consumed by the second add operation, although it may have a longer lifetime if the same A+B calculation appears elsewhere and the intermediate result can be reused.
Many prior art computer processors employ a set of general registers, which are storage devices that can hold a single operand each. Machine operations like addition take their arguments from and deliver their result to registers. Thus, a register is the holding place for transient operands. When the lifetime of an operand ends, the register holding it can simply be overwritten by some other newly computed operand. Register usage by a program is very high because there are so many transients. Consequently, computer processor designers go to great lengths to ensure that access to registers is very fast and that there are enough registers to hold any reasonable transient population. Operands that do not fit in the available registers must be kept elsewhere, typically in memory, and access to such spilled operands takes tens to hundreds of times longer than access to a register. Because of the speed advantage of registers, registers not needed for transients are commonly used for frequently-referenced operands with more-than-transient lifetimes, even very long lived global operands. Each extra operand that can reside in the registers improves the speed of the program by avoiding lengthy memory access. This design force tends to cause designers to increase the number of registers in a design, so that more operands can be register resident. Balancing this force are two other effects of increased register count: instruction entropy and hardware complexity.
Entropy refers to the information-theoretic density of the machine representation (the encoding) of instructions to be executed. Each instruction must encode an indicator of the operation to be performed (the opcode) and the places that data arguments for the instruction must come from and results go to (the addresses for the source and result operands). Typical computational operations (such as an add) require two source operand addresses and one result operand address, in addition to the opcode. The operand addresses are register numbers when the arguments and results are in registers. When a design increases the number of addressable registers, it necessarily also increases the size of the address required to indicate which register to use. Thus, if there are eight registers (as in some early machines), an operand address occupies three bits and a register-based add operation uses nine bits for addressing, whereas if there are 128 registers (as in some recent machines), an operand address occupies seven bits and an add requires 21 bits of address.
Unfortunately, other considerations often dictate that instructions themselves must occupy a whole power-of-two number of bits, such as 16 or 32. Increasing the number of registers (and hence the number of address bits in an instruction) then necessarily reduces the number of bits available for the opcode and other purposes. In practice, it is impractical to have more than 32 registers while retaining a fixed 32 bit instruction length. Moreover, extra registers increase the total size of a program even if the design uses a wider or variable-length instruction to admit more than 32 registers. The increased program size and decode complexity may cause problems with the memory bandwidth and instruction cache of the machine.
Besides the entropy effect, increasing the number of registers also increases the complexity, chip area, and power requirements of the machine. Each potential functional unit consumer of an operand and each functional unit producer of a result operand must be able to communicate with each register, and thus involve connections that directly and super-linearly increase the required chip area and power. Moreover, modern processors typically include a bypass network whose complexity increases non-linearly. The bypass network is used to deal with pairs of operations that have a producer-consumer relationship, i.e. the transient result of the first is immediately used by the second. The bypass network avoids the latency in moving the result operand from the first operation into a register and then fetching it again as a consumer operand for the second operation. Instead, special hardware circuitry detects the producer-consumer relation and the bypass network routes the transient operand directly from the producing functional unit (such as an adder) to the consumer without waiting for the operand to reach the register. However, the bypass network is often the critical timing path of the whole machine, so any slowdown of the bypass network slows down the execution of every operation. Consequently, the design of a register-based machine reflects a balance between the storage performance advantages of extra registers and the encoding and execution performance costs of those registers.
However, a designer is not necessarily restricted to using registers for transients. There are other architectural categories that avoid many of the register problems by not using general registers in the first place. Two of these alternative approaches are accumulator machines and stack machines.
In an accumulator machine there is exactly one register for transient operands, although there may be other registers for longer-lived operands as well. All operations take one of their inputs from the accumulator, and place their result in the accumulator. Because there is only one, addressing the accumulator is implicit and does not require any address bits in the operation. Consequently, a computational operation contains only a single address, for the second argument, not three as in a general register machine. Of course, the first operand of an expression must be placed into the accumulator by an extra operation to start things off, which adds some extra cost to the use of an accumulator. In practice accumulator designs eliminate any entropy problems, and accumulator machines frequently have very small instructions, with a net gain even allowing for the extra operations to load the accumulator. Such designs also eliminate the bypass network because the producer and the following consumer are necessarily the same, namely the accumulator. This makes expressions such as “A+B+C” have a compact encoding and rapid execution.
However, an accumulator machine is optimal only if the most recent transient is immediately needed in the expression. In an expression like “(A*B)+(C*D)” there are two multiplies, both of which must be done before the add can sum their results. On an accumulator machine, the second multiply and the add can be done using the accumulator, but the result of the first multiply must be saved somewhere or it will be overwritten by the result of the second.
In a stack machine, transient operands are stored in a last-in-first-out (LIFO) stack, so that temporaries not needed immediately can simply be pushed into a stack. In such a stack machine, the computational operations contain no addresses at all, but operate on the top two operands in the stack by removing or popping them from the stack and pushing the result onto the top of the stack. As in the accumulator machine, the encoding requires extra operations to preload the stack with any operands that are not transients. However, operation encoding density is very good even allowing for these costs, and no bypass network is required.
Despite their advantages, accumulator and stack designs are rarely used where performance is a concern because they are inherently sequential in execution. Because there is only one accumulator (or one top of stack) they can execute only one operation at a time, whereas most modern processor designs try very hard to execute more than one operation simultaneously in parallel.
Note that it is possible to put more than one accumulator machine or stack machine into a single computer or chip, but that approach gains little because each must have its own instruction decoder and other components. It is also possible to put more than one accumulator into a single machine, but the result is called a general register machine with the drawbacks noted above.