1. Field
The present disclosure relates to computer processors.
2. Related Art
A computer processor (or central processing unit or CPU) executes a sequence of instructions, typically obtained from main memory, which are executed in positional order except when redirected by a branch, jump, call, or similar control-flow operation. The order is important because there are often semantic dependencies between pairs of instructions and the machine state would be different if the instructions were executed in a different order; that is, instruction execution is not commutative. However, strict order is not always required for a particular pair of instructions, and an important class of CPU architectures (called out-of-order execution (OOO) machines) detects the presence of semantic dependencies and reorders the execution of instructions in ways that preserve semantics while improving execution performance. Nevertheless, for nearly all CPU architectures, the original program instruction order is used as an implicit specification of the intended program semantics, whether reordered later or not.
There is little to be gained by reordering when the CPU can execute only one instruction at a time. After all, if every instruction operation has to be executed individually then any ordering should take as long to execute as any other. However, in the quest for CPU performance, computer designers have created CPUs that are capable of performing more than one operation simultaneously, in parallel. Clearly, if the program calls for two instructions to be executed in sequence, but they are actually executed simultaneously, then any semantic dependency between them will be violated. An OOO-architecture CPU can detect when two instructions, while sequential in the program, are independent and thus can be executed simultaneously. This permits the CPU to perform both instructions in parallel, shortening the total execution time of the program. The hardware to perform OOO reordering is large, difficult to design, and costly in chip area, power, and clock rate impact. Nevertheless it can yield significant gains when the program instruction set interface specifies a single, nominally sequential, instruction stream. However, there are ways to obtain parallel execution by using a different approach to specifying instruction semantics.
One common approach to obtain parallel execution is referred to as “multi-threading,” where the program is specified not as a single sequential stream of instructions, but as several such streams. Sequential semantics are enforced within any single stream of instructions, but the streams themselves are considered to be independent and instructions between streams can be executed in any order except for certain specialized instructions which serve to synchronize the streams. Each stream may be executed by its own sub-CPU or pipeline, or the streams may be interleaved on a single CPU such that each uses resources left idle by the others.
In another approach to obtain parallel execution, typified by Very Long Instruction Word (VLIW) architectures, there is only one instruction stream, but each instruction may have several operations which are executed in parallel. In essence, a VLIW sees multiple operation streams rather than multiple instruction streams, where operations from multiple operation streams are concatenated together to form a single instruction in a single instruction stream. Each position at which an operation can reside within the instruction is called a slot. Because the operations of each slot are in a shared instruction, the multiple operations streams are synchronized at every cycle and advance in lock step. Consequently, an operation executed in a given cycle may be semantically dependent on any operation executed earlier and operations that are executed in later cycles may be semantically dependent on it, but operations (from a single instruction) executed in the same cycle cannot be dependent on each other. So long as there are at least as many independent operations in a cycle as there are slots then all slots can be kept busy; if not then some slots must remain idle. Code generation software such as compilers analyze the program and assign individual operations to the slots so as to maximize performance. This task, called static scheduling, is similar to what an OOO machine does in dynamic scheduling hardware during execution. But because it is done once, in advance, and by software able to statically analyze and optimize future execution, the result is a much cheaper CPU and generally better performance for a large class of programs.
The instruction and operation streams described here are abstract notions, which must be encoded as a sequence of primitive operations defined by bits in memory that are fetched and executed by the CPU. The encodings used by different CPU architectures vary greatly from each other, but all seek to balance ease of interpretation by hardware decode machinery against compactness of representation. In most architectures, the instructions are intended to be executed in a particular order as an instruction stream, where the execution order is usually determined by the address order of the instructions in memory but may be changed as a consequence of the execution of flow of control operations in the instruction stream as described above with respect to OOO machines.
Broadly, there are two sorts of encodings used for instructions: fixed-length encodings and variable-length encodings. In a fixed-length encoding, each instruction uses a single fixed number of bits for its representation, for example 32 bits. In a variable-length encoding, different instructions use different bit-lengths where the bit-length for a particular instruction is typically selected by minimizing the number of bits required to convey the semantics of that particular instruction. Thus, some instructions may be 8 bits in length, others 16 bits, 56 bits or whatever. The fixed-length encoding approach is commonly associated with RISC (Reduced Instruction Set Computer) designs typified by the SPARC instruction set architecture, while the variable-length encoding approach is commonly associated with CISC (Complex Instruction Set Computer) designs typified by x86 instruction set architectures.
In general, fixed-length encodings are relatively easy to decode, and it is especially easy to decode several operations simultaneously in parallel because it is known a priori where in memory each operation starts. Parallel decode reads in a block of operations, breaks them at operation boundaries, and gives each of them to independent decoders. However, fixed length encodings are not compact, because the semantics of many kinds of operation can be represented in fewer bits than the fixed length. Other kinds of operation need more bits than the encoding length and so a single logical operation must be represented awkwardly as two or more of the fixed length operation.
By contrast, variable-length encodings tend to be quite compact, which is economical of memory space and reduces the load on memory pathways arising from instruction fetch. However, the decode machinery does not know the length of a particular variable-length operation until it has examined it, a process called parsing the operation. This is a problem for modern architectures that execute several operations in parallel. While the decode hardware that parses operations can fetch a block of memory that contains several operations, it cannot know where any operation after the first begins until after it has parsed all prior operations. This serializes operation parse, whereas the fixed length encodings can be easily parsed in parallel. Schemes for parallel decode of variable length operations (despite the serial parse) exist, but are difficult to realize and very expensive in hardware and power consumption.
Furthermore, there are two prior art approaches to instruction semantics. In one approach, typically referred to as sequential semantics, each instruction presumes that all prior instructions in the instruction stream have been executed to completion before the present instruction begins, and so all consequences of those prior instructions are fully reflected in machine state. If a prior instruction takes a long time to execute then subsequent instructions simply wait for it to complete, a condition called stall. In the other approach, typically referred to as timed semantics, some fixed number (typically one) of instructions are begun every time period whether prior instructions have completed or not. On a wide issue machine, each instruction may contain several operations that issue together (when the instruction issues) but complete independently. Each operation sees only the consequences of prior operations that have actually completed. There may be other in-flight operations that have begun execution but not yet completed, and the effects of these in-flight operations are invisible. If a prior operation takes a long time to execute, then there may be many subsequent instructions executed before the lengthy operation's results are available.
Clearly if every instruction took exactly one time period to execute then the two approaches are the same in their effect. However, the natural execution time of different operations (called the latency) varies considerably in practice. Thus, a double-precision floating point multiply instruction may take ten times as many cycles to perform as does a simple integer add instruction.
Early instruction designs nearly always used sequential semantics because doing so simplified the hardware, despite the limit of doing only one instruction and its single operation at a time. Modern designs increase CPU complexity to be able to gain the ability to execute several operations in parallel, and so many designs (especially VLIW designs) use timed semantics.
Timed semantics instruction designs permit more than one operation to be executed in parallel. However, there are times when the program has no more operations to execute. For example, if a floating-point product is to be an argument of a function then the CPU cannot make the call until the product is ready, and there may not be anything else to do but wait. While an ability to wait is inherent in sequential semantics, in timed semantics the hardware expects to start an instruction every period, even when there's nothing to do. For this purpose, instruction sets with timed semantics always define a nop (no-operation) operation, which executes in one issue cycle and has no machine state consequences at all. The compiler or other instruction-generating software then fills any idle waiting periods with nop operations. Thus in the example, the actual instruction stream would contain the multiply operation, then some number of nop operations sufficient to let the multiply complete, and then the call.
The nop operation thus permits the benefit of timed semantics without a problem when there are not enough operations to fill necessary wait periods. The drawback to nop operations is that they must exist in the instruction stream and be processed as if they were useful, which costs memory and power.