1. Field
The present application relates to computer processors.
2. Related Art
A computer processor (or central processing unit or CPU) executes a sequence of instructions, typically obtained from main memory, which are executed in positional order except when redirected by a branch, jump, call, or similar control-flow operation. The order is important because there are often semantic dependencies between pairs of instructions and the machine state would be different if the instructions were executed in a different order; that is, instruction execution is not commutative. However, strict order is not always required for a particular pair of instructions, and an important class of CPU architectures (called out-of-order execution (OOO) machines) detects the presence of semantic dependencies and reorders the execution of instructions in ways that preserve semantics while improving execution performance. Nevertheless, for nearly all CPU architectures, the original program instruction order is used as an implicit specification of the intended program semantics, whether reordered later or not.
There is little to be gained by reordering when the CPU can execute only one instruction at a time. After all, if every instruction operation has to be executed individually then any ordering should take as long to execute as any other. However, in the quest for CPU performance, computer designers have created CPUs that are capable of performing more than one operation simultaneously, in parallel. Clearly, if the program calls for two instructions to be executed in sequence, but they are actually executed simultaneously, then any semantic dependency between them will be violated. An OOO-architecture CPU can detect when two instructions, while sequential in the program, are independent and thus can be executed simultaneously. This permits the CPU to perform both instructions in parallel, shortening the total execution time of the program. The hardware to perform OOO reordering is large, difficult to design, and costly in chip area, power, and clock rate impact. Nevertheless it can yield significant gains when the program instruction set interface specifies a single, nominally sequential, instruction stream. However, there are ways to obtain parallel execution by using a different approach to specifying instruction semantics.
One common approach to obtain parallel execution is referred to as “multi-threading,” where the program is specified not as a single sequential stream of instructions, but as several such streams. Sequential semantics are enforced within any single stream of instructions, but the streams themselves are considered to be independent and instructions between streams can be executed in any order except for certain specialized instructions which serve to synchronize the streams. Each stream may be executed by its own sub-CPU or pipeline, or the streams may be interleaved on a single CPU such that each uses resources left idle by the others.
In another approach to obtain parallel execution, typified by Very Long Instruction Word (VLIW) architectures, there is only one instruction stream, but each instruction may have several operations which are executed in parallel. In essence, a VLIW sees multiple operation streams rather than multiple instruction streams, where operations from multiple operation streams are concatenated together to form a single instruction in a single instruction stream. Each position at which an operation can reside within the instruction is called a slot. Because the operations of each slot are in a shared instruction, the multiple operations streams are synchronized at every cycle and advance in lock step. Consequently, an operation executed in a given cycle may be semantically dependent on any operation executed earlier and operations that are executed in later cycles may be semantically dependent on it, but operations (from a single instruction) executed in the same cycle cannot be dependent on each other. So long as there are at least as many independent operations in a cycle as there are slots then all slots can be kept busy; if not then some slots must remain idle. Code generation software such as compilers analyze the program and assign individual operations to the slots so as to maximize performance. This task, called static scheduling, is similar to what an OOO machine does in dynamic scheduling hardware during execution. But because it is done once, in advance, and by software able to statically analyze and optimize future execution, the result is a much cheaper CPU and generally better performance for a large class of programs.
The instruction and operation streams described here are abstract notions, which must be encoded as a sequence of primitive operations defined by bits in memory that are fetched and executed by the CPU. The encodings used by different CPU architectures vary greatly from each other, but all seek to balance ease of interpretation by hardware decode machinery against compactness of representation. In most architectures, the instructions are intended to be executed in a particular order as an instruction stream, where the execution order is usually determined by the address order of the instructions in memory but may be changed as a consequence of the execution of flow of control operations in the instruction stream as described above with respect to OOO machines.
Broadly, there are two sorts of encodings used for instructions: fixed-length encodings and variable-length encodings. In a fixed-length encoding, each instruction uses a single fixed number of bits for its representation, for example 32 bits. In a variable-length encoding, different instructions use different bit-lengths where the bit-length for a particular instruction is typically selected by minimizing the number of bits required to convey the semantics of that particular instruction. Thus, some instructions may be 8 bits in length, others 16 bits, 56 bits or whatever. The fixed-length encoding approach is commonly associated with RISC (Reduced Instruction Set Computer) designs typified by the SPARC instruction set architecture, while the variable-length encoding approach is commonly associated with CISC (Complex Instruction Set Computer) designs typified by x86 instruction set architectures.
In general, fixed-length encodings are relatively easy to decode, and it is especially easy to decode several operations simultaneously in parallel because it is known a priori where in memory each operation starts. Parallel decode reads in a block of operations, breaks them at operation boundaries, and gives each of them to independent decoders. However, fixed length encodings are not compact, because the semantics of many kinds of operation can be represented in fewer bits than the fixed length. Other kinds of operation need more bits than the encoding length and so a single logical operation must be represented awkwardly as two or more of the fixed length operation.
By contrast, variable-length encodings tend to be quite compact, which is economical of memory space and reduces the load on memory pathways arising from instruction fetch. However, the decode machinery does not know the length of a particular variable-length operation until it has examined it, a process called parsing the operation. This is a problem for modern architectures that execute several operations in parallel. While the decode hardware that parses operations can fetch a block of memory that contains several operations, it cannot know where any operation after the first begins until after it has parsed all prior operations. This serializes operation parse, whereas the fixed length encodings can be easily parsed in parallel. Schemes for parallel decode of variable length operations (despite the serial parse) exist, but are difficult to realize and very expensive in hardware and power consumption.