Instructions within a group of processor instructions in a sequential instruction stream are said to be independent of each other when the input values of the processor instructions are not generated by other processor instructions within the group. In other words, the resulting output of execution of the instruction stream remains the same whether the instructions within the group of processor instructions are evaluated sequentially in order or in parallel or in an out-of-order manner.
Since independent instructions may be evaluated in any order, the potential overlap in the execution of these independent instructions is known as instruction level parallelism (ILP). ILP may be defined as a measure of a degree to which processor instructions may be evaluated simultaneously.
The factors that limit ILP in an instruction stream are data dependencies and control dependencies between instructions. Current processor architectures attempt to mitigate the effects of these dependencies by employing micro-architectural techniques such as dynamic instruction scheduling, static instruction scheduling, and multiple instruction issue. Dynamic instruction scheduling techniques have been widely used in the domain of general purpose processors. However, due to real-time and power constraints, processors in the embedded systems domain have typically used static instruction scheduling techniques.
Processors that evaluate instructions in a pipelined microarchitecture typically proceed through the same processing stages. FIG. 1 shows a set of instructions stages that may be employed by a typical pipelined processor. The first instruction, addi, performs an addition of the contents of register r2 with the immediate value 8. In the first stage (IF), instructions may be fetched from memory and decoded. In the second stage (RD), the operands may be read from a register file. In the third stage (EX), an addition may be performed. Finally, in the fourth stage (WB), the results may be written back into the register file at location r0. Each of the pipeline phases (IF, RD, EX, and WB) are generally executed on a clock boundary. The second instruction, muli, may be started on the second clock cycle without requiring additional hardware. The processing of the second instruction may be overlapped with the first instruction because the resources needed for the IF stage are not needed in the RD stage. The IF, RD, EX, and WB stages may be shared but shifted in time.
FIG. 2 shows a complication with the pipeline implementation of FIG. 1. In this example, the muli instruction requires the result computed by the addi instruction as an operand. Since the write-back (WB) phase has not yet occurred, the process must stall and wait for the execution to complete. The empty cycles the muli must wait for its operands to become available are typically called “bubbles” in the pipeline.
Typical general purpose processors dynamically determine, using complex dependency checking hardware, whether a decoded instruction has inter-instruction dependencies with the instructions currently executing in a pipeline. This hardware is sometimes known as interlock hardware. If a dependency is detected (also known as a hazard), the instruction issue is stalled until the dependent instruction completes execution and the dependency is resolved. It should be noted that if a processor is capable of forwarding a result from the current instruction's EX stage directly to the next instruction's EX stage, then no bubbles in the pipeline would form. This is termed bypassing or sometimes result forwarding.
Typical embedded processors may employ exposed pipelines where the programmer or compiler is responsible for ensuring code executes correctly by scheduling the instructions prior to execution such that all inter-instruction dependencies are satisfied. This way of scheduling instructions at compile time is known as static scheduling. With static scheduling and no interlock hardware, NOP (no operation) instructions may need to be inserted into the instruction stream such that an appropriate number of cycles elapse before the register file is read.
Whether referring to general purpose processors with dynamic hardware interlocks or embedded processors with visible pipelines, pipelined overlapping instruction execution may be referred to as vertical parallelism.
Another type of parallelism may be referred to as horizontal parallelism. With horizontal parallelism, multiple independent operations may be executed concurrently if there are no dependencies. This is often the case in digital signal processing code and high performance computing programs.
If these multiple independent operations perform the same operation on different data elements, this type of parallelism is sometimes termed as data level parallelism. Processors that exploit data level parallelism may execute a single instruction that performs the same operation. When the same operation is performed simultaneously on multiple elements, this is often referred to as vector/SIMD operations. Vector operations are a subset of single instruction multiple data (SIMD) operations, but for many classes of algorithms they are equivalent. In contrast, instructions on a scalar processor operate on single data elements. FIG. 3 is a simplified circuit diagram of a Vector/SIMD processor.
Processors may exploit horizontal parallelism by issuing multiple independent instructions simultaneously in a single clock cycle. Multiple issue processors come in two basic flavors: superscalar processors and VLIW processors. The basic difference between the two types is the manner in which inter-instruction dependencies are resolved and instructions are issued for execution.
In superscalar architectures, the inter-instruction hazards are detected dynamically at run time in hardware. Once a dependency is resolved, the instruction(s) are issued dynamically. Current general purpose processors are typically speculative superscalar processors, which issue a varying number of instructions per clock cycle, perform inter-instruction hazard detection in hardware, issue instructions dynamically once dependencies are resolved, perform out-of-order instruction execution followed by in-order commit, and perform speculative execution of the instruction stream to mitigate the effects of control dependencies.
In contrast, VLIW and exposed pipeline architectures typically issue a fixed number of operations formatted as one large instruction. Instruction scheduling is performed statically at compile time or by the programmer. This approach relies on the compiler or programmer to identify hazards and schedule operations since the implementation does not contain hardware to check explicitly for hazards. Embedded processors and digital signal processors typically use very long instruction word (VLIW) techniques.
The VLIW approach executes all operations within an instruction simultaneously. As a refinement, architectures have employed one or more bits within the VLIW instruction field to indicate parallelism explicitly. This permits the execution of the operations within an instruction to be issued serially rather than in parallel.
The Sandbridge Sandblaster SB3500 microprocessor, as described in M. Moudgill, J. Glossner, S. Agrawal, and G. Nacer, “The Sandblaster 2.0 Architecture and SB3500 Implementation,” in Proceedings of the Software Defined Radio Technical Forum (SDR Forum '08), Washington D.C., October, 2008, uses a single bit within its 3-operation compound instruction bundle called the “serial bit” to indicate whether operations within the instruction bundle are required to be executed sequentially. If the serial bit is set, then the operations within the bundle are executed serially. If the serial bit is not set, then the operations must execute in parallel. If one of the operations is a taken branch with the serial bit set, then the remainder of the operations is not executed.
The original Texas Instruments TI C6201 microprocessor, as described in “TMS320C64x/C64x+ DSP CPU and Instruction Set—Reference Guide,” Literature Number: SPRU732J, Texas Instruments, July 2010 (hereinafter “Texas Instruments”), required 256-bits to be fetched for each VLIW instruction. Texas Instruments termed these 256-bit instructions fetch packets. Texas Instruments subdivided the operation fields into separate 32-bit fields that Texas Instruments referred to as instructions. The execution of these operations within this fetch packet is controlled partially by a bit in each “instruction” called the “p-bit.” The p-bit determines whether the instruction is to be executed in parallel with another instruction. If the p-bit of instruction i is 1, then instruction i+1 is to be executed in parallel with (in the same cycle as) instruction i. If the p-bit of instruction i is 0, then instruction i+1 is executed in the cycle after instruction i. The p-bit patterns within a VLIW fetch packet permits the execution of the instructions in a fully parallel, fully serial, or partially serial fashion.
Instruction chaining as originally introduced in the IBM650 computer meant that a current instruction contained a next instruction address as described in Gerrit A. Blaaw and Frederick P. Brooks, Jr., “Computer Architecture: Concepts and Evolution,” Addison Wesley, first edition, February 1997 (hereinafter “Blaaw”). In the original TI design, if an execute packet crossed a fetch packet boundary, the execute packet was required to be placed in the next fetch packet with the current packet padded with NOPs. Later versions relaxed this restriction, but still required the total parallel length to be 8. In later revisions, branching into the middle of execute packets was permitted; however, all instructions at lower addresses were ignored, and in some cases, this may produce erroneous result. The p-bit was used only for parallel or serial instruction issue, did not encode dependencies, and did not generate any pipeline controls.
U.S. Pat. No. 6,260,189, titled, “Compiler-controlled dynamic instruction dispatch in pipelined processors,” describes a technique of encoding pipeline dependencies within a code block in a pipeline dependency field in the code block. A compiler or programmer needed to identify pipeline dependencies in a plurality of instructions such as Read-After-Write (RAW) hazards, Write-After-Write (WAW) hazards, Write-After-Read (WAR) hazards or predication hazards. These instructions were grouped together into a code block with a pipeline dependency field associated with the code block, which indicates the types of pipeline dependencies, if any, in the code block. If no hazards were identified in a given code block, the instructions within that block could be executed without stalls or other corrective measures. If one or more hazards were detected in the given code block, the processor implemented only those stalls or other corrective measures that were required for hazards actually present in a given block of code as specified in the pipeline dependency field, thereby avoiding the performance degradations which can result when using conventional techniques.
FIGS. 4A and 4B show examples of pipeline dependency fields for two different code blocks. The code block in FIG. 4A has a RAW dependency and the pipeline dependency field indicates this. The code block in FIG. 4B does not have any data or prediction dependencies.