Application programs can generally be divided into two broad categories: a first category having sequential code segments with low levels of instruction level parallelism (ILP), such as control code, and a second category having code segments with high levels of ILP, which is referred to herein as parallel code. Both of these code types are often intermixed in an application program and both need to be efficiently processed to achieve high performance for the whole program.
Sequential code may be defined as an ordered sequence of instructions that when executed on a processor completes the instructions in the order specified. Given an existing processor with a fixed instruction set and operating at a maximum clock frequency, one of the primary means available for executing instructions faster is to increase the clock frequency in a new processor implementation. This increased clock frequency requires a larger number of processing stages so that each stage has a smaller latency, in effect increasing temporal parallelism through deeper pipelining. Since sequential code often has a relatively high number of branch instructions, increasing the pipeline depth can lead to longer branch latencies. To address such long latencies, prior techniques have included the use of conditional or predicated execution to minimize the use of branches, or complex branch prediction strategies, or some combination of such approaches, which for all of these techniques there is increased hardware complexity. Adding stages to an existing pipeline potentially has the additional, and possibly most serious problem of changing the programming model. Adding pipeline stages in a new processor implementation while avoiding programming model changes can result in much higher hardware complexity and power consumption, reduced performance gain, or both.
A processor architecture can be considered to consist of a variety of executable function instructions, such as addition, subtraction, multiplication, division, compares, ANDs, ORs, ExclusiveORs, NOTs, shifts, rotates, permutes, bit operations, moves, loads, stores, communications and variations and combinations of the like. The execution of these diverse instructions also will have a wide diversity of execution delays. In developing a processor pipeline, the execution delays for each instruction, as it flows through the pipeline, are examined and typically the longest delay path through any stage sets the pipeline clock period and consequent operating frequency. The longest delay path is a worst case path and is typically referred to as a critical timing path. A pipeline stage logic path typically begins with output signals from a previous stage's storage device, such as a latch, register, or memory. The signals then flow through function specific combinatorial logic defining the functional portion of the pipeline stage. Then the pipeline stage ends with saving that stage's output in a storage device. Increasing the clock frequency reduces the clock period and typically causes the addition of pipeline stages since there is less time to accomplish the longest path function specific combinatorial logic.
Given such observations, a need exists for the development of a processor pipeline that takes advantage of the variation in critical path lengths of different processor instructions, and the tendency for different processor instructions to be used together in different contexts in order to achieve increased instruction throughput. Further, a need exists that any advantageous changes to the pipeline minimally, if at all, affect the programming model, implementation silicon area and power for such an implementation.