1. Field of the Invention
This invention relates to a pipe architecture in particular, to alternate multi-threaded pipeline architecture.
2. Description of the Related Art
The pipelining of instructions and data is well known in the processor art for one technique of achieving higher processing speed. In a typical pipelined processor design, the data is passed through the processor in various stages, referred to as a pipeline, in order to process large amounts of data more quickly to obtain a final output at higher speeds.
The processor speed depends on the time to finish each task in every pipeline stage. In order to achieve higher speeds, the pipeline stages are increased in number. For very high speeds, a deep pipeline having as many as 9-12 stages may be used. Such a deep pipeline requires substantial refinement and precise timing between the circuits.
A deep pipeline design usually introduces pipeline stalls and branch penalties. There is a requirement to use static and/or dynamic branch schemes in order to reduce the branch penalties. In addition, multi-threading is sometimes used reduce memory latenancies and VLIW/superscaler to improve ILP (instruction level parallel). The above attempts only solve a few of the problems introduced by using a deep pipeline and the improvement is not significant.
FIGS. 1 and 2A-2C illustrate current attempts at pipeline management. As is known in the art, a pipeline may have many stages. In the embodiments shown, 9 pipeline stages are illustrated, though the use of 6 pipelines stages, or other lengths are well known in the art. The execution of the various pipeline stages of FIG. 1 is illustrated in FIG. 2A.
A pipeline 10 of data is illustrated in FIG. 1. A circuit capable of the pipeline of FIG. 1 includes 9 stages, the first stage being labeled stage 12 and the last being labeled 14. The steps 13 executed in each stage are shown below the name for pipeline stage. One step 13 in the pipeline is executed on one clock cycle. The first step, Instruction Fetch 1, (IF1) is executed on two parts of the clock cycle, C1 and C2. The steps 13 executed on these clock cycles are shown below the pipeline 10, which include I Addr, Program Memory Read, etc.
The execution of the pipeline 10 is described with respect to FIG. 2A. An ALU is one form of a logic circuit 16 which executes the data from the pipeline stages 10. The ALU 16 has clocked inputs 18 and 20, which receive the data and transfer it for execution by the ALU 16 when the clock cycle is enabled.
In the embodiment of FIG. 2A, the clocks C1 and C2 represent opposite phases of the same clock driving the logic circuit 16. On a first clock C1 the data presented at inputs S0 and S1 is provided to the ALU 16 for execution. The execution is completed and the data output of the ALU 16 and presented at the input of multiplexer and drive circuit 22. On the opposite phase, the data is clocked out of the multiplexer 22 for presentation 23. It is also fed back for presentation to the ALU 16 on the subsequent clock cycle C1.
In the embodiment of the prior art in FIG. 2A, a single thread of pipeline data is being processed. The time of a single clock cycle is 2.5 nanoseconds, shown in FIG. 2A. The first part of the cycle introduces the data to the ALU for execution, while the second part of the operates on the same thread and the same data to provide an output to execute the data. The same pipeline thread 10 continues through the ALU 16 for execution one stage at a time.
The operation of steps 13 can be seen in FIG. 1, after the first step of IF1, a program memory read is executed which requires two clock cycles followed by subsequent execution on each clock cycle, such as instruction drive, IR latch and pre-decode ARF address, ARF read, etc. This particular approach makes use of a highly pipelined processor in order to obtain an output stream.
Traditionally, a pipeline architecture is carried out as shown in FIG. 2B in which each processor is physically separate from each other. Processor 25 and 27 are on a physically separate processor. Each of the processors operates in parallel using one or more state machines in order to track the data.
There is a potential that the different instructions or data may cause delays in execution, and thus dedicated processors, each with its own state machine, are often used to avoid errors. This traditional dual processor system results in time delays and large use of surface area. Achieving fast clocks beats becomes more difficult. One prior art technique is described in an article in Microprocessor Report, Volume 16, Archive 2, Feb. 2002, pages 4-9 titled “Technology 2001: On a Clear Day You Can See Forever,” by Max Barron, which illustrates an improvement in which the processor core remains the same and two execution logic circuits 15 are used which share processor core resources. While this results in some improvement in operational time, it fails to provide a substantial advance across the entire system level.