Modern computer systems utilize a variety of different microprocessor architectures to perform program execution. Each microprocessor architecture is configured to execute programs made up of a number of macro instructions and micro instructions. Many macro instructions are translated or decoded into a sequence of micro instructions before processing. Micro instructions are simple machine instructions that can be executed directly by a microprocessor.
To increase processing power, most microprocessors use multiple pipelines, such as an integer pipeline and a load/store pipeline to process the macro and micro instructions. Typically, each pipeline consists of multiple stages. Each stage in a pipeline operates in parallel with the other stages. However, each stage operates on a different macro or micro instruction. Pipelines are usually synchronous with respect to the system clock signal. Therefore, each pipeline stage is designed to perform its function in a single clock cycle. Thus, the instructions move through the pipeline with each active clock edge of a clock signal. Some microprocessors use asynchronous pipelines. Rather than a clock signal, handshaking signals are used between pipeline stages to indicate when the various stages are ready to accept new instructions. The present invention can be used with microprocessors using either (or both) synchronous or asynchronous pipelines.
FIG. 1 shows an instruction fetch and issue unit, having an instruction fetch stage (I stage) 105 and a pre-decode stage (PD stage) 110, coupled via an instruction buffer 115 to a typical four stage integer pipeline 120 for a microprocessor. Integer pipeline 120 comprises a decode stage (D stage) 130, an execute one stage (E1 stage) 140, an execute two stage (E2 stage) 150, and a write back stage (W stage) 160. Instruction fetch stage 105 fetches instructions to be processed. Pre-decode stage 110 predecodes instructions and stores them into the instructions buffer. It also groups instructions so that they can be issued in the next stage to one or more pipelines. Ideally, instructions are issued into integer pipeline 120 every clock cycle. Each instruction passes through the pipeline and is processed by each stage as necessary. Thus, during ideal operating conditions integer pipeline 120 is simultaneously processing 4 instructions. However, many conditions as explained below may prevent the ideal operation of integer pipeline 120.
FIG. 2 shows a typical four stage load/store pipeline 200 for a microprocessor coupled to a memory system 270, instruction fetch stage 105 and pre-decode stage 110. Load/store pipeline 200 includes a decode stage (D stage) 230, an execute one stage (E1 stage) 340, an execute two stage (E2 stage) 250, and a write back stage (W stage) 260. In one embodiment, memory system 270 includes a data cache 274 and main memory 278. Other embodiments of memory system 270 may be configured as scratch pad memory using SRAMs. Because memory systems, data caches, and scratch pad memories, are well known in the art, the function and performance of memory system 270 is not described in detail. Load/store pipeline 200 is specifically tailored to perform load and store instructions. Decode stage 230 decodes the instruction and reads the register file (not shown) for the needed information regarding the instruction. Execute one stage 240 calculates memory addresses for the load or store instructions. Because the address is calculated in execute one stage and load instructions only provide the address, execute one state 240 configures memory system 270 to provide the appropriate data at the next active clock cycle for load from memory. However, for store instructions, the data to be stored is typically not available at execute one stage 240. For load instructions, execute two stage 250 retrieves information from the appropriate location in memory system 270. For store instructions, execute two stage 250 prepares to write the data appropriate location. For example, for stores to memory, execute two stage 250 configures memory system 270 to store the data on the next active clock edge. For register load operations, write back stage 260 writes the appropriate value into a register file. By including both a load/store pipeline and an integer pipeline, overall performance of a microprocessor is enhanced because the load/store pipeline and integer pipelines can perform in parallel.
While pipelining can increase overall throughput in a processor, pipelining also introduces data dependency issues between instructions in the pipeline. For example, if instruction “LD D0, [A0]”, which means to load data register D0 with the value at memory address A0, is followed by “MUL D2, D0, D1”, which means to multiply the value in data register D0 with the value in data register D1 and store the result into data register d2, “MUL D2, D0, D1” can not be executed until after “LD D0, [A0]” is complete. Otherwise, “MUL D2, D0, D1” may use an outdated value in data register D0. However, stalling the pipeline to delay the execution of “MUL D2, D0, D1” would waste processor cycles. Many data dependency problems can be solved by forwarding data between pipeline stages. For example, the pipeline stage with the loaded value from [A0] targeting data register D0, could forward the value to a pipeline stage with “MUL D2, D0, D1” to solve the data dependency issue without stalling the pipeline.
Ideally, integer pipeline 120 and load/store pipeline 200 can execute instructions every clock cycle. However, many situations may occur that causes parts of integer pipeline 120 or load/store pipeline 200 to stall, which degrades the performance of the microprocessor. A common problem which causes pipeline stalls is latency in memory system 270 caused by cache misses. For example, a load instruction “LD D0, [A0]” loads data from address A0 of memory system 270 into data register D0. If the value for address A0 is in a data cache 274, the value in data register D0 can be simply replaced by the data value for address A0 in data cache 274. However, if the value for address A0 is not in data cache 274, the value needs to be obtained from the main memory. Thus, memory system 270 may cause load/store pipeline 200 to stall as the cache miss causes a refill operation. Furthermore, if the cache has no empty set and the previous cache data are dirty, the refill operation would need to be preceded by a write back operation.
Rather than stalling the pipeline and wasting processor cycles, some processors (called multithreaded processors), can switch from a current thread to a second thread that can use the processors cycles that would have been wasted in single threaded processors. Specifically, in multithreaded processors, the processor holds the state of several active threads, which can be executed independently. When one of the threads becomes blocked, for example due to a cache miss, another thread can be executed so that processor cycles are not wasted. Furthermore, thread switching may also be caused by timer interrupts and progress-monitoring software in a real-time kernel. Because the processor does not have to waste cycles waiting on a blocked thread overall performance of the processor is increased. However, different threads generally operate on different register contexts. Thus data forwarding between threads should be avoided.
Another related problem is caused by traps. Traps are generally caused by error conditions, which lead to a redirection of the program flow to execute a trap handler. The error conditions can occur in different pipeline stages and need to be prioritized in case of simultaneous occurrences. Synchronous traps need to be synchronous to the instruction flow, which means the instruction that caused the trap is directly followed by the trap handler in the program execution. Asynchronous traps usually get handled some cycles after the trap is detected. In a multithreaded processor, a trap handler needs to be able to correlate a trap to the thread, which caused the trap. Thus, most conventional processors using data forwarding or supporting synchronous traps do not allow multiple threads to coexist in the same pipeline. In these processors, processing cycles are wasted during a thread switch to allow the pipelines to empty the current thread before switching to the new thread. Other conventional processors allow multiple threads to coexist in the pipeline but do not support data forwarding and synchronous traps.
Another issue with conventional multi-threaded processors is that program tracing becomes complicated due to thread switching. Conventional embedded processors incorporate program trace output for debugging and development purposes. Generally, a program trace is a list of entries that tracks the actual instructions issued by the instruction fetch and issue unit with the program counter at the time each instruction is issued. However for multi-threaded processors, a list of program instructions without correlation to the actual threads owning the instruction would be useless for debugging.
Hence there is a need for a method or system to allow pipelines to have multiple threads without the limitations of conventional systems with regards to program tracing, data forwarding and trap handling.