1. Field of the Invention
The present invention relates to detecting and handling an instruction flush in a data processing system. More specifically, the present invention is directed to distributing a flush mechanism across all the execution units in a data processing system.
2. Description of the Related Art
A microprocessor is the heart of a modern computer, a chip made up of millions of transistors and other elements organized into specific functional operating units, including arithmetic units, cache memory and memory management, predictive logic and data movement.
Processors in modern computers have grown tremendously in performance, capabilities and complexity over the past decade. Any computer program consists of many instructions for operating on data. A processor executes the program through four operating stages: fetch, decode, execute and retire (or complete). The fetch stage reads a program's instructions and any needed data into the processor. The decode stage determines the purpose of the instruction and passes it to the appropriate hardware element. The execution stage is where that hardware element, now freshly fed with an instruction and data, carries out the instruction. This hardware element might be an add, bit-shift, floating-point multiply or vector operation. The retire stage takes the results of the execution stage and places them into other processor registers or the computer's main memory. For example, the result of an add operation might be stored in memory for later use.
Processor circuitry is organized into separate logic elements—perhaps a dozen or more—called execution units. The execution units work in concert to implement the four operating stages. The capabilities of the execution units often overlap among the processing stages. The following are examples of some common processor execution units:                Arithmetic logic unit: Processes all arithmetic operations. Sometimes this unit is divided into subunits, one to handle all integer add and subtract instructions, and another for the computationally complex integer multiply and divide instructions.        Floating-point unit (FPU): Deals with all floating-point (non-integer) operations. In earlier times, the FPU was an external coprocessor; today, it's integrated on-chip to speed up operations.        Load/store unit (LSU): Manages the instructions that read or write to memory.        Memory-management unit (MMU): Translates an application's addresses into physical memory addresses. This allows an operating system to map an application's code and data in different virtual address spaces, which lets the MMU offer memory-protection services.        Branch processing unit (BPU): Predicts the outcome of a branch instruction, aiming to reduce disruptions in the flow of instructions and data into the processor when an execution thread jumps to a new memory location, typically as the outcome of a comparison operation or the end of a loop.        Vector processing unit (VPU): Handles vector-based, single-instruction multiple data (SIMD) instructions that accelerate graphics operations.        
A common problem found in high performance microprocessor designs is detecting and handling instruction flush. When executing instructions speculatively, if the results of the execution are based on a misprediction, the instructions must be re-executed. The most severe penalty for mis-predicting instruction execution results in an instruction flush which causes the results of that instruction and all following instructions to be thrown away. Instruction processing starts over with fetching the instruction flush. Instruction flush occurs in high performance microprocessor designs due to the desire to fetch and execute instructions speculatively, prior to ensuring all prior instructions have completed cleanly with no errors. Some examples which cause instruction flush are branch mispredict or other load/store fault conditions such as page faults. During a branch mispredict, instructions which have been fetched and executed down the mispredicted path are flushed. During a load/store flush, all younger instructions after the faulting instruction are flushed.
Most microprocessor architectures specify that a program will appear to execute in sequential order. A given instruction is younger than instructions which will execute earlier in the program code. Prior high performance designs, such as POWER4™, implement a central flush mechanism in which flush signals are generated from each unit, then collected in a completion unit, then re-distributed back to all units with a global flush signal. In high frequency designs, this central method is limiting because it requires additional pipeline stages to receive flush signals from each unit, collect them, then re-distribute a global flush signal.
Thus, it would be advantageous to provide a method and apparatus to distribute a flush mechanism across all the execution units in a data processing system, and not require a central collection point to re-distribute the flush signals.