1. Field of the Invention
The present invention relates generally to the field of microprocessors and more specifically to pipelined microprocessors.
2. Prior Art
The design of an original microprocessor architecture requires a very large investment of time, money, and engineering effort. In order to maximize the profit realized over the lifetime of an architecture, proliferations of the original design are typically created to appeal to particular markets. A proliferation retains the "core" design of the original architecture, but enhances or adds to that design. For example, an original microprocessor design with a performance of 30 MIPs may be the perfect solution for the largest segment of the microprocessor market. However, other markets, such as laser printers, may demand a higher performance microprocessor. In this case, the original architecture could be enhanced to meet the performance requirements of the laser printer market by adding an on-chip data cache unit, speeding up the floating point unit, etc. The resulting proliferation may have a performance of 50 MIPs with slightly higher costs than the original. Proliferating an existing architecture is more attractive than creating a new design because the required investment is much smaller.
When an original architecture is defined, design choices and tradeoffs are necessarily made. In the design of a proliferation of the original architecture, problems imposed by the constraints of these original design choices are often encountered. Overcoming these constraints is often a challenging part of the design of a proliferation.
FIG. 1 illustrates a block diagram of a prior art microprocessor. Microprocessor 10 includes a processor core 12, a random access memory (RAM) 16, and a bus controller logic (BCL) 18, all coupled to a memory-side machine bus (MMB) 14. The processor core 12 generates an instruction pointer address, fetches the instruction at the instruction pointer address, decodes the instruction, and issues the decoded instruction to the functional units for execution. Bus controller logic 18 and an external memory 22 are coupled to a system bus 20. System bus 20 is used to transfer data between microprocessor 10 and external devices such as external memory 22. Bus controller logic 18 controls data transfers on system bus 20. When executing instructions, microprocessor 10 may operate on data retrieved either from random access memory 16 or external memory 22.
The memory-side machine bus 14 carries the control information and handles the local data transfers that occur during the fetching, issuing, and executing of memory instructions by the microprocessor pipeline. The fetching, issuing, and executing of instructions is pipelined in a "fixed" sequence of three stages or "pipes" as defined in Table I. Each stage is divided into two phases: phase 1 and phase 2.
TABLE I ______________________________________ Pipeline for Fetching, Issuing, and Executing Instructions ______________________________________ Pipe 0: Phase 1. Generate Instruction Pointer (IP). Pipe 0: Phase 2. Fetch instruction at IP address. Pipe 1: Phase 1. Decode instruction. Pipe 1: Phase 2. Issue instruction. Pipe 2: Phase 1. Execute instruction. Pipe 2: Phase 2. Return. ______________________________________
During pipe 0, phase 1, the address of the next instruction is generated by processor core 12. This address is the instruction pointer (IP). During pipe 0, phase 2, the processor core 12 fetches the next instruction from the address indicated by the instruction pointer. The instruction is typically fetched from either external memory or an instruction cache. During pipe 1, phase 1, the instruction is decoded by the processor core 12. The processor core 12 determines, among other things, whether execution of the instruction uses machine-side memory bus 14. If the instruction uses the memory-side machine bus, then it is issued on memory-side machine bus 14 during pipe 1, phase 2. During issuance, the memory-side machine bus carries control information that indicates which of the possible units (RAM, BCL, or other) should execute the issued instruction. During pipe 2, phase 1, the appropriate coprocessor unit executes the issued instruction. For example, bus controller logic 18 executes a LOAD instruction by retrieving the required data from external memory 22. During pipe 2, phase 2, the return phase of the pipeline, data is returned to memory side machine bus 14 and transferred to the appropriate destination unit. Not all instructions return data during their pipe stage 2. Therefore, the return phase of the pipeline must be arbitrated to allow, for example, the bus controller to return data from external memory that was requested several instructions previously in the pipeline.
FIG. 1 shows an example of the operation of the pipeline of microprocessor 10 in waveforms PH1 24, MMBQ11 26, and BCLEXEQ21 28. (Note: Q11 means pipe 1, Phase 1; Q21 means pipe 2 phase 1.) PH1 24 indicates the phase of the pipeline. The pipeline is in phase 1 when PH1 24 is high and phase 2 when PH1 24 is low. MMBQ11 26 is the instruction decoding of pipe 1, phase 1. BCLEXEQ21 is the execution of pipe 2, phase 1. The waveforms of FIG. 1 illustrate a STORE, LOAD, FETCH sequence of instructions being decoded during pipe 1, phase 1 and executed during pipe 2, phase 1. A STORE instruction stores data to a memory. A LOAD instruction retrieves data from a memory. A FETCH is the loading of an instruction from external memory.
Microprocessor 10 has two possible "targets" that can execute or service an issued LOAD instruction: external memory 22 or internal RAM 16. Bus controller logic 18 is responsible for the detection and handling of this distinction. This detection is done in the same cycle in which the LOAD itself is issued, allowing bus controller logic 18 to begin executing a LOAD from external memory 22 in pipe 2, phase 1. In the case of a LOAD from RAM 16, the RAM will service the LOAD in pipe 2, phase 1.
During execution of a typical program, many of the data accesses by microprocessor 10 are data LOAD accesses from external memory 22. Microprocessor 10 may be forced to sit idle for some time waiting for a LOAD instruction to return data since accesses to external memory 22 are relatively slow. Therefore, one way to achieve a higher performance proliferation of the microprocessor 10 would be to include an on-chip data cache unit to circumvent the long access times needed to LOAD data from external memory 22. However, the fixed definition of the pipeline does not allow enough time for "hit or miss" detection by a data cache unit during a cacheable LOAD from external memory.
Addition of a data cache unit provides a third "target," the data cache unit, that can execute a LOAD instruction. During a LOAD, a tag match operation is performed by the data cache unit to determine whether the required data resides in the data cache unit. This detection begins in pipe 1, phase 2 at the same time the LOAD is issued. Unfortunately, the tag match operation, or "hit or miss" detection, of a data cache unit requires too much time to be completed before execution of the LOAD instruction begins in pipe 2, phase 1. Therefore, it is not possible to tell bus controller logic 18 whether or not to execute the LOAD in the execution cycle. An additional "dead" cycle is required to give the "hit or miss" detection enough time to properly determine whether the bus controller logic will need to service the LOAD. Creating a permanent "dead" cycle in the pipeline is one way to solve the problem. But a permanent "dead" cycle in the pipeline would seriously degrade the performance of the microprocessor.
Therefore, a method and apparatus for dynamically expanding the pipeline of a microprocessor is needed.