The present invention pertains generally to microprocessors, and more particularly to a single-chip microprocessor comprising multiple asymmetrical central processing units executing separate threads.
Single-chip microprocessors have been around for decades and are used extensively in computer systems and other electronically controlled systems. The fundamental structure of a microprocessor includes a central processing unit (CPU), an execution unit, a memory management unit (MMU), and optionally an on-chip cache. The CPU includes a program counter which points to the location in memory from which to fetch program instructions, an instruction fetch unit (IFU) which fetches program instructions from memory and places them into an instruction cache, and an instruction decode unit which decodes the instructions in the instruction cache and facilitates the execution of the decoded instructions by an execution unit. The CPU typically includes a number of fast data/instruction registers for temporarily storing instructions or data on which operations are performed.
In the continual strive for faster and smaller electronics, much research is devoted to developing techniques for increasing the overall speed, or throughput, of the microprocessor. Throughput is measured in terms of the number of operations performed per unit of time. Conceptually, the simplest of all possible improvements to system speed is to increase the clock speeds of the various components, particularly the clock speed of the processor. For example, if everything runs twice as fast but otherwise works in exactly the same manner, the system should generally perform a given task in half the time. Increasing the clock speed indefinitely, however, is not practical due to the inherent RC delay limitations of the data signals.
Another technique for increasing the throughput is to reduce the length of the signal paths within the microprocessor. In other words, by reducing the number of components and length of wire between the components, the data signals need travel a shorter distance are subject to less RC delay. This makes it possible to increase the clock speed of the processor, and accordingly increase system speed. Despite the enormous gains in integrated circuit density, however, the ability of a chip to increase the amount of circuitry is approaching physical limits; accordingly, RC signal delay can no longer be significantly reduced by merely shortening the data signal path lengths.
Yet another technique for increasing the speed of a microprocessor is to implement switching speed enhancement hardware throughout the data signal paths. Data signal switching speed can be increased through various hardware enhancements such as the use of repeaters along signal trace lines, biasing latches in the direction of the signal transition of interest, and many other artificial enhancements. Data switching enhancement techniques are also problematic in that they increase circuit complexity, require an increased number of circuit components, and increase the total amount of space required to implement the microprocessor.
In view of the above hardware limitations, attention has therefore been directed to architectural approaches for further improvements in overall speed of the microprocessor.
One approach to increasing the average number of operations executed per clock cycle is the implementation of instruction pipelining and cache memories. Pipeline instruction execution allows subsequent instructions to begin execution before previously issued instructions have finished. Cache memories store frequently used instructions and data nearer the processor and allow instruction execution to continue, in most cases, without waiting the full access time of a primary memory. Some improvement has also been demonstrated with multiple execution units with look ahead hardware for finding instructions to execute in parallel.
Multiple functional or execution units are provided in many modern microprocessors to run multiple pipelines in parallel. In a superscalar architecture, instructions may be completed in-order and out-of-order. In-order completion means no instruction can complete before all instructions dispatched ahead of it have been completed. Out-of-order completion means that an instruction is allowed to complete before all instructions ahead of it have been completed, as long as predefined rules are satisfied.
For both in-order and out-of-order execution in superscalar systems, pipelines will stall under certain circumstances. An instruction that is dependent upon the results of a previously dispatched instruction that has not yet completed may cause the pipeline to stall. For instance, instructions dependent on a load/store instruction in which the necessary data is not in the cache, i.e., a cache miss, cannot be executed until the data becomes available in the cache. Maintaining the requisite data in the cache necessary for continued execution and to sustain a high hit ratio, i.e., the number of requests for data compared to the number of times the data was readily available in the cache, is not trivial especially for computations involving large data structures. A cache miss can cause the pipelines to stall for several cycles, and the total amount of memory latency will be severe if the data is not available most of the time. Although memory devices used for primary memory are becoming faster, the speed gap between such memory chips and high-end processors is becoming increasingly larger. Accordingly, a significant amount of execution time in current high-end processor designs is spent waiting for resolution of cache misses and these memory access delays use an increasing proportion of processor execution time.
The presence of branch instructions becomes a major impediment to improving processor performance, especially in pipelined superscalar processors, since they control which instructions are executed next. This decision cannot be made until the branch is xe2x80x9cresolvedxe2x80x9d or completed. Branch prediction techniques have been used to guess the correct instruction to execute. As a result, these techniques are not perfect. This becomes more severe as processors are executing speculatively past multiple branches.
Another architectural approach to improving system throughput has been the use of multiple processors. This is often implemented by placing multiple identical CPUs in a single computer system, typically which services multiple users simultaneously. Each of the different CPUs can separately execute a different task on behalf of a different user, thus increasing the overall speed of the system to execute multiple tasks simultaneously. Key to this architecture is that each of the multiple CPUs in the system are identical and therefore each CPU can perform any application task.
The above use of multiple processors is problematic, however. Most application programs follow a single path or flow of steps performed by the processor. While it is sometimes possible to break up this single path into multiple parallel paths, a universal technique for doing so is still being researched. Generally, breaking a lengthy task into smaller tasks for parallel processing by multiple processors is done by a software engineer writing code on a case-by-case basis. This ad hoc approach is especially problematic for executing programs which are not necessarily repetitive or predictable.
It should thus be apparent that a need exists for an improved technique for increasing the throughput of a microprocessor.
A microprocessor architecture is presented which includes multiple asymmetrical central processing units (CPUs), including a primary CPU that executes a primary application thread and one or more secondary CPUs that execute secondary threads that monitor the progress of the primary thread and attempt to ensure that instructions are prefetched from main memory and transferred into the instruction cache, or from external storage into the main memory as needed, such that the instruction pipeline on which the execution unit operates is full as much as possible. Each secondary CPUs includes a dedicated program counter, instruction fetch unit, and instruction decode unit, just as does the primary CPU, but implements much simpler circuitry such as providing many fewer registers, if any, and a simpler instruction decode unit operating on a reduced instruction set in order to conserve chip space.