Users of data processing systems such as computers and the like continue to demand greater and greater performance from such systems for handling increasingly complex and difficult tasks. Greater performance from the processors that operate such systems may be obtained through faster clock speeds, so that individual instructions are processed more quickly. However, relatively greater performance gains have been achieved through performing multiple operations in parallel with one another.
One manner of parallelization is known as "pipelining" , where instructions are fed into a pipeline for an execution unit in a processor that performs different operations necessary to process the instructions in parallel. For example, to process a typical instruction, a pipeline may include separate stages for fetching the instruction from memory, executing the instruction, and writing the results of the instruction back into memory. Thus, for a sequence of instructions fed in sequence into the pipeline, as the results of the first instruction are being written back into memory by the third stage of the pipeline, a next instruction is being executed by the second stage, and still a next instruction is being fetched by the first stage. While each individual instruction may take several clock cycles to be processed, since other instructions are also being processed at the same time, the overall throughput of the processor is much greater.
Greater parallelization can also be performed by attempting to execute multiple instructions in parallel using multiple execution units in a processor. Processors that include multiple execution units are often referred to as "superscalar" processors, and such processors include scheduling circuitry that attempts to efficiently dispatch instructions to different execution units so that as many instructions are processed at the same time as possible. Relatively complex decision-making circuitry is often required, however, because oftentimes one instruction cannot be processed until after another instruction is completed. For example, if a first instruction loads a register with a value from memory, and a second instruction adds a fixed number to the contents of the register, the second instruction typically cannot be executed until execution of the first instruction is complete.
The use of relatively complex scheduling circuitry can occupy a significant amount of circuitry on an integrated circuit device, and can slow the overall execution speed of a processor. For these reasons, significant development work has been devoted to Very Long Instruction Word (VLIW) processors, where the decision as to which instructions can be executed in parallel is made when a program is created, rather than during execution. A VLIW processor typically includes multiple execution units, and each VLIW instruction includes multiple primitive instructions known as parcels that are known to be executable at the same time as one another. Each primitive instruction in a VLIW may therefore be directly dispatched to one of the execution units without the extra overhead associated with scheduling. VLIW processors rely on sophisticated computer programs known as compilers to generate suitable VLIW instructions for a computer program written by a computer user. VLIW processors are typically less complex and more efficient than superscalar processors given the elimination of the overhead associated with scheduling the execution of instructions.
Despite the type of processor, another bottleneck on computer performance is that of transferring information between a processor and memory. In particular, processing speed has increased much more quickly than that of main memory. As a result, cache memories, or caches, are often used in many such systems to increase performance in a relatively cost-effective manner.
A cache is typically a relatively faster memory that is coupled intermediate one or more processors and a relatively slower memory such as implemented in volatile or non-volatile memory devices, mass storage devices, and/or external network storage devices, among others. A cache speeds access by maintaining a copy of the information stored at selected memory addresses so that access requests to the selected memory addresses by a processor are handled by the cache. Whenever an access request is received for a memory address not stored in the cache, the cache typically retrieves the information from the memory and forwards the information to the processor. Moreover, if the cache is full, typically the information related to the least recently used memory address is discarded or returned to the memory to make room for information related to more recently accessed memory addresses.
The benefits of a cache are maximized whenever the number of access requests to cached memory addresses, known as "cache hits", are maximized relative to the number of access requests to non-cached memory addresses, known as "cache misses". Despite the added overhead that typically occurs as a result of a cache miss, as long as the percentage of cache hits is high, the overall access rate for the system is increased.
However, it has been found that with much commercial program code such as operating system code and the like, the miss rate for instructions in a cache is often relatively high due to the lack of code reuse and the presence of a large number of branch instructions, which are used to cause a processor to take different instruction paths based upon the result of conditions, or tests, specified in the instructions. Also, a great deal of operating system code is devoted to error and exception handling, and is thus rarely executed, often resulting in a cache temporarily storing a significant number of instructions that are never executed.
It has further been found that for VLIW processors, the miss rate is often even higher because compiling a computer program into a VLIW-compatible format typically expands the program code 2-4 times. Also, the relative frequency of branch instructions in VLIW program code is much higher--typically two branches out of every three instructions verses one branch every 5-6 instructions with a superscalar processor.
One manner of increasing the hit rate for a cache is to increase the size of the cache. However, cache memory is often relatively expensive, and oftentimes is limited by design constraints--particularly if the cache is integrated with a processor on the same integrated circuit device. Internal caches integrated with a processor are typically faster than external caches implemented in separate circuitry. On the other hand, due to design and cost restraints, internal caches are typically much smaller in size than their external counterparts.
One cost-effective alternative is to chain together multiple caches of varying speeds, with a relatively smaller, but faster primary cache chained to a relatively larger, but slower secondary cache. In addition, instructions and data may be separated into separate data and instruction caches. For example, for instructions, some processors implement a relatively small internal level one (L1) instruction cache with an additional external level two (L2) instruction cache coupled intermediate the L1 instruction cache and main memory storage. Typically, an L1 instruction cache has an access time of one clock cycle, and thus, data may be fed to the processor at approximately the same rate as instructions can be processed by the processor. On the other hand, an external L2instruction cache oftentimes has an access time of at least 5 clock cycles, so if a processor is required to rely extensively on memory accesses to an L2 instruction cache, the processor may often stall waiting for data to be retrieved by the cache, thereby significantly degrading processor performance.
As an attempt to minimize the delays associated with retrieving instructions from memory, many processors include prefetch circuitry that attempts to "predict" what instructions will need to be executed in the immediate future, and then to speculatively retrieve those instructions from memory before they are needed by the processor. Branch instructions present the greatest impediments to prefetching instructions, and as a result, prefetch circuitry typically performs an operation known as "branch prediction" to attempt to speculatively determine whether or not a particular instruction path will be taken after a branch instruction.
One manner of branch prediction relies on a branch history table or cache that maintains a history of whether or not previously-executed branch instructions resulted in branches being taken. In particular, it has been found that more often than not branch instruction will take the same instruction path each time it is executed. By predicting that the same path will be taken the next time a particular branch instruction is executed, the prediction is usually successful.
Conventional branch history tables typically store an indication of whether the condition for a particular branch instruction was met the last time the instruction was executed. However, with a conventional branch history table, often the table must be accessed to determine whether a branch was taken, followed by generating the address for the next instruction, and then fetching the instruction stored at the generated address. If the instruction at the generated address is not in the primary cache, the processor will stall waiting for the secondary cache to handle the fetch request.
Consequently, while conventional branch history tables do reduce the overhead associated with branch instructions, some degree of overhead still exists in many circumstances. As a result, processor performance is adversely affected. Furthermore, with VLIW program code, where branch instructions are encountered more frequently, the adverse impact of branch instructions on processor performance is even greater.
Therefore, a substantial need exists for an improved manner of branch prediction that minimizes the overhead associated with branch instructions and maximizes processor performance, particularly for VLIW and superscalar processors and the like.