1. Field of Invention
This invention relates to microprocessor design, and more particularly, to techniques for more efficient utilization of cache in a microprocessor.
2. Description of Related Art
Advances in semiconductor manufacturing have greatly improved the performance of microprocessors, while at the same time reducing their cost. Consequently, the use of microprocessors has become more and more widespread. Today, they are present not only in computers, but also in various consumer-oriented products, such as VCRs, microwave ovens and automobiles. In many cases, low cost is more important than high-performance. For example, a microprocessor-controlled washing machine may require the microprocessor to do nothing more than accept user commands and time the wash cycles. On the other hand, some applications demand the highest performance obtainable. For example, modern telecommunications requires very high speed processing of multiple signals representing voice, video, data, etc., which have been densely combined to maximize the use of available communication channels.
A rough measure of a microprocessor's performance is the speed with which it executes an instruction sequence, sometimes stated in millions of instructions per second, or MIPS. Since there are many applications in which microprocessor speed is extremely important, designers have evolved a number of speed-enhancing techniques and architectural features. Among these techniques and features are the instruction pipeline and the use of cache memory.
A pipeline consists of a sequence of stages through which instructions pass as they are executed, with partial processing of an instruction being performed in each stage. In a typical microprocessor, each instruction comprises an operator and one or more operands. The operator represents a code designating the particular operation to be performed (e.g., MOVE, ADD, etc.), and the operand denotes an address or data upon which the operation is to be performed. Execution of an instruction is actually a process requiring several steps. For example, execution of an instruction may consist of four separate steps, in which the instruction must be decoded, the addresses of the operands computed, the operands fetched, and the operation executed. If each step occurs in one cycle of the microprocessor clock, four clock cycles are needed to execute an instruction. In a non-pipelined microprocessor, only one instruction is processed at a time. Therefore, the instruction rate is based on the time required to perform all of these separate steps. Thus, the instruction execution rate for a non-pipelined microprocessor is one instruction every four clock cycles. In a pipelined microprocessor, however, the four steps are performed concurrently on multiple instructions as they advance through the pipeline. After performing the first step on a given instruction, it is passed down the pipeline for the second step. At the same time, another instruction is brought into the pipeline for the first step. Continuing in this manner, instructions advance through the pipeline in assembly line fashion, and emerge from the pipeline at a rate of one instruction every clock cycle. Therefore, the pipeline-equipped microprocessor has an average instruction rate four times higher than the non-pipelined microprocessor.
The advantage of the pipeline consists in performing each of the steps required to execute an instruction concurrently. However, to operate efficiently, a pipeline must remain full. If the flow of instructions into and out of the pipeline is disrupted, clock cycles are wasted while the instructions within the pipeline are prevented from proceeding to the next processing step. Prior to execution, the instructions are typically stored in a memory device and must be fetched into the pipeline by the microprocessor. However, since access times for such memory devices is generally much longer than the operating speed of the microprocessor, instruction flow through the pipeline may be impeded by the length of time required to fetch instructions from memory.
An obvious approach to this problem would seem to be to simply use faster memory devices. Unfortunately, although faster memory devices are available, they are typically costlier and consume more power than conventional memory. In view of these disadvantages, the use of high-speed devices throughout the entire memory is usually infeasible. A more practical alternative for high performance microprocessors is the use of cache memory.
Cache is secondary memory resource, used in addition to the main memory. Cache generally consists of a limited amount of very high-speed memory. Since the cache is small (relative to the main memory), its cost and power consumption are not significant. Cache memory can improve microprocessor performance whenever the majority of instructions required by the microprocessor are concentrated in a particular region of memory. The principle underlying the use of cache is that, more often than not, the microprocessor will fetch instructions from the same area of memory. This is due to the fact that most instructions are executed by the microprocessor in the sequence in which they are encountered in memory.
The instructions comprising a typical microprocessor program are stored at consecutive addresses in memory. A program counter (or, PC) within the microprocessor keeps track of the address of the next instruction to be executed. For the majority of instructions, the PC is automatically incremented to point to the next instruction after a given instruction is executed. A special class of instructions, known as branch instructions, can modify the PC, causing the microprocessor to execute instructions at a non-sequential memory address. However, in most cases the branch will be to a nearby location. Naturally, some programs will contain frequent branches to non-local memory addresses, but the assumption of “locality” is usually justified.
Assuming the majority of instructions required by the microprocessor may be found in a given area of memory, the entire area is copied (e.g., in a block transfer) to the cache. The microprocessor then fetches the instructions as needed from the cache, rather than from the main memory. Since the cache is faster than the main memory, the pipeline is able to operate at full speed. Thus, cache memory can provide a dramatic improvement in average pipeline throughput, by providing the pipeline faster access to instructions than would be possible by directly accessing the instructions from conventional memory. As long as the instructions are reasonably localized, the use of a cache can significantly improve microprocessor performance.
A “cache miss” occurs when an instruction required by the microprocessor is not present in the cache. In response to a cache miss, the microprocessor typically discards the contents of its cache, and then fetches the necessary instructions as quickly as possible from memory and places them into the cache. Obviously, this is a source of overhead, and if it becomes necessary to empty and refill the cache very frequently, performance may begin to approach that of a microprocessor with no cache.
Note that the performance improvement due to cache depends on the expeditious transfer of the instructions from the targeted memory region into the cache. If the instructions were transferred at the same rate as a conventional memory access (i.e., at the rate at which the microprocessor would fetch them one at a time), there would be no improvement in speed—in fact, it is more likely that overall performance would be worse. Fortunately, it is possible to transfer the contents of a fixed-sized segment of memory (a “cache line”) to cache en masse. For example, an 8-word cache line may be transferred to cache in the same amount of time as it takes for the microprocessor to fetch a single instruction from the memory. There are certain restrictions on this type of transfer, however. In particular, the boundaries of each cache line (i.e., the starting and ending memory address) are pre-determined, and may not be altered for a given transfer. To continue with the previous example, assume the cache line size is 8 words. This implies that cache lines occur on 8-word boundaries—i.e., that the starting address in memory of each cache line is a multiple of 8. Thus, cache lines would occur at (hexadecimal, or base-16) memory addresses of 0000h, 0008h, 0010h . . . etc. In order to place an instruction at a particular memory location into cache, the entire cache line containing the instruction (referred to herein as the “context” of the instruction) must be transferred. For instance, to place the instruction at memory location 4003h into cache, the entire cache line from address 4000h to address 4007h must be transferred. When a block of instructions is transferred from memory to the cache, it may need to be pre-decoded before the instructions can be used by the microprocessor. Pre-decoding refers to detecting the instruction boundaries within the instruction block. If every instruction occupied a single memory address, pre-decoding would be trivial, but variable length instructions make demarcation more complicated.
Microprocessors frequently employ variable length instructions to make efficient use of instruction memory. Thus, for example, a microprocessor may employ both single-word instructions (i.e., 16 bits long) and double-word instructions (i.e., 32 bits long). A comparison between single-word and double-word instructions is presented in FIG. 1. A typical single-word instruction 26 occupies one memory location (at some address n). A portion of the single-word instruction consists of an op code 30, which is a specific bit pattern identifying the instruction as for, example, an add or shift operation. The remainder of the instruction is an operand, which designates the register, memory location, etc. to which the instruction applies. A typical double-word instruction 28 occupies two consecutive memory locations (addresses n and n+1). The op code for a double-word instruction 32 is generally in the first word (i.e., the word at the lower memory address), and the operand may include the remainder of the first word along with the second word. It is important to realize that, in the absence of any other information, there are generally no criteria by which to recognize an instruction.
TABLE 1Given a 16-bit word, the following possibilities exist:1.The word is a single-word instruction2.The word is the second word of a double-word instruction.Given a pair of 16-bit words, the following possibilities exist:1.The two words are the first and second words of a double-wordinstruction.2.The first and second words are two consecutive single-wordinstructions.3.The first word is either a single-word instruction or the second wordof a double-word instruction, and the second word is either asingle-word instruction or the first word of a double-wordinstruction.The inherent ambiguity of instructions complicates pre-decoding of an instruction block, as described in detail below.
When an instruction block is fetched from memory, it may contain a combination of single-word and double-word instructions. In this case, the boundaries between the instructions are unknown, so the instruction block must be pre-decoded before the instructions can be entered into the pipeline. The process by which a block of instructions is fetched from memory and prepared for execution by a pipelined microprocessor is illustrated in FIG. 2.
In FIG. 2 an instruction block 12 is fetched from the internal memory 10 of a pipelined microprocessor 18. The size of the instruction block is based on the length of a cache line (typically, from 8 to 32 16-bit words). Given the fixed size of the cache line, the number of instructions in the block depends on the particular combination of single-word and double-word instructions. For example, if the cache line is 8 words long, the instruction block could contain as many as 8 single-word instructions or as few as 4 double-word instructions. To determine the boundaries of the single-word and double-word instructions in the fetched block, the instruction block is pre-decoded 14. The decoded instructions are then placed in the instruction cache 16 for execution by the microprocessor 18.
A problem arises with the conventional method of instruction pre-decoding, in regard to efficient cache utilization. As stated earlier, the instruction block fetched from memory must lie on a cache line boundary. In other words, the starting address and length of the instruction block are based on the cache line size. For example, with a cache line size of 8 words, when the microprocessor requests an instruction at memory location 4003h, an entire 8-word instruction block beginning at address 4000h is fetched. It should be clear that, depending on its address relative to the cache line boundaries, a requested instruction may appear anywhere within the fetched instruction block (i.e., the context of the instruction). A disadvantage of conventional pre-decoding is explained with reference to FIG. 3.
FIG. 3 represents an 8-word instruction block (illustrated beginning at instructions 22c, 22b, and 22a then the forward decoding instructions 20a, 20b, 20c, 20d, and 20e) along a cache line beginning at internal memory address 4000h, and extending to address 4007h. In this case, the microprocessor has requested the instruction at address 4003h, resulting in the entire instruction block being fetched from memory. As stated earlier, before the instructions in this block can be transferred to the cache, the block must be pre-decoded. The block contains a combination of single-word and double-word instructions, and the boundaries of these instructions must be known before the instructions can enter the pipeline.
The locations within the instruction block are relative to the address (4003h) of the requested instruction I1 20a, which is the value contained in the program counter (PC) of the microprocessor. Conventional pre-decoding proceeds in the forward direction (i.e., toward higher memory addresses) from requested instruction I1 20a, and establishes the addresses of the instructions subsequent to I1. This is easy to do, since there is no ambiguity concerning the size of instructions in the forward direction. For example, beginning with instruction I1 20a, the address of the next instruction I2 20b can be easily found. The microprocessor can determine from the op code of I1 whether it is a single-word or a double-word instruction. This tells the microprocessor whether instruction I2 20b is located one, or two memory locations away (i.e., PC+1, or PC+2, respectively). In the example of FIG. 3, I1 is a single-word instruction. Therefore, instruction I2 20b begins at address 4004h. Similarly, the op code of I2 indicates that it is also a single-word instruction, so instruction I3 20c must begin at address 4005h. The op code of instruction I3 reveals it to be a double-word instruction. Therefore, the microprocessor looks for instruction I4 20e two locations away, at address 4007h. Note that this technique works only because we are assured of beginning with the first word of an instruction (namely, the requested instruction, I1). Lacking this information, it would not be possible to resolve the ambiguities of Table 1.
While pre-decoding in the forward direction is straightforward, backward pre-decoding is generally not possible at all. Starting with a known instruction at the relative location of the PC is of no help in determining the boundaries of instructions at lower addresses (i.e., PC−1, PC−2 . . . etc.). For example, it is impossible to determine whether address 4002h in the instruction block contains a single-word instruction or the second word of a double-word instruction. Since the instructions preceding I1 cannot be decoded, they are not marked valid in the cache and cannot be entered in the pipeline. Thus, the inability to perform backward pre-decoding on the instruction block results in inefficient use of the cache. Although in the example of FIG. 3, only three memory locations (4000h–4002h) were unused, cache underutilization could be much worse, depending on where the requested instruction appeared within the cache line. For instance, if the requested instruction had been at memory address 4007h, the entire rest of the cache line could not be pre-decoded.
Under certain circumstances, the inefficient use of the cache described above can result in repeated fetches of the same instruction block from memory. FIG. 4 illustrates a common type of instruction sequence known as a loop. Each of the boxes represents an instruction. To the right of each instruction is its address in memory, and to the left its offset relative to the program counter (PC). Note that all of the instructions are contained within the same instruction block, spanning the address range from 4000h–4007h, and that the PC offset of the requested instruction I1 20a is zero. In a typical loop, instructions I1 20a through I4 20e would be executed in sequence. After executing instruction 14, the microprocessor's program counter would be directed to instruction I-2 24, which might test a condition to determine whether to terminate or continue execution of the loop. If this loop were first entered at instruction I1 20a (e.g., by branching from elsewhere in the program), the instruction block containing instruction I1 would be fetched from memory and be pre-decoded before placing it into the cache. According to the forward pre-decoding process described above, the boundaries of instructions I2–I4 would be readily determined. Instructions I1–I4 would then be placed into the cache for immediate access by the pipeline of the microprocessor. Instruction I-2, however, while present in the same instruction block as I1–I4, could not be pre-decoded. Therefore, following execution of I4, the microprocessor would be forced to request I4 from memory, whereupon the instruction block would be fetched a second time and the cache refilled.
Due to the frequent occurrence of loops and other similar structures in software programs, the scenario just described is believed to constitute a major source of inefficiency in the architecture of traditional microprocessors. In microprocessors employing variable length instructions, the inability to perform backward pre-decoding of an instruction block leads to underutilization of the cache. The impact on efficiency of cache underutilization will be software dependent, but it is estimated that cache misses could typically be reduced by 50% if it were possible to pre-decode the entire instruction block.
In view of this problem, it would be desirable to have a means to perform backward pre-decoding of an instruction block, so that as many of the instructions as possible could be delivered to the pipeline of the microprocessor. The method by which this is accomplished should preferably be suitable for implementation in the circuitry of a microprocessor. Furthermore, the method should be capable of high-speed operation, so as not to compromise the performance of the microprocessor.