1. Field of the Invention
This invention relates to microprocessor design, and more particularly to circuits and methods for improving instruction fetch time by determining mapping information relating one or more instructions prefetched from a higher-level memory device to corresponding predecoded instructions stored in a lower-level memory device.
2. Description of the Related Art
The following descriptions and examples are not admitted to be prior art by virtue of their inclusion within this section.
Over the years, the use of microprocessors has become increasingly widespread in a variety of applications. Today, microprocessors may be found not only in computers, but also in a vast array of other products such as VCR's, microwave ovens, and automobiles. In some applications, such as microwave ovens, low cost may be the driving factor in the implementation of the microprocessor. On the other hand, other applications may demand the highest performance obtainable. For example, modern telecommunication systems may require very high speed processing of multiple signals representing voice, video, data, etc. Processing of these signals, which have been densely combined to maximize the use of available communication channels, may be rather complex and time consuming. With an increase in consumer demand for wireless communication devices, such real time signal processing requires not only high performance but also demands low cost. To meet the demands of emerging technologies, designers must constantly strive to increase microprocessor performance while maximizing efficiency and minimizing cost.
With respect to performance, greater overall microprocessor's speed may be achieved by improving the speed of devices within the microprocessor's circuits as well as architectural development that allow for optimal microprocessor's performance and operations. As stated above, microprocessor's speed may be extremely important in a variety of applications. As such, designers have evolved a number of speed enhancing techniques and architectural features. Among these techniques and features may be the instruction pipeline, the use of cache memory, and the concept of prefetching.
A pipeline consists of a sequence of stages through which instructions pass as they are executed. In a typical microprocessor, each instruction comprises an operator and one or more operands. Thus, execution of an instruction is actually a process requiring a plurality of steps. In a pipelined microprocessor, partial processing of an instruction may be performed at each stage of the pipeline. Likewise, partial processing may be performed concurrently on multiple instructions in all stages of the pipeline. In this manner, instructions advance through the pipeline in assembly line fashion to emerge from the pipeline at a rate of one instruction every clock cycle.
The advantage of the pipeline generally lies in performing each of the steps required to execute multiple instructions in a simultaneous manner. To operate efficiently, however, a pipeline must remain full. If the flow of instructions in the pipeline is disrupted, clock cycles may be wasted while the instructions within the pipeline may be prevented from proceeding to the next processing step. Prior to execution, the instructions typically stored in a memory device are fetched into the pipeline by the microprocessor. However, access times for such memory devices are generally much slower than the operating speed of the microprocessor. As such, instruction flow through the pipeline may be impeded by the length of time required to fetch instructions from memory (i.e. memory latency).
An obvious approach to the above problem may be to simply use faster memory devices. Unfortunately, although faster memory devices may be available, they are typically more costly and may consume more power than conventional memory. For many applications, the use of high-speed memory devices throughout the entire memory hierarchy is infeasible. Thus, a more practical alternative for high performance microprocessors may be the use of cache memory.
Cache memory is a secondary memory resource that may be used in addition to the main memory, and generally consists of a limited amount of very high-speed memory devices. Since cache memory is typically small relative to the main memory, cost and power consumption of the cache may not be significant factors in some applications. However, factors such as cost and circuit dimension limitations may place constraints on cache size in other applications.
Cache memory may improve microprocessor's performance whenever the majority of instructions required by the microprocessor are concentrated in a particular region of memory. The principle underlying the use of cache is, more often than not, that the microprocessor typically fetches instructions from the same area of memory. Such a principle is due to the sequential nature in which instructions are stored in memory. In other words, most of the instructions may be executed by the microprocessor in the sequence in which they are encountered in memory.
Assuming the majority of instructions required by the microprocessor are found in a given area of memory, the entire area may be copied (e.g., in a block transfer) to the cache. In this manner, the microprocessor may fetch the instructions as needed from the cache rather than from the main memory. Since cache memory is generally faster than main memory, the pipeline may be able to operate at full speed. Thus, the use of cache memory provides a dramatic improvement in average pipeline throughput. Such an improvement is achieved by providing the pipeline with faster access to instructions than would be possible by directly accessing the instruction from conventional memory. As long as the instructions are reasonably localized, the use of a cache significantly improves microprocessor performance.
To further improve access time to instructions, one or more levels of cache memory may also be included within the system. The number of cache memory devices may vary in a given memory hierarchy. Typically, the lowest level of cache (i.e. the first to be accessed) may be smaller and faster than the one or more levels above the lowest level in the memory hierarchy. When an instruction is called for execution, the memory address associated with the instruction is typically stored in the lowest level of cache for the fastest possible retrieval. The fastest possible operation may be called a “cache hit.” A cache hit may occur when the memory address corresponding to the instruction called for execution is stored in the level of cache indicated by the instruction address. If a cache hit occurs, the addressed information may be retrieved from the cache without having to access a higher (and often slower) level of memory in the memory hierarchy.
Conversely, a “cache miss” may occur when an instruction required by the microprocessor is not present in the level of cache indicated by the instruction address. In response to a cache miss, the next higher ordered memory structure may be presented with the instruction address. The next higher ordered memory structure may be another cache, such that another hit or miss may occur. If misses occur at each level of cache, the microprocessor may stall all further instruction executions. During the stall period, the microprocessor may discard the contents of the cache. Subsequently, the necessary information may be fetched as quickly as possible from main memory and placed into cache. Obviously, such a process may be a source of overhead, and if it becomes necessary to empty and refill the cache frequently, system performance may begin to approach that of a microprocessor without cache.
Another advancement in microprocessor's technology relates to the concept of prefetching information, where such information may either be data or instructions. FIG. 1, for example, illustrates one embodiment of the prefetch concept. As illustrated in FIG. 1, prefetch unit 106 may request a block of information by transmitting one or more instruction addresses 110, via memory bus 108, to memory controller 116 of memory device 114. In some cases, memory device 114 may be an external memory device having a relatively high order in the memory hierarchy. Memory controller 116 may retrieve the block of information from memory space 118 and may transmit retrieved instructions 112, via memory bus 108, to processing unit 102. A processing unit, as described herein, is typically a microprocessor, but may alternatively encompass any circuitry adapted to execute instructions. Subsequently, instructions 112 may be written to a storage device lower in the memory hierarchy, such as a lower order level of cache memory device 104. Prefetching may allow the time spent retrieving the block of information to occur concurrently with other actions of processing unit 102. In this manner, when the processing unit 102 requests the prefetched information, there may be little or no delay in having to fetch the information from a nearby cache.
As such, prefetching involves a speculative retrieval of information, where the information may be retrieved from a higher-level memory system, such as external memory device 114, and placed into a lower level memory system, such as cache memory device 104. Such a retrieval may be executed under the expectation that the retrieved information may be needed by the processing unit for an anticipated event at some point after the next successive clock cycle.
In some cases, processing unit 102 may include an internal, or on-chip, cache memory device 104, as illustrated in FIG. 1. An internal cache, often called a primary cache, may be built into the circuitry of the processing unit. Processing unit 102 may further include internal prefetch unit 106, such that the prefetch unit may be coupled to the internal cache memory device via an internal bus. In other cases, however, cache memory device 104 and prefetch unit 106 may be external devices coupled to processing unit 102 via an external bus (not shown). The advantages and disadvantages of including internal versus external devices are well known in the art; thus, only internal devices are illustrated in FIG. 1 for the purpose of simplicity.
Several types of prefetching are known in the art. The most common example of a prefetch may be performed in response to a load operation. A load may occur when the processing unit requests specific information to be retrieved, such that the processing unit may use the retrieved information. In another example, a store operation may prefetch a block of data, such that a portion of the block may be overwritten with current information. Another form of prefetching may occur for certain instructions, such as those involved with certain block or string-related operations. In particular, such instructions may involve strings having numerous words, such as a double word, or some other quantity. For example, the instruction may include a string having a double word, yet the microprocessor is capable of handling only one word (or another quantity) at a time. In such a case, the processing unit may fetch the first word of the string, while concurrently prefetching some or all of the remaining words of the string. Therefore, a prefetch may be performed so that the remaining words in the instruction are more readily accessible for processing after fetching the first word. Thus, prefetching instructions during a time in which the processing unit is occupied with other processing may increase the speed of the processing unit by ensuring the availability of subsequent instructions before the processing unit requests the instructions.
Though prefetching, according to the manners described above, provides the benefit of improved microprocessor performance, the present inventor has recognized various drawbacks resulting from such techniques. Therefore, discussion of such drawbacks is presented below along with various embodiments that reduce the effects of such drawbacks and improve upon the prior art.