1. Field of the Invention
This invention relates to microprocessor design, and more particularly directed to systems and methods for increasing microprocessor speed through efficient instruction prefetching.
2. Description of the Related Art
The following descriptions and examples are not admitted to be prior art by virtue of their inclusion within this section.
Over the years, the use of microprocessors has become increasingly widespread in a variety of applications. Today, microprocessors may be found not only in computers, but may also be found in devices such as VCR""s, microwave ovens, and automobiles. In some applications, such as microwave ovens, low cost may be the driving factor in the design of the microprocessor. On the other hand, other applications may demand the highest performance obtainable. For example, modem telecommunication systems may require very high speed processing of multiple signals representing voice, video, data, etc. Processing of these signals, which have been densely combined to maximize the use of available communication channels, may be rather complex and time consuming. With an increase in consumer demand for wireless communication devices, such real time signal processing requires not only high performance but also demands low cost. To meet the demands of emerging technologies, designers must constantly strive to increase microprocessor performance while maximizing efficiency and minimizing cost.
With respect to performance, greater overall microprocessor speed may be achieved by improving the speed of various related and unrelated microprocessor circuits and operations. As stated above, microprocessor speed may be extremely important in a variety of applications. As such, designers have evolved a number of speed-enhancing techniques and architectural features. Among these techniques and features may be the instruction pipeline, the use of cache memory, and the concept of prefetching.
A pipeline consists of a sequence of stages through which instructions pass as they are executed. In a typical microprocessor, each instruction comprises an operator and one or more operands. Thus, execution of an instruction is actually a process requiring a plurality of steps. In a pipelined microprocessor, partial processing of an instruction may be performed at each stage of the pipeline. Likewise, partial processing may be performed concurrently on multiple instructions in all stages of the pipeline. In this manner, instructions advance through the pipeline in assembly line fashion to emerge from the pipeline at a rate of one instruction every clock cycle.
The advantage of the pipeline may lie in performing each of the steps required to execute an instruction in such a simultaneous manner. However, to operate efficiently, a pipeline must remain full. If the flow of instructions into and out of the pipeline is disrupted, clock cycles may be wasted while the instructions within the pipeline may be prevented from proceeding to the next processing step. Prior to execution, the instructions may typically be stored in a memory device, such that instructions may be fetched into the pipeline by the microprocessor. However, access times for such memory devices may generally be much slower than the operating speed of the microprocessor. As such, instruction flow through the pipeline may be impeded by the length of time required to fetch instructions from memory (i.e. memory latency).
An obvious approach to the above problem may seem to simply use faster memory devices. Unfortunately, although faster memory devices may be available, they are typically more costly and may consume more power than conventional memory. In view of these disadvantages, the use of high-speed memory devices throughout the entire memory hierarchy may be infeasible. Thus, a more practical alternative for high performance microprocessors may be the use of cache memory.
Cache memory is a secondary memory resource that may be used in addition to the main memory, and generally consists of a limited amount of very high-speed memory. Since cache memory may be small relative to the main memory, cost and power consumption of the cache may not be significant factors in some applications. However, factors such as cost and circuit dimension limitations may place constraints on cache size in other applications.
Cache may memory improve microprocessor performance whenever the majority of instructions required by the microprocessor may be concentrated in a particular region of memory. The principle underlying the use of cache may be, more often than not, the microprocessor may fetch instructions from the same area of memory. Such a principle may be due to the sequential nature in which instructions may be stored in memory. In other words, most instructions may be executed by the microprocessor in the sequence in which they may be encountered in memory.
Assuming the majority of instructions required by the microprocessor may be found in a given area of memory, the entire area may be copied (e.g., in a block transfer) to the cache. In this manner, the microprocessor may fetch the instructions as needed from the cache rather than from the main memory. Since cache memory may be faster than main memory, the pipeline may be able to operate at full speed. Thus, cache memory may provide a dramatic improvement in average pipeline throughput. Such an improvement may be achieved by providing the pipeline with faster access to instructions than would be possible by directly accessing the instructions from conventional memory. As long as the instructions are reasonably localized, the use of a cache may significantly improve microprocessor performance.
To further improve access time to information, one or more levels of cache memory may also be included within the system. Typically, the lowest level of cache (i.e. the first to be accessed) may be smaller and faster than the one or more levels above the lowest level in the memory hierarchy. Also, the number of caches in a given memory hierarchy may vary. When an instruction is executed, the address associated with the instruction may typically be directed to the lowest level of cache. The fastest possible operation may be called a xe2x80x9ccache hit.xe2x80x9d A cache hit may occur when the information corresponding to the instruction address may be stored in the level of cache indicated by the address. If a cache hit occurs, the addressed information may be retrieved from the cache without having to access a higher level of memory in the memory hierarchy. The higher ordered memory may be slower to access than the lower ordered cache memory.
Conversely, a xe2x80x9ccache missxe2x80x9d may occur when an instruction required by the microprocessor is not present in the level of cache indicated by the instruction address. In response to a cache miss, the next higher ordered memory structure may be presented with the instruction address. The next higher ordered memory structure may be another cache, such that another hit or miss may occur. If misses occur at each level of cache, xe2x80x9cstarvationxe2x80x9d of the microprocessor may occur. During starvation, the microprocessor may discard the contents of the cache. Subsequently, the necessary information may be fetched as quickly as possible from main memory and may be placed into cache. Obviously, this may be a source of overhead, and if it becomes necessary to empty and refill the cache frequently, system performance may begin to approach that of a microprocessor without cache.
Another advancement in microprocessor technology relates to the concept of prefetching information, where such information may either be data or instructions. A block of information may be prefetched from a storage device, which may be at a relatively high order in the memory hierarchy, and written to a storage device lower in the memory hierarchy, such as a lower order cache. Prefetching may allow the time spent to retrieve such information to occur concurrently with other actions of the microprocessor. In this manner, when the microprocessor requests the prefetched information, there may be little or no delay in having to fetch the information from a nearby cache. Thus, prefetching may reduce or eliminate some of the cache miss penalty by having such a miss occur while the microprocessor may be occupied with other processing.
Prefetching may involve a speculative retrieval of information, where the information may be retrieved from a higher level memory system, such as an external memory, and may be placed into a lower level memory system, such as a cache memory device. Such a retrieval may be executed under the expectation that the retrieved information may be needed by the microprocessor for an anticipated event at some point after the next successive clock cycle.
Several types of prefetching are known in the art. The most common instance of a prefetch may be performed in response to a load operation. A load may occur when the microprocessor requests specific information to be retrieved, such that the retrieved information may be used by the microprocessor. However, a store operation may prefetch a block of data, such that a portion of the block may be overwritten with current information. Still another form of prefetching may occur for certain instructions, such as those involved with certain block or string-related operations. In particular, such instructions may involve strings having numerous words, such as a double-word, or some other quantity. For example, the instruction may include a string having a double-word, yet the microprocessor is capable of handling only one word (or other quantity) at a time. In such a case, the microprocessor may fetch the first word of the string, while concurrently prefetching some or all of the remaining words of the string. Therefore, a prefetch may be performed so that the remaining words in the instruction are more readily accessible for processing after fetching the first word. Thus, prefetching instructions during such a time that the microprocessor may be occupied with other processing may increase the speed of the microprocessor by ensuring the availability of subsequent instructions before such instructions may be actually requested by the microprocessor.
Though prefetching, according to the manners described above, provides the benefit of improved microprocessor performance, the present inventors have recognized various drawbacks resulting from such techniques. Therefore, discussion of such drawbacks is presented below along with various embodiments that reduce the effects of such drawbacks and improve upon the prior art.
Instruction prefetching may be relatively straightforward in the case of linear programming code (i.e. code without discontinuities or branches). For example, the microprocessor may request an address of an instruction. Once the first cacheline of instructions arrives from the cache memory device, the microprocessor may begin executing the instruction, and at the same time, the prefetch unit may prefetch the next cacheline of instructions. If a cache miss occurs, the execution unit of the microprocessor may temporarily halt execution while the desired instruction may be fetched from main memory and placed into cache memory. In such a case, the occurrence of a cache miss may cause xe2x80x9cstarvationxe2x80x9d of the microprocessor during the time it takes to fetch the missing instructions from memory. In addition, if the prefetch unit continues to fetch instructions during this time, the instructions may be fetched faster than the microprocessor may consume them. Therefore, it may be possible for the prefetch unit to xe2x80x9coutlapxe2x80x9d the microprocessor and overwrite instructions in the cache that have not yet been read by the microprocessor. Such a case may be avoided by at least temporarily halting the prefetch unit until the microprocessor may read the desired instructions.
However, an internal (i.e. on chip) cache is almost useless in linear code, since cache may only be needed when the instructions in the cache may be used repeatedly. Such repetitive use may be present, for example, in non-linear program code, which may imply that a backwards discontinuity may occur at some point. A very useful backwards discontinuity construct is the loop construct, which may execute a block of instructions a specified number of times. Since loop counts may be very large, it is judicious to avoid prefetching until a loop may be completed to avoid overwriting instructions in the loop that may still be needed. However, merely halting the prefetch unit until the microprocessor may read the current instruction (as described above in the case of linear code) may not be applicable in the case of non-linear code. For instance, even though a block of instructions contained within the loop may have been read, and are therefore fair game to be overwritten, the microprocessor may still need the loop instructions on a successive iteration of the loop.
As stated above, halting the prefetch unit may solve the problem of overwriting data not yet read by the microprocessor. For example, halting the prefetch unit may occur simply whenever the xe2x80x9cAgnxe2x80x9d statement is encountered in an assembly language program. However, resuming the prefetch process before a loop is terminated, such that the instructions immediately following the end of the loop are available as soon as the microprocessor exits the loop, may be a problem not addressed in current methods. Since current methods may not take into consideration the period of a loop iteration, the current methods cannot resume prefetching instructions precisely m cycles before the microprocessor exits a loop (i.e. where m may be the memory latency, or the number of clock cycles needed to fetch an instruction from memory). In this manner, though halting the prefetch unit may solve the problem of overwriting unread data, it may not solve the problem of microprocessor starvation.
Also stated above, starvation is said to occur whenever the microprocessor must wait to fetch an instruction from memory. In high-performance, real-time algorithms such as those used for signal processing applications, the impact of starvation may significantly reduce microprocessor performance. As such, current methods may attempt to solve the problem of starvation in one of two ways. In one embodiment, a current method may simply accept the m cycle penalty for exiting a loop. The disadvantage of accepting the m cycle penalty may surface when the execution time of a loop is not much longer than the memory latency (m). In this case, the effects of starvation may significantly affect microprocessor performance. For instance, if it takes 20 cycles to execute a loop and 4 cycles to fetch instructions from memory, there may be a 20% decrease in performance due to the effects of starvation.
In another embodiment, a current method may increase the size of the cache memory by a factor of m. In this manner, the increased size of the cache may allow a current method to prefetch the next m number of cache lines arranged after a loop, before prefetching may be halted during execution of the loop. As such, once the microprocessor exits the loop, there may be sufficient time to resume prefetching at cacheline m+1 without causing starvation. However, the disadvantages of increasing the size of the cache include increasing area as well as increasing cost.
The above outlined problems may be in large part addressed by an efficient method of instruction prefetching. Thus, the present invention contemplates a system and method for efficient instruction prefetching based on the termination of loops. The method for controlling a prefetch unit may include detecting when a set of instructions within a last iteration of a loop may be encountered. Prior to such detection, the method may disable the prefetch unit when a last instruction within a first iteration of the loop may be detected. The method for controlling the prefetch unit may further include counting a plurality of clock cycles needed to fetch the set of instructions within the loop by the prefetch unit. Such a number of clock cycles may be designated as the loop period. In addition, the method may enable the prefetch unit when a remaining number of clock cycles substantially equals the memory latency (m, or the number of clock cycles needed to access a memory device).
A computer system may also be contemplated herein. The computer system may include, but may not be limited to, a semiconductor memory device, a cache memory device and a prefetch unit. The system may also include a memory bus to couple the semiconductor memory device to the prefetch unit. In one embodiment, the prefetch unit may be included as an internal component of the microprocessor. In another embodiment, the prefetch unit may be included as an external component coupled to the microprocessor. The memory bus may be adapted to receive a sequence of instructions, which may include a first set of instructions followed by a second set of instructions, and may further receive/transmit a plurality of instruction sets. Furthermore, the computer system may also include a circuit coupled to the memory bus. The circuit may detect a branch instruction within the sequence of instructions, such that the branch instruction may target a loop construct. Such a loop construct may include a single set of instructions, or may also include a plurality of sets of instructions, such that the plurality of sets may be formed within one another and may create xe2x80x9cnestedxe2x80x9d loops.
A circuit for controlling a prefetch unit may further be contemplated herein. The circuit may include a detector coupled to detect a loop within a sequence of instructions and to signal a loop count each time a set of instructions within the loop may be encountered. The circuit may also include a period counter coupled to the detector. The period counter may count a number of clock cycles associated with the set of instructions within a loop construct. In addition, a countdown timer may be coupled to the period counter, such that the countdown timer may be enabled to count down the number of clock cycles remaining in a loop. The period counter may initiate the enabling of the countdown timer, such that the period counter may indicate that a single iteration of the loop remains. Furthermore, the circuit may include a programmable logic component coupled to the countdown timer. The countdown timer may note when the number of clock cycles remaining in a loop may substantially equal the number of clock cycles needed to access a memory device. At such a time that the value of the countdown timer may substantially equal the memory latency value of the memory device, the countdown timer may signal the programmable logic component to enable the prefetch unit. In this manner, prefetching may be resumed m number of clock cycles before the microprocessor may exit the loop. As such, the m number of instructions arranged subsequent to the loop may be readily available for execution as soon as the microprocessor exits the loop.