1. Technical Field
The present invention generally relates to the management of processor instructions. More particularly, the invention relates to the selective bypassing of a trace cache build engine for enhanced performance.
2. Discussion
In the highly competitive computer industry, the trend toward faster processing speeds and increased functionality is well documented. While this trend is desirable to the consumer, it presents significant challenges to processor designers as well as manufacturers. A particular area of concern relates to the management of processor instructions. In modern day processor architectures, a back end allocation module executes decoded operations, typically termed micro-operations (μops), in order to implement the various features and functions called for in the program code. The front end of the processor architecture provides the μops to the allocation module, in what is often referred to as an instruction or operation pipeline. Generally, it is desirable to ensure that the front end pipeline remains as full as possible in order to optimize the processing time of the back end allocation module. As the processing speed of the allocation module increases, however, optimization becomes more difficult. As a result, a number of instruction management techniques have evolved in recent years.
FIG. 1 illustrates one such approach to managing processor instructions that involves the use of a trace cache 20. Encoded instructions 32 are provided to a decoder 22, which decodes the instructions 32 into basic μops 34 that the execution core in the back end allocation module 24 is able to execute. Since the decoding process has been found to often be a bottleneck in the process of executing instructions, one conventional approach has been to effectively recycle the retired μops 34′ so that decoding is not always necessary. Thus, the retired μops 34′ are sent to a build engine 26 in order to create trace data 36. The building of trace data 36 essentially involves the use of branch prediction logic and knowledge of past program execution to speculate where the program is going to execute next. Trace-based instruction caching is described in a number of sources such as U.S. Pat. No. 6,170,038 to Krick, et al. The trace data 36 is written into the trace cache 20. The trace cache 20 is preferred over the decoder 22 as a source of instructions due to the above-described bottleneck concerns. For example, the time required to read from the decoder 22 is often on the order of four times longer than the time required to read from the trace cache 20. Thus, the back end allocation module 24 typically searches for a given μop in the trace cache 20 first, and resorts to the decoder 22 when the μop is not found in the trace cache 20 (i.e., a trace cache miss occurs). The difficulty with the above-described “build-at-retirement” approach is that loops in the program code may not be detected by the build engine 26 until after they are useful.
FIG. 2 illustrates another conventional approach that addresses the concerns of building at retirement, but also leaves considerable room for improvement. Under this approach, the decoded μops 34 are sent directly to a build engine 28 that includes a controller 29 that decides whether to send the trace data directly to the allocation module 24 or to the trace cache 20. Thus, when the controller 29 determines that a trace cache miss has occurred, the trace data 36′ can be sent directly to the allocation module 24 in order to reduce latency. The allocation module 24 can therefore be viewed as being switched from a trace cache reading state into a build engine reading state. As trace data 36′ is sent to the allocation module 24, the controller 29 can use address line 30 to determine whether it is safe to return to the trace cache read state. Specifically, as μops 34 come into the build engine 28, the controller 29 can search the trace cache 20 for the linear instruction pointer (IP) corresponding to each μop. When a match is made, the controller 29 can re-authorize the transfer of trace data 36 from the trace cache 20 to the allocation module 24. While this approach significantly helps with regard to the detection of program loops, certain difficulties remain. For example, the latency associated with the build engine 28 is part of the μop pipeline regardless of whether the trace cache 20 is being written to. Indeed, the build engine latency can become critical as build heuristics become more advanced.