Processors have evolved throughout recent decades by becoming smaller in size, more sophisticated in design and, faster in performance. Such an evolution has resulted for numerous reasons, one of which is the portability of systems incorporating processors. Portability introduces demands on processors such as smaller size, reduced power and efficient performance.
A processor (such as a microprocessor) processes instructions according to an instruction set architecture. The processing comprises fetching, decoding, and executing the instructions. Some instruction set architectures define a programming model where fetching, decoding, executing, and any other functions for processing an instruction are apparently performed in strict order, beginning after the functions for all prior instructions have completed, and completing before any functions of a successor instruction has begun. Such an instruction set architecture provides a programming model where instructions are executed in program order.
Some processors process instructions in various combinations of overlapped (or non-overlapped), parallel (or serial), and speculative (or non-speculative) manners using, for example, pipelined functional units, superscalar issue, and out-of-order execution. Thus, some processors are enabled to execute instructions and access memory in an order that differs from the program order of the programming model. Nevertheless, the processors are constrained to produce results consistent with results that would be produced by processing instructions entirely in program order.
Applications of processors are, for example, in personal computers (PCs), workstations, networking equipment and portable devices. Examples of portable devices include laptops, which are portable PCs, and hand-held devices.
Due to the wide use of hardware and software applications dependent on either of or both the x86 or x87 instruction sets, particularly by software programmers who have become well accustomed to this code and are not likely to readily adapt to another code, backward compatibility of code is key in the architecture of a new processor. That is, the user of a newly-designed processor must enjoy the ability to use the same code utilized in a previous processor design without experiencing any problems.
In trace-based processor architectures, different trace types are used to significantly optimize execution by the back end, or execution unit, of the processor. Traces are generally built by the front end or trace unit (or instruction processing unit) of a processor. A trace includes one or more sequences of operations with each operation corresponding to an instruction or a number of operations corresponding to the same instruction.
Different types of traces might include a decoder trace, basic block trace, a multi-block trace or a microcode trace. A multi-block trace is made of one or more basic block traces, one or more multi-block traces or a combination thereof. A microcode trace is used when, for example, a sequence of instructions is either complex or rare. U.S. patent application Ser. No. 11/781,937, entitled “A Trace Unit with a Decoder, A Basic Block Builder, and A Multi-Block Builder” and filed on Jul. 23, 2007, the disclosure of which is incorporated herein by reference as though set forth in full, presents further details of such traces.
A trace, in some trace-based architectures, includes operations that do not necessarily correspond to instructions in the instructions' original program order. That is, knowledge of the original program order of the instructions is lost in a trace. Moreover, an instruction may result in multiple operations. Additionally, there is no clear instruction boundary.
During the 1980's, it became common for processor architectures to use multiple pipelines executing operations in parallel to reduce clocks per executed instruction (CPI). Such pipeline execution of operations is done in superscalar-based processor architectures.
Computer processor performance is determined by the number of instructions executed per cycle, processor clock cycle time, and CPI. This and other factors are used to calculate execution time according to the following equation:Execution Time=(Number of Instructions)*(CPI)*(Clock Cycle Time) Performance is then inversely proportional to execution time, and cycle time has been reduced by improved processor design.
A compiler can reduce the number of operations executed by eliminating operations that are unnecessary along the identified trace. In some usage scenarios, the effectiveness of trace scheduling is related to accuracy of the compiler in predicting branches within the trace. Computers with Very Long Instruction Words (VLIW) that could express dozens of parallel operations were developed to exploit trace scheduling.
Compilers can increase the Instruction Level Parallelism (ILP) of a trace by eliminating or scheduling operation dependencies to minimize pipeline delays in superscalar-based processor architectures. The latter allows a number of operations to execute substantially simultaneously. In general, the number of operations that can be executed in parallel is limited by control, data, and resource dependencies between operations. A control dependency occurs when the execution of an operation depends on the resolution of a previous branch instruction. A data dependency occurs when an operation has a source or destination operand that depends on the result of another operation. A resource dependency occurs when two or more operations require the same hardware resource. ILP allows execution of multiple operations simultaneously that would otherwise have to be executed in series. Thus, it is desirable to reduce the number of clock cycles required for executing operations, thereby increasing performance.
Conventional processors have a front end that is not a hardware-based compiler for building and optimizing traces. The back end of a processor would benefit from the faster and more efficient hardware-based compiler when executing races. While techniques for software compilation are known, such software techniques consume computing resources that would otherwise be available to execute a user application.
The potential performance benefits of hardware trace optimization has been explored, however no practical structures and techniques for hardware trace optimization in a high-performance microprocessor are known.
In light of the foregoing, there is a need for a processor for performing hardware trace optimization.