1. Field of the Invention
The present invention relates to a computer processing system and particularly to an instruction schedule cache, which is a type of an instruction cache that holds schedules of instructions for future execution, such as out-of-order execution, multi-threaded execution (simultaneous or non-simultaneous), in-order execution, VLIW execution, data-flow execution, etc.
2. Description of Background
Before our invention a typical processor utilized register-renaming to break false data dependences (anti- and output dependences), employed multiple-instruction fetch, decode, rename, dispatch, and issue, implemented pipelining and concurrent execution of instructions in multiple units, required a mechanism to recover the program-order semantics in applying the changes to the processor state after the instructions finish their execution, and provided for the simultaneous retirement of multiple instructions, in which the architected state of the machine was modified to reflect the status of computation, in a processor cycle. All these features, and more, lead to designs that are complex in nature. The inherent complexity of the design, combined with the limitations of implementation technologies (silicon-based, for example), lead to designs that are hard to operate, correctly, at higher operating frequencies. Even if a processor could operate correctly at high frequencies, the required supply voltage will be higher, and/or the total amount of logic will be larger, which raises yet another problem: one of high power consumption, and a related problem of efficient heat dissipation.
Applications vary in the amount of fine-grained, instruction-level parallelism (ILP) they possess. Some applications possess almost no instruction-level parallelism. The bulk of the computation occurs within a tight data-dependence chain. Such applications can often run very efficiently in high throughput modes, in which high processor frequency, combined with deeper pipelines, but almost no hardware support to find and expose instruction-level parallelism is provided. Other workloads are ILP workloads i.e. in a single-thread of execution; there is inherently a higher amount of parallelism available for the hardware to exploit. Such applications are best suited for processor designs which may perhaps have a somewhat lower frequency, but exhibit a higher degree of superscalarity, thus carrying out more (parallel) work per cycle.
Most of the present processor designs tend to target one or the other type of applications. Either the processor is a large, complex design and does an excellent job of extracting the ILP (when available), or it is a design that is excellent at efficiently running a tight data-dependence chain application, by virtue of being a simple pipeline that balances the latencies of execution with the latencies of memory access. The ILP-focused designs do an excellent job of extracting the ILP when ILP is available, but such machines also try to extract ILP when it is not available, e.g. in tight data-dependence chains. The effort spent trying to extract and exploit ILP, in an application that contains little or no ILP, is essentially futile. On the other hand, designs that are focused on efficiently executing a tight data-dependence chain application, when presented with applications that contain high amounts of ILP, fail to take advantage of the ILP because they are not well equipped to identify or exploit available workload ILP.