Aspects of the present invention relate to the design of processors and a processor instruction scheduler as a design automation tool.
State of the art processors comprise a large number of units such as cores, processing units or accelerators. Often so-called execution units are used to execute special instructions. Out-of-order superscalar processors use inherent instruction level parallelism to do the speculative and parallel execution of multiple instructions each cycle on multiple execution units in order to improve the instruction throughput. Such out-of-order processors typically have an instruction sequencing unit (ISU) for scheduling the execution of an instruction on the multiple execution units as part of the processor each cycle. In addition the ISU takes care by a so-called commit process that speculative execution results will become architected state according to the order of the program code stream. Area, power and timing constraints put constraints on the ISU instruction scheduling heuristics. For example, the instruction queue with its associated rename and dependency checking will have a certain queue depth. Queues are often split based on execution units and so on, limiting the number of instructions in the code stream that the ISU is able to take into account for scheduling the instruction onto multiple execution units. Hence, the order in which instructions are sent to the ISU matters.
A wrong ordering of the processor instructions in the code stream can lead to some units running empty while others are overloaded and the instructions queued up for the overloaded units are blocking processor instructions that could be executed on other units. In some cases the available processor registers limit the number of processor instructions that can be handled by the processor simultaneously. All these variables differ between processor families or even between different generations of the same processor family. General purpose compilers cannot be expected to produce optimized code for each situation as they need to be able to compile large software packages in an acceptable time.
Another issue is that during the definition of the processor architecture, basic decisions have to be made about the units and instructions:                Which accelerators will be implemented (performance vs. hardware tradeoffs)?        Which processor instructions and how many execution units are supported?        What is a suitable pipeline depth for the execution in a certain unit?        
For example, in K. Atasu et al “Optimizing Instruction-set Extensible Processors under Data Bandwidth Constraints”, Proc. of the Conference on Design, Automation and Test in Europe, pp. 588-593, 2007, which is hereby incorporated herein by reference in its entirety, the use of linear programming to identify custom instructions is described. Here groups of processor instructions are identified that can be combined to be executed by hardware accelerators.
The alternatives in the design of a processor are usually tested with a code stream that may not be in the optimal order for the given configuration. The respective work flow is shown in FIG. 1: High level code I1, which implements an algorithm, is used as input for a compiler and compiled in step S1, which results in a code stream I2, for example in form of assembler code. The compiler uses heuristics to schedule the processor instructions in the code stream I2. After the generation of machine code from the code stream I2, which comprises of direct processor instructions, it will then be used as input of a processor and will be executed by the processor in step S2.
While it is very time consuming to re-order the code stream for each alternative manually, there is also no guarantee that optimal performance is reached. Also for critical loops in software, their corresponding machine code will be manually rewritten in case of performance problems for existing processors. This requires in depth knowledge of the processor hardware implementation, for example of the instruction scheduling in the processor. Some compilers also offer to automatically instrument the code stream such that profiling information is generated during the execution of the code stream, which is used to gather statistics. This allows the compiler to use the data from the statistics for optimizations in subsequent compilation runs when generating a code stream.