The invention relates generally to a processor, and more specifically, to a processor with multiple execution units for instruction processing, an instruction decode and issue logic for assigning instructions for execution to one of the execution units. The invention relates further to a method for instruction processing with a processor and a design structure.
Today, processors, e.g. superscalar processors, allow parallel execution of several instructions during a single processor cycle due to the availability of a plurality of parallel execution units in a superscalar processor. Generally, this mechanism increases the processor's performance. It may also be possible to issue multiple instructions to parallel execution pipelines in the same cycle. However, two consecutive instructions may be dependent on each other. I.e., a following instruction may require the result of the preceding instruction. Thus, a scheduling or dispatching of the following instruction has to wait for the preceding instruction to finish. Independent of this, an issuing of instructions to different execution units may be performed without reflecting special dependencies.
Traditional methods try to maximize the distance between dependent instructions: firstly algorithms and programs may be redesigned to best fit the underlying hardware. However, this may be very expensive and must be redone for each processor architecture the software should be executed on. Additionally, the source code may not always be available anymore or a user is not willing to recompile and recertify his system due to the associated costs. In a virtualized environment—using virtual machines (VM)—or in multi-thread environments other threads may nullify such a single-threaded optimization. In a multi-threaded or VM (virtual machine) environment it is very hard to predict what else will be running on the same core and compete for a hardware resource, hence any static optimization may be defeated.
On the other side, the transistor processing device performance and single thread performance is saturating due to physical limits but Moore's law applies and ‘silicon shrining’ continuous. Thus, more and more circuits may be integrated on a chip, e.g., parallel execution units may be used to parallelize the execution of an instruction stream and thus to reduce the average cycles per instruction. The parallelism is limited to the dependencies between instructions, i.e., the result of one instruction may be needed by a following dependent instruction (s) and hence instructions cannot be arbitrarily parallelized. In order to maximize performance, the results from individual execution units need to be forwarded to other execution units. For processor designers, this may result in wiring headaches or additional cycle delay(s) on the forwarding paths. Currently, available wire stacks may limit the number of interconnected units—and thus the performance increase—and may create a significant engineering effort to close the integration gap. More metal layers in the processor design are very expensive if they are available at all.