The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to disclosed embodiments.
Within a computer processor, such as a central processing unit (CPU), various operations or stages must be performed for the CPU to perform any beneficial task. Within the CPU, the concept of an instruction fetch corresponds to the operation of retrieving an instruction from program memory communicatively interfaced with the CPU so that it may undergo further processing (e.g., instruction decode, instruction execute, and write back of the results). Each of these operations consume time or CPU clock cycles, and thus, inhibit speed and efficiency of the processor.
The concepts of pipelining and superscalar CPU processing thus implement what is known in the art as Instruction Level Parallelism (ILP) within a single processor or processor core to enable faster CPU throughput of instructions than would otherwise be possible at any given clock rate. One of the simplest methods used to accomplish increased parallelism is to begin the first steps of instruction fetching and decoding before the prior instruction finishes executing resulting in a pipeline of instructions for processing. Increased parallelism may also be attained through multiple functional units to simultaneously perform multiple “fetch” operations which are then placed into a pipeline such that an instruction is always available for an execution cycle. In such a way, an opportunity to execute an instruction less likely to be wasted due to having to wait for an instruction to be fetched.
As the complexity and redundancy of functional units increases, so does the overhead penalty for managing the increased instruction level parallelism of the CPU. When the processor performs a simple fetch, decode, execute, and write back cycle in a continuous sequential cycle, there is no worry of dependency on a preceding or subsequent statement. Any change required will have already been processed (e.g., executed and written back) such that any data dependency is already satisfied by the time an otherwise dependent instruction seeks the data. For example, if a second instruction depends upon the result of a first instruction, that result is assured to be available in a simple and sequential fetch, decode, execute, and write back cycle as the subsequent instruction cannot be “fetched” until the prior instruction is “executed,” causing the change, and “written back,” making the change available.
Thus it can be plainly seen that implementing instruction level parallelism within a CPU presents a risk that a subsequent instruction may potentially be “fetched” and presented for execution before the first instruction is executed and “written back.” If the second instruction depends upon the first, dependency is violated. Other dependency types exist as well besides the data dependency example set forth above, such as anti-dependency, control dependency, and output dependency.
Scoreboarding implements a scheduling mechanism by which dependency violations can be avoided (e.g., via waits, stalls, etc.) which would otherwise result in “hazards” or incorrectly processed data, instruction, etc.
Previously known mechanisms allow for instruction level parallelism of the CPU but enforce a requirement that fetch is performed in-order and thus, the extent of instruction level parallelism is so limited. Even where superscalar processors permit out-of-order execution, the extent of instruction level parallelism remains constrained to in-order fetch mechanisms and a correspondingly limited scheduling window.
The present state of the art may therefore benefit from techniques, systems, methods, and apparatuses for the scheduling of instructions in a multi-strand out-of-order processor as described herein.