Field of the Invention
The disclosure generally relates to multi-threaded instruction scheduling, and more specifically to methods and apparatus for scheduling instructions using pre-decode data.
Description of the Related Art
Parallel processors have multiple independent cores that enable multiple threads to be executed simultaneously using different hardware resources. SIMD (single instruction, multiple data) architecture processors execute the same instruction on each of the multiple cores where each core processes different input data. MIMD (multiple instruction, multiple data) architecture processors execute different instructions on different cores with different input data supplied to each core. Parallel processors may also be multi-threaded, which enables two or more threads to execute substantially simultaneously using the resources of a single processing core (i.e., the different threads are executed on the core during different clock cycles). Instruction scheduling refers to the technique for determining which threads to execute on which cores during the next clock cycle.
Typically, instruction scheduling algorithms will decode a plurality of instructions after fetching the instructions from memory to determine the particular resources required for each specific operation and the latencies associated with those resources. The system may then evaluate the latencies to determine the optimal scheduling order for the plurality of instructions. For example, one instruction may specify an operand (i.e., a register value) that is dependent on a calculation being executed by a previous instruction from the same thread. The scheduler then delays execution of the one instruction until the previous instruction completes execution.
One problem with the above described systems is that decoding a plurality of instructions, identifying dependencies between the instructions, and analyzing the latencies associated with all of the computations specified by the instructions requires a lot of management resources in the processor and a large amount of state information storage. The processor may determine the specific opcodes specified by the instructions, the resources associated with the operations (e.g., the specific registers passed as operands to each instruction), the interdependencies between instructions, and any other important data associated with the instructions. The implementation of such algorithms may take many clock cycles to complete and a lot of memory for storing and decoding instructions.
Accordingly, what is needed in the art is a system and method for performing instruction scheduling without having to determine the latencies for computations that are inputs to other instructions.