1. Technical Field
The technical field of the present specification relates in general to a method and system for data processing and in particular to a method and system for multiscalar data processing.
2. Description of the Related Art
In the development of data processing systems, it became apparent that the performance capabilities of a data processing system could be greatly enhanced by permitting multiple instructions to be executed simultaneously. From this realization, several processor paradigms were developed that each permit multiple instructions to be executed concurrently.
A superscalar processor paradigm is one in which a single processor is provided with multiple execution units that are capable of concurrently processing multiple instructions. Thus, a superscalar processor may include an instruction cache for storing instructions, at least one fixed-point unit (FXU) for executing fixed-point instructions, a floating-point unit (FPU) for executing floating-point instructions, a load/store unit (LSU) for executing load and store instructions, a branch processing unit (BPU) for executing branch instructions, and a sequencer that fetches instructions from the instruction cache, examines each instruction individually, and opportunistically dispatches each instruction, possibly out of program order, to the appropriate execution unit for processing. In addition, a superscalar processor typically includes a limited set of architected registers that temporarily store operands and results of processing operations performed by the execution units. Under the control of the sequencer, the architected registers are renamed in order to alleviate data dependencies between instructions.
State-of-the-art superscalar processors afford a performance of between 1 and 2 instructions per cycle (IPC) by, among other things, permitting speculative execution of instructions based upon the dynamic prediction of conditional branch instructions. Because superscalar processors have no advance knowledge of the control flow graph (CFG) (i.e., the control relationships linking basic blocks) of a program prior to execution, IPC performance is necessarily limited by branch prediction accuracy. Thus, increasing the performance of the superscalar paradigm requires not only improving the accuracy of the already highly accurate branch prediction mechanism, but also supporting a broader instruction issue bandwidth, which requires exponentially complex sequencer circuitry to analyze instructions and resolve instruction dependencies and antidependencies. Because of the inherent difficulty in overcoming the performance bottlenecks of the superscalar paradigm, the development of increasingly aggressive and complex superscalar processors has a diminishing rate of return in terms of IPC performance.
An alternative processing paradigm is that provided by parallel and multiprocessing data processing systems, which although having some distinctions between them, share several essential characteristics. Parallel and multiprocessor data processing systems, which each typically comprise multiple identical processors and are therefore collectively referred to hereinafter as multiple processor systems, execute programs out of a shared memory accessible to the processors across a system bus. The shared memory also serves as a global store for processing results and operands, which are managed by a complex synchronization mechanism to ensure that data dependencies and antidependencies between instructions executing on different processors are resolved correctly. Like superscalar processors, multiple processor systems are also subject to a number of performance bottlenecks.
A significant performance bottleneck in multiple processor systems is the latency incurred by the processors in storing results to and retrieving operands from the shared memory across the system bus. Accordingly, in order minimize latency and thereby obtain efficient operation, compilers for multiple processor systems are required to divide programs into groups of instructions (tasks) between which control and data dependencies are identified and minimized. The tasks are then each assigned to one of the multiple processors for execution. However, this approach to task allocation is not suitable for exploiting the instruction level parallelism (ILP) inherent in many algorithms. A second source of performance degradation in multiple processor systems is the requirement that control dependencies between tasks be resolved prior to the dispatch of subsequent tasks for execution. The failure of multiple processor systems to provide support for speculative task execution can cause processors within the multiple processor systems to incur idle cycles while waiting for inter-task control dependencies to be resolved. Moreover, the development of software for multiple processor systems is complicated by the need to explicitly encode fork information within programs, meaning that multiple processor code cannot be easily ported to systems having diverse architectures.
Recently, a new aggressive "multiscalar" paradigm, comprising both hardware and software elements, was proposed to address and overcome the drawbacks of the conventional superscalar and multiple processor paradigms described above. In general, the proposed hardware includes a collection of processing units that are each coupled to a sequencer, an interconnect for interprocessor communication, and a single set of registers. According to the proposed multiscalar paradigm, a compiler is provided that analyzes a program in terms of its CFG and partitions a program into multiple tasks, which comprise contiguous regions of the dynamic instruction sequence. In contrast to conventional multiple processor tasks, the tasks created by the multiscalar compiler may or may not exhibit a high degree of control and data independence. Importantly, the compiler encodes the details of the CFG in a task descriptor within the instruction set architecture (ISA) code space in order to permit the sequencer to traverse the CFG of the program and speculatively assign tasks to the processing units for execution without examining the contents of the tasks.
According to the proposed multiscalar paradigm, register dependencies are resolved statically by the compiler, which analyzes each task within a program to determine which register values each task might possibly create during execution. The compiler then specifies the register values that might be created by each task within an associated register reservation mask within the task descriptor. The register reservations seen by a given task are the union of the register reservation masks associated with concurrently executing tasks that precede the given task in program order. During execution of the program, a processing unit executing an instruction dependent upon a register value that might be created by a concurrently executing task stalls until the register value is forwarded or the reservation is released by the preceding task. Upon release of the register or receipt of a forwarded register value by the stalled processing unit, the reservation for the register is cleared within the register reservation mask of the stalled processing unit and the stalled processing unit resumes execution. In order to trigger the forwarding of register values, the compiler adds tag bits to each instruction within a task. The tag bits associated with the last instruction in a task to create a particular register value indicate that the register value is to be forwarded to all concurrently executing tasks subsequent to the task in program order. Release of a register, on the other hand, is indicated by a special release instruction added to the base ISA or created by overloading an existing instruction within the ISA.
In contrast to register dependencies, the proposed multiscalar paradigm does not attempt to statically resolve memory dependencies and permits load and store instructions to be executed speculatively. A dynamic check must then be made to ensure that no preceding task stores to a memory location previously loaded by a subsequent task. If such a dependency violation is detected, the execution of the task containing the speculative load and all subsequent tasks are aborted and appropriate recovery operations are performed. Further details of the proposed mlultiscalar architecture may be found in G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, "Multiscalar Processors," Proc. ISCA '95 Int'l Symposium on Computer Architecture, June 1995, pp. 414-425.
The proposed multiscalar paradigm overcomes many of the deficiencies of other paradigms in that the multiscalar paradigm affords a wide instruction window from which instructions can be dispatched utilizing relatively simple scheduling hardware, is less sensitive to inter-task data dependencies and mispredicted branches, and is capable of exploiting the ILP believed to be present in most sequential programs. However, the proposed multiscalar architecture also has several deficiencies. First, backward compatibility of code binaries is sacrificed due to the insertion of release and other multiscalar instructions into the program to handle task synchronization. Second, multiscalar simulations have shown that the insertion of a large amount of multiscalar instructions that do no useful work into a program can actually degrade multiscalar performance to such an extent that better performance may be obtained with a conventional superscalar processor. Third, the attachment of additional bits to each instruction in the program, which was proposed in order to trigger the forwarding of processing results from a predecessor task to subsequent tasks, necessitates an increased instruction path width and additional hardware complexity. Fourth, the proposed multiscalar paradigm has no mechanism for handling dependencies between loads and stores to memory. Fifth, in the proposed multiscalar architecture, all tasks except the oldest are executed speculatively, meaning that even if task prediction accuracy is 90%, the prediction accuracy for tasks beyond the fifth task drops below 60%.
As should thus be apparent, it would be desirable to provide an enhanced multiscalar architecture that overcomes the foregoing and other deficiencies of the proposed multiscalar processor paradigm.
It is therefore one object of the present disclosure to provide an improved method and system for data processing.
It is another object of the present disclosure to provide an improved method and system for multiscalar data processing.
The foregoing objects are achieved as is now described. A processor and method of executing a program within a processor are provided. According to the method, a plurality of program instructions comprising a program and a set of auxiliary instructions are stored. An instruction stream including selected ones of the plurality of program instructions is supplied to the processor. In response to the processor processing a program instruction within the instruction stream that has an associated auxiliary instruction within the set of auxiliary instructions, the associated auxiliary instruction is automatically inserted within the instruction stream and the associated auxiliary instruction is executed within the processor.
The above as well as additional objects, features, and advantages of an illustrative embodiment will become apparent in the following detailed written description.