Many practical applications require processing of very large amounts of information in a short period of time. One of the basic approaches to minimizing the time to perform such computations is to apply some sort of parallelism, so that tasks which are logically independent can be performed in parallel. This can be done, for example, by executing two or more instructions per machine cycle, i.e., by means of instruction-level parallelism. Thus, in a class of computers using superscalar processing, hardware is used to detect independent instructions and execute them in parallel, often using techniques developed in the early supercomputers.
Another approach to exploiting instruction level parallelism is used by the Very Long Instruction Word (VLIW) processor architectures in which the compiler performs most instruction scheduling and parallel dispatching at compile time, reducing the operating burden at run time. By moving the scheduling tasks to the compiler, a VLIW processor avoids both the operating latency problems and the large and complex circuitry associated with on-chip instruction scheduling logic. Both superscalar and VLIW processors take advantage of techniques know as pipelining for instruction scheduling optimization.
As known, each VLIW instruction includes multiple independent operations for execution by the processor in a single cycle. A VLIW compiler processes these instructions according to precise conformance to the structure of the processor, including the number and type of the execution units, as well as execution unit timing and latencies. The compiler groups the operations into a wide instruction for execution in one cycle. At run time, the wide instruction is applied to the various execution units with little decoding. The execution units in a VLIW processor typically include arithmetic units such as floating point arithmetic units. An example of a VLIW processor that includes floating point execution units is described by R. K. Montoye, et. al. in "Design of the IBM RISC System/6000 floating point execution unit", IBM J.Res. Develop., V. 43 No.1, pp. 61-62, January 1990.Additional examples are provided in U.S. Pat. No. 5,418,975, which is incorporated herein by reference in the entirety.
Predicated and speculative computations are known in the art, see e.g. Parallel and Distributed Computing Handbook, Albert Y. Zomaya, Editor, McGraw-Hill 1996, chapter 21, Superscalar and VLIW Processors, pp 621-647, incorporated herein by reference. To improve efficiency, certain instructions may be executed speculatively and their results may then be retired or discarded. Also it is known that profile data that characterizes program behavior can be obtained by performing test runs of the program. Such a technique is employed, for example, for profiled branch prediction. This generated profile data enables the compiler to identify probable alternatives of a conditional statement so as to enhance the efficiency of speculative computations.
While these processors are capable of performing a variety of tasks adequately, it is perceived that the performance of VLIW processors can be improved further by improving optimization techniques employed by compilers that compile programs for VLIW processing. More specifically, redundant speculative computations in the loop body may reduce effectiveness of loop software pipelining. Thus, it desirable to provide for program compilation that reduces such redundant speculative calculations in the innermost loops.