With the improvement in hardware and software technology in recent years, multimedia consumer electronic devices with support for video, audio and images have become prevalent. However, there is a never-ending demand for support of higher video resolutions, better video quality at lower bit-rates, lower power consumption, enhanced overall functionality, and so on. To meet the computationally demanding challenges necessary to support real-time multimedia-enabled devices, it becomes necessary to employ optimization strategies, including exploiting parallelism, targeting both hardware and software.
Although performance gains for processors have primarily been achieved by increasing processor clock-rates, significant improvement has also been achieved utilizing architectures that exploit instruction-level parallelism (ILP). Examples are pipelined processor, superscalar and very long instruction word (VLIW) architectures. These architectures leverage fine-grained parallelism in computer code to be able to execute more than one instruction per machine cycle.
With a superscalar architecture, independent instructions are detected in hardware and then executed in parallel. For instance, superscalar architectures exploit ILP by utilizing complex logic implemented in hardware to examine software code during runtime, and then reorder the software code for faster execution. Accordingly, with superscalar architectures, performance gains are achieved at the expense of more complex hardware.
Another approach to increasing ILP is that of very long instruction word (VLIW) technology. With a VLIW architecture, finding ILP and correctly scheduling parallel operations is a software function that occurs prior to run-time, i.e. at compile time, thereby resulting in a simpler and thus more economical hardware solution. However, with VLIW architectures, the challenge becomes one of designing a software compiler that is intelligent enough to decide on how to build the very long instruction word to utilize the target architecture optimally. Usually, a VLIW compiler first maps the program instructions from a higher level language construct to the basic ISA (instruction set architecture) of the processor. The instruction scheduler component of the compiler then does its best to identify independent basic operations. Next, the compiler maps the independent operations to appropriate functional units while maintaining the constraints imposed by the algorithm and architecture. Accordingly, the parallelized basic operations are packed into very long instruction words. During the execution phase, the processor unpacks the very long instruction words, and forwards the basic operations to multiple fractional units for simultaneous execution.
Generally, state of the art VLIW instruction schedulers are still not intelligent enough to generate optimally scheduled code. Therefore, hand-scheduling has to be resorted to for further exploitation of ILP and the underlying architecture. It is, however, well known that hand coding and scheduling at the assembly language level is an arduous and error prone task. Hence, a method and apparatus are needed that can make the job of low-level optimization easier and less error prone.