The development of software applications typically involves writing software code in a high-level programming language and translating the code into a lower-level machine language that can be executed by a computer system. Many so-called “compiler” applications exist to effectuate the translation from the high-level “source code” into a lower-level “executable code.” These compilers may implement many different types of functionality, for example, that enhance the efficiency of the compilation process through software pipelining, instruction scheduling, and other techniques.
Multi-core and multi-threaded architectures have become very popular in recent years. On these systems, more than one thread (i.e., instruction execution stream) can run simultaneously on a core, so that a core's computing resources could be shared by more than one thread. However, throughput-oriented, multi-core and multi-threaded architectures tend to sacrifice single-thread performance due to resource sharing and potentially increased instruction latencies.
Traditional software pipelining and instruction scheduling are tuned to produce optimal code for single-thread execution. Accordingly, instruction sequences are generated in an attempt to use all the resources of the core and keep the pipeline busy by covering the full instruction latencies of the pipeline. However, such optimal single-thread binaries may not be optimal when many threads are sharing the computing resources of a core. There may be little advantage to covering full instruction latencies when core resources are shared by multiple threads, and using the full instruction latencies in these algorithms may negatively impact performance. For example, using single-thread optimizations in a multi-threaded environment can result in increased register spilling and reloading, excessive loop unrolling, and/or other undesirable side effects.