One key function of a compiler in computer architecture is to generate object code (or machine code) from a particular sequence of program instructions, i.e., any type of software program or module, whether in a high-level language, assembler language or previously-compiled machine code. The object code is then read by a processor, where the instructions are executed.
However, object code is sometimes inefficient in its use of computer resources, resulting in longer times to execute the program. Two areas of inefficiency are particularly problematic and are addressed by this disclosure.
(1) If the processor has an instruction decoder that limits the instructions processed each cycle based on the locations of the instruction bytes, then the decoder might not decode instructions as quickly as they can be executed by the processor. Examples are AMD and Intel x86-based CPUs that use two aligned 16-byte sets of instruction bytes each cycle.
(2) A processor typically has a load/store unit, which serves to move data in both directions between the execution unit and the data cache. Sometimes the program is slowed down because the load/store unit fails to achieve the full throughput of which it is capable, due to idiosyncratic behavior of the load/store unit for certain programs. An alteration of the program as described herein can avoid this behavior and achieve full throughput. In particular, the AMD Family 10h processors, and other processor families with similar architecture (referred to hereafter as K10), have a significant bottleneck at the load/store unit, for which a specific remedy is disclosed herein. The AMD Family 15h processors (referred to hereafter as K15), and other processor families with similar architecture, are known to have a different bottleneck. However, the general approach given in this disclosure may be used to improve program performance on these processors as well.
Thus, it would be desirable to have a technique which improves program performance on these processors and maximizes the number of instructions that can be performed during each processor cycle.