Field of the Invention
The present invention generally relates to computer processing, and, more specifically, to an algorithm for vectorization and memory coalescing during compiling.
Description of the Related Art
Developers use compilers to generate executable programs from high-level source code. Typically, a compiler is configured to receive high-level source code of a program (e.g., written in C++ or Java), determine a target hardware platform on which the program will execute (e.g., an x86 processor), and then translate the high-level source code into assembly-level code that can be executed on the target hardware platform. This configuration provides the benefit of enabling the developers to write a single high-level source code program and then target that program for execution across a variety of hardware platforms, such as mobile devices, personal computers, or servers.
In general, a compiler includes three components: a front-end, a middle-end, and a back-end. The front-end is configured to ensure that the high-level source code satisfies programming language syntax and semantics, whereupon the front-end unit generates a first intermediate representation (IR) of the high-level source code. The middle-end is configured to receive and optimize the first IR, which usually involves, for example, removing unreachable code, if any, included in the first IR. After optimizing the first IR, the middle-end generates a second IR for the back-end to process. In particular, the back-end receives the second IR and translates the second IR into assembly-level code.
The assembly-level code includes low-level assembly instructions that are directly-executable on a processor that is part of the target hardware platform. As is well-understood, the number of instructions included in the generated assembly-level code may, in fact, be significantly larger than the number of instructions included in the high-level source code. For example, the simple high-level source code instruction “x=y+z” would likely be compiled into a series of assembly instructions that would include instructions for loading values for y and z into registers of a memory subsystem included in the target hardware platform, executing an addition of the values stored in the registers, and storing the sum of the values into another register. Although the processor is able to execute each of these assembly instructions at a rapid pace, the assembly instructions may reference the same or a similar area of memory, which, as set forth below in an example, introduces execution redundancies and/or inefficiencies within the target hardware platform.
Consider, for example, first, second, third and fourth assembly instructions that cause the processor to interface with the memory subsystem and read data stored in first, second, third and fourth adjacent segments, respectively, of a memory location. Consider also that a single assembly instruction—referred to herein as a “vectorized” assembly instruction—can be used in place of the first, second, third and fourth instructions. In particular, such a single vectorized assembly instruction, when executed, would exploit an available large-bandwidth memory operation that would cause the processor to simultaneously read the data stored in the first, second, third and fourth segments, respectively, of the memory location, thereby reducing the number of processor cycles required to execute the assembly instructions by a factor of four. Unfortunately, conventional compilers do not include the logic to identify these redundancies and effect code replacements.
Accordingly, what is needed in the art is a technique for generating more efficient assembly code.