1. Field
The present disclosure is generally directed to automatic vectorization. More particularly, the present disclosure is directed to partial vectorization.
2. Background Art
Modern microprocessors support Single Instruction Multiple Data (SIMD) instructions. SIMD instructions enable microprocessors to exploit data level parallelism. Specifically, a SIMD instruction performs the same identical action simultaneously on two or more pieces of data.
There are two ways to utilize the SIMD capabilities of a microprocessor. First, a programmer can write SIMD assembly language instructions. Second, a compiler can perform autovectorization. Autovectorization is a compiler transformation that automatically generates SIMD instructions for a program loop or a sequentially executing block of instructions, e.g. a basic block.
The autovectorization of code other than program loops (e.g. basic blocks) has become increasingly important in maximizing program performance. The autovectorization of basic block code is called partial vectorization or partial “simdization.” Partial vectorization has been demonstrated to improve performance in many independent studies.
Partial vectorization involves analyzing a basic block and identifying groups of identical instructions which can be executed independently of each other. These groups of instructions are converted to one or more vector instructions. The number of vector instructions generated is based on the width of the vector registers of a microprocessor.
If a basic block does not contain groups of identical instructions which can be executed independently of each other, then partial vectorization is not applied to the basic block. Alternatively, if a basic block contains groups of identical instructions which can be executed independently of each other, then those instructions are converted to vector instructions. Instructions in the basic block which are not vectorized are called scalar instructions. Thus, a basic block may contain both vector and scalar instructions.
Traditional approaches to partial vectorization have suffered from scalability issues. Specifically, it has been a challenge applying partial vectorization algorithms because analyzing basic blocks with large numbers of instructions is time consuming. Currently, there are two dominant approaches to partial vectorization: dynamic programming algorithms and greedy algorithms.
Dynamic programming applies a bottom-up approach to partial vectorization. Specifically, the input basic block is represented as a Directed Acyclic Graph (DAG). The DAG is built by connecting every variable's definition to its uses. Dynamic programming is then applied to the DAG. Specifically, the packing and unpacking costs of vectorization are propagated recursively from the leaves of the DAG to their parents and so on.
Dynamic programming generates the smallest number of packing and unpacking instructions. In addition, dynamic programming vectorizes the longest expressions in the basic block which thereby maximizes the number of instructions vectorized. However, compile time for dynamic programming is high and it does not scale when there are hundreds of high level statements in the basic block. Specifically, because all independent and isomorphic expressions are compared with each other, the compile time is extremely high when compiling basic blocks containing hundreds of statements.
Greedy programming, on the other hand, makes the locally optimal choice at each stage of vectorization with the hope of finding a global optimum. Greedy algorithms have reasonable compile time and scale when there are hundreds of high level statements in the basic block. However, greedy programming does not usually produce an optimal solution, e.g., it does not maximize the number of instructions vectorized. Specifically, there is no guarantee that the longest expressions in the basic block will get vectorized. Nor is there any guarantee that the packing and unpacking costs of vectorization will be minimized.