1. Field of the Invention
This invention is related generally to the field of computer systems and more particularly to generating optimized vector instructions from high level programming languages.
2. Description of the Related Art
High performance microprocessors use a variety of techniques to increase their performance. These techniques are designed to allow the microprocessors to execute a greater number of instructions per unit of time. One well-known technique is pipelining. Pipelined microprocessors execute instructions in stages, so that an initial stage of execution of one instruction can be performed while a subsequent stage of execution of an earlier instruction is performed. In this manner, portions of the execution of successive instructions are performed in parallel.
The use of pipelining techniques to increase parallelism in the execution of program instructions does have several drawbacks, however. Because some instructions in a program depend on instructions which precede it in program order, the instruction cannot be executed until the results of the preceding instructions are available. These dependencies may include data dependencies and control dependencies. (These dependencies are well known in the art and will not be described in detail here.) In a pipelined microprocessor, the number of dependencies in the pipeline increases as the depth of the pipeline increases, potentially causing more stalling of the microprocessor and thereby reducing its efficiency. Additionally, as the speed of a pipelined microprocessor is increased, it becomes more and more difficult to fetch and decode instructions rapidly enough to fill the pipeline. This may create a bottleneck in the microprocessor.
Another technique for increasing the performance of a microprocessor is to configure the microprocessor to perform vector processing. Vector processing consists of performing an operation on an array of data rather than on a single datum. For example, where a non-vector microprocessor might multiply a first value by a second value to produce a third value, a vector microprocessor would multiply a first array of values times a second array of values to produce a third array of values. Thus, a single vector operation on one or more n-element vectors (i.e. arrays) can replace an n-iteration loop which executes a non-vector operation.
Vector operations can have a number of advantages over non-vector operations. As indicated above, a single vector instruction can specify the same calculations as a loop executing a non-vector instruction. As a result, fewer instructions need to be fetched and decoded, thereby eliminating a potential bottleneck in the microprocessor. Control hazards which may be generated in a loop are also eliminated. Further, execution of the specified operation on each element of the vector is independent of the other elements. Therefore, execution of the vector operation does not create data hazards at runtime. Still further, if the vector operation involves a memory access, the access pattern is typically well-defined and, since the entire vector is accessed at once, the latency of the access may be reduced.
If a loop will perform many iterations, it is clear that larger vectors will tend to maximize the benefit of the vector operations. In other words, the more operations that can be processed as a single vector instruction, the better. Much of the development of vector processors has therefore focused on vectors having a relatively large number of elements (e.g., eight or sixteen.) Further, the development of compilers which vectorize software programs have focused on the conversion of loops to one or more vector instructions. For example, if a vector processor handles eight-element vectors, a 50-iteration loop can be processed as seven vector instructions (six operating on full eight-element vectors, and one operating on vectors having only two valid elements.)
A number of factors have caused the vectorization of non-loop instructions to remain largely undeveloped. One of these factors is that the values used in vector operations should be xe2x80x9cpairedxe2x80x9d (adjacent in memory.) As indicated above, instructions within loops typically have well-ordered memory addresses and well-defined access patterns. Non-loop instructions, however, typically are not so ordered. Another of these factors is that the realignment of the elements in the vectors should be minimized and, while looped instructions typically repetitively access data in the same order, the order in which non-loop instructions access data may vary widely. The optimization of generalized instructions (including non-loop instructions) has therefore been quite difficult.
It should be noted that generalized parallel processing systems do not solve these problems in generating vectorized code. While generalized parallel processing systems are intended to maximize the number of operations performed in parallel, it is not necessary for these systems to manage the storage of data. In other words, it is not necessary to store and retrieve data in a way which is convenient for vector operations (e.g., storing vector data in adjacent memory locations or re-aligning vector data.)
One or more of the problems described above may be solved by the various embodiments of the invention. Broadly speaking, the invention comprises a method for vectorizing code. One embodiment comprises a method for compiling source code to produce vector instructions. The method vectorizes non-loop source code instructions as well as instructions which form loops in the source code. The method is directed to two-element vectorization (i.e., selecting pairs of operations for execution in parallel.) Because two operations are executed in parallel (rather than a larger number,) the method is well-suited to maximize the number of operations performed in parallel in many different types of code. Based on the operations which are selected for parallel execution, memory locations are assigned to the corresponding operands so that parallel operations operate on data which are adjacent in memory. The memory locations are assigned in a way which minimizes realignment of the data (i.e., swapping positions of two operands.) Another embodiment comprises a software program (e.g., a vectorizing compiler) which examines a block of program code, analyzes the operators within the code and generates vectorized code in accordance with the foregoing method. Many additional environments are possible, and will be apparent to persons of skill in the art of the invention.