Many scientific computer algorithms and applications are dominated by regular, repetitive floating point calculations on sequences (or vectors) of numbers. For example, let X and Y be sequences of numbers with n elements (vectors of length n): EQU X=x1, x2, . . . , xn EQU Y-y1, y2, . . . , yn
The vector addition of these sequences is the pointwise operation that independently adds the corresponding components of X and Y. That is, the vector sum of X and Y is: EQU X+Y=x1+yl, x2+y2, . . . , xn+yn
Similarly, the vector product is pointwise product of X and Y: EQU X*Y=x1*y1, x2*y2, . . . , xn*yn
It is important to note that the pointwise operations are completely independent of each other, meaning that xj*yj does not require the result of (does not depend upon) xi*yi, for all i and j. Contrast this to a recurrence like the vector inner product, where the partial sum sj=yj*yj+s(j-1) directly depends upon the previous partial sum s(j-1). Inner product is also known as a reduction operator, because the result of a vector computation is reduced to a single number, in this case the last partial sum, sn. In practice, a recurrence is much more difficult to compute quickly because the computation cannot be pipelined.
Vector sum, product, and inner product are two-input or dyadic vector operations. Single-input or monadic vector operations perform calculations on the components of a single vector and, typically, a single number, or scalar. For example, the vector scale operation multiplies each component of a vector by the same scalar, EQU a*X=a*x1, a*x2, . . . , a*xn
There are also monadic reduction operators, for example, adding all of the elements in a vector, finding the element of the greatest absolute value, or counting the number of non-zero elements. These very simple vector operations can be combined into slightly more complex ones. For example, SAXPY, the inner loop of the Linpack benchmark is a composition of vector scale and vector sum; EQU SAXPY A Z Y=Y+(a*X)=y1+a*x1,y2+a*x2, . . .
Generally speaking, vector operations are divided into two broad classes as suggested by the above examples:
Linear Combinations: e.g., Vector sum, product, scale, difference, SAXPY. PA0 Recurrences/Reductions: e.g., Vector inner product, total, find maximum, count zeros. PA0 Address Generation, the computation of memory addresses for vector data, including virtual address translation. PA0 Memory READ/WRITE, the actual control of the system memory. PA0 Numeric Pipeline, the unit that performs the arithmetic operation. PA0 Loop Control the control of the number of vector operations to perform.
An important property of many complex algorithms and applications is that they can be readily decomposed into sequences of simple vector operations. Put another way, simple independent vector operations, like SAXPY, can be used in a sequence of calculations to efficiently compute more complex functions, like the inverse of a matrix.
In practice, many complex scientific algorithms are vectorizable; they can be reformulated into sequences of simple vector operations such that 60-95% of all floating-point operations are performed as vector arithmetic. However, a non-vector, or scalar, processor does not, in general, benefit from vectorization because it still performs about the same amount of work. For example, vector sum Z=Z+Y would be compiled on a scalar processor (in a stylized machine code) as:
______________________________________ j = 1 ;; initialize j Loop: fetch X[j] ;; read xj fetch Y[j] ;; read yj add ;; add them store Z[j] ;; write zj j = j + 1 ;; increment j compare j, n ;; test for end of vector jump-LEQ Loop ;; repeat if j.ltoreq. n ______________________________________
Although the computation is quite regular and repetitive, the scalar processor sequentially performs the memory reads, the arithmetic, store, and loop control. In contrast, a vector processor is a computer architecture that is especially suited to performing vector operations. A vector processor is associated with a scalar processor such that normal scalar processing can be performed but vector operations can be efficiently dispatched to the vector processor. The vector operation is performed very rapidly by the vector processor, and then the scalar processor can resume computation. While most vector processor architectures allow the scalar processor to continue computing during a vector operation, it is still appropriate to think of the vector processor as extending the instruction set of the scalar unit. That is, the scalar processor not only has an "add" instruction for adding two scalar quantities but also a "vadd" instruction for adding two vectors.
Frequently, a vector processor can perform a vector operation at least an order of magnitude faster than if the vector operation were performed by the scalar processor alone. This great increase in performance occurs because the vector processor architecture exploits the regular, repetitive structure of simple vector operations (especially linear combinations) by employing a highly specialized form of parallel processing. There are two basic techniques pipelining and functional parallelism.
Pipelining is a design technique whereby each component of the vector is computed "assembly-line" fashion, so at any time several operations may be in various states of completion. The number of simultaneous operations is determined by the number of pipeline stages (or the depth). The rate at which results are computed (the throughput) is determined by the rate at which an operation can be advanced from one stage to the next, whereas the total time to complete an operation on a particular datum, called the latency, is directly proportional to the pipeline depth. Pipelining relies on the lack of dependence among elements of the result vector, so it works very well for linear combinations, but actually makes recurrences run more slowly (recurrences are limited by the latency--which is always several times worse than the throughput).
Functional Parallelism is a design technique whereby the different aspects of processing a vector are performed by function units that operate in parallel. The principal function units in a vector processor are:
There are two types of vector processors based on whether vector operations take place directly on vectors stored in memory (a memory-to-memory architecture) or whether the vectors are first loaded into vector registers, the operation is performed on the registers and then the result is written back to memory (a register-to-register architecture). Indeed, there are two design camps: the CDC STAR 100. CYBER 205, and ETA 10 are memory-to-memory architectures, whereas all of the CRAY machines and most other vector processors suscribe to the register-to-register philosophy.
It is not immediately obvious why the register-to-register machines should prevail. Indeed, it is widely acknowledged that memory-to-memory architectures are more "expressive" and easier to compile to. Register to register machines require explicitly loading and unloading of vector registers and, because vector registers are a small, fixed length (typically 64 elements per vector register), long vectors must be broken up into pieces, or stripmined, by the scalar processor.
A well-designed register-to-register architecture overcomes these shortcomings with two techniques. Chaining permits the loading and storing of vectors to and from registers to occur in parallel with operations on vector registers. By allowing the scalar processor to continue execution during a vector operation, the stripmining computation can be "hidden" during the vector operation.
Conversely, memory-to-memory machines have suffered from two hard design problems. First, there is an increased requirement for main memory bandwidth. In a register-to-register machine, intermediate or temporary vectors, during a sequence of vector operations, can often be stored in a vector register; whereas the memory-to-memory machine places the temporary vector back into main memory only to fetch it again, possibly on the very next vector operation. Second, some designs suffer from excessive latency in the memory system. That is, it takes a relatively long time to get the first element of a vector from memory. The same is true of a register-to-register machine, except that when a vector is in a register the latency is much lower and chaining can sometimes be used to help mask memory latency.
Of course, real applications seldom comprise only vector operations. There are always aspects of the computation which do not match the capabilities of the vector processor. For example, the computation may not be regular, may have a significant amount of I/0, may operate on data types that the vector processor cannot handle (like characters) or may be a true sequential process (like first order recurrences). Those portions of the computation which cannot be vectorized are called scalar or nonvectorizable.
The nonvectorizable portion of a computation sets a fundamental limit on how much a vector processor will speed up an application. The governing relation is called Amdahl's Law, after Gene Amdahl, the architect of the IBM 360. Amdahl's Law is best understood with a simple example. Suppose that a program is 90% vectorizable, that is, 90% of the computation matches the capabilities of a vector processor whereas 10% is nonvectorizable and must be executed by the scalar processor. Now even if the vector unit were infinitely fast, the computation could only be sped up by a factor of ten. The vector unit does not affect the speed at which the processor works on the scalar sections of the program, so the execution time is dominated by the scalar code. If the vector unit is only ten times faster than the scalar processor (a common case), then the program runs only five times faster, half of the time being devoted to vector processing and the other half to scalar. Amdahl's law is given by the following formula: ##EQU1## where, vspeed=the relative rate of vector vs. scalar processing
V=the fraction of vector operations (%vectorizable/100)
Thus, the expected performance increase is a nonlinear function of the vectorizability. For instance, with V =0.5 (50% vectorizable) the very fastest vector processor could offer a total speed-up of only 2.times.. For a program that is 99% vectorizable, an infinitely fast vector unit would offer a hundred-fold performance improvement while a more modest ten-times vector unit would offer almost a ten-fold increase in speed.
Often, the speed-up of an otherwise heavily vectorizable program is not even as good as Amdahl predicts, because the vector unit does not consistently speed up all vector operations. The usual culprit is the average vector length on which the vector unit is asked to operate. All vector processors incur a fairly fixed overhead for starting any vector operation called the start-up time. If a vector is short, then the start-up time dominates the computation. Hockney has quantified this overhead in terms of the half-power point of a vector unit: the length of vector required such that the start-up time and time required spent actually computing the elements are equal. In other words, the Hockney number is the length of vector required for the vector processor to achieve half of its peak performance.
The start-up time can come from several sources. The pass-off time to the vector unit is the amount of time required for the scalar processor to set up a vector operation. If the vectors are being retrieved from memory, then the memory latency, the time between a memory request and a response, can be a significant factor. The fill time is the time required for the first element of a vector to make its way all the way through the numeric pipeline, and is directly proportional to the pipeline depth. The shutdown time must also be considered; it is comprised mainly of resynchronization with the scalar processor.
A high Hockney number (i.e., only long vectors perform well) may affect the overall program speedup as strongly as the percentage vectorization. In practice, a Hockney of about 10 elements is considered very good, 20 is usual, and above 30 or so becomes marginal for a number of applications. Traditionally. memory-to-memory machines have had far worse Hockney numbers than register-to-register designs; often experiencing a half power point at 50 or even a 100 elements.
A number of other factors influence the effectiveness of a vector processor, but most important seems to be the organization and management of the main memory system. Main memory bandwidth and latency are the two important metrics. Insufficient bandwidth will starve the vector pipeline and yield low, sustained (long vector) performance. High latency can have a very negative effect on the Hockney number and thus cause short vectors to suffer.
Obtaining high bandwidth with tolerable latency is the real design challenge of a vector processor, especially when large amounts of memory are required. When little main memory is needed, say for signal processing, then very fast but expensive static RAM can solve both problems. Main memories are often interleaved into separate banks so that several memory requests can be processed simultaneously to increase the effective memory bandwidth. But generally speaking, latency tends to be proportional (relative to bandwidth) to the number of interleaves, and the memory system becomes susceptible to bank conflicts. Bank conflicts arise when one memory interleave (or bank) is accessed before it has finished its previous request. Normally, memory is interleaved so that contiguous locations fall into different banks. Thus, when a vector is accessed sequentially (stride one), each element is retrieved from a different bank, modulo the number of banks. Non-contiguous access can occur in several circumstances. A constant stride operation may pick up every Nth element. If the number of banks is not prime relative to N, then bank conflicts are guaranteed to arise. Additionally, a scatter/gather operation uses an indirection vector or a bit mask to determine the elements that are to be accessed, causing nearly random requests to be made of the memory system.
For scalar processors, caches have been employed to aid both memory bandwidth and latency. Caches rely on locality of data reference, a property that does not seem to hold for a number of important applications. A cache may help when a lot of operations are performed on a small set of short vectors, but have a strong negative effect when the size of the vectors is larger than the cache. In practice, caches are a poor solution to the vector memory bandwidth/latency problem. There is no substitute for honest memory bandwidth and a "tight" low-latency memory protocol.
Although they perform relatively simple and regular operations, vector processors are often very complex. As indicated above, a vector processor comprises several function units: address generation, main memory, numeric pipeline, and loop control. For a register-to-register architecture, the multiported vector register file must also be considered, as well as the added complexity to chain load/store operations with register-to-register operations. In addition, if the scalar processor has a virtual memory system, then the virtual-to-physical translation must be accomplished at vector speed and the possibility of a page fault mid-vector must also be accounted for. Finally, if the scalar processor has a cache, then the scalar cache must be kept consistent, or coherent, with the operation of the vector unit.
An alternative to the vector extension closely integrated with a scalar processor is a separate but attached array processor which performs the vector processes. An array processor is traditionally a distinct, microcoded processor with its own private memory subsystem. To use an array processor, a scalar processor must copy the data on which the array processor is to operate into the array processor memory configure the array processor to perform the required computation, and then synchronize on the completion of the array processor operation.
The overhead factors in dispatching the vector process to the array processor and subsequent synchronization can contribute to extraordinarily high Hockney numbers. As such, the array processor philosophy is tuned for solving large problems directly at the expense of efficiently performing simple, short vector operations. A sophisticated compiler would be required to identify the complex processes for which the array processor may be used efficiently. As a result, the array processor is usually only selected by explicit programming. To make a problem fit on an array processor, the application writer relies on the foresight of the microcoder to provide just the right function. This often is not the case, and the application writer is often precluded, due to concerns with the cost of overhead relative to the benefits of use of the array processor, from decomposing the problem into simple vector operations to be executed by the array processor. The other choice is for the programmer to write new microcode to solve his special problem, an arduous task at best.
Design complexities have, to date, prevented the deep integration of vector processing with inexpensive processors like those found in personal computers. Thus, specialized vector processing in the personal computer environment has been limited to the attached array processor architecture.