Most general purpose computer systems are built around a general-purpose processor, which is typically an integrated circuit operable to perform a wide variety of operations useful for executing a wide variety of software. The processor is able to perform a fixed set of instructions, which collectively are known as the instruction set for the processor. A typical instruction set includes a variety of types of instructions, including arithmetic, logic, and data instructions.
In more sophisticated computer systems, multiple processors are used, and one or more processors runs software that is operable to assign tasks to other processors or to split up a task so that it can be worked on by multiple processors at the same time. In such systems, the data being worked on is typically stored in memory that is either centralized, or is split up among the different processors working on a task.
Instructions from the instruction set of the computer's processor or processor that are chosen to perform a certain task form a software program that can be executed on the computer system. Typically, the software program is first written in a high-level language such as “C” that is easier for a programmer to understand than the processor's instruction set, and a program called a compiler converts the high-level language program code to processor-specific instructions.
In multiprocessor systems, the programmer or the compiler will usually look for tasks that can be performed in parallel, such as calculations where the data used to perform a first calculation are not dependent on the results of certain other calculations such that the first calculation and other calculations can be performed at the same time. The calculations performed at the same time are said to be performed in parallel, and can result in significantly faster execution of the program. Although some programs such as web browsers and word processors don't consume a high percentage of even a single processor's resources and don't have many operations that can be performed in parallel, other operations such as scientific simulation can often run hundreds or thousands of times faster in computers with thousands of parallel processing nodes available.
Multiple operations can also be performed at the same time using one or more vector processors, which perform an operation on multiple data elements at the same time. For example, rather than instruction that adds two numbers together to produce a third number, a vector instruction may add elements from a 64-element vector to elements from a second 64-element vector to produce a third 64-element vector, where each element of the third vector is the sum of the corresponding elements in the first and second vectors.
In this example, the vector registers each hold 64 elements, so the vector length is said to be 64. The vector processor can handle sets of data smaller than 64 by using a vector length register specifying that some number fewer than 64 elements are to be processed, or can handle sets of data larger than 64 elements by using multiple vector operations to process all elements in the data set, such as by using a program loop.
The vectors in some further examples do not operate on elements that are sequential in memory, but instead operate on elements that are spaced some distance apart, such as on certain elements of a large array for scientific computing and modeling applications. This distance between elements in a vector is referred to as the stride, such that sequential words from memory have a stride of one, whereas a vector comprising every sixteenth element in memory has a stride of 16.
Vector processing provides other benefits to program efficiency, but at the cost of significant load or startup time relative to a scalar operation. Although the vectors must be completely loaded from memory before functions can be performed on the elements, other steps such as checking for variable independence need only be performed once for an entire vector operation. Instruction and coding efficiency are also improved with vector operations, as is memory access where the vector has a known or consistent memory access pattern. Vector processor design choices such as vector length consider these efficiencies and tradeoffs in an attempt to provide both good scalar operation performance and efficient vector operation.