To increase the speed of computations, computer systems often employ some form of parallel processing, such as multiprocessing or vector processing. For example, multiprocessing systems require that the programmer break a computation into multiple tasks that are executed in parallel by different processors. Because each processor executes a separate instruction stream on separate data, multiprocessors are traditionally characterized as utilizing a multiple-instruction, multiple-data (MIMD) model. In contrast to multiprocessing, vector processing often requires that a programmer break the computation's data into arrays (single or multidimensional) and instruct the system to execute a single instruction on multiple elements of the array in parallel. For this reason, vector processing is traditionally characterized as utilizing a single-instruction, multiple-data (SIMD) model.
Vector processing (also known as array processing) often requires that the programmer encode a program using a vector-programming language and execute the program on a vector-processing system. A vector processing system may be implemented in different configurations and may include different numbers and/or types of processors. For example, a vector processing system may include one or more vector processors, such as graphics processing units (GPUs), each capable of concurrently executing an instruction on multiple data. A vector processing system may additionally or alternatively include one or more scalar processors/cores configured to implement vector processing collectively.
In a vector processing programming model, the programmer may create a data structure that contains multiple data elements (e.g., an array of numbers) and write a single instruction that instructs the system to perform the same operation on each of the data elements in parallel. For example, the programmer may create two 64-element arrays, and, using a single add instruction, instruct the vector processing system to add the corresponding elements of the two arrays. The programming model does not require that the programmer use loops to iterate over each element, nor does it generally require that the programmer encode explicit communications between different threads of execution and/or processing elements. Instead, communication and synchronization is taken care of transparently, such as through hardware constructs and/or shared memory regions. The number of elements on which the system may operate in parallel is referred to as the system's vector width.