There is an ever increasing number of computational and modeling problems that are being solved by parallel computing techniques. Parallel computing is a form of computation in which many calculations are carried out simultaneously using multiple processing resources, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently (“in parallel”). This is typically accomplished by breaking the problem into independent parts so that each processing element can execute its part of the algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, networked and/or clustered computers, specialized hardware, or any combination of the above.
There are various hardware/system architectures employed for most parallel computing tasks, including Symmetric Multiprocessing, multi-core processors, massively-parallel computer systems, and network-based architectures employing multiple locally clustered or geographically dispersed computers. Symmetric Multiprocessing (SMP) involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Many of today's multiprocessor systems employ an SMP architecture. In the case of multi-core processors, control of memory accesses to the processor cores and associated cache levels are managed in a manner similar to SMP, with each core treated as a separate processor. Massively-parallel computer system architectures are often employed by supercomputers. These systems may employ an array of processors and/or multiple processor blades or servers with physically separate memory resources. Network-based architectures typically employ either a local cluster of computers or servers, or a network of geographically dispersed computers (also called distributive computing). For example, a computer cluster may be implemented via a blade server or rack server having blades or modules interconnected via network connections. Meanwhile, 100's, 1000's or even greater numbers of geographically dispersed computers may be employed to perform specialized tasks in parallel, such as solving problems using a MapReduce framework.
One of the key aspects of parallel processing architectures is data coherency, which is necessary to ensure that during ongoing operations the multiple processing elements are accessing valid copies of data objects. For SMP architectures and multi-core processors, data coherency is typically implemented via memory coherency schemes, such as using the well-known MESI (Mutual Exclusive, Shared, Invalid) memory coherency protocol or similar protocols. There are also techniques for assuring data coherency for distributive architectures, at a coarser level of granularity than under SMP or multi-core processor systems.
Another key aspect of parallel processing is scalability. Generally, the performance scalability of a given architecture is a non-linear function as relates to the number of processing elements, due in large part to data access bottlenecks and data transfer latencies. The fastest architectures (in terms of data access) employ System on a Chip (SoC) or Chip Multi Processor (CMP) configurations with multiple processor cores, such as Intel® Corporation's 48-core Single Chip Computer (SCC). The data access speed is facilitated in large part due to the use of a very high speed interconnects and protocols that supports coherent memory transactions (such as Intel's QuickPath® Interconnect or QPI). Although SoCs and CMPs support very fast coherent memory transactions, their scalability in terms of processing elements is fundamentally limited to the number of cores they employ.
At the opposite end of the scale is distributive computing architectures. This approach has (theoretically) infinite scalability (in terms of the number of processing elements), but has inherent latencies that functionally limit the practical data granularity that can be used for performing parallel operations. As a result, the level of parallelism in distributive computing architectures is relatively low. In between is massively-parallel architectures and clustered blade/server architectures. These have inherent processor scalability (within practical limits due to space and heat considerations), and generally provide high performance. However, the performance of data access and parallelism across blades and/or servers using networked links is significantly less than provided by SoCs and CMPs, while availability of supercomputers is cost-prohibited for most users.