Prior art programming methods are implemented with parallel processing architectures called “clusters.” Such clusters are generally one of two types: the shared memory cluster and the distributed memory cluster. A shared memory cluster consists of multiple computers connected via RAM memory through a back plane. Because of scaling issues, the number of processors that may be linked together via a shared back plane is limited. A distributed memory cluster utilizes a network rather than a back plane. Among other problems, one limitation of a distributed memory cluster is the bandwidth of the network switch array.
More particularly, each node of an N-node distributed memory cluster must obtain part of an initial dataset before starting computation. The conventional method for distributing data in such a cluster is to have each node obtain its own data from a central source. For problems that involve a large dataset, this can represent a significant fraction of the time it takes to solve the problem. Although the approach is simple, it has several deficiencies. First, the central data source is a bottleneck: only one node can access the data source at any given time, while others nodes must wait. Second, for large clusters, the number of collisions that occur when the nodes attempt to access the central data source leads to a significant inefficiency. Third, N separate messages are required to distribute the dataset over the cluster. The overhead imposed by N separate messages represents an inefficiency that grows directly with the size of a cluster; this is a distinct disadvantage for large clusters.
Shared memory clusters of the prior art operate to transfer information from one node to another as if the memory is shared. Because the data transfer cost of a shared memory model is very low, the data transfer technique is also used within clustered, non-shared memory machines. Unfortunately, using a shared memory model in non-shared memory architectures imposes a very low efficiency; the cluster inefficiency is approximately three to seven percent of the actual processor power of the cluster.
Although increasing the performance of the central data source can reduce the impact of these deficiencies, adding additional protocol layers on the communication channel to coordinate access to the data, or to increase the performance of the communication channel, adds cost and complexity. These costs scale directly as the number of nodes increase, which is another significant disadvantage for large clusters in the prior art.
Finally, certain high performance clusters of the prior art also utilize invasive “parallelization” methods. In such methods, a second party is privy to the algorithms used on a cluster. Such methods are, however, commercially unacceptable, as the users of such clusters desire confidentiality of the underlying algorithms.
The prior art is familiar with four primary parallel programming methods: nested-mixed model parallelism, POSIX Pthreads, compiler extensions and work sharing. Nested-mixed model parallelism is where one task spawns subtasks. This has the effect of assigning more processors to assist with tasks that could benefit with parallel programming. It is however difficult to predict, a priori, how job processing will occur as the amount of increase in computational speed remains unknown until after all of subtasks are created. Further, because only parts of a particular job benefit from the parallel processing, and because of the high computational cost of task spawning, the total parallel activity at the algorithm level is decreased. According to the so-called Amdahl's Law of the prior art, even a small percentage change in parallel activity generates large effective computational cost.
POSIX Pthreads are used in shared memory architectures. Each processor in the shared memory is treated as a separate processing thread that may or may not pass thread-safe messages in communicating with other threads. Although this may work well in a shared memory environment, it does not work well in a distributed processor environment. The inability to scale to large numbers of processors even in a shared memory environment has been well documented in the prior art. Because of bus speed limits, memory contention, cache issues, etc., most shared memory architectures are limited to fewer than sixty-four processors working on a single problem. Accordingly, efficient scaling beyond this number is problematic. The standard method of handling this problem is to have multiple, non-communicating algorithms operating simultaneously. This still limits the processing speedup achievable by a single algorithm.
Compiler extensions, such as distributed pointers, sequence points, and explicit synchronization, are tools that assist in efficiently programming hardware features. Accordingly, compiler extensions tools offer little to enhance parallel processing effects as compared to other prior art methods.
Work-Sharing models are characterized by how they divide the work of an application as a function of user-supplied compiler directives or library function calls. The most popular instance of work sharing is loop unrolling. Another work-sharing tool is the parallel region compiler directives in OpenMP, which again provides for limited parallel activity at the algorithm level.
It is interesting to note that prior art parallel processing techniques are injected into the algorithms and are not “intrinsic” to the algorithms. This may be a result of the historic separation between programming and algorithm development, in the prior art.