1. Field of the Invention
This invention relates to a method for the execution of plural computer programs being executed by a multiplicity of processors in a parallel configuration, and more specifically to performing multiple state transitions simultaneously.
2. Discussion of the Prior Art
New Parallel Language Needed
I surveyed ways to program parallel machines. Despite contentions to the contrary, I do not consider SIMD (single-instruction, multiple-datastream) machines like the Connection Machine (CM-2) to be parallel processors. SIMD machines are restricted to performing the same operation in many cells and thus can perform only a narrow range of applications. Most applications require much data-dependent branching that degrades performance on SIMD machines, such as the Cray YMP-8 and CM-2 alike. FORTRAN, with a vectorizing compiler, is appropriate for SIMD machines. Every attempt to wring efficient, massive parallelism from conventional FORTRAN programs has failed. Only programs expressed with a language embodying an inherently parallel model of computation can execute efficiently in parallel.
The challenge to those wishing to exact optimum performance from highly parallel processors is to coordinate activities in processing nodes that are doing different things. Modifications to special-purpose languages like LISP and PROLOG to incorporate futures and guards, respectively, are problematic, addressing restricted application domains of largely academic interest.
Then there are the many extensions of sequential languages. Such extensions fall into two classes: "processes" and "doall". Ada is perhaps the best example of a process-based language. Each Ada task (i.e. process) represents a single thread of control that cooperates with other tasks through message passing. Communicating Sequential Processes described in "Communicating Sequential Processes," C. A. R. Hoare, Communications of the ACM, Vol. 21, No. 8, August 1978 forms the underlying model of computation for message-passing, process-based languages. However, deciding whether an arbitrary collection of processes is deadlock-free is undecidable and proving that a particular collection of processes is deadlock-free is onerous. Furthermore, such pairwise communication and synchronization is unnecessarily restrictive thus limiting the degree of parallelism. Finally, a programmer writes (essentially) a plurality of sequential programs, rather than a single, parallel program. The "doall" is just a FORTRAN do-loop in which all the iterations are instead performed concurrently; again a collection of sequential programs.
Current, Conventional, Concurrent Computing
Traditional, sequential computing causes a sequence of states to be created so that the last state holds the data desired. FIG. 1 illustrates sequential computing starting at an initial state (101), proceeding through intermediate states (103), and terminating in a final state (102). The values of program variables at a particular instant of time comprise a state, depicted as circles. In state diagrams, time always flows down the page, that is, the state at the top of the figure occurs first followed by the state(s) connected by a line. Broken lines indicate repetition or "more of the same."
In "parallel" execution several sequences of state transitions occur concurrently. FIG. 2 depicts parallel execution in doall style. The state diagram shows state transitions occurring as "parallel" sequences. The conception of parallel processing as a collection of sequential executions characterizes the prior art that this invention supersedes.
Two mechanisms allow communication between sequential processes: message-passing or shared-memory. Load balancing and problem partitioning have produced many academic papers for either message-passing or shared-memory paradigms--nearly all yielding poor performance.
Mutual exclusion (MutEx) via message passing (i.e. for distributed systems, no shared state) is horribly inefficient. If the application needs MutEx often, performance will suffer. "An Algorithm for Mutual Exclusion in Computer Networks," Glenn Ricart & Ashok K. Agrawala, Communications of the ACM, Vol. 24, No. 1, January 1981, for a discussion of prior art O(N) request/reply algorithm, which really needs O(N.sup.2) messages when everyone wants the lock. "A.sqroot.N Algorithm for Mutual Exclusion in Decentralized Systems," Mamoru Maekawa, ACM Transactions on Computer Systems, Vol. 3, No. 2, May 1985, reduces messages to O(.check mark.N) with an elegant multidimensional geometry. Sanders' generalized MutEx algorithm in more convoluted terms in her article, "The Information Structure of Distributed Mutual Exclusion Algorithms," Beverly A. Sanders, ACM Transactions on Computer Systems, Vol. 5, No. 1, August 1987. She considers deadlock freedom (absence of possibility that computation ceases because every process is waiting to acquire a lock held by a different process), but the basic model which she proposes needs further augmentation to detect and recover from deadlock. Of course, only a wordy explanation substitutes for correctness proof. Therefore the desired interference freedom provided by MutEx is not achieved. Raymond in "A Tree-Based Algorithm for Distributed Mutual Exclusion," ACM Transactions on Computer Systems, Vol. 7, No. 1, February 1989, further reduced the number of messages to O(logN) by increasing the latency, which could easily become a significant burden if processes need the lock briefly.
Controlling access to shared variables is so difficult that designers have resorted to the message-passing model in which all state variables are local to some process and processes interact exclusively, via exchange of messages. Inevitably, two different kinds of system-wide operations barrier synchronization and remote state acquisition are required by the algorithm or application. They are notoriously inefficient. Barrier synchronization ensures that all the parallel computations in one phase are completed before the next phase is begun. Remote state acquisition occurs when the application's data cannot be partitioned disjointly so that all processes use values exclusively from its partition held in local memory. For example, advanced magnetic resonance imaging (MRI) research seeks to analyze brain scans and call to the attention of the radiologist any anomalies. Three-dimensional fast Fourier transform (FFT) is necessary for filtering and extraction of interesting features. The data consists of a 1 k by 1 k by 1 k array of floating point intensities. The first phase of the algorithm maps nicely by putting a 1 k by 1 k plane in each of 1 k processors. However integrating the third dimension finds the data in the worst possible distribution--every computation needs values that are possessed by each of the 1 k processors|
Barriers are an artifact of the popular yet problematic process paradigm. A static collection of sequential processes alternately work independently and wait for laggards. FIG. 3 depicts four processes (201) synchronized with barriers (204). Each process is actively computing (202) and then wait idly (203) until all other processes reach the barrier. The best known algorithms for barrier-synchronization take O(logP) time (with a large constant) for both message-passing and shared-memory models. This corresponds with the depth of a tree using either messages or flags.
A large body of academic work has been produced for both message-passing and parallel, random-access machine (PRAM) models, i.e. shared memory. Barrier synchronization in a message-passing model requires O(logP) time where P is number of processors, with a big constant proportional to message transmission latency. See "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors," John M. Mellor-Crummey and Michael L. Scott, ACM Transactions on Computer Systems, Vol. 9, No. 1, February 1991.
Consider an ideally parallelizable application that requires synchronization. A single processor would require N steps (time, T.sub.8) to perform the application with L iterations of a single loop each requiring N/L steps. When parallelized, barrier synchronization is required at the end of each iteration taking logP steps. Real systems take many steps for each level of the synchronization tree. So the parallel time for each iteration, T.sub.i is N/(L*P)+logP and the whole application T.sub.p, is T.sub.i *L or N/P+L*logP steps. Speedup, S, is the ratio of sequential time to parallel time, T.sub.s /T.sub.p. Efficiency, E, is the speedup obtained divided by the number of processors used, S/P. In this case, E=T.sub.s /T.sub.p *P=N/(N+P*L*logP).
______________________________________ Total work to be done T.sub.s = N steps (sequential) Number of loop iterations L Sequential time per iteration N/L Parallel time per iteration T.sub.j = N/(L*P) + logP Total parallel time T.sub.p = T.sub.i *L = N/P + L*logP Speedup S = T.sub.s /T.sub.p = N*P/(N + P*L*logP) S = P/(1 + (L/N)*P*logP) Efficiency E = S/P = N/(N + P*L*logP) E = 1/(1 + (L/N)*P*logP) - O(1/logP) ______________________________________
When the number of processors is small, N&gt;&gt;P*L*logP, the efficiency can be quite good, close to 1. But as more processors are used (N.apprxeq.P) the barrier synchronization overhead begins to dominate even when only a single barrier is needed.
Therefore, an application needing barrier synchronization on any conceivable message-passing architecture (N.apprxeq.P), the minimum time of execution will be O(logP), the best speedup possible is O(P/logP), the efficiency can be no better than O(1/logP). Little wonder that efficiencies of even 10% are hard to achieve on parallel processors like CM-2, CM-5, NCube, Paragon, and T3D. They can perform no better than their theoretical model.
______________________________________ Order parallel time T.sub.p = O(logP) Order speedup S = O(P/logP) Order efficiency E = O(1/logP) ______________________________________
Parallel Random-Access Machines (PRAMs) suffer similar fate. When most people think of "shared-memory" they imagine a PRAM--a bunch of sequential processors that can read and write the same memory locations. Many PRAM variations exist. Whether the memory locations can be accessed concurrently or exclusively and what happens for concurrent accesses constitute most of the PRAM model differences. Even unrealistic operations like storing the median or geometric mean of all concurrent writes still inflicts an O(logP) penalty for barrier synchronization same as message-passing.
Sequential algorithms have been traditionally analyzed using Turing machines. Both the time (number of steps) and the space (number of tape squares) are used to measure the complexity of a computation. Additionally whether the Turing machine has a unique next action (deterministic) or a set of possible next actions (nondeterministic) substantially affects the complexity measure of a computation.
The following paragraphs summarize three chapters of a recent book on parallel algorithms entitled "Synthesis of Parallel Algorithms", John H. Reif, editor, Morgan Kaufmann, 1993, that I believe is state-of-the-art computer science theory for shared-memory computers. The following sections quote heavily those authors (Faith E. Fich, Raymond Greenlaw, and Phillip B. Gibbons) and will be indented to indicate quotation.
PRAMs attempt to model the computational behavior of a shared-memory parallel processor. The alternative is shared-nothing, message-passing parallel processors, which are popular these days. PRAMs have many variations and have a large body of published scholarship. Fich, Greenlaw, and Gibbons cover the main kinds of PRAMs, which I deem represent the conventional wisdom about how to think about parallel computation. This paradigm has, so far, not yielded efficient, general-purpose, parallel processing. Thus the motive for the present invention.
Fich's chapter begins with a concise definition of a "synchronous" PRAM.
"The PRAM is a synchronous model of parallel computation in which processors communicate via shared memory. It consists of m shared memory cells M.sub.1 . . . , M.sub.m and p processors P.sub.1, . . . , P.sub.p. Each processor is a random access machine (RAM) with a private local memory. During every step of a computation a processor may read from one shared memory cell perform a local operation, and write to one memory cell. Reads, local operations, and writes are viewed as occurring during three separate phases. This simplifies analysis and only changes the running time of an algorithm by a constant factor." (Underline mine.) PA1 "A forking PRAM is a PRAM in which a new processor is created when an existing processor executes a fork operation. In addition to creating the new processor, this operation specifies the task that the new processor is to perform (starting at the next time step)." PA1 Phase 1: Each processor that wants to write marks its dedicated cell from the p EREW cells used for each PRIORITY cell. PA1 Phase 2: By evaluating (in parallel) a binary tree of all processors that want to write the highest priority processor is found. PA1 Phase 3: The winning processor writes the "real" memory cell. PA1 "Existing MIMD machines are asynchronous, i.e., the processors are not constrained to operate in lock-step. Each processor can proceed through its program at its own speed, constrained by the progress of other processors only at explicit synchronization points. However in the (synchronous) PRAM model, all processors execute a PRAM program in lock-step with one another. Thus in order to safely execute a PRAM program on an asynchronous machine, there must be a synchronization point after each PRAM instruction. Synchronizing after each instruction is inherently inefficient since the ability of the machine to run asynchronously is not fully exploited and there is a (potentially large) overhead in performing the synchronization. Therefore our first modification to the PRAM model will be to permit the processors to run asynchronously and then charge for any needed synchronization." (underline mine) PA1 Additionally shared memory accesses are charged more than local accesses: "To keep the model simple we will use a single parameter, d, to quantify the communication delay to memory. This parameter is intended to capture the ratio of the median time for global memory accesses to the median time for a local operation." PA1 Given n numbers stored one per global memory location, and the following four types of instructions: L:=G, L:=L+L, G:=L and "barrier," where L is a local cell and G is a global cell, then the sum of n numbers on an Asynchronous PRAM with this instructions set requires .OMEGA.(B log n/log B) regardless of the number of processors. This is interesting because Multitude can sum n integers in O(1) time| Gibbons concludes with "subset synchronization" in which only a subset of processors in the machine need synchronize. Since the barriers involve fewer processors, and thus are shorter, summing numbers runs marginally faster. PA1 induction basis: Prove that P holds for n=0. PA1 Induction step: Prove that P holds for n+1 from the induction hypothesis that P holds for n. PA1 hundreds, thousands, or more processors may be efficiently used; PA1 program correctness may be proved; PA1 special data structures may be created that allow many elements within the data structure to be simultaneously accessed without locking, blocking, or semaphores; PA1 the meaning of every language construct is precisely and mathematically defined; and PA1 during program execution, processors may simultaneously self-schedule or post work for other processors to do.
Throughout Fich's paper she assumes a synchronous PRAM: single-instruction, multiple-datastream (SIMD). This allows her to ignore synchronization costs and hassles in her analyses. The synchronous PRAM cannot model MIMD machines with their independent, data-dependent branching is and uncertain timing. Having all processors read-execute-write each step sidesteps sticky issues like what happens when a memory cell is read and written simultaneously. Furthermore the execute part of a step can perform any calculation whatsoever| So the lower bounds derived for various algorithms concentrate solely on shared-memory accesses while totally ignoring the real computation--data transformation in the execute part of the step.
Fich describes "forking" PRAM in which new processors are created with a fork operation (as in Unix). A forking PRAM can dynamically describe parallel activities at run-time. (Regular PRAMs have a fixed set of processes/processors.) But a forking PRAM still is a collection of sequential execution streams, not a single, parallel program.
After defining a forking PRAM Fich never mentions it again. But she does make claims for PRAMs about "programmer's view," "ignoring lower level architectural details such as . . . synchronization," and "performance of algorithms on the PRAM can be a good predictor of their relative performance on real machines," that I completely disagree with. Since PRAMs exemplify the process paradigm (collection of sequential executions) they may adequately predict performance of algorithms and architectures similarly afflicted (all machines built so far), but certainly not what's possible.
Most distinctions between PRAMs involve memory access restrictions. The most restrictive is called exclusive-read exclusive-write (EREW). Ensuring that two processors don't write to the same memory cell simultaneously is trivial with a synchronous PRAM since all operations are under central control. The other four memory access restrictions are: concurrent-read exclusive write (CREW), COMMON, ARBITRARY, and PRIORITY. The last three restrictions differ in how concurrent writes are handled. With COMMON, values written at the same time to the same cell must have the same value. With ARBITRARY, any value written in the same step may be stored in the memory cell. With PRIORITY, the value written by the highest priority process (lowest process number) is stored in the memory cell.
Obviously an EREW program would run just fine on a CREW machine; the concurrent read hardware would not be used. The five memory restrictions Fich uses nest nicely in a power hierarchy: EQU PRIORITY&gt;ARBITRARY&gt;COMMON&gt;CREW&gt;EREW
Fich then spends much of her chapter showing how many steps it takes one PRAM to simulate a single step of another.
Proving that EREW can simulate a step of PRIORITY using O(log p) steps and p*m memory cells takes Fich two pages of wordy prose. Essentially the simulation takes three phases:
Since the height of the binary tree is O(log p) it takes O(log p) steps for an EREW PRAM to emulate a PRIORITY PRAM. Much of Fich's chapter is devoted to convoluted schemes by which one model simulates another without mentioning any applications that benefit from a more powerful and expensive model.
Gibbons finally provides a reasonable PRAM model in the very last chapter of this 1000 page book. None of the other authors constrain themselves to reasonable models, thus limiting the value of their work. In Gibbons' PRAM model:
I invented the Layered class of multistage interconnection networks U.S. Pat. No. 4,833,468, LAYERED NETWORK, Larson et al., May 23, 1989! specifically to keep d small. For the Multitude architecture (interconnected by a Layered network) the ratio of time for global memory accesses to local memory accesses is about 3 to 4. This compares with a ratio of local memory access time to cache hit time of 20 to 30.
Gibbons' asynchronous PRAM model uses four types of instructions: global read, global write, local operation, and synchronization step. The cost measures are:
______________________________________ Instruction Cost ______________________________________ local operation 1 global read or write d k global reads or writes d + k - 1 synchronization barrier B ______________________________________
The parameter B=B(p), the time to synchronize all the processors, is a nondecreasing function of the number of processors, p, used by the program. In the Asynchronous PRAM model, the parameters are assumed to obey the following constraints: 2.ltoreq.d.ltoreq.B.ltoreq.p. However a reasonable assumption for modeling most machines is that B(p).di-elect cons.O(d/log p) or B(p).di-elect cons.O(d/log p/log d).
Here again is evidence of a logP performance hit for barrier synchronization.
A synchronous EREW PRAM can be directly adapted to run on an asynchronous PRAM by inserting two synchronization barriers, one after the read phase and write phase of the synchronous PRAM time step. Thus a single step of a synchronous PRAM can be simulated in 2B+2d+1 time on an asynchronous PRAM. Therefore a synchronous EREW PRAM algorithm running in t time using p processors will take t(2B+2d+1) time on an asynchronous PRAM--a significant penalty. Gibbons shows how to maintain the O(Bt) running time but use only p/B processors by bunching the operations of B synchronous processors into a single asynchronous processor thus getting more work done for each barrier synchronization.
Gibbons transforms an EREW algorithm for the venerable "all prefix sums" operation for an asynchronous PRAM yielding an O(B log n/log B) time algorithm which isn't too bad. Gibbons presents asynchronous versions of descaling lemmas (performing the same application with fewer processors) which tack on a term for the barrier synchronizations. Still he uses many barriers in order to obtain behavior similar to synchronous PRAMs. Only when applications are synchronized just when they need it, and those needed synchronizations are really fast, B.apprxeq.d.apprxeq.1, will scaled-up parallel processing be possible, which is the contribution of this invention and Layered networks.
Gibbons considers several algorithms that run in O(B log n/log B) time. In particular, he proves an interesting theorem about summing numbers:
So the best PRAM model I could find charges PRAM algorithms for their "cheating," but still offers no insights about how to efficiently compute on an MIMD machine.
Temporal Logic
Moszkowski in his Executing Temporal Logic Programs Cambridge University Press, 1986! presented a temporal-logic programming language. His language, Tempura, allows description of parallel execution (unsatisfactorily) within a single program expression. The foundation for Tempura is interval temporal logic. A Tempura interval is a non-empty sequence of states. A state is determined by the values of program variables at a particular instant of time. The sequence of states that form an interval provide a discrete notion of time.
The semantics of Tempura were much different than any I had previously encountered. Instead of programs being a step-by-step recipe for how to do the computation, they were logical formulas that are "satisfied" by constructing a "model" that makes the formula "true." Unfortunately the models constructed are simply sequences of states--the process model underlying PRAMs.
The semantics of what is referred to herein as DANCE programs, although also temporal logic formula based, are satisfied by constructing models with a different and superior mathematical structure that allows efficient execution on a plurality of computational engines. It is this mathematical structure that forms the core of this invention.