This invention relates generally to data processing and more particularly to a binary tree-structured parallel processing machine employing a large number of processors, each such processor incorporating its own I/O device.
Throughout the history of the computer there has been a continuing demand to increase the throughput of the computer. Most of these efforts have concentrated on increasing the speed of operation of the computer so that it is able to process more instructions per unit time In serial computers, however, these efforts have in certain senses been self-defeating since all the processing is performed by a single element of the computer leaving most of the resources of the computer idle at any time.
In an effort to avoid some of these problems, special purpose machines such as array processors have been developed which are especially designed for the solution of special classes of problems. Unfortunately, while commercially successful in the performance of certain computational tasks, such computers fall far short of adequate performance in others.
In recent years substantial efforts have been made to increase throughput by operating a plurality of processors in parallel. See, for example, Chuan-lin Wu and Tse-yun Feng, Interconnection Networks for Parallel and Distributed Processing (IEEE 1984). One such parallel processor is that in which a plurality of processors are connected in a tree-structured network, typically a binary tree. S. A. Browning, "Computations on a Tree of Processors," Proc. VLSI Conf., California Institute of Technology, Pasadena, Jan. 22-24, 1979; A. M. Despain et al., "The Computer as a Component", (1979 unpublished); A. Mago, "A Cellular Language-directed Computer Architecture," Proc. VLSI Conf., California Institute of Technology, Jan. 22-24, 1979; R. J. Swan et al., "Cm*-A Modular, Multimicroprocessor", Proc. 1977 NCC, pp. 645-655 (June 1977); J. R. Goodman et al., "Hypertree: A Multiprocessor Interconnection Topology", IEEE Trans. on Computers, Vol. C-30, No. 12, pp. 923-933 (December 1981), reprinted in Wu and Feng at pp. 46-56; J. L. Bentley and H. T. Kung, "Two Papers on a Tree-Structured Parallel Computer", Technical Report, Dept. of Computer Science, Carnegie-Mellon University, Sept. 1979.
In a binary tree computer, a large number of processors are connected so that each processor except those at the root and leaves of the tree has a single parent processor and two children processors. The processors typically operate synchronously on data flowing to them from the parent processor and pass results to descendant processors.
Important problems in the storage and retrieval of data can be analyzed following J. L. Bentley, "Decomposable Searching Problems", Information Processing Letters, Vol. 8, No. 5, pp. 244-250 (June 1978). Bentley defines a static searching problem as one of preprocessing a set F of N objects into an internal data structure D and answering queries about the set F by analyzing the data structure D. Bentley defines three functions of N that characterize the complexity of the searching function: the amount of storage S required by D to store the N objects, the preprocessing time P required to form D in S, and the time Q required to answer a query by searching D.
An illustration of a problem that can be solved by such a database is the membership problem. In this case N elements of a totally ordered set F are preprocessed so that queries of the form "Is x in F?" can be answered quickly. The common solution for serial computers is to store F in a sorted array D and binary search. Thus, the membership problem can be computed on sequential computers with the following complexity: S=N; P=O(N log N); Q=O(log N).
Bentley defines a decomposable searching problem as one in which a query asking the relationship of a new object x to a set of objects F can be written as: EQU Query (x,F)=B q(x,f)
where B is the repeated application of a commutative, associative binary operator b that has an identity and q is a primitive query applied between the new object x and each element f of F. Hence the membership problem is a decomposable searching problem when cast in the form: EQU Member (x,F)=OR equal (x,f)
where OR is the logical function OR and equal is the primitive query "Is x equal to f?" applied between the object x and each element f of F.
The key idea about this type of problem is its decomposability. To answer a query about F, we can combine the answers of the query applied to arbitrary subsets of F.
This type of problem is well suited to quick execution in a parallel processing environment. The set F is partitioned into a number of arbitrary subsets equal to the number of available processors. The primitive query q is then applied in parallel at each processor between the unknown x that is communicated to all processors and the locally stored set element f. The results are then combined in parallel by log.sub.2 N repetitions of the operator b, first performing b computations on N/2 adjacent pairs of processors, the b computations on N/4 pairs of results of the first set of computations and so on until a single result is obtained.
The complexity of this operation in the parallel processing environment is computed as follows. Each of the N elements of the set F must be distributed among the processors so that the N elements are distributed among all the processors. The number of time steps to do this equals the number of elements of the set. Thus, P=O(N). If each element is stored in a different processor such that S=N, the time required to answer a primitive query is a single time step; and the time required to compute the final answer is the number of time steps required to report back through the binary tree which is O(log.sub.2 N) Thus, Q=O(1)+O(log.sub.2 N). Compared with the complexity of the membership problem when executed on a serial computer, the use of a parallel processor provides substantial savings in the preprocessing time required to build the data structure since there is no need to store the data structure in an ordered array.
Bentley and Kung proposed a specific tree structure illustrated in FIG. 1 which was designed to achieve throughputs on the order described above. As shown in FIG. 1, their tree structure comprises an array of processors P1-P10 organized into two binary trees that share leaf processors P4-P7. Data flows in one binary tree from root processor P1 to leaf processors P4-P7. Data is operated on at the leaf processors and the results flow in the second tree from leaf processors P4-P7 to root processor P10. Obviously, many more processors can be used in the array if desired.
To load data into each of leaf processors P4-P7 of FIG. 1, the data for each leaf processor is provided to the root processor P1 at successive time steps and is routed through the array to each leaf processor via intermediate processors P2 and P3. Thus, it takes at least one time step per leaf processor to load data into the leaf processors.
The data structure is queried by providing the query to root processor P1 and propagating the query in parallel to each leaf processor. The results of the query are then reported out through processors P8 and P9 to root processor P10 with each of these processors computing a result from two inputs higher up in the tree. As will be apparent, propagation times of the query and the result through the binary tree introduce significant delays in overall throughput comparable to those of a serial computer.
While the time required to answer a single query in the parallel processor is comparable to that in a serial computer, queries can sometimes be processed in pipeline fashion in the parallel processor while they cannot be in the serial computer. Thus, after O(log.sub.2 N) steps, results begin to flow out of the parallel processor at a rate of one per time step. If the number of queries is large enough that the pipe filling and flushing times can be considered negligible, the complexity can be computed as: S=N, P=O(N) and Q=O(1).
There are, however, numerous instances in which pipelining cannot be used to minimize the effect of propagation delays in the binary tree. These include:
1. Decomposable searching problems where the number of "external" queries is small, or a series of queries are generated internally from processing elements within the tree. Internally generated queries would need to migrate to the root in logN steps under Bentley and Kung's scheme, and be "broadcast" down once again in logN steps. Since each query would force pipe flushing, Q=O(log N) for all queries. Artificial Intelligence production systems provide an illustration of these kinds of problems.
2. Searching problems where a single data structure D cannot be constructed. That is, for certain sets of complex (or multi-dimensional) objects, searching problems cannot be applied to a single data structure D. Consider relational databases where each element of a set is a complex record structure with possibly numerous fields or keys. To search such records, a data structure D would necessarily be needed for each field. Hence, in this case P(N)=kN for k fields in each record.
3. A set F of first order predicate logic literals, i.e., a knowledge base. We wish to compute a series of unifications of a series of "goal literals" against each of the elements of F. Since logic variables can bind to arbitrary first order terms during unification, a single query can change the entire set of elements in the knowledge base by binding and substituting new values for variable terms. Successive queries, therefore, are processed against succeedingly different sets of literals. (The simpler case involves frequent modifications to a dynamic set F, necessitating frequent reconstruction of D. Relational databases provide a convenient example.) Hence, EQU Query(x.sub.i, F.sub.i)=B q(x.sub.i,f.sub.i)
where F.sub.i =function (F.sub.i-1,Query (x.sub.i-1, F.sub.i-1)). (In the case of logic programming, function is substitution after logical unification, while for relational databases function may be insert or delete.)
4. Problems where a single query in a series of queries cannot be computed without knowing the result of the previous query. In dynamic programming approaches to statistical pattern matching tasks, a single match of an unknown against the set of reference templates cannot be computed without knowing the best match(es) of the previous unknown(s). Hence, for a series of unknown x.sub.i, i=1, . . . , M, EQU Query(X.sub.i,F)=B q(x.sub.i, Query (x.sub.i-1,F),f).
In this case, the same pipe flushing phenomenon appears as in the first case noted above. 5. Searching problems where we wish to compute a number of different queries about the same unknown x over a set, or possibly different sets. Hence, EQU Query.sub.i (x,F)=B q.sub.i (x,f) for i=1, . . . , M.
Artificial intelligence production systems provide an illustration of this type of problem as well.
We will refer to problems of this type as almost decomposable searching problems.
Additional deficiencies of binary tree type parallel processors include efficiency and fault tolerance. Efficiency of a computation performed by the tree is often reduced since the amount of computation time required by each processor on a particular cycle may be vastly different depending on its local state. Such differences often result in unnecessary waiting and increased computation time. Additionally, it is well known that binary tree processors are inherently not very fault tolerant. Since any such fault has a tendency to ripple through the binary tree architecture, it is imperative that the fault(s) be not only detected but also compensated for so as to produce an accurate computation despite the fault(s).