This invention relates generally to storage architectures for parallel data processing systems and methods for balancing work load after node failure.
Parallel data processing systems of many kinds are known. Data in parallel data processing systems of the prior art is partitioned into the data fragments which are distributed among a plurality of data storage elements such as hard disks. For purposes herein, the meaning of the words "disk subsystems" and "storage subsystems" are defined to be interchangeable. By distributing the data fragments over a number of data storage elements, corresponding multiple threads of computation are created at multiple data processing nodes to access the multiple data fragments in parallel.
A major object of parallel data processing systems is fault tolerance. A fault tolerant system is defined as a system designed to continue operation in the event of a fault of one or more components. Parallel data processing systems are often employed in "mission critical" application environments. As is well known, in any given system, the mean time between failure is inversely proportional to the number of components in the system. Accordingly, in a parallel system with a large number of components, component failures can be expected to occur frequently. Accordingly, one object of the invention herein is graceful performance degradation after occurrence of a component failure.
FIG. 1 shows, in block diagram form, the general scheme of a fault-tolerant parallel data processing system 10. The fault-tolerant parallel data processing system 10 includes a plurality of disk subsystems 11, 12. For simplicity in illustration, two disk subsystems are shown in FIG. 1. The fault tolerant parallel data processing system 10 further includes a plurality of data processing nodes 14, including, in the example shown, data processing nodes 14a, 14b, 14c, and 14d. These data processing nodes are connected to disk subsystems 11, 12 along busses 15. In particular, data processing nodes 14a and 14b are connected to disk subsystem 11, and data processing nodes 14c and 14d are connected to disk subsystem 12, each along a separate bus 15. Data processing system 10 further includes an interconnection system 16 to which each of data processing nodes 14 are connected to establish an integrated parallel data processing system. As illustrated in FIG. 1, data processing nodes 14 are grouped into disjoint pairs. Each pair of data processing nodes is connected to a shared single disk subsystem, i.e., one of disk subsystems 11 or 12. Disk subsystems 11 and 12, according to this approach, mirror each other, containing replicated data or information sets. Such disk mirroring protects the data processing system 10 from a single disk failure. In case of a processing node failure as to a particular data processing node 14a, for example, the other processing node of the pair, i.e., processing node 14b, can take over the work load of the faulty node, because both data processing nodes have access to the shared disk subsystem, in this case disk subsystem 11. However, the performance of the parallel processing system 10 may drop as much as half because the performance of a loosely-coupled parallel system such as data processing system 10 in FIG. 1 typically depends upon the performance of the slowest processing node in the data processing system 10.
The problem of unbalanced work load after component failure in a fault tolerant system has also been addressed by allowing more than two data nodes to be connected to a single, shared disk subsystem. However, the bandwidth of the bus connecting the data processing nodes and the disk subsystems becomes a limiting factor in performance. Having more than two data processing nodes connect to a larger disk subsystem usually does not scale up system performance very much. Thus the indicated approach appears to contradict the performance motivation behind designing current parallel system architectures.
Another approach to the problem of fault tolerance in parallel processing systems has been attempted by parallel architectures which rely upon multistage interconnection topologies to provide shared access from every processing node to every device in the storage subsystem. However, this approach is costly and may yield unpredictable input/output performance.
It is an object of the invention to develop a simple yet effective architectural approach to fault-tolerance in parallel processing systems.
Another object is to provide a method for redistributing the workload of a failed processing node to adjacent operating data processing nodes in a manner which minimizes the overall system performance reduction caused by the redistributed workload.