Certain large data processing systems employ a multiplicity of independent, computer/disk systems (called "nodes"), which operate in parallel on discrete portions of a problem. In such systems, data files are sometimes distributed across the nodes so as to balance nodal work loads and protect against significant losses of data should one or more nodes malfunction. All nodes continuously interact with one-another and are interdependent insofar as successful completion of data processing actions are concerned.
The prior art evidences such systems and describes various methods for inter-nodal control and protection against data loss in the event of a malfunction. In U.S. Pat. No. 4,228,496 to Katzman et al., a multiprocessing system is disclosed wherein memory maps enable the creation of protection functions for individual nodal memory modules. The system also includes a sophisticated parity protection scheme to detect and correct bit errors.
In U.S. Pat. No. 4,412,285 to Neches et al., a multiprocessing system is described which employs various protocols for updating processing states between nodes. Little consideration is given, in this patent, to the possibility of individual nodal failure.
As parallel computing systems have come into vogue through the availability of lower cost and more capable processors, nodal failure problems have attracted more attention. This is due to the difficulty of reestablishing a system's processing states and restarting the system, while avoiding the need to completely reload the entire system's software. In this regard, attention has been given to defining certain system transactions as "atomic" and assuring that a record is kept of all actions within a transaction until all such actions have been completed. The actions which comprise an atomic transaction are tracked through a transaction management system. The system contains a record, or log, of each action performed during the processing of the transaction and enables the system to know when a transaction has either been completed or not completed.
In U.S. Pat. No. 4,319,336 to Anderson et al., a transaction-based system is described and employs a user-available keyboard entry device. Each transaction which does not "close", results in an interactive status message that enables the central computer to re-initialize the state of the transaction. Consideration is not given to the failure of a system element during the process of a transaction.
In U.S. Pat. No. 4,819,159 to Shipley et al., a fault-tolerant, parallel computing system is shown which employs transaction-based processing. In the Shipley et al. system, transactions are handled at the "user program or process" level so that transactions are considered on a macroscopic basis, insofar as the system is concerned. The transaction system stores pre-image and subsequent-image data in a transaction log. If an abort occurs during a transaction, the available pre-image of the data enables the system to reset to the state which existed prior to the transaction failure.
Prior art transaction management systems known to the inventors hereof, have been mainly applied to relatively macroscopic functions occurring within a computing system. Thus, such transaction logs and the functions they performed were largely under the control of the application system programmer and did not enable transaction logging within the system's "microscopic" actions to be carried out by the operating system in a manner transparent to the application system programmer. Thus, when a prior art macroscopic transaction was "rolled back", a number of microscopic functions which had been properly completed were scrapped during the rollback process. Such rollbacks created an unnecessary loss of processing time and, in a parallel processor, created ripple effects throughout parallel processing nodes.
Notwithstanding the use of transaction systems to assure macroscopic data integrity, when a portion of a file-based system malfunctions, the ability to reconstruct and restart the file system can be extraordinarily complex. For instance, in a system employing a Unix operating system, the reconstruction of a file system involves carrying out complete file system consistency checks that can take hours to complete. Such checks involve determining the consistency of disk directories, file tables, free data block lists, pointers in files to other files, etc. etc. Those comparisons involve detailed and time consuming data handling and greatly hinder a systems rapid recovery from a malfunction.
To partially avoid the time consuming consistency checks, the AIX System marketed by the IBM Corporation, employs certain logging techniques common to data base systems. (See Chang et al., "Evolution of Storage Facilities in the AIX System", IBM RISC System/6000 Technology, IBM document SA23-2619, pp. 138-142, 1990.) The AIX System limits itself to logging "metadata" (internally generated file identifier data) but does not log the file contents per se--nor does the AIX System concern itself with distributed file systems.
Another technique for file construction is described by Mueller et al. in "A Nested Transaction Mechanism for LOCUS" ACM Operating System Review, Proceedings of 9th ACM Symposium on Operating System Principles, pp. 71-89 (1983) and "The Locus Distributed System Architecture" edited by Popek et al, MIT Press, pp. 46-48, (1985). A "shadow paging" technique is there described wherein two copies of image data are always made, with the original data copy not being discarded until the transaction employing the new image copy has successfully completed.
Another distributed data-base management system with fault recovery capabilities is discussed by Lindsay et al in "Computation and Communication in R.sup.* : A Distributed Database Manager," ACM Transactions on Computer Systems, Vol. 2, No. 1, Feb. 1984, pp. 24-38. The Lindsay et al database employs a distributed transaction system and a nested form of a two-phase commit protocol to insure that all sites of a transaction come to the same conclusion as to the outcome of the transaction. There is no indication in the paper of the particular log-management strategy employed by Lindsay et al and how or whether it copes with "microscopic" transactions.
Accordingly, it is an object of this invention to provide an improved transaction logging and control system for a distributed file storage system in a parallel computing system.
It is still another object of this invention to provide an improved transaction logging and control system for microscopic-level operations in a distributed file system.
It is yet another object of this invention to provide a transaction logging and control system for data in a distributed file system which is invisible to the application programmer.
It is another object of this invention to provide a transaction logging and control system which assures distributed file system consistency in the event of a system malfunction and thus provides for rapid file system recovery.