The prior art has disclosed a number of virtual storage data processing systems which employ a single standalone Central Processing Unit (CPU). These systems generally employ a main storage having a plurality of individually addressable storage locations, each of which stores one byte of data and a secondary storage device such as a disk file which includes a plurality of block addressable storage locations, each of which stores a block of data. The virtual storage concept involves what is sometimes referred to as a single-level store. In a single-level store, the maximum address range of the system is generally much larger than the real capacity of the main storage. The main storage is made to appear much larger by the use of a paging mechanism and a secondary storage device which cooperate to keep the data required by the application program in main storage. The function of the paging mechanism is to transfer a page of data from the secondary storage device to main storage whenever a page, which is addressed by the application program is not in main storage. This is called a page fault. Transferring the page of data from the secondary storage device to main storage is called page fault handling.
The prior art has also disclosed a number of different multi-processor system configurations that are sometimes employed to obtain increased data processing power. A multi-processor system configuration may be thought of as a plurality of processing units sharing a logical communication channel. The logical communication channel may take the form of storage shared among the processing units into which messages from one processing unit to another processing unit may be placed. Additionally, the logical communication channel may take the form of a communication network (including shared buses) through which messages may travel from one processing unit to another processing unit.
In some prior art multi-processor system configurations referred to as tightly-coupled multi-processor configurations, the processing units in the configuration share some amount of storage which any of the processing units in the configuration may access. Each processing unit, however, may have some amount of private storage which only it and no other processing unit may access.
Computing systems arranged in a tightly-coupled multi-processor configuration have the benefit of rapid communication via shared storage and may also exploit the shared storage as a disk cache. A page fault may occur when an application program executing on one of the processing units in a tightly-coupled multi-processor configuration addresses a page of data that is not in main storage. During page fault handling, the appropriate secondary storage device connected to the configuration is commanded to place the appropriate page of data into the shared storage. Once the page of data has been placed in the shared storage, it may be addressed by any of the processing units in the configuration.
A practical limit, however, is reached for tightly-coupled multi-processor configurations when the contention for access to shared storage among the processing units in the configuration exceeds the benefit provided by the shared storage when used as a disk cache. For example, one processing unit in the configuration may attempt to change the contents of a page of data while another processing unit is attempting to examine the contents of the same page of data. Some mechanism must normally be provided by the configuration to lock out one of the processing units in favor of the other so that the two processing units see a consistent view of the data. Various methods exist in the prior art to enforce a consistent view of data upon the processing units in a tightly-coupled, multi-processor configuration.
These prior art methods involve idling one of the processing units in the configuration until the other processing unit has completed its access to shared storage. The processing unit that has been idled cannot be idle and also perform useful work; thus, contention for access to shared storage inevitably results in some loss of processing power for the configuration, when the configuration is considered as a whole. For these reasons, the number of processing units in a single tightly-coupled, multi-processor configuration rarely exceeds six.
In some prior art multi-processor system configurations referred to as closely-coupled or "clustered" multi-processor configurations, the plurality of processing units are connected via a communications network and each processing unit may access its own storage directly and no other processing unit has access to that storage. The processing units in a closely-coupled multi-processor configuration may share data by sending messages via the communications network to other processing units within the configuration.
In a variation on the closely-coupled multi-processor configuration, one of the processing units in the configuration operates as a shared storage processing unit. The main storage attached to the shared storage processing unit is used as a disk cache managed by the shared storage processing unit. The shared storage processing unit is also assigned the function of controlling which of the other processing units can have access to what area of the shared storage at what time and under what conditions.
More recently, the prior art has begun to configure standalone personal computers or standalone engineering work stations into a local area network. In such an arrangement, which is called a loosely-coupled multi-processor configuration or a distributed system configuration, any work station can communicate with another work station employing standard communication protocols. The motivation that exists for establishing such a loosely-coupled configuration is not necessarily more data processing power, but simply one of convenience of exchanging information electronically instead of non-electronically. However, it has been found in many situations that the individual work stations are running the same operating system.
A paper entitled "Memory Coherence in Shared Virtual Storage Systems" authored by Kai Li and Paul Hudak and presented at the 5th Annual Association for Computing Machinery Symposium on Principles of Distributing Computing, 1986, discloses a plurality of virtual-memory data processing units interconnected in a clustered configuration. In this arrangement all units have the same operating system and address the same virtual address space. Each unit is the owner of a different set of files which is stored in that owner's storage system. A non-owner running an application program obtains access to the other unit's storage system through a suitable communication link, which causes requests to the file owner for virtual pages of data which are then returned to the requester. Each processing unit of the clustered configuration therefore shares the set of files in its virtual storage system with the other units in the configuration.
A paper entitled "The Integration of Virtual Memory Management and Interprocess Communication in Accent" authored by R. Fitzgerald and R. F. Rashid and published in the May, 1986 issue of ACM Transactions on Computing Systems 4(2) describes the Accent operating system, developed at Carnegie-Mellon University. The Accent operating system integrates virtual storage management and inter-process communication within the kernel such that large data transfers use storage mapping techniques, rather than data copying, to implement kernel services.
In multi-processor systems employing shared virtual storage, there are two pervasive problems. One is the emergence of partial failures and the resulting level of reliability offered by the system. The other is the added complexity and amount of special-purpose code required in the kernel to distribute its services.
When a uniprocessor system "crashes" or fails, the services supplied by the system and the users of the services crash together, so that a total failure is seen. In a distributed configuration, one processor may crash while others stay up--services supplied by the crashed processor are then seen by their users to have failed, giving rise to partial failures. In order to resume useful work, the system must first bring itself into a consistent state, which may be a difficult task. As a result, most multi-processor operating systems either "kill" and re-start affected applications from the beginning (or from a checkpoint), or they assume that applications and/or subsystems are willing to deal with partial failure on their own, and therefore provide little or no assistance, as discussed in a paper entitled "A Non-Stop Kernel" authored by J. F. Bartlett and published in Proceedings of the Eighth Symposium on Operating System Principles in December, 1981. One goal of a clustered system is transparency, i.e. users and application programs should not be aware of the existence of a plurality of processor units. Thus, steps must be taken to preclude or minimize the effect of partial failures in a clustered system.
In a clustered system of independent processors, communication is necessarily involved, so protocols are needed, and there may be special processes and other related facilities. If a single mechanism can be found which removes or reduces the need for the special facilities, this simplifies the implementation of the system services, which are now distributed, and makes it possible to optimize the underlying mechanism rather than putting effort into each special facility.
In prior art distributed data processing systems, it was common for one unit in the system which needed a particular function to request another processing unit in the distributed system to do the work for it. In effect, one processor shipped the service request to a different processor unit in the system which had been assigned that particular work function, and, accordingly had the necessary data structures available to accomplish the work. Such a "function shipping" implementation required the use of complicated code structures which made recovery from a partial failure difficult.
In addition, loosely-coupled microprocessor configurations disclosed in the prior art were traditionally designed around a message-passing communication model in which individual kernels running on separate processor units sent messages containing requests for services to other processor units within the configuration that managed configuration-wide shared resources. Reliance on such a message-passing model undoubtedly occurred because message passing corresponds naturally to the underlying communications connections among the processing units.
The difficulty of sharing complex data structures in a message-passing implementation is well known and is discussed in a paper entitled "A Value Transmission Method for Abstract Data Types" by M. Herlihy and B. Liskov and published in the ACM Transactions on Programming Language Systems, Vol. 4, No. 4 in October 1982, which is herein incorporated by reference. The difficulty of a message-passing model is further discussed in a doctoral dissertation entitled Remote Procedure Call, by B. Nelson and published by Carnegie-Mellon University in May 1981, which is also incorporated herein by reference.
In contrast, prior art operating systems for tightly-coupled micro-processor configurations have not traditionally been implemented around a message-passing model; rather, the processing units in the configuration share some amount of main storage where kernels share complex data structures in the main storage and pass among them only pointers to these structures. It is evident that operating systems originally developed for uniprocessors have, with some modifications in the areas of serialization and cache consistency been modified rather than rewritten to execute efficiently on such tightly-coupled, multi-processor configurations. It would be unusual and difficult to modify an operating system constructed around a message-passing model to execute on such tightly-coupled, multi-processor configurations.
In co-pending U.S. patent application Ser. No. 07/126,820, a novel system and method of "recoverable shared virtual storage (RSVS)" or "cluster" storage in a shared virtual storage, closely-coupled, multi-processor, data processing system is disclosed. Such a system achieves the goal of being a "high availability" data processing system which also allows for horizontal growth by employing a novel method which minimizes loss of data due to aborted transactions. Horizontal growth may be defined as adding processor units to a clustered system and achieving higher performance, either in reduced time to process a set of programs, or to allow more programs to be processed simultaneously without significantly extending the response time of the system.
A "transaction" is a unit of work performed by an application program that may update data stored in virtual storage that is shared among the processing units in a clustered configuration. A transaction runs under the thread of execution of a single process running a single program on a single processing unit in the clustered configuration. The novel system disclosed in the co-pending application maintains copies of data structures that are affected by identified transactions performed by one processor and only update the copies located on a different processor when a transaction has been committed. Transactions that must be aborted for any reason can therefore be retried since the information as it existed at the start of the transaction is available in the copy stored on another processor.
The co-pending application discloses an implementation of the invention based on the IBM AIX.TM..sup.1 operating system, which uses a form of shared virtual storage, provides atomic, serialized update semantics and provides Degree 3 consistency, also known as read-write serializability. Transactions are atomic in that either all of the changes made by a given transaction are made visible or none are, and it is possible to undo all changes at any time until they are committed. They are serializable in that the hardware locking support described in said application insures that, although several transactions may take place "simultaneously" the results are as if the transactions had taken place serially in some order. FNT .sup.1 AIX is a registered trademark of IBM Corporation.
A paper entitled, "801 Storage: Architecture and Programming," by A. Chang and M. Mergen, and published in the ACM Transactions on Computing Systems, February, 1988, describes the concept of "database storage". In order to understand RSVS or cluster storage, it is useful to have some understanding of database storage.
An object, such as a file or a data structure, is mapped into a virtual storage segment. All users of the object access it at the same virtual address, which allows sharing in a natural way. Operations on one or more such objects take place as transactions. When a transaction accesses database storage it implicitly acquires a read or write lock on the storage, as required. If the lock conflicts with those held by other transactions, the transaction is made to wait. Eventually the transaction finishes and it completes by calling either commit or undo. In the former case, the transaction's updates are made permanent, by writing them to secondary storage, while in the latter case they are discarded. In either case, the locks are freed and any processes waiting on them are allowed to continue.
Unlike database storage, however, recoverable shared virtual storage (RSVS) is designed for storing computational data which is not needed if the entire cluster crashes. Rather, the data is built up as the system begins and continues operation.
Recoverable shared virtual storage (RSVS) is designed for storing computational data which is not needed if the entire cluster crashes. Thus, when the changes are made visible, they are not written to secondary storage as are changes to the database storage. So long as at least two copies of the page exist in different processors in the cluster, the page of data is recoverable.
The co-pending application also discloses an implementation based on file structures i.e., structures which are written to secondary storage when the transaction is committed. It does not address the manner in which recoverable shared virtual storage (RSVS) may be applied to insure the recoverability of "shared data structures" in the event of a partial failure or for data structures that are not written to secondary storage.
Shared data structures include data structures for interprocess communication ("IPC") mechanisms, such as message queues, semaphores, and shared memory segments, as well as file system data structures such as the in-core inode table, the open file table, the directory cache (for both local and remote directories); and subsystem global data, such as the SNA connection table.
Message queues provide a useful mechanism for interprocess communication in operating systems based on or derived from the UNIX.sup.2 operating system, such as the IBM AIX.TM. operating system. Processes can communicate by first creating a message queue, then exchanging messages via the queue. A set of system calls is provided to use this mechanism. FNT .sup.2 UNIX is a registered trademark of AT&T Bell Laboratories.
Recent prior art developments relating to message queue implementation have taken different approaches. UNIX development has centered primarily on work done by AT&T, called "System V" and by the University of California at Berkeley, called "Berkeley". Both of these versions have had a form of interprocess communication integrated into them. Berkeley provides two versions of IPC called "data-grams" and "virtual circuits", both of which are built on the concept of a "socket." According to B. D. Fleisch in his article entitled "Distributed System V IPC in LOCUS: A Design and Implementation Retrospective" published in the Communications of the ACM in February 1986, "Berkeley's IPC is best suited for `long-haul` environments". On the other hand, "System V" IPC is built for a single system image of computation. More particularly, Fleisch's article describes the distribution of System V IPC. In the LOCUS system it is possible, for example, to share a message queue between processes running on different processors; if, however, one of the processors crashes, the messages in the queue on that processor are lost, although the identity of the queue is not. Thus, the existence of the distributed system becomes visible to the surviving processes in the event of a partial failure. In order to keep the message queues recoverable in the face of failure, the LOCUS system takes special steps. A queue is referred to by a unique "handle". The handle's value includes identifiers and "boot counts", or the number of times the system has been started, which are checked whenever the handle is used. The "name server", which allocates and assigns handles, must always be available, so there is a mechanism in the operating system kernel to start a second one if the first fails.
The name server and the kernels have to communicate, which is done through a distinguished queue. Messages go from the kernel to the name server using the normal mechanisms; replies from the name server are intercepted by the kernel which recognizes the distinguished queue's handle and routes the reply from the name server's machine to the one where its client resides. When the processor unit containing the name server crashes, a new name server processor unit is elected. Parts of the name server's database have been replicated at each processor unit within the cluster, and the new name server can rebuild the entire database, and reconstruct what was at the failed processor unit, by polling the surviving processor units. This is a fairly complicated and lengthy procedure. It should also be noted that only the queues that had existed at the failed processor unit are lost.
Although the above-referenced mechanisms may be effective for providing some level of reliability, a system with substantially higher reliability is desirable, especially a system which does not require a set of complex special-purpose mechanisms to provide higher reliability. Therefore, it is desirable to develop a mechanism for implementing shared data structures, such as message queues, which retains not only the existence of the data structures in the event of a processor failure, but also saves any data within the data .structures at the time of failure. In particular, it is desirable to implement a form of highly reliable data structures by adapting the concept of recoverable shared virtual storage (RSVS) to the implementation of message queues and other shared data structures. Thus, it would not be necessary to implement special mechanisms for individual data structures to achieve higher reliability.