The prior art has disclosed a number of virtual memory data processing systems which employ a single standalone Central Processing Unit (CPU). These systems generally employ a main memory having a plurality of individually addressable storage locations, each of which stores one byte of data and a secondary storage device such as a Disk File which includes a plurality of block addressable storage locations, each of which stores a block of data. For discussion purposes it is convenient to assume that each block address of the disk file stores a page of data comprising for example 2K (2048) bytes of data. The virtual memory concept involves what is sometimes referred to as a single-level store. In a single-level store, the maximum address range of the system is generally much larger than the real capacity of the main memory. The main memory is made to appear much larger by the use of a paging mechanism and a secondary storage device which cooperate to keep the data required by the application program in main memory. The function of the paging mechanism is to transfer a page of data from the disk file to main memory whenever a page, which is addressed by the application program, is not in main memory. This is called a page fault. Transferring the page of data from the disk file to main memory is called page fault handling.
The prior art has also disclosed a number of multi-processor system configurations that are sometimes employed to obtain increased data processing power. A multi-processor system configuration may be thought of as a plurality of processing units sharing a logical communication channel. The logical communication channel may take the form of memory shared among the processing units into which messages from one processing unit to another processing unit may be placed. Additionally, the logical communication channel may take the form of a communication network through which messages from one processing unit to another processing unit may travel.
In some prior art multi-processor system configurations referred to as tightly-coupled multi-processor configurations, the processing units in the configuration share some amount of memory which any of the processing units in the configuration may access, and each processing unit may have some amount of private memory which only it and no other processing unit may access.
Computing systems arranged in a tightly-coupled multi-processor configuration have the benefit of rapid communication via shared memory and may also exploit the shared memory as a disk cache. A page fault may occur when an application program executing on one of the processing units in a tightly-coupled multi-processor configuration addresses a page of data that is not in main memory. During page fault handling, the appropriate secondary storage device connected to the configuration is commanded to place the appropriate page of data into the shared memory. Once the page of data has been placed in the shared memory it may be addressed by any of the processing units in the configuration.
If the plurality of processing units in a multi-processor configuration are working on a common problem, it is normal for the data they access to be accessed in such a way as to experience "locality of reference". The term locality of reference is used when there is some non-zero probability that a page of data retrieved from secondary storage and placed in shared memory to satisfy a page fault resulting from an access to virtual memory by an application program executing on one processing unit in the configuration, will also be accessed by another application program, executing on another processing unit in the configuration before the page frame in shared memory holding that page of data has been re-used by the configuration to hold another page of data. If such an access by another application program executing on another processing unit in the configuration occurs, the configuration may avoid a disk access by satisfying the page fault with that page of data already in shared memory.
A practical limit however is reached for tightly-coupled multi-processor configurations when the contention for access to shared memory among the processing units in the configuration exceeds the benefit provided by the shared memory when used as a disk cache. For instance, one processing unit in the configuration may attempt to change the contents of a page of data while another processing unit is attempting to examine the contents of the same page of data. Some mechanism must normally be provided by the configuration to lock out one of the processing units in favor of the other so that the two processing units see a consistent view of the data. Various methods exist in the prior art to enforce a consistent view of data upon the processing units in a tightly-coupled multi-processor configuration. These methods involve idling one of the processing units in the configuration until the other processing unit has completed its access to shared memory. The processing unit that has been idled cannot be idle and also perform useful work; thus, contention for access to shared memory inevitably results in some loss of processing power for the configuration when considered as a whole. For these reasons, the number of processing units in a single tightly-coupled multi-processor configuration rarely exceeds six.
In some other prior art multi-processor system configurations referred to as closely-coupled multi-processor configurations, the plurality of processing units are connected via a communications network and each processing unit may access its own memory directly and no other processing unit has access to that memory. The processing units in a closely-coupled multi-processor configuration may share data by sending messages via the communications network to other processing units within the configuration. A variation on the closely-coupled multi-processor configuration distinguishes one of the processing units in the configuration as a shared memory processing unit. The main memory attached to the shared memory processing unit is used as a disk cache managed by the shared memory processing unit. The shared memory processing unit is assigned the function of controlling which of the other processing units can have access to what area of the shared memory at what time and under what configurations. When the shared memory is a virtual memory involving a fast main memory and a relatively slow secondary storage device, the size of the main memory which is required to obtain a respectable hit ratio is directly related to the total number of instructions that are being executed by the multi-processor configuration per second. Individual processing units are sometimes rated in Millions of Instructions Per Seconds (MIPS). If two 4 MIPS processing units and a third shared memory processing unit are employed in a closely-coupled multi-processor configuration, the main memory associated with the configuration must have approximately 80 megabytes of byte addressable memory to obtain a respectable hit ratio. The rule of thumb that is used is that 10 megabytes of byte addressable main memory per MIPS is required to obtain an 85 percent hit ratio in the shared memory. Therefore, if another 4 MIPS processing unit is added to the multi-processor configuration, another 40 megabytes of byte addressable memory should be added to the main memory of the shared memory processing unit to maintain the 85 percent hit ratio. A practical limit however is reached in the number of processing units that can be added to the configuration before the cost parameters and performance reach the point of diminishing returns.
More recently the prior art has begun to configure standalone personal computers or standalone engineering work stations into a local area network. In such an arrangement, which is called a loosely-coupled multi-processor configuration or a distributed system configuration or a cluster configuration, any work station can communicate with another work station employing standard communication protocols. The motivation that exists for establishing the cluster configuration is not necessarily more data processing power, but simply one of the convenience of exchanging information electronically vs. non-electronic exchange. However, it has been found in some situations that the individual work stations are running the same operating system and at times run the same application programs. A paper entitled "Memory Coherence in Shared Virtual Storage Systems" authored by Kai Li and Paul Hudak and presented at the 5th Annual Association for Computing Machinery Symposium on Principles of Distributed Computing 1986, discloses a plurality of virtual memory data processing units interconnected in a cluster configuration. In this arrangement all units have the same operating system and address the same virtual address space. Each unit is the owner of a different set of files which is stored in that owner' s memory system. A non-owner running an application program obtains access to the other unit's memory system through a suitable communication link, which causes requests to the file owner for virtual pages of data which are then returned to the requester. Each unit of the cluster configuration therefore shares the set of files in its virtual memory system with the other units in the configuration. Page faults resulting from requests are serviced by the file owner. If the request is local, that is from the owner, the requested page is transferred from the owner's secondary storage directly to the owner's main memory. If the request is from a remote unit, the page is transferred from the owner's secondary storage to the requester's main memory through the communication link. A system protocol is established to control what happens to pages of data after the requesting unit is finished with them. This protocol addresses such issues as, when to return a page to the owner, how to manage concurrent requests for the same page if one unit wants to write to that page while other units want to read from that page, and various other situations that are common to functions that share stored data.
The sharing by each processing unit of its virtual memory with other processing units in the cluster has some potential advantages in that the size or capacity of the secondary storage devices can be reduced since the total number of files available to the cluster is spread out among a number of secondary storage devices. This would permit the use of devices with faster access times and/or lower cost. A potential disadvantage is that concurrent requests from a number of different units to an owning unit will each result in a number of disk accesses to occur in sequence. While the requests are generally serviced in an overlapped manner, a disk access is a relatively time consuming operation for the unit and could severely impact the performance of the owning unit which is perhaps executing an unrelated application program, that is competing for the services of the secondary storage device.
The invention disclosed and claimed in the cross-referenced U.S. application Ser. No. 07/126,814 is directed to a novel method for use by a shared virtual memory, cluster configured, data processing system in which the number of page faults requiring access to the secondary storage devices is considerably reduced.
Loosely coupled multi-processor configurations disclosed in the prior art have traditionally been architected around a message passing model in which individual kernels running on separate processing units send messages containing requests for service to other processing units within the configuration that manage configuration-wide shared resources. Reliance on a message passing model has undoubtedly occurred because message passing corresponds naturally to the underlying communications connections among the processing units, which is generally believed to compose the primary performance bottleneck in a loosely coupled configuration; however, message passing as a model for system coupling has several drawbacks.
The difficulty of directly sharing complex data structures (e.g. control blocks containing pointers) among processors in message passing systems is well known.
The difficulty of sharing complex data structures given a message-passing model is discussed in a paper entitled "A Value Transmission Method For Abstract Data Types" by M. Herlihy and B. Liskov and published in the ACM Transactions on Programming Languages and Systems, Vol. 4, No. 4 in October of 1982. This subject is further discussed in a doctoral thesis entitled "Remote Procedure Call", by B. Nelson, and published by Carnegie Mellon University in May of 1981.
In order to share a list of elements between two components of an operating system executing on separate processing units within a multi-processor configuration, which is itself a relatively common requirement, the elements have to be packed into a format suitable for transmission at the sending component, transmitted from the sending component to the receiving component, then unpacked at the receiving component. This sequence of operations is inefficient both in processor utilization and in communication channel utilization.
More important, this sequence of operations is complex and unwieldy. The primary drawback of message passing is that it forces both the sending and receiving components into awkward and complex architectures that tend to be costly and difficult to implement, debug, augment, and maintain. Since the kernel of a typical general purpose operating system tends to be composed of many interacting components, the implications of architecting the operating system of a multi-processor configuration around a message passing model tend to be enormous.
Operating systems disclosed in the prior art for tightly-coupled multi-processor configurations have not traditionally been architected around a message passing model; rather, the processing units in the configuration share some amount of main memory, their kernels share complex data structures in the shared memory, and pass among themselves only pointers to these objects. It is clear that operating systems developed for uniprocessors have, with some modification in the areas of serialization and cache consistency, been modified rather than rewritten to execute efficiently on tightly coupled multi-processor configurations. It would be unusual and difficult to modify an operating system constructed around a message passing model to execute on a tightly coupled multi-processor configuration. This tends to validate the assumption that general purpose operating systems fit more naturally into a shared storage model than a message passing one.
The IBM RT PC virtual memory management hardware provides the capability of implementing an efficient shared virtual memory and provides an ideal environment to use the method of the present invention. The IBM AIX operating system is implemented around the shared virtual memory. The virtual memory manager is of necessity constructed around a message passing model. All higher levels of the AIX operating system, including the file system and interprocess communication, are constructed around a shared memory model provided by the virtual memory manager. The shared memory architectural model will allow the individual components of the AIX operating system to be implemented in such a way as to trade some small amount of performance for simplicity, which, in turn, is the source of many other benefits.
An operating system maintains data in its memory that represents its current state. This data is volatile in the sense that it does not have to survive system restart. Data that is in main memory, and thus can be processed directly by the CPU, is frequently referred to as in-core data. It is desirable to update this data using atomic transactions, so that it is never in an inconsistent state. An example of such data is the system directory which relates files to their current locations. This directory is built up from information on disk as the system runs. While updates to it are being made, or if an update fails, the interim value of the information should not be available.
A transaction is a unit of work performed by an application program that may access (reference and/or update) data stored in virtual memory that is shared among the processing units in a cluster configuration. A transaction runs under the thread of execution of a single process running a single application program on a single processing unit in the configuration.
A transaction executing on a given processing unit may access data accessible to other transactions executing on the same processing unit. A plurality of transactions concurrently executing on the same processing unit might never actually access the same data at the same time, since most conventional processing units can execute only a single stream or thread of instructions; nevertheless, it is a usefully abstraction to assume that the transactions are executing concurrently and are accessing data stored in virtual memory that is shared among the processing units in a cluster configuration.
Since multiple concurrent transactions may (at least virtually) access shared data, it is possible that they might update data in a way that would place it in some state that would be impossible to achieve if the transactions had been executed serially (in any order) on the same data. This is an undesirable condition, since the semantics of the multiple concurrent transactions may be timing-dependent, and therefore difficulty to predict. It is desirable to ensure that shared data always appears to be in a state which could only have been achieved had the multiple concurrent transactions executed in some (any) serial order. This property is called serializability. The art of coordinating concurrent access to shared data by multiple transactions in such a way as to ensure serializability is called concurrency control.
It is desirable to control concurrency in such a way as to provide for efficient use of the resources of a processing unit. A simple way to ensure serializability is to execute the transactions serially (in some order). This is inefficient, since a transaction may require access to resource that is not immediately available, such as a virtual memory page that might not be in main memory at that time. In this case, the processing unit would have to remain idle until the resource became available, since no other transaction would be allowed to begin executing until the transaction in progress completed. There are other reasons that this approach to concurrency control is undesirable, which we shall not detail here. Typically, it is desirable to dispatch one or more of the other concurrent transactions that can continue useful work until the resource becomes available. Because the resource may become available at a time not easily predictable by the transaction dispatcher, a suitable concurrency control algorithm is often used to ensure serializability of the concurrently executing transactions.
One approach to concurrency control is to lock shared data accessed by a transaction, and prevent other concurrent transactions from locking the same data in a way that would lead to conflict. This approach usually allows for two types of lock to be granted by an abstract entity called a lock manager: read-locks and write-locks, which indicate whether a transaction has the right to read or read-and-write, respectively, the data that the lock protects.
Conflict occurs when a transaction is holding a read-lock protecting some shared data and another transaction requests a write-lock for the same data, or when a transaction is holding a write-lock protecting some shared data, and another transaction requests either a read-lock or a write-lock for the same data. Read-locks protecting the same data may be granted to multiple concurrent transactions without conflict. A transaction requesting a lock that would conflict with lock(s) held by one or more other transaction(s) must be made to wait until the conflict has been resolved (i.e. when the other transaction(s) have freed their conflicting lock(s).
A well-known theorem is that a two-phase locking protocol, in which all locks needed by a transaction are acquired before any are freed, is sufficient to guarantee serializability.
A situation in which each of a set of two or more concurrent transactions is waiting to lock data in such a way as to lead to conflict with some other transaction(s) in the set is called deadlock. Since each transaction in the set is waiting, it cannot free the lock(s) some other transaction(s) in the set is (are) waiting for. From the inception of a deadlock onward, no useful work can be performed by any members of the deadlocked set, nor can any be performed by the set of transactions not in the deadlocked set, but that request locks that would conflict with locks held by members of the deadlocked set. We shall call this the "set of fringe transactions."
To prevent the persistence of deadlock and its consequences, most computer systems using a two-phase locking protocol periodically check for the existence of deadlock, and if found, abort one or more transactions in the deadlocked set. Aborting a transaction implies that all of the updates to data it modified must be backed out, and all the locks must be freed.
In practice, there are many other situations in which transactions may be aborted, such as system failure, transaction program error, transaction program design, and operator intervention. Since transactions may be aborted before reaching their endpoint, it is desirable that updates made by a transaction appear to be "atomic," i.e. either all of its updates are applied at the same time, or none of its updates are applied. If this rule is enforced, shared data is always kept in a consistent state, and each transaction sees either all or none of the updates made by any other transaction.
When a transaction reaches its endpoint, it requests that its updates be made visible to other transactions by using the commit service. If the operating system responds affirmatively to the transaction, all of the transaction's updates have been applied; otherwise, none have been. In either case, all of the transaction's locks have been freed. A transaction may use the backout service rather than the commit service in order to have all of its updates backed out, and all of the locks it holds freed.
There are five important, essentially different degrees of consistency which may be provided by a concurrency control mechanism based on locking (from a paper written by J. N. Gray entitled, "The Transaction Concept: Virtues and Limitations", Seventh International Conference on Very Large Databases 1981). They are:
Free Access Readers may freely reference a given object. Writers may freely update a given object.
Degree 0: Readers may freely reference a given object. Writers must lock a given object prior to updating it.
Writers conflict with other writers. Degree 1: Readers may freely reference a given object. Writers must lock a given object prior to updating it. Writers conflict with other writers. Two-phase write-locking is enforced.
Degree 2: Readers must lock a given object prior to referencing it. Writers must lock a given object prior to updating it. Readers conflict with writers. Writers conflict with other writers. Two-phase write-locking is enforced.
Degree 3: Readers must lock a given object prior to referencing it. Writers must lock a given object prior to updating it. Readers conflict with writers. Writers conflict with other writers. Two-phase locking is enforced.
The AIX operating system uses a form of shared virtual memory called cluster storage that provides atomic, serialized update semantics and Degree 3consistency to various components of the operating system, and to subsystems and application programs . . .
The operating system uses locking to achieve serialized update semantics. A mechanism to implement locking is described in "801 Storage: Architecture and Programming" by Albert Chang and Mark Mergen. The locking mechanism may be extended to function in the distributed cluster environment. An example of an existing distributed lock manager is implemented in the Distributed Services LPP of the AIX Operating System.