The present invention relates to database systems and, more particularly, to a hybrid shared nothing/shared disk database system.
Multi-processing computer systems are systems that include multiple processing units that are able to execute instructions in parallel relative to each other. To take advantage of parallel processing capabilities, different aspects of a task may be assigned to different processing units. The different aspects of a task are referred to herein as work granules, and the process responsible for distributing the work granules among the available processing units is referred to as a coordinator process.
Multi-processing computer systems typically fall into three categories: shared everything systems, shared disk systems, and shared nothing systems. The constraints placed on the distribution of work to processes performing granules of work vary based on the type of multi-processing system involved.
In shared everything systems, processes on all processors have direct access to all dynamic memory devices (hereinafter generally referred to as xe2x80x9cmemoryxe2x80x9d) and to all static memory devices (hereinafter generally referred to as xe2x80x9cdisksxe2x80x9d) in the system. Consequently, in a shared everything system there are few constraints with respect to how work granules may be assigned. However, a high degree of wiring between the various computer components is required to provide shared everything functionality. In addition, there are scalability limits to shared everything architectures.
In shared disk systems, processors and memories are grouped into nodes. Each node in a shared disk system may itself constitute a shared everything system that includes multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only the processes on processors that belong to a particular node can directly access the memory within the particular node. Shared disk systems generally require less wiring than shared everything systems. However, shared disk systems are more susceptible to unbalanced workload conditions. For example, if a node has a process that is working on a work granule that requires large amounts of dynamic memory, the memory that belongs to the node may not be large enough to simultaneously store all required data. Consequently, the process may have to swap data into and out of its node""s local memory even though large amounts of memory remain available and unused in other nodes.
Shared disk systems provide compartmentalization of software failures resulting in memory corruption. The only exceptions are the control blocks used by the inter-node lock manager, that are virtually replicated in all nodes.
In shared nothing systems, all processors, memories and disks are grouped into nodes. In shared nothing systems as in shared disk systems, each node may itself constitute a shared everything system or a shared disk system. Only the processes running on a particular node can directly access the memories and disks within the particular node. Of the three general types of multi-processing systems, shared nothing systems typically require the least amount of wiring between the various system components. However, shared nothing systems are the most susceptible to unbalanced workload conditions. For example, all of the data to be accessed during a particular work granule may reside on the disks of a particular node. Consequently, only processes running within that node can be used to perform the work granule, even though processes on other nodes remain idle.
Shared nothing systems provide compartmentalization of software failures resulting in memory and/or disk corruption. The only exceptions are the control blocks controlling xe2x80x9cownershipxe2x80x9d of data subsets by different nodes. Ownership is much more rarely modified than shared disk lock management information. Hence, the ownership techniques are simpler and more reliable than the shared disk lock management techniques, because they do not have high performance requirements.
Databases that run on multi-processing systems typically fall into two categories: shared disk databases and shared nothing databases. Shared disk database systems in which multiple database servers (typically running on different nodes) are capable of reading and writing to any part of the database. Data access in the shared disk architecture is coordinated via a distributed lock manager. Shared disk databases may be run on both shared nothing and shared disk computer systems. To run a shared disk database on a shared nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have direct access to remote disks.
A shared nothing database assumes that a process can only directly access data if the data is contained on a disk that belongs to the same node as the process. Specifically, the database data is subdivided among the available database servers. Each database server can directly read and write only the portion of data owned by that database server. If a first server seeks to access data owned by a second server, then the first database server must send messages to the second database server to cause the second database server to perform the data access on its behalf.
Shared nothing databases may be run on both shared disk and shared nothing multi-processing systems. To run a shared nothing database on a shared disk machine, a software mechanism may be provided for logically partitioning the database, and assigning ownership of each partition to a particular node.
Shared nothing and shared disk systems each have favorable advantages associated with its particular architecture. For example, shared nothing databases provide better performance if there are frequent write accesses (write hot spots) to the data. Shared disk databases provide better performance if there are frequent read accesses (read hot spots). Also, as mentioned above, shared nothing systems provide better fault containment in the presence of software failures.
In light of the foregoing, it would be desirable to provide a single database system that is able to provide the performance advantages of both types of database architectures. Typically, however, these two types of architectures are mutually exclusive.
According to one aspect of the invention, a method is provided for transitioning ownership of a data item. Ownership is transferred by disabling access to the data item, waiting for all transactions that have made changes to the data item to either commit or abort, changing data that indicates ownership of the data item from a first owner to a second owner, and enabling access to the data item.
Ownership groups are provided to establish sets of commonly owned data items. When a data item undergoing an ownership change belongs to an ownership group initially owned by the first owner, the step of changing data that indicates ownership of the data item from a first owner to a second owner may be accomplished by changing the owner of the ownership group from the first owner to the second owner, or may be accomplished by changing data that indicates the ownership group to which the data item belongs to reflect that the data item belongs to a second ownership group owned by the second owner.
The process performing the ownership transition may fail. According to one aspect of the invention, where the transition involves changing the owner of an ownership group, the system responds to such failure by determining whether the process failed before changing the data that indicates ownership of the ownership group. If the process failed before changing the data that indicates ownership of the ownership group, then the first owner is restored as owner of the ownership group. If the process failed after changing the data that indicates ownership of the ownership group, then the second owner is retained as owner of the ownership group.
When the transition involves changing the ownership group to which the data item belongs, removing the data item from its current ownership group involves updating a first file, and adding the data item to the new ownership group involves updating a second file. Failure of a process that is performing the ownership transition is responded to by determining whether the process performing the ownership transition died before the change to the second file. If the process performing the ownership transition died before the change to the second file, then the data item is restored as a member of the first ownership group. If the process performing the ownership transition died after the change to the second file but before the change to the first file, then the transition to the second ownership group is completed by updating the first file.