This invention relates to the field of database management, and in particular to a system and method for managing a cluster of nodes that contain partitions of data based on the use of six atomic online clustering operations.
The ever increasing use of electronic commerce, information access, and other applications that access large sets of data has increased the demand for high reliability and accessibility to the data. The loss of access to data for even a few minutes could cost a large e-commerce provider hundreds of thousands of dollars in lost sales as purchasers choose other vendors during the outage. In like manner, slow access to data will likely cause purchasers to choose other vendor sites for better performance.
Techniques such as load sharing, redundancy, and others have been developed to assure high performance and high reliability. Clusters of servers may be provided to enable multiple users to access the database at the same time with minimal latency; and such servers can be configured to automatically absorb the load of servers that fail. In like manner, clusters of data nodes may be provided to assure distributed access to the database, wherein such data nodes are also configured to provide redundant/back-up access when individual nodes fail, or are taken offline for repair or updating.
FIG. 10 illustrates an example database management system that uses a cluster of nodes to provide high availability and elasticity. The following is a description of the operation of such a database management system in very general terms, to provide a background to the structure of an example system to which this invention applies. One of skill in the art will recognize that this invention may also apply to alternative structures and/or principles of operation.
In a conventional embodiment, a ‘master’ database may be maintained, and downloaded to one or more clusters to support the expected demand for access to the data of the database via the clusters. Each cluster may support a different set of servers, and contain the entire database; or, each cluster may contain different portions of the database and support the entire set of servers for access to its particular subset of the data; or any combination between these extremes. Some embodiments may include a single cluster, and some embodiments may use a set of clusters to store the ‘master’ database. In the example embodiment of FIG. 10, the database is illustrated as being contained in a single cluster, for ease of understanding.
Conventionally, a database is partitioned into logical blocks of data, termed database partitions DBP, and the clusters that store the data in the database are similarly logically partitioned into a plurality of cluster partitions CPs, each cluster partition CP corresponding to a database partition DBP. Typically, the cluster partition is an ordinal that is incremented as each database partition DBP is assigned to the cluster.
Basic database operations include Read, Write, and Delete functions, wherein an input to the system includes an identifier of the operation Op that is to be performed, and an identifier of the database partition DBP that is being addressed by the operation Op. If the operation is a write operation, the input will also include the data [DataIn] to be written to the database partition DBP.
An aggregator 10 receives the input and determines the cluster partition CP corresponding to the database partition DBP. In some embodiments, the cluster partition CP is the same as the database partition DBP and a translation is not required.
Upon determining the cluster partition CP corresponding to the database partition DBP, the aggregator 10 determines which Node has been allocated to the partition CP containing the data block identified by the identifier DBP. When the identified DBP is assigned to the cluster, the aggregator 10 allocates a node N for storing the data of the identified cluster partition CP. The aggregator 10 maintains a set of metadata 20 that maps each cluster partition CP to its allocated node(s); each node maintains a mapping between the identified cluster partition CP and the physical location PNx of the database partition within the node. If multiple aggregators 10 provide access to the cluster, each aggregator maintains the same metadata 20, typically via a synchronous replication of all changes to the metadata.
To provide data availability in the event of a failure, the aggregator 10 may allocate each partition CP to multiple nodes, with a select node being identified as containing a ‘primary’ copy of the partition, and all of the other nodes allocated to the partition CP being identified as containing ‘follower partitions’, which are replications of the partition CP. The metadata at each cluster includes these multiple allocations, as illustrated in FIG. 11 and detailed further below.
During a Read operation, the data of the partition CP is obtained from the primary node and provided as the DataOut 60 from the cluster; if the primary node is not available, the data of the partition CP is obtained from a select follower node.
During a Write operation, the data block DataIn is written to the partition on the primary node (hereinafter “primary partition”) allocated to the CP. The content of the primary partition is replicated to each corresponding partition of the secondary nodes (hereinafter “secondary partition”). Depending upon the particular embodiment, the primary nodes may be configured to autonomously “push” the data from the primary partition to each secondary partition; or, the secondary nodes may be configured to autonomously “pull” the data from the primary partition. Alternatively, the aggregator may be configured to maintain the time that each partition associated with each CP was last changed, and prior to using the data on a secondary partition of the CP, these times may be checked to assure that the primary partition has not been updated since the secondary partition had been updated.
As detailed further below, in addition to providing read and write access to the data blocks in the cluster, the aggregator 10 has the responsibility for maintaining the metadata 20 to efficiently manage the allocation of nodes 40 to assure high availability and elasticity (expand, contract, balance loads, and so on). High availability and elasticity are key features for database management systems that use clustered data nodes. Solving each problem requires the ability to move data around a cluster, keep online and up-to-date backups (via replication, described above), and adjust the topology of the cluster in response to events like adding new nodes to the system or nodes failing. Of particular note, three commonly implemented clustered database procedures provided by the aggregator 10 include data failover, auto healing, and elastic scaling.
Data failover is the process of keeping a database system online in the event of a node failure. When a node fails, the data that it was responsible for serving must be exposed by other nodes in the system. To do so in an online manner, the system must maintain hot backups (replicas) of the data in the steady state that are available for use should their source (e.g. the node containing the ‘primary’ copy of the partition data) fail.
When a node recovers, it must be reintroduced into the system. This means reusing whatever data is on the node (if it is still valid), and optionally balancing data among the nodes so that the data in the cluster is evenly distributed. Advanced database systems are able to do this automatically when a node is visible to the cluster. The process of reintroducing a node automatically is called auto-healing.
Elastic scaling is the process of redistributing data in a cluster as it scales up or down, including load balancing. This is a common operation for cloud-based systems where hardware can be easily acquired to horizontally scale a cluster.
Currently, embodiments of these features are provided in an ‘ad hoc’ manner, wherein modules for providing data failover, auto-healing, elastic scaling, and others are custom designed for the particular database embodiment, or the particular database management system. This custom design introduces significant costs to create, test, and support each embodiment, with the accompanying risk of poor or unreliable performance.
It would be advantageous to provide a core group of primitives that can be used to create higher level clustered database features and functions, including data failover, auto-healing, and elastic scaling. It would also be advantageous to minimize the number of primitives in this group, and to optimize the features of each primitive to enable embodiments of the higher level features using a minimum number of these primitives. It would also be advantageous to enable these primitives to operate ‘online’, with minimal interference with users of the database, and to operate in parallel, for optimized performance.
These advantages, and others, can be realized by providing a set of six atomic primitives that are able to be used in combination to provide all of the common features and functions of a clustered database, including data failover, auto-healing, and elastic scaling. These six atomic primitives include CREATE, DROP, DETACH, ATTACH, COPY, and PROMOTE. Of particular note, it is shown that by maintaining appropriate metadata, including the status of each instance of each partition in the cluster, the versatility and reliability of this set of primitives is sufficient to implement each of the aforementioned data failover, auto-healing, and elastic scaling features and functions with high efficiency using a minimal number of these primitives. Each primitive is atomic (such that the cluster is clearly in one state or another) and online (a workload of reads and writes is uninterrupted while the primitive runs), and each primitive is scoped to a single partition of data, thereby enabling parallel processing.
Throughout the drawings, the same reference numerals indicate similar or corresponding features or functions. The drawings are included for illustrative purposes and are not intended to limit the scope of the invention.