The present invention in general relates to the field of shared nothing clusters, and more specifically, to a cluster management method and device for recovering faulted data nodes in a shared nothing cluster.
Recently, shared nothing clusters, such as Hadoop, have attracted wide attention and gained a rapid development due to their reliability and high efficiency in distributed big data processing. A shared nothing cluster usually comprises a considerable amount of machines and disks for mass data storage, where none of the machines (i.e., nodes) share memory and/or storage. In order to achieve the maximum business value, it is a priority in big data management to ensure the reliability of cluster data, the scalability of the cluster, and cluster management efficiency.
FIG. 7a shows the architecture of a typical existing shared nothing cluster. In general, one or more machines in a shared nothing cluster are designated as masters, for example, the master nodes 710 in FIG. 7a, and the rest machines in the cluster may be designated as slave nodes or data nodes, such as data nodes 720 in FIG. 7a. In general, each mater node is used to superintend one or more data nodes; at least some of those data nodes have operating systems (OS) and applications running thereon, wherein the applications generally comprise cluster applications and/or local applications. As shown in FIG. 7b, generally, a data node further comprises local data files, cluster data replicas, and application files.
However, the existing shared nothing cluster architecture suffers from some potential issues that may injure cluster reliability. For example, when the OS on a data node in the cluster has broken down, applications and local/cluster data on that data node cannot be directly accessed any more.
Although some shared nothing file system (for example, Hadoop distributed file system (HDFS)) can ensure the reliability of cluster data through adopting cluster data replicas stored on other data nodes, recovering the OS and applications on a faulted data node cannot be achieved through cluster performances. That is, at present, automatic restart and failover are not supported for the OS, local data files and local/cluster applications on a data node in the cluster.
More seriously, if all data nodes storing part of cluster data happen to be broken, prompt access to not only the OS, local data files, local and cluster applications on each of these data nodes, but also the cluster data commonly stored thereon cannot be realized for cluster users.
In order to recover a data node with a broken-down OS, a user has to:
(i) replicate cluster data stored on the faulted data node on another data node in the cluster;
(ii) remove the faulted data nodes (if there are several faulted data nodes) from the cluster manually one by one;
(iii) format some previous data and reinstall an OS on the faulted data node;
(iv) reconfigure several applications on the faulted data node;
(v) deploy the faulted data node as a new data node and manually add the new node to the operating shared nothing cluster; and
(vi) as required, rebalance data over the cluster when unbalance occurs in the data node newly added to the cluster.
In the above operations (i) to (vi), data in the data node cannot be directly accessed any more, and running application may be lost. The backup of a mass of previous data in operation (i) and formatting data and OS installation in operation (iii) may take a considerable long period of time. Removing/adding nodes from/to the cluster need to interrupt running cluster applications.
Moreover, adding a new data node to the cluster usually causes cluster unbalance, and requires rebalancing data in the new cluster, which may impact cluster performance. If the cluster has a larger number of data nodes, for example, 100 nodes, even if the OS on only one node has broken down, rebalancing the complete cluster and recovering all nodes in the original cluster takes about 10 days. With the growth in scale of the cluster, the above fault recovery process may take an even longer period of time, leading to an expansive cost.
Thus, a cluster management method and device capable of recovering a faulted data node in a shared nothing cluster rapidly and conveniently are desired.
Furthermore, a cluster management method and device capable of recovering a plurality of faulted data nodes in a shared nothing cluster simultaneously are desired.