1. Field of the Invention
This present invention relates to a cloud data storage system, and in particular to a cloud data storage system with capability of data import and cell recovery.
2. Description of the Prior Art
The data storage system has evolved from local harddisk, Redundant Array of Independent Disks (RAID), central file server in local area network (LAN), network attached storage (NAS), storage area network (SAN) and clustered storage apparatus of earlier times to the cloud storage apparatus. With the high speed development of Internet technology and mobile devices, e.g., iPHONE, iPAD, data storage has moved into the “cloud” era. People intend to visit cloud data storage system or to obtain computational capability at any places where they can log on the Internet. People usually own their local independent standalone storage system and computation device and on the other hands from their PC or personal mobile devices visit the clustered storage to access data of files on any storage apparatus within Internet cloud system.
The centralized file server uses basic client/server network technology to resolve issue of data retrieval. In its simplest form, a file server is employed, which might be PC or workstation hardware executing network operating system (NOS), to support controlled (or different access privilege levels) files sharing (e.g., Novell NetWare, UNIX® or Microsoft Windows®). The file server could provide gigabytes of storage capacity with hard disks or with tape drives that further expand the storage capacity.
To set up cloud data storage system, clustered or cloud files system such as Hadoop File System (HDFS) appeared. In general, these clustered files system consists of metadata server node and plural data nodes. The metadata server node provides file attributes data, such as file size, access control flags and location of file. The data node provides data access service of actual (or, in other term, preliminary) data to user clients. However, to small office home office (SOHO) environment, if they like to have their own cloud data storage system, the cost for clustered cloud data storage system of this type is too expensive. The following is the summary of the current technologies.
Lustre Clustered Files System
Lustre is an open source codes clustered files system proposed by Sun Microsystem Co. (merged by Oracle Co.). Lustre metadata management architecture is a typical master-slave architecture. The active metadata server (MDS) serves the request of metadata inquiry, and the standby (non-active) MDS, to assure failover during any accident of malfunctions, will monitor health of the active MDS.
To provide high availability of metadata services, Lustre permits plural MDSs operating at standby mode for meeting fail-over needs. All MDSs are all connected to metadata target (MDT), but only the active MDS may access MDT. Although other MDSs provide failover mechanism of malfunctions, the computational capabilities of standby MDSs are therefore wasted. The single active MDS is a bottleneck to the entire clustered files system. The overall cost for the clustered cloud data storage system of this type is too high.
Hadoop Clustered Files System (HDFS)
HDFS is an open source cloud files system proposed by Hadoop project. HDFS implements namenode to service metadata inquiry requests from all clients. The metadata update log (EDITLOG) record all metadata editing request messages and is stored on the harddisk storage apparatus of master namenode. The secondary namenode rapidly merges the metadata from EDITLOG into files system structure image file (FsImage) of namenode's harddisk. A new metadata update log (EDITLOG_new) is created for coming update requests during the merging process. After the merge process is done, the old EDITLOG will be deleted and then EDITLOG_new will be renamed to EDITLOG.
HDFS is a distributed, scalable and portable files system written in Java language directed to Hadoop architecture. Under the Hadoop architecture, each node in general has a single datanode, and the clustered plural datanodes constitute HDFS clustered files system. But it not required for each node within HDFS cloud files system to have datanode. Each datanode provides data_block access service over the network using block protocol specific to HDFS. HDFS uses TCP/IP layer for communication purpose, and user_clients use RomoteProcedureCall to communicate with each other. HDFS can replicate data across different platforms of datanode, and therefore there is no need for RAID type harddisk on the datanode.
The current HDFS does not have failover mechanism during malfunctions. Once the namenode malfunctions, the administrator must reset the namenode. Once the harddisk of namenode malfunctions, EDITLOG and FsImage metadata will be lost. The secondary namenode wastes its server computational capability since it only executes the check point mechanism.
Filesystem update recording files and metadata Editlog record all filesystem operations for the namenode. As some files are deleted or new files are added into the system, these operation records will be recorded into EDITLOG which can be stored on the local disk of the namenode, and the contents recorded including operation timestamps, file metadata and additional information. The contents of EDITLOG are merged into FsImage periodically. At the moment of each merging, FsImage includes metadata for all files of the system, the metadata contents includes the file owners, file access privileges, the numbering of file_blocks and the datanode where the file block resides.
Ceph Metadata Servers Clustered Architecture
There are five different types of metadata management architecture. According to literature, HDFS and Lustre belong to types of metadata being separated from the actual data, and Ceph belongs to metadata server clustered architecture and adopts subtree partitioning algorithm. The number of subdirectory limits the amount of metadata server used, and as the number of subdirectory grows to a certain number, this method is not enough.
However, since data servers used in above mentioned architectures are type of the standard data file server (i.e., central file servers), instead of standard network attached storage apparatus (NAS), several drawbacks exist. First of all, the clients have to use specific user_client for accessing files or data from the data server, making operations difficult for general publics without specific trainings. Secondly, hardware, infrastructure or operational costs of these standard file servers are higher than the standard NAS. Thirdly, the management works for standard file server is higher than that of standard NAS.