1. Field of the Invention
The present invention relates to a RAID (redundant array of inexpensive disks) apparatus in a distributed object sharing system, and more particularly, to a distributed object sharing system capable of preventing data loss during error recovery in the system and a method thereof. This work was supported by the IT R&D program of MIC/IITA [Project No. 2005-S-405-02, Project Name: A Development of the Next Generation Internet Server Technology].
2. Description of the Related Art
In general, a distributed object sharing system constructed with a plurality of original storage apparatuses includes a RAID apparatus having an error recovery function in consideration of system availability and a system performance. The RAID apparatus is implemented in a mirroring scheme or a striping scheme.
In the mirroring scheme, one or more copies of data for an original data are generated, and the copies of the data are stored in different storage apparatuses. An RAID apparatus using the mirroring scheme is referred to as RAID Level 1. In the mirroring scheme, a storage space is wasted, and a time for a wiring operation is long. However, in a case where multiple reading requests for the same data are generated simultaneously, the reading operations can be distributed over a plurality of storage apparatuses. Even in a case where errors as many as the number of copies of the data occur, the system can be operated consistently.
In the stripping scheme, data is divided in unit of a predetermined size, and divided data sections are distributed and stored in a plurality of storage apparatuses. An RAID apparatus using the stripping scheme is referred to as RAID Level 5. In the stripping scheme, in a case where an input/output (I/O) size of data is small, a parallel process can be performed over the different storage apparatuses. In addition, in a case where the I/O size is large, a simultaneous reading operation can be performed on the storage apparatuses. In addition, a parity, that is, a kind of redundant information is used, so that the system can be consistently operated in occurrence of one error. However, since the parity needs to be updated at every time of updating the data, an overhead may be added to a wiring performance.
The distributed object sharing system has a centralized structure in that a metadata is managed by a metadata server and a plurality of file server are operated based on metadata information received from the metadata server.
In the centralized structure, the metadata server using Level-1 or Level-5 RAID apparatus exclusively recovers errors of the RAID apparatus. In this case, data loss may occur.
FIG. 1 is a block diagram illustrating a configuration of a conventional distributed object sharing system using a Level-1 RAID apparatus.
Referring to FIG. 1, the distributed object sharing system includes a metadata server 110 having a RAID error recovery unit 111 for recovering an error of a RAID apparatus 130, at least one file server 120 having a RAID driving unit 121 for performing reading and writing commands for an object by using the RAID apparatus 130, an original storage apparatus 131 for storing an original data, a storage apparatus 132 for storing a copy of the data, and the RAID apparatus having a recovery storage apparatus 133 for storing a data recovered by the RAID error recovery unit 111.
When the file server 120 generates an object generating command, the original storage apparatus 131 of the RAID apparatus 130 generates the corresponding object and transmits the object generating command including an identifier of the object to the copy storage apparatus 132. In response to the object generating command, the copy storage apparatus 132 generates the same object.
As a result, the original storage apparatus 131 and the copy storage apparatus 132 can generate the object having the same identifier. Namely, in the original storage apparatus 131 and the copy storage apparatus 132, the same object A, object B, and object C exist.
In case of Level-1 RAID apparatus 130, the writing command are performed on all the storage apparatuses 131 and 132, but the reading command are performed one storage apparatus 131 among all the storage apparatus.
For example, in case of multiple file servers, when a file server 1 intends to update contents of an object A, the same writing command is transmitted to the objects A of the original storage apparatus 131 and the copy storage apparatus 132.
However, when a file server 2 intends to read contents of an object B and a file server 3 intends to read contents of an object C, the file server 2 reads the contents of the object B from the original storage apparatus 131, and the file server 3 reads the contents from the copy storage apparatus 132.
Namely, the reading command can be subject to a load distribution process. In this case, it is determined by using an I/O distribution algorithm whether or not to read the contents of the object from the storage apparatus.
FIG. 2 is a block diagram for explaining an error recovery method in the distributed object sharing system of FIG. 1.
When an error occurs in the copy storage apparatus 132, the RAID error recovery unit 111 of the metadata server 110 forcibly generates an error in the copy storage apparatus 132. Next, all the objects stored in the copy storage apparatus 132 are sequentially read out (S11), and the objects are sequentially stored in the recovery storage apparatus 133 to recover the error (S12).
During these operations, when the file server 120 generates an update command for the recovery-proceeding object (S13), data loss occurs.
Namely, when a recovery command for the object B and an update command for the object B are generated simultaneously, the metadata server 110 performs firstly the reading operation on the object B and secondly the writing operation on the object B. On the contrary, the file server 120 performs only the writing operation on the object B.
The original storage apparatus 131 performs firstly the reading command of the metadata server 110 and secondly the update command of the file server 120. The recovery storage apparatus 133 performs firstly the update command of the file server 120 and secondly the reading command of the metadata server 110.
Therefore, a non-updated object other than the updated object which is updated by the metadata server 1100 is stored in the recovery storage apparatus 133, so that different objects are stored in the original storage apparatus 131 and the recovery storage apparatus 133. As a result, data loss occurs in the recovery storage apparatus 133.
FIG. 3 is a block diagram illustrating a configuration of a conventional distributed object sharing system using a Level-5 RAID apparatus.
Referring to FIG. 3, similar to FIG. 1, the distributed object sharing system includes a metadata server 210, at least one file server 220, and a RAID apparatus 230. The RAID apparatus 230 includes a plurality of storage apparatuses 231 to 234 for storing data of objects in a distributed manner and a recovery storage apparatus 235 for storing a data of an object which is recovered by the RAID error recovery unit 211 of the file server 220.
The number of tolerable errors of the RAID apparatus 230 is determined based on the number of parity information. However, since the parity information needs to be synchronized with data updating, an overhead may be added to a writing performance.
For example, when the file server 220 intends to update a data 5 of an object A, a data as much as the to-be-updated data 5 is read out from the object A of the storage apparatus 231 in which the data 5 is stored.
Next, parity information corresponding to the same offset and the same size is readout from a parity 1 of the object A of the storage apparatus_3 233 which stores parity information of a stripe 1 which the data 5 belongs to.
Next, new parity information is calculated by performing an exclusive-OR operation on a written data, a parity, and a new data in units of a bit.
The writing commands for the new parity information and the new data 5 are transmitted to the storage apparatuses 231 to 234, so that the writing commands are performed.
More specifically, in order to perform the writing command, a total of four input and output operations, that is, two input and output operations for the reading command and two input and output operations for the writing command are performed.
FIG. 4 is a block diagram for explaining an error recovery method in the distributed object sharing system of FIG. 3.
When the metadata server 210 detects an occurrence of an error in the storage apparatus_2 232, the RAID error recovery unit 211 of the metadata server 210 sequentially recovers the objects of the storage apparatus_2 232 to the recovery storage apparatus 235 in units of a stripe in the order of object identifiers.
Namely, the RAID error recovery unit 211 stores the objects of the original storage apparatus_2 232 in the recovery storage apparatus 235 in the order of the stripes 0, 1, and 2 of the object A of the original storage apparatus_2 232 and in the order of the stripes 0, 1, and 2 of the object B of the original storage apparatus_2 232.
However, if the file server 220 generates the update command for the recovery-proceeding object in this state, the data loss occurs.
Now, the recovery of the stripe 1 of the object A that is performed by the metadata server 210 during the update operation on the data 5 of the object A that is performed by the file server 220 is described with reference to FIG. 4.
The file server 220 reads out a data 5, a data 3, and a data 4 for party update (S21, S22, and S23). The file server 220 recovers the parity 1 (S24). After that, the file server 220 calculates a new parity 1 (S25).
At the same time, the metadata server 220 also reads out the data 5, the data 3, and the data 4 for the recovery of the parity 1 of the object A (S26, S27, and S28). The metadata server 220 recovers the parity 1.
When completing the parity update, the file server 220 stores a new data 5 in the storage apparatus_1 231 (S29). The file server 220 stores a new parity 1 in the recovery storage apparatus 235 (S30). On the other hand, when completing the recovery of the parity 1, the metadata server 210 stores the recovered parity 1 in the recovery storage apparatus 235 (S31).
In this manner, if the metadata server 210 performs the recovery operation of the parity 1 after the file server 230 performs the writing operation of the parity 1, a non-updated parity 1 other than the updated parity 1 is stored in the recovery storage apparatus 235.
In the conventional distributed object sharing system, after the recovery is completed or when the recovery is performed due to an erroneous storage apparatus, the recovery may be performed with an erroneous data, and data loss may occur. Namely, the conventional distributed object sharing system has a problem in that the data loss may occur during the error recovery of the RAID apparatus.