The present invention generally relates to distributed data storage systems. Typically, such distributed storage systems are targeted at storing large amounts of data, such as objects or files in a distributed and fault tolerant manner with a predetermined level of redundancy. The present invention relates more particularly to a distributed object storage system.
The advantages of object storage systems, which store data objects referenced by an object identifier versus file systems, such as for example US2002/0078244, which store files referenced by an inode or block based systems which store data blocks referenced by a block address, in terms of scalability and flexibility are well known. Object storage systems in this way are able to surpass the maximum limits for storage capacity of file systems in a flexible way such that for example storage capacity can be added or removed in function of the needs, without degrading its performance as the system grows. This makes such object storage systems excellent candidates for large scale storage systems.
Such large-scale storage systems are required to distribute the stored data objects in the object storage system over multiple storage elements, such as for example hard disks, or multiple components such as storage nodes comprising a plurality of such storage elements. However, as the number of storage elements in such a distributed object storage system increases, equally the probability of failure of one or more of these storage elements increases. To cope therewith it is required to introduce a level of redundancy into the distributed object storage system. This means that the distributed object storage system must be able to cope with a failure of one or more storage elements without data loss. In its simplest form redundancy is achieved by replication, this means storing multiple copies of a data object on multiple storage elements of the distributed object storage system. In this way when one of the storage elements storing a copy of the data object fails, this data object can still be recovered from another storage element holding a copy. Several schemes for replication are known in the art, in general replication is costly as the storage capacity is concerned. This means that in order to survive two concurrent failures of a storage element of a distributed object storage system, at least two replica copies for each data object are required, which results in storage capacity overhead of 200%, which means that for storing 1 GB of data objects a storage capacity of 3 GB is required. Another well-known scheme is referred to as RAID systems of which some implementations are more efficient than replication as storage capacity overhead is concerned. However, often RAID systems require a form of synchronisation of the different storage elements and require them to be of the same type and in the case of drive failure require immediate replacement, followed by a costly and time consuming rebuild process. Therefor known systems based on replication or known RAID systems are generally not configured to survive more than two concurrent storage element failures. Therefor it has been proposed to use distributed object storage systems that are based on erasure encoding, such as for example described in WO2009135630, US2007/0136525 or US2008/313241. Such a distributed object storage system stores the data object in encoded sub blocks that are spread amongst the storage elements in such a way that for example a concurrent failure of six storage elements can be tolerated with a corresponding storage overhead of 60%, that means that 1 GB of data objects only require a storage capacity of 1.6 GB.
Such an erasure encoding based districted object storage system for large scale data storage also requires a form a self-healing functionality in order to restore the required redundancy policy after for example the failure of a storage element. However, in most known systems these self-healing methods lack efficiency and consume considerable amounts of processing power and/or network bandwidth in order for example to cope with restoring the redundancy for the stored data objects on a failed storage element. One system that tries to improve efficiency is for example described in WO2010/091101, however this system could result in data loss after subsequent generations of node failure. Furthermore, this system is only able to handle the restore of a complete storage element and all objects on it. It is further not able to handle simultaneous replacement of a plurality of storage elements reliably and efficiently as for every failing storage element a new storage element needs to be provided for the restore operation.
In general, during maintenance of a large scale distributed object storage system, adding, removing and/or replacing storage elements or even complete storage nodes is an activity that is performed almost constantly. However, in prior art systems the efficiency of repair activity during normal operation does not suffice to reliably cope with these maintenance activities resulting in manual configuration or supplementary restore operations to be performed in order to sufficiently safeguard the reliability of the distributed object storage system.
Therefor there still exists a need for an efficient and reliable monitoring and repair process for a distributed object storage system, that does not result in data loss in the long term and is able to realize a large scale, self-healing distributed object storage system. Further there exists a need for the self-healing efficiency being sufficiently high such that the need for manual configuration or supplementary restore operations is reduced even during extensive changes to the available storage elements or storage nodes.