The present invention relates to increased efficiency of data de-duplication. More specifically, the invention relates to selecting data containers for placement into a de-duplication domain.
De-duplication is a method to reduce the number of data storage devices that need to be used to store a given amount of information. It operates by detecting repetition of identical chunks of data, and in some instances replacing a repeated copy with a reference to another copy of the same content. A de-duplication system also provides for reconstructing the original form of content which has been stored in a compressed manner. References are used to locate the original copies of the data so that the full-length form of the desired content can be delivered.
Systems employing de-duplication can experience performance issues when applied to large-scale storage systems. To resolve this issue, systems built for large-scale storage are generally designed to adopt a scale-out strategy such that separate hardware can operate independently on separate sub regions of the storage. Operating independently is necessary so that messaging overheads, lock delays, and blocking waits do not grow too large. However, de-duplication imposes a limitation for a dependent operation across its entire span. This limitation creates a requirement that all nodes involved in the full span of de-duplication be in frequent messaging contact and block waits that can degrade scalability.