Modern computer systems hold vast quantities of data that is increasing rapidly; so rapidly, in fact, that in many cases the increase threatens to outstrip the capacity of storage systems. This growth not only needs a continuing investment in newer and bigger storage systems, it also requires a corresponding increase in the cost of managing those systems. It is highly desirable to decrease the amount of storage within a company, as the storage can significantly reduce the capital and operational expenditure of a company.
One characteristic of the data stored in most mass storage systems is that there is a tremendous amount of duplication of data. Examples include duplicate files, files that are slightly different (e.g. multiple drafts of the document), same images being stored in multiple documents, same templates or stationery being applied to presentations etc. While there are some systems that can detect identical files and store them only once, typical systems still require storing large amount of duplicate data. For example, practically every document in a company has the company logo embedded within it, but today's storage techniques are unable to recognize that the same data for the logo is being repeated in every document and are unable to save on storage for that.
There is increased emphasis on sub-file data de-duplication to detect duplicate data at a sub-file level to reduce the storage and network footprint for primal storage as well as secondary storage uses like backup and archive. In recent times, various systems have been designed that can detect duplicate data at sub-file level. De-duplication systems typically create one or more ‘chunks’ out of the file or block storage unit being analyzed for de-duplication and then employ one or more methods of comparison to detect whether a duplicate chunk has been produced.