Modern computer systems hold vast quantities of data that is increasing rapidly; so rapidly, in fact, that in many cases the increase threatens to outstrip the capacity of storage systems. For some companies, data growth can be as high as 30-40% per year. This growth not only needs a continuing investment in newer and bigger storage systems, it also requires a corresponding increase in the cost of managing those systems. It is highly desirable to decrease the amount of storage within a company, as the storage can significantly reduce the capital and operational expenditure of a company.
One characteristic of the data stored in most mass storage systems is that there is a tremendous amount of duplication of data. Examples include duplicate files, files that are slightly different (e.g. multiple drafts of the document), same images being stored in multiple documents, same templates or stationery being applied to presentations etc. While there are some systems that can detect identical files and store them only once, typical systems still require storing large amount of duplicate data. For example, practically every document in a company has the company logo embedded within it, but today's storage techniques are unable to recognize that the same data for the logo is being repeated in every document and are unable to save on storage for that.
There is increased emphasis on sub-file data de-duplication to detect duplicate data at a sub-file level to reduce the storage and network footprint for primary storage as well as secondary storage uses like backup and archive. In recent times, various systems have been designed that can detect duplicate data at sub-file level. Essentially all de-duplication systems create one or more ‘chunks’ out of the file or block storage unit being analyzed for de-duplication and then employ one or more methods of comparison to detect whether a duplicate chunk has been produced.
Current methods of partitioning, or chunking, data are often ineffective at finding common sub-objects in absence of ancestry information about the digital data units being evaluated for de-duplication. For example, if one is aware that file B is derived from file A, one can do a delta comparison between the two files to find common sub-objects or use a “sticky bits” method to partition data. However, in absence of any ancestry knowledge finding common sub-objects requires extreme computational complexity, especially when applied to today's highly distributed computer systems with millions of files spread across thousands of computer systems where ancestry information is scarce.