1. Field of the Invention
The present invention relates, in general, to data storage and reduction or control of redundant data, and, more particularly, to a method of performing data deduplication that includes storing deduplicated instance information within, or in a manner similar to that used for, a standard file system. For example, the data deduplication method may include building an instance repository provided with disk storage that is configured with a conventional file system arrangement to store instances as files located using file system conventions rather than using existing techniques of storing instance access information in a database, in-memory index, or other indirect access storage that requires a two step process (e.g., instance look up in a first device/memory structure and then retrieval based on obtained location/pointer information to instance data).
2. Relevant Background
The amount and type of data storage is rapidly expanding, and data management is rapidly becoming a significant cost for many businesses or enterprises. Particularly, enterprise data is growing exponentially and today's businesses need a way to dramatically reduce costs associated with data storage and management. Enterprises also have to provide proper data back up to meet their needs such as servicing clients and complying, with regulations and laws regarding maintaining data for relatively long periods of time. A complication for most businesses is the enterprise data may be highly dispersed over many machines, data storage centers, and interconnected networks/systems.
Data deduplication may be used to lower overall costs of physical data storage by storing only a single instance of unique data (e.g., only one copy of particular data such as a file or data object is stored) for an enterprise or group sharing access to data. Deduplication is fast becoming a standard feature in many data storage systems, but existing data deduplication techniques have a number of limitations including the use of database, in-memory index, or similar mechanism to store the information that is needed to retrieve a specific instance of unique data.
Data deduplication generally is used to refer to the elimination of redundant data. In the deduplication process, duplicate data is deleted to leave only one copy or instance of the data to be stored. For example, a single copy of a document, an image, an e-mail, a spreadsheet, and other data objects for which there may have been numerous copies on a system may be stored in one or more data stores/data storage devices accessible by workers or operators in an enterprise such as a typical business or the like. Indexing of all the data or copies of the data is still retained during deduplication so that the data may be later retrieved with the index providing a unique name and location for the data object. In many deduplication processes, a database of the indexed data is provided that includes key-value pairs providing a key for identifying the data and a value that provides a location of the data (or a pointer/reference to the remote data location). The key or unique identifier for a data object (e.g., a file for a file system) is often generated by creating a hash of the object. Then, deduplication may involve comparing a hash of a new or ingested file with hashes of existing files in data storage. When files/objects with identical hashes are identified, the copy is removed and a new file (or reduced data file) is stored that points to the old or single stored instance.
Deduplication is useful as it is able to reduce the required storage capacity as only unique data is stored. In an e-mail example for an enterprise, a typical e-mail system may contain one thousand instances of the same one megabyte file attachment. If the e-mail system is backed up or archived, all data is stored with all one thousand instances, of the attachment being saved in data storage, which requires one thousand megabytes. However, with data deduplication, only one instance of the attachment is actually stored in an instance repository, and each subsequent instance identified during the data ingestion step of deduplication is simply reference to the one saved copy, e.g., with a key-value pair in an index file of a database or with information of an in-memory index. In this example, data deduplication reduces storage requirements from one thousand megabytes to about one megabyte.