1. The Field of the Invention
The present invention relates to data storage and backup solutions for archiving data. More particularly, embodiments of the invention relate to hardware, software, systems, and methods for efficiently backing up and/or restoring data by localizing storage of data referenced in a composite or directory element with the composite or directory element in a hash file system and content addressed storage.
2. The Relevant Technology
The need for reliable backup and archiving of information is well known. Businesses are devoting large amounts of time and money toward information system (IS) resources that are devoted to providing backup and archive of information resident in computers and servers within their organizations that produce and rely upon digital information. The customers of the data storage industry are more frequently demanding that not only is their data properly backed up but also that such data protection be done in a cost effective manner with a reduced cost per bit for stored data sets.
To address these demands, Content Addressed Storage (CAS) has been developed to provide a more cost effective approach to data backup and archiving. Generally, CAS applications involve a storage technique for content that is in its final form, i.e., fixed content, or that is not changed frequently. CAS assigns an identifier to the data so that it can be accessed no matter where it is located. For example, a hash value may be assigned to each portion or subset of a data set that is to be data protected or backed up. Presently, CAS applications are provided in distributed or networked storage systems designed for CAS, and storage applications use CAS programming interface (API) or the like to store and locate CAS-based files in the distributed system or network.
The usage of CAS enables data protection systems to store, online, multi-year archives of backup data by removing storage of redundant data because complete copies of data sets do not have to be stored as long as that content is stored and available. The use of CAS removes the challenges of maintaining a centralized backup index and also provides a high level of data integrity. CAS-based backup and archive applications have also improved the usage network and data storage resources with better distribution of data throughout a multi-node data storage system.
CAS-based backup and archive applications are also desirable because multi-year or other large backup archives can be stored easily since only a single instance of any particular data object (i.e., content) is stored regardless of how many times the object or content is discovered with the data set being protected or backed up. With CAS, the storage address for any data element or content is generated by an analysis of the contents of the data set itself. Since an exclusive storage address is generated for each unique data element (which is matched with a unique identifier) and the storage address points to the location for the data element, CAS-based architectures have found favor in the storage industry because they reduce the volume of data stored as each unique data object is stored only once within the data storage system.
In a CAS-based architecture, directories, files, and other large sequences of digital data are broken down into multiple unique data elements. In this way, when a small modification is made to a large digital sequence, only a few (as few as one) affected data elements of the large digital sequence have to be added to the CAS system, rather than adding the entire modified large digital sequence to the CAS system. In order to reconstruct each of the large sequences of digital data from multiple individual data elements, a CAS system creates and stores recipes (such as composites or directory elements), each recipe referencing two or more corresponding data elements making up the larger digital sequence and including instructions for combining the data elements.
In a conventional CAS system, storage addresses generated for a recipe and for the unique data elements referenced by the recipe may exist on different storage nodes. As a result, restoring a corresponding large sequence of digital data can require performing multiple seeks across numerous storage nodes to retrieve each of the unique data elements. These multiple seeks, in turn, affect the performance of the CAS system, as each seek increases the total time required to complete a restore process.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced