Organizations and businesses that need to retain data assets for prolonged periods of time have migrated from traditional location-based file storage systems to more efficient content addressed storage (CAS) systems. CAS systems keep data objects (also termed files, binary large objects, or blobs) in a single flat directory, rather than the tiered directory used by traditional location-based file systems. Additionally, CAS systems rename data objects and do not refer to them by user-provided file names. Instead, the system creates names for stored data objects based upon their content and context. Any file name provided by the user is stored as metadata associated with the data object, along with other information, such as the data object's date of creation, creator name, date of last modification, project name, location in the CAS system repository, etc. Data objects are recovered from a CAS system repository by referring to identifiers that the system associates with the requested data object. To interact with a CAS system and its associated repository, a user may use a backup utility software application, also known as a data mover agent (DMA). Instructions to initiate backup to, archive from, and recover and archived data object from a CAS repository can be executed using a DMA.
To further protect valuable data assets, an organization may utilize a secondary backup or archive of the primary CAS system. Secondary backup devices, such as magnetic or optical tape drives, may be used to mirror data objects stored on the primary CAS system. The secondary storage system may be connected to the primary CAS system over a network, with appropriate hardware and software mechanisms for enabling backup, recovery and archive.
One method for archiving content addressed data objects to a secondary storage device involves identifying those data objects for archiving, then issuing an appropriate computer instruction or command to the backup utility software application. The current protocol standard in the industry is the open Network Data Management Protocol (NDMP). Commands issued to the backup and recovery system comply with NDMP. NDMP supports interaction between the DMA that a user uses to interface with the content addressed storage system, the backup and recovery software module (BRM) that manages or resides within the host NDMP server associated with the CAS system, and the secondary backup storage system.
NDMP supports data transfer in two formats: single stream and multiple stream. In single stream format, data objects are transmitted one at a time. In multiple stream format, data objects are transmitted simultaneously. The number of data streams may depend upon the limitations of the hardware and software configuration, the limitations of the network, and the limitations of the ultimate destination where data objects are streamed.
One process for streaming data objects selected for archiving to a secondary storage device, such as a tape, requires that data objects be packaged according to a certain byte size. One skilled in the relevant art will recognize that such packaging will involve copying or moving the data objects into memory buffers, or blocks, before streaming them to the secondary storage device. If a data object exceeds the size of the data block buffer, then the data object may be divided up before being streamed to the secondary storage device. As a result, the data object may be fragmented or non-sequentially placed in multiple locations on the secondary storage device. In the case of a tape, a large data object will be apportioned across multiple sections of tape. While this method may be efficient for some older backup and archiving purposes, it proves inefficient for data object restoration operations.
Because large data objects may be divided across multiple locations on the secondary storage device medium, the restoration process is delayed by the task of locating each portion of the fragmented file. The chance of data loss is therefore increased because of the system's inability to locate all the fragmented pieces. Additionally, restoration is slowed by the necessary step of making sure all the fragmented pieces correctly correspond. In many systems, recovery often fails since the system is unable to locate all of the fragmented pieces of a data object. Large recovery requests for multiple data objects compounds the problem, resulting in mass data recovery failure and defeating the purpose of having a viable backup, recovery and archive system.
What is needed is a novel process for archiving data objects stored in a content addressed storage system that avoids recovery failure and/or recovery delay. What is needed is a process that will reduce the inefficiency of restoring fragmented data objects. What is further needed is a process that works with existing NDMP-compatible content addressed storage systems that is easy to adopt and implement.