This invention is directed towards data storage systems, and more particularly towards backup and restoration of data residing in data storage systems.
Computer systems process increasingly vast quantities of data for a variety of purposes. As the ability of computer systems to process data increases, so does their need for data storage systems which provide very large data storage capabilities and high speed access to the data by computer host systems. Businesses which operate globally typically require round-the-clock access to databases which may be stored in one or more data storage systems. The data that is stored in these data storage systems is changing at an incredible rate. For example, data is changed and updated many times per second in transaction processing applications, reservation systems and data mining applications.
Data must be periodically backed up (copied to another storage medium) for archival purposes and in case of system failure. Data backups are performed frequently because data losses can be catastrophic to businesses or institutions. However, the need for system backups often conflicts with the need for continuous access to data because many computer systems must deny applications access to data while a backup is performed.
A system of providing redundant data storage known as xe2x80x9cmirroringxe2x80x9d allows fault tolerant data storage, as well as the ability for applications to access data while data backups and restores are being performed. Two or more separate storage devices (which includes physical or virtual disks) store identical data. These mirrors may be located close together in a system, or may be in different geographical locations. This provides a high level of fail safe fault tolerance, and also allows data access during a backup or restore operation. Typically, a backup is performed by stopping the mirroring process (referred to as splitting the mirrors), taking one of the storage devices (mirrors) off line and backing up the data from that mirrors. The other mirror remains online and available. When the first mirror is completely backed up, the two mirrors are resynchronized (so that the data is again identical on both), and the data storage system returns to full operation.
Still further problems arise from the need to backup data in data storage systems. For example, backup operations which require substantial participation by applications and operating systems on host computer systems consume resources of those systems and reduce their performance. Components of data files or other data objects are typically scattered or stored in non-contiguous segments which may span multiple disks within a data storage system. Host operating systems and applications maintain maps such as file allocation tables to identify where each part of each data object is stored. The host operating system knows how to access the files in a logically contiguous format. Therefore, a backup operation that is performed at the application level (a logical backup ) requires host system involvement which substantially slows the host system because the applications must first read specific data objects from the data storage system and then write the data files to a backup device such as a magnetic tape drive. Large quantities of individual data objects are typically backed up thereby requiring a host to perform extensive data transfer operations. Further, the backup device is often connected to the host system by a low bandwidth data path, such as an ethernet connection. This process places a large load of streaming data on the low bandwidth data path, degrading performance. Therefore application level backup can archive data in a logical order so that data objects such as individual files may be individually accessed from the backup media; but at a cost of lowered efficiency.
A faster method uses a backup system which streams large quantities of data from a data storage system to a backup device via a high speed direct connection between a backup server and the host storage device without routing the data through a host computer system. The host computer system is not involved in this process. The backup system copies physical segments of data which contain the desired data from the data storage system over the high speed direct connection to the backup device (physical backup). This high speed direct connection can use any of various types of interfaces, such as a SCSI or fibre channel connection.
The physical backup is analogous to a snapshot of the physical segments of data as they were stored on the data storage system. The identical segments of data are read back to the data server in their entirety in order to restore data. Mapping information of the locations of individual data objects are not available to the physical backup system so such high speed backup systems can not typically retrieve specific data objects such as individual files or directory structures from the backup device.
The present invention includes a system and method for high speed external data backup and restore which allows improved access to individual files and data objects. A backup system for example, EMC Data Manager (EDM) by EMC Corporation of Hopkinton Mass, acquires a description of data to be backed up from an operating system or application that is running on a host computer. The data description includes mapping information which define where data objects are stored in a data storage system and may be communicated to the backup system through a network connection such as a standard ethernet connection. The backup system then causes data to be transferred in logical order from a data server to backup media such as magnetic tape. The transfer of data occurs over high speed data channels such as Symmetric Connect SCSI or fiber by EMC Corporation. The system of the present invention thereby provides the combined advantages of (1) high speed direct data transfer and (2) uniform access to individual data objects on backup media.
The present invention includes a data storage backup system to back up and restore data from a data storage system which is coupled to a host system. It includes an application interface component, running on the host system, to acquire mapping information for data objects stored in said data storage system. It also includes a backup system component coupled to the data storage system, the backup system component receiving the mapping information from the application interface component, to directly access the data objects in the data storage system based on storage locations as indicated by the mapping information. The backup system component reads the data objects in the data storage system and transfers the read data to a backup storage medium. The backup system component can read the data objects in the data storage system in a sequence to access the data objects in a contiguous format.
For operations including restore operations, the application interface component requests the host system to allocate storage locations in the data storage system, and the application interface component acquires mapping information for those allocated storage locations. The backup system component writes data objects into the data storage system based on mapping information for the allocated storage locations.
Components of data files or other data objects are typically scattered or stored in non-contiguous segments which may span multiple disks within a data storage system. Host operating systems, logical volume managers, and applications maintain maps such as file allocation tables to identify where each part of each data object is stored. The backup system described in the present invention is capable of interpreting a provided map of data which describes a host operating system (including logical volumes) or application and using that map to selectively read data objects in logical order from a data storage system. In an illustrative embodiment, this information is obtained and provided by a specialized application. The host system is thereby relieved of the task of reading and assembling data before it is written to backup media.
The present invention also includes an improved method and apparatus for restoring data which is integral to the backup system. Once data has been backed up in logical order according to the method of the present invention, any individual files or data objects may be efficiently retrieved.
A request to restore particular data may be scheduled or generated by a host computer system and communicated through a network connection to the backup system. The backup system may directly locate the required data objects on the backup media because the data objects are stored there in logical order. Also, the backup system does not need to retain a map of the data (as it had been stored in the data storage system prior to being backed up) for restore time. Furthermore, data is frequently moved from one location to another within a data storage system between backup operations. Therefore, data is not restored to a data storage system according to its original map. Rather, a new disk space on the data storage system is allocated to the restored data and a new map of that data is provided by the host system.
Data backup and restoration procedures according to the present invention may be employed by systems having redundant data storage systems, such as RAID-1. Such systems may also use mirror splitting methods to provide a host system with continuous access to data during backup and restore operations.
An advantage of the present invention includes restore granularity that is provided by file level logical backup systems with speed as provided by physical data streaming backup systems.
Another advantage includes the ability to do xe2x80x9csmartxe2x80x9d or finer grained backups without tying up host computer or network resources. The backup system performs many of the tasks which previously must have been performed by the host systems.
Yet another advantage of the present invention is the ability and flexibility to give a backup and restore system much more control and flexibility over the process of accessing and restoring data in data storage systems, without requiring the backup and restore system to directly work with the complexity of database or operating system file hierarchies. The present invention has the ability to use knowledge of the physical arrangement of data to discern the data logical order despite disk striping or fragmenting.
Further, the present invention allows the backup of files stored on arbitrary storage geometries to be backed up to other geometries. For example, a Logical Volume Manager may be used to allocate a xe2x80x9cvirtual volumexe2x80x9d across several Symmetrix disks. An EDM backup may be restored to a logical volume which uses a different underlying disk geometry, such as a change in stripe size, or number of disks in the striped set. Further, the present invention allows a backup system to pre-allocate empty space on the data storage systems by requesting the host system to handle the details of allocating space on a data storage system (which the host knows how to do), and then the backup system can put data into the allocated space.