1. Field of the Invention
The present invention relates generally to storage systems, and, in particular, to storage systems composed of a plurality of contents addressable storage (CAS) systems
2. Description of Related Art
Of the number of different known methods for archiving data, each method has certain disadvantages. Until recently, archiving with tape has been the most commonly-used method for archiving data, but tape archives are often difficult to access and do not allow quick and easy retrieval of archived data. Because of this, disk arrays have become more common as the archival medium of choice, since they allow archived data to be more quickly and easily accessed. However, prior art disk array archiving schemes suffer from several limitations, such as users being unable to locate the correct file and the storage of a large amounts of duplicate data, leading to increased costs for archiving.
Fixed Content Aware Storage, Content Addressed Storage
Fixed Content Aware Storage (FCAS) is generally defined by SNIA (Storage Networking Industry Association) as storage of unchanging data (fixed content) and associated metadata based on a variety of naming schemas including Content Addressed Storage (CAS) and global content-independent identifiers. In the storage industry, CAS is also sometimes referred to as Content Addressable Storage, Content Aware Storage or Content Archive Storage.
In CAS, users and host computers store and retrieve data as an object, composed of content and metadata, rather than treating data as a standard file. The data (content) is appended with metadata (attributes of the content) and is assigned a unique object designator known as a “content address” or “object ID”. For archiving, the object is stored to a permanent location on a hard disk. Since each object is unique, multiple copies of the same file are not stored. This reduces the storage of duplicate data and reduces the total storage requirements, thereby overcoming a major limitation of disk array archiving discussed above.
In a CAS object, the content is a byte stream and the metadata is the additional information regarding attributes of the content, such as creation time, retention period, size of the content, comments regarding the content, and the like. By attaching a detailed and well-thought-out metadata to the object, data may be indexed, classified or searched without knowing specific filenames, dates or other traditional file designations. Thus, enhanced metadata may include background information, comments, and other information on the data that can aid someone who accesses the data in the future in understanding or applying the data, and that can also aid in searching for and locating desired data.
When an object is stored to CAS, the CAS system generates identification information (object ID or object identifier), and then the object ID is returned to hosts/users. Hosts/users are able to access objects using the object ID. In some implementations of CAS, the object ID is generated using a kind of hash algorithm such as SHA-1 (Secure Hash Algorithm), SHA-256, or MD-5 (Message Digest 5). But at the moment, neither the method for generating the object ID nor the access protocol has yet been standardized. Therefore to enable object access in CAS, host computers use what ever API (Application Programming Interface) is provided by the vendors of the particular CAS system that the user has purchased.
CAS also enables data reduction by locating commonality among data and by eliminating duplication of stored data. When data is stored to the CAS system, a hash algorithm is applied to the data, which produces a unique value according to the content. The CAS system compares that unique value against an index of unique values of other saved objects. If the unique hash value is not in the index, then the data and its metadata are stored and the hash value is added to the index. However, if the hash value is already in the index, then that data has already been stored, and only the metadata and a pointer to the already stored data are stored. This can result in substantial savings of archival storage space. Additionally, in many applications, less expensive SATA hard disks may be used in CAS systems, rather than more expensive Fibre Channel disks, which can result in additional cost savings. US Patent Application Publication No. US 20050125625 to Kilian et al., which is incorporated by reference herein in its entirety, discusses one application of CAS for parsing a content address to facilitate selection of a physical storage location in a data storage system. This application discloses a method for data placement in a CAS storage system, but fails to teach or suggest any method for the migration of data from one CAS storage system to another.
Data Migration between Storage Systems
Because the lifetime of data is generally longer than that of the storage apparatus on which it is stored, there sometimes arises the need to migrate data from a current storage system to a new replacement storage system. During data migration, one important concern is to shorten any downtime required during the migration. U.S. Pat. No. 6,108,748 to Ofek et al. discloses one method for migrating data in storage systems while a host computer is online. Ofek et al. disclose a data migration method that is applicable to a mainframe storage system or a block access storage system, in which the same access method is applicable to both the donor storage device (storage system to be migrated) and the donee storage device (newly installed storage system). However, as the method described in this disclosure is directed to data migration between CAS systems, in which the data access methods are different from each other, the technique disclosed by Ofek et al. cannot be used.
Migrating Data between CAS Systems
When migrating data between CAS systems, it is likely that the object ID information will be changed, and, accordingly, current data migration methods that are applicable to block-based storage systems cannot be used. Since there is no industry standard for CAS systems, when moving data between an old CAS system sold by one vendor and a new CAS system sold by a different vendor, the object ID (content address) of each object will be different in the new system than it was in the old system. This is because the access method and the object ID generating method will be different. For example, one vendor may use one or more first hash algorithms for creating the object IDs, while another vendor might use different hash algorithms. Furthermore, even if both CAS systems were manufactured by the same vendor, the method of generating object IDs might be different when the new CAS system is upgraded. Therefore in CAS data migration, another method is required for transferring data between a first CAS system having one data access method or object ID calculation scheme and a second CAS system having a different data access method and object ID calculation scheme.