The present invention is generally related to storage systems and in particular to a system and method for providing reliable long term retention of data.
Events in recent times have identified the need for long-term storage of data. Businesses and data users in general have a need for archiving data for long periods of time. Enterprises are interested in long term data preservation, motivated in large part by government imposed regulations. For example, the U.S. Securities and Exchange Commission (SEC) requires exchange members, brokers, and dealers to preserve records of trading accounts until the end of the account plus 6 years, and records of all communications, such as email with their customers, must be kept for a period of not less than 6 years under the Securities Exchange Act of 1934 Rule 17a-4. The National Association of Securities Dealers Inc. (NASD) has similar regulations under Rule 3010 & 3110. See for example the SEC web site http://www.sec.gov for further detail.
Another example of an industry where long-term data retention is important is the healthcare industry. Regulations require hospitals to retain medical records for a patient's life plus 2 years under HIPAA (Health Insurance Portability and Accountability Act). See for example, the web site http://www.cms.hhs.gov/hipaa/ for further detail.
There are several key issues for long term data preservation, such as frequency of backups, the storage media, location of the data vault, and so on. One of the most important considerations is faithful data recovery after many years of storage; i.e., providing exactly the same data, as it was originally saved, to users after a long period of time has passed. Generally, users preserve (or archive) data using lower cost storage systems than were used for production data. Examples of lower cost storage systems include tape libraries, optical disk libraries, and ATA-based disk storage systems. Compare those systems with typical higher performance, higher reliability production data storage systems such as a RAID system using FC/SCSI based disks. Since archive storage systems are lower cost, their reliability is likewise lower than for a production system. Therefore, data loss can occur after a long period of time.
A conventional technique for increasing the reliability and reproducibility of long term data is to use a checksum. Each file is “analyzed” to determine a checksum that is associated with the file. For example, each byte (or bytes) of data in the file can be summed to produce a total called the checksum. The checksum is saved along with the file. Later, the file can be validated by repeating the checksum computation and comparing it with the stored checksum to determine if the file has been corrupted over time. Other similar techniques have also been used, e.g., hash codes. While these methods can detect if the file has been corrupted, they cannot undo the corruption.
Another conventional technique is to create one or more replicas of the file and save the file and its replicas on different storage devices. For example, a PCT publication International Publication No. WO 99/38093 discloses a method of content addressable information encapsulation, representation, and transfer. As understood, hash values are generated and used as the file descriptor and the file is replicated in several storage resources. The hash value is used to access these replicas uniquely. Since the replica(s) exists in other storage system(s), the file is recoverable even if the original one is detected to have been corrupted by using the hash value. However, the method has problems that the replicas require extra capacity in the storage systems. As a result, the cost of this solution is relatively expensive.
A need exists for reliable long term data retention. It is desirable to achieve this in a low-cost implementation.