1. Field of the Invention
The present invention relates to a distributed data storage system and method for storing data, and more particularly, to a system and method for storing subsets of an original set of data on multiple data storage devices in one or more locations such that the individual data subsets on each digital data storage device are unrecognizable and unusable except when combined with data subsets from other digital data storage devices and in which the data subsets are selected by way of information dispersal algorithms so that even if there is a failure of one or more digital data storage devices, the original data can be reconstructed.
2. Description of the Prior Art
Various data storage systems are known for storing data. Normally such data storage systems store all of the data associated with a particular data set, for example, all the data of a particular user or all the data associated with a particular software application or all the data in a particular file, in a single data space (i.e single digital data storage device). Critical data is known to be initially stored on redundant digital data storage devices. Thus, if there is a failure of one digital data storage device, a complete copy of the data is available on the other digital data storage device. Examples of such systems with redundant digital data storage devices are disclosed in U.S. Pat. Nos. 5,890,156; 6,058,454; and 6,418,539, hereby incorporated by reference. Although such redundant digital data storage systems are relatively reliable, there are other problems with such systems. First, such systems essentially double the cost of digital data storage. Second, all of the data in such redundant digital data storage systems is in one place making the data vulnerable to unauthorized access.
In order to improve the security and thus the reliability of the data storage system, the data may be stored across more than one storage device, such as a hard drive, or removable media, such as a magnetic tape or a so called “memory stick,” as set forth in U.S. Pat. No. 6,128,277, hereby incorporated by reference, as well as for reasons relating to performance improvements or capacity limitations. For example, recent data in a database might be stored on a hard drive while older data that is less often used might be stored on a magnetic tape. Another example is storing data from a single file that would be too large to fit on a single hard drive on two hard drives. In each of these cases, the data subset stored on each data storage devices does not contain all of the original data, but does contain a generally continuous portion of the data that can be used to provide some usable information. For example, if the original data to be stored was the string of characters in the following sentence:                The quick brown fox jumped over the lazy dog,and that data was stored on two different data storage devices, then either one or both of those devices would contain usable information. If, for example, the first 26 characters of that 45 character string was stored on one data storage device and the remaining 19 were stored on a second data storage device, then the sentence may be stored as follows:        The quick brown fox jumped (Stored on the first storage device) over the lazy dog. (Stored on the second storage device)        
In each case, the data stored on each device is not a complete copy of the original data, but each of the data subsets stored on each device provides some usable information.
Typically, the actual bit pattern of data storage on a device, such as a hard drive, is structured with additional values to represent file types, file systems and storage structures, such as hard drive sectors or memory segments. The techniques used to structure data in particular file types using particular file systems and particular storage structures are well known and allow individuals familiar with these techniques to identify the source data from the bit pattern on a physical media.
In order to make sure that stored data is only available to authorized users, data is often stored in an encrypted form using one of several known encryption techniques, such as DES, AES or several others. These encryption techniques store data in some coded form that requires a mathematical key that is ideally known only to authorized users or authorized processes. Although these encryption techniques are difficult to “break”, instances of encryption techniques being broken are known, making the data on such data storage systems vulnerable to unauthorized access.
In addition to securing data using encryption, several methods for improving the security of data storage using information dispersal algorithms have been developed, for example as disclosed in U.S. Pat. No. 6,826,711 and U.S. patent application Publication No. US 2005/0144382, hereby incorporated by reference. Such information dispersal algorithms are used to “slice” the original data into multiple data subsets and distribute these subsets to different storage nodes (i.e different digital data storage devices). Individually, each data subset or slice does not contain enough information to recreate the original data; however, when threshold number of subsets (i.e. less than the original number of subsets) are available, all the original data can be exactly created.
The use of such information dispersal algorithms in data storage systems is also described in various trade publications. For example, “How to Share a Secret”, by A. Shamir, Communications of the ACM, Vol. 22, No. 11, November, 1979, describes a scheme for sharing a secret, such as a cryptographic key, based on polynomial interpolation. Another trade publication, “Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance”, by M. Rabin, Journal of the Association for Computing Machinery, Vol. 36, No. 2, April 1989, pgs. 335-348, also describes a method for information dispersal using an information dispersal algorithm. Unfortunately, these methods and other known information dispersal methods are computationally intensive and are thus not applicable for general storage of large amounts of data using the kinds of computers in broad use by businesses, consumers and other organizations today. Thus there is a need for a data storage system that is able to reliably and securely protect data that does not require the use of computation intensive algorithms.