1. Technical Field of the Invention
The present invention relates generally to systems, apparatus, and methods for distributed data storage, and more particularly to systems, apparatus, and methods for distributed data storage using an information dispersal algorithm so that no one location will store an entire copy of stored data, and more particularly still to systems, apparatus, and methods for reading data from and writing data to a dispersed data storage network.
2. Description of Related Art
Storing data in digital form is a well-known problem associated with all computer systems, and numerous solutions to this problem are known in the art. The simplest solution involves merely storing digital data in a single location, such as a punch film, hard drive, or FLASH memory device. However, storage of data in a single location is inherently unreliable. The device storing the data can malfunction or be destroyed through natural disasters, such as a flood, or through a malicious act, such as arson. In addition, digital data is generally stored in a usable file, such as a document that can be opened with the appropriate word processing software, or a financial ledger that can be opened with the appropriate spreadsheet software. Storing an entire usable file in a single location is also inherently insecure as a malicious hacker only need compromise that one location to obtain access to the usable file.
To address reliability concerns, digital data is often “backed-up,” i.e., an additional copy of the digital data is made and maintained in a separate physical location. For example, a backup tape of all network drives may be made by a small office and maintained at the home of a trusted employee. When a backup of digital data exists, the destruction of either the original device holding the digital data or the backup will not compromise the digital data. However, the existence of the backup exacerbates the security problem, as a malicious hacker can choose between two locations from which to obtain the digital data. Further, the site where the backup is stored may be far less secure than the original location of the digital data, such as in the case when an employee stores the tape in her home.
Another method used to address reliability and performance concerns is the use of a Redundant Array of Independent Drives (“RAID”). RAID refers to a collection of data storage schemes that divide and replicate data among multiple storage units. Different configurations of RAID provide increased performance, improved reliability, or both increased performance and improved reliability. In certain configurations of RAID, when digital data is stored, it is split into multiple stripes, each of which is stored on a separate drive. Data striping is performed in an algorithmically certain way so that the data can be reconstructed. While certain RAID configurations can improve reliability, RAID does nothing to address security concerns associated with digital data storage.
One method that prior art solutions have addressed security concerns is through the use of encryption. Encrypted data is mathematically coded so that only users with access to a certain key can decrypt and use the data. Common forms of encryption include DES, AES, RSA, and others. While modern encryption methods are difficult to break, numerous instances of successful attacks are known, some of which have resulted in valuable data being compromised.
Files are usually organized in file systems, which are software components usually associated with an operating system. Typically, a file system provides means for creating, updating, maintaining, and hierarchically organizing digital data. A file system accepts digital data of arbitrary size, segments the digital data into fixed-size blocks, and maintains a record of precisely where on the physical media data is stored and what file the data is associated with. In addition, file systems provide hierarchical directory structures to better organize numerous files.
Various interfaces to storage devices are also well known in the art. For example, Small Computer System Interface (“SCSI”) is a well known family of interfaces for connecting and transferring data between computers and peripherals, including storage. There are also a number of standards for transferring data between computers and storage area networks (“SAN”). For example, Fibre Channel is a networking technology that is primarily used to implement SANs. Fibre Channel SANS can be accessed through SCSI interfaces via Fibre Channel Protocol (“FCP”), which effectively bridges Fibre Channel to higher level protocols within SCSI. Internet Small Computer System Interface (“iSCSI”), which allows the use of the SCSI protocol over IP networks, is an alternative to FCP, and has been used to implement lower cost SANs using Ethernet instead of Fibre Channel as the physical connection. Interfaces for both FCP and iSCSI are available for many different operating systems, and both protocols are widely used. The iSCSI standard is described in “Java iSCSI Initiator,” by Volker Wildi, and Internet Engineering Task Force RFC 3720, both of which are hereby incorporated by reference.
In 1979, two researchers independently developed a method for splitting data among multiple recipients called “secret sharing.” One of the characteristics of secret sharing is that a piece of data may be split among n recipients, but cannot be known unless at least t recipients share their data, where n≧t. For example, a trivial form of secret sharing can be implemented by assigning a single random byte to every recipient but one, who would receive the actual data byte after it had been bitwise exclusive orred with the random bytes. In other words, for a group of four recipients, three of the recipients would be given random bytes, and the fourth would be given a byte calculated by the following formula:s′=s⊕ra⊕rb⊕rc,where s is the original source data, ra, rb, and rc are random bytes given to three of the four recipients, and s′ is the encoded byte given to the fourth recipient. The original byte s can be recovered by bitwise exclusive-orring all four bytes together.
The problem of reconstructing data stored on a digital medium that is subject to damage has also been addressed in the prior art. In particular, Reed-Solomon and Cauchy Reed-Solomon coding are two well-known methods of dividing encoded information into multiple slices so that the original information can be reassembled even if all of the slices are not available. Reed-Solomon coding, Cauchy Reed-Solomon coding, and other data coding techniques are described in “Erasure Codes for Storage Applications,” by Dr. James S. Plank, which is hereby incorporated by reference.
Traditional disk-oriented file systems offer the ability to store and retrieve user-visible files, directories and their metadata. In addition to this data, and transparent to the file system user, is the file system metadata which is comprised of various elements of concern to the file system itself or its immediate execution context of the operating system kernel. File system metadata (often called the superblock in UNIX parlance) is composed of such things as the magic number identifying the file system, vital numbers describing geometry, statistics and behavioral tuning parameters and a pointer to the tree's root. This has various implications, the most crucial of which being that a file system cannot “bootstrap” itself, or bring itself online, if the superblock were to ever become corrupt.
Schemes for implementing dispersed data storage networks (“DDSNs”), which are also known as dispersed data storage grids, are also known in the art. In particular, U.S. Pat. No. 5,485,474, issued to Michael O. Rabin, describes a system for splitting a segment of digital information into n data slices, which are stored in separate devices. When the data segment must be retrieved, only m of the original data slices are required to reconstruct the data segment, where n>m.
In October of 2007, Cleversafe, Inc. of Chicago, Ill. released the first dispersed data storage network to incorporate an iSCSI device interface. This release allowed users of non-specialized computer systems, such as Windows Workstations, Macintosh Workstations, and Linux Workstations, to make use of dispersed data storage network technology. This constituted a significant step forward in providing significant access to dispersed data storage technology. However, subsequent use and testing has revealed that additional improvements could be made in providing better performing and more accessible dispersed data storage technology.
Most importantly, users of Cleversafe's iSCSI interface must rely on a standard file system, such as FAT, NTFS, HFS+, or ext3, to store their files. These file systems were designed to work with localized storage solutions, and accordingly, do not account for the specific requirements of file storage and retrieval on a dispersed data storage network. For example, file systems store not only files, but information about the files called metadata. One type of metadata is used to establish an organization for the files known as a directory structure. Most file systems require that each access must traverse the directory structure from the root node to the particular file to be accessed. With a localized storage solution, the performance impact of such a traverse is minimal.