1. Field of the Invention
The present invention generally relates to the secure storage and retrieval of information and, more particularly, to a method and apparatus which guarantees the integrity and confidentiality of the stored information.
2. Description of the Prior Art
The problem this invention is concerned with is the secure storage and retrieval of information. Consider a user who stores his or her files on his or her workstation. Random failures (such as a hard disk crash) could cause the loss or the temporary unavailability of the data. Also possibly malicious intrusions may occur which would destroy both the confidentiality and integrity of the data. Ideally, the user would like a fully secure system which provides protection against these and maybe other kinds of faults without overburdening the system with memory and computational requirements.
Typically, protection against random failures is obtained via replication. That is, the data is stored in multiple locations so that failures in some of them can be tolerated. One such example is the Redundant And Inexpensive Drive (RAID) standard commonly used on servers in a Local Area Network (LAN). In order to obtain a significant degree of protection, there is a high cost in terms of memory requirements.
The notion of information dispersal was introduced by M. Rabin in his well-known Information Dispersal Algorithm (IDA) described in "Efficient Dispersal of Information for Security, Load Balancing, and Fault Tolerance", Journal of the ACM, Vol. 36(2), pp. 335-348, 1989. The basic approach taken in IDA is to distribute the information F being stored among n active processors in such a way that the retrieval of F is possible even in the presence of up to t failed (inactive) processors. The salient point was to achieve this goal while incurring a small overhead in needed memory. And in fact Rabin's result is space optimal. Retrieval of F is possible out of n-t pieces, where each piece is ##EQU1##
The Information Dispersal Algorithm uses a linear transformation to convert m=nt-t bytes of input into m bytes of output. This transformation is given by an mxn matrix T over GF(2.sup.8). Moreover, the matrix T has the property that every (n-t) columns of T are linearly independent. Thus, each input and output byte is viewed as an element of GF(2.sup.8). The block size is m bytes and the operation is repeated for every m bytes.
Let the (i,j).sup.th entry of T be represented by T.sub.ij. Let P.sub.0, P.sub.1, . . . . P.sub.m-1 be a block of input. Then the output bytes Q.sub.0, Q.sub.1, . . . Q.sub.n-1 are given by EQU Q.sub.i =T.sub.0,i .multidot.P.sub.0 +T.sub.1,i .multidot.P.sub.1 + . . . T.sub.m-1,i .multidot.P.sub.m-1,
where the arithmetic is performed in the field GF(2.sup.8).
Given any m output bytes, the input can be recovered because every m columns of T are linearly independent. In other words, the matrix S formed by taking the columns of T which correspond to these m output bytes is invertible. Again, the inverse of this matrix is computed over GF(2.sup.8).
As an example, let m=3 and n=5. The following matrix T has the property that every three columns of T are linearly independent. Note that we are using polynomials in x for representing elements of GF(2.sup.8). The polynomial arithmetic can be done modulo x.sup.8 +x.sup.6 +x.sup.5 +x.sup.4 +1, which is an irreducible polynomial over GF(2). ##EQU2##
If, for example, only the first, second and fifth byte of a coded text are known, the plaintext (or original text) can be retrieved by applying the following transformation to the three bytes of coded text: ##EQU3##
In addition to its optimal space complexity, the IDA technique has the following very attractive properties:
it permits any party in the system to retrieve the distributed information (by communicating with the piece holders); PA1 it does not require a central authority; PA1 it is symmetric with respect to all participants; and PA1 no secret cryptographic keys are involved. PA1 distributed implementation of the storing device, PA1 tolerance of faults (inactive or maliciously active) during the process of storing and retrieval of the information, PA1 tolerance of faults as above, where all servers can be faulty during the lifetime of the system but only up to t servers can be faulty during each time interval (herein referred to as proactive SSRI), PA1 transparency of the distributed implementation from the user's point of view, and PA1 space optimality. PA1 Electronic Vault. A robust distributed repository (a.k.a. E-Vault, strong box, safety box, secure back-up, secure archive) of users' information. PA1 A mechanism for the delivery and distribution of files in a communication network robust against malicious failures and break-ins. PA1 Regular and anonymous electronic P.O. Boxes with the same robustness and resiliency properties. PA1 Secure distributed file system. We view the SSRI as implemented at the application Layer. However, the concepts described above can be broadened to apply to a distributed file system, with a richer functionality and security properties over Sun's Network File System (NFS) and the DCE-based Distributed File System (DFS).
However, this combination of very desirable properties is achieved at the expense of limiting the kind of faults against which the algorithm is robust, namely, by assuming that available pieces are always unmodified.
An enhanced mechanism to reconstruct the information when more general faults occur was presented by H. Krawczyk, in "Distributed Fingerprints and Secure Information Dispersal", Proc. 20.sup.th Annual ACM Symp. on Principles of Distributed Computing, pp. 207-218, Ithaca, N.Y., 1993, who called this problem, and its solution, the Secure Information Dispersal problem/algorithm (SIDA). This mechanism is able to tolerate malicious parties that can intentionally modify their shares of the information, and is also space optimal (asymptotically). In a nutshell, SIDA makes use of a cryptographic tool called distributed fingerprints, which basically consists of each processor's share being hashed, i.e., the fingeprints, and then distributing this value among an processors using the coding function of an error correcting code that is able to reconstruct from altered pieces (e.g., the Reed-Solomon code). In this way, the correct processors are able to reconstruct the fingerprints using the code's decoding function, check whether pieces of the tile were correctly returned, and finally reconstruct F from the correct pieces using the IDA algorithm.
A shortcoming of these methods is to assume that the faults only occur at reconstruction time, after the dispersal of the shares has been properly done.