1. Field of Invention
The present invention relates to the field of storage network security, and more particularly, to a Provable Data Integrity (PDI) verifying method, apparatuses and system.
2. Description of Prior Art
The Internet is fast evolving toward outsourcing data from one's local storage to global-scale persistent storage service. The Amazon Simple Storage Service (Amazon S3) (Reference 1: Amazon Simple Storage Service (Amazon S3), http://aws.amazon.com/s3) is one of such storage system for the Internet. Amazon S3 provides web services interface that can be used to store and retrieve data. The service of Amazon S3 is global-scale and business-class while its pricing is quite reasonable, US $0.15 per GB/Month of storage used, US $0.10 per GB for all data transfer in, and US $0.18 per GB for the first 10 TB/month data transfer out. In case someone is seeking for free global-scale storage service, there are as well. MediaMax (Reference 2: MediaMax Free Online Storage, http://www.mediamax.com) is providing 25 GB free online storage and the Gmail File System (Reference 3: Gmail Drive Shell Extension, http://www.viksoe.dk/code/gmail.htm) project has translated free Gmail accounts into one consistent free network storage space.
With these public storage space services, a client can discard its local storage subsystem and can retrieve data at any time and from anywhere over the Internet. This fabulous prospect has attracted lots of industry efforts and such efforts have made storage outsourcing an inevitable trend.
The IETF Network WG captured the trend and RFC 4810 “Long-Term Archive Service Requirement” was thus released (Reference 4: RFC 4810, Long-Term Archive Service Requirement, IETF Network WG. http://www.ieff.org/rfc/rfc4810.txt). RFC 4810 depicts requirements on a long-term archive service that is responsible for preserving data for long periods. Supporting non-repudiation of data existence, integrity, and origin is a primary purpose of a long-term archive service. As RFC 4810 stated, a long-term archive service must be capable of providing evidence that can be used to demonstrate the integrity of data for which it is responsible, from the time it received the data until the expiration of archival period of the data.
Outsourcing data from client storage to archive service has two fundamental steps, one is submitting data and the other is retrieving data. The naïve solution to verify data integrity involves retrieving the data from the archive. However, provision of high bandwidth from remote archive to the client verifier is impractical at present and in the near future as well. In particular, it's hard for a mobile client to enjoy high bandwidth connection. Moreover, as RFC 4810 stated, it may be a third-party verifier that checks integrity of user data. In such case, the third-party verifier should not have access to the user data; otherwise it may violate the user data privacy. In order to verity data integrity while avoid retrieving the data from the archive, prior work adopts an operation model of three steps, as illustrated in FIG. 1. Notice that for notational simplicity (and without loss of generality), hereafter we mostly take the case that the client, i.e. the data owner, is the user data integrity verifier as example. But as discussed before, the verifier in practice may be a third party other than the data owner.
At step 0, digital fingerprints of data are generated by the client and are sent to the archive along with the data. The archive needs to store the fingerprints of the data in addition to the data itself. At step 1 the client sends challenge on data integrity to the archive. And the archive utilizes the content of the data, the fingerprints of the data, and the client challenge all together to compute data integrity proof that is returned to the client for verification at step 2. Step 1 and step 2 may be repeated multiple times until the expiration of the archival period of the data.
Based on above operation model, below is a list of the key factors that should be considered by any technical solution to the provable data integrity issue.                (I). The time it takes for the client to generate the fingerprint of data        (II). The size of the archive storage that the fingerprint of data consumes        (III). The size of the challenge that the verifier sends to the archive        (IV). The time for the archive to compute the data integrity proof        (V). The size of the data integrity proof that the archive sends to the verifier        (VI). The time for the verifier to check the data integrity proof        
There is a solution seemingly tackles the data integrity issue. Initially, the data owner divides the data into multiple fragments and for each fragment pre-computes a message authentication code (MAC). Whenever a verifier, the data owner or a third party, needs data integrity proof, it retrieves from the archive service a number of randomly selected fragments and re-computes the MAC of each fragment for comparison.
Deswarte et al. (Reference 5: Y. Deswarte, J. J. Quisquater, A. Saidane, Remote integrity checking, In Proc. of Conference on Integrity and Internal control in Information systems (IICIS'03), 2003) and Filho et al. (Reference 6: D. L. G. Filho, P. S. L. M. Baretto. Demonstrating Data Possession and Uncheatable Data Transfer. http://eprint.iacr.org/2006/150.pdf) proposed to verify that an archive correctly stores a file using RSA-based hash function.
Most recently, Ateniese et al. (Reference 7: G. Ateniese, R. Burns, R. Curtmola, J. Herring, L. Kissner, Z. Peterson, D. Song, Provable Data Possession at Untrusted Stores. http://eprint.iacr.org/2007/202.pdf) proposed a RSA-based provable data possession scheme, S-PDP scheme, where “S” stands for “sampling”. By sampling it means that the client selects a portion of the data at random and asks the archive to demonstrate evidence that the randomly selected data is in healthy status, i.e. data integrity of those selected is held. The S-PDP scheme doesn't require exponentiate the entire file and the communication complexity is constant which makes S-PDP scheme the most efficient one among all prior arts.
The naïve solution has drawback in that its communication complexity is linear with respect to the queried data size. Moreover, in the case of a third-party verifier, sending user data to the verifier is prohibitive because it violates the data owner's privacy. To avoid retrieving data from the archive service, one may improve the simple solution by choosing multiple secret keys and pre-computing multiple keyed-hash MAC for the data. Thus the verifier can each time reveal a secret key to the archive service and ask for a fresh keyed-hash MAC for comparison. However, this way the number a particular data can be verified is limited by the number of secret keys that has to be fixed a priori. When the keys are exhausted, in order to compute new MACs, retrieving the data from the archive service is inevitable.
The proposals of References 5 and 6 have drawback in that the archive has to exponentiate the entire file. As reference, given 2048-bit RSA modulus, one full exponent exponentiation takes 61.325 milliseconds on Intel Core Duo 2.16 GHz processor. Therefore it would take 251.3 seconds per Megabyte for exponentiation which means that to test integrity of 64 MB file, the archive has to spend 16083.8 seconds before the client can receive the data integrity proof.
The S-PDP scheme has one problem in that its design goal, i.e. sampling, may sometimes be meaningless to the data owner. By sampling, the S-PDP scheme tries to tolerate file block failure at seemingly high detection probability. For example, Reference 7 discusses how to reach detection probability of 99% in the case of 1% file blocks failure. However, there are many types of files that cannot withstand even one bit error. For example, loss of the head of a media file, where codec configuration parameters resides, will cause difficulty in rendering. For another example, damage on the (public key encrypted) symmetric encryption key that is embedded in an encrypted file results in garbage ciphertext that no one can recover the plaintext anymore. In general, what the data owner demands is 100% data safety. There is no compromise for whatever reasons. The S-PDP scheme has another problem in that it is extremely inefficient for being adopted by a third-party verification (or so-called public verifiability) system. In order to be publicly verifiable, the S-PDP scheme mandates that each file block must be smaller than the RSA public key e. Take 2048 bits RSA modulus as an example. The public key e can be at most 1024 bits. Hence, a solution according to the publicly verifiable S-PDP scheme has to logically divide a file into multiple 1024 bits file blocks. The consequence is a huge amount of file blocks, for each of which there must be a tag being generated. In other words, the size of the tags is two times larger than the file itself and the time it costs the client to tag a file is too much to be practically doable.