In recent times, there has been an increase in use of massive on-line storage and backup systems, which allow (generally many) users to store files in a service provided on the Internet. An example of such a system is the LifeCache Digital Vault product from NewBay Software Ltd. (www.newbay.com). Such storage system products can improve storage efficiency via so-called “de-duplication.” Where many users use such a service for backup purposes there will often be many copies of the same file stored, for example system files or popular music files. De-duplication means having the service only store a single (or few) copies of such files, thus consuming less raw storage. De-duplication is a well-known feature of such storage systems.
As an additional aspect of de-duplication, some storage systems can detect that a user is attempting to upload a file that is a duplicate, and indicate to the user's client software that that upload is unnecessary, since a copy of the file is already present in the overall store. This improves the bandwidth efficiency of the service and the service's responsiveness to the user, since only one full copy of the file needs to be uploaded to the service. For example, Deucour, J. et al, (“Reclaiming space from duplicate files in a serverless distributed file system,” 22nd International Conference on Distributed Computing Systems, pp 617-624, July 2002) describes a distributed file system called Farsite that automatically detects duplication, and only stores a single version of a file.
One mechanism that achieves this is to have the user's client software compute a hash of the file content using a cryptographic hash algorithm, such as SHA-256, and for the client software to send the hash value (which is a short, fixed-length value) to the service. The service can then check if any other file with the same hash value is already stored and if so, the upload can be avoided and the service can simply note that that user also has a copy of the file in question. The service will associate various pieces of user meta-data with the stored file, but doesn't require the user to upload the actual file content a second time, nor will the service have to store the file a second time.
However, this creates a security vulnerability—if an attacker can guess a hash value, or if an actual hash value becomes known to the attacker, then the above process would allow the attacker to pretend to upload the file, in which case the service would associate that file content with the attacker, and subsequently allow the attacker to download a copy of the file content. The net effect is that the attacker would have gained access to the file content, thanks to the de-duplication scheme, even though the attacker never actually had a copy of file. The attacker would have essentially stolen the file content.
With the selection of a proper hash algorithm, there is no realistic probability of simply guessing the relevant hash value. However, hash functions that were previously considered cryptographically strong (e.g. MD5) have been broken by cryptanalysis in various ways, so a system using such a weak hash function for de-duplication could be vulnerable to guessed or colliding hash values. Of course, as cryptographic techniques improve over time, what was once considered a secure hash function may become insecure.
Even with what is currently considered a good hash function, such as SHA-256, actual hash values could leak out of the system, either via users colluding with one another, via operator error or operator staff misbehaviour. In that case the attacker does have the correct hash value (but not the file content) and without an appropriate challenge-response scheme the attacker could again steal the file content.
Examples of known challenge-response schemes can be found in the following:                “Network Security—PRIVATE Communication in a PUBLIC World”, Second edition, Kaufman, Perlman, Speciner, Prentice Hall, 2002. ISBN-13: 9780130460196        Lee & Yeh, “A Self-Concealing Mechanism for Authentication of Portable Communication Systems” International Journal of Network Security, Vol. 6, No. 3, PP. 285-290, May 2008.        
However, the above schemes involve the use of shared-secrets like passwords, or a so-called ‘warrant’ (which is like a Kerberos ticket-granting-ticket) that is used in order to authenticate the user.
In addition, there is a related problem with efficient file downloads in a more general context. For example, when downloading a file using the HTTP protocol, if the file-content is in any way sensitive, then clients typically have to authenticate and be authorized for access to the file. That process can be relatively resource-intensive in large-scale applications. In cases where the subsequent download operation is interrupted, the HTTP protocol supports the client requesting that the download be resumed at the relevant point (more generally the client can request that only certain byte-ranges be downloaded), but such requests generally have to incur all the authentication and authorization overhead of the initial request, which imposes a burden on large scale Internet services. For example, this can prevent making best use of load-balancing techniques or content delivery networks.
As a result, it is an object of the invention to provide a validation scheme which validates that a user actually has a copy of the file content during the upload process, while still gaining the benefits of the de-duplication scheme. It is a further object of the invention to provide a scheme for block-level resume operations with less overhead than applying full authentication and authorisation.