Privacy Preserving Storage
When data are outsourced by a client to a server for storage, it is often desirable to “hide” the individual data entries from the server in a secure manner. In other words, the server is an untrusted third party, and the data is not revealed to the server.
The reason for this is to preserve privacy of client information, and to prevent the server to gain access to sensitive information about processes used to acquire and generate the data. For these reasons, the data are often modified in a secure manner before outsourcing to the server.
Another reason for “hiding” the individual data entries is to allow the server, or any untrusted party accessing the server to compute aggregate measures (such as mean, variance, or other moments) from the data without revealing individual data. In this way, remote “global” data analysis can be enabled while preserving privacy of sensitive data.
Random Number Generation
One information theoretically secure way to hide the data adds random values drawn from a probability distribution to the data. For number data, for example, privacy can be obtained by masking the data using numbers sampled from a uniform distribution. The numbers can be sampled using a Cryptographically Secure Pseudorandom Number Generator (CS-PRNG). The CS-PRNG uses a seed to generate a pseudorandom sequence of bits, which, in turn, can be used to generate numbers from a desired probability distribution. Typically, the numbers are integers.
CS-PRNG is preferred over PRNG because of two beneficial properties.    (1) Given any bit output by the CS-PRNG, it is impossible to predict a next bit in polynomial time with probability greater than 0.5.    (2) If the CS-PRNG is compromised at any time, it is impossible to reconstruct the sequence bits generated before that time.
Aggregate Statistics
Even though the data are hidden from the server, it is often beneficial to enable the server to determine aggregate statistics that provide summary information about portions of the data. For example, it may be desired to determine a number of pages printed on a given printer on a given day. As another example, it may be desired to determine a total number of “trades” that are performed by a trader during a given time interval.
Common aggregate statistics include sum, weighted sum, average, weighted average, higher moments, weighted higher moments, etc. Techniques such as randomized response hide individual data but allow determination of estimates of aggregate statistics on the data. Randomized response is one method that allows respondents to respond to sensitive issues while maintaining confidentiality of the response.
The straightforward method for implementing randomized response is to additively mask individual data entries using random numbers before transmitting the data entries to the server. By carefully tuning the distribution of the masking values, this allows the storage server to determine an estimate of the aggregate statistic. However, enabling the server to compute the exact aggregate statistic while hiding individual data entries is very difficult, especially when the client side can only perform limited or no buffering of the data when the data are produced.
Audits
From time to time, the client may want to conduct audits on portions of the data stored at the server. An audit refers to recovering a portion of the stored data, and verifying the integrity of the data. This information is typically contained in an audit report. To generate the audit report, the client requests some of the modified data from the server. In order to be able to interpret the audit report, the modified data in the report should be perfectly reversible by the client.