This section is intended to introduce the reader to various aspects of the art that may be related to various aspects of the present invention. The following discussion is intended to provide information to facilitate a better understanding of the present invention. Accordingly, it should be understood that statements in the following discussion are to be read in this light, and not as admissions of prior art.
Increasingly, data management is outsourced to third parties. This trend is driven by growth and advances in cheap, high-speed communication infrastructures as well as by the fact that the total cost of data management is 5-10 times higher than the initial acquisition costs [34].
Outsourcing has the potential to minimize client-side management overheads and benefit from a service provider's global expertise consolidation and bulk pricing. Providers such as Yahoo Briefcase, Amazon Elastic Compute Cloud, Amazon Simple Storage Service, Amazon Web Services, Google App Engine, Sun Utility Computing, and others ranging from corporate-level services such as the IBM Data Center Outsourcing Services to personal level database hosting—are rushing to offer increasingly complex storage and computation outsourcing services.
Yet, significant challenges lie in the path of a successful large-scale adoption. In business, health care and government frameworks, clients are reluctant to place sensitive data under the control of a remote, third-party provider, without practical assurances of privacy and confidentiality. Yet today's privacy guarantees of such services are at best declarative and often subject customers to unreasonable fine-print clauses—e.g., allowing the server operator (or malicious attackers gaining access to its systems) to use customer behavior and content for commercial, profiling, or governmental surveillance purposes [29]. These services are thus fundamentally insecure and vulnerable to illicit behavior.
Existing research addresses several important outsourcing aspects, including direct searches on encrypted data, and techniques for querying remotely-hosted encrypted structured data in a unified client model [32, 61]. These efforts are based on the assumption that, to achieve confidentiality, data will need to be encrypted before outsourcing to an untrusted provider. Once encrypted however, inherent limitations in the types of primitive operations that can be performed on encrypted data by untrusted hosts lead to fundamental expressiveness constraints of the allowable types of queries. Specifically, reasonably practical mechanisms exist only for simple selection and range queries or variants thereof.
The present invention is directed to a mechanism for collaborative transaction processing with durability guarantees supported by an untrusted service provider under assurances of confidentiality and access privacy. In effect, the cost benefits of standard outsourcing techniques (durability, transaction processing, availability) are achieved while preserving the privacy guarantees of local data storage. Clients to interact through an untrusted service provider that offers durability and transaction serializability support, while being able to check whether the service provider is misbehaving.
In this context data outsourcing becomes a setting in which data is hosted permanently off-site where it is securely encrypted, yet clients access it through their locally-run database effectively acting as a data cache. If local data is lost, it can be retrieved from the offsite repository. Inter-client interaction and transaction management (primarily for transactions that change data) is intermediated by the untrusted provider which also ensures durability by maintaining a client-encrypted and authenticated transaction log with full confidentiality.
In the present invention, each client maintains its own cache of (all or portions of) the database in client-local storage, allowing it to perform reads efficiently and with privacy, while relieving local system administrators of backup obligations. Additional benefits include achieving data and transaction privacy while (1) avoiding the requirement for persistent client storage (clients are now allowed to fail or be wiped out at any time), (2) avoiding the need to keep any single client-side machine online as a requirement for availability
Previous methods, while containing many excellent ideas, fail to accomplish the principal goals of the present invention: (i) allow arbitrary database queries and (ii) ensure access privacy, meaning the untrusted outsourcing agent cannot infer which data is accessed or even whether two transactions access the same data, (iii) while the outsourcing agent provides durability and serializability services.
As mentioned above, outsourcing database administration and backup has emerged as a technique to reduce costs in the last several years. A principal impediment to this idea is the perceived lack of security at the outsourced sites where both external hackers and internal industrial spies might steal data. For this reason, several groups propose enhancing security through encryption.
Hacigumus et al. [41] propose a method to execute SQL queries over partly obfuscated outsourced data. The data is divided into secret partitions and queries over the original data can be rewritten in terms of the resulting partition identifiers; the server can then perform parts of the queries directly. The information leaked to the server is claimed to be 1-out-of-s where s is the partition size. This balances a trade-off between client-side and server-side processing, as a function of the data segment size. At one extreme, privacy is completely compromised (small segment sizes) but client processing is minimal. At the other extreme, a high level of privacy can be attained at the expense of the client processing the queries in their entirety. Moreover, in [44] the authors explore optimal bucket sizes for certain range queries. Similarly, data partitioning is deployed in building “almost”-private indexes on attributes considered sensitive. An untrusted server is then able to execute “obfuscated range queries with minimal information leakage”. An associated privacy-utility trade-off for the index is discussed. The main drawbacks of these solutions lie in their computational impracticality and inability to provide strong confidentiality.
Recently, Ge et al. [73] discuss executing aggregation queries (a special type of query) with confidentiality on an untrusted server. Unfortunately, due to the use of extremely expensive homomorphisms (Paillier [64, 65]) this scheme leads to impractically large processing times for any reasonably security parameter settings (e.g., for 1024 bit of security, processing would take over 12 days per query). Current homomorphisms are simply not fast enough to be usable for practical data processing.
In a publish-subscribe model (a specialized form of a database), Devanbu et al. deployed Merkle trees to authenticate data published at a third party's site [32], and then explored a general model for authenticating data structures [56, 57]. Hard-to-forge verification objects are provided by publishers to prove the authenticity and provenance of query results.
Mykletun [61] introduce mechanisms for efficient integrity and origin authentication for simple selection predicate query results (a special form of query). The authors explore Different signature schemes (DSA, RSA, Merkle trees [59] and BGLS [25]) as potential alternatives for data authentication primitives.
Mykietun et al. [33] introduce signature immutability for aggregate signature schemes. The goal is to defeat a frequent querier that could eventually gather enough signatures data to answer other (un-posed) queries. The authors explore the applicability of signature-aggregation schemes for efficient data authentication and integrity of outsourced data. The considered query types are simple selection queries.
Similarly, in [54], digital signature and aggregation and chaining mechanisms are deployed to authenticate simple selection and projection operators. While these are important to consider, their expressiveness is limited. A more comprehensive, query-independent approach is desirable. Moreover, the use of strong cryptography renders this approach less useful because of the expense. Often simply transferring the data to the client side will be faster.
In [66] verification objects VO are deployed to authenticate simple data retrieval in “edge computing” scenarios, where application logic and data is pushed to the edge of the network, with the aim of improving availability and scalability. Lack of trust in edge servers mandates validation for their results—achieved through verification objects. Authentication ensures integrity of the data but not privacy.
In [45] Merkle tree and cryptographic hashing constructs are deployed to authenticate the result of simple range queries in a publishing scenario in which data owners delegate the role of satisfying user queries to a third-party un-trusted publisher (a very special form of a database). Additionally, in [55] virtually identical mechanisms are deployed in database outsourcing scenarios.
[31] proposes an approach for signing XML documents allowing untrusted servers to answer certain types of path and selection queries.
The drawbacks of these efforts include the fact that they operate in unrealistic “semi-honest” adversarial models. As a result, for example, data updates are not handled properly and the mechanisms are vulnerable to forking attacks. A forking attack occurs when a dishonest server effectively creates different data worlds for different clients. One way to do this is to show some transactions to one client and other transactions to the other.
Sion has explored query correctness by considering the query expressiveness problem in [70] where a novel method for proofs of actual query execution in an outsourced database framework for arbitrary queries is proposed. The solution is based on a mechanism of runtime query “proofs” in a challenge—response protocol built around the ringer concept first introduced in [39]. For each batch of client queries, the server is “challenged” to provide a proof of query execution that offers assurance that the queries were actually executed with completeness, over their entire target data set. This proof is then checked at the client site as a prerequisite to accepting the actual query results as accurate. This gives a probabilistic assurance of correct behavior but does not preserve privacy.
In a different adversarial and deployment model, researchers have also proposed techniques for protecting critical DBMS structures against errors [53, 67]. These techniques deal with corruptions caused by software errors. Work on tamper proof audit logs by Snodgrass et al.[51, 69] introduces the idea of hashing transactional data with cryptographically strong one-way hash functions. This hash is periodically signed by a trusted external digital notary, and stored within the DBMS. A separate validator checks the database state against these signed hashes to detect any compromise of the audit log. If tampering is detected, a separate forensic analyzer springs into action using other hashes that were computed during previous validation runs to pinpoint when the tampering occurred and roughly where in the database the data was tampered. The use of a notary prevents an adversary, even an auditor or a buggy DBMS, from silently corrupting the database. This is meant to ensure the integrity of a database.
Encrypted Storage. Encryption is one of the most common techniques used to protect the confidentiality of stored data. Several existing systems encrypt data before storing it on potentially vulnerable storage devices or network nodes. Blaze's CFS [22], TCFS [27], StegFS [58], and NCryptfs [75] are file systems that encrypt data before writing to stable storage. NCryptfs is implemented as a layered file system [42] and is capable of being used even over network file systems such as NFS. Encryption file systems are designed to protect the data at rest, yet are insufficient to solve the outsourcing problem. For one thing, they do not allow for complex retrieval queries. For another, a malicious server is able to detect which data a client is accessing, violating client access privacy.
Integrity-Assured Storage. Tripwire [48, 49] is a user level tool that verifies file integrity at scheduled intervals of time. File systems such as I3FS [47], GFS [35], and Checksummed NCryptfs [71] perform online real-time integrity verification. Venti [68] is an archival storage system that performs integrity assurance on read-only data. Mykletun et al. [62, 63] explore the applicability of signature aggregation schemes to provide computation- and communication efficient data authentication and integrity of outsourced data.
Keyword Searches on Encrypted Data. Song et al. [72] propose a scheme for performing simple keyword search on encrypted data in a scenario where a mobile, bandwidth-restricted user wishes to store data on an untrusted server. The scheme requires the user to split the data into fixed-size words and perform encryption and other transformations. Drawbacks of this scheme include fixing the size of words, the complexities of encryption and search, the inability of this approach to support access pattern privacy, or retrieval correctness. Eu-Jin Goh [36] proposes to associate indexes with documents stored on a server. A document's index is a Bloom filter [23] containing a codeword for each unique word in the document. Chang and Mitzenmacher [28] propose a similar approach, where the index associated with documents consists of a string of bits of length equal to the total number of words used (dictionary size). Boneh et al. [24] proposed an alternative for senders to encrypt e-mails with recipients' public keys, and store this email on untrusted mail servers. Golle et al. [38] extend the above idea to conjunctive keyword searches on encrypted data. The scheme requires users to specify the exact positions where the search matches have to occur, and hence is impractical. Brinkman et al. [26] deploy secret splitting of polynomial expressions to search in encrypted XML.