With the proliferation of the Internet, more and more organizations are relying on hosted services to provide certain information technology (IT) resources. For example, a company may contract for email services from an outside email service provider rather than maintaining the servers, agents, and infrastructure necessary to provide its own email services. Hosted services enable these organizations to take advantage of those having specific knowledge and experience with different resources. Furthermore, provision of these services by an outside service provide can lower costs because the service provider can take advantage of an ever-expanding client base to provide these services more efficiently.
Many organizations, such as companies and nonprofits, use hosted electronic document management systems to archive electronic documents. These electronic document management systems may provide, among other things, storage, indexing, searching, and processing services for a large collection of documents. For example, a hospital may generate and collect thousands of documents every day, such as patients' medical records, doctors' reports, payroll records, vendor purchase orders, and so on. It may become cumbersome for the hospital to manage this large set of documents since the hospital may not be equipped or staffed to do so efficiently. To efficiently handle these documents, the hospital may employ the services of an electronic document management system. In some cases, the electronic document management system may be located remotely from its users and hosted by an electronic document management service provider. Hosting the electronic document management system remotely provides each end user with centralized access to the system. Centralization can alleviate many of the problems associated with distributed systems, such as coherency and maintenance issues. Because some organizations, such as hospitals and banks, generate a number of documents containing confidential information, such as an individual's medical or financial records, it is important that electronic document management systems provide certain security guarantees, such as services that do not rely on the use of plaintext (or unencrypted) information associated with the documents. Some systems allow a user to download and decrypt the encrypted information and sift through the information locally. However, this can be inefficient as the user may be looking only for a specific subset of the documents. As an added level of protection, organizations may also desire that an electronic document management service provider have access only to encrypted versions of the documents. Some electronic document management systems maintain an encrypted document index, which provides centralized access to a plurality of users without allowing the electronic document management system to determine the contents of the documents directly. However, as discussed below, these encrypted indexes can present certain risks, such as vulnerability to frequency-based attacks.
Some electronic document management services provide an encrypted keyword index that maps an encrypted version of a keyword to the documents containing that keyword. Placing the index with the electronic document management service provides centralized access to the index and better performance of the indexing and searching services. The electronic document management service may also provide a central repository of encrypted versions of the documents. When a user performs a query for a keyword, the keyword is encrypted using a predetermined encryption algorithm (or cipher) and encryption key, and then the encrypted keyword is passed to the electronic document management service. The electronic document management system uses the encrypted keyword index located at the electronic document management system to identify documents containing the keyword and provides an encrypted indication of these documents, such as an encrypted document identifier, to the user for decryption. When a user selects a document identifier, an encrypted version of the relevant document may be retrieved from the electronic document management server for decryption at the users computer. Because these systems use a 1:1 mapping between keywords and encrypted keywords, however, they are susceptible to frequency-based attacks, such as a histogram-based attack. If the frequency with which words appear in a set of documents is known or can be reasonably estimated, some information about the documents can be inferred by comparing the frequency of encrypted keywords in a set of documents to the known or estimated frequency of unencrypted keywords in the documents. As an example, if the word “research” is known or estimated to be the most common word in a set of documents, then the most common encrypted keyword in the documents is likely to be the encrypted version of the word “research.” In the case of an electronic document management system, an electronic document management service provider, which has access to the encrypted index but may have access only to the encrypted documents, can analyze the index to determine the frequency of encrypted keywords associated with the documents. As another example, an attacker may, over time, be able to determine the frequency of encrypted keywords associated with a set of documents by monitoring communications (e.g., queries and results) between clients and a server.
In some cases, a document index may map documents to a value in a sequence, such as a range of dates corresponding to a Date Created or Last Modified attribute of each document. When these document indexes are encrypted, queries for exact matches may succeed but queries that rely on order, such as less than or equal to (“≦”) or greater than or equal to (“≧”), may fail unless the encryption algorithm used to encrypt the index is order-preserving. In a paper titled “Anti-Tamper Database Research: Inference Control Techniques,” G. Bebek proposes a solution to this problem where a sequence of encrypted values is generated using a random number generator (G. Bebek, Anti-Tamper Database Research: Inference Control Techniques. Technical Report EECS 443 Final Report, Case Western Reserve University, November 2002). For each plaintext value, an encrypted value is generated by adding the next random number to the previously generated encrypted value. Because this technique maps a single encrypted value to each plaintext value, however, one may be able to infer information about the plaintext sequence from the sequence of encrypted values based on the distance between encrypted values. Furthermore, the 1:1 mapping between the plaintext values and the encrypted values opens Bebek's technique up to the frequency-based attacks previously described.