While private and semi-private information on the Internet has grown rapidly in recent years, mechanisms for searching this information have failed to keep pace. A user faced with the problem of locating an access-controlled document would typically identify and individually search each relevant repository, assuming of course the user knows and remembers which repositories are relevant.
For example, company XYZ wishes to share some but not all of their internal research documents with company ABC. The documents that company XYZ wishes to share might refer to a collaborative project between the two companies. Company XYZ would like to be able to offer a search facility for that data, where company ABC can only search for documents to which they have access. However, company XYZ does not want company ABC to be able to determine what company XYZ is sharing with company Q. Currently, no method exists for uniformly searching data in this format between companies and individuals wishing to share data in an access-controlled format.
The lack of tools for searching access-controlled content on the network stems from the considerable difficulty in creating a search-engine that indexes the content while respecting the security and privacy requirements of the content providers. Contemporary search engines build inverted indexes that map a keyword to its precise locations in an indexed document.
Conventional inverted indexes represent an indexed document in its virtual entirety. The indexed document can thus be easily reconstructed from the index. The trust and security thus required of any host providing such an index over access-controlled content is enormous. Conferred with knowledge of every searchable document, the trust required of a search engine over access-controlled content grows rapidly with each participating provider. This enormous trust requirement, coupled with the potential for a complete breach of access control by way of malicious index disclosure, render such an approach impractical.
Conventional search solutions include centralized indexing, query broadcasting, distributed indexing, and centralized fuzzy indexing. The most common scheme for supporting efficient search over distributed content is centralized indexing, in which a centralized inverted index is built. The index maps each term to a set of documents that contain the term. The index is queried by the searcher to obtain a list of matching documents. This is the scheme of choice of web search engines and mediators
Centralized indexing can be extended to support access-controlled search by propagating access policies along with content to the indexing host. The index host applies these policies for each searcher to filter search results appropriately. Since only the indexing host needs to be contacted to completely execute a search, searches are highly efficient. However, a centralized index may allow anyone who has access to the index structure to “provably expose” content providers. A provable exposure occurs when an adversary (i.e., hacker) can provide irrefutable evidence that provider p is sharing document d. In cases where the index host is completely trusted by all content providers, this violation of access control may be tolerable. Finding such a trusted host is immensely difficult. Further, compromise of the index host by hackers could lead to a complete and devastating privacy loss should the index be revealed publicly.
At the other end of the search efficiency spectrum lie query broadcasting, broadcast-based schemes that send the query to all participating content providers. Such schemes include a network of content providers, where providers locally evaluate each query and directly provide any matching documents to the searcher. The query broadcasting search protocol may be augmented to implement access control. In such a protocol, the query will be broadcast along with the identity and IP address of the query originator. Providers could securely deliver search results back to the authenticated searcher over an encrypted connection to avoid interception.
Since content shared by a provider p resides at the provider's database alone, providers are assured absolute privacy and the goal of content privacy is naturally preserved. However, while this adaptation to query broadcasting has excellent privacy characteristics, it suffers from poor scalability and severe performance penalties. Consequently, the protocols for query broadcasting adopt heuristics (e.g., time-to-live fields) that limit search horizons and compromise search completeness.
The performance limitations of query broadcasting have led to work on distributed indexing methods that support efficient search without the need for a single centralized index provider. For example, a peer-to-peer network may leverage “super-peers” (machines with above-average bandwidth and processing power) by having them host sub-indexes of content shared by several less capable machines.
Another system distributes a search index using a distributed hash table. In these systems, the distributed index is used to identify a set of documents (or machines that host the documents) matching the searcher's query. These machines are then contacted directly by the searcher to retrieve the matching documents.
Access control for distributed indexing systems can be supported by simply having the providers enforce their access policies before providing the documents. However, much as in the case of a centralized index, any node with access to a portion of the distributed index can provably expose any of the providers indexed by that portion.
Further, indexes are typically hosted by untrusted machines over whom the providers themselves have no control. An active adversary that does not host a portion of the index can search the distributed index to inflict privacy breaches. For example, the adversary can determine the precise list of providers sharing a document with a particular keyword by issuing a search on that keyword, breaching content privacy with provable exposure. Content privacy can also be breached by mounting phrase attacks. Such attacks take advantage of the observation that most documents have characteristic sets of words that are unique to them.
To identify a provider sharing some document, the adversary need only compose a query consisting of such terms for the document. The resulting list of sites are then known to share the document but with possible innocence. Possible Innocence occurs when the claim of an adversary about provider p sharing document d can be false with a non-trivial probability. By choosing an appropriate set of terms, the adversary can achieve a near provable exposure.
Some search applications do not maintain precise inverted index lists, but instead maintain structures that allow mapping of a query to a “fuzzy” set of providers that may contain matching documents; this approach is called centralized fuzzy indexing. A bloom filter index, which is a type of a fuzzy index, can be probed by a searcher to identify a list of all providers that contain documents matching the query. The list however is not necessarily precise, since bloom filters may produce false positives due to hash collisions. Given such a list, the searcher contacts each provider to accumulate results. These schemes can be extended to support access-controlled searches by having the providers enforce their access policies at the point a searcher requests matching documents.
Bloom filter indexes do offer limited privacy characteristics by virtue of potential false positives in the list of providers. Each provider in the list is thus possibly innocent of sharing a document matching the query. However, this privacy is spurious. An active adversary can perform a dictionary-based attack on the Bloom filter index to identify the term distribution of any indexed provider.
Dictionary-based attacks take advantage of the fact that sentences in natural language (e.g., English) use words from a restricted vocabulary that are easily compiled (e.g., in a Oxford/Webster dictionary). Thus, the adversary can compute a hash for each word in the vocabulary. A provider in the Bloom filter entry for such a hash is, with some probability, sharing a document with the corresponding word. In addition, the scheme remains prone to phrase attacks.
While these conventional search solutions might be adapted to support searches over access-controlled content, such adaptations fail to adequately address privacy and efficiency. Any search mechanism that relies on a conventional search index allows a provider to be “provably exposed” because of the precise information that the index itself conveys. Efficient privacy-preserving search therefore requires an index structure that prevents breaches of “content privacy” even in the event that the index is made public.
What is needed is a system and associated method that will allow searchers privileged access to access-controlled documents without exposing the contents of the document, the provider of the document, or even existence of the document to unauthorized searchers. The need for such a system and method has heretofore remained unsatisfied.