1. Introduction
Information integration has long been an area of active database research [e.g. see references 12, 16, 21, 27, 48]. So far, this literature has tacitly assumed that the information in each database can be freely shared. However, there is now an increasing need for computing queries across databases belonging to autonomous entities in such a way that no more information than necessary is revealed from each database to the other databases. This need is driven by several trends:                End-to-end Integration: E-business on demand requires end-to-end integration of information systems, from the supply chain to the customer-facing systems. This integration occurs across autonomous enterprises, so full disclosure of information in each database is undesirable.        Outsourcing: Enterprises are outsourcing tasks that are not part of their core competency. They need to integrate their database systems for purposes such as inventory control.        Simultaneously compete and cooperate: It is becoming common for enterprises to cooperate in certain areas and compete in others, which requires selective information sharing.        Security: Government agencies need to share information for devising effective security measures, both within the same government and across governments. However, an agency cannot indiscriminately open up its database to all other agencies.        Privacy: Privacy legislation and stated privacy policies place limits on information sharing. However, it is still desirable to mine across databases while respecting privacy limits.1.1 Motivating Applications        
We give two prototypical applications to make the above paradigm concrete.
Application 1: Selective Document Sharing Enterprise R is shopping for technology and wishes to find out if enterprise S has some intellectual property it might want to license. However, R would not like to reveal its complete technology shopping list, nor would S like to reveal all its unpublished intellectual property. Rather, they would like to first find the specific technologies for which there is a match, and then reveal information only about those technologies. This problem can be abstracted as follows.
We have two databases DR and DS, where each database contains a set of documents. The documents have been preprocessed to only include the most significant words, using some measure such as term frequency times inverse document frequency [41]. We wish to find all pairs of similar documents DRεDR and dSεDS, without revealing the other documents. In database terminology, we want to compute the join of DR and DS using the join predicate f(|dRdS|,|dR|,|dS|)>τ, for some similarity function f and threshold τ. The function f could be |dRdS|/(|dR|+|dS|), for instance.
Many applications map to this abstraction. For example, two government agencies may want to share documents, but only on a need-to-know basis. They would like to find similar documents contained in their repositories in order to initiate their exchange.
Application 2: Medical Research Imagine a future where many people have their DNA sequenced. A medical researcher wants to validate a hypothesis connecting a DNA sequence D with a reaction to drug G. People who have taken the drug are partitioned into four groups, based on whether or not they had an adverse reaction and whether or not their DNA contained the specific sequence; the researcher needs the number of people in each group. DNA sequences and medical histories are stored in databases in autonomous enterprises. Due to privacy concerns, the enterprises do not wish to provide any information about an individual's DNA sequence or medical history, but still wish to help with the research.
Assume that the table TR(person_id, pattern) stores whether person's DNA contains pattern D and TS(person_id, drug, reaction) captures whether a person took drug G and whether the person had an adverse reaction. TR and TS belong to two different enterprises. The researcher wants to get the answer to the following query:    select pattern, reaction, count(*)    from TR, TS     where TR.person_id=TS.person_id and TS.drug=“true”    group by TR.pattern, TS.reaction    We want the property that the researcher should get to know the counts and nothing else, and the enterprises should not learn any new information about any individual.1.2 Current TechniquesWe discuss next some existing techniques that one might use for building the above applications, and why they are inadequate.            Trusted Third Party: The main parties give the data to a “trusted” third party and have the third party do the computation [7, 30]. However, the third party has to be completely trusted, both with respect to intent and competence against security breaches. The level of trust required is too high for this solution to be acceptable.        Secure Multi-Party Computation: Given two parties with inputs x and y respectively, the goal of secure multi-party computation is to compute a function f(x,y) such that the two parties learn only f(x,y), and nothing else. See [26, 34] for a discussion of various approaches to this problem.        
Yao [49] showed that any multi-party computation can be solved by building a combinatorial circuit, and simulating that circuit. A variant of Yao's protocol is presented in where the number of oblivious transfers is proportional to the number of inputs and not the size of the circuit. Unfortunately, the communication costs for circuits make them impractical for many problems.
There is therefore an increasing need for sharing information across autonomous entities such that no information apart from the answer to the query is revealed.