The inference problem in databases (and in social networks too, in a slightly different guise) occurs when sensitive information is disclosed indirectly, via a series of ostensibly secure answers to queries. Even though each individual query answer may be properly authorized for disclosure (i.e., the user's clearance level may permit her to receive the answer), the answers may nevertheless collectively compromise sensitive information, in that the user may be able to infer from these answers information that she is not authorized to have, particularly when she combines the answers with some additional knowledge, e.g., metadata such as integrity constraints or functional dependencies, or domain-specific knowledge.
The problem has attracted a great deal of attention. Most approaches fall into two camps, static and dynamic. Static approaches analyze a database prior to querying and try to detect so-called “inference channels” that could result in inference-based leaks of sensitive information. When such channels are identified, the database is modified in order to eliminate them; typically the security levels of various attributes are raised accordingly. This usually results in over-classification: large portions of the data are classified as sensitive, and overall data availability is thereby decreased, making the database less useful. As a rather simplistic example, consider a database with three attributes, Name, Rank, and Salary. Suppose that we wish to keep secret the association between names and salaries, but we freely disclose the association between names and ranks, and between ranks and salaries. Given a functional dependency Rank→Salary that may be widely known, it is clear that the user could come to infer salaries from ranks. In the static approach, the solution would be to make the Rank a sensitive attribute.
Dynamic approaches, by contrast, attempt to detect potential inferences of sensitive information at query time. If no inference is detected, the regular answer to the query can be released. But if it is determined that potentially compromising inferences could be made on the basis of the answer (and other knowledge, such as previous answers, metadata, etc.), then the answer is not released; it is withheld, or suppressed, or generalized, and so on. Dynamic approaches have the benefit of being considerably more precise than static approaches. On average, data is more available under a dynamic approach because there is no need to be overly conservative ahead of time; protective measures are taken only if and when needed. The main drawbacks of dynamic approaches have been incompleteness and inefficiency. Incompleteness means that only a very restricted class of inferences could be detected; and inefficiency typically means that the detection was computationally expensive.
Given that the issue at hand is information inference, it would appear that logic-based techniques such as theorem proving might be of use. Indeed, theorem proving techniques could be (and have been) used to tackle the inference problem, roughly along the following lines: For any given time point t, let At={a1, . . . , at}, t≧1, be the answers to all the queries that a given user has previously posed (up to time t). Further, let B be a set of background knowledge that the user can be reasonably expected to have. For instance, B could be a set of functional dependencies for the underlying database. Now, let qt+1 be a new query, and let at+1 be the answer to it. An inference-blocking information-management system will decline to disclose at+1 if A∪B∪{at+1}|−p, where p is a sensitive proposition that should not be made available to this particular user (e.g., because her security clearance level is insufficient). Typically p is an atomic proposition that reveals the value of a sensitive attribute for a given individual, such assalary(Tom)=70K  (1.1)
In other words, the new query will not be answered lithe answer could be used, in tandem with previous answers and background knowledge, to deduce some sensitive information item about someone.
There are two main drawbacks to this approach. First, usually there is no single proposition p that we wish to protect but many. For instance, we wish to prevent the disclosure of all sensitive attribute values for all individuals in a given database. At least in principle, there are a couple of ways of handling this problem. First, one might run a theorem-proving procedure such as resolution to completion, deriving not one but very many conclusions from A∪B∪{at+1 }, and then checking to see if any sensitive proposition is among the conclusions. A slightly more targeted approach is to formulate a disjunctive proposition p1 V . . . V pn containing all the propositions whose secrecy we wish to maintain under the circumstances, and check whetherA∪B∪{at+1}|−p1V . . . Vpn  (1.2)
Neither formulation is particularly practical or elegant. But there is a second serious problem, namely, the answer at+1 might amount to a partial information disclosure. That is, it might not allow the user to deduce a particular sensitive proposition such as (1.1), i.e., a specific value for some individual's sensitive attribute, but it may nevertheless provide helpful information in that it might eliminate certain alternatives, thereby narrowing the pool of possible values for the attribute in question. For instance, suppose that we are dealing with a company database and that company rank is a sensitive attribute. Suppose further that company rank is either E, F, G, or H; and that the user knows that the company rank of a certain employee x is either F, G, or H. In reality, it is F. Now if the new query answer allows the user to eliminate H as a possibility, it is clear that it has given her some sensitive information, even though she remains unable to derive the actual database entry, rank (x)=F. The upshot is that a security breach may well occur even though (1.2) does not hold.