Oftentimes it is desirable to be able to analyze a database to learn statistical information about a population as represented by the database. Typically, a query to such a database is of the form “How many members of a set S of entries/rows in the database satisfies a particular property P?”, where such property P may-be expressed as a Boolean formula or as some more complex form of formula.
For example, it may be desirable with regard to a particular database to statistically determine within the population represented thereby whether a correlation may be found between two factors or sets of factors, such as whether with regard to a medical database patients who have heart disease also have a history of smoking tobacco. In particular, a query to a medical database might be fashioned to answer a question such as: “How many individuals as represented within the database are tobacco smokers?”, “How many individuals as represented within the database have heart disease?”, “How many individuals as represented within the database are tobacco smokers that suffer from heart disease?”, and the like.
However, and significantly, it is oftentimes necessary based on a legal or moral standard or otherwise to protect the privacy of individuals as represented within a database under statistical analysis. Thus, a querying entity should not be allowed to directly query for information in the database relating to a particular individual, and also should not be allowed to indirectly query for such information either.
Given a large database, then, perhaps on the order of hundreds of thousands of entries where each entry corresponds to an individual, a need exists for a method to learn statistical information about the population as represented by such a database without compromising the privacy of any particular individual within such population. More particularly, a need exists for such a method by which an interface is constructed between the querying entity and the database, where such interface obscures each answer to a query to a large-enough degree to protect privacy, but not to such a large degree so as to substantively affect statistical analysis of such database.
In at least some instances, the aforementioned large database is vertically partitioned in that a first portion of information with regard to each entry is in a first location and a second portion of the information with regard to each entry is in a second location. For example, it may be that the first location of the database has information regarding particular individuals that suffer from heart disease, and the second location of the database has information regarding which of such particular individuals are tobacco smokers.
As may be appreciated, reasons for such a partition are many and varied, and can include the portions of information having been collected by different entities, at different times, from different sources, and the like. As may also be appreciated, performing statistical analysis on such a vertically partitioned database may be difficult, especially if cross-referencing between the locations based on indicia identifying particular individuals is prohibited due to privacy concerns.
A need exists, then, for a method for statistically analyzing the database based on attributes that are stored in both locations while still satisfying such privacy concerns. In particular, a need exists for such a method where statistics for any Boolean combination of attributes stored in both locations can be learned. Thus, and to continue with the aforementioned example, a statistic such as the increase in risk of heart disease due to smoking can be computed in a privacy-preserving manner. Indeed, all statistics based on any two properties/attributes can be computed without violating privacy concerns.