Research projects in many fields often require access to confidential information to successfully fulfill stated research objectives. For example, medical researchers may need detailed information about demographics, treatments, disease progression, outcomes, and side-effects in a diverse patient population to plan scientifically valid tests of new treatments. At the same time, ethical and legal requirements to protect privacy rights are important conditions for enabling access to confidential data, particularly in the medical field with respect to patient information. An individual""s employment, access to housing, or insurability might be adversely affected by disclosure of private healthcare records. Unfettered access to confidential information may result in the unintentional disclosure of information in a manner that would violate the privacy interests of the data subjects that contributed the information. Therefore, techniques are needed to protect the privacy or confidentiality of data subjects, while still allowing sufficient access to confidential information to enable socially desirable research objectives.
One such technique is the elimination or modification of data fields in a database that would observably identify a specific individual, such as a name field, full address field, or social security number field. This method has limited utility since it only addresses individual data fields, whereas in some instances, combinations of data fields could lead to a violation of a data subject""s privacy. Consider a database query that searches a patient database for a specific combination of a birth date, zip code, and disease status. This combination of information may be enough to indirectly identify specific individuals, even if information that directly contains the identity of a unique data subject is obscured. For example, if the chosen zip code for this database query is a zip code that only applies to the White House in Washington, DC, then the results of this query would almost certainly allow identification of a specific data subject due to the limited number of individuals in that zip code. If the disease status of that data subject is highly confidential, then allowing this type of query could violate the data subject""s privacy rights.
However, excluding any data in the database that might, in combination with other data, lead to a privacy or confidentiality disclosure would render many databases useless. Date of birth may be important for studying time-related effects of diseases and zip code important for studying geographic distributions of diseases. Simply disallowing any access to such data is often too inflexible to accomplish many research goals.
Another approach is to group related data subjects into cells, such as grouping data subjects by race, age group, and zip code, and disallowing access to any cell that that has a xe2x80x9csmallxe2x80x9d number of subjects. This method is inflexible in that it does not consider the specific query that may reference the cell, which means that some queries may be needlessly disallowed. Known methods also fail to provide systematic methods for identifying relevant cells, usually assuming that demographic variables are the principal or only attributes of concern.
Yet another approach is to provide access only to summaries of confidential information while withholding the detailed information used to compile those summaries. However, certain types of research questions can only be answered with access to the detailed data and simply disallowing this access would frustrate a researcher""s ability to accomplish research goals.
Another approach is an xe2x80x9cad hocxe2x80x9d method of determining whether certain types of information should be withheld from a researcher. This method relies upon the judgement of a person responsible for managing the confidential information to decide upon the available scope of use or disclosure for that information. Many drawbacks exist with this ad hoc method, including lack of a systematic and consistent approach to classifying data, inability to consider the classification of data in light of specific database queries, and total reliance upon the judgement of persons designated with responsibility to classify the informationxe2x80x94who may or may not be qualified or adequately trained.
Therefore, it is advantageous to implement a system and method to intelligently and systematically classify data items and/or data queries based upon their potential to violate privacy or confidentiality terms under which the data was originally assembled. Such a system is applicable to fields other than just the health-care example discussed above, including but not limited to, census data, financial records, motor vehicle information, and national security data.
In view of the above, the present invention is advantageous in that it provides a method to selectively evaluate data items and search queries for privacy violations. According to an embodiment of the invention, attributes of a data item are identified and quantified to evaluate their potential to violate privacy interests. Search queries can be evaluated before accessing the records according to an embodiment, which improves operating efficiency and provides additional privacy protections. The query evaluation determines whether to disallow a query or withhold a query result if an individual or a small group of individuals can be identified by the results or if variables revealed will violate privacy policies. Also, the invention enables a provider of information to systematically evaluated the selectivity and visibility of attributes that are analyzed to allow or disallow queries and to set thresholds for different combinations of attributes. This allows implementation of different privacy policies, which may vary based on the database, the attributes, and/or the source of the query.
Further aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims.