Vast quantities of individual information are currently collected and analyzed by a broad spectrum of organizations. While these data clearly hold great potential for analysis, they are commonly collected under the premise of privacy. Careless disclosures can cause harm to the data's subjects and jeopardize future access to such sensitive information.
Data disclosure horror stories about “anonymized” and “de-identified” data typically refer to non-interactive approaches in which certain kinds of information in each data record have been suppressed or altered. A famous example is America Online's release of a set of “anonymized” search query logs. People search for many obviously disclosive things, such as their full names, their own social security numbers (to see if their numbers are publicly available on the web, possibly with a goal of assessing the threat of identity theft), and even the combination of mother's maiden name and social security number. AOL carefully redacted such obviously disclosive “personally identifiable information” and replaced each user ID with a random string. However, search histories can be very idiosyncratic, and with the help of data from sources other than the AOL database, a New York Times reporter correctly connected one of the “anonymized” search histories to a specific resident of Georgia.
In these so-called linkage attacks, an attacker (sometimes called an adversary) links the “anonymized” data to auxiliary information found in other databases or other sources of information. Although each of these databases may be innocuous by itself, the combination of information can allow enough inferences to identify subjects in the “anonymized” data and thereby violate their privacy. Examples like the AOL database have shown that even with great care taken to anonymize statistical data, auxiliary information can defeat attempts to anonymize the data itself. This realization has led to interest in data analysis techniques, such as differential privacy, that can mathematically guarantee that the inclusion of a subject's sensitive data in a database does not discernably increase the likelihood of that sensitive data becoming public.
In database technology, a query plan (or query execution plan) is an ordered set of steps used to access data in a structured query language (SQL) relational database management system (RDMS). Query plans can also be thought of as a specific case of the relational model concept of access plans. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, although the alternatives can have widely varying performance. When a query is submitted to the database, a query optimizer evaluates some of the possible plans for executing the query and returns what it considers the “best” option under the given constraints. Typically, a query plan is designed to optimize computing resources available to a database server (e.g., processing power, memory, and bandwidth) and to execute a query as fast as possible.