Today's globally networked society places great demand on the dissemination and sharing of person-specific data for many new and exciting uses. Even situations where aggregate statistical information was once the reporting norm now rely heavily on the transfer of microscopically detailed transaction and encounter information. This happens at a time when more and more historically public information is also electronically available. When these data are linked together, they provide an electronic shadow of a person or organization that is as identifying and personal as a fingerprint—even when the information contains no explicit identifiers such as name or phone number. Other distinctive data, such as birth date or zip code often combine uniquely and can be linked to publicly available information to reidentify individuals. Producing anonymous data that remains specific enough to be useful is often a very difficult task, and practice today tends to either incorrectly believe confidentiality is maintained when it is not or produces data that are practically useless.
One type of commonly shared data is electronic medical records. Analysis of the detailed information contained within electronic medical reports promises advantages to society, including improvements in medical care, reduced institution cost, the development of predictive and diagnosis support systems and the integration of applicable data from multiple sources into a unified display for clinicians. These benefits, however, require sharing the contents of medical records with secondary viewers, such as researchers, economists, statisticians, administrators, consultants, and computer scientists, to name a few. The public would probably agree that these secondary parties should know some of the information buried in the records, but that such disclosures should not risk identifying patients.
There are three major difficulties in providing anonymous data. One of the problems is that anonymity is in the eye of the beholder. Consider an HIV testing center located in a heavily populated community within a large metropolitan area. If the table shown in FIG. 1 shows the results for two days of testing, then it may not appear very anonymous if the left-most column is the date, the middle column is the patient's phone number and the right-most column holds the results. An electronic phone directory can match each phone number to a name and address. Although this does not identify the specific member of the household tested, the possible choices have narrowed to a particular address.
Alternatively, if the middle column in the table of FIG. 1 holds random numbers assigned to samples, then identifying individuals becomes more difficult, but still cannot guarantee the data are anonymous. If a person with inside knowledge (e.g., a doctor, a patient, a nurse, an attendant, or even a friend of the patient) recognizes a patient and recalls the patient was the second person tested that day, then the results are not anonymous to the insider. In a similar vein, medical records distributed with a provider code assigned by an insurance company are often not anonymous because thousands of administrators often have directories that link the provider's name, address and phone number to the assigned code.
As another example, consider the table of FIG. 2. If the contents of this table are a subset of an extremely large and diverse data source, then the three records listed in the table at FIG. 2 may appear anonymous. Suppose the zip code 33171 primarily consists of a retirement community; then there are very few people of such a young age living there. Likewise, 02657 is the zip code for Provincetown, Mass., in which there may be only about five black women living there year-round. The zip code 20612 may have only one Asian family. In these cases, information outside the data identifies the individuals.
Most towns and cities sell locally-collected census data or voter registration lists that include the date of birth, name and address of each resident. This information can be linked to medical data that include a date of birth and zip code, even if the names, social security numbers and addresses of the patients are not present. Of course, census data are usually not very accurate in college towns and in areas that have a large transient community, but for much of the adult population in the United States, local census information can be used to reidentify deidentified data since other personal characteristics, such as gender, data of birth and zip code, often combine uniquely to identify individuals.
A second problem with producing anonymous data concerns unique and unusual information appearing within the data themselves. Consider the data source shown in the table of FIG. 3. It is not surprising that the social security number is uniquely identifying, or given the size of the illustrated data source, that the birth date is also unique. To a lesser degree, the zip code identifies individuals since it is almost unique for each record. Importantly, what may not have been known without close examination of the particulars of this data source is that the designation of Asian ethnicity is uniquely identifying. Any single uniquely occurring value can be used to identify an individual. Remember that the unique characteristic may not be known beforehand. It could be based on diagnosis, achievement, birth year, visit date, or some other detail or combination of details available to the memory of a patient or a doctor, or knowledge about the data source from some other source.
Measuring the degree of anonymity in released data poses a third problem when producing anonymous data for practical use. The Social Security Administration (SSA) releases public-use files based on national samples with small sampling fractions (usually less than 1 in 1,000). The files contain no geographic codes, or at most regional or size of place designators. The SSA recognizes that data containing individuals with unique combinations of characteristics can be linked or matched with other data sources. Thus, the SSA's general rule is that any subset of the data that can be defined in terms of combinations of characteristics must contain at least five individuals. This notion of a minimal bin size, which reflects the smallest number of individuals matching the characteristics, is useful in providing a degree of anonymity within data: the larger the bin size, the more anonymous the data. As the bin size increases, the number of people to whom a record may refer also increases, thereby masking the identity of the actual person.
In medical data sources, the minimum bin size should be much larger than the SSA guidelines suggest for three reasons: (1) most medical data sources are geographically located and so one can presume, for example, the zip codes of a hospital's patients; (2) the fields in a medical data source provide a tremendous amount of detail and any field can be a candidate for linking to other data sources in an attempt to reidentify patients; and (3) most releases of medical data are not randomly sampled with small sampling fractions, but instead include most, if not all of the data source.
Determining the optimal bin size to ensure anonymity is not a simple task. It depends on the frequencies of characteristics found within the data as well as within other sources for reidentification. In addition, the motivation and effort required to reidentify release of data in cases where virtually all-possible candidates can be identified must be considered. For example, if data are released that map each record to ten possible people, and the ten people can be identified, then all ten candidates may be contacted or visited in an effort to locate the actual person. Likewise, if the mapping is 1 in 100, all 100 could be phoned because visits may be impractical, and in the mapping of 1 in 1,000, a direct mail campaign could be employed. The amount of effort the recipient is willing to spend depends on their motivation. Some medical files are quite valuable, and valuable data will merit more effort. In these cases, the minimum bin size must be further increased or the sampling fraction reduced to render those efforts useless.
The above-described anonymity concerns implicated upon the dissemination and sharing of person-specific data must be countenanced with the fact that there is presently unprecedented growth in the number and variety of person-specific data collections and in the sharing of this information. The impetus for this explosion has been the proliferation of inexpensive, fast computers with large storage capacities operating in ubiquitous network environments.
There is no doubt that society is moving toward an environment in which society could have almost all the data on all the people. As a result, data holders are increasingly finding it difficult to produce anonymous and declassified information in today's globally networked society. Most data holders do not even realize the jeopardy at which they place financial, medical, or national security information when they erroneously rely on security practices of the past. Technology has eroded previous protections leaving the information vulnerable. In the past, a person seeking to reconstruct private information was limited to visiting disparate file rooms and engaging in labor-intensive review of printed material in geographically distributed locations. Today, one can access voluminous worldwide public information using a standard hand-held computer and ubiquitous network resources. Thus, from seemingly anonymous data and available public and semi-pubic information, one can often draw damaging inferences about sensitive information. However, one cannot seriously propose that all information with any links to sensitive information be suppressed. Society has developed an insatiable appetite for all kinds of detailed information for many worthy purposes, and modern systems tend to distribute information widely.