1. Field of the Invention
The present invention relates generally to data management, and more particularly, to a system and method for anonymously linking a plurality of data records.
2. Relevant Technology
Recent years have seen an increased expectation of confidentiality in personally identifiable information stored in computer databases. For example, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) requires the maintenance of appropriate security measures to preserve the confidentiality of health information. In particular, the HIPAA establishes severe penalties for xe2x80x9cwrongful disclosurexe2x80x9d of health information that is individually identifiable.
However, for legitimate purposes, it is important to be able to associate multiple records (from one or more data sources) related to a single individual. For example, the linking of related data records is crucial in performing certain research studies, such as horizontal and longitudinal studies. A horizontal study is conducted at one point in time, across multiple lines of service, such as services provided by physicians, pharmacists, hospitals, laboratories, and the like. A longitudinal study, on the other hand, is conducted on services provided over time to the same patient or population.
Consequently, the removal of personally identifying information from data records (hereinafter referred to as xe2x80x9cde-identificationxe2x80x9d or xe2x80x9canonymizationxe2x80x9d) to comply with privacy regulations, without degrading the ability to conduct both horizontal and longitudinal studies, is of critical importance to researchers.
However, the linking of anonymized data records presents at least two problems not fully solved by conventional approaches. First, all of the records pertaining to a single individual need to be identified. As explained below, this can be difficult because records from different data sources may use different types of personal identifiers, many of which may not uniquely identify an individual, or may even change over time for the same individual.
Second, the records of an individual should be anonymously linked in such a way that it would be difficult or impossible to discover the identity of the individual to whom the records pertain. As explained hereafter, conventional approaches, such as a Master Patient Index (MPI), do not protect an individual""s privacy, since they may be used to recover personally identifying information from an assigned identification code.
Unfortunately, the lack of a single, universal identifier in the United States complicates the problem of identifying all of the records of a single individual. For example, the U.S. healthcare system does not have a unique identifier for each patient. As such, a multitude of different identification schemes have developed.
For instance, a person is typically issued one healthcare identifier by a medical plan, another by a dental plan, and yet another from a pharmacy benefits manager, and possibly several more from secondary coverage sources. Some identifiers may be based on a Social Security Number (SSN) with the optional addition of a xe2x80x9cperson code.xe2x80x9d However, in many cases, the healthcare identifiers are independently created, alphanumeric codes of between 5 and 18 digits.
The identification problem is further complicated by the volatility of personal identifiers. For example, people may change their names, either through marriage, divorce, adoption, or the like. Other personal identifiers, such as an individual""s address, city, and ZIP code are even more volatile than the individual""s name. Even the once stable characteristic of gender is now subject to change. The date of birth is frequently the only stable characteristic, although it is not useful, alone, in uniquely identifying an individual.
As noted above, the associated problem of anonymously linking a plurality of related records lies in the fact that conventional approaches typically do not prevent identification of the individual to whom the records pertain. For example, hospitals often assign a unique number or code to each patient for purposes of linking patient transaction records from different sources, e.g., radiological reports, pathology reports, physician""s notes, and the like.
Although the records may include a variety of personal identifiers, such as a name and date of birth, the patient code is generally the only identifier used within the hospital to correlate the patient""s records. One purpose of exclusively relying on the patient code is to prevent confusion between two or more patients sharing the same name or other personal identifiers.
Typically, the hospital maintains a Master Patient Index (MPI), which associates the patient""s personal identifiers with the assigned patient code. By providing one or more personal identifiers, doctors or hospital staff may obtain from the MPI the patient""s assigned code.
Unfortunately, the MPI may also be used to reverse the process and obtain the patient""s personal information from the patient code. As such, the MPI is not well suited for anonymously linking the patient""s records.
Industries other than healthcare would similarly benefit from a system for anonymously linking data records. For example, consumer studies could be conducted using large clearinghouses of consumer data without jeopardizing the privacy of consumers by the misuse of such data. Moreover, even where anonymization of the data records is not required, a system for identifying records related to the same individual would be highly desirable.
Accordingly, what is needed is a system and method identifying a plurality of data records related to the same individual. Moreover, what is needed is a system and method for anonymously linking the plurality of related data records, such that the records may be de-identified and used in research studies and the like.
The present invention solves many or all of the foregoing problems by providing an anonymized linking system and method. In one aspect of the invention, a first identity reference encoding module encodes a first encoded identity reference from a first subset of the identifying elements of a data record. The first encoded identity reference may comprise, for example, a one-way hash of the first subset of identifying elements.
In another aspect, a second identity reference encoding module encodes a second encoded identity reference from a second subset of the identifying elements of the data record. The first and second subsets may be disjoint, or may include one or more identifying elements in common.
In yet another aspect, an anonymization code assignment module assigns to each of the first and second encoded identity references an identical anonymization code for anonymously representing the individual associated with the data record. In one embodiment, the anonymization code may include a unique serial number.
In still another aspect, an anonymization code lookup module determines whether each of the first and second encoded identity references has an assigned anonymization code stored within an anonymization code database. In one embodiment, the anonymization code database comprises an xe2x80x9canonymous indexxe2x80x9d for linking at least one encoded identity reference to an assigned anonymization code.
If neither the first nor second encoded identity references are found to have an assigned anonymization code, the anonymization code assignment module may provide a new anonymization code and assign the new anonymization code to both the first and second encoded identity references. If, however, one encoded identity reference is found to have an assigned anonymization code while the other encoded identity reference does not, the anonymization code assignment module may assign the same anonymization code found to be associated with the one encoded identity references to the other encoded identity reference.
In another aspect, an anonymization code insertion module may insert the assigned anonymization code into the data record, while an identifying element removal module optionally removes the plurality of identifying elements from the data record, thus anonymizing or de-identifying the data record.
These and other objects, features, and advantages of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.