An increasing amount of patient healthcare data regarding disease and treatment is being electronically entered and recorded. For example, a healthcare provider may electronically submit healthcare data for the purpose of receiving payment for services rendered. The data generally flows from the healthcare provider to a clearinghouse or a provider of electronic data interchange and related services. Healthcare data submitted can include standardized codes to describe the diagnosis made, services performed, or products used.
As patient data regarding disease and treatment becomes more widely recorded and available, linking data for individual patients from different data sources created at different times would be advantageous, for example, when a researcher wants to study certain variables, such as patients' diagnoses, procedures performed, or drugs prescribed.
However, the Health Insurance Portability and Accountability Act of 1996 (HIPAA) restricts entities covered under HIPAA from disclosing protected health information (“PHI”). The disclosure of PHI is regulated because it is healthcare data with personally identifiable information (“PII”). Many data sources would be considered covered entities because the data sources produce information which may contain PHI, and PHI through its associated PII can be used to positively identify a person. Such information containing PII and concerning individual privacy are strictly protected by HIPAA. Under HIPAA, covered entities cannot disclose PII to third parties, except in limited circumstances, such as to other authorized entities for billing purposes. Thus, healthcare data used by non-covered entities for research, analysis, and/or reporting needs to be de-identified so that the data is no longer considered PII. Consequently, direct identifiers, such as names, elements of addresses (except zip codes if they cover a sufficiently large population), birth dates, social security numbers, insurance policy numbers, license numbers, or any other unique identifier that may allow patient identification, must be removed. Thus, researchers are limited to data which may not include a particular desired variable, such as the prevalence of a particular disease in a particular area because any demographic data, even indirect identifiers, appended to de-identified patient data increases the risk of identifying an individual. As a result, researchers are limited to data without relevant demographic variables that they may wish to study.
Thus under HIPAA, the healthcare data transmitted by covered entities must be de-identified so that it no longer contains PII. HIPAA stipulates two methods for de-identifying data. The first method is based on the safe harbor provision, which directs the removal of 18 enumerated identifiers, such as, name, geographic subdivision smaller than a state, dates directly related to an individual phone numbers, fax numbers, email addresses, social security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate/license numbers, vehicle identifiers and serial numbers, device identifiers and serial numbers, web universal resource locators, Internet protocol address numbers, biometric identifiers, full face photographic and comparable images, and other unique identifiers. The second method is based on statistical de-identification. An entity covered under HIPAA may determine that the health information is not individually identifiable health information only if a person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information individually unidentifiable, applying such principles and methods, (1) determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is the subject of the information, and (2) documents the methods and results of the analysis that justify such a determination, as described in “HIPAA Certification for SDI's De-Identification Technology” by Fritz Scheuren, Ph.D. and Patrick Baier, D. Phil, dated Jun. 4, 2007.
As described by Scheuren and Baier, known methods append additional information to the de-identified patient data. One method appends additional information in a non-specific way such as with the zip code or other grouping information, as discussed in the “Description of the Related Art” in U.S. Patent Application Pub. No. 2004/0199781, entitled “Data Source Privacy Screening Systems and Methods,” by Erickson et al. Another method appends only limited variables in order to minimize the risk of identification, as discussed in U.S. Patent Application Pub. No. 2004/0199781. The disadvantages of these approaches are that (1) they assume that all individuals in a particular group share the same appended characteristic data, (2) they limit the number of discrete variables that can be included in any analysis, (3) they require a very high degree of oversight and review by an approved statistician, and/or (4) they carry a risk of re-identification, as the party who holds the merged data may have enough data available to possibly re-identify an individual in violation of HIPAA through combining the data with demographic or other available variables.
Thus, there continues to be a need for a system and a method that allows associating of patient healthcare data from different data sources at different times but avoids using PII that can be used to identify the patient.