The worlds of database coordination, data rights and data usage are inherently paradoxical, since privacy preserving legal rights restrict usage of technical functions in some circumstances while permitting these same technical functions in other circumstances. Simply stated, usage of functions such as sort, search, merge, and Boolean logical operators are the pith and marrow of database operations—except when one of the database fields or a combination of several fields may lead to identification of a person.
Identifiable data may not be from one field and may not be that explicit. For example, a study done on the Census data in the US demonstrated that 87% of US population can be uniquely identified just based on Date-Of-Birth, Sex and ZIP code. There is also the issue of being able to re-identify someone based on an external public database (such as voter registration that includes DOB, Sex and ZIP). So bottom line, the real issue is the level uniqueness of a record and not necessarily a specific field. It is with this very concern in mind that data providers bundle their information goods into identity camouflaged collections or otherwise aggregate records or “trim” down the data to create more “same” records (e.g. report only the first three digits of a ZIP code or report only year of birth)—so that one cannot know, at a certain level of probability, if some particular John Doe is present in one category of an eventual statistical report or any specific details about him; even though this report is based on information goods where John Doe is explicitly labeled, quantitatively described and categorically characterized.
Numerous fields of endeavor come to mind wherein this data privacy paradox prohibits making best use of the information—especially for applications that are not concerned with any particular John Doe. For example, healthcare organizations such as physician practices, labs, hospitals and health maintenance organizations (HMOs) keep extensive medical records including data on each specific patient and on each specific doctor. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) in the USA and similar legislation in other jurisdictions prevents HMOs and healthcare providers from sharing data at full transparency—since the privacy of individuals must be preserved. (see FIGS. 1 & 2 for further details) Nevertheless, without any interest in specific individuals, pharmaceutical companies could greatly improve many technical and mercantile aspects of their operation—if they were given unrestricted access to the HMO raw data. Similar data opacity exists between banks and insurance companies, between sellers of goods and credit card companies, between the census bureau and other government agencies (e.g. tax authorities, public health systems, etc.).
Just for example, the HIPAA related section talking about de-identification says: .sctn. 164.514 Other requirements relating to uses and disclosures of protected health information.    (a) Standard: de-identification of protected health information. Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.    (b) Implementation specifications: requirements for de-identification of protected health information. A covered entity may determine that health information is not individually identifiable health information only if:    (1) A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and (ii) Documents the methods and results of the analysis that justify such determination; or    (2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed: (A) Names; (B) All geographic subdivisions smaller than a State, including street address, city, county, precinct, zip code, and their equivalent geo-codes, except for the initial three digits of a zip code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all zip codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a zip code for all such geographic units containing 20,000 or fewer people is changed to 000. (C) All elements of dates (except year) for dates directly related to an individual, including birth date, admission date, discharge date, date of death; and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older; (D) Telephone numbers; (E) Fax numbers; (F) Electronic mail addresses; (G) Social security numbers; (H) Medical record numbers; (I) Health plan beneficiary numbers; (J) Account numbers; (K) Certificate/license numbers; (L) Vehicle identifiers and serial numbers, including license plate numbers; (M) Device identifiers and serial numbers; (N) Web Universal Resource Locators (URLs); (O) Internet Protocol (IP) address numbers; (P) Biometric identifiers, including finger and voice prints; (Q) Full face photographic images and any comparable images; and (R) Any other unique identifying number, characteristic, or code; and (ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.
Again, specifically with reference to a non-limiting example of health care related information systems—it is worthy to note some additional Background Factors:    (A) The rising cost of health care—Health care expenses and utilization are growing at an alarming, unprecedented rate. In 2000 Americans spent $1.3 trillion on health care. That's more than was spent on food, housing, automobiles or national defense. And by 2010, health care expenditures are expected to double to $2.6 trillion—15.9 percent of our Gross Domestic Product, according to the Centers for Medicare and Medicaid Services. There are many reasons to the significant increase in cost. While addressing this challenge is a hot political, social and ethical issue, there is an agreement that healthcare information can be used to guide toward a more effective and efficient use of healthcare resources.    (B) The role of data in healthcare—analyses of adequate healthcare data can be used for a wide range of application including: identifying ways to improve the effectiveness, safety and efficiency of health care delivery; retrospective population studies to understand risk factors and therapeutic option; public health and epidemiological studies; the understanding of healthcare errors and compliance issues and the understanding of the effectiveness of healthcare innovations communication to healthcare professionals and consumers (healthcare marketing). Many of these applications contribute to a better and more efficient healthcare system.    (C) Health transaction data sources—healthcare claims data, transaction data and medical data is being created, stored and communicated by various healthcare organizations. Healthcare providers frequently initiate large amounts of data as they diagnose, perform various clinical tests, perform medical procedures, and prescribe treatment. Elements of the clinical information also exist with the laboratories, pharmacies, HMOs and other healthcare payers, as well as a range of other service organizations such as clearinghouses and PBMs. Health transaction data is protected by privacy standards such as the HIPAA in the USA. In many different areas of the healthcare system data is being used for both internal applications within the organization that generated the data or for external applications, by properly de-identifying transaction data from patient identifiers.    (D) Aggregated de-identified data, physician level—In the pharmaceutical industry data is commonly used to direct pharmaceutical companies promotional efforts. Pharmacy datasets are typically aggregated to the physician (or prescriber) level and include share and volume data (Total Rx and New Rx or TRx and NRx). In generating this datasets, the original identifiable and complete data is de-identified and aggregated and therefore a “lower resolution” of data is available as an output, or in other words a portion of the original dataset is lost and no longer available for analyses.    (E) Longitudinal patient-level data—A second-generation level of data is now also available for pharmaceutical applications. Frequently called anonymous (or de-identified) patient-level data, these datasets link several records of the same person over time, therefore providing better understanding of both consumers and physicians. These datasets never include identifiable patient information and sometimes also lack physician identifiers. In generating these datasets, the original identifiable and complete data is de-identified and aggregated and therefore a “lower resolution” of data is available as an output, or in other words a portion of the original dataset is lost and no longer available for analyses. In addition, at times methods such as one-way hash encryptions are used to be able to identify the same entity over time and across datasets. The use of constant one-way hash to link or match records for the same person or entity may have substantial drawbacks in terms of risk of downstream re-identification (e.g. access to the one-way hash and a set of personal information may allow generation of an individual's encrypted identifier and therefore re-identification) as well as significantly reduced matching and/or linking capability.    (F) Direct-to-consumer, DTC as a trend—Specifically the pharmaceutical industry (and sometimes the medical device manufacturers), communicate directly with consumers to drive awareness to various medical conditions and to specific products. Direct-to-Consumer marketing has grown significantly since the FDA has relaxed its regulation on such activities in 1997. DTC initiatives range from advertising initiatives to initiatives that are very well targeted through a one-to-one dialog. Some initiatives are specifically aimed at users of a particular medication to encourage them to use the product correctly, or as prescribed, and for chronic conditions, encourage users to use the medication for a long period of time (persistency). DTC promotional activities are examples of Health Programs as defined herein.    (G) Adherence to therapies (compliance) as a major health issue—many healthcare stakeholders appreciate the need to enhance compliance to medical treatments prescribed by doctors. The World Health Organization published a study under the name “Adherence to Long-Term Therapies: Evidence for Action”. As part of the introduction to the study the WHO wrote—Adherence to therapies is a primary determinant of treatment success. Poor adherence attenuates optimum clinical benefits and therefore reduces the overall effectiveness of health systems. “Medicines will not work if you do not take them”—Medicines will not be effective if patients do not follow prescribed treatment, yet in developed countries only 50% of patients who suffer from chronic diseases adhere to treatment recommendations. Improving compliance is one are that substantial more progress is needed with benefits to all healthcare stakeholders. Various sophisticated Health Programs, as defined herein, are launched by various sponsors with the goal of improving compliance.    (H) Nature of health programs and data collected; type of intervention and possible combinations—There are many different types of health programs and likewise different entities who may be interested in sponsoring and delivering these programs. Goals can vary based on sponsors (government, HMOs, employers, pharmaceutical companies, etc.). Health programs can have the goals of raising product awareness, acquiring new customers, encouraging patient compliance with medication regimen, expanding the overall diagnosed market, improve healthcare outcomes, improve quality of life, reduce overall cost to the healthcare system, etc. Other non-pharmaceutical manufacturer sponsored health programs may include public health efforts or disease/care management as well as other health promotion programs promoted by healthcare associations, payers and others.    (I) In-sufficiency of target consumer program measurement while data exist because of privacy issues—The challenge of measuring the impact of a consumer health program becomes significant whenever the health program sponsor does not have the full healthcare information of the target population at their disposal. Blocked by both access to data as well as privacy challenges, sponsoring organizations have to assess the impact of their efforts with very limited methods. As described above in this section, HIPAA provides substantial limitation on Personal Health Information and existing de-identification method may render the information useless for the purpose of measuring the impact of health programs. Naturally, with limited measurement abilities, less resources are directed by sponsors to valuable health programs such as compliance programs.    (J) “Soft” measurement of health programs, activity or self-reported measurement—As a result of the above mentioned limitations, existing methods for assessing health programs and marketing programs that effect a subset of the consumer/patient population include self reported data such as patient surveys or activity measurement such as the number of messages sent to the consumer, etc. Other approaches include: (i) consumer panels where consumers are surveyed on some regular basis. (ii) regionally or otherwise focused initiatives can be measured by a regional analysis if (iii) other fairly complex and limited methods to infer patient behavior.
Now, in these and countless other (non-health system related) examples, many useful advances to understanding would occur if the data privacy restrictions were lifted—since records could be aligned according to name and/or ID—thereby presenting to researchers a portrait of reality at substantially higher resolution. However, if this merger were allowed, then countless opportunities to breach personal privacy would occur in violations of laws and regulations—eventually causing many individuals to stop providing accurate information to their HMOs and healthcare providers, the census bureau, and/or to stop using their credit card, etc. Accordingly, there is a long felt need in the art for a protocol that will allow higher resolution query and manipulation of privacy sensitive data while simultaneously allowing individual privacy to be preserved. Furthermore, it is reasonable to consider that any progress in the direction of better data utilization while maintaining privacy would constitute progress.
Key Definitions:
Data Source Entity—organizations that generate, capture or store (for example—in the health care industry) medical and claims data that includes identifiable personal health information. That includes physician office, hospitals, labs and other healthcare providers; pharmacies; and HMOs, MCOs, self-insured employers, insurance companies, PBMs and other such entities. It also includes claims clearinghouses and any other “Covered Entities” as defined under HIPAA. Conceptually, the source-entity also includes other entities operating as a vendor for the source-entity under a privacy agreement (such as HIPAA Business Associate Agreement). Furthermore, there are non-health care data source entities—such as credit card companies, credit bureaus, insurance companies, banks, the census bureau, social service agencies, law enforcement agencies, and the likes, all of which share common functionality as collectors and maintainers of myriads of data including therein personal identifiable data.
Data Consumer Entity—organizations that would like to get analytics services to answer marketing, operational, quality, (for example) health outcome or other business related question regarding a specific (for example) health program, initiative, a subset or all of the marketplace, etc. Data Consumer Entities are interested in strategic and tactical analyses to help them optimize their resource investment to achieve their objectives. Examples can be the government, researchers, product and service (for example) healthcare companies, etc. Specifically in healthcare, detailed population information can have a remarkable role in the identification of public health trends, retrospective health outcomes, clinical research and development, medical errors and other valuable healthcare applications.
Data Originator Entity—organizations that generate, capture or store personal identifiable data (“originating information”), from which can be generated a list of instances that satisfy a condition or conditions in a query. The query is related, of course, to the question that the Data Consumer Entity wants answered. Data Originator Entities can include health care organizations like physician offices, hospitals, labs and other healthcare providers, pharmacies, HMOs, MCOs, self-insured employers, insurance companies, PBMs, claims clearinghouses, and other such entities. Data Originator Entities can also include other entities operating as a vendor for the Data Source Entity under a privacy agreement. There are also non-health care Data Originator Entities, such as credit card companies, credit bureaus, MSOs, cable TV companies, insurance companies, banks, the census bureau, social service agencies, law enforcement agencies, and the like, all of which share common functionality as collectors and maintainers of data including therein personal identifiable data. The Data Originator Entity can be the same as the Data Consumer Entity (i.e., where the Data Consumer Entity has access to suitable originating information), or the two entities can be different (i.e., where the Data Consumer Entity does not have access to suitable originating information).
One example of a non-health care Data Originator Entity is a cable TV company with detailed records of household cable-box channel settings, household billing information and advertising schedules. The cable company information reveals what TV show or other entertainment content a particular household was watching at a particular time, and through this information it can be deduced which advertisements that particular household was exposed to. Such originating information is suitable for handling queries such as, but not limited to, “all households who had the opportunity to see commercial advertising X between date A and date B”. The objective of such a query is to link advertisement exposure to transactional purchasing information, in order to answer the question of a Data Consumer Entity (which might be a health care company, a consumer products company, etc.), concerning how many households that saw a particular advertisement subsequently purchased the advertised product or service.
Crossix—an expression that includes the instant protocol according to any of its embodiments—and derivative uses thereof (see FIGS. 4 & 5 for preferred embodiment details)
Health Program—a program (used as specific example for the preferred embodiment of the instant invention) that affects a subset of the overall potential population. Typically patients, consumers or healthcare professionals will opt-in to participate in such a program and if the organization sponsoring it is not covered by HIPAA, the sponsoring organization will adhere to its published privacy policy. Typically Health Programs capture personal identifying information. Health Programs may include for example compliance programs or may include a broadcast advertising component (such as TV commercials) encouraging consumers to call a toll-free number or go to a web-site for further information. Frequently, at the call center or web-site, some consumer information is captured.
Typical Identifiable Data Captured in a Health Program—Some combination of the following fields or similar to those: First Name; Last Name; Date of Birth Or Year of Birth; Zip Code; Full Address; Phone Number(s); Fax Number(s); E-Mail; Prescribing Doctor Name, Address or Other Identifiers; Medical Condition or Drug Prescribed; Gender; Social Security. NOTE: Variability of data discussion—personal data frequently changes. (See discussion on this in U.S. Pat. No. 6,397,224 and Math, Myth & Magic of Name Search & Matching by SearchSoftwareAmerica) A subset of this data jointly can serve as an identifier with high probability of uniqueness. For example, Date of Birth and phone number could serve jointly as unique identifiers. Data Source Entity information structure (of typical health care related identifiers) may include all or some of the above plus a unique member ID. (Note: See U.S. Pat. No. 5,544,044; U.S. Pat. No. 5,835,897 and U.S. Pat. No. 6,370,511 for detailed description of healthcare data structure.)