1. Field of the Invention
The present invention relates to an apparatus and method for determining population size and overlap within information sources. More specifically, the present invention relates to a statistical technique for measuring population overlap in plural data sets without reliance on unique identifiers, and provides an alternative and superior method for determining population size using a technique that allows the holder of the information to summarize the features of the information, forward the summary to a processor which determines the overlap thereby enhancing the privacy of the original information.
2. Description of the Related Art
Government and private industry need to know the number of people who are involved in more than one institution, program, group, or activity, either concurrently or in sequence for purposes of management, monitoring, and evaluation.
The measurement of population overlap has been hampered by both the complexity of the social institutions and the lack of unique personal identifiers across existing data sets. Until now, the determination of the number of individuals shared across sub-populations has relied on one or more of three approaches to the problem: (1) the construction of detailed case registries (single data sets); (2) implementation of a true unique id system (e.g. National ID card) across multiple data sets; and (3) case by case matching of records from multiple data sets that describe the members and/or activities of various organizations and service sectors.
Traditionally, the problem of data set overlap has been most commonly approached by the development of case registries. The Gulf War Registry, designed to allow medical researchers to determine the prevalence and distribution of Gulf War Syndrome is one current example. The National Breast Cancer Registry is another. In the 1960s, a number of states established psychiatric case registries in order to determine the prevalence and distribution of mental illness. In every case the problem was the same. Existing fragmented information systems could not support the critical epidemiological functions of determining the relationship among existing data sets. There are three important shortcomings to this approach. First, the creation of case registries is a very expensive undertaking. Second, the completeness of a registry is always in question, especially when participation is voluntary. The incompleteness of the Gulf War Registry is notorious. Finally, because they necessarily include personal identifiers, the creation of case registries raises important issues about personal privacy and confidentiality of personal records.
The implementation of universal true unique personal identifier systems provides a second solution to the problem of determining the number of people involved in different subpopulations. While the implementation of such identification systems has been successfully accomplished for specific organizations (e.g. individual hospitals, correctional facilities, and insurance companies), these identification systems do not constitute the kind of universal identification systems that allow for analysis of membership overlap. In the United States, the social security number comes close to providing a universal identification system, but concerns about personal privacy severely limit the availability of these identifiers in settings not directly related to the social security system.
Case by case matching of records from multiple data sets based on the names of people or other identifiers that may be shared by more than one data set is a third approach to the problem. Case by case database integration on a patient specific basis has been utilized in a number of fields. From a practical point of view, this approach is has two major shortcomings. First, it is tedious, time consuming, and expensive. Second, it includes an unquantifiable degree of error. This approach also depends on personal identifiers, so concerns about privacy and confidentiality are likely to limit its utilization.
The problem of measuring the overlap between populations where no unique person identifier exists is related to the problem of measuring population size (the number of distinct individuals) without a unique person identifier. The problem of estimating population size may, in fact, be seen as a constituent part of the larger problem of estimating population overlap. In the past, the measurement of the number of people represented in a single data set that does not include a unique person identifier has relied on either of two statistical approaches. One statistical approach applies the capture-recapture sampling technique to the problem. This approach is illustrated by Abeni et al., “Capture-Recapture to Estimate the Size of the Population with Human Immunodeficiency Virus Type 1 Infection,” Epidemiology, Volume 5 Number 4, July 1994 (pp. 410–414). The other statistical technique is based on a classical occupancy theory, as discussed by Feller, “An Introduction to Probability Theory and Its Applications,” Volume 1, Second Edition, 1957. The classical occupancy theory is described on pages 210–211 and 224 of Feller's text. One implementation of the classical occupancy theory has been provided by Larsen, “Estimation of the Number of People in a Register from the Number of Birthdates,” Statistics in Medicine, Volume 13, 1994 (pp. 177–183). The present invention uses a fundamentally different, and far superior, implementation of the classical occupancy theory.
The capture-recapture technique is, in essence, case by case matching of small samples of larger populations. It avoids the cost associated with complete case-by-case matching, but still raises issues of personal privacy and confidentiality because it relies on personal identifiers for a subset of the population. Capture-recapture was originally developed by ecologists to estimate the size of wildlife populations. In the simplest setting, a sample of wildlife is captured, tagged, and released. At a later time, a second sample is drawn and overlap with the first sample is determined. The sizes of the two samples and their overlap are used to statistically determine the size of the total population and the confidence interval associated with the estimate. In applications to human populations, capture-recapture draws samples from lists of members of subpopulations. Personal identifiers are used to measure overlap of the samples and statistical computations are used to determine the size of the overall population. The greatest shortcoming of the capture-recapture approach is the large confidence intervals associated with the measure. It is not unusual to find confidence intervals of 50% of the population parameter as illustrated by Abeni et al.
A statistical procedure that addresses the problem of estimating the size of a population without a unique personal identifier has been provided by Larsen's maximum likelihood estimate of the solution to the classical occupancy problem. Larsen applied his solution to the estimation of the number of people represented in an anonymous Chlamydia registry in one county in Denmark. His solution provides less precise estimates and contains greater error than the solution provided by the present invention. In addition, his solution does not address the population overlap problem.
In recent years, the proliferation of electronic data bases in conjunction with dramatic advances in data processing technology have led to increasing concern about threats to personal privacy and, more specifically, about the confidentiality of medical records. (See Secretary's Advisory Committee on Automated Personal Data Systems. Records, Computers, and Rights of Citizens: Report of the Advisory Committee on Automated Personal Data Systems, US Department of Health, Education, and Welfare. Washington D.C. U.S. Government Printing Office, 1973; Hendricks E, Hayden T, Novik J D. Your Right to Privacy: A Basic Guide to Legal Rights in an Information Society. Carbondale, Ill: Southern Illinois University Press, 1990; and Donaldson M S, Lohr K N. Health Data in the Information Age: Use Disclosure and Privacy. Institute of Medicine. Washington D.C.: National Academy Press; 1994. p 51.)
In 1996, the United States Congress passed the Health Insurance Portability and Accountability Act (PL 104–191) that requires the Secretary of Health and Human Services to promulgate federal regulations that protect the privacy of health information if congress had not enacted legislation in this area by August of 1999. In December 2000, Standards for Privacy of Individually Identifiable Health Information; Final Rule (45 CFR 164) were promulgated. (See Standards for Privacy of Individually Identifiable Health Information; Final Rule (45 CFR 164) were promulgated. UPDATE § 164.514 Other requirements relating to uses and disclosures of protected health information.) In this rule, section § 164.514 notes “Other requirements relating to uses and disclosures of protected health information” includes the requirement that: “(2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed: . . . (C) All elements of dates (except year) for dates directly related to an individual, including birth date . . . ”.
The current debate regarding medical records privacy involves a classic confrontation of public and private goods. (See Pandiani J A, Banks S M, Schacht L M. Personal Privacy vs. Public Accountability: A Technological Solution to an Ethical Dilemma. Journal of Behavioral Health Services and Research. 1998 November;25(4):456–63.) Personal privacy is an important value in contemporary American society. Legislation in support of personal privacy and the confidentiality of medical records would be likely to receive widespread support if it did not have the potential to undermine another set of important national values. These values favor rational administration of public programs and the right of people to be able to make informed choices regarding their health care.
The Institute for Medicine offers a broad definition of personally identifiable information when it includes: “ . . . items of information (e.g. the fact of a physician visit on a given day) that will allow identification of an individual when combined with other facts (e.g. zip code, date of birth, and gender).” (See Hendricks E, Hayden T, Novik J D. Your Right to Privacy: A Basic Guide to Legal Rights in an Information Society. Carbondale, Ill.: Southern Illinois University Press, 1990.)
The Public Health Service endorsed an even stricter definition of personally identifiable information when it promulgated guidelines that prohibited the release of data tapes that contain: “ . . . any detailed information about the subject that could facilitate identification . . . (e.g., exact date of the subject's birth).” (See National Center for Health Statistics. NCHS Staff Manual on Confidentiality. Hyattsville Md.: Department of Health and Human Services; 1984.)
What is needed is an approach to determining overlap that enhances protection of the privacy of the original information.
Government and private industry also need to know the unduplicated number of individuals shared by more than two data sets. For instance, government agencies want to know if people served by both substance abuse and mental health agencies have better outcomes than people served by either alone. Relevant outcomes include levels of criminal justice involvement, employment, hospitalization, and mortality. Many government agencies and consumer advocates want to know about the degree of integration among multiple agencies, especially with regard to child serving agencies. Such measures address efficiency concern as well as concerns about he adequacy of care. (See Stroul, B. A. & Friedman, R. M. (1986). A system of care for children and youth with severe emotional disturbances. (rev. ed.). Washington D.C.: Georgetown University Child Development Center, CASSP Technical Assistance Center; and Pandiani, J. A., Banks, S. M., & Schacht, L. S.: (1999) Caseload Segregation/Integration: A Measure of Shared Responsibility for Children and Adolescents, Journal of Emotional and Behavioral Disorders 7(2)66–71.)
The difficulty of measuring the overlap between data sets tends to increase with the number of data sets involved. This is because multiple data sets are less likely to share person identifiers, state and federal regulations are more likely to limit access to one or more of the data sets because of privacy concerns.
What is needed is an approach that will allow the size and overlap in three or more data sets to be determined.