A computer source code listing containing a preferred embodiment of the present invention is included in a microfiche appendix, appended hereto, having one microfiche and fourteen frames.
1. Field of the Invention
The present invention relates to an apparatus and method for determining population size and overlap within information sources. More specifically, the present invention relates to a statistical technique for measuring population overlap without reliance on unique identifiers, and provides an alternative and superior method for determining population size.
2. Description of the Related Art
Government and private industry need to know the number of people who are involved in more than one institution, program, group, or activity, either concurrently or in sequence for purposes of management, monitoring, and evaluation.
The measurement of population overlap has been hampered by both the complexity of the social institutions and the lack of unique personal identifiers across existing data sets. Until now, the determination of the number of individuals shared across sub-populations has relied on one or more of three approaches to the problem: (1) the construction of detailed case registries (single data sets); (2) implementation of a true unique id system (e.g. National ID card) across multiple data sets; and (3) case by case matching of records from multiple data sets that describe the members and/or activities of various organizations and service sectors.
Traditionally, the problem of data set overlap has been most commonly approached by the development of case registries. The Gulf War Registry, designed to allow medical researchers to determine the prevalence and distribution of Gulf War Syndrome is one current example. The National Breast Cancer Registry is another. In the 1960s, a number of states established psychiatric case registries in order to determine the prevalence and distribution of mental illness. In every case the problem was the same. Existing fragmented information systems could not support the critical epidemiological functions of determining the relationship among existing data sets. There are three important shortcomings to this approach. First, the creation of case registries is a very expensive undertaking. Second, the completeness of a registry is always in question, especially when participation is voluntary. The incompleteness of the Gulf War Registry is notorious. Finally, because they necessarily include personal identifiers, the creation of case registries raises important issues about personal privacy and confidentiality of personal records.
The implementation of universal true unique personal identifier systems provides a second solution to the problem of determining the number of people involved in different subpopulations. While the implementation of such identification systems has been successfully accomplished for specific organizations (e.g. individual hospitals, correctional facilities, and insurance companies), these identification systems do not constitute the kind of universal identification systems that allow for analysis of membership overlap. In the United States, the social security number comes close to providing a universal identification system, but concerns about personal privacy severely limit the availability of these identifiers in settings not directly related to the social security system.
Case by case matching of records from multiple data sets based on the names of people or other identifiers that may be shared by more than one data set is a third approach to the problem. Case by case database integration on a patient specific basis has been utilized in a number of fields. From a practical point of view, this approach is has two major shortcomings. First, it is tedious, time consuming, and expensive. Second, it includes an unquantifiable degree of error. This approach also depends on personal identifiers, so concerns about privacy and confidentiality are likely to limit its utilization.
The problem of measuring the overlap between populations where no unique person identifier exists is related to the problem of measuring population size (the number of distinct individuals) without a unique person identifier. The problem of estimating population size may, in fact, be seen as a constituent part of the larger problem of estimating population overlap. In the past, the measurement of the number of people represented in a single data set that does not include a unique person identifier has relied on either of two statistical approaches. One statistical approach applies the capture-recapture sampling technique to the problem. This approach is illustrated by Abeni et al., xe2x80x9cCapture-Recapture to Estimate the Size of the Population with Human Immunodeficiency Virus Type 1 Infection,xe2x80x9d Epidemiology, Volume 5 Number 4, July 1994 (pp. 410-414). The other statistical technique is based on a classical occupancy theory, as discussed by Feller, xe2x80x9cAn Introduction to Probability Theory and Its Applications,xe2x80x9d Volume 1, Second Edition, 1957. The classical occupancy theory is described on pages 210-211 and 224 of Feller""s text. One implementation of the classical occupancy theory has been provided by Larsen, xe2x80x9cEstimation of the Number of People in a Register from the Number of Birthdates,xe2x80x9d Statistics in Medicine, Volume 13, 1994 (pp. 177-183). The present invention uses a fundamentally different, and far superior, implementation of the classical occupancy theory.
The capture-recapture technique is, in essence, case by case matching of small samples of larger populations. It avoids the cost associated with complete case by case matching, but still raises issues of personal privacy and confidentiality because it relies on personal identifiers for a subset of the population. Capture-recapture was originally developed by ecologists to estimate the size of wildlife populations. In the simplest setting, a sample of wildlife is captured, tagged, and released. At a later time, a second sample is drawn and overlap with the first sample is determined. The sizes of the two samples and their overlap are used to statistically determine the size of the total population and the confidence interval associated with the estimate. In applications to human populations, capture-recapture draws samples from lists of members of subpopulations. Personal identifiers are used to measure overlap of the samples and statistical computations are used to determine the size of the overall population. The greatest shortcoming of the capture-recapture approach is the large confidence intervals associated with the measure. It is not unusual to find confidence intervals of xc2x150% of the population parameter as illustrated by Abeni et al.
A statistical procedure that addresses the problem of estimating the size of a population without a unique personal identifier has been provided by Larsen""s maximum likelihood estimate of the solution to the classical occupancy problem. Larsen applied his solution to the estimation of the number of people represented in an anonymous Chlamydia registry in one county in Denmark. His solution provides less precise estimates and contains greater error than the solution provided by the present invention. In addition, his solution does not address the population overlap problem.
Accordingly, it is an object of the present invention to identify an accurate quantity of unique individuals (entities, objects, items, etc.) in a data source containing potentially multiple records pertaining to a particular individual.
It is a further object of the present invention to identify an accurate quantity of unique individuals (entities, objects, items, etc.) overlapping across multiple data sources which may contain multiple records pertaining to a particular individual within a single data source or within multiple data sources.
It is another object of the present invention to determine a more precise and smaller range of variance of the quantity of unique individuals (or items) at a specified confidence interval (e.g. 95%) in a single data source or overlapping across multiple data sources which contain potentially multiple records regarding a particular individual in a single data source and/or across multiple data sources.
Objects of the present invention are achieved by providing an apparatus and method for probabilistic population size determination. The apparatus and method uses a computer to probabilistically calculate the population size of unique entities in data, containing records on unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation, using decomposed probabilistic calculations based on values of the information with the known distribution. The decomposed probabilistic calculations used in determining population size include calculating a first probabilistic number of unique entities needed to satisfy a first value of the information with the known distribution, successively calculating a probabilistic incremental number of unique entities needed for a previous value of the information with the known distribution to increase to a subsequent value of the information with the known distribution, until the probabilistic number of unique entities needed to increase to a predetermined value of the information with the known distribution is calculated, and summing the first probabilistic number of unique entities and the probabilistic incremental numbers of unique entities.
Objects of the present invention are also achieved by providing an apparatus and method for probabilistic population size and population overlap determination. Data is found in first and second data sets containing records on the unique entities without unique identifiers for the unique entities and having at least one common type of information with a known distribution of finite expectation. A computer combines the first and second data sets into a combined data set and probabilistically calculates the population size for the first and second data sets in a fashion similar to the apparatus and method discussed in the preceding paragraph. The population overlap of the unique entities between the first and second data sets is then determined by subtracting a probabilistic incremental number of unique entities needed for a larger total number of values of the information with the known distribution from either of the first and second data sets to increase to a total number of values of the information with the known distribution in the combined data set from a smaller of the population size of the first and second data sets.
Objects of the present invention are also achieved by providing an apparatus (and a corresponding method) for probabilistic determination of population size and overlap having a storage medium, a data preparation unit, a population size measurement unit, a population overlap measurement unit, and a total population information generator. The storage medium stores first and second data sets containing the records on individuals without unique identifiers for the individuals, and the records have information on gender and birth date for each individual. The data preparation unit subdivides the records in the first, second, and combined data sets into multiple gender/year of birth cohorts, and determines a total number of unique birth dates in the first, second, and combined data sets for each gender/year of birth cohort. The population size measurement unit probabilistically calculates the population size for the first and second data sets in each gender/year of birth cohort by calculating a first probabilistic number of individuals needed to fill one date of birth, successively calculating a probabilistic incremental number of individuals needed for a previous number of dates of birth to increase to a subsequent number of dates of birth, until the probabilistic number of unique entities needed to increase to the total number of unique dates of birth is calculated, and summing the first probabilistic number of individuals and the probabilistic incremental numbers of individuals. The population overlap measurement unit determines the population overlap of the individuals between the first and second data sets by subtracting a probabilistic incremental number of individuals needed for a larger total number of unique birth dates from either of the first and second data sets to increase to a total number of unique birth dates in the combined data set from a smaller of the population size of the first and second data sets. The total population information generator calculates a total population size and a total population overlap by summing the population size of the first and second data sets and the population overlap across the multiple gender/year of birth cohorts.
Objects of the present invention are further achieved by providing a data preparation unit which performs diagnosis to identify discrepancies between the observed value and the maximum expected value of the information with the known distribution and to generate a warning or recommend solutions to overcome the discrepancy.
The population size measurement unit also calculates a population size variance corresponding to the population size determined for the first and second data sets for each gender/year of birth cohort. The population overlap measurement unit also calculates a variance corresponding to the probabilistic incremental number of individuals needed for the larger total number of unique birth dates from either of the first and second data sets to increase to the total number of unique birth dates in the combined data set, and calculates lower and upper population overlap variance values corresponding to the population overlap for each gender/year of birth cohort. The total population information generator calculates a total population size variance by summing the population size variance for the first and second data sets across the multiple gender/year of birth cohorts, and calculates total lower and upper population overlap variance values by summing the lower and upper population overlap variance values across the multiple gender/year of birth cohorts.
The population overlap measurement unit also calculates a ninty-five percent confidence interval for the population overlap for each gender/year of birth cohort and the total population information generator calculates a ninty-five percent confidence interval for the total population overlap based on the total lower and upper population overlap variance values and eliminating impossible ranges in variance.
Objects of the present invention are further achieved by providing an apparatus and method for probabilistic population overlap determination when the population size is known for a first and second data set containing unique entities identified by unique identifiers, but where the type of unique identifiers in the first data set are different from the type of unique identifiers used in the second data set. A storage medium stores first and second data sets containing records on the unique entities with first unique identifiers for the unique entities in the first data set different from second unique identifiers for the unique entities in the second data set, and having at least one common type of information with a known distribution of finite expectation. A computer combines the first and second data sets into a combined data set and determines the population overlap of the unique entities between the first and second data sets by subtracting a probabilistic incremental number of unique entities needed for a larger total number of values of the information with the known distribution from either of the first and second data sets to increase to a total number of values of the information with the known distribution in the combined data set from a smaller of the population size of the first and second data sets.