A recurring problem in the construction and use of databases and the data stored therein is that the data from which a database is to be constructed is often not available in the necessary or desired form. For example, the data may not be available in its original form, but only in one or more processed forms in other databases. These other databases, in turn, may have been created to a different model than that of the intended database, that is, to a different set of criteria or for different purposes. The available databases may thereby contain data or represent data relationships in forms that is different from both the original forms of the data or data relationships and from the forms desired or necessary for the new database.
To illustrate, microdata records are machine readable data files issued by a public or governmental agency, such as the U.S. Census Bureau, and contain records representing information gathered and processed by the agency. The original data from the U.S. Census is not available to the public, but databases constructed from the census data are publicly available in various forms. An example of such are the 1% and 5% Public Use Microdata Records (PUMSs) issued by the U.S. Census Bureau for the 1980 Census and the 1990 Census. Another source of census information is the Summary Tape File (STF) tabulations of census long-form population, household and housing characteristics for census tracts, block groups and other census defined areas.
PUMSs and STF tabulations are both created from the original census master file data records, each of which contains detailed individual data for each census subject and identifies each subject by name and residence, that is, by address or housing unit. PUMSs and STF tabulations, however, are each constructed on models, that is, according to sets of criteria, which represent properties which are different from both the original data records and from each other, wherein the properties of a database may be defined as comprised of the types or classes of information represented by the values stored in the databases. As such, PUMSs and STF tabulations are constructed in a manner which modifies or destroys certain of the properties of the data elements of the original data records or the relationships between the data elements of the original data records, or both.
For example, while PUMSs are subsets of records in Census master files, not all data entities, such as exact location of housing unit and type of group quarters are recorded in PUMS records and some PUMS data entities are recoded from multiple fields of the Census master files or from one or more records in a household. In addition, and in order to protect the privacy of census respondents, however, all personal identification information, such as names and addresses are deleted from PUMSs, and extreme incomes and house values are recoded to a uniform topcode.
Further, all of the data records in a set of data records comprising a PUMS are drawn from the same geographic area, but the location and extents of each such geographic area, referred to as a Public Use Microdata Area (PUMA), are determined by the area necessary to meet a minimum size requirement for the population from which the set is drawn. To protect respondent's privacy, PUMAs are constructed to that they contain a population of no less than 100,000 persons. Therefore, while the sets are delineated on a per state basis and, when possible, on a per county basis, the primary criterion for delineating a PUMA and constructing a set of data records for a PUMS for the PUMA is the size of the population from which the set is drawn, and any correlation between the geographic delineation of a PUMA and any geographic area of analytic interest is a secondary consideration.
The data records in a given PUMS database may therefore be identified as regards state and, in many instances, a single set of large subdivisions within a state, and having a population of over 100,000. A given PUMS therefore contains data records identifiable to a large county, a part of a large county or multiple counties, and will often merge data records from demographically disparate areas, such as two distinct neighborhoods or areas of a city or a part of a city and an adjacent suburban or rural area.
Therefore, while the data records contained in a PUMS are a statistically accurate representation of the original data for the PUMA, and include or reflect all of the original census data records in the PUMA, the data records have been divorced from any general relationship with defined geographic areas of social, economic, political or commercial significance and comprised of specific counties, cities, towns or neighborhoods.
Further, and as mentioned above, the data records included in a PUMS do not include a data record for each person, housing unit or household in a PUMA, but contain representative samples of the persons, housing units or households in the PUMA. In order to expand the data records to represent the total population and number of households and housing units in a PUMA, it is therefore necessary to apply a statistical weight, equal to the inverse of the sampling ratio, to each data record. These weights may vary across persons, households and housing units. In this case, the data records are further processed by the encoding of the statistical weight in each individual data record. In other cases, the uniform weight is a characteristic of the data set. The data records are therefore further processed by encoding a statistical weight, the inverse of the sampling ratio, in each individual data record. The weights can then be used to expand the data records to represent the total population and number of persons, housing units and households in the PUMA.
An STF tabulation contains one or more sets of data records wherein each set of data records is compiled from the original census data records for a standard, systematic hierarchy of census geographic areas. This standard hierarchy includes the following classes of areas: block, within block group (or part), within census tract or block numbering area (or part), within place (or part), within county subdivision, within county, within State, within division within region, within the United States. In addition, and independently of this standard hierarchy of areas, STF tabulations are produced for entities which may be split in a hierarchical presentation, such as place, census tract/block numbering area or block group. For example, FIGS. 1A, B and C shows the Summary Level Area Sequence Chart for STF 3A in the U.S. Census of 1990. Therefore the STF data can be selected and identified for all of these socially, economically or politically significant areas. Data records in an STF tabulation contain data elements which represent the distribution of the values of one or more elements of the household, person or housing unit records represented in the tabulation of an STF area. These distributions may be one-way distributions, that is, simple statistical frequencies, or two or more way distributions, that is, cross tabulations, and may be based upon single data entities or composites, that is, recodings, of multiple data entries. Some aggregates, or sums, and median values are also reported in STF tabulations, but these are an exception.
In an STF tabulation, therefore, the data records in an STF tabulation all belong to elements in one or more hierarchies of geographic areas, such as counties, towns, cities, minor civil divisions, neighborhoods and subdivisions thereof, rather than to geographic areas arbitrarily defined to exceed a minimum population size. The geographic properties of the STF data records, that is, the relationships between the data records and defined areas of social, economic, political or commercial significance, is thereby retained in STF tabulations. The individual, detailed properties of the persons, housing units or households in any geographic area of an STF tabulation, however, are submerged in the aggregates of these properties, thereby destroying the properties of and the relationships within and between the individual data records but protecting the confidentiality of individual responses by combining the responses of many households or persons into the aggregates.
Further, the STF records for different areas can be combined to obtain aggregated tabulations for defined areas comprised of counties, towns, cities, minor civil divisions, places, Census tracts, block groups or combinations of such STF reporting areas. These records can be aggregated because almost all of their elements are additive, that is, the distribution of the values for a variable characteristic field for area C, which is comprised of STF reporting areas A and B, is the sum of the separate distributions of the values of the field for areas A and B, or of the aggregates of such properties as total income, house value, and so on, with the exception medians, which are not additive. Therefore, STF tabulation data can in general be derived for any and all larger areas which can be defined as aggregates of STF reporting areas. Such areas may be general economic, social or political significance or of significance for a particular commercial, organizational or analytic purpose, such as a service area or market territory or trading area, analytic zone or PUMA.
Therefore, and in illustration of the general problem, while it is frequently useful and desirable to have a database representing detailed information on individual persons, housing units and households in user definable geographic areas of social, economic, political or commercial significance, this information is not accessible from either PUMSs or STF tabulations. That is, the original census data records are not publicly available for such purposes and, while PUMSs retain the detailed data elements and properties of the original, individual data records, the relationship between the data records and the geographic areas of interest is lost. STF tabulations, in turn, preserve the relationships between the data records and user definable geographic areas of interest, but the properties of the individual data records are lost.
To illustrate various approaches of the prior art to this general problem, using the example discussed above, a number of ad hoc methods have been employed to allow use of PUMS data for particular territories which cross PUMA boundaries.
For example, one approach approximates a particular territory with data for the set of PUMAs which best matches the territory boundaries. This approach can give accurate results when the match between the territory and the PUMAs is close, but does not provide a general solution to the problem because of the large minimum size of PUMAs.
Another approach approximates a particular territory with data for the set of PUMAs which best matches the territory boundaries and ratio adjusts either tabulated frequencies or tabulated weighted frequencies by the ratio of the territory population to total PUMA population, which is known from STF tabulations. This adjustment to the data corrects for gross differences in the population size of the PUMA set and that of the particular territory, but not for systematic differences, that is, non-homogenieties, between the household, housing or population characteristics of the territory and non-territory portions of the PUMA set, that is, the portions of the PUMA set which are respectively included and not included in the territory.
Yet another approach approximates the particular territory with data for the set of PUMAs which best matches the territory boundaries and adjusts tabulated frequencies or joint distributions by marginal or joint frequency distributions for the territory as obtained from STF tabulations. This adjustment corrects for systematic differences between the household, housing or population characteristics of the territory and non-territory portions of the PUMAs insofar as these characteristics are reported in STF tabulations. Not all characteristics are reported, however, and those that are reported are only reported in a particular format which may not conform to the desired tabulation. This approach therefore requires a high degree of analytic skill, sophisticated data processing and labor and is not effective for tabulations of user-specific recodes of variables or for other multivariate statistical procedures.
Still another approach of the prior art is to create a territory subset from the PUMS by randomly drawing, for each partially included PUMA, a subsample of the correct size as determined by population. This approach relaxes the constraints noted for the previous approach but leads to biases due to systematic differences, that is, non-homogeneities, between the household, housing or population characteristics of the territory and non-territory portions of the PUMAs.
The present invention provides a solution to these and other related problems of the prior art.