Today, media researchers and management information workers have at their disposal a wide range of information about consumers. Such data includes some sets collected by obtrusive, active measures, as well as some from passively-collected, unobtrusive observation. Examples of the former include surveys, warranty registrations, active data collection through internet-connected devices, and frequent shopper programs. Examples of the latter include data taken from transaction streams, coupon redemptions, credit card transactions, TV viewing behavior via digital set-top boxes, internet observed behaviors (such as interactions through cookies), IP tracking, and so forth. In the past, most of the data used for population estimates have been considered to be from data sets having some well-defined known relationship to the population, e.g., probability samples. Driven by demand, by failure of some of the old paradigms, and by new technologies, which produce diverse and potentially useful pieces of information, more and more of the available data is, on its face, from data sets which do not have well-defined known relationships to the population, and are not directly representative of the population to be measured. While all types of data are potentially useful (e.g., data for which a well-defined relationship to the population is in some way known as well as data for which a well-defined relationship to the population is not known), current technology provides very few tools for improving the accuracy of population estimates based on data that does not have well-defined known relationships between the elements of a data set and the elements of a population.
When the size of a population to be analyzed (the “target population”) is large, researchers who need to analyze information about the target population (herein “primary data”) rarely perform a systematic measurement of the primary data for all members of the population (that is, a “complete census”) because the cost of gathering so much information is often too high, the time it takes to collect the data is too long, or it's impractical for some other reason. A well-known example of high data collection cost is the Decennial Census. In the United States, the 2010 Census cost approximately $13 billion dollars to collect data on approximately 308 million US residents, according to the U.S. Government Accountability Office publication, “Preliminary Lessons Learned Highlight the Need for Fundamental Reforms.”
Because of the high cost of performing a census of a large target population, researchers will typically collect data from only a subset of the population (i.e., sample the population), and will then estimate characteristics of the overall population based on that sample and its relationship to the population. One problem with this method is that the sample can often be misleading due to the presence of known as well as unknown biases in the sample selection process itself. For example, a sample can often fall victim to a self-selection bias because some members of the sampled population refuse to be observed or cannot be observed. Samples, including those from transactional data sets (such as purchases made at a cash register by a credit card, television viewing behavior transactions performed in a household subscribing to a particular television service for which the viewing behavior transactions are being monitored, and so on) are often selective of participants in such a manner that the resulting sample is biased: it inaccurately represents the overall target population in substantial and unknown ways. The presence of these selection biases can make the resulting estimates of population characteristics inaccurate, in a directional or biased sense.
For convenience in this description, some terminology will now be defined. Collected data that is the primary data set used to make a population estimate, whether obtained obtrusively or unobtrusively, is hereinafter referred to as “subject data.” A data set that is to be used to derive properties of a target population will be referred to as “reference data.” Unobtrusively obtained data that represents specific events (such as a credit card transaction, a channel change on a television set-top box, a click on a URL in a web browser, a frequent flier transaction, or a loyalty program transaction with a merchant) is hereinafter referred to as “transactional data.” A distinction is made herein between “personally-identifiable information”—which is to say, data of sufficient specificity that it can be used to identify a particular individual person or household, such as a social security number, a name/address combination, a credit card number, etc.—and “personal information” which, while not necessarily sufficient to identify a particular individual or household, is nevertheless considered to be private information, such as income, religious preference, age, etc. There are many legal restrictions around the use of personally-identifiable information. Furthermore, many companies are sensitive to the use of personal information, even in the absence of specific legal restrictions.
Media research has historically been carried out in situations where the researcher controls the sample, the sampling frame, and the survey questions asked. Statistical methods and estimation procedures have been developed to account for differences between the estimates and properties of the sample from the population and the overall population that is the target of the study. Through combinations of techniques—such as careful sample frame design, probability sampling, over-sampling, optimal allocation, and sample balancing—a rich toolbox of methodologies has been developed. Most of these techniques make use of probability theory to construct estimates of the population characteristics from sample data. Some, like sample balancing, do not use probability mechanisms but assume, at a minimum, that the data to be analyzed has sufficient detail to enable the researcher to construct calibration-type estimates, using the values for the calibration variables collected directly from the sample elements, to make estimates for the desired population. In summary these calibration estimates and processes require that the variables used for calibration be present in the data collected from the sample.
The usual estimation techniques based on probability sampling are often inadequate when the sample is subject to selection bias. The fact that the subject data set is not necessarily based on a probability sample drawn from a defined sampling frame which completely covers the target population means that the rate or incidence of some variables or their values may not be good approximations of the corresponding rate in the population at large as they would otherwise be. Estimators using this data do not have sample selection probabilities available to adequately correct for the resulting biases in the subject data set. If variables suitable for calibrating the estimators are available in the subject data set then calibration or related techniques can be used to adjust the estimates. However, these conventional statistical techniques require that the survey respondent-level information for the balancing variables is present in the subject data set. In other words, the sample balancing techniques require that all the variables used for balancing be present in (i.e., native to) the data set, such that the balancing variables may actually be observed or measured for each respondent in the data set. For example, in the course of making a population estimate conventional balancing techniques can be applied to improve the representativeness of a data set in order to match a target population's demographic statistics when the subject data set contains the necessary demographic data for every respondent in the data set.
The fact that such conventional survey estimating techniques require that all balancing variables be present in the data set is, however, a severe limitation in view of modern data privacy requirements. For example, many consumers are nowadays averse to allowing a survey, data collection, or marketing company to collect (or combine) personal information along with the primary survey or transactional data that is intended to be analyzed. Many consumers object to providing (or may be unable to provide) information such as whether they viewed a particular television advertisement, program, or channel, whether they viewed a particular internet Web site or otherwise consumed other internet content such as by using a smartphone application, whether they purchased or would purchase a particular product, and under which conditions did the actual or potential product purchase take place, did they use a particular service and associated conditions therein, and the like.
In some cases, a member of a target population may only choose to participate in a survey (or in a transaction) that collects personal information conditioned on the receipt of adequate compensation (such as monetary compensation, a product discount coupon, getting first in line to try a new product, etc.) for the perceived risk of their information being used or their privacy being potentially compromised. Other members of a target population may choose not to participate in any survey that collects any personal information at all. Therefore, the very fact that a researcher performs a survey that asks a respondent for personal information (as distinct from personally-identifiable information) or the very nature of the researcher offering a survey respondent compensation in exchange for such information (or not offering such compensation, or offering the wrong type or level of compensation) affects which portion of the population will become survey respondents, and thereby may introduce a bias into the survey results, which would not otherwise exist without the collection of the personal information. Furthermore, any of the other conditions under which the survey or data collection takes place (such as, for example, the time of day, the day of the week, the location, or a variety of other conditions under which different members of a target population may be more or less likely to respond), can affect the representativeness of the sample, and which portion of a population's potential respondents decide to participate, thus introducing a bias into the sample.
Moreover, consumers about whom information is collected in many transactional databases (which could be analyzed, similarly to how survey information relating to transactions are analyzed), may be entitled to receive notification about the conditions under which their personal information is used or shared with other companies. It is now often legally or culturally unacceptable even to ask a customer (such as may be a member of a target population for which a statistical analysis is desired) for their race, sex, age, height, weight, religion, family status, marriage status, disability status, mobility, home ownership, location, employment status, industry, income, education level, political affiliation, sexual preference, any other demographic information, or any other information that may be limited by a privacy policy (whether personally-identifiable information or personal information).
At the same time, such consumers are also generally empowered by law to demand a company to refrain from using or sharing their personal information in specific ways, thereby limiting the manner in which the company can use the consumer's personal information. For example, the Financial Services Modernization Act of 1999 (the “Gramm-Leach-Bliley Act”) put a requirement in place for financial institutions to provide each consumer with a “privacy notice” at the time the consumer relationship is established, as well as every year thereafter. This and other privacy laws now exist in the US, as well as other countries, and affect a wide array of industries and markets.
One of the problems with eliminating personal information from a database, however, is that the prior art survey analysis techniques (such as sample balancing) for projecting the statistics of a survey or other data collection effort to a desired target population (such as a target market segment for the most profitable sales of a commercial product) require the presence of personal information in the data set, in order to make the survey statistics more representative of the target population (in other words, to reduce bias).
Much of the past art has concentrated on techniques of estimation which rely on probability sampling techniques and the building of probability-based estimators. In recent times, however, greater need has arisen to make more use of data sets which are not collected from strict probability samples (because, for example, of sample non-response or coverage problems with the frame, or because the data has been harvested from some other process designed for another purpose, etc.). As described above, this data often does not contain essential calibration variables needed to make reasonable estimates of population characteristics. The variables measured during data collection are often not as well selected as in a carefully planned sample survey which uses a probability sample and data collection instruments with targeted uses in mind, e.g., the current population survey conducted by the U.S. Census. The end result is that variables that are present in many data sets cannot be pre-determined by the researcher, and even if calibration variables are present, some or all may be excluded from use by privacy restrictions etc. attached to the data. Examples include internet ad-server logs, television set-top box viewing data, and credit card transactional data. Some of this data may often, for example, be the data remnants of a transaction or a piece of a transaction or internet interaction collected within a transactional “pipeline.” Such data sets often include many millions or billions of data points, but each individual respondent data point may be lacking supplemental information such as traditional demographics data, and furthermore the sample may be of unknown quality and likely to be unrepresentative of the overall population due to inherent selection or other biases. There is therefore a need for estimation techniques that can leverage these data sets despite the absence of usable calibration variables from the data set collected. As a result of at least the aforementioned problems, a need exists for statistical estimation technique that allow estimates from survey or other sample data, transactional data, or statistics to be adjusted to be more representative of a desired target population without the need for having the variables used for the adjustment (such as the personal information for the participants) be present in the data set. Doing so would allow for more accurate estimation of the characteristics of a target population without requiring that personal information for the participants be stored, or even directly known.