1. Field
The present inventions relate generally to computer systems for large-scale data analytics and, more specifically, to computer systems for evaluating the effects of content distributed over networks on behaviors of groups sharing some attribute as indicated by network traffic.
2. Description of the Related Art
Geolocation analytics platforms are generally used to understand human behavior. Such systems map data about places to geographic locations and then this mapping is used to analyze patterns in human behavior based on people's presence in those geographic locations. For example, researchers may use such systems to understand patterns in health, educational, crime, or political outcomes in geographic areas. And some companies use such systems to understand the nature of their physical locations, analyzing, for instance, the demographics of customers who visit their stores, restaurants, or other facilities. Some companies use such systems to measure and understand the results of TV advertising campaigns, detecting changes in the types of customers who visit stores following a campaign. Some companies use geolocation analytics platforms to target content to geolocations, e.g., selecting content like business listings, advertisements, billboards, mailings, restaurant reviews, and the like, based on human behavior associated with locations to which the content is directed. In many contexts, location can be a useful indicator of human behavior.
Traditional geolocation analytics platforms are not well suited for performing complex analyses on large data sets, as often arise in the context of analyzing web-scale data sets describing user behavior on a network. In many cases, simplifying assumptions are made to render the analysis more tractable for available computers and software, but these assumptions can give rise to various biases and misleading results that can skew the results of analyses.
One noteworthy example of such a misleading result is Simpson's paradox, where an analysis may reveal a particular effect in a population, but when the analysis is repeated for groups within that population, the effect can disappear or even reverse. In some cases, the group-to-group variation overwhelms the effect caused by a treatment, making the effects of the treatment on the various groups appear different from what is actually happening.
Yet it is common to ignore this issue and other sources of bias because, particularly for stochastic analyses of large data sets that reveal themselves over time, it can be difficult to consistently and reliably disaggregate control and treatment segments of the population. The difficultly is compounded when members of groups at issue have intersecting sets of members, when the number of groups is relatively large, and when the members of the population appear inconsistently over time. Further challenges arise from efforts to avoid selection bias, as often happens when users' behaviors on networks makes certain groups more likely to be represented in a sample.