Web site operators typically track user interactions with their web sites in order to determine the effectiveness of the web pages of a web site and their layout. This tracking often includes recording one or more particular user interactions related to a given web page. The operators typically prefer to obtain as much information as possible about these interactions, often tracking such metrics as the number of clicks on specific hyperlinks or advertisements on a web page, identifiers of the feature or features clicked on, the time spent viewing a web page, or the number of times an ad was displayed on a web page. To record this information each individual user interaction is monitored and information describing it is stored in a record of some type. A record of an individual user interaction may be referred to as an “event”. An event may include such information as an indication that a hyperlink or an advertisement was displayed to, clicked on by or otherwise interacted with by a user; an identifier of the user, the item clicked on, viewed, or otherwise interacted with, or the web page; date and time of the user interaction; software or equipment used by the user; and one or more metrics associated with the user interactions such as an amount paid in a purchase transaction, or time spent in an activity. For example, one event may be a single user click on a hyperlink and another event may be a display of a specific advertisement. Each event is recorded as raw event data for later analysis to determine the effectiveness of the web page.
Storage of the raw event data represents a significant burden on operators of large web sites, as the number of events can be quite large and often the data is stored for long periods of time in order run many different analyses on the data.
In addition to the storage burden, the processing of the raw event data is also time-consuming as the raw event data is typically reprocessed for each analysis. Several different approaches had been adopted for processing this raw event data. Processing of raw event data retains the native resolution and no intermediate processing is performed. However, each analysis requires a reprocessing of the entire data set. Furthermore, if processing is done in real time, as new data are received the intermediate calculations become progressively more expensive.
Another typical approach is a random partitioning of the raw event data. In this approach, the events in the raw event data for a specified period of time are randomly selected and aggregated together into several partitions for that time period. Averages and other metrics for each partition are then determined. This partition data, and not the raw event data, is then used to characterize the distribution of the data for the time period, so reprocessing is not required when performing subsequent analyses. However, depending on the number of partitions, and typically on the order of 30 to 40 are used, this represents a significant loss of resolution from the raw event data where thousands or tens of thousands of individual samples may have been taken.
Thus, the operator is left with the choice of storing and processing large sets of raw event data which yield the higher resolution results or storing and processing smaller aggregated data partitions but with a potential loss of resolution in the results.