Methods and systems designed for analyzing relatively small data sets begin to break or become non-functional as the size of the date sets increase. The analysis of larger amounts of data (colloquially called “big data”) using may require extensive computing resources, including processors and memory. Such methods may require data to be loaded into locally accessible memory, such as system memory or memory cache, where it can be processed to obtain results. However, as the amount of data increases, processing the entirety of the big data in locally accessible memory can become impossible. In such situations, there is a need for generating results for complete analysis of the big data that do not require as much computational cost and latency. To achieve such an outcome, the accuracy of the results may be traded for less computational resources, such as memory.
Big data analysis presents a significant problem, in particular, for large website operators, such as Yahoo! Inc. A large website operator may generate terabytes of data per day describing the traffic to its website, or the content and advertisements displayed. While this vast pipeline of data can be mined for insights into the characteristics and behavior of its users, those insights are simply not available unless the pipeline of data can be analyzed, thereby permitting questions to be asked and answered within a relatively short period of time. For example, if the answer to a question about how many users visited a given website today takes until tomorrow to answer, then the answer may be of little use. Providing faster and more accurate answers to questions such as the unique number of visitors to a given website, or the number of clicks on a given item of content or advertisement, are technical problems of the utmost importance to website operators.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.