Methods and systems designed for analyzing smaller data sets begin to break or become non-functional as the size increases. The analysis of larger amounts of data (colloquially called “big data”) using conventional methods can require extensive computing resources, including processors and memory. Conventional methods may require data to be loaded into locally accessible memory, such as system memory or memory cache, where it can be processed to obtain results. However, as the amount of data increases, this can become impossible. In such situations, there is a need for generating results for complete analysis of the big data that do not require as much computational cost and latency. To achieve such an outcome, the accuracy of the results may be traded for less computational resources, such as memory.
Big data analysis also frequently involves processing of data along multiple dimensions. Such dimensions could be time or data type. For example, big data may contain a log of user IDs and timestamps for users who have requested a particular web application. The big data may also contain user IDs of blacklisted users for each month. The data analysis may require a monthly count report of all unique non-blacklisted users that have visited the web application. To achieve such a result, not only does the large log of user IDs need to be extracted from the big data and processed but also set operations of difference need to be performed on the log data with the blacklisted user data. Such requirements for set operations further complicate the data analysis for big data performed with limited computational resources.
Big data analysis presents a significant problem, in particular, for large website operators, such as Yahoo! Inc. A large website operator may generate terabytes of data per day describing the traffic to its website, or the content and advertisements displayed. While this vast pipeline of data can be mined for insights into the characteristics and behavior of its users, those insights are simply not available unless the pipeline of data can be analyzed, thereby permitting questions to be asked and answered within a relatively short period of time. For example, if the answer to a question about how many users visited a given website today takes until tomorrow to answer, then the answer may be of little use. Providing faster and more accurate answers to questions such as the unique number of visitors to a given website, or the number of clicks on a given item of content or advertisement, are technical problems of the utmost importance to website operators.
More specifically, many big data analysis scenarios, such as user segment analysis, require set operations (e.g., intersection, union and difference) on sets of unique identifiers. When the data is larger than can be normally handled in memory, the unique counting as well as the set operations can be very expensive to compute exactly. If approximate answers are acceptable, then sketching technology can significantly reduce both the computational cost and the latency of obtaining results.
A sketch can be more than just a mechanism to approximate unique counts. It can be thought of as a data structure that approximates a larger set of values. A sketch, in fact, may be a substantially uniform and random reservoir sample of all the unique values presented to it. It is then reasonable to ask: given two sketches can one determine, approximately, the number of unique values that form the intersection of the two large data sets represented by the sketches? Or, perhaps, could sketches represent other set operations such as difference the number of unique values that are present in only one of two large data sets?
For systems that generate millions of sketches or where query latency is critical, to be able to perform set operations on the same sketches that do the unique counting is a huge benefit and eliminates the need for separate processes. Example applications where set operations are intrinsic include segment overlap analysis (intersection), segment rollup analysis (union), retention analysis (intersection), and blacklist removal (set difference). Examples of segments for a website operator might be users (defined by login), impressions on items of content, clicks on advertisements, or other large-scale sets of data relating to website traffic.
Getting faster and more accurate answers to questions about website traffic is an important technical problem facing many website operators.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.