When analyzing massive data sets, even simple operations such as computing a sum or a mean are costly and time consuming. These simple operations are frequently performed by people investigating the data interactively, asking a series of questions about the data. In addition, automated systems must monitor or collect a multitude of statistics.
Data sketching algorithms enable the information in these massive data sets to be efficiently processed, stored, and queried. This allows them to be applied, for example, in real-time systems, both for ingesting massive data streams or for interactive analysis.
In order to achieve this efficiency, sketches are designed to only answer a specific class of questions, and there is typically error in the answer. In other words, a data sketch is a form of lossy compression on the original data. A person must choose what information to lose from the original data. A good sketch makes efficient use of the data so that the errors are minimized while having the flexibility to answer a broad range of questions of interest. Some sketches, such as Hyper-Log Log, are constrained to answer very specific questions with extremely little memory. On the other end of the spectrum, sampling-based methods, such as coordinated sampling, are able to answer many questions about the original data, but at the cost of far more space to achieve the same approximation error.
Many data analysis problems consist of a simple aggregation over some filtering and group by conditions, such as                SELECT sum(metric), dimensions        FROM table        WHERE filters        GROUP BY dimensions        
This problem has several variations that depend on what is known about the possible queries and what is known about the data before the sketch is constructed. For problems in which there is no GROUP BY clause and the set of possible filter conditions are known before the sketch is constructed, counting sketches such as the CountMin sketch and AMS sketch are appropriate. When the filters and group by dimensions are unknown and arbitrary, the problem is the subset sum estimation problem. Sampling methods such as priority sampling can be used in some cases. The sampling methods work by exploiting a measure of importance for each row and sampling important rows with higher probability. For example, when computing a sum, the rows containing large values contribute more to the sum and should be retained in the sample.
The disaggregated subset sum estimation problem is a more difficult variant, where there is little to no information about row importance and only a small amount of information about the queries. For example, many user metrics, such as the number of clicks, are computed as aggregations over some event stream where each event has the same weight (i.e., 1) and hence, the same importance. Filters and group by conditions can be arbitrary, except for a small restriction that one cannot query at a granularity finer than the specified unit of analysis. In the click example, the finest granularity may be at the user level. One is allowed to query over arbitrary subsets of users but cannot query a subset of a single user's clicks. The data is “disaggregated” because the relevant per unit metric is split across multiple rows. As used herein, something at the smallest unit of analysis may be referred to as an “item” to distinguish it from one row in the data source.
Because pre-aggregating to compute per unit metrics does not reduce the amount of relevant information, it follows that the best accuracy one can achieve is to first pre-aggregate and then apply a sketch for subset sum estimation. This operation, however, is extremely expensive, especially as the number of units is often large. Examples of units include (user, advertisement id) pairs for ad click prediction and (source IP, destination IP) pairs for network flow metrics. Each of these have trillions or more possible units.
Several sketches based on sampling have been proposed that attempt to address the disaggregated subset sum problem. These include the bottom-k sketch, which samples items uniformly at random, the class of “Net-Flow” sketches, and the Sample and Hold sketches. These proposed solutions, however, are not always accurate, and can also be slow and/or resource intensive. Therefore, an alternative solution is needed that is efficient and produces accurate results.