It is often useful to provide a summary of a high volume stream of unaggregated weighted items that arrive faster and in larger quantities than can be saved, so that only a sample can be stored efficiently. Preferably, we would like to provide a generic sample of a certain limited size that we can later use to estimate the total weight of arbitrary subsets of the data.
Many data sets occur as unaggregated data sets, where multiple data points are associated with each key. The weight of a key is the sum of the weights of data points associated with the key and the aggregated view of the data, over which aggregates of interest are defined, includes the set of keys and the weight associated with each key.
In greater detail, this invention is concerned with the problem of summarizing a population of data points, each of which takes the form of (k, x) where k is a key and x≧0 is called a weight. Generally, in an unaggregated data set, a given key occurs in multiple data points of the population. An aggregate view of the population is provided by the set of key weights: the weight of given key is simply the sum of weights of data points with that key within the population. This aggregate view would support queries that require selection of sub-populations with arbitrary key predicates.
However, in many application scenarios, it is not feasible to compute the aggregate key weights directly; we describe some of these application scenarios below. In these applications time and processing constraints prohibit direct queries and make it necessary to first compute a summary of the aggregate view over the data points, and then to process the query on the summary. A crucial requirement of such summaries is that they must also support selection of subpopulations with arbitrary key predicates. Since the keys of interest are not assumed to be known at the time of summarization, the summarization process must retain per-key statistical estimates of the aggregate weights.
Turning now to applications of interest, communications networking provides a fertile area for developing summarization methods. In the Internet Protocol (IP) suite, routers forward packets between high speed interfaces based on the contents of their packet headers. The header contains the source and destination address of the packets, and usually also source and destination port number which are used by end hosts to direct packets to the correct application executing within them. These and other fields in the packet header each constitute a key that identifies the IP flow that the packet belongs to. In our context, we can think of the set of keys of packets arriving at the router in some time interval, each paired with the byte size of the corresponding packet, as a population of unaggregated data points.
Routers commonly compile summary statistics on the traffic passing through them, and export them to a collector which stores the summaries and support query functions. Export of the unaggregated data is infeasible due to the expense of the bandwidth, storage and computation resources that would be required to support queries. On the other hand, direct aggregation of byte sizes over all distinct flow keys at a measuring router is generally infeasible at present due the amount of (fast) memory that would be required to maintain and update at line rate the summaries for the large number of distinct keys present in the data Thus some other form of summarization is required.
Common queries for network administrators would include: (i) calculating the traffic matrix, i.e., the weight between source-destination address pairs; (ii) the application mix, as indicated by weight in various port numbers (iii) popular websites, as indicated by destination address using certain ports. Although some queries are routine, in exploratory and troubleshooting tasks the keys of interest are not known in advance.
Other network devices that serve content or mediate network protocols generate logs comprising records of each transaction. Examples include web servers and caches; content distribution servers and caches; electronic libraries for software, video, music, books, papers; DNS and other protocol servers. Each record may be considered as a data point, keyed, e.g., by requester or item requested, with weight being unity or the size or price of the item requested if appropriate. Offline libraries can produce similar records. Queries include finding the most popular items or the heaviest users, requiring aggregation over keys with common user and/or item. Another example is sensor networks comprise a distributed set of devices each of which generates monitoring events in certain categories.
All of these application examples, to a greater or lesser extent, share the feature that the approximate aggregation is subjected to physical resource constraints on the information that can be carried through time or between locations. For example, there are multiple distinct devices that produce data points, and from which information flows to a single ultimate collector and bandwidth is limited. If data points arrive as a data stream, then storage is limited. In the network traffic statistics application, measurements may be aggregated in mediation devices (e.g. one per geographic router center) which in turn export to a central collector. Sensor networks may deploy a large number of sensor nodes with limited capabilities that can collaborate locally to aggregate their measurements before relaying messages more widely. Physical layout aside, when summarizing data that resides on external memory or when exploiting parallel processing to speed up the computation, the computation is subjected to similar data flow constraints imposed by the underlying model.
There has been considerable amount of work in past years devoted to finding efficient data summarization schemes.
Summarizing Aggregated Data. In aggregated data sets, each data point has a unique key. There are many summarization methods for such data sets in the literature that produce summaries that support unbiased estimates for subpopulation weight. Reservoir sampling from a single stream is the base of the stream database of Johnson et. al. [T. Johnson, S. Muthukrishnan, and I. Rozenbaum, SAMPLING ALGORITHMS IN A STREAM OPERATOR, In Proc. ACM SIGMOD, pages 1-12, 2005]. Classic algorithms for offline, data streams, and distributed settings include: Weighted sampling with replacement (probability proportional to size) (the k-mins framework) [E. Cohen, SIZE-ESTIMATION FRAMEWORK WITH APPLICATIONS TO TRANSITIVE CLOSURE AND REACHABILITY, J. Comput. System Sci., 55:441-453, 1997; E. Cohen and H. Kaplan, SPATIALLY-DECAYING AGGREGATION OVER A NETWORK: MODEL AND ALGORITHMS, J. Comput. System Sci., 73:265-288, 2007]; the stronger bottom-k framework [E. Cohen and H. Kaplan, BOTTOM-K SKETCHES: BETTER AND MORE EFFICIENT ESTIMATION OF AGGREGATES, In Proceedings of the ACM SIGMETRICS '07 Conferece, 2007, poster; E. Cohen and H. Kaplan, SUMMARIZING DATA USING BOTTOM-K SKETCHES, In Proceedings of the ACM PODC '07 Conference, 2007; E. Cohen and H. Kaplan, TIGHTER ESTIMATION USING BOTTOM-K SKETCHES, In Proceeding of the 34th VLDB Conference, 2008] that includes priority sampling [N. Duffield, M. Thorup, and C. Lund, Priority sampling for estimating arbitrary subset sums, J. Assoc. Comput. Mach., 54(6), 2007] and the classic weighted sampling with replacement; and the recently-proposed VAROPT [E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, VARIANCE OPTIMAL SAMPLING BASED ESTIMATION OF SUBSET SUMS, In Proc. 20th ACM-SIAM Symposium on Discrete Algorithms, ACM-SIAM, 2009] that achieves variance optimality.
These summarizations, however, can not be computed over the unaggregated data unless the data is first aggregated, which is prohibited by application constraints. Firstly, the best estimators for summaries derived from aggregated data utilize the exact weight of each key that is included in the summary. Secondly, the distribution itself of keys that are included in the summary can not be computed under the IFT-constraints. (The only exception is weighted sampling (with or without replacement), but even though we can efficiently determine the keys to include in the summary over the unaggregated data, we need a “second pass” (or another communication round) to obtain the total weight of each included key in order to compute the estimators.)
These methods can be applied to produce data-point-level summaries, by effectively treating each data point as having a unique key. These summaries, however, have large multiplicities of the same key and they are considerably less accurate than key-level summaries. This prompted the development of methods that compute key-level summaries over the unaggregated data.
Summarizing Unaggregated Data. Summarization of unaggregated data sets was extensively studied [N. Alon, Y. Matias, and M. Szegedy, THE SPACE COMPLEXITY OF APPROXIMATING THE FREQUENCY MOVEMENTS, J. Comput. System Sci. 58:137-147, 1999; P. Indyk and D. P. Woodruff, OPTIMAL APPROXIMATIONS OF THE FREQUENCY MOMENTS OF DATA STREAMS, In Proc 37th Annual ACM Symposium on Theory of Computing, pages 202-208, ACM, 2005; M. Charikar, K. Chen, and M. Farach-Colton, FINDING FREQUENT ITEMS IN DATA STREAMS, In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP), pages 693-703, 2002; R. Fagin, A. Lotem, and M. Naor, OPTIMAL AGGREGATION ALGORITHMS FOR MIDDLEWARE, In Proceedings of the 24th ACM Symposium on Principles of Database Systems, ACM-SIGMOD, 2001; P. Cao and Z. Wang, EFFICIENT TOP-k QUERY CALCULATION IN DISTRIBUTED NETWORKS, In Proc 23rd Annual ACM Symposium on Principles of Distributed Computing, ACM-SIGMOD,2004; G. Manku and R. Motwani, APPROXIMATE FREQUENCY COUNTS OVER DATA STREAMS, In International Conference on Very Large Databases (VLDB), pages 346,357, 2002; G. Cormode and S. Muthukrishnan, WHAT'S HOT AND WHAT'S NOT: TRACKING MOST FREQUENT ITEMS DYNAMICALLY, In Proceeding of ACM Ptinciples of Database Sysems, 2003] for applications that include data streams, distributed data, and in-network aggregation (sensor networks) [A. Manjhi, S. Nath, and P. B. Gibbons, TRIBUTARIES AND DELTAS: EFFICIENT AND ROBUST AGGREGATION IN SENSOR NETWORK STREAMS, In SIGMOD 2005, ACM, 2005]. We are specifically interested in summaries that support estimating the weight of selected subpopulations, specified using arbitrary selection predicates and compare our methods against alternative methods that do that. (We do not consider methods restricted to estimating an aggregate over the full data or geared for different aggregates such as top-k, heavy hitters, or frequency moments of the full data set.)
Concise samples [P. Gibbons and Y. Matias, NEW SAMPLING-BASED SUMMARY STATISTICS FOR IMPROVING APPROXIMATE QUERY ANSWERS, In SIGMOD, ACM, 1998] refer to independent sampling of data points (this assumes that data points have uniform weights). The key idea is to combine in the sample all data points with the same key, and therefore obtain a larger effective sample using the same storage. This is also the flow counting mechanism deployed by Cisco's sampled NetFlow (NF) in routers [Cisco NetFlow, described in materials found at www.cisco.com/en/US/docs/ios/12—2sb/feature/guide/sbrsnf.html]. When sampling is performed at a fixed-rate we obtain variable-size summary. In many applications, a fixed-size summary is desirable, which is obtained by adaptively decreasing the sampling rate. We refer to this adaptive version as ANF.
Counting samples [P. Gibbons and Y. Matias, NEW SAMPLING-BASED SUMMARY STATISTICS FOR IMPROVING APPROXIMATE QUERY ANSWERS, In SIGMOD, ACM, 1998] (also developed as sample-and-hold (SH) [C. Estan and G. Varghese, NEW DIRECTIONS IN TRAFFIC MEASUREMENT AND ACCOUNTING, In Proceeding of the ACM SIGCOMM '02 Conference, ACM, 2002]) is a summarization algorithm applicable to an unaggregated stream of data points with uniform weights. The algorithm samples all data points at a fixed rate, but once a key is sampled, all subsequent data points with the same key are counted. Similarly, there is an adaptive version of the algorithm that produces fixed-size summaries (ASH).
Subpopulation-weight estimators for ASH and ANF have been proposed and evaluated [E. Cohen, N. Duffield, H. Kaplan, C. Lund and M. Thorup, SKETCHING UNAGGREGATED DATA STREAMS FOR SUBPOPULATION-SIZE QUERIES, In Proc of the 2007 ACM Symp. on Principles of Database Systems (PODS 2007), ACM, 2007; E. Cohen, N. Duffield, H. Kaplan, C. Lund and M. Thorup, ALGORITHMS AND ESTIMATORS FOR ACCURATE SUMMARIZATION OF INTERNET TRAFFIC, In Proceedings of the 7th ACM SIGCOMM conference on Internet measurements (IMC), 2007]. ASH dominates ANF on any sub-population and distribution. ANF (and NF), however, are applicable on general IFTs whereas ASH (and SH) are limited to streams. In addition, ASH does not support multiple-objectives unbiased estimation for other additive (over data points) weight functions [E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup, ALGORITHMS AND ESTIMATORS FOR ACCURATE SUMMARIZATION OF INTERNET TRAFFIC, MANUSCRIPT, 2007] whereas ANF and our summarization algebra support multiple objectives. ASH is applicable to uniform weights and its extension to general weights does not utilize and benefit from a higher-level of aggregation. For example, in terms of the produced summary and estimate quality, it treats the sequence (i1, 1), (i2, 3), (i1, 2) as (i1,1), (i2,1), (i2,1), (i2,1), (i1,1), (i1,1).
Step-counting SH (SSH) is another summarization scheme for unaggregated data streams that improves over ASH by exploiting the memory hierarchy structure at high speed IP routers. As a pure data stream algorithm, however, SSH utilizes larger storage to produce the same size summary as ASH.
Propagation of Summaries on Trees. Multistage aggregation for threshold sampling [N. G. Dufield, C. Lund, and M. Thorup, LEARN MORE, SAMPLE LESS: CONTROL OF VOLUME AND VARIANCE IN NETWORK MEASUREMENTS, IEEE Transactions on Information Theory, 51(5):1756-1775, 2005] is represented on a tree [E. Cohen, N. Duffield, C. Lund, and M. Thorup, CONFIDENT ESTIMATION FOR MULTISTAGE MEASUREMENT SAMPLING AND AGGREGATION, In ACM SIGMETRICS, 2008, Jun. 2-6, 2008, Annapolis, Md., USA] for the purpose of developing exponential bounds on summary error. Applications include Sampled NetFlow, Counting Samples, and Sample and Hold. Some earlier work [N. Duffield and C. Lund, PREDICTING RESOURCE USAGE AND ESTIMATION ACCURACY IN AN IP FLOW MEASUREMENT COLLECTION INFRASTRUCTURE, In ACM SIGCOMM Internet Measurement Workshop, 2003, Miami Beach, Fla., Oct. 27-29, 2003] had analyzed variance for Sampled NetFlow, exploiting relationships similar to Lemma 8 set forth below for multistage sampling.
From the foregoing discussion, it will be apparent that a summarization method for unaggregated data sets desirably will work on massive data streams in the face of processing and storage constraints that prohibit full processing; will produce a summarization with low variance for accurate analysis of data; will be one that is efficient in its application (will not require inordinate amounts of time to produce); will provide unbiased summaries for arbitrary analysis of the data; and will limit the worst case variance for every single (arbitrary) subset.
The prior art summarization methods described above have been unable to satisfy all of these desiderata.
Accordingly, there is a need to provide a summarization method for unaggregated data that produces results better than those attainable by prior art methods.