1. Technical Field
The present invention relates to data stream analysis and, more particularly, to a system and method for analyzing streams and counting stream items.
2. Description of the Related Art
Recent technological advances have led to a proliferation of applications which can generate and process data streams. Data streams are sequences of data items that can be generated continuously at dynamically varying rates and need to be processed at equivalent rates as soon as they are received by the processing elements. Such data streaming applications often process large quantities of data that can potentially grow without limit at a rapid rate, putting enormous burden on the computational and memory resources of an underlying system.
Many issues exist in networks that process streams of information. One of the key data streaming applications involves determining frequency statistics of the stream items in real-time. Examples of such statistics include frequency moments, determining heavy hitters, and order statistics. The problem of frequency querying of data streams includes some of the following issues. For example, one can query a stream to calculate the number of occurrences or the frequency of items in a section of the stream observed so far. Formally, this stream frequency counting problem can be defined as follows: Let stream S=(s1, . . . , sN) be a sequence of items, where each si is a member of a domain D=(1, . . . , d). Estimate the frequency of a unique item sj in the sub-sequence S(t)=(s1, . . . , st), where t≦N. (This type of query is also referred to as the point query). Clearly, the values of N and d can be very large and the item frequency can vary over time.
For example, in web click streams or phone call streams, the number of possible unique items (i.e., web pages or phone numbers) could easily range in the order of hundreds of millions or even billions. In many cases, the processing of data collected in large sensor networks is performed on the sensor nodes which have limited memory and power consumption. Obviously, to satisfy the memory and real-time execution constraints, the input stream data can not be stored in its entirety. Therefore, the counting applications employ algorithms that strive to maximize the computational performance while minimizing the memory usage.
In stream processing, specifically for a cell processor, for example, a typical core may contain only 256 KB of memory. The distributed counting problem will be designed for counting the frequencies of items in a data stream to answer point and range queries. Multiple streams may be processed in parallel, and each stream is potentially distributed over multiple processing cores which it needs to be stored in. A well-known memory-efficient technique for counting items from data streams uses probabilistic data structures, e.g., sketches. The sketch method is essentially a random projection based approach which uses either linear projections or hash functions to condense the input stream data into a summary. The results for a point query can then be extracted from these condensed summaries. While the sketch-based approach reduces the space complexity of the counting process, additional modifications are needed to improve the computational performance.
A traditional approach for improving computational performance involves partitioning the work across multiple processing entities and executing them in parallel. In recent times, such parallel approaches have become even more practical due to availability of systems that use multi-core processors. The multi-core processors support multiple, potentially heterogeneous, on-chip processing cores connected via high-bandwidth, low-latency interconnection fabric. Such features enable the multi-core processors to provide very high computational performance at relatively low power consumption. These capabilities make the multi-core processors potentially suitable platforms for streaming data processing. In recent years, distributed processing has been widely studied as a method for accelerating performance stream processing algorithms.