This application relates generally to sampling of data. More specifically, the disclosure provided herein relates to sampling from distributed steams of data.
In the communication age, some data management systems must collect and analyze large quantities of data. These data may be observed at data collection devices distributed at geographically diverse locations, and may be communicated to the data management systems from the data collection devices. The data collection devices may track and report increasingly detailed data due to enhanced measurement devices, increasing numbers of sensors and sensor networks, and/or increased measurement granularity. Thus, tracking and reporting of data may require storage and transmission of large volumes of data. In light of these improvements, issues with regard to collecting all observed data at a single data management system and/or performing analysis with respect to all of the observed data may present a challenge.
Additionally, some data management systems rely upon various queries and other operations that may be generated and/or iterated numerous times. These queries and other operations may request information or trigger various activities based upon all data that has been collected across the data collection devices at any particular time. Performing such queries or other operations presents additional challenges with respect to data management systems. These challenges have led to the formalization of the continuous, distributed, streaming model of data management systems. In the continuous, distributed, streaming model of data management systems, a number of distributed sites observe a stream of data and collaborate with a centralized coordinator device to answer queries over the union of the observed data.
Protocols have been defined in the continuous, distributed, streaming model of data management systems for a number of classes of queries. These protocols seek to minimize the communication, storage space, and/or time needed by each participant in the model. The protocols, however, fail to address producing a sample drawn from the distributed streams. A sample is a powerful tool, since it can be used to approximately answer many queries. Various statistics over a sample can indicate the current distribution of data in a system, and therefore can prove to be a valuable tool in the data management setting.