This invention relates to compression of large data sets, and in particular to selection of a coreset of points to represent sensor data.
The wide availability of networked sensors such as GPS receivers and cameras is enabling the creation of sensor networks that generate huge amounts of data. For example, vehicular sensor networks where in-car GPS sensor probes are used to model and monitor traffic can generate on the order of gigabytes of data in real time. These huge amounts of data are potentially usable for applications such as query-based searching and spatiotemporal data mining, but such applications can face storage-space problems as well as problems in efficiently accessing the motion data.
One approach to processing such huge amounts of data is to first compress it, for example as it is streamed from distributed sensors. However, such compression can be a challenging technical problem. Technical challenges include managing the computation required for such compression, as well as managing and characterizing the accuracy (e.g., compression error) introduced by using the compressed data as representing the originally acquired data.
One class of problems involves trajectory data, in which the data represents a time indexed sequence of location coordinates. The challenge of compressing such trajectories (also known as “line simplification”) has been attacked from various perspectives: geographic information systems, databases, digital image analysis, computational geometry, and especially in the context of sensor networks. The input to this problem is a sequence P of n points that describes coordinates of a path over time. The output is a set Q of k points (usually subset of P) that approximates P. More precisely, the k-spline S that is obtained by connecting every two consecutive points in Q via a segment should be close to P according to some distance function. The set Q is sometimes called a coreset since it is a small set that approximates P. (Note that in the Description below, the term “coreset” is not used according to a definition in and is used in a manner more similar to usage in computational geometry.)
Several books have been written about the line simplification problem. The oldest heuristic for line simplification is the Douglas-Peucker heuristic (DPH). DPH gets an input threshold ε>0 and returns a set Q that represents a k-spline S as defined above. DPH guarantees that the Euclidean distance from every pεP to S is at most ε. This is also the attractiveness of DPH, compared to other lossy data compression techniques such as wavelets. DPH is very simple, easy to implement, and has a very good running time in practice. The guaranteed ε-error allows us to merge two compressed sets Q1 and Q2 in the streaming model, while keeping the ε-error for Q1∪Q2. The DPH has been characterized as “achiev[ing] near-optimal savings at a far superior performance”.
While DPH has a guaranteed ε-error, it suffers from serious space problems due to its local (ad-hoc, greedy) optimization technique. In particular, the size k of its output is unbounded, and might be arbitrarily larger than the smallest set Q⊂P that obtained such an ε-error. While merging two sets preserve the error, it is not clear how to reduce the merged set again. The size of the compressed output will increase linearly with the input stream. Choosing a larger ε may result in too small or empty set Q for the first compressions. The worst case running time for the basic (and practical) implementation is O(n2). More modern versions of the DPH appear to have similar technical characteristics.