1. Technical Field
This disclosure is directed to methods for accessing relational data bases containing time series data.
2. Discussion of Related Art
Efficiently storing and querying time-series data in relational data bases is challenging. On the one hand, the relational data model does not directly support a notion of order, but does so only indirectly through timestamps. This makes operations such as interpolation very complex. On the other hand, time-series data sets, especially if derived from sensor data, can be extremely large. This is primarily due to the fact that time-series data are stored as pairs of time stamp and value. As a consequence, queries can be expensive due to high I/O costs. One way to reduce I/O cost is to store data into the database in a compressed form to reduce I/O and computational cost. However, this approach is challenging, as different data and queries might require different compression. In addition, decompression can significantly slow query response time. Finally, compression must ensure that existing DBMS optimizations are fully exploited or otherwise the benefits over non-compressed storage might not be effective.
More specifically, a compressed representation should satisfy the following requirements:
1. It should be well-suited for time and value-series, taking the implicit ordering of values into account to allow for appropriate compression.
2. Alternative representations should be possible, so that for any given query, the best representation can be chosen before executing the query
3. It should be possible to answer queries directly on the compressed representation, as otherwise decompression would become the bottleneck, potentially negating the advantage of reducing I/O through compression. In addition, needing to decompress the data implies there is sufficient memory to hold the actual non-compressed data. In most computer systems however, main memory is the main bottleneck driving the I/O need and most of the computation time.
4. It should make full use of the security, optimization and parallelization features of the underlying data-base, as these facilities have been extensively optimized and re-inventing them would be prohibitively expensive.
While there are general methods to compress data in databases, as well as methods to compress values series outside databases, none of these methods fulfills all of the above requirements.
Two current techniques for general data compression in databases are row compression and vertical/key-value databases.
Row-based compression utilizes patterns in the values of individual records that can be used to compress the content of a row using techniques such as the Lempel-Ziv-Welch (LZW) algorithm. These techniques are not applicable to time-series, as time series, by nature, only contain a time-stamp and a single value in a row. However, compressible information in time-series spans several rows, not just a single one.
Vertical or key-value databases are well-suited if different records use different columns. In this case, the relational model would be ill-suited, as it assumes a fixed scheme of columns over all records. Therefore, storing such data in a relational model would imply a large number of missing values to force all records to the same scheme. This method is of limited use for time-series data, as all records usually share the same number of values.
One way to reduce the cost of storing time-series in databases is to eliminate the need for a time-stamp for each value by storing values in an array, such that each time-series is encoded in a single row in the underlying database system, such as employed by Informix. However, many of the indexing and optimization capabilities of the database can not be exploited. In particular, all queries along the time axis are bound to be very slow. Another drawback of these generic methods is that they usually need to decode or decompress the data before applying the query.
There is some work on compressing time and value series, mostly based on wavelet or Fourier transformations. In practice, most time series signals do not compressible well using the Fourier transformation; for example, the metered data coming from electricity, water or transportation metering systems typically has a low frequency sampling and have shapes that are very different from sine or cosine curves. They therefore do not compress well using a Fourier based compression.
Another way to compress temporal data is not to store the actual values but only the changes over time. If large portions of the data are constant, this can lead to a significant compression. However, it is hard to apply queries directly to the compressed data, making it necessary to decompress the data first.
Another structural issue of the above mentioned compression techniques is that the compression/accuracy trade-off should be fixed once and then used throughout the application. Usually, the higher the compression, the lesser is the reconstruction accuracy. Depending on the use of the queries, different accuracy/compression trade-offs may be required:                for exploratory queries that need to quickly obtain a rough estimate of the querie's result, a low accuracy high compression that leads to a shorter query response would be appropriate; and        on the other hand, queries requiring an exact estimate for business critical applications might prefer to target a high accuracy, therefore selecting a low compression with a high query answering latency.        
Usually businesses need to answer both kinds of questions on the same data, making compression techniques that enable dynamically choosing the right compression/accuracy trade-off very valuable.
A recently proposed set of techniques for compressing signal is dictionary based compression. Dictionary compression has been extensively used for image and video compression. However, it is not obvious how to implement this technique for representing and processing time series into a relational database.