With rapid development of database technologies, people are more and more concerned with how to acquire valuable information from a large volume of data. This process may be referred to as large data analysis. The large data analysis in practice is directed to time series data in many cases. The time series data refers to chronologically recorded data columns under a unified index, such as transaction data of the stock market, status data collected over sensor networks, statistical data of consumption in shops, and statistical data of telephone communication traffic.
The volume of the time series data is very large. In order to facilitate storage and retrieval of the time series data, the time series data is processed based on dimensionality reduction, i.e., data with more time points is compressed into data with less time points. Piecewise Linear Approximation (PLA) is a common method for dimensionality reduction. For PLA, the time series data is partitioned into small time segments, and in each time segment, data of the time segment is approximated by a line segment with a certain slope. As such, a space for storing the processed time sequences may be reduced effectively just by storing a start time point and end time point of the line segment corresponding to each time segment and the corresponding linear parameter (a coefficient of a linear equation to which the line segment pertains).
Similarity retrieval of the time series data is an analysis means commonly used in the big data analysis, including the following steps of: dividing the large time series data into a large amount of time sequences with the same time length for storage, and querying, from the stored time sequences, a time sequence matching with the target time sequence to be retrieved (the target time sequence has the same time length as the stored time sequences). For example, in an electrocardiogram, the frequency of occurrence of a certain characteristic waveform may be used for identifying a disease. The characteristic waveform may be retrieved from the recorded electrocardiogram and disease analysis may be carried out based on a result of retrieval. For ease of retrieval, the stored time sequences and the target time sequence are generally processed based on fixed-length PLA where a to-be-processed time sequence based on PLA is partitioned into a plurality of time segments with the same time length.
During implementation of the present disclosure, the inventors have identified that the prior art has at least the following problem:
In the prior art, a time sequence is processed based on fixed-length PLA during storing of the time sequence; however, for fixed-length PLA, the precision of data needs to be ensured by shortening the time length of the time segment, resulting in the increase of the volume of data to be stored and more consumption of the storage space.