The problem of data string classification has been widely studied in the data mining, artificial intelligence, and machine learning communities. Typically, a set of records is established called the training data, in which each record is labeled with a class.
This training data is used to construct a model which relates the features in the data records to a class label. If the class label for a given record is unknown, the model may be used to predict a class label. This problem often arises in the context of customer profiling, target marketing, medical diagnosis, and speech recognition.
Techniques and/or mechanisms which are often used for classification in the data mining domain include decision trees, rule based classifiers, nearest neighbor techniques and neural networks, see, e.g., reports such as R. Duda et al., “Pattern Analysis and Scene Analysis,” Wiley 1973; J. Gehrke et al., “Optimistic Decision Tree Construction,” SIGMOD Conference, 1999; J. Gehrke et al., “Rainforest—A Framework for Fast Decision Tree Construction of Large Data Sets,” VLDB Conference, 1998; and J. Gehrke et al., “Data Mining with Decision Trees,” ACM SIGKDD Conference Tutorial, 1999.
Time series data classification is important with respect to financial, medical, and scientific databases. A time series has a set of data records comprising a set of real valued numbers in succession. Each real number corresponds to the value of the time series at a moment in time. Examples of time series data appear in applications concerning the stock market and biological data.
In many cases, the classification behavior of the time series may be stored in portions of the time series which cannot be easily determined a-priori. Often the compositional characteristics of a time series may contain important characteristics which reflect its classification behavior. Typically, techniques used to classify characteristics of time series data utilize an event-based or a global classification system, but not both. However, the important characteristics may be hidden in local characteristics of the string or in more global portions. The data is also typically stored in a compressed form (e.g., GZIP). Using current classification techniques, the compressed format makes it unclear as to which subset of the series to pick. It is also unclear as to which granularity to pick and what shapes result in the corresponding characteristics. Therefore, the data must be decompressed before it can be used with these techniques. Thus, a need exists for improved time series data classification techniques which overcome these and other limitations.