1. Field of the Invention
The present invention relates generally to data processing, and more particularly to "computer database mining" in which similar time sequences are discovered. In particular, the invention concerns discovering, in a large database, similarities in the patterns between time sequences of data.
2. Description of the Related Art
Sequences of events over time, hereinafter "time sequences", can be and often are electronically recorded in databases. As recognized by the present invention, the capability to identify time sequences that are similar to each other has many applications, including, e.g., identifying companies with similar patterns of earnings and sales growth. As another example, it would be advantageous to identify similar time sequences in product sales patterns, and to discover stocks that have similar price movements over time. Indeed, discovering similar and/or dissimilar time sequences in seismic waves has many useful applications, such as identifying geological irregularities.
Mining systems to discover similar time sequences have been disclosed in Agrawal et at., "Database Mining: A Performance Perspective", Proc. of the Fourth Int'l Conf. on Foundations of Data Organization and Algorithms, Chicago, 1993, and in Faloutsos et at., Fast Sub-sequence Matching in Time-series Databases", Proc. of the ACM Sigmod Conf. on Management of Data, May, 1994. The systems and methods disclosed in the abovementioned publications, however, share several drawbacks which limit their practical application. Among the drawbacks are that the methods are inherently overly sensitive to a few data anomalies. Further, the methods referred to above do not address the problems of amplitude scaling and translation of sequences. Consequently, they are effectively unable to identify, e.g., similarities in the price sequences of two stocks if one stock fluctuates around $10 and the other stock fluctuates around $75.
Still further, the methods referred to above are unable to effectively ignore small non-matching regions of two otherwise similar time sequences. Consequently, the methods can fail to identify certain actually similar time sequences as being similar.
In addition, prior methods for data processing in time sequence similarity discovery models suffer several drawbacks. Among the disadvantages of prior data processing regimes, which are used to index the time sequences incident to matching similar time sequences, are that many false matches tend to be identified. Also, the previous methods tend to be computationally intensive, and the methods inherently make it difficult for the user to vary the criteria that are used to define the conditions for time sequence similarity.
Accordingly, it is an object of the present invention to provide a system and method for discovering similar time sequences that are stored in a large database which establish similarity criteria that can be easily varied. Another object of the present invention is to provide a system and method for discovering similar time sequences which identifies similar time sequences in the presence of a few data anomalies and non-matching regions. Still another object of the present invention is to provide a system and method for discovering similar time sequences which can identify two time sequences as being similar when the amplitude scaling of one time sequence differs significantly from the amplitude scaling of the other time sequence. Yet another object of the present invention is to provide a system and method for discovering similar time sequences which is easy to use and cost-effective.