1. Field of the Invention
The present invention relates generally to data processing, and more particularly to "mining" of computerized databases in which a user can analyze database contents to discern particular shapes of time sequences for retrieval. In particular, the invention concerns a shape definition language for mining subsequences of time sequences that are stored in a computerized database, wherein the subsequences have user-defined shapes.
Relatedly, J. T. Wang et al. in "Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results," Proc. of the ACM SIGMOD Conference on Management of Data (5/94), have defined the mining of a database as "the activity of finding structural or topological patterns in data that can lead to important conclusions or prediction of new phenomena." In this application one such topological pattern is imposed by time; the invention has been made to obtain this pattern from a computerized database.
2. Description of the Related Art
Sequences of events over time, hereinafter "time sequences", often are implicit in the contents of databases, making them accessible to computers for a number of advantageous purposes. As recognized by the present invention, computers can provide a vehicle to define particular patterns of interest in such time sequences. These patterns can be graphically depicted in, e.g., Cartesian coordinates, with the y-axis representing a magnitude, for example, a stock price, and the x-axis representing time, such that the pattern established by the time sequence is characterized by a shape. Although the above example assumes time histories that have tuples that map to (or imply) two dimensions for the purposes of explaining what is meant by "shape", the principles of the present invention can be applied to time histories that have tuples that map to "n" dimensions. It is the purpose of the present invention to provide a means for a user to define a desired time history shape, and then to access an index of actual time histories to retrieve from contents of a database subsequences of time sequences which conform to the defined shape.
The ability to define a desired time history shape and then identify actual time sequences that conform to the shape has many applications in science, industry, and business. As but one example, it might be desirable to identify stocks whose closing price, when plotted in Cartesian coordinates against time, resembles a head-and-shoulder shape. As another example, it might be desirable to identify user-defined time sequences in product sales patterns, or to identify predefined patterns in time sequences in seismic waves for identifying geological irregularities. As recognized by the present invention, it is often desirable to match a time sequence to a predefined shape despite the time sequence not exactly matching the shape. Such matches are referred to herein as "blurry" matches.
Database languages and models have been introduced for identifying time sequences having particular attributes, but such languages and models have several drawbacks which it is the intent of the present invention to address. An example of a database language for specifying composite events in databases is disclosed in Gehani et al., "Composite Event Specification in Active Databases: Model and Implementation", Proc. of the VLDB Conf., pp. 327-338, Vancouver, 1992. The language disclosed in Gehani et al. uses finite automatons and, hence, regular expressions. Consequently, the language of Gehani et al. is somewhat cumbersome in effecting blurry matches, because it requires the generation and merging of many automatons to express the desired blurry shape. Furthermore, the Gehani et al. language focusses on finding the endpoints of predefined events, rather than identifying time sequence intervals that conform to a predefined shape.
A time sequence pattern detection algorithm is disclosed in Berndt et al., "Using Dynamic Time Warping to Find Patterns in Time Series", KDD-94: AAAI Workshop on Knowledge Discovery in Databases, pp. 359-370, Seattle, 1994. Unfortunately, the Berndt et al. algorithm does not provide the capability to impose arbitrary conditions on blurry matches. Still further, previous algorithms such as the Berndt et al. algorithm which use regular expressions tend to identify time sequences that are essentially duplicative of each other, and hence generate much of what might be termed unnecessary clutter.
More particularly, time sequence detection techniques that use regular expressions tend to return both maximal subsequences (i.e., subsequences that are not proper subsequences of other shape-matching subsequences), as well as non-maximal subsequences that are largely overlapping, and in many applications the non-maximal subsequences represent useless data clutter. Additionally, previous time sequence detection techniques in general do not automatically rewrite user shape queries as comparatively more efficient alternate expressions. And, previous time sequence detection techniques do not reconfigure databases for efficient examination of the time sequences in response to user queries.
Accordingly, it is an object of the present invention to provide a system and method for matching time sequences in databases to predefined shapes that enable a user to express the desired shapes simply, naturally, and powerfully.
Another object of the present invention is to provide a system and method for matching time sequences in database contents to predefined shapes which allow for imposing arbitrary conditions on blurry matches.
Yet another object of the present invention is to provide a system and method for matching time sequences in database contents to predefined shapes which return only maximal subsequences that match a predefined shape.
Still another object of the present invention is to provide a system and method for matching time sequences in databases contents to predefined shapes with ease of use, efficiency, and cost-effectiveness.