The present invention relates to data series. More particularly, the present invention relates to retrieval of data contained in large sequences of data.
In many industries, large stores of data are used to track variables over relatively long expanses of time or space. For example, several environments, such as chemical plants, refineries, and building control, use records known as process histories to archive the activity of a large number of variables over time. Process histories typically track hundreds of variables and are essentially high-dimensional time series. The data contained in process histories is useful for a variety of purposes, including, for example, process model building, optimization, control system diagnosis, and incident (abnormal event) analysis.
Large data sequences are also used in other fields to archive the activity of variables over time or space. In the medical field, valuable insights can be gained by monitoring certain biological readings, such as pulse, blood pressure, and the like. Other fields include, for example, economics, meterology, and telemetry.
In these and other fields, events are characterized by data patterns within one or more of the variables, such as a sharp increase in temperature accompanied by a sharp increase in pressure. Thus, it is desirable to extract these data patterns from the data sequence as a whole. Data sequences have conventionally been analyzed using such techniques as database query languages. Such techniques allow a user to query a data sequence for data associated with process variables of particular interest, but fail to incorporate time-based features as query criteria adequately. Further, many data patterns are difficult to describe using conventional database query languages. Moreover, the lack of an intuitive interface impairs efficiency for many users.
In order to facilitate querying data sequences, so-called graphical query languages have been developed that offer a graphical user interface (GUI) to enter standard query language commands. Even using these graphical query languages, however, it is difficult to specify temporal feature sets or patterns that characterize events of interest.
Another obstacle to efficient analysis of data sequences is their volume. Because data sequences track many variables over relatively long periods of time, they are typically both wide and deep. As a result, the size of some data sequences is on the order of gigabytes. Further, most of the recorded data tends to be irrelevant. Due to these challenges, existing techniques for extracting data patterns from data sequences are both time consuming and tedious.
According to one aspect of the present invention, a graphical user interface (GUI) is used to quickly and easily find data patterns within a data sequence that match a target data pattern representing an event of interest. The user first uses the GUI to specify the target data pattern. Search criteria, such as a match threshold and amplitude and duration constraints, are then specified. A pattern recognition technique is then applied to the data sequence to find data patterns within the data sequence that satisfy the search criteria. Thus, the user avoids the need to sift through large amounts of data not relevant to the current query.
According to one embodiment, the present invention is directed to a method for finding, within a data sequence, matching data patterns that satisfy a similarity criterion with respect to a target data pattern. A graphical representation of at least a portion of the data sequence is displayed using a GUI. The GUI is then used to define the target data pattern within the data sequence and the similarity criterion. A pattern recognition algorithm is then applied to the data sequence to find the matching data patterns that satisfy the similarity criterion with respect to the target data pattern.
In another embodiment, a target data pattern within the data sequence and at least one search constraint are defined using a GUI. A pattern recognition algorithm is applied to the data sequence to find matching data patterns that satisfy the search constraint with respect to the target data pattern. These matching data patterns are then presented to the user.
Still another embodiment is directed to a method for finding, within a data sequence, matching data patterns that satisfy a similarity criterion with respect to a target data pattern. A graphical representation of at least a portion of the data sequence is displayed using a graphical user interface. The target data pattern within the data sequence and the similarity criterion are then defined using the graphical user interface. Next, a plurality of temporally warped versions of at least a portion of the target data pattern are prepared. At least one of these temporally warped versions is compared to at least a portion of the data sequence to determine a plurality of candidate data patterns within the data sequence that satisfy a match threshold with respect to the compared at least one temporally warped version. Candidate data patterns that violate amplitude limits are rejected.
Other embodiments are directed to computer-readable media and computer arrangements for performing these methods.