1. Field of the Invention
The present invention relates to discovering significant patterns in long sequences of items, and more particularly for a system and method for identifying significant patterns in sequences which include noise.
2. Description of the Related Art
With the large amounts of data being stored and used, discovering and understanding significant patterns in large data sets has become increasingly important. Significant pattern discovery has taken on greater importance in a plurality of new fields as well as new applications for existing technologies. Support (the number of occurrences) of a pattern has been proposed as the metric of significance in an article by R. Agrawal et al., xe2x80x9cMining association rules between sets of items in large databases.xe2x80x9d Proc. ACM SIGMOD Conf. on Management of Data, 207-216, 1993. As discussed in Agrawal et al., an input is a set of transactions, and each transaction includes a set of items. The significance of a set of items is determined by the number of transactions which contain this set of items.
Due to the presence of noise, a symbol may be misrepresented by some other symbols. This substitution may prevent an occurrence of a pattern from being recognized and, in turn, slashes the support of that pattern. As a result, a frequent pattern may be xe2x80x9cconcealedxe2x80x9d by the noise. This phenomenon commonly exists in many applications.
For example, in bio-medical study, mutation of amino acids is a common phenomenon studied in the context of biology. Some mutations are proven to occur with a non-negligible probability under normal circumstances and incur little change to an organism""s biological functions. For example, the amino acid N in the human body is likely to mutate to D with little impact on behavior. In this sense, the amino acids should not be considered as a-totally independent.
In the area of performance analysis, many system-monitoring applications involve collecting and analyzing attributes that take continuous numerical values. A common approach to process the data is to quantize the domain into categories. If the true value of an attribute is close to the boundary of the quantization, there is a fair chance that the observed value may fall into the adjacent bin and be represented by a different label. It would be desirable if such kind of distortion can be taken into account during a data mining process. In the area of consumer behavior, for example, in the supermarket, consumers frequently buy a slightly different product or brand from their original intent due to various reasons, such as, the desired product was out of stock or misplaced. Allowing obscurity in item matching may unveil the customer""s real purchase intention.
This problem becomes critical when the pattern is substantially long because an occurrence of a long pattern is much more subject to distortion caused by noise. In general, the length of a gene expression can range up to a few hundreds of links, if amino acids are taken as the granularity of the analysis. Some clinical studies show that, the amino acids N, K, and V are relatively more likely to mutate to amino acids D, R, and I, respectively. The corresponding gene expressions after the mutation may differ from the standard one. It is more equitable to treat them as possible (degraded) occurrences of the standard expression than to consider them as totally independent gene expressions.
Therefore, a need exists for a system and method which discovers significant patterns while accounting for noise effects. A further need exists for a new measure that accounts for mutation or naturally occurring changes in data in discovering significant patterns.
A system and method for determining patterns in a data sequence constructs a compatibility matrix which provides a probability between an actual occurrence of an item and an observed occurrence of that or another item between each item in the data sequence. Candidate patterns are generated. The candidate patterns include items in the data sequence. The candidate patterns are checked against the data sequence to determine a match value based on the compatibility matrix, and significant matches are determined based on candidate patterns having the match value above a threshold.
In alternate systems and methods, the items may include symbols and a compatibility matrix may be constructed which includes constructing a matrix such that a match is determined between any two symbols in the data sequence. The compatibility matrix may include rows and columns and each entry in the compatibility matrix corresponds to a row and a column. The match value between two items may include a number between 0 and 1.
The candidate patterns may be checked against the data sequence to determine a match value based on the compatibility matrix. This may include, for a pattern p and a sequence s of symbols, determining an overall match value of p with respect to s by aggregating p with respect to each subsequence sxe2x80x2 with 1 symbols in s. The determining of an overall match value of p with respect to s may include determining a match value between p and sxe2x80x2 by taking a product of the match value between symbols at each position in the data sequence.
The probability between an actual occurrence of an item may determined by experiment or expert opinion. The items may include symbols and candidate patterns may be generated using a level-wise approach wherein at each level one additional symbol is added to a total number of symbols considered in a candidate pattern. The candidate patterns may include a significant pattern if all sub patterns of the candidate pattern satisfy the threshold. Significant matches may be determined based on candidate patterns having the match value above a threshold such that for each candidate pattern, a match value of that candidate pattern is verified against the input sequence to determine a set of patterns that satisfy the threshold.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.