mining and knowledge discovery today is to discover interesting relationships in complex and diverse high dimensional data sets. Interesting information and relevant patterns might be scattered, entangled in and spanning various data subspaces. Currently, there are several problems in the area of data mining and knowledge discovery, some of which are discussed below.
First, there exist some technologies to discover patterns such as patterns from data sets including pattern discovery and association rule mining. Pattern mining aims to discover previously unknown patterns/rules from the raw or pre-processed data. Pattern mining is used in the knowledge discovery process for example for business applications, commercial applications and other uses in support of discovering useful knowledge from data. However, very often, the number of patterns discovered is overwhelming. In fact, the number of discovered patterns is often so large that the patterns cannot be presented to the users as they won't have much meaning.
Currently, to handle problems of having too many patterns, additional specification is obtained from the users to select the more interesting patterns. For example, the system may ask users to specify their existing knowledge and search unexpected patterns for them. Another example is to use templates or constraints to specify the required patterns. Another approach to deal with the problem of too many patterns is to prune uninteresting patterns based on certain criteria. Some common criteria are minimum improvement in confidence or the coverage of the patterns over the entire data set. Some systems group patterns using a nonparametric density estimation method. Others select a subset of associations to form a summary of the discovered associations while the rest of the patterns are grouped into them accordingly. However, all of the attempts in the prior art to deal with this issue have some limitations. For example, these systems may require user input to select desired patterns or be limited to receiving one type of patterns. Another example is that interesting patterns may be pruned by these systems since the measure of interestingness is rather ad hoc.
Thus, post-processing of the discovered patterns is needed to render further analysis. For example, pattern pruning removes uninteresting patterns, and pattern summarization builds a summary of the patterns. A fundamental problem of all post-processing tasks is to describe the relationship among discovered patterns.
One method to analyze the discovered patterns and to understand the meaning of the large number of patterns is to calculate the distance between the patterns existent within the data.
However, existing distancing methods offer limited insight into the patterns, one method to calculate distances between patterns within corresponding data groups is to count the number of common primary events (or items in the terminology of association rule mining) shared by them. For example, in a text mining application, the patterns [computer, science] and [computer, language] share the event [computer] and so their distance is 1. However, this approach may be disadvantageous as sometimes related patterns may not contain common primary events. For instance, the patterns [computer, science] and [programming, language] do not share any common events but programming language is related to and a subject in computer science. Second, this approach may be disadvantageous as unrelated patterns may contain common primary events. For instance, [computer, science] and [social, science] share one primary event. However, computer science and social science are two separate fields. Hence, counting the number of common primary events may miss certain subtle relationships between patterns and may produce misleading and undesirable results.
Other methods to calculate distances between patterns involve distances based on the number of samples in which different patterns either match or mismatch. For example, this may involve counting the number of samples where the patterns share or differ. However, these sample matching distances are one dimensional and only account for either differences or similarities between samples. This may not be sufficient for all types of data clusters.
Accordingly, there is a need for a method and system to discover, and analyze pattern information and corresponding data so as to obviate or mitigate at least some of the above-presented disadvantages.