The present invention relates to data mining, and more specifically to modeling sequences of events arranged in a taxonomy.
Markov models are fundamental mathematical structures widely used in the natural and physical sciences, computer science, and engineering systems for describing and predicting processes. A Hidden Markov Model (HMM) is an extension of a Markov chain in which observable symbols are emitted in each of the states, but it is not possible to know exactly the current state from the symbol observed. Markov models with or without hidden states, first-order or higher-order, with or without lumping of states, have been extensively applied to sequence mining in the past.
The HMM parameters of a model can be adjusted by the well-known Baum-Welch method to increase the likelihood of observed sequences of symbols. Another class of techniques for learning HMM parameters from data is based on model merging. These approaches start with a maximum likelihood HMM that directly encodes all the observable samples. At each step, more general models are produced by merging previous simpler submodels. The submodel space is explored using a greedy search strategy and the states to be merged are chosen to maximize the data likelihood.
State aggregation in Markov models. Many of the processes that can be represented by HMMs suffer from the state space explosion problem. State space explosion occurs when the number of states grows too quickly for computation to solve more than trivial cases. As the number of states rapidly increases, computers run out of time and/or memory to complete the computation. For example, problems that grow exponentially or combinatorially with the size of the input suffer from state space explosion. As a result, minimizing memory requirements and time is crucial for most applications of HMMs. Aggregation techniques for reducing the number of states have been extensively studied.
Many approaches are based on the notion of lumpability, a property of Markov chains for which there exists a partition of the original state space into aggregated states such that the aggregated Markov chain maintains the characteristics of the original. A different approach reduces the structure of an HMM by partitioning the states using the bi-simulation equivalence, so that equivalent states can be aggregated in order to obtain a minimal set that does not significantly affect model performance. A simple heuristic for HMMs is to merge states that have the most similar emission probabilities. This approach has been applied to the domain of gesture recognition.
Sequence clustering. Sequence clustering is one of the most common tasks in sequence mining. This task has been handled by using frequent subsequences or n-grams statistics as features or considering the edit distances among all the candidate sequences. Traditional methods often require sequence alignment and do not efficiently handle variable-length sequences.
One of the first works using HMM for sequence clustering computed the pairwise distance matrix for all the observed sequences by training an HMM for each sequence. The log-likelihood of each model given the sequence is used to cluster the sequences in K clusters using an Expectation-Maximization (EM) algorithm. A Markov-chain based cluster method without hidden states using EM has also been implemented in commercial applications. In another approach, the HMMs are used as cluster prototypes. The clustering is computed by a combined approach of the HMMs and a rival-penalized competitive learning procedures. In an extension to the pairwise distance approach, HMMs are used to build a new representative space, where the features are the log-likelihoods of each sequence to be clustered with respect to a predefined number of HMMs trained over a set of reference sequences.
Sequence clustering can also be used for probabilistic user behavior models to describe and predict user actions. User actions are described by the conditional probability of performing an action given the previous action, plus binary features that indicate the presence of a certain action in the user's history.
Sequence mining applications. There are many applications of sequence mining. Two areas of interest are web usage mining and spatio-temporal data mining. Sequential pattern mining is one of the most common data mining techniques for Web data analysis. Markov models have been applied for modeling user web navigation sessions, describing user behavior, mining web access logs and for query recommendation. Mobility data analysis is a research area rapidly gaining a great deal of attention, as witnessed by the amount of spatio-temporal data mining techniques that have been developed in the last years.
Non-Markov based methods are also known in the art. Taxonomy-driven data mining has been mainly considered in the context of frequent pattern extraction: originally taxonomies were used for mining association rules and sequential patterns of itemsets in market-based data, where each item is a member of a hierarchy of product categories. More recently, taxonomy-based methods were used for mining frequent-subgraph patterns in biological pathways, where graphs of interacting proteins annotated with functionality concepts form a very large taxonomy.