Field of the Invention
This invention relates to sequential data mining, and more particularly, to an approach for extracting maximal repeat patterns and computing their frequency distribution from a huge amount of tagged sequences with a MapReduce framework.
Description of the Prior Art
The frequency distribution of repeat patterns plays an important role in many practical applications such as, for example, genomic pattern (biomarker) mining in bioinformatics, trend analysis in text mining, events (logs) analysis in internet security, user behavior analysis and production line analysis. Suffix tree and suffix array are two well-known data structures for extracting maximal repeats from sequences. However, the computation based on above two data structures suffers from memory limitation problem, especially when the volume of sequences exceeds the size of main memory available in one computing node. To overcome that memory limitation problem, the String B-Tree (SB-Tree), based on an external memory-based approach, was proposed and supposed to be able to handle larger amount of sequences. However, the computation via SB-Tree is time-consuming because the speed of external memory (e.g. disk) input/output is slow. It is desired to improve the computation of repeat pattern extraction via multi-computing nodes, especially with a MapReduce framework that is expected to be scalable.