The invention relates generally to determining n-gram frequency in a sequence. In particular, the invention relates to counting sequences of characters based on the de Bruijn graph.
The problem of counting frequency for n-grams in a sequence is routine in text mining and genomic studies. However, the process for calculating the frequency is more complex in terms of time. The typical algorithm requires a single pass to enumerate all the n-grams in the sequence. For each term, the algorithm then makes a second pass to count the matching next term for each occurrence in the sequence. Such an algorithm calculates the frequency of all n-grams of length n and must be repeated for other lengths. This conventional algorithm has complexity of order O(kn2) for k sets of n-grams. Typically this algorithm is only executed a few times due to its complexity.
Conventional text miners limit themselves 1-to-5 length n-grams because further analysis is just too expensive. Among genomic researchers the solution is to chop larger sequences into smaller sizes. That kind of analysis is called “de novo”. Literature in that field specifically states that de novo analysis is a work around the O(n2) complexity of the algorithm. The de novo analysis limits the size of n so that complexity does not exceed real time constraints.