The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences). It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The longest common subsequence problem is a classic computer science problem, the basis of data comparison programs such as the diff utility, and has applications in bioinformatics.
It is a common requirement to get the LCS of two sequences. The well-known Hirschberg's algorithm solves this problem. Hirschberg's algorithm can solve the LCS problem in quadratic time and in linear space. Hirschberg's algorithm is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences. Other LCS algorithms exist.
However, when applied to a superscale data sequence (SSDS), such as two 10 billion event sequences happening in a cloud data center or two 3 billion DNA sequences, the conventional LCS techniques, including Hirschberg's algorithm, are not practical. Because Hirschberg's algorithm is designed to run on a single machine, for an SSDS, problems emerge, such as the program cannot load all of the data events and sequences into memory, and because the algorithm is quadratic time bound, it is relatively slow and time consuming, and there is too much noise for practical use if the sequences are not naturally well aligned.