Over the past three decades, the suffix tree has served as a fundamental data structure in text or data string processing. However, its widespread applicability has been hindered by the fact that suffix tree construction is believed to not scale well with the size of the input string. With advances in data collection and storage technologies, large strings have become ubiquitous, especially across emerging applications involving text, time series, and biological sequence data. To benefit from these advances, it is imperative that a scalable suffix tree construction algorithm be realized.
There recently has been an emergence of several disk-based suffix tree construction algorithms that attempt to index strings that do not fit in memory. However, construction times continue to be daunting—for e.g., indexing the entire human genome still takes over 30 hours on a system with 2 gigabytes of physical memory.
Current extant disk-based tree construction algorithms are limited in the following regards: 1) To garner reasonable disk I/O efficiency, the algorithms require the input string to fit in main memory. Although existing “partition-and-merge”-based approaches such as those described in the reference to Phoophakdee, B. and Zaki, M. entitled “Genome-scale disk-based suffix tree indexing”, in Proceedings of the ACM International Conference on Management of Data, 2007; and, the reference to Tian, Y., Tata, S., Hankins, R., and Patel, J., entitled “Practical methods for constructing suffix trees”, in VLDB Journal 14, 3 (2005), do attempt to remove this restriction, they teach accessing the input string in a near-random fashion during a merge phase. As a consequence, when the input string does not fit in main memory, disk I/O latency dominates. 2) If one were to employ parallel processing offered by modern high performance computing systems to reduce operation times, existing techniques would require that each processor house the entire input string. This is simply not possible given that most state-of-the-art massively parallel systems have a small, fixed amount of memory (for e.g., 512 MB) per processing element. More often than not, these systems are disk-less and do not offer virtual memory support. Consequently, large scale parallel suffix tree construction using existing algorithms is not trivial.
That is, existing suffix tree construction algorithms cannot be trivially parallelized on such systems for the following reasons: (1) Due to limited main memory per processor, the input string being indexed cannot always be maintained in-core, and needs to be maintained and read off the network file system. Accessing the suffix tree during the tree construction and link recovery processes requires accessing the input string (using start and end indices). These accesses are near random and hence the processes are extremely I/O inefficient when the input string does not fit in main memory. Parallel operations become latency bound. (2) The link recovery task requires all processors to simultaneously have both read and write access to nearly all suffix sub-trees. On massively parallel systems, this quickly leads to I/O contention and limits scalability. (3) Naive parallelization results in significant amount of redundant work being performed, which also limits scalability.
Due to the aforementioned limitations, suffix trees have lost bearing when it comes to indexing and querying large input strings.
It would be highly desirable to provide a proposed approach that affords improvements of several orders of magnitude when indexing large strings.
Furthermore, it would be highly desirable to provide a locality-conscious algorithm for suffix tree construction to efficiently build very large suffix trees for strings that are significantly larger than the size of main memory in both a serial as well a parallel setting.