A “stringome” can be defined as a family of strings that can be obtained by the concatenation of a small number of shorter elemental strings (e.g., “stringlets”), which can additionally share many common structures, patterns, similarities or homologies with the stringome. In particular, these elemental strings can be partitioned into several classes such that the stringlets within a class are few in number, and may differ from each other by small edit-distances, thus giving rise to small number of “allelic forms” within each class. In a dynamic setting, the stringomes can evolve by adding, deleting or mutating the set of elemental strings, but can also evolve by changing the rules by which the strings can be concatenated in a stringome. The study of such combinatorial objects can be referred to as “stringomics,” and can raise many combinatorial and algorithmic questions related to efficient pattern matching, query processing and statistical analysis, which would otherwise be prohibitively expensive in the absence of any imposed indexing structures and associated assumptions.
Stringomics can be similar to “pattern matching on hypertext,” which was introduced by Manber and Wu in 1992. (See e.g., reference 20). In that scenario, the hypertext was modeled in the form of a graph of n nodes and m edges, each node storing one string, and edges indicating alternative texts-nodes that can follow the current text-node. The pattern, however, is still a simple (e.g., linear) string of length p. A pattern occurrence was defined as a path of text-nodes containing the pattern. Therefore, an occurrence can be thought of as internal to a text-node, or to span (e.g., a path of) several text-nodes.
In reference 20, an acyclic graph was considered, and all occ pattern occurrences in O(N+p m+occ log log p) time were reported, where N can be the total length of the strings stored in the graph's nodes. Akutsu (see e.g., reference 3) improved the solution for the case of a tree structure in optimal time, while Park and Kim (see e.g., reference 17) extended this result to an O(N+pm) time-algorithm for directed acyclic graphs (“DAG”), as well as for graphs with cycles, but only under the assumption that no text-node can match the pattern in two places. Subsequent to that, other researchers (see e.g., references 1, 23, 24) dealt with the problem of approximate pattern matching on hypertexts, showing that the problem can be solved in O(pN) time and O(N) space for cyclic and acyclic graphs.
Currently deployed genomics analysis algorithms are simple reincarnations of generic stringology algorithms that were developed in the most general unconstrained setting, and have failed to take advantage of the genome structure (e.g., what can be discerned in diploid eukaryotic genomes). For example, while devising algorithms to study a population of human genomes, the computational genomicists have not yet taken noticeable advantage of the genome-architecture of such structural elements as haplotype blocks, sparsity in variations, allelic frequencies, haplotype phasing, and/or population stratification, etc. New approaches that exploit these architectural structures to dramatically improve the algorithmic and communication complexity can benefit numerous genomic applications, for instance, Bayesian base calling genome assembly (both genotypic and haplotypic) (see e.g., reference 21), resequencing, and most importantly, the embryonic field of clinical genomics. (See e.g., reference 22). Another specialized, but an enormously critical, application comes from the field of onto-genomics analysis, which studies the genomes of a heterogeneous population of tumor cells and stroma, all undergoing rapid somatic evolution choreographed through cancer hallmarks, but also sculptured by the chemo- and radiation-based therapeutic interventions.
Although combinatorial objects (e.g., stringomes) are somewhat specialized, they are likely to enable many future algorithmic innovations. For example, just in the field of genomics, these can deliver many desperately needed tools to curb the data deluge, improve data-compression (both in transmission and storage), and ultimately, accelerate disease studies through intelligent pooling strategies. (See e.g., reference 22). Similar improvements can be foreseen in other related fields, for example, metagenomics, epigenomics, transcriptomics, microbiomics, and many others, as would be apparent to a person possessing skills in the related arts.
Generally, the above methods are batch-solutions that need to scan the entire graph, and the strings contained in its nodes, in order to count/report the pattern occurrences. Thus, it may be beneficial to provide exemplary systems, methods and computer-accessible mediums that are index-based solutions which can thus count/report the pattern occurrences in a time complexity which can be as independent as possible to the graph and string sizes, while remaining succinct/compressed in the occupied space, can exploit the structural properties of stringomics problems and can overcome at least some of the problems above.
Accordingly, there is a need to address and/or overcome at least some of the deficiencies described herein above.