The study of genetic material recovered directly from an environmental sample, by sequencing the genetic material, is referred to as metagenomics. Metagenomics provides information pertaining to taxonomic diversity and physiology of various organisms present in the environmental sample.
A facility, such as a research laboratory or a clinic, involved in genomic study typically uses high capacity platforms, such as next generation sequencing (NGS) platforms, capable of generating huge volumes of metagenomic data every year. The metagenomic data thus generated may be further analyzed, for example, to determine various organisms present in the metagenomic data and to identify the functional roles of the various genes they encompass. Generally, the metagenomic data may be stored for further analysis and future studies. Thus, each year metagenomic data is generated in huge volumes, in the range of hundreds of terabytes (TB), and stored in repositories for future studies.
In order to analyze the metagenomic data, nucleotide sequences, such as DNA or RNA sequences constituting the metagenomic data are generally assembled into larger sequences called contigs. The process of assembly typically involves performing a pairwise comparison of the nucleotide sequences, numbering in millions, thus requiring huge computational resources and infrastructure. Furthermore, an attempt to assemble nucleotide sequences, originating from genomes of a large number of organisms belonging to diverse taxonomic groups, may result in formation of erroneous chimeric sequences, which may affect the results of analyses of the metagenomic data.