Genetic information of living organisms (e.g., animals, plants, microorganisms, viruses) is encoded in deoxyribonucleic acid (DNA) or ribonucleic acid (RNA). Genetic information is a succession of nucleotides or modified nucleotides representing the primary structure of nucleic acids. The nucleic acid content (e.g., DNA) of an organism is often referred to as a genome. In most humans, the complete genome typically contains about 30,000 genes located on twenty-three pairs of chromosomes. Most genes encode a specific protein, which after expression via transcription and translation fulfills one or more biochemical functions within a living cell.
Many medical conditions are caused by, or its risk of occurrence is influenced by, one or more genetic variations within a genome. Some genetic variations may predispose an individual to, or cause, any of a number of diseases such as, for example, diabetes, arteriosclerosis, obesity, various autoimmune diseases and cancer (e.g., colorectal, breast, ovarian, lung). Such genetic variations can take the form of an addition, substitution, insertion or deletion of one or more nucleotides within a genome.
Genetic variations can be identified by analysis of nucleic acids. Nucleic acids of a genome can be analyzed by various methods including, for example, methods that involve massively parallel sequencing. Massively parallel sequencing (MPS) techniques often generate thousands, millions or even billions of small sequencing reads. To determine genomic sequences, each read is often mapped to a reference genome and collections of reads are assembled into a sequence representation of an individual's genome, or portions thereof. The process of mapping and assembly of reads is carried out by one or more computers (e.g., microprocessors and memory) and is driven by a set of instructions (e.g., software instructions, code and/or algorithms). Such mapping and assembly processes often fail when a genetic variation is encountered in a genome of a subject. For example, existing software and programs sometimes incorrectly map reads, fail to map reads and/or fail to correctly assemble regions of a gene of interest where another highly similar gene exists in the same genome, thereby diminishing the ability to successfully identify genetic variations in such a gene of interest. This is especially problematic where it is desired to quickly and accurately detect the presence or absence of known variants in highly similar genes using data generated by high throughput MPS methods that can rapidly generate thousands, millions or even billions of small sequencing reads from multiple subjects.
Methods, systems and processes herein offer significant advances and improvements to current nucleic acid analysis techniques. Such advances and improvements can help expedite screening of MPS-generated data for genetic variations that may exist in one or more genes of a set of two or more highly similar genes.