1. Field of the Invention
The present invention relates to a computer system and method of computer-facilitated data analysis for providing reliable DNA alignments. More specifically, the invention relates to the automation of alignment and naming of mitochondrial DNA sequences, particularly for use in forensic analysis.
2. Description of Related Art
The process of DNA sequencing encompasses the use of biochemical methods for determining the order of the nucleotide bases: adenine, guanine, cytosine, and thymine, in a DNA oligonucleotide. The advent of rapid and efficient DNA sequencing techniques has significantly accelerated biological research and discovery. (Maxam et al., “A New Method For Sequencing DNA”, 1977, Proc. Natl. Acad. Sci. (USA), 74(2):560-4; Braslavsky et al., “Sequence Information Can Be Obtained From Single DNA Molecules”, 2003, Proc. Natl. Acad Sci. (USA) 100: 3960-3964; Ventner, J. C.; et al., “The Sequence Of The Human Genome”, 2001, Science 291 (5507): 1304-51.) The rapid speed of sequencing attainable with modern DNA sequencing technology has been instrumental in the large-scale sequencing of the human genome, in the Human Genome Project. Related projects have generated the complete DNA sequences of many animal, plant, and microbial genomes (Blattner et al., “The Complete Genome Sequence of Escherichia coli K-12”, 1997, Science 277 (5331): 1453-1462; The C. elegans Sequencing Consortium, “Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology”, 1998, Science 282 (5396): 2012-2018; Neil Hall, “Advanced Sequencing Technologies And Their Wider Impact In Microbiology”, 2007, The Journal of Experimental Biology 209: 1518-1525).
The sequence of a DNA molecule constitutes the heritable genetic information in nuclei, plasmids, mitochondria, and chloroplasts that forms the basis for the developmental programs of all living organisms. Determining the DNA sequence is therefore useful in basic research studying fundamental biological processes, as well as in applied fields such as diagnostic research or forensic analysis. Single nucleotide polymorphisms (SNPs) are an abundant form of nucleic acid sequence variation, occurring at a rate of approximately one per 500 nucleotides in coding sequences, and more abundantly in non-coding sequence (U.S. Pat. No. 7,273,699; Wang D. G., et al. (1998) Science 280:1077-1082). As many as a million SNPs may exist in the human genome. The accurate identification and characterization of single nucleotide polymorphisms (SNPs) has important utility in genome-wide association studies and genetic linkage studies (U.S. Pat. No. 7,361,468) as well as in personalized therapeutic and diagnostic protocols (U.S. Pat. Nos. 7,488,813; 7,127,355; 7,488,813; 7,470,513; 7,461,048; 7,461,006; 6,931,326).
In bioinformatics, a sequence alignment is a way of arranging two or more primary sequences of DNA, RNA, or proteins to identify regions of similarity between a “query” or “sample” sequence and a “reference” sequence that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. An asterisk is often used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this involves assigning each nucleotide its own color. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation (Schneider et al., “Sequence Logos: A New Way To Display Consensus Sequences”, 1990, Nucleic Acids Res. 18: 6097-6100).
Such pairwise alignments of reference and query/sample sequences may be performed globally, locally or in a hybrid (“glocal”) fashion. (Needleman et al., “A General Method Applicable To The Search For Similarities In The Amino Acid Sequence Of Two Proteins”, 1970, J Mol Biol. 48(3):443-53; Smith et al., “Identification Of Common Molecular Subsequences”, 1981, Journal of Molecular Biology 147: 195-197; Brudno et al., “Glocal Alignment: Finding Rearrangements During Alignment”, 2003, Bioinformatics 19 Suppl 1: i54-62). Global alignments, which attempt to align every residue of the reference sequence with every residue of the query/sample sequences, are most useful when the sequences in the query/sample set are similar and are of approximately equal size to the reference sequence. A general global alignment technique is called the Needleman-Wunsch algorithm and is based on dynamic programming (Phillips, A. J., “Homology Assessment And Molecular Sequence Alignment,” 2006, J. Biomed. Inform. 39(1):18-33). Local alignments have particular utility for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm, a general local alignment method also based on dynamic programming, is often employed (Bucher, P. et al., “A Sequence Similarity Search Algorithm Based On A Probabilistic Interpretation Of An Alignment Scoring System,” 1996, Proc. Int. Conf. Intell. Syst. Mol. Biol. 4:44-51). Instead of looking at the total sequence, the Smith-Waterman algorithm compares segments of all possible lengths and optimizes the similarity measure. Hybrid methods, known as semiglobal or “glocal” methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence.
One application of the currently available DNA sequencing and alignment technologies is in the area of genetic fingerprinting. Genetic fingerprinting is a technique used to distinguish between individuals of the same species using only samples of their DNA. Although two individuals of the same species will have the vast majority of their DNA sequence in common, certain sequences, known as highly variable repeat sequences called variable number tandem repeat loci (VNTRs) will vary within a population. These loci are variable enough that two unrelated humans are unlikely to have the same alleles. (Jeffreys et al., “Hypervariable ‘Minisatellite’ Regions In Human DNA”, 1985, Nature 314: 67-73) The technique is now the basis of several national DNA identification databases. These databases serve as an invaluable resource for researchers, investigators, and forensic scientists. Genetic fingerprinting has been widely applied to provide detailed genetic information for researchers in fields as varied as population genetics, microbiology and legal ethics (McClelland et al., “A Genetic Linkage Map For Coho Salmon (Oncorhynchus kisutch)”, 2008, Anim Genet., 39(2):169-79; Minder et al., “A Population Genomic Analysis Of Species Boundaries: Neutral Processes, Adaptive Divergence And Introgression Between Two Hybridizing Plant Species”, 2008, Mol Ecol., 17(6):1552-63; Daeid, N., “DNA-What's Next?”, 2007, Sci. Justice, 47(4): 149; Stellaroli et al., “Shared Genetic Data And The Rights Of Involved People”, 2007, Law Hum Genome Rev. (26):193-231; Kronforst et al., “The Population Genetics Of Mimetic Diversity In Heliconius Butterflies”, 2008, Proc Biol Sci., 275(1634):493-500).
Methods used for DNA fingerprinting include restriction fragment length polymorphism typing of variable number tandem repeat (VNTR) loci (Budowle et al., “Modifications To Improve The Effectiveness Of Restriction Fragment Length Polymorphism Typing”, Appl. Theor. Electrophoresis, 1990, 1: 181-187; Jeffreys et al., “Hypervariable Minisatellite Regions In Human DNA”, Nature, 1985, 314:67-73; Jeffreys et al., “Individual-Specific Fingerprints Of Human DNA”, Nature, 1985, 316:76-79; Wyman et al., “A Highly Polymorphic Locus In Human DNA”, Proc. Natl. Acad. Sci. (USA), 1980, 77:6754-6758), polymerase chain reaction (PCR)-based systems to analyze single nucleotide polymorphisms (SNPs) (Comey et al., “Validation Studies On The Analysis Of The Hla-Dq Alpha Locus Using The Polymerase Chain Reaction”, J. Forensic Sci., 1991, 36: 1633-1648; Saiki et al., “Genetic Analysis Of Amplified DNA With Immobilized Sequence-Specific Oligonucleotide Probes”, Proc. Natl. Acad. Sci. (USA), 1989, 86: 6230-6234), VNTR loci (Budowle et al., “Analysis Of The VNTR Locus D1S80 By The PCR Followed By High-Resolution PAGE”, Am. J. Hum. Genet., 1991, 48:137-144; Kasai et al., “Amplification Of A Variable Number Of Tandem Repeat (VNTR) Locus (pMCT118) By The Polymerase Chain Reaction (PCR) And Its Application To Forensic Science”, J. Forensic Sci., 1990, 35:1196-1200), short tandem repeat loci (STR) (Edwards et al., “DNA Typing And Genetic Mapping With Trimeric And Tetrameric Tandem Repeats”, Am. J. Hum. Genet., 1991, 49:746-756), Y chromosome analysis (Koyama et al., “Utility Of Y-STR Haplotype And mtDNA Sequence In Personal Identification Of Human Remains”, 2002, Am J Forensic Med Pathol. 23(2):181-185) and mitochondrial DNA analysis (Holland et al., “Mitochondrial DNA Analysis Of Human Skeletal Remains: Identification Of Remains From The Vietnam War”, J. Forensic Sci., 1993, 38:542-553; Hopgood et al., “Strategies For Automated Sequencing Of Human Mitochondrial DNA Directly From PCR Products”, Biotechniques, 1992, 13:82-92; Sullivan et al., “Automated Amplification And Sequencing Of Human Mitochondrial DNA”, Electrophoresis, 1991, 12:17-21; Wilson et al., “Validation Of Mitochondrial DNA Sequencing For Forensic Casework Analysis”, Int. J. Leg. Med., 1995, 108:68-74; Wilson et al., “Extraction, PCR Amplification, And Sequencing Of Mitochondrial DNA From Human Hair Shafts”, Biotechniques, 1995, 18:662-669).
Particularly for forensic scientists, DNA fingerprinting based on sequence analysis is an important tool for identifying genetic material of unknown origin (Pretty, I., “Forensic Dentistry: 1. Identification Of Human Remains”, 2007, Dent Update, 34(10):621-2, 624-6, 629-30; Baker et al., “Reuniting Families: An Online Database To Aid In The Identification Of Undocumented Immigrant Remains”, 2008, J Forensic Sci., 53(1):50-3; Irwin et al., “Application Of Low Copy Number STR Typing To The Identification Of Aged, Degraded Skeletal Remains”, 2007, J Forensic Sci., 52(6):1322-7; LeClair et al., “Bioinformatics And Human Identification In Mass Fatality Incidents: The World Trade Center Disaster”, 2007, J Forensic Sci. 52(4):806-19). Common sources of genetic material may include, but are not limited to, skin, hair, saliva, tissue, bone, and blood. Preservation of sequence integrity is vital to assure proper sequence identification; however, this is not always possible, and often samples are contaminated or partially degraded.
Mitochondria are DNA-containing organelles that supply energy to cells and contain their own DNA. Mitochondrial DNA (mtDNA) is a particularly preferred source of DNA for sequence analysis due to its high copy number and resistance to degradation (Remualdo et al., “Analysis Of Mitochondrial DNA From The Teeth Of A Cadaver Maintained In Formaldehyde”, 2007, Am J Forensic Med Pathol., 28(2):145-6; Andreasson et al., “Nuclear And Mitochondrial DNA Quantification Of Various Forensic Materials”, 2006, Forensic Sci Int., 164(1):56-64; Matsuda et al., “Identification Of DNA Of Human Origin Based On Amplification Of Human-Specific Mitochondrial Cytochrome b Region”, 2005, Forensic Sci Int. Sep. 10, 2005; 152(2-3):109-14).
Human mtDNA contains approximately 16,500 base pairs, of which there are two regions that contain significant sequence variation and can provide distinguishing profiles useful in identifying individuals or genetic samples. Sample identifications are based on the observation that mtDNA from an individual and from that individual's maternal relatives have the same sequences present in the two regions of mtDNA sequence conservation (Case et al., “Maternal Inheritance Of Mitochondrial DNA Polymorphisms In Cultured Human Fibroblasts”, Somat. Cell Genet., 1981, 7:103-108; Giles et al., “Maternal Inheritance Of Human Mitochondrial DNA”, Proc. Natl. Acad. Sci. (USA), 1980, 77:6715-6719). In contrast, non-related individuals will have different sequences in these two regions of mtDNA. Forensic analysis using mtDNA is thus performed by comparing an individual's mtDNA with mtDNA isolated from an individual suspected of being a maternal relative. By obtaining genetic material from a candidate maternal relative, an investigator may compare the mtDNA sequences for purposes of sample identification. If the sample mtDNA does not match that of the presumed maternal relative in more than two positions, then a familial connection may be excluded (Bender et al., “Application Of MtDNA Sequence Analysis In Forensic Casework For The Identification Of Human Remains”, 2000, Forensic Sci Int., 113(1-3):103-7; Szibor et al., “Efficiency Of Forensic mtDNA Analysis. Case Examples Demonstrating The Identification Of Traces”, 2000, Forensic Sci Int., 113(1-3):71-8).
Despite the many advances in DNA sequencing and alignment technology, current methods for performing forensic DNA analysis are labor-intensive, and often result in the same sequence typed differently by different investigators due to the necessity for human evaluation of the sequencing data. Further compounding this issue is the database contamination resulting from human error in sequence evaluation (Carracedo et at., “Reproducibility Of mtDNA Analysis Between Laboratories: A Report Of The European DNA Profiling Group (EDNAP)”, 1998, Forensic Sci Int., 97(2-3):165-70; Yao et at., “A Call For mtDNA Data Quality Control In Forensic Science”, 2004, Forensic Sci Int., 141(1):1-6; Budowle et al., “Addressing The Use Of Phylogenetics For Identification Of Sequences In Error In The SWGDAM Mitochondrial DNA Database”, 2004, J Forensic Sci., 49(6):1256-61).
In particular, the manner in which sequence alignments are described can vary from laboratory to laboratory, thereby creating ambiguities in future analyses. To address this problem, Wilson et al. (Wilson et al., “Recommendations For Consistent Treatment Of Length Variants In The Human Mitochondrial DNA Control Region”, Forensic Sci. Int., 2002, 129:35-42; Wilson et al., “Further Discussions Of The Consistent Treatment Of Length Variants In The Human Mitochondrial DNA Control Region”, 2002, Forensic Sci. Comm., 4:4) developed an approach to standardize alignments using differential weighting of transitions, transversions, insertions and deletions. The Wilson method provided recommendations for forensic scientists to assist them in interpreting and naming the mtDNA sequence data generated. By utilizing this method, inconsistencies in nomenclature could be reduced, and from the higher quality databases generated by the use of this method, more reliable profile searches could be performed.
Although the goal of these rules was to standardize sequence nomenclature in order to provide greater consistency in forensic DNA data, several issues have become apparent. First, the Wilson Rules are currently performed manually, and are extremely time-consuming to implement, resulting in fewer samples being analyzed by forensic scientists, and thus, fewer entries into reference databases. Additionally, the rules as written do not guarantee consistent identification with the traditional nomenclature due to the necessity for human interpretation of sequence data. Therefore, a need exists for a computer-facilitated method of aligning mtDNA sample sequences with reference samples that allows for rapid sequence alignment, provides for absolute stability and consistency of nomenclature and provides increased database accuracy. The present invention is directed to this and other goals.