1. Field of Invention
This invention relates to a system and methods for the characterization of organisms and more particularly, to the characterization of the identities and relative populations of organisms in a sample. The characterization may be specific to the species and/or sub-species and/or strain level and may rely on probabilistic methods that compare sequencing information from metagenomic fragment reads to sequencing information of one or more genomic databases.
2. Discussion of the Background
Current nucleic acid methods for identifying populations of organisms are specific only to the genus level (16S rDNA) for bacteria, viruses, and other microbial organisms and do not identify the populations down to the species, sub-species and strains of organisms within the sample. Current techniques for detecting and identifying one or a small number of bacteria to the genus, species and/or sub-species level rely on static methods, such as polymerase chain reaction (PCR) and microchip arrays, to detect signatures of a pre-specified organism or plurality of pre-specified organisms. Current methods for detecting and identifying viruses require specific nucleic acid tests and do not produce data on population diversity. Complete sequencing of virus genomes has been used to demonstrate virus diversity. Conventional methods generally employ laboratory culturing methods for bacteria, fungi and parasites and are time consuming and expensive.
In none of these cases are relative populations able to be precisely determined nor are any of these methods able to detect and identify simultaneously the organisms present in microbial populations with respect to the specific taxa (genus, species, sub-species, and strain) of bacteria, viruses, parasites, fungi, or nucleic acid fragments including plasmids and mobile genomic components. In addition, given the rapid rate of genomic mutation, and growing evidence of horizontal gene transfer, static methods that rely on predetermined signatures produce false negative results if (a) mutation has occurred in the nucleic acid sequence of the sample relative to the signature, (b) the target signature was horizontally transferred, or (c) genomic near neighbors are present in the sample.
In the diagnosis of infectious disease, conventional microbiology still relies on time consuming and laborious culturing methods and cumbersome tests for bacteria, viruses, parasites, and fungi, and also on immunological and molecular nucleic acid tests. Additional nucleic acid assays are generally used to determine the presence of specific bacteria, viruses or unculturable bacteria. In a significant fraction of samples and up to 25% of all samples, no identifiable causative agent specific to the symptoms of the patient is identified. Moreover, it is generally assumed that an infectious disease is always caused by a single microbial agent or plurality of the agent, collectively inducing the observed symptoms, when more than one or a few cells of the agent are present.
Background bacterial populations or microbiomes (bacteria), mycobiomes (fungi) and viromes (viruses), at the species and strain levels cannot be rapidly or easily determined (i.e., within hours and with a single method or less) by current methods. However, determining the cause of disease may require normalizing results to background populations, but current methods lack the ability to do this. In food science, such relative comparisons to microbial background, down to the sub-species and/or strain level, are required to determine the source of food contamination and degree of pathogenicity.
For example, benign strains of Escherichia coli are abundant in nature, but such strains can mutate, acquire genes encoding for pathogenesic properties and/or toxin production and become toxigenic (e.g., E. coli O157:H7). Recently, six (6) new pathogenic strains of E. coli non-O157 (O111, O121, O26, O45, 0103 and O145) have been identified. These new pathogenic strains of E. coli are much less known than E. coli O157 but are just as capable of causing serious illness, including kidney failure, which is usually fatal. These new pathogenic strains of E. coli are much harder to identify in a microbiome using conventional methods because, although these new strains have been identified from their whole DNA genomes, tests using conventional static methods technology and involving a genetic signature have not been developed.
Given the frequency and propensity of genetic mutation in nature, it is probable that additional pathogenic strains of E. coli and other bacteria will continue to develop and evolve and, therefore, cause disease. Such continuing genetic mutation, a naturally occurring phenomenon, requires a universal method to facilitate identification even when mutation has occurred. Accordingly, there is a need in the art both for a universal method capable of microbial identification at the species and strain levels and for a method that accounts for biodiversity and mutation. Since DNA and RNA base pairs characterize all living organisms, including microorganisms and nucleic acid fragments like plasmids, and direct DNA sequencing is the standard for DNA base-pair identification, there is a need in the art for universal genomic identification at the sub-species and strain level using direct DNA and RNA sequencing of metagenomic samples.
Moreover, monitoring populations of microorganisms in the environment (e.g., in the water supply) and tracing them to infectious disease in patients (e.g., cholera) requires specificity of identification to the sub-species and/or strain level to diagnose the disease and its source. In addition, analysis of microbiomes in nature is required to understand antibiotic resistance and to monitor and prevent epidemic outbreaks or pandemics. Since microorganims are ubiquitous and many, if not most, exist both in environmentally friendly (non toxigenic commensal) forms and also in forms that are a biothreat to humans (highly toxigenic and/or invasive pathogen), they cannot be completely eradicated; the only way to minimize or prevent infection is to minimize exposure to pathogenic forms of microbes when their concentrations are high, and to identify and track specific pathogenic species and strains infecting patients.
Basic Local Alignment Search Tool (BLAST) analysis has become a ubiquitous method of interrogating sequence data. Many data-search methods have been developed that are based on improvements to BLAST. These include systems and methods for generating indexes and fast searching of “approximate”, “fuzzy”, or “homologous” (perfect-matching) matches for a large quantity of data. The data are indexed to generate a search tree taxonomy. Once the index is generated, a query can be provided to report hits within a certain neighborhood of the query. In BLAST, a local distance of a local sequence space is used to generate local search tree branches.
However, there are limitations to using BLAST output E-values, which describe the number of hits one can expect to see by chance when searching a database of a particular size and are used to gauge the significance of a match, as criteria for data parsing. While this measurement is possible, the output is often skewed by both the database used for comparison and the length of the match. Small regions of high similarity can generate an artificially low E-value and negate the global level of similarity exhibited by the sequence. The value of the BLAST score varies with the length of the nucleotide queried, and hence is not suitable alone for comparative analysis using universal cutoffs.
Previously, direct application of sequencing for rapid, multiplex diagnostics had not been possible. Direct analysis of samples was considered too complex to interpret and selective methods (e.g., culture) are employed to minimize the number of organisms (mostly to one type) for analysis. The capability to detect all pathogens using a single platform has not been possible. Biodefense, Force Protection, agriculture and global health will benefit from sequence-based identification of all pathogens in a sample and pathogen profiling for medical decision-making.
Pathogen identification techniques that do not necessarily rely on conventional culture methods include immunological techniques, whereby molecules unique to the pathogen (generally proteins) are detected using antibodies that specifically bind to the unique molecules, and a variety of techniques that target specific DNA or RNA sequences, collectively know as nucleic acid techniques (NAT) or molecular diagnostic techniques. Current immunological and NAT methods are useful for recognizing a limited range of pathogens under highly specific conditions, but each of these methods is subject to inadequacies.
Immunological techniques, or immunoassays, are known to suffer several critical weaknesses that limit their effectiveness in medical diagnosis. These include the volume and specificity of the reagents, cross-reactivity, and poor immunogenicity of some organisms, among the shortcomings. For example, often it is difficult to produce antibodies that will react specifically with the target pathogen without reacting to other pathogens (i.e., cross-reactivity). There are many pathogens that vary the molecules on their surfaces (e.g., Niesseria gonorrheae), making it impossible to detect all members of a given target group. Many immunoassay formats, such as rapid agglutination tests for Type A Strep, require large numbers of organisms for detection. This makes early detection of infections difficult or requires culture of the microorganisms prior to immunological detection.
Nucleic acid methods are much more specific than immunological methods because they target genetic material of the pathogen. Almost all NAT methods require amplification of the target nucleic acid by the Polymerase Chain Reaction (PCR). Limitations of PCR include: (a) the library of DNA primers to recognize sequences on pathogen genomes is limited; (b) mutants, strains, and engineered pathogens are not always readily detected if at all; (c) because of the limited DNA primer libraries, there is little or no recognition redundancy to exclude false positive or negative reactions; (d) primer/signature erosion occurs; and (e) unknown pathogens cannot be recognized because recognition of any pathogen requires previous knowledge of the nucleotide sequence of the particular pathogen's genetic material.
Sequencing-based methods have been used for whole genome analysis, but not for characterizing and identifying populations of microorganisms or as a predictive and forensic tool for decision making. For example, methods for identifying species and subspecies in a biological sample through selective amplification of segments of nucleic acid have been developed. Such methods use a primer or code for a specific target region (usually a gene, genes, or fragments of genes, including mitochondrial DNA) present in a minute fraction of all the populations in a sample. The methods involve DNA extraction from a sample, amplification of divergent segments of the target by PCR or an equivalent technique, using primers of regions with high evolutionary conservation between species and subspecies, analysis of the amplified segment by comparison of its size in base pairs with a pre-established standard of sizes and/or analysis of the amplified segment by DNA sequence comparison of the resulting sequence with a subset of specific sequences of fractions of a group of species or subspecies queried on a computer database.
These methods have been used for genetic analysis of a biological species employing a sample (biological material) derived from single isolates or from samples containing dual or heterogeneous mixtures. Amplification of a region of DNA of the sample, corresponding to a pre-determined and narrow genome position(s), is done to determine the size in base-pairs and/or the precise DNA sequence followed by mapping that region via taxonomic identification. The mapping is done against a reference database of organisms of amplified regions containing pre-established sizes and/or DNA sequences of the corresponding region of a plurality of species and/or subspecies.
All of these methods have limited use when a sample comprises a mixture of organisms. They can only confirm the presence of a pre-known or suspected organism, but they cannot identify each of the organisms present in the sample and cannot identify to the species, sub-species, and/or strain level. In addition, if a pre-known organism were present, but had undergone mutation in the pre-specified sequence, such methods would indicate a false negative. The natural processes of mutation, genetic deletions and alterations or engineered mutations are all part the creation of biodiversity which cannot be detected or even addressed with existing prior art methods.
Therefore, a mechanism is needed to identify simultaneously a plurality of organisms in a given sample with a single test without having to use multiple probes and without prior knowledge of organisms present in the sample. It also desirable to distinguish very similar or interrelated species, sub-species and strains for medical, agricultural, and industrial applications.
There are many life-threatening circumstances in which it would be useful to analyze and sequence the DNA and/or RNA in a sample, for example, in response to an act of bioterrorism where a fatal pathogenic agent had been released into the environment. In the past, such results have required involvement of many people, which demand too much time. As a result, rapidity and accuracy may suffer.
A bioterrorist attack or an emerging epidemic, requires first responders, i.e. physicians in the emergency room (their options or bed-side treatments), to make immediate decisions for treatment, and food manufacturers, distributors, retailers, and public health personnel throughout the country to identify rapidly, accurately, and reliably the pathogenic agents and disease(s) they cause. Pathogenic agents can be transmitted in food, air, soil, water, and animal, plant and human tissue and by clinical presentation in emergency rooms. Because the agent(s) and/or potential disease(s) can be immediately life-threatening and/or highly contagious, identification must be both rapid and accurate. If this is not possible, it represents a significant weakness in infectious disease control, homeland security, and bioterrorism response.
A method and system are needed to identify rapidly and accurately more than a single organism (multiplexing) in a sample and indicate if a species, strain and/or substrain are present employing genome comparison of nucleic acids present in the sample to nucleic acids present in a reference genomic database.
Rapid advances in biological engineering have dramatically impacted the design and capabilities of DNA sequencing tools, including high through-put sequencing, a method of determining the order of bases in DNA, and mapping the genetic variation that reveals the genetic underpinning of human disease. This approach is useful when sequencing many different DNA templates with any number of primers. Despite these important advances in biological engineering, little progress has been made in building devices to quickly identify sequence information and transfer data more efficiently and effectively.
Traditionally, DNA sequencing was accomplished by a dideoxy method, commonly referred to as the Sanger method [Sanger et al, 1977], that used chain terminating inhibitors to stop the extension of the DNA chain during DNA synthesis.
Methods for sequencing strategies continue to be developed. For example, it is possible to build an array of DNA sequences (microarrays) and hybridize complementary sequences in a process commonly referred to as sequencing-by-hybridization. Another technique considered state-of-the-art employs primer extension, followed by cyclic addition of a single nucleotide, with each cycle followed by detection of the incorporation event. The technique referred to as sequencing-by-synthesis or pyrosequencing, including fluorescent in situ sequencing (FISSEQ), is reiterative in practice and involves a serial process of repeated cycles of primer extension while the target nucleotide sequence is sequenced. These sequencing methods cannot rapidly identify an organism from the data of an isolate, and no tools currently exist for identifying a mixture of organisms based on metagenomic data created by these sequencing methods. Moreover, conventional methods and systems for identifying organisms in metagenomic samples based on nucleotide data generated by sequencers do not exist.
Despite these advances, there is a need for rapid genome identification methods and systems, including multidirectional electronic communication of nucleic acid sequence data, clinical data, therapeutic intervention, and tailored delivery of therapeutics to targeted populations to streamline responses and speed diagnosis of infectious disease, conserve valuable medical supplies, and contain bioterrorism, inadvertent release, and emerging pathogenic epidemics. In addition, a mechanism is needed to identify simultaneously a plurality of organisms in a given sample with a single test without having to use multiple probes, and it is desirable to distinguish very similar or interrelated species, sub-species and strains for medical, agricultural, and industrial applications.