1. Field of the Invention
The present invention is relevant to a variety of automated selection systems including automated sub-sequence selection systems for usage with any method group of nucleic acid or protein sequences generated by alternative methods. The system is configured to automatically select sets of sub-sequences from incomplete nucleotide sequence data obtained from resequencing DNA microarrays, according to parameters predetermined by the system or determined by a user, for selecting sequence subsets that are optimally suitable for comparison against a collection of predetermined database sequences using one or more similarity search algorithm(s). Embodiments of the invention also enable the further analysis and presentation of relevant results returned by a similarity search resulting from submission of one or more subsequences. Aspects of the invention described herein distinguish between combinations of sequence signatures that arise from a mixture of multiple sequence targets (e.g. microbial organisms) or from a rearrangement of sequences within a single target. Embodiments of the method are also capable of assigning relative abundances of mixed target sequences based on relative signal intensity values from the DNA microarray itself. Moreover, an aspect of the invention is an integral component of an iterative process for designing resequencing DNA microarrays using “prototype” sequence tiles to represent a range of related target sequences (e.g. pathogens).
2. Description of the Related Art
The convergence of biology with engineering and computer science has led to the emergence of biotechnology and bioinformatics which, among many other goals, aim to rapidly obtain and analyze genomic and proteomic sequence information for diagnosis of disease. The experimental viability and widespread availability of such methodologies are due in no small part to the emergence of DNA microarrays (Stenger et al., 2002).
Generally, microarray fabrication applies methods of microprocessor manufacturing to create “gene chips” capable of rapidly and reliably identifying sequences of DNA or proteins that are present in a biological sample. Here, the term “microarray” refers to any type of planar substrate or alternative matrix presenting a high multiplicity (102 to 106) of individual sites, each presenting probes (immobilized nucleic acids or antibodies) designed to selectively capture complementary strands of a target (i.e. gene or gene transcript) analytes in solution. By design, DNA microarrays enable the simultaneous interrogation of thousands of gene or gene transcript elements.
In using a resequencing DNA microarray for genetic analysis, a solution containing amplified and fluorescently-tagged genetic targets are passed across the microarray comprised of a plurality of oligonucleotide probes in a “tiled” format (Kozal et al., 1996). Complementary sequences in the sample bind to the corresponding probes contained on the microarray. The microarray is then analyzed using, for example, a laser scanner that records the intensity of light emission from the microarray's probes. The recorded intensities are then analyzed by array-specific software used to make “base calls,” which is a term describing an algorithmic method of identifying to a certain degree of probabilistic certainty the sequence of nucleic acids (adenine; A, thymine; T, cytosine; C, or guanine; G) contained in the biological sample of interest. A broader IUPAC definition code is also used to describe less precise base calls (see U.S. Provisional Application Ser. No. 60/590,931 filed on July 2, 2004 entitled “Resequencing Pathogen Microarray”, supplemental data, Appendix J “gdas_manual pdf” page 255). If the target sequence is sufficiently homologous to the appropriate tile region of the resequencing microarray (fewer than 1-2 base substitutions per 25 bases) then a complete resequencing of the target is possible. However, the hybridization to the tile region is interrupted when the target sequence contains insertions, deletions, or base substitutions at frequencies of greater than 2 substitutions per 25 bases of target sequence. This results in the “no [base] call” (N) being made from the corresponding sequences on the microarray tile region. N calls also result when the concentration of the target nucleic acid in solution is low or when there are interfering level of competing background nucleic acids in the hybridization solution. Incomplete biological sequence information can also be generated by a number of other nucleic acid and protein sequencing technologies.
The primary intended application of resequencing microarrays is to detect low probability single nucleotide polymorphisms (SNPs) or mutations within a limited range of target sequences. However, although not conventionally performed currently in industry, sequence output of the microarray can also be compared against sequence databases to allow identification of target sequences. The most prevalent comparison method, or similarity search algorithm, for sequence data currently in use is Basic Local Alignment Search Tool, commonly known as and referred to herein as “BLAST.” Numerous variants exist, including Washington University BLAST (WU-BLAST), NCBI-BLAST, FASTA, MPsrch, Scanps, and BestFit (Korf, Yandell & Bedell, 2003). Such comparisons generally yield a number of possible matches in terms of certainty (measured probabilistically) that the tested sample includes the matched biological subject for which a sequence is known. The sequence output by the intensity analysis of the microarray is then often compared to a database that includes known sequences of biological subjects which could include pathogenic microbes. However, one normally skilled in the art of molecular biology would not be capable of visually determining the best sequence sections from a tiled region containing A, C, T and G base calls punctuated and in some cases dominated by varying numbers of no-calls (N).
The use of microarrays for the purposes of genetic sequencing and identification has drastically increased the capability of even a single researcher to extract a large amount of sequence data from a biological sample for comparison against an even larger number of previously sequenced organisms and biological substances. However, the researcher is unable to utilize the information in a time-effective manner. Ambiguous results are also problematic for a researcher submitting sample sequences for comparison due to excessive wait times and poor (inconclusive or conflicting) results associated with attempts to match ambiguous subsequences. Accordingly, a widely-practiced method of obtaining more relevant results from sequence comparison is for a researcher to review sequence output searching for subsequences that appear to have a higher probability of returning a relevant result. In particular, many researchers often find themselves manually and subjectively selecting, or visually parsing, certain subsequences for comparison against those in the sequence database. As a result, a researcher expends time and resources for relatively slowly and subjectively optimizing the sequence data for submission to the similarity search. Thus, the current solution for the above-noted resource utilization problem leads to additional time and resource requirements demanded of the researcher. Moreover, as the current solution is subjective as well as time-intensive, the net gain with respect to facilitating the advancement (and acceleration) of genomic research is ambiguous at best.
However, as noted above the vast repositories of known biological sequences are often contained in shared computing resources. These shared computing resources require vast amounts of data storage capacity, as well as robust and powerful tools with which to compare a submitted sequence to those contained within the database. As the amount of sequence data produced (and submitted) by researchers increases with the improvement and increasing availability of microarrays for general research use, the burden placed on shared databases (and associated systems) in terms of bandwidth and processing requirements increase dramatically. In other words, the increase in data made possible by widespread use of microarrays often leads to less efficient utilization of shared bioinformatics computing resources.
For example, if sequences containing a large percentage of ambiguous sequence data (Ns) are submitted, the sequence database's computing resources will be spent trying to find matches for inherently ambiguous sequences, resulting in all possible similarity search results with low certainty values. FIG. 1(a) is an exemplary flowchart illustrating a process that might currently be performed with methods available to the industry. In this example, nucleotide or amino acid sequence data 103 corresponding to a sequence of interest is submitted for comparison against a known sequence database using a similarity search 109.
The submitted sequence(s) 103 when compared to database records, 109 might or might not return statistically significant or meaningful results. Here, by definition, to “compare” means to perform a similarity search of a query sequence against a database of sequence records using any one of a large number of algorithms for determining similarity (e.g. BLAST). Sequendes that are said to be “comparable” have a sufficient degree of similarity to at least one sequence in a database to result in the return of at least one statistically significant (user defined) result. It is straightforward for an end user to visually identify and select contiguous stretches of nucleotide base calls (comprised of only A, T, C, or G residues) or amino acids that might be comparable. However, as the number or percentage of “Ns” contained within target sequences increases, it becomes exponentially more difficult for the end user to visually determine the comparability of either the entire sequence or subsequences within it.
The results 111 include high probability matches 111a, lower probability matches 111b, and a significant number of statistically insignificant results 111c that can be attributed to a chance match with the database. Ns are treated as “aNy” (wild card) characters by similarity search algorithms meaning the N could be any of the four base residues or gap when the default parameters are used. In the case of a resequencing DNA output, an N indicates the resequencing algorithm could not resolve the call and can correspond to any of the four base residues (A, T, C or G) or to empty space (Korf et al., 2003). In the case that too many non-calls (Ns) are included in the submitted sequence, then the similarity search (e.g. BLAST) will calculate E (expect) values higher than the acceptable E (expect) value (e.g. 1.0e−9) indicating the chances are greater that the returned sequence is not unique. Similarly, shorter sequences may have higher E values indicating their lack of use to the end user in determining the presence of unique DNA material. The results 111, including the numerous ambiguous results 111c, are then left to be analyzed 113 by the researcher.
In the case of FIG. 1(a), other users are shown submitting their sequences of base calls to the shared sequence database 109, which handles these additional requests for local alignment searches. As described above, the submission of ambiguous sequences by multiple users to a shared sequence alignment resource often results in available computing resources being spent to serve only a small number of sequence submissions.
FIG. 1(b) illustrates this alternative case often found in practice in the industry that is problematic with regard to researcher time consumption. In contrast to the previously illustrated case, the sequence data 103 is altered in a cut and paste operation 119 performed manually by a human researcher. More specifically, the human researcher often visually scans the raw data output and subjectively copies and pastes subsets of the raw data output 119 that appear to contain fewer “Ns” and submits these subjective selections 121 for comparison 109. However, as the selection of subsets is performed subjectively and repetitively for a large amount of raw data, the human-selected submissions 121 often include comparable 121a and non-comparable 121b data. Consequently, the results from the BLAST comparison 123 still include a wide array of possible matches, ranging from high probability matches 123a, to low probability matches 123b which are often caused by selections in which there are to many non-calls 123c as opposed to the anticipated result of a low probability match caused by a less similar sequence match.
As discussed above, FIG. 1(c) is a schematic drawing of a general system layout for interaction with sequence database servers through computer terminals over a wired or wireless network 128. In some case, sequence database (and associated server) 127 is located remotely from a researcher's terminal 129. Alternatively, some facilities have built custom sequence databases 133 which are accessible through a local terminal 131. However, the above-noted problems with time and shared resource consumption are significant in either configuration with a higher increase in time consumption at a public database level.
A variety of different factors can contribute to the inability of a resequencing DNA microarray to make non-ambiguous base calls. In pure target samples, the hybridization patterns necessary for base calling (Cutler et al., 2001; Kozal et al., 1996) are interrupted whenever a stretch of target sequence is sufficiently dissimilar from the probe sequences that are tiled on the microarray surface. This results in the introduction of N calls into the interrupted positions of the resequencing microarray output file. The same effect occurs when the target molecules are present in low concentration and/or when the target sample is not pure but contains varying amounts of other nucleic acid molecules that can bind non-specifically to the tiled probes with low affinity, resulting in a lowered signal-to-noise ratio of hybridization (fluorescence) signals across the probe sets. To illustrate how these factors can determine whether sequences are comparable or non-comparable data, FIG. 1(d) shows an example of a resequencing DNA microarray output file that results when incomplete hybridization occurs. In the illustrated case, the sequences 135 are in FASTA format, however alternative sequence data formats are equally suitable, including, but not limited to plain, EMBL, GCG, GenBank, and IG. Within the example sequence 136 are sequence subsets 140 (subsequences). Example subsequences 140 include a subsequence with an excessive number of non-calls (Ns) 137, a subsequence that is too short to return meaningful results from a similarity search such as BLAST 139, and a subsequence that is likely to produce a meaningful result 143. Additionally, multiple sequences are set off by aliases, located in the sequence header 138, referencing the probe tile set that is physically present on the microarray surface.
Overall, the above-noted problems with the current state of industry practice are fundamentally related to researcher time consumption and shared resource allocation. More specifically, the increased amount of subsequence data obtained from samples results in rapid increases in the utilization of shared resources such as sequence comparison databases. Such rapid increases necessitate efficient use for supporting a growing community (in terms of researchers and data). With the aim of using shared resources more effectively, researchers are now often faced with the need to devote time and resources to subjectively and manually selecting sequence subsets for comparison.
As stated above, there is a critical need for advanced diagnostic systems that can rapidly detect both known and unanticipated sequences. More particularly, there remains a critical demand for DNA microarray techniques that reduce the need for human work input and increase the efficiency of shared resource utilization, especially in the case of shared similarity search databases and systems.
In addition to the above-described problems in the industry regarding more effective use of researcher and shared computing resources, the evolution of world events and the emergence of infectious disease and bioterrorism in mainstream society have led to a growing sentiment amongst the scientific community and lay people alike that new, rapid, and accurate techniques for threat identification and eradication must be developed. The concept of broad-spectrum pathogen identification has considerable and obvious appeal to both medical practice and national defense. It is within this framework that the present inventors have endeavored. Furthermore, there remains a need for more ready and robust determination of mixtures and recombinants in biological samples from biological sequence data, regardless of the source of the sequence data.