This invention relates in general to DNA data processing and in particular to an algorithm for reducing cross-talk between DNA data streams.
The structural analysis of DNA has an increasingly important role in modern molecular biology and is needed to support many research programs, including searching for clues to certain diseases. Accordingly, extensive research into DNA structure is ongoing. One of the most complex programs is the Human Genome Project which has the goal of determining the content of human DNA.
DNA is a nucleic acid consisting of chains of nucleotide monomers, or oligomers, that occurs in a specific sequence. The structural analysis of DNA involves determining the sequence of the oligomers. Currently, DNA sequencing begins with the separation of a DNA segment into DNA fragments comprising a stochastic array of the oligomers. The separation involves electrophoresis in DNA sequencing gels, such as denaturing polyacrylamide gels. One of two methods is typically used for the electrophoresis, either a chemical method is used that randomly cleaves the DNA segment or dideoxy terminators are used to halt the biosynthesis process of replication.
Each of the oligomers in the resulting stociastic array terminates in one of four identifying nitrogenous bases that are typically referred to by a letter. The bases are: adenine (A), cytosine (C), guanine (G) and thymine (T). Thus, the sequencing of the DNA can be accomplished by identifying the order of the bases A, C, G and T. This process is often referred as xe2x80x9cbase callingxe2x80x9d. However, DNA is extremely complex. For example, there are 3.1 billion biochemical letters in human DNA that spell out some 50,000 genes, automated base calling is highly desirable.
One method of automated base calling involves fluorescence detection of the DNA fragments. A schematic drawing of an apparatus for fluorescence detection is shown generally at 10 in FIG. 1. The apparatus 10 includes an upper buffer reservoir 12 connected to a lower buffer reservoir 14 by a gel tube 16. The gel tube 16 is formed from glass or quartz and has an inside diameter within the range of one to two mm. A detector 18 is mounted near the bottom of the tube 16. The detector 18 monitors the gel passing through the tube 16 and transmits the data to a computer 20.
The chemical method described above is used to separate a DNA segment into its base oligomers. A different colored fluorophore dye is used for each of the chemical reactions for the bases A, C, G and T . One of the fluorophore dyes attaches to each of the oligomers as a marker. The reaction mixtures are recombined in the upper reservoir 12 and co-electrophoresed down the gel tube 16. As the fluorophore dye labeled DNA fragments pass by the detector 18, they are excited by an argon ion laser that causes the dye to fluoresce. The dye emits a spectrum of light energy that falls within a range of wavelengths. A photo-multiplier tube in the detector 18 scans the gel and records data for the spectrum for each of the dyes. The resulting fluorescent bands of DNA are separated into one of four channels, each of which corresponds to one of the bases. The real time detection of the bases in their associated channels is transferred to the computer 20 which assembles the data into the sequence of the DNA fragment.
FIG. 2 illustrates an ideal data stream generated by the apparatus 10. As shown in FIG. 2, a color is associated with each of the four bases; with green identifying A; blue, C; black, G; and red, T. The data in each of the channels is shown as a horizontal line with the detection of a base appearing in real time as a pulse. The resulting time sequence of pulses received, and hence the DNA sequence, is shown as the top line in FIG. 2. However, the actual data stream differs from the ideal data stream because of several factors. First of all, the emission spectra of the different dyes overlap substantially. Because of the overlap, peaks corresponding to the presence of a single fluorophore dye can be detected in more than one channel. Additionally, the different dye molecules impart non-identical electrophoretic mobilities to the DNA fragments. Furthermore, as the photo-multiplier tube in the detector 18 scans the gel, data detection does not occur at the same time for the four signals. Finally, imperfections of the chemical separation method can result in substantial variations in the intensity of bands in a given reaction. Thus, a set of typical actual raw data streams is shown in FIG. 3. The notations along the vertical axis in FIG. 3 refer to wavelengths for the detected colors. As in FIG. 2, four data streams are shown with each data stream corresponding to one of the base identifiers, as indicated by the letters in parenthesis.
As illustrated by the flow chart shown in FIG. 4, it is known to enhance the raw data streams by a series of operations following the sampling of the DNA data in functional block 32. First, in functional block 34, high frequency noise is removed with a low-pass Fourier filter. Typically, each of the four data streams has a different base line level that varies slowly over time. These variations are corrected by passing the data through a high-pass Fourier filter in functional block 35.
The data streams are corrected with respect to signal strength, or magnitude, in functional block 36. This process is referred to a baseline adjustment. The data signal in each of the four channels is divided into a number of windows with each of the windows including approximately 30 signal peaks. The minimum signal strength is determined within each of the windows. A succession of segments is constructed connecting the consecutive minimum signal strengths. The absolute minima is determined for the consecutive segments. The minimum in each segment is then set to zero and the non-minimum points in the segment is adjusted by subtracting the difference between the absolute minimum and the minimum value for the segment. This signal strength adjustment is commonly referred to as baseline adjustment.
Next, a multicomponent analysis, or data filtering, is performed on each set of four data points, as shown in functional block 38. The filtering determines the amount of each of the four dyes present in the detector as a function of time. After filtering, the mobility shift introduced by the dyes is corrected in functional block 40 with empirically determined correction factors. Following this, the peaks present in the data are located in functional block 42. The application of the above series of operations to the raw data streams shown in FIG. 3 results in processed data streams in functional block 44 where the DNA sequence is read. The processed data streams are shown in FIG. 5. The corresponding DNA sequence is shown below the processed data streams in FIG. 5 and consists of the sequential combination of the four processed data streams A, T, G and C.
For the data processing described above, it is assumed that the transformation from raw data to filtered data is linear in order to develop the filter for removing the cross-talk. Assuming a linear transformation, the filtering step, shown in functional block 38 in FIG. 4, utilizes a transformation matrix, M, and involves a multi-component analysis that is embodied in the matrix M. With a multi-component analysis, the relationship between the measured signal sj and the actual fluorescence intensities fj, with j=1, 2, 3 and 4, is given by the relationship:             s      j        =                            ∑          4                                      j            =            1                                i            =            1                              ⁢                        m                      i            ,            j                          ·                  f          j                      ,
where mi,j is a constant coefficient indicating the cross talk between intensity signals i and j. Writing the above relationship in matrix form results in:
s=Mxc2x7f,
where s and f are vectors with four elements and M is a 4xc3x974 matrix.
Typically, the transformation matrix M is determined by a conventional method that includes an iterative process in which known raw data streams are processed through the matrix M and the matrix coefficients adjusted to provide the best signal separation possible for the data streams. The adjustment of the coefficients of the transformation matrix M is necessary because the data transformation is actually non-linear in nature.
To determine the actual intensities of the fluorescence, the matrix M is used to deconvolute the measured signals s into the actual fluorescence f by the following relationship:
f=Mxe2x88x921∵s
In addition to the non-linearity of the data transformation, use of the transformation matrix M requires that the baseline adjustment of the data be applied to the data streams before filtering the data. The baseline adjustment is necessary because, as described above, the baseline within each fluorescent signal collected at the four different wavelengths typically varies with time. Also, each signal can have a different signal level. The algorithm typically used for the baseline adjustment first divides the entire data sequence in each channel into a number of windows. The baseline adjustment algorithm then finds a minimum value within each of the windows and constructs a line connecting the minimum values for each channel. Finally, the line connecting the minimum values is subtracted from the raw data at each data point in each channel. Unfortunately, the baseline adjustment can result in loss of information contained in the raw data and distort the signals. To regain the original data, additional steps, such as a Fourier-based filter for adjusting the base line or even a baseline cutoff is required. This adds complexity to the data processing. Accordingly, it would be desirable to both compensate for non-linear nature of the cross-talk filtering process and to eliminate the baseline adjustment of the raw data.
This invention relates to an algorithm for reducing cross-talk between DNA data streams.
The present invention is directed toward a multi-component analysis that is applied to the difference of the signal intensity on each of the four channels. This is done before any baseline adjustment of the raw data. Instead, baseline adjustment occurs after the raw data has been filtered. The present invention also adds an additional processing step to account for the non-linear nature of the cross talk filtering. The additional processing step includes combining the signals with their derivatives and accounts for the correlation of each of the data signals with the other three data signals.
The present invention contemplates a method for enhancing DNA raw data that includes providing an apparatus for collecting DNA data from dye-labeled DNA fragments, the DNA data being divided between a plurality of channels. The DNA data is passed through a first filter to reduce any cross-talk between data contained in the channels. The data is then passed through a second filter to reduce any non-linearity remaining after the first filtering process has been applied.
The reduction of cross talk between the channels includes determining difference values for the signals in each channel by subtracting the magnitudes of the signals in each channel at two consecutive sampling instants. A first multi-component analysis is applied to the difference values to deconvolute the data contained in the signals. The first multi-component analysis includes multiplying the data by a constant coefficient transformation matrix M.
The second filtering process to reduce the non-linearity remaining after the first filtering process includes determining derivative values for the signals obtained from the cross talk reduction filter. A multi-component analysis is applied to the derivative values to remove non-linear effects remaining after the first filtering process and the resulting data is then reconstructed to obtain the signal intensity. Similar to the first filter, the second multi-component analysis includes multiplying the data by a constant coefficient matrix T.