1. Field:
This invention relates generally to signal processing analysis of chromatographic migration patterns such as are commonly used in chemistry and in biology to analyze mixtures of molecules, and further to such analysis applied to the determination of DNA sequences.
2. State of The Art:
Electrophoretic migration patterns may be visualized by a number of different techniques. However, most of these techniques depend upon an electromagnetic radiation-emitting tag attached to the molecule(s) of interest. Such a tag may be a radioactive isotope emitting X-ray photons, beta particles, alpha particles, etc, a fluorescent tag emitting photons of UV, visible or infrared light, or the like. The emissions of these tags are generally detected by photographic films or by a detector sensitive to the emission, and converted into a visual image indicating the amount of label migrating in different regions of the electrophoretic medium. The electrophoretic medium is often a porous gel medium on which a film may be overlaid or which may be scanned by a detector. More recently, methods for performing electrophoresis in a liquid medium have been developed; here the tag is usually detected as the molecule carrying it passes adjacent to a fixed detector.
Whatever the method of visualization, the resulting image depicts the amounts of tagged molecules migrating at different linear positions in the electrophoretic medium. The methods described in this application are suitable for analysis of migration patterns obtained by any of the foregoing means.
At present, the analysis of DNA sequences of large DNA segments is particularly important in connection with the Human Genome Project, as well as for many other research and industrial purposes Biochemical methods for sequencing DNA are well-known, which involve electrophoresing four replicate sets of DNA fragments generated from the DNA molecule being sequenced. Each set contains a series of labelled fragments varying in length and terminating in a single respective one of the four standard DNA nucleotides A (adenine), T (thymine), G (guanine) and C (cytosine) That is, all of the tagged fragments in one sample terminate in A residues, in another sample all fragments terminate in T residues, etc. The samples are electrophoresed in one dimension and the migration patterns visualized by one of the methods mentioned hereinabove. The result is a pattern of four lanes in which each lane has a series of bands corresponding to the positions of fragments terminating in a particular one of the four bases adenine, thymine, guanine and cytosine.
The biochemical portions of these methods have been automated and thus greatly speeded up. However, a serious bottleneck remains in the determination of the sequence from the set of four ladders. Most often, a skilled individual reads the sequence by aligning the lanes of the four samples and making judgements as to what band images represent "true" bands representing a tagged fragment, and which are due to noise or overlapping of small features resulting from noise or variations in the biochemical portion of the assay. This process is tedious, time-consuming, and not as accurate as is desired: a skilled human reader requires at least 2 hours to analyze a film containing sequences totalling about 5000 nucleotides. For comparison, the human genome is estimated to require reading of an absolute minimum of about one to two million films, or two to four million man-hours. The error rate even by skilled readers is generally above 1%, which is unacceptably high.
It is highly desirable to analyze the patterns automatically by computer. The visualized patterns can readily be digitized to provide a signal that can be analyzed by signal processing technology and/or computer. However, there are several problems which complicate the analysis. First, the spacing of bands produced by fragments differing in length by a single nucleotide tends to change with size of the fragment. There may also be differences in the spacing of bands among the four lanes. Additionally, since detection of these bands is essentially detection of the label by means of electromagnetic radiation, there is a spread or dispersion due to the stochastic nature of the electromagnetic radiation emissions. There is background noise resulting from both factors in the biochemical technique and in the detection of electromagnetic radiation, which results in a generally low but variable pattern of visual darkening or visual signal over the lane. The general intensity of labelling often varies between the four lanes, and there is furthermore a tendency for bands within a given lane to vary in relative intensity in an unpredictable manner.
All of above factors serve to confuse the identification of individual peaks and the correct ordering of the peaks, leading to errors in the determined sequence. Furthermore, all of these factors vary from one quartet of ladders to the next, so that it is not possible to determine a blurring "noise characterization" which may be applied to any quartet of visualized band patterns. Instead, the visual images are commonly interpreted by a skilled human reader.
To compensate for the differing distances between fragments which differ by one base in length, commonly the reader makes some comparative measurement of the spacing between fragments at two or more points in the vertical length of the lane and then enters this into the computer. Similarly, the background level in different regions of each lane of a quartet must be determined and entered into the computer. These values apply only to the single image being analyzed at that time. The parameters may not be useful to analyze a different image. Thus, there is a great deal of operator input and tedious work required simply to set up the visual image for the computerized analysis, slowing the process.
Therefore, a need remains for a method of computerized analysis of the visual images of DNA sequence ladders which does not require the inputting or determination of specific parameters for each individual autoradiograph by an operator and which can rapidly and accurately produce sequences for a number of different autoradiographs. Furthermore, because the amount of data to be analyzed for any one sequence is substantial, for example, may consist of several thousand bases, it is desirable to have such a method which will utilize or conform to the methods of most rapid signal processing, thus reducing the amount of computer time required for the analysis and save both time and money.