1. Technical Field
The present disclosure is directed to systems and methods for facilitating DNA spectral analysis and, more particularly, to systems and methods that employ image processing techniques and/or signal processing methods to automate and/or expedite, in whole or in part, the processing of DNA sequence data. According to exemplary embodiments of the present disclosure, systems and methods are provided to support one or more of the following DNA spectral analysis techniques: (i) comparative histogram methodologies; (ii) selection/classification using support vector machines and genetic algorithms; and (iii) spectrovideo methodologies based on spectrogram extractions from DNA sequence data.
2. Background Art
Bioinformatics seeks to organize tremendous volumes of biological data into comprehensible information which can be used to derive useful knowledge. In the field of bioinformatics, techniques for spectral analysis of DNA sequences have been developed. Spectral analysis techniques generally represent an improvement over manual DNA pattern analysis techniques which aim at identifying DNA patterns serving as biological markers related to important life processes. Traditionally, automatic analyses are performed directly on strings of DNA sequences composed of the four characters A, T, C and G, which represent the four nucleotide bases. However, due to the tremendous length of DNA sequences (e.g., the length of the shortest human chromosome is 46.9 Mb), the wide range of pattern spans associated with the limited character set, and the statistical nature of the problem, such an intuitive/manual approach is inefficient, if not impossible, for achieving the desired purpose.
DNA spectral analysis offers an approach to systematically tackle the problem of deriving useful information from DNA sequence data. Generally, DNA spectral analysis involves an identification of the occurrences of each nucleotide base in a DNA sequence as an individual digital signal, and transforming each of the four nucleotide signals into a frequency domain. The magnitude of a frequency component can then be used to reveal how strongly a nucleotide base pattern is repeated at that frequency. A larger magnitude/value usually indicates a stronger presence of the repetition. To improve the readability of the results, the prior art discloses systems wherein each nucleotide base is represented by a color and the frequency spectrums of the four bases are combined and presented as a color spectrogram. These techniques are described by:                D. Anastassiou, “Frequency-Domain Analysis of Biomolecular Sequences,” Bioinformatics, Vol. 16, No. 12, December 2000, pp. 1073-1081; and        D. Sussillo, A. Kundaje and D. Anastassiou, “Spectrogram Analysis of Genomes,” EURASIP Journal on Applied Signal Processing, Special Issue on Genomic Signal Processing, Vol. 2004, No. 1, January 2004, pp. 29-42.        
The translation of the magnitudes/values for nucleotide bases into a visual image, i.e., a spectrogram, is a powerful visualization tool for DNA analysis. The resultant pixel color is indicative of the relative intensity of the four bases at a particular frequency, and the representation of DNA sequences as color images allows patterns to be more easily identified by visual inspection. In general, the hue in a spectrogram region reflects its overall nucleotide composition, and bright lines and patches in a spectrogram reveal the existence of special repetitive patterns.
An algorithm or technique for the generation of DNA spectrograms can be summarized in five steps as follows.    (i) Formation of binary indicator sequences (BISs) uA[n], uT[n], uC[n] and uG[n] for the four nucleotide bases. The BIS for a particular base takes the value of “I” at positions where the base exists and “0” otherwise. Thus, in an exemplary DNA sequence having a nucleotide sequence “AACTGGCATCCGGGAATAAGGTCT”, the BIS translates as depicted in FIG. 1. Based on the foregoing exemplary DNA sequence, the BIS values may be plotted as depicted in FIG. 2.    (ii) Discrete Fourier Transform (DFT) on BISs. The frequency spectrum of each base is then obtained by computing the DFT of its corresponding BIS using Equation (1):
                                                                        U                X                            ⁡                              [                k                ]                                      =                                          ∑                                  n                  =                  0                                                  N                  -                  1                                            ⁢                                                                    u                    X                                    ⁡                                      [                    n                    ]                                                  ⁢                                  ⅇ                                                            -                      j                                        ⁢                                                                  2                        ⁢                        π                                            N                                        ⁢                    kn                                                                                ,                      k            =                          0              ,              1                                ,          ...                                          ,                                    ⌊                              N                /                2                            ⌋                        +            1                          ⁢                                  ⁢                              X            =            A                    ,          T          ,          C          ,                      or            ⁢                                                  ⁢            G                                              (        1        )            The sequence U[k] provides a measure of the frequency content at frequency k, which is equivalent to an underlying period of N/k samples as depicted in FIG. 3.    (iii) Mapping of DTF Values to RGB Colors. The four DFT sequences are reduced to three sequences in the RGB space by the following set of linear equations, collectively designated as Equation (2):Xr[k]=αrUA[k]+trUT[k]+crUC[k]+grUG[k]Xg[k]=αgUA[k]+tgUT[k]+cgUc[k]+ggUG[k]Xb[k]=αbUA[k]+tbUT[k]+cbUC[k]+gbUG[k]  (2)where (αr, αg, αb), (tr, tg, tb), (cr, cg, cb) and (gr, gg, gb) are the color mapping vectors for the nucleotide bases A, T, C and G, respectively. The resultant pixel color (Xr[k], Xg[k], Xb[k]) is thus a superposition of the color mapping vectors weighted by the magnitude of the frequency component of their respective nucleotide base as depicted in FIG. 4.
FIGS. 5 and 6 further illustrate the mapping of DFT values to colors according to exemplary embodiments of the present disclosure. Thus, with reference to FIG. 5, color vectors are selected for the respective nucleotide bases A, T, C and G, respectively. In selecting color vectors, it is generally desirable to improve and/or enhance the color contrast of the DNA features. Based on exemplary color vectors, the DFT values are combined in color space, as shown in FIG. 6. Alternative mapping techniques and/or protocols may be employed, e.g., DFT values may be mapped to Hue Saturation Value (HSV space), YCrCb space, etc.    (iv) Normalizing the Pixel Values. Before rendering the color spectrograms, the RGB values of each pixel are generally normalized such that they fall between 0 and 1. There are numerous ways to implement the normalization function. The simplest approach is to divide all values by the global maximum. However, such a one-step approach may degrade the overall color contrast of the image. A better method is to perform normalization at two levels: at a first level, all pixel values are divided by a statistical maximum, e.g., equal to the overall mean plus one standard deviation, such that after the initial operation, a majority of pixels have RGB values between 0 and 1; then, at a second level, for the remaining pixels with any of their RGB values greater than one, a second level of normalization is individually performed by dividing each of such pixel values by its local maximum max(xr, xg, xb). This two level approach prevents the overall intensity of the image from being excessively reduced by the more extreme pixel values and, as a result, the color contrast of the spectrogram image can be better preserved. FIG. 7 presents exemplary normalized plots of the combined DFT values of FIG. 6.    (v) Short-Time Fourier Transform (STFT). Up to now, only a single Discrete Fourier Transform (DFT) window has been considered. For long DNA sequences, however, it may be necessary to repeat steps (i) to (iv) for DFT windows that are shifted along the sequence. This results in consecutive strips of color pixels, with each of the strips depicting the frequency spectrum of a local DNA segment. A DNA spectrogram is then formed by a concatenation of these strips. The images set forth below are reproduced in FIGS. 8 and 9 hereto.
It is noted that the set of equations designated as equations (8) in the publication by D. Anastassiou (“Frequency-Domain Analysis of Biomolecular Sequences,” Bioinformatics, Vol. 16, No. 12, December 2000, pp. 1073-1081) suggests that the order of steps (ii) and (iii) is reversible, i.e. it is possible to first reduce the four binary indicator sequences to three numerical sequences (xr, xg, xb) and then perform a Discrete Fourier Transform (DFT). This, however, needs further proof because the binary indicator sequences are not independent functions.
The appearance of a spectrogram is very much affected by the choice of the Short Term Fourier Transform (STFT) window size, the length of the overlapping sequence between adjacent windows, and the color mapping vectors. Basically, the window size determines the effective range of a pixel value in a spectrogram. A larger window results in a spectrogram that reveals statistics collected from longer DNA local segments and may be useful in identifying wider patterns. In general, the window size should be made several times larger than the length of the repetitive pattern of interest and smaller than the size of the region that contains the pattern. The window overlap determines the length of the DNA segment common to two adjacent STFT windows. Therefore, the larger the overlap, the more gradual is the transition of the frequency spectrum from one STFT window to the next. Smaller window intervals yield in higher image resolutions, thereby making it easier to extract features by image processing or visual inspection. However, smaller intervals also generally demand more computational resources.
With reference to U.S. Pat. No. 6,287,773 to Newell, a method for detecting known blocks of functionally aligned protein sequences in a test nucleic acid sequence, e.g., in an uncharacterized EST, is disclosed. The Newell '773 method involves: (a) reverse translating the set of protein sequences to a set of functionally aligned nucleic acid sequences using codon-usage tables and creating a profile from the set of functionally aligned nucleic acid sequences; (b) constructing a first indicator function (adenine) for the profile; (c) constructing a second indicator function (adenine) for the test nucleic acid sequence; (d) computing the Fourier transform of each of the indicator functions; (e) complex conjugating the Fourier transform of the second indicator function; (f) multiplying the Fourier transform of the first indicator function and the complex conjugated Fourier transform of the second indicator function to obtain a Fourier transform of the number of matches of adenine bases; (g) repeating steps (b)-(f) for guanine, thymine, and cytosine; (h) summing the Fourier transforms of the number of matches for each base, respectively, to obtain the total Fourier transform; (i) computing the inverse Fourier transform of the total Fourier transform to obtain a complex series; and (j) taking the real part of the series to determine the total number of base matches for the variety of possible lags of the profile relative to the test sequence. The first indicator function allows the value at a given position to be continuous between 0 and 1 as a function of the percentage presence of adenine at a particular position. The method can then detect the presence of known blocks of functionally aligned protein sequences in a test nucleic acid sequence based on the total number of base matches for the variety of possible lags, i.e., to facilitate sequence matching.
Despite efforts to date, a need remains for systems and methods that facilitate expeditious visualization of genomic information. In addition, a need remains for systems and methods that facilitate identification of repetitive DNA patterns, e.g., CpG islands, Alu repeats, non-coding RNAs, tandem repeats and various types of satellite repeats. A need remains for tools that can identify structurally or compositionally similar patterns that exhibit similar spectral properties. Such tools are to be contrasted with sequence alignment tools that seek to align sequences in linear order or nucleotide appearance. Still further, a need remains for systems and methods for facilitating rapid, full-scale analysis of spectral images using supervised and/or unsupervised machine learning techniques. Moreover, a need remains for systems and methods for increasing the resolution of spectral image sequences, e.g., to permit rapid visualization of an entire genome at a desired resolution. These and other needs are met by the systems and methods disclosed herein.