The invention relates to a novel optical feature extractor and a method for symbolically encoding DNA bases to permit detection of a class of DNA sequences based on their symmetry.
The degree of computational parallelism available using optics has generated great interest in optical approaches to pattern recognition and computation in general. The performance of optical pattern recognition techniques has in some cases been quite remarkable, while in others the results have been less than satisfactory. The level of performance often has much to do with how well-conditioned a problem is to an optical approach or architecture. For example, problems in image interpretation may be complicated by scale, rotation, and perspective variabilities or distortions that reduce the degree of correlation between reference objects and images under examination. Assembly line inspection, optical character recognition of printed text, and other problems in which image content and structure are constrained are more tractable due to the reduced number of degrees of freedom.
In some cases transformations exist that allow the offending properties of images to be removed (e.g., use of the Mellin transform to convert scale variant objects to translation variant features). In others the investigator is able to exploit the structure or representation of a problem by using optical methods. DNA sequence analysis is a problem that can be exploited by using optical methods.
The sequence of bases along each strand of DNA forms the fundamental genetic information in each individual and organism. While only 4 types of subunits are used in DNA (denoted by the symbols A, C, G, and T in FIG. 1), the number of different possible sequences of length N is 4N. As the length N of the human genome is approximately 3xc3x97109 this allows for a rather large number of possible human DNA sequences. An important structural element of DNA is that the sequences on each of the two strands of which DNA is composed are complementary; an A base on one strand is hydrogen bonded to a T base on the opposite strand of the double helix. Likewise, a G on one strand indicates a C on the opposite strand at the same position along the DNA. Thus, knowledge of the sequence of one strand immediately defines the sequence along the complementary strand.
In order to understand the genetic basis of disease and evolutionary relationships between species and individuals, molecular biologists have found it useful to sift through a vast collection of such sequences, looking for themes and relationships between subsequences constituting genes and other genetic landmarks. Their efforts have been impressive yet modest, given the scale of the problem and the computational resources available.
A set of sequences known as xe2x80x9cpalindromesxe2x80x9d are important landmarks, both to the biochemical machinery of the living cell and for navigational purposes to the investigator. They are a class of DNA sequences known to have special regulatory functions in biological systems and distinguished by the antisymmetric arrangement of bases in palindromic sequences. They are DNA sequences that have a 2-fold inversion symmetry (i.e., sequences that conform to the scheme XY={overscore (YX)}).
A system of enzymes (xe2x80x9crestriction enzymesxe2x80x9d) able to recognize and act on palindromic sequences has evolved in many living organisms. The symmetric composition of the DNA restriction sites has been exploited in the evolution of enzymes having in many cases a 2-fold dyadic structure. These enzymes are used by researchers to cut and splice segments of DNA at the sites of the palindromic sequences recognized by the restriction enzymes.
Currently, genetic sequence databases contain on the order of 108 bases of sequence data gathered from organisms of all types, determined primarily in the last 10 years. The rate of sequence determination is increasing exponentially and is expected to do so for years to come. With each newly logged sequence a time consuming search for palindromes is begun. The scientific community needs new approaches to reduce the search time required.
The need for faster searching of DNA sequences has resulted in the development of a multichannel optical architecture which can rapidly search for symbolically encoded sequences of DNA base information and detect palindromic sequences of length-6, a common object of sequence searches. This approach uses video display, spatial light modulation, and detection components in conjunction with microlenslet replicating optics, to expedite the recognition of symbol sequences based on their symmetry properties. Multichannel operation is achieved through the replication of input scenery, making possible a higher throughput rate than for single channel systems. The method of the invention uses a symbolic representation of the DNA subunits to facilitate the classification of sequences as palindromic or nonpalindromic.
A notable feature of this optical approach is the exchanged positions of input scenery and the filter set. The conventional treatment has been to display the input scene on a monitor for projection onto a set of feature extraction vectors realized as amplitude modulated LCTV devices or lithographically prepared masks. Instead, the invention provides the filter set as input to the system and correspondingly places the sequence data in the filter plane of the system, relying on the commutativity of projection to allow this role reversal.
The optical feature extractor of the invention can classify short (6 bases in length) sequences of DNA as palindrome or nonpalindrome. This classification is made on the basis of the sequence symmetry, independent of base composition.