The function of an associative memory is commonly used to detect cache “hits” in a computer system by comparing an address word with a memory of address words previously accessed. A “hit” occurs when there are a match between the input address word and an entry in this database. The output of this hit is the cache line where the address was previously read into. An associative memory is therefore in essence a parallel recognition process where a new input is compared with the entire database of prior experiences to detect any match, and in the case of a hit, to output the reference or location.
Parallel recognition processes are conceptually simple, but in actual existing practice grow exponentially in complexity and are unfeasible except for the most limited and small database applications. One potential application of parallel recognition processes is in the searching of DNA/RNA sequences. Such an application of computers to solve information processing problems in the life sciences area is within the general field known as “bioinformatics.” However, searches of DNA/RNA sequences typically involve very large databases potentially containing millions to hundreds of millions of bases. These size ranges are inconsistent with the small database sizes suitable for existing parallel recognition processes.
The bioinformatics field, which, in a broad sense, includes any use of computers in solving information problems in the life sciences, and more particularly, the creation and use of extensive electronic databases on genomes, proteomes, etc., is currently in a stage of rapid growth. In order to better appreciate some of the concepts in the bioinformatics field, it is helpful to discuss some of the basic principles of cells.
A cell relies on proteins for a variety of its functions. Producing energy, biosynthesizing all component macromolecules, maintaining cellular architecture, and acting upon intra- and extra-cellular stimuli are all protein-dependent activities. Almost every cell within an organism contains the information necessary to produce the entire repertoire of proteins that the organism can specify. This information is stored as genes within the organism's DNA genome. Different organisms have different numbers of genes to define them. The number of human genes, for example, is estimated to be approximately 25,000.
Genetic information of all life forms is encoded by four basic nucleotides (adenine, thymine, cytosine, and guanine, which are designated by the letters “A”, “T”, “C”, and “G”, respectively). The genes are grouped in the base pairs A-T and G-C, and a DNA sequence refers to the ordering or pattern of the nucleotide bases in the gene. The length of a DNA sequence can be very large, and for instance, a DNA sequence may have between 2,000 and two million base pairs. The make-up of all life forms is determined by the sequence of these nucleotides. DNA is the molecule that encodes this sequence of nucleotides.
Each gene typically provides biochemical instructions on how to construct a particular protein. In some cases multiple genes are required to create a single protein, and multiple proteins can be produced through alternative splicing and post-transcriptional modification of a single gene.
Only a portion of the genome is composed of genes, and the set of genes expressed as proteins varies between cell types. Some of the proteins present in a single cell are likely to be present in all cells because they serve functions required in every type of cell. These proteins can be thought of as “housekeeping” proteins. Other proteins serve specialized functions that are only required in particular cell types. Such proteins are generally produced only in limited types of cells. Given that a large part of a cell's specific functionality is determined by the genes that it is expressing, it is logical that transcription, the first step in the process of converting the genetic information stored in an organism's genome into protein, would be highly regulated by the control network that coordinates and directs cellular activity.
There are approximately three billion different DNA base pairs that may be found in humans, and the particular DNA sequences that each person has are located in 23 pairs of chromosomes that contain about 100,000 individual genes. It is significant that faulty genes can be linked to a large variety of human afflictions. An ability to relate an individual gene directly with a particular medical health problem can lead to predictive tests, treatments, and potential cures for a wide variety of medical problems and hereditary ailments.
Currently, about 2,000 human DNA sequences are known and identified, and these DNA sequences are stored in available databases. The number of known and identified human DNA sequences is only a small fraction of the enormous total number of human DNA sequence combinations, and the number of such known and identified DNA sequences is growing rapidly. In addition, the number of DNA sequences of other organisms that have been identified and that are available in databases is also large and likewise growing with time.
The DNA sequence information contained in these growing databases will be a major instrument for basic medical and biological research activities for many years. This information will also be a basis for developing curative techniques for medical and hereditary afflictions. In order to use effectively the information in these enormous and growing databases, it is necessary to provide an efficient means to access that information. In particular, it is necessary to provide an efficient and reliable means to compare a given DNA sequence to the library of known DNA sequences in the databases. Such a comparison is useful to identify, analyze, and interpret that given DNA sequence.
Current procedures for making such comparisons are comparatively slow and impractical. As the amount of stored information increases, current search methods will become unable to function with practical, short processing times, and these methods will have very slow operating speeds. Existing technology is not practical for searching large-scale DNA databases, which may have three billion or more base pair data items.
In addition to the above limitations in searching DNA sequence databases, another of the current limitations on drug discovery research involving the analysis of genome structure and function is the need to perform wet DNA hybridization assays because accurate “in silico” simulations are not available. Further, existing sequence matching tools, such as BLAST, often miss important sequence motifs since they lack the resolution to detect short sequences (e.g., less than 14 bases in length).
Accordingly, it would be desirable to have an improved solution that overcomes the exponentially growing complexities and combinatorial explosion associated with existing parallel recognition processes in bioinformatics and in other technical fields, and that dramatically reduces associative memory search and retrieval effort. It would be further desirable to have systems and methods to perform DNA/RNA sequence matching with convenient database access, high-speed processing, improved resolution, accuracy, and cost efficiency.