The present invention relates in general to identifying nucleic acid sequences and in particular to an automated method for identifying nucleic acid sequences and electronically storing information related to the nucleic acid sequences.
The present invention is useful, for example, for researchers using the subtraction library technique to determine regulation of mRNA, researchers using a high throughput technique for identification of DNA or cDNA nucleotide sequences and researchers with data containing many unknown DNA sequences that require revisiting a nucleic acid identification database on a regular basis.
In the United States, the National Institutes of Health's (NIH) National Center for Biotechnological Information (NCBI) maintains databases with information about each nucleotide sequence that has been submitted to it. The NCBI database is accessible to the general public. There is one record for each sequence in the non-repeating database (NR) or multiple matching records in the expression sequence tags (EST) database. The NCBI database is updated daily and has become one of the world's largest repositories of protein and genetic data. Other publicly available databases are located in Europe and Japan. In addition, some private entities maintain nucleic acid identification databases that are not generally available to the public.
An example of the use of a nucleic acid identification database involves the subtraction library technique. Using a subtraction library technique, one can produce hundreds of cDNA protein fragments that are either up regulated or down regulated in response to a stimulus defined by different experimental conditions. The sequence of base pairs for each fragment can be determined using DNA sequencers, producing files of “raw” sequences, generally in an electronic format. To make use of these data, each raw sequence needs to be identified as a subset of a known protein, mRNA, gene, or DNA sequence for use in further analysis. The identification can be done by requesting that NCBI match the sequence against all of the known sequences in its database and return information about the most similar matching items. There will usually be many possible matches with reams of data returned for each match. The amount of data generated becomes unmanageable very quickly. The present invention helps a researcher organize and use data obtained from a nucleic acid identification database.
In the past, when using a publicly available database such as the NCBI database, the identification of each nucleic acid sequence involved: 1) visually scanning the nucleic acid sequence; 2) deleting the vector and adaptor sequences; 3) electronically pasting the edited sequence into a web-based search request form for submission to the Basic Local Alignment Search Tool (BLAST) page on the NCBI website; 4) waiting on-line for data analysis and transfer; 5) printing the search results for later review; and 6) selecting certain of the sequence identifier search results and typing them into a spreadsheet for specific data capture, archiving and subsequent sequence analysis. During review of the hard (paper) copy sequence alignments, it was common to revisit the BLAST site on the web to obtain further information. This further information was available through hyperlinks embedded in the original output, but was not accessible when reviewing a paper copy.
The present invention automates all of the steps that were previously done by hand, starting from the raw sequence files (produced by the nucleic acid sequencers) through to the creation of a complete library file that contains identification of the nucleic acid sequences in an individual nucleic acid library sample set. It is estimated that the invention reduces the data capture and review time required for nucleic acid sequence identification by as much as 90 percent.