Without limiting the scope of the invention, its background is described in connection with uses of functional genomics and bioinformatics, as an example.
The present invention relates generally to methods and systems for searching and identifying functional nucleic acid sequences and proteins encoded by genes available from the multitude of nucleic acid and protein databases presently available. These biological databases store information that is searchable and from which biological information may be retrieved. More particularly, the present invention relates to systems and methods for identifying biologically relevant sequences of biological molecules using an integrated approach that specifically identifies sequences for cloning.
Generally, informatics may be defined as the study and application of computer and statistical techniques to the management of information. In projects related to biological information, the term “bioinformatics” has been coined to include the development of methods to, e.g., search databases, analyze nucleic acid sequence information, predict protein sequence, protein structure, and protein function from nucleic acid sequence data.
The widespread use and availability of molecular biological techniques have allowed for the rapid development and identification of nucleic acid derived sequences. With the widespread availability of advanced computer systems and the integration of laboratory equipment with computer software, researchers are able to conduct advanced quantitative analyses, database comparisons and computational algorithms to seek and identify gene sequences with homology to known sequences.
Examples of large-scale sequencing and the availability of genetic information for a number of organisms have been cataloged in a number of public and private computer databases. Genetic databases for organisms such as Escherichia coli, Haemophilus influenzae, Mycoplasma genitalium, and Mycoplasma pneumoniae, to name a few, are publicly available. At present, however, complete sequence data is available for relatively few species, and the ability to manipulate sequence data within and between species and databases is greatly limited by the ability of these public databases to be searched for functional significance.
One example of a system for comparing relational databases of sequences is disclosed in U.S. Pat. No. 5,966,712, issued to Sabatini, et al. The system disclosed is a relational database system for storing and manipulating biomolecular sequence information and includes a database of genomic libraries for a plurality of types of organisms. These libraries are taught to have multiple genomic sequences, at least some of which represent open reading frames located along a contiguous sequence in each of the plurality of organisms' genomes. A user interface is provided and is capable of receiving a selection of two or more of the genomic libraries for comparison and displaying the results of the comparison. The system also provides a user interface capable of receiving a selection of one or more probe open reading frames for use in determining homologous matches between such probe open reading frame(s) and the open reading frames in the genomic libraries, and displaying the results of the determination.
Also needed are fully integrated systems that take advantage of functional observations and the identification of biologically relevant and functional gene sequences. This disconnect between genotype and phenotype leads to the pursuit of many genes of doubtful relevance or even mere artifacts. Thus, researchers are presently unable to avoid using available computer resources to explore, identify and study relevant gene sequences, gene expression, and molecular structure without extensive experimentation.
Another such use of bioinformatics involves studying an organism's genome to determine the sequence and placement of its genes and their relationship to other sequences and genes within the genome or to genes in other organisms. The study of the relationship between introns and exons, for example across species, allows for a scientific understanding of many underlying substructures of the protein or proteins being expressed. It also allows for the identification of sequences that are involved in the regulation of the gene or genes that are at a particular gene locus. Such information may be of significant interest in biomedical and pharmaceutical research to assist in the evaluation of potential drug efficacy and resistance for genes that are well studied and for which significant structure-function studies have been conducted. In one such database system (Incyte Pharmaceuticals, Inc., U.S.A.), software has been developed that searched the annotated information that is part of genomic sequence data in publicly available sequence databases. Unfortunately, not all electronically recorded sequences contain annotated information. Some contain information that is not functional, contain information that is not accurate, or contain information that has no relation to function. Examples of such databases include the widely available public databases GenBank (NCBI) and TIGR. Therefore, the accuracy and relevance of any search results from these databases often has no bearing on the cellular biological function of a particular protein of gene regulatory element.
Although genetic data processing and relational database systems such as those developed by Incyte Pharmaceuticals, Inc. provide great power and flexibility in analyzing genetic information, this area of technology is still in its infancy and further improvements in genetic data processing and relational database systems will help accelerate biological research for numerous applications.