Informatics is the study and application of computer and statistical techniques to the management of information. In genome projects, bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence and structure from DNA sequence data. Increasingly, molecular biology is shifting from the laboratory bench to the computer desktop. Advanced quantitative analyses, database comparisons, and computational algorithms are needed to explore the relationships between sequence and phenotype.
One use of bioinformatics involves studying genes differentially or commonly expressed in different tissues or cell lines such as in normal or cancerous tissue. Such expression information is of significant interest in pharmaceutical research. A sequence tag method is used to identify and study such gene expression. Complementary DNA (cDNA) libraries from different tissue or cell samples are available. cDNA clones, or expressed sequence tags (ESTs) that cover different parts of the mRNA(s) of a gene are derived from the cDNA libraries. The sequence tag method generates large numbers, such as thousands, of clones from the cDNA libraries. Each cDNA clone can include about 100 to 800 nucleotides, depending on the cloning and sequencing method. Assuming that the number of sequences generated is directly proportional to the number of mRNA transcripts in the tissue or cell type used to make the cDNA library, then variations in the relative frequency of occurrence of those sequences can be stored in computer databases and used to detect the differential expression of the corresponding genes.
Sequences are compared with other sequences using heuristic search algorithms such as the Basic Alignment Search Tool (BLAST). BLAST compares a sequence of nucleotides with all sequences in a given database. BLAST looks for similarity matches, or `hits`, that indicate the potential identity and function of the gene. BLAST is employed by programs that assign a statistical significance to the matches using the methods of Karlin and Altschul (Karlin S., and Altschul, S. F. (1990) Proc. Natl. Acad. Sci. U.S.A. 87(6): 2264-2268; Karlin, S. and Altschul, S. F. (1993) Proc. Natl. Acad. Sci. U.S.A. 90(12): 5873-5877). Homologies between sequences are electronically recorded and annotated with information available from public sequence databases such as GenBank. Homology information derived from these and other comparisons provides a basis for assigning function to a sequence.
Typically computer systems use relational databases to store, process, and manipulate nucleotide and amino acid sequences, expression information, chromosomal location, and protein function information. As the amount of biomolecular information increases, the computer systems and their databases must accommodate the storage, retrieval, comparison, and display of very large amounts of data. Typically, the data is stored in multiple files making up a relational database. Each file has records with predefined fields. Records are accessed using at least one field that is designated as a key or index. Relational databases typically use a join operation to cross reference data stored in different files based on a common key field. The join operation basically combines data stored in multiple files. However, if one of the files being joined, such as the cDNA or clone file, is unusually large, then even a simple join operation with a small file is time consuming and slow.
In addition, relational databases use separate data and index files. Index files are used to access information in corresponding data files. Typically, a large amount of storage is needed to store both the data and index files. Therefore, in a system with many data and index files, it is unlikely that all data can be stored at any one time in the main memory of the computer system. The data that remains on the disk must be swapped into main memory. The swapping of data into main memory further contributes to the slow response time of data retrieval systems having unusually large database tables.
Therefore, there is a need for a biomolecular database that eliminates the need for using join operations. In addition, there is a need for a biomolecular database that can be stored in the main memory of ordinary desktop computer systems. There is also a need for a biomolecular database with a common set of tissue classes that can assign a cDNA library to many different tissue classes.
The present invention provides a self-sufficient modular database that organizes and precompiles data to eliminate the need for using join operations. In addition, the database is organized such that the entire database can be stored in the main memory of the computer system. The present invention also provides a way to associate a cDNA library with multiple tissue classes.