The present invention relates generally to a system and method for storing and retrieving biomolecular sequence information. More particularly, the invention relates to a system and method for storing biomolecular sequence information in a precompiled, modular format which allows for rapid retrieval of the information.
Informatics is the study and application of computer and statistical techniques to the management of information. In genome projects, bioinformatics includes the development of methods to search databases quickly, to analyze nucleic acid sequence information, and to predict protein sequence and structure from DNA sequence data. Increasingly, molecular biology is shifting from the laboratory bench to the computer desktop. Advanced quantitative analyses, database comparisons, and computational algorithms are needed to explore the relationships between sequence and phenotype.
One use of bioinformatics involves studying genes differentially or commonly expressed in different tissues or cell lines such as in normal or cancerous tissue. Such expression information is of significant interest in pharmaceutical research. A sequence tag method is used to identify and study such gene expression. Complementary DNA (cDNA) libraries from different tissue or cell samples are available. cDNA clones, or expressed sequence tags (ESTs) that cover different parts of the mRNA(s) of a gene are derived from the cDNA libraries. The sequence tag method generates large numbers, such as thousands, of clones from the cDNA libraries. Each cDNA clone can include about 100 to 800 nucleotides, depending on the cloning and sequencing method. Assuming that the number of sequences generated is directly proportional to the number of mRNA transcripts in the tissue or cell type used to make the cDNA library, then variations in the relative frequency of occurrence of those sequences can be stored in computer databases and used to detect the differential expression of the corresponding genes.
Sequences are compared with other sequences using heuristic search algorithms such as the Basic Alignment Search Tool (BLAST). BLAST compares a sequence of nucleotides with all sequences in a given database. BLAST looks for similarity matches, or xe2x80x98hitsxe2x80x99, that indicate the potential identity and function of the gene. BLAST is employed by programs that assign a statistical significance to the matches using the methods of Karlin and Altschul (Karlin S., and Altschul, S. F. (1990) Proc. Natl. Acad. Sci. U S A. 87(6): 2264-2268; Karlin, S. and Altschul, S. F. (1993) Proc. Natl. Acad. Sci. U S A. 90(12): 5873-5877). Homologies between sequences are electronically recorded and annotated with information available from public sequence databases such as GenBank. Homology information derived from these and other comparisons provides a basis for assigning function to a sequence.
Typically computer systems use relational databases to store, process, and manipulate nucleotide and amino acid sequences, expression information, chromosomal location, and protein function information. As the amount of biomolecular information increases, the computer systems and their databases must accommodate the storage, retrieval, comparison, and display of very large amounts of data. Typically, the data is stored in multiple files making up a relational database. Each file has records with predefined fields. Records are accessed using at least one field that is designated as a key or index. Relational databases typically use a join operation to cross reference data stored in different files based on a common key field. The join operation basically combines data stored in multiple files. However, if one of the files being joined, such as the cDNA or clone file, is unusually large, then even a simple join operation with a small file is time consuming and slow.
In addition relational databases use separate data and index files. Index files are used to access information in corresponding data files. Typically, a large amount of storage is needed to store both the data and index files. Therefore, in a system with many data and index files, it is unlikely that all data can be stored at any one time in the main memory of the computer system. The data that remains on the disk must be swapped into main memory. The swapping of data into main memory further contributes to the slow response time of data retrieval systems having unusually large database tables.
Therefore, there is a need for a biomolecular database that eliminates the need for using join operations. In addition, there is a need for a biomolecular database that can be stored in the main memory of ordinary desktop computer systems. There is also a need for a biomolecular database with a common set of tissue classes that can assign a cDNA library to many different tissue classes.
The present invention provides a self-sufficient modular database that organizes and precompiles data to eliminate the need for using join operations. In addition, the database is organized such that the entire database can be stored in the main memory of the computer system. The present invention also provides a way to associate a cDNA library with multiple tissue classes.
A computer system stores biomolecular data in a database in a memory. The biomolecular database has a set of entities. Each entity stores attributes for a plurality of entries. At least one attribute is stored in an array. Data associated with an entry is stored at a location in the array. An entity offset designates the location of the data in the array. The same entity offset value is used to access data associated with a particular entry for all attributes within the entity.
The modular database allows extremely rapid search, comparison, and retrieval of information from very large databases. In the database, joins are eliminated through the use of a set of predefined addressing techniques. The data is organized into entities and the relationships of the data between entities is pre-compiled. Offsets or pointers define relationships between entities. Although entity offsets may be stored in multiple locations, the biomolecular data is stored once.
The addressing technique allows for rapid searching and comparisons of very large amounts of sequence data. Such rapid processing of sequence information provides the capability for significant analysis of the biological function of the huge numbers of sequences currently residing in public and private databases.
The present invention also provides a database and system that allows for comparison of libraries. Library comparison techniques include direct comparisons of sequence expression between libraries that were derived from normal and diseased tissues to provide expression information useful for identifying target molecules for pharmaceutical therapy.
In addition, the present invention provides a database that is structured so as to facilitate quick access to expression level information for a specified cluster or set of clusters of sequences in a specified set of libraries (each of which represents a specific source of expression information). The present invention also provides tools for quickly determining the sensitivity and specificity of expression level values.
The invention also provides an improved technique for assessing similarities between sequences and clustering multiple sequences.
The modular database increases the speed of analyzing sequences which will help accelerate biomolecular research for numerous applications.