The present invention relates to systems and methods useful for annotating biomolecular sequences. More particularly, the present invention relates to computational approaches, which enable systemic characterization of biomolecular sequences and identification of differentially expressed biomolecular sequences such as sequences associated with a pathology.
In the postgenomic era, data analysis rather than data collection presents the biggest challenge to biologists. Efforts to ascribe biological meaning to genomic data, whether by identification of function, structure or expression pattern are lagging behind sequencing efforts [Boguski M S (1999) Science 286:453-455].
It is well recognized that elucidation of spatial and temporal patterns of gene expression in healthy and diseased states may contribute immensely to further understanding of disease mechanisms.
Therefore, any observational method that can rapidly, accurately and economically observe and measure the pattern of expression of selected individual genes or of whole genomes is of great value to scientists.
In recent years, a variety of techniques have been developed to analyze differential gene expression. However, current observation and measurement methods are inaccurate, time consuming, labor intensive or expensive, oftentimes requiring complex molecular and biochemical analysis of numerous gene sequences.
For example, observation methods for individual mRNA or cDNA molecules such as Northern blot analysis, RNase protection, or selective hybridization to arrayed cDNA libraries [see Sambrook et al. (1989) Molecular cloning, A laboratory manual, Cold Spring Harbor press, NY] depend on specific hybridization of a single oligonucleotide probe complementary to the known sequence of an individual molecule. Since a single human cell is estimated to express 10,000-30,000 genes [Liang et al. (1992) Science 257:967-971], single probe methods to identify all sequences in a complex sample are ineffective and laborious.
Other approaches for high throughput analysis of differential gene expression are summarized infra.
EST sequencing—The basic idea is to create cDNA libraries from tissues of interest, pick clones randomly from these libraries and then perform a single sequencing reaction from a large number of clones. Each sequencing reaction generates 300 base pairs or so of sequence that represents a unique sequence tag for a particular transcript. An EST sequencing project is technically simple to execute since it requires only a cDNA library, automated DNA sequencing capabilities and standard bioinformatics protocols.
To generate meaningful amounts of data, however, high throughput template preparation, sequencing and analysis protocols must be applied. As such, the number of new genes identified as well as the statistical significance of the data is proportional to the number of clones sequenced as well as the complexity of the tissue being analyzed [Adams et al. (1995) Nature 377:3-173; Hillier et al. (1996) Genome Res. 6:807-828].
Subtractive cloning—Subtractive cloning offers an inexpensive and flexible alternative to EST sequencing and cDNA array hybridization. In this approach, double-stranded cDNA is created from the two-cell or tissue populations of interest, linkers are ligated to the ends of the cDNA fragments and the cDNA pools are then amplified by PCR. The cDNA pool from which unique clones are desired is designated the “tester”, and the cDNA pool that is used to subtract away shared sequences is designated the “driver”. Following initial PCR amplification, the linkers are removed from both cDNA pools and unique linkers are ligated to the tester sample. The tester is then hybridized to a vast excess of driver DNA and sequences that are unique to the tester cDNA pool are amplified by PCR.
The primary limitation of subtractive methods is that they are not always comprehensive. The cDNAs identified are typically those, which differ significantly in expression level between cell-populations and subtle quantitative differences are often missed. In addition each experiment is a pair wise comparison and since subtractions are based on a series of sensitive biochemical reactions it is difficult to directly compare a series of RNA samples.
Differential display—Differential display is another PCR-based differential cloning method [Liang and Pardee (1992) Science 257:967-70; Welsh et al. (1992) Nucleic Acids Res. 20:4965-70]. In classical differential display, reverse transcription is primed with either oligo-dT or an arbitrary primer. Thereafter an arbitrary primer is used in conjunction with the reverse transcription primer to amplify cDNA fragments and the cDNA fragments are separated on a polyacrylamide gel. Differences in gene expression are visualized by the presence or absence of bands on the gel and quantitative differences in gene expression are identified by differences in the intensity of bands. Adaptation of differential display methods for fluorescent DNA sequencing machines has enhanced the ability to quantify differences in gene expression [Kato (1995) Nucleic Acids Res. 18:3685-90].
A limitation of the classical differential display approach is that false positive results are often generated during PCR or in the process of cloning the differentially expressed PCR products. Although a variety of methods have been developed to discriminate true from false positives, these typically rely on the availability of relatively large amounts of RNA.
Serial analysis of gene expression (SAGE)—this DNA sequence based method is essentially an accelerated version of EST sequencing [Valculescu et al. (1995) Science 270:484-8]. In this method a digestible unique sequence tag of 13 or more bases is generated for each transcript in the cell or tissue of interest, thereby generating a SAGE library.
Sequencing each SAGE library creates transcript profiles. Since each sequencing reaction yields information for twenty or more genes, it is possible to generate data points for tens of thousands of transcripts in modest sequencing efforts. The relative abundance of each gene is determined by counting or clustering sequence tags. The advantages of SAGE over many other methods include the high throughput that can be achieved and the ability to accumulate and compare SAGE tag data from a variety of samples, however the technical difficulties concerning the generation of good SAGE libraries and data analysis are significant.
Altogether, it is clear from the above that laboratory bench approaches are ineffective, time consuming, expensive and often times inaccurate in handling and processing the vast amount of genomic information which is now available.
It is appreciated, that much of the analysis can be effected by developing computational algorithms, which can be applied to mining data from existing databases, thereby retrieving and integrating valuable biological information.
To date, there are more than a hundred major biomolecule databases and application servers on the Internet and new sites are being introduced at an ever-increasing rates [Ashburner and Goodman (1997) Curr. Opin. Genet. Dev. 7:750-756; Karp (1998) Trends Biochem. Sci. 23:114-116].
However, these databases are organized in extremely heterogeneous formats. These reflect the inherent complexity of biological data, ranging from plain-text nucleic acid and protein sequences, through the three dimensional structures of therapeutic drugs and macromolecules and high resolution images of cells and tissues, to microarray-chip outputs. Moreover data structures are constantly evolving to reflect new research and technology development.
The heterogeneous and dynamic nature of these biological databases present major obstacles in mining data relevant to specific biological queries. Clearly, simple retrieval of data is not sufficient for data mining; efficient data retrieval requires flexible data manipulation and sophisticated data integration. Efficient data retreival requires the use of complex queries across multiple heterogeneous data sources; data warehousing by merging data derived from multiple public sources and local (i.e., private) sources; and multiple data-analysis procedures that require feeding subsets of data derived from different sources into various application programs for gene finding, protein-structure prediction, functional domain or motif identification, phylogenetic tree construction, graphic presentation and so forth.
Current biological data retrieval systems are not fully up to the demand of smooth and flexible data integration [Etzold et al. (1996) Methods Enzymol 266:t14-t28; Schuler et al. (1996) Methods Enzymol. 266:141-162; Chung and Wong (1999) Trends Biotech. 17:351-355].
There is thus a widely recognized need for, and it would be highly advantageous to have, systems and methods which can be used for efficient retrieval and processing of data from biological databases thereby enabling annotation of previously un-annotated biomolecular sequences.