Computational methods and systems are essential in virtually every area of human endeavor. Often, the limits of a computational system define abilities to acquire, manipulate, analyze, understand or utilize information. This effect is becoming increasingly important in many areas of biochemical research, such as genomics research and research on gene expression. And, although bioinformatics methods and systems are developing rapidly, the volume and complexity of information from sequencing and profiling projects, to mention just two areas, is increasing even more rapidly. Thus, improving the ability of computational methods and systems remains essential to many of the most important areas of biochemical research. For instance, progressively more powerful molecular biology and biochemistry techniques are being used to determine the complete sequences of increasingly larger genomes. Acquiring, storing and making sense of the data produced by genome sequencing in itself poses a challenge to current computational methods and systems. Recently developed methods of profiling gene expression on a large scale likely will produce even more challenging amounts of data. Furthermore, using this information to understand physiology, determination, differentiation, response to environmental factors, embryogenesis development, aging, dysfunction, disease, transformation and death, in simple and complex cells and in complex multicellular organisms poses perhaps the greatest challenge of all.
The challenges to computation methods and systems in biochemistry and molecular biology is briefly illustrated in more detail in the following discussion of work on genome sequences and work on expression profiling.
The techniques of molecular biology and biochemistry are being applied to determine the structures and mechanisms of genetic information and control of physiology, determination, differentiation, response to environmental factors, embryogenesis, development, aging, dysfunction, disease, transformation and death in cells and in multicellular organisms, to name just a few areas of research.
Powerful techniques for accurately determining the exact sequence of bases in fragments of DNA several hundred bases in length have been available since the late 1970s. Initially, these techniques were laboriously applied to determine the complete sequences of relatively small plasmid and viral genomes, such as pBR322, polyoma virus and SV40, as well as individually isolated genes, such as immunoglobulin and globin genes of various species, interferons and cytokines to name just a few.
More recently, DNA sequencing methods have been automated and applied to sequencing entire genomes of more complex organism. For instance, the complete sequence of a mycoplasma, of one prokaryote from each of three major evolutionary branches, the sequence of E. coli and the sequence of a yeast all have been completely determined. Work to determine the complete sequences of much more complex genomes, including simple plants and animals and, especially the human genome, is progressing rapidly and it is likely that the complete sequence of the human genome will be known within a relatively short span of years.
More powerful sequencing techniques under development will make it possible to determine DNA sequences rapidly and accurately in the clinical laboratory, as a diagnostic tool. And, it is likely that miniaturized implementations of one or more sequencing techniques will provide automated sequencers with throughputs on the order of millions of base pairs per hour. Such devices could be used to determine, very quickly, the entire sequence of mitochondrial DNA in a patient blood sample, for instance, or to determine the profile of a million polymorphisms in a patient's genome. It is even conceivable that the time is not long off when the DNA of every patient will be sequenced as a matter of course in its entirety, allowing rapid acquisition of genetic predispositions.
Similarly, the miniaturization of these automated techniques will make it possible to screen other organisms for many purposes, using a complete genome sequence. In the health care setting this will make it possible to identify infectious agents and to identify tumor types with unparalleled speed and accuracy. It will allow testing for drug resistance, and for response to therapeutic agents and regimens. It will make possible a greater degree of therapeutic tailoring of therapies to individual patients and disease circumstances. Moreover, it will play an important role in drug discovery and the development of new therapies for disease, to mention just a few of the implications of the widespread deployment of such techniques in clinical practice and biomedical research.
However, DNA sequencing provides basically structural information. The vast amount of information that now can be acquired about DNA sequences thus increasingly highlights the need for equally powerful techniques for assessing the activity of genes; i.e., gene expression. Efforts thus are underway to develop methods for determining all the active genes in a cellular sample and for simultaneously classifying the activities.
One type of technique developed toward this end involves hybridization to oligonucleotide arrays. Illustrative of this approach is the method for “Expression monitoring by hybridization to high-density oligonucleotide arrays” described by Lockhart et al. in Nature Biotechnology 14: 1675–1680 (1996).
Another approach to expression monitoring is the technique of differential display, described by Liang and Pardee in their paper “Differential Display of Eukaryotic Messenger RNA by Means of the Polymerase Chain Reaction” in Science 257: 967–971 (1992) and in U.S. Pat. No. 5,599,672 of Liang et al. (See also Nuc. Acids Res. 19: 4009 et seq. (1994) or Trends Genet. 11: 242 et seq. (1995)). Numerous variations and improvements on the basic differential display technique have been published, as well. For instance, Luehrsen et al. described the “Analysis of Differential Display RT-PCR Products Using Fluorescent Primers and “GENESCAN” Software” in BioTechniques 22: 168–174 (1997). And Jones et al. described the “Generation of Multiple mRNA Fingerprints Using Fluorescence-Based Differential Display and an Automated DNA Sequencer” in BioTechniques 22: 536–227 (1977), to name just two examples in this regard.
These techniques presently suffer from several disadvantages. Perhaps most important is their limited quantitative reproducibility, which leads to a significant incidence of false positive signals, as much as 80% in the case of differential display. (As described regarding differential display in Trends Genet. 11: 242 et seq. (1995), Nuc. Acids Res. 22: 5763, et seq. (1994) and FEBS Lett. 351: 231 et seq. (1994), among others). Furthermore, the hybridization techniques generally are useful only to analyze expression of genes whose sequences already are known hybridization chip technology has not yet produced arbitrary sequence arrays that could assay the expression complement even of relatively small genomes. Differential display and oligo-chip techniques furthermore suffer from irreproducibility caused by unpredictable hybridization of random short sequence probe sets to high complexity targets under low stringency conditions.
A reproducible technique for profiling gene expression with great quantitative accuracy has been described, more recently. In its most common format, this technique quantitatively determines mRNA abundance in samples by measuring the amounts of 3′-end fragments of cDNAs generated by specific primers and specific cleavage reactions. The technique is described in detail in pending PCT patent application number PCT/US96/12468, published as WIPO publication WO97/05286, dated 13 Feb. 1997, the disclosure of which is incorporated herein, in its entirety.
However, the very power of the gene expression profiling techniques poses a challenge to data handling methods and systems, to methods and systems for manipulating and analyzing the data, and to methods and systems for displaying the data so that it can be comprehended, studied and used. That is, the DNA sequence and gene monitoring techniques produce enormous amounts of data, so much so that new techniques are necessary for handling the data, for making it useful and for manipulating, analyzing and displaying it.
A single copy of a human genome, for instance, contains about 3,300,000,000 base pairs, encoding about 100,000 genes. About 15,000 genes are active in any given type of cell, very roughly. Each active gene gives rise to at least one corresponding mRNA. The expression of 15,000 different active genes thus gives rise to more than 15,000 different corresponding mRNAs. Also, genes have greatly varying activity. In consequence, the 15,000 different types of mRNA are present in greatly differing amounts. Thus, gene expression in a given human cell results in a very large pool of mRNA molecules, made up of greatly varying amounts of about 15,000 different types of mRNA, corresponding to roughly 15,000 different active genes. Gene expression can be assessed by determining the different types of mRNA that are present in a cell or tissue and the amount of each type. One complete profile of gene expression in a cell is provided by determining all the types of mRNA in a cell and the amounts of each one. As noted above, for a human cell this involves about 15,000 different mRNAs. To determine differences in gene expression in a cell or tissue, or changes in gene expression over time, numerous profiles must be determined and compared.
Moreover, information from profiling experiments can have its greatest effect only when it is combined with information from other research. Thus, information from profiling experiments for greatest effect should be linked to information in DNA sequence data banks and information about characterized genes, mRNAs, proteins and the like. For greatest utility, linkage between profiling information, DNA sequence information and other information about genes, gene expression, expressed proteins, cell phenotypes and genotypes and organismal physiology and disease (to name just a few categories) should be linked on an mRNA by mRNA basis, so that all the information pertaining to a given mRNA (or representative thereof) in a gene expression profile can be retrieved readily and interactively.
A variety of techniques have been developed for storing, manipulating, analyzing and displaying DNA sequence information. Such techniques are described in many journal articles and books, such as SEQUENCE ANALYSIS PRIMER, Gribskov and Devereux, Eds., UWBC Biotechnical Resource Series, Stockton Press, New York (1991) and GUIDE TO HUMAN GENOME COMPUTING, Martin J. Bishop, Ed., Academic Press, Inc. Harcourt Brace & Company, Publishers, San Diego (1994). For the most part these techniques relate to the acquisition, formatting, storage, manipulation and analysis of DNA sequence information. These genomics methods and systems are designed particularly for managing experimental data from sequencing projects, for sequence assembly, for assembling sequences into “contigs,” for developing physical maps of DNAs and whole genomes, for sequence comparisons, for sequence functional inference and for genetic linkage analysis. Although powerful, these techniques generally are not useful for many purposes associated with expression profiling.
In contrast to the many bioinformatics techniques that have been developed for DNA sequencing and sequence analysis applications, presently there are few or no such techniques available for many purposes of gene expression profiling. Furthermore, the few techniques that are available suffer from many limitations and disadvantages, such as the lack of a flexible user interface, the lack of a flexible display paradigm that effective portrays to users the results of profiling experiments, the lack of an interactive display that the enables users to carry out and to visualize the results of a wide variety of data analysis approaches to gene expression profiling data, the lack of an interactive display that allows users to access information about mRNAs and genes represented in displays through a graphical user interface that links defined parts of a graphical display, e.g., peaks, to information in linked records, the lack of effective tools for generating records and databases and linking them to profiles and profile displays, to mention a few shortcomings of present bioinformatics methods and systems for acquiring, storing, manipulating, analyzing, comparing, linking and displaying gene expression profiling data. There is therefore a need for efficient, flexible, powerful, interactive methods and systems for storing, organizing, retrieving, manipulating, analyzing and displaying gene expression profiling data and related information.