The present invention relates to the fields of relational databases, database use, graphical presentation of data, genomics, and gene discovery.
Relational databases continue to grow in number and complexity. Making sense of the data contained in these databases similarly is becoming a more daunting task. The data in these databases are often packaged or tagged in alphanumeric form in order to facilitate their handling and sorting. Examples of this can be found in many of the databases that contain chemical and biological information, such as Expressed Sequence Tags, or ESTs. In fact, the databases in this field provide a good example and are illustrative of the general problem that has arisen in database management, namely, how a researcher can effectively use the massive amount of information that is available.
Research into gene discovery, for example, often focuses on ESTs which in general reflect the diversity of gene expression in living organisms.
These sequences result from an established path in the laboratory: pieces of single-stranded messenger ribonucleic acid (mRNA) are isolated from organic tissue, converted into double-stranded complementary deoxyribonucleic acid (cDNA), cloned into vector replicons, and then transformed into Escherichia coli or other expression systems for replication. Deoxyribonucleic acid (DNA) is extracted from these clones and sequenced using high-throughput methods resulting in pools of EST data (Adams et al., 1992, Sequence Identification of 2375 Human Brain Genes, Nature 355:632-34). Such methods mean that a given set of sequences, often called a “library”, shares a common origin, i.e., they have the same species, cultivar, tissue, condition, and stress attributes. Their characteristics represent a snapshot of the organism, captured at the moment in time when the researcher isolated the mRNA.
The abundance of EST data has increased dramatically in the past few years. The plant tribe Triticeae, for example, includes several closely-related crop plants of major economic importance, including wheat, barley and rye (Barkworth et al., 1992, Taxonomy of the Triticeae, A Historical Perspective, Hereditas 116:1-14; and Kellogg, 2001, Evolutionary History of the Grasses, Plant Physiology 125:1198-1205.) In the year 1998, only a handful of ESTs from Triticum sp. plants were available; now the number of ESTs for Triticum sp. exceed 750,000 (NCBI dbEST, 2003). This information has been assembled into vast databases, which are growing exponentially from year to year.
How to manage such massive amounts of data is difficult and labor intensive. For example, to help process this overload of information and to remove redundancy from within an EST data set, sequences can be aligned and clustered using various assembly algorithms, some of the more popular being CAP3 (Huang and Madan, 1999, CAP3: A DNA Sequence Assembly Program, Genome Res. 9:868-877), phrap (Green, 2003, The Phrap Program, www.phrap.org) and d2_cluster (Burke et al., 1999, D2 Cluster: a Validated Method for Clustering EST and Full-Length cDNA Sequences, Genome Res. 9:1135-1142). Moreover, in building such an assembly, a set of unique gene sets can be assembled into “unigenes”, essentially representing a range of genes present in an organism (Liang et al., 2000, An Optimized Protocol for Analysis of EST Sequences, Nucleic Acids Res. 28:3657-65; and Quakenbush et al., 2000, The TIGR Gene Indices: Reconstruction and Representation of Expressed Gene Sequences, Nucleic Acids Res. 28:141-45). The success of such an assembly relies on the quality of the sequence data and the various parameters available within the software used to provide the established settings for sequence-by-sequence comparisons, all of which has become quite difficult because of the sheer mass of information that must be evaluated and processed.
As these databases have grown larger and larger, the amount of time and labor needed to use them has also grown. What is needed is a type of search and analytical tool that can be used with large databases in general, and particularly those which use alphanumeric characters to identify underlying information. In other words, virtually all databases.