1. Field of the Invention
The present invention relates to a computer research tool for searching and displaying biological data. More specifically, the invention relates to a computer research tool utilizing a novel graphical user interface (GUI) for performing computerized research of biological data from various databases and for providing enhanced graphical representation of biological data, progressive querying, and cross-navigation of relational data.
2. Description of the Related Art
Trillions of pieces of information generated by emerging technologies in molecular biology and genetics are stored digitally in computer databases worldwide. In fact, every day approximately 2000 nucleotide sequences are deposited in publicly accessible databases. With such a large amount of multidimensional data, researchers rely on complex information systems to find, summarize and interpret this biological information. This has resulted in the creation of a new field of science known as bioinformatics which combines the power of biochemistry, mathematics, and computers. Bioinformatics has allowed the development of new databases and computational technologies to help in the understanding of the biological meaning encoded in vast collections of sequence data.
A. Biochemistry Overview
An understanding of the biological meaning of sequence data begins with a study of the 20 amino acids that make up proteins. Deoxyribonucleic acid (DNA) contains the blueprints for these structures. DNA is composed of very long polymers of chemical sub-units known as nucleotides. Each nucleotide includes one of four nitrogenous bases: adenine (A), thymine (T), cytosine (C) and guanine (G). DNA serves as a template for ribonucleic acid (RNA), which serves as a template for proteins. Like DNA, RNA is also composed of nucleotides. Each RNA nucleotide includes one of four nitrogenous bases. These bases of RNA differ from that of DNA only in the substitution of thymine (T) with uracil (U). Three nucleotides of DNA encode three nucleotides of RNA, which in turn encode one amino acid of a protein.
Proteins are macromolecules of amino acids which show great diversity in physical properties thereby fulfilling a broad range of biological functions (i.e., polymers of covalently bonded amino acids). A protein's structure and function depends upon its amino acid sequence, which is determined by the nucleotide sequence of the RNA which produced it, which is determined by the nucleotide sequence of the DNA that produced the RNA. Hence, the great diversity observed in the sequence of amino acids is the direct result of the many possible permutations of DNA and RNA. The primary structure is the sequence of amino acids covalently bonded together. The secondary structure is the result of amino acid sequence of the polypeptide. The bonding causes the chain to develop specific shapes (alpha helix, beta sheet). The tertiary structure is the 3-dimensional folding of the alpha helix or the pleated sheet. The quaternary structure is the spatial relationship between the different polypeptides in the protein.
B. Sequence Comparison
Sequence comparison is a very powerful tool in molecular biology, genetics and protein chemistry. Frequently, it is unknown for which proteins a new DNA sequence codes or if it codes for any protein at all. If you compare a new coding sequence with all known sequences there is a high probability to find a similar sequence. Usually one tries to determine what level of similarity is shared between the proteins in terms of structural and functional characteristics. This determination is made by comparing the amino acid sequences of the proteins. It has been observed that the primary structures of a given protein from related species closely resemble one another. Comparisons of the primary structures of homologous proteins (evolutionarily related proteins) indicate which of the proteins' amino acid residues or domains (i.e., stretches of amino acids) are essential to its function, which are of lesser significance, and which have little specific function. Sequences which are found in similar positions of functionally similar proteins are said to be homologous, conservatively substituted or highly conserved. A popular computational tool for rapid comparison of a search sequence to a database of known sequences is the BLAST search. The advantage of a BLAST search is the ability to find matches to distantly related sequences. The disadvantage is that the searches become computationally intensive and may take an inordinate length of time.
C. Biological Databases
In order to perform these sequence comparisons, databases of known biological data need to be accessed. There are a lot of different databanks (databases) where biological information such as DNA and protein sequence data are stored, including, general biological databanks such as EMBL/GENBANK (nucleotide sequences), SWISS-PROT (protein sequences), and PDB/Protein Data Bank (protein structures). [See, e.g., “Comprehensive, Comprehensible, Distributed and Intelligent Databases: Current Status” by Frishman, et al. Bioinformatics Review, Vol. 14, No. 7, 1998, pgs. 551-561, incorporated herein by reference]. Specifically, GenBank is an annotated collection of all publicly available DNA sequences. As of August 1999, there were approximately 3,400,000,000 bases in 4,610,000 sequence records. The GenBank database comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. SWISS-PROT is an annotated protein sequence database maintained by the Department of Medical Biochemistry of the University of Geneva. The PDB/Protein Data Bank, maintained by Brookhaven National Laboratory, contains all publicly available solved protein structures. These databases contain large amounts of raw sequence data which can be cumbersome to use.
In an effort to provide a more useful form of biological data, there are a number of derived or structured databases which integrate information from multiple primary sources, and may include relational/cross-referenced data with respect to sequence, structure, function, and evolution. A derived database generally contains added descriptive materials on top of the primary data or provides novel structuring of the data based on certain defined relationships. Derived/structured databases typically structure the protein sequence data into usable sets of data (tables), grouping the protein sequences by family or by homology domains. A protein family is a group of sequences that can be aligned from end to end and are <55% different globally. A homologous domain is a subsequence of a protein that is distinguished by a well-defined set of properties or characteristics and may also occur in at least two different subfamilies.
An example of a structured database is ProDom, a protein domain database, consisting of an automatic compilation of homologous domains. The database was designed as a tool to help analyze domain arrangements of proteins and protein families. Current versions of the ProDom database are built using a procedure based on recursive PSI-BLAST searches. ProDom contains 57,976 domain families, sorted by decreasing number of protein sequences in the families. ProDom is generated from the SWISS-PROT database by automated sequence comparison.
Similarly, DOMO is a database of homologous protein domain families. DOMO was obtained from successive sequence analysis steps including similarity search, domain delineation, multiple sequence alignment, and motif construction. DOMO has analyzed 83,054 non redundant protein sequences from SWISS-PROT and PIR-International Sequence DataBase yielding a database of 99,058 domain clusters into 8,877 multiple sequence alignments.
Another derived protein sequence database is the Block Database. Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. The blocks for the Block Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in the Prosite Database. The blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution matches.
D. Researching Biological Databases
Typically biological databases may be searched by either an unstructured (keyword) or structured (field based) search. An unstructured search of the database is performed by searching for a keyword or the ID of records. For example, a keyword search of “ecoli” retrieves a list of protein sequences that are identified by the keyword “ecoli”. A structured search is a more deliberate search, allowing, for example, the searching of the database for protein sequences which contain a particular sequence of interest.
An example of a well known search engine to conduct research on the GenBank database is the ENTREZ search engine which utilizes keyword searching. If a search results in too many hits, ENTREZ allows the addition of new search terms to progressively narrow the number of hits. A researcher may then select all or a subset of the entries that match the search for display to generate a summary page that reports on each of the selected entries. The search results may be displayed in a variety of formats or standardized reports. The form that most biologists are familiar with, the GenBank report, shows the raw GenBank entry. Other familiar formats include FASTA and ASN.1. The genomes division of ENTREZ has a graphic interface based on alignments among multiple maps. The display image shows a series of genetic and physical maps published from a variety of sources, roughly aligned, with diagonal lines connecting common features.
Another search system is SRS (Sequence Retrieval System) which is a Web-based system for searching among multiple sequence databases supported by EMBL. The SRS cross-references sequence information from approximately 40 other sequence databases including ones that hold protein and nucleotide sequence information, 3D structure, disease and phenotype information, and functional information. The SRS search allows structured queries on one or more databases with common fields (e.g., ID, AccNumber, Description). SRS displays the results as a series of hypertext links. The search can be broadened to other databases by bringing in cross-references.
A number of patents exist which relate to display and analysis of biological sequence data, including, U.S. Pat. Nos. 5,891,632; 5,884,230; 5,878,373; 5,873,052; 5,873,052; 5,864,488; 5,856,928; 5,842,151; 5,799,301; 5,795,716; 5,724,605; 5,724,253; 5,706,498; 5,701,256; 5,600,826; 5,598,350; 5,595,877; 5,577,249; 5,557,535; 5,524,240; 5,453,937; 5,187,775; 4,939,666; 4,923,808; 4,771,384; and 4,704,692; and PCT Patent No. WO96/23078; all of which are incorporated herein by reference.
E. Graphical User Interfaces (GUIs)
The development and proliferation of GUIs has greatly enhanced the ease with which users interact with biological databases both in the searching stage and in the display of information. A conventional GUI display includes a desktop metaphor upon which one or more icons, application windows, or other graphical objects are displayed. Typically, a data processing system user interacts with a GUI display utilizing a graphical pointer, which the user controls with a graphical pointing device, such as a mouse, trackball, or joystick. For example, depending upon the actions allowed by the active application or operating system software, the user can select icons or other graphical objects within the GUI display by positioning the graphical pointer over the graphical object and depressing a button associated with the graphical pointing device. In addition, the user can typically relocate icons, application windows, and other graphical objects on the desktop utilizing the well known drag-and-drop techniques. By manipulating the graphical objects within the GUI display, the user can control the underlying hardware devices and software objects represented by the graphical objects in a graphical and intuitive manner.
User interfaces used with multi-tasking processors also allow the user to simultaneously work on many tasks at once, each task being confined to its own display window. The interface allows the presentation of multiple windows in potentially overlapping relationships on a display screen. The user can thus retain a window on the screen while temporarily superimposing a further window entirely or partially overlapping the retained window. This enables the user to divert the attention from a first window to one or more secondary windows for assistance and/or references, so that overall user interaction may be improved. There may be many windows with active applications running at once. Oftentimes, the windows may be (dynamically or statically) related such that modifying a query in one window results in changes to the displayed data in the other related windows, thereby “propagating” the changes throughout.
There are a number of patents which relate to Graphical User Interfaces. For example, the following patents which relate to Graphical User Interfaces, albeit not for biological data, are incorporated by reference herein: U.S. Pat. No. 5,926,806 to Marshall et al.; U.S. Pat. No. 5,544,352 to Egger; U.S. Pat. No. 5,777,616 to Bates et al.; U.S. Pat. No. 5,812,804 to Bates et al.; U.S. Pat. No. 5,146,556 to Hullot et al.; U.S. Pat. No. 5,893,082 to McCormick; U.S. Pat. No. 5,815,151 to Argiolas; U.S. Pat. No. 5,911,138 to Li; U.S. Pat. No. 5,761,656 to Ben-Shachar; U.S. Pat. No. 5,404,442 to Foster et al.; U.S. Pat. No. 5,917,492 to Bereiter et al.; U.S. Pat. No. 4,710,763 to Franke et al.; U.S. Pat. No. 5,828,376 to Solimene et al.; U.S. Pat. No. 5,748,927 to Stein et al.; U.S. Pat. Nos. 5,452,416 to Hilton et al.; and 5,721,900 to Banning.
F. Graphical User Interfaces for Biological Data Systems
As in most industries, software user interfaces for biological data have evolved from the former DOS text and command line interfaces to intuitive screen graphics which represent data in a user friendly manner. In order to evaluate and analyze data sequences from various biological databases, researchers often utilize graphical user interfaces to view biological data in a variety of ways, including multiple sequence alignments (MSAs), secondary structure predictions, two-dimensional graphical representations of sequences, and phylogenetic trees.
A multiple sequence alignment displays the alignment of homologous residues among a set of sequences in columns. In a 2D graphical representation, sequences are displayed as schematic boxes wherein each box is spatially oriented. Phylogenetic trees are genealogical trees which are built up with information gained from the comparison of the amino acid sequences in a protein. The phylogenetic tree (rooted or unrooted) is a graphical representation of the evolutionary distance between individual protein sequences in a family of proteins. The branches of the phylogenetic tree are evolutionary distances from the PAM matrix, an evolutionary model that assumes that estimation of mutation rates for closely related proteins can be extrapolated to distant relationships.
A good example of a graphical user interface can be found in the ProDom interface. The output from a ProDom query for proteins sharing a homologous domain with a particular sequence may be displayed as 2D graphic representations, summarized alignments and trees, alignment in MSF format, and 3D structures. Specifically, the 2D graphical view presents domain arrangements for proteins sharing homology by showing each protein on a single line, starting with its name, hypertext-linked to SWISS-PROT, followed by a 2D view of schematic boxes, each box hypertext-linked to corresponding ProDom entries.
The limitation of most of these systems is that the graphical displays are both static and unrelated. A static graphical display is defined as when a user is unable to refine or modify the search criteria from within the graphical display. Unrelated graphical displays are defined as when a user modifies a graphical display for a particular search, the remaining graphical displays for the particular search are not correspondingly modified (i.e., no propagation). These limitations can make the analysis of protein sequences cumbersome and time consuming. Additionally, the inflexibility of these system could result in an incorrect or incomplete analysis by limiting a user's ability to view all possible relationships.
Accordingly, there is a need in the art for a user friendly computerized research tool for biological data to provide more effective ways to retrieve and view interrelated information from a database. This system needs to provide a usable display for representing vast amounts of discrete information, permitting researchers to focus on the most relevant materials and discover new functional relationships. To effectually and efficiently analyze the information, there remains a need for a graphical user interface which provides increased flexibility by permitting the user to view any number of related or unrelated data displays from one or more databases at the same time. These displays need to be interlinked such that the selection of one or more entries in one of the display windows causes the other display windows to distinguishably display and act on those entries related to the selection. Progressive querying is also needed to allow the user to quickly discover new relationships based on the results of previous queries.
The present invention is designed to address these needs.