The advent of new experimental technologies that support molecular biology research have resulted in an explosion of data and a rapidly increasing diversity of biological measurement data types. Examples of such biological measurement types include gene expression from DNA microarray or Taqman experiments, protein identification from mass spectrometry or gel electrophoresis, cell localization information from flow cytometry, phenotype information from clinical data or knockout experiments, genotype information from association studies and DNA microarray experiments, etc. This data is rapidly changing. New technologies frequently generate new types of data.
High-throughput techniques are generating huge amounts of biological data which are readily available, but which must still be interpreted. Experiments that measure thousands of genes and proteins (microarray, imminent protein-array technologies, etc.) simultaneously and under different conditions are becoming the norm in both academia and pharmaceutical/biotech companies. The ability to visualize this data in a meaningful way is an important consideration in making this data more accessible and widely used.
Tools that allow a user to visualize meaningful data by sorting and/or presenting the data in a way which is more ordered or more easily interpretable by the human viewer are needed for making the data readily accessible and widely useable by researchers. Such tools and visualization methods could also prove useful to clinician-researchers in accessing such data to classify different tissue types, e.g., tumor types, with the goal of deciding upon a therapeutic regimen. Such visualization tools may ultimately be used as aids by physicians to explain test results to patients.
The Entrez Map Viewer provided by the National Center for Biotechnology Information (NCBI) provides graphical displays of features on NCBI's assembly of human genomic sequence data as well as cytogenetic, genetic, physical, and radiation hybrid maps. The Entrez Map Viewer integrates human sequence and map data from a variety of sources, and displays various types of maps including sequence, cytogenetic, genetic linkage, radiation hybrid and YAC contig.
Among the mapping or visual display features available to the Entrez Map Viewer, the sequence maps provide localization of FISH mapped clones. When a clone insert sequence is used for contig assembly and if it spans a large region (e.g., >1 Mb), the clone is also marked to span the same region. Component maps show components of the human genome assembly by showing the placement of individual GenBank sequence entries that were used to generate the genomic contigs. This represents a tiling path for the human genome sequence, based on the relationship of overlapping clones. Contig maps show the chromosomal placement of contigs that have been assembled at NCBI using finished and draft high-throughput genomic (HTG) sequence data. Any individual contig can be assembled from finished sequence (phase 3 HTG), draft sequence (phase 1 and 2 HTG), or a mixture of both. CpG Island maps show regions of high G+C content on the assembled genome sequence.
The dbSNP haplotype map displays intervals of contig sequence when there is information in dbSNP to define chromosomal haplotypes. Typically, there is more than one haplotype allele at any sequence location, and the set of alleles revealed in a particular experiment as a dbSNP Haplotype Set is referenced. The verbose text for the dbSNP_Hap map gives the submitter handle and local haplotype set IDs for the interval. Haplotype set features are linked to the dbSNP Hapset display page where all haplotypes in the set are displayed. Haplotype sets from different labs may be presented side by side when they span a common interval of contig sequence.
The GenBank DNA map shows the placement of human genomic DNA sequences from GenBank that were not used in the assembly of contigs. The placement is based on the alignment of the sequences to the components of the contigs. The GenBank DNA map includes human genomic sequences longer than 500 bp that have at least 97% identity to the components for at least 98 base pairs. If a sequence extends beyond a contig, that portion of sequence is not shown. A “hits” link is provided which, when selected, leads to a tabular display that shows the matching regions (base spans) of the assembly component and the GenBank genomic DNA record that has been aligned to it.
Gene_Sequence maps identify genes that have been annotated on the genomic contigs. These include known and putative genes placed as a result of alignments of mRNAs to the contigs. If multiple models exist for a single gene, corresponding to splicing variants, the Gene_Sequence map presents a flattened view of all the exons that can be spliced together in various ways. A GenomeScan map uses models generated by Genome Scan in which mRNA alignments were used to segment the genomic sequence by putative gene boundaries, and GenomeScan was executed on these segments to predict genes. Sage Tag Map provide maps of SAGE tag sequences aligned to the genomic contigs, and provide connections to SAGEmap, NCBI's resource for serial analysis of gene expression data.
STS maps provide placement of STSs from a variety of sources onto the genomic data using electronic-PCR (e-PCR). The markers are from RHdb. GDB, GeneMap'99, (gene-based markers) Stanford G3 RH map (both gene and non-gene markers), TNG map, Whitehead RH map and YAC maps (both gene and non-gene markers), Genethon genetic map, Marshfield genetic map, and several chromosome-specific maps, such as the NHGRI map for chromosome 7 and the Washington University map of chromosome X. Transcript (RNA) maps are diagrams of the RNAs that are predicted on the genomic contigs. The Transcript map shows the combinations of exons (i.e., splice variants) that are valid, based on mRNA sequences.
The UniGene_Human map displays human mRNA and EST sequences aligned to the assembled human genomic sequence that has been repeat-masked and dusted. Only ESTs supplied with orientation are used. Each alignment is the single best placement for that sequence in the current build of the human genome. The display of the UniGene_Human map varies according to the span of sequence being displayed. For large spans of sequence (greater than 10 million bases), the Map Viewer displays histograms that show the density of ESTs and mRNAs aligned to a region, the UniGene clusters to which they belong, and the number of sequences from each UniGene cluster. For smaller spans of sequence (i.e., higher resolutions, showing less than 10 million bases), the Map Viewer displays the above information, plus blue lines that indicate exon/intron structure.
A current implementation for visualizing genomic gene expression data from microarrays is provided by GeneSpring 5.0, available from Silicon Genetics, of San Carlos, Calif.). GeneSpring 5.0 provides a genome browser for displaying genes of interest or all of the genes in a genome view. The genes may be displayed as vertical bars along horizontal lines, corresponding to their positions along the chromosomes where they are located. The browser provides zooming capabilities wherein a portion of the genome map may be expanded to view in greater detail, but this view does not maintain focus and context. Thus, only one mapping view at a time is available and the user must continuously switch between screens if any comparisons among different views are to be made.
A genome-wide transcriptome map of non-small cell lung carcinomas based on gene-expression profiles generated by serial analysis of gene expression (SAGE) was constructed as reported in Fujii et al., “A Preliminary Transcriptome Map of Non-Small Cell Lung Cancer”, Cancer Research 62, 3340-3346, Jun. 15, 2002. An overlay of four experiments (4 representative sets) was displayed on a transcriptome map, with the expression information being extracted and mapped via SAGE.
While some limited mapping of information to transcriptome maps has been experimented with, it would be further desirable to provide tools and techniques for not only presenting gene expression data and other types of data on genomic maps representing the loci where the genes responsible for the expression data exist, or for mapping biomolecule abundance to corresponding gene/chromosome locations (including mapping of CGH gene amplification/deletions; mRNA expression, protein abundance, and the like), but for more transparently and automatically correlating such data with appropriate placement on the maps. Further, correlation with other map formats and automatic linking between such formats would be desirable. Still further, the ability to visualize more than just the bare expression data would be desirable, whether appearing on the face of the visualization or otherwise embedded, but accessible at the will of the user. Further desirable would be the provision of such maps having interactive capability so that the user can alter the data that is displayed.