With the completion of the Human Genome Project, genetic research is now being directed towards understanding complex multigenic diseases e.g. cancer, cardiac diseases. Microarray technology has proven to be really useful in studying the expression pattern of thousands of genes simultaneously. Also with the availability of the entire genome, many tools have now been developed to generate inferences and predictions based on the entire genome, such as POMPOUS (Fondon et al, PNAS, 95(13)7514–9, 1998) which looks for potentially polymorphic genes. Efforts like Program For Genome Application (PGA) are now being undertaken to study hundreds of genes associated with particular diseases or phenotypes. As a result, researchers frequently need to compile large lists of genes associated with certain diseases, phenotypes, keywords and their synonyms. The selection of array elements for large gene collections typically involves: finding possible gene candidates, generally done using a series of keyword searches on different databases; assembling these several different lists obtained from various databases and trying to eliminate redundancies; and annotating all the genes (on the tentative list) in detail so that the researcher can know as much as possible about the gene.
The NCBI website provides a keyword search engine for various databases like GenBank, UniGene and LocusLink; however, the keyword search has to be done separately on each database. The list then needs to be combined and more importantly, the sequence redundancy needs to be eliminated. Eliminating the redundancy manually is not an easy task since each database has its own unique identifier. This is primarily done based on the researcher's experience and not all sequence redundancies are eliminated, especially for a large collection of genes. Additionally, the annotation for all the candidates on the list is not available in one place, so the researcher has to look up individual genes—a very laborious and time-consuming task.
Websites like Genecards (Rebhan, M et al, Bioinformatics 14(8)656–64, 1998) <http://nciarray.nci.nih.gov/cards/>) provide a database of human genes, their products and their involvement in diseases. However, Genecards only offers information about the functions of all human genes that have an approved symbol, and a few selected others. Again this information can only be accessed one gene at a time, and the annotation cannot be downloaded in any useful format for working with a large gene collection. DRAGON (Bouton CM et al, Bioinformatics 16(11)1038–9, 2000)<http://207.123.190,10/dragon.htm>) lets the researcher do a keyword search on multiple databases at one time, but the output is a list of accession numbers and definitions in text format, which is not linked to any of its annotations. The tool does not let the researcher select entries from the keyword search. It does not allow moving between pages and merge lists obtained from different keyword searches. As a result DRAGON does not help in systematically compiling a large gene collection. Further, DRAGON does not include important databases like GenBank and LocusLink that are the most commonly used databases for searching candidate genes. None of these tools helps in eliminating sequence redundancies within the lists. Databases like LocusLink and Genecards attempt to integrate the unique characteristics from various databases and provide a broad summary on a single gene basis. Nevertheless they do not help in annotating a large gene collection. There is a need for a tool tat comprehensively gathers annotation related to all these elements in one place. The annotation tool of DRAGON only combines information from UniGene, Swissprot, Pfam and KEGG pathway database with 17 fields of annotation. However these fields do not include important fields like repeat, SNP, pathways, clones, etc. which would be of great value. Additionally including a number (expression data for microarrays, purity of repeats for polymorphism) in the final annotation table would make it convenient for the user to extract infonnation from the table. With more and more gene collections, it is also required to combine several collections of genes, obtained from different sources.
The production of DNA microarrays can be divided into four stages: a. Selection of array elements and design of the probe DNA; b. Preparation of the probe DNA; c. Preparation of a suitable design substrate to spot the probes on; d. Deposition of array elements. The selection of array elements for microarays involves assembling a large gene collection. It would be very valuable if the same tool (to compile a large gene collection) could be used to further design primers, look for commercially available clones (expression microarrays) and design resequencing probes (resequencing microarrays). Once the genes are spotted on the microarray and hybridized to fluorescent labeled probes, there are a number of software programs that help in conversion of the fluorescence of the scanned image to numbers, using complex mathematical corrections to extract signal from background noise. e.g. Genepix (<http://www.axon.com/GN GenePixSoftware.html>) and ArrayVision (<http://imaging.brocku.ca/products/Arrayvision.htm>). These numbers indicate level of expression. Other programs such as GeneSpring (Silva et al, HMS Beagle: The BioMedNet Magazine issue 82, 2000), Cluster Treeview (Eisen MB et al, Proc Natl Acad Sci U S A 95) and Spotfire (<http://www.spotfire.com>), help in the analysis by clustering the data together using various methods based on K-means, hierarchal or self-organizing maps. Clustering algorithms use the expression level data to group the various elements on the array. It would also be very useful to view the elements of the array with their complete annotation and overlay the expression level data on top of it. The data could further be selectively viewed by sorting on various annotation fields and the expression level data. This approach could be useful to view any large gene collection in general. With the increasing number of microarray experiments, it would be valuable to compare elements between different microarrays considering that fragments of the same gene might be represented by different sequence identifiers. For example, two different accession numbers might belong to the same UniGene cluster, representing the same gene. An artifact sometimes observed in the results obtained from an expression profiling mnicroarray experiment is that some sequences might hybridize to other sequences to which they are significantly similar. This leads to false positive results after a microarray experiment. Although Human Cot DNA is often used to prevent non-specific hybridization by blocking simple repetitive elements in genomnic DNA, as shown in experiments to study cross-hybridization, Human Cot DNA is not very effective in preventing cross hybridization. ARROGANT computationally estimates the amount of cross hybridization for each sequence and tags potential genes as possible candidates for cross hybridization.
Several computational tools and databases are available which may be used in the development of the code for working with large gene collections. Some of them are discussed here in brief.
1. PRIMO: PRIMO (Li et al, Genomics 40(3) 476–85,1997) is a code that was developed to design primers for large-scale DNA sequencing projects. PRIMO designs primers (short sequences typically 20 bases long), which are used to amplify sequences (0.4 KB–2 KB) using PCR. PRIMO can be made to design primers to amplify a specific region. PRIMO can be run in batch mode and the region for the design of primers for each sequence can be specified separately. The parameters file (including parameters like oligo length, melting temperatures etc.) can be altered. The code is written in ANSI C and is available locally on a HP/UX computer. The code has been successfully used to design primers for the past couple of years and is available on the web at <http://atlas.swmed.edu>. This makes PRIMO a very important tool to design primers to amplify a large number of sequences simultaneously.
2. BLAST: BLAST(Basic Local Alignment Search Tool) is an alignment tool to search for similar sequences (protein or DNA) developed by NCBI (Altschul et at, Journal of Molecular Biology 215(3)4-3-10,1990). It is available at <http://www.ncbi.nlm.nih.gov/BLAST/>. ARROGANT uses the BLAST output to estimate cross-hybridization for microarrays. Each element on the array is BLASTed against the entire UniGene database and the BLAST output is parsed to detect 65 contiguous hydrogen bond overlaps, used as a threshold for cross-hybridization.
3. Rep-X: Rep-X (Wren et al, American Journal of Human Genetics 67(2)345–56, 2000) uses the UniGene database and generates a list of repeats, hairpin and palindrome sequences. This code runs on HP/UX computer. The output of Rep-X is incorporated into ARROGANT to look for repeats, hairpins and palindrome sequences.
4. NCBI Databases: NCBI provides databases used by ARROGANT (downloaded and implemented locally) to annotate gene collections and find potential candidates associated with keywords. The databases include: a. GenBank (Benson D A et al, Nucleic Acids Res 28(1)15–18, 2000): An annotated collection of all publicly available DNA sequences provided by NIH; b. UniGene (Schuler, J Mol Med 75(10)694–8, 1997): Partitions GenBank EST sequences into a non-redundant set of gene oriented clusters; c. LocusLink (Pruitt et al, Nucleic Acids Res 29(1)137–40, 2001): Integrates and provides a single query interface to cluster sequences and makes available descriptive information about genetic loci; d.
HomoloGene (Zhang et al, J. Comp. Biol. 7(1–2)203–14, 2000): The database of calculated orthologs and homologs between all UniGene clusters by each pair of organisms.
5. KEGG Databases: KEGG (Kyoto Encyclopedia of Genes and Genomes) (Kanehisa, M., Oxford University Press 2000) provides genome and pathway databases for a large number of organisms. ARROGANT uses (downloaded and implemented locally) these databases to look for potential gene candidates, their pathways and to annotate gene collections.
6. Clone Databases: Commercially available clone databases include the IMAGE (G. Lennon et al, Genomics 33(1)151–2, 1996) Consortium, which shares high quality arrayed cDNA libraries and provides sequence, map, and expression data on the clones in these arrays to the public domain; vendors include Research Genetics, Incyte Genomics, etc.