A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the xerographic reproduction by anyone of the patent document or the patent disclosure in exactly the form it appears in the U.S. Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
The present invention relates generally to the field of bioinformatics, and more particularly to techniques for facilitating the identification of candidate genes.
With recent advances in the identification of expressed sequence tags (ESTs) and the sequencing of the human genome, a number of researchers are now directing their efforts towards analyzing the data from the genome maps and sequences. A significant portion of this research is being directed towards identifying genes which might trigger, prevent, ameliorate, or somehow affect a variety of diseases or physiological states. Such genes are commonly referred to as xe2x80x9ccandidatexe2x80x9d genes.
The identification of candidate genes is critical to entities such as drug companies who may use the information related to the candidate genes to identify better drug targets in the drug development process. The early identification of candidate genes could reduce the number of potential therapeutics moving through a company""s clinical testing pipeline, significantly reducing overall costs and reducing the time taken by the company to market the drugs.
However, conventional techniques do not facilitate easy identification of candidate genes. This is due to the enormous amount of information being generated by the researchers, and the lack of adequate tools to organize the information in a manner which facilitates analysis of the information. For example, techniques such as parallel expression and analysis using cDNA arrays, as described in U.S. Pat. No. 5,807,522, and synthetic DNA array technology, as described in U.S. Pat. Nos. 5,593,839 and 5,571,639, have been developed to study large scale gene expression profiles (e.g. time-courses of a disease process or comparisons between an altered physiologic or metabolic state with an untreated biological sample). Databases and algorithms have also been developed to analyze the results of the above-mentioned array technologies. Public databases of metabolic, genetic and physiological pathways of yeast (e.g., Munich Information Center for Protein Sequences (MIPS)) and some mammalian genes (e.g., Kyoto Encyclopedia of Genes and Genomes (KEGG)) have been developed largely from the published literature of many traditional low-throughput experimental studies. However, the information provided by the various sources of information identified above and other sources has not been integrated in a coherent manner conducive to identification of candidate genes.
Based on the foregoing, there is a need for techniques which can facilitate the identification of candidate genes. It is desirable that these techniques be able to correlate various types of information and store it in a format which can be easily accessed or queried by researchers interested in identifying candidate genes.
The present invention discusses techniques for facilitating identification of candidate genes from a plurality of DNA sequences. According to an aspect of the present invention, techniques are provided for extracting and integrating information from various information sources and results of various analyses, and storing the integrated information in a form which facilitates identification of candidate genes.
According to an embodiment, the present invention accesses results of a homology search for the plurality of DNA sequences, annotative information for the plurality of DNA sequences indicating the biochemical functions and physiological roles of the plurality of DNA sequences, gene expression profile data for the plurality of DNA sequences describing behavioral patterns of the plurality of DNA sequences, results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data, and other information. The information accessed by the present invention is stored in a format, e.g. a database, which facilitates identification of candidate genes.
According to another embodiment, the present invention receives queries identifying criteria for the candidate genes. In response to the queries, the present invention searching the database storing information for the plurality of DNA sequences to identify a set of DNA sequences which satisfy the query criteria. The set of DNA sequences are then output as a result of the query.
According to yet another embodiment of the present invention, a user may configuring a query identifying criteria for the candidate genes and communicate the query to a server storing information related to a plurality of DNA sequences. According to this embodiment, the information related to the plurality of DNA sequences may comprise results of a homology search for the plurality of DNA sequences, annotative information for the plurality of DNA sequences describing the biochemical functions and physiological roles of the plurality of DNA sequences, gene expression profile data for the plurality of DNA sequences describing behavioral patterns of the plurality of DNA sequences, results from clustering the plurality of DNA sequences based on the behavioral patterns of the plurality of DNA sequences as described by the gene expression profile data, and other information. In response to the query, the user receives a first set of DNA sequences which satisfy the criteria for the candidate genes identified in the query.
The invention will be better understood by reference to the following detailed description and the accompanying figures.