1. Field of the Invention
The present invention relates generally to the field of bioinformatics, and specifically to systems, methods, and computer program products that make use of clustering of data, statistical natural language processing, and machine learning, for purposes of analyzing data acquired using DNA arrays that have been hybridized with cDNA probes.
2. Description of Prior Art
Disease processes, as well as physiological responses to agents such as drugs and radiation, are often investigated by measuring the amounts of different messenger RNA (mRNA) species in a tissue specimen or in a cultured cell population. The present invention is concerned with analyzing such data, in particular, data acquired through use of a recently developed tool known as microarrays [DUGGAN et al., Nature Genetics 21, Suppl. 1:10-14 (1999)].
Microarrays consist of hundreds or thousands of spots of different DNA sequences, corresponding to many different genes, arranged in a grid pattern on a glass substrate or nylon membrane. Complementary DNA (cDNA) prepared from the mRNA of a tissue specimen is hybridized to the microarray, which is then detected by fluorescence or autoradiographic methods. The signal detected at each of the many spots on the microarray is then used as an indication of the relative amount of the corresponding mRNA species in the specimen.
Microarray experiments are often performed to compare mRNA levels from tissues under two conditions (e.g., cancerous vs. normal cells; before vs. after administration of a drug), in which case, the ratio of estimated mRNA levels for each microarray spot under the two conditions is also ordinarily calculated. The construction or interpretation of such ratio estimates may benefit from the application of statistical corrections, especially when spot values are close to the threshold of measurement detectability [CHEN et al., patent U.S. Pat. No. 6,245,517 (2001); NEWTON et al. (1999) from the Web site having the following domain name—top level domain=edu, second level domain=wisc, third level domain=stat, fourth level domain=www, path=/˜newton/papers/publications.
Microarrays have also been used to monitor the time course of mRNA levels in a cell population that had been subjected to an intervention, such as a shift in serum concentration in the growth medium, which alters the concentration of hormones and other factors needed for cell growth [IYER et al., Science 283: 83-87 (1999)]. Those microarray measurements are typically made from mRNA collected at short time intervals (on the order of several minutes) immediately after application of the intervention, and longer intervals thereafter (hours). cDNA prepared from each of these mRNA samples is ordinarily hybridized to a separate array. Ratios are then constructed for each time point, as mentioned above, by dividing the measurement at each microarray spot at the time point, by a measurement corresponding to time-zero for that microarray spot. After inspecting the time course of estimated mRNA levels for all the genes on the arrays in those experiments, investigators noted that the mRNA levels for certain groups of genes tend to fluctuate up and down together. Subsequently, computer algorithms were used to group together sets of genes (known as “clusters”, produced by a clustering algorithm) according to the similarity of the time-course of their estimated mRNA levels, making the groupings more objective and relieving investigators of the burden of grouping the genes by eye [EISEN et al., Proc. Natl. Acad. Sci. USA 95: 14863-14868 (1998); TAVAZOIE et al., Nature Genetics 22: 281-285 (1999); TAMAYO et al., Proc. Natl. Acad. Sci. USA 96: 2907-2912 (1999); BEN-DOR et al., J. Computational Biol. 6: 281-297 (1999), GETZ et al. (1999), arXiv:physics/9911038 from the Web site having the following domain name—top level domain=gov, second level domain=lanl, third level domain=xxx; ZHENG et al., patent U.S. Pat. No. 6,263,287 (2001)].
Microarrays have also been used to measure cell responses to several different types of interventions, at a single time point, rather than the response to a single intervention at a series of time points. In those experiments, groups of genes were also observed to exhibit similar mRNA levels in response to the various interventions, and groupings of those genes were also produced automatically by using clustering algorithms [PEROU et al., Proc. Natl. Acad. Sci. USA 96: 9212-9217 (1999); TIBSHIRANI et al. (1999) from the Web site having the following domain name—top level domain=edu, second level domain=stanford, third level domain=www-stat, path=/˜tibs/lab/publications.html
The similarity of estimated mRNA levels—observed among genes in individual clusters—could in some instances be coincidental, but most investigators attribute the similarity of mRNA levels to unknown biological control mechanisms, whereby functionally related genes are transcribed in a coordinated fashion in order to participate stoichiometrically in a biochemical or cell-physiological process. Thus, the clustering of genes on the basis of the similarity of their mRNA levels is viewed by investigators as an initial step in identifying functionally significant biochemical pathways or cell-physiological processes and their mechanisms of transcriptional control. For example, genes involved in mediating progression through the cell cycle may be found in the same cluster [IYER et al., supra]. However, it has also been observed that genes with supposedly similar known functions do not always appear together in the same clusters [TAVAZOIE et al., supra]. This may be due in part to inadequacy of the particular clustering algorithm that was used. If a different clustering algorithm were applied to the data, it would generally produce different clusters and may be more successful at grouping together functionally related genes.
Initially, investigators applied hierarchical clustering algorithms to array data [EISEN et al. supra]. Later investigators used self-organizing maps to perform clustering [TAMAYO et al., supra]. Other investigators have performed clustering of microarray data using the k-means algorithm [TAVAZOIE et al., supra], a graph theoretical algorithm [BEN-DOR et al., supra], super-parametric clustering [GETZ et al., supra], as well as grid and σ-τ clustering [ZHENG et al., supra]. Variations of these algorithms have also been implemented by using various normalizations and distance measures. Additional clustering algorithms were described for situations in which data are parameterized by two or more variables [TIBSHIRANI et al., supra]. Considering that hundreds of other general-purpose clustering algorithms have been described [KAUFMAN and ROUSSEEUW, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley (1990) and references contained therein], many of which may eventually be applied to microarray data, and considering that all of these clustering algorithms may group microarray data in different ways, investigators have the problem of deciding which of those algorithms is most useful for analyzing their data.
One therefore needs to compare results made with different clustering algorithms, in order to decide which algorithm is most useful for the data under investigation. The comparison may be made first in terms of the statistics of how well members of each cluster resemble their corresponding centroid (i.e., tightness of clustering), or in terms of a figure-of-merit obtained using a resampling approach YEUNG et al. (2000) from the Web site having the following domain name—top level domain=edu, second level domain=washington, third level domain=cs, fourth level domain=www, path=/homes/kayee/research.html. However, such goodness-of-fit comparisons do not assess the quality of clustering in terms of the biological reasonableness of the results, which must be based on the physiological functions of the genes in the clusters.
However, there is little prior art that can assist investigators in evaluating the extent to which genes in clusters are functionally related, which has been taken to be a primary criterion upon which the quality of clustering is judged. The main difficulty in establishing functional relations among genes in clusters lies in the unavailability or incompleteness of factual databases that explicitly link the known functions of genes with one another. TAVAZOIE et al., supra, indexed yeast genes using the 199 functional categories in the Martinsreid Institute of Sciences functional classification scheme database (ribosomal, mitochondrial, TCA pathway, etc.). For each cluster of genes, they then calculated probabilities (P values) of the frequency of observed genes in the various functional categories, to determine whether particular clusters are significantly composed of genes associated with particular functional categories. However, such functional classification databases are available to characterize the genes of only a limited number of organisms, or they may not contain a complete list of known genes. Furthermore, those databases force genes into a predetermined classification scheme that may contain overly-broad or overly-narrow classifications, or classifications that are not mutually exclusive. Possibly for this reason, TAVAZOIE et al., supra, found that genes with supposedly similar known functions—as defined by the Martinsreid Institute of Sciences functional classification scheme database for yeast—do not preferentially appear together in the same microarray cluster.
Consequently, many investigators simply review the lists of clustered genes manually and then offer expert commentary about the functional significance of genes in the various clusters, based on their reading of the literature about those genes. For example, IYER et al., supra, describe one cluster as being enriched for genes “involved in mediating progression through the cell cycle”, describe another cluster as containing genes encoding “proteins involved in cellular signaling”, and for other clusters they offer no description. At the present state of the art, expert human judgement may well be the best method for evaluating the relatedness of functions of genes in clusters. However, this method is limited by the expertise of its practitioners, as well as by the considerable labor involved in manually reviewing literature concerning the many genes that may be present in the clusters. In fact, even the task of identifying the relevant articles in the scientific literature is arduous.
Therefore, a practical impediment in interpreting microarray data is the time and effort needed to acquire literature about genes represented on microarrays, as well as the time and effort needed to manually review and integrate the information contained within that literature. It is consequently an objective of the present invention to produce automatically generated, quantitative indices (figures-of-merit) of the extent to which genes in a cluster are functionally related to one another, based solely on information within the scientific literature concerning genes present on a microarray. Investigators may use the indices to evaluate the functional relatedness of genes in clusters that were made using a particular clustering algorithm, as well as to compare the performance of different clustering algorithms. In so doing, investigators may use the figure-of-merit indices that are generated by the method to evaluate the quality of different clustering algorithms, based solely on the content of the literature about the genes associated with the clusters. An advantage of the present method and system is that it does not presuppose the existence of a structured database of gene annotations, such as the Martinsreid Institute of Sciences functional classification scheme database for yeast, which was mentioned above.
There is little prior art that that can assist an artisan in automatically generating a useful corpus of literature about an individual gene, which might then be used to analyze the literature about the genes that constitute a microarray cluster. PubMed/MEDLINE is the most widely used on-line source for gene-related abstracts and literature, which might be used to generate such a corpus, but few investigators have described its use for any similar purpose. SHATKAY et al., Internat. Conf. on Intelligent Systems in Molecular Biology 8: 317-323 (2000), explain that PubMed provides for literature search and retrieval by two methods—boolean query and similarity query (also known as “neighboring”). They describe how there are well-known deficiencies with any attempt to use the method of boolean queries to generate a text corpus. For example, CHAUSSABEL and SHER, Genome Biology 3(10):research0055.1-0055.16 (2002) attempted to use boolean queries consisting of gene names taken from a list, and ultimately found it necessary to manually edit or correct the unacceptably large number of errors that resulted from use of boolean queries. Accordingly, SHATKAY et al. advocate using only the neighboring feature of PubMed to acquire a set of documents about a gene, after first selecting a “kernel” citation for that gene, possibly from within a curated database about the genes under investigation. The literature in PubMed that “neighbors” this kernel citation is then generated by PubMed after providing it the kernel citation as the neighboring query. The method of SHATKAY et al. then seeks to find similarities within the documents so generated for different genes. This method can be automatic only if there already exists a curated citation list from which to obtain the “kernel” documents, as was the case with the yeast genes investigated by SHATKAY et al. Otherwise, and in general, an expert human would need to select the kernel documents. Furthermore, SHATKAY et al. teach that when a clustering of genes is already available from microarray expression experiments, then that clustering should be ignored, except for purposes of manually comparing with results obtained independently by their method. Thus, unlike the present invention, their method does not use the actual clustering of microarray data, and it provides no figure-of-merit for the quality of microarray clustering. Another method for generating a corpus of text using MEDLINE was described by ANDRADE and VALENCIA, Bioinformatics 14: 600-607 (1998), but it was used to generate a text corpus only for protein domain families, rather than for individual, arbitrarily selected genes. In their method, protein families in the PDBSELECT database pointed to entries in the SwissProt database, which pointed to articles in MEDLINE, which were then taken to be the corpus for the corresponding protein domain family. This method is not generally applicable to the problem of generating a literature corpus for an arbitrarily selected gene, because a gene may not belong to a known protein family. Furthermore, the size of the literature corpus generated by their method would be limited by the number of pointers in the SwissProt database. Unlike the present invention, their method does not make use of information from the clustering of microarray data, and it provides no figure of merit for the quality of microarray clustering.
The present method and system also produces text for each cluster in a different manner than what was described by MASYS et al., Bioinformatics 17: 319-326 (2001). Unlike the present invention, their method does not provide a figure of merit for the quality of microarray clustering. Their method also has the disadvantage that the words and phrases it produces are voluminous and generally non-specific, placing a significant burden of interpretation on the investigator, because it links sets of genes to the published literature by way of keyword hierarchies using the entire set of descriptors contained in MeSH and Enzyme Commission nomenclature.
Given all the above-mentioned limitations of the prior art, it was therefore an aim of the present invention to provide a method for automatically generating a substantial literature text corpus for an arbitrarily selected known gene, which could then be used to generate a text corpus for clusters obtained from microarray experiments, suitable for generating a figure-of-merit index indicating whether each particular microarray clustering has support in the scientific literature.
In one embodiment, the present invention obtains text in the scientific literature about genes on a microarray by following links about each gene in databases of the National Institutes of Health (NIH) to a corresponding entry in the NIH database “Online Mendelian Inheritance in Man” (OMIM). The latter database contains literature citations about the corresponding genes, text from which is downloaded from another NIH literature database and which is used to construct a text corpus for each grouping of genes that constitute a microarray cluster.
The present invention then makes use of concepts of machine learning and statistical natural langnge processing [MITCHELL, Machine Learning, McGraw-Hill (1997); MANNING et al., Foundations of Statistical Natural Language Processing, MIT Press (1999)] by treating the problem of interpreting literature about microarray clusters as one of text categorization, the goal of which is to classify the theme of any particular document. The rationale of the disclosed method is that if the microarray clusters correspond to distinctions that would be meaningful to a biologist reading the literature about genes in the clusters, then such distinctions can be made quantitatively in terms the frequent appearance of words or phrases that are contained within the text corpus for each cluster and simultaneously in terms of the infrequent appearance of such words or phrases with the text corpus for other clusters. To the extent that quantitatively distinguishing words and phrases can be found for each cluster, then we may use them to construct a text classifier that predicts whether a gene is associated with a particular cluster, based on the quantitative word/phrase composition of literature documents about that gene. Conversely, if the microarray clusters do not correspond to distinctions that would be meaningful to a biologist reading the literature about genes in the clusters, then it should not be possible to construct a text classifier that reliably predicts whether a gene is associated with a particular cluster, based on the quantitative word/phrase composition of literature documents about that gene.
The figure-of-merit indices of the present invention relate to the percentage of times that tested classifications are made correctly, as compared with classifications performed on text corresponding to genes placed randomly into clusters. Although the invention makes use of some conventional procedures of supervised machine learning, as implemented in the open source software “rainbow” (described in documentation at the Web site for the “bow” software, top level domain=edu, second level domain=cmu, third level domain=cs, fourth level domain=www, path=/˜mccallun/bow [McCALLUM (1998)]), the invention differs from the existing supervised machine learning art in that the categories into which documents are to be classified are not pre-selected by the artisan to be meaningful, but correspond instead to automatically generated microarray clusters, which may or may not be biologically meaningful entities. In fact, an objective of the invention is conversely to analyze the extent to which the microarray clusters (supervised machine-learning classification categories) are actually biologically meaningful.
In another embodiment of the invention, figure-of-merit indices are calculated by obtaining text in the scientific literature about genes on a microarray; by putting that literature in groups defined by microarray clustering of the corresponding genes; and by then evaluating the extent to which genes within a cluster are associated with the same literature citations, irrespective of the actual text within those literature citations. Indices are then generated by calculating the average fraction of times that pairs of genes in a cluster are associated with the same literature citation. Because the indices are based on literature citations rather than text within the citations, the indices are related to concepts that arise in connection with the analysis of literature co-citation frequencies [EGGHE and ROUSSEAU, Introduction to Informetrics, Elsevier Science Publ. (1990)], but which have nothing to do with the analysis of microarray data.