The present invention relates to computer systems and more particularly to computer systems for mining information about gene expression levels.
Devices and computer systems have been developed for collecting information about gene expression or expressed sequence tags (EST) in large numbers of samples. For example, PCT application WO92/10588, incorporated herein by reference for all purposes, describes techniques for sequence checking nucleic acids and other materials. Probes for performing these operations may be formed in arrays according to the pioneering techniques disclosed in U.S. Pat. Nos. 5,143,854 and 5,571,639, for example. Both of these U.S. Patents are incorporated herein by reference for all purposes.
According to one aspect of the techniques described in these patents, an array of nucleic acid probes is fabricated at known locations on a chip or substrate. A fluorescent label attached to a nucleic acid is then brought into contact with the chip and a scanner generates an image file indicating the locations where the labeled nucleic acids bound to the chip. Based upon the identities of the probes at these locations, information such as the monomer sequence of DNA or RNA can be extracted.
Computer-aided techniques for gene expression monitoring using such arrays of probes have been developed as disclosed in EP Pub. No. 0848067 and PCT publication No. WO 97/10365, the contents of which are herein incorporated by reference. Many diseases are characterized by differences in the degree that various genes are expressed either through changes in the copy number of the genetic DNA or through changes in levels of transcription (e.g., through control of initiation, provision of RNA precursors, RNA processing, etc.) of particular genes. For example, losses and gains of genetic material play an important role in malignant transformation and progression. Furthermore, changes in the expression (transcription) levels of particular genes (e.g., oncogenes or tumor suppressors), serve as signposts for the presence and progression of various cancers.
Information on expression of genes or expressed sequence tags may be collected on a large scale in many ways, including the probe array techniques described above. One of the objectives in collecting this information is the identification of genes or ESTs whose expression is of particular importance. Researchers use such techniques to answer questions such as: 1) Which genes are expressed in cells of a malignant tumor but not expressed in either healthy tissue or tissue treated according to a particular regime? 2) Which genes or ESTs are expressed in particular organs but not in others? 3) Which genes or ESTs are expressed in particular species but not in others?
Collecting vast amounts of expression data from large numbers of samples including many tissue types is useful in answering these questions. However, in order to derive full benefit from the investment made in collecting and storing expression data, techniques enabling one to efficiently mine the data to find items of particular relevance are highly desirable.
The present invention provides techniques for organizing expression or concentration information in a way that facilitates mining. A database model is provided which may organize information relating to, e.g., sample preparation, expression analysis of experiment results, and intermediate and final results of mining gene expression measurements, gene sets and the like. The model is readily translatable into database languages such as SQL and the like. The database model can scale to permit mining of gene expression measurements collected from large numbers of samples.
According to an embodiment of the present invention, a computer based method for mining a plurality of experiment information is provided. The method includes a variety of steps such as collecting information from experiments and chip designs. The method can include steps of selecting experiments to be mined. Experiment results and other information can be organized by experimental analysis, and the like. A step of defining one or more groupings for the experiments to be mined is also be part of the method. The method also includes a step of selecting based upon the groupings, information about the experiments to be mined to form a plurality of resulting information. This resulting information can include one or more resulting gene sets, and the like. Finally, the method formats the resulting information for viewing by a user. The combination of these steps can provide to the user the ability to access experiment information.
In some embodiments, visualization techniques can be used in conjunction with the steps of the method to enable users to more easily understand the results of the data mining. Further, in some embodiments, a step of recording conclusions about the results of the data mining can also be part of the method.
In another aspect according to the present invention, a method for working with expression information is provided. The method includes a variety of steps such as collecting information about results of experiments. A step of gathering information about samples and information about the experiments, which can comprise an experimental analysis and the like, is also part of the method. The step of adding one or more attributes to the information about the experiments can also be performed. The method then transforms the plurality of results of experiments into a plurality of transformed information. Transformations can include normalizing, de-normalizing, aggregation, scaling, and the like. Steps of mining the plurality of transformed information and visualizing the plurality of transformed information can also be part of the method.
Numerous benefits are achieved by way of the present invention over conventional techniques. Some embodiments according to the present invention can provide better access to genetic experiment information than methods known in the prior art. Embodiments can provide answers to queries such as, xe2x80x9cshow all genes where the gene expression value is greater than or equal to 100, where at least three genes out of four respond to the query,xe2x80x9d as well as answers to many other and varied useful queries. Another advantage provided by this approach is that the results of numerous experiments can be mined effectively using visualization techniques and set theory queries.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.