1. Field of the Invention
The present invention, in the fields of molecular biology and computer image analysis, relates to methods and computer systems for analyzing proteomes of organisms, organs, tissues, biopsies, primary, secondary or established cell lines, and body fluids (serum, plasma, cerebrospinal fluid, urine etc. or culture media) (hereinafter referred to as xe2x80x98cellsxe2x80x99)xe2x80x9d to characterize cellular and secreted proteins and/or nucleic acids that are up- or down-regulated in affected or unaffected conditions, for diagnostic or therapeutic applications. Proteins characterized using such methods or computer systems are also provided, as well as peptide fragments, and nucleic acids encoding the proteins or fragments, for use in diagnostic applications.
2. Related Art
Proteins and Two Dimensional Gel Electrophoresis. Two-dimensional gel electrophoresis (2-DGE) is a particularly effective tool for separating mixtures of proteins. Cell protein extracts are put onto a gel, and the individual proteins are separated first by charge and then by size. The result is a characteristic picture of as many as 1000 to 5000 spots, each usually a single protein. Resolution can be improved by increasing gel size, and by enhancing the sensitivity through the use of radiolabel methods, silver staining, and the reduction in thickness of the gels to 1.5 mm and less. Jungblut et al., Journal of Biotechnology 41:111-120 (1995), have reported that up to 5000 protein spots were run from mouse brain cell extracts on gels of size 23xc3x9730 cm.
High resolution 2-DGE has been used for analyzing basic as well as acidic proteins. Isoelectric focusing (IEF) in the first dimension can be combined with sodium dodecylsulfate (SDS) gel electrophoresis in the second dimension (IEF-SDS). Alternatively, NonEquilibrium pH Gradient Electrophoresis (NEPHGE) in the first dimension can be combined with SDS gel electrophoresis in the second dimension (NEPHGE-SDS). Such procedures are described in O""Farrell, J. Biol. Chem. 250:4007-4021 (1975) and O""Farrell et al., Cell, 12:1133-1142 (1977), which are entirely incorporated herein by reference. NEPHGE gels cannot be used for the determination of isoelectric points of proteins. The isoelectric point of a protein is usually determined in a stable pH gradient with reference to known proteins. As discussed in O""Farrell (1977), good resolution of acidic proteins is obtained with equilibrium IEF. Good resolution of basic proteins can be with a pH 7-10 NEPHGE gel. For the highest resolution of the entire range of proteins, two gels are used: (1) an IEF gel for acidic proteins; and (2) a NEPHGE gel for basic proteins. An alternate method for separating proteins according to pI is to use immobilized pH gradient gel electrophoresis (IPG), according to known method steps.
Once a 2-DGE gel is run, the proteins may be visualized in a variety of ways including staining (Coomasie blue, silver or gold), flourescence (if the sample has been appropriately prepared, e.g. with monobromobimane), or an image captured on X-ray or phosphoimaging plates (if the sample is radioactivly labelled e.g. with [35S]-methionine, [14C]-amino acids, or [32P] phosphate). Stained and flourescent images are captured electronically e.g. using a camera, while X-ray film and phosphor imaging plates are scanned in appropriate devices to yeild the electronic image. For example, after electrophoresis, a 2-DGE gel can be fixed with methanol and acetic acid, treated with AMPLIFY(copyright) (Amersham), and dried. The gel is then placed in contact with X-ray film and exposed. The gel can be exposed for multiple time periods to compensate for the lack of dynamic range of X-ray films. Each film image contains a multiplicity of xe2x80x9cspotsxe2x80x9d of differing position, size, shape, and optical density. The spots on the image are analyzed to determine the correspondence between spots and proteins. The use of phosphorimaging technology is preferred because the responce of the phosphorimaging plates is linear and covers a range of 1:100,000 obviating the need for multiple exposures and avoiding the non-linear response of film.
Analysis of 2DGE Gels. Manual visual inspection and analysis of gel images is limited in the number of spots resolvable (Jungblut et al., In: Neuhoff, V. (ed.) Electrophoresis, Verlag Chemie GmbH, Weinheim, p. 301-303; (1984); Andersen et al., Diabetes, Vol. 44:400-407 (April, 1995)). Additionally, increasing gel size makes visual analysis laborious and time consuming. Analysis of one film can take at least eight to 20 hours, even for one having an expert level of skill and experience in this art. Further, quantification by visual analysis is limited. Typically, visual analysis only detects changes in protein amounts of a factor greater than or equal to 2.
Various computer programs and computer evaluation systems have been developed to improve quantification and assist in evaluation of individual gel films, e.g., PDQUEST (Protein Database Inc., New York), Biolmage (Ann Arbor, Mass., USA), Phoretix (Phoretix International, Newcastle, UK), and Kepler (Large Scale Biology Corporation, Rockville, Md.). To use a computer program such as BioImage, the image on the gel film is usually scanned or captured using a digital camera and the digital image entered into the memory or storage of a computer. The digitized gel image is analyzed by the computer program. Each spot is assigned an intensity value, such as an integrated optical density percentage (IOD%), and a position on the gel, such as an xe2x80x9cX,Yxe2x80x9d Cartesian-type coordinate. Computer programs such as BioImage require the highest qualities in resolution and the highest reproducibility of the spot position. Because the gel medium is so elastic, gel patterns are not identical, i e., two gels, run under essentially identical conditions, will not have each protein spot located in exactly the same position. If two gels are run under conditions that are not essentially the same, then the variations in position of corresponding protein spots will be even greater.
Computer evaluation systems such as those described above have improved the quantification of spot intensities and IOD% for generation of a xe2x80x9cspot listxe2x80x9d for a gel image. However, computer evaluation systems such as those described above still require significant operator effort for editing. A gel image to be evaluated is input to a computer, such as by scanning. The digitized image is searched to locate spots having an intensity or optical density above a sensitivity threshold. The operator must then edit the gel image. For example, if two very big spots are close together, the computer may have identified the two spots as one elongated spot. The computer may not be able to resolve that there are actually two spots. The operator would then be required to manually edit the image to divide the spot into two spots. As another example, the computer may incorrectly identify as a protein spot a non-protein spot on the gel image, such as a high intensity streak. The operator would then be required to manually edit the image to delete the non-protein spot. It can take from six to eight hours for a skilled operator to edit a gel image evaluated using a conventional computer evaluation system. This manual editing introduces a considerable degree of subjectivity into the analysis and this is the major drawback to the analysis of 2D gel images. Even though attempts can be made to reduce this by having the same operator carry out the entire analysis, there are bound to be differences in how he/she defines spots and how the computer does. This will introduce a degree of error into the analysis.
As reported in Jungblut et al. (1995), numerous researchers have used conventional computer evaluation systems to produce 2-DGE databases for various tissues or cell types. However, these systems require significant effort on the part of the operator to produce an accurate spot list for a new gel image. More importantly, conventional computer evaluation systems do not provide an analysis and interpretation tool that uses information from other gel images of the same cell type to allow an operator to quickly and efficiently analyze and interpret a new gel image. Conventional computer evaluation systems cannot be used to reliably detect proteins only present in small amounts. Thus, there is a need in the art for a computer-based analysis system that reduces the effort required by the operator, and increases the speed with which new gel images can be analyzed and interpreted. There is a further need in the art for a computer-based analysis system for analyzing and interpreting new gel images that uses information from other gel images of the same cell type.
Most conventional computer evaluation systems also do not provide an analysis tool for statistical comparison between groups of gel images. Thus, there is a further need in the art for a computer-based analysis system that is capable not only of analyzing and interpreting a new gel image, but also of executing statistical comparisons between various groups of gel images.
Accordingly, there is a need to provide methods and analysis systems for determining which proteins or nucleic acids are up or down regulated in diseases, as well as methods and systems for testing potential diagnostic or therapeutic compositions and methods for diagnosing or treating such diseases.
The present invention relates to methods and computer systems for analyzing images of specific cell type proteomes of organisms, organs, tissues, biopsies, primary, secondary or established cell lines, and body fluids (serum, plasma, cerebrospinal fluid, urine etc. or culture media) (hereinafter referred to as xe2x80x98cellsxe2x80x99). The proteomes are analyzed to characterize proteins or nucleic acids that are up- or down-regulated in treated, diseased or immunologically affected conditions. The present invention thus provides such proteins and nucleic acids in purified or isolated form, as well as fragments, probes and related diagnostic and therapeutic compositions and methods.
The invention, in one aspect, provides methods and computer systems for identifying or characterizing unaffected proteins and affected proteins that distinguish normal cells from treated, diseased or immunologically affected cells, in vitro or in vivo, the cells derived from a sample of a specific cell type, or cell lines derived therefrom. The sample can be subjected to two dimensional gel electrophoresis (2DGE) to provide a 2DGE gel comprising the unaffected or affected proteins, as well as recorded images thereof.
These images can be colored or black and white (a colored image can have three grey scale ranges for the primary colors and can thus be analyzed in the same way as described below). For the purposes of this description only, one grey scale is considered although for one skilled in the art, there would be no difficulty to extend the description to the three primary colors, or combinations thereof.
In biotechnology, applications can include, but are not limited to, Northern, Southern or Western blots, one-dimensional gel electrophoresis (1DGE) gels and/or 2DGE gels. The present invention is described below with respect to analyzing gel electrophoresis images to identify proteins and encoding nucleic acids, and to compare gel images to identify changes in protein or nucleic acid expression.
In one aspect of the invention, a method for analyzing images is provided. The method comprises at least one of the following steps, such as, but not limited to (1) to (3), (4), (5), (6), (7), (8), (9), (10), (11) or (12):
(1) capturing a new image, wherein the new image contains a plurality of new image spots corresponding to one or more proteins in an electrophoresis gel, each new image spot having a spot number, an integrated optical density percentage (IOD%) and a position;
(2) generating a master composite image for use in analyzing the new image, wherein the master composite image contains a plurality of master composite spot data list, each master composite spot data list having a spot number, an IOD% and a position;
(3) generating a master composite spot data list, wherein the master composite spot data list comprises the spot number, the IOD%, the position, the variability of the spot (for example the standard deviation expressed as a percentage) for the position and IOD%, and a saturation value (corresponding to the value of the maximum pixel intensity found in any of the spots (from the original images which were used to derive the spot in question) (this value is expressed as a fraction on a scale from white (0) to black (1)) for each of the plurality of master composite spot data list;
(4) generating a database which contains information which might be necessary to interpret the gel images in a meaningful way. This information might include, but is not limited to: the type of sample analysed (including whether it is an organism, an organ, a tissue sample, a biopsy, a body fluid, isolated cells, primary, secondary or from established cell culture; whether it is a total cell extract, a protein containing supernatant or medium produced by cells; the type of cells (including origin, species, age); whether the sample is from a diseased organism or is a control sample for a disease; whether the individual organism or sample has been infected with another organism including any form of microorganism, virus, bacterium, bacteriophage, prion or other infectious agent (and if so which and how and to what extent the infection has progressed); whether the individual organism or sample has been treated with any form of drug or chemical compound (and if so which and how and at what amount); whether the individual organism or sample has been treated with any form of stress or environmental factor which could be expected to influence its response (and if so which and how and at what amount); the manner in which the sample has been collected and treated; information concerning the experiments execution; characteristics of the proteins that have been entered manually or imported from various sources (including the internet), e.g the protein identity, cellular localalisation etc.; or other data that has been generated by analysing some or all of other gel images;
(5) aligning the new image with the master composite image;
(6) selecting a set of anchor points from the master composite spot data list;
(7) detecting new image spots that have a position that is within a position tolerance of the position of corresponding anchor points and that have an IOD% that is within an IOD% tolerance of the IOD% of corresponding anchor points, and matching the detected new image spots to the corresponding anchor points to form a set of matched new image -spots;
(8) calculating a set of vectors linking spots of the same number in the master composite image and in the new gel image; and determining for each vector the length and angle;
(9) calculating a vector difference for each of the set of matched new image spots corresponding to the difference between the vector in question and the vectors originating from a number (for example, 2-500 of the nearest spots to the spot in question. This will generate a vector difference for each of the new matched new image spots and in a subsequent step, removing from the set of matched new image spots those matches for which the vector differences are greater than a predetermined percentage of the best (shortest length and numerically smaller angle) vector differences. A means by which these vector differences can be used to quality check the alignment of the images and to guide the correction of mismatches in a reiterative manner until an optimal match is obtained);
(10) selecting a set of well-defined spots from the master composite spot data list, detecting new image spots that have a position that is within a position tolerance of the position of corresponding well-defined spots, matching the detected new image spots to the corresponding well-defined spots, and adding the matched new image spots to the set of matched new image spots;
(11) selecting a set of saturated spots from the master composite spot data list, detecting new image spots that have a position that is within a position tolerance of the position of corresponding saturated spots, matching the detected new image spots to the corresponding saturated spots, and adding the matched new image spots to the set of matched new image spots;
(12) selecting a set of weak spots from the master composite spot data list, detecting new image spots that have a position that is within a position tolerance of the position of corresponding weak spots, matching the detected new image spots to the corresponding weak spots, and adding the matched new image spots to the set of matched new image spots; and
(13) (optionally replacing step (5) above) searching the new image outside the set of matched new image spots to locate unidentified new image spots.
In another aspect of the present invention, the master composite spot data list or master composite image optionally further comprises at least one characteristic of at least one of said proteins, said characteristic selected from the group comprising pl, molecular weight, amino acid sequence, mass spectra and a post-translational modification.
In another aspect of the present invention, the method further includes comparing a first set of images to a second set of images.
In yet a further aspect of the present invention, the new image is aligned with the master composite image through the use of a common anchor point. Common anchor points correspond to spots present in both the new image and the master composite image. Anchor points selected from the master composite spot data list can include primary anchor points and secondary anchor points. Primary and secondary anchor points are obtained at different stages in the image processing using different selection criteria to select the master composite spot data list proteins to be used.
In still a further aspect of the present invention, well-defined spots have a saturation value S in the range of 0.2 less than S less than 0.8. Saturated spots have a saturation value Sxe2x89xa70.8. Weak spots have a saturation value Sxe2x89xa60.2.
In another aspect, a related method of the invention comprises
(a) providing at least one recorded image of at least a portion of the 2DGE gel comprising the unaffected or affected proteins, the proteins being resolvable as spots in the protein image;
(b) analyzing the image to identify (i) at least one of the unaffected or affected proteins; (ii) qualitative or quantitative changes in at least one of the affected proteins; (iii) at least one identifying characteristic of at least one of the affected proteins; or (iv) at least one marker protein present in each 2DGE gel from the normal, treated, diseased or immunologically affected cells.
In this method, at least one of the proteins can be selected from the group consisting of unaffected proteins, affected proteins or marker proteins.
The invention, in another aspect provides, methods and computer systems for identifying or characterizing unaffected proteins and affected proteins that distinguish normal cells from treated, diseased or immunologically affected cells, in vitro or in vivo. The sample can be subjected to two dimensional (2D) gel electrophoresis to provide a 2DGE gel comprising the unaffected or affected proteins.
In another aspect, the computer-based system comprises
(a) a computer readable medium having stored thereon at least one protein image or protein composite image of at least a portion of the 2DGE gel comprising the unaffected or affected proteins, the proteins being resolvable as spots in the protein image or in the protein composite image;
(b) at least one computing subroutine that, when executed on a computer, causes the computer to analyze the protein image or protein composite image to provide output data representing at least one of the unaffected or affected proteins, the output data optionally further comprising at least one marker image or marker composite image representing at least one marker protein present in each 2DGE gel from the affected and unaffected cells, wherein the protein image or protein composite image, when used to compare images or composite images of the unaffected and affected proteins, identifies (i) qualitative or quantitative changes in at least one of the affected proteins; or (ii) at least one identifying characteristic of at least one of the affected proteins; and
(c) retrieval means for recording the output data comprising the protein image or protein composite image, and optionally further comprising (1) data for the marker image or marker composite image; (2) data for the qualitative or quantitative changes; or (3) data for said at least one characteristic.
In a further aspect, the invention provides a computer method, comprising
(a) providing a computer readable medium having stored thereon at least one protein image or protein composite image of at least a portion of the 2DGE gel comprising the unaffected or affected proteins, the proteins being resolvable as spots in the protein image or in the protein composite image;
(b) analyzing, on a computer using at least one computing subroutine executed in the computer, the at least one protein image or protein composite image to provide output data representing at least one of the unaffected or affected proteins, the output data optionally further comprising at least one marker image or marker composite image representing at least one marker protein present in each 2DGE gel from the normal, treated, diseased or immunologically affected cells, wherein the protein image or protein composite image, when used to compare images or composite images of the unaffected and affected proteins, identifies (i) qualitative or quantitative changes in at least one of the affected proteins; or (ii) at least one identifying characteristic of at least one of the affected proteins; and
(c) obtaining the output data comprising the protein image or protein composite image, and optionally further comprising at least one of (1) data for the marker image or marker composite image; (2) data for the qualitative or quantitative changes; or (3) data for the at least one characteristic.
In the above computer system or method, the at least one characteristic of at least one of said proteins can be characterised in a number of ways including but not limited to protein identity, pI, molecular weight, amino acid sequence, IOD%, mass spectra or a protein modification.
The invention also provides computer readable media comprising output data provided by the above computer system of method.
In preferred embodiments, computer systems or methods of the present invention are provided where the treated cells have been treated with at least one compound prior to providing the cell sample. The compound, such as a chemical or a biological molecule, can be a potential diagnostic or therapeutic compound.
In methods, computer systems or gels of the present invention, qualitative changes can be changes in the structure of at least one of said proteins in said 2DGE gel, and quantitative changes can be changes in the amount of at least one of said proteins in said 2DGE gel.
In methods, computer systems or gels of the present invention, at least one characteristic can be selected from the group consisting of pI, molecular weight, %IOD, amino acid sequence, mass spectra and a protein modification.
In methods, computer systems or gels of the present invention, the cell type or cell line can be derived from a prokaryotic or eukaryotic cell, and it is preferred that the eukaryotic cell is a mammalian cell or bird cell.
In methods, computer systems or gels of the present invention, the treated cells can have been treated with at least one compound prior to providing said cell sample, where a preferred compound is selected from the group consisting of a protein, a nucleic acid and a chemical compound, and a more preferred compound can be a potential drug.
According to the present invention, at least one purified protein is provided by the present invention, where the protein corresponds to a protein identified or characterized by methods, computer systems or gels of the present invention.
It is a feature of the present invention that it can analyze and interpret new gel images, and also conduct statistical comparisons between groups of gel images.
It is a further feature of the present invention that it uses information from a single gel (using default tolerance values) or a master composite image to guide the analysis and interpretation of new gel images.
It is yet a further feature of the present invention that it uses the integrated optical density percentage, as well as the position, in locating spots in new gel images.
It is an advantage of the present invention that new gel images can be analyzed and interpreted with minimal input from an operator.
It is a further advantage of the present invention that new gel images can be analyzed and interpreted quickly and efficiently.
It is a still further advantage of the present invention that if can reliably detect proteins that are present in small amounts.
It is yet a further advantage of the present invention that it is not limited to analysis and interpretation of two-dimensional gel electrophoresis images, and can be used to compare any two similar images, whether black and white or color or in any situation where image interpretation and recognition is involved. This process could include the comparison of xe2x80x9cfreshly derived imagesxe2x80x9d from any image capture device with an image recovered from a computer memory device.
Other objects of the invention will be apparent to skilled practitioners from the following detailed description and examples relating to the present invention.