The invention relates to methods and means for obtaining, storing and using an index or catalog of proteins. The catalog can be specific for, for example, an organelle, cell, tissue, organ, organism or population.
Proteins are the working parts of living cells. With the near completion of the Human Genome Project there is now a need for an integrated system and program for obtaining, organizing, searching, and for using experimentally global information on the protein composition of cells, and on how that composition varies in development, disease, in response to drugs, toxic agents, and other experimental variables.
The human genome is estimated to code for up to 100,000 different proteins. Most if not all are post-translationally modified, and/or are transported from the site of synthesis to the site of function. Many are elements of signaling or communication pathways. The protein composition of cells changes in an organized manner during development, and many cell-specific proteins are known.
Methods for separating or identifying proteins by immunochemical means are widely used and well understood. However, no large-scale systematic means for producing protein-specific antibodies has been described, hence a library of antibodies to match the ever increasing number of isolated proteins or the genomic data from the Human Genome Project does not exist.
The final proof that a given protein is present in a given cell type, and in a specific organelle of that cell type can be provided by immunochemical studies on carefully prepared cell and tissue sections. Many instances of such studies have been reported, however, systematic use of such procedures to confirm the localization of multiple numbers, much less large numbers of proteins has not been described. Such studies cannot proceed in the absence of a library of well-characterized antibodies to a library of specific proteins.
While many of the elements of the multi-dimensional Human Genome Project now exist, at least in part, the extension of that information to systematic large-scale studies requires innovation, automation and integration. Tissue and protein samples and fractions rapidly degrade; hence, it is not feasible to organize a project aimed at characterizing all of the proteins in a fashion similar to the Human Genome Project based on cooperative efforts at many sites. To further handle perishable samples, automation is best developed in intimate contact with an existing operating system. In addition, the elements of an integrated system must match each other in throughput and in time requirements. For example, cell fractionation of sets of tissues obtained at the same time must match the requirements of the next step in the fractionation process. Thus, the hierarchical disassembly of a freshly obtained tissue to cells, subcellular fractions, separation and analysis at the protein level, and data acquisition and analysis must match and must include quality control elements so that key steps may be repeated while the samples are still in good condition and available.
To organize, search and experimentally manipulate information relating to such a large number of functional entities will require both a theoretical framework in which new knowledge can be organized, means for obtaining the wide range of data required, and means for doing the experimental studies required to test new hypothesis. Such means did not exist previously in an integrated or integratable form.
The human body is composed of approximately 252 different cell types, all descendant through different intermediate cells from the three germ layers, and ultimately from a single fertilized human egg. While all diploid cells contain the same genetic information, different genes are expressed in different cell types and at different times during development and during the cell cycle. A protein gene product expressed in several cell types may differ in abundance. In addition, most, if not all proteins are post translationally modified. Further, proteins are synthesized in one set of structures (ribosomes), but target themselves into other subcellular structures.
Proteins are the working parts of living cells. All are parts of self-assembling machines, all can change in abundance in response to experimental and physiological variables, and all turn over constantly, but at different rates. Under starvation conditions the total cell mass may decrease without loss of any individual function of the resting state, and will regain but not exceed a predetermined mass when returned to conditions of normal nutrition, suggesting that the proteome, with its tens of thousands of proteins, is a highly coordinated system.
While collections of proteins are well known, they have not been previously integrated into a unified system able to acquire, organize and sort the data now required to understand both the molecular anatomy and the molecular physiology of man in terms of the human proteome. It is evident that such a system would make possible the detailed description of diseased states, contribute to understanding aging, redefine cancer, and allow both pharmacology and toxicology to be rewritten.
There is therefore an evident need for a cataloging of all of the known proteins that can serve both the passive anatomical function of a data repository and an active physiological function as a search engine for new data and discoveries. An essential attribute of an index is searchability. There is a need for a system, a means and organization to create an index that provides the means for searching the data contained therein for new information and relationships.
It is evident that although some of the data required for such an active index can be acquired from the scientific literature, only an integrated program, analogous to those in atomic physics and space research, can provide and manage the vast amounts of data that can and should be acquired.
A Human Protein Index was hypothesized, Anderson and Anderson, Journal of Automatic Chemistry 2(4):177-178 (1980) and Anderson and Anderson, Clinical Chemistry 28(4):739-748 (1982), and in conjunction with the human genome project, Anderson and Anderson, American Biotechnology Laboratory September/October 1985. However, heretofore, the materials and methods to allow for the development of such a resource of information were not available.
The instant invention relates to a method and means for systematically studying proteins to provide data thereon to enable making a catalog of proteins. The method of interest accounts for intertissue and interindividual variability. The method of interest enables the rapid provisional identification of proteins between and among samples. That provisional identification, which later can be confirmed, then can be relied on to develop further provisional identifications of other proteins in the same or other samples. The method reveals sample-specific markers, such as tissue-specific markers. The method provides a protein reference standard be it for an individual protein, a set of proteins or a pattern of polypeptide spots appearing on a 2-D gel. That sort of reference standard can be applied across organelles, tissues, organs, individuals and so on. The catalog of proteins thus is useful for identifying and comparing similar and identical proteins from other sources, such as, other tissues, other individuals of a population and species. The catalog and patterns will reveal relationships between and among proteins, for example, expression thereon under defined conditions, coregulation of proteins and so on. Therefore, proteins that are coordinately expressed or regulated will be revealed, as will proteins with a reciprocal or antagonistic pattern of expression wherein expression of one protein wanes or does not occur when another is expressed. The method yields a reference point for determining the reaction of an individual or a cell, and the proteins thereof, to a stimulus. The method provides a reference point to distinguish manifestations arising from an abnormal state, such as in a disease state. The catalog of proteins is useful for identifying sequences of nucleotides, or clones from a genomic or cDNA bank, that could or do encode a particular protein. As to clones from a genomic bank, knowing the protein will enable determination of what processing of the genomic sequence occurs to obtain expression of the open reading frame. The protein index or database can be aligned, for example, with a chromosomal map or to a morbid gene map to reveal associations with a particular protein and with a particular disease, respectively. Identification of such markers will lend to the development of particular diagnostic and therapeutic materials and methods.