Proteins are the working parts of living cells. With the near completion of the Human Genome Project there is now a need for an integrated system and program for obtaining, organizing, searching, and for using experimentally global information on the protein composition of cells, and on how that composition varies in development, disease, in response to drugs, toxic agents, and other experimental variables.
The human genome is estimated to code for up to 100,000 different proteins. Most ii not all are post-translationally modified, and/or are transported from the site of synthesis to the site of function. Many are elements of signaling or communication pathways. The protein composition of cells changes in an organized manner during development, and many cell-specific proteins are known.
Methods for separating or identifying proteins by immunochemical means are widely used and well understood. However, no large-scale systematic means for producing protein-specific antibodies has been described, hence a library of antibodies to match the ever increasing number of isolated proteins or the genomic data from the Human Genome Project does not exist.
The final proof that a given protein is present in a given cell type, and in a specific organelle of that cell type can be provided by immunochemical studies on carefully prepared cell and tissue sections. Many instances of such studies have been reported, however, systematic use of such procedures to confirm the localization of multiple numbers, much less large numbers of proteins has not been described. Such studies cannot proceed in the absence of a library of well-characterized antibodies to a library of specific proteins.
While many of the elements of the multi-dimensional Human Genome Project now exist, at least in part, the extension of that information to systematic large-scale studies requires innovation, automation and integration. Tissue and protein samples and fractions rapidly degrade; hence, it is not feasible to organize a project aimed at characterizing all of the proteins in a fashion similar to the Human Genome Project based on cooperative efforts at many sites. To further handle perishable samples, automation is best developed in intimate contact with an existing operating system. In addition, the elements of an integrated system must match each other in throughput and in time requirements. For example, cell fractionation of sets of tissues obtained at the same time must match the requirements of the next step in the fractionation process. Thus, the hierarchical disassembly of a freshly obtained tissue to cells, subcellular fractions, separation and analysis at the protein level, and data acquisition and analysis must match and must include quality control elements so that key steps may be repeated while the samples are still in good condition and available.
To organize, search and experimentally manipulate information relating to such a large number of functional entities will require both a theoretical framework in which new knowledge can be organized, means for obtaining the wide range of data required, and means for doing the experimental studies required to test new hypothesis. Such means did not exist previously in an integrated or integratable form.
The human body is composed of approximately 252 different cell types, all descendant through different intermediate cells from the three germ layers, and ultimately from a single fertilized human egg. While all diploid cells contain the same genetic information, different genes are expressed in different cell types and at different times during development and during the cell cycle. A protein gene product expressed in several cell types may differ in abundance. In addition, most, if not all proteins are post translationally modified. Further, proteins are synthesized in one set of structures (ribosomes), but target themselves into other subcellular structures.
It has been estimated that between 28,000 and 120,000 genes are present in a human. The present consensus estimates between 30,000 to 70,000 genes. However, each gene does not necessarily correspond to one protein. Many genes are expressed in only one gender, at only one developmental stage and in response to certain different stimuli. Thus, the number of protein “gene products” present are considerably less.
However, a single gene may produce several different protein forms as the result of alternative splicing, cleaved signal sequences, posttranslational glycosylation, phosphorylation, cleavage, complexing with cofactors, metal ions, other proteins and other modifications. For example, the well-characterized protein insulin may be found as the C chain or the A chain linked to the B chain. If a separation or purification is performed under reducing conditions, the A and B chains will be separated. Thus, a single “gene product” may be visualized as up to three different “proteins” depending on the conditions.
Proteins are the working parts of living cells. All are parts of self-assembling machines, all can change in abundance in response to experimental and physiological variables, and all turn over constantly, but at different rates. Under starvation conditions the total cell mass may decrease without loss of any individual function of the resting state, and will regain but not exceed a predetermined mass when returned to conditions of normal nutrition, suggesting that the proteome, with its tens of thousands of proteins, is a highly coordinated system.
While collections of proteins are well known, they have not been previously integrated into a unified system able to acquire, organize and sort the data now required to understand both the molecular anatomy and the molecular physiology of man in terms of the human proteome. It is evident that such a system would make possible the detailed description of diseased states, contribute to understanding aging, redefine cancer, and allow both pharmacology and toxicology to be rewritten.
There is therefore an evident need for a cataloging of all of the known proteins that can serve both the passive anatomical function of a data repository and an active physiological function as a search engine for new data and discoveries. An essential attribute of an index is searchability. There is a need for a system, a means and organization to create an index that provides the means for searching the data contained therein for new information and relationships.
It is evident that although some of the data required for such an active index can be acquired from the scientific literature, only an integrated program, analogous to those in atomic physics and space research, can provide and manage the vast amounts of data that can and should be acquired.
A Human Protein Index was hypothesized, Anderson & Anderson, Journal of Automatic Chemistry 2 (4): 177–178 (1980) and Anderson & Anderson, Clinical Chemistry 28 (4): 739–748 (1982), and in conjunction with the human genome project, Anderson & Anderson, American Biotechnology Laboratory September/October 1985. However, heretofore, the materials and methods to allow for the development of such a resource of information were not available.