The present invention relates to the analysis of life science data. More particularly, the present invention relates to computer based interpretation of biological sequences. Even more particularly, the present invention relates to a client/server or Internet based computer tool and method for identification of potential protein, DNA or RNA sites of interest based upon the underlying amino acid or DNA or RNA nucleic acid sequence characteristics.
Currently, life scientists and molecular biologists are working with a wide variety of manual and automated tools to determine particular characteristics regarding molecular biology data. While automated protein sequencing tools presently generate large volumes of protein amino acid sequence data, tools for easily handling and interpreting the new data have yet to become commonplace. Scientists have been attempting to manage their molecular biology data in a wide variety of ways, from expensive, dedicated and proprietary computer systems to manually reviewing data placed into common word processors or text editors not optimized for handling large amounts of life science information.
The challenge of the first approach lies primarily in its limited accessibility by the average life scientist. Dedicated proprietary computer systems for sequencing and interpretation of molecular biology information often cost far beyond what the budgets of small research operations will permit. Other drawbacks exist in addition to acquisition price, such as closed and user unfriendly proprietary system architecture which does not facilitate cross-platform sharing of molecular biology sequence information. Often researchers using state of the art proprietary systems purchased at great expense will encounter difficulty sharing molecular biology data with other researchers on different computer systems in the same lab, let alone with colleagues in another institution or country.
The difficulty encountered at the other end of the spectrum is just as common, if not more so. Researchers not able to gain access to high cost dedicated molecular biology computer systems may resort to utilizing the most rudimentary of toolsets to interpret their genetic sequence or corresponding protein data. Manually screening through volumes of protein sequence data using basic text editors and word processors is not unheard of, despite the fact that these tools are not optimized for or never designed to handle genetic data in any form. In addition, despite the high degree of sophistication the average life scientist may have with respect to his or her particular field, often a commensurate computer ability is not present in the average life science user. User interfaces currently designed for state of the art molecular biology computer systems can be so user unfriendly that a life scientist may actually prefer to work with a simple and easy to use text editor instead of an inflexible proprietary system. As for the problem of collaborative work on related sequence data, neither approach facilitates remote access to lab generated or public domain sequence library information.
In the end, a technologically robust and user friendly system for remotely interpreting and managing life science data is truly needed. Such a system would aid not only the research process itself, but would speed the end product of the research as well. An improved and broadly accessible tool for interpreting life science data would not simply aid research in and of itself, but bring about discoveries in an accelerated manner. Data brought closer to understanding by the life scientist consequently means accelerated medical breakthroughs, improved drug therapies, and better understood systematic models of disease and regulatory processes.
By combining a powerful biological sequence site scoring tool with remote computer access functionality, a web-based tool for the identification of molecular biology sequence sites is hereby disclosed. An example of the present system and method functionality is provided using the identification of Caspase cleavage sites as a working example. Scoring as applied to potential protein modifications sites is based on amino acid sequence characterization, and is easily modifiable to be utilized by nucleic acid sequences.
Disclosed is an objective, quantitative method and apparatus for searching and evaluating biological sequence data relative to a selected functional characteristic, such as enzyme cleavage site, binding site, secondary structure, or potential modification site. Software is used to scan known target sequences of amino acids, DNA or RNA base pairs, searching for sequence regions exhibiting composition characteristics derived from scoring matrices provided by user input. Characteristics may include number of residues, presence of specific residues, or specific sequences of residues. Sequence regions exhibiting characteristics similar to the predetermined characteristics are identified, flagged and quantitatively scored for closeness of fit to the group of all predetermined characteristics, including quantitative scoring for mandatory characteristics and exclusionary characteristics. Scoring takes place based upon one or more scoring matrices which detail the individual predetermined characteristics and their respective quantitative scores. The scoring matrices can be used to predict the relative functional effect of individual biological sequences within a potential sequence site and help interpret combinations of sequences relative to the specific functional characteristic of interest. The invention further provides the user with the ability to select threshold cutoff values to be used by the software for evaluation of scoring matrix results, thereby assisting the user in the site location identification process, and providing the user the ability to evaluate the effect of substitutions of characteristics.
To practice the claimed invention on a particular protein amino acid sequence cleavage site, one or more scoring matrices are developed. These scoring matrices are derived by comparing the cleavage sites from known protein targets and determining the frequency of amino acid content at each position. A score for each possible amino acid is then set for each position based on this frequency. For example, a particular cleavage site in a protein may contain 5 amino acids. If it were found that Aspartic acid occurred 50% of the time at the first position and Leucine 50% of the time at this position, then each of these amino acids would have a score of 0.5 and the remaining amino acids would have a score of 0 at this position. To ensure the return of particular results, such as when particular amino acids must be present (weighted score greater than or equal to 1) or must not be present (negative weighted score), scores outside of the anticipated frequency range can also be inserted. Thus, for each of the five positions in a protein cleavage site a score for each of the 20 amino acids is created, this information is stored in the scoring matrix. Each possible cleavage site in a target protein is assigned a cumulative score based on this matrix. All possible cleavage sites can be listed, sorted by this score. A threshold can be set such that only scores above a certain level of identity are returned when queried. This search can be performed on a single protein or on large public protein databases, residing anywhere from the initial client computer, the central server computer, or remotely on public databases accessible via the Internet. The data searched can be resident on the server undertaking the analysis, or remotely retrieved from public or private sequence databases. In addition, the results returned can be sent to a single remote client computer, or to a plurality of remote systems. Due to the pervasive nature of the Internet, it is an intended consequence of the claimed invention that multi-user collaboration is made possible under the client/server computer model, with data sets and stored queries are easily shared among users.
In the working example of Caspases, the Caspases are a family of proteases that are known to play a key role in the regulation of programmed cell death (apoptosis). These proteins have a high degree of substrate specificity and Caspase cleavage of specific key regulatory proteins is thought to play an integral role in cell death. This specificity is achieved by recognition of specific amino acid patterns in target proteins, for example, the amino acid sequence DEADG (aspartic acid, glutamic acid, alanine, aspartic acid and glycine) in Retinoblastoma protein (pRb) is recognized and cleaved by Caspase 3 during apoptosis. The aspartic acid in the 4th position is absolutely required for all Caspase cleavage, while the other 4 amino acids determine whether the sequence is cleaved by Caspases and if so by which Caspase. Thus, while aspartic acid residues are required for a Caspase cleavage site, the surrounding amino acids will determine whether Caspases can cleave the protein at that particular aspartic acid. Additionally, those surrounding amino acids can determine which Caspase acts on that site. Recently, it has become clear that Caspases can also regulate other cellular processes such as proliferation and differentiation. Thus, Caspases are critical regulators of cell fate and may play roles in the pathogenesis of diseases such as cancer, autoimmune disease, AIDS and Alzheimer""s Disease. The identification of Caspase substrates may therefore provide insight into the regulatory pathways involved in these diseases, and advances in characterizing potential Caspase cleavage sites would clearly advance medical discoveries in these areas.
The claimed invention as applied to this working example scans a protein""s amino acid sequence for potential cleavage sites and scores them using user-defined scoring matrices based on the consensus sites for several different proteases. In the working example of Caspases, the evaluated Caspase variants are Caspase 3, Caspase 6 and Caspase 8 respectively. These scoring matrices are derived by comparing the cleavage sites from known Caspase targets and determining the frequency of amino acid content at each position. A score for each possible amino acid is then set for each position based on this frequency. A score can also reflect particular user defined characteristics as well. For example, Caspase cleavage sites contain 5 amino acids. If it were found that Aspartic acid occurred 50% of the time at the first position and Leucine 50% of the time at this position, then each of these amino acids would have a score of 0.5 and the remaining amino acids would have a score of 0 at this position. For event determinative characteristics, a required amino acid at a particular position can be assigned a frequency score greater than one to guarantee inclusion of this indicator in the returned results. Also, amino acids requiring exclusion at a particular position can be assigned a score less than zero to ensure that potential sites including this amino acid at this position are not returned. Thus, for each of the five positions in a Caspase site a score for each of the 20 amino acids is created, this information is stored in the scoring matrix. Each possible cleavage site in a target protein is assigned a cumulative score based on this matrix. For Caspase cleavage, every aspartic acid is considered to be a potential cleavage site and is in fact required at the fourth sequence location, and is consequently assigned a value of xe2x80x982xe2x80x99 at the fourth position to guarantee inclusion in results returned. All possible cleavage sites can be listed, sorted by this score. A threshold can be set such that only scores above a certain level of identity are returned. This search can be performed on a single protein or on large public protein databases, located either on the client computer, central server computer, or remote public database accessible through the Internet. In the following tables, the exemplary scoring values for amino acids in Caspase cleavage sites are presented as Caspase 3, Caspase 6 and Caspase 8 tables, followed by the union table Caspaseall.
Table of Caspase 3 cleavage site scoring, length of site is 5, 4 is position of required amino acid D (Aspartic Acid)
Table of Caspase 6 cleavage site scoring, length of site is 5, 4 is position of required amino acid D (Aspartic Acid)
Table of Caspase 8 cleavage site scoring, length of site is 5, 4 is position of required amino acid D (Aspartic Acid)
Table of Caspaseall cleavage site scoring, length of site is 5, 4 is position of required amino acid D (Aspartic Acid)
Summary of cleavage table and explanation. Once the scoring matrices have been developed (examples provided as Tables 1-4 above), sequence analysis can take place. In reviewing a particular protein sequence through the working model of potential Caspase cleavage sites, the presence of Aspartic acid at the fourth position is required. Consequently, in each of the detailed scoring tables, a value higher than one (2 in this working example) is assigned to Aspartic acid at the fourth position to account for its mandatory inclusion in any returned sequence. In addition, mandatory sequence information can be placed in a header to a given matrix which describes the length of the site followed by the location of the required amino acid. This would be xe2x80x985 4xe2x80x99 according to the Caspase working example, since the cleavage site is five amino acids long, with the required amino acid at the fourth position. A consequence of Caspase cleavage sites requiring an Aspartic acid at the fourth position is that sequence scoring may be optimized based upon this characteristic. While this optimization characteristic is clearly available when searching for Caspase cleavage sites, the presence of an absolutely required sequence can be used to similarly optimize sequence searches according to the following method. Since the Aspartic acid is required for a cleavage site, there is little benefit to scoring sequence data until an Aspartic acid is found. Since it is clearly easier for a computer to scan for a particular amino acid instead of reading sets of five amino acids and performing scoring calculations based upon a selected matrix, the present embodiment of the claimed invention reads through the sequences until an Aspartic acid is found. Scoring only then takes place based upon the sequences surrounding the Aspartic acid. Since Aspartic acid is required for the fourth sequence position in the cleavage site, scoring then takes place on the three amino acids prior to the Aspartic acid, as well as on the amino acid after the Aspartic acid. Throughput is thus optimized above and beyond that which would have been obtained if each and every amino acid in a protein had been scored.
In parallel with or subsequent to development of the scoring matrices, the threshold for returning results must be decided upon. This cutoff threshold will determine the specificity of potential characterization sites which will be returned. In the Caspase model, a threshold value of 4 was selected. This means that sequences scored with a particular matrix must have a value of greater than four to be returned in a search as a putative Caspase cleavage site. Applying the Caspase 3 scoring table to the known Caspase cleavage site of DEVDG listed in FIG. 3, this site would return a score of 4.667, which is well above the threshold cutoff value, and is in fact the highest score possible according to this scoring table. The value of 4.667 was arrived at based upon adding 1 for Aspartic Acid (which is required for Caspase 3 cleavage, hence the score of 1) at the first position added with 0.397 for Glutamic Acid at the second position added with 0.270 for Valine at the third position added with 2 for Aspartic Acid at the fourth position (the required amino acid in this example) added with 1 for Glycine at the fifth position (which is one of three possible required amino acids at this position). If a particular amino acid sequence did not have Aspartic Acid at the fourth position, the score would drop by two, since all other amino acids have a score of zero at the fourth position and would fall below the threshold cutoff score of four and not be returned. Similarly, if a particular amino acid sequence was expressly not desired at a particular sequence position, assigning that amino acid a negative score such as negative one would similarly select against a result containing that amino acid at the specified position. In the working example described, substituting phenylalanine for valine at the third position would drop the score by 1.270, since the value contributed by having valine at the third position would not be added, and phenylalanine has a score of negative one for the third position. Consequently, the five sequence value would become 3.397 and would be excluded as a potential Caspase 3 cleavage site since it is less than the threshold cutoff value of 4.
For the web-based implementation of the described tool and method, a programming language such as the Perl programming language may be used, in conjunction with Apache (open source web server software) and MySQL (an open source relational database) running under the Linux operating system. Key components can be implemented using a module written in the C programming language. Obviously, this tool and method can easily be extended to search for any user-defined protein motif in a protein. For example, to search for potential phosphorylation sites, a scoring matrix reflecting a user-defined phosphorylation consensus sequence would be substituted for the Caspase cleavage specific scoring matrixes used in the example presented. Minor modifications to the user interface would allow the user to select from all matrices available (e.g. Pull down menu). Similarly, other public or private protein databases could be substituted or added to those shown. Though protein databases are used in this example, the method could be extended to nucleotide databases provided these nucleotide sequences were translated into the appropriate amino acid sequence using a standard codon table prior to application of this method.