As is well-known, amino acids are the building blocks of proteins. Proteins make up the bulk of cellular structures, and some proteins serve as enzymes for facilitating cellular reactions. Twenty different amino acids are known to occur in proteins. The properties of each protein are dictated in part by the precise sequence of component amino acids.
Databases of amino acids and proteins are maintained by a variety of research organizations, including, for example, the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine, and the Influenza Sequence Database at the Los Alamos National Laboratory. These databases are typically accessible via the Internet through web pages that provide a researcher with capabilities to search for and retrieve specific proteins. These databases may also be accessible to researchers via local-area and wide-area networks. Additionally, researchers may directly access amino acid and protein databases stored on peripheral devices, such as magnetic disks, optical disks, static memory devices, and a variety of other digital storage media known in the art.
In amino acid and protein databases, amino acids are typically encoded as alphabetic characters. FIG. 1 lists each amino acid known to occur in proteins and provides a typical 3-letter abbreviation and single-letter code by which the amino acids may be represented in databases, according to a standard supplied by the International Union of Pure and Applied Chemistry (IUPAC).
A given protein may be described by its sequence of amino acids. For example, using the single-letter code given in FIG. 1, the character string “crvpsgvdla” corresponds to the protein defined by the following sequence of amino acids: cysteine, arginine, valine, proline, serine, glycine, valine, aspartic acid, leucine, and alanine.
When a protein database is searched for proteins that satisfy certain criteria (for example, those proteins relating to cancer in humans), the protein database search engine may respond by identifying hundreds or thousands of matching proteins. This set of matching proteins may be narrowed by supplying additional search criteria. At any point during the search process, specific proteins may be selected and reviewed. In FIG. 2, a printout describes a specific protein identified from an NCBI search for proteins relating to human cancer.
As can be seen in FIG. 2, a protein description may include detailed information describing, among other identifying factors, such information as the title of the protein (“Differential expression of a novel serine protease homologue in squamous cell carcinoma of the head and neck”), the authors of the protein description (“Lang, J. C. and Schuller, D. E.”), and the organism from which the protein was isolated (“Homo sapiens”).
Protein descriptions may include a specific sequence of amino acids that define the protein. For example, in FIG. 2, amino acid sequence data is found at the end of the description of the protein, in a section of the printout prefaced by the word “ORIGIN.” In this example, the first few amino acids are “myrpdvvrar,” which correspond to methionine, tyrosine, arginine, proline, aspartic acid, valine, valine, arginine, alanine, and arginine.
Some protein descriptions may include a sequence of nucleic acid bases, rather than amino acid sequences, that define the protein. As is known, a sequence of three nucleic acid bases (i.e., a nucleic acid base triplet) may correspond to an amino acid according to a mapping provided by the table found in FIG. 3. Each nucleic acid base triplet identified in the table represents or corresponds to a specific amino acid. For example, the nucleic acid triplet GCT (guanine-cytosine-thymine) corresponds to the amino acid Alanine. Similarly, the nucleic acid triplet GCA (guanine-cytosine-adenine) also corresponds to the amino acid Alanine. As another example, the nucleic acid triplets AAA and AAG (adenine-adenine-adenine and adenine-adenine-guanine, respectively) each corresponds to the amino acid Lysine.
The Replikin Pattern
In previous patent applications, the inventors have identified and described a pattern of amino acids that has been designated a “Replikin pattern” or simply a “Replikin.” A Replikin pattern comprises a sequence of about 7 to about 50 contiguous amino acids that includes the following three (3) characteristics:
(1) the sequence has at least one lysine residue located six to ten amino acid residues from a second lysine residue;
(2) the sequence has at least one histidine residue; and
(3) the sequence has at least 6% lysine residues.
Replikins have been shown to be associated with rapid replication in fungi, yeast, viruses, bacteria, algae, and cancer cells. Based on this association, it is believed that Replikins may be an indicator of disease. Additionally, an increase in concentration of Replikins over time may be an indicator of the imminent onset of disease. For example, before each of the three influenza pandemics of the last century (identified as H1N1, H2N2 and H3N2), there was a significant increase in the concentration of Replikins in the corresponding influenza virus. With respect to the H5N1 influenza, FIG. 4 illustrates a rapid increase in the concentration of Replikins per 100 amino acids just prior to epidemics in 1997 (indicated as E1), 2001 (indicated as E2) and 2004 (indicated as E3). Replikin patterns have been found in a variety of disease-related proteins, including cancers of the lung, brain, liver, soft-tissue, salivary gland, nasopharynx, esophagus, stomach, colon, rectum, gallbladder, breast, prostate, uterus, cervix, bladder, eye, forms of melanoma, lymphoma, leukemia, and kidney. Importantly, Replikin patterns appear to be absent from the normal healthy human genome. FIG. 5 lists selected examples of Replikin patterns that have been found in various organisms.
For example, the 13-residue pattern “hyppkpgcivpak,” occurring in Hepatitis C (which is the last entry in the Tumor Virus Category of FIG. 5) is a Replikin pattern because: (1) it contains two lysine residues that are 8 positions apart; (2) it contains a histidine residue; and (3) the percentage of lysine residues is 2/13, which is 15.4%.
Amino Acid Search Tools
As is known in the art, databases of proteins and amino acids may be searched using a variety of database tools and search engines. Using these publicly available tools, patterns of amino acids may be described and located in many different proteins corresponding to many different organisms. Several methods and techniques are available by which patterns of amino acids may be described. One popular format is the PROSITE pattern. A PROSITE pattern description may be assembled according to the following rules:
(1) The standard International Union of Pure and Applied Chemistry (IUPAC) one-letter codes for the amino acids are used (see FIG. 1).
(2) The symbol ‘x’ is used for a position where any amino acid is accepted.
(3) Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses ‘[ ]’. For example: [ALT] would stand for Alanine or Leucine or Threonine.
(4) Ambiguities are also indicated by listing between a pair of curly brackets ‘{ }’ the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Alanine and Methionine.
(5) Each element in a pattern is separated from its neighbor by a ‘-’.
(6) Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
(7) When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a ‘<’ symbol or respectively ends with a ‘>’ symbol.
(8) A period ends the pattern.
Examples of PROSITE patterns include:
PA [AC]-x-V-x(4)-{ED}. This pattern is translated as: [Alanine or Cysteine]-any-Valine-any-any-any-any-{any but Glutamic Acid or Aspartic Acid}
PA <A-x-[ST](2)-x(0,1)-V. This pattern, which must be in the N-terminal of the sequence (‘<’), is translated as: Alanine-any-[Serine or Threonine]-[Serine or Threonine]-(any or none)-Valine.
Another popular format for describing amino acid sequence patterns is the regular expression format that is familiar to computer scientists. In computer science, regular expressions are typically used to describe patterns of characters for which finite automata can be automatically constructed to recognize tokens in a language. Possibly the most notable regular expression search tool is the Unix utility grep.
In the context of describing amino acid sequence patterns, a simplified set of regular expression capabilities is typically employed. Amino acid sequence patterns defined by these simple regular expression rules end up looking quite similar to PROSITE patterns, both in appearance and in result. A regular expression description for an amino acid sequence may be created according to the following rules:
(1) Use capital letters for amino acid residues and put a “-” between two amino acids (not required).
(2) Use “[ . . . ]” for a choice of multiple amino acids in a particular position. [LIVM] means that any one of the amino acids L, I, V, or M can be in that position.
(3) Use “{ . . . }” to exclude amino acids. Thus, {CF} means C and F should not be in that particular position. In some systems, the exclusion capability can be specified with a “^” character. For example, AG would represent all amino acids except Glycine, and [^ILMV] would represents all amino acids except I, L, M, and V.
(4) Use “x” or “X” for a position that can be any amino acid.
(5) Use “(n)”, where n is a number, for multiple positions. For example, x(3) is the same as “xxx”.
(6) Use “(n1,n2)” for multiple or variable positions. Thus, x(1,4) represents “x” or “xx” or “xxx” or “xxxx”.
(7) Use the symbol “>” at the beginning or end of the pattern to require the pattern to match the N or C terminus. For example, “>MDEL” finds only sequences that start with MDEL. “DEL>” finds only sequences that end with DEL.
The regular expression, “[LIVM]-[VIC]-x(2)-G-[DENQTA]-x-[GAC]-x(2)-[LIVMFY](4)-x (2)-G” illustrates a 17 amino acid peptide that has: an L, I, V, or M at position 1; a V, I, or C at position 2; any residue at positions 3 and 4; a G at position 5 and so on . . . .
Other similar formats are in use as well. For example, the Basic Local Alignment Search Tool (BLAST) is a well-known system available on the Internet, which provides tools for rapid searching of nucleotide and protein databases. BLAST accepts input sequences in three formats: FASTA sequence format, NCBI Accession numbers, or GenBank sequence numbers. However, these formats are even more simple in structure than regular expressions or PROSITE patterns. An example sequence in FASTA format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLL LNGSYSENRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTI MAGLVFHSQKYNLRLRQAWCHFPSNWKGAWKEVKEEIVNLPKERYRGTND PKRIFFQRQWGDPETANLWFNCHGEFFYCKMDWFLNYLNNLTVDADHNEC KNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKKTYAPPREGHL ECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLL AGILQQQKNLLAAVEAQQQMLKLTIWGVK
Features of the BLAST system include sequence comparison algorithms that are used to search sequence databases for regions of local alignments in order to detect relationships among sequences which share regions of similarity. However, the BLAST tools are limited in terms of the structure of amino acid sequences that can be discovered and located. For example, BLAST is not capable of searching for a sequence that has “at least one lysine residue located six to ten amino acid residues from a second lysine residue,” as required by a Replikin pattern, for example. Nor is BLAST capable of searching for amino acid sequences that contain a specified percentage or concentration of a particular amino acid, such as a sequence that has “at least 6% lysine residues.”
Need for Replikin Search Tools
As can be seen from its definition, a Replikin pattern description cannot be represented as a single linear sequence of amino acids. Thus, PROSITE patterns and regular expressions, both of which are well suited to describing ordered strings obtained by following logical set-constructive operations such as negation, union and concatenation, are inadequate for describing Replikin patterns.
In contrast to linear sequences of amino acids, a Replikin pattern is characterized by attributes of amino acids that transcend simple contiguous ordering. In particular, the requirement that a Replikin pattern contain at least 6% lysine residues, without more, means that the actual placement of lysine residues in a Replikin pattern is relatively unrestricted. Thus, in general, it is not possible to represent a Replikin pattern description using a single PROSITE pattern or a single regular expression.
Accordingly, there is a need in the art for a system and method to scan a given amino acid sequence and identify all instances of a Replikin pattern. Similarly, there is a need in the art for a system and method to search protein databases and amino acid databases for amino acid sequences that match a Replikin pattern. Additionally, there is a need in the art for a generalized search tool that permits researchers to locate amino acid sequences of arbitrary specified length that includes any desired combination of the following characteristics: (1) a first amino acid residue located more than N positions and less than M positions away from a second amino acid residue; (2) a third amino acid residue located anywhere in the sequence; and (3) the sequence contains at least R percent of a fourth amino acid residue. Finally, the shortcomings of the prior art are even more evident in research areas relating to disease prediction and treatment. There is a significant need in the art for a system to predict in advance the occurrence of disease (for example, to predict strain-specific influenza epidemics) and similarly to enable synthetic vaccines to be designed based on amino acid sequences or amino acid motifs that are discovered to be conserved over time and which have not been previously detectable by prior art methods of searching proteins and amino acid sequences.