Rapid replication is characteristic of virulence in certain bacteria, viruses and malignancies, but no chemistry common to rapid replication in different organisms has been described. The inventors have found a family of conserved small protein sequences related to rapid replication, Replikins. Such Replikins offer new targets for developing effective detection methods and therapies. There is a need in the art for methods of identifying patterns of amino acids such as Replikins.
Bioinformatic Identification of Amino Acid Sequences
Identification of amino acid sequences, nucleic acid sequences and other biological structures may be aided with the implementation of bioinformatics. Publicly available databases containing amino acid and nucleic acid sequence information may be searched to identify and define Replikins, Replikin Scaffolds and Exoskeleton Scaffolds within representative proteins or protein fragments or genomes or genome fragments.
Databases of amino acids and proteins are maintained by a variety of research organizations, including, for example, the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine, and the Influenza Sequence Database at the Los Alamos National Laboratory. These databases are typically accessible via the Internet through web pages that provide a researcher with capabilities to search for and retrieve specific proteins.
Amino Acid Search Tools
As is known in the art, databases of proteins and amino acids may be searched using a variety of database tools and search engines. Using these publicly available tools, patterns of amino acids may be described and located in many different proteins corresponding to many different organisms. Several methods and techniques are available by which patterns of amino acids may be described. One popular format is the PROSITE pattern. A PROSITE pattern description may be assembled according to the following rules:                (1) The standard International Union of Pure and Applied Chemistry (IUPAC) one-letter codes for the amino acids are used (see FIG. 12).        (2) The symbol ‘x’ is used for a position where any amino acid is accepted.        (3) Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses ‘[ ]’. For example: [ALT] would stand for Alanine or Leucine or Threonine.        (4) Ambiguities are also indicated by listing between a pair of curly brackets ‘{ }’ the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Alanine and Methionine.        (5) Each element in a pattern is separated from its neighbor by a ‘−’.        (6) Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parentheses. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.        (7) When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a ‘<’ symbol or respectively ends with a ‘>’ symbol.        (8) A period ends the pattern.        
Examples of PROSITE patterns include:                PA [AC]-x-V-x(4)-{ED}. This pattern is translated as: [Alanine or Cysteine]-any-Valine-any-any-any-any-{any but Glutamic Acid or Aspartic Acid}        PA <A-x-[ST](2)-x(0,1)-V. This pattern, which must be in the N-terminal of the sequence (‘<’), is translated as: Alanine-any-[Serine or Threonine]-[Serine or Threonine]-(any or none)-Valine.        
Another popular format for describing amino acid sequence patterns is the regular expression format that is familiar to computer scientists. In computer science, regular expressions are typically used to describe patterns of characters for which finite automata can be automatically constructed to recognize tokens in a language. Possibly the most notable regular expression search tool is the Unix utility grep.
In the context of describing amino acid sequence patterns, a simplified set of regular expression capabilities is typically employed. Amino acid sequence patterns defined by these simple regular expression rules end up looking quite similar to PROSITE patterns, both in appearance and in result. A regular expression description for an amino acid sequence may be created according to the following rules:                (1) Use capital letters for amino acid residues and put a “−” between two amino acids (not required).        (2) Use “[ . . . ]” for a choice of multiple amino acids in a particular position. [LIVM] means that any one of the amino acids L, I, V, or M can be in that position.        (3) Use “{ . . . }” to exclude amino acids. Thus, {CF} means C and F should not be in that particular position. In some systems, the exclusion capability can be specified with a “^” character. For example, ^G would represent all amino acids except Glycine, and [^ILMV] would represents all amino acids except I, L, M, and V.        (4) Use “x” or “X” for a position that can be any amino acid.        (5) Use “(n)”, where n is a number, for multiple positions. For example, x(3) is the same as “xxx”.        (6) Use “(n1,n2)” for multiple or variable positions. Thus, x(1,4) represents “x” or “xx” or “xxx” or “xxxx”.        (7) Use the symbol “>” at the beginning or end of the pattern to require the pattern to match the N or C terminus. For example, “>MDEL” (SEQ ID NO: 13) finds only sequences that start with MDEL (SEQ ID NO: 13). “DEL>” finds only sequences that end with DEL.        
The regular expression, “[LIVM]-[VIC]-x (2)-G-[DENQTA]-x-[GAC]-x (2)-[LIVMFY](4)-x (2)-G” illustrates a 17 amino acid peptide that has: an L, I, V, or M at position 1; a V, I, or C at position 2; any residue at positions 3 and 4; a G at position 5 and so on . . . .
Other similar formats are in use as well. For example, the Basic Local Alignment Search Tool (BLAST) is a well-known system available on the Internet, which provides tools for rapid searching of nucleotide and protein databases. BLAST accepts input sequences in three formats: FASTA sequence format, NCBI Accession numbers, or GenBank sequence numbers. However, these formats are even simpler in structure than regular expressions or PROSITE patterns. An example sequence in FASTA format is:
>gi|532319|pir|TVFV2E|TVFV2E envelope proteinELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVT TGLLLNGSYSENRT QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFH SQKYNLRLRQAWC HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETA NLWFNCHGEFFYCK MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVAC HIRSVIIWLETISKK TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELD RYKLVEITPIGF APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXV QSQHLLAGILQQQKNL LAAVEAQQQMLKLTIWGVK (SEQ ID NO: 15).
Features of the BLAST system include sequence comparison algorithms that are used to search sequence databases for regions of local alignments in order to detect relationships among sequences which share regions of similarity. However, the BLAST tools are limited in terms of the structure of amino acid sequences that can be discovered and located. For example, BLAST is not capable of searching for a sequence that has “at least one lysine residue located six to ten amino acid residues from a second lysine residue,” as required by a Replikin pattern, for example. Nor is BLAST capable of searching for amino acid sequences that contain a specified percentage or concentration of a particular amino acid, such as a sequence that has “at least 6% lysine residues.”
Need for Replikin Search Tools
As can be seen from its definition, a Replikin pattern description cannot be represented as a single linear sequence of amino acids. Thus, PROSITE patterns and regular expressions, both of which are well suited to describing ordered strings obtained by following logical set-constructive operations such as negation, union and concatenation, are inadequate for describing Replikin patterns.
In contrast to linear sequences of amino acids, a Replikin pattern is characterized by attributes of amino acids that transcend simple contiguous ordering. In particular, the requirement that a Replikin pattern contain at least 6% lysine residues, without more, means that the actual placement of lysine residues in a Replikin pattern is relatively unrestricted. Thus, in general, it is not possible to represent a Replikin pattern description using a single PROSITE pattern or a single regular expression.
Accordingly, there is a need in the art for a system and method to scan a given amino acid sequence and identify and count all instances of a Replikin pattern. Similarly, there is a need in the art for a system and method to search protein databases and amino acid databases for amino acid sequences that match a Replikin pattern. Additionally, there is a need in the art for a generalized search tool that permits researchers to locate amino acid sequences of arbitrary specified length that includes any desired combination of the following characteristics: (1) a first amino acid residue located more than N positions and less than M positions away from a second amino acid residue; (2) a third amino acid residue located anywhere in the sequence; and (3) the sequence contains at least R percent of an amino acid residue. Finally, the shortcomings of the prior art are even more evident in research areas relating to disease prediction and treatment. There is a significant need in the art for a system to predict in advance the occurrence of disease (for example, to predict strain-specific influenza epidemics) and similarly to enable synthetic vaccines to be designed based on amino acid sequences or amino acid motifs that are discovered to be conserved over time and which have not been previously detectable by prior art methods of searching proteins and amino acid sequences.