In macromolecular analysis (such as proteins, DNA, or RNA), discovering sequence patterns with variations may reveal the underlying function of a protein family. Protein motifs or patterns (including RNA/DNA patterns) are conserved regions with variations that are maintained in the amino acid or residues respectively, whether the significance of these motifs be structural, functional, or evolutionary.
Macromolecular analysis may be directed for example at detecting sequence patterns that may reveal the underlying function of a protein family. Discovering these sequence patterns with variations is used for example in drug discovery.
Functional patterns can be altered through mutation, and therefore they do not repeat precisely at the same location for each occurrence of the protein, which poses a challenge in discovering and analyzing these patterns.
Various prior art bioinformatics techniques may be used for functional pattern discovery. These are generally based on one of two approaches: (1) multiple sequence alignment, or (2) motif finding.
Multiple sequence alignment can align a set of protein sequences from the same protein family in order to identify important regions and sites in the resulting alignment. Common multiple sequence alignments include Clustal Omega, T-Coffee, DIALIGN, and HMMER. However, finding the global optimal alignment is expensive to compute, and is known to be an NP-complete problem in regards to its computational complexity. Even with approximate heuristics added, multiple sequence alignment is not efficient in handling large datasets. Moreover, this approach is only appropriate for highly similar sequences, but not for sequences with considerable dissimilarity. Therefore, instead of aligning the entire sequence globally, it is only suitable to identify similarities locally. Thus, the suspected consensus regions may need to be located and pre-processed ahead of time.
Motif finding generally involves using combinatorial and probabilistic methods to identify protein function segments. Furthermore, these prior art solutions are generally based on finding patterns. For example, many combinatorial methods exhaustively enumerate all possible sequence patterns and derive the best consensus pattern taken from the enumerated results. One prior solution is known to create cliques in which vertices are sequence patterns while arcs connect similar sequence patterns. The cliques then represent the consensus patterns.
Furthermore, prior art probabilistic methods generally calculate the amino acid distribution at each fixed position to form an array of sequence patterns. One example involves a position-specific weighted matrix, which estimates an amino acid at each position while assuming that each position is independent. An alternative method, known as the random sequence synthesis, takes frame-shifted position into consideration by optimally aligning amino acids to create a probabilistic sequence representation known as random sequences. Other probabilistic methods make use of a Markov model, where the dependencies of the current state depend only on a pre-specified set of past states. This is the case for example with the popular pFAM™ database (referred to below), which builds a profile Hidden Markov Model (HMM) from the multiple sequence alignment of a protein family for classifying proteins and predicting their functionality. In general, the probabilistic models compress the data into probability distributions and express amino acid associations as a sequence of independent random variables. With such a method, although each position has its amino acid distribution, there is no specific way to express the complex amino acid associations with statistical support within the sequence patterns.
Examples of known protein annotation databases include pFAM (already mentioned) or PROSITE™. Also, various computer system and computer programs are known that incorporate motif finding feature or functions for example: CONSENSUS™, MEME™, Gibbs™ or BLOCKS™.
A common problem is that these technologies and methods generate large solutions sets. In part to manage these large solution sets, prior art technologies are constrained to, or are usually used so as to, limit analysis to the same or similar macromolecule families.
Furthermore, probabilistic motif finding requires a more elaborate representation of amino acid associations, which is not available in prior art solutions.
What is needed is a computer system and method that addresses some of these limitations.