I. Analyzing Polynucleotide Sequences by Clustering
The increasing amounts of polynucleotide sequence data present an analytical challenge. Such large amounts of data on the one hand provide an opportunity for extensive research, but on the other hand are difficult to analyze by conventional analytical methods. However, one method that has been found to be generally effective for analyzing such large amounts of sequence data is clustering.
Clustering may be performed in a variety of methods. Hierarchical clustering, for example, seeks to create by steps of either mergers or divisions, a hierarchy of segments or clusters. Agglomerative approaches build the hierarchy of clusters by steps of such mergers. Some approaches combine the above two1.
In addition, there are also non-hierarchical methods, which do not seek to create a hierarchy of segments or clusters. The K-Means clustering algorithm is an example of such a clustering technique. It has been used in combination with other techniques, for example, for exploring protein structure2. It was also used to identify recurring local sequence motifs for proteins3.
II. Context Polynucleotide Sequence Analysis
Heidecker and Messing4 found the NNANNAUGGC (SEQ ID NO:1) motif in the AUG context. Joshi5 identified the consensus sequence of AAAAACAA[A/C]AAUGGC (SEQ ID NO:2).
More recently, a survey which included 5074 plant genes demonstrated that higher plants have an AC-rich consensus sequence, aaaaacaA(A/C)aAUGGCg (SEQ ID NO:3)as a context of AUG6. These finding were recently supported7.
Analysis of 5′ untranslated region of mRNA of vertebrates were initially focused on conserved consensus sequence signals which accommodated translation initiation8. Studies which followed, attempted to analyze the consensus sequence about said translation initiation signal9. The later study has demonstrated conserved purines at position −3 and at position +4. The following conserved sequences were identified in the same study: (GCC)GCC(A/G)CCAUGG (SEQ ID NO:4).
Consensus sequences are useful in research for locating the translation initiator codon. The untranslated leader sequence may additionally influence gene expression levels10. It was previously appreciated that Kozak-Like elements in the context of the initiator codon indeed affect expression levels11,12,13,14. Therefore, in U.S. Pat. No. 7,253,342, leader sequence was used to directly influence the expression of the specifically attached gene by either increasing expression, or for maintaining stable mRNA levels15.