1. Field of the Invention
The present invention is in the fields of bioinformatics and molecular biology. More specifically, the invention provides a method, a system and an apparatus to predict and/or recognize and/or classify biological sequences, especially binding site recognition motifs poorly conserved, comprising the use of rules extracted from neural networks learning process. The invention is particularly useful for the prediction, recognition and/or classification of promoters.
2. Prior Art
Promoter prediction and recognition in silico is a crucial open issue in molecular biology and a challenge in bioinformatics. Promoters are cis-acting elements located before the transcription start site (TSS) of the Open Reading Frame (ORF). Gene expression begins with the recognition of the promoter by RNA polymerase enzyme (RNAP). In bacteria, RNAP holoenzyme consists of five subunits (2α, β, β′, ω) and an additional sigma (σ) subunit factor (Borukov and Nudler, 2003; Thiyagarajan, et. al, 2005). The σ subunit of RNAP is a key regulator of bacterial gene expression because it is responsible for the specific interaction of RNAP at the promoter region. The σ factors control transcription initiation by directing RNAP binding to specific promoter sequences and “melting” the double-stranded DNA, thus the transcription of a given gene is dependent upon the σ associated to the RNAP (Doucleff, 2007; Borukhov and Nudler, 2003; Hook-Barnard, et. al 2006).
Bacterial cells use alternative σ factors, specific for different subsets of promoters, in order to adapt to environmental changes (Borukhov and Nudler, 2003). E. coli has several σ factors, the most prevalent of which are: σ24, σ28, σ32, σ38, σ54 and σ70 (the number indicates their molecular weight). Each σ family has a role in the bacterial response to environmental conditions and it recognizes different consensual promoter sequence. For example, σ32 has a role in heat shock response, σ28 is associated to expression of flagellar genes during normal growth and σ70 is the major factor responsible for the bulk of transcription activity in the cell (Lewin 2008; Borukov and Nudler, 2003). Despite the family, all promoters have two important binding sites for RNAP, the −35 and −10 region relative to TSS nucleotide. These motifs are poorly conserved, particularly among σ families. The canonical consensus for −35 and −10 regions and the number of nucleotides between them are (Lewin, 2008):
σ32—CCCTTGAA 13-15 pb CCCGATNT (SEQ ID NO: 1)
σ28—CTAAA 15 pb GCCGATAA (SEQ ID NO: 2)
σ70—TTGACA 16-18 bp TATAAT (SEQ ID NO: 3)
σ54—CTGGNA 6-bp TTGCA (SEQ ID NO: 4)
The consensual motifs recognized by σ24 and σ38 has not been described, due to their low conservation or reduced number of promoters confirmed.
The variation among consensus sequences recognized by each σ factor, particularly the relative position of the conserved motifs, limits the efficiency of a global analysis approach. The promoter prediction should be done for each σ family separately, since the analysis of a given promoter by comparison with σ70 promoter consensual motif can led to incorrect result.
Promoter compilations and analysis allowed the development of computer programs which predict the location of promoter sequences on the basis of its homology using consensus sequences or a reference list of promoters (Polat and Günes, 2007). The classical approach for promoters prediction involves the development of algorithms that used position weight matrices (PWMs). This methodology gives results by aligning examples of sequences and estimating the base preference at each position of a matrix (Gordon et al, 2006; Stormo, 2000; Hannenhalli and Wang, 2005).
In the last years, Machine Learning approaches have been applied for promoter recognition and prediction. Among these, Support Vector Machines (SVM), and Neural Network (NN) gave promising results. The SVM methods use a training algorithm and can represent complex nonlinear functions. This algorithm aims to separate the data set into two classes by a hyperplane (Kapetanovich et. al, 2004). The SVM can be applied to identify important biological elements: transcription factors (Holloway, 2007), promoters (Polat and Gunes, 2007; Liang and Li, 2006), transcription start sites (Gordon et. al, 2006; Gao, T., et al, 2009), among others.
The NNs are computational tool with complex nonlinear functions. They have been used for many biologic applications, as promoter prediction (Demeler and Zhou, 1991; Burden et. al, 2005; Rani et. al, 2007), gene expression (Tan and Pan, 2005; Janga and Collado-Vides, 2007) and protein analysis (HellesFonseca, 2009; Chae et. al, 2009). The NNs are adequate for promoter prediction and recognition due to their ability to identify degenerated, imprecise and incomplete patterns merged in those sequences, and achieved high performance when processing large genome sequences (Cotik et. al, 2005; Kalate et. al, 2003). Moreover, the NN methodology allows rule extraction from trained networks, which can assist in uncovering of biologic rules learned from the input data (Andrews et. al, 1995).
In literature, there are some papers describing promoter predictors, as BDGP (Reese M G, 2001), however none of them uses the rules extracted from neural network training cited herein, as following described.
Some patent related documents describe prediction tools using biological information, as following described.
Document US 2010/0057419 describes a fold-wise classification of proteins comprising the prediction of a fold pattern of a protein of interest having an unknown fold pattern by training a system to correlate structural or sequence features to the known protein fold pattern to predict protein fold patterns, preferably using SVMs. The present invention describes the use of neural network (NN) rules to classify, predict and/or recognize bacterial biological sequences poorly conserved, which is not cited in the above document, and do not include the specific prediction of protein fold patterns.
Document US 2009/0111099 describes a method for promoter detection and analysis comprising the insertion of a sequence candidate into a vector comprising a TAG sequence. The present invention describes the use of neural network (NN) rules to classify, predict and/or recognize bacterial biological sequences poorly conserved, which is not cited in the above document.
Document US 2008/0147369 describes methods, systems and software for identifying functional biomolecules comprising the generation of a model through the identification of cross-product terms using genetic algorithms. The present invention describes the use of neural network (NN) rules to classify, predict and/or recognize bacterial biological sequences poorly conserved, which is not cited in the above document.
Document WO 2007/059119 describes systems and methods for identifying diagnostic indicators using neural network rules, determining responsiveness to a therapy. The present invention is related to the use of neural network rules to classify, predict and/or recognize bacterial biological sequences poorly conserved, which is not described in any previous document, and do not is applied to identification of diagnostic indicators.
In view of the prior art cited above, it can be seen that no relevant prior art disclosing a mathematical approach to validate protein mutations as disclosed herein was found.
Objects and advantages of the invention set forth herein and will also be readily appreciated here from, or may be learned by practice with the invention. These objects and advantages are realized and obtained by means of instrumentalities and combinations pointed out in the specification and claims.