1. Field of the Invention
The present invention relates to a software product that uses unique characterizing residues to automatically identify strains of partial or complete capsid sequences of picorna viruses and calici viruses, two of the most highly diverse single stranded RNA (ssRNA) virus families.
More particularly, the present invention relates to a computer program product stored on a computer readable medium for predicting strains of some ssRNA viruses from their limited sequence data, said computer product including a graphical user interface (GUI) code operable to carry out all data input-output (I/O) operations; storage codes operable to store virus sequence databases in the form of multiple data arrays containing information about phylogenetic trees, sequence groups and characteristic residues of these groups; sequence comparison codes operable to compare input virus sequences with the stored database sequences on a residue-by-residue basis; identification codes operable to identify the strains of input virus sequences based on comparisons (iii) and subsequent decision making algorithms.
2. Description of the Related Art
Currently, no equipments or kits can unambiguously distinguish among a particularly important class of viruses that cause alarming epidemic outbreaks all over the world and consequently, pose high bio-terrorism related threats. This class comprises of the single-stranded RNA (ssRNA) viruses which include diverse virus families ranging from those that cause flu (including the cruise-ship flu and the bird flu) to the AIDS causing viruses. The main difficulty in distinguishing among these viruses and detecting them lies in their molecular details. As the name suggests, each of these viruses consists of a single genomic RNA strand enclosed within a protein shell called the capsid. This encapsidated RNA strand undergoes rapid sequence mutations to generate a large number of virus strains that are often associated with different epidemic outbreaks. Sequence differences among these strains are so subtle and intricate that they appear to be almost random. There are no reliable methods to unambiguously distinguish among the strains by systematically tracking these variations. Most often, the capsid sequences show the maximum variations relative to the other genomic regions for a given ssRNA virus families. This is mainly because the capsid residues undergo most mutations in response to host immunity. Consequently, capsid sequences provide the best regions to identify strains as they truly represent the diversity of ssRNA virus families. Thus, any reliable method to uniquely identify the ssRNA virus strains would be based on capsid sequences even though the problem of strain identification on the basis of these sequences may appear intractable.
Two diagnostic methods are most widely used to detect ssRNA viruses. One of them relies on immuno-based techniques while the other is based on reverse transcriptase polymerase chain reaction (RT-PCR) assays. The immuno-based assays distinguish strains on the basis of epitope differences in the capsid protein while the RT-PCR based assays use nucleotide primers to amplify differences in the viral genome. However, experimental constraints limit the ability of both of these assays in distinguishing among the strains. Most often, these methods are useful in detecting only significant sequence differences but they fail to detect subtle sequence variations among the strains which occurs, for example, when the sequence identity falls to approximately 10% or below as in the case of noroviruses
Accurate strain recognition in uncharacterized target capsid sequences is essential in understanding the epidemiology and diagnostics of these viruses and for efficient vaccine development. Experimental techniques to detect ssRNA virus strains are inadequate when the number of strains are very large as is the case with picornaviruses and caliciviruses, or, the strains are non-cultivable like those of the human caliciviruses. Additionally, existing homology comparison based computational methods to recognize strains are of limited use as they most often rely on similarity scores between target sequences and sequences of homology matched reference strains. Methods based on such scores are often time consuming and ambiguous especially if only partial target sequences are available or, if different ssRNA virus families are jointly analyzed. In such cases, knowledge of residues that uniquely distinguish among known reference strains is critical for rapid and unambiguous strain recognition of target capsid sequences. Conventional sequence comparisons are unable to identify such capsid residues due to high sequence divergence among the ssRNA virus reference strains. Consequently, automated general methods to predict strains from sequence data of such viruses on the basis of strain distinguishing residues are not available.
One of the main challenges, therefore, in making efficient detection systems for ssRNA viruses is to devise methods to unambiguously distinguish subtle sequence variations among the different ssRNA virus strains using stain distinguishing residues. This challenge becomes significantly tough when only partial sequences of these viruses are available. The only feasible way to address this problem is through computational techniques. However, all such known techniques are based on criteria that allow them to distinguish only significantly different strains. In contrast, the intellectual property described here is a software product based on different computational criteria. The product successfully demonstrates a way to distinguish among very closely related virus strains of two important ssRNA virus families. The only requirements of the software are the availability of accurately known (complete or partial) genomic or protein capsid sequences of these viruses. Given the rapidly improving sequencing techniques, this should not be a major problem, and, it should therefore be possible to design and manufacture efficient ssRNA virus detection systems based on the described software. It is anticipated that the software will reduce both time and costs in identifying closely related ssRNA viruses from their sequences by substantially reducing the throughput time.
Most non-bacterial epidemic outbreaks are caused by single stranded RNA (ssRNA) viruses. Typically, these viruses undergo rapid genetic mutations that result in a large and dynamic population diversity seen as different virus strains utilizing multiple hosts [1]. Caliciviruses and picornaviruses are two of the most highly divergent ssRNA virus families each containing several hundred reference strains showing very low sequence identity even within families [2-4]. The software determines relationships among the strains using a unique algorithm in contrast to existing methods. Relationships among the strains are usually inferred through conventional homology based comparisons using complete capsid sequences or other genomic regions. These comparisons seek to identify clusters of similar sequences that comprise the major sequence groups (genogroups or genera) and their sub-groups leading to various diagnostics [5-16] and classification schemes for these viruses.
The four calicivirus genera (noroviruses, sapoviruses, lagoviruses and vesiviruses) [3, 4, 17] and the nine picornavirus genera (apthoviruses, cardioviruses, enteroviruses, erboviruses, hepatoviruses, kobuviruses, parechoviruses, rhinoviruses and teschoviruses) [2] are classified using such schemes [2, 4]. Further divisions of these genera reflect more detailed sequence relatedness among these viruses. For example, among the diverse caliciviruses [17], noroviruses are divided into two genogroups GI and GII [18-20] each of which contains seven sequence clusters (GI.1-GI.7 and GII.1-GII.7) [4], sapovirus sequences are grouped into 2-5 genogroups each of which contains several clusters [2]-23], vesivirus sequences are known to contain at least 40 immune response related antigenic serotypes and lagovirus sequences cluster into proposed sero-specific groups [24]. Similarly, classification of the 9 picornavirus genera into species, each of which consists of several serotypes [2, 25] reflects finer relations among these virus sequences (Table 1).
TABLE 1Species and Serotypes of All Picornavirus GeneraGeneraSpecies (Abbreviation) [Number of serotypes]AphthovirusesFoot-and-mouth disease virus (FMDV) [7]Equine rhinitis A virus (ERAV) [1]CardiovirusesEncephalomyocarditis virus (EMCV) [1]Theilovirus [2 or 3]EnterovirusesHuman enterovirus A (HEV-A) [12]Human enterovirus B (HEV-B) [36]Human enterovirus C (HEV-C) [11]Human enterovirus D (HEV-D) [2]Bovine enterovirus (BEV) [2]Porcine enterovirus A (PEV-A) [2]Porcine enterovirus B (PEV-B) [2]HepatovirusesHepatitis A virus (HAV) [1]KobuvirusesAichi kobuviruses (AKV) [1]Bovine kobuviruses (BKV) [1]ParechovirusesHuman parechoviruses (HPeV) [3]CardiovirusesEncephalomyocarditis virus (EMCV) [1]Ljungan viruses [1]RhinovirusesHuman rhinovirus A (HRV-A) [75]Human rhinovirus B (HRV-B) [25]TeschovirusesPorcine Teschoviruses (PTEV) [11]Abbreviations for species are shown within parentheses and the number of serotypes for given species are shown within square brackets.The available crystal structures of several calici and picornavirus capsids [26-32] further help understand such sequence relationships including those among the four subunits (VP1-VP4) of the picornavirus capsids [2].
Strain and genogroup predictions in uncharacterized target sequences of calici and picornaviruses depend critically on their sequence relationships. Most often, such predictions use conventional homology comparisons between the target and a large number of known reference sequences. However, there are difficulties in these approaches when applied to caliciviruses. Most prediction methods for these viruses are based on sequence similarity cut-off values that are arbitrarily derived from the homology based sequence comparisons [19]. Although recent reports indicate statistically significant estimation of such cut-off values in distinguishing the major norovirus genogroups [33], no uniform criteria exist to accurately estimate these values for the other caliciviruses. In addition, homology based sequence similarity cut-off values are even more difficult to estimate when different virus genera need to be analyzed together in situations for example, where the genus of the target sequences may not be known. These difficulties are compounded while determining the strains of partial sequences mainly because experimental considerations usually restrict these partial sequences to smaller and relatively more conserved regions [15, 19, 34-37] whose comparisons may often introduce ambiguities in strain identification.
Even if complete sequences of target virus capsids are compared [33, 38], strain determination using homology based similarity scores is still computationally challenging. This is because comparisons of a large number of complete capsid sequences demands significant computation time which increases exponentially with increasing sequence lengths and the number of sequences that are compared together. Such limitations may severely reduce the number of usable reference capsid sequences thereby creating major computational bottlenecks.
Recent methods to genotype sequences belonging to certain virus families [39] suggest ways to reduce such bottlenecks. These methods efficiently align sliding windows of target sequences with databases of reference sequences and genotype the target sequences essentially using highest overall alignment scores. However, such methods, primarily designed to detect recombination breakpoints within virus genomes, critically depend on parameters such as window sizes and choice of reference sequences. Smaller windows may significantly increase the computation time while larger windows may overlook fine sequence variations. Similarly, incorrect choices of reference sequences may introduce possible error inducing biases. Time consuming repetitive runs using different trial settings of these parameters may be necessary to correctly genotype virus strains in such cases [39].
Thus, strain recognition methods using sequence identity based scores have not been easily amenable to reliable and robust automation across ssRNA virus families. Based on earlier analysis of noroviruses [40], we describe here the generalized implementation of a residue-wise comparison based approach to automate strain predictions in complete and partial amino acid capsid sequences of calici and picornaviruses.