Accurate genotyping (or subtyping) is critical in understanding evolution of divergent viruses. Recently, rapid growth in the number of viral sequences in the public databases is observed. For example, HIV-1 and HCV sequence entries NCBI GenBank have doubled almost every three years. These viruses also show great genotypic diversities and thus have been classified into groups, so-called genotypes and subtypes (Robertson et al., 2000; Simmonds et al., 2005).
Consequently, genotyping (or subtyping) these virus strains based on their sequence similarities has become one of the most basic steps in understanding their evolution, epidemiology and developing antiviral therapies or vaccines.
The conventional subtyping methods include the following: (1) the nearest neighbor methods that look for the best match of the query to the representatives of each subtype, so-called references; (2) the phylogenetic methods that look for the monophyletic group to which the query branches. Since the subtypes have been defined originally as separately clustered groups, these intuitively sound methods have been widely used and quite successful for many cases.
However, with increasing numbers of sequences, outliers that cannot be clearly subtyped or for which these methods do not agree are being observed. A recent report that compared these different automatic subtyping methods with HIV-1 sequences showed less than 50% agreement among them except for subtypes B and C (Gifford R, de Oliveira T, Rambaut A, Myers R E, Gale C V, Dunn D, Shafer R, Vandamme A M, Kellam P, Pillay D: UK Collaborative Group on HIV Drug Resistance: Assessment of automated genotyping protocols as tools for surveillance of HIV-1 genetic diversity. AIDS 2006, 20: 1521-1529). One of the reasons for the disagreement was attributed to the increasing divergence and complexity caused by recombination. It was also noted that closely related subtypes (B and D) or the subtypes sharing common origin (A and CRF01_AE) showed poor concordance rate among those methods.
The present inventor thinks what lies at the bottom of this problem is that the number of reference sequences per subtype was too small. These methods have used two to four hand-picked reference sequences. Having been carefully chosen by experts among the high-quality whole-genome sequences, they are to cover the diversity of each subtype as much as possible. However, with intrinsically small numbers of references per subtype, they cannot address the confidence of subtype predictions; a low E-value of a pairwise alignment or a high bootstrap value of a phylogenetic tree indicates the reliability of the unit operation, but does not necessarily guarantee a confident subtype classification, as a whole.
Recognition of this issue of lacking a statistical confidence measure, brought about the introduction of STAR, a method based on statistical models of position-specific scoring matrix built from multiple sequence alignment (MSA) of each subtype. However, its current implementation has several limitations: it was applied to HIV-1 amino acid sequences only, based on a small number of references (all together 141 for 11 subtypes), and tested with less than 1,000 sequences.
Recently, new genotyping (or subtyping) methods based on nucleotide composition strings have been introduced. It is unique in that it bypasses the multiple sequence alignment and still achieves high accuracy. However, it also uses only 42 reference sequences and has been tested with 1,156 sequences. Considering the explosive increase in the numbers of these viral sequences, the test cases of these conventional methods were rather small, of ten thousands at most.
Therefore, the object of the present invention is to provide a novel method for classifying genotype or subtype of query sequences which are known to public. It is critical to evaluate how well each subtype population is clustered, before attempting to classify a query sequence. Consider a case where the reference sequences are mostly well segregated by subtype except for two or more subtypes that overlap at least partially: those methods that rely on a few references may not notice this problem and may assign an apparent subtype with a high score. Due to varying mutation rate along the sequence range, the phylogenetic power of each gene segment may also vary. This is particularly critical for relatively short partial sequences. In other words, even the well characterized references that are otherwise distinctively clustered may not be resolved if only part of the sequence region is considered in genotyping (or subtyping).
The nearest neighbor methods do not evaluate this validity of the background classification models, since they concern the alignments of only query-to-reference, not reference-to-reference. REGA, one of the tree-based methods, concerns whether the query is inside or outside the cluster formed by a group of references (de Oliveira T, Deforche K, Cassol S, Salminen M, Paraskevis D, Seebregts C, Snoeck J, van Rensburg E J, Wensing A M, van de Vijver D A, Boucher C A, Camacho R, Vandamme A M: An automated genotyping system for analysis of HIV-1 and other microbial sequences. Bioinformatics 2005, 21: 3797-3800). However, as far as the present inventor knows, no tools report such a measure quantitatively.
Therefore, the present inventor presents a method which develops the background classification models based on the distances among the reference sequences, re-evaluates their validity for each query, and reports the statistical significance of genotype (or subtype) assignment in terms of posterior probabilities.
As such, the method of the present invention is suited for the cases where many reference sequences are available. The present invention achieves such goals by combining principal coordinate analysis (PCoA) with linear discriminant analysis (LDA), both of which are well established statistical tools with popular usages in biological sciences. PCoA, also known as classical multidimensional scaling (MDS), maps the sequences to a high-dimensional principal coordinate space, while trying to preserve the distance relationships among them as much as possible. PCoA has been widely applied to the discovery of global trends in a sequence set, complementing tree-based methods in phylogenetic analysis.
Since subtypes have been defined as distinct monophyletic groups in a phylogenetic tree, each subtype should form a well separated cluster in a MDS space if an appropriately high dimension is chosen. In such cases, a set of hyperplanes that separate these clusters may be found and a query relative to the hyperplanes may be classified. For this purpose, the present invention applies LDA, a straightforward and powerful classification method, to the MDS coordinates and assigns a query to the genotype (or subtype) that shows the highest posterior probability of membership.
This probability can be useful in detecting any ambiguous cases, for which careful examination is required. The method of the present invention tests the LDA models through the leave-one-out cross-validation (LOOCV), which can be used to assess the model validity by examining the misclassification rate. As the sequences are represented by coordinates, a simple measure can be also developed for detecting genotype (or subtype) outliers.
The present inventor has tested the present invention with virtually all the HIV-1 and HCV sequences available from NCBI GenBank (nucleotide) and GenPept (protein).