A Virus Taxonomy Classification Framework

View the Project on GitHub josephhughes/ViCTree

ViCTree Home

Tutorial for command-line use

Frequently Asked Questions

Tree visualisation features in ViCTreeView

Densovirinae analysis

ViCTree pipeline is currently set up for Densovirinae sub-family. Trees and alignments are updated when new sequences are submitted to the Genbank and if the newly added sequences form a new cluster that is not previously identified.

Densovirinae subfamily analysis is based on the Non structural protein 1 sequences. 21 seed sequences were used to carry out initial as well all subsequent analysis for the subfamily level taxonomic classification of the viruses.

Clustering threshold of 1.0 was applied to cluster sequences that were 100% identical. There is an optional parameter -u to specify a file with a list of protein accession numbers that are accepted by ICTV as representative species for the family or the subfamily. This option allows the users to keep the static branches expanding phylogeny consistent.

Multiple sequence alignment and pairwise distance are calculated for the sequences included in the final set after the clustering step. Sequence metadata information is collected for this set of sequences. This includes the information about the genome accession, scientific name, taxonomy ID, taxonomy lineage, genus and NCBI URL. This information enables users to customise labels for the tips of the tree in the visualisation module. It also provide a link-out to NCBI genome sequence page for each sequence represented in the tree. A tree with rapid bootstrap analysis and best-scoring maximum likelihood with PROTGAMMAJTT (default) model is generated using RAxML.

The tree with the pairwise distance matrix and the metadata tables are then automatically submitted to the ViCTreeView module of the pipeline for the visualisation.

Parameter Optimisation

BLAST - Hit length and Coverage

Pre-optimised blast parameter for a hit length 100 (-l 100) with the coverage 50 (-c 50) was specified.

These parameters were calculated prior to running the ViCTree pipeline. This is an important step that need to be carried out before setting up the ViCTree pipeline for a family of viruses. To obtain the BLAST parameters, multiple sequence alignments (MSA) was performed on the seed set of the sequences. This allows user to inspect the conserved regions of the protein sequence and also helps to identify the variation across the sequences used as seed sequences in the pipeline.

CD-HIT - Identity

CD-HIT clustering parameters were determined using the current pairwise distance criteria used for the classification of the species and genus within the virus family or subfamily. For the Densovirinae subfamily, the species and genus level classification is based on 15% and 30% pairwise distances respectively where pairwise distances are represented as percentage. Hence, sequences below these threshold were clustered together using CD-HIT identity criteria of 1.0. This step also helps to reduce the complexity of the tree by choosing the longest or user-defined representative -u for the clusters generated by CD-HIT.

RAxML - Model

Best model for the RAxML was determined using the Prottest. MSA of the seed sequences was used an input for the Prottest. PROTGAMMAJTT is also set to be the default model for the RAxML maximum likelihood tree calculations. However, any RAxML compatible models can be specified using -m parameter in ViCTree.


Filename Contents
Seed set All protein sequences used as seed set for Densovirinae example.
Final set The final set of sequences selected after BLAST and Clustering step. This set is used for Multiple Sequence Alignment.
Alignments Protein sequence alignments for the final set of sequences.
Pairwise distance matrix The pairwise distance matrix for the final set of sequences.
Phylogenetic tree Phylogenetic tree generated by the ViCTree pipeline in the newick format.

New Species

By using the ViCTree pipeline, we identified all previously classified genera and species in the subfamily Densovirinae (Cotmore et al., 2014), as well as six new species that have been submitted to the ICTV for approval. Five new species for genus Ambidensovirus and one new species for genus unassigned are identified. Further details of these species are provided in the table below.

Name of new species Representative isolate GenBank accession number Genus
Asteroid ambidensovirus 1 Sea star-associated densovirus KM052275 Ambidensovirus
Decapod ambidensovirus 1 Cherax quadricarinatus densovirus KP410261 Ambidensovirus
Hemipteran ambidensovirus 2 Dysaphis plantaginea densovirus 1 FJ040397 Ambidensovirus
Hemipteran ambidensovirus 3 Myzus persicae densovirus 1 AY148187 Ambidensovirus
Hymenopteran ambidensovirus 1 Solenopsis invicta densovirus KC991097 Ambidensovirus
Orthopteran densovirus 1 Acheta domestica mini ambidensovirus KF275669 Unassigned

Following figure shows a snapshot of the Densovirinae tree generated by the ViCTree analysis, visualised in the ViCTreeView module of the framework.

ViCTreeView visualisation on the CVR Bioinformatics website. It shows the Densovirinae subfamily level taxonomic classification tree generated by ViCTree pipeline. Distinct colours highlight the clusters generated when the pairwise distance value is set to 15%. Grey lines indicate the sequences with the pairwise distance higher than 15%.

The ViCTree framework is developed by :

Sejal Modha (@sejmodha), Anil Thanki (@anilthanki) and Joseph Hughes (@josephhughes).