Monthly Archives: June 2015

Bioinformatics tools to analyse viral genomics data


We have recently written a review article entitled “Bioinformatics tools to analyse viral genomics data” for the OIE. In the review, we were unable to provide direct hyperlinks and references to all available tools, simply because there are too many, so we included them here. These commonly used bioinformatics tools are split into the following categories:

Quality Control – Adapter Removal

AdapterRemoval, CutAdapt, FASTX-Toolkit, Scythe, TagCleaner, Trimmomatic, TrimGalore

Quality Control – Trimming / Filtering

ConDeTriFastQC, FASTX-ToolkitPRINSEQ, SickleTrimGalore

Quality Control – Non FASTQ formats

454/Torrent (sff): PyroCleanerseq_crumbs, sff tools; PacBio (hd5): pbh5tools; Oxford Nanopore (fast5): PoreTools

Error Correction – 454/Torrent

AmpliconNoise, Coral, PyroHMMvarRC454

Reference Mapping – Hash Based

Mosaik, NextGenMap, Novoalign, Stampy, Tanoti

Reference Mapping – Burrows-Wheeler

BarraCUDA, Bowtie, BWA, Cushaw2, GEM, SOap3-DP

Reference Mapping – Long Reads


Variant Calling

DiversiTools, FluxSimulator, LoFreq, Segminator, V-Phaser, VarScan

Quasispecies Reconstruction

HaploClique, PredictHaplo, Qcolors, QuasiRecomb, QuRe, ShoRAH, ViQuaS

De novo assembly – OLC

Edena, Forge, Newbler, SGA, Shorty

De novo assembly – de Bruijn

ABySS, CLC, Cortex, EULER-SR, IDBA-UDMIRA, SOAP2, SPAdes, Velvet, Vicuna

De novo assembly – Scaffolders

Abacas, Bambus2, BESST, GRASS, MIP, Scaffold Builder, SCARPA, SOPRA, SSPACE

De novo assembly – Gap Filling

GapCloser, GapFiller, IMAGE

Metagenomics – Homology

MEGAN, Naïve Bayes Classifier, PhymmBL

Metagenomics – Abundance

Kraken, MetaPhlAn, RIEMS, SIGMA

Metagenomics – Pipelines

IMSA, MetaAMOS, VirusFinder2

Metagenomics – De Novo

MetaVelvet, Ray Meta, also see de novo section above

RNA-Seq – Mapping

TopHat, GSNAP, OLego, SOAPsplice, STAR

RNA-Seq – Transcript assembly

Cufflinks, baySeq, edgeR, DESeq, limma

RNA-Seq – de novo

Trinity, SOAPdenovo-Trans, Trans-ABySS

If you think there is a tool missing that should be included, or a link is not working, leave a comment below.


Case studies of HTS applications

Here are my slides from the BBSRC WestBio DTP skills training session at University of Glasgow, Friday 26th June 2015. The talk was entitled “Case studies of HTS applications” and presents a number of case studies on the application on high-throughput sequencing (HTS), also known as next generation sequencing (NGS), to biological problems ranging from human genome sequencing, identification of disease mutations, metagenomics, virus discovery, epidemics, transmission chains and viral population analyses.



Convert NCBI Protein GI to Genome Accession

A few days back I posted a question on BioStars about getting genome accession numbers for a list of protein GIs. I had a long list of protein GI and I wanted the genome accession number for each protein GI (if there is one in NCBI databases) but without downloading files for each protein GI in genbank or xml format.

One way to  do this is use db2db. However you can only use db2db if you have a list of protein accession number for the protein GIs of interest. Also I wanted to include this step as part of a pipeline and automate it. db2db is a web based approach so doesn’t allow for easy automation.

I wrote the following script that first uses NCBI utilities to convert the list of protein GI to nucleotide GI and then fetches genome accession numbers for those nucleotide GIs.

This script will take a file with the list of protein GI as an input and can be run as

where test_gi_list file contains the following protein GIs


This command should provide the following output.

Setting up automatic BLAST database update on linux servers

Basic Local Alignment Search Tool (BLAST) is one of the most commonly used programs for sequence classification using similarity search.

Standalone BLAST can be setup easily on the local server. More info about how to set it up on a local Linux server can be found here:

In our lab, all our servers run the BioLinux operating system and BLAST is pre-installed on the server. With local BLAST, it is important to update local BLAST databases regularly to include new sequences submitted to NCBI. However, sometimes it does become a bit tricky to install and regularly update these databases.

Here is a small tutorial about how to setup local BLAST databases and regularly update them.

In BioLinux, the BLASTDB variable path is usually set up to /var/lib/blastdb and is specified in the file in /etc/profile.d/

The standard file looks like this.

BLASTDB path can be updated to /your/blastdb/location by changing details in the “if” statement of the file.

The following example shows how I will change the location to my customized blastdb in my home directory /home/sejalmodha/blastdb

On a standard linux server you can specify the BLASTDB path variable in /etc/bash.bashrc or in your local ~/.bashrc

To update these databases regularly on the server, use NCBI’s update_blastdb script and wrap it in a cronjob.

I have an script that downloads nr, nt and refseq_protein databases from the NCBI website and changes the permissions of those files so that all users can use the files.

To schedule the downloading of these databases monthly, put it in a cronjob called blast_cronrun and save the log to download.log file.

The last step is to submit the cronjob using the crontab command.


Recombination detection programs

A critical step before phylogenetic analysis and molecular selection analysis is to detect recombination and either remove recombinant sequences or partition the alignment into different spans that a recombination-free. The problem is that there are so many different recombination detection programs available. These have been nicely review by Posada (2002) and some of the programs have been benchmarked by Kosakovsky Pond and Frost (2005).

I decided to compile all the programs that are available but there are so many that I am sure I have missed some. If you know of any others, please leave a comment below.

I have split the programs into the same four different categories as Posada (2002) but some programs implement multiple methods so it gets a bit tricky. The size of the font relates to the number of citations each program/publication has received in Google Scholar. As you can see, there are some clear favourites.

You can click on the names of the programs and this should either take you to the publication abstract on Pubmed or to the website for the software.