Setting up automatic BLAST database update on linux servers

Basic Local Alignment Search Tool (BLAST) is one of the most commonly used programs for sequence classification using similarity search. Standalone BLAST can be setup easily on the local server. More info about how to set it up on a local Linux server can be found here: http://www.ncbi.nlm.nih.gov/books/NBK52640/ In our lab, all our servers run […]

Read More

Recombination detection programs

A critical step before phylogenetic analysis and molecular selection analysis is to detect recombination and either remove recombinant sequences or partition the alignment into different spans that a recombination-free. The problem is that there are so many different recombination detection programs available. These have been nicely review by Posada (2002) and some of the programs […]

Read More

Essential AWK Commands for Next Generation Sequence Analysis

Here are the few essential awk command line scripts for next generation sequence analysis. Users need latest version of gawk to run commands with bitwise operations. Most of the Linux distributions come with gawk. However OSX users have to install it from here http://rudix.org/packages/gawk.html Count number of reads in a FastQ file awk ‘END{print NR/4}’ […]

Read More

Why and how to use biomaRt?

The bioinformatics work includes the gene annotation work. In recent years more and more biological data has become available.  Meanwhile, how to get the access these valuable data resources and analyse the data is important for comprehensive bioinformatics data analysis. The biomaRt is a very useful tool to achieve that. Now there are two questions: […]

Read More

A simple method to distinguish low frequency variants from Illumina sequence errors

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing has revolutionised the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants […]

Read More

Illumina adapter and primer sequences

Illumina Adapter and Primer Sequences Illumina libraries are normally constructed by ligating adapters to short fragments (100 – 1000bp) of DNA. The exception to this is if Nextera is used (see end of this post) or where PCR amplicons have been constructed that already incorporate the P5/P7 ends that bind to the flowcell. Illumina Paired […]

Read More

Trie Data Structure

In Computer Science, a trie is a data structure that is also known as a digital search tree or a prefix tree. It can be used for fast retrieval on large data sets such as looking up words in a dictionary. The term trie was invented from the phrase ‘Information Retrieval’ by Fredkin(1960). As a […]

Read More

Calculating dNdS for NGS datasets

vNvS Our upcoming tool vNvS calculates the dN/dS ratio at each site, codon and also for the sample as a whole, here is an explanation of the theory behind it. vNvS is currently in development – for more information email Richard.Orton@glasgow.ac.uk dN/dS dN/dS is the ratio of the number of nonsynonymous substitutions per non-synonymous site (pN) […]

Read More

Parsing PubMed for email addresses in author affiliations

USE THE FOLLOWING RESPONSIBLY PLEASE! Recently, we wanted to send out a survey for the International Committee on Taxonomy of Viruses (ICTV) to a large number of authors who have recently published in a virology journal. Fortunately, PubMed stores author affiliations and the email address is also sometimes present in the affiliation. We decided to […]

Read More