Author Archives: Joseph Hughes

A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml

A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted
without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss
of some information.

Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.

The command is shown in full below the table.

Conversion From Row/To Col blast-xml blast-tab psl pslx SAM/BAM BED
blast-xml N/A blast2tsv.xsl blastXmlToPsl blastXmlToPsl -pslx blast2bam BLAST_to_BED
blast-tab perl blast2xml.pl N/A ?? ?? ?? blast2bed
psl ?? ?? N/A pslToPslx psl2sam.pl pslToBed
psl2BED
pslx ?? ?? ?? N/A ?? pslToBed
SAM/BAM ?? ?? sam2psl.py sam2psl.py -s samtools view bedtools bamtobed
BED ?? ?? bed2psl ?? bedtools bedtobam N/A

Convert from SAM to psl using sam2psl.py

Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py

Example command:

This is a lossless format conversion with the -s option , however the sequence as a read is no longer supported in the psl format.

Usage:

Convert from psl to SAM

Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl

Example command:

This ends up being a lossy conversion as the read sequence is not in the output.

Usage

The options are used to calculate a blast like scoring see post:
https://www.biostars.org/p/44281/


Convert psl to pslx

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion. For usage:

 

Convert SAM to fasta

 

Convert psl to BED

Option 1:

Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as the standard psl doesn’t have the sequence and so the bed file doesn’t either.

Usage:

 

Option 2 as suggested by Alex Reynolds:

Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

This is also lossless when used with –keep-header:

Example:

As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.

Usage:

 

Convert pslx to BED

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion as the sequence is lost

Usage:

 

Convert BAM to BED

Using bedtools: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html

This is a lossy conversion as the sequence data is lost.

Usage:

 

Convert BED to BAM

Create the genome file for bed

 

Using the genome file and BED file to produce the BAM file.

Usage:

 

The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion.
Additionally, there are differences in the number of read compared to the original file.

Convert BED to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as neither the BED nor psl contain sequence information

 

Usage:

 

Preparing blast-xml format

Preparing blast-tab format

Blast-xml to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion

However, if you use the -pslx option, you can get lossless conversion

Usage:

 


Converting from blast-xml to SAM/BAM

Using https://github.com/guyduche/Blast2Bam

This is a lossless conversion with sequence and read quality introduced.

 

Usage

Subsequently converted to BAM using samtools

 

Blast-xml to BED

Using https://github.com/mojaveazure/BLAST_to_BED

Command

This is a lossy conversion as the sequence information is lost.

 

Converting Blast tabular to blast-xml

Not completely possible due to missing information such as the alignment but see post https://www.biostars.org/p/7981/#7983
Or using the script blast2xml.pl from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg

Command:

Usage

 

Convert blast-xml to blast tabular

Several approaches have been suggested here https://www.biostars.org/p/7290/ but the most straight forward I have found is using the style sheet blast2tsv.xml from here:
https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formatted blast-tabular output as it contains the sequence and the aligned site information in the last two columns.

Command:

 

Blast tabular to BED

Using https://github.com/nterhoeven/blast2bed

The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.

Usage

 


Converting SAM to BAM

Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html
This is a lossless format conversion.

Command


Converting BAM to SAM using samtools

This is a lossless format conversion.

 

PS: I dedicate this tutorial to Sej, a great bioinformatician and friend 😉

Exploring the FAST5 format

FAST5 format from Oxford Nanopore (ONT) is in fact HDF5, which is a very flexible data model, library, and file format for storing and managing data. It is able to store an unlimited variety of datatypes.

A number of tools have been developed for handling HDF5 available from here.  The most useful are:

  • hdfview, a java visual tool for viewing HDF5 files with some limited functionality of plotting data and the option of exporting subsets in HDF5 (extension .h5)
  • h5ls, for listing specified entries of the HDF5 file
  • h5dump, to examine the HDF5 file and export specified groups or datasets in ASCII.

Here’s a run through exploring the lambda phage control run. First off, looking at the FAST5 file produced by the MinION.

At this stage, the FAST5 file only has one dataset which is the “Signal” dataset.

The same thing, on a FAST5 file, which has been processed by Metrichor, now has a lot more associated information, notably Fastq, Events, various Log files for the different analyses and still contains the raw Signal dataset.

 

To list all groups recursively using h5ls use -r:

Similar information can be obtained using h5dump -n:

To get all data and metadata for a given group /Raw/Reads/Read_939:

Or, the following is similar without the group tags. The -d option is used for printing a specified dataset.

Removing the array indices using option -y:

Saving the raw Signal dataset to file “test”:

The same as the above but specifying that the column width of the dataset is 1 with the option -w 1:

Dumping the whole FAST5 into XML format:

O.K., that it for now.

Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group

A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.

Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.

nanopolish extract --help
Usage: nanopolish extract [OPTIONS] <fast5|dir>...
Extract reads in fasta format

--help display this help and exit
--version display version
-v, --verbose display verbose output
-r, --recurse recurse into subdirectories
-q, --fastq extract fastq (default: fasta)
-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any
(default: 2d-or-template)
-o, --output=FILE write output to FILE (default: stdout)

Report bugs to https://github.com/jts/nanopolish/issues

poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The poretools fasta command has many options for filtering the sequences.

poretools fasta -h
usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME]
[--end END_TIME] [--min-length MIN_LENGTH]
[--max-length MAX_LENGTH] [--high-quality]
[--normal-quality] [--group GROUP]
FILES [FILES ...]

positional arguments:
FILES The input FAST5 files.

optional arguments:
-h, --help show this help message and exit
-q, --quiet Do not output warnings to stderr
--type STRING Which type of FASTQ entries should be reported?
Def.=all
--start START_TIME Only report reads from after start timestamp
--end END_TIME Only report reads from before end timestamp
--min-length MIN_LENGTH
Minimum read length for FASTA entry to be reported.
--max-length MAX_LENGTH
Maximum read length for FASTA entry to be reported.
--high-quality Only report reads with more complement events than
template.
--normal-quality Only report reads with fewer complement events than
template.
--group GROUP Base calling group serial number to extract, default
000

PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.

poreseq extract -h
usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta

positional arguments:
dirs fast5 directories
fasta output fasta

optional arguments:
-h, --help show this help message and exit
-p, --path use rel. path as fasta header (instead of just filename)

poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.

The following table compares the runtime for each program using the perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.

CommandTime (sec)MBFASTAFASTQ
nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa10.885.4Defaultnanopolish extract -q
poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa4.593.2poretools fastaporetools fastq
poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa0.292.5poreseq extractN/A

So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.

The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

For poreseq, the output identifier is the name of the fast5 file:
>vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “nanopolish variants --consensus” on the file I had extracted using poretools. I haven’t yet looked into detail why that is.

That’s it.

Update:
Other tools to checkout: Fast5-to-Fastq by Ryan Wick‏ https://github.com/rrwick/Fast5-to-Fastq

2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course: http://rik.smith-unna.com/command_line_bootcamp.

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants

DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
MassimoWelcomeOIE2016
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
AnaSequencingTechnology
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
SreenuExplaining
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally
DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha
DAY 3 – HOW TO DO VARIANT CALLING AND DE NOVO ASSEMBLY

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
RichardHelping
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
PabloPontificating
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha
DAY 4 – METAGENOMICS AND HOW TO DO GENOME ANNOTATION

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
Roman
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
SejPresenting
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
GroupPhotoOIE2016
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest (http://www.thecurlersrestglasgow.co.uk). 
CurelersDinner
DAY 5 – PRIMER IN APPLIED PHYLOGNETICS

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
OIEcake2016
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

Submitting a job to run on another server and retrieving the results

Imagine having two different servers called darwin and linnaeus. Imagine that darwin is a great server with loads of RAM for doing de-novo assembly and that linnaeus has loads of nodes so a great server for splitting up jobs and running lots of jobs in parallel. To make good use of all these resources, it would make sense to do part of the processing on one server and then automatically send jobs to be processed on another server.

So this is how you do that. On linnaeus you run:

You copy the key and on darwin put the key in .ssh/authorized_keys2

The reverse also needs to be done by putting a darwin key on linnaeus.

Now to test it out create the shell script that will be executed on linnaeus e.g. linnaeusshell:

This small script will uncompress a file, return the uncompressed file and return a “Done” log to darwin once the script is finished.

Now create a command shell on darwin, e.g. darwinshell:

And finally execute the darwinshell:

This will transfer the Pf3D7_01.embl.gz compressed file over to linnaeus where the file will be uncompressed and transferred back to darwin.

Big thanks to Sreenu who helped me a lot to sort this out.

Setting up an Amazon ftp server to receive big files

Sharing large files with collaborators has rarely been a problem, we usually just compress them and put them on our web server and then send the link to our collaborator who can then download the file.
However, we have struggled to find a solution to receive large files. We usually run out of space in Dropbox or Google Drive. We have tried infinit.io but this has failed on a few occasions, we think due to firewall issues either on our side or on the side of the collaborator. So when we managed to get an amazon cloud account set-up through Arcus Global (see my previous blog on how we got that organised), an obvious thing to try was to set-up an ftp server to receive large files.

A bit of googling around provided us with a very useful post on Stackoverflow.

First, we launched an instance through the Amazon web interface. We selected an Ubuntu instance from Amazon EC2 and specified 60Gb of storage. We generated a new key called “ftp” and saved the key locally. The .txt extension was added to the file so we renamed it and changed the permissions.

Using the ip address of the instance we then logged into the instance using ssh.

We installed the ftp server

but we did not provide a password for ftp as we decided to use the ubuntu username and password for login.

Then we opened up the FTP ports on your EC2 instance as described on Stackoverflow.

We changed the ssh configuration as explained in the previous blog and changed the ubuntu user’s password:

Then we changed the ftp configuration file in /etc/vsftpd.conf

The following lines where changed:

And added the following with the IP of our instance. If restarting an image, the IP will be different so this will need to be changed

Restart vsftpd

We had it up for about 24 hours and it cost approximately £0.56.

BEAST in the cloud

We recently managed to set-up an account on amazon web service (AWS) that will enable us to test out how practical and efficient it will be to do computationally intensive analyses in the cloud. We often wonder whether it is worth spending loads of money on new servers and lots of time managing the hardware and software on them or whether it is worthwhile off loading most of it to a private company like Amazon. Do we save time? Do we save money?

The cloud is not impervious to problems, recently amazon was affected by a significant outage, however it is still worth thinking of using cloud computing and storage for the future.

The following post is a walk through of my first experiment using AWS EC2 to benchmark different instances for BEAST phylogenetic analyses.

1) Setting-up an amazon AWS account paid using a purchase order number from the CVR.

Arcus Global Ltd is a company that has been set-up to help Universities use AWS. The following video explains the background and how it works.

It wasn’t too difficult to set-up and the people at Arcus Global Ltd were quite helpful. They don’t charge more than the AWS pricing but try to make money by offering support for $100 a month. They aren’t pushy though so we decided to go it alone.

2) Trouble getting going

This was by far the hardest step. AWS terminology is quite alien and there is a big step to make to get started, there is still an awful lot that I don’t understand. The hard part is to install the CUDA tool kit and the beagle library on an instance. I found the following two blogs very useful:

  1. http://blog.faircloth-lab.org/beast-in-the-cloud/
  2. http://francoismichonneau.net/2014/05/how-to-install-beagle-on-ubuntu/
  3. and the most useful: http://tleyden.github.io/blog/2014/10/25/cuda-6-dot-5-on-aws-gpu-instance-running-ubuntu-14-dot-04/

Using the EC2 CentOS 5.5 GPU HVM AMI (ami-aa30c7c3), I didn’t manage to install the beagle library. I kept getting the following error message:

./autogen.sh
Putting files in AC_CONFIG_AUX_DIR, .config'.
configure.ac: installing
.config/install-sh’
configure.ac: installing .config/missing'
examples/complextest/Makefile.am: installing
.config/depcomp’
Makefile.am: installing ./INSTALL'
configure.ac:388: required file
hmsbeagle-${GENERIC_API_VERSION}.pc.in’ not found
autoreconf: automake failed with exit status: 1

However, starting from the following AMI (ami-9eaa1cf6), I was successful.

3) Setting-up an AWS instance.

I found the AWS command line tools really useful so I would recommend installing them.

a) Launch the instance and login

ec2-run-instances –key my-key ami-9eaa1cf6 -t g2.8xlarge
ssh -i my-key.pem ubuntu@xx.yy.zz.qq

b) Change the settings to allow ssh (optional)

sudo bash
adduser bob

In /etc/ssh/sshd_config change to:

PasswordAuthentication yes

Add:

AllowUsers bob

Then:

service ssh restart
exit

c) Verify the CUDA library is installed properly

cd /usr/local/cuda/samples/1_Utilities/deviceQuery
make
./deviceQuery

d ) Install all the necessary libraries for the beagle installation

sudo apt-get update
sudo apt-get install build-essential autoconf automake libtool subversion pkg-config openjdk-7-jdk

sudo apt-get install git
git clone –depth=1 https://github.com/beagle-dev/beagle-lib.git
cd beagle-lib
./autogen.sh
./configure –prefix=$HOME
make install
make check

And add to .profile

export LD_LIBRARY_PATH=$HOME/lib:$LD_LIBRARY_PATH
source ~/.profile

e) Finally install BEAST

I did this by transferring BEASTv1.8.2 using CyberDuck because I could find the appropriate link to use for wget. Uncompress and then check that BEAST can find the beagle resources:

tar xvf BEASTv1.8.2.tar
beast -beagle_info

Output for instance type cg1.4xlarge:

BEAGLE resources available:
0 : CPU
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU
1 : Tesla M2050
Global memory (MB): 2687
Clock speed (Ghz): 1.15
Number of cores: 448
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA
2 : Tesla M2050
Global memory (MB): 2687
Clock speed (Ghz): 1.15
Number of cores: 448
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA

Or for the instance type g2.2xlarge

BEAGLE resources available:
0 : CPU
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_SSE VECTOR_NONE THREADING_NONE PROCESSOR_CPU FRAMEWORK_CPU
1 : GRID K520
Global memory (MB): 4096
Clock speed (Ghz): 0.80
Number of cores: 1536
Flags: PRECISION_SINGLE PRECISION_DOUBLE COMPUTATION_SYNCH EIGEN_REAL EIGEN_COMPLEX SCALING_MANUAL SCALING_AUTO SCALING_ALWAYS SCALERS_RAW SCALERS_LOG VECTOR_NONE THREADING_NONE PROCESSOR_GPU FRAMEWORK_CUDA

4) Benchmarking different instance types and my personal computers.

Time (minutes)Java JVMOperating SystemProcessor modelProcessor speedComputer modelNotes
2.651.8.0_40MacOS X 10.10.54 GHz Intel Core i74 GHz iMac./beast -beagle_gpu -seed 666 ../examples/Benchmarks/benchmark1.xml
2.711.8.0_40MacOS X 10.10.54 GHz Intel Core i74 GHz iMac./beast -beagle_opencl -seed 666 ../examples/Benchmarks/benchmark1.xml
2.721.8.0_40MacOS X 10.10.54 GHz Intel Core i74 GHz iMac./beast -seed 666 ../examples/Benchmarks/benchmark1.xml
3.69java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel Xeon E5-2680 v2 (Ivy Bridge) Processors2.8 GHz (Max Turbo 3.6 GHz)ami-a596b8d2 on instance type c3.8xlarge./beast -beagle -beagle_cpu -seed 666 ../examples/Benchmarks/benchmark1.xml
3.71java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel Xeon E5-2680 v2 (Ivy Bridge) Processors2.8 GHz (Max Turbo 3.6 GHz)ami-a596b8d2 on instance type c3.8xlarge./beast -beagle -beagle_gpu -seed 666 ../examples/Benchmarks/benchmark1.xml
4.63java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel® Xeon® Processor E5-2670
(20M Cache, 2.60 GHz, 8.00 GT/s Intel® QPI)
2.6 GHz (Max Turbo 3.3 GHz)ami-2cbf3e44 using instance type g2.2xlarge./beast -beagle -beagle_cuda -seed 666 ../examples/Benchmarks/benchmark1.xml
4.71java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel® Xeon® Processor E5-2670
(20M Cache, 2.60 GHz, 8.00 GT/s Intel® QPI)
2.6 GHz (Max Turbo 3.3 GHz)ami-2cbf3e44 using instance type g2.2xlarge./beast -beagle -beagle_GPU -seed 666 ../examples/Benchmarks/benchmark1.xml
4.72java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel® Xeon® Processor E5-2670
(20M Cache, 2.60 GHz, 8.00 GT/s Intel® QPI)
2.6 GHz (Max Turbo 3.3 GHz)ami-2cbf3e44 using instance type g2.2xlarge./beast -seed 666 ../examples/Benchmarks/benchmark1.xml
5.06java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel Sandy Bridge processor 2.6 GHz and two NVIDIA Tesla “Fermi” M2050 GPUs2.6 GHzami-a596b8d2 on instance type cg1.4xlarge./beast -beagle -beagle_GPU -seed 666 ../examples/Benchmarks/benchmark1.xml
5.351.7.0_25MacOS X 10.9.52 x 2.4 GHz Quad-Core Intel Xeon2 x 2.4 GHzMac Pro./beast -beagle_opencl ../examples/Benchmarks/benchmark1.xml
31.16java-7-openjdk-amd64Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)Intel Sandy Bridge processor 2.6 GHz and two NVIDIA Tesla “Fermi” M2050 GPUs2.6 GHzami-a596b8d2 on instance type cg1.4xlarge./beast -beagle -beagle_GPU -seed 666 ../examples/Benchmarks/benchmark2.xml

I still have loads to learn when it comes to using the cloud but this was a useful experiment to start off with.

Thanks for reading this far,

Joseph

 

1st Viral Bioinformatics and Genomics Training Course

SRE_9674

The first Viral Bioinformatics and Genomics training course held at the University of Glasgow was completed successfully by 14 delegates (nine external and five internal) on 10-14 August 2015. The course took place in the McCall Building computer cluster, and the adjacent Lomond and Dumgoyne Rooms were used for refreshments and lunch.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Robert Gifford
Sejal Modha
Richard Orton
Sreenu Vattipally
Gavin Wilkie

9:00-10:00 Tea & Coffee in Dumgoyne Room – Arrival of participants
DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

Day one will introduced the participant to the different sequencing technologies available, the ways in which samples are prepared with an emphasis on how this impacts the bioinformatic analyses. The rest of the day aimed to familiarize the researcher with the command line and useful UNIX commands.

IMG_20150814_121005

10:00-10:45 Next-generation sequencing technologies – Gavin Wilkie
10:45-11:30 Examples of HTS data being used in virology – Richard Orton
11:30-11:45 Brief introduction and history of Unix/Linux – Sreenu Vattipally and Richard Orton
11:45-12:30 The command line anatomy and basic commands – Sreenu Vattipally and Richard Orton

12:30-13:30 Lunch break in Dumgoyne Room

13:30-14:30 Essential UNIX commands – Sreenu Vattipally and Sejal Modha
14:30-15:30 Text processing with grep – Sreenu Vattipally and Sejal Modha
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 sed and awk – Sreenu Vattipally and Sejal Modha

DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The participant continued to practice UNIX commands and learn how to run basic bioinformatic tools. By the end of the day, the participant were able to analyse HTS data with different reference assemblers and automate the processing of multiple files.

9:30-10:15 Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally and Richard Orton
10:15-11:00 Quality scores, quality control and trimming (Prinseq and FastQC and trim_galore) – Sreenu Vattipally and Richard Orton
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Introduction to alignment and assembly programs (Blast, BWA, Bowtie) – Sreenu Vattipally and Richard Orton

13:00-14:00 Lunch break in Dumgoyne Room

14:00-14:45 Continuation of alignment and assembly programs (Novoaligner, Tanoti) – Sreenu Vattipally and Richard Orton
14:45-15:30 Post-assembly processing and alignment visualization (Tablet and UGENE) – Sreenu Vattipally and Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-16:45 Workflow design and automation – Sreenu Vattipally and Richard Orton
16:45-17:30 Loops and command line arguments – Sreenu Vattipally and Richard Orton

DAY 3 – HOW TO DO DE NOVO ASSEMBLY AND USE METAGENOMICS PIPELINES

The third day covered the different approaches used for de-novo assembly and provided hands-on experience. A popular metagenomic pipeline was presented and participants learned how to use it as well as create their own pipeline. (Slides and practical)

IMG_20150812_113351

9:30-10:15 De-novo assemblers – Sejal Modha and Gavin Wilkie
10:15-11:00 Using different de-novo assemblers (e.g. Meta-idba, Edena, Abyss, Spades) – Sejal Modha and Gavin Wilkie

11:00-11:30 – Tea & Coffee in Lomond Room
11:30-13:00 Scaffolding contigs, filling gaps in assemblies and correcting errors (e.g. phrap, gap_filler, scaffold builder, ICORN2, spades) – Gavin Wilkie and Sejal Modha

13:00-14:00 Lunch break in Lomond Room

14:00-14:45 Practice building a custom de-novo pipeline (BLAST, KronaTools)
14:45-15:30 the MetAmos metagenomic pipeline – Sejal Modha and Sreenu Vattipally
15:30-16:00 Tea & Coffee in Lomond Room
16:00-17:30 Analysis using the pipeline – Sejal Modha and Sreenu Vattipally

DAY 4 – SEQUENCING TECHNOLOGIES AND HOW TO DO GENOME ANNOTATION

Day four gave the participant the opportunity to look into more detail at their assembly, they learned how to create a finished curated full genome (emphasis on finishing) with gene annotations and analysed the variation within their sample

9:30-11:00 Finishing and Annotating genomes – Andrew Davison, Gavin Wilkie and Sreenu Vattipally
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Annotation transfer from related species – Gavin Wilkie, Andrew Davison and Sreenu Vattipally

13:00-14:00 Lunch break in Dumgoyne Room

IMG_20150813_141642

14:00-15:30 Error detection and variant calling – Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 Quasi-species characterisation – Richard Orton

17:30 Social evening in the pub/restaurant at Curlers on Byres Road organised by Richard Orton (http://www.thecurlersrestglasgow.co.uk). 

 

DAY 5 – PRIMER IN INVESTIGATING PATHWAYS OF VIRAL EVOLUTION

Researchers worked through several practical examples of using sequences in virology and spent the remaining time analysing their own data with the teachers help.

9:30-10:15 Identifying mutations of interest for individual sequences and within a set of sequences – Robert Gifford
10:15-11:00 Combining phylogenies with traits – Robert Gifford
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-12:15 Investigating epidemiology (IDU example) – Robert Gifford
12:15-13:00 Investigate transmission of drug resistance – Robert Gifford

13:00-14:00 Lunch break in Dumgoyne Room

14:00-15:30 Analysing your own data or developing your own pipeline
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:00 Analysing your own data or developing your own pipeline
17:00 Goodbyes

IMG_20150814_130750

 

If you would like to be informed of similar training courses run in the future, please fill in your details here.

 

 

Recombination detection programs

A critical step before phylogenetic analysis and molecular selection analysis is to detect recombination and either remove recombinant sequences or partition the alignment into different spans that a recombination-free. The problem is that there are so many different recombination detection programs available. These have been nicely review by Posada (2002) and some of the programs have been benchmarked by Kosakovsky Pond and Frost (2005).

I decided to compile all the programs that are available but there are so many that I am sure I have missed some. If you know of any others, please leave a comment below.

I have split the programs into the same four different categories as Posada (2002) but some programs implement multiple methods so it gets a bit tricky. The size of the font relates to the number of citations each program/publication has received in Google Scholar. As you can see, there are some clear favourites.

You can click on the names of the programs and this should either take you to the publication abstract on Pubmed or to the website for the software.

alt-text