Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group

A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.

Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.

nanopolish extract --help
Usage: nanopolish extract [OPTIONS] <fast5|dir>...
Extract reads in fasta format

--help display this help and exit
--version display version
-v, --verbose display verbose output
-r, --recurse recurse into subdirectories
-q, --fastq extract fastq (default: fasta)
-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any
(default: 2d-or-template)
-o, --output=FILE write output to FILE (default: stdout)

Report bugs to https://github.com/jts/nanopolish/issues

poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The poretools fasta command has many options for filtering the sequences.

poretools fasta -h
usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME]
[--end END_TIME] [--min-length MIN_LENGTH]
[--max-length MAX_LENGTH] [--high-quality]
[--normal-quality] [--group GROUP]

positional arguments:
FILES The input FAST5 files.

optional arguments:
-h, --help show this help message and exit
-q, --quiet Do not output warnings to stderr
--type STRING Which type of FASTQ entries should be reported?
--start START_TIME Only report reads from after start timestamp
--end END_TIME Only report reads from before end timestamp
--min-length MIN_LENGTH
Minimum read length for FASTA entry to be reported.
--max-length MAX_LENGTH
Maximum read length for FASTA entry to be reported.
--high-quality Only report reads with more complement events than
--normal-quality Only report reads with fewer complement events than
--group GROUP Base calling group serial number to extract, default

PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.

poreseq extract -h
usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta

positional arguments:
dirs fast5 directories
fasta output fasta

optional arguments:
-h, --help show this help message and exit
-p, --path use rel. path as fasta header (instead of just filename)

poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.

The following table compares the runtime for each program using the perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.

CommandTime (sec)MBFASTAFASTQ
nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa10.885.4Defaultnanopolish extract -q
poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa4.593.2poretools fastaporetools fastq
poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa0.292.5poreseq extractN/A

So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.

The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

For poreseq, the output identifier is the name of the fast5 file:

Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “nanopolish variants --consensus” on the file I had extracted using poretools. I haven’t yet looked into detail why that is.

That’s it.

2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course: http://rik.smith-unna.com/command_line_bootcamp.

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants


The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest (http://www.thecurlersrestglasgow.co.uk). 

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

How to generate a Sample Sheet from sample/index data in BaseSpace

If you are using BaseSpace for sample entry but demultiplexing your data manually, you may have been frustrated that there is no facility to download your sample names and index tag data from BaseSpace as a sample sheet. This means you have to enter the same data twice – with the possibility of errors creeping in especially for large projects with many samples and dual index tags.

We have found a way to avoid typing the same information twice and instead fetch the sample names, index ID’s and index tag sequences from BaseSpace straight to a sample sheet. This saves a huge amount of time for large projects with many samples.

Log in to BaseSpace, and navigate to the ‘Libraries’ page within the ‘Prep Libraries’ tab. Each line is a set of libraries with complete information on index names and tag sequences. Clicking a set of libraries will bring up the following screen – this example has 24 samples with TruSeqLT tags (only 7 are visible without scrolling down the list).


Clicking the ‘EXPORT’ button will download a comma separated file (csv) that can be opened in Excel. This file has all the sample names, index ID’s and index sequences (but not in quite the correct format to paste into a sample sheet).


Open the file in Excel, select the entire Index1 Column and click the ‘Text to Columns’ function (under the ‘Data’ menu in Excel). Choose the ‘Delimited’ option, then tick ‘Other’ and enter a hyphen (-) in the box. This will split the Index1 Column into two, with the name of the Index and the actual Tag sequence in two separate columns, as below.


If using dual indexing (e.g. TruSeqHT or NexteraXT) then do the same for the second column with Index 2 to split the index2 names and sequences into two separate columns.

Now open a blank or used sample sheet that is set up for the correct library chemistry and sequencing instrument (see previous blog post) then copy and paste the sample ID’s, Index ID’s and Index sequences into the sample sheet. Save as a comma separated file (csv) and its ready to use for demultiplexing and fastq generation, or your next MiSeq run. The above example looks like this…


How to demultiplex Illumina data and generate fastq files using bcl2fastq

Sequence runs on NGS instruments are typically carried out with multiple samples pooled together. An index tag (also called a barcode) consisting of a unique sequence of between 6 and 12bp is added to each sample so that the sequence reads from different samples can be identified.

On the Illumina MiSeq, the process of demultiplexing (dividing your sequence reads into separate files for each index tag/sample) and generating the fastq data files required for downstream analysis is carried out automatically using the onboard PC. However, on the higher-throughput NextSeq500 and HiSeq models this process is carried out on BaseSpace – Illumina’s cloud-based resource.

Whilst there are many advantages to having your sequence data in the cloud (e.g. monitoring a sequence run from home, ease of sharing data with collaborators, etc) there are also some drawbacks to this system. In particular the process of demultiplexing and fastq file generation in BaseSpace can be very slow. It takes up to 8 hours to demultiplex the data from a high output NextSeq500 run on BaseSpace, and if the fastq files then have to be downloaded to your local computer or server for analysis this requires a further 3 hours.

If your data is urgent you may not want to wait 11 hours or more after your sequence run has finished to begin your analysis! We have found that demultiplexing and fastq file generation from a high output NextSeq500 run can instead be carried out in about 30 minutes on our in-house UNIX server. This also has the advantage of avoiding the rather slow step of downloading your fastq files from BaseSpace.

In order to do this, you need to install a free piece of software from Illumina called bcl2fastq on your UNIX server. Demultiplexing NextSeq500 data (or any Illumina system running RTA version 1.18.54 and later) requires bcl2fastq version 2.16 or newer (the latest version at the time of writing is v2.17 and can be downloaded here.

Importantly, we have checked that the results obtained from bcl2fastq and BaseSpace are equivalent – the fastq files generated are exactly the same. BaseSpace is set to remove adapter sequences by default, meaning that the sequence reads may not all be the same length (any reads from short fragments with adapter read-through will have those sequences removed). In bcl2fastq you have the option to either remove adapter sequences or leave them in so that all reads are the same length.

In order to demultiplex the data, first copy the entire run folder from the sequencer to your UNIX server. On the NextSeq500, the run folder will be inside the following directory on the hard disc –
D:\Illumina\NextSeq Control Software\Temp\
It ought to be the ONLY folder here as the NextSeq only retains data from the most recent run – as soon as you start a new sequence run the data from the previous run is deleted. Copy the entire folder, including all its subdirectories. This folder contains the raw basecall (bcl) files. Do not change the name of the folder, which will be named as per the following convention – YYMMDD_InstrumentID_RunID_FlowcellID
For example, the 10th run carried out on a NextSeq500 with serial number 500999, on 14th April 2016 and using flowcell number AHLFNLBGXX would be named as follows –

The other requirement is a sample sheet – a simple comma separated file (csv) with the library chemistry, sample names and the index tag used for each sample, in addition to some other metrics describing the run. Anyone running a MiSeq will already be familiar with these, but NextSeq and HiSeq users may only have used BaseSpace to enter these values. Unfortunately there is no way to automatically download a sample sheet from BaseSpace (although we have figured out a way round this to avoid double data entry, see the next blog post). Sample sheets can be made and modified using MS Excel or any other software that can read csv files, but the easiest way to make one is to use a free wizard-type program for the PC called Illumina Experiment Manager, which guides you through the process. The latest version at the time of writing is v1.9, which is available here.

Open Illumina Experiment Manager, and click on ‘Create Sample Sheet.’ Then, make certain that you choose the correct sequencer (essential since the NextSeq and MiSeq use opposite reverse complements during index reads). Select ‘Fastq only’ output. Enter any value (numbers or text) for the Reagent Kit Barcode – this will become the filename. Ensure correct library chemistry is selected (e.g. TruSeqLT, TruSeqHT, NexteraXT, etc). If there are custom/non-standard tags these will need to be manually entered in the csv file. Tick adapter trimming for read1 and read2 if required, select either paired or single end reads and enter the read length as appropriate (add one base, so for 150bp reads enter 151). Then either follow the instructions in the next blog post to import sample names and tags from BaseSpace, or enter them manually by adding a blank row for each sample, entering the sample names and selecting the index tag(s) for each sample. It is wise to double check that the sample names and indexes are correct, as mistakes will cause data to be allocated to the wrong file. Change the name of the file to ‘SampleSheet.csv’ and copy it into the top directory inside the sequence run folder on the server. The sample sheet file should resemble the example below – this is for a paired end 2x151bp NextSeq run with four samples, TruSeqLT index tags, and adapter trimming selected.


Now use the command line below on the server to run bcl2fastq. For speed, we use 12 threads for processing the data on our UNIX server (-p 12), however the optimal number will depend on your system architecture, resources and usage limits. It is important to set a limit to the number of threads, otherwise bcl2fastq will use 100% of the CPU’s on the server. We usually invoke the no-lane-splitting option, otherwise each output file from our NextSeq is divided into four (one for each lane on the flowcell). Here we are using the NextSeq run folder mentioned above as an example (160414_NB500999_0010_AHLFNLBGXX) and sending the output to a subdirectory within it called ‘fastq_files.’ For other bcl2fastq options please see Illumina’s manual on the software.

In this example, there should be two fastq files generated for each sample (one each for forward R1 and reverse R2 reads, since this is a paired end 2x151bp run) plus a forward and reverse file for ‘Undetermined’ reads where the index tag did not match any of the tags in the sample sheet. The Undetermined file will contain all of the reads from the PhiX spike-in if used (as PhiX does not have a tag) and also any other reads where there was a basecalling error during the index read. Depending on the PhiX spike-in % and the total number of samples on the run, the size of the Undetermined file should normally be smaller than the other files. If there is a problem suspected with demultiplexing or tagging always check the ‘index.html’ file within the ‘Reports/html’ subdirectory. This file will open on a standard web browser, and clicking the ‘unknown barcode’ option will display the top unknown barcodes and allow problems to be diagnosed. Common issues are that one or more samples were omitted from the sample sheet, errors entering the barcodes, incorrect library chemistry (e.g. selecting NexteraXT instead of TruSeqHT) or that the barcodes (especially sometimes index 2 on dual-indexed samples) need to be reverse-complemented on the sample sheet.

NCBI Entrez Direct UNIX E-utilities

I use NCBI Entrez Direct UNIX E-utilities regularly for sequence and data retrieval from NCBI. These UNIX utils can be combined with any UNIX commands.

It is available to download from the NCBI website: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/

A few useful examples for NCBI edirect utilities.

Download a sequence in fasta format from NCBI using accession number

Batch retrieval for all proteins for taxon ID. This example will download all proteins for viruses in fasta format.

Download sequences infasta format from NCBI using edirect using isolate info

Download sequences from NCBI using edirect using bioproject accession or ID

Get all CDS from a genome

Get taxonomy ID from protein accession number

Get taxonomy ID from accession number using esummary

Get full lineage from accession number
Tip : xtract can be used to fetch any element from the xml output

Get scientific name from accession number

Download all refseq protein sequences for viruses

Download reference genome sequence from taxonomy ID
Note: Using efilter command

Get all proteins from a genome accession

Extract genome accession from protein accession – DBSOURCE attribute in genbank file and an alternative to the script mentioned in one of my earlier blog post.
Note: Following command would work with protein accession and GIs used as -id parameter in elink command.

More info about NCBI Entrez Direct E-utillities is available on the NCBI website. http://www.ncbi.nlm.nih.gov/books/NBK179288/

NGS Data Formats and Analyses

Here are my slides from a session on NGS data formats and analyses that I gave as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels in April 2016. It covers file formats such as FASTA, FASTQ, SAM, BAM, and VCF, and also goes over IUAPAC nucleotide ambiguity codes, read names, quality scores, error probabilities, CIGAR strings.


How to Import data for libraries with index tags into BaseSpace

In this blog we describe how to import lists of sample data with defined index tags into BaseSpace, and provide templates for TruSeqLT and TruSeqHT libraries. We have found this saves a lot of time and eliminates errors associated with manual entry.
The Illumina NextSeq500 sequencer requires all users to complete sample data entry on BaseSpace (Illumina’s cloud-based resource) including sample names, species, project names, index tags and sample pools. Whilst there are many advantages to having this data in the cloud, the BaseSpace interface is not always the most convenient or user-friendly system for data entry and management.
Our experience has been that for large projects with many samples, it is impractical to use the manual method of entering sample names in the ‘Biological Samples’ tab, then individually assigning an index tag in the ‘Libraries’ tab by dragging each sample onto an image of a 96-well plate of barcodes. To make matters worse, BaseSpace always mixes up the order of the samples (even if they are named 1-96), so it becomes all too easy to make an error when faced with a long list of sample names in a random order that each require a tag to be assigned.
It is quite easy to import a csv file created in Excel (or similar) with the sample names, species, project and nucleic acid into the ‘Biological Samples’ tab, and thus avoid a large part of the manual data entry. However this still requires the user to individually assign an index tag to each sample using the cumbersome and error-prone interface pictured below, dragging each sample on the list to the correct well on the index plate.
It is possible to avoid this by importing a csv file with the sample names, species, project, nucleic acid, index name and also the index tags into the ‘Libraries’ tab on BaseSpace. However, there is very little guidance on how to do this – and Illumina only provide an example template for libraries made using Nextera XT with none of the sequence tags themselves.
We are mainly using TruSeq indexes, so we have generated our own import templates with all 24 TruSeqLT tags, and all 96 dual-indexed TruSeqHT tags. This took quite a bit of trial and error, plus fetching the sequences of all 216 index tags. We have therefore made our own templates for importing TruSeqLT and TruSeqHT libraries available here for others to use.
Simply open the csv file in Excel (or similar) and insert the names of your own samples in the first two columns. Copy and past the index tags you have used to the correct sample lines (Each sample requires the Well, Index1Name, Index1Sequence,Index2Name and Index2Sequence). Change the name of the ContainerID from ‘Platename’ to your own name and delete any lines you don’t need (e.g. if you have less than 24 or 96 samples). Here we are using the template to import 24 samples called apples 1-24 with TruSeqHT dual tags.If using 96 samples, use this.
Save the csv file, navigate to the ‘Libraries’ tab in your BaseSpace account and then click the ‘Import’ button on the top-right corner. Choose your csv file, and after a minute you should see your libraries successfully imported with the correct index tags as below, ready to pool for a sequence run.
Now, if Illumina would just allow us to import pools of samples we could also avoid having to individually drag each sample into a small dot in the ‘Pools’ tab. This is rather tiresome when there are large numbers of samples in a pool!

How to make a BioLinux Live USB Stick – with persistent data storage

These are the steps I used to create a batch of bootable BioLinux Live USB sticks – with persistent data so that any data files created/downloaded would be preserved. This was used for a course so that each stick had the same NGS data and the same additional (non-BioLinux) programs pre-installed and already configured.

Step 1 – Download the BioLinux ISO file for use with DVD/USB media

The downloaded .iso file is an archive file that contains the whole BioLinux operating system – it can be used later to either install BioLinux onto a machine, or to create a bootable BioLinux USB Live disk. The bio-linux-8-latest.iso image is currently (March 2016) 3.58GB in size.

Step 2 – Download and install UNetbootin

UNetbootin allows you to create bootable Live USB drives for Ubuntu and other Linux distributions without burning a CD.

It is simple to install, on a Mac you just move the downloaded unetbootin.app file into /Applications

Step 3 – Create an initial BioLinux Live USB disk with persistent data

As the .iso file is 3.58GB in size, a USB stick of atleast 4GB is needed, but that is a little to close for comfort, so best to go for a USB stick of atleast 8GB; these days 8GB sticks are very cheap (£2.99) and are the same price (if not cheaper) as 4GB sticks. To play safe, the USB stick should probably be in FAT32 format – FAT32 has a limitation of 4GB for file sizes – this includes the overall casper-rw BioLinux file which will be where all the persistent data is stored, so if you are going to be storing more than 4GB of data then you will probably need the NTFS file system on the USB stick.

Insert your blank USB key into your computer. Launch unetbootin. Select the “Diskimage” toggle button, select “ISO” from the drop down list, and then navigate to and select the BioLinux .iso file from your computer downloaded in Step 1. Next, in the field entitled “Space used to preserve files across reboots (Ubuntu only)” enter “3500” into the MB textfield (3.5 GB) – you could increase this above 4GB if you have a bigger USB stick and if it is using the NTFS file system. Next, select “USB Drive” from the “Type” drop down list, and then select your actual USB stick from the “Drive” drop down list and then click “OK” to create your bootable BioLinux Live USB stick with persistent data storage.

Step 4 – Boot into your BioLinux

Next step is to boot into the BioLinux Live USB disk from a machine – this will need to be a Windows or Linux machine, a modern Mac is unlikely to boot up from it. Turn the computer off, insert the BioLinux Live USB stick into the computer, turn the computer back on, and get ready. As soon as the first screen appears – which normally has the computer manufacturer logo – it should say something like “Press F12 to Choose Boot Device” at the bottom of the screen – so press F12 quickly before the screen disappears. Sometimes it is not F12, sometimes it is F10 or F2 or another key, but it should say on the screen what button to press. This will launch the BIOS menu. Enter the “Boot Device Select” menu, and move your USB Stick up the boot order to the top, so that the computer will now boot from the USB stick before its own hard drive. Exit the BIOS menu, saving any changes, and the computer should now boot into the BioLinux Live USB stick.

Step 5 – Customise your BioLinux – add data and programs

Now you will be inside your own BioLinux OS on the USB stick. So install any extra programs you want, configure PATHs, and download any data files you want. The programs, configs and data will be saved onto the USB stick and preserved – due to the persistent data storage and the casper-rw file.

Now shutdown BioLinux, remove the USB stick, and boot back into your normal operating system.

Step 6 – Make an image copy of your customised BioLinux disk

Once inside your normal operating system, insert the BioLinux USB stick back in. The next step will only work on a Mac or a Linux machine as it using the dd command.

This copies the BioLinux Live USB stick (located at /dev/disk2 on my machine – on a mac run “diskutil list” to see where yours is) and it creates a single biolinux.img file in the Documents folder which contains the entire operating system along with all the extra data and programs I installed.

The original customised BioLinux Live USB stick can now be ejected and removed.

Step 7 – Copy Copy Copy

Insert a new blank USB stick into the computer (obviously it needs to be atleast the same size as the original one). Now we want to make a copy of that original BioLinux Live USB stick onto the new USB stick using the dd command:

This copies the biolinux.img file located in the Documents folder that we created in Step 6, onto the new blank USB disk (located at /dev/disk2 – check where yours is). On a Mac, I had to first go into DiskUtility and dismount the FAT32 partition of the USB stick before dd would work – not dismount the USB stick itself, just the FAT32 partition. The key thing here, is that you can insert multiple blank USB sticks into all the available USB sticks and run the dd command in parallel:

For an 8GB USB stick, this copying process took almost exactly 1 hour. Then you can eject the USB sticks and put new ones in and copy another batch.


Many thanks to Paul Capewell and Willie Weir for a tip on the dd command.

Submitting a job to run on another server and retrieving the results

Imagine having two different servers called darwin and linnaeus. Imagine that darwin is a great server with loads of RAM for doing de-novo assembly and that linnaeus has loads of nodes so a great server for splitting up jobs and running lots of jobs in parallel. To make good use of all these resources, it would make sense to do part of the processing on one server and then automatically send jobs to be processed on another server.

So this is how you do that. On linnaeus you run:

You copy the key and on darwin put the key in .ssh/authorized_keys2

The reverse also needs to be done by putting a darwin key on linnaeus.

Now to test it out create the shell script that will be executed on linnaeus e.g. linnaeusshell:

This small script will uncompress a file, return the uncompressed file and return a “Done” log to darwin once the script is finished.

Now create a command shell on darwin, e.g. darwinshell:

And finally execute the darwinshell:

This will transfer the Pf3D7_01.embl.gz compressed file over to linnaeus where the file will be uncompressed and transferred back to darwin.

Big thanks to Sreenu who helped me a lot to sort this out.

Setting up an Amazon ftp server to receive big files

Sharing large files with collaborators has rarely been a problem, we usually just compress them and put them on our web server and then send the link to our collaborator who can then download the file.
However, we have struggled to find a solution to receive large files. We usually run out of space in Dropbox or Google Drive. We have tried infinit.io but this has failed on a few occasions, we think due to firewall issues either on our side or on the side of the collaborator. So when we managed to get an amazon cloud account set-up through Arcus Global (see my previous blog on how we got that organised), an obvious thing to try was to set-up an ftp server to receive large files.

A bit of googling around provided us with a very useful post on Stackoverflow.

First, we launched an instance through the Amazon web interface. We selected an Ubuntu instance from Amazon EC2 and specified 60Gb of storage. We generated a new key called “ftp” and saved the key locally. The .txt extension was added to the file so we renamed it and changed the permissions.

Using the ip address of the instance we then logged into the instance using ssh.

We installed the ftp server

but we did not provide a password for ftp as we decided to use the ubuntu username and password for login.

Then we opened up the FTP ports on your EC2 instance as described on Stackoverflow.

We changed the ssh configuration as explained in the previous blog and changed the ubuntu user’s password:

Then we changed the ftp configuration file in /etc/vsftpd.conf

The following lines where changed:

And added the following with the IP of our instance. If restarting an image, the IP will be different so this will need to be changed

Restart vsftpd

We had it up for about 24 hours and it cost approximately £0.56.