Category Archives: software

Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group

A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.

Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.

nanopolish extract --help
Usage: nanopolish extract [OPTIONS] <fast5|dir>...
Extract reads in fasta format

--help display this help and exit
--version display version
-v, --verbose display verbose output
-r, --recurse recurse into subdirectories
-q, --fastq extract fastq (default: fasta)
-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any
(default: 2d-or-template)
-o, --output=FILE write output to FILE (default: stdout)

Report bugs to https://github.com/jts/nanopolish/issues

poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The poretools fasta command has many options for filtering the sequences.

poretools fasta -h
usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME]
[--end END_TIME] [--min-length MIN_LENGTH]
[--max-length MAX_LENGTH] [--high-quality]
[--normal-quality] [--group GROUP]
FILES [FILES ...]

positional arguments:
FILES The input FAST5 files.

optional arguments:
-h, --help show this help message and exit
-q, --quiet Do not output warnings to stderr
--type STRING Which type of FASTQ entries should be reported?
Def.=all
--start START_TIME Only report reads from after start timestamp
--end END_TIME Only report reads from before end timestamp
--min-length MIN_LENGTH
Minimum read length for FASTA entry to be reported.
--max-length MAX_LENGTH
Maximum read length for FASTA entry to be reported.
--high-quality Only report reads with more complement events than
template.
--normal-quality Only report reads with fewer complement events than
template.
--group GROUP Base calling group serial number to extract, default
000

PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.

poreseq extract -h
usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta

positional arguments:
dirs fast5 directories
fasta output fasta

optional arguments:
-h, --help show this help message and exit
-p, --path use rel. path as fasta header (instead of just filename)

poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.

The following table compares the runtime for each program using the perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.

CommandTime (sec)MBFASTAFASTQ
nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa10.885.4Defaultnanopolish extract -q
poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa4.593.2poretools fastaporetools fastq
poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa0.292.5poreseq extractN/A

So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.

The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

For poreseq, the output identifier is the name of the fast5 file:
>vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “nanopolish variants --consensus” on the file I had extracted using poretools. I haven’t yet looked into detail why that is.

That’s it.

Update:
Other tools to checkout: Fast5-to-Fastq by Ryan Wick‏ https://github.com/rrwick/Fast5-to-Fastq

How to generate a Sample Sheet from sample/index data in BaseSpace

If you are using BaseSpace for sample entry but demultiplexing your data manually, you may have been frustrated that there is no facility to download your sample names and index tag data from BaseSpace as a sample sheet. This means you have to enter the same data twice – with the possibility of errors creeping in especially for large projects with many samples and dual index tags.

We have found a way to avoid typing the same information twice and instead fetch the sample names, index ID’s and index tag sequences from BaseSpace straight to a sample sheet. This saves a huge amount of time for large projects with many samples.

Log in to BaseSpace, and navigate to the ‘Libraries’ page within the ‘Prep Libraries’ tab. Each line is a set of libraries with complete information on index names and tag sequences. Clicking a set of libraries will bring up the following screen – this example has 24 samples with TruSeqLT tags (only 7 are visible without scrolling down the list).

libraries_for_export

Clicking the ‘EXPORT’ button will download a comma separated file (csv) that can be opened in Excel. This file has all the sample names, index ID’s and index sequences (but not in quite the correct format to paste into a sample sheet).

excel_for_export

Open the file in Excel, select the entire Index1 Column and click the ‘Text to Columns’ function (under the ‘Data’ menu in Excel). Choose the ‘Delimited’ option, then tick ‘Other’ and enter a hyphen (-) in the box. This will split the Index1 Column into two, with the name of the Index and the actual Tag sequence in two separate columns, as below.

excel_for_indexes

If using dual indexing (e.g. TruSeqHT or NexteraXT) then do the same for the second column with Index 2 to split the index2 names and sequences into two separate columns.

Now open a blank or used sample sheet that is set up for the correct library chemistry and sequencing instrument (see previous blog post) then copy and paste the sample ID’s, Index ID’s and Index sequences into the sample sheet. Save as a comma separated file (csv) and its ready to use for demultiplexing and fastq generation, or your next MiSeq run. The above example looks like this…

sample_sheet

Recombination detection programs

A critical step before phylogenetic analysis and molecular selection analysis is to detect recombination and either remove recombinant sequences or partition the alignment into different spans that a recombination-free. The problem is that there are so many different recombination detection programs available. These have been nicely review by Posada (2002) and some of the programs have been benchmarked by Kosakovsky Pond and Frost (2005).

I decided to compile all the programs that are available but there are so many that I am sure I have missed some. If you know of any others, please leave a comment below.

I have split the programs into the same four different categories as Posada (2002) but some programs implement multiple methods so it gets a bit tricky. The size of the font relates to the number of citations each program/publication has received in Google Scholar. As you can see, there are some clear favourites.

You can click on the names of the programs and this should either take you to the publication abstract on Pubmed or to the website for the software.

alt-text