Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

Post by: Joseph Hughes
March 22, 2017
3 Comments

The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group

A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.

Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.

nanopolish extract --help Usage: nanopolish extract [OPTIONS] <fast5|dir>... Extract reads in fasta format


--help display this help and exit

--version display version

-v, --verbose display verbose output

-r, --recurse recurse into subdirectories

-q, --fastq extract fastq (default: fasta)

-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any

(default: 2d-or-template)

-o, --output=FILE write output to FILE (default: stdout)

Report bugs to https://github.com/jts/nanopolish/issues
poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The poretools fasta command has many options for filtering the sequences.
poretools fasta -h usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME] [--end END_TIME] [--min-length MIN_LENGTH] [--max-length MAX_LENGTH] [--high-quality] [--normal-quality] [--group GROUP] FILES [FILES ...]


positional arguments:

FILES The input FAST5 files.

optional arguments: -h, --help show this help message and exit -q, --quiet Do not output warnings to stderr --type STRING Which type of FASTQ entries should be reported? Def.=all --start START_TIME Only report reads from after start timestamp --end END_TIME Only report reads from before end timestamp --min-length MIN_LENGTH Minimum read length for FASTA entry to be reported. --max-length MAX_LENGTH Maximum read length for FASTA entry to be reported. --high-quality Only report reads with more complement events than template. --normal-quality Only report reads with fewer complement events than template. --group GROUP Base calling group serial number to extract, default 000
PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.
poreseq extract -h usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta


positional arguments:

dirs fast5 directories

fasta output fasta

optional arguments: -h, --help show this help message and exit -p, --path use rel. path as fasta header (instead of just filename)
poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.

The following table compares the runtime for each program using the perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.

Command	Time (sec)	MB	FASTA	FASTQ
nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa	10.88	5.4	Default	nanopolish extract -q
poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa	4.59	3.2	poretools fasta	poretools fastq
poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa	0.29	2.5	poreseq extract	N/A

So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.

The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

For poreseq, the output identifier is the name of the fast5 file:
>vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “nanopolish variants --consensus” on the file I had extracted using poretools. I haven’t yet looked into detail why that is.

That’s it.

Update:
Other tools to checkout: Fast5-to-Fastq by Ryan Wick‏ https://github.com/rrwick/Fast5-to-Fastq

Categories: ONT, software

Tagged: FAST5, FASTA, ONT

3 Comments on “Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools”

jts says:

March 23, 2017 at 11:22 pm

Hi,

fast5 files can contain more than one set of basecalled reads (for example if basecalling was performed twice). nanopolish extract adds the extra fields to the header to make it unambiguous which dataset a read was extracted from. We need to make a few hdf5 queries to support this feature, which might explain the longer runtime.

Jared

1. bljog says:
  
  March 30, 2017 at 3:12 pm
  
  Hi Jared, Thanks for the info. Even if “nanopolish extract” is slower, it is my preferred tool.
  
Roberto Gajardo Alister says:

July 5, 2017 at 12:45 am

Hi,

In fact, when I’ve used poRe there were several errors (for example, in get_fasta() function, the arguments failed randomly when change options 1D or 2D ). I came here since I’m strongly interested in buy a minION for bacterial genomics and to make identification of plasmids sequence. This tool is helping me, thanks!

Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

3 Comments on “Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools”

Leave a Reply Cancel reply