Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools
The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.
“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group
A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.
Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.
nanopolish extract --help
Usage: nanopolish extract [OPTIONS] <fast5|dir>...
Extract reads in fasta format
--help display this help and exit
--version display version
-v, --verbose display verbose output
-r, --recurse recurse into subdirectories
-q, --fastq extract fastq (default: fasta)
-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any
-o, --output=FILE write output to FILE (default: stdout)
Report bugs to https://github.com/jts/nanopolish/issues
poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The
poretools fasta command has many options for filtering the sequences.
poretools fasta -h
usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME]
[--end END_TIME] [--min-length MIN_LENGTH]
[--max-length MAX_LENGTH] [--high-quality]
[--normal-quality] [--group GROUP]
FILES [FILES ...]
FILES The input FAST5 files.
-h, --help show this help message and exit
-q, --quiet Do not output warnings to stderr
--type STRING Which type of FASTQ entries should be reported?
--start START_TIME Only report reads from after start timestamp
--end END_TIME Only report reads from before end timestamp
Minimum read length for FASTA entry to be reported.
Maximum read length for FASTA entry to be reported.
--high-quality Only report reads with more complement events than
--normal-quality Only report reads with fewer complement events than
--group GROUP Base calling group serial number to extract, default
PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.
poreseq extract -h
usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta
dirs fast5 directories
fasta output fasta
-h, --help show this help message and exit
-p, --path use rel. path as fasta header (instead of just filename)
poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.
The following table compares the runtime for each program using the
perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (
perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.
|nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa
|nanopolish extract -q
|poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa
|poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa
So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.
The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5
The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5
For poreseq, the output identifier is the name of the fast5 file:
Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “
nanopolish variants --consensus” on the file I had extracted using
poretools. I haven’t yet looked into detail why that is.
Other tools to checkout: Fast5-to-Fastq by Ryan Wick https://github.com/rrwick/Fast5-to-Fastq