Tag Archives: FAST5

Exploring the FAST5 format

FAST5 format from Oxford Nanopore (ONT) is in fact HDF5, which is a very flexible data model, library, and file format for storing and managing data. It is able to store an unlimited variety of datatypes.

A number of tools have been developed for handling HDF5 available from here.  The most useful are:

  • hdfview, a java visual tool for viewing HDF5 files with some limited functionality of plotting data and the option of exporting subsets in HDF5 (extension .h5)
  • h5ls, for listing specified entries of the HDF5 file
  • h5dump, to examine the HDF5 file and export specified groups or datasets in ASCII.

Here’s a run through exploring the lambda phage control run. First off, looking at the FAST5 file produced by the MinION.

At this stage, the FAST5 file only has one dataset which is the “Signal” dataset.

The same thing, on a FAST5 file, which has been processed by Metrichor, now has a lot more associated information, notably Fastq, Events, various Log files for the different analyses and still contains the raw Signal dataset.

 

To list all groups recursively using h5ls use -r:

Similar information can be obtained using h5dump -n:

To get all data and metadata for a given group /Raw/Reads/Read_939:

Or, the following is similar without the group tags. The -d option is used for printing a specified dataset.

Removing the array indices using option -y:

Saving the raw Signal dataset to file “test”:

The same as the above but specifying that the column width of the dataset is 1 with the option -w 1:

Dumping the whole FAST5 into XML format:

O.K., that it for now.

Extraction of FASTA sequences from Oxford Nanopore fast5 files – a comparison of tools

The ONT produces results from sequencing run in the FAST5 format which is a variant of HDF5.

“HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.” from HDF group

A number of tools have been developed for converting the fast5 files produced by ONT to the more commonly used FASTA/FASTQ file formats. This is my attempt at determining which one to use based on functionality and runtime. The tools that I have looked at so far are nanopolish, poretools, poreseq and poRe.

Nanopolish is developed by Jared Simpson in C++ with some python utility script. The extract command comes with useful help informations.

nanopolish extract --help
Usage: nanopolish extract [OPTIONS] <fast5|dir>...
Extract reads in fasta format

--help display this help and exit
--version display version
-v, --verbose display verbose output
-r, --recurse recurse into subdirectories
-q, --fastq extract fastq (default: fasta)
-t, --type=TYPE read type: template, complement, 2d, 2d-or-template, any
(default: 2d-or-template)
-o, --output=FILE write output to FILE (default: stdout)

Report bugs to https://github.com/jts/nanopolish/issues

poretools is a toolkit by Nick Loman and Aaron Quinlan written in python. The poretools fasta command has many options for filtering the sequences.

poretools fasta -h
usage: poretools fasta [-h] [-q] [--type STRING] [--start START_TIME]
[--end END_TIME] [--min-length MIN_LENGTH]
[--max-length MAX_LENGTH] [--high-quality]
[--normal-quality] [--group GROUP]
FILES [FILES ...]

positional arguments:
FILES The input FAST5 files.

optional arguments:
-h, --help show this help message and exit
-q, --quiet Do not output warnings to stderr
--type STRING Which type of FASTQ entries should be reported?
Def.=all
--start START_TIME Only report reads from after start timestamp
--end END_TIME Only report reads from before end timestamp
--min-length MIN_LENGTH
Minimum read length for FASTA entry to be reported.
--max-length MAX_LENGTH
Maximum read length for FASTA entry to be reported.
--high-quality Only report reads with more complement events than
template.
--normal-quality Only report reads with fewer complement events than
template.
--group GROUP Base calling group serial number to extract, default
000

PoreSeq has been developed by Tamas Szalay and is written in python. The poreseq extract also has a help argument.

poreseq extract -h
usage: poreseq extract [-h] [-p] dirs [dirs ...] fasta

positional arguments:
dirs fast5 directories
fasta output fasta

optional arguments:
-h, --help show this help message and exit
-p, --path use rel. path as fasta header (instead of just filename)

poRe is a library for R written by Mick Watson. poRe has a very basic script extract2D which you can run from the command line. Unfortunately, I could not get it to print out the converted files and there were no error messages. I did also try to use it in the R client but without luck.

The following table compares the runtime for each program using the perf stat command with 10 replicate for extracting 4000 .fast5 to fasta files (perf stat -r 10 -d). The time represents an average over the 10 replicates. All tools compared produced the identical sequences as an output but the headers and thus file sizes are different. The differences are illustrated in the table.

CommandTime (sec)MBFASTAFASTQ
nanopolish extract --type 2d /home3/ont/toledo_fc1/pass/batch_1489737203645/ -o batch_1489737203645.fa10.885.4Defaultnanopolish extract -q
poretools fasta --type 2D /home3/ont/toledo_fc1/pass/batch_1489737203645/ > batch_1489737203645_poretools.fa4.593.2poretools fastaporetools fastq
poreseq extract /home3/ont/toledo_fc1/pass/batch_1489737203645/ batch_1489737203645.fa0.292.5poreseq extractN/A

So although the sequences extracted were identical for all three tools, there is quite a difference in the speed and the size of the the files. The size of the file is obviously related to the identifier/description line where nanopolish has the longest identifier with the addition of “:2D_000:2d”.

The output for nanopolish contains 3 parts (an identifier, the name of the run, the complete path to the fast5 file):
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d:2D_000:2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

The output for porteools contains three parts:
>c8ea266c-b9ab-4f87-9c54-810a75a53fdf_Basecall_2D_2d vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand /home3/ont/toledo_fc1/pass/batch_1489737203645/vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

For poreseq, the output identifier is the name of the fast5 file:
>vgb_20170316_FNFAB46402_MN19940_sequencing_run_hcmvrun2_20232_ch436_read24793_strand.fast5

Whilst it would be great to standardise the identifier and description information, some extraction tool have downstream tools which expect the identifier and description to be in a certain format. For example, I was unable to run “nanopolish variants --consensus” on the file I had extracted using poretools. I haven’t yet looked into detail why that is.

That’s it.

Update:
Other tools to checkout: Fast5-to-Fastq by Ryan Wick‏ https://github.com/rrwick/Fast5-to-Fastq