Exploring the FAST5 format
- Post by: Joseph Hughes
- July 19, 2017
- 2 Comments
FAST5 format from Oxford Nanopore (ONT) is in fact HDF5, which is a very flexible data model, library, and file format for storing and managing data. It is able to store an unlimited variety of datatypes.
A number of tools have been developed for handling HDF5 available from here. The most useful are:
- hdfview, a java visual tool for viewing HDF5 files with some limited functionality of plotting data and the option of exporting subsets in HDF5 (extension .h5)
- h5ls, for listing specified entries of the HDF5 file
- h5dump, to examine the HDF5 file and export specified groups or datasets in ASCII.
Here’s a run through exploring the lambda phage control run. First off, looking at the FAST5 file produced by the MinION.
hdfview /home3/ont/lambda_fc1/uploaded/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read984_strand.fast5
At this stage, the FAST5 file only has one dataset which is the “Signal” dataset.
The same thing, on a FAST5 file, which has been processed by Metrichor, now has a lot more associated information, notably Fastq, Events, various Log files for the different analyses and still contains the raw Signal dataset.
hdfview /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5 &
To list all groups recursively using h5ls use -r:
h5ls -r /home3/ont/lambda_fc1/uploaded/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read984_strand.fast5
Similar information can be obtained using h5dump -n:
h5dump -n /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
To get all data and metadata for a given group /Raw/Reads/Read_939:
h5dump -g /Raw/Reads/Read_939 /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
Or, the following is similar without the group tags. The -d option is used for printing a specified dataset.
/home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
Removing the array indices using option -y:
h5dump -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
Saving the raw Signal dataset to file “test”:
h5dump -o test -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
The same as the above but specifying that the column width of the dataset is 1 with the option -w 1:
h5dump -w 1 -o test -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5
Dumping the whole FAST5 into XML format:
h5dump --xml /home3/ont/Toledo_DeltaMerlin/pass/vgb_20170201_FNFAB45374_MN19940_sequencing_run_Toledo_DeltaMerlin_010217_3_98936_ch99_read985_strand.fast5
O.K., that it for now.