Exploring the FAST5 format

FAST5 format from Oxford Nanopore (ONT) is in fact HDF5, which is a very flexible data model, library, and file format for storing and managing data. It is able to store an unlimited variety of datatypes.

A number of tools have been developed for handling HDF5 available from here.  The most useful are:

  • hdfview, a java visual tool for viewing HDF5 files with some limited functionality of plotting data and the option of exporting subsets in HDF5 (extension .h5)
  • h5ls, for listing specified entries of the HDF5 file
  • h5dump, to examine the HDF5 file and export specified groups or datasets in ASCII.

Here’s a run through exploring the lambda phage control run. First off, looking at the FAST5 file produced by the MinION.

hdfview /home3/ont/lambda_fc1/uploaded/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read984_strand.fast5

At this stage, the FAST5 file only has one dataset which is the “Signal” dataset.

The same thing, on a FAST5 file, which has been processed by Metrichor, now has a lot more associated information, notably Fastq, Events, various Log files for the different analyses and still contains the raw Signal dataset.

hdfview /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5 &

 

To list all groups recursively using h5ls use -r:

h5ls -r /home3/ont/lambda_fc1/uploaded/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read984_strand.fast5

Similar information can be obtained using h5dump -n:

h5dump -n /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

To get all data and metadata for a given group /Raw/Reads/Read_939:

h5dump -g /Raw/Reads/Read_939 /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

Or, the following is similar without the group tags. The -d option is used for printing a specified dataset.

/home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

Removing the array indices using option -y:

h5dump -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

Saving the raw Signal dataset to file “test”:

h5dump -o test -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

The same as the above but specifying that the column width of the dataset is 1 with the option -w 1:

h5dump -w 1 -o test -y -d /Raw/Reads/Read_939/Signal /home3/ont/lambda_fc1/downloads/pass/vgb_20170110_FNFAB46402_MN19940_sequencing_run_lambdacontrol_10012017_23602_ch9_read939_strand.fast5

Dumping the whole FAST5 into XML format:

h5dump --xml /home3/ont/Toledo_DeltaMerlin/pass/vgb_20170201_FNFAB45374_MN19940_sequencing_run_Toledo_DeltaMerlin_010217_3_98936_ch99_read985_strand.fast5

O.K., that it for now.

Categories: ONT, Uncategorized, UNIX
Tagged: , , ,