How to demultiplex Illumina data and generate fastq files using bcl2fastq
- Post by: Gavin Wilkie
- April 25, 2016
- 4 Comments
Sequence runs on NGS instruments are typically carried out with multiple samples pooled together. An index tag (also called a barcode) consisting of a unique sequence of between 6 and 12bp is added to each sample so that the sequence reads from different samples can be identified.
On the Illumina MiSeq, the process of demultiplexing (dividing your sequence reads into separate files for each index tag/sample) and generating the fastq data files required for downstream analysis is carried out automatically using the onboard PC. However, on the higher-throughput NextSeq500 and HiSeq models this process is carried out on BaseSpace – Illumina’s cloud-based resource.
Whilst there are many advantages to having your sequence data in the cloud (e.g. monitoring a sequence run from home, ease of sharing data with collaborators, etc) there are also some drawbacks to this system. In particular the process of demultiplexing and fastq file generation in BaseSpace can be very slow. It takes up to 8 hours to demultiplex the data from a high output NextSeq500 run on BaseSpace, and if the fastq files then have to be downloaded to your local computer or server for analysis this requires a further 3 hours.
If your data is urgent you may not want to wait 11 hours or more after your sequence run has finished to begin your analysis! We have found that demultiplexing and fastq file generation from a high output NextSeq500 run can instead be carried out in about 30 minutes on our in-house UNIX server. This also has the advantage of avoiding the rather slow step of downloading your fastq files from BaseSpace.
In order to do this, you need to install a free piece of software from Illumina called bcl2fastq on your UNIX server. Demultiplexing NextSeq500 data (or any Illumina system running RTA version 1.18.54 and later) requires bcl2fastq version 2.16 or newer (the latest version at the time of writing is v2.17 and can be downloaded here.
Importantly, we have checked that the results obtained from bcl2fastq and BaseSpace are equivalent – the fastq files generated are exactly the same. BaseSpace is set to remove adapter sequences by default, meaning that the sequence reads may not all be the same length (any reads from short fragments with adapter read-through will have those sequences removed). In bcl2fastq you have the option to either remove adapter sequences or leave them in so that all reads are the same length.
In order to demultiplex the data, first copy the entire run folder from the sequencer to your UNIX server. On the NextSeq500, the run folder will be inside the following directory on the hard disc –
D:\Illumina\NextSeq Control Software\Temp\
It ought to be the ONLY folder here as the NextSeq only retains data from the most recent run – as soon as you start a new sequence run the data from the previous run is deleted. Copy the entire folder, including all its subdirectories. This folder contains the raw basecall (bcl) files. Do not change the name of the folder, which will be named as per the following convention – YYMMDD_InstrumentID_RunID_FlowcellID
For example, the 10th run carried out on a NextSeq500 with serial number 500999, on 14th April 2016 and using flowcell number AHLFNLBGXX would be named as follows –
160414_NB500999_0010_AHLFNLBGXX
The other requirement is a sample sheet – a simple comma separated file (csv) with the library chemistry, sample names and the index tag used for each sample, in addition to some other metrics describing the run. Anyone running a MiSeq will already be familiar with these, but NextSeq and HiSeq users may only have used BaseSpace to enter these values. Unfortunately there is no way to automatically download a sample sheet from BaseSpace (although we have figured out a way round this to avoid double data entry, see the next blog post). Sample sheets can be made and modified using MS Excel or any other software that can read csv files, but the easiest way to make one is to use a free wizard-type program for the PC called Illumina Experiment Manager, which guides you through the process. The latest version at the time of writing is v1.9, which is available here.
Open Illumina Experiment Manager, and click on ‘Create Sample Sheet.’ Then, make certain that you choose the correct sequencer (essential since the NextSeq and MiSeq use opposite reverse complements during index reads). Select ‘Fastq only’ output. Enter any value (numbers or text) for the Reagent Kit Barcode – this will become the filename. Ensure correct library chemistry is selected (e.g. TruSeqLT, TruSeqHT, NexteraXT, etc). If there are custom/non-standard tags these will need to be manually entered in the csv file. Tick adapter trimming for read1 and read2 if required, select either paired or single end reads and enter the read length as appropriate (add one base, so for 150bp reads enter 151). Then either follow the instructions in the next blog post to import sample names and tags from BaseSpace, or enter them manually by adding a blank row for each sample, entering the sample names and selecting the index tag(s) for each sample. It is wise to double check that the sample names and indexes are correct, as mistakes will cause data to be allocated to the wrong file. Change the name of the file to ‘SampleSheet.csv’ and copy it into the top directory inside the sequence run folder on the server. The sample sheet file should resemble the example below – this is for a paired end 2x151bp NextSeq run with four samples, TruSeqLT index tags, and adapter trimming selected.
Now use the command line below on the server to run bcl2fastq. For speed, we use 12 threads for processing the data on our UNIX server (-p 12), however the optimal number will depend on your system architecture, resources and usage limits. It is important to set a limit to the number of threads, otherwise bcl2fastq will use 100% of the CPU’s on the server. We usually invoke the no-lane-splitting option, otherwise each output file from our NextSeq is divided into four (one for each lane on the flowcell). Here we are using the NextSeq run folder mentioned above as an example (160414_NB500999_0010_AHLFNLBGXX) and sending the output to a subdirectory within it called ‘fastq_files.’ For other bcl2fastq options please see Illumina’s manual on the software.
bcl2fastq --run-folder-dir 160414_NB500999_0010_AHLFNLBGXX -p 12 --output-dir 160414_NB500999_0010_AHLFNLBGXX/fastq_files --no-lane-splitting
In this example, there should be two fastq files generated for each sample (one each for forward R1 and reverse R2 reads, since this is a paired end 2x151bp run) plus a forward and reverse file for ‘Undetermined’ reads where the index tag did not match any of the tags in the sample sheet. The Undetermined file will contain all of the reads from the PhiX spike-in if used (as PhiX does not have a tag) and also any other reads where there was a basecalling error during the index read. Depending on the PhiX spike-in % and the total number of samples on the run, the size of the Undetermined file should normally be smaller than the other files. If there is a problem suspected with demultiplexing or tagging always check the ‘index.html’ file within the ‘Reports/html’ subdirectory. This file will open on a standard web browser, and clicking the ‘unknown barcode’ option will display the top unknown barcodes and allow problems to be diagnosed. Common issues are that one or more samples were omitted from the sample sheet, errors entering the barcodes, incorrect library chemistry (e.g. selecting NexteraXT instead of TruSeqHT) or that the barcodes (especially sometimes index 2 on dual-indexed samples) need to be reverse-complemented on the sample sheet.
Dear Gavin,
Thanks for this detailed post. Can I ask what would be the minimum HD configuration on this UNIX server such as the one described here, if we had one NextSeq500 and one MiSeq connected to it and wanted to be able to plan back to back runs with minimum downtime between them. We can assume the highest capacity runs (120 GB / run on NextSeq and 15 GB/run on the Miseq)
This is assuming we will also have a separate NAS storage device for longer term storage of the FASTQ files thus generated, and also any other run related files such as bcl if they need to be stored.
I found the required server config for bcl2fastq (in a document from Illumina) but this does not specify the HD capacity.
Thanks!
-Nandita
Tiny bug, at least with the current bcl2fastq software. ‘–run-folder-dir’ should read ‘–runfolder-dir’ Hope this helps anyone running into it.
There is a way know which adapters are present in my sequences by using blc files?