Category Archives: illumina

Extensive but not comprehensive compilation of de-novo assemblers

This figure is an update of Figure 1 in “A practical comparison of de novo genome assembly software tools for next-generation sequencing technologies.” published by Zhang et al (2011).
The figure was produced in SVG so you should be able to click on the name of the assembler which should take you straight to the PUBMED abstract. The size of the de-novo assembler names is relative to the number of citations in PUBMED. You can see that 2012 was the year of the De Brujin Graph assemblers.


2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course:

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants


The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest ( 

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

How to generate a Sample Sheet from sample/index data in BaseSpace

If you are using BaseSpace for sample entry but demultiplexing your data manually, you may have been frustrated that there is no facility to download your sample names and index tag data from BaseSpace as a sample sheet. This means you have to enter the same data twice – with the possibility of errors creeping in especially for large projects with many samples and dual index tags.

We have found a way to avoid typing the same information twice and instead fetch the sample names, index ID’s and index tag sequences from BaseSpace straight to a sample sheet. This saves a huge amount of time for large projects with many samples.

Log in to BaseSpace, and navigate to the ‘Libraries’ page within the ‘Prep Libraries’ tab. Each line is a set of libraries with complete information on index names and tag sequences. Clicking a set of libraries will bring up the following screen – this example has 24 samples with TruSeqLT tags (only 7 are visible without scrolling down the list).


Clicking the ‘EXPORT’ button will download a comma separated file (csv) that can be opened in Excel. This file has all the sample names, index ID’s and index sequences (but not in quite the correct format to paste into a sample sheet).


Open the file in Excel, select the entire Index1 Column and click the ‘Text to Columns’ function (under the ‘Data’ menu in Excel). Choose the ‘Delimited’ option, then tick ‘Other’ and enter a hyphen (-) in the box. This will split the Index1 Column into two, with the name of the Index and the actual Tag sequence in two separate columns, as below.


If using dual indexing (e.g. TruSeqHT or NexteraXT) then do the same for the second column with Index 2 to split the index2 names and sequences into two separate columns.

Now open a blank or used sample sheet that is set up for the correct library chemistry and sequencing instrument (see previous blog post) then copy and paste the sample ID’s, Index ID’s and Index sequences into the sample sheet. Save as a comma separated file (csv) and its ready to use for demultiplexing and fastq generation, or your next MiSeq run. The above example looks like this…


How to demultiplex Illumina data and generate fastq files using bcl2fastq

Sequence runs on NGS instruments are typically carried out with multiple samples pooled together. An index tag (also called a barcode) consisting of a unique sequence of between 6 and 12bp is added to each sample so that the sequence reads from different samples can be identified.

On the Illumina MiSeq, the process of demultiplexing (dividing your sequence reads into separate files for each index tag/sample) and generating the fastq data files required for downstream analysis is carried out automatically using the onboard PC. However, on the higher-throughput NextSeq500 and HiSeq models this process is carried out on BaseSpace – Illumina’s cloud-based resource.

Whilst there are many advantages to having your sequence data in the cloud (e.g. monitoring a sequence run from home, ease of sharing data with collaborators, etc) there are also some drawbacks to this system. In particular the process of demultiplexing and fastq file generation in BaseSpace can be very slow. It takes up to 8 hours to demultiplex the data from a high output NextSeq500 run on BaseSpace, and if the fastq files then have to be downloaded to your local computer or server for analysis this requires a further 3 hours.

If your data is urgent you may not want to wait 11 hours or more after your sequence run has finished to begin your analysis! We have found that demultiplexing and fastq file generation from a high output NextSeq500 run can instead be carried out in about 30 minutes on our in-house UNIX server. This also has the advantage of avoiding the rather slow step of downloading your fastq files from BaseSpace.

In order to do this, you need to install a free piece of software from Illumina called bcl2fastq on your UNIX server. Demultiplexing NextSeq500 data (or any Illumina system running RTA version 1.18.54 and later) requires bcl2fastq version 2.16 or newer (the latest version at the time of writing is v2.17 and can be downloaded here.

Importantly, we have checked that the results obtained from bcl2fastq and BaseSpace are equivalent – the fastq files generated are exactly the same. BaseSpace is set to remove adapter sequences by default, meaning that the sequence reads may not all be the same length (any reads from short fragments with adapter read-through will have those sequences removed). In bcl2fastq you have the option to either remove adapter sequences or leave them in so that all reads are the same length.

In order to demultiplex the data, first copy the entire run folder from the sequencer to your UNIX server. On the NextSeq500, the run folder will be inside the following directory on the hard disc –
D:\Illumina\NextSeq Control Software\Temp\
It ought to be the ONLY folder here as the NextSeq only retains data from the most recent run – as soon as you start a new sequence run the data from the previous run is deleted. Copy the entire folder, including all its subdirectories. This folder contains the raw basecall (bcl) files. Do not change the name of the folder, which will be named as per the following convention – YYMMDD_InstrumentID_RunID_FlowcellID
For example, the 10th run carried out on a NextSeq500 with serial number 500999, on 14th April 2016 and using flowcell number AHLFNLBGXX would be named as follows –

The other requirement is a sample sheet – a simple comma separated file (csv) with the library chemistry, sample names and the index tag used for each sample, in addition to some other metrics describing the run. Anyone running a MiSeq will already be familiar with these, but NextSeq and HiSeq users may only have used BaseSpace to enter these values. Unfortunately there is no way to automatically download a sample sheet from BaseSpace (although we have figured out a way round this to avoid double data entry, see the next blog post). Sample sheets can be made and modified using MS Excel or any other software that can read csv files, but the easiest way to make one is to use a free wizard-type program for the PC called Illumina Experiment Manager, which guides you through the process. The latest version at the time of writing is v1.9, which is available here.

Open Illumina Experiment Manager, and click on ‘Create Sample Sheet.’ Then, make certain that you choose the correct sequencer (essential since the NextSeq and MiSeq use opposite reverse complements during index reads). Select ‘Fastq only’ output. Enter any value (numbers or text) for the Reagent Kit Barcode – this will become the filename. Ensure correct library chemistry is selected (e.g. TruSeqLT, TruSeqHT, NexteraXT, etc). If there are custom/non-standard tags these will need to be manually entered in the csv file. Tick adapter trimming for read1 and read2 if required, select either paired or single end reads and enter the read length as appropriate (add one base, so for 150bp reads enter 151). Then either follow the instructions in the next blog post to import sample names and tags from BaseSpace, or enter them manually by adding a blank row for each sample, entering the sample names and selecting the index tag(s) for each sample. It is wise to double check that the sample names and indexes are correct, as mistakes will cause data to be allocated to the wrong file. Change the name of the file to ‘SampleSheet.csv’ and copy it into the top directory inside the sequence run folder on the server. The sample sheet file should resemble the example below – this is for a paired end 2x151bp NextSeq run with four samples, TruSeqLT index tags, and adapter trimming selected.


Now use the command line below on the server to run bcl2fastq. For speed, we use 12 threads for processing the data on our UNIX server (-p 12), however the optimal number will depend on your system architecture, resources and usage limits. It is important to set a limit to the number of threads, otherwise bcl2fastq will use 100% of the CPU’s on the server. We usually invoke the no-lane-splitting option, otherwise each output file from our NextSeq is divided into four (one for each lane on the flowcell). Here we are using the NextSeq run folder mentioned above as an example (160414_NB500999_0010_AHLFNLBGXX) and sending the output to a subdirectory within it called ‘fastq_files.’ For other bcl2fastq options please see Illumina’s manual on the software.

In this example, there should be two fastq files generated for each sample (one each for forward R1 and reverse R2 reads, since this is a paired end 2x151bp run) plus a forward and reverse file for ‘Undetermined’ reads where the index tag did not match any of the tags in the sample sheet. The Undetermined file will contain all of the reads from the PhiX spike-in if used (as PhiX does not have a tag) and also any other reads where there was a basecalling error during the index read. Depending on the PhiX spike-in % and the total number of samples on the run, the size of the Undetermined file should normally be smaller than the other files. If there is a problem suspected with demultiplexing or tagging always check the ‘index.html’ file within the ‘Reports/html’ subdirectory. This file will open on a standard web browser, and clicking the ‘unknown barcode’ option will display the top unknown barcodes and allow problems to be diagnosed. Common issues are that one or more samples were omitted from the sample sheet, errors entering the barcodes, incorrect library chemistry (e.g. selecting NexteraXT instead of TruSeqHT) or that the barcodes (especially sometimes index 2 on dual-indexed samples) need to be reverse-complemented on the sample sheet.

How to Import data for libraries with index tags into BaseSpace

In this blog we describe how to import lists of sample data with defined index tags into BaseSpace, and provide templates for TruSeqLT and TruSeqHT libraries. We have found this saves a lot of time and eliminates errors associated with manual entry.
The Illumina NextSeq500 sequencer requires all users to complete sample data entry on BaseSpace (Illumina’s cloud-based resource) including sample names, species, project names, index tags and sample pools. Whilst there are many advantages to having this data in the cloud, the BaseSpace interface is not always the most convenient or user-friendly system for data entry and management.
Our experience has been that for large projects with many samples, it is impractical to use the manual method of entering sample names in the ‘Biological Samples’ tab, then individually assigning an index tag in the ‘Libraries’ tab by dragging each sample onto an image of a 96-well plate of barcodes. To make matters worse, BaseSpace always mixes up the order of the samples (even if they are named 1-96), so it becomes all too easy to make an error when faced with a long list of sample names in a random order that each require a tag to be assigned.
It is quite easy to import a csv file created in Excel (or similar) with the sample names, species, project and nucleic acid into the ‘Biological Samples’ tab, and thus avoid a large part of the manual data entry. However this still requires the user to individually assign an index tag to each sample using the cumbersome and error-prone interface pictured below, dragging each sample on the list to the correct well on the index plate.
It is possible to avoid this by importing a csv file with the sample names, species, project, nucleic acid, index name and also the index tags into the ‘Libraries’ tab on BaseSpace. However, there is very little guidance on how to do this – and Illumina only provide an example template for libraries made using Nextera XT with none of the sequence tags themselves.
We are mainly using TruSeq indexes, so we have generated our own import templates with all 24 TruSeqLT tags, and all 96 dual-indexed TruSeqHT tags. This took quite a bit of trial and error, plus fetching the sequences of all 216 index tags. We have therefore made our own templates for importing TruSeqLT and TruSeqHT libraries available here for others to use.
Simply open the csv file in Excel (or similar) and insert the names of your own samples in the first two columns. Copy and past the index tags you have used to the correct sample lines (Each sample requires the Well, Index1Name, Index1Sequence,Index2Name and Index2Sequence). Change the name of the ContainerID from ‘Platename’ to your own name and delete any lines you don’t need (e.g. if you have less than 24 or 96 samples). Here we are using the template to import 24 samples called apples 1-24 with TruSeqHT dual tags.If using 96 samples, use this.
Save the csv file, navigate to the ‘Libraries’ tab in your BaseSpace account and then click the ‘Import’ button on the top-right corner. Choose your csv file, and after a minute you should see your libraries successfully imported with the correct index tags as below, ready to pool for a sequence run.
Now, if Illumina would just allow us to import pools of samples we could also avoid having to individually drag each sample into a small dot in the ‘Pools’ tab. This is rather tiresome when there are large numbers of samples in a pool!