Category Archives: Deep Sequencing

2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course: http://rik.smith-unna.com/command_line_bootcamp.

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants

DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
MassimoWelcomeOIE2016
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
AnaSequencingTechnology
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
SreenuExplaining
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally
DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha
DAY 3 – HOW TO DO VARIANT CALLING AND DE NOVO ASSEMBLY

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
RichardHelping
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
PabloPontificating
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha
DAY 4 – METAGENOMICS AND HOW TO DO GENOME ANNOTATION

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
Roman
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
SejPresenting
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
GroupPhotoOIE2016
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest (http://www.thecurlersrestglasgow.co.uk). 
CurelersDinner
DAY 5 – PRIMER IN APPLIED PHYLOGNETICS

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
OIEcake2016
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

How to Import data for libraries with index tags into BaseSpace

In this blog we describe how to import lists of sample data with defined index tags into BaseSpace, and provide templates for TruSeqLT and TruSeqHT libraries. We have found this saves a lot of time and eliminates errors associated with manual entry.
The Illumina NextSeq500 sequencer requires all users to complete sample data entry on BaseSpace (Illumina’s cloud-based resource) including sample names, species, project names, index tags and sample pools. Whilst there are many advantages to having this data in the cloud, the BaseSpace interface is not always the most convenient or user-friendly system for data entry and management.
Our experience has been that for large projects with many samples, it is impractical to use the manual method of entering sample names in the ‘Biological Samples’ tab, then individually assigning an index tag in the ‘Libraries’ tab by dragging each sample onto an image of a 96-well plate of barcodes. To make matters worse, BaseSpace always mixes up the order of the samples (even if they are named 1-96), so it becomes all too easy to make an error when faced with a long list of sample names in a random order that each require a tag to be assigned.
It is quite easy to import a csv file created in Excel (or similar) with the sample names, species, project and nucleic acid into the ‘Biological Samples’ tab, and thus avoid a large part of the manual data entry. However this still requires the user to individually assign an index tag to each sample using the cumbersome and error-prone interface pictured below, dragging each sample on the list to the correct well on the index plate.
BaseSpace_indexing
It is possible to avoid this by importing a csv file with the sample names, species, project, nucleic acid, index name and also the index tags into the ‘Libraries’ tab on BaseSpace. However, there is very little guidance on how to do this – and Illumina only provide an example template for libraries made using Nextera XT with none of the sequence tags themselves.
We are mainly using TruSeq indexes, so we have generated our own import templates with all 24 TruSeqLT tags, and all 96 dual-indexed TruSeqHT tags. This took quite a bit of trial and error, plus fetching the sequences of all 216 index tags. We have therefore made our own templates for importing TruSeqLT and TruSeqHT libraries available here for others to use.
Simply open the csv file in Excel (or similar) and insert the names of your own samples in the first two columns. Copy and past the index tags you have used to the correct sample lines (Each sample requires the Well, Index1Name, Index1Sequence,Index2Name and Index2Sequence). Change the name of the ContainerID from ‘Platename’ to your own name and delete any lines you don’t need (e.g. if you have less than 24 or 96 samples). Here we are using the template to import 24 samples called apples 1-24 with TruSeqHT dual tags.If using 96 samples, use this.
 template_image
Save the csv file, navigate to the ‘Libraries’ tab in your BaseSpace account and then click the ‘Import’ button on the top-right corner. Choose your csv file, and after a minute you should see your libraries successfully imported with the correct index tags as below, ready to pool for a sequence run.
 imported_libraries
Now, if Illumina would just allow us to import pools of samples we could also avoid having to individually drag each sample into a small dot in the ‘Pools’ tab. This is rather tiresome when there are large numbers of samples in a pool!

The dark arts of Ion Torrent Sequencing

All technologies, have their advantages and disadvantages, and next generation sequencing (NGS) is no different. For the time being, the current dominant forces in NGS are Illumina, which is based upon detecting a flash of light as fluorescent nucleotides are incorporated, or Ion Torrent, where nucleotide incorporation is detected in the form of a change in pH. Illumina has made its technology user friendly and produces high quality data, but is expensive and sensitive to misalignment of the lens used to photograph the flow cell, whereas Ion Torrent is (relatively) cheaper and less sensitive to shakes, but historically results in poorer quality data and user friendliness is far behind Illumina. Ion Torrent’s answer to addressing consistency and user friendliness has been to release the Ion Chef; effectively a robot which performs the steps from library to chip. Life without an Ion Chef is very different. Like any method, the instructions that come with the supplies tells you what to do, but it doesn’t tell you the ‘tricks’: those little touches that make you hit the sweet spot. In the non-chef protocol, for the Ion Proton, there are various points throughout the procedure, from RNA to sequence data, where little touches help.
My experience has been using the Ion Proton for transcriptomics, so any ‘recommendations’ when it comes to Ion Torrent sequencing refer to RNAseq.
Library preparation begins with good quality RNA. The standard way to determine RNA quality is to assess the integrity and ratio of the ribosomal RNA peaks. Originally this was simply an agarose gel, but nowadays the Bioanalyzer or Tapestation from Agilent are commonplace and provide a ‘RIN’ value indicating the RNA quality. For anyone thinking of buying one, given the choice between Bioanalyser and Tapestation, personally I’d choose the Bioanalyzer every time as the reagents seem to be much more stable overtime. For quantification it is widely accepted that dye-based methods of quantitation, e.g. Qubit, are much more reliable than spectrophotometry, e.g. NanoDrop.

Bioanalyser trace

Good quality RNA: Total RNA (RIN 10) as analysed using the Bioanalyzer.

The amount of RNA used to generate the library can be very small, but should be sufficient that the resulting library is representative of the RNA population in the sample. Currently, I poly(A) enrich 1.5µg of total RNA and use the entire eluate as the input for the RNAseq V2 kit from Ion Torrent. The protocol provided suggests 3 minutes of fragmentation for <100ng of poly(A) enriched RNA as an input. Fragmenting for this long though will produce short reads, whereas a shorter incubation, e.g. 1min 30sec will result in longer reads, which may be beneficial when it comes to aligning the reads to the reference. When it comes to the clean-up after fragmentation, and this applies to all of the clean-ups, big emphasis is placed upon precise volumes. It’s true, this is important, but just as important – and something not in the manual – is that the ethanol MUST be fresh. As soon as the bottle is open the quality drops. It’s prudent to make aliquots from a freshly opened bottle, especially if you want batch –to-batch consistency among your libraries.

When it comes to barcoding, if possible plan ahead such that the successive runs on the Proton utilise different barcodes. In this way it may help to remove inter-run contamination of data.
Library QC is also important, again, use the Qubit. Once amplified, using 1.5µg of total RNA input results in libraries that need to be diluted a lot to reach the required dilution for the OT2, either 100pM (and use 8µl in the OT2 reaction), or 8pM (and use 100µl in the OT2 reaction). More DNA than this will result in an increase in polyclonal spheres. It may be tedious vortexing the ISPs for a full minute prior to their addition to the OT2 reaction, but it’s important and adding the ISPs is the last thing I do prior to starting the OT2.

After enrichment and loading (adding 10µl of 50% annealing buffer prior to adding to the chip), the foam washes aren’t to be underestimated. The wells may be perfectly loaded with perfectly templated ISPs, but unless the surface is washed properly, the amount of data will be reduced. Whereas the protocol states two washes, I perform a minimum of 4, some run slowly, some faster. The goal each time is to generate a fine foam. For this the pipette tip makes a surprising amount of difference; unless there is a sufficiently narrow opening to the tip, it is difficult to generate a good foam. It is also sometimes possible to ‘over-foam’, whereby pipetting too much will result in the small bubbles becoming bigger.

ChipLoading

Chip loading: an image of the Ion Proton PI chip loaded with ISPs

I’m not sure how much of a difference it makes, but I try to add the polymerase as slowly as possible and leave the chip to sit still for at least 5 minutes before moving it.

These are simply some of the habits I’ve picked up. The protocol is much improved compared to previous versions but, ultimately, the best option would be to buy an Ion Chef!

1st Viral Bioinformatics and Genomics Training Course

SRE_9674

The first Viral Bioinformatics and Genomics training course held at the University of Glasgow was completed successfully by 14 delegates (nine external and five internal) on 10-14 August 2015. The course took place in the McCall Building computer cluster, and the adjacent Lomond and Dumgoyne Rooms were used for refreshments and lunch.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Robert Gifford
Sejal Modha
Richard Orton
Sreenu Vattipally
Gavin Wilkie

9:00-10:00 Tea & Coffee in Dumgoyne Room – Arrival of participants
DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

Day one will introduced the participant to the different sequencing technologies available, the ways in which samples are prepared with an emphasis on how this impacts the bioinformatic analyses. The rest of the day aimed to familiarize the researcher with the command line and useful UNIX commands.

IMG_20150814_121005

10:00-10:45 Next-generation sequencing technologies – Gavin Wilkie
10:45-11:30 Examples of HTS data being used in virology – Richard Orton
11:30-11:45 Brief introduction and history of Unix/Linux – Sreenu Vattipally and Richard Orton
11:45-12:30 The command line anatomy and basic commands – Sreenu Vattipally and Richard Orton

12:30-13:30 Lunch break in Dumgoyne Room

13:30-14:30 Essential UNIX commands – Sreenu Vattipally and Sejal Modha
14:30-15:30 Text processing with grep – Sreenu Vattipally and Sejal Modha
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 sed and awk – Sreenu Vattipally and Sejal Modha

DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The participant continued to practice UNIX commands and learn how to run basic bioinformatic tools. By the end of the day, the participant were able to analyse HTS data with different reference assemblers and automate the processing of multiple files.

9:30-10:15 Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally and Richard Orton
10:15-11:00 Quality scores, quality control and trimming (Prinseq and FastQC and trim_galore) – Sreenu Vattipally and Richard Orton
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Introduction to alignment and assembly programs (Blast, BWA, Bowtie) – Sreenu Vattipally and Richard Orton

13:00-14:00 Lunch break in Dumgoyne Room

14:00-14:45 Continuation of alignment and assembly programs (Novoaligner, Tanoti) – Sreenu Vattipally and Richard Orton
14:45-15:30 Post-assembly processing and alignment visualization (Tablet and UGENE) – Sreenu Vattipally and Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-16:45 Workflow design and automation – Sreenu Vattipally and Richard Orton
16:45-17:30 Loops and command line arguments – Sreenu Vattipally and Richard Orton

DAY 3 – HOW TO DO DE NOVO ASSEMBLY AND USE METAGENOMICS PIPELINES

The third day covered the different approaches used for de-novo assembly and provided hands-on experience. A popular metagenomic pipeline was presented and participants learned how to use it as well as create their own pipeline. (Slides and practical)

IMG_20150812_113351

9:30-10:15 De-novo assemblers – Sejal Modha and Gavin Wilkie
10:15-11:00 Using different de-novo assemblers (e.g. Meta-idba, Edena, Abyss, Spades) – Sejal Modha and Gavin Wilkie

11:00-11:30 – Tea & Coffee in Lomond Room
11:30-13:00 Scaffolding contigs, filling gaps in assemblies and correcting errors (e.g. phrap, gap_filler, scaffold builder, ICORN2, spades) – Gavin Wilkie and Sejal Modha

13:00-14:00 Lunch break in Lomond Room

14:00-14:45 Practice building a custom de-novo pipeline (BLAST, KronaTools)
14:45-15:30 the MetAmos metagenomic pipeline – Sejal Modha and Sreenu Vattipally
15:30-16:00 Tea & Coffee in Lomond Room
16:00-17:30 Analysis using the pipeline – Sejal Modha and Sreenu Vattipally

DAY 4 – SEQUENCING TECHNOLOGIES AND HOW TO DO GENOME ANNOTATION

Day four gave the participant the opportunity to look into more detail at their assembly, they learned how to create a finished curated full genome (emphasis on finishing) with gene annotations and analysed the variation within their sample

9:30-11:00 Finishing and Annotating genomes – Andrew Davison, Gavin Wilkie and Sreenu Vattipally
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Annotation transfer from related species – Gavin Wilkie, Andrew Davison and Sreenu Vattipally

13:00-14:00 Lunch break in Dumgoyne Room

IMG_20150813_141642

14:00-15:30 Error detection and variant calling – Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 Quasi-species characterisation – Richard Orton

17:30 Social evening in the pub/restaurant at Curlers on Byres Road organised by Richard Orton (http://www.thecurlersrestglasgow.co.uk). 

 

DAY 5 – PRIMER IN INVESTIGATING PATHWAYS OF VIRAL EVOLUTION

Researchers worked through several practical examples of using sequences in virology and spent the remaining time analysing their own data with the teachers help.

9:30-10:15 Identifying mutations of interest for individual sequences and within a set of sequences – Robert Gifford
10:15-11:00 Combining phylogenies with traits – Robert Gifford
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-12:15 Investigating epidemiology (IDU example) – Robert Gifford
12:15-13:00 Investigate transmission of drug resistance – Robert Gifford

13:00-14:00 Lunch break in Dumgoyne Room

14:00-15:30 Analysing your own data or developing your own pipeline
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:00 Analysing your own data or developing your own pipeline
17:00 Goodbyes

IMG_20150814_130750

 

If you would like to be informed of similar training courses run in the future, please fill in your details here.

 

 

A simple method to distinguish low frequency variants from Illumina sequence errors

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences.

Next generation sequencing has revolutionised the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population.

However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing.

A simple method to do this is to utilise the Illumina quality scores. Every sequenced base in every read is assigned a quality score that is a measure of the uncertainty of the base call using the Phred scale. An example FASTQ read is:

A Phred quality score (Q) is logarithmically related to it’s corresponding base-calling error probability (P), and they can be calculated from each other using the following formulas:

In FASTQ files, quality scores are typically displayed with an ASCII character that is a representation of the quality score – one reason for this is to save space – instead of writing “40” (2 characters) one can simply write “I” (1 character). A range of quality scores and their corresponding error probabilities and ASCII characters is presented below (NB: on the Sanger Phred scale the ASCII characters have a +33 offset to the quality score).

Phred Quality (Q) ScoreProbability (P) of Base Calling ErrorQ + 33 (Sanger)ASCII Q Character
400.000173I
390.00012589372H
380.00015848971G
370.00019952670F
360.00025118969E
350.00031622868D
340.00039810767C
330.00050118766B
320.00063095765A
310.00079432864@
300.00163?
200.01535
100.143+
10.79432823534"
0133!

At one of the highest quality scores Q40, there is a 1 in 10,000 probability of a base being called by the sequence machine incorrectly. Although this is a very low probability, the ultra-deep coverage of viral samples can result in a coverage of well over 10K at each position in the genome. This implies that even if every base were of the highest quality, we would still expect errors to be introduced via base miscalls from the Illumina machine.

Therefore, in order to identify true viral variants present in the population at low frequency, one must handle base miscall errors from the Illumina machine. A simple way to do this is as follows:

  1. For a given genome position, retrieve all the quality scores of every individual base aligned at that position from the reads.
  2. Convert each quality score to a probability of error using the formulas above
  3. Calculate an average (mean) probability of error at the given genome position.
  4. Test whether the observed number of mutations is above that expected from Illumina sequence errors alone. This can be done statistically by using a binomial distribution where the probability of observing M or more mutations is given by Binom(M; AvP/3, C), where C is the coverage, AvP is the average probability of error calculated from the quality scores, and AvP/3 represents the probability of the specific mutation observed in the reads. Using a 5% threshold, if [1 – Binom(M; AvP/3, C)] is less than 0.05 then one can say the variant is validated.

For example, at genome position 1034 the reference base is an A, we observe a coverage of 35,000 composed of 34,993 As and 7 G mutations, and evaluating the quality scores at this position gives an average probability of error of 0.00032. Using the [1 – Binom(M; AvP/3, C)] formula (where C = 35,000, AvP/3 = 0.00032/3, and M = 7) gives a P value of  0.08476547 which is not below the 0.05 threshold and so is not a validated low frequency mutation.

This is a relatively simple approach that was proposed by Morelli et al. 2013 and is also incorporated into our own tool DiversiTools. However, validation of low frequency variants is further complicated by issues such as Strand Bias and Illumina Systematic Errors, as well as RT-PCR errors and sample contamination.

A good website with a full table of quality scores and error probabilities is available here, along with information on older Illumina quality scores (which utilise slightly different formulas).

Illumina adapter and primer sequences

Illumina Adapter and Primer Sequences

Illumina libraries are normally constructed by ligating adapters to short fragments (100 – 1000bp) of DNA. The exception to this is if Nextera is used (see end of this post) or where PCR amplicons have been constructed that already incorporate the P5/P7 ends that bind to the flowcell.

Illumina Paired End Adapters (cannot be used for multiplexing)

Top adapter
5′ ACACTCTTTCCCTACACGACGCTCTTCCGATC*T 3’

Bottom adapter
5′ P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG 3’

Note that the last 12nt of the adapters are complementary (when the bottom adapter is viewed 3’-5’ as below) hence the name ‘forked adapters’. The adapters are annealed together then ligated to both ends of the library DNA. The bottom adapter is 5’-phosphorylated in order to promote ligation. The top adapter has a phosphorothioate bond (*) before the terminal T, ensuring that exonucleases cannot digest the T overhang that pairs to the A-tail added to library fragments.

Structure of Illumina forked PE adapter

PCR with partially complementary primers then extends the ends and resolves the forks, adding unique termini that bind to the oligos on the surface of the flow cell (P5 blue/P7 red, also see diagram at foot of page).

PE PCR Primer 1.0 (P5) (same as universal adapter)
5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T 3’

PE PCR Primer 2.0 (P7)
5' CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC*T 3’

Structure of Illumina TruSeq™ indexed forked adapters

The last 12nt of the adapters are complementary, allowing them to anneal and form the forked structure. The adapter is ligated to both ends of the A-tailed DNA library, generating larger floppy overhangs than with the paired-end adapters on the first page. Note that while the top adapter is identical to the Illumina Universal oligo, the bottom adapter is different to the PE adapter in the purple highlighted section. The adapter already has the index and complete P7/P5 ends.

PCR with the following primers resolves the forked ends to generate products with no floppy overhangs. The sequences that bind to the flow cell (P5 blue/P7 red) finish up at opposite ends of the library fragments.

PCR Primer 1.0 (P5)
5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGA 3’

PCR Primer 2.0 (P7)
5’ CAAGCAGAAGACGGCATACGAGAT 3’

The following oligos (provided in the MiSeq reagent cartridge) are used to prime the sequence reads. Note that the index read primer is complementary to the Read 2 sequencing primer (see diagram below). This is used to sequence the hexamer index tag in the forward direction after read 1 is complete, before the reverse strand is synthesised by bridge amplification.

Multiplexing Read 1 Sequencing Primer
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

Multiplexing Index Read Sequencing Primer
5' GATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Multiplexing Read 2 Sequencing Primer
5' GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

When ordering primers for use in Illumina libraries, make certain to include the modifications (e.g. 5’-phosphorylation and phosphorothioate bonds on the 3' terminal nucleotide) and ensure the oligos are PAGE purified. Even small amounts of n-1 primers will lead to messy out-of-phase sequencing and cause clusters to fail filtering. Costs per oligo for 0.2µmole synthesis scale and PAGE are in region of £40.

TruSeq™ DNA Sample Prep Kit v2

There are currently two versions of the kit, each with 12 different adapters that incorporate unique index tags – allowing samples to be multiplexed on the same sequencing run.

Kit A contains indexes: 2, 4, 5, 6, 7, 12, 20, 21, 22, 23, 25, 27.
Kit B contains indexes: 1, 3, 8, 9, 10, 11, 13, 14, 15, 16, 18, 19.

NOTE that all the indexed adapters should be 5’-Phosphorylated. For unknown reasons adapters 13-27 have an extra 2 bases (these are not used for the indexing). Illumina also reserve certain numbers e.g. 17, 24 and 26. The 6-base index tag sequences are in italics below.

TruSeq Universal Adapter
5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T

TruSeq Adapter, Index 1
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 2
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 3
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACTTAGGCATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 4
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 5
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 6
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 7
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 8
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTTGAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 9
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 10
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACTAGCTTATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 11
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 12
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 13
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTCAACAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 14
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 15
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGTCAGAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 16
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 18
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTCCGCACATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 19
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGAAACGATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 20
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTGGCCTTATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 21
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 22
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGTACGTAATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 23
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTGGATATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 25
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACTGATATATCTCGTATGCCGTCTTCTGCTTG
TruSeq Adapter, Index 27
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTTTATCTCGTATGCCGTCTTCTGCTTG

NEBNext® DNA Library Prep

The NEB kit uses a short adapter which is supplied as a single self-complimentary oligo forming a stem-loop. It has a Uracil base that is later cleaved and removed by Uracil Glycosylase and base excision repair enzyme mix (USER).

NEBNext adaptor for Illumina

5’ P-GATCGGAAGAGCACACGTCTGAACTCCAGTC-U-ACACTCTTTCCCTACACGACGCTCTTCCGATC*T 3’

Oligo is designed to self-anneal forming a stem-loop structure as below. This may help to prevent formation of adapter dimers during ligation.

The Index tags and the P5/P7 ends are added by PCR using universal and tagged primers. The end result is exactly the same as TruSeq.

NEBnext Universal primer
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T

NEBnext Indexed primers 1 – 12 (6-mer indexes)

Index 1 CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 2 CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 3 CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 4 CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 5 CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 6 CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 7 CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 8 CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 9 CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 10 CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 11 CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T
Index 12 CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCAGACGTGTGCTCTTCCGATC*T

SURESELECT (POST-CAPTURE INDEXING)

This begins with a shorter bottom adapter that is extended to add the P5 end in the pre-capture PCR. The post-capture PCR step adds the index and P7 end. Note The NEB adapter is more efficient than the InPE adapter in my comparative tests.

InPE adapter (indexing paired end adapter)

PRE-CAPTURE PCR

PCR Primer 1.0 [Tm 70deg]
5’ AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACG*A

Multiplex PCR Primer 2.0 [Tm 67deg]
5' GTGACTGGAGTTCAGACGTGTGCTCTTCCGAT*C

POST-CAPTURE PCR Indexing Primers
2nd PCR reaction (post-capture amplification) adds indexes and P7 sequence.

Universal Primer [Tm 75deg]
5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC*T

PCR Primer, Index 1 5’ CAAGCAGAAGACGGCATACGAGATCGTGATGTGACTGGAGTT*C
PCR Primer, Index 2 5’ CAAGCAGAAGACGGCATACGAGATACATCGGTGACTGGAGTT*C
PCR Primer, Index 3 5’ CAAGCAGAAGACGGCATACGAGATGCCTAAGTGACTGGAGTT*C
PCR Primer, Index 4 5’ CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTT*C
PCR Primer, Index 5 5’ CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTT*C
PCR Primer, Index 6 5’ CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTT*C
PCR Primer, Index 7 5’ CAAGCAGAAGACGGCATACGAGATGATCTGGTGACTGGAGTT*C
PCR Primer, Index 8 5’ CAAGCAGAAGACGGCATACGAGATTCAAGTGTGACTGGAGTT*C
PCR Primer, Index 9 5’ CAAGCAGAAGACGGCATACGAGATCTGATCGTGACTGGAGTT*C
PCR Primer, Index 10 5’ CAAGCAGAAGACGGCATACGAGATAAGCTAGTGACTGGAGTT*C
PCR Primer, Index 11 5’ CAAGCAGAAGACGGCATACGAGATGTAGCCGTGACTGGAGTT*C
PCR Primer, Index 12 5’ CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTT*C

Guidelines for Low-Level Pooling
Some sequencing experiments require the use of fewer than 12 index sequences in a lane with a high cluster density. In such cases, a careful selection of indexes is required to ensure optimum cluster discrimination by having different bases at each cycle of the index read. Illumina recommends the following sets of indexes for low-level pooling experiments

Pool of 2 samples:
Index 6 GCCAAT
Index 12 CTTGTA

Pool of 3 samples:
Index 4 TGACCA
Index 6 GCCAAT
Index 12 CTTGTA

Pool of 6 samples:
Index 2 CGATGT
Index 4 TGACCA
Index 5 ACAGTG
Index 6 GCCAAT
Index 7 CAGATC
Index 12 CTTGTA

Nextera Sample Preparation

The sequences in Nextera libraries are different to all the other workflows.

Nextera® transposase sequences (FC-121-1031, FC-121-1030)

The Tn5 transposase cuts the sample DNA and adds the following sequence at either end of each fragment, with the highlighted sequence next to the library insert.

5’ TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
Read 1 >

5’ GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
Read 2 >

Nextera® Index Kit – PCR primers (FC-121-1012, FC-121-1011)

PCR with the following primers adds the P5 and P7 termimi that bind to the flowcell and also the dual 8bp index tags (denoted by the i5 and i7 below).

5’ AATGATACGGCGACCACCGAGATCTACAC[i5]TCGTCGGCAGCGTC

5’ CAAGCAGAAGACGGCATACGAGAT[i7]GTCTCGTGGGCTCGG

If trimming adapters from Nextera runs should cut the reads at CTGTCTCTTATACACATCT instead of the usual AGATCGGAAGAGC. Use of cutadapt, trim_galore or similar program is recommended with custom adapter specified.

Calculating dNdS for NGS datasets

DiversiTools

Our upcoming tool DiversiTools calculates the dN/dS ratio at each site, codon and also for the sample as a whole, here is an explanation of the theory behind it.

dN/dS

dN/dS is the ratio of the number of nonsynonymous substitutions per non-synonymous site (pN) to the number of synonymous substitutions per synonymous site (pS), which can be used as an indicator of selective pressure acting on a protein coding gene. A ratio greater than one implies positive or darwinian selection, less than one implies purifying selection, and a ratio of one indicates neutral selection.

A synonymous substitution is a nucleotide mutation that does not alter the amino acid (AA) of a protein; also called a silent mutation. However, synonymous substitutions can still potentially affect transcription, splicing, mRNA transport or translation, any of which could alter the phenotype and render the synonymous mutation non-silent.

A nonsynonymous substitution is a nucleotide mutation that alters the AA of a protein; also called a missense mutation. However, nonsynonymous substitutions are called nonsense mutations if the codon changes into a (premature) stop codon (i.e. TAA, TAG, TGA).

We will now consider two example AA codons and evaluate the types of mutations for each: Tryptophan which is mutationally fragile and Arginine which is mutationally robust. A codon consists of three nucleotides (the so called triplet code), and as there are four possible nucleotides (A, C, G, T), there are a total of 9 possible mutations to consider for each codon.

TGG - Tryptophan - Trp - W

 MutationCodonAATypeN-Sites
TGGTrp
1Position 1 (T->A)AGGArgNonsynonymous (missense)1/3
2Position 1 (T->C)CGGArgNonsynonymous (missense)1/3
3Position 1 (T->G)GGGGlyNonsynonymous (missense)1/3
4Position 2 (G->A)TAGStopNonsynonymous (nonsense)1/3
5Position 2 (G->C)TCGSerNonsynonymous (missense)1/3
6Position 2 (G->T)TTGLeuNonsynonymous (missense)1/3
7Position 3 (G->A)TGAStopNonsynonymous (nonsense)1/3
8Position 3 (G->C)TGCCysNonsynonymous (missense)1/3
9Position 3 (G->T)TGTCysNonsynonymous (missense)1/3

Any mutation at any of the three codon positions for Tryptophan results in a nonsynonymous mutation. This emphasises why the expected number of nonsynonymous mutations  – or the number of nonsynonymous sites – should be taken into account in dN/dS analyses, as observed counts alone would lead to biased results. We will come to the N-Sites column later.

CGG - Arginine - Arg - R

 MutationCodonAATypeN-Sites
CGGArg
1Position 1 (C->A)AGGArgSynonymous0
2Position 1 (C->G)GGGGlyNonsynonymous (missense)1/3
3Position 1 (C->T)TGGTrpNonsynonymous (missense)1/3
4Position 2 (G->A)CAGGlnNonsynonymous (missense)1/3
5Position 2 (G->C)CCGProNonsynonymous (missense)1/3
6Position 2 (G->T)CTGLeuNonsynonymous (missense)1/3
7Position 3 (G->A)CGAArgSynonymous0
8Position 3 (G->C)CGCArgSynonymous0
9Position 3 (G->T)CGTArgSynonymous0

The Arginine codon CGG has 4 possible synonymous mutations and 5 nonsynonymous mutations. It can withstand any mutation at the 3rd codon position and still code for Arginine and also withstand one substitution at the 1st codon position. However, any mutation at the 2nd codon position is nonsynonymous as it results in an AA substitution.

This is true for all AA codons – any mutation at the 2nd codon position will result in an AA substitution, whilst mutations at the 3rd codon position either cause a silent mutation or a nonsynonymous mutation that typically maintains the hydrophilicity/hydrophobicity of the AA. This is one reason why the 3rd codon position is most liable to mutation.

Expected Sites for a Reference Sequence

Central to the calculation of the dN/dS ratio is calculating the expected number of nonsynonymous sites for each codon – this is why the dN/dS ratio is described in part as “nonsynonymous substitutions per non-synonymous site”.

For any particular codon, we denote fi to be the fraction of  substitutions that are nonsynonymous at the ith codon position (i = 1, 2, 3). The number of nonsynonymous (n) and synonymous (s) sites for a codon is then calculated with the formulas:

eq1

Using our previous examples, and the values already populated in the N-Sites column of the tables:

  • TGG – Tryptophan – Trp – W
  • n = (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) = 3
  • s = 3 – n = 3 – 3 = 0
  • The number of nonsynonymous sites for Tryptophan TGG is 3, whilst the number of synonymous sites is 0.
  • CGG – Arginine – Arg – R
  • n = (0 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (0 + 0 + 0) = 1.66666
  • s = 3 – n = 3 – 1.66666 = 1.33333
  • The number of nonsynonymous sites for Arginine CGG is 1.66666, whilst the number of synonymous sites is 1.33333.

For a protein encoding DNA sequence of length r codons, the total number of nonsynonymous (N) and synonymous (S) sites for the whole sequence can therefore be calculated by summing up the individual n and s values from all codons using the formulas:

eq2

As an example, consider the hypothetical reference sequence ATG – AAA – CCC – GGG – TTT – TAA. Given this sequence, N and S are calculated as:

Reference Expected Sites

Number (i)
CodonNonsynonymous (n)Synonymous (s)
1ATG30
2AAA2.6660.333
3CCC21
4GGG21
5TTT2.6660.333
6TAA2.3330.666
Total14.666 (N)3.333 (S)

Observed Substitutions for a Sample Sequence

Now consider that we have a new sequence, our sample sequence, and we want to compare it to our original reference sequence, so we must count the number of observed nonsynonymous (nd) and synonymous (sd) mutations that our sample sequence has at each codon compared to the reference.

Sample Observed Mutations

Number (i)ReferenceSampleNonsynonymous (nd)Synonymous (sd)
1ATGATG00
2AAAAAA00
3CCCCGC10
4GGGGGC01
5TTTTAC11
6TAATAA00
Total2 (Nd)2 (Sd)

In the above, there are three codons that have mutations:

Codon 3: CCC (pro) -> CGC (arg): This has a single nucleotide mutation (distance = 1) resulting in a nonsynonymous AA substitution, and therefore nd = 1 and sd = 0 for this codon.

Codon 4: GGG (gly) -> GGC (gly): This has a single nucleotide mutation (distance = 1) resulting in a synonymous AA substitution, and therefore nd = 0 and sd = 1 for this codon.

Codon 5: TTT (phe) -> TAC (tyr): This has two nucleotide mutations (distance = 2) and is a nonsynonymous AA substitution. However, due to the double mutation it is a more complex situation, as one must consider the two mutation pathways that could have led to this state (you can not assume that as the AA is nonsynonymous then both mutations are nonsynonymous):

  1. TTT (phe) –> TAT (tyr) -> TAC (tyr): 1 nonsynonymous and 1 synonymous mutation
  2. TTT (phe) –> TTC (phe) -> TAC (tyr): 1 synonymous and 1 nonsynonymous mutation
  • As we consider the two above pathways to occur with equal probability, for this codon nd = 1 and sd = 1 (sums to two as we have two mutations in the codon).

Hypothetically, let us expand the example above to a three mutation (distance = 3) situation from TTT (phe) to GAC (asp). A three mutation situation results in 6 possible mutation pathways:

  1. TTT (phe) -> TTC (phe) -> TAC (tyr) -> GAC (asp): 2n and 1s
  2. TTT (phe) -> TTC (phe) -> GTC (val) -> GAC (asp): 2n and 1s
  3. TTT (phe) -> TAT (tyr) -> TAC (val) -> GAC (asp): 3n
  4. TTT (phe) -> TAT (tyr) -> GAT (val) -> GAC (asp): 3n
  5. TTT (phe) -> GTT (val) -> GTC (val) -> GAC (asp): 2n and 1s
  6. TTT (phe) -> GTT (val) -> GAT (val) -> GAC (asp): 2n and 1s
  • As we consider the six above pathways to occur with equal probability, for this codon nd = 2.333 and sd = 0.666 (sums to three as we have three mutations in the codon).

So, for a protein encoding DNA sequence of length r codons, the total number of observed nonsynonymous (Nd) and synonymous (Sd) mutations between two sequences can therefore be calculated by summing up the individual nd and sd values from all codons using the formulas:

eq3

dN/dS ratio

One then calculates the proportion of nonsynonymous (pn) and synonymous (ps) differences with the following equations:

eq4

Then to estimate the number of nonsynonymous substitutions (dN) and synonymous substitutions (dS) per site, and the dN/dS ratio itself, we use the formulas:

eq5c

So using our reference and sample sequence examples above we now have the following values:

  • N = 14.666
  • S = 3.333
  • Nd = 2
  • Sd = 2
  • pN = 0.1364
  • pS = 0.6001
  • dN = 0.1505
  • dS = 1.2074
  • dN/dS = 0.1247

NGS Datasets

The expansion of all of the above to NGS datasets is relatively straightforward. The calculation of the number of nonsynonymous and synonymous sites in the reference sequences proceeds the same as above.

One must then calculate the number of observed nonsynonymous and synonymous mutations in the reads when compared to the reference (this will need information on where the open reading frame [ORF] starts and stops). There are two options when calculating the observed numbers:

  1. Consider all observed mutations in the reads covering any part of the codon.
  2. Consider only those mutations where the read fully covers the codon the mutation occurs in, i.e. ignore partially covered codons at the read ends, or partially covered codons due to indels.

To adapt to NGS datasets, one must consider read coverage. The approach taken by Morelli et al (2013) adjusts the formula for pN (and likewise pS) as follows to take into account the read coverage (c) at each codon:

dnds_ngs_formula

 

Essentially, for each read (c) that covers a particular codon, the observed number of nonsynonymous mutations in the read compared to the reference codon is calculated, and divided by the expected number. The values for all reads at the codon are then summed and averaged. Then the value for all codons is summed to give a single value for the whole ORF. This value of pN and pS can they be plugged in to the same dN and dS formulas to calculate the dN/dS ratio.

References

Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions

Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts