Calculating dNdS for NGS datasets

Post by: Richard Orton
November 16, 2014
22 Comments

vNvS

Our upcoming tool vNvS calculates the dN/dS ratio at each site, codon and also for the sample as a whole, here is an explanation of the theory behind it. vNvS is currently in development – for more information email Richard.Orton@glasgow.ac.uk

dN/dS

dN/dS is the ratio of the number of nonsynonymous substitutions per non-synonymous site (p_N) to the number of synonymous substitutions per synonymous site (p_S), which can be used as an indicator of selective pressure acting on a protein coding gene. A ratio greater than one implies positive or darwinian selection, less than one implies purifying selection, and a ratio of one indicates neutral selection.

A synonymous substitution is a nucleotide mutation that does not alter the amino acid (AA) of a protein; also called a silent mutation. However, synonymous substitutions can still potentially affect transcription, splicing, mRNA transport or translation, any of which could alter the phenotype and render the synonymous mutation non-silent.

A nonsynonymous substitution is a nucleotide mutation that alters the AA of a protein; also called a missense mutation. However, nonsynonymous substitutions are called nonsense mutations if the codon changes into a (premature) stop codon (i.e. TAA, TAG, TGA).

We will now consider two example AA codons and evaluate the types of mutations for each: Tryptophan which is mutationally fragile and Arginine which is mutationally robust. A codon consists of three nucleotides (the so called triplet code), and as there are four possible nucleotides (A, C, G, T), there are a total of 9 possible mutations to consider for each codon.

TGG - Tryptophan - Trp - W

	Mutation	Codon	AA	Type	N-Sites
		TGG	Trp

1	Position 1 (T->A)	AGG	Arg	Nonsynonymous (missense)	1/3
2	Position 1 (T->C)	CGG	Arg	Nonsynonymous (missense)	1/3
3	Position 1 (T->G)	GGG	Gly	Nonsynonymous (missense)	1/3

4	Position 2 (G->A)	TAG	Stop	Nonsynonymous (nonsense)	1/3
5	Position 2 (G->C)	TCG	Ser	Nonsynonymous (missense)	1/3
6	Position 2 (G->T)	TTG	Leu	Nonsynonymous (missense)	1/3

7	Position 3 (G->A)	TGA	Stop	Nonsynonymous (nonsense)	1/3
8	Position 3 (G->C)	TGC	Cys	Nonsynonymous (missense)	1/3
9	Position 3 (G->T)	TGT	Cys	Nonsynonymous (missense)	1/3

Any mutation at any of the three codon positions for Tryptophan results in a nonsynonymous mutation. This emphasises why the expected number of nonsynonymous mutations – or the number of nonsynonymous sites – should be taken into account in dN/dS analyses, as observed counts alone would lead to biased results. We will come to the N-Sites column later.

CGG - Arginine - Arg - R

	Mutation	Codon	AA	Type	N-Sites
		CGG	Arg

1	Position 1 (C->A)	AGG	Arg	Synonymous	0
2	Position 1 (C->G)	GGG	Gly	Nonsynonymous (missense)	1/3
3	Position 1 (C->T)	TGG	Trp	Nonsynonymous (missense)	1/3

4	Position 2 (G->A)	CAG	Gln	Nonsynonymous (missense)	1/3
5	Position 2 (G->C)	CCG	Pro	Nonsynonymous (missense)	1/3
6	Position 2 (G->T)	CTG	Leu	Nonsynonymous (missense)	1/3

7	Position 3 (G->A)	CGA	Arg	Synonymous	0
8	Position 3 (G->C)	CGC	Arg	Synonymous	0
9	Position 3 (G->T)	CGT	Arg	Synonymous	0

The Arginine codon CGG has 4 possible synonymous mutations and 5 nonsynonymous mutations. It can withstand any mutation at the 3^rd codon position and still code for Arginine and also withstand one substitution at the 1^st codon position. However, any mutation at the 2^nd codon position is nonsynonymous as it results in an AA substitution.

This is true for all AA codons – any mutation at the 2^nd codon position will result in an AA substitution, whilst mutations at the 3^rd codon position either cause a silent mutation or a nonsynonymous mutation that typically maintains the hydrophilicity/hydrophobicity of the AA. This is one reason why the 3^rd codon position is most liable to mutation.

Expected Sites for a Reference Sequence

Central to the calculation of the dN/dS ratio is calculating the expected number of nonsynonymous sites for each codon – this is why the dN/dS ratio is described in part as “nonsynonymous substitutions per non-synonymous site”.

For any particular codon, we denote f_i to be the fraction of substitutions that are nonsynonymous at the i^th codon position (i = 1, 2, 3). The number of nonsynonymous (n) and synonymous (s) sites for a codon is then calculated with the formulas:

eq1

Using our previous examples, and the values already populated in the N-Sites column of the tables:

TGG – Tryptophan – Trp – W
n = (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) = 3
s = 3 – n = 3 – 3 = 0
The number of nonsynonymous sites for Tryptophan TGG is 3, whilst the number of synonymous sites is 0.

CGG – Arginine – Arg – R
n = (0 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (0 + 0 + 0) = 1.66666
s = 3 – n = 3 – 1.66666 = 1.33333
The number of nonsynonymous sites for Arginine CGG is 1.66666, whilst the number of synonymous sites is 1.33333.

For a protein encoding DNA sequence of length r codons, the total number of nonsynonymous (N) and synonymous (S) sites for the whole sequence can therefore be calculated by summing up the individual n and s values from all codons using the formulas:

eq2

As an example, consider the hypothetical reference sequence ATG – AAA – CCC – GGG – TTT – TAA. Given this sequence, N and S are calculated as:

Reference Expected Sites

Number (i)	Codon	Nonsynonymous (n)	Synonymous (s)
1	ATG	3	0
2	AAA	2.666	0.333
3	CCC	2	1
4	GGG	2	1
5	TTT	2.666	0.333
6	TAA	2.333	0.666

Total		14.666 (N)	3.333 (S)

Observed Substitutions for a Sample Sequence

Now consider that we have a new sequence, our sample sequence, and we want to compare it to our original reference sequence, so we must count the number of observed nonsynonymous (n_d) and synonymous (s_d) mutations that our sample sequence has at each codon compared to the reference.

Sample Observed Mutations

Number (i)	Reference	Sample	Nonsynonymous (nd)	Synonymous (sd)
1	ATG	ATG	0	0
2	AAA	AAA	0	0
3	CCC	CGC	1	0
4	GGG	GGC	0	1
5	TTT	TAC	1	1
6	TAA	TAA	0	0

Total			2 (Nd)	2 (Sd)

In the above, there are three codons that have mutations:

Codon 3: CCC (pro) -> CGC (arg): This has a single nucleotide mutation (distance = 1) resulting in a nonsynonymous AA substitution, and therefore n_d = 1 and s_d = 0 for this codon.

Codon 4: GGG (gly) -> GGC (gly): This has a single nucleotide mutation (distance = 1) resulting in a synonymous AA substitution, and therefore n_d = 0 and s_d = 1 for this codon.

Codon 5: TTT (phe) -> TAC (tyr): This has two nucleotide mutations (distance = 2) and is a nonsynonymous AA substitution. However, due to the double mutation it is a more complex situation, as one must consider the two mutation pathways that could have led to this state (you can not assume that as the AA is nonsynonymous then both mutations are nonsynonymous):

TTT (phe) –> TAT (tyr) -> TAC (tyr): 1 nonsynonymous and 1 synonymous mutation
TTT (phe) –> TTC (phe) -> TAC (tyr): 1 synonymous and 1 nonsynonymous mutation

As we consider the two above pathways to occur with equal probability, for this codon n_d = 1 and s_d = 1 (sums to two as we have two mutations in the codon).

Hypothetically, let us expand the example above to a three mutation (distance = 3) situation from TTT (phe) to GAC (asp). A three mutation situation results in 6 possible mutation pathways:

TTT (phe) -> TTC (phe) -> TAC (tyr) -> GAC (asp): 2n and 1s
TTT (phe) -> TTC (phe) -> GTC (val) -> GAC (asp): 2n and 1s
TTT (phe) -> TAT (tyr) -> TAC (val) -> GAC (asp): 3n
TTT (phe) -> TAT (tyr) -> GAT (val) -> GAC (asp): 3n
TTT (phe) -> GTT (val) -> GTC (val) -> GAC (asp): 2n and 1s
TTT (phe) -> GTT (val) -> GAT (val) -> GAC (asp): 2n and 1s

As we consider the six above pathways to occur with equal probability, for this codon n_d = 2.333 and s_d = 0.666 (sums to three as we have three mutations in the codon).

So, for a protein encoding DNA sequence of length r codons, the total number of observed nonsynonymous (N_d) and synonymous (S_d) mutations between two sequences can therefore be calculated by summing up the individual n_d and s_d values from all codons using the formulas:

eq3

dN/dS ratio

One then calculates the proportion of nonsynonymous (p_n) and synonymous (p_s) differences with the following equations:

eq4

Then to estimate the number of nonsynonymous substitutions (d_N) and synonymous substitutions (d_S) per site, and the dN/dS ratio itself, we use the formulas:

So using our reference and sample sequence examples above we now have the following values:

N = 14.666
S = 3.333
N_d = 2
S_d = 2
p_N = 0.1364
p_S = 0.6001
d_N = 0.1505
d_S = 1.2074
dN/dS = 0.1247

NGS Datasets

The expansion of all of the above to NGS datasets is relatively straightforward. The calculation of the number of nonsynonymous and synonymous sites in the reference sequences proceeds the same as above.

One must then calculate the number of observed nonsynonymous and synonymous mutations in the reads when compared to the reference (this will need information on where the open reading frame [ORF] starts and stops). There are two options when calculating the observed numbers:

Consider all observed mutations in the reads covering any part of the codon.
Consider only those mutations where the read fully covers the codon the mutation occurs in, i.e. ignore partially covered codons at the read ends, or partially covered codons due to indels.

To adapt to NGS datasets, one must consider read coverage. The approach taken by Morelli et al (2013) adjusts the formula for p_N (and likewise p_S) as follows to take into account the read coverage (c) at each codon:

Essentially, for each read (c) that covers a particular codon, the observed number of nonsynonymous mutations in the read compared to the reference codon is calculated, and divided by the expected number. The values for all reads at the codon are then summed and averaged. Then the value for all codons is summed to give a single value for the whole ORF. This value of p_N and p_S can they be plugged in to the same dN and dS formulas to calculate the dN/dS ratio.

References

Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions

Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts

Categories: Deep Sequencing

22 Comments on “Calculating dNdS for NGS datasets”

Quan Gu says:

November 17, 2014 at 10:31 am

Powerful!!!

bljog says:

December 2, 2014 at 9:43 am

Great post! Thanks

Eric Enns says:

June 1, 2015 at 4:54 pm

In your “Reference Expected Sites” table the first row for ATG is incorrect it should be 3 for NS and 0 for S.

Richard Orton says:

June 2, 2015 at 9:17 am

Thanks Eric – have updated the values now

Eric Enns says:

June 2, 2015 at 6:13 pm

One more thing the final equation in your blog is it missing a 1/r at the beginning, so before the first summation. I believe you are referring to the last equation in Materials and Methods section of the paper, which includes a 1/ncod.

Richard Orton says:

June 29, 2015 at 11:14 am

I think that is actually a typo in the paper, for NGS you must correct for the coverage, but then there is no need to give an average per codon value, simply as this is not done for normal datasets – there is no averaging per codon there.

Richard Orton says:

June 29, 2015 at 11:23 am

Actually – I was thinking about a different formula there, for that last one there is a correction for codon number for NGS

Hugo Araújo says:

September 8, 2015 at 2:59 pm

In your final results for the example, aren’t the pN and pS values wrong?

Richard Orton says:

September 21, 2015 at 2:50 pm

Hi – Yes they were! They weren’t updated after changing the ATG correction – updated now

Kai says:

April 21, 2016 at 12:46 pm

Hi, thanks for such a clear description! Could you point me to a table somewhere listing the N-sites for each amino?

Sharlie says:

March 28, 2017 at 12:45 pm

Hi
In your sample observed mutations, and then codon 5, there were 2 possible options:
1. TTT (phe) –> TAT (tyr) -> TAC (tyr): 2 nonsynonymous mutations
2. TTT (phe) –> TTC (phe) -> TAC (tyr): 1 nonsynonymous and 1 synonymous mutation

But i thought that the first option would be 1 nonsynonymous mutation and 1 synonymous mutation, because TTT (phe) -> TAT (tyr) is a nonsynonymous one and TAT (tyr) -> TAC (tyr) is synonymous.

Or is it like: TTT (phe) -> TTC (tyr) is nonsynonymous and TTC (phe) -> TAC (tyr) is nonsynonymous?

1. invisible says:
  
  April 7, 2017 at 6:52 pm
  
  I am having problem.
  
  In case of 3 mutations. Amino acids are wrongly annotated.too.
  
  1. Richard Orton says:
    
    August 1, 2017 at 4:58 pm
    
    Fixed it
    
2. Richard Orton says:
  
  August 1, 2017 at 4:58 pm
  
  Hi – Fixed it – should of been 1N and 1S on each
  
Philip Ashton says:

August 25, 2017 at 7:00 am

This is a great post! Thanks for sharing.

Sucheta says:

September 12, 2017 at 12:33 pm

Is there a difference between dN/dS ration vs Ka/ks ratio?

1. Richard Orton says:
  
  October 23, 2017 at 10:28 pm
  
  They are used interchangeably and refer to the same ratio: https://en.wikipedia.org/wiki/Ka/Ks_ratio
  
2. Richard Orton says:
  
  October 24, 2017 at 11:36 am
  
  They are used interchangeably and refer to the same ratio: https://en.wikipedia.org/wiki/Ka/Ks_ratio
  
Jan Kimble says:

October 16, 2017 at 2:08 pm

Hi! I’m a little confused regarding translation of the codons. For example, in your tryptophan table you state that a change from TGG to AGG is a change from Trp to Arg. The mRNA sequence for TGG is ACC and the mRNA for AGG is UCC, so wouldn’t this be a change from amino acids threonine to serine?

1. Richard Orton says:
  
  October 23, 2017 at 10:26 pm
  
  If the DNA is TGG, then the corresponding mRNA is UGG, as the mRNA relates to the sense strand. Although the mRNA is generated through complementary binding to the anti-sense reverse strand (in this case ACC), the mRNA coded seq is a representation of the sense strand, so we are talking about the original TGG and the mutations related to that.
  
  https://en.wikipedia.org/wiki/Sense_(molecular_biology)
  
2. Richard Orton says:
  
  October 24, 2017 at 11:37 am
  
  If the DNA is TGG, then the corresponding mRNA is UGG, as the mRNA relates to the sense strand. Although the mRNA is generated through complementary binding to the anti-sense reverse strand (in this case ACC), the mRNA coded seq is a representation of the sense strand, so we are talking about the original TGG and the mutations related to that.
  
  https://en.wikipedia.org/wiki/Sense_(molecular_biology)
  
irene z.b. says:

February 7, 2018 at 10:11 am

Hi
is this tools still available?
I can’t manage to download it

any advice?

thanks
ib