A simple method to distinguish low frequency variants from Illumina sequence errors

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences.

Next generation sequencing has revolutionised the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population.

However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing.

A simple method to do this is to utilise the Illumina quality scores. Every sequenced base in every read is assigned a quality score that is a measure of the uncertainty of the base call using the Phred scale. An example FASTQ read is:

@SRR1036477.23 23 length=323
GTGTAACGCCCCTGATGCACTGAGGGACATGGGACTCTTTGGGCAAAATATGTACTACCACTACCTAGGTAGGTCCGGGTACACCGTGCATGTACAGTGTAACGCCCCAGATGCACTGAGGGACATGGGACTCTTTGGGCAAAATATGTACTACCACTACCTAGGCAGGTCCGGGTACACCGTGCATGTACAGGGTAAAGCCCCTGATGNACNGNNGNNNNNNGNNCNNTNNNNNNNNNNNNNNNNNNNNNNNNACNNAGGTGGNNCCGGNNNNANCNCGCATGGACGGTAGAATGTATGGAGTCAGTGTCGAGCACGAATGC
+SRR1036477.23 23 length=323
?????BB?DDDDDDDDFFFFFFIHIHHIIIIHFFH@FHHIHIHIIIIHFHIHHHHHIIIIIIIIIIHIIAFFHFGFHHH<CHHHHHHHHHHHHHHHFFH@DH.C4:A===D==..5=DD==A*;;,5=C,A3B,===,?A08A:***00*:::A:?EECEAC?CE?E?2*:*'4<.4:CA1'008C*?E*:?:0)*1:*0*0888)0*1#0.#0##.######.##0##0########################))##).))))##)))0####)#)#))).))*1*'')*)*0**1**1****0**1*11)...4*'0.00*

A Phred quality score (Q) is logarithmically related to it’s corresponding base-calling error probability (P), and they can be calculated from each other using the following formulas:

Q = -10 x log10 P
P = 10 ^ (-Q/10)

e.g. if P = 0.001, then Q = -10 x log10 0.001 = 30
e.g. if Q = 30, then P = 10 ^ (-30/10) = 0.001

In FASTQ files, quality scores are typically displayed with an ASCII character that is a representation of the quality score – one reason for this is to save space – instead of writing “40” (2 characters) one can simply write “I” (1 character). A range of quality scores and their corresponding error probabilities and ASCII characters is presented below (NB: on the Sanger Phred scale the ASCII characters have a +33 offset to the quality score).

Phred Quality (Q) ScoreProbability (P) of Base Calling ErrorQ + 33 (Sanger)ASCII Q Character
400.000173I
390.00012589372H
380.00015848971G
370.00019952670F
360.00025118969E
350.00031622868D
340.00039810767C
330.00050118766B
320.00063095765A
310.00079432864@
300.00163?
200.01535
100.143+
10.79432823534"
0133!

At one of the highest quality scores Q40, there is a 1 in 10,000 probability of a base being called by the sequence machine incorrectly. Although this is a very low probability, the ultra-deep coverage of viral samples can result in a coverage of well over 10K at each position in the genome. This implies that even if every base were of the highest quality, we would still expect errors to be introduced via base miscalls from the Illumina machine.

Therefore, in order to identify true viral variants present in the population at low frequency, one must handle base miscall errors from the Illumina machine. A simple way to do this is as follows:

  1. For a given genome position, retrieve all the quality scores of every individual base aligned at that position from the reads.
  2. Convert each quality score to a probability of error using the formulas above
  3. Calculate an average (mean) probability of error at the given genome position.
  4. Test whether the observed number of mutations is above that expected from Illumina sequence errors alone. This can be done statistically by using a binomial distribution where the probability of observing M or more mutations is given by Binom(M; AvP/3, C), where C is the coverage, AvP is the average probability of error calculated from the quality scores, and AvP/3 represents the probability of the specific mutation observed in the reads. Using a 5% threshold, if [1 – Binom(M; AvP/3, C)] is less than 0.05 then one can say the variant is validated.

For example, at genome position 1034 the reference base is an A, we observe a coverage of 35,000 composed of 34,993 As and 7 G mutations, and evaluating the quality scores at this position gives an average probability of error of 0.00032. Using the [1 – Binom(M; AvP/3, C)] formula (where C = 35,000, AvP/3 = 0.00032/3, and M = 7) gives a P value of  0.08476547 which is not below the 0.05 threshold and so is not a validated low frequency mutation.

This is a relatively simple approach that was proposed by Morelli et al. 2013 and is also incorporated into our own tool DiversiTools. However, validation of low frequency variants is further complicated by issues such as Strand Bias and Illumina Systematic Errors, as well as RT-PCR errors and sample contamination.

A good website with a full table of quality scores and error probabilities is available here, along with information on older Illumina quality scores (which utilise slightly different formulas).