# Trie Data Structure

In Computer Science, a trie is a data structure that is also known as a digital search tree or a prefix tree. It can be used for fast retrieval on large data sets such as looking up words in a dictionary. The term trie was invented from the phrase ‘Information Retrieval’ by Fredkin(1960).

As a prefix tree, the trie is an ordered tree used to represent a set of words (or strings) over a finite alphabet set. It allows words with common prefixes to use the same prefix data and stores only the end of the words to indicate different words. Each node in the trie is labeled with one character except the root node which is an empty node. The maximum number of children that a particular node in the trie can have is equal to the number of characters in the alphabet set used for the trie. The children of a node are alphabetically ordered and sibling nodes must represent distinct characters. The node which contains the last character of a word is called a leaf node.  The path from the root to a leaf node represents a word in the trie.

Consider the set C = { ‘A’, ’C’, ’G’, ’T’}, then a trie for this set can potentially have 4 children for each node; one corresponding to each letter. The following figure shows a trie for 6 different strings composed from C. The strings are: ‘AAAA’, ‘ACGT’, ‘CGA’, ‘GTA’, TA’, and ‘TTTT’. Note that ‘A’, ‘AA’, ‘AAA’, ‘AC’, ‘CG’, ‘ACG’, ‘C’, ‘G’, ‘GT’, ‘T’, ‘TT’, ‘TTT’ are all common prefixes. Also, the height of the trie is the same as the longest word in the trie which is 4 in this case.

A standard trie data structure can be developed from a single node class with 3 attributes:

1. A character ch to store the letter representing the node
2. An array of nodes children to represent the children of the node, and
3. A Boolean leafNode to mark the end of the word.

To implement a trie for the previous example, we start with the root node of the trie. Therefore, we instantiate a node object root  by using the following Node constructor:

We also need to determine the position for each character in the array so that they are alphabetically ordered. I.e.

Two main functions are needed with a trie data structure: Search and Insert. The following pseudo codes provide the algorithms for both functions.

Void INSERT (String word)

1. Make current node points to the root of the trie.
2. For each character c in word:
1. Get the pre-determined position for c in the children array
2. If the children node of the current node has nothing at that position, insert c
3. Make current node points to its child node whose index is the same as position.
3. Set leafNode to true to indicate insertion of word into the trie.

Boolean SEARCH (String word)

1. Make current node points to the root of the trie.
2. For each character c in word:
1. If there is nothing in current node, return false.
2. Get the pre-determined position for c in the children array
3. Make current node points to its child node whose index is the same as position.
3. If current node is not empty and leafNode is false, return false.
4. Return true to indicate that word was found in the trie.

The trie data structure is often overlooked by developers because of the time it takes to build. For a text of length n, inserting words into the trie requires (n2) time overall. Space needed for a trie is also quadratic which can be another drawback. On the other hand, searching a trie is fast. In the worst case it is  O(m)where m is the length of the word to search for.

# Calculating dNdS for NGS datasets

vNvS

Our upcoming tool vNvS calculates the dN/dS ratio at each site, codon and also for the sample as a whole, here is an explanation of the theory behind it. vNvS is currently in development – for more information email Richard.Orton@glasgow.ac.uk

dN/dS

dN/dS is the ratio of the number of nonsynonymous substitutions per non-synonymous site (pN) to the number of synonymous substitutions per synonymous site (pS), which can be used as an indicator of selective pressure acting on a protein coding gene. A ratio greater than one implies positive or darwinian selection, less than one implies purifying selection, and a ratio of one indicates neutral selection.

A synonymous substitution is a nucleotide mutation that does not alter the amino acid (AA) of a protein; also called a silent mutation. However, synonymous substitutions can still potentially affect transcription, splicing, mRNA transport or translation, any of which could alter the phenotype and render the synonymous mutation non-silent.

A nonsynonymous substitution is a nucleotide mutation that alters the AA of a protein; also called a missense mutation. However, nonsynonymous substitutions are called nonsense mutations if the codon changes into a (premature) stop codon (i.e. TAA, TAG, TGA).

We will now consider two example AA codons and evaluate the types of mutations for each: Tryptophan which is mutationally fragile and Arginine which is mutationally robust. A codon consists of three nucleotides (the so called triplet code), and as there are four possible nucleotides (A, C, G, T), there are a total of 9 possible mutations to consider for each codon.

## TGG - Tryptophan - Trp - W

MutationCodonAATypeN-Sites
TGGTrp
1Position 1 (T->A)AGGArgNonsynonymous (missense)1/3
2Position 1 (T->C)CGGArgNonsynonymous (missense)1/3
3Position 1 (T->G)GGGGlyNonsynonymous (missense)1/3
4Position 2 (G->A)TAGStopNonsynonymous (nonsense)1/3
5Position 2 (G->C)TCGSerNonsynonymous (missense)1/3
6Position 2 (G->T)TTGLeuNonsynonymous (missense)1/3
7Position 3 (G->A)TGAStopNonsynonymous (nonsense)1/3
8Position 3 (G->C)TGCCysNonsynonymous (missense)1/3
9Position 3 (G->T)TGTCysNonsynonymous (missense)1/3

Any mutation at any of the three codon positions for Tryptophan results in a nonsynonymous mutation. This emphasises why the expected number of nonsynonymous mutations  – or the number of nonsynonymous sites – should be taken into account in dN/dS analyses, as observed counts alone would lead to biased results. We will come to the N-Sites column later.

## CGG - Arginine - Arg - R

MutationCodonAATypeN-Sites
CGGArg
1Position 1 (C->A)AGGArgSynonymous0
2Position 1 (C->G)GGGGlyNonsynonymous (missense)1/3
3Position 1 (C->T)TGGTrpNonsynonymous (missense)1/3
4Position 2 (G->A)CAGGlnNonsynonymous (missense)1/3
5Position 2 (G->C)CCGProNonsynonymous (missense)1/3
6Position 2 (G->T)CTGLeuNonsynonymous (missense)1/3
7Position 3 (G->A)CGAArgSynonymous0
8Position 3 (G->C)CGCArgSynonymous0
9Position 3 (G->T)CGTArgSynonymous0

The Arginine codon CGG has 4 possible synonymous mutations and 5 nonsynonymous mutations. It can withstand any mutation at the 3rd codon position and still code for Arginine and also withstand one substitution at the 1st codon position. However, any mutation at the 2nd codon position is nonsynonymous as it results in an AA substitution.

This is true for all AA codons – any mutation at the 2nd codon position will result in an AA substitution, whilst mutations at the 3rd codon position either cause a silent mutation or a nonsynonymous mutation that typically maintains the hydrophilicity/hydrophobicity of the AA. This is one reason why the 3rd codon position is most liable to mutation.

Expected Sites for a Reference Sequence

Central to the calculation of the dN/dS ratio is calculating the expected number of nonsynonymous sites for each codon – this is why the dN/dS ratio is described in part as “nonsynonymous substitutions per non-synonymous site”.

For any particular codon, we denote fi to be the fraction of  substitutions that are nonsynonymous at the ith codon position (i = 1, 2, 3). The number of nonsynonymous (n) and synonymous (s) sites for a codon is then calculated with the formulas:

Using our previous examples, and the values already populated in the N-Sites column of the tables:

• TGG – Tryptophan – Trp – W
• n = (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) = 3
• s = 3 – n = 3 – 3 = 0
• The number of nonsynonymous sites for Tryptophan TGG is 3, whilst the number of synonymous sites is 0.
• CGG – Arginine – Arg – R
• n = (0 + 1/3 + 1/3) + (1/3 + 1/3 + 1/3) + (0 + 0 + 0) = 1.66666
• s = 3 – n = 3 – 1.66666 = 1.33333
• The number of nonsynonymous sites for Arginine CGG is 1.66666, whilst the number of synonymous sites is 1.33333.

For a protein encoding DNA sequence of length r codons, the total number of nonsynonymous (N) and synonymous (S) sites for the whole sequence can therefore be calculated by summing up the individual n and s values from all codons using the formulas:

As an example, consider the hypothetical reference sequence ATG – AAA – CCC – GGG – TTT – TAA. Given this sequence, N and S are calculated as:

## Reference Expected Sites

 Number (i) Codon Nonsynonymous (n) Synonymous (s) 1 ATG 3 0 2 AAA 2.666 0.333 3 CCC 2 1 4 GGG 2 1 5 TTT 2.666 0.333 6 TAA 2.333 0.666 Total 14.666 (N) 3.333 (S)

Observed Substitutions for a Sample Sequence

Now consider that we have a new sequence, our sample sequence, and we want to compare it to our original reference sequence, so we must count the number of observed nonsynonymous (nd) and synonymous (sd) mutations that our sample sequence has at each codon compared to the reference.

## Sample Observed Mutations

 Number (i) Reference Sample Nonsynonymous (nd) Synonymous (sd) 1 ATG ATG 0 0 2 AAA AAA 0 0 3 CCC CGC 1 0 4 GGG GGC 0 1 5 TTT TAC 1 1 6 TAA TAA 0 0 Total 2 (Nd) 2 (Sd)

In the above, there are three codons that have mutations:

Codon 3: CCC (pro) -> CGC (arg): This has a single nucleotide mutation (distance = 1) resulting in a nonsynonymous AA substitution, and therefore nd = 1 and sd = 0 for this codon.

Codon 4: GGG (gly) -> GGC (gly): This has a single nucleotide mutation (distance = 1) resulting in a synonymous AA substitution, and therefore nd = 0 and sd = 1 for this codon.

Codon 5: TTT (phe) -> TAC (tyr): This has two nucleotide mutations (distance = 2) and is a nonsynonymous AA substitution. However, due to the double mutation it is a more complex situation, as one must consider the two mutation pathways that could have led to this state (you can not assume that as the AA is nonsynonymous then both mutations are nonsynonymous):

1. TTT (phe) –> TAT (tyr) -> TAC (tyr): 1 nonsynonymous and 1 synonymous mutation
2. TTT (phe) –> TTC (phe) -> TAC (tyr): 1 synonymous and 1 nonsynonymous mutation
• As we consider the two above pathways to occur with equal probability, for this codon nd = 1 and sd = 1 (sums to two as we have two mutations in the codon).

Hypothetically, let us expand the example above to a three mutation (distance = 3) situation from TTT (phe) to GAC (asp). A three mutation situation results in 6 possible mutation pathways:

1. TTT (phe) -> TTC (phe) -> TAC (tyr) -> GAC (asp): 2n and 1s
2. TTT (phe) -> TTC (phe) -> GTC (val) -> GAC (asp): 2n and 1s
3. TTT (phe) -> TAT (tyr) -> TAC (val) -> GAC (asp): 3n
4. TTT (phe) -> TAT (tyr) -> GAT (val) -> GAC (asp): 3n
5. TTT (phe) -> GTT (val) -> GTC (val) -> GAC (asp): 2n and 1s
6. TTT (phe) -> GTT (val) -> GAT (val) -> GAC (asp): 2n and 1s
• As we consider the six above pathways to occur with equal probability, for this codon nd = 2.333 and sd = 0.666 (sums to three as we have three mutations in the codon).

So, for a protein encoding DNA sequence of length r codons, the total number of observed nonsynonymous (Nd) and synonymous (Sd) mutations between two sequences can therefore be calculated by summing up the individual nd and sd values from all codons using the formulas:

dN/dS ratio

One then calculates the proportion of nonsynonymous (pn) and synonymous (ps) differences with the following equations:

Then to estimate the number of nonsynonymous substitutions (dN) and synonymous substitutions (dS) per site, and the dN/dS ratio itself, we use the formulas:

So using our reference and sample sequence examples above we now have the following values:

• N = 14.666
• S = 3.333
• Nd = 2
• Sd = 2
• pN = 0.1364
• pS = 0.6001
• dN = 0.1505
• dS = 1.2074
• dN/dS = 0.1247

NGS Datasets

The expansion of all of the above to NGS datasets is relatively straightforward. The calculation of the number of nonsynonymous and synonymous sites in the reference sequences proceeds the same as above.

One must then calculate the number of observed nonsynonymous and synonymous mutations in the reads when compared to the reference (this will need information on where the open reading frame [ORF] starts and stops). There are two options when calculating the observed numbers:

1. Consider all observed mutations in the reads covering any part of the codon.
2. Consider only those mutations where the read fully covers the codon the mutation occurs in, i.e. ignore partially covered codons at the read ends, or partially covered codons due to indels.

To adapt to NGS datasets, one must consider read coverage. The approach taken by Morelli et al (2013) adjusts the formula for pN (and likewise pS) as follows to take into account the read coverage (c) at each codon:

Essentially, for each read (c) that covers a particular codon, the observed number of nonsynonymous mutations in the read compared to the reference codon is calculated, and divided by the expected number. The values for all reads at the codon are then summed and averaged. Then the value for all codons is summed to give a single value for the whole ORF. This value of pN and pS can they be plugged in to the same dN and dS formulas to calculate the dN/dS ratio.

References

Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions

Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts