Monthly Archives: March 2015

Why and how to use biomaRt?

The bioinformatics work includes the gene annotation work. In recent years more and more biological data has become available.  Meanwhile, how to get the access these valuable data resources and analyse the data is important for comprehensive bioinformatics data analysis. The biomaRt is a very useful tool to achieve that. Now there are two questions: 1. Why to use biomaRt? 2. How to use biomaRt?

Let us first get the concept of BioMart. The BioMart project (http://www.biomart.org) provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. Examples of BioMart databases are Ensembl, Uniprot and HapMap.  However, if the dataset is big and the conversion from different datasets  is troublesome, we need a bioinformatics tool which could do it automatically. The biomaRt is the package which provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The major databases (e.g. Ensembl, Uniprot)give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R&Bioconductor.

The first way to use BioMart is online ID conversion. We could go to website: http://useast.ensembl.org/biomart/martview/ and then select the corresponding datasets, filters and attributes. If we click the ‘Results’ button, we could see the final outputs.

The second way is to use biomaRt, which is a R&Bioconductor package. There are 2 steps: (1) select the Mart database and (2) use getBM to get the gene annotation. However, how many Mart database does the package have? And how do we get the correct setting from filters and attributes from the corresponding datasets?  We could use function ‘listMart’ and ‘listDatasets’ to check the database, meanwhile the function ‘listFilters’ and ‘listAttributes’ are useful for you to get the correct setting . Let ‘s check the corresponding results from R.

Mart version by the command listMarts()

Datasets version by the command listDatasets(ensembl)

Filter function by the command listFilters(ensembl)

Attribute function by the command listAttributes(ensembl)

Besides the database ID conversion (e.g. ID,symbol, name) , the biomaRt could achieve the information of SNP, alternative splicing, exon, intron, 5’utr, 3’utr as well.

The third way is to use Biomart Perl API, it is also one of the most convenient way to access BioMart programmatically.  We would not introduce it in detail in this post.

Generally speaking, it is an amazing bioinformatics tool, and moreover, it is free!

 

A simple method to distinguish low frequency variants from Illumina sequence errors

RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences.

Next generation sequencing has revolutionised the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population.

However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing.

A simple method to do this is to utilise the Illumina quality scores. Every sequenced base in every read is assigned a quality score that is a measure of the uncertainty of the base call using the Phred scale. An example FASTQ read is:

A Phred quality score (Q) is logarithmically related to it’s corresponding base-calling error probability (P), and they can be calculated from each other using the following formulas:

In FASTQ files, quality scores are typically displayed with an ASCII character that is a representation of the quality score – one reason for this is to save space – instead of writing “40” (2 characters) one can simply write “I” (1 character). A range of quality scores and their corresponding error probabilities and ASCII characters is presented below (NB: on the Sanger Phred scale the ASCII characters have a +33 offset to the quality score).

Phred Quality (Q) ScoreProbability (P) of Base Calling ErrorQ + 33 (Sanger)ASCII Q Character
400.000173I
390.00012589372H
380.00015848971G
370.00019952670F
360.00025118969E
350.00031622868D
340.00039810767C
330.00050118766B
320.00063095765A
310.00079432864@
300.00163?
200.01535
100.143+
10.79432823534"
0133!

At one of the highest quality scores Q40, there is a 1 in 10,000 probability of a base being called by the sequence machine incorrectly. Although this is a very low probability, the ultra-deep coverage of viral samples can result in a coverage of well over 10K at each position in the genome. This implies that even if every base were of the highest quality, we would still expect errors to be introduced via base miscalls from the Illumina machine.

Therefore, in order to identify true viral variants present in the population at low frequency, one must handle base miscall errors from the Illumina machine. A simple way to do this is as follows:

  1. For a given genome position, retrieve all the quality scores of every individual base aligned at that position from the reads.
  2. Convert each quality score to a probability of error using the formulas above
  3. Calculate an average (mean) probability of error at the given genome position.
  4. Test whether the observed number of mutations is above that expected from Illumina sequence errors alone. This can be done statistically by using a binomial distribution where the probability of observing M or more mutations is given by Binom(M; AvP/3, C), where C is the coverage, AvP is the average probability of error calculated from the quality scores, and AvP/3 represents the probability of the specific mutation observed in the reads. Using a 5% threshold, if [1 – Binom(M; AvP/3, C)] is less than 0.05 then one can say the variant is validated.

For example, at genome position 1034 the reference base is an A, we observe a coverage of 35,000 composed of 34,993 As and 7 G mutations, and evaluating the quality scores at this position gives an average probability of error of 0.00032. Using the [1 – Binom(M; AvP/3, C)] formula (where C = 35,000, AvP/3 = 0.00032/3, and M = 7) gives a P value of  0.08476547 which is not below the 0.05 threshold and so is not a validated low frequency mutation.

This is a relatively simple approach that was proposed by Morelli et al. 2013 and is also incorporated into our own tool DiversiTools. However, validation of low frequency variants is further complicated by issues such as Strand Bias and Illumina Systematic Errors, as well as RT-PCR errors and sample contamination.

A good website with a full table of quality scores and error probabilities is available here, along with information on older Illumina quality scores (which utilise slightly different formulas).