Author Archives: Quan Gu

4th Annual Training Course on Viral Bioinformatics and Genomics (20-24th Aug, 2018)

From 20-24th Aug, we have held our 4th Viral Bioinformatics and Genomics training course at the Garscube campus of the University of Glasgow. In this year, our course had 16 participants attending from academic and health institutions from across the UK, Europe, Canada, Australia, Qatar, and South Africa. The MRC-University of Glasgow Centre for Virus Research (CVR) has been designated a World Organisation for Animal Health (OIE) Collaborating Centre for Viral Genomics and Bioinformatics at the 82nd OIE General Session.


Andrew Davison, Ana Filipe, Quan Gu (Course Organiser), Joseph Hughes, Richard Orton, David Robertson, Sreenu Vattipally (Co-organiser)

The course involved the usual suspects of instructors, and included sessions on Linux, reference assembly and consensus/variant calling, advanced bash scripting, de novo assembly, metagenomics, transcriptomics, genome annotation, and phylogenetics, all with a focus on viral data sets. The course itself was held in the IAMS computer room of the McCall building, with lunch and refreshments in the Mary Stewart building, poster session in the Sir Michael Stoker building, and the course dinner at Brel restaurant.

Day 1: Introductions, NGS Reads, and Mapping

The course started off with an introduction to Viral Bioinformatics from our head of bioinformatics Prof. David Robertson followed by an introduction to High Throughput Sequencing (HTS) by the head of the CVR’s sequencing facility Ana Filipe. This was followed by a 1 minute – 1 slide introduction session involving all course participants and instructors.In the afternoon, there were sessions on Basic Linux Commands, NGS Reads & Cleaning, and Mapping Reads to a Reference, followed by a wine and poster reception held at the CVR where participants got a chance to get know each other and the instructors better, and discuss the latest viral research with members of the CVR. The Director of CVR Prof. Massimo Palmarini came to the wine reception to give us a speech.

Day 2: Variant Calling & Advanced Linux

The course picked up from the previous day with sessions on Post Assembly & Consensus Calling followed by Low-Frequency Variant Calling. In the afternoon, the course moved on to advanced Linux topics with sessions covering the basics of text processing, scripts, and Linux conditions.

Day 3 – Bash Scripting and Phylogenetic Analysis

On the 3rd day, participants learned how to build their own bash scripts involving loops and arrays to automate sample analysis, whilst in the afternoon there were sessions on automatically downloading sequence data from Genbank followed by sequence alignment and phylogenetic analysis.

Day 4 – De Novo Assembly and Genome Annotation

On the 4th day, the course moved on to de novo assembly analyses, and the different assemblers available, assessing the quality of contigs, merging and gap filling. In the coffee break, we are playing the lego of alignment. In the evening, all course participants and instructors moved on to Brel restaurant for our annual course dinner and drinks. Quan Gu showed his singing skill during the dinner.

Day 5 – Metagenomics, Transcriptomics, and Pipelines

The final day involved metagenomics, transcriptomics and a build your own pipeline session in the morning, as well as a chance to discuss participants own data. After our annual cutting of the OIE cake, we said our goodbyes to this year’s participants. We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

featureCounts or htseq-count?

Count-based differential expression analysis of sequencing data is one of the best known pipeline in bioinformatics analysis. In this pipeline, the vital step is to estimate the reads count of each genomic features. After counting the features, the differential expression(DE) analysis tools are used for getting the differential expression list of genomic features.  It has been widely used in CHIP-Seq and RNA-Seq. So far a couple of  software(e.g. summarizeOverlaps, coverageBED, htseq-count and featureCounts) are provided for counting the features.

In the case of RNA-Seq, the features are typically genes, where each gene is considered here as the union of all its exons. Counting RNA-seq reads is complex because of the need to accommodate exon splicing. The common approach is to summarize counts at the gene level, by counting all reads that overlap any exon for each gene. In this method, gene annotation file from RefSeq or Ensembl is often used for this purpose. So far there are two major feature counting tools: featureCounts (Liao et al.) and htseq-count (Anders et al.). Both are very well known and reliable. But which one is better? I will provide my personally suggestion as follows.

Here I firstly summarize some common features of these two software.

1.Input files: both require for reference -based alignment files (BAM/SAM files) , both could be used for stranded/unstranded reads.

2.Genomic features file: both require for GTF/GFF file from RefSeq or Ensembl.

3. OS: both could be applied in UNIX.

However, htseq-count has its own advantage. This Python based software is developed by the scientists from European Molecular Biology Laboratory, who also took part in the development of many well -known differential expression tools (e.g. DESeq, DESeq2) across RNA-Seq analysis. Hence, it seems more reliable and compatible to count-based differential expression RNA-Seq analysis. It’s worth to mention that the citation of htseq-count is favourable, lots of users are still use htseq.


In contrast, featureCounts has its own advantages:

1. Input files: The input of featureCounts could be SAF files rather than GTF/GFF files.

2.SAF files is 5 columns and more easier than GTF/GFF files.

GeneID    Chr   Start         End  Strand

301026     chr1 2206566  2209048  –

301026     chr1 2431765  2431992  –

301026     chr1 2661153  2662007  –

3.I have tested the running speed of featureCounts on my real data, which is ~20 times quicker than htseq-count .

4.Besides UNIX version, it also has R-based version. For the bioinformatian who liked to perform RNA-Seq or CHIP-Seq analysis in Windows/MacOS, featureCounts is the best choice.

5.featureCounts is more liberal than htseq-count, it could get more counts especially for pair-ended reads.

To observe it, let’s firstly check how htseq-count do the counting (the figure is taken from htseq manual).


In the setting of htseq, union mode is the most appropriate and best recommended by the authors. If the reads(or read pair) contains more than one feature, the read (or read pair) is counted as ambiguous and not counted for any features. But featureCounts is different. For the single-ended reads, featureCounts and htseq-count are almost equivalent. But on pair-ended reads, featureCounts is advanced.

For example, if two genes were found to both overlap with a reads pair fragment but one gene was found to overlap with only one read and the other with both reads from that fragment, featureCounts will assign that fragment to the gene overlapping with both reads. However, htseq-count will take this fragment as ambiguous and will not assign it to any gene.

In my opinion, featureCounts can make a clear distinction for those features that overlap with different numbers of reads from the same fragment, meanwhile I think htseq-count is too conservative and get lower counts.

However, as the proportion of ‘ambiguous’ reads is very low. I have checked 20 real samples in my database, which is around 1000-100000 reads per sample, less than 0.5%. Hence, in summary, the controversial between two tools could be ignored in real data analysis. However, if the proportion of features overlap is high, you need to make the choice.

Some key factors for number of significant DE genes

In the different RNA-Seq experiments, we have found that some experiments have more significant DE (differential expression) genes than others. Besides the experiment design which is the most important factor, a number of other factors may influence the number of DE genes. Here, we list the different experiments that we have carried out the investigation of the factors that may influence the number of DE genes.

The influence of trimming reads
Previous analyses applying gentle trimming have negligible effects on the number of DE genes. However, heavy trimming could decrease the expression level heavily which is not objective for analysis.

The threshold of Significant DE genes

The default parameter forchoosing the significance of a DE gene is the Q-value (i.e. FDR P-value)< 0.05. This is based on the Benjamini–Hochberg procedure correction. One way to change the number of DE genes is to change the threshold value or the parameter used as a threshold. For example, any of Q-value< 0.1, P-value<0.05 and Q-value<0.1&abs(Log2FC)>1 could be used as threshold parameters. Using data sequenced at the CVR (treated and untreated cells) as an example, edgeR is used as the DE analysis tool (Fig. 1). The largest number of deferentially expressed genes is found if P<0.05 is used as a threshold parameter.On the other hand, if you want to reduce the number of DE genes, you could change the correction procedure used. For example, the Benjamini–Yekutieli procedure is more conservative than the Benjamini–Hochberg procedure.


Fig.1 Number of significant DE genes with different threshold

Biological coefficient of variation

An important factor that influences the number of DE genes is the variation among the samples.The BCV (Biological Coefficient of Variation) plot is a way to measure the biological variation within a particular condition. A common dispersion (i.e. red line on the BCV plot) between 0.2 and 0.4 is usually considered reasonable and hence could detect more DE genes.If the common dispersion is above the 0.4 threshold, this will influence the number of DE genes found in the study. Additionally, the PCA (principal component analysis) and MDS (multidimensional scaling) plots which represents the relationship between groups of samples are affected by high BCVs. The experiment was run twice. We can see the differences in the BCV and MDS plots between the two experiments. An experiment where the common dispersion is 0.41 provides almost 576 DE genes whilst an experiment with high average BCV results in only 109 DE genes. Additionally, the MDS plots clearly differentiate the different conditions when the common dispersion is low implying greater variation between the conditions than within, as expected.

experiment 1 (Condition 1; Condition 2):

Condition 1 sample ID: CVR71-1, CVR74-1, CVR57-1;

Condition 2 sample ID: CVR71-2, CVR74-2, CVR57-2

experiment 2 (Condition1; Condition 2):

Condition 1 sample ID: CVR82-1, CVR 82-2, CVR82-3;

Condition 2 sample ID: CVR83-1, CVR83-2, CVR83-3


Fig 2. BCV plot of Experiment 1 vs BCV plot of Experiment 2


Fig 3. MDS plot of Experiment 1 vs MDS plot of Experiment 2

Number of samples

Statistically, it is beneficial to have as many replicates as possible. The minimum number of replicates would be 3 but it is advisable to have more than 4 in each condition for cell lines and 6 for tissue samples that inherently have more variation in the genes being expressed. The number of DE genes would be affected by number of samples in the experiment. We used edgeR to compare the results obtained with 2 replicates versus 3 replicates for an experiment 1 and cell experiment 2 by edgeR and get the result to Fig. 4. Figure 4 shows that we could get more significantly DE genes with more samples.


Fig.4 Number of significant DE genes vs sample number

However, if we have outlier samples in each condition, the number of significant DE genes will decrease (Fig. 5). For example, we made a new DE analysis of the pig genome using:

Condition 1:CVR57-1, CVR82-1, CVR 82-2, CVR82-3

Condition 2: CVR57-2, CVR83-1, CVR83-2, CVR83-3

As shown in the Fig.5, the number of DE genes decreased because of the outlier (i.e. CVR57-1 and CVR57-2). By adding this outlier, you are increasing the biological variation in the experiment, resulting in a higher BCV and a lower number of DE genes.

Number of reads in each sample

The number of DE genes could also be affected by the number of reads in the samples. Here we randomly sub-sampling the number of RNA-Seq reads of each sample (Fig. 5). If we have more reads, we find more DE genes.DESeq2 is influenced more by the number of reads in the experiment than edgeR which is less affected by changes in the number of mapped reads. This also suggests that sequencing 25% less per sample, will not adversely affect the number of DE genes determined by edgeR.


Fig.5 Number of significant DE genes vs proportion of sub-sampling


Here we test some of our samples (unmapped reads) to the Mycoplasmas bacteria with Ion-Proton sequencing data. Mycoplasma can influence the growth of cell cultures may adversely affect the experiment, thus resulting in lower numbers of DE genes.

The mycoplasma genomes used in this study were downloaded from NCBI genomes: Mycoplasma hominis ATCC 23114 (NC_013511.1), M. hyorhinis MCLD (NC_017519.1), Mycoplasma fermentans M64 (NC_014921.1) and Acholeplasma laidlawii PG-8A (NC_010163.1). As shown in the Table 1, there is not an obvious relationship between number of significant DE genes and the mycoplasma contamination.

Table 1 Statistic of mapping to Mycoplasma and pig genome


The number of genes in the genome

The number of genes in the genome could also influence the number of significant DE genes although this is unlikely to be the case in the current set of experiments approx. 25,000 genes and approx. 26,000.

The statistical model

Even in the same pipeline, use different statistical model could affect the number of DE genes as well.

For example, number of DE genes will increase from 576 to 3457 if the model of edgeR changed from ‘classical linear models’ to ‘Generalized linear models (GLM) with muti-factor of samples’ .


We propose that the following strategy for future experimental design:

  • increasing the number of replicates;
  • the number of sequenced reads could be decreased slightly to accommodate for more replicates;
  • insuring consistent handling in the laboratory.


Why and how to use biomaRt?

The bioinformatics work includes the gene annotation work. In recent years more and more biological data has become available.  Meanwhile, how to get the access these valuable data resources and analyse the data is important for comprehensive bioinformatics data analysis. The biomaRt is a very useful tool to achieve that. Now there are two questions: 1. Why to use biomaRt? 2. How to use biomaRt?

Let us first get the concept of BioMart. The BioMart project ( provides free software and data services to the international scientific community in order to foster scientific collaboration and facilitate the scientific discovery process. Examples of BioMart databases are Ensembl, Uniprot and HapMap.  However, if the dataset is big and the conversion from different datasets  is troublesome, we need a bioinformatics tool which could do it automatically. The biomaRt is the package which provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. The major databases (e.g. Ensembl, Uniprot)give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R&Bioconductor.

The first way to use BioMart is online ID conversion. We could go to website: and then select the corresponding datasets, filters and attributes. If we click the ‘Results’ button, we could see the final outputs.

The second way is to use biomaRt, which is a R&Bioconductor package. There are 2 steps: (1) select the Mart database and (2) use getBM to get the gene annotation. However, how many Mart database does the package have? And how do we get the correct setting from filters and attributes from the corresponding datasets?  We could use function ‘listMart’ and ‘listDatasets’ to check the database, meanwhile the function ‘listFilters’ and ‘listAttributes’ are useful for you to get the correct setting . Let ‘s check the corresponding results from R.

Mart version by the command listMarts()

Datasets version by the command listDatasets(ensembl)

Filter function by the command listFilters(ensembl)

Attribute function by the command listAttributes(ensembl)

Besides the database ID conversion (e.g. ID,symbol, name) , the biomaRt could achieve the information of SNP, alternative splicing, exon, intron, 5’utr, 3’utr as well.

The third way is to use Biomart Perl API, it is also one of the most convenient way to access BioMart programmatically.  We would not introduce it in detail in this post.

Generally speaking, it is an amazing bioinformatics tool, and moreover, it is free!