Monthly Archives: November 2015

Simple Bash Quiz

Basic Bash Quiz

A simple quiz to test bash knowledge.

This quiz has 20 questions and you must provide at least one answer for each question.

Each user can only have one open quiz attempt at a time

You will be given 10 mins to complete this quiz.

You must specify a text.
You must fill out this field.
Bioinformatician at CVR.

Java CIGAR Parser for SAM format

Sequence Alignment/Map (SAM) format is a well-known bioinformatics format designed to store  information about reads mapping against large reference sequence.  The SAM file is split into two sections: a header section and an alignment section. The header section starts with ‘@’ and it contains information such as the name and length of the reference sequence. The alignment section follows the header section. Each line in this section represents an alignment of a read to the reference sequence that may include insertion, deletion, skips and padding. However, not all lines are successful alignment. A single alignment line consists of 11 mandatory fields and a variable number of optional fields. The mandatory fields are ordered and, if information is unavailable, their values can be zero, for a number field, or * for a string field.  The reader is referred to the SAM Format Specification [1,2] for a detailed description of each field in the alignment section but in summary, the mandatory fields are:

  1. QNAME: the name of the read.
  2. FLAG: a bitwise number describing information about the read such as mapped/unmapped read, pairing, etc.
  3. RNAME: the reference sequence name. The value of this field is * for unmapped reads.
  4. POS: the left most mapping position of the first matching base. It is set to zero for unmapped reads.
  5. MAPQ: a number to indicate the mapping quality.
  6. CIGAR: a string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are: MIDNSHP=X.  The following table gives an overview of these operators
  7. MNAME: the name of the mate strand, if it is a paired read. ‘=’ means that the name is the same as RNAME.
  8. MPOS: a number indicating the left most mapping position of the mate.
  9. TLEN: a number set to 0 for single-segment. For paired-segments it is calculated as the number of bases from the left most mapped base to the right most mapped base.
  10. SEQ: the read sequence as the reference. If not  ‘*’ then the length of the sequence must be equal the sum of lengths of M/I/S/=/X operations in the CIGAR string.
  11. QUAL: the read quality, same as the quality string in a fastaq format.
Operator Description
D Deletion; the nucleotide is present in the reference but not in the read
H Hard Clipping; the clipped nucleotides are not present in the read.
I Insertion; the nucleotide is present in the read  but not in the rference.
M Match; can be either an alignment match or mismatch. The nucleotide is present in the reference.
N Skipped region; a region of nucleotides is not present in the read
P Padding; padded area in the read and not in the reference
S Soft Clipping;  the clipped nucleotides are present in the read
X Read Mismatch; the nucleotide is present in the reference
= Read Match; the nucleotide is present in the reference

The following example shows the mapping of different reads to a reference sequence and their corresponding SAM file [2].  SAM_parser_figure1

Different tools such as SAMtools [2] and htsjdk are available to post-process the SAM files for analysis of the alignment information.

An intuitive usage of the SAM file is to view the overall coverage of the reference sequence by the reads. This information can be retrieved from the mapping position and CIGAR value fields which are the 4th and 6th fields, respectively.  Computationally, reference coverage is obtained by counting the number of times (frequency) of each nucleotide base in the reference sequence that appears in the mapped reads.  The frequency is determined using one of three cases according to the operators in the CIGAR string:

  • Case1 ( M/X/=) :
    • start at the specified mapping position, set counter to 1
    • Add 1 to both the counts of the bases from that position and the counter.
    • Move to the next position.
    • Repeat this process till counter is the same as the number associated with the operator.
  • Case2 (N/D):
    • Move the specified mapping position by the number associated with the operator.
  • Case3 (I/S/H/P):
    • Do nothing

The following Java codes illustrate how to obtain the coverage of a single reference sequence obtained from a SAM file.


The highest frequency in the reference sequence indicates the depth of the reads mapped to the reference. In this case, it is 3.


[1]: The Sequence Alignment/Map format and SAMtools.   BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078–2079 doi:10.1093/bioinformatics/btp352

[2]: Sequence Alignment/Map Format Specification.pdf





The dark arts of Ion Torrent Sequencing

All technologies, have their advantages and disadvantages, and next generation sequencing (NGS) is no different. For the time being, the current dominant forces in NGS are Illumina, which is based upon detecting a flash of light as fluorescent nucleotides are incorporated, or Ion Torrent, where nucleotide incorporation is detected in the form of a change in pH. Illumina has made its technology user friendly and produces high quality data, but is expensive and sensitive to misalignment of the lens used to photograph the flow cell, whereas Ion Torrent is (relatively) cheaper and less sensitive to shakes, but historically results in poorer quality data and user friendliness is far behind Illumina. Ion Torrent’s answer to addressing consistency and user friendliness has been to release the Ion Chef; effectively a robot which performs the steps from library to chip. Life without an Ion Chef is very different. Like any method, the instructions that come with the supplies tells you what to do, but it doesn’t tell you the ‘tricks’: those little touches that make you hit the sweet spot. In the non-chef protocol, for the Ion Proton, there are various points throughout the procedure, from RNA to sequence data, where little touches help.
My experience has been using the Ion Proton for transcriptomics, so any ‘recommendations’ when it comes to Ion Torrent sequencing refer to RNAseq.
Library preparation begins with good quality RNA. The standard way to determine RNA quality is to assess the integrity and ratio of the ribosomal RNA peaks. Originally this was simply an agarose gel, but nowadays the Bioanalyzer or Tapestation from Agilent are commonplace and provide a ‘RIN’ value indicating the RNA quality. For anyone thinking of buying one, given the choice between Bioanalyser and Tapestation, personally I’d choose the Bioanalyzer every time as the reagents seem to be much more stable overtime. For quantification it is widely accepted that dye-based methods of quantitation, e.g. Qubit, are much more reliable than spectrophotometry, e.g. NanoDrop.

Bioanalyser trace

Good quality RNA: Total RNA (RIN 10) as analysed using the Bioanalyzer.

The amount of RNA used to generate the library can be very small, but should be sufficient that the resulting library is representative of the RNA population in the sample. Currently, I poly(A) enrich 1.5µg of total RNA and use the entire eluate as the input for the RNAseq V2 kit from Ion Torrent. The protocol provided suggests 3 minutes of fragmentation for <100ng of poly(A) enriched RNA as an input. Fragmenting for this long though will produce short reads, whereas a shorter incubation, e.g. 1min 30sec will result in longer reads, which may be beneficial when it comes to aligning the reads to the reference. When it comes to the clean-up after fragmentation, and this applies to all of the clean-ups, big emphasis is placed upon precise volumes. It’s true, this is important, but just as important – and something not in the manual – is that the ethanol MUST be fresh. As soon as the bottle is open the quality drops. It’s prudent to make aliquots from a freshly opened bottle, especially if you want batch –to-batch consistency among your libraries.

When it comes to barcoding, if possible plan ahead such that the successive runs on the Proton utilise different barcodes. In this way it may help to remove inter-run contamination of data.
Library QC is also important, again, use the Qubit. Once amplified, using 1.5µg of total RNA input results in libraries that need to be diluted a lot to reach the required dilution for the OT2, either 100pM (and use 8µl in the OT2 reaction), or 8pM (and use 100µl in the OT2 reaction). More DNA than this will result in an increase in polyclonal spheres. It may be tedious vortexing the ISPs for a full minute prior to their addition to the OT2 reaction, but it’s important and adding the ISPs is the last thing I do prior to starting the OT2.

After enrichment and loading (adding 10µl of 50% annealing buffer prior to adding to the chip), the foam washes aren’t to be underestimated. The wells may be perfectly loaded with perfectly templated ISPs, but unless the surface is washed properly, the amount of data will be reduced. Whereas the protocol states two washes, I perform a minimum of 4, some run slowly, some faster. The goal each time is to generate a fine foam. For this the pipette tip makes a surprising amount of difference; unless there is a sufficiently narrow opening to the tip, it is difficult to generate a good foam. It is also sometimes possible to ‘over-foam’, whereby pipetting too much will result in the small bubbles becoming bigger.


Chip loading: an image of the Ion Proton PI chip loaded with ISPs

I’m not sure how much of a difference it makes, but I try to add the polymerase as slowly as possible and leave the chip to sit still for at least 5 minutes before moving it.

These are simply some of the habits I’ve picked up. The protocol is much improved compared to previous versions but, ultimately, the best option would be to buy an Ion Chef!

Some key factors for number of significant DE genes

In the different RNA-Seq experiments, we have found that some experiments have more significant DE (differential expression) genes than others. Besides the experiment design which is the most important factor, a number of other factors may influence the number of DE genes. Here, we list the different experiments that we have carried out the investigation of the factors that may influence the number of DE genes.

The influence of trimming reads
Previous analyses applying gentle trimming have negligible effects on the number of DE genes. However, heavy trimming could decrease the expression level heavily which is not objective for analysis.

The threshold of Significant DE genes

The default parameter forchoosing the significance of a DE gene is the Q-value (i.e. FDR P-value)< 0.05. This is based on the Benjamini–Hochberg procedure correction. One way to change the number of DE genes is to change the threshold value or the parameter used as a threshold. For example, any of Q-value< 0.1, P-value<0.05 and Q-value<0.1&abs(Log2FC)>1 could be used as threshold parameters. Using data sequenced at the CVR (treated and untreated cells) as an example, edgeR is used as the DE analysis tool (Fig. 1). The largest number of deferentially expressed genes is found if P<0.05 is used as a threshold parameter.On the other hand, if you want to reduce the number of DE genes, you could change the correction procedure used. For example, the Benjamini–Yekutieli procedure is more conservative than the Benjamini–Hochberg procedure.


Fig.1 Number of significant DE genes with different threshold

Biological coefficient of variation

An important factor that influences the number of DE genes is the variation among the samples.The BCV (Biological Coefficient of Variation) plot is a way to measure the biological variation within a particular condition. A common dispersion (i.e. red line on the BCV plot) between 0.2 and 0.4 is usually considered reasonable and hence could detect more DE genes.If the common dispersion is above the 0.4 threshold, this will influence the number of DE genes found in the study. Additionally, the PCA (principal component analysis) and MDS (multidimensional scaling) plots which represents the relationship between groups of samples are affected by high BCVs. The experiment was run twice. We can see the differences in the BCV and MDS plots between the two experiments. An experiment where the common dispersion is 0.41 provides almost 576 DE genes whilst an experiment with high average BCV results in only 109 DE genes. Additionally, the MDS plots clearly differentiate the different conditions when the common dispersion is low implying greater variation between the conditions than within, as expected.

experiment 1 (Condition 1; Condition 2):

Condition 1 sample ID: CVR71-1, CVR74-1, CVR57-1;

Condition 2 sample ID: CVR71-2, CVR74-2, CVR57-2

experiment 2 (Condition1; Condition 2):

Condition 1 sample ID: CVR82-1, CVR 82-2, CVR82-3;

Condition 2 sample ID: CVR83-1, CVR83-2, CVR83-3


Fig 2. BCV plot of Experiment 1 vs BCV plot of Experiment 2


Fig 3. MDS plot of Experiment 1 vs MDS plot of Experiment 2

Number of samples

Statistically, it is beneficial to have as many replicates as possible. The minimum number of replicates would be 3 but it is advisable to have more than 4 in each condition for cell lines and 6 for tissue samples that inherently have more variation in the genes being expressed. The number of DE genes would be affected by number of samples in the experiment. We used edgeR to compare the results obtained with 2 replicates versus 3 replicates for an experiment 1 and cell experiment 2 by edgeR and get the result to Fig. 4. Figure 4 shows that we could get more significantly DE genes with more samples.


Fig.4 Number of significant DE genes vs sample number

However, if we have outlier samples in each condition, the number of significant DE genes will decrease (Fig. 5). For example, we made a new DE analysis of the pig genome using:

Condition 1:CVR57-1, CVR82-1, CVR 82-2, CVR82-3

Condition 2: CVR57-2, CVR83-1, CVR83-2, CVR83-3

As shown in the Fig.5, the number of DE genes decreased because of the outlier (i.e. CVR57-1 and CVR57-2). By adding this outlier, you are increasing the biological variation in the experiment, resulting in a higher BCV and a lower number of DE genes.

Number of reads in each sample

The number of DE genes could also be affected by the number of reads in the samples. Here we randomly sub-sampling the number of RNA-Seq reads of each sample (Fig. 5). If we have more reads, we find more DE genes.DESeq2 is influenced more by the number of reads in the experiment than edgeR which is less affected by changes in the number of mapped reads. This also suggests that sequencing 25% less per sample, will not adversely affect the number of DE genes determined by edgeR.


Fig.5 Number of significant DE genes vs proportion of sub-sampling


Here we test some of our samples (unmapped reads) to the Mycoplasmas bacteria with Ion-Proton sequencing data. Mycoplasma can influence the growth of cell cultures may adversely affect the experiment, thus resulting in lower numbers of DE genes.

The mycoplasma genomes used in this study were downloaded from NCBI genomes: Mycoplasma hominis ATCC 23114 (NC_013511.1), M. hyorhinis MCLD (NC_017519.1), Mycoplasma fermentans M64 (NC_014921.1) and Acholeplasma laidlawii PG-8A (NC_010163.1). As shown in the Table 1, there is not an obvious relationship between number of significant DE genes and the mycoplasma contamination.

Table 1 Statistic of mapping to Mycoplasma and pig genome


The number of genes in the genome

The number of genes in the genome could also influence the number of significant DE genes although this is unlikely to be the case in the current set of experiments approx. 25,000 genes and approx. 26,000.

The statistical model

Even in the same pipeline, use different statistical model could affect the number of DE genes as well.

For example, number of DE genes will increase from 576 to 3457 if the model of edgeR changed from ‘classical linear models’ to ‘Generalized linear models (GLM) with muti-factor of samples’ .


We propose that the following strategy for future experimental design:

  • increasing the number of replicates;
  • the number of sequenced reads could be decreased slightly to accommodate for more replicates;
  • insuring consistent handling in the laboratory.