Category Archives: Uncategorized

Exploring the FAST5 format

FAST5 format from Oxford Nanopore (ONT) is in fact HDF5, which is a very flexible data model, library, and file format for storing and managing data. It is able to store an unlimited variety of datatypes.

A number of tools have been developed for handling HDF5 available from here.  The most useful are:

  • hdfview, a java visual tool for viewing HDF5 files with some limited functionality of plotting data and the option of exporting subsets in HDF5 (extension .h5)
  • h5ls, for listing specified entries of the HDF5 file
  • h5dump, to examine the HDF5 file and export specified groups or datasets in ASCII.

Here’s a run through exploring the lambda phage control run. First off, looking at the FAST5 file produced by the MinION.

At this stage, the FAST5 file only has one dataset which is the “Signal” dataset.

The same thing, on a FAST5 file, which has been processed by Metrichor, now has a lot more associated information, notably Fastq, Events, various Log files for the different analyses and still contains the raw Signal dataset.


To list all groups recursively using h5ls use -r:

Similar information can be obtained using h5dump -n:

To get all data and metadata for a given group /Raw/Reads/Read_939:

Or, the following is similar without the group tags. The -d option is used for printing a specified dataset.

Removing the array indices using option -y:

Saving the raw Signal dataset to file “test”:

The same as the above but specifying that the column width of the dataset is 1 with the option -w 1:

Dumping the whole FAST5 into XML format:

O.K., that it for now.

2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course:

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants


The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest ( 

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

NGS Data Formats and Analyses

Here are my slides from a session on NGS data formats and analyses that I gave as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels in April 2016. It covers file formats such as FASTA, FASTQ, SAM, BAM, and VCF, and also goes over IUAPAC nucleotide ambiguity codes, read names, quality scores, error probabilities, CIGAR strings.


Simple Bash Quiz

Basic Bash Quiz

A simple quiz to test bash knowledge.

This quiz has 20 questions and you must provide at least one answer for each question.

Each user can only have one open quiz attempt at a time

You will be given 10 mins to complete this quiz.

You must specify a text.
You must fill out this field.
Bioinformatician at CVR.

Java CIGAR Parser for SAM format

Sequence Alignment/Map (SAM) format is a well-known bioinformatics format designed to store  information about reads mapping against large reference sequence.  The SAM file is split into two sections: a header section and an alignment section. The header section starts with ‘@’ and it contains information such as the name and length of the reference sequence. The alignment section follows the header section. Each line in this section represents an alignment of a read to the reference sequence that may include insertion, deletion, skips and padding. However, not all lines are successful alignment. A single alignment line consists of 11 mandatory fields and a variable number of optional fields. The mandatory fields are ordered and, if information is unavailable, their values can be zero, for a number field, or * for a string field.  The reader is referred to the SAM Format Specification [1,2] for a detailed description of each field in the alignment section but in summary, the mandatory fields are:

  1. QNAME: the name of the read.
  2. FLAG: a bitwise number describing information about the read such as mapped/unmapped read, pairing, etc.
  3. RNAME: the reference sequence name. The value of this field is * for unmapped reads.
  4. POS: the left most mapping position of the first matching base. It is set to zero for unmapped reads.
  5. MAPQ: a number to indicate the mapping quality.
  6. CIGAR: a string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are: MIDNSHP=X.  The following table gives an overview of these operators
  7. MNAME: the name of the mate strand, if it is a paired read. ‘=’ means that the name is the same as RNAME.
  8. MPOS: a number indicating the left most mapping position of the mate.
  9. TLEN: a number set to 0 for single-segment. For paired-segments it is calculated as the number of bases from the left most mapped base to the right most mapped base.
  10. SEQ: the read sequence as the reference. If not  ‘*’ then the length of the sequence must be equal the sum of lengths of M/I/S/=/X operations in the CIGAR string.
  11. QUAL: the read quality, same as the quality string in a fastaq format.
Operator Description
D Deletion; the nucleotide is present in the reference but not in the read
H Hard Clipping; the clipped nucleotides are not present in the read.
I Insertion; the nucleotide is present in the read  but not in the rference.
M Match; can be either an alignment match or mismatch. The nucleotide is present in the reference.
N Skipped region; a region of nucleotides is not present in the read
P Padding; padded area in the read and not in the reference
S Soft Clipping;  the clipped nucleotides are present in the read
X Read Mismatch; the nucleotide is present in the reference
= Read Match; the nucleotide is present in the reference

The following example shows the mapping of different reads to a reference sequence and their corresponding SAM file [2].  SAM_parser_figure1

Different tools such as SAMtools [2] and htsjdk are available to post-process the SAM files for analysis of the alignment information.

An intuitive usage of the SAM file is to view the overall coverage of the reference sequence by the reads. This information can be retrieved from the mapping position and CIGAR value fields which are the 4th and 6th fields, respectively.  Computationally, reference coverage is obtained by counting the number of times (frequency) of each nucleotide base in the reference sequence that appears in the mapped reads.  The frequency is determined using one of three cases according to the operators in the CIGAR string:

  • Case1 ( M/X/=) :
    • start at the specified mapping position, set counter to 1
    • Add 1 to both the counts of the bases from that position and the counter.
    • Move to the next position.
    • Repeat this process till counter is the same as the number associated with the operator.
  • Case2 (N/D):
    • Move the specified mapping position by the number associated with the operator.
  • Case3 (I/S/H/P):
    • Do nothing

The following Java codes illustrate how to obtain the coverage of a single reference sequence obtained from a SAM file.


The highest frequency in the reference sequence indicates the depth of the reads mapped to the reference. In this case, it is 3.


[1]: The Sequence Alignment/Map format and SAMtools.   BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078–2079 doi:10.1093/bioinformatics/btp352

[2]: Sequence Alignment/Map Format Specification.pdf





The dark arts of Ion Torrent Sequencing

All technologies, have their advantages and disadvantages, and next generation sequencing (NGS) is no different. For the time being, the current dominant forces in NGS are Illumina, which is based upon detecting a flash of light as fluorescent nucleotides are incorporated, or Ion Torrent, where nucleotide incorporation is detected in the form of a change in pH. Illumina has made its technology user friendly and produces high quality data, but is expensive and sensitive to misalignment of the lens used to photograph the flow cell, whereas Ion Torrent is (relatively) cheaper and less sensitive to shakes, but historically results in poorer quality data and user friendliness is far behind Illumina. Ion Torrent’s answer to addressing consistency and user friendliness has been to release the Ion Chef; effectively a robot which performs the steps from library to chip. Life without an Ion Chef is very different. Like any method, the instructions that come with the supplies tells you what to do, but it doesn’t tell you the ‘tricks’: those little touches that make you hit the sweet spot. In the non-chef protocol, for the Ion Proton, there are various points throughout the procedure, from RNA to sequence data, where little touches help.
My experience has been using the Ion Proton for transcriptomics, so any ‘recommendations’ when it comes to Ion Torrent sequencing refer to RNAseq.
Library preparation begins with good quality RNA. The standard way to determine RNA quality is to assess the integrity and ratio of the ribosomal RNA peaks. Originally this was simply an agarose gel, but nowadays the Bioanalyzer or Tapestation from Agilent are commonplace and provide a ‘RIN’ value indicating the RNA quality. For anyone thinking of buying one, given the choice between Bioanalyser and Tapestation, personally I’d choose the Bioanalyzer every time as the reagents seem to be much more stable overtime. For quantification it is widely accepted that dye-based methods of quantitation, e.g. Qubit, are much more reliable than spectrophotometry, e.g. NanoDrop.

Bioanalyser trace

Good quality RNA: Total RNA (RIN 10) as analysed using the Bioanalyzer.

The amount of RNA used to generate the library can be very small, but should be sufficient that the resulting library is representative of the RNA population in the sample. Currently, I poly(A) enrich 1.5µg of total RNA and use the entire eluate as the input for the RNAseq V2 kit from Ion Torrent. The protocol provided suggests 3 minutes of fragmentation for <100ng of poly(A) enriched RNA as an input. Fragmenting for this long though will produce short reads, whereas a shorter incubation, e.g. 1min 30sec will result in longer reads, which may be beneficial when it comes to aligning the reads to the reference. When it comes to the clean-up after fragmentation, and this applies to all of the clean-ups, big emphasis is placed upon precise volumes. It’s true, this is important, but just as important – and something not in the manual – is that the ethanol MUST be fresh. As soon as the bottle is open the quality drops. It’s prudent to make aliquots from a freshly opened bottle, especially if you want batch –to-batch consistency among your libraries.

When it comes to barcoding, if possible plan ahead such that the successive runs on the Proton utilise different barcodes. In this way it may help to remove inter-run contamination of data.
Library QC is also important, again, use the Qubit. Once amplified, using 1.5µg of total RNA input results in libraries that need to be diluted a lot to reach the required dilution for the OT2, either 100pM (and use 8µl in the OT2 reaction), or 8pM (and use 100µl in the OT2 reaction). More DNA than this will result in an increase in polyclonal spheres. It may be tedious vortexing the ISPs for a full minute prior to their addition to the OT2 reaction, but it’s important and adding the ISPs is the last thing I do prior to starting the OT2.

After enrichment and loading (adding 10µl of 50% annealing buffer prior to adding to the chip), the foam washes aren’t to be underestimated. The wells may be perfectly loaded with perfectly templated ISPs, but unless the surface is washed properly, the amount of data will be reduced. Whereas the protocol states two washes, I perform a minimum of 4, some run slowly, some faster. The goal each time is to generate a fine foam. For this the pipette tip makes a surprising amount of difference; unless there is a sufficiently narrow opening to the tip, it is difficult to generate a good foam. It is also sometimes possible to ‘over-foam’, whereby pipetting too much will result in the small bubbles becoming bigger.


Chip loading: an image of the Ion Proton PI chip loaded with ISPs

I’m not sure how much of a difference it makes, but I try to add the polymerase as slowly as possible and leave the chip to sit still for at least 5 minutes before moving it.

These are simply some of the habits I’ve picked up. The protocol is much improved compared to previous versions but, ultimately, the best option would be to buy an Ion Chef!

Some key factors for number of significant DE genes

In the different RNA-Seq experiments, we have found that some experiments have more significant DE (differential expression) genes than others. Besides the experiment design which is the most important factor, a number of other factors may influence the number of DE genes. Here, we list the different experiments that we have carried out the investigation of the factors that may influence the number of DE genes.

The influence of trimming reads
Previous analyses applying gentle trimming have negligible effects on the number of DE genes. However, heavy trimming could decrease the expression level heavily which is not objective for analysis.

The threshold of Significant DE genes

The default parameter forchoosing the significance of a DE gene is the Q-value (i.e. FDR P-value)< 0.05. This is based on the Benjamini–Hochberg procedure correction. One way to change the number of DE genes is to change the threshold value or the parameter used as a threshold. For example, any of Q-value< 0.1, P-value<0.05 and Q-value<0.1&abs(Log2FC)>1 could be used as threshold parameters. Using data sequenced at the CVR (treated and untreated cells) as an example, edgeR is used as the DE analysis tool (Fig. 1). The largest number of deferentially expressed genes is found if P<0.05 is used as a threshold parameter.On the other hand, if you want to reduce the number of DE genes, you could change the correction procedure used. For example, the Benjamini–Yekutieli procedure is more conservative than the Benjamini–Hochberg procedure.


Fig.1 Number of significant DE genes with different threshold

Biological coefficient of variation

An important factor that influences the number of DE genes is the variation among the samples.The BCV (Biological Coefficient of Variation) plot is a way to measure the biological variation within a particular condition. A common dispersion (i.e. red line on the BCV plot) between 0.2 and 0.4 is usually considered reasonable and hence could detect more DE genes.If the common dispersion is above the 0.4 threshold, this will influence the number of DE genes found in the study. Additionally, the PCA (principal component analysis) and MDS (multidimensional scaling) plots which represents the relationship between groups of samples are affected by high BCVs. The experiment was run twice. We can see the differences in the BCV and MDS plots between the two experiments. An experiment where the common dispersion is 0.41 provides almost 576 DE genes whilst an experiment with high average BCV results in only 109 DE genes. Additionally, the MDS plots clearly differentiate the different conditions when the common dispersion is low implying greater variation between the conditions than within, as expected.

experiment 1 (Condition 1; Condition 2):

Condition 1 sample ID: CVR71-1, CVR74-1, CVR57-1;

Condition 2 sample ID: CVR71-2, CVR74-2, CVR57-2

experiment 2 (Condition1; Condition 2):

Condition 1 sample ID: CVR82-1, CVR 82-2, CVR82-3;

Condition 2 sample ID: CVR83-1, CVR83-2, CVR83-3


Fig 2. BCV plot of Experiment 1 vs BCV plot of Experiment 2


Fig 3. MDS plot of Experiment 1 vs MDS plot of Experiment 2

Number of samples

Statistically, it is beneficial to have as many replicates as possible. The minimum number of replicates would be 3 but it is advisable to have more than 4 in each condition for cell lines and 6 for tissue samples that inherently have more variation in the genes being expressed. The number of DE genes would be affected by number of samples in the experiment. We used edgeR to compare the results obtained with 2 replicates versus 3 replicates for an experiment 1 and cell experiment 2 by edgeR and get the result to Fig. 4. Figure 4 shows that we could get more significantly DE genes with more samples.


Fig.4 Number of significant DE genes vs sample number

However, if we have outlier samples in each condition, the number of significant DE genes will decrease (Fig. 5). For example, we made a new DE analysis of the pig genome using:

Condition 1:CVR57-1, CVR82-1, CVR 82-2, CVR82-3

Condition 2: CVR57-2, CVR83-1, CVR83-2, CVR83-3

As shown in the Fig.5, the number of DE genes decreased because of the outlier (i.e. CVR57-1 and CVR57-2). By adding this outlier, you are increasing the biological variation in the experiment, resulting in a higher BCV and a lower number of DE genes.

Number of reads in each sample

The number of DE genes could also be affected by the number of reads in the samples. Here we randomly sub-sampling the number of RNA-Seq reads of each sample (Fig. 5). If we have more reads, we find more DE genes.DESeq2 is influenced more by the number of reads in the experiment than edgeR which is less affected by changes in the number of mapped reads. This also suggests that sequencing 25% less per sample, will not adversely affect the number of DE genes determined by edgeR.


Fig.5 Number of significant DE genes vs proportion of sub-sampling


Here we test some of our samples (unmapped reads) to the Mycoplasmas bacteria with Ion-Proton sequencing data. Mycoplasma can influence the growth of cell cultures may adversely affect the experiment, thus resulting in lower numbers of DE genes.

The mycoplasma genomes used in this study were downloaded from NCBI genomes: Mycoplasma hominis ATCC 23114 (NC_013511.1), M. hyorhinis MCLD (NC_017519.1), Mycoplasma fermentans M64 (NC_014921.1) and Acholeplasma laidlawii PG-8A (NC_010163.1). As shown in the Table 1, there is not an obvious relationship between number of significant DE genes and the mycoplasma contamination.

Table 1 Statistic of mapping to Mycoplasma and pig genome


The number of genes in the genome

The number of genes in the genome could also influence the number of significant DE genes although this is unlikely to be the case in the current set of experiments approx. 25,000 genes and approx. 26,000.

The statistical model

Even in the same pipeline, use different statistical model could affect the number of DE genes as well.

For example, number of DE genes will increase from 576 to 3457 if the model of edgeR changed from ‘classical linear models’ to ‘Generalized linear models (GLM) with muti-factor of samples’ .


We propose that the following strategy for future experimental design:

  • increasing the number of replicates;
  • the number of sequenced reads could be decreased slightly to accommodate for more replicates;
  • insuring consistent handling in the laboratory.


Bioinformatics tools to analyse viral genomics data


We have recently written a review article entitled “Bioinformatics tools to analyse viral genomics data” for the OIE. In the review, we were unable to provide direct hyperlinks and references to all available tools, simply because there are too many, so we included them here. These commonly used bioinformatics tools are split into the following categories:

Quality Control – Adapter Removal

AdapterRemoval, CutAdapt, FASTX-Toolkit, Scythe, TagCleaner, Trimmomatic, TrimGalore

Quality Control – Trimming / Filtering

ConDeTriFastQC, FASTX-ToolkitPRINSEQ, SickleTrimGalore

Quality Control – Non FASTQ formats

454/Torrent (sff): PyroCleanerseq_crumbs, sff tools; PacBio (hd5): pbh5tools; Oxford Nanopore (fast5): PoreTools

Error Correction – 454/Torrent

AmpliconNoise, Coral, PyroHMMvarRC454

Reference Mapping – Hash Based

Mosaik, NextGenMap, Novoalign, Stampy, Tanoti

Reference Mapping – Burrows-Wheeler

BarraCUDA, Bowtie, BWA, Cushaw2, GEM, SOap3-DP

Reference Mapping – Long Reads


Variant Calling

DiversiTools, FluxSimulator, LoFreq, Segminator, V-Phaser, VarScan

Quasispecies Reconstruction

HaploClique, PredictHaplo, Qcolors, QuasiRecomb, QuRe, ShoRAH, ViQuaS

De novo assembly – OLC

Edena, Forge, Newbler, SGA, Shorty

De novo assembly – de Bruijn

ABySS, CLC, Cortex, EULER-SR, IDBA-UDMIRA, SOAP2, SPAdes, Velvet, Vicuna

De novo assembly – Scaffolders

Abacas, Bambus2, BESST, GRASS, MIP, Scaffold Builder, SCARPA, SOPRA, SSPACE

De novo assembly – Gap Filling

GapCloser, GapFiller, IMAGE

Metagenomics – Homology

MEGAN, Naïve Bayes Classifier, PhymmBL

Metagenomics – Abundance

Kraken, MetaPhlAn, RIEMS, SIGMA

Metagenomics – Pipelines

IMSA, MetaAMOS, VirusFinder2

Metagenomics – De Novo

MetaVelvet, Ray Meta, also see de novo section above

RNA-Seq – Mapping

TopHat, GSNAP, OLego, SOAPsplice, STAR

RNA-Seq – Transcript assembly

Cufflinks, baySeq, edgeR, DESeq, limma

RNA-Seq – de novo

Trinity, SOAPdenovo-Trans, Trans-ABySS

If you think there is a tool missing that should be included, or a link is not working, leave a comment below.


Case studies of HTS applications

Here are my slides from the BBSRC WestBio DTP skills training session at University of Glasgow, Friday 26th June 2015. The talk was entitled “Case studies of HTS applications” and presents a number of case studies on the application on high-throughput sequencing (HTS), also known as next generation sequencing (NGS), to biological problems ranging from human genome sequencing, identification of disease mutations, metagenomics, virus discovery, epidemics, transmission chains and viral population analyses.



Convert NCBI Protein GI to Genome Accession

A few days back I posted a question on BioStars about getting genome accession numbers for a list of protein GIs. I had a long list of protein GI and I wanted the genome accession number for each protein GI (if there is one in NCBI databases) but without downloading files for each protein GI in genbank or xml format.

One way to  do this is use db2db. However you can only use db2db if you have a list of protein accession number for the protein GIs of interest. Also I wanted to include this step as part of a pipeline and automate it. db2db is a web based approach so doesn’t allow for easy automation.

I wrote the following script that first uses NCBI utilities to convert the list of protein GI to nucleotide GI and then fetches genome accession numbers for those nucleotide GIs.

This script will take a file with the list of protein GI as an input and can be run as

where test_gi_list file contains the following protein GIs


This command should provide the following output.

Bioinformatician at CVR.