Author Archives:

About

Bioinformatician at CVR. http://bioinformatics.cvr.ac.uk/sejal.php

Update Kraken databases

Kraken is a really good k-mer based classification tool. I frequently use this tool for viral signal detection in metagenomic samples. A number of useful scripts such as updating Kraken databases are provided with the tool. Since the NCBI updated the FTP website structure and decided to phase-out Genbank Idenfiers (GIs), the default Kraken database update scripts do not work. So I decided to write a little python script to update Kraken databases that some Kraken users might find useful.

This script automatically downloads:

  • Human genome – most recent assembly version
  • Complete bacterial reference genomes
  • Complete viral reference genomes
  • Archaeal genomes
  • Reference plasmids sequences

To change default genome downloads e.g. download mouse reference genome instead of human; please make necessary changes in the code. Feel free to fork this script on github.

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

NCBI Entrez Direct UNIX E-utilities

I use NCBI Entrez Direct UNIX E-utilities regularly for sequence and data retrieval from NCBI. These UNIX utils can be combined with any UNIX commands.

It is available to download from the NCBI website: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/

A few useful examples for NCBI edirect utilities.

Download a sequence in fasta format from NCBI using accession number

Batch retrieval for all proteins for taxon ID. This example will download all proteins for viruses in fasta format.

Download sequences infasta format from NCBI using edirect using isolate info

Download sequences from NCBI using edirect using bioproject accession or ID

Get all CDS from a genome

Get taxonomy ID from protein accession number

Get taxonomy ID from accession number using esummary

Get full lineage from accession number
Tip : xtract can be used to fetch any element from the xml output

Get scientific name from accession number

Download all refseq protein sequences for viruses

Download reference genome sequence from taxonomy ID
Note: Using efilter command

Get all proteins from a genome accession

Extract genome accession from protein accession – DBSOURCE attribute in genbank file and an alternative to the script mentioned in one of my earlier blog post.
Note: Following command would work with protein accession and GIs used as -id parameter in elink command.

More info about NCBI Entrez Direct E-utillities is available on the NCBI website. http://www.ncbi.nlm.nih.gov/books/NBK179288/

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

Simple Bash Quiz

Basic Bash Quiz

A simple quiz to test bash knowledge.

This quiz has 20 questions and you must provide at least one answer for each question.

Each user can only have one open quiz attempt at a time

You will be given 10 mins to complete this quiz.

You must specify a text.
You must fill out this field.
Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

Top tips to keep your home folder on a server tidy

All bioinformatics server users and administrators would know how easy it is to fill up our home directories with huge amounts of data, especially when you are analysing deep sequencing data on a daily basis.

Here is a list of a few useful commands and tips that can help to keep your home directory tidy.

  • Do not copy fastq files to multiple locations, create soft links instead using the following command in your working directories.

  • Always convert .sam files to .bam files

  • Zip any data files/folders that are not going to be used for next few weeks.

  • To compress a directory and all the data within it run the following command

  • Organise your home directory well.

Keep all reference sequences in one folder

Keep all indexes in one folder (this could be the same folder as the references for simplicity)

  • Always delete temporary and intermediate files and keep a log of deleted files in a text file.
  • Empty the trash folder if you use a GUI or Virtual Desktop Environment
  • Use ncdu, tree or baobab (GUI) commands to find out disk consumption

  • Find out the size of your home directory using the following du command

  • For advanced users:

As mentioned in this stackoverflow forum, if you would like to get a list of multiple copies of files in your directory use the following set of commands.

 

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

Convert NCBI Protein GI to Genome Accession

A few days back I posted a question on BioStars about getting genome accession numbers for a list of protein GIs. I had a long list of protein GI and I wanted the genome accession number for each protein GI (if there is one in NCBI databases) but without downloading files for each protein GI in genbank or xml format.

One way to  do this is use db2db. However you can only use db2db if you have a list of protein accession number for the protein GIs of interest. Also I wanted to include this step as part of a pipeline and automate it. db2db is a web based approach so doesn’t allow for easy automation.

I wrote the following script that first uses NCBI utilities to convert the list of protein GI to nucleotide GI and then fetches genome accession numbers for those nucleotide GIs.

This script will take a file with the list of protein GI as an input and can be run as

where test_gi_list file contains the following protein GIs

752901017
675510735
674269272
360086542

This command should provide the following output.

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

Setting up automatic BLAST database update on linux servers

Basic Local Alignment Search Tool (BLAST) is one of the most commonly used programs for sequence classification using similarity search.

Standalone BLAST can be setup easily on the local server. More info about how to set it up on a local Linux server can be found here:

http://www.ncbi.nlm.nih.gov/books/NBK52640/

In our lab, all our servers run the BioLinux operating system and BLAST is pre-installed on the server. With local BLAST, it is important to update local BLAST databases regularly to include new sequences submitted to NCBI. However, sometimes it does become a bit tricky to install and regularly update these databases.

Here is a small tutorial about how to setup local BLAST databases and regularly update them.

In BioLinux, the BLASTDB variable path is usually set up to /var/lib/blastdb and is specified in the blast_environment.sh file in /etc/profile.d/blast_environment.sh

The standard blast_environment.sh file looks like this.

BLASTDB path can be updated to /your/blastdb/location by changing details in the “if” statement of the file.

The following example shows how I will change the location to my customized blastdb in my home directory /home/sejalmodha/blastdb

On a standard linux server you can specify the BLASTDB path variable in /etc/bash.bashrc or in your local ~/.bashrc

To update these databases regularly on the server, use NCBI’s update_blastdb script and wrap it in a cronjob.

I have an update_db.sh script that downloads nr, nt and refseq_protein databases from the NCBI website and changes the permissions of those files so that all users can use the files.

To schedule the downloading of these databases monthly, put it in a cronjob called blast_cronrun and save the log to download.log file.

The last step is to submit the cronjob using the crontab command.

 

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

Adding a new assembler in MetAMOS

MetAMOS is a modular metagenomic analysis pipeline that can be used to automate metagenomic assembly, annotation and scaffolding analysis. Further information and published paper about MetAMOS can be found here and documentation about the pipeline can be found here.

One of the biggest advantages of using the MetAMOS framework is the modularity of the framework. It allows users to add new programs at various steps of the pipeline such as annotation and assembly.

Here, I discuss how to add a new assembler within the framework. I am going to use metacortex as an example assembler that we wanted to incorporate in the CVR’s MetAMOS pipeline.

The chart below show the steps to add a new assembler into the MetAMOS pipeline.

Add a new assembler in MetAMOS

Steps to add new assembler in MetAMOS pipeline

To add a new assembler, add the name of the assembler in the ASSEMBLE.generic file. This file can be found here metAMOS/Utilities/config/ASSEMBLE.generic

It is also possible to add various version of the same program in the list as long as the names of the programs are different, e.g. soap_v1, soap_v2 etc.

Writing a configuration file is the most important part of adding a new assembler to the pipeline. The configuration file is the place where we specify various parameters for the assembly software including name, location, input, output, commands to run etc. The configuration file has several reserved keywords. A list of currently available reserved keywords has been described here.

The configuration file for metacortex looks like this:

The program configuration can be specified here in the [CONFIG] section. The “commands” keyword is used to specify the commands that need to run the actual program. Multiple commands can be separated using && sign. One can also use the reserved keywords such as [FIRST], [SECOND], [KMER], [PREFIX] etc to specify the input and output details. These keywords are identified by MetAMOS and are consistent throughout the framework.

input Input type – fasta, fastq etc
name Program name that you want to report
output Output file name
location Location of executable
paired Type of paired end read and how to pass it. You can use reserved keywords such as [FIRST] and [SECOND] here
commands Actual commands that need to be run using the executable specified at the location specified in the location

After preparing this file and following the steps mentioned in the figure 1, metAMOS needs to be built again. To rebuild the metAMOS executable:

1) Delete .pyc and .c files from the metAMOS-master/src/ folder
2) Delete the executables initPipeline and runPipeline.
Do not delete the .py scripts.
To rebuild metAMOS run the python script setup.py. This should generate the .pyc and .c files in the src/ directory and should also create the initPipeline and runPipeline executable.

When I tried running this newly added metacortex assembler within the pipeline, it gave me the following error.

I posted this on the github forum for help and MetAMOS developers (Todd J Treangen and Chris Hill) asked me to run the following command and provide my config file:

I got the following output for the grep command:

Sergey Koren, one of the developers of metAMOS , pointed out that the error was not due the configuration file but was because of the ‘echo’ command used in the file. MetAMOS has a list of allowed system commands, as echo was not specified in that list, MetAMOS could not recognize the command. System commands are specified in metAMOS-master/src/generic.py

Line #13 SYSTEM_COMMANDS

As highlighted above, the ‘echo’ command was added to this list, and MetAMOS executable was rebuilt again for the changes to take effect.

Job done, metacortex is now part of our in-house customized MetAMOS.

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php