Category Archives: BioLinux

2nd Viral Bioinformatics and Genomics Training Course (1st – 5th August 2016)

We have shared our knowledge on Viral bioinformatics and genomics with yet another clever and friendly bunch of researchers. Sixteen delegates from across the world joined us for a week of intensive training. The line-up of instructors changed slightly due to the departure of Gavin Wilkie earlier in the year.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Sejal Modha
Richard Orton (co-organiser)
Sreenu Vattipally
Ana Da Silva Filipe

The timetable changed a bit with more focus on advanced bash scripting (loops and conditions) as we asked the participants to have basic linux command experience (ls, mkdir, cp) which saved us a lot of time. Rick Smith-Unna’s Linux bootcamp was really useful for the students to check their expertise before the course: http://rik.smith-unna.com/command_line_bootcamp.

The timetable this year follows and as in the previous year, we had plenty of time for discussion at lunch time and tea breaks and the traditional celebratory cake at the end of the week.

9:00-9:45                 Tea & Coffee in the Barn – Arrival of participants

DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

The first day will start with an introduction to the various high-throughput sequencing (HTS) technologies available and the ways in which samples are prepared, with an emphasis on how this impacts the bioinformatic analyses. The rest of the first day and the second day will aim to familiarize participants with the command line and useful UNIX commands, in order to empower them to automate various analyses.

9:45-10:00           Welcome and introductions – Massimo Palmarini and Joseph Hughes
MassimoWelcomeOIE2016
10:00-10:45        Next-generation sequencing technologies – Ana Da Silva Filipe
AnaSequencingTechnology
10:45-11:15        Examples of HTS data being used in virology – Richard Orton
11:15:11:30            Short break
11:30-11:45        Introduction to Linux and getting started – Sreenu Vattipally
SreenuExplaining
11:45-12:30        Basic commands – Sreenu Vattipally
12:30-13:30            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
13:30-14:30        File editing in Linux – Sreenu Vattipally & Richard Orton
14:30-15:30        Text processing – Sreenu Vattipally & Richard Orton
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Advanced Linux commands – Sreenu Vattipally
DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The second day will continue with practicing UNIX commands and learning how to run basic bioinformatic tools. By the end, participants will be able to analyse HTS data using various reference assemblers and will be able to automate the processing of multiple files.

9:30-11:00           BASH scripting (conditions and loops) – Sreenu Vattipally
11:00-11:30            Tea & Coffee in the Barn Room
11:30-12:15        Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally & Richard Orton
12:15-13:00        Sequence quality checks – Sreenu Vattipally & Richard Orton
13:00-14:00            Lunch break in the Barn followed by a guided tour of the sequencing facility with Ana Da Silva Filipe
14:00-14:45        Introduction to assembly (BWA and Bowtie2)– Sreenu Vattipally & Richard Orton
14:45-15:30        More reference assembly (Novoalign, Tanoti and comparison of mapping methods) – Sreenu Vattipally & Sejal Modha
15:30-16:00            Tea & Coffee in the Barn Room
16:00-17:30        Post-processing of assemblies and visualization (working with Tablet and Ugene and consensus sequence generation) – Sreenu Vattipally & Sejal Modha
DAY 3 – HOW TO DO VARIANT CALLING AND DE NOVO ASSEMBLY

The third day will start with participants looking at variant calling and quasi-species characterisation. In the afternoon, we will use different approaches for de novo assembly and also provide hands-on experience.

9:30-11:00           Error detection and variant calling – Richard Orton
RichardHelping
11:00-11:30            Tea & Coffee in Barn Room
11:30-13:00        Quasi-species characterisation – Richard Orton
13:00-14:00            Lunch break in the Lomond Room with an informal presentation of Pablo Murcia’s research program.
PabloPontificating
14:00-14:45        De novo assemblers – Sejal Modha
14:45-1:30           Using different de novo assemblers (e.g. idba-ud, MIRA, Abyss, Spades) – Sejal Modha
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Assembly quality assessment, merging contigs, filling gaps in assemblies and correcting errors (e.g. QUAST, GARM, scaffold builder, ICORN2, spades) – Sejal Modha
DAY 4 – METAGENOMICS AND HOW TO DO GENOME ANNOTATION

On the fourth day, participants will look at their own assemblies in more detail, and will learn how to create a finished genome with gene annotations. A popular metagenomic pipeline will be presented, and participants will learn how to use it. In the afternoon, the participants will build their own metagenomic pipeline putting in practice the bash scripting learnt during the first two days.

9:30-10:15           Finishing and annotating genomes – Andrew Davison & Sejal Modha
10:15-11:00        Annotation transfer from related species – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-12:15        The MetAMOS metagenomic pipeline – Sejal Modha & Sreenu Vattipally
13:00-14:00            Lunch break in Lomond Room with informal presentation of Roman Biek’s research program.
Roman
14:00-15:30        Practice in building a custom de novo pipeline – Sejal Modha & Sreenu Vattipally
SejPresenting
15:30-16:00            Tea & Coffee in the Barn
16:00-17:30        Practice in building a custom de novo pipeline – Sejal Modha
GroupPhotoOIE2016
17:30                         Group photo followed by social evening and Dinner at the Curler’s Rest (http://www.thecurlersrestglasgow.co.uk). 
CurelersDinner
DAY 5 – PRIMER IN APPLIED PHYLOGNETICS

On the final day, participants will combine the the consensus sequences generated during day two with data from Genbank to produce phylogenies. The practical aspects of automating phylogenetic analyses will be emphasised to reinforce the bash scripting learnt over the previous days.

9:30-10:15           Downloading data from GenBank using the command line – Joseph Hughes & Sejal Modha
10:15-11:00        Introduction to multiple sequence alignments – Joseph Hughes
11:00-11:30            Tea & Coffee in the Barn
11:30-1300         Introduction to phylogenetic analysis – Joseph Hughes
13:00-14:00            Lunch break in the Lomond Room with a celebratory cake
OIEcake2016
14:00-15:30        Analysing your own data or developing your own pipeline – Whole team available
15:30-16:00            Tea & Coffee in the Barn
16:00-17:00        Analysing your own data or developing your own pipeline – Whole team available
17:00                       Goodbyes
We wish all the participants lots of fun with their new bioinformatic skills.

If you are interested in finding out about future course that we will be running, please fill in the form with your details.

How to make a BioLinux Live USB Stick – with persistent data storage

These are the steps I used to create a batch of bootable BioLinux Live USB sticks – with persistent data so that any data files created/downloaded would be preserved. This was used for a course so that each stick had the same NGS data and the same additional (non-BioLinux) programs pre-installed and already configured.

Step 1 – Download the BioLinux ISO file for use with DVD/USB media

The downloaded .iso file is an archive file that contains the whole BioLinux operating system – it can be used later to either install BioLinux onto a machine, or to create a bootable BioLinux USB Live disk. The bio-linux-8-latest.iso image is currently (March 2016) 3.58GB in size.

Step 2 – Download and install UNetbootin

UNetbootin allows you to create bootable Live USB drives for Ubuntu and other Linux distributions without burning a CD.

It is simple to install, on a Mac you just move the downloaded unetbootin.app file into /Applications

Step 3 – Create an initial BioLinux Live USB disk with persistent data

As the .iso file is 3.58GB in size, a USB stick of atleast 4GB is needed, but that is a little to close for comfort, so best to go for a USB stick of atleast 8GB; these days 8GB sticks are very cheap (£2.99) and are the same price (if not cheaper) as 4GB sticks. To play safe, the USB stick should probably be in FAT32 format – FAT32 has a limitation of 4GB for file sizes – this includes the overall casper-rw BioLinux file which will be where all the persistent data is stored, so if you are going to be storing more than 4GB of data then you will probably need the NTFS file system on the USB stick.

Insert your blank USB key into your computer. Launch unetbootin. Select the “Diskimage” toggle button, select “ISO” from the drop down list, and then navigate to and select the BioLinux .iso file from your computer downloaded in Step 1. Next, in the field entitled “Space used to preserve files across reboots (Ubuntu only)” enter “3500” into the MB textfield (3.5 GB) – you could increase this above 4GB if you have a bigger USB stick and if it is using the NTFS file system. Next, select “USB Drive” from the “Type” drop down list, and then select your actual USB stick from the “Drive” drop down list and then click “OK” to create your bootable BioLinux Live USB stick with persistent data storage.

Step 4 – Boot into your BioLinux

Next step is to boot into the BioLinux Live USB disk from a machine – this will need to be a Windows or Linux machine, a modern Mac is unlikely to boot up from it. Turn the computer off, insert the BioLinux Live USB stick into the computer, turn the computer back on, and get ready. As soon as the first screen appears – which normally has the computer manufacturer logo – it should say something like “Press F12 to Choose Boot Device” at the bottom of the screen – so press F12 quickly before the screen disappears. Sometimes it is not F12, sometimes it is F10 or F2 or another key, but it should say on the screen what button to press. This will launch the BIOS menu. Enter the “Boot Device Select” menu, and move your USB Stick up the boot order to the top, so that the computer will now boot from the USB stick before its own hard drive. Exit the BIOS menu, saving any changes, and the computer should now boot into the BioLinux Live USB stick.

Step 5 – Customise your BioLinux – add data and programs

Now you will be inside your own BioLinux OS on the USB stick. So install any extra programs you want, configure PATHs, and download any data files you want. The programs, configs and data will be saved onto the USB stick and preserved – due to the persistent data storage and the casper-rw file.

Now shutdown BioLinux, remove the USB stick, and boot back into your normal operating system.

Step 6 – Make an image copy of your customised BioLinux disk

Once inside your normal operating system, insert the BioLinux USB stick back in. The next step will only work on a Mac or a Linux machine as it using the dd command.

This copies the BioLinux Live USB stick (located at /dev/disk2 on my machine – on a mac run “diskutil list” to see where yours is) and it creates a single biolinux.img file in the Documents folder which contains the entire operating system along with all the extra data and programs I installed.

The original customised BioLinux Live USB stick can now be ejected and removed.

Step 7 – Copy Copy Copy

Insert a new blank USB stick into the computer (obviously it needs to be atleast the same size as the original one). Now we want to make a copy of that original BioLinux Live USB stick onto the new USB stick using the dd command:

This copies the biolinux.img file located in the Documents folder that we created in Step 6, onto the new blank USB disk (located at /dev/disk2 – check where yours is). On a Mac, I had to first go into DiskUtility and dismount the FAT32 partition of the USB stick before dd would work – not dismount the USB stick itself, just the FAT32 partition. The key thing here, is that you can insert multiple blank USB sticks into all the available USB sticks and run the dd command in parallel:

For an 8GB USB stick, this copying process took almost exactly 1 hour. Then you can eject the USB sticks and put new ones in and copy another batch.

Acknowledgements

Many thanks to Paul Capewell and Willie Weir for a tip on the dd command.

Setting up an Amazon ftp server to receive big files

Sharing large files with collaborators has rarely been a problem, we usually just compress them and put them on our web server and then send the link to our collaborator who can then download the file.
However, we have struggled to find a solution to receive large files. We usually run out of space in Dropbox or Google Drive. We have tried infinit.io but this has failed on a few occasions, we think due to firewall issues either on our side or on the side of the collaborator. So when we managed to get an amazon cloud account set-up through Arcus Global (see my previous blog on how we got that organised), an obvious thing to try was to set-up an ftp server to receive large files.

A bit of googling around provided us with a very useful post on Stackoverflow.

First, we launched an instance through the Amazon web interface. We selected an Ubuntu instance from Amazon EC2 and specified 60Gb of storage. We generated a new key called “ftp” and saved the key locally. The .txt extension was added to the file so we renamed it and changed the permissions.

Using the ip address of the instance we then logged into the instance using ssh.

We installed the ftp server

but we did not provide a password for ftp as we decided to use the ubuntu username and password for login.

Then we opened up the FTP ports on your EC2 instance as described on Stackoverflow.

We changed the ssh configuration as explained in the previous blog and changed the ubuntu user’s password:

Then we changed the ftp configuration file in /etc/vsftpd.conf

The following lines where changed:

And added the following with the IP of our instance. If restarting an image, the IP will be different so this will need to be changed

Restart vsftpd

We had it up for about 24 hours and it cost approximately £0.56.

Top tips to keep your home folder on a server tidy

All bioinformatics server users and administrators would know how easy it is to fill up our home directories with huge amounts of data, especially when you are analysing deep sequencing data on a daily basis.

Here is a list of a few useful commands and tips that can help to keep your home directory tidy.

  • Do not copy fastq files to multiple locations, create soft links instead using the following command in your working directories.

  • Always convert .sam files to .bam files

  • Zip any data files/folders that are not going to be used for next few weeks.

  • To compress a directory and all the data within it run the following command

  • Organise your home directory well.

Keep all reference sequences in one folder

Keep all indexes in one folder (this could be the same folder as the references for simplicity)

  • Always delete temporary and intermediate files and keep a log of deleted files in a text file.
  • Empty the trash folder if you use a GUI or Virtual Desktop Environment
  • Use ncdu, tree or baobab (GUI) commands to find out disk consumption

  • Find out the size of your home directory using the following du command

  • For advanced users:

As mentioned in this stackoverflow forum, if you would like to get a list of multiple copies of files in your directory use the following set of commands.

 

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

1st Viral Bioinformatics and Genomics Training Course

SRE_9674

The first Viral Bioinformatics and Genomics training course held at the University of Glasgow was completed successfully by 14 delegates (nine external and five internal) on 10-14 August 2015. The course took place in the McCall Building computer cluster, and the adjacent Lomond and Dumgoyne Rooms were used for refreshments and lunch.

Instructors:
Joseph Hughes (Course Organiser)
Andrew Davison
Robert Gifford
Sejal Modha
Richard Orton
Sreenu Vattipally
Gavin Wilkie

9:00-10:00 Tea & Coffee in Dumgoyne Room – Arrival of participants
DAY 1 – SEQUENCING TECHNOLOGIES AND A UNIX PRIMER

Day one will introduced the participant to the different sequencing technologies available, the ways in which samples are prepared with an emphasis on how this impacts the bioinformatic analyses. The rest of the day aimed to familiarize the researcher with the command line and useful UNIX commands.

IMG_20150814_121005

10:00-10:45 Next-generation sequencing technologies – Gavin Wilkie
10:45-11:30 Examples of HTS data being used in virology – Richard Orton
11:30-11:45 Brief introduction and history of Unix/Linux – Sreenu Vattipally and Richard Orton
11:45-12:30 The command line anatomy and basic commands – Sreenu Vattipally and Richard Orton

12:30-13:30 Lunch break in Dumgoyne Room

13:30-14:30 Essential UNIX commands – Sreenu Vattipally and Sejal Modha
14:30-15:30 Text processing with grep – Sreenu Vattipally and Sejal Modha
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 sed and awk – Sreenu Vattipally and Sejal Modha

DAY 2 – UNIX CONTINUED AND HOW TO DO REFERENCE ASSEMBLIES

The participant continued to practice UNIX commands and learn how to run basic bioinformatic tools. By the end of the day, the participant were able to analyse HTS data with different reference assemblers and automate the processing of multiple files.

9:30-10:15 Introduction to file formats (fasta, fastq, SAM, BAM, vcf) – Sreenu Vattipally and Richard Orton
10:15-11:00 Quality scores, quality control and trimming (Prinseq and FastQC and trim_galore) – Sreenu Vattipally and Richard Orton
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Introduction to alignment and assembly programs (Blast, BWA, Bowtie) – Sreenu Vattipally and Richard Orton

13:00-14:00 Lunch break in Dumgoyne Room

14:00-14:45 Continuation of alignment and assembly programs (Novoaligner, Tanoti) – Sreenu Vattipally and Richard Orton
14:45-15:30 Post-assembly processing and alignment visualization (Tablet and UGENE) – Sreenu Vattipally and Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-16:45 Workflow design and automation – Sreenu Vattipally and Richard Orton
16:45-17:30 Loops and command line arguments – Sreenu Vattipally and Richard Orton

DAY 3 – HOW TO DO DE NOVO ASSEMBLY AND USE METAGENOMICS PIPELINES

The third day covered the different approaches used for de-novo assembly and provided hands-on experience. A popular metagenomic pipeline was presented and participants learned how to use it as well as create their own pipeline. (Slides and practical)

IMG_20150812_113351

9:30-10:15 De-novo assemblers – Sejal Modha and Gavin Wilkie
10:15-11:00 Using different de-novo assemblers (e.g. Meta-idba, Edena, Abyss, Spades) – Sejal Modha and Gavin Wilkie

11:00-11:30 – Tea & Coffee in Lomond Room
11:30-13:00 Scaffolding contigs, filling gaps in assemblies and correcting errors (e.g. phrap, gap_filler, scaffold builder, ICORN2, spades) – Gavin Wilkie and Sejal Modha

13:00-14:00 Lunch break in Lomond Room

14:00-14:45 Practice building a custom de-novo pipeline (BLAST, KronaTools)
14:45-15:30 the MetAmos metagenomic pipeline – Sejal Modha and Sreenu Vattipally
15:30-16:00 Tea & Coffee in Lomond Room
16:00-17:30 Analysis using the pipeline – Sejal Modha and Sreenu Vattipally

DAY 4 – SEQUENCING TECHNOLOGIES AND HOW TO DO GENOME ANNOTATION

Day four gave the participant the opportunity to look into more detail at their assembly, they learned how to create a finished curated full genome (emphasis on finishing) with gene annotations and analysed the variation within their sample

9:30-11:00 Finishing and Annotating genomes – Andrew Davison, Gavin Wilkie and Sreenu Vattipally
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-13:00 Annotation transfer from related species – Gavin Wilkie, Andrew Davison and Sreenu Vattipally

13:00-14:00 Lunch break in Dumgoyne Room

IMG_20150813_141642

14:00-15:30 Error detection and variant calling – Richard Orton
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:30 Quasi-species characterisation – Richard Orton

17:30 Social evening in the pub/restaurant at Curlers on Byres Road organised by Richard Orton (http://www.thecurlersrestglasgow.co.uk). 

 

DAY 5 – PRIMER IN INVESTIGATING PATHWAYS OF VIRAL EVOLUTION

Researchers worked through several practical examples of using sequences in virology and spent the remaining time analysing their own data with the teachers help.

9:30-10:15 Identifying mutations of interest for individual sequences and within a set of sequences – Robert Gifford
10:15-11:00 Combining phylogenies with traits – Robert Gifford
11:00-11:30 Tea & Coffee in Dumgoyne Room
11:30-12:15 Investigating epidemiology (IDU example) – Robert Gifford
12:15-13:00 Investigate transmission of drug resistance – Robert Gifford

13:00-14:00 Lunch break in Dumgoyne Room

14:00-15:30 Analysing your own data or developing your own pipeline
15:30-16:00 Tea & Coffee in Dumgoyne Room
16:00-17:00 Analysing your own data or developing your own pipeline
17:00 Goodbyes

IMG_20150814_130750

 

If you would like to be informed of similar training courses run in the future, please fill in your details here.

 

 

Setting up automatic BLAST database update on linux servers

Basic Local Alignment Search Tool (BLAST) is one of the most commonly used programs for sequence classification using similarity search.

Standalone BLAST can be setup easily on the local server. More info about how to set it up on a local Linux server can be found here:

http://www.ncbi.nlm.nih.gov/books/NBK52640/

In our lab, all our servers run the BioLinux operating system and BLAST is pre-installed on the server. With local BLAST, it is important to update local BLAST databases regularly to include new sequences submitted to NCBI. However, sometimes it does become a bit tricky to install and regularly update these databases.

Here is a small tutorial about how to setup local BLAST databases and regularly update them.

In BioLinux, the BLASTDB variable path is usually set up to /var/lib/blastdb and is specified in the blast_environment.sh file in /etc/profile.d/blast_environment.sh

The standard blast_environment.sh file looks like this.

BLASTDB path can be updated to /your/blastdb/location by changing details in the “if” statement of the file.

The following example shows how I will change the location to my customized blastdb in my home directory /home/sejalmodha/blastdb

On a standard linux server you can specify the BLASTDB path variable in /etc/bash.bashrc or in your local ~/.bashrc

To update these databases regularly on the server, use NCBI’s update_blastdb script and wrap it in a cronjob.

I have an update_db.sh script that downloads nr, nt and refseq_protein databases from the NCBI website and changes the permissions of those files so that all users can use the files.

To schedule the downloading of these databases monthly, put it in a cronjob called blast_cronrun and save the log to download.log file.

The last step is to submit the cronjob using the crontab command.

 

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php