Monthly Archives: April 2016

How to generate a Sample Sheet from sample/index data in BaseSpace

If you are using BaseSpace for sample entry but demultiplexing your data manually, you may have been frustrated that there is no facility to download your sample names and index tag data from BaseSpace as a sample sheet. This means you have to enter the same data twice – with the possibility of errors creeping in especially for large projects with many samples and dual index tags.

We have found a way to avoid typing the same information twice and instead fetch the sample names, index ID’s and index tag sequences from BaseSpace straight to a sample sheet. This saves a huge amount of time for large projects with many samples.

Log in to BaseSpace, and navigate to the ‘Libraries’ page within the ‘Prep Libraries’ tab. Each line is a set of libraries with complete information on index names and tag sequences. Clicking a set of libraries will bring up the following screen – this example has 24 samples with TruSeqLT tags (only 7 are visible without scrolling down the list).

libraries_for_export

Clicking the ‘EXPORT’ button will download a comma separated file (csv) that can be opened in Excel. This file has all the sample names, index ID’s and index sequences (but not in quite the correct format to paste into a sample sheet).

excel_for_export

Open the file in Excel, select the entire Index1 Column and click the ‘Text to Columns’ function (under the ‘Data’ menu in Excel). Choose the ‘Delimited’ option, then tick ‘Other’ and enter a hyphen (-) in the box. This will split the Index1 Column into two, with the name of the Index and the actual Tag sequence in two separate columns, as below.

excel_for_indexes

If using dual indexing (e.g. TruSeqHT or NexteraXT) then do the same for the second column with Index 2 to split the index2 names and sequences into two separate columns.

Now open a blank or used sample sheet that is set up for the correct library chemistry and sequencing instrument (see previous blog post) then copy and paste the sample ID’s, Index ID’s and Index sequences into the sample sheet. Save as a comma separated file (csv) and its ready to use for demultiplexing and fastq generation, or your next MiSeq run. The above example looks like this…

sample_sheet

How to demultiplex Illumina data and generate fastq files using bcl2fastq

Sequence runs on NGS instruments are typically carried out with multiple samples pooled together. An index tag (also called a barcode) consisting of a unique sequence of between 6 and 12bp is added to each sample so that the sequence reads from different samples can be identified.

On the Illumina MiSeq, the process of demultiplexing (dividing your sequence reads into separate files for each index tag/sample) and generating the fastq data files required for downstream analysis is carried out automatically using the onboard PC. However, on the higher-throughput NextSeq500 and HiSeq models this process is carried out on BaseSpace – Illumina’s cloud-based resource.

Whilst there are many advantages to having your sequence data in the cloud (e.g. monitoring a sequence run from home, ease of sharing data with collaborators, etc) there are also some drawbacks to this system. In particular the process of demultiplexing and fastq file generation in BaseSpace can be very slow. It takes up to 8 hours to demultiplex the data from a high output NextSeq500 run on BaseSpace, and if the fastq files then have to be downloaded to your local computer or server for analysis this requires a further 3 hours.

If your data is urgent you may not want to wait 11 hours or more after your sequence run has finished to begin your analysis! We have found that demultiplexing and fastq file generation from a high output NextSeq500 run can instead be carried out in about 30 minutes on our in-house UNIX server. This also has the advantage of avoiding the rather slow step of downloading your fastq files from BaseSpace.

In order to do this, you need to install a free piece of software from Illumina called bcl2fastq on your UNIX server. Demultiplexing NextSeq500 data (or any Illumina system running RTA version 1.18.54 and later) requires bcl2fastq version 2.16 or newer (the latest version at the time of writing is v2.17 and can be downloaded here.

Importantly, we have checked that the results obtained from bcl2fastq and BaseSpace are equivalent – the fastq files generated are exactly the same. BaseSpace is set to remove adapter sequences by default, meaning that the sequence reads may not all be the same length (any reads from short fragments with adapter read-through will have those sequences removed). In bcl2fastq you have the option to either remove adapter sequences or leave them in so that all reads are the same length.

In order to demultiplex the data, first copy the entire run folder from the sequencer to your UNIX server. On the NextSeq500, the run folder will be inside the following directory on the hard disc –
D:\Illumina\NextSeq Control Software\Temp\
It ought to be the ONLY folder here as the NextSeq only retains data from the most recent run – as soon as you start a new sequence run the data from the previous run is deleted. Copy the entire folder, including all its subdirectories. This folder contains the raw basecall (bcl) files. Do not change the name of the folder, which will be named as per the following convention – YYMMDD_InstrumentID_RunID_FlowcellID
For example, the 10th run carried out on a NextSeq500 with serial number 500999, on 14th April 2016 and using flowcell number AHLFNLBGXX would be named as follows –
160414_NB500999_0010_AHLFNLBGXX

The other requirement is a sample sheet – a simple comma separated file (csv) with the library chemistry, sample names and the index tag used for each sample, in addition to some other metrics describing the run. Anyone running a MiSeq will already be familiar with these, but NextSeq and HiSeq users may only have used BaseSpace to enter these values. Unfortunately there is no way to automatically download a sample sheet from BaseSpace (although we have figured out a way round this to avoid double data entry, see the next blog post). Sample sheets can be made and modified using MS Excel or any other software that can read csv files, but the easiest way to make one is to use a free wizard-type program for the PC called Illumina Experiment Manager, which guides you through the process. The latest version at the time of writing is v1.9, which is available here.

Open Illumina Experiment Manager, and click on ‘Create Sample Sheet.’ Then, make certain that you choose the correct sequencer (essential since the NextSeq and MiSeq use opposite reverse complements during index reads). Select ‘Fastq only’ output. Enter any value (numbers or text) for the Reagent Kit Barcode – this will become the filename. Ensure correct library chemistry is selected (e.g. TruSeqLT, TruSeqHT, NexteraXT, etc). If there are custom/non-standard tags these will need to be manually entered in the csv file. Tick adapter trimming for read1 and read2 if required, select either paired or single end reads and enter the read length as appropriate (add one base, so for 150bp reads enter 151). Then either follow the instructions in the next blog post to import sample names and tags from BaseSpace, or enter them manually by adding a blank row for each sample, entering the sample names and selecting the index tag(s) for each sample. It is wise to double check that the sample names and indexes are correct, as mistakes will cause data to be allocated to the wrong file. Change the name of the file to ‘SampleSheet.csv’ and copy it into the top directory inside the sequence run folder on the server. The sample sheet file should resemble the example below – this is for a paired end 2x151bp NextSeq run with four samples, TruSeqLT index tags, and adapter trimming selected.

SampleSheet_example

Now use the command line below on the server to run bcl2fastq. For speed, we use 12 threads for processing the data on our UNIX server (-p 12), however the optimal number will depend on your system architecture, resources and usage limits. It is important to set a limit to the number of threads, otherwise bcl2fastq will use 100% of the CPU’s on the server. We usually invoke the no-lane-splitting option, otherwise each output file from our NextSeq is divided into four (one for each lane on the flowcell). Here we are using the NextSeq run folder mentioned above as an example (160414_NB500999_0010_AHLFNLBGXX) and sending the output to a subdirectory within it called ‘fastq_files.’ For other bcl2fastq options please see Illumina’s manual on the software.

In this example, there should be two fastq files generated for each sample (one each for forward R1 and reverse R2 reads, since this is a paired end 2x151bp run) plus a forward and reverse file for ‘Undetermined’ reads where the index tag did not match any of the tags in the sample sheet. The Undetermined file will contain all of the reads from the PhiX spike-in if used (as PhiX does not have a tag) and also any other reads where there was a basecalling error during the index read. Depending on the PhiX spike-in % and the total number of samples on the run, the size of the Undetermined file should normally be smaller than the other files. If there is a problem suspected with demultiplexing or tagging always check the ‘index.html’ file within the ‘Reports/html’ subdirectory. This file will open on a standard web browser, and clicking the ‘unknown barcode’ option will display the top unknown barcodes and allow problems to be diagnosed. Common issues are that one or more samples were omitted from the sample sheet, errors entering the barcodes, incorrect library chemistry (e.g. selecting NexteraXT instead of TruSeqHT) or that the barcodes (especially sometimes index 2 on dual-indexed samples) need to be reverse-complemented on the sample sheet.

NCBI Entrez Direct UNIX E-utilities

I use NCBI Entrez Direct UNIX E-utilities regularly for sequence and data retrieval from NCBI. These UNIX utils can be combined with any UNIX commands.

It is available to download from the NCBI website: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/

A few useful examples for NCBI edirect utilities.

Download a sequence in fasta format from NCBI using accession number

Batch retrieval for all proteins for taxon ID. This example will download all proteins for viruses in fasta format.

Download sequences infasta format from NCBI using edirect using isolate info

Download sequences from NCBI using edirect using bioproject accession or ID

Get all CDS from a genome

Get taxonomy ID from protein accession number

Get taxonomy ID from accession number using esummary

Get full lineage from accession number
Tip : xtract can be used to fetch any element from the xml output

Get scientific name from accession number

Download all refseq protein sequences for viruses

Download reference genome sequence from taxonomy ID
Note: Using efilter command

Get all proteins from a genome accession

Extract genome accession from protein accession – DBSOURCE attribute in genbank file and an alternative to the script mentioned in one of my earlier blog post.
Note: Following command would work with protein accession and GIs used as -id parameter in elink command.

More info about NCBI Entrez Direct E-utillities is available on the NCBI website. http://www.ncbi.nlm.nih.gov/books/NBK179288/

Bioinformatician at CVR.
http://bioinformatics.cvr.ac.uk/sejal.php

NGS Data Formats and Analyses

Here are my slides from a session on NGS data formats and analyses that I gave as part of the EPIZONE Workshop on Next Generation Sequencing applications and Bioinformatics in Brussels in April 2016. It covers file formats such as FASTA, FASTQ, SAM, BAM, and VCF, and also goes over IUAPAC nucleotide ambiguity codes, read names, quality scores, error probabilities, CIGAR strings.


 

How to Import data for libraries with index tags into BaseSpace

In this blog we describe how to import lists of sample data with defined index tags into BaseSpace, and provide templates for TruSeqLT and TruSeqHT libraries. We have found this saves a lot of time and eliminates errors associated with manual entry.
The Illumina NextSeq500 sequencer requires all users to complete sample data entry on BaseSpace (Illumina’s cloud-based resource) including sample names, species, project names, index tags and sample pools. Whilst there are many advantages to having this data in the cloud, the BaseSpace interface is not always the most convenient or user-friendly system for data entry and management.
Our experience has been that for large projects with many samples, it is impractical to use the manual method of entering sample names in the ‘Biological Samples’ tab, then individually assigning an index tag in the ‘Libraries’ tab by dragging each sample onto an image of a 96-well plate of barcodes. To make matters worse, BaseSpace always mixes up the order of the samples (even if they are named 1-96), so it becomes all too easy to make an error when faced with a long list of sample names in a random order that each require a tag to be assigned.
It is quite easy to import a csv file created in Excel (or similar) with the sample names, species, project and nucleic acid into the ‘Biological Samples’ tab, and thus avoid a large part of the manual data entry. However this still requires the user to individually assign an index tag to each sample using the cumbersome and error-prone interface pictured below, dragging each sample on the list to the correct well on the index plate.
BaseSpace_indexing
It is possible to avoid this by importing a csv file with the sample names, species, project, nucleic acid, index name and also the index tags into the ‘Libraries’ tab on BaseSpace. However, there is very little guidance on how to do this – and Illumina only provide an example template for libraries made using Nextera XT with none of the sequence tags themselves.
We are mainly using TruSeq indexes, so we have generated our own import templates with all 24 TruSeqLT tags, and all 96 dual-indexed TruSeqHT tags. This took quite a bit of trial and error, plus fetching the sequences of all 216 index tags. We have therefore made our own templates for importing TruSeqLT and TruSeqHT libraries available here for others to use.
Simply open the csv file in Excel (or similar) and insert the names of your own samples in the first two columns. Copy and past the index tags you have used to the correct sample lines (Each sample requires the Well, Index1Name, Index1Sequence,Index2Name and Index2Sequence). Change the name of the ContainerID from ‘Platename’ to your own name and delete any lines you don’t need (e.g. if you have less than 24 or 96 samples). Here we are using the template to import 24 samples called apples 1-24 with TruSeqHT dual tags.If using 96 samples, use this.
 template_image
Save the csv file, navigate to the ‘Libraries’ tab in your BaseSpace account and then click the ‘Import’ button on the top-right corner. Choose your csv file, and after a minute you should see your libraries successfully imported with the correct index tags as below, ready to pool for a sequence run.
 imported_libraries
Now, if Illumina would just allow us to import pools of samples we could also avoid having to individually drag each sample into a small dot in the ‘Pools’ tab. This is rather tiresome when there are large numbers of samples in a pool!

How to make a BioLinux Live USB Stick – with persistent data storage

These are the steps I used to create a batch of bootable BioLinux Live USB sticks – with persistent data so that any data files created/downloaded would be preserved. This was used for a course so that each stick had the same NGS data and the same additional (non-BioLinux) programs pre-installed and already configured.

Step 1 – Download the BioLinux ISO file for use with DVD/USB media

The downloaded .iso file is an archive file that contains the whole BioLinux operating system – it can be used later to either install BioLinux onto a machine, or to create a bootable BioLinux USB Live disk. The bio-linux-8-latest.iso image is currently (March 2016) 3.58GB in size.

Step 2 – Download and install UNetbootin

UNetbootin allows you to create bootable Live USB drives for Ubuntu and other Linux distributions without burning a CD.

It is simple to install, on a Mac you just move the downloaded unetbootin.app file into /Applications

Step 3 – Create an initial BioLinux Live USB disk with persistent data

As the .iso file is 3.58GB in size, a USB stick of atleast 4GB is needed, but that is a little to close for comfort, so best to go for a USB stick of atleast 8GB; these days 8GB sticks are very cheap (£2.99) and are the same price (if not cheaper) as 4GB sticks. To play safe, the USB stick should probably be in FAT32 format – FAT32 has a limitation of 4GB for file sizes – this includes the overall casper-rw BioLinux file which will be where all the persistent data is stored, so if you are going to be storing more than 4GB of data then you will probably need the NTFS file system on the USB stick.

Insert your blank USB key into your computer. Launch unetbootin. Select the “Diskimage” toggle button, select “ISO” from the drop down list, and then navigate to and select the BioLinux .iso file from your computer downloaded in Step 1. Next, in the field entitled “Space used to preserve files across reboots (Ubuntu only)” enter “3500” into the MB textfield (3.5 GB) – you could increase this above 4GB if you have a bigger USB stick and if it is using the NTFS file system. Next, select “USB Drive” from the “Type” drop down list, and then select your actual USB stick from the “Drive” drop down list and then click “OK” to create your bootable BioLinux Live USB stick with persistent data storage.

Step 4 – Boot into your BioLinux

Next step is to boot into the BioLinux Live USB disk from a machine – this will need to be a Windows or Linux machine, a modern Mac is unlikely to boot up from it. Turn the computer off, insert the BioLinux Live USB stick into the computer, turn the computer back on, and get ready. As soon as the first screen appears – which normally has the computer manufacturer logo – it should say something like “Press F12 to Choose Boot Device” at the bottom of the screen – so press F12 quickly before the screen disappears. Sometimes it is not F12, sometimes it is F10 or F2 or another key, but it should say on the screen what button to press. This will launch the BIOS menu. Enter the “Boot Device Select” menu, and move your USB Stick up the boot order to the top, so that the computer will now boot from the USB stick before its own hard drive. Exit the BIOS menu, saving any changes, and the computer should now boot into the BioLinux Live USB stick.

Step 5 – Customise your BioLinux – add data and programs

Now you will be inside your own BioLinux OS on the USB stick. So install any extra programs you want, configure PATHs, and download any data files you want. The programs, configs and data will be saved onto the USB stick and preserved – due to the persistent data storage and the casper-rw file.

Now shutdown BioLinux, remove the USB stick, and boot back into your normal operating system.

Step 6 – Make an image copy of your customised BioLinux disk

Once inside your normal operating system, insert the BioLinux USB stick back in. The next step will only work on a Mac or a Linux machine as it using the dd command.

This copies the BioLinux Live USB stick (located at /dev/disk2 on my machine – on a mac run “diskutil list” to see where yours is) and it creates a single biolinux.img file in the Documents folder which contains the entire operating system along with all the extra data and programs I installed.

The original customised BioLinux Live USB stick can now be ejected and removed.

Step 7 – Copy Copy Copy

Insert a new blank USB stick into the computer (obviously it needs to be atleast the same size as the original one). Now we want to make a copy of that original BioLinux Live USB stick onto the new USB stick using the dd command:

This copies the biolinux.img file located in the Documents folder that we created in Step 6, onto the new blank USB disk (located at /dev/disk2 – check where yours is). On a Mac, I had to first go into DiskUtility and dismount the FAT32 partition of the USB stick before dd would work – not dismount the USB stick itself, just the FAT32 partition. The key thing here, is that you can insert multiple blank USB sticks into all the available USB sticks and run the dd command in parallel:

For an 8GB USB stick, this copying process took almost exactly 1 hour. Then you can eject the USB sticks and put new ones in and copy another batch.

Acknowledgements

Many thanks to Paul Capewell and Willie Weir for a tip on the dd command.

Submitting a job to run on another server and retrieving the results

Imagine having two different servers called darwin and linnaeus. Imagine that darwin is a great server with loads of RAM for doing de-novo assembly and that linnaeus has loads of nodes so a great server for splitting up jobs and running lots of jobs in parallel. To make good use of all these resources, it would make sense to do part of the processing on one server and then automatically send jobs to be processed on another server.

So this is how you do that. On linnaeus you run:

You copy the key and on darwin put the key in .ssh/authorized_keys2

The reverse also needs to be done by putting a darwin key on linnaeus.

Now to test it out create the shell script that will be executed on linnaeus e.g. linnaeusshell:

This small script will uncompress a file, return the uncompressed file and return a “Done” log to darwin once the script is finished.

Now create a command shell on darwin, e.g. darwinshell:

And finally execute the darwinshell:

This will transfer the Pf3D7_01.embl.gz compressed file over to linnaeus where the file will be uncompressed and transferred back to darwin.

Big thanks to Sreenu who helped me a lot to sort this out.

Setting up an Amazon ftp server to receive big files

Sharing large files with collaborators has rarely been a problem, we usually just compress them and put them on our web server and then send the link to our collaborator who can then download the file.
However, we have struggled to find a solution to receive large files. We usually run out of space in Dropbox or Google Drive. We have tried infinit.io but this has failed on a few occasions, we think due to firewall issues either on our side or on the side of the collaborator. So when we managed to get an amazon cloud account set-up through Arcus Global (see my previous blog on how we got that organised), an obvious thing to try was to set-up an ftp server to receive large files.

A bit of googling around provided us with a very useful post on Stackoverflow.

First, we launched an instance through the Amazon web interface. We selected an Ubuntu instance from Amazon EC2 and specified 60Gb of storage. We generated a new key called “ftp” and saved the key locally. The .txt extension was added to the file so we renamed it and changed the permissions.

Using the ip address of the instance we then logged into the instance using ssh.

We installed the ftp server

but we did not provide a password for ftp as we decided to use the ubuntu username and password for login.

Then we opened up the FTP ports on your EC2 instance as described on Stackoverflow.

We changed the ssh configuration as explained in the previous blog and changed the ubuntu user’s password:

Then we changed the ftp configuration file in /etc/vsftpd.conf

The following lines where changed:

And added the following with the IP of our instance. If restarting an image, the IP will be different so this will need to be changed

Restart vsftpd

We had it up for about 24 hours and it cost approximately £0.56.