Category Archives: Perl

A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml

A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted
without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss
of some information.

Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.

The command is shown in full below the table.

Conversion From Row/To Col blast-xml blast-tab psl pslx SAM/BAM BED
blast-xml N/A blast2tsv.xsl blastXmlToPsl blastXmlToPsl -pslx blast2bam BLAST_to_BED
blast-tab perl blast2xml.pl N/A ?? ?? ?? blast2bed
psl ?? ?? N/A pslToPslx psl2sam.pl pslToBed
psl2BED
pslx ?? ?? ?? N/A ?? pslToBed
SAM/BAM ?? ?? sam2psl.py sam2psl.py -s samtools view bedtools bamtobed
BED ?? ?? bed2psl ?? bedtools bedtobam N/A

Convert from SAM to psl using sam2psl.py

Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py

Example command:

This is a lossless format conversion with the -s option , however the sequence as a read is no longer supported in the psl format.

Usage:

Convert from psl to SAM

Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl

Example command:

This ends up being a lossy conversion as the read sequence is not in the output.

Usage

The options are used to calculate a blast like scoring see post:
https://www.biostars.org/p/44281/


Convert psl to pslx

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion. For usage:

 

Convert SAM to fasta

 

Convert psl to BED

Option 1:

Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as the standard psl doesn’t have the sequence and so the bed file doesn’t either.

Usage:

 

Option 2 as suggested by Alex Reynolds:

Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

This is also lossless when used with –keep-header:

Example:

As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.

Usage:

 

Convert pslx to BED

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion as the sequence is lost

Usage:

 

Convert BAM to BED

Using bedtools: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html

This is a lossy conversion as the sequence data is lost.

Usage:

 

Convert BED to BAM

Create the genome file for bed

 

Using the genome file and BED file to produce the BAM file.

Usage:

 

The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion.
Additionally, there are differences in the number of read compared to the original file.

Convert BED to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as neither the BED nor psl contain sequence information

 

Usage:

 

Preparing blast-xml format

Preparing blast-tab format

Blast-xml to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion

However, if you use the -pslx option, you can get lossless conversion

Usage:

 


Converting from blast-xml to SAM/BAM

Using https://github.com/guyduche/Blast2Bam

This is a lossless conversion with sequence and read quality introduced.

 

Usage

Subsequently converted to BAM using samtools

 

Blast-xml to BED

Using https://github.com/mojaveazure/BLAST_to_BED

Command

This is a lossy conversion as the sequence information is lost.

 

Converting Blast tabular to blast-xml

Not completely possible due to missing information such as the alignment but see post https://www.biostars.org/p/7981/#7983
Or using the script blast2xml.pl from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg

Command:

Usage

 

Convert blast-xml to blast tabular

Several approaches have been suggested here https://www.biostars.org/p/7290/ but the most straight forward I have found is using the style sheet blast2tsv.xml from here:
https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formatted blast-tabular output as it contains the sequence and the aligned site information in the last two columns.

Command:

 

Blast tabular to BED

Using https://github.com/nterhoeven/blast2bed

The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.

Usage

 


Converting SAM to BAM

Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html
This is a lossless format conversion.

Command


Converting BAM to SAM using samtools

This is a lossless format conversion.

 

PS: I dedicate this tutorial to Sej, a great bioinformatician and friend 😉

Short command lines for manipulation FASTQ and FASTA sequence files

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta

 

To convert a fastq file to fasta in a single line using sed

 

Dirty way to count the number of sequences in a fastq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

One liner to remove the description information from a fasta file and just keep the identifier

 

Get all the identifier names from a fasta file

 

Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:

 

Convert from a two column text tab-delimited file (ID and sequence) to a fasta file

 

Get the length of a fasta sequence (the sequence must in singleline)

 

I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.

 

Continue reading

Parsing PubMed for email addresses in author affiliations

USE THE FOLLOWING RESPONSIBLY PLEASE!

Recently, we wanted to send out a survey for the International Committee on Taxonomy of Viruses (ICTV) to a large number of authors who have recently published in a virology journal. Fortunately, PubMed stores author affiliations and the email address is also sometimes present in the affiliation. We decided to target the following journals: Journal of Virology; Journal of General Virology, Virology, Virus Research, Antiviral Research, Viruses and Journal of Medical Virology. A lot of the difficult work can be done using E-utilities to generate the URL for the search. As we may be retrieving a large number of emails, we need to retrieve the results from the URL query in batches. We then want to extract the affiliations and the emails from the affiliations using:

As we didn’t want to send all the emails off in one go, we split the output into multiple batches of 100 emails.   Here’s the full code also available as a Gist on Github:

Here are the email counts: Journal of Virology = 634: Journal of General Virology = 169; Virology = 546: Virus Research = 425; Antiviral Research = 252; Viruses = 892; Journal of Medical Virology = 0.

The Journal of Medical Virology doesn’t release the email addresses of authors and if the information is not used responsibly, then a number of other journals might go that way to as discussed in “E-mail Address Harvesting on PubMed—A Call for Responsible Handling of E-mail Addresses“.

If you re-run this script, you might find a few more hits as more papers get published this year.