Category Archives: Perl

A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml

A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted
without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss
of some information.

Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.

The command is shown in full below the table.

Conversion From Row/To Col blast-xml blast-tab psl pslx SAM/BAM BED
blast-xml N/A blast2tsv.xsl blastXmlToPsl blastXmlToPsl -pslx blast2bam BLAST_to_BED
blast-tab perl N/A ?? ?? ?? blast2bed
psl ?? ?? N/A pslToPslx pslToBed
pslx ?? ?? ?? N/A ?? pslToBed
SAM/BAM ?? ?? -s samtools view bedtools bamtobed
BED ?? ?? bed2psl ?? bedtools bedtobam N/A

Convert from SAM to psl using

Available from:

Example command:

This is a lossless format conversion with the -s option , however the sequence as a read is no longer supported in the psl format.


Convert from psl to SAM

Available from the samtools legacy scripts:

Example command:

This ends up being a lossy conversion as the read sequence is not in the output.


The options are used to calculate a blast like scoring see post:

Convert psl to pslx


This is a lossless conversion. For usage:


Convert SAM to fasta


Convert psl to BED

Option 1:

Using pslToBed from

This is a lossless conversion as the standard psl doesn’t have the sequence and so the bed file doesn’t either.



Option 2 as suggested by Alex Reynolds:

Using psl2bed from

This is also lossless when used with –keep-header:


As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.



Convert pslx to BED


This is a lossy conversion as the sequence is lost



Convert BAM to BED

Using bedtools:

This is a lossy conversion as the sequence data is lost.



Convert BED to BAM

Create the genome file for bed


Using the genome file and BED file to produce the BAM file.



The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion.
Additionally, there are differences in the number of read compared to the original file.

Convert BED to psl


This is a lossless conversion as neither the BED nor psl contain sequence information




Preparing blast-xml format

Preparing blast-tab format

Blast-xml to psl


This is a lossy conversion

However, if you use the -pslx option, you can get lossless conversion



Converting from blast-xml to SAM/BAM


This is a lossless conversion with sequence and read quality introduced.



Subsequently converted to BAM using samtools


Blast-xml to BED



This is a lossy conversion as the sequence information is lost.


Converting Blast tabular to blast-xml

Not completely possible due to missing information such as the alignment but see post
Or using the script from the blast2go google group:




Convert blast-xml to blast tabular

Several approaches have been suggested here but the most straight forward I have found is using the style sheet blast2tsv.xml from here: This is a lossless conversion but is not a standardly formatted blast-tabular output as it contains the sequence and the aligned site information in the last two columns.



Blast tabular to BED


The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.



Converting SAM to BAM

Using samtools
This is a lossless format conversion.


Converting BAM to SAM using samtools

This is a lossless format conversion.


PS: I dedicate this tutorial to Sej, a great bioinformatician and friend 😉

Short command lines for manipulation FASTQ and FASTA sequence files

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta


To convert a fastq file to fasta in a single line using sed


Dirty way to count the number of sequences in a fastq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

One liner to remove the description information from a fasta file and just keep the identifier


Get all the identifier names from a fasta file


Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:


Convert from a two column text tab-delimited file (ID and sequence) to a fasta file


Get the length of a fasta sequence (the sequence must in singleline)


I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.


Continue reading

Parsing PubMed for email addresses in author affiliations


Recently, we wanted to send out a survey for the International Committee on Taxonomy of Viruses (ICTV) to a large number of authors who have recently published in a virology journal. Fortunately, PubMed stores author affiliations and the email address is also sometimes present in the affiliation. We decided to target the following journals: Journal of Virology; Journal of General Virology, Virology, Virus Research, Antiviral Research, Viruses and Journal of Medical Virology. A lot of the difficult work can be done using E-utilities to generate the URL for the search. As we may be retrieving a large number of emails, we need to retrieve the results from the URL query in batches. We then want to extract the affiliations and the emails from the affiliations using:

As we didn’t want to send all the emails off in one go, we split the output into multiple batches of 100 emails.   Here’s the full code also available as a Gist on Github:

Here are the email counts: Journal of Virology = 634: Journal of General Virology = 169; Virology = 546: Virus Research = 425; Antiviral Research = 252; Viruses = 892; Journal of Medical Virology = 0.

The Journal of Medical Virology doesn’t release the email addresses of authors and if the information is not used responsibly, then a number of other journals might go that way to as discussed in “E-mail Address Harvesting on PubMed—A Call for Responsible Handling of E-mail Addresses“.

If you re-run this script, you might find a few more hits as more papers get published this year.