A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml

Post by: Joseph Hughes
November 15, 2017
No Comment

A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted
without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss
of some information.

Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.

The command is shown in full below the table.

Conversion From Row/To Col	blast-xml	blast-tab	psl	pslx	SAM/BAM	BED
blast-xml	N/A	blast2tsv.xsl	blastXmlToPsl	blastXmlToPsl -pslx	blast2bam	BLAST_to_BED
blast-tab	perl blast2xml.pl	N/A	??	??	??	blast2bed
psl	??	??	N/A	pslToPslx	psl2sam.pl	pslToBed psl2BED
pslx	??	??	??	N/A	??	pslToBed
SAM/BAM	??	??	sam2psl.py	sam2psl.py -s	samtools view	bedtools bamtobed
BED	??	??	bed2psl	??	bedtools bedtobam	N/A

Convert from SAM to psl using sam2psl.py

Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py

Example command:

python sam2psl.py -i test.sam -s -o test.psl

This is a lossless format conversion with the -s option , however the sequence as a read is no longer supported in the psl format.

python sam2psl.py -i test.sam -o test_no_seq.psl

Usage:

sam2psl.py [options]

It takes as input a file in SAM format and it converts into a PSL format file.

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-i INPUT_FILENAME, --input=INPUT_FILENAME
The input file in SAM format.
-4, --skip-conversion-cigar-1.3
By default if the CIGAR strings in the input SAM file
are in the format defined in SAM version 1.4 (i.e.
there are 'X' and '=') then the CIGAR string will be
first converted into CIGAR string, which is described
in SAM version 1.3, (i.e. there are no 'X' and '='
which are replaced with 'M') and afterwards into PSL
format. Default is 'False'.
-s, --read-seq It adds to the PSL output as column 22, the sequence
of the read. This is not anymore a valid PSL format.
-r REPLACE_READS_IDS, --replace-read-ids=REPLACE_READS_IDS
In the reads ids (also known as query name in PSL) the
string specified here will be replaced with '/' (which
is used in Solexa for /1 and /2).
-o OUTPUT_FILENAME, --output=OUTPUT_FILENAME
The output file in PSL format.

Convert from psl to SAM

Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl

Example command:

psl2sam.pl test.psl

This ends up being a lossy conversion as the read sequence is not in the output.

Usage

psl2sam.pl
Usage: psl2sam.pl [-a 1] [-b 3] [-q 5] [-r 2]

The options are used to calculate a blast like scoring see post:
https://www.biostars.org/p/44281/

Convert psl to pslx

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

pslToPslx test_no_seq.psl test.fa ref.fa test.pslx

This is a lossless conversion. For usage:

pslToPslx - Convert from psl to pslx format, which includes sequences
usage:
pslToPslx [options] in.psl qSeqSpec tSeqSpec out.pslx

qSeqSpec and tSeqSpec can be nib directory, a 2bit file, or a FASTA file.
FASTA files should end in .fa, .fa.gz, .fa.Z, or .fa.bz2 and are read into
memory.

Options:
-masked - if specified, repeats are in lower case cases, otherwise entire
sequence is loader case.

Convert SAM to fasta

awk '$1~!/^@/ {print ">"$1"\n"$10}' test.sam > test.fa

Convert psl to BED

Option 1:

Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as the standard psl doesn’t have the sequence and so the bed file doesn’t either.

pslToBed test_no_seq.psl test.bed

Usage:

bedToPsl - convert bed format files to psl format
usage:
   bedToPsl chromSizes bedFile pslFile

Convert a BED file to a PSL file. This the result is an alignment.
 It is intended to allow processing by tools that operate on PSL.
If the BED has at least 12 columns, then a PSL with blocks is created.
Otherwise single-exon PSLs are created.

Options:
-keepQuery  -  instead of creating a fake query, create PSL with identical query and
                target specs. Useful if bed features are to be lifted with pslMap and one 
                wants to keep the source location in the lift result.

(khmerEnv)hugh01j@Alpha:~/test_format_convert$ kentUtils/bin/linux.x86_64/pslToBed 
pslToBed: tranform a psl format file to a bed format file.
usage:
    pslToBed psl bed
options:
    -cds=cdsFile
cdsFile specifies a input cds tab-separated file which contains
genbank-style CDS records showing cdsStart..cdsEnd
e.g. NM_123456 34..305
These coordinates are assumed to be in the query coordinate system
of the psl, like those that are created from genePredToFakePsl
    -posName
changes the qName field to qName:qStart-qEnd
(can be used to create links to query position on details page)

Option 2 as suggested by Alex Reynolds:

Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html

This is also lossless when used with –keep-header:

Example:

psl2bed < in.psl > out.bed

As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.

Usage:

convert2bed -i psl
  version:  2.4.29 (typical)
  author:   Alex Reynolds

  Converts 0-based, half-open [a-1, b) headered or headerless PSL
  input into 0-based, half-open [a-1, b) extended BED or BEDOPS Starch

  $ psl2bed < foo.psl > sorted-foo.psl.bed
  $ psl2starch < foo.psl > sorted-foo.psl.starch

  Or:

  $ convert2bed -i psl < foo.psl > sorted-foo.psl.bed
  $ convert2bed -i psl -o starch < foo.psl > sorted-foo.psl.starch

  We make no assumptions about sort order from converted output. Apply
  the usage case displayed to pass data to the BEDOPS sort-bed application
  which generates lexicographically-sorted BED data as output.

  If you want to skip sorting, use the --do-not-sort option:

  $ psl2bed --do-not-sort < foo.psl > unsorted-foo.psl.bed

  The PSL specification (http://genome.ucsc.edu/goldenPath/help/blatSpec.html)
  contains 21 columns, some which map to UCSC BED columns and some which do not.

  PSL input can contain a header or be headerless, if the BLAT search was
  performed with the -noHead option. This program can accept input in
  either format.

  If input is headered, you can use the --keep-header option to preserve the header
  data as pseudo-BED elements that use the "_header" chromosome name. We expect this
  should not cause any collision problems since PSL data should use UCSC chromosome
  naming conventions.

  We describe below how we map columns to BED, so that BLAT results can be losslessly
  transformed back into PSL format with a simple awk statement or other similar
  command that permutes columns into PSL-ordering.

  We map the following PSL columns to their equivalent BED column, as follows:

  - tName    <-->   chromosome
  - tStart   <-->   start
  - tEnd     <-->   stop
  - qName    <-->   id
  - qSize    <-->   score
  - strand   <-->   strand

  Remaining PSL columns are mapped, in order, to columns 7 through 21 in the
  BED output:

  - matches
  - misMatches
  - repMatches
  - nCount
  - qNumInsert
  - qBaseInsert
  - tNumInsert
  - tBaseInsert
  - qStart
  - qEnd
  - tSize
  - blockCount
  - blockSizes
  - qStarts
  - tStarts

  PSL conversion options:

  --keep-header (-k)
      Preserve header section as pseudo-BED elements (requires --headered)
  --split (-s)
      Split record into multiple BED elements, based on tStarts field value

  Other processing options:

  --do-not-sort (-d)
      Do not sort BED output with sort-bed (not compatible with --output=starch)
  --max-mem=<value> (-m <val>)
      Sets aside <value> memory for sorting BED output. For example, <value> can
      be 8G, 8000M or 8000000000 to specify 8 GB of memory (default is 2G)
  --sort-tmpdir=<dir> (-r <dir>)
      Optionally sets [dir] as temporary directory for sort data, when used in
      conjunction with --max-mem=[value], instead of the host's operating system
      default temporary directory
  --starch-bzip2 (-z)
      Used with --output=starch, the compressed output explicitly applies the bzip2
      algorithm to compress intermediate data (default is bzip2)
  --starch-gzip (-g)
      Used with --output=starch, the compressed output applies gzip compression on
      intermediate data
  --starch-note="xyz..." (-e "xyz...")
      Used with --output=starch, this adds a note to the Starch archive metadata
  --help | --help[-bam|-gff|-gtf|-gvf|-psl|-rmsk|-sam|-vcf|-wig] (-h | -h <fmt>)
      Show general help message (or detailed help for a specified input format)
  --version (-w)
      Show application version

Convert pslx to BED

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion as the sequence is lost

pslToBed test.pslx test.bed

Usage:

pslToBed: tranform a psl format file to a bed format file.
usage:
    pslToBed psl bed
options:
    -cds=cdsFile
cdsFile specifies a input cds tab-separated file which contains
genbank-style CDS records showing cdsStart..cdsEnd
e.g. NM_123456 34..305
These coordinates are assumed to be in the query coordinate system
of the psl, like those that are created from genePredToFakePsl
    -posName
changes the qName field to qName:qStart-qEnd
(can be used to create links to query position on details page)

Convert BAM to BED

Using bedtools: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html

This is a lossy conversion as the sequence data is lost.

bedtools bamtobed -i test.bam > test_bamtobed.bed

Usage:

Tool:    bedtools bamtobed (aka bamToBed)
Version: v2.22.1
Summary: Converts BAM alignments to BED6 or BEDPE format.

Usage:   bedtools bamtobed [OPTIONS] -i <bam> 

Options: 
	-bedpe	Write BEDPE format.
		- Requires BAM to be grouped or sorted by query.

	-mate1	When writing BEDPE (-bedpe) format, 
		always report mate one as the first BEDPE "block".

	-bed12	Write "blocked" BED format (aka "BED12"). Forces -split.

		http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1

	-split	Report "split" BAM alignments as separate BED entries.
		Splits only on N CIGAR operations.

	-splitD	Split alignments based on N and D CIGAR operators.
		Forces -split.

	-ed	Use BAM edit distance (NM tag) for BED score.
		- Default for BED is to use mapping quality.
		- Default for BEDPE is to use the minimum of
		  the two mapping qualities for the pair.
		- When -ed is used with -bedpe, the total edit
		  distance from the two mates is reported.

	-tag	Use other NUMERIC BAM alignment tag for BED score.
		- Default for BED is to use mapping quality.
		  Disallowed with BEDPE output.

	-color	An R,G,B string for the color used with BED12 format.
		Default is (255,0,0).

	-cigar	Add the CIGAR string to the BED entry as a 7th column.

Convert BED to BAM

Create the genome file for bed

samtools faidx ref.fa
awk -v OFS='\t' {'print $1,$2'} ref.fai > ref.txt

Using the genome file and BED file to produce the BAM file.

bedtools bedtobam -i test_bamtobed.bed -g ref_revcomp.txt > test_bedtobam.bam

Usage:

Tool:    bedtools bedtobam (aka bedToBam)
Version: v2.22.1
Summary: Converts feature records to BAM format.

Usage:   bedtools bedtobam [OPTIONS] -i <bed/gff/vcf> -g <genome>

Options: 
	-mapq	Set the mappinq quality for the BAM records.
		(INT) Default: 255

	-bed12	The BED file is in BED12 format.  The BAM CIGAR
		string will reflect BED "blocks".

	-ubam	Write uncompressed BAM output. Default writes compressed BAM.

Notes: 
	(1)  BED files must be at least BED4 to create BAM (needs name field).

The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion.
Additionally, there are differences in the number of read compared to the original file.

Convert BED to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossless conversion as neither the BED nor psl contain sequence information

bedToPsl Longest_revcomp.txt test.bed test_bedtopsl.psl

Usage:

bedToPsl - convert bed format files to psl format
usage:
bedToPsl chromSizes bedFile pslFile

Convert a BED file to a PSL file. This the result is an alignment.
It is intended to allow processing by tools that operate on PSL.
If the BED has at least 12 columns, then a PSL with blocks is created.
Otherwise single-exon PSLs are created.

Options:
-keepQuery - instead of creating a fake query, create PSL with identical query and
target specs. Useful if bed features are to be lifted with pslMap and one
wants to keep the source location in the lift result.

Preparing blast-xml format

makeblastdb -dbtype nucl -in Longest_revcomp.fa
blastn -query test.fa -db Longest_revcomp.fa -out test.blastxml -outfmt 5

Preparing blast-tab format

blastn -query test.fa -db Longest_revcomp.fa -out test.blasttab -outfmt 6

Blast-xml to psl

Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64

This is a lossy conversion

blastXmlToPsl test.blastxml test_blastxmltopsl.psl

However, if you use the -pslx option, you can get lossless conversion

blastXmlToPsl -pslx test.blastxml test_blastxmltopsl.pslx

Usage:

blastXmlToPsl - convert blast XML output to PSLs
usage:
blastXmlToPsl [options] blastXml psl

options:
-scores=file - Write score information to this file. Format is:
strands qName qStart qEnd tName tStart tEnd bitscore eVal qDef tDef
-verbose=n - n >= 3 prints each line of file after parsing.
n >= 4 dumps the result of each query
-eVal=n n is e-value threshold to filter results. Format can be either
an integer, double or 1e-10. Default is no filter.
-pslx - create PSLX output (includes sequences for blocks)
-convertToNucCoords - convert protein to nucleic alignments to nucleic
to nucleic coordinates
-qName=src - define element used to obtain the qName. The following
values are support:
o query-ID - use contents of the element if it
exists, otherwise use
o query-def0 - use the first white-space separated word of the
element if it exists, otherwise the first word
of .
Default is query-def0.
-tName=src - define element used to obtain the tName. The following
values are support:
o Hit_id - use contents of the element.
o Hit_def0 - use the first white-space separated word of the
element.
o Hit_accession - contents of the element.
Default is Hit-def0.
-forcePsiBlast - treat as output of PSI-BLAST. blast-2.2.16 and maybe
others indentify psiblast as blastp.
Output only results of last round from PSI BLAST

Converting from blast-xml to SAM/BAM

Using https://github.com/guyduche/Blast2Bam

This is a lossless conversion with sequence and read quality introduced.

blast2bam -o test_blastxmltosam.sam test.blastxml ref.fa reads_1.fq reads_2.fq

Usage

Blast2Bam. Last compilation: Jun 27 2017 at 15:21:50.

Usage: blast2bam [options] [FastQ_2]

Options:
--output | -o FILE Output file (default: stdout)
--interleaved | -p Interleaved data
--readGroup | -R STR Read group header line '@RG\tID:foo'
--minAlignLength | -W INT Discard alignments shorter than [INT]
--shortCigar | -c Short version of the CIGAR string ('M' instead of '=' and 'X')
--posOnChr | -z Adjust the alignment position to the first position of the reference
--help | -h Get help (this screen)

Subsequently converted to BAM using samtools

samtools view -b test_blastxmltosam.sam > test_blastxmltosam.bam

Blast-xml to BED

Using https://github.com/mojaveazure/BLAST_to_BED

Command

BLAST_to_BED.py -x test.blastxml -o test_blastxmltobed.bed

This is a lossy conversion as the sequence information is lost.

usage: BLAST_to_BED.py (-f FASTA FILE | -x XML FILE)
                       [-b BED FILE | -o OUTPUT FILE]
                       [-d REFERENCE BLAST DATABASE] [-e E-VALUE THRESHOLD]
                       [-s MAX HITS] [-m MAX HSPS] [-k]
                       [-X [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]] |
                       -K [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]]]

Input options:
  Specify the type of input we're working with, '-f | --fasta' requires BLASTn from NCBI to be installed

  -f FASTA FILE, --fasta FASTA FILE
                        Input FASTA file to run BLAST on, incompatible with
                        '-x | --xml'
  -x XML FILE, --xml XML FILE
                        Input BLAST XML file to turn into BED file,
                        incompatible with '-f | --fasta'

Output options:
  Specify how we're writing our output files, create new file or append to preexisting BED file

  -b BED FILE, --bed BED FILE
                        BED file to append results to, incompatible with '-o |
                        --outfile'
  -o OUTPUT FILE, --outfile OUTPUT FILE
                        Name of output file to write to, defaults to
                        '/home1/hugh01j/test_format_convert/output.bed',
                        incompatible with '-b | --bed'

BLAST options:
  Options only used when BLASTING, no not need to provide with '-x | --xml'

  -d REFERENCE BLAST DATABASE, --database REFERENCE BLAST DATABASE
                        Reference BLAST database in nucleotide format, used
                        only when running BLAST
  -e E-VALUE THRESHOLD, --evalue E-VALUE THRESHOLD
                        Evalue threshold for BLAST, defaults to '1e-1', used
                        only when running BLAST
  -s MAX HITS, --max-hits MAX HITS
                        Maximum hits per query, defaults to '1', used only
                        when running BLAST
  -m MAX HSPS, --max-hsps MAX HSPS
                        Maximum HSPs per hit, defaults to '1', used only when
                        running BLAST
  -k, --keep-xml        Do we keep the XML results? pas '-k | --keep-xml' to
                        say 'yes', used only when running BLAST

Filtering options:
  Options to filter the resulting BED file, can specify as many or as few chromosome/contig names as you want. You may choose to either exclude or keep, not both. Note: this only affects the immediate output of this script, we will not filter a preexisting BED file

  -X [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]], --exclude-chrom [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]]
                        Chromosomes/Contigs to exclude from BED file,
                        everything else is kept, incompatible with '-K |
                        --keep-chrom'
  -K [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]], --keep-chrom [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]]
                        Chromosomes/Contigs to keep from BED file, everything
                        else is excluded, incompatible with '-X | --exclude-
                        chrom'

Converting Blast tabular to blast-xml

Not completely possible due to missing information such as the alignment but see post https://www.biostars.org/p/7981/#7983
Or using the script blast2xml.pl from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg

Command:

perl blast2xml.pl -i test.blastxml -o test_blasttoblastxml

Usage

perl blast2xml.pl

- i|input : path of the input file (must be text blast file output)
- o|output : path of the output file (by default, the same as the input file)
- s|sequences : number of sequences by xml file (default inf)
- hit : number of hit to print for each sequences (default inf)
- hsp : number of hsp to print for each hit (default inf)
- help|h|? : print this help and exit

Convert blast-xml to blast tabular

Several approaches have been suggested here https://www.biostars.org/p/7290/ but the most straight forward I have found is using the style sheet blast2tsv.xml from here:
https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formatted blast-tabular output as it contains the sequence and the aligned site information in the last two columns.

Command:

xsltproc --novalid blast2tsv.xsl test.blastxml

Blast tabular to BED

Using https://github.com/nterhoeven/blast2bed

blast2bed test.blasttab

The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.

Usage

Usage: ./blast2bed
The blast file should be in blast outfmt 6 or 7.
See Readme.org for more details.

Converting SAM to BAM

Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html
This is a lossless format conversion.

Command

samtools view -bS test.sam > test.bam

Converting BAM to SAM using samtools

This is a lossless format conversion.

samtools view -h -o test.sam test.bam

PS: I dedicate this tutorial to Sej, a great bioinformatician and friend 😉

Categories: BLAST, Perl, UNIX

Tagged: BAM, BED, blast, psl, pslx, SAM, xml

A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml

Convert from SAM to psl using sam2psl.py

Convert from psl to SAM

Convert psl to pslx

Convert SAM to fasta

Convert psl to BED

Option 1:

Option 2 as suggested by Alex Reynolds:

Convert pslx to BED

Convert BAM to BED

Convert BED to BAM

Convert BED to psl

Preparing blast-xml format

Preparing blast-tab format

Blast-xml to psl

Converting from blast-xml to SAM/BAM

Blast-xml to BED

Converting Blast tabular to blast-xml

Convert blast-xml to blast tabular

Blast tabular to BED

Converting SAM to BAM

Leave a Reply Cancel reply