A compilation of conversion tools for BED, SAM/BAM, psl, pslx, blast tabular and blast xml
- Post by: Joseph Hughes
- November 15, 2017
- No Comment
A wide range of formats exist for representing the comparisons of different sequences to each other: blast tabular, blast xml, psl, pslx, SAM/BAM, BED
Most of these formats can be converted from one format to another. Sometimes the format is lossless allowing for the original data to be perfectly converted
without the loss of information. Other times, the format conversion is lossy permitting the conversion of only part of the original data resulting in the loss
of some information.
Here, I have compiled the tools or UNIX commands necessary for converting from one file format to another. As you can see, I am still needing to compete some of the gaps so please let me know of any other tools which are missing.
The command is shown in full below the table.
Conversion From Row/To Col | blast-xml | blast-tab | psl | pslx | SAM/BAM | BED |
blast-xml | N/A | blast2tsv.xsl | blastXmlToPsl | blastXmlToPsl -pslx | blast2bam | BLAST_to_BED |
blast-tab | perl blast2xml.pl | N/A | ?? | ?? | ?? | blast2bed |
psl | ?? | ?? | N/A | pslToPslx | psl2sam.pl | pslToBed psl2BED |
pslx | ?? | ?? | ?? | N/A | ?? | pslToBed |
SAM/BAM | ?? | ?? | sam2psl.py | sam2psl.py -s | samtools view | bedtools bamtobed |
BED | ?? | ?? | bed2psl | ?? | bedtools bedtobam | N/A |
Convert from SAM to psl using sam2psl.py
Available from: https://github.com/ndaniel/fusioncatcher/blob/master/bin/sam2psl.py
Example command:
python sam2psl.py -i test.sam -s -o test.psl
This is a lossless format conversion with the -s option , however the sequence as a read is no longer supported in the psl format.
python sam2psl.py -i test.sam -o test_no_seq.psl
Usage:
sam2psl.py [options] It takes as input a file in SAM format and it converts into a PSL format file. Options: --version show program's version number and exit -h, --help show this help message and exit -i INPUT_FILENAME, --input=INPUT_FILENAME The input file in SAM format. -4, --skip-conversion-cigar-1.3 By default if the CIGAR strings in the input SAM file are in the format defined in SAM version 1.4 (i.e. there are 'X' and '=') then the CIGAR string will be first converted into CIGAR string, which is described in SAM version 1.3, (i.e. there are no 'X' and '=' which are replaced with 'M') and afterwards into PSL format. Default is 'False'. -s, --read-seq It adds to the PSL output as column 22, the sequence of the read. This is not anymore a valid PSL format. -r REPLACE_READS_IDS, --replace-read-ids=REPLACE_READS_IDS In the reads ids (also known as query name in PSL) the string specified here will be replaced with '/' (which is used in Solexa for /1 and /2). -o OUTPUT_FILENAME, --output=OUTPUT_FILENAME The output file in PSL format.
Convert from psl to SAM
Available from the samtools legacy scripts: https://github.com/lh3/samtools-legacy/blob/master/misc/psl2sam.pl
Example command:
psl2sam.pl test.psl
This ends up being a lossy conversion as the read sequence is not in the output.
Usage
psl2sam.pl Usage: psl2sam.pl [-a 1] [-b 3] [-q 5] [-r 2]
The options are used to calculate a blast like scoring see post:
https://www.biostars.org/p/44281/
Convert psl to pslx
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
pslToPslx test_no_seq.psl test.fa ref.fa test.pslx
This is a lossless conversion. For usage:
pslToPslx - Convert from psl to pslx format, which includes sequences usage: pslToPslx [options] in.psl qSeqSpec tSeqSpec out.pslx qSeqSpec and tSeqSpec can be nib directory, a 2bit file, or a FASTA file. FASTA files should end in .fa, .fa.gz, .fa.Z, or .fa.bz2 and are read into memory. Options: -masked - if specified, repeats are in lower case cases, otherwise entire sequence is loader case.
Convert SAM to fasta
awk '$1~!/^@/ {print ">"$1"\n"$10}' test.sam > test.fa
Convert psl to BED
Option 1:
Using pslToBed from https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossless conversion as the standard psl doesn’t have the sequence and so the bed file doesn’t either.
pslToBed test_no_seq.psl test.bed
Usage:
bedToPsl - convert bed format files to psl format usage: bedToPsl chromSizes bedFile pslFile Convert a BED file to a PSL file. This the result is an alignment. It is intended to allow processing by tools that operate on PSL. If the BED has at least 12 columns, then a PSL with blocks is created. Otherwise single-exon PSLs are created. Options: -keepQuery - instead of creating a fake query, create PSL with identical query and target specs. Useful if bed features are to be lifted with pslMap and one wants to keep the source location in the lift result. (khmerEnv)hugh01j@Alpha:~/test_format_convert$ kentUtils/bin/linux.x86_64/pslToBed pslToBed: tranform a psl format file to a bed format file. usage: pslToBed psl bed options: -cds=cdsFile cdsFile specifies a input cds tab-separated file which contains genbank-style CDS records showing cdsStart..cdsEnd e.g. NM_123456 34..305 These coordinates are assumed to be in the query coordinate system of the psl, like those that are created from genePredToFakePsl -posName changes the qName field to qName:qStart-qEnd (can be used to create links to query position on details page)
Option 2 as suggested by Alex Reynolds:
Using psl2bed from http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/psl2bed.html
This is also lossless when used with –keep-header:
Example:
psl2bed < in.psl > out.bed
As a bonus, it uses sort-bed to make a sorted BED file, so that it is ready to use with bedops, bedmap, etc.
Usage:
convert2bed -i psl version: 2.4.29 (typical) author: Alex Reynolds Converts 0-based, half-open [a-1, b) headered or headerless PSL input into 0-based, half-open [a-1, b) extended BED or BEDOPS Starch $ psl2bed < foo.psl > sorted-foo.psl.bed $ psl2starch < foo.psl > sorted-foo.psl.starch Or: $ convert2bed -i psl < foo.psl > sorted-foo.psl.bed $ convert2bed -i psl -o starch < foo.psl > sorted-foo.psl.starch We make no assumptions about sort order from converted output. Apply the usage case displayed to pass data to the BEDOPS sort-bed application which generates lexicographically-sorted BED data as output. If you want to skip sorting, use the --do-not-sort option: $ psl2bed --do-not-sort < foo.psl > unsorted-foo.psl.bed The PSL specification (http://genome.ucsc.edu/goldenPath/help/blatSpec.html) contains 21 columns, some which map to UCSC BED columns and some which do not. PSL input can contain a header or be headerless, if the BLAT search was performed with the -noHead option. This program can accept input in either format. If input is headered, you can use the --keep-header option to preserve the header data as pseudo-BED elements that use the "_header" chromosome name. We expect this should not cause any collision problems since PSL data should use UCSC chromosome naming conventions. We describe below how we map columns to BED, so that BLAT results can be losslessly transformed back into PSL format with a simple awk statement or other similar command that permutes columns into PSL-ordering. We map the following PSL columns to their equivalent BED column, as follows: - tName <--> chromosome - tStart <--> start - tEnd <--> stop - qName <--> id - qSize <--> score - strand <--> strand Remaining PSL columns are mapped, in order, to columns 7 through 21 in the BED output: - matches - misMatches - repMatches - nCount - qNumInsert - qBaseInsert - tNumInsert - tBaseInsert - qStart - qEnd - tSize - blockCount - blockSizes - qStarts - tStarts PSL conversion options: --keep-header (-k) Preserve header section as pseudo-BED elements (requires --headered) --split (-s) Split record into multiple BED elements, based on tStarts field value Other processing options: --do-not-sort (-d) Do not sort BED output with sort-bed (not compatible with --output=starch) --max-mem=<value> (-m <val>) Sets aside <value> memory for sorting BED output. For example, <value> can be 8G, 8000M or 8000000000 to specify 8 GB of memory (default is 2G) --sort-tmpdir=<dir> (-r <dir>) Optionally sets [dir] as temporary directory for sort data, when used in conjunction with --max-mem=[value], instead of the host's operating system default temporary directory --starch-bzip2 (-z) Used with --output=starch, the compressed output explicitly applies the bzip2 algorithm to compress intermediate data (default is bzip2) --starch-gzip (-g) Used with --output=starch, the compressed output applies gzip compression on intermediate data --starch-note="xyz..." (-e "xyz...") Used with --output=starch, this adds a note to the Starch archive metadata --help | --help[-bam|-gff|-gtf|-gvf|-psl|-rmsk|-sam|-vcf|-wig] (-h | -h <fmt>) Show general help message (or detailed help for a specified input format) --version (-w) Show application version
Convert pslx to BED
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossy conversion as the sequence is lost
pslToBed test.pslx test.bed
Usage:
pslToBed: tranform a psl format file to a bed format file. usage: pslToBed psl bed options: -cds=cdsFile cdsFile specifies a input cds tab-separated file which contains genbank-style CDS records showing cdsStart..cdsEnd e.g. NM_123456 34..305 These coordinates are assumed to be in the query coordinate system of the psl, like those that are created from genePredToFakePsl -posName changes the qName field to qName:qStart-qEnd (can be used to create links to query position on details page)
Convert BAM to BED
Using bedtools: http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html
This is a lossy conversion as the sequence data is lost.
bedtools bamtobed -i test.bam > test_bamtobed.bed
Usage:
Tool: bedtools bamtobed (aka bamToBed) Version: v2.22.1 Summary: Converts BAM alignments to BED6 or BEDPE format. Usage: bedtools bamtobed [OPTIONS] -i <bam> Options: -bedpe Write BEDPE format. - Requires BAM to be grouped or sorted by query. -mate1 When writing BEDPE (-bedpe) format, always report mate one as the first BEDPE "block". -bed12 Write "blocked" BED format (aka "BED12"). Forces -split. http://genome-test.cse.ucsc.edu/FAQ/FAQformat#format1 -split Report "split" BAM alignments as separate BED entries. Splits only on N CIGAR operations. -splitD Split alignments based on N and D CIGAR operators. Forces -split. -ed Use BAM edit distance (NM tag) for BED score. - Default for BED is to use mapping quality. - Default for BEDPE is to use the minimum of the two mapping qualities for the pair. - When -ed is used with -bedpe, the total edit distance from the two mates is reported. -tag Use other NUMERIC BAM alignment tag for BED score. - Default for BED is to use mapping quality. Disallowed with BEDPE output. -color An R,G,B string for the color used with BED12 format. Default is (255,0,0). -cigar Add the CIGAR string to the BED entry as a 7th column.
Convert BED to BAM
Create the genome file for bed
samtools faidx ref.fa awk -v OFS='\t' {'print $1,$2'} ref.fai > ref.txt
Using the genome file and BED file to produce the BAM file.
bedtools bedtobam -i test_bamtobed.bed -g ref_revcomp.txt > test_bedtobam.bam
Usage:
Tool: bedtools bedtobam (aka bedToBam) Version: v2.22.1 Summary: Converts feature records to BAM format. Usage: bedtools bedtobam [OPTIONS] -i <bed/gff/vcf> -g <genome> Options: -mapq Set the mappinq quality for the BAM records. (INT) Default: 255 -bed12 The BED file is in BED12 format. The BAM CIGAR string will reflect BED "blocks". -ubam Write uncompressed BAM output. Default writes compressed BAM. Notes: (1) BED files must be at least BED4 to create BAM (needs name field).
The sequence is not present in the BED file so is absent from BAM as well. This is a lossy format conversion.
Additionally, there are differences in the number of read compared to the original file.
Convert BED to psl
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossless conversion as neither the BED nor psl contain sequence information
bedToPsl Longest_revcomp.txt test.bed test_bedtopsl.psl
Usage:
bedToPsl - convert bed format files to psl format usage: bedToPsl chromSizes bedFile pslFile Convert a BED file to a PSL file. This the result is an alignment. It is intended to allow processing by tools that operate on PSL. If the BED has at least 12 columns, then a PSL with blocks is created. Otherwise single-exon PSLs are created. Options: -keepQuery - instead of creating a fake query, create PSL with identical query and target specs. Useful if bed features are to be lifted with pslMap and one wants to keep the source location in the lift result.
Preparing blast-xml format
makeblastdb -dbtype nucl -in Longest_revcomp.fa blastn -query test.fa -db Longest_revcomp.fa -out test.blastxml -outfmt 5
Preparing blast-tab format
blastn -query test.fa -db Longest_revcomp.fa -out test.blasttab -outfmt 6
Blast-xml to psl
Using https://github.com/ENCODE-DCC/kentUtils/tree/master/bin/linux.x86_64
This is a lossy conversion
blastXmlToPsl test.blastxml test_blastxmltopsl.psl
However, if you use the -pslx option, you can get lossless conversion
blastXmlToPsl -pslx test.blastxml test_blastxmltopsl.pslx
Usage:
blastXmlToPsl - convert blast XML output to PSLs usage: blastXmlToPsl [options] blastXml psl options: -scores=file - Write score information to this file. Format is: strands qName qStart qEnd tName tStart tEnd bitscore eVal qDef tDef -verbose=n - n >= 3 prints each line of file after parsing. n >= 4 dumps the result of each query -eVal=n n is e-value threshold to filter results. Format can be either an integer, double or 1e-10. Default is no filter. -pslx - create PSLX output (includes sequences for blocks) -convertToNucCoords - convert protein to nucleic alignments to nucleic to nucleic coordinates -qName=src - define element used to obtain the qName. The following values are support: o query-ID - use contents of the element if it exists, otherwise use o query-def0 - use the first white-space separated word of the element if it exists, otherwise the first word of . Default is query-def0. -tName=src - define element used to obtain the tName. The following values are support: o Hit_id - use contents of the element. o Hit_def0 - use the first white-space separated word of the element. o Hit_accession - contents of the element. Default is Hit-def0. -forcePsiBlast - treat as output of PSI-BLAST. blast-2.2.16 and maybe others indentify psiblast as blastp. Output only results of last round from PSI BLAST
Converting from blast-xml to SAM/BAM
Using https://github.com/guyduche/Blast2Bam
This is a lossless conversion with sequence and read quality introduced.
blast2bam -o test_blastxmltosam.sam test.blastxml ref.fa reads_1.fq reads_2.fq
Usage
Blast2Bam. Last compilation: Jun 27 2017 at 15:21:50. Usage: blast2bam [options] [FastQ_2] Options: --output | -o FILE Output file (default: stdout) --interleaved | -p Interleaved data --readGroup | -R STR Read group header line '@RG\tID:foo' --minAlignLength | -W INT Discard alignments shorter than [INT] --shortCigar | -c Short version of the CIGAR string ('M' instead of '=' and 'X') --posOnChr | -z Adjust the alignment position to the first position of the reference --help | -h Get help (this screen)
Subsequently converted to BAM using samtools
samtools view -b test_blastxmltosam.sam > test_blastxmltosam.bam
Blast-xml to BED
Using https://github.com/mojaveazure/BLAST_to_BED
Command
BLAST_to_BED.py -x test.blastxml -o test_blastxmltobed.bed
This is a lossy conversion as the sequence information is lost.
usage: BLAST_to_BED.py (-f FASTA FILE | -x XML FILE) [-b BED FILE | -o OUTPUT FILE] [-d REFERENCE BLAST DATABASE] [-e E-VALUE THRESHOLD] [-s MAX HITS] [-m MAX HSPS] [-k] [-X [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]] | -K [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]]] Input options: Specify the type of input we're working with, '-f | --fasta' requires BLASTn from NCBI to be installed -f FASTA FILE, --fasta FASTA FILE Input FASTA file to run BLAST on, incompatible with '-x | --xml' -x XML FILE, --xml XML FILE Input BLAST XML file to turn into BED file, incompatible with '-f | --fasta' Output options: Specify how we're writing our output files, create new file or append to preexisting BED file -b BED FILE, --bed BED FILE BED file to append results to, incompatible with '-o | --outfile' -o OUTPUT FILE, --outfile OUTPUT FILE Name of output file to write to, defaults to '/home1/hugh01j/test_format_convert/output.bed', incompatible with '-b | --bed' BLAST options: Options only used when BLASTING, no not need to provide with '-x | --xml' -d REFERENCE BLAST DATABASE, --database REFERENCE BLAST DATABASE Reference BLAST database in nucleotide format, used only when running BLAST -e E-VALUE THRESHOLD, --evalue E-VALUE THRESHOLD Evalue threshold for BLAST, defaults to '1e-1', used only when running BLAST -s MAX HITS, --max-hits MAX HITS Maximum hits per query, defaults to '1', used only when running BLAST -m MAX HSPS, --max-hsps MAX HSPS Maximum HSPs per hit, defaults to '1', used only when running BLAST -k, --keep-xml Do we keep the XML results? pas '-k | --keep-xml' to say 'yes', used only when running BLAST Filtering options: Options to filter the resulting BED file, can specify as many or as few chromosome/contig names as you want. You may choose to either exclude or keep, not both. Note: this only affects the immediate output of this script, we will not filter a preexisting BED file -X [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]], --exclude-chrom [EXCLUDE CHROMOSOMES [EXCLUDE CHROMOSOMES ...]] Chromosomes/Contigs to exclude from BED file, everything else is kept, incompatible with '-K | --keep-chrom' -K [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]], --keep-chrom [KEEP ONLY CHROMOSOMES [KEEP ONLY CHROMOSOMES ...]] Chromosomes/Contigs to keep from BED file, everything else is excluded, incompatible with '-X | --exclude- chrom'
Converting Blast tabular to blast-xml
Not completely possible due to missing information such as the alignment but see post https://www.biostars.org/p/7981/#7983
Or using the script blast2xml.pl from the blast2go google group: https://11302289526311073549.googlegroups.com/attach/ed2c446e1b1852a9/blast2xml.pl?part=0.1&view=1&vt=ANaJVrEJYYa7SZC-uvOtoKb6932qlMJWltc2p_5GrTK5Wi7jo-hw14zFroKEcLhdNcJUcQweoUJOuXk2H7wQB5q6mzDTTn211hC2OvwiWw0b5PZev-HQ7Qg
Command:
perl blast2xml.pl -i test.blastxml -o test_blasttoblastxml
Usage
perl blast2xml.pl - i|input : path of the input file (must be text blast file output) - o|output : path of the output file (by default, the same as the input file) - s|sequences : number of sequences by xml file (default inf) - hit : number of hit to print for each sequences (default inf) - hsp : number of hsp to print for each hit (default inf) - help|h|? : print this help and exit
Convert blast-xml to blast tabular
Several approaches have been suggested here https://www.biostars.org/p/7290/ but the most straight forward I have found is using the style sheet blast2tsv.xml from here:
https://github.com/lindenb/xslt-sandbox/blob/master/stylesheets/bio/ncbi/blast2tsv.xsl. This is a lossless conversion but is not a standardly formatted blast-tabular output as it contains the sequence and the aligned site information in the last two columns.
Command:
xsltproc --novalid blast2tsv.xsl test.blastxml
Blast tabular to BED
Using https://github.com/nterhoeven/blast2bed
blast2bed test.blasttab
The output will be in test.blasttab.bed, this is a lossless conversion neither blast-tabular nor BED have the sequence.
Usage
Usage: ./blast2bed The blast file should be in blast outfmt 6 or 7. See Readme.org for more details.
Converting SAM to BAM
Using samtools http://quinlanlab.org/tutorials/samtools/samtools.html
This is a lossless format conversion.
Command
samtools view -bS test.sam > test.bam
Converting BAM to SAM using samtools
This is a lossless format conversion.
samtools view -h -o test.sam test.bam
PS: I dedicate this tutorial to Sej, a great bioinformatician and friend 😉