Short command lines for manipulation FASTQ and FASTA sequence files

Post by: Joseph Hughes
February 23, 2015
9 Comments

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta

awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample1.fa > sample1_singleline.fa

To convert a fastq file to fasta in a single line using sed

sed '/^@/!d;s//>/;N' sample1.fq > sample1.fa

Dirty way to count the number of sequences in a fastq

grep -c '^@' sample1.fq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

cat sample1.fq | echo $((`wc -l`/4))

One liner to remove the description information from a fasta file and just keep the identifier

perl -p -i -e 's/>(.+?) .+/>$1/g' sample1.fa

Get all the identifier names from a fasta file

perl -ne 'if(/^>(\S+)/){print "$1\n"}' sample1.fa

Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' sample1.fa

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.txt sample1.fa

Convert from a two column text tab-delimited file (ID and sequence) to a fasta file

awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa

Get the length of a fasta sequence (the sequence must in singleline)

cat sample1_singleline.fa | awk 'NR%2==0' | awk '{print length($1)}'

I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.

Categories: awk, Perl, UNIX

9 Comments on “Short command lines for manipulation FASTQ and FASTA sequence files”

Ma Pengze says:

May 28, 2015 at 1:44 pm

haha I like awk~

Richard Orton says:

April 3, 2016 at 12:50 pm

Your cat sample1.fq | echo $((`wc -l`/4)) doesn’t seem to work for me for counting the number of reads in a FASTQ files – on BioLinux 8 – default z shell – I get an error:

wc: standard input: Input/output error.

This seems to work:

echo $((`wc -l < ReadFileName.fastq`/4))

bljog says:

April 3, 2016 at 6:08 pm

If using Bourne shell (sh), it works. Also if you want to do it for .gz compressed fastq files:
cat sample1.fq | echo $((`wc -l`/4)

bljog says:

April 3, 2016 at 7:01 pm

You could also switch to bash:
exec bash
And to switch back to zsh:
exec zsh

Lijing Bu says:

January 9, 2018 at 10:08 pm

Some quality score lines start with @ !!!

1. bljog says:
  
  January 10, 2018 at 12:10 pm
  
  This is why the following command is a better way to count the number of sequences:
  
  cat sample1.fq | echo $((`wc -l`/4))
  
  1. Lijing Bu says:
    
    January 10, 2018 at 6:09 pm
    
    Sorry, I meant it affect the fastq to fasta convert.
    
    About the counting, sometimes the end of the file doens’t have a end of line character. In such rare case, the number of lines from wc -l is 1 less than the real number.
    
Weigela2 says:

April 24, 2018 at 9:30 pm

Counting reads: Do head of our fastq or fasta file first, take part of the read name, and use that to grep -c . the chance that there is the same 4 or more character sequence in the quality scores is virtually nil.

LoM says:

May 29, 2018 at 1:27 pm

Hello,
I tried to loop the perl script to extract sequences based on Ids from a fasta file, using multiple IDs.txt files and one original fasta file (original_fasta.fa):

for file in *.txt; do perl -ne ‘if(/^>(S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV’ $file original_fasta.fa > “$file”_cds_hits.fa ; done

However all the output fasta files contains the sequences from the first file:

grep – c “>” *.txt
ID_1.txt:25
ID_2.txt:48
ID_3.txt:21
…
ID_27.txt:34

grep -c “>” *_cds_hits.fa
ID_1_cds_hits.fa:25
ID_2_cds_hits.fa:25
ID_3_cds_hits.fa:25
…
ID_27_cds_hits.fa:25

Any thoughts on how to loop this script properly ?
thanks

Short command lines for manipulation FASTQ and FASTA sequence files

9 Comments on “Short command lines for manipulation FASTQ and FASTA sequence files”

Leave a Reply Cancel reply