Short command lines for manipulation FASTQ and FASTA sequence files

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta

 

To convert a fastq file to fasta in a single line using sed

 

Dirty way to count the number of sequences in a fastq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

One liner to remove the description information from a fasta file and just keep the identifier

 

Get all the identifier names from a fasta file

 

Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:

 

Convert from a two column text tab-delimited file (ID and sequence) to a fasta file

 

Get the length of a fasta sequence (the sequence must in singleline)

 

I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.

 

  • Ma Pengze

    haha I like awk~

  • Richard Orton

    Your cat sample1.fq | echo $((wc -l/4)) doesn’t seem to work for me for counting the number of reads in a FASTQ files – on BioLinux 8 – default z shell – I get an error:

    wc: standard input: Input/output error.

    This seems to work:

    echo $((wc -l < ReadFileName.fastq/4))

  • bljog

    If using Bourne shell (sh), it works. Also if you want to do it for .gz compressed fastq files:
    cat sample1.fq | echo $((wc -l/4)

  • bljog

    You could also switch to bash:
    exec bash
    And to switch back to zsh:
    exec zsh

  • Lijing Bu

    Some quality score lines start with @ !!!

    • bljog

      This is why the following command is a better way to count the number of sequences:

      cat sample1.fq | echo $((wc -l/4))

      • Lijing Bu

        Sorry, I meant it affect the fastq to fasta convert.

        About the counting, sometimes the end of the file doens’t have a end of line character. In such rare case, the number of lines from wc -l is 1 less than the real number.

  • Weigela2

    Counting reads: Do head of our fastq or fasta file first, take part of the read name, and use that to grep -c . the chance that there is the same 4 or more character sequence in the quality scores is virtually nil.

  • LoM

    Hello,
    I tried to loop the perl script to extract sequences based on Ids from a fasta file, using multiple IDs.txt files and one original fasta file (original_fasta.fa):

    for file in *.txt; do perl -ne ‘if(/^>(S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV’ $file original_fasta.fa > “$file”_cds_hits.fa ; done

    However all the output fasta files contains the sequences from the first file:

    grep – c “>” *.txt
    ID_1.txt:25
    ID_2.txt:48
    ID_3.txt:21

    ID_27.txt:34

    grep -c “>” *_cds_hits.fa
    ID_1_cds_hits.fa:25
    ID_2_cds_hits.fa:25
    ID_3_cds_hits.fa:25

    ID_27_cds_hits.fa:25

    Any thoughts on how to loop this script properly ?
    thanks