Category Archives: UNIX

Essential AWK Commands for Next Generation Sequence Analysis

Here are the few essential awk command line scripts for next generation sequence analysis.

Users need latest version of gawk to run commands with bitwise operations. Most of the Linux distributions come with gawk. However OSX users have to install it from here

http://rudix.org/packages/gawk.html

Count number of reads in a FastQ file

Convert FastQ to FastA

 

Get reads matching a sequence pattern and convert them to FastA

This will get all reads with EcoRI cleavage (GAATTC)

Separate reads based on their length

This will print reads thats are 75bp or more in length.

Printing the above output in FastA format

 

Get All header lines from a SAM file

 

Get all reads – excluding headers

 

Get all unmapped

Get all mapped

 

Count unmapped

Count mapped

 

Convert all unmapped reads into fasta format

Convert all mapped reads into fasta format

Convert all reads into fasta format

 

Short command lines for manipulation FASTQ and FASTA sequence files

I thought it was time for me to compile all the short command that I use on a more or less regular basis to manipulate sequence files.

Convert a multi-line fasta to a singleline fasta

 

To convert a fastq file to fasta in a single line using sed

 

Dirty way to count the number of sequences in a fastq

It’s dirty because sometimes the quality information line may also start with “@” so the number of sequences could be overestimated.

A more precise way is to count the lines and divide by four:

One liner to remove the description information from a fasta file and just keep the identifier

 

Get all the identifier names from a fasta file

 

Extract sequences by their ID from a fasta file
For example, you want to get the sequences with id1 and id2 as identifiers

If you have a long list of identifiers in a file called ids.txt, then the following should do the trick:

 

Convert from a two column text tab-delimited file (ID and sequence) to a fasta file

 

Get the length of a fasta sequence (the sequence must in singleline)

 

I’ll update this when I find some more useful single line commands for manipulation fastq and fasta files.

Please post comments if you have some suggestions.

 

Continue reading