Essential AWK Commands for Next Generation Sequence Analysis

Here are the few essential awk command line scripts for next generation sequence analysis.

Users need latest version of gawk to run commands with bitwise operations. Most of the Linux distributions come with gawk. However OSX users have to install it from here

Count number of reads in a FastQ file

Convert FastQ to FastA


Get reads matching a sequence pattern and convert them to FastA

This will get all reads with EcoRI cleavage (GAATTC)

Separate reads based on their length

This will print reads thats are 75bp or more in length.

Printing the above output in FastA format


Get All header lines from a SAM file


Get all reads – excluding headers


Get all unmapped

Get all mapped


Count unmapped

Count mapped


Convert all unmapped reads into fasta format

Convert all mapped reads into fasta format

Convert all reads into fasta format


Categories: awk, UNIX