Java CIGAR Parser for SAM format

Sequence Alignment/Map (SAM) format is a well-known bioinformatics format designed to store  information about reads mapping against large reference sequence.  The SAM file is split into two sections: a header section and an alignment section. The header section starts with ‘@’ and it contains information such as the name and length of the reference sequence. The alignment section follows the header section. Each line in this section represents an alignment of a read to the reference sequence that may include insertion, deletion, skips and padding. However, not all lines are successful alignment. A single alignment line consists of 11 mandatory fields and a variable number of optional fields. The mandatory fields are ordered and, if information is unavailable, their values can be zero, for a number field, or * for a string field.  The reader is referred to the SAM Format Specification [1,2] for a detailed description of each field in the alignment section but in summary, the mandatory fields are:

  1. QNAME: the name of the read.
  2. FLAG: a bitwise number describing information about the read such as mapped/unmapped read, pairing, etc.
  3. RNAME: the reference sequence name. The value of this field is * for unmapped reads.
  4. POS: the left most mapping position of the first matching base. It is set to zero for unmapped reads.
  5. MAPQ: a number to indicate the mapping quality.
  6. CIGAR: a string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are: MIDNSHP=X.  The following table gives an overview of these operators
  7. MNAME: the name of the mate strand, if it is a paired read. ‘=’ means that the name is the same as RNAME.
  8. MPOS: a number indicating the left most mapping position of the mate.
  9. TLEN: a number set to 0 for single-segment. For paired-segments it is calculated as the number of bases from the left most mapped base to the right most mapped base.
  10. SEQ: the read sequence as the reference. If not  ‘*’ then the length of the sequence must be equal the sum of lengths of M/I/S/=/X operations in the CIGAR string.
  11. QUAL: the read quality, same as the quality string in a fastaq format.
Operator Description
D Deletion; the nucleotide is present in the reference but not in the read
H Hard Clipping; the clipped nucleotides are not present in the read.
I Insertion; the nucleotide is present in the read  but not in the rference.
M Match; can be either an alignment match or mismatch. The nucleotide is present in the reference.
N Skipped region; a region of nucleotides is not present in the read
P Padding; padded area in the read and not in the reference
S Soft Clipping;  the clipped nucleotides are present in the read
X Read Mismatch; the nucleotide is present in the reference
= Read Match; the nucleotide is present in the reference

The following example shows the mapping of different reads to a reference sequence and their corresponding SAM file [2].  SAM_parser_figure1

Different tools such as SAMtools [2] and htsjdk are available to post-process the SAM files for analysis of the alignment information.

An intuitive usage of the SAM file is to view the overall coverage of the reference sequence by the reads. This information can be retrieved from the mapping position and CIGAR value fields which are the 4th and 6th fields, respectively.  Computationally, reference coverage is obtained by counting the number of times (frequency) of each nucleotide base in the reference sequence that appears in the mapped reads.  The frequency is determined using one of three cases according to the operators in the CIGAR string:

  • Case1 ( M/X/=) :
    • start at the specified mapping position, set counter to 1
    • Add 1 to both the counts of the bases from that position and the counter.
    • Move to the next position.
    • Repeat this process till counter is the same as the number associated with the operator.
  • Case2 (N/D):
    • Move the specified mapping position by the number associated with the operator.
  • Case3 (I/S/H/P):
    • Do nothing

The following Java codes illustrate how to obtain the coverage of a single reference sequence obtained from a SAM file.

SAM_parser_fig2

The highest frequency in the reference sequence indicates the depth of the reads mapped to the reference. In this case, it is 3.

References:

[1]: The Sequence Alignment/Map format and SAMtools.   BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078–2079 doi:10.1093/bioinformatics/btp352

[2]: Sequence Alignment/Map Format Specification.pdf

 

 

 

 

  • Richard Orton

    In your first table – are the insertion and deletion descriptions the wrong way round? Insertion; the nucleotide is present in the reference but not in the read. Shouldn’t this be present in the read but not the reference – and vice versa for the deletion?