Java CIGAR Parser for SAM format
- Post by: Maha Maabar
- November 16, 2015
- 1 Comment
Sequence Alignment/Map (SAM) format is a well-known bioinformatics format designed to store information about reads mapping against large reference sequence. The SAM file is split into two sections: a header section and an alignment section. The header section starts with ‘@’ and it contains information such as the name and length of the reference sequence. The alignment section follows the header section. Each line in this section represents an alignment of a read to the reference sequence that may include insertion, deletion, skips and padding. However, not all lines are successful alignment. A single alignment line consists of 11 mandatory fields and a variable number of optional fields. The mandatory fields are ordered and, if information is unavailable, their values can be zero, for a number field, or * for a string field. The reader is referred to the SAM Format Specification [1,2] for a detailed description of each field in the alignment section but in summary, the mandatory fields are:
- QNAME: the name of the read.
- FLAG: a bitwise number describing information about the read such as mapped/unmapped read, pairing, etc.
- RNAME: the reference sequence name. The value of this field is * for unmapped reads.
- POS: the left most mapping position of the first matching base. It is set to zero for unmapped reads.
- MAPQ: a number to indicate the mapping quality.
- CIGAR: a string describing how the read aligns with the reference. It consists of one or more components. Each component comprises an operator and the number of bases which the operator applies to. Operators are: MIDNSHP=X. The following table gives an overview of these operators
- MNAME: the name of the mate strand, if it is a paired read. ‘=’ means that the name is the same as RNAME.
- MPOS: a number indicating the left most mapping position of the mate.
- TLEN: a number set to 0 for single-segment. For paired-segments it is calculated as the number of bases from the left most mapped base to the right most mapped base.
- SEQ: the read sequence as the reference. If not ‘*’ then the length of the sequence must be equal the sum of lengths of M/I/S/=/X operations in the CIGAR string.
- QUAL: the read quality, same as the quality string in a fastaq format.
Operator | Description |
D | Deletion; the nucleotide is present in the reference but not in the read |
H | Hard Clipping; the clipped nucleotides are not present in the read. |
I | Insertion; the nucleotide is present in the read but not in the rference. |
M | Match; can be either an alignment match or mismatch. The nucleotide is present in the reference. |
N | Skipped region; a region of nucleotides is not present in the read |
P | Padding; padded area in the read and not in the reference |
S | Soft Clipping; the clipped nucleotides are present in the read |
X | Read Mismatch; the nucleotide is present in the reference |
= | Read Match; the nucleotide is present in the reference |
The following example shows the mapping of different reads to a reference sequence and their corresponding SAM file [2].
Different tools such as SAMtools [2] and htsjdk are available to post-process the SAM files for analysis of the alignment information.
An intuitive usage of the SAM file is to view the overall coverage of the reference sequence by the reads. This information can be retrieved from the mapping position and CIGAR value fields which are the 4th and 6th fields, respectively. Computationally, reference coverage is obtained by counting the number of times (frequency) of each nucleotide base in the reference sequence that appears in the mapped reads. The frequency is determined using one of three cases according to the operators in the CIGAR string:
- Case1 ( M/X/=) :
- start at the specified mapping position, set counter to 1
- Add 1 to both the counts of the bases from that position and the counter.
- Move to the next position.
- Repeat this process till counter is the same as the number associated with the operator.
- Case2 (N/D):
- Move the specified mapping position by the number associated with the operator.
- Case3 (I/S/H/P):
- Do nothing
The following Java codes illustrate how to obtain the coverage of a single reference sequence obtained from a SAM file.
public void main (String referenceFile){ //Read the reference file in a string called refSeq //create an array with the same size as the reference to hold the frequency of each base. Set its values to zero int [] referenceFreq = new int [refSeq.length()]; int currentPosition =0; for (int i=0;i < referenceFreq.length; i++) referenceFreq [i]= 0; //populate the reference frequencies from the SAM file setFrequencies(samFile,referenceFreq); //print out the frequencies for (int i=0;i < refSeq.length(); i++) System.out.print(refSeq.charAt(i)); System.out.println(); for (int i=0;i < referenceFreq.length; i++) System.out.print(referenceFreq[i]); System.out.println(); } /*A function to populate the frequency array according to information in SAM file */ public void setFrequencies(String samFile, int [] refFreq){ try (BufferedReader bf = new BufferedReader( new FileReader(samFile))) { String line; while((line = bf.readLine()) !=null){ //Ignore header sections in sam file if (line.charAt(0)!='@') { String [] fields = line.split("\t"); //fields are tab-delimited //mapping position is the 4th column in sam file int readPos = Integer.parseInt(words[3]); if (readPos>0){ //only mapped reads are considered //CIGAR string is the 5th column in sam file String cigarVal = fields[5]; ArrayList<String> cigarValues= splitCIGAR(cigarVal); // because java array index starts at 0 currentPosition=readPos-1; // decodeCIGAR(cigarValues, refFreq); } } } bf.close(); } catch (IOException ex) { System.out.println("Error reading the file: "+samFile+"."); ex.printStackTrace(); } } /*A function to interpret the CIGAR string as frequencies*/ private void decodeCIGAR(ArrayList<String> cigarValues, int [] refFreq){ for(int i=0;i<cigarValues.size();i++){ String cVal = cigarValues.get(i); int cLen = Integer.parseInt(cVal.substring(0,cVal.length()-1)); char cLetter = cVal.toUpperCase().charAt(cVal.length()-1); switch (cLetter){ case 'M': case 'X': case '=': int start = currentPosition; int end=(start+cLen)-1; for (int pos=start; pos<=end;pos++) { refFreq [pos]++; currentPosition++; } break; case 'I': case 'S': case 'H': case 'P': break; case 'N': case 'D': setCurrentPosition +=cLen; break; } } } /*A function to split the CIGAR value into a list of CIGAR elements*/ private ArrayList<String> splitCIGAR(String cigarString) { //One cigar component is a number of any digit, followed by a letter or = Pattern cigarPattern = Pattern.compile(“[\\d]+[a-zA-Z|=]”); ArrayList<String> cigarElems = new ArrayList<String>(); Matcher matcher = cigarPattern.matcher(cigarString); while (matcher.find()) { cigarElems.add( matcher.group() ); } return cigarElems; } Running the codes on the previous example, gives the following results:
The highest frequency in the reference sequence indicates the depth of the reads mapped to the reference. In this case, it is 3.
References:
[1]: The Sequence Alignment/Map format and SAMtools. BIOINFORMATICS APPLICATIONS NOTE Vol. 25 no. 16 2009, pages 2078–2079 doi:10.1093/bioinformatics/btp352
[2]: Sequence Alignment/Map Format Specification.pdf
In your first table – are the insertion and deletion descriptions the wrong way round? Insertion; the nucleotide is present in the reference but not in the read. Shouldn’t this be present in the read but not the reference – and vice versa for the deletion?