weeSAM version 1.5.
- Post by: Zack Boyd
- July 17, 2018
- No Comment
What is weeSAM?
weeSAM is a python script which produces coverage statistics and coverage plots from an input SAM or BAM file. Figures and stats are written up in HTML so users can easily view the coverage for their reference assembly.
weeSAM is simple to run and the steps below give an illustration.
What’s new in version 1.5?
weeSAMv1.5 (https://github.com/centre-for-virus-research/weeSAM) is a python rewrite of Joseph Hughes’ weeSAMv1.4 written in perl and R.
If you’re familiar with weeSAM all of the functionality of version 1.4 exists in 1.5 so no need to worry. The major changes / additions in version 1.5 are as follows:
- The R / PDF functionality has been replaced with Matplotlib / HTML.
With the removal of .pdf files and the addition of .html files users are now able to view coverage statistics of their data in a browser.
The weeSAM command line options are:
Usage: weeSAM { --sam [txt] OR --bam [txt] } --cutoff [int] --out [txt] --html [txt] -v -h --overwrite Flag descriptions: --sam : An input .sam file. --bam : An input .bam file. --cutoff : Cut-off value for number of mapped reads. --out : Output file name. --html : HTML file name. --overwrite : Add this flag if you want to remove the html directory from a previous run. -v : Version number. -h : Help
Here’s an example of weeSAM using a bam file generated from a de-novo assembly of HCMV data using spades:
weeSAM --bam SpadesContigs_aligned.bam --html Spades_HCMV.html
When this command is run, a new directory is produced called Spades_HCMV_html_results
which looks like this:
All that is needed to be done now is to double click the highlighted file (.html) and you shall see your results in your default browser.
All of these fields are described at the bottom of the blog. The most important field is “Ref_Name” this contains the name of each sequence in the BAM file and is a clickable link which will show you the coverage plot of that sequence.
Figure 1.3 shows the coverage plot for NODE_3. The coverage along the genome is shown in blue, the average coverage as a dotted green line, the (average coverage)*0.2 as a dotted orange line and (average coverage)*1.8 as a dotted red line.
The table below the figure shows the same information as in the main table. If you want to view a different sequence hit back in your browser then click another link.
If you’re not interested in the html you can just produce a tab delimited txt file containing the exact same information as seen in figure 1.2. This would be done via this command:
weeSAM --bam SpadesContigs_aligned.bam --out Spades_HCMV.txt
Explanation of the statistics produced by weeSAM:
- Ref_Name: Name of the reference sequence in the SAM/BAM file.
- Ref_Len: Length of the reference sequence (in bases).
- Mapped_Reads: The number of reads mapped to the sequence.
- Breadth: The number of sites on the sequence covered by reads.
- %_Covered: The percentage of sites on the sequence which have coverage.
- Min_Depth: The minimum read depth observed.
- Max_Depth: The maximum read depth observed.
- Avg_Depth: The average read depth.
- Std_Dev: The standard deviation of the mean (Avg_Depth).
- Above_0.2_Depth: The percentage of sites which have a coverage value of the average depth multiplied by 0.2.
- Above_1_Depth: The percentage of sites which have a coverage value greater than Avg_Depth.
- Above_1.8_Depth: The percentage of sites which have a coverage value of the average depth multiplied by 1.8.
- Variation_Coefficient: A measure of variability (Std_Dev/Avg_Depth)
The values 10-13 provide estimations on the variability in your coverage. A value below 100 for Above_0.2_Depth implies that you have a number of sites with very low coverage. A large value for Above_1.8_Depth suggests that you have some peaks with very high depth. Having a low Above_0.2_Depth is obviously a bigger problem than having a low Above_1.8_Depth.
The coefficient of variation is also used to look at variability in coverage. A coefficient of variation < 1 would suggest that the coverage has low-variance, which is good, while a coefficient > 1 would be considered high-variance.