SAMtools: Primer / Tutorial
by Ethan Cerami, Ph.D.
keywords: samtools, next-gen, next-generation, sequencing, bowtie, sam, bam, primer, tutorial, how-to, introduction
- 1.0: May 30, 2013: First public release on biobits.org.
- 1.1: July 24, 2013: Updated with Disqus Comments / Feedback section.
- 1.2: December 19, 2014: Multiple updates, including:
- Updated to use samtools 1.1 and bcftools 1.2.
- Updated usage for bcftools.
SAMtools is a popular open-source tool used in next-generation sequence analysis. This primer provides an introduction to SAMtools, and is geared towards those new to next-generation sequence analysis. The primer is also designed to be self-contained and hands-on, meaning that you only need to install SAMtools, and no other tools, and sample data sets are provided. Terms in bold are also explained in the glossary at the end of the document.
If you find an error in this document, or would like to send specific feedback, please send email to: cerami AT gmail.com.
Sample Data Files
You can access all sample data files for this primer from the companion github repository.
From the repository, you can:
- browse and download individual data files.
- download a complete zip file containing everything.
- clone the entire repository. If you have never used Github before, there is a comprehensive tutorial to get you started.
Next-generation sequencing refers to new, high-throughput technologies for sequencing DNA or RNA. A typical work-flow for next-generation sequence analysis is usually composed of the following steps:
SAMtools fits in at steps 4 and 5. Most importantly, it can process aligned sequence reads, and manipulate them with ease. For example, it can convert between the two most common file formats (SAM and BAM), sort and index files (for speedy retrieval later), and extract specific genomic regions of interest. It also enables quality checking of reads, and automatic identification of genomic variants.
Installing SAMtools, bcftools
To get started, download the SAMtools source code from: www.htslib.org. Then, unzip and unpack the distribution to your preferred location –– for example, I have placed SAMtools at:
SAMtools is written in C, compiled with gcc and make, and has only two dependencies: the GNU curses library, and the ZLib compression library. If you are using a modern variant of Linux or MacOS X, you probably already have these libraries installed.
To build SAMtools, type:
Next, download the bcftools source code, also from: www.htslib.org. As with SAMtools, unzip and unpack to your preferred location – for example, I have placed bcftools at:
To build bcftools, type:
Then, add the samtools and bcftools executables to your path. For example, I have modified my
.bash_profile like so:
export PATH=/Users/ecerami/libraries/samtools-1.1:$PATH export PATH=/Users/ecerami/libraries/bcftools-1.1:$PATH
As an optional, but recommended step, copy the man page for
samtools.1 to one of your man page directories .
To illustrate the use of SAMtools, we will focus on using SAMtools within a complete workflow for next-generation sequence analysis. For simplicity, the tutorial uses a small set of simulated reads from E. coli. I have chosen E. coli because its genome is quite small — 4,649 kilobases, with 4,405 genes, and its entire genome fits into a single FASTA file of 4.8 megabytes. The sample read file is also small — just 790K — enabling you to download both within a few minutes.
If you are new to next-generation sequence analysis, you will soon find that one of the biggest obstacles is just finding and downloading sample data sets, and then downloading the very large reference genomes. By focusing on E. coli and simulated data sets, you can start small, learn the tool sets, and then advance to other organisms and larger sample data sets.
The workflow below is organized into six steps (see below):
Steps 3-6 are focused on the use of SAMtools. Steps 1-2 require the use of other tools. These initial steps do, however provide important context, and are therefore described below. That said, you are not required to actually perform any of steps yourself now, as intermediate files from these steps are available for download.
Generating Simulated Reads
First, we need a small set of sample read data. A number of tools, including ArtificialFastqGenerator, and SimSeq, will generate artificial or simulated sequence data for you. For this tutorial, I chose to use the wgsim tool (created by Heng Li, also the creator of SAMtools).
The command line usage for wgsim is:
wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>
By default, wgsim reads in a reference genome in the FASTA format, and generates simulated paired-end reads in the FASTQ format.
If you specify the full E. coli genome, wgsim will generate simulated reads across the entire genome. However, for the tutorial, I chose to restrict the simulated reads to just the first 1,000 bases of the E. coli genome. To do so, I extracted the first 1K bases of the E. coli genome, placed these within a new file: NC_008253__1K.fna, and ran:
wgsim -N1000 -S1 genomes/NC_008253_1K.fna simulated_reads/sim_reads.fq /dev/null
Command line options are described below:
-N1000: directs wgsim to generate 1,000 simulated reads (default is set to: 1,000,000).
-S1: specifies a seed for the wgsim random number generator; by specifying a seed value, one can reproducibly create the same set simulated reads.
/dev/null: wgswim requires that you specify two output files (one for each set of paired-end reads). However, if you set the second output to
/dev/null, all data for the second set of reads will be discarded, and you can effectively generate single-end read data only.
wgsim will then output:
[wgsim] seed = 1 [wgsim_core] calculating the total length of the reference sequence... [wgsim_core] 1 sequences, total length: 1000 gi|110640213|ref|NC_008253.1| 736 T G -
This indicates that wgsim has read in a reference sequence of 1K and has generated simulated reads to support exactly one artificial SNP, a T->G at position 736. By default, all artificial reads have a read length of 70, and the reads are written in the FASTQ format. For illustrative purposes, the first read is shown below:
@gi|110640213|ref|NC_008253.1|_418_952_1:0:0_1:0:0_0/1 CCAGGCAGTGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCATCTGGTAGCGATGAT + 2222222222222222222222222222222222222222222222222222222222222222222222
Note that in the FASTQ format, the first line specifies a unique sequence identifier (usually referred to as the QName), the second line specifies the sequence, and the fourth line specifies the Phred quality score for each base. wgsim does not generate artificial quality scores, and all bases are simply set to 2, indicating that the bases have a 0.01995 probability of being called incorrectly (for additional details, refer to Phred quality scores in the glossary.)
The next step is to align the artificial reads to the reference genome for E. coli. For this, I have chosen to use the widely used Bowtie2 aligner, from Johns Hopkins University. Again, you need not actually perform this step to use SAMtools, but it does provide important context, so I have included the details.
To align, use the command:
bowtie2 -x indexes/e_coli -U simulated_reads/sim_reads.fq -S alignments/sim_reads_aligned.sam
This directs bowtie to align the simulated reads against the e_coli reference genome, and to output the alignments in the SAM file format. Details regarding each command line option is provided below:
-x: the location of the reference genome index files. Bowtie requires that reference genome indexes be downloaded or built, prior to alignment. I have provided a pre-built index for e_coli on the github repository for this primer.
-U: list of unpaired reads to align.
-S: output alignments in the SAM format to the specified file.
For additional details on using bowtie2, refer to the online manual.
Understanding the SAM Format
As SAMtools is primarily concerned with manipulating SAM files, it is useful to take a moment to examine the sample SAM file generated by bowtie, and to dive into the details of the SAM file format itself. The first six lines from the bowtie SAM file are extracted below:
SAM files consist of two types of lines: headers and alignments. Headers begin with @, and provide meta-data regarding the entire alignment file. Alignments begin with any character except @, and describe a single alignment of a sequence read against the reference genome.
For now, we ignore the header-lines, auto-generated by bowtie, and focus instead on the alignments. Two factors are most important to note. First, each read in a FASTQ file may align to multiple regions within a reference genome, and an individual read can therefore result in multiple alignments. In the SAM format, each of these alignments is reported on a separate line. Second, each alignment has 11 mandatory fields, followed by a variable number of optional fields. Each of the fields is described in the table below:
|Field Name||Description||Example from the e_coli SAM file|
|QNAME||Unique identifier of the read; derived from the original FASTQ file.||gi|110640213|ref |NC_008253.1 |_418_952_1:0:0_1:0:0_0/1|
|FLAG||A single integer value (e.g. 16), which encodes multiple elements of meta-data regarding a read and its alignment. Elements include: whether the read is one part of a paired-end read, whether the read aligns to the genome, and whether the read aligns to the forward or reverse strand of the genome. A useful online utility decodes a single SAM flag value into plain English.||16: indicates that the read maps to the reverse strand of the genome.|
|RNAME||Reference genome identifier. For organisms with multiple chromosomes, the RNAME is usually the chromosome number; for example, in human, an RNAME of "chr3" indicates that the read aligns to chromosome 3. For organisms with a single chromosome, the RNAME refers to the unique identifier associated with the full genome; for example, in E. coli, the RNAME is set to: gi|110640213|ref|NC_008253.1|.||gi|110640213|ref|NC_008253.1||
|POS||Left-most position within the reference genome where the alignment occurs.||418|
|MAPQ||Quality of the genome mapping. The MAPQ field uses a Phred-scaled probability value to indicate the probability that the mapping to the genome is incorrect. Higher values indicate increased confidence in the mapping.||42: indicates low probability (6.31e-05) that the mapping is incorrect.|
|CIGAR||A compact string that (partially) summarizes the alignment of the raw sequence read to the reference genome. Three core abbreviations are used: M for alignment match; I for insertion; and D for Deletion. For example, a CIGAR string of 5M2I63M indicates that the first 5 base pairs of the read align to the reference, followed by 2 base pairs, which are unique to the read, and not in the reference genome, followed by an additional 63 base pairs of alignment. Note that the CIGAR string is a partial representation of the alignment because M indicates an alignment match, but this could indicate an exact sequence match or a mismatch . If you would like to determine match v. mismatch, you can consult the optional MD field, detailed below.||70M: 70 base pairs within the read match the reference genome.|
|RNEXT||Reference genome identifier where the mate pair aligns. Only applicable when processing paired-end sequencing data. For example, an alignment with RNAME of "chr3" and RNEXT of "chr4" indicates that the mate and its pair span two chromosomes, indicating a possible structural rearrangement. A value of * indicates that information is not available.||the E. coli simulated data is single read sequencing data, and all RNEXT values are therefore set to * (no information available).|
|PNEXT||Position with the reference genome, where the second mate pair aligns. As with RNEXT, this field is only applicable when processing paired-end sequencing data. A value of 0 indicates that information is not available.||the E. coli simulated data is single read sequencing data, and all RNEXT values are therefore set to 0 (no information available).|
|TLEN||Template Length. Only applicable for paired-end sequencing data, TLEN is the size of the original DNA or RNA fragment, determined by examining both of the paired-mates, and counting bases from the left-most aligned base to the right-most aligned base. A value of 0 indicates that TLEN information is not available.||the E. coli simulated data is single read sequencing data, and all RNEXT values are therefore set to 0 (no information available).|
|SEQ||the raw sequence, as originally defined in the FASTQ file.||CCAGGCAGTGGCAGGT...|
|QUAL||The Phred quality score for each base, as originally defined in the FASTQ file.||222222222222222...: as noted above, wgsim does not generate artificial quality scores, and all bases are simply set to 2, indicating that the bases have a 0.01995 probability of being called incorrectly.</table> Following the 11 mandatory fields, each alignment can also have a variable number of optional fields. For example, the bowtie generated SAM file for *E. coli* includes 8 additional fields for each alignment. Most of these are specific to bowtie, and are beyond the scope of this document . However, two fields are part of the official SAM format, and are worth noting. These are detailed in the table below. Note that optional SAM fields always use the format: FIELD_NAME:DATA_TYPE:VALUE, where the DATA_TYPE of `i` indicates integer values and `Z` indicates string values .|
|Field Name||Description||Examples from the e_coli SAM file|
|NM||The number of base pair changes required to change the aligned sequence to the reference sequence, known formally as the Edit Distance.||
NM:i:0 indicates a perfect match between the aligned sequence and the reference sequence.
NM:i:1 indicates that the aligned sequence contain one mismatch.
|MD||A string which summarizes the mismatch positions between the aligned read and the reference genome.||MD:Z:8G61 indicates a single base pair mismatch. Specifically, the aligned read matches the first 8 bases of the reference, after which it fails to match a G in the reference sequence, followed by 61 exact matches to the reference.</table> ### Converting SAM to BAM Now that you understand where alignments come from, know how to interpret SAM fields, and have a sample [SAM file to play with](tutorial/alignments/sim_reads_aligned.sam), you are finally read to tackle SAMtools. As described above, our goal is to identify the set of genomic variants within the *E. coli* data set. To do so, our first step is to convert the SAM file to BAM. This is an important prerequisite, as all the downstream steps, including the identification of genomic variants and visualization of reads, require BAM as input. To convert from SAM to BAM, use the SAMtools `view` command: samtools view -b -S -o alignments/sim_reads_aligned.bam alignments/sim_reads_aligned.sam * `-b`: indicates that the output is BAM. * `-S`: indicates that the input is SAM. * `-o`: specifies the name of the output file. BAM files are stored in a compressed, binary format, and cannot be viewed directly. However, you can use the same `view` command to display all alignments. For example, running: samtools view alignments/sim_reads_aligned.bam | more will display all your reads in the unix `more` paginated style. You can also use `view` to only display reads which match your specific filtering criteria. For example: samtools view -f 4 alignments/sim_reads_aligned.bam | more * `-f INT`: extracts only those reads which match the specified SAM flag. In this case, we filter for only those reads with flag value of 4 = read fails to map to the reference genome. or: samtools view -F 4 alignments/sim_reads_aligned.bam | more * `-F INT`: removes reads which match the specified SAM flag. You can also try out the `-c` option, which does not output the reads, but rather outputs the number of reads which match your criteria. For example: samtools view -c -f 4 alignments/sim_reads_aligned.bam indicates that 34 of our artificial reads failed to align to the reference genome. Finally, you can use the `-q` parameter to indicate a minimal quality mapping filter. For example: samtools view -q 42 -c alignments/sim_reads_aligned.bam outputs the total number of aligned reads (819) that have a mapping quality score of 42 or higher. ### Sorting and Indexing The next step is to sort and index the BAM file. There are two options for sorting BAM files: by read name (`-n`), and by genomic location (default). As our goal is to call genomic variants, and this requires that we "pile-up" all matching reads within a specific genomic location, we sort by location: samtools sort alignments/sim_reads_aligned.bam alignments/sim_reads_aligned.sorted This reads the BAM file from `alignments/sim_reads_aligned.bam` and writes the sorted file to: `alignments/sim_reads_aligned.sorted.bam`. Once you have sorted your BAM file, you can then index it. This enables tools, including SAMtools itself, and other genomic viewers to perform efficient random access on the BAM file, resulting in greatly improved performance. To do so, run: samtools index alignments/sim_reads_aligned.sorted.bam This reads in the sorted BAM file, and creates a BAM index file with the `.bai` extension: `sim_reads_aligned.sorted.bam.bai`. Note that if you attempt to index a BAM file which has not been sorted, you will receive an error and indexing will fail.|
If you are already conversant with command line tools and possess basic programming skills,
you will quickly realize that there is nothing magical about next-generation sequence analysis.
You just need to learn the terms, the file formats, and the tools, and tie them all together.
However, there is one extra skill you will need to cultivate: patience. Once you move beyond
toy examples, and enter the world of real sequence data, all the concepts remain the same, but
all the files become much bigger, and everything takes longer -- usually a lot longer. For
example, sorting and indexing a single human exome BAM file can easily take 2-4 hours,
depending on read depth and your computer set-up. Worse, you may encounter an error after
two hours of processing, then make an adjustment, and wait another two hours to see if the
problem is fixed. It's a bit like rolling giant boulders up a mountain, having the
occasional one slip through, and starting all over from the beginning.
Many bioinformaticians hedge their patience by running next-generation sequencing pipelines on large distributed compute clusters, so that they can at least move dozens of boulders at once -- however, this is a topic for another primer. Until then, my best advice is to get the concepts right and start with small files. Then, pack your patience and work your way up to larger data sets.
|Genotype||Phred Scaled Score||Likelihood p-value|
|Within tview you can press g to go to a specific genomic location. However, if you type g and enter a position, such as 736, nothing will happen. This is because the location must be specified in the form: RNAME:position, e.g. for human data, you would type: chr3:200000. In our case, the RNAME is set to the identifier associated with the full genome, which is: gi|110640213|ref|NC_008253.1|. Therefore, to jump to position 736, you would have to enter the rather long string: gi|110640213|ref|NC_008253.1|:736.|