LII:Next Generation Sequencing (NGS)/Alignment
This page in a nutshell: Given sequencing data (reads) and the reference sequence for the species, comparing the reads to the reference is an easy way to detect small variations in the sequenced sample, such as SNPs and short InDels. |
Introduction
Alignment, also called mapping,[1] of reads is an essential step in re-sequencing. Having sequenced an organism of a species before, and having constructed a reference sequence, re-sequencing more organisms of the same species allows us to see the genetic differences to the reference sequence, and, by extension, to each other. Alignments of data from these re-sequenced organisms is a relatively simple method of detecting variation in samples. There are certain instances (such as new genes in the sequenced sample that are not found in the existing reference sequence) that can not be detected by alignment alone; however, while other approaches, such as de novo assembly, are potentially more powerful, they are also much harder or, for some organisms, impossible to achieve with current sequencing methods.
Next-generation sequencing generally produces short reads or short read pairs, meaning short sequences of <~200 bases (as compared to long reads by Sanger sequencing, which cover ~1000 bases). To compare the DNA of the sequenced sample to its reference sequence, we need to find the corresponding part of that sequence for each read in our sequencing data. This is called aligning or mapping the reads against the reference sequence. Once this is done, we can look for variation (e.g. SNPs) within the sample. This poses a number of problems:
- The short reads do not come with position information, that is, we do not know what part of the genome they came from; we need to use the sequence of the read itself to find the corresponding region in the reference sequence.
- The reference sequence can be quite long (~3 billion bases for human), making it a daunting task to find a matching region.
- Since our reads are short, there may be several, equally likely places in the reference sequence from which they could have been read. This is especially true for repetitive regions.
- If we were only looking for perfect matches to the reference, we would never see any variation. Therefore, we need to allow some mismatches and small structural variation (InDels) in our reads.
- Any sequencing technology produces errors. Similar to the "real" variation, we need to tolerate a low level of sequencing errors in our reads, and separate them from the "real" variation later.
- We need to do that for each of the millions of reads in our sequencing data.
Short reads
Raw short reads often come in (or can be converted into) a file format called FASTQ.[2] It is a plain text format, containing the sequence and quality scores for every read, where each single read normally occupies four consecutive lines:
@Read_id_1 CTGATGTGCCGCCTCACTTCGGTGGT + @@@DDDDDH8<BAHG@BHGIHIII>( @Read_id_2 TGATGTGCCGCCTCACTACGGTGGTG + FHHHHHJIJIJIJIIIJJIIJGIGII @Read_id_3 ...
The four lines are:
- The name/ID of the read, preceded by a "@". For read pairs, there will be two entries with that name, either in the same or a second FASTQ file.
- The sequence of the read.
- A "+" sign. In very old FASTQ files, this is followed by the read name from the first line. Today, this line is present for
historical reasonsbackwards compatibility only. - The quality scores of the bases from line 2. The scores are generated by the sequencing machine, and encoded as ASCII (33+score) characters. The line should have the same length as line 2, as there is one quality score per base.
Alignment
For each of the short reads in the FASTQ file, a corresponding location in the reference sequence (or that no such region exists) needs to be determined. This is achieved by comparing the sequence of the read to that of the reference sequence. A mapping algorithm will try to locate a (hopefully unique) location in the reference sequence that matches the read, while tolerating a certain amount of mismatch to allow subsequence variation detection. Reads aligned (mapped) to a reference sequence will look like this:
GCTGATGTGCCGCCTCACTTCGGTGGTGAGGTG Reference sequence CTGATGTGCCGCCTCACTTCGGTGGT Short read 1 TGATGTGCCGCCTCACTACGGTGGTG Short read 2 GATGTGCCGCCTCACTTCGGTGGTGA Short read 3 GCTGATGTGCCGCCTCACTACGGTG Short read 4 GCTGATGTGCCGCCTCACTACGGTG Short read 5
You can see the reference sequence on the top row, and five short reads stacked below; this is called a pileup. While two of the reads are a perfect match to the reference, the three other reads show a mismatch each, highlighted in red ("A" in the read, instead of "T" in the reference). Since there are multiple reads showing the mismatch, at the same position, with the same difference, one could conclude that it is an actual genetic difference (point mutation or SNP), rather than a sequencing error or mismapping.
Mapping algorithms
There are several alignment algorithms in existence; you can find an (incomplete) list further down in software packages. Some notes on mapping algorithms:
- The reference sequence, the short reads, or both, are often pre-processed into an indexed form for rapid searching. (See BWT)
Sources of errors
There are several potential sources for errors in an alignment, including (but not limited to):
- Polymerase Chain Reaction artifacts (PCR artifacts). Many NGS methods involve one or multiple PCR steps. PCR errors will show as mismatches in the alignment, and especially errors in early PCR rounds will show up in multiple reads, falsely suggesting genetic variation in the sample. A related error would be PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
- Sequencing errors. The sequencing machine can make an erroneous call, either for physical reasons (e.g. oil on an Illumina slide), or due to properties of the sequenced DNA (e.g., homopolymers). As sequencing errors are often random, they can be filtered out as singleton reads during variant calling.
- Mapping errors. The mapping algorithm can map a read to the wrong location in the reference. This often happens around repeats or other low-complexity regions.
Alignment types
Alignments can be used for different purposes:
- Whole-genome sequencing. This would be the "default" use; sequence all DNA from an organism and map it to the appropriate reference sequence, to find genetic variation.
- Exome sequencing. For large genomes (e.g., human), capture just the exomic DNA before sequencing. This will return sequencing data for most of the genes, at a fraction of the cost.
- Transcriptome sequencing (RNA-Seq). Sequencing of the transcriptome, that is, of the RNA present in the sample. This can show which genes are transcribed in the sample, and help fine-tune gene annotation (exon boundaries etc.). Mapping can be done either to the full reference sequence, or to a special "transcriptome reference".
- ChIP-Seq (Protein-DNA interaction).
The SAM/BAM format
The SAM/BAM format has emerged as the de facto standard format for short read alignments. SAM[3] is the plain-text version of the binary, compressed BAM format. They can be converted into one another by the name-giving samtools[4] command-line tool. BAM (without alignment position data) is increasingly used as a space-saving alternative to FASTQ files for containing the short raw read data, and all current alignment software can generate SAM/BAM as an output format. Once in BAM format, the file can be indexed, giving quick access to any region of the reference sequence. Subsequently, using samtools or other software, BAM files can be analysed (e.g. for quality control), modified (removal of PCR duplicates, local realignment, base quality recomputation), or used to call variation, either small (SNPs, short InDels) or large (inversions, tandem duplications, deletions, translocations). BAM files can be visualised using tools like Artemis, ACT, or LookSeq[5]. Last but not least, alignments in BAM format can be used to "morph" the reference sequence to correspond to the short read data with ICORN[6]; this can be useful to get an actual DNA sequence for a sample, or to construct a new reference sequence based on a closely related species.
Other useful SAM/BAM-related software includes:
SAM/BAM tools examples
- Convert SAM to BAM format:
- samtools view -bS aln.sam > aln.bam
- Sort BAM file according to position on reference sequence
- samtools sort aln.bam aln_sorted.bam
- Index BAM file (needed for visualising the alignment)
- samtools index aln_sorted.bam aln_sorted.bai
- Extract the header information from a BAM file
- samtools view –h aln_sorted.bam > aln.sam
- Generate a FASTQ file from a BAM file
- bam2fastq -o aligned.fastq --no-unaligned aln.bam
Software packages
Software | Type | Supported technologies |
Interface | Notes |
---|---|---|---|---|
Partek | Commercial | All | GUI |
|
BWA[9] | Free software | Illumina SOLiD 454 |
Command line |
|
Bowtie[10] | Free software | Illumina SOLiD |
Command line |
|
Stampy[11] | Free software | Illumina | Command line |
|
SHRiMP2 | Free software | Illumina | Command line |
|
TMAP | Free software | IonTorrent | Command line |
|
SNP-o-matic[12] | Free software | Illumina | Command line |
|
CLC workstation | Commercial | All | GUI |
|
NextGenMap[13] | Open source | Illumina, Ion Torrent | Command Line | Fast and accurate. Self adjusting to the underlying data. Robust for high polymorphism
|
Novoalign | Commercial for multi-threaded version. Single threaded version is free | Illumina | Command Line | Fast and accurate. Probably the best aligner as of 2013. |
GSMapper | Commercial | 454 | GUI | / |
SSAHA2 | Free software | 454 | Command line | Fast and accurate for all reads it can map |
BLAT | Free software | 454 | Command line | Not designed for NGS data. |
Mosaik | Free software | All | Command line | Tedious steps. Alignment speed can be slow. Huge memory requirement. |
BWA-SW[14] | Free software | 454, IonTorrent | Command line |
|
BWA-MEM[15] | Free software | 454, IonTorrent | Command line |
|
Bfast[17] | Free software | SOLiD | Command Line | Speed of alignment may be too slow for large NGS data [18] |
Tophat[19] | Free software | Illumina | Command Line | Transcriptome data only |
Splicemap | Free software | Illumina | Command Line | Transcriptome data only |
MapSplice | Free software | Illumina | Command Line | Transcriptome data only |
AbMapper | Free software | Illumina | Command Line | Transcriptome data only |
ERNE-map (rNA)[20] | Free software | Illumina | Command line |
|
mrsFAST-Ultra[22] | Free software | Illumina | Command line / GUI |
|
An exhaustive list of NGS aligner is maintained at HTS mapper
Based on personal experience and prevalence and based on literature data on the performance[23][24][25], a quick primer on when to use which software:
- If only speed matters use bowtie.
- BWA is a bit slower but more sensitive.
- If sensitivity and specificity is needed, try NextGenMap, Stampy, Novoalign, or SHRiMP 2.
Furthermore a framework (Teaser) was proposed that automatically benchmarks several mappers within minutes.[26]
References
- ↑ Flicek, P.; Birney, E. (2009). "Sense from sequence reads: Methods for alignment and assembly". Nature Methods. 6 (11 Suppl): S6–S12. doi:10.1038/nmeth.1376. PMID 19844229.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ↑ Cock, P.J.; Fields, C.J.; Goto, N.; Heuer, M.L.; Rice, P.M. (2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Research. 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC 2847217. PMID 20015970.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ The SAM/BAM Format Specification Working Group (18 November 2015). "Sequence Alignment/Map Format Specification" (PDF). GitHub. p. 16. Retrieved 28 April 2016.
- ↑ "Samtools organisation and repositories". GitHub. Retrieved 28 April 2016.
- ↑ Manske, H.M.; Kwiatkowski, D.P. (2009). "LookSeq: A browser-based viewer for deep sequencing data". Genome Research. 19 (11): 2125–2132. doi:10.1101/gr.093443.109. PMC 2775587. PMID 19679872.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ "ICORN: Iterative Correction of Reference Nucleotides". SourceForge. Retrieved 28 April 2016.
- ↑ Broad Institute. "Picard". GitHub. Retrieved 28 April 2016.
- ↑ Stein, L.D. (12 February 2016). "Bio-SamTools". CPAN. Perl.org. Retrieved 28 April 2016.
- ↑ Li, H.; Durbin, R. (2009). "Fast and accurate short read alignment with Burrows–Wheeler transform". Bioinformatics. 25 (14): 1754–1760. doi:10.1093/bioinformatics/btp324. PMC 2705234. PMID 19451168.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S.L. (2009). "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome". Genome Biology. 10 (3): R25. doi:10.1186/gb-2009-10-3-r25. PMC 2690996. PMID 19261174.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Lunter, G.; Goodson, M. (2011). "Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads". Genome Research. 21 (6): 936–939. doi:10.1101/gr.111120.110. PMC 3106326. PMID 20980556.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Manske, H.M.; Kwiatkowski, D.P. (2009). "SNP-o-matic". Bioinformatics. 25 (18): 2434–2435. doi:10.1093/bioinformatics/btp403. PMC 2735664. PMID 19574284.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Cibiv. "NextGenMap". GitHub. Retrieved 28 April 2016.
- ↑ Li, H.; Durbin, R. (2010). "Fast and accurate long-read alignment with Burrows–Wheeler transform". Bioinformatics. 26 (5): 589–595. doi:10.1093/bioinformatics/btp698. PMC 2828108. PMID 20080505.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Li, H. (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM". Preprint: 1–3.
- ↑ Chapman, B. (6 May 2013). "Framework for evaluating variant detection methods: Comparison of aligners and callers". Blue Collar Bioinformatics. Retrieved 28 April 2016.
- ↑ Homer, N. "Blat-like Fast Accurate Search Tool". SourceForge. Retrieved 28 April 2016.
- ↑ Koboldt, D. (3 July 2011). "Aligners". MassGenomics. Retrieved 28 April 2016.
- ↑ Kim, D.; Salzberg, S. (23 Februart 2016). "TopHat". Johns Hopkins University Center for Computational Biology. Retrieved 28 April 2016.
{{cite web}}
: Check date values in:|date=
(help)CS1 maint: multiple names: authors list (link) - ↑ Vezzi, F.; Del Fabbro, C.; Tomescu, A.I.; Policriti, A. (2011). "rNA: a fast and accurate short reads numerical aligner". Bioinformatics. 28 (1): 123–124. doi:10.1093/bioinformatics/btr617. PMID 22084252.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ↑ "Welcome to ERNE 2!". SourceForge. 29 September 2014. Retrieved 28 April 2016.
- ↑ Hach, F.; Sarrafi, I.; Hormozdiari, F.; Alkan, C.; Eichler, E.E.; Sahinalp, S.C. (2014). "mrsFAST-Ultra: A compact, SNP-aware mapper for high performance sequencing applications". Nucleic Acids Research. 42 (W1): W494–W500. doi:10.1093/nar/gku370. PMC 4086126. PMID 24810850.
{{cite journal}}
: CS1 maint: PMC format (link) CS1 maint: multiple names: authors list (link) - ↑ Bao, S.; Jiang, R.; Kwan, W.; Wang, B.; Ma, X.; Song, Y.Q. (2011). "Evaluation of next-generation sequencing software in mapping and assembly". Journal of Human Genetics. 56 (6): 406–14. doi:10.1038/jhg.2011.43. PMID 21525877.
{{cite journal}}
: CS1 maint: multiple names: authors list (link) - ↑ Li, H. (19 November 2009). "NGS Alignment Programs". SourceForge. Retrieved 28 April 2016.
- ↑ Li, H. (19 November 2009). "NGS mapper ROC curves". SourceForge. Retrieved 28 April 2016.
Notes
This article is a direct transclusion of the Wikibooks article and therefore may not meet the same editing standards as LIMSwiki.