You are on page 1of 32

NGS Data Analysis

Tools & Formats

Basic Workflow
SEQUENCER

FASTQ

REFERENCE

SAM

BAM

File Formats

FASTA

Simple text-based format.


Sequence starts with a > followed by the sequence identifier
and optionally, a description
Usually indicated with the suffix *.fa or *.fasta or *.fsa

>seq_1 description
ATGCTGCTGACGTAGCGATGCAGTAGCAGGTACGAGTCGCAGT
GCAGATGCA
>seq_2
GTAGACGATCGATGCAGCATGACGATGACGATGACGACGATGA
CGATAGCAGATGCA

FASTQ

text-based format
four lines entry per sequence
storing sequence and its corresponding quality score
most commonly used format to store sequencing reads
usually indicated with the suffix *.fastq or *.fq

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC
AACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

@EAS139

the unique instrument name

136

the run id

FC706VJ

the flowcell id

flowcell lane

2104

tile number within the flowcell lane

15343

'x'-coordinate of the cluster within the tile

197393

'y'-coordinate of the cluster within the tile

the member of a pair, 1 or 2

Y if the read is filtered, N otherwise

18

0 when none of the control bits are on, otherwise it is


an even number

ATCACG

index sequence

Quality
Q = -10logP, where P is base-calling error probabilities
(i.e., the probability that the corresponding base call is
incorrect)

!#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

http://en.wikipedia.org

SAM

SAM stands for Sequence Alignment/Map format


TAB-delimited text format
flexible enough to store all the alignment information
generated
allows most of operations on the alignment to work on a
stream without loading the whole alignment into memory
allows the file to be indexed by genomic position to efficiently
retrieve all reads aligning to a locus
consists of a header section (optional) and an alignment
section

Li et al., 2009.

Li et al., 2009.

BAM
BAM is the compressed binary version of the SAM format
compact and index-able representation of nucleotide sequence
alignments.

uses a modified form of gzip format called BGZF (Blocked


GNU Zip Format)

VCF
Variant Call Format
VCF is a text
file format (most likely stored in a compressed manner). It
contains meta-information lines, a header line, and then
data lines each containing information about a position in
the genome. The format also has the ability to contain
genotype information on samples for each position

VCF specs v4.2

VCF specs v4.2

Hapmap
text-based file format
information for a series of SNPs as well as the germplasm

lines are stored in one file


the first row contains the header labels, and each additional
row contains all the information associated with a single SNP
the first 11 columns describe attributes of the SNP, while the
following columns describe the SNP value for a single
germplasm line

http://www.maizegenetics.net

http://www.maizegenetics.net

GFF : General Feature Format


GFF3 files are nine-column, tab-delimited, plain text files. It is used
to hold information about gene structure
Column 1: seqid
Column 2: source
Column 3: type
Columns 4 & 5: start and end
Column 6: score
Column 7: strand
Column 8: phase
Column 9: attributes

GFF : General Feature Format


Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8

GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN

mRNA
76315
78595
CDS 76315
76450
.
CDS 76668
76852
.
CDS 77457
77657
.
CDS 77994
78155
.
CDS 78233
78595
.
mRNA
85322
90545
CDS 85322
86173
.
CDS 88630
89316
.
CDS 89970
90545
.
mRNA
94102
99473
CDS 98946
99473
.
CDS 97180
97620
.
CDS 96589
96819
.
CDS 95733
95797
.
CDS 95601
95658
.
CDS 94282
94350
.
CDS 94102
94200
.

0.990688
+
0
+
2
+
0
+
0
+
0
0.655887
+
0
+
0
+
0
0.967529
0
0
0
0
1
0
0

+
.
ID=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
+
.
ID=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
.
ID=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;

GTF : General Transfer Format


first 8 column same as GFF
9th column is structured differently
it must begin with 'gene_id' and 'transcript_id' attributes
attribute must end with a semi-colon
GFF
GTF

ID=geneA;Name=geneA
ID=exonA1;Parent=geneA
gene_id "geneA";transcript_id "geneA.1";

Tools

Quality Control
Why Quality Control ?

sequencing a poor library on multiple runs


time required for analysis
cost of analyzing data
raw sequence data storage
hours spent in analysis could be wasted

QC Tools
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
providing a quick overview to tell you in which areas
there may be problems
summary graphs and tables to quickly assess your
data
export of results to an HTML based permanent report
offline operation to allow automated generation of
reports without running the interactive application

PrintSeq Schmieder R and Edwards R, 2011


summary statistics for your sequence data
reformat and trim your sequences
easily configurable

Trimmomatic Bolger et al., 2014


flexible read trimming tool for Illumina NGS data
trims adapter
fast, multithreaded command line toolt
Sickle https://github.com/najoshi/sickle
supports gzipped file inputs
with both paired-end and single-end
easily configurable
Cutadapt Marcel Martin, 2011
trims reads from current high-throughput sequencing
machines
errors in the adapter are tolerated
input or output file can be gzip-compressed

BAD

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

BAD

http://prinseq.sourceforge.net

GOOD

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Alignment Tools
Also called mapping
experiments with known genome
align reads to the reference genome
computationally intensive for huge volume data and large reference
genome

Bowtie2

Langmead and Salzberg, 2012

an ultrafast and memory-efficient tool for aligning


sequencing reads
supports gapped, local, and paired-end alignment
no upper limit on read length

BWA

Li and Durbin, 2009

fast and require less memory compare to many other


tools
supports gapped alignment
supports read lengths upto 1 Mb
default configuration works for most typical inputs
GS Reference Mapper

Roche

rapidly and accurately align reads to any reference


genome
identify differences compared to the reference
annotate reference features and variations
explore the full spectrum of genomic variation

IGV : Integrative Genomics Viewer


The Integrative Genomics Viewer (IGV) is a highperformance
visualization
tool
for
interactive
exploration of large, integrated genomic datasets
Supports multiple data types
Sequence alignments
Genome annotations
Variants/SNPs
etc.

James et al., 2011

James et al., 2011

CLC Genomics Workbench

commercial / paid application


computationally less intensive
proprietary internal algorithms
flexible and scalable
supports all typical NGS workflow
Resquencing
Mapping
Variant Detection
RNA-seq
De novo assembly
etc.

http://www.clcbio.com

References
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer
for Illumina Sequence Data. Bioinformatics, btu170.
James T. Robinson, Helga Thorvaldsdttir, Wendy Winckler, Mitchell Guttman,
Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. Nature
Biotechnology (2011), 29, 2426.
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature
Methods. (2012), 9:357-359.
Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics
(2009), 25 (16): 2078-2079.
Li H. and Durbin R. Fast and accurate short read alignment with BurrowsWheeler Transform. Bioinformatics (2009), 25:1754-60.
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.journal (2011), 17:10-12
Schmieder R and Edwards R: Quality control and preprocessing of
metagenomic datasets. Bioinformatics (2011), 27:863-864.

Thank you!

You might also like