NGS ToolsFormats r1 BDG

NGS Data Analysis
Tools & Formats
Basic Workflow
SEQUENCER
FASTQ
REFERENCE
SAM
BAM
File Formats
FASTA
Simple text-based format.

Sequence starts with a > followed by the sequence identifier
and optionally, a description
Usually indicated with the suffix *.fa or *.fasta or *.fsa
>seq_1 description
ATGCTGCTGACGTAGCGATGCAGTAGCAGGTACGAGTCGCAGT
GCAGATGCA
>seq_2
GTAGACGATCGATGCAGCATGACGATGACGATGACGACGATGA
CGATAGCAGATGCA
FASTQ
text-based format
four lines entry per sequence
storing sequence and its corresponding quality score
most commonly used format to store sequencing reads
usually indicated with the suffix *.fastq or *.fq
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTC
AACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
@EAS139
the unique instrument name
136
the run id
FC706VJ
the flowcell id
flowcell lane
2104
tile number within the flowcell lane
15343
'x'-coordinate of the cluster within the tile
197393
'y'-coordinate of the cluster within the tile
the member of a pair, 1 or 2
Y if the read is filtered, N otherwise
18
0 when none of the control bits are on, otherwise it is

an even number
ATCACG
index sequence
Quality
Q = -10logP, where P is base-calling error probabilities
(i.e., the probability that the corresponding base call is
incorrect)
!#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
http://en.wikipedia.org
SAM
SAM stands for Sequence Alignment/Map format

TAB-delimited text format
flexible enough to store all the alignment information
generated
allows most of operations on the alignment to work on a
stream without loading the whole alignment into memory
allows the file to be indexed by genomic position to efficiently
retrieve all reads aligning to a locus
consists of a header section (optional) and an alignment
section
Li et al., 2009.
Li et al., 2009.
BAM
BAM is the compressed binary version of the SAM format
compact and index-able representation of nucleotide sequence
alignments.
uses a modified form of gzip format called BGZF (Blocked

GNU Zip Format)
VCF
Variant Call Format
VCF is a text
file format (most likely stored in a compressed manner). It
contains meta-information lines, a header line, and then
data lines each containing information about a position in
the genome. The format also has the ability to contain
genotype information on samples for each position
VCF specs v4.2
VCF specs v4.2
Hapmap
text-based file format
information for a series of SNPs as well as the germplasm
lines are stored in one file

the first row contains the header labels, and each additional
row contains all the information associated with a single SNP
the first 11 columns describe attributes of the SNP, while the
following columns describe the SNP value for a single
germplasm line
http://www.maizegenetics.net
http://www.maizegenetics.net
GFF : General Feature Format

GFF3 files are nine-column, tab-delimited, plain text files. It is used
to hold information about gene structure
Column 1: seqid
Column 2: source
Column 3: type
Columns 4 & 5: start and end
Column 6: score
Column 7: strand
Column 8: phase
Column 9: attributes
GFF : General Feature Format

Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
Ca8
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
GLEAN
mRNA
76315
78595
CDS 76315
76450
.
CDS 76668
76852
.
CDS 77457
77657
.
CDS 77994
78155
.
CDS 78233
78595
.
mRNA
85322
90545
CDS 85322
86173
.
CDS 88630
89316
.
CDS 89970
90545
.
mRNA
94102
99473
CDS 98946
99473
.
CDS 97180
97620
.
CDS 96589
96819
.
CDS 95733
95797
.
CDS 95601
95658
.
CDS 94282
94350
.
CDS 94102
94200
.
0.990688
+
0
+
2
+
0
+
0
+
0
0.655887
+
0
+
0
+
0
0.967529
0
0
0
0
1
0
0
+
.
ID=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
Parent=Ca_11934;
+
.
ID=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
Parent=Ca_11933;
.
ID=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
Parent=Ca_11932;
GTF : General Transfer Format

first 8 column same as GFF
9th column is structured differently
it must begin with 'gene_id' and 'transcript_id' attributes
attribute must end with a semi-colon
GFF
GTF
ID=geneA;Name=geneA
ID=exonA1;Parent=geneA
gene_id "geneA";transcript_id "geneA.1";
Tools
Quality Control
Why Quality Control ?
sequencing a poor library on multiple runs

time required for analysis
cost of analyzing data
raw sequence data storage
hours spent in analysis could be wasted
QC Tools
FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
providing a quick overview to tell you in which areas
there may be problems
summary graphs and tables to quickly assess your
data
export of results to an HTML based permanent report
offline operation to allow automated generation of
reports without running the interactive application
PrintSeq Schmieder R and Edwards R, 2011

summary statistics for your sequence data
reformat and trim your sequences
easily configurable
Trimmomatic Bolger et al., 2014

flexible read trimming tool for Illumina NGS data
trims adapter
fast, multithreaded command line toolt
Sickle https://github.com/najoshi/sickle
supports gzipped file inputs
with both paired-end and single-end
easily configurable
Cutadapt Marcel Martin, 2011
trims reads from current high-throughput sequencing
machines
errors in the adapter are tolerated
input or output file can be gzip-compressed
BAD
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
BAD
http://prinseq.sourceforge.net
GOOD
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Alignment Tools
Also called mapping
experiments with known genome
align reads to the reference genome
computationally intensive for huge volume data and large reference
genome
Bowtie2
Langmead and Salzberg, 2012
an ultrafast and memory-efficient tool for aligning

sequencing reads
supports gapped, local, and paired-end alignment
no upper limit on read length
BWA
Li and Durbin, 2009
fast and require less memory compare to many other

tools
supports gapped alignment
supports read lengths upto 1 Mb
default configuration works for most typical inputs
GS Reference Mapper
Roche
rapidly and accurately align reads to any reference

genome
identify differences compared to the reference
annotate reference features and variations
explore the full spectrum of genomic variation
IGV : Integrative Genomics Viewer

The Integrative Genomics Viewer (IGV) is a highperformance
visualization
tool
for
interactive
exploration of large, integrated genomic datasets
Supports multiple data types
Sequence alignments
Genome annotations
Variants/SNPs
etc.
James et al., 2011
James et al., 2011
CLC Genomics Workbench
commercial / paid application

computationally less intensive
proprietary internal algorithms
flexible and scalable
supports all typical NGS workflow
Resquencing
Mapping
Variant Detection
RNA-seq
De novo assembly
etc.
http://www.clcbio.com
References
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer
for Illumina Sequence Data. Bioinformatics, btu170.
James T. Robinson, Helga Thorvaldsdttir, Wendy Winckler, Mitchell Guttman,
Eric S. Lander, Gad Getz, Jill P. Mesirov. Integrative Genomics Viewer. Nature
Biotechnology (2011), 29, 2426.
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature
Methods. (2012), 9:357-359.
Li et al., The Sequence Alignment/Map format and SAMtools. Bioinformatics
(2009), 25 (16): 2078-2079.
Li H. and Durbin R. Fast and accurate short read alignment with BurrowsWheeler Transform. Bioinformatics (2009), 25:1754-60.
Marcel Martin. Cutadapt removes adapter sequences from high-throughput
sequencing reads. EMBnet.journal (2011), 17:10-12
Schmieder R and Edwards R: Quality control and preprocessing of
metagenomic datasets. Bioinformatics (2011), 27:863-864.
Thank you!

NGS ToolsFormats r1 BDG

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NGS ToolsFormats r1 BDG

Uploaded by

Copyright:

Available Formats

NGS Data Analysis

Tools & Formats

Simple text-based format.

the unique instrument name

tile number within the flowcell lane

'x'-coordinate of the cluster within the tile

'y'-coordinate of the cluster within the tile

the member of a pair, 1 or 2

Y if the read is filtered, N otherwise

0 when none of the control bits are on, otherwise it is

SAM stands for Sequence Alignment/Map format

uses a modified form of gzip format called BGZF (Blocked

VCF specs v4.2

VCF specs v4.2

lines are stored in one file

GFF : General Feature Format

GFF : General Feature Format

GTF : General Transfer Format

sequencing a poor library on multiple runs

PrintSeq Schmieder R and Edwards R, 2011

Trimmomatic Bolger et al., 2014

Langmead and Salzberg, 2012

an ultrafast and memory-efficient tool for aligning

Li and Durbin, 2009

fast and require less memory compare to many other

rapidly and accurately align reads to any reference

IGV : Integrative Genomics Viewer

James et al., 2011

James et al., 2011

CLC Genomics Workbench

commercial / paid application

You might also like