You are on page 1of 11

Next Generation sequencing (NGS)

Data analysis in GALAXY (https://usegalaxy.org/)


by
Waseem Haider, PhD
Assistant Professor COMSATS University Islamabad (CUI)
CEO Next Gen. Solutions
Learning Objectives
 Getting familiar with the FASTQ format and base quality scores
 Perform Quality Control (QC) of data
 Learn to align reads on the references using BWA aligner
 Generate BAM file and subsequently generate a pileup file
 Be able to visualize BAM files using IGV and identify SNVs and indels
Background of the variant detection:
Please go through the links below
https://docs.google.com/document/pub?id=1NfythYcSrkQwldGMrbHRKLRORFFn-WnBm3gOMHwIgmE
http://vlsci.github.io/lscc_docs/tutorials/variant_calling_galaxy_1/variant_calling_galaxy_1/
Dataset: Short read exome data of chromosome 22 of a single human individual. There
are over one million 76bp reads in the dataset, produced on an Illumina GAIIx from
exome-enriched DNA. This data was generated as part of the 1000 genomes project.
GALAXY (https://usegalaxy.org/)

1: Must create an account and login to your account. It will save your work and time.
2: Data uploading/fetching from the Internet
i) n the Galaxy tools panel (left), click on Get Data and choose Upload File. Click
Paste/Fetch data and paste the URL below.
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/VariantDet_BASIC
/NA12878.GAIIx.exome_chr22.1E6reads.76bp.fastq
Select Type as fastqsanger (careful as there is a fastqcsanger too) and click Start.
Once the upload status turns green, it means the upload is complete. You should now
be able to see the file in the Galaxy history panel (right).

ii) Alternatively, if you have a local file to upload you can upload it directly browsing
from your computer.
iii) You can also upload from SRA (Sequence Read Archive) linked with GALAXY at
EMBL-EBI’s ENA (e.g GSE88943 at GEO)

http:/ /www.ebi.ac.uk/ena EMBL-EBI’s ENA (European Nucleotide Archive)

We can send data to galaxy by clicking file# Fastq files (galaxy)


3: Data Quality Control (QC)
Have a look into the FastQ file.
NGS: QC and manipulation > FastQC
Measure Value
Filename NA12878.GAIIx.exome_chr22.1E6reads.76bp.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 1151702
Sequences flagged as poor quality 0
Sequence length 76
%GC 53
1. QUALITY SCORES ACROSS ALL THE BASES

COMMENTS:
Green area shows reads with good quality and red area represents region of bad/low
quality. Low quality reads should be trimmed and discarded in order to get appropriate
results. Only few of the reads lie in red area, which shows less error and less biasness in
the sequencing. Sample is a good representative of the data with most of the reads
showing high quality
2. QUALITY PER TILE (blue color which means indexes of reads are not biased)
3.MEAN SEQUENCE QUALITY:

The average quality score per read mostly lies in the area 35-38.
4. SEQUENCE CONTENT ACROSS ALL BASES:

First few base pairs show uneven distribution of A, T, C, G while 12 bases onwards is uniform.

5. GC CONTENT PER READ:

Mean GC content lies in the region of 53-56 base pairs per read.
6. N CONTENT ACROSS ALL BASES

No read shows N-content.


7. DISTRIBUTION OF SEQUENCE LENGTH AMONG ALL SEQUENCES:

All sequences are 76 base pairs in length.


8. % SEQUENCES REMAINING AFTER DEDUPLICATION:

Only few sequences show high level of redundancy.


9. PRESENCE OF ADAPTORS:

No adaptors found.

10. Kmer Content:

No kmer is over representing.

11. OVERREPRESENTED SEQUENCES:

Not Found.

4: Alignment or Mapping of short reads onto the reference


There are 2 major tools, Bowtie2 and BWA. We will map/align the
reads with the BWA tool to Human reference genome 19 (hg19) UCSC
hg19.
NGS: Mapping > Map with BWA-MEM
From the options:
Using reference genome: set to hg19.
Single or Paired-end reads: set to Single
Keep other options as default and click execute
5: Sort the BAM file. From the Galaxy tools panel, select
NGS: SAM Tools > Sort BAM dataset
From the options:
BAM File: set to the output from the alignment BAM file
Sort by: Chromosomal coordinates
Keep other options as default and click execute

To examine the output sorted BAM file, we need to first convert it into readable
SAM format. From the Galaxy tools panel, select
NGS: SAM Tools > BAM-to-SAM
From the options:
BAM File to Convert: set to the output of the sorted BAM file
Keep other options as default and click execute
7: Study Mapping Statistics
We can generate some mapping statistics from the BAM file to assess the quality of our
alignment.

Run IdxStats
NGS: SAM Tools > IdxStats
From the options:
The BAM: select the sorted BAM file
Keep other options as default and
click execute
Output: A tab-delimited output with four
columns. Each line consists of a
reference sequence name (e.g. a
chromosome), reference sequence length,
number of mapped reads and number of
placed but unmapped reads.

8: Visualize the BAM file.


Download the sorted BAM file.
From the Galaxy history panel,
Click on the sorted BAM file.
Click on the disk icon > Download
dataset
Click on the disk icon > Download
bam_index
This will result in two files being
downloaded:
 a bam file (*.bam), and
 a bam index file (*.bai)
View the BAM file in IGV (Integrated Genome Viewer)
(http://software.broadinstitute.org/software/igv/)
9: Calling single nucleotide variations (SNVs)
A pileup is essentially a column
wise representation of the aligned
read - at the base level - to the
reference. The pileup file
summarizes all data from the reads
at each genomic region that is
covered by at least one read. We will
use sorted bam file.
NGS: SAMtools > Generate Pileup
From the options:
Call consensus according to MAQ
model = Yes
This generates a called 'consensus
base' for each chromosomal position.
Keep other options as default and
click execute

PileUP format:
The pileup file we generated has 10 Further information on (10):
columns: Each character represents one of the following (the
longer this string, higher the coverage):
1. chromosome  . = match on forward strand for that base
2. position  , = match on reverse strand
3. current reference base  ACGTN = mismatch on forward
4. consensus base from the mapped reads  acgtn = mismatch on reverse
5. consensus quality  +[0-9]+[ACGTNacgtn]+' = insertion between
6. SNV quality this reference position and the next
7. maximum mapping quality  -[0-9]+[ACGTNacgtn]+' = deletion between
this reference position and the next
8. coverage  ^ = start of read
9. quality values  $ = end of read
10. bases within reads  BaseQualities = one character per base in
ReadBases, ASCII encoded Phred scores

Convert to pileup file: Above output file is in tabular format. For the processing as
under, we need to convert it to pileup format. For that we need to click on the pencil icon
(Edit attributes) for the pileup file and then change the data type attribute.
Now next process will operate on this converted file.
SNV Filtering
NGS: SAM Tools > Filter Pileup
From the options:
which contains = Pileup with ten
columns (with consensus)
Do not report positions with coverage
lower than = 10
Convert coordinates to intervals = Yes
Keep other options as default and click
execute

10: Calling INDELS (INsertions-DELetionS)


NGS: SAM Tools > Generate pileup
From the options:
Select the BAM file to generate the pileup file for = sorted bam file
Whether or not to print only output pileup lines containing indels = Print only lines
containing indels
Call consensus according to MAQ model? = yes
Keep other options as default and click execute
Filter for high quality INDELS: Filter and Sort > Filter
From the options:
With following condition = c7>50 and c11>20 # c7 for quality C11 for coverage
keep other options as default and click execute
# Filtering reduces to 83% and we are left with 17% of the data (~700 INDELS)

To visualize these indels, we need to convert from tabular to bed. This is two-step
process. Click the pencil icon, Under the Datatype tab: choose Interval and save, Under
Attributes tab: make sure End column = 2
Next, we can convert the Interval file to BED format. Click the pencil icon, Under
Convert Format tab: choose Convert Genomic Interval to BED, Rename this to
indels.filtered

Download the bed file and open it using IGV genome browser.
Try looking at region chr22:31,854,409-31,854,460

You might also like