Analysis of Sequence-Based COpy Number Variation Detection Tools For Cancer Studies

Analysis
of sequence-based copy number varia7on detec7on tools for cancer studies

Sheida Nabavi Zhengqiu Cai Peter J Tonellato
Laboratory for Personalized Medicine Center For Biomedical Informa7cs Harvard Medical School
Why focus on CNV?

CNVs (amplica7ons and dele7ons of chromosomal segments) are major structural varia7ons.
amplica7on
CNVs have been associated with various diseases

including cancers.
Amplica7on of oncogenes and loss of tumor suppressor genes
NGS is emerging as the best plaTorm for CNV detec7on

Array based technologies have been used to study CNVs
Advantages Aordable Established High resolu7on arrays Disadvantages Noisy , Hybridiza7on Limited resolu7on Predened probes
Sequence based technologies (using next genera7on sequencing (NGS)) are emerging
Advantages Higher resolu7on and accuracy Single plaTorm Rapid cost reduc7on Disadvantages Currently expensive No standard analysis Computa7onally demanding
Objec7ves of this work

Indicate the performance characteris7cs of the recent sequence-based CNV detec7on tools to facilitate tool selec7on. Iden7fy strengths and weaknesses of current CNV tools.
Address limita7ons Iden7fy modica7ons to improve performance Develop new method
Typical pipeline for read depth-based CNV detec7on

CNV detec7on NGS data Alignment Coun7ng reads CNV list Normaliza7on Segmenta7on
NGS data
@SRR034720.3267591 length=36 ATTATTTTATGTTATTTATTTTGTATGTTTTTTTTT + 88888888888888888888885888%888888/8) @SRR034720.3267592 length=36 TCGGGAACGTCTCGACCGAAATTATTTTGTATGTCT + 8888788888888888878-188288881878"888 . . .
Coun7ng window
Short read
Read count
Reference genome
The number of reads that align to a posi7on in a genome is propor7onal to the copy number at that posi7on
Normaliza7on reduces biases

Aligned short reads
Soma7c amplica7on
Sample
Soma7c dele7on
Control
Main sources of biases are GC-content, mapability, and sample prepara7on
Read depth-based CNV detec7on Tools

Tool CNVnator Method Analyze depth of coverage (DOC), mean-shift technique for segmentation, correct GC bias SE/ control PE SE PE No Reference Abyzov et al. (2011), Genome Res, 21, 974-984 Miller et al. (2011), PLoS One, 6, e16327 Boeva et al. (2011), Bioinformatics, 27, 268-269 Medvedev et al. (2010), Genome Res, 20, 1613-1622 Xie and Tammi (2009), BMC Bioinformatics, 10, 80 Chiang et al. (2009), Nat Methods, 6, 99-103
ReadDepth
Analyze DOC, circular binary SE segmentation algorithm, use negative PE binomial distribution, use PE information Analyze DOC, use LASSO-based algorithm for segmentation, normalize for GC content Analyze DOC with paired-end mapping information, repeat graph algorithm Analyze DOC, fixed window, no segmentation Analyze DOC, extend a window to include a fixed number of reads SE PE PE
No
FREEC
Yes/ No No
CNVer
CNV-seq
SE PE SE PE
Yes
SegSeq
Yes
Cell line and synthesized NGS datasets

Eight breast cancer cell line NGS datasets
Cell Line BT-20 BT-474 MDA-MB-231 MDA- MB-468 MCF-7 T47D ZR-75-1 HCC1143 Read Single end/ length paired end 50 50 50 50 50 50 50 36 PE PE PE PE PE PE PE SE PF Reads 13,822,800 15,787,700 12,372,400 13,265,400 9,838,800 13,514,100 13,484,800 15,038,736 Coverage 0.43 X 0.49 X 0.39 X 0.41 X 0.31 X 0.42 X 0.42 X 0.25 X Illumina platform GA II GA II GA II GA II GA II GA II GA II 1G
Six Synthe7c NGS datasets

500 amplifications and 500 deletions with random lengths form 500 bp to 1Mbp on chr1 Coverage: 0.5, 2, 5, 10, 20, 50 50 base pair short read using SAMtools wgsim
There is no consistent agreement among the tools

CNV size
100% 80% 60% 40% 20% 0% >1000K 500k-1000k 100k-500k 50k-100k 10k-50k 5k-10k 2k-5k 1k-2k 0-1k
CNV size distribu7ons Cell lines detected CNVs size, number, span and type are dierent across the tools.
Few consensus genes among tools
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
amplica7on, dele7on 1: SegSeq, 2: CNVer, 3: CNVnator, 4: FREEC, 5: CNV-seq, 6: ReadDepth
Tools show high sensi7vity for cell line datasets

100% 80% 60% 40% 20% 0%
MCF7
Sensi7vity Precision
Benchmarks: Common CNVs in two published results Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve
There are many false posi7ve CNVs

Chr20, FREEC
Detected CNV Sample read count Normal read count Benchmark CNV region
There are many false posi7ve CNVs
Higher sensi7vity compare to precision for high coverage synthesized datasets

100% 80% 100% 80%
Sensi7vity
60% 40% 20% 0% 0.5x SegSeq CNVnator CNV-seq 2x 5x 10x CNVer FREEC ReadDepth 20x 50x
Precision
60% 40% 20% 0% 0.5x 2x 5x 10x 20x 50x
Coverage
Coverage
Benchmark: known synthesized CNVs Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve
Conclusions
The CNV results across the tools are not consistence. Most of the tools show high sensi7vity and breakpoint accuracy, however their precision is not high. Tools with advanced algorithms such as CNVnator and FREEC perform beoer, however they are computa7onally more expensive. Tools u7lize pair end informa7on, such as CNVer and ReadDepth, detect CNVs more accurately. Development of more ecient and accurate tools is required.
Acknowledgements
Laboratory of Personalized Medicine (LPM)
Peter Tonellato Zengqui Cai Erik Gafni Vincent Fusaro Chih-Lin Chi Michiyo Yamada Jessica Correia Maohew Crawford

Analysis of Sequence-Based COpy Number Variation Detection Tools For Cancer Studies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis of Sequence-Based COpy Number Variation Detection Tools For Cancer Studies

Uploaded by

Copyright:

Available Formats

Analysis

of sequence-based copy number varia7on detec7on tools for cancer studies

Why focus on CNV?

CNVs have been associated with various diseases

NGS is emerging as the best plaTorm for CNV detec7on

Objec7ves of this work

Typical pipeline for read depth-based CNV detec7on

Normaliza7on reduces biases

Main sources of biases are GC-content, mapability, and sample prepara7on

Read depth-based CNV detec7on Tools

Cell line and synthesized NGS datasets

Six Synthe7c NGS datasets

There is no consistent agreement among the tools

Few consensus genes among tools

amplica7on, dele7on 1: SegSeq, 2: CNVer, 3: CNVnator, 4: FREEC, 5: CNV-seq, 6: ReadDepth

Tools show high sensi7vity for cell line datasets

There are many false posi7ve CNVs

There are many false posi7ve CNVs

Higher sensi7vity compare to precision for high coverage synthesized datasets

60% 40% 20% 0% 0.5x 2x 5x 10x 20x 50x

You might also like