You are on page 1of 8

Analysis

of sequence-based copy number varia7on detec7on tools for cancer studies


Sheida Nabavi Zhengqiu Cai Peter J Tonellato
Laboratory for Personalized Medicine Center For Biomedical Informa7cs Harvard Medical School

Why focus on CNV?


CNVs (amplica7ons and dele7ons of chromosomal segments) are major structural varia7ons.

amplica7on

CNVs have been associated with various diseases


including cancers.
Amplica7on of oncogenes and loss of tumor suppressor genes

NGS is emerging as the best plaTorm for CNV detec7on


Array based technologies have been used to study CNVs
Advantages Aordable Established High resolu7on arrays Disadvantages Noisy , Hybridiza7on Limited resolu7on Predened probes

Sequence based technologies (using next genera7on sequencing (NGS)) are emerging
Advantages Higher resolu7on and accuracy Single plaTorm Rapid cost reduc7on Disadvantages Currently expensive No standard analysis Computa7onally demanding

Objec7ves of this work


Indicate the performance characteris7cs of the recent sequence-based CNV detec7on tools to facilitate tool selec7on. Iden7fy strengths and weaknesses of current CNV tools.
Address limita7ons Iden7fy modica7ons to improve performance Develop new method

Typical pipeline for read depth-based CNV detec7on


CNV detec7on NGS data Alignment Coun7ng reads CNV list Normaliza7on Segmenta7on

NGS data
@SRR034720.3267591 length=36 ATTATTTTATGTTATTTATTTTGTATGTTTTTTTTT + 88888888888888888888885888%888888/8) @SRR034720.3267592 length=36 TCGGGAACGTCTCGACCGAAATTATTTTGTATGTCT + 8888788888888888878-188288881878"888 . . .

Coun7ng window

Short read

Read count

Reference genome

The number of reads that align to a posi7on in a genome is propor7onal to the copy number at that posi7on

Normaliza7on reduces biases


Aligned short reads
Soma7c amplica7on

Sample

Soma7c dele7on

Control

Main sources of biases are GC-content, mapability, and sample prepara7on

Read depth-based CNV detec7on Tools


Tool CNVnator Method Analyze depth of coverage (DOC), mean-shift technique for segmentation, correct GC bias SE/ control PE SE PE No Reference Abyzov et al. (2011), Genome Res, 21, 974-984 Miller et al. (2011), PLoS One, 6, e16327 Boeva et al. (2011), Bioinformatics, 27, 268-269 Medvedev et al. (2010), Genome Res, 20, 1613-1622 Xie and Tammi (2009), BMC Bioinformatics, 10, 80 Chiang et al. (2009), Nat Methods, 6, 99-103

ReadDepth

Analyze DOC, circular binary SE segmentation algorithm, use negative PE binomial distribution, use PE information Analyze DOC, use LASSO-based algorithm for segmentation, normalize for GC content Analyze DOC with paired-end mapping information, repeat graph algorithm Analyze DOC, fixed window, no segmentation Analyze DOC, extend a window to include a fixed number of reads SE PE PE

No

FREEC

Yes/ No No

CNVer

CNV-seq

SE PE SE PE

Yes

SegSeq

Yes

Cell line and synthesized NGS datasets


Eight breast cancer cell line NGS datasets
Cell Line BT-20 BT-474 MDA-MB-231 MDA- MB-468 MCF-7 T47D ZR-75-1 HCC1143 Read Single end/ length paired end 50 50 50 50 50 50 50 36 PE PE PE PE PE PE PE SE PF Reads 13,822,800 15,787,700 12,372,400 13,265,400 9,838,800 13,514,100 13,484,800 15,038,736 Coverage 0.43 X 0.49 X 0.39 X 0.41 X 0.31 X 0.42 X 0.42 X 0.25 X Illumina platform GA II GA II GA II GA II GA II GA II GA II 1G

Six Synthe7c NGS datasets


500 amplifications and 500 deletions with random lengths form 500 bp to 1Mbp on chr1 Coverage: 0.5, 2, 5, 10, 20, 50 50 base pair short read using SAMtools wgsim

There is no consistent agreement among the tools


CNV size
100% 80% 60% 40% 20% 0% >1000K 500k-1000k 100k-500k 50k-100k 10k-50k 5k-10k 2k-5k 1k-2k 0-1k

CNV size distribu7ons Cell lines detected CNVs size, number, span and type are dierent across the tools.

Few consensus genes among tools

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

amplica7on, dele7on 1: SegSeq, 2: CNVer, 3: CNVnator, 4: FREEC, 5: CNV-seq, 6: ReadDepth

Tools show high sensi7vity for cell line datasets


100% 80% 60% 40% 20% 0%

MCF7
Sensi7vity Precision

Benchmarks: Common CNVs in two published results Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve

There are many false posi7ve CNVs


Chr20, FREEC
Detected CNV Sample read count Normal read count Benchmark CNV region

There are many false posi7ve CNVs

Higher sensi7vity compare to precision for high coverage synthesized datasets


100% 80% 100% 80%

Sensi7vity

60% 40% 20% 0% 0.5x SegSeq CNVnator CNV-seq 2x 5x 10x CNVer FREEC ReadDepth 20x 50x

Precision

60% 40% 20% 0% 0.5x 2x 5x 10x 20x 50x

Coverage

Coverage

Benchmark: known synthesized CNVs Detected CNVs with more 70% overlap with the benchmark CNVs are called true posi7ve

Conclusions
The CNV results across the tools are not consistence. Most of the tools show high sensi7vity and breakpoint accuracy, however their precision is not high. Tools with advanced algorithms such as CNVnator and FREEC perform beoer, however they are computa7onally more expensive. Tools u7lize pair end informa7on, such as CNVer and ReadDepth, detect CNVs more accurately. Development of more ecient and accurate tools is required.

Acknowledgements
Laboratory of Personalized Medicine (LPM)
Peter Tonellato Zengqui Cai Erik Gafni Vincent Fusaro Chih-Lin Chi Michiyo Yamada Jessica Correia Maohew Crawford

You might also like