You are on page 1of 40

Data Deduplication in a Virtual Tape Library

Environment

Mathias Defiebre
IBM Lab Services
mathias.defiebre@de.ibm.com

STG Technical Conferences 2010

2010 IBM Corporation

STG Technical Conferences 2010

Agenda
 Data Deduplication Overview
 Data Deduplication Theory
 Data Deduplication Approaches in Practice
 Data Deduplication Considerations and Value Proposition
 TS7650 ProtecTIER Deduplication Gateway
 TS7650 ProtecTIER Deduplication Appliance Series
 A look in the Future
 Links

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Overview

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Overview


 With Data Deduplication repeated instances of identical data are
identified and stored only once
Identical data is referenced to a single instance
Saves storage capacity and network bandwidth

 Data Deduplication is a feature of a storage device or an application


VTL, NAS-Box, backup application

 Data Deduplication requires an I/O protocol


FCP, iSCSi, CIFS, NFS, API, Tape Library Emulation

 Data Deduplication does not always make sense


Not all data can be deduplicated well
May interfere or work together with other technologies like compression, encryption
or with data security requirements

 Data Deduplication is transparent


To end-users and applications
4

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Theory

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Process (simplified)


Data Object / Stream

Data object or stream is subject for deduplication


(1) Data object is split in chunks (fixed or variable size)
 Data Chunking

A B C D A E F F D

(2) For each junk an identity characteristic is determined


 Identity Determination

(3a) Identical Chunks are referenced (pointer, reference)


(3b) Non-identical chunks (single instances) are stored unique
 Determining Duplicates

Identical Chunks

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Methods for Data Chunking

Data Object / Stream

1. File based


One chunk is one file, most appropriate for file systems

2. Block based


Data object is chunked into blocks of fixed or variable size

Used by block storage devices

3. Format aware (Content aware)




Understands explicit data formats and chunks data objects according to the format

Example: Breaking a PowerPoint presentation into separate slides

4. Format agnostic (Content agnostic)




Chunking is based on an algorithm that looks for logical breaks or similar elements
within a data object/stream

Chunking method influences dedupe ratio


7

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Methods for Determining Duplicates

A B C D A E F F D

1. Hashing




Calculate a hash (MD-5, SHA-256) for each data chunk


A
B
C
D
Compare hash with hash of existing data
Identical hash means most likely identical data
Hash Collision: Identical hash but non-identical data
Must be prevented through secondary comparison (additional metadata,
second hash method, additional binary comparison)

2. Binary Comparison


Compare all bits of similar chunks

3. Delta Differencing




Computes a delta between two similar chunks of data where one chunk is the
baseline and the second is the delta
Since each delta is unique there is no possibility of collision
To reconstruct the original chunk the delta(s) have to be re-applied to the baseline
chunk

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Architectures


Server

Client

LAN or SAN

LAN

Client-side

Server-side

Storage-side

+ Reduces load on Server

+ Allows cross-correlation
among multiple Clients

+ Transparent to Clients and


Servers

Adds load to Server

+ Reduces load on Server and


Clients

+ Reduces bandwidth on
LAN
Adds load to Client
No cross-correlation
among multiple clients

Storage Device

Data Deduplication in a Virtual Tape Library Environment

Adds load to Storage Device

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Processing Time

 In-line: Data is deduplicated before it is actually stored


+ Requires less storage capacity
Potential decrease of I/O performance

 Post-processing: Data is first stored and deduplicated later in the


background
+ Better Performance expected
Requires more storage capacity to temporarily store the data
Data is written, read and written again thus more I/O intensive
Deduplication window must be coordinated with backup window

 Combination of In-Line and Post-processing


In-line as long as performance can be satisfied then switch to Post-processing

10

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Approaches in Practise

11

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Practical Approaches Overview

 Practical approaches combine


Chunking Method
Method for Determining/Checking Identity

 Common Practical Approaches


Identity
Check
Chunking

Format Aware
Format
Agnostic
12

Hash based

Fixed/Variable
Block Size

Hashing

Delta Diff

Binary Diff

Content Aware

Data Deduplication in a Virtual Tape Library Environment

HyperFactor

2010 IBM Corporation

STG Technical Conferences 2010

Hash Based Approach


1. Slice data into chunks (fixed or variable)

2. Generate Hash per chunk

Ah Bh Ch Dh Eh
3. Compare hashes with hash table
Hash Value

Storage
locations

Object
References

4. For identical hashes store reference, otherwise store chunk and


update hash table

13

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Assessment for Hash Based Approach


 Hash-Collisions must be handled
More overhead, especially for in-line deduplication

 Requires a hash table to store hashes for all chunks


Hash table will grow with data volume

 Hash Table must be quickly searchable and accessible


Growing hash table may become a performance bottleneck (doesnt fit into RAM)
Scalability issues

 Hash table must be protected


One copy might not be sufficient

Example:
Chunk size of 8KB, each hash is 20 bytes long
With a 1 TB repository:
1 TByte repository has ~134,000,000 chunks of 8 KB each
Need pointers scheme to reference inside 1 TByte
 Hash table requires ~2.5 GB of memory no issue
With a 100 TB repository:
 Hash table requires ~250 GB of memory performance!!!

14

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

HyperFactor Approach

 HyperFactor has two indexes


HyperFactor Index
Restore Index

 HyperFactor Index used for backup


Used to filter out similar elements from the incoming data stream
Fixed size of 4 GB, memory resident, synced to disk (repository) periodically
Can be restored from repository if lost
References up to 1 PB of physical data elements stored in the repository

 Restore Index used for restore


Includes references to physical data elements
Dynamic index, growing
Stored on disk (repository)

15

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

HyperFactor Approach
1. Look through data stream for similarity and filter similar elements
Using HyperFactor Index (fixed size 4 GB)

New Data Stream


2. Read elements that are most similar from storage
Using Restore Index

3. Binary compare element in stream with element(s) read from


storage

Element A

Element B

Element C

4. Identical data is referenced by a new additional entry in the Restore


Index - unique data is stored in the repository
16

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Assessment for HyperFactor

 No Hash Table required


No scalability issues
4 GB Index references up to 1 PB of physical data elements

 No dependency of data format and application


Very flexible, no ongoing development effort due to format changes

 HyperFactor index always fits into memory


Enables enterprise-class high-performance in-line deduplication

 Eliminates the phenomenon of missed factoring opportunities


Looks for similarity between data not on exact chunk matches

17

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Data Deduplication Considerations and Value


Proposition

18

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Not all Data Dedupe well

 High Dedupe Ratio expected for ...


Structured Data
Database Files
E-mails

 Low Dedupe Ratio expected for ...


Unstructured Data
Images
Videos
Voice Data
Seismic Data
Large collections of small files

 Some Technologies influence Dedupe Ratio

19

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Technologies influencing Data Deduplication


 Compression
Archives
*.zip (Phil Katz zip: pkzip, pkunzip)
*.gz (GNU zip: gzip, gzip -d)

 Compaction
Lotus Notes Database

 Multiplexing
Multiple backup streams to a single tape drive
Veritas Backup Exec
Computer Associates ARCserve
Oracle RMAN multiplexing of backup sets

 Encryption

Above technologies change the data stream making identical data non-identical!
20

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Example: Data Deduplication and Encryption


Data
source 1
Important text

No
encryption

Dataencryption
encryptionprior
priortotodedeData
duplicationprocessing
processingcan
can
duplication
subvertdata
datareduction
reduction
subvert
Important text

Data Store

Data
source 2
Important text

Encryption
key 1

Data
Deduplication

txpt tnatroemI

Important text
txpt tnatroemI

Data
source 3

te tarpIxtntom

Important text

1. Three data
sources have
the same text file
21

Encryption
key 2

te tarpIxtntom

2. After encryption,
text files do not
match

Compression
possible

3. Deduplication
processing does not
detect redundancy

Data Deduplication in a Virtual Tape Library Environment

4. Text files are


stored without
data reduction
2010 IBM Corporation

STG Technical Conferences 2010

Dedupe Value Proposition & potential Drawbacks


 Data Deduplication Value Proposition
Disk storage savings
Network Bandwidth savings
Energy savings (Green IT)
Better utilization of existing floor and rack space
Increased scalability

 Data Deduplication Potential Drawbacks


Loss of one single data chunk may cause loss of multiple files
Repository or Index required to store meta data
must be protected
requires additional storage capacity
may slow down performance
Loss of all Index means loss of all data

22

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

TS7650 ProtecTIER Gateway

23

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

ProtecTIER Architecture Overview


Virtual Tape Library
Its a Tape
Library and
Drives

ProtecTIER Server

FC

Backup Server





24

ProtecTIER
Application

Disk Storage
System

Linux server-based application running on a System x server


Emulates a tape library unit, including drives, cartridges, and robotics
Uses Fibre Channel (FC) attached disk storage system as the backup medium
Has a build-in deduplication engine (HyperFactor)
Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

New Data Stream

Filter out similar elements (using resident index)


Read similar elements from storage and compare

HyperFactor

Data Storage

Memory
Resident Index
(4 GB, may contain
predefined elements)

Disk Arrays

FC Switch

ProtecTIER
Server

Existing Data

Virtual Tape Emulation

Backup Servers

Restore Index
Filtered data

Reference identical elements in restore index


25

Data Deduplication in a Virtual Tape Library Environment

Store unique elements on storage


2010 IBM Corporation

STG Technical Conferences 2010

Dedupe Ratio depends on ...

 Data Change Rate


the percentage of data in the incomming backup data stream that is new for
ProtecTIER and not already stored physically in the repository

 Backup Policies
# full backups
# Inc backups
backup frequency
data retention period

26

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

ProtecTIER Native Replication Key new feature R2.3

Backup
Server

Represented capacity

Primary Site
ProtecTIER
Gateway

Physical
capacity
Backup
Server

Significant bandwidth reduction


ProtecTIER IP
replication

Represented capacity

Secondary Site

Backup
Server

27

ProtecTIER
Gateway

Data Deduplication in a Virtual Tape Library Environment

Physical
capacity

2010 IBM Corporation

STG Technical Conferences 2010

TS7650 ProtecTIER Appliance Series

28

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

TS7650 Appliance Series

1u empty space
1u empty space

1u empty space
1u empty space

1u empty space
1u empty space
1u empty space

1u empty space
1u empty space

1u empty space
1u empty space

U U U U P U U U U P U U U U P S

U U U U P U U U U P U U U U P S

U U U U P U U U U P U U U U P S

U U U U P U U U U P U U U U P S

U U U U P U U U U P U U U U P S

M M M M M m m m m m U U U U P S

M M M M M m m m m m U U U U P S

1u empty space
1u empty space
1u empty space

X3850 M2
3 x 6core, 24GB RAM

Data Deduplication in a Virtual Tape Library Environment

EXP810

U U U U P U U U U P U U U U P S

EXP810

U U U U P U U U U P U U U U P S

EXP810

M M M M M m m m m m U U U U P S

X3850 M2
3 x 6core, 24GB RAM

U U U U P U U U U P U U U U P S

DS4700

U U U U P U U U U P U U U U P S

Power: FC1903

U U U U P U U U U P U U U U P S

1u empty space or TSSC


1u empty space or TSSC

Power: Base

Power: FC1903

M M M M M m m m m m U U U U P S

EXP810

U U U U P U U U U P U U U U P S

EXP810

U U U U P U U U U P U U U U P S

U U U U P U U U U P U U U U P S

EXP810

M M M M M m m m m m U U U U P S

Power: Base

EXP810

Power: Base

29

U U U U P U U U U P U U U U P S

DS4700

1u empty space

U U U U P U U U U P U U U U P S

Power: Base

1u empty space
1u empty space
1u empty space

DS4700

X3850 M2
3 x 6core, 24GB RAM

Power: FC1903

X3850 M2
3 x 6core, 24GB RAM

EXP810

X3850 M2
3 x 6core, 24GB RAM

EXP810

1u empty space
1u empty space or TSSC
1u empty space or TSSC

EXP810

1u empty space
1u empty space or TSSC
1u empty space or TSSC

DS4700

Power: FC1903

Appliances can be upgraded one step forward ...

1u empty space
1u empty space or TSSC
1u empty space or TSSC

1u empty space
1u empty space

WTI Switch

1u empty space
1u empty space
1u empty space

EXP810

1u empty space
1u empty space

U U U U P U U U U P U U U U P S

EXP810

1u empty space
1u empty space

Ethernet Switch (1U)


Ethernet Switch (1U)

EXP810

1u empty space

500MB/sec

DS4700

1u empty space

31.5TB

EXP810

1u empty space
1u empty space
1u empty space

F05
Base Frame

1u empty space
1u empty space
1u empty space

1u empty space
1u empty space

1u empty space
1u empty space
1u empty space
1u empty space

500MB/sec

EXP810

1u empty space
1u empty space

31.5TB

Clustered 4700
128 spindle 450GB (8 drawer)
36TB 450MB/sec

EXP810

F05
Base Frame

1u empty space
1u empty space

Standalone 4700
128 spindle 450GB (8 drawer)
36TB 450MB/sec

F05
Base Frame

15.8TB
F05
Base Frame

6.3TB

Standalone 4700
64 spindle 450GB (4 drawer)
18TB 250MB/sec

DS4700

Standalone 4700
32 spindle 450GB (2 drawer)
7TB 100MB/sec

M M M M M m m m m m U U U U P S

2010 IBM Corporation

STG Technical Conferences 2010

A look in the Future ...

30

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

A look in the Future

 Some observations from the VTL and Dedupe Market


Vendors converge to a common point
Scalable appliances with multiple I/O interfaces (FCP, iSCSI, CIFS, NFS, Library
Emulation)

Replication becomes more and more commodity


Replication benefits from deduped data

Intelligent storage devices will be tighly integrated with 3rd party backup
applications
e.g. controlling & monitoring replication from a backup application

31

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Links

32

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Links I
 TS7650G ProtecTIER Deduplication Gateway
http://www-03.ibm.com/systems/storage/tape/ts7650g/index.html
 TS7650 ProtecTIER Deduplication Appliance
http://www-03.ibm.com/systems/storage/tape/ts7650a/index.html
 Whitepaper: IBM Data Deduplication Strategy and Operations
http://www.ibm.com/developerworks/wikis/display/tivolistoragemana
ger/IBM+Tivoli+Storage+Manager+V6.1+Data+Deduplication+Strate
gy+and+Operations
 Redbook: The IBM System Storage TS7650G and TS7650
ProtecTIER Servers
http://w3.itso.ibm.com/redpieces/abstracts/sg247652.html?Open

33

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Links II

 TS7650G ProtecTIER Implementation Workshops


IBMer:
https://w301.sso.ibm.com/learning/lms/Saba/Web/Main/goto/learningActivity?c
ourseNum=SS92E1DE&deepLinkRedirect=false
Business Partner:
http://www304.ibm.com/jct03001c/services/learning/ites.wss/de/de?pageType=
course_description&includeNotScheduled=y&courseCode=SS92E1
DE

34

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Storage Competence at the Mainz Location

__________________________________________________________________________________________________________________________________________

IBM Germanys fourth


largest location offers
you a broad portfolio of
IBM System Storage
Services

35

IBM Dynamic Infrastructure Leadership


Center for Information
Infrastructure

IBM European Storage


Competence Center
& Systems Lab Europe

 Business, Channel & Skill


Enablement & Training
 DI Education & Briefings
 Demos & Showcases
 IT Transformation Roadmaps & Workshops
 BP Certification

 Business, Channel & Skill


Enablement & Training
 End-to-end client support
 Workshops
 Solution Design
 Lab Services
 Customer Relationship
Management

IBM Executive Briefing


Center & TMCC

IBM STG Europe Storage


Software Development

 Business, Channel & Skill


Enablement & Training
 Customer and Group
Briefings
 Product & SW Demos
 Integrated Solution Demos
 Exhibition Support &
Organization

Software Development
 Storage & Tape
 Linux
 Mainframe
 File Systems

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

IBM System Storage Solutions Center of Excellence

__________________________________________________________________________________________________________________________________________

We offer technical
support from the
planning phase through
well after installation

Our Services

Our Expertise

 Client Briefings &


Education
 Systems Lab Services &
Training
 Customized Workshops
 System Storage Demos
 Advanced Technical
Support
 Solution Design
 Proof of Concepts
 Benchmarks
 Product Field Engineering

Skilled technical storage


experts covering the whole
IBM System Storage
Portfolio
 Information Infrastructure:

Compliance

Availability

Retention

Security
 HW / SW & Performance

Our Systems Lab Europe


 1500 sqm lab space
 IBM & heterogenous hardware

36

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Hindi
Hebrew
Simplified
Chinese
Russian

Gracias

ThankYou Obrigado
Spanish

English

Arabic

Tak

Brazilian Portuguese

Danish

Danke

Grazie

German

Italian
Korean

Merci
French

Japanese
Tamil
Traditional Chinese

37

Data Deduplication in a Virtual Tape Library Environment

Thai

2010 IBM Corporation

STG Technical Conferences 2010

Disclaimer I

38

Copyright 2009 by International Business Machines Corporation.

No part of this document may be reproduced or transmitted in any form without written permission from
IBM Corporation.

The performance data contained herein were obtained in a controlled, isolated environment. Results
obtained in other operating environments may vary significantly. While IBM has reviewed each item for
accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained
elsewhere. These values do not constitute a guarantee of performance. The use of this information or
the implementation of any of the techniques discussed herein is a customer responsibility and depends
on the customer's ability to evaluate and integrate them into their operating environment. Customers
attempting to adapt these techniques to their own environments do so at their own risk.

Product data has been reviewed for accuracy as of the date of initial publication. Product data is
subject to change without notice. This information could include technical inaccuracies or
typographical errors. IBM may make improvements and/or changes in the product(s) and/or
program(s) at any time without notice. Any statements regarding IBM's future direction and intent are
subject to change or withdrawal without notice, and represent goals and objectives only

References in this document to IBM products, programs, or services does not imply that IBM intends to
make such products, programs or services available in all countries in which IBM operates or does
business. Any reference to an IBM Program Product in this document is not intended to state or imply
that only that program product may be used. Any functionally equivalent program, that does not
infringe IBM's intellectually property rights, may be used instead. It is the user's responsibility to
evaluate and verify the operation of any on-IBM product, program or service.

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Disclaimer II


THE INFORMATION PROVIDED IN THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY
WARRANTY, EITHER EXPRESS OR IMPLIED. IBM EXPRESSLY DISCLAIMS ANY WARRANTIES
OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NONINFRINGEMENT.

IBM shall have no responsibility to update this information. IBM products are warranted according to
the terms and conditions of the agreements (e.g. IBM Customer Agreement, Statement of Limited
Warranty, International Program License Agreement, etc.) under which they are provided. IBM is not
responsible for the performance or interoperability of any non-IBM products discussed herein.

Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products in
connection with this publication and cannot confirm the accuracy of performance, compatibility or any
other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be
addressed to the suppliers of those products.

The provision of the information contained herein is not intended to, and does not, grant any right or
license under any IBM patents or copyrights. Inquiries regarding patent or copyright licenses should be
made, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785
U.S.A.

39

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

STG Technical Conferences 2010

Trademarks


The following terms are trademarks or registered trademarks of the IBM Corporation in either the
United States, other countries or both.
IBM, TotalStorage, zSeries, pSeries, xSeries, S/390, ES/9000, AS/400, RS/6000
z/OS, z/VM, VM/ESA, OS/390, AIX, DFSMS/MVS, OS/2, OS/400, ESCON, Tivoli
iSeries, ES/3090, VSE/ESA, TPF, DFSMSdfp, DFSMSdss, DFSMShsm, DFSMSrmm, FICON,

40

ProtecTIER, XIV
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in
the United States, other countries, or both. Other company, product, and service names mentioned
may be trademarks or registered trademarks of their respective companies.

Data Deduplication in a Virtual Tape Library Environment

2010 IBM Corporation

You might also like