You are on page 1of 10

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Using patented high-speed inline deduplication technology, Data Domain systems


identify redundant data as they are being stored, creating a storage foot print that is 10X
30X smaller on average than the original dataset and that reduces WAN bandwidth
needed for replication by up to 99%. Originally an ideal solution for backup and
disaster recovery application, customers are now deploying Data Domain deduplication
storage more broadly as a storage tier including near-line file storage, backup, disaster
recovery (DR), and long term retention of enterprise data for reference, litigation
support and regulatory compliance.
The Data Domain product family ranges from the low-end DD140 system to the highend Global Deduplication Array.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

A Data Domain appliance is a storage system with shelves of disks and a controller. Its optimized,
first to backup and second to archive applications, and supports most of the industry-leading backup
and archiving applications. The list on the slide, which is composed primarily of leading backup
applications, not only EMCs offerings with NetWorker but also Symantec, CommVault, and so on.
On the way into the storage system, data can pass through either Ethernet or Fibre Channel. With
Ethernet it can use various protocols and NFS or CIFS; it can also use optimized protocols, such as
Data Domain Boost, a custom integration with leading backup applications. After the data is stored
and its deduplicated during the storage process, it can replicate for disaster recovery, replicating only
the compressed deduplicated unique data segments that have been filtered out through the right
process on the target tier. Within the hardware, there are best-of-class approaches for using
commodity hardware for maximum effect. Data Domain supports RAID 6 implementation.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The end result of identifying duplicate segments and compressing the data before storing is a significant
reduction in the data stored on disk. The overall reduction is viewed as compression, and it is sometimes
discussed in two parts: global and local. Global compression refers to the deduplication process that
compares received data to data already stored on disks. Data that is new is then locally compressed before
being written to disk.
To see how the effect of global compression increases over time consider a backup stream from a first full
backup that contains five segments, A, B, C, another copy of B, and D. This gets stored on disk as
A,B,C,D and a reference to B instead of a second copy. Global compression at this point is the ratio of the
size of the original 5 segments received (A+B+C+B again+D) to the size of the 4 segments (A+B+C+D)
stored on disk.
If the next backup is incremental that includes copies of A and B as well as a new segment E, only E
needs to be stored. A and B are already stored so simply create references to the previously stored
segments. Global compression of this backup is quite good since it is the ratio of the 3 received segments
(A+B+E) to the single stored segment E.
The second full backup is when the savings from global compression start to become very large. A,B,C,D
and E are recognized as duplicates from the previous two backups, and only the new segment F gets
stored. Global compression of this second full backup is very high, with 6 segments coming in but only
the one new segment getting stored.
Global compression taken over all three backups is the ratio of all 14 segments coming from the backup
software to be stored to the 6 segments that get stored to represent all the data received over time. Local
compression further reduces the space needed for the 6 stored segments by as much as another ratio of
2:1.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

In the post-process architecture, data is stored to a disk before deduplication. Then after
its stored, its read back internally, deduplicated, and written again to a different area.
Although this approach may sound appealing because it seems as if it would allow for
faster backups and the use of less resources. By doing post process deduplication, a lot
more disks are needed to store the multiple pools of data, and for speed.
In Inline approach, the data is all filtered before its stored to disk which improves overall
performance.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Data Domain operating system (DD OS) is purpose built for data protection, its design
elements comprise an architectural design whose goal is data invulnerability. Since every
component of a storage system can introduce errors, an end-to-end test is the simplest path
to ensure data integrity. End-to-end verification means reading data after it is written and
comparing it to what it is supposed to be, proving that it is reachable through the file system
to disk.
When DD OS receives a write request from backup software, it computes checksum for the
data. After analyzing the data for redundancy, it stores the new data segments and all of the
checksums. After the backup is compete and all the data has been synchronized to disk, DD
OS verifies that it can read the entire file from the disk platter and through the Data Domain
file system, and that the checksums of the data read back match the checksums of the
written. This ensures that the data on the disks is readable and correct and that the file
system metadata structures used to find the data are also readable and correct. The data is
correct and recoverable from every level of the system. If there are problems anywhere
along the way, for example if a bit has flipped on a disk drive, it will be caught. For the
most part it can be corrected through self-healing feature. If for any reason it cant be
corrected, it will be reported immediately, and a backup can be repeated while the data is
still valid on the primary store. Conventional, performance-optimized storage systems
cannot afford such rigorous verifications. The tremendous data reduction achieved by Data
Domain Global Compression reduces the amount of data that needs to be verified and
makes such verifications possible.
6

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Once the data is stored in a Data Domain system, there are a variety of replication
options to move the compressed deduplicated changes to a secondary site or a tertiary
site for restore in multiple locations for disaster recovery.
Collection replication performs whole-system mirroring in a one-to-one topology,
continuously transferring changes in the underlying collection, including all of the
logical directories and files of the Data Domain filesystem.
In addition, the most popular is a directory or tape pool-oriented approach that lets you
select a part of the file system, or a virtual tape library or tape pool, and only replicate
that. So a single system could be used as both a backup target and a replica for another
Data Domain system.
This graphic shows a number of smaller sites all replicated into one hub site. In those
cases the communication between those systems asks the hub whether or not it has a
given segment of data yet. If it doesnt, then it sends the data. If the destination system
does have the data already, the source site doesnt have to send the data again. In this
scenario with multiple systems replicating to one system, in a many-to-one
configuration, there is cross-site deduplication, further reducing the WAN bandwidth
required and the price.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

EMC Data Domain Boost software distributes parts of the deduplication process to the DD Boost
Library that runs on backup servers. Traditional backup is a three-tier system. Theres a backup client,
a backup server, and a storage array. The whole stream of backup data from the client has to go
through the backup server, across two LAN hops, to a storage device. Traditionally with Data
Domain, since all of the deduplication occurs on the array, the network and each system along the
way has to ship the whole dataset over both hops of the backup LAN. DD Boost distributes some of
the deduplication processing to the backup server, so the last hop sends only deduplicated,
compressed data. This makes the backup network more efficient, it makes Data Domain systems 50%
faster, and it makes the whole system more manageable. It works across the entire Data Domain
product line.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

Todays IT environments are facing challenges with the combination of data growth
and shrinking backup windows. Recovery time objectives (RTOs) and Recovery point
objectives (RPOs) are also becoming more stringent, increasing the importance of a
highly reliable, high-performance backup environment.
As a complement to tape for long-term, offsite storage, backup-to-disk such as the
EMC Disk Library products have emerged as powerful solutions. Customers seeking
the advanced virtual tape library (VTL) functionality of the Disk Library as well as
the ROI benefits of deduplication can leverage a Disk Library deployment with Data
Domain. This enables customers to move data to Data Domain deduplication storage
systems for longer-term retention of data and network-efficient replication.
Figure on slide shows a Disk Library with the Data Domain deployment scenario. In
this deployment, data in the Disk Library virtual tape cartridges is migrated or copied
to the Data Domain system where it is deduplicated to remove data redundancies,
resulting in longer data retention capability than a stand-alone Disk Library. The Data
Domain system does not need to be dedicated to the Disk Library. While operations
are occurring from the Disk Library to the Data Domain system, concurrent NAS or
VTL jobs can be occurring in parallel on the Data Domain system.

Copyright 2010 EMC Corporation. Do not Copy - All Rights Reserved.

The most common scenarios for using the Disk Library with the Data Domain system are
shown in the slide.
1. Copying data from the Disk Library to the Data Domain system: In this scenario, either
one or two engines are writing data to the Data Domain system. Data is migrated from the
Disk Library (using tape caching) or is copied (using the embedded media managers) to the
Data Domain system. With the Automated Tape Caching feature, the backup application sees
the local copy of data and data access is through the Disk Library. With the embedded storage
node or embedded media server, the backup application is aware of both copies of data and
data access is through the backup application.
2. Copying data from the Disk Library to Data Domain and to a physical tape library: In
this scenario, data is copied to the Data Domain system and a physical tape library via the
embedded storage node/media server. In this configuration, the data can reside on each of the
three units for different retention periods. Each engine would have to see the Data Domain
system and the physical tape library since the data is seen by each engine individually.
Multiple engines can be used in a dual- engine configuration, with each writing to its own Data
Domain system and physical tape unit.
3. Copying data to the Data Domain and replicating to another Data Domain: In this
scenario, data is written to the Data Domain system and then replicated to another Data
Domain system. Data can either be migrated from the Disk Library (using tape caching) or is
copied (using the embedded media managers) to the Data Domain system. The data is then
automatically replicated to another Data Domain system. A dedicated Disk Library on the
target side is not required, although in some tape caching environments, a Disk Library on the
target side may be required.

10

You might also like