You are on page 1of 74

Welcome to ScaleIO Fundamentals.

Copyright 1996, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013 ,
2014 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its
publication date. The information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED AS IS. EMC CORPORATION MAKES NO
REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION,
AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software
license.
EMC2, EMC, Data Domain, RSA, EMC Centera, EMC ControlCenter, EMC LifeLine, EMC OnCourse, EMC Proven, EMC
Snap, EMC SourceOne, EMC Storage Administrator, Acartus, Access Logix, AdvantEdge, AlphaStor,
ApplicationXtender, ArchiveXtender, Atmos, Authentica, Authentic Problems, Automated Resource Manager,
AutoStart, AutoSwap, AVALONidm, Avamar, Captiva, Catalog Solution, C-Clip, Celerra, Celerra Replicator, Centera,
CenterStage, CentraStar, ClaimPack, ClaimsEditor, CLARiiON, ClientPak, Codebook Correlation Technology,
Common Information Model, Configuration Intelligence, Configuresoft, Connectrix, CopyCross, CopyPoint, Dantz,
DatabaseXtender, Direct Matrix Architecture, DiskXtender, DiskXtender 2000, Document Sciences, Documentum,
elnput, E-Lab, EmailXaminer, EmailXtender, Enginuity, eRoom, Event Explorer, FarPoint, FirstPass, FLARE,
FormWare, Geosynchrony, Global File Virtualization, Graphic Visualization, Greenplum, HighRoad, HomeBase,
InfoMover, Infoscape, Infra, InputAccel, InputAccel Express, Invista, Ionix, ISIS, Max Retriever, MediaStor,
MirrorView, Navisphere, NetWorker, nLayers, OnAlert, OpenScale, PixTools, Powerlink, PowerPath, PowerSnap,
QuickScan, Rainfinity, RepliCare, RepliStor, ResourcePak, Retrospect, RSA, the RSA logo, SafeLine, SAN Advisor,
SAN Copy, SAN Manager, Smarts, SnapImage, SnapSure, SnapView, SRDF, StorageScope, SupportMate,
SymmAPI, SymmEnabler, Symmetrix, Symmetrix DMX, Symmetrix VMAX, TimeFinder, UltraFlex, UltraPoint,
UltraScale, Unisphere, VMAX, Vblock, Viewlets, Virtual Matrix, Virtual Matrix Architecture, Virtual Provisioning,
VisualSAN, VisualSRM, Voyence, VPLEX, VSAM-Assist, WebXtender, xPression, xPresso, YottaYotta, the EMC logo,
and where information lives, are registered trademarks or trademarks of EMC Corporation in the United States and
other countries.
All other trademarks used herein are the property of their respective owners.
Copyright 2014 EMC Corporation. All rights reserved. Published in the USA.
Revision Date: October 2014
Revision Number: MR-1WN-SIOFUN.1.30

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

This course covers an introduction to the ScaleIO product.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

This module focuses on an overview and general introduction to ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

This lesson covers the definition and value proposition of ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

ScaleIO is software-defined, distributed shared storage. It is a software-only solution that


enables you to create a SAN from direct-attached storage (DAS) located in your hosts.
ScaleIO creates a large pool of storage that can be shared among all SDC hosts within the
cluster. This storage pool can be tiered to supply differing performance needs. ScaleIO is
infrastructure-agnostic. It can run on any host, whether physical or virtual, and leverage
any storage media, including disk drives, flash drives, or PCIe flash cards.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

ScaleIO is focused on convergence, scalability, elasticity, and performance. The software


converges storage and compute resources into a single architectural layer, which resides on
the application host. The architecture allows for scaling out from as little as three hosts to
thousands by simply adding nodes to the environment. This is done elastically; increasing
and decreasing capacity and compute resources can happen on the fly without impact to
users or applications.
ScaleIO has self-healing capabilities, which enables it to easily recover from host or disk
failures. ScaleIO aggregates all the IOPS in the various hosts into one high-performing
virtual SAN. All hosts participate in servicing I/O requests using massively parallel
processing.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

ScaleIO converges storage and compute resources into a single-layer architecture,


aggregating capacity and performance and simplifying management. All I/O and throughput
are collective and accessible to any SDC enabled host within the cluster. With ScaleIO,
storage is just another application running alongside other applications, and each host is a
building block for the global storage and compute cluster.
Converging storage and compute simplifies the architecture and reduces cost without
compromising on any benefit of external storage. ScaleIO enables the IT administrator to
singlehandedly manage the entire data center stack, improving operational effectiveness
and lowering operational costs.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

ScaleIO is designed to massively scale from three to thousands of nodes. Unlike most
traditional storage systems, as the number of hosts grows, so do throughput and IOPS. The
scalability of performance is linear with regard to the growth of the deployment. Whenever
the need arises, additional storage and compute resources (i.e., additional hosts and
drives) can be added modularly. Storage and compute resources grow together so the
balance between them is maintained. Storage growth is therefore always automatically
aligned with application needs.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

With ScaleIO, you can increase or decrease capacity and compute whenever the need
arises. You need not go through any complex reconfiguration or adjustments due to
interoperability constraints. The system automatically rebalances the data on the fly with
no downtime. No capacity planning is required, which is a major factor in reducing
complexity and cost.
You can think about it as being tolerant toward errors in planning. Insufficient storage?
Starvation? Just add nodes. Over provisioned? Just remove them. Additions and removals
can be done in small or large increments, contributing to the flexibility in managing
capacity.
ScaleIO will run with just about any commodity hardware and any host or operating
systems, bare metal or virtualized, with any storage media (HDDs, SSDs, or PCIe cards)
and the media located anywhere (DAS/external). A ScaleIO environment can be comprised
of any mix of the above. This is true elasticity.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

Every host in the ScaleIO cluster is used in the processing of I/O operations. The
architecture is parallel so, unlike a dual controller architecture, there are no bottlenecks or
choke points. As the number of hosts increases, more compute resources can be added
and utilized. Performance scales linearly and cost/performance rates improve with growth.
Performance optimization is automatic; whenever rebuilds and rebalances are needed, they
occur in the background with minimal or no impact to running applications. For performance
management, manual tiering can be designed by using storage pools. For example, you can
create a designated performance tier to utilize low-latency, high-bandwidth flash media and
a designated capacity tier to utilize disk spindles of various kinds.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

10

ScaleIO provides for enterprise-class SAN features such as:


Volume Snapshot - including instant writable snapshots and consistency groups
Multi-tenancy - by enabling segregation of tenant data across media and hosts and
providing data encryption
Multi-tiering - volumes may be tiered, based on the type of storage media and class of
host and/or networks
Quality of Service (QoS) configuration - IOPS and bandwidth may be optionally limited
to specific values on a per-volume basis

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

11

ScaleIO offers a virtual SAN that is entirely host based.


It is a software solution that installs directly on production hosts as a lightweight agent with
minimal footprint, enabling them to access volumes from storage pools created by
aggregated storage from either complete disk(s), partitions within a disk, or even files on
the local hosts. Since all the storage in the aggregated pool is host-resident, this removes
the need for traditional SAN infrastructure, such as storage arrays and Fibre Channel
switches.
The solution can accommodate dynamic growth and shrinkage of the aggregated Storage
Pool, without disruption to application I/O activity.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

12

This lesson covers the marketing and use cases for ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

13

Generally, VSI environments require large amounts of storage that can be grown easily. At
the same time, they require easy manageability and a low dollar per host cost. ScaleIO is
ideal for VSI because it leverages any commodity hardware and can accommodate any size
growth. No capacity planning is required, as growth in both capacity and performance is
easy and linear. ScaleIO is easy to manage, requiring little administration. With no need for
dedicated storage components or expensive arrays, TCO and dollar per host are low.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

14

Generally, VDI environments require high levels of performance at peak times, such as boot
storms. They require large amounts of storage that can be grown easily as the number of
users increases. At the same time, they require a low dollar per desktop cost. ScaleIO is
ideal for VDI because every host in the cluster is used in the processing of I/O operations,
eliminating bottlenecks and stabilizing performance. It leverages any commodity hardware
and can accommodate any size growth. No capacity planning is required, as growth in both
capacity and performance is easy and linear. ScaleIO is easy to manage, requiring little
administration. With no need for dedicated storage components or expensive arrays, TCO
and dollar per desktop are low.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

15

Generally, database environments require high write performance, high availability, quick
recovery, and low cost of storage. ScaleIO is ideal for databases because converged storage
and compute allows for very fast writes. Its massive parallelism delivers quick recovery and
stable, predictable performance. With no need for dedicated storage components or
expensive arrays, TCO is kept low.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

16

Development and testing environments do not require a ton of capacity and do not have to
be the best of the best in terms of performance. But they must be low cost, since there is
no revenue directly tied to them. Dev/test environments often change rapidly for
repurposing. ScaleIO is ideal for development and testing environments. Its autorebalancing, easy scale-out, and elasticity with no downtime are a perfect fit for dynamic
environments. Its low initial cost is justifiable for a non-production workload and allows for
more investment in what matterscompute.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

17

This lesson covers a review of the competitive products for ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

18

Nutanix is a converged appliance offering a hardware and software solution to remove the
complicated SAN infrastructure used today. It was developed for a virtualized environment
requiring high IOPS while maintaining a low TCO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

19

ScaleIO competed against Nutanix by offering a software only solution; there is no reliance
on proprietary hardware. Because of the software only approach, ScaleIO can elastically
expand or contract the storage access without concern about the existing environment.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

20

Here is a simple matrix positioning ScaleIO against Nutanix.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

21

This module covered an overview of ScaleIO focusing on the benefits and use cases for the
product.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

22

This module focuses on a review of the architecture of ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

23

This lesson covers an overview of the ScaleIO architecture.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

24

ScaleIO makes much of the traditional storage infrastructure unnecessary. You can create a
large-scale SAN without arrays, dedicated fabric, or HBAs. With ScaleIO, you can leverage
the local storage in your existing hosts that often goes unused, ensuring that IT resources
arent wasted. And you simply add hosts to the environment as needed. This gives you
great flexibility in deploying various size SANs and modifying them as needed. It also
significantly reduces the cost of initial deployment.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

25

The first component is the ScaleIO Data Client, or SDC. The SDC is a block device driver
that exposes ScaleIO shared block volumes to applications. The SDC runs locally on any
application host that requires access to the block storage volumes. The blocks that the SDC
exposes can be blocks from anywhere within the ScaleIO global virtual SAN. This enables
the local application to issue an I/O request and the SDC fulfills it regardless of where the
particular blocks reside. The SDC communicates with other nodes (beyond its own local
host) over TCP/IP-based protocol, so it is fully routable. TCP/IP is ubiquitous and is
supported on any network. Data center LANs are naturally supported.
You can see the I/O flow in this animation. The application issues an I/O, which flows
through the file system and volume manager, but instead of accessing the local storage on
the host (via the block device driver), it is passed to the SDC (denoted as C in the slide).
The SDC knows where the relevant block resides on the larger system, and directs it to its
destination (either locally or on another host within the ScaleIO cluster). The SDC is the
only ScaleIO component that applications see in the data path.

Note that in a bare-metal configuration, the SDC is always implemented as an OS


component (kernel). In virtualized environments, it is typically implemented as a hypervisor
element or as an independent VM.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

26

The next component in the ScaleIO data path is known as the ScaleIO Data Server, or SDS.
The SDS owns local storage that contributes to the ScaleIO storage pools. An instance of
the SDS runs on every host that contributes some or all of its local storage space (HDDs,
SSDs, or PCIe flash cards) to the aggregated pool of storage within the ScaleIO virtual SAN.
Local storage may be disks, disk partitions, even files. The role of the SDS is to actually
perform I/O operations as requested by an SDC on the local or another host within the
cluster.
You can see the I/O flow in this animation. A request, originated at one of the clusters
SDCs, arrives over the ScaleIO protocol to the SDS. The SDS uses the native local medias
block device driver to fulfill the request and returns the results. An SDS always talks to
the local storage, the DAS, on the host it runs on. Note that an SDS can run on the same
host that runs an SDC or can be decoupled. The two components are independent from
each other.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

27

ScaleIOs control component is known as the metadata manager, or MDM. The MDM serves
as the monitoring and configuration agent.
The MDM holds the cluster-wide mapping information and is responsible for decisions
regarding migration, rebuilds, and all system-related functions. The ScaleIO monitoring
dashboard communicates with the MDM to retrieve system information for display.
The MDM is not on the ScaleIO datapath. That is, reads and writes never traverse the MDM.
The MDM may communicate with other ScaleIO components within the cluster in order to
perform system maintenance/management operations but never to perform data
operations. This means that the MDM does not represent a bottleneck for data operations
and is never an issue in the scaling up of the overall cluster. The MDM consumes resources
that are not needed by applications and/or datapath activities. The MDM does not preempt
users operations and does not have any impact on the overall cluster performance and
bandwidth.

To support high availability, three instances of MDM can be run on different hosts. This is
also known as the MDM cluster. An MDM may run on hosts that also run SDCs and/or SDSs.
The MDM may also run on a separate host. During installation, the user decides where MDM
instances reside. If a primary MDM fails (due to host crash, for example), another MDM
takes over and functions as primary until the original MDM is recovered. The third instance
is usually used both for HA and as a tie-breaker in case of conflicts.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

28

In VMware environments, ScaleIO uses a model that is similar to a virtual storage appliance
(VSA), which is called ScaleIO VM, or SVM. This is a dedicated VM in each ESX host that
contains both the SDS and the SDC. The VMs in that host can access the storage as
depictedto the hypervisor, then to the SVM, and from the SVM to the local storage. All the
SVMs are connected, so this allows any VM in any ESX host to access any SDS in the
system, as in a physical environment.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

29

Non-VMware environments including Citrix XenServer, Linux KVM, and Microsoft Hyper-V
are identical to physical environments. Both the SDS and the SDC sit inside the hypervisor.
Nothing is installed at the guest layer. Since ScaleIO is installed in the hypervisor, you are
not dependent on the operating system, so there is only one build to maintain and test. And
the installation is easy, as there is only one location to install ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

30

Protection domains are an important feature of ScaleIO. Protection domains are sets of
hosts or SDS nodes. The administrator can divide SDSs into multiple protection domains of
various sizes, designating volume to domain assignments. As the name implies, data
protection (redundancy, balancing, etc.) is established within a protection domain. This
means that all the chunks of a particular volume will be stored in SDS nodes that belong to
the protection domain for this specific volume. Volume data is kept within the boundaries of
the protection domain.
Any application on any host can access all the volumes, regardless of protection domain
assignment. So an SDC can access data in any protection domain. It is important to
understand that protection domains are not related to data accessibility, only data
protection and resilience. Protection domains allow for:
Increasing the resilience of the overall system by tolerating multiple simultaneous
failures across the overall deployment.
Separation of volumes for performance planningfor example, assigning highly
accessed volumes in less busy domains or dedicating a particular domain to an
application.
Data location and partitioning in multi-tenancy deployments so that tenants can be
segregated efficiently and securely.
Adjustments to different network constraints within the cluster.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

31

ScaleIO offers several methods of managing the cluster. The primary tool for creating and
administering ScaleIO is the CLI. The ability to deploy or manage components are all
available within the CLI. The GUI has many of the functions that are offered by the CLI but
adds the monitoring interface for real-time information about the health and performance of
the cluster.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

32

This lesson covers how ScaleIO works with the components to provide storage access.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

33

It is common practice to install both an SDC and an SDS on the same host. This way
applications and storage share the same compute resources. This slide shows such a fully
converged configuration, where every host runs both an SDC and an SDS. All hosts can
have applications running on them, performing I/O operations via the local SDC. All hosts
contribute some or all of their local storage to the ScaleIO system via their local SDS.
Components communicate over the LAN.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

34

In some situations, an SDS can be separated from an SDC and installed on a different host.
ScaleIO does not have any requirements in regard to deploying SDCs and SDSs on the
same or different hosts. Whatever the preference of the administrator is, ScaleIO works
with it transparently and smoothly. Shown here is a two-layer configuration. A group of
hosts is running SDCs and another distinct group is running SDSs. The applications that run
on the first group of hosts make I/O requests to their local SDC. The second group, running
SDSs, contributes the hosts local storage to the virtual SAN. The first and second groups
communicate over the LAN. In a way, this deployment is similar to a traditional external
storage system. Applications run in one layer, while storage is in another layer.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

35

This slide shows the power of the distributed architecture of ScaleIO. Every SDC knows how
to direct an I/O operation to the destination SDS. There is no flooding or broadcasting. This
is extremely efficient parallelism that eliminates single points of failure. Since there is no
central point of routing, all of this happens in a distributed manner. Each SDC does its own
routing, independent from any other SDC. The SDC has all the intelligence needed to route
every request, preventing unnecessary network traffic and redundant SDS resource usage.
This is, in effect, a multi-controller architecture that is highly optimized and massively
parallel. It allows performance to scale linearly as the number of nodes increases. ScaleIO
is capable of handling asymmetric clusters with different capacities and media types.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

36

Similarly, a fully converged configuration will have even higher parallelism and load
distribution between the nodes. Any combination of the fully converged and two-layer
configuration options is valid and operational.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

37

Within a given protection domain, you can select a set of storage devices and designate
them as a storage pool. You can define several storage pools within the same protection
domain. When provisioning a data volume, you can assign it to one of the storage pools.
Doing so means that all the chunks of this volume will be stored in devices belonging to the
assigned storage pool. With protection domains and storage pools, ScaleIO establishes a
strict hierarchy for volumes. A given volume belongs to one storage pool; a given storage
pool belongs to one protection domain.
The most common use of storage pools is to establish performance tiering. For example,
within a protection domain, you can combine all the flash devices into one pool and all the
disk drives into another pool. By assigning volumes, you can guarantee that frequently
accessed data resides on low-latency flash devices while the less frequently accessed data
resides on high-capacity HDDs. Thus, you can establish a performance tier and a capacity
tier. You can divide the device population as you see fit to create any number of storage
pools.

The pools boundaries are soft in that the admin can move devices from pool to pool as
necessary. ScaleIO responds to such shifts in pool assignments by rebalancing and reoptimizing. No user action is required to reconfigure and rebalance the systemit is
automatic and fast. This ease and simplicity of movement allows the admin to introduce
temporary enhancements on a whim. For instance, you can move a couple of devices from
pool1 to pool2 (or from one protection domain to another) for a limited period to address an
expected (or experienced) peak demand. The situation can later be reversed with no
downtime or significant overhead. Its that simple.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

38

This module covered the architecture and components of ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

39

This module focuses on the features and capabilities of ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

40

This lesson covers ScaleIO fault tolerance.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

41

Lets look at the distributed data layout scheme of ScaleIO volumes. This scheme is
designed to maximize protection and optimize performance. On the left, you see a data
Volume 1 in grey and a data Volume 2 in blue. On the right, you see a 100-node SDS
cluster.

A single volume is divided into chunks of reasonably small size, say 1 MB. These chunks will
be scattered (striped) on physical disks throughout the cluster, in a balanced and random
manner.
Once the volume is provisioned, the chunks of Volume 1 are spread throughout the cluster
randomly and evenly.
Volume 2 is treated similarly.
Note that the slide shows partial layout. Ideally, the chunks are spread over the all the 100
hosts. It is important to understand that the ScaleIO volume chunks are not the same as
data blocks. The I/O operations are done at a block level. If an application writes out 4KB of
data, only 4KB are written, not 100 MB. The same goes for read operationsonly the
required data is read.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

42

Now lets look at the ScaleIO two-copy mesh mirroring. For simplicity, it is illustrated with
Volume 2, which only has five chunksA, B C, D and E. The chunks are initially stored on
hosts as shown. In order to protect the volume data, we need to create redundant copies of
those chunks. We end up with two copies of each chunk. It is important that we never store
copies of the same chunk on the same physical host.
The copies have been made. Now, chunk A resides on two hosts: SDS2 and SDS4.
Similarly, all other chunks copies are created and stored on hosts different from their first
copy. Note that no host holds a complete mirror of another host. The ScaleIO mirroring
scheme is referred to as mesh mirroring, meaning the volume is mirrored at the chunk level
and is meshed throughout the cluster. This is one of the factors in enhancing overall data
protection and cluster resilience. A volume never fails in full and rebuilding a particular
damaged chunk (or chunks) is fast and efficient, as it is done simultaneously by multiple
hosts. When a host fails (or is removed from the cluster), its chunks are spread over the
whole cluster and rebuilding is shared among all the hosts.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

43

Lets take a look at a host failure scenario. SDS1 presently stores chunks E and B from
Volume 2 and chunk F from Volume 1.
If SDS1 crashes, ScaleIO needs to rebuild these chunks, so chunks E, B and F are copied to
other hosts. This is done by copying the mirrors. The mirrored chunk of E is copied from
SDS3 to SDS4, the mirrored chunk of B is copied from SDS6 to SDS100, and the mirrored
chunk of F is copied from SDS2 to SDS5. This process is called forward rebuild. It is a
many-to-many copy operation. By the end of the forward rebuild operation, the system is
again fully protected and optimized. No matter what, no two copies of the same chunk are
allowed to reside on the same host. Clearly, this rebuild process is much lighter-weight and
faster than having to serially copy an entire host to another. Note that while this operation
is in progress, all the data is still accessible to applications. For the chunks of SDS1, the
mirrors are still available and are used. Users experience no outage or delays. ScaleIO
always reserves space on hosts for failure cases, when rebuilds are going to occupy new
chunk space on disks. This is a configurable parameter (i.e., how much storage capacity to
allocate as reserve).

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

44

Adding a node or disk to ScaleIO automatically triggers primary and secondary chunks to be
migrated to the newly available devices ensuring balanced utilization of all devices in the
corresponding Storage Pool and Protection Domain.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

45

A similar rebalance mechanism applies to removing a node or device from a ScaleIO


system. Nodes or disks may be dynamically removed and the ScaleIO cluster automatically
migrates and rebalances the Storage Pool.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

46

Forward rebuild and backward rebuild are two strategies that ScaleIO uses to ensure
ongoing data protection immediately after it detects a node or device failure. Well examine
these strategies in greater detail.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

47

Lets examine carefully what happens when a single node fails. The same logic applies when
a disk fails within a node.
Consider SDS2, which has chunks that are mirrored across every other node of this 100node cluster.
When the SDS2 host fails, all I/O requests to ScaleIO will continue to be serviced as usual.
However, we have a situation where roughly 1 percent of the data chunks are in what is
termed degraded protection mode, which implies that the chunk does not have a usable
mirror.
Within a few seconds of detecting the failure, ScaleIO will allocate new chunks out of the
reserved space to take the place of the failed mirrors. The data gets re-mirrored to the new
locations.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

48

Note that remirroring is done in a many-to-many fashion, since the mirrors of SDS2 chunks
are scattered across all nodes of the cluster. This will therefore be many times faster when
compared to taking the traditional approach of simply restoring the SDS2 node as it used to
be that would be bottlenecked by either the NIC bandwidth of a single node, or the
throughput capability of the disk spindles on that single node, and in addition, also be very
imbalanced in terms of rebuild load causing degradation in application performance.
Lets consider a simple example to get a sense of the achievable rebuild time. Assume we
have 1TB of data on each of 101 nodes, and a 1 Gbit network in place. When one node fails,
we need to re-mirror roughly 1TB of data that was lost. This 1TB of data needs to be
propagated between the surviving 100 nodes. That is 10GB of data to be mirrored by each
of these nodes. Assuming ~50 MB/sec read speed, it will take about 200 seconds to read
the chunks. Lets double that and call it 400 seconds to complete both the reads and writes,
which is around seven minutes. This is a pretty realistic example.
Within a cluster partition, when two nodes go down simultaneously, data could become
unavailable. In this example, this unavailability condition will apply only to one percent of
one percent - or one 10,000th of the total data (since only the chunks that are mirrored
between these two specific nodes are affected). The odds of this happening are quite small,
given that the rebuild is many-to-many and quick. The SDCs can continue to access the
rest of the available data. Unlike some other competing products, we also have good
handling mechanisms in place for when nodes go down momentarily and then comes back.
When this happens to two different nodes, the data that becomes momentarily unavailable
becomes available again. As youll see later, ScaleIO has the notion of Protection Domains
that enable large clusters to tolerate multi-node failures. Nodes that are in separate
Protection Domains can fail simultaneously and still not cause any data loss or
unavailability.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

49

What if SDS2 comes back with all its chunks intact after going down for a short while and
forward rebuilds are already in progress? Now the situation becomes more complex,
because chunks could have been modified *after* the forward rebuild process has started.
So some chunks in SDS2 may have out-of-date data at this point. If a chunk has been only
slightly modified, it may make sense to simply restore the SDS2 copy from the up-to-date
copy elsewhere in the cluster. This is a backward-rebuild operation.
ScaleIO is intelligent enough to decide on and do what makes the most sense efficiencywise - continue with the forward rebuild, or initiate a backward rebuild selectively only for
those chunks that have changed. And it can make that decision on a chunk-by-chunk basis.
This design ensures that there is no large penalty efficiency-wise for momentary host
failures. A short outage results in only a small penalty, since most of the chunks will either
remain unchanged or require a short backwards-rebuild.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

50

Lets look at an example of a forward-rebuild versus backward-rebuild at work. Pretend that


a host failed for an hour and was restored.
Meanwhile, 10 percent of the data on that host has been modified due to application writes.
Does it make sense to do a forward rebuild or backward rebuild?
It depends on how success is measured. If the goal is to minimize the amount of data being
propagated on the network, then the backward rebuild makes the most sense it would
result in only 10 percent of the data being remirrored to SDS2.
However, if the main concern is to reduce the window of vulnerability to a second host
outage, the picture changes. You need to remirror 10 percent of the data to a single host.
Lets assume there is just one disk spindle per host. With a forward rebuild, you would be
propagating ~10 times more data, but it would be to 100 spindles on 100 different hosts,
so it is going to be 10 times faster.

ScaleIO is designed to target the second goal, that is to get fully protected as soon as
possible.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

51

This lesson covers ScaleIO snapshot functionality.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

52

A snapshot is a volume that is a copy of another volume. Snapshots take no time to create
(or remove); they are instantaneous. Snapshots do not consume much space initially
because they are thinly provisioned. You can create snapshots of snapshotsany number
of them.

All ScaleIO snapshots are thinly-provisioned, regardless of the form of the original volumes
that could be either thin or thick.
The dark-bordered area in the dashboard view of Capacity is the capacity used by
snapshots. Note that a snapshot can grow in one of two ways:
The user writes to the original volume. Since the snapshot needs to preserve the
original state, it must therefore grow.
The user writes directly to the snapshot volume.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

53

The slide shows a volume, V1, a snapshot of that volume, S111 and a snapshot of snapshot
S111, S121. Snapshot volumes, which are fully functional, can be mapped to SDSs just like
any other volumes. A complete genealogy of volumes and their snapshots is known as
VTree. Any number of VTrees can be created in the system. When you create a snapshot for
several volumes (or snapshots), a consistency group that contains all the volumes in that
operation is created and named. Consistency groups are automatically created when issuing
a snapshot command for several volumes. Operations may be performed on an entire
consistency group (for example, delete volume).
The slide shows two genealogies of volumes, V1 and V2. V1 is being used to create VTree1
of snapshots. V2 is used to create the VTree2 genealogy.
At some point, a command is issued to create two snapshots, one of V1 and the other of
V2. Because this is a single snapshot command, the two newly created snapshot are
grouped. S112 (of V1) and S211 (of V2) have been grouped together in the C1 consistency
group.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

54

ScaleIO enables the user to create a consistent snapshot of multiple volumes with a single
command. This ensures that write order fidelity is preserved when taking the snapshot,
making the snapshots crash-consistent.
In addition, a consistency group can include volumes from multiple Protection Domains.
This means you can make a crash-consistent copy of multiple volumes spread all over the
data center, if they are all related to the same application. Plus, you can even snapshot the
entire data center in a crash-consistent manner, if desired.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

55

Note that when the snapshot is created, the *original* Blocks of the volume are now the
snapshot. This is a redirect on write implementation. In other words, new writes to the
original volume end up getting redirected to the snapshot space.
In all cases, data is written exactly once. It is managed within ScaleIO with pointer
manipulation.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

56

This lesson covers ScaleIO protection and security features.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

57

ScaleIO offers a Quality of Service feature. This feature :


Limits IOPS on hosts (bare metal) or data stores (VMware)
Eliminates application monopolizing resources

Enables performance based SLAs and monetization of storage performance

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

58

ScaleIO features a capability known as the limitera configurable option to limit resource
consumption by applications. The limiter provides the ability to limit specific customers from
exceeding a certain rate of IOPS and/or a certain amount of bandwidth when accessing a
certain volume. On the slide, you see three applications sharing compute resources while
accessing the same volume. The amounts of consumed resources are represented by the
size of the colored boxes. Initially, they are all the same; the division of resources is equal.
Some available compute resources exist, which are not currently being consumed by the
three applications.
Now lets say App 3 has become hungry and has consumed all of the available resources.
Apps 1 and 2 have no resources left to consume, should they need it. They are at risk.
Now App 3 is consuming so much IOPS and bandwidth that it is eating into the compute
power that Apps 1 and 2 require. So their performance is now suffering due to App 3s
hogging.

With the ScaleIO limiter applied, however, App 3 is limited in the amount of resources it can
consume. So Apps 1 and 2 can operate as their SLA is defined. So as you can see, the
limiter allows you to allocate IOPS and bandwidth as desired in a controlled manner.
Protection domains, storage pools, and the limiter allow the administrator to manage
resource efficiency within the ScaleIO cluster. These tools allow the administrator to
regulate and condition the system, thereby optimizing its overall operation.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

59

With ScaleIO, data at rest can be stored in an encrypted form enabling the customer to
secure their data while maintaining current service level operations.
Does not affect real-time application performance.

Prevents data from being compromised due to host(s) theft.


Prevent non-authorized data being viewed in multi-tenancy environments.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

60

Dedicated ScaleIO software instances for dedicated IaaS customers.


All the security and performance benefits of a dedicated SAN hosting, but purely in software
and cost efficient.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

61

This module covered the key features of ScaleIO.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

62

This demonstration covers an overview of administering the features within ScaleIO.


Click the Launch button to view the video.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

63

This module focuses on ScaleIO management interfaces.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

64

Managing a ScaleIO deployment is easy. Anything from installation, configuration,


monitoring, and upgrade is simple and straightforward. Anyone who manages the data
center is capable of fully administering the deployment, without any specialized training
and/or vendor certification. The complexity of storage administration is completely
eliminated. The screens shown here are all that is needed in order to monitor the ScaleIO
system. There is also a simple CLI for configuration and various system actions. Because
the system manages itself and takes all the necessary remedial actions when a failure
occurs, including re-optimization, there is no need for operator intervention when various
events occur.
However, ScaleIO features a call home capability via email, which alerts the administrator
should an event occur. The admin can then take action to respond to the event (if
necessary) even outside of business hours. The administrator can follow the system
operations and monitor its progress. For example, the actual rebuilds and rebalance
operations, when they are executed, can be monitored via the dashboard.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

65

The dashboard displays the overall system status. Each tile displays a certain aspect of the
storage system. Various controls let you customize the information displayed on the
dashboard.
In the Capacity tile, note that it is possible to get the legend to show what each color
represents for the way that storage is used. The I/O Workload tile allows you to display the
IOPs or bandwidth, depending on your preference.
In general, note that only the tiles with interesting data are highlighted, while the other
panels are dimmed. For example, here notice that there is no rebalance and rebuild activity
currently in the system and the entire tile is dimmed. As another example, the
Management tile would be dimmed, if everything were normal and the MDM cluster were
running in clustered mode. If it is set to single mode, then it would be highlighted, since
the ScaleIO GUI wishes to alert the user to that fact.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

66

The backend view provides detailed information about components in the system. The main
areas of the backend view are:
Filter - Allows user to filter the information displayed in the table and property sheets

Toolbar - Allows user to perform an action on the selected row in the table by clicking
the appropriate button
Table - Displays detailed information about system components
Property Sheets - Displays very detailed, read-only information about the component
selected in the table. This is useful for fine-grained drill-down actions into specific
components within a large cluster without having to resort to the CLI. You can even
simultaneously work with multiple Property Sheets, one for each of several related
objects for example, a device, an SDS, a Storage Pool and a Protection Domain.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

67

The CLI enables you to perform all provisioning, maintenance, and monitoring activities in
ScaleIO. It is installed and available by default on the primary and secondary MDM.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

68

ScaleIO provides a vSphere plug-in for ScaleIO. The plug-in works with vSphere Web
Clients only. It cannot be accessed from a Windows-native vSphere client.
The recommended ScaleIO installation method in VMware environments is to use the plugin. Beyond basic installation, the plug-in can be used during day-to-day operations to
perform basic provisioning and monitoring tasks. This enables VMware host administrators
to function in their familiar user interface without having to resort to ScaleIO commands or
the native ScaleIO dashboard for routine storage management chores.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

69

The vSphere web plug-in can be used to perform initial installation of ScaleIO in a VMware
environment, using the Deploy ScaleIO environment option as shown on the pane on the
right.
Other common administration and management tasks for SDS hosts, SDC clients,
Protection Domains, Fault Sets, Storage Pools, Devices and Volumes are grouped as shown
in the highlighted area on the bottom left.
In particular, note that from the Storage Pools view, you can perform Add Volume from
the vSphere web client plug-in. This includes optionally mapping the volume via iSCSI to
the ESXi initiators so that the ScaleIO volume can present a shared datastore to all the
ESXi hosts selected.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

70

Shown here is the Capacity Overview. Using the drop-down option users can switch to any
of the supported views including: Total Capacity, Capacity In-Use, I/O Bandwidth, IOPS,
Rebuild, Rebalance, and Alerts. These selectable views or presets are designed to suit
specific customer use cases.

Within a preset, users can change the hierarchy of items within a table. For example the
current screen is showing a view sorted by SDS (upper right drop-down) and items within
each SDS. Users can switch this to by Storage Pool.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

71

This demo covers a review of the ScaleIO CLI and GUI.


Click the Launch button to view the video.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

72

This module covered the ScaleIO management options and interfaces.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

73

This course covered the fundamentals of the ScaleIO product.


This concludes the training. Proceed to the course assessment on the next slide.

Copyright 2014 EMC Corporation. All rights reserved.

ScaleIO Fundamentals

74

You might also like