You are on page 1of 30

Business Continuity and

Disaster Recovery Overview

August 2016

Version 3.0a

Prepared by
TBD
TBD

Table of Contents

1 Introduction ............................................................................................................................................ 5
1.1 Purpose............................................................................................................................................................. 5

2 Business Continuity and Disaster Recovery Concepts............................................................. 6


2.1 Overview of Business Continuity and Disaster Recovery (BC/DR) Concepts ......................... 6

2.2 Enterprise Risk Management (ERM) ...................................................................................................... 7

2.3 Business Continuity ...................................................................................................................................... 7


Disaster Recovery Plans ..................................................................................... 8

Emergency Management (Business Response) Teams .......................................... 8

Business Continuity Standards ............................................................................ 9

2.4 Positioning Disaster Recovery............................................................................................................... 10


Plan-Do-Check-Act (PDCA) Model .................................................................... 11

2.5 Recovery Concepts Overview ................................................................................................................ 12


Recovery Time Objective (RTO) / Recovery Point Objective (RPO) ...................... 12

SIPOC .............................................................................................................. 13

Dependency Categories ................................................................................... 13

Technical Dependency Analysis (TDA) ............................................................... 14

2.6 Disaster Recovery Planning Approach ............................................................................................... 15

2.7 Emergency Response (The Incident Command System) ............................................................ 16

3 Microsoft Cloud-Based Disaster Recovery Capabilities ........................................................ 20


3.1 Planning BC/DR for Cloud Environments ......................................................................................... 20
On-Premises Cloud BC/DR Capabilities ............................................................. 20

Hybrid Cloud BC/DR Capabilities ...................................................................... 23

VMWare and Physical Environments ................................................................. 24

How Does ASR Protect On-Premises Resources? ............................................... 26

What is Needed to Configure ASR for VMware .................................................. 26

Native Application Platform Considerations ...................................................... 28

iii
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
TBD

4 Summary ................................................................................................................................................ 30

iv
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
1 Introduction

1.1 Purpose
This guide is intended to provide and introduction to the concepts relating to Business
Continuity and Disaster Recovery (BC/DR).

Page 5
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
2 Business Continuity and Disaster Recovery
Concepts

2.1 Overview of Business Continuity and Disaster Recovery


(BC/DR) Concepts
Operating an IT environment involves being prepared for managing system outages,
misconfigurations, and corrupt data. These incidents often require a few focused break/fix
operations (sometimes called cases or outages). The ability to manage and mitigate these
isolated disruptions is often what defines an organizations ability to keep a stable, running
business.

In this context, it is important to understand what happens with an organization experiences a


significant disaster. During a disaster a significant portion of the organizations IT ecosystem is
lost. This requires several break/fix and complex activities to restore the organization IT
infrastructure. Today, with complex global supply chain strategies, rapidly recovering from a
major disaster before other competing organizations (such as industries or governments) is
viewed as a significant competitive advantage and is often referred to as resiliency strategy.

Disaster events can be categorized into two types forecasted and un-forecasted. A forecasted
event is one where the impact can be foreseen (such as a weather system event like a hurricane)
and can be mitigated through prior planning. Un-forecasted events are those where the
organization does not have a mitigation plan in place either due to the immediate timing of the
event itself (such as an earthquake or cyber security attack) or the realization of previously
accepted risk factors.

Disaster scenarios, major attack vectors or incident types are the events that could lead to a
major disruptions or crisis/emergency for the business. Organizations will identify specific
forecasted threats and their probabilities and impact in the organizations Risk Assessment (RA)
and Business Impact Assessment (BIA). Most of these assessments focus on forecasted risks to
the company and/or specific organization unit operations. Strategically, its important to look at
the overall impact of the scenario on the organization. Disaster Management is often divided
into the manageable areas to manage risk, plan and react to forecasted and un-forecasted
disaster events.

Page 6
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
2.2 Enterprise Risk Management (ERM)
Often found in the organizations finance group, this team forecasts potential threats to the
business for the board of the directors and shareholders. Enterprise Risk Management (ERM)
looks at competitive threats, natural and manmade threats, regulatory changes and government
and market changes. The ERM teams primary purpose is to map out the forecasted impact of
strategic mistakes. This forecasting process requires due diligence and can take up significant
amount of time and energy. When analyzing disasters, the primary goal is to understand how
much damage (money, assets, and destroyed supply chains) the organization can withstand.
ERM has its roots from insurance, loss control and compliance1. Common risk areas include:

Figure 1: Common Risk Areas

2.3 Business Continuity


Business Continuity works with the Risk Management team and other teams to forecast, analyze
and mitigate specific threat vectors for targeted divisions in the organization and to restore
essential people, processes, technologies and supply chains for stabilizing the organization in
the event of a disaster. They work with various teams to have an actionable business and
operational capability. The following diagram depicts a typical business continuity lifecycle.

1
Additional information can be found in the Business Continuity Institutes Good Practice Guidelines
(http://www.thebci.org/index.php/resources/the-good-practice-guidelines) and the MIT Sloan School of business
(http://sloanreview.mit.edu/tag/risk-identification)

Page 7
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Figure 2: Business Continuity Lifecycle

Some common outputs an IT BC/DR professional should be familiar with include:

Business Continuity Policy and Charter - Most organizations will have a policy stating
the strategy and executive support in the times of disaster.
Risk Assessments These identify and analyze potential risks and threats to the overall
organizations performance before a disaster event is realized.
Business Impact Analysis - This determines the impact of specific disasters on specific
operational functions. This is commonly defined as systematic, repeatable and
substantially defensible analysis to identify, measure, and validate potential impacts an
interruption would cause to a business process.
Continuity Requirements This determines specific continuity performance metrics for
specific supply chains, systems and processes including desired recovery time objectives
and recovery point objectives.

Disaster Recovery Plans


Disaster Recovery Plans are a component of business continuity that focuses on mitigating the
impact of forecasted disasters on specific targeted systems and processes. For un-forecasted
disasters where no predefined recovery plan is available, the disaster recovery plan covers the
roles and responsibilities for handling the disaster.

Emergency Management (Business Response) Teams


Emergency Management Teams manage the complexity of a disaster event providing situational
awareness, impact analysis and triaging mission teams to recover the organization as fast as
possible. They are uniquely trained for high stress situations and have the ability and authority
to make decisions quickly to restore the organization. Typically, these response teams utilize the
Incident Command System (ICS) to recover the organization as fast as possible.

Page 8
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Business Continuity Standards
While there is a significant set of regulations and laws pertaining to the continuity and resiliency
for governments and publicly traded companies in different countries, there is also a growing
list of accepted international standards from the International Organization for Standardization
(ISO): ISO 22301 (standard) and ISO 22313 (implementation). Note that ISO 22301 and 22313
should be viewed as a minimum bar and not a final goal of organization. Even if an organization
passes an ISO 22301 audit, doesnt mean they have an effective business continuity or disaster
recovery capability. Other common standards and regulations which cover business continuity
and disaster recovery include:

Standard Purpose

ISO 9001 Quality requirements

ISO 14001 Environmental management systems - Requirements with guidance for use

ISO 19011 Guidelines for auditing management systems

ISO/IEC 20000-1 Service Management

ISO 22300 Societal security Terminology

ISO/PAS 22399 Societal security - Guideline for incident preparedness and operational continuity management

ISO/IEC 24762 Information technology Security techniques and guidelines for Information and communications
technology disaster recovery services

ISO/IEC 27001 Information Security Management Systems

ISO/IEC 27031 Information technology Security techniques Guidelines for information and communication
technology readiness for business continuity

ISO 31000 Risk Management Principles and Guidelines

ISO/IEC 31010 Risk management Risk assessment techniques

ISO/IEC Guide 73 Risk management Vocabulary

BS 25999-1 Business continuity management Code of practice, British Standards Institution (BSI)

BS 25999-2 Business continuity management Specification, British Standards Institution (BSI)

SI 24001 Security and continuity management systems Requirements and guidance for use, Standards
Institution of Israel

NFPA 1600 Standard on disaster/emergency management and business continuity programs, National Fire
Protection Association (USA)

Page 9
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Standard Purpose

Business Continuity Ministry of Economy, Trade and Industry (Japan), 2005


Plan Drafting Guideline

Business Continuity Central Disaster Management Council, Cabinet Office, Government of Japan, 2005
Guideline

ANSI/ASIS SPC.1 Organizational Resilience: Security, Preparedness, and Continuity Managements Systems Requirements
with Guidance for Use SS 540: 2008, Singapore Standard for Business Continuity Management

ANSI/ASIS/BSI BCM.01 Business Continuity Management Systems: Requirements with Guidance for Use

2.4 Positioning Disaster Recovery


There are five phases to disaster response, Watch, Mobilize, Assess, Stabilize and Close. These
phases cover the broad range of activities that typically need to be addressed in a disaster
event, in order, and are illustrated in the diagram below:

Figure 3: Standard Disaster Response Protocol

When starting a BC/DR oriented project, its crucial to utilize the organizations Business Impact
Assessment (BIA) and Risk Assessment (RA) to define needs in responding to a disaster. These
needs will help the organization define their disaster response strategy from an IT perspective.
Often times the IT organization will have the RA/BIA and Continuity requirements (CR)
documented or will know crucial business services and the corresponding dependent IT systems.
It is important that the disaster recovery plan align to the RA/BIA and Continuity Requirements
(CR) of the organization. The DR plan should be tightly scoped to the targeted services and
supply chains and forecast the impact on adjacent dependent IT assets and processes.

Page 10
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
It is important to note that the IT organization must in most cases own its own RA, BIA and CR.
While a BC/DR project can present options, observations, the engagement cannot represent
itself as owning the final strategy or guaranteeing risk scenarios to the customer (board of
directors, shareholders, or political leaders). Its generally regarded as a bad practice for
customers to delegate their business continuity strategic decisions to a third party. Promoting a
disaster recovery plan without regard to the customers business continuity requirements,
customers emergency implementation capability or the critical dependences (people, processes
and IT systems) is generally regarded to be an irresponsible action.

Plan-Do-Check-Act (PDCA) Model


Typically, organizations will utilize a variation of the Plan-Do-Check-Act (PDCA) model to drive
their business continuity and disaster recovery strategy to reality. Not only is it an ISO 22301
standard, the PDCA model promotes BC/DR as a perpetual commitment of implementation
including processes, technology, organizational muscle memory and executive commitment.
This is not a product or technology, it is a process-based effort.

An example of the PDCA is provided below:

Continual Improvement of Business


Continuity Management Systems

Establish
PLAN

Stakeholders
driving Stakeholders
Review and Improve Implement & Operation
Requirements, Realize the
ACT DO
Vision and results
Direction

Monitor & Review


CHECK

Figure 4: Plan-Do-Check-Act (PCDA) Model

PDCA is an essential top-down approach to help make sure business continuity strategies are
aligned with executive needs of the organizations. However, it must be complimented with a
bottoms-up capability perspective to help make sure that the strategy can be implemented by
Page 11
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
the disaster response team. When DR plans are designed from the top down without regard to
the response capabilities of the organization, the ability to run under pressure with limited staff
in a disaster is often compromised. Often, the tools, training and muscle memory of the teams
will determine if the organization effectively recovers with the disaster recovery plan.

2.5 Recovery Concepts Overview


Recovery Time Objective (RTO) / Recovery Point Objective (RPO)
From a technology perspective, Recovery Point Objective (RPO) and Recovery Time Objective
(RTO) are important in the guidance to set-up your disaster recovery plan. These metrics are
usually documented by the business continuity team in the Continuity Requirements for specific
business functions.

The Recovery Point Objective (RPO) covers the maximum amount (in time) of data that can be
lost in case of a disruption. It answers the question, to what point in time can I recover?

The Recovery Time Objective (RTO) covers the maximum amount of time it will take from the
disruption to bring back the business functions including data. It answers the question, at what
point in time can I expect business operations to continue?

The RPO and RTO figures we find most in the Service Level Agreement (SLA) are focused on the
regular back-up and recovery processes. As part of Disaster Recovery (DR) the RPO and RTO
figures normally would be higher. Based on the Business Continuity (BC) plan realistic values
need to be set. As seen in the figure below ideally RPO and RTO are business driven numbers,
that rollup from the RPO and RTO for the technical components that make up the business
application.

Figure 5: RPO/RTO

Page 12
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
SIPOC
Most business continuity teams look at the organization as a whole and work on a resiliency
strategy that incorporates restoring the specific business operation as a whole. The SIPOC
acronym (a Six Sigma methodology) stands for supplier, inputs, processes, outputs, and
customers. This model is often used to assist groups in understanding the interrelationships of
their processes and how work is currently performed within each process.

Dependency Categories
There is a common terminology used when outlining business dependencies (non-IT assets)
during BC/DR analysis. It is critical to understand how technology can impact each of the areas
listed below and familiarity with these terms helps with BC/DR planning to address the needs of
the organizations business. These terms include, but are not limited to:

A Supplier is any person, entity or organization that provides inputs to the current
process. A supplier can provide information, data, documents, guidelines, transactions,
supplies, equipment or raw material. An internal supplier is internal to the organization,
such as a team or business group, and provides inputs for the process in question. An
external supplier is an external entity or organization providing inputs to the process.
An Input is anything which feeds into the process as a document, guidelines, product,
data, transaction, specialized equipment or raw material.
Workforce, for the purpose of the Dependency Analysis, is any employee or non-payroll
worker. These would include vendors and independent contractors.
A Location is a place where something is or could be located; a site, such as a specific
building name or number.
An Application refers to a computer program or group of programs designed for end
users. Applications are self-contained programs that perform a well-defined set of tasks
under user control.
A Vendor is a business entity contracted to provide a service or infrastructure element to
customers or clients. Vendors can be any third-party provider, regardless of the service
they provide.
Data and Vital Records refer to any data or information required to perform your
process. Data and vital records can be electronic or hard copy and reside in a number of
different formats or locations.
Specialized Equipment is any specialized equipment, machine or tool required to
perform your process. Your list of equipment and tools should not include normal office
equipment and supplies such as laptops, PCs, printers, copiers, fax machines, paper, pens
and general desk supplies.

Page 13
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
A Partnership is a formal contractual relationship established to provide regular
business services between two companies.
An Output is anything that was created by the process such as a document, transaction,
product, data or information and given to a customer of the process. An output of your
process can be an input of another related process. Example: Customer Balance can
be an input to a Collections Process. An output can be given to an internal customer,
external customer, or a related process.
A Customer is any team, business unit or organization that receives a product or service
from your process. Internal customers are colleagues, departments or groups inside the
organization who receive products, services, support, or information from your process.
External customers are individuals or organizations outside of the organization who are
usually associated with paying money for our products and services, or are an extension
of your process under a contractual relationship.

Technical Dependency Analysis (TDA)


About thirty years ago, most organizational IT solutions were in massive self-contained propriety
systems of completion. Today, almost all solutions have other applications, networks and data
which serve as dependencies for their uninterrupted operation. This means DR teams have to
understand the dependencies and relationships the targeted solutions has with other systems.

Technical Dependency Analysis (TDA) is a process to define all technology and processes
components and key personnel to keep a specific IT capability operational. Common TDA
questions asked by the Business Continuity Team for each critical system include:

Responsible for Data Data Elements

Business Process Name


Business Unit Lead Business Process Criticality
Business Process Recovery Time Objective (RTO)
Process Recovery Business Point Objective (RPO), as applicable
Application Name/SharePoint site URL
Application Owner

1.
Recovery Time Capability (RTC), or
Application & Infrastructure Support Recovery Time Estimate (RTE), if they havent been tested
Team Recovery Point Capability (RPC), or
Recovery Point Estimate (RPE), if they havent been tested
Identify the Primary Production Site
Identify the Failover Site (if exists)
Identify the dependent applications to the primary application
Identify critical systems that are dependent on the primary application
Identify all Single Points of Failure (SPOF) for the primary application
Has Disaster Recovery (DR) been implemented (not backups)?
Is the Disaster Recovery Plan (DRP) Available?
Has the Disaster Recovery Plan (DRP) been tested?
What is the last Disaster Recovery Plan (DRP) test date?
Page 14
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
The Technical Dependency Analysis (TDA) examines the application(s) and supporting
infrastructure that a process depends on to determine, at a minimum, the following:

Recovery Time Capability (RTC);


Recovery Time Estimate (RTE), if they havent been tested
Recovery Point Capability (RPC)
Recovery Point Estimate (RPE), if they havent been tested

The data collected from this analysis will be used to identify gaps between the business process
recovery requirements and the recovery capabilities of the applications and supporting
infrastructure.

Recovery Time Capability means the technical dependency has been proven through a
test and may or may not meet the RTO requirement.
Recovery Time Estimate means the technical dependency has not been proven through
a test and the RTC has not been validated.
Recovery Point Capability means the technical dependency has been proven through a
test and may or may not meet the RPO requirement.
Recovery Point Estimate means the technical dependency has not been proven
through a test and the RPC has not been validated.

Performing the due diligence of TDA will lead towards the development of service dependency
maps, which outline the dependent systems and services for each application providing
capability to an organizations specific business functions.

2.6 Disaster Recovery Planning Approach


To optimize your business continuity plan to support your IT Disaster Recovery strategy,
multiple workshops are necessary:

1. Kick-off and positioning: Provides all attendees of the program detailed information on
the goals, planning and their roles, next to a common language when talking about
disaster recovery.
2. Service Mapping: Collects all information about the business function or service, its
components, the parties involved and the different agreements. Insights gathered in this
session are essential in planning for the recovery from a disaster.
3. Scenario identification: Identifying all possible DR scenarios and ranking those on
probability, impact and mitigations already in place. Information from this session is used
to validate the coverage of the technical recovery scenarios and evaluate if the
information from the Business Impact Assessment is complete. Based on the identified
scenarios, the response and the corresponding processes are designed.

Page 15
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
4. Information Needs and War room Facilities: Based on the identified recovery
scenarios and their constraints, information needs and facility requirements are identified
(email not available, no phone, no access to the office, etc.). This area is operated by the
Response Leadership Team. It will be crucial to design a strategy that the response can
easily run. Too much complexity and manual processes is the root of most Response
failures.
5. Responsibility and accountability: Based on the prior workshops a RACI matrix will be
set up for extending / maintaining the DRP and running a Disaster Recovery. Further
Critical Success Factors (CSF) and Key Performance Indicators (KPI) will be set up to
measure and extend the BC/DR.

While the information gathered and documented is valuable, it is also quite volatile.
Embedding disaster recovery in the change process will allow the organization update the
DR information as part of changes that are implemented. It is a good practice to assign an
owner to the information and set relevant review intervals to verify that the information is
up-to date.

2.7 Emergency Response (The Incident Command System)


Its a sad fact that most disaster recovery plans are failures. There are a variety of reasons for
this:

Too much complexity


Too much specialized human involvement
Decisions by consensus
Lack of testing (this brings out the missing details)
Lack of real world disaster management experience by the planning team

Needless to say, when a real disaster hits, its the emergency response team that will mobilize
with the executive leadership team to stabilize the organization, its people, process, partners
and customers in their time of need.

Often BC/DR plans ignore the implementation capabilities of the emergency team and assume
the organization will have access to their most talented technical team to fix or restore critical
systems. To use a practical example, the most skilled Active Directory administrative team will
effectively restore the Active Directory service for the company after a disaster has occurred. IT
leadership often assumes the incident response team that addresses common outages can
manage a major disaster. These assumptions are often ill founded in a real disaster. As a
guideline, it is safe to assume an IT organization will have access to 50% of their employees
operating at 50% mental capacity under stress. As a general practice, BC/DR plans should
incorporate this assumption in the recovery capabilities of the organization.
Page 16
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
For a variety of reasons, often regular organizational bureaucracy is ill equipped to handle the
pressure and rapid pace massive disaster management requires. Emergency Response requires
a clear command model with focused teams to quickly rebuild the organizations systems and
services effectively. While there are variety of approaches, most successful Emergency Response
teams utilize a variation of the Incident Command System (ICS). ICS is an internationally
recognized operational command and control model to mobilize, access and triage the crisis
and incorporate and responsibly orchestrate all available talent available while working with
critical partners, government organizations and key stakeholders. ICS as a process has been
maturing and proving itself for decades. The reason why organizations use ICS: it consistently
works.

Page 17
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
An example of IT Incident Command Systems (ICS) model for Disaster Management is provided
in the diagram below:

Executive
Leadership Team

Internal Business Unit


Information Liaison Team
Technology
Incident Command
Lead Employee Welfare and
Safety Team

Operations Section Planning Section Logistics Section Finance and


Team Team Team Administration
Team
Focused IT Recovery Hardware / Software
teams: break fix / App Triage Requests Internet Access, Food, Contracts, Internal
Dev / IT Infrastructure Strategic Dependencies Office Space, Charge Costs and
Build / Cloud Migration. Compile Reports and Transportation, Payments, Compliance,
Action Plans, Team Passports, Employee Financial Reporting and
Assignments, and Contractor Impact
Demobilization, Impact Resource Mgmt.
Assessment

Figure 6: Incident Command Structure Example

These technology response teams will support the Operations Lead on target missions to
support restoring key organization functions and supply chains. The Operations Lead directs all
response/tactical actions. For government organizations, the Operations will lead a variety of
responsibilities working with other ICS leadership.

Typical IT restoration will consist of multiple teams in a separate IT Recovery Team or injected
into a task force unit. Needs assessments will be triaged to focus on the most important
systems first. IT restoration encompasses two types of missions:

Break/Fix missions: Repair and restore of existing IT assets


o Examples include restoring an existing application or service from backup.
Complex missions: Major rebuild of key IT assets.
o Examples include setting up new emergency cloud services, rapidly building
applications or addressing significant cyber-attacks while in crisis.

It will be important for the DR strategy be automated and standardized as much as possible to
help these teams be successful. Often, the IT Recovery Team is managing hundreds of separate
uncoordinated DR missions in a major disaster. As a recommended practice, when evaluating

Page 18
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
the technical dependencies of an organization, examine the organizations incident response
capabilities carefully to help make sure they can complete the DR plan developed.

Page 19
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
3 Microsoft Cloud-Based Disaster Recovery
Capabilities

3.1 Planning BC/DR for Cloud Environments


Cloud workloads fall into two categories stateful and stateless. Stateful workloads rely on the
infrastructure to provide availability and therefore not have the constructs within the application
or service to manage their own state in a cloud environment. In cloud architectures, stateful
resource pools provide virtual machine resiliency through availability constructs such as virtual
machine Live Migration. Stateless workloads rely on the application or service to provide
availability and contain the constructs within the service to continue service during outages. In
some cases, these workloads provide resiliency at the cost of operating during failures with
diminished capacity.

Cloud infrastructures provide availability constructs such as upgrade domains which define
boundaries of failure. These boundaries differ based on public or private cloud offerings and
each models capabilities are outlined in the sections below.

On-Premises Cloud BC/DR Capabilities


For private cloud architectures, availability and upgrade domains are defined by discrete
resource pools which can support a level of availability within or between one another using
technologies such as virtual machine mobility, replication and backup.

Virtual Machine Mobility

Virtual machine mobility for on-premises infrastructures running Windows Server 2012 R2
Hyper-V is supported by two technologies: Hyper-V Live Migration and Hyper-V Storage
Migration.

Hyper-V Live Migration makes it possible to move running virtual machines from one physical
host to another with no effect on the availability of virtual machines to the services running
within it. Hyper-V Live Migration is divided into two categories:

Shared Storage-based live migration. In this instance, the hard disk of each virtual
machine is stored on either a local CSV or a central SMB file share and live migration
occurs over either TCP/IP or the SMB transport. You then perform a live migration of the
virtual machines from one server to another while their storage remains on the central
local CSV or SMB share.

Page 20
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Shared-nothing live migration. In this case, the live migration of a virtual machine
from one non-clustered Hyper-V host to another begins when the hard drive storage of
the virtual machine is mirrored to the destination server over the network. Then you
perform the live migration of the virtual machine to the destination server while it
continues to run and provide network services.

Windows Server 2012 R2 also supports Live storage migration, which supports the movement of
virtual hard disks that are attached to a virtual machine that is running. This provides the
flexibility to manage storage without affecting the availability of virtual machine workloads,
perform maintenance on storage subsystems, upgrade storage-appliance firmware and
software, and balance loads while the virtual machine is in use. Live storage migration is
supported for virtual hard disks on shared and non-shared storage subsystems (when using
Hyper-V over SMB designs).

Virtual Machine Replication

Virtual machine replication for on-premises infrastructures is supported by the Windows Server
2012 R2 Hyper-V Replica feature. Hyper-V Replica provides a workload agnostic failure recovery
solution by providing asynchronous replication of virtual machines over standard network
protocols (HTTP or HTTPS) from one Hyper-V host or cluster to another remote Hyper-V host or
cluster without relying on storage arrays or other software replication technologies. Windows
Server 2012 R2 Hyper-V Replica supports replication between source and target Hyper-V servers
(or clusters) which can be physically co-located or geographically separated. It can further
support extending replication from the target server to a third server through the extended
replication feature. Hyper-V Replica tracks the write operations on the primary virtual machine
and replicates these changes to the replica server in configurable frequencies of 15 minutes, 5
minutes or 30 seconds and additional recovery points can be configured to be stored for 24
hours. Hyper-V Replica also supports both planned and unplanned failover scenarios with
advanced logic such as TCP/IP re-addressing of the host as part of the failover process.

Virtual Machine Backup

Virtual machine backup for on-premises infrastructures is provided through backup software
which supports the Hyper-V Volume Shadow Copy Services (VSS) Writer. The ability to back up
open files is required to provide business continuity and VSS creates frozen copies of open files,
helping to make sure that virtual machines do not have to be put into hibernation or be shut
down before a consistent backup can be made. In a virtualized data center, there are three
commonly used backup types: host-based, guest-based, and a SAN-based snapshot. The
following table contrasts these types.

Page 21
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Backup Capability Host-Based Guest-Based SAN Snapshot

Protection of virtual machine configuration

Protection of host and cluster configuration

Protection of virtualization-specific data

Protection of data inside the virtual machine

Protection of data inside the virtual machine stored on pass-


through disks, iSCSI and vFC LUNs and Shared VHDxes.

Support for Microsoft Volume Shadow Services (VSS)-based


backups for supported operating systems and applications

Support for continuous data protection

Ability to granularly recover specific files or applications


inside the virtual machine

Note that the use of SAN volume snapshots is highly dependent on the storage vendors level of
VSS and Hyper-V integration. SAN volume snapshots are typically block-level, and they only
utilize storage capacity as blocks change on the originating volume.

System Center 2012 R2 Data Protection Manager allows disk-based and tape-based data
protection and recovery for Hyper-V servers. Data Protection Manager supports the protection
of standalone or clustered computers running Hyper-V in failover clusters using shared (cluster
shared volumes) or SMB storage.

Azure Site Recovery (ASR) On-Premises BCDR for physical instances including MSCS clusters,
virtual instances running on VMWare and Pre-2012 Hyper-V is available using ASR. ASR enables
your organization to meet stringent disaster recovery needs, eliminate the impact of local
backups, and manage application uptime to meet high availability requirements. ASR uses
advanced technologies like Continuous Data Protection (CDP), Asynchronous Replication over
IP, Application Failover/Failback, and WAN Optimization for disaster recovery of data.

CDP technology enables ASR to capture data for recovery purposes and lets you decide upon
any recovery point in time to recover your lost/corrupted data.

Asynchronous replication, configurable in 1-to-1, 1-to-N, and N-to-1 configurations support


short or long distance DR requirements over IP networks, while WAN optimization technologies
allow ASR to support even large applications using minimal bandwidth. Instead of the shared-
disk model that conventional high availability clustering software uses, ASR Application
Page 22
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
failover/failback utilizes a shared-nothing model. All these capabilities are combined into a
single software-based platform that supports Windows, Linux, and UNIX environments and
heterogeneous storage architectures (DAS, SAN, NAS, iSCSI, FC). A key differentiator for ASR is
that it provides DR solutions that enable recoveries from or at remote sites within minutes
through an efficient use of available bandwidth. This debunks the perception that DR
configurations are expensive and inflexible, allowing customers to preserve their existing
investments in hardware, software, and networks as they deploy a DR solution that can meet
stringent recovery point objective, recovery time objective, and recovery reliability requirements.

Hybrid Cloud BC/DR Capabilities


Public cloud architectures support a wide range of availability for workloads running within their
service. While public cloud offerings still ultimately reside on physical hardware running in
physical datacenters across the world, this is where the similarity ends. Public cloud offerings
differ vastly from that of private cloud environments as they are exposed as services, which
means that traditional constructs organizations expect from what may look like a familiar
capability (such as virtual machines) ultimately is using a different set of constructs for the
consumer based solely on what the provider chooses to expose. Furthermore, new constructs
are often developed by the provider to support a greater degree of service separation and
availability for consumer workloads. These services are often backed by a Service Level
Agreement (SLA) and the granularity often is exposed at the service level. Azures SLAs are
available publicly and are aligned directly to each of the services being provided through its
management portal.

A key consideration when deploying workloads to hybrid cloud environments is that the
organization is mixing availability constructs between what they provide internally through on-
premises cloud infrastructures and what the public cloud provider has exposed through various
service offerings. This mixing of constructs means that BC/DR planning of a workload which
spans public and private cloud infrastructures must consider both the capabilities and SLAs
provided by both environments to assess availability and recovery needs.

Azure provides a wide range of capabilities which support the availability of workloads spanning
on-premises and Public Cloud Infrastructure as a Service (IaaS)-based solutions. These
capabilities change rapidly with each new release and an overview of currently available IaaS
services is provided below.

First, availability of workloads hosted in Azure virtual machines is achieved by using multiple
virtual machines for continuity. This provides general availability of the workload during local
network failures, local disk-hardware failures, and any planned downtime that the platform
might require. Availability of a workload comprised of multiple virtual machines is achieved by

Page 23
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
adding them to an availability set. Availability sets are directly related to fault domains and
update domains in cloud infrastructures. A fault domain in Azure is defined by avoiding single
points of failure, like the network switch or power unit of a rack of servers. When multiple virtual
machines are connected together in a cloud service, an availability set can be used to help help
make sure that the virtual machines are located in different fault domains. The following
diagram shows two availability sets, each of which contains two virtual machines.

Figure 7: Azure Availability Sets

Azure periodically updates the underlying infrastructure that hosts the instances of running
workloads and during that process a virtual machine is shut down when an update is applied. An
update domain is used to help make sure that not all of the virtual machine instances are
updated at the same time. When you assign multiple virtual machines to an availability set,
Azure helps to make sure that the virtual machines are assigned to different update domains.
As discussed previously, the Windows Azure virtual machine availability concepts are not the
same as on-premises Hyper-V. To support high availability for workloads hosted in Azure,
multiple virtual machines per application or role must be created, and Azure constructs such as
availability groups and load balancing must be utilized. Additional information about these
constructs can be found in the Infrastructure-as-a-Service Product Line Architecture Fabric
Architecture Guide.

VMWare and Physical Environments

The Azure Site Recovery Service contributes to your business continuity and disaster recovery
(BCDR) strategy by orchestrating replication, failover and recovery of virtual machines and
physical servers. Machines can be replicated to Azure, or to a secondary on-premises datacenter.

Page 24
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Azure Site Recovery Service is a hybrid cloud service which coordinates and manages the
protection of VMWare virtual machines located in private cloud infrastructures managed by
VMWare ESX servers. Azure Site Recovery Service orchestrates failover of these virtual machines
from one on-premises ESX host or cluster to another on-premises ESX host or cluster located in
secondary location.

Azure Site Recovery Service uses the concept of vaults in Azure to store configuration data
related to the protection of single and multi-tier workloads which are defined as Recovery Plans.
See the following configuration example.

Recovery Plans are linear orchestration plans which allow for the grouping of virtual machines
into one or more failover groups. Recovery plans also allow for the addition of manual steps
and the insertion of automation (scripts) which can be run as part of a failover event. When
combined together, Recovery Plans support many of the requirements for failover multi-tier
application of workloads that span multiple virtual machines. While there are many
technologies available which provide protection of virtual machines themselves, very few
recovery solutions exist which provide the fabric management infrastructure with the
intelligence to see multiple virtual machines as composed applications and services with
differing failover needs and actions for each tier.

Page 25
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
Azure Site Recovery Service additionally automates planned and unplanned failover activities
across sites and supports the TCP/IP readdressing needs when failover is performed across
separate network segments. Finally, recovery plans can be tested in isolation without disruption
to the running workload, supporting activities such organizational BC/DR drills and plan
verification.

How Does ASR Protect On-Premises Resources?


Site Recovery helps protect your on-premises resources by orchestrating, simplifying replication,
failover and failback in a number of deployment scenarios. If you want to protect your on-
premises VMware virtual machines or Windows or Linux physical servers, here's how Site
Recovery can help:

Allows VMware users to replicate virtual machines to Azure.


Allows the replication of physical on-premises servers to Azure.
Provides a single location to setup and manage replication, failover, and recovery.
Provides easy failover from your on-premises infrastructure to Azure, and failback
(restore) from Azure to on-premises.
Implements recovery plans for easy failover of workloads that are tiered over multiple
machines.
Provides multi VM consistency so that virtual machines and physical servers running
specific workloads can be recovered together to a consistent data point.
Supports data replication over the Internet, over a site-to-site VPN connection, or
over Azure ExpressRoute.
Provides automated discovery of VMware virtual machines.

What is Needed to Configure ASR for VMware

COMPONENT DEPLOYMENT DETAILS

Configuration Server Deploy as an Azure standard A3 virtual This server coordinates communication
machine in the same subscription as Site between protected machines, the
Recovery. Process Server, and Master Target
servers in Azure. It sets up replication
You set up this server in the Azure Site
and coordinates recovery in Azure when
Recovery portal
failover occurs.

Master Target Server Deploy as Azure virtual machine It receives and retains replicated data
Either a Windows server based on a from your protected machines using
Windows Server 2012 R2 gallery image attached VHDs created on blob storage
(to protect Windows machines) or as a in your Azure storage account.
Linux server based on a OpenLogic
Page 26
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
COMPONENT DEPLOYMENT DETAILS

CentOS 6.6 gallery image (to protect


Linux machines).

Two sizing options are available


standard A4 and standard D14.

The server is connected to the same


Azure network as the Configuration
Server.

You set up in the Site Recovery portal

Process Server Deploy as an on-premises virtual or Protected machines send replication


physical server running Windows Server data to the on-premises Process Server.
2012 R2 It has a disk-based cache to cache
replication data that it receives. It
We recommend it's placed on the same
performs a number of actions on that
network and Local Area Network (LAN)
data.
segment as the machines that you want
to protect, but it can run on a different It optimizes data by caching,
network as long as protected machines compressing, and encrypting it before
have L3 network visibility to it. sending it on to the Master Target
server.
You set it up and register it to the
Configuration Server in the Site Recovery It handles push installation of the
portal. Mobility Service.

It performs automatic discovery of


VMware virtual machines.

On-Premises Machines On-premises virtual machines running You set up replication settings that apply
on a VMware hypervisor, or physical to virtual machines and servers. You can
servers running Windows or Linux. fail over an individual machine or more
commonly, as part of a recovery plan
containing multiple virtual machines that
fail over together.

Mobility Service Installs on each virtual machine or The service takes a VSS snapshot of data
physical server you want to protect on each protected machine and moves it
to the Process Server, which in turn
Can be installed manually or pushed and
replicates it to the Master Target server.
installed automation by the Process
Server.

Azure Site Recovery Vault Set up after you've subscribed to the Site You register servers in a Site Recovery
Recovery service. vault. The vault coordinates and
orchestrates data replication, failover,
and recovery between your on-premises
site and Azure.

Replication Mechanism Over the InternetCommunicates and Neither option requires you to open any
replicates data from protected on- inbound network ports on protected
premises servers and Azure using a
secure SSL/TLS communication channel

Page 27
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
COMPONENT DEPLOYMENT DETAILS

over a public internet connection. This is machines. All network communication is


the default option. initiated from the on-premises site.

VPN/ExpressRouteCommunicates
and replicates data between on-
premises servers and Azure over a VPN
connection. You'll need to set up a site-
to-site VPN or an ExpressRoute
connection between the on-premises
site and your Azure network.

You'll select how you want to replicate


during Site Recovery deployment. You
can't change the mechanism after it's
configured without impacting protection
on already protected servers.

FEATURE REFERENCE

Set up protection between on-premises VMware virtual https://azure.microsoft.com/en-


machines or physical servers and Azure us/documentation/articles/site-recovery-vmware-to-azure

Site Recovery Overview https://azure.microsoft.com/en-


us/documentation/articles/site-recovery-overview/

Site Recovery components https://azure.microsoft.com/en-


us/documentation/articles/site-recovery-components/

Native Application Platform Considerations


As stated earlier, all cloud solutions should be built with the workload in mind. From a BC/DR
perspective, any Cloud Solution should respect the availability constructs provided by the
workload itself. While Hyper-V and Azure support virtual machine availability through some of
the constructs outlined above, many workloads provide native capabilities to support service
availability within their application or service. Examples of this include Microsoft SQL Server
Always-On Availability Groups, Active Directory Domain Services (AD DS) domain controllers,
Exchange Server Database Availability Groups (DAG) and Lync services. In some cases, it is either
not supported or recommended to combine the availability constructs of the cloud with those of
the workload. In these cases, it is often preferable (or required) to allow the workload to
manage its own availability. Some workload availability constructs can be combined with cloud

Page 28
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
capabilities to further enhance their availability. Examples of this include SQL Server Always-On
Availability Groups support in Azure and enhancements to AD DS support in virtualized
environments. As outlined in the BC/DR concepts, it is important to include workload availability
constructs as Recovery Point Capabilities when determining RTO/RPO for cloud-based solutions.

Page 29
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"
4 Summary
Business Continuity and Disaster Recovery planning is a required element of any cloud-based
workload deployment. This document is meant to serve as a framework for applying BC/DR
concepts in workload planning and design for public, private and hybrid cloud environments.
These concepts and capabilities can be applied to various applications and services and
therefore require analysis of each workloads capabilities and support for the constructs
discussed earlier. Along with this guide, a series of workload-specific scenario guides are
available to outline practical application by various workloads for the framework outlined in this
document.

Page 30
Business Continuity and Disaster Recovery Overview, , Version 3.0a
Prepared by
"Document1"

You might also like