Professional Documents
Culture Documents
2013
Page | 2
Foreword
By Laura DuBois, Program Vice President, Storage, IDC
The Disaster Recovery Imperative Nearly all organizations today rely on information technology and the data it manages to operate. Keeping computers and networks running, and data accessible, is imperative. Without this information technology, customers cannot be serviced, orders taken, transactions completed, patients treated, and on and on. Disasters that create IT downtime are numerous and common, spanning the physical and logical, the man-made and natural. Organizations must be resilient to these disasters, and able to operate in a disruption of any type, whether it is a security incident, human error, device failure, or power failure.
State of Preparedness Most organizations know the importance of disaster recovery, and firms of all sizes are investing to drive greater uptime. An IDC study on business continuity and disaster recovery (DR) showed that unplanned events of most concern were power, telecom, and data center failures (physical infrastructure) more so than natural events such as fire or weather. Security was considered the second most critical and extreme threat to business resiliency. Seventy-one percent of those surveyed had as many as 10 hours of unplanned downtime over a 12-month period. This underscores the importance of greater uptime and DR, which is driving firms to conduct DR tests more frequently. Approximately one in four firms are conducting DR testing quarterly or monthly, while another 45% are testing semi-annually or annually. This is a marked increase from previous research, which IDC conducted three years ago, where firms were testing annually at best. However, 25% of firms are still not doing any DR testing.
IDC Advice DR planning is complex and spans three key areas: technology, people, and process. From an IT perspective, planning starts with a business impact analysis (BIA) by application/workload. Natural tiers or stages of DR begin at phase 1 infrastructure (networking, AD, DHCP, etc.) then extend to recovery by application tiers. Each application tier should have an established recovery time objective (RTO) and recovery point objective (RPO) based on business risk. DR testing is essential to adequate recovery of systems and data, but also to uncover events or conditions met during real disasters scenarios that were not previously accounted for. Examples
2013
Page | 3
include change management such as the needed reconfiguration of applications or systems. Also, the recovery of systems in the right sequence is important. To ensure that DR testing, planning, and recovery is organized and effective, many organizations use a disaster recovery "run book." A DR run book is a working document, unique to every organization, which outlines the necessary steps to recover from a disaster or service interruption. It provides an instruction set for personnel in the event of a disaster, including both infrastructure and process information. Run books, or updates to run books, are the outputs of every DR test. However, a run book is only useful if it is up-to-date. If documented properly, it can take the confusion and uncertainty out the recovery environment which, during an actual disaster, is often in a state of panic. Using the run book template provided here by Xtium can make the difference for an organization between two extremes: being prepared for an unexpected event and efficiently recovering, or never recovering at all.
2013
Page | 4
2013
Page | 5
DR Scenarios
Though not part of the run book itself, were providing this section to list some common events that would cause DR scenarios. These threats are general and could affect any business, so you might also want to list those which would threaten your business specifically. Research firm Forrester outlined some of the most common causes of disaster scenarios from a 2011/2012 study. The findings illuminate the fact that your business should not just be prepared for the news-making types of disaster threats (hurricanes or tornados, for example). Instead, consider all these potential causes for disaster:
Source: http://it.toolbox.com/blogs/managed-hosting-news/whats-your-2012-it-disaster-recovery-plan-49333
2013
Page | 6
It is wise to also list disaster scenarios that are unique to, or are more likely to affect, your business. For each possibility, include details on the scenario, methods for data restoration on the part of the provider and your company, and procedures by which DR events will be initiated. For example: Scenario #1: List your first disaster scenario or business continuity threat here. Examples might include significant loss of hardware, a power outage of significant length, an infrastructure outage, disk corruption, or loss of most or all systems due to unavoidable natural disaster. Identify and address those disaster scenarios that are most relevant and likely to affect your business. For each scenario, include: Overview of the associated scenario and systems most likely to be affected by the threat Time frame of potential outages, based on the likely elements of the specific scenario Systems that may be brought up locally via on-premise failover equipment or premisebased cloud enablement technology Procedures for initiation of system failover to external data centers Priority schedule for system restoration Procedures for contacting your hosting provider (if applicable) to initiate critical support
Continue listing disaster scenarios with all important details. Do not feel limited to only a few disaster recovery scenarios; list all those that could realistically impact your business along with the associated recovery procedures. The table below may be an effective tool for listing your potential DR scenarios: Event Power failure Plan of Action Enact affected system run book plans Enact total failover plan Owner Application business owner
Disaster Recovery Coordinator (DRC) Disaster Recovery Coordinator (DRC) Business Owner
Pending weather event (winter Review all DR plans, notify storm, hurricane, etc.) DRC, put key employees on standby
2013
Page | 7
Distribution List
This secion is also critical to the development of your run book. You must keep a clearly defined distribution list for the run book, ensuring that all key stakeholders have access to the document. Use the chart below to indicate the stakeholders to whom this run book will be distributed.
Role Owner Approver Auditor Contributor (Technical) Contributor (DBA) Contributor (Network) Contributor (Vendor)
Name
Phone
Location
Specify the location(s) where this document may be found in electronic and/or hard copy. You may wish to include it on your companys shared drive or portal. If located on a shared drive or company portal, consider providing a link here so the most recent version is readily accessible. If this run book is also stored as a hard copy in one or multiple locations, list those locations here (along with who has access to those locations). We do recommend making your run book available outside of shared networks, as the document must be readily accessible at time of disaster in the event that primary systems like email are not accessible to employees. In other words, ensure your run book is accessible under any circumstances!
2013
Page | 8
Table of Contents
Document Control Contact Information Data Center Access Control List Communication Structure of Plan Declaration Guidelines Alert Response Procedures Issue Management and Escalation Changes to SOP During Recovery Infrastructure Overview
Data Center Network Layout Topology Access to Facilities
7 8 10 11 13 15 16 17 19 19 21 21 22 23 25 26 27 29
Order of Restoration System Configuration Backup Configuration Monitors Roles and Responsibilities Data Restoration Processes
2013
Page | 9
Document Control
Document creation and edit records should be maintained by your companys disaster recovery coordinator (DRC) or business continuity manager (BCM). If your organization does not have a DRC, consider creating that role to manage all future disaster recovery activities.
Document Name Version Date created Date last modified Last modified by
Keep the most up-to-date information on your disaster recovery plan in this section, including the most recent dates your plan was accessed, used and modified. Keep a running log, with as many lines as necessary, on document changes and document reviews, as well.
V1.1
12/30/2010
2013
Page | 10
Contact Information
This section will list your service providers contacts (if applicable) along with those from your IT department. This is the team that will conduct ongoing disaster recovery operations and respond in the case of a true emergency. Specific roles listed below are examples of those that might comprise your team. All of these roles need to be in communication when in a disaster recovery mode of operation. For pending events, this same distribution list should be used to provide advanced notice of potential incidents. Customer support teams should also not be overlooked as they are the first line of communication to your customer base. Forgetting this step will cause extra work on your primary recovery team as they take time to explain what is going on.
Title Disaster Recovery Coordinator Chief Information Officer Network Systems Administrator Database Systems Administrator Chief Security Officer Chief Technology Officer Business Owner Application Development Lead (as applicable) Data Center Manager Customer Support Manager Call Center Manager
Email Email
Name Name
Primary phone
Secondary phone
Email Email
Primary phone
Secondary phone
Name
Primary phone
Secondary phone
Primary phone
Secondary phone
Primary phone
Secondary phone
Primary phone
Secondary phone
Primary phone
Secondary phone
Name Name
Primary phone
Secondary phone
Email Email
Primary phone
Secondary phone
Name
Primary phone
2013
Page | 11
Role Disaster Recovery Coordinator Customer Service Emergency Support Sr. System Engineer Director Service Delivery
Email Email
Primary phone
Secondary phone
If youphone are working withEmail a Primary Secondary service phone provider, this position might be alternately filled with Primary phoneor test manager. Email an account
Secondary phone
Primary phone
Secondary phone
2013
Page | 12
Name Name
Access level General access Can authorize guest access General access Can authorize guest access Server room access, cage/cabinet, NOC access Cannot authorize guest access Server room access Cannot authorize guest access Server room access Cannot authorize guest access General Access Can authorize guest access General Access Can authorize guest access
Name
Phone
Email
Name
Phone
Email
Name
Systems Engineer
Phone
Email
Name
Network Engineer
Phone
Email
Name
Phone
Email
Name
Phone
Email
2013
Page | 13
2013
Page | 14
And, for the situation written above, your general progression of calls might be as follows:
Sr. Systems Engineer Disaster Recovery Coordinator Head of Operations Director of Service Delivery Network Engineer Systems Administrator CEO Director of Business Development Sales contact PR Representative
2013
Page | 15
Declaration Guidelines
As you create your run book, you must consider guidelines for declaring a disaster scenario. Guidelines that we recommend are specified in the chart below: Situation Workaround does not exist in a matter of time that does not affect customer SLAs Restoration procedres cannot be completed in your production environment A production environment no longer exists or is unable to be accessed Action Declare application level failover and enact failover to secondary site Declare application level failover and enact failover to secondary site Declare a data center failure and enact a total failover plan from primary to secondary data center Notify service provider and have them enact DR plans Owner
The use of technology can be incorporated into the declration steps of a DR plan. Be sure not to declare on the first instance of an event unless it is completely understood that secondary instances of the event will result in increased damage to your customer or your business sytems. The table below details some standard practices to use in order to mitigate premature declarations. SLAs should be built in a manner that allows for some troubleshooting and system restoration prior to the need to declare a disaster. Also use this section to outline standard monitoring procedures along with associated thresholds. List all system monitors, what they do, their associated thresholds, associated alerts when those thresholds are met or exceeded, the individual(s) who receive the alerts, and the remediation steps for each monitor. List event monitoring standards by defining thresholds for event types, durations, corrective actions to be taken once the threshold is met, and event criticality level. Use the following chart (or a derivative thereof for your monitoring standards) to specify event monitoring standards.
2013
Page | 16
Event Type Performance Monitoring Status = Warning Alert Level Memory Usage > 80%
Corrective Action Isolate problem device / recycle device - Isolate physical device / virtual machine - configure memory pool increase - clear memory cache - clear memory buffer - increase compute allocation (virtual) - add additional compute resources into application pool - check memory queue - clear memory cache of affected system - increase memory allocation (virtual)
> 5 minutes
Critical Level
> 3 minutes
Critical Level
Memory
> 15 minutes
Storage Network Ping Check IP Check These event types (memory, storage, network, ping check and IP check) are categories of events for which you should list specific examples in this chart.
2013
Page | 17
Service interruption identified > Service Delivery Manager contacted 1. Ticket is opened with support team (either in-house or third party providers ticket creation system). 2. Contact key stakeholders to ensure they are aware of the alert and determine if any current activity or recent changes may be responsible for the service interruption. 3. Verify that alert is legitimate and not an isolated single user issue or monitoring time out. 4. Notify end users of ticket creation. 5. Contact the appropriate member(s) of your operations or engineering teams to notify them of the alert and assign investigation and data restoration procedures.
2013
Page | 18
Depending on the severity of the service interruption, your escalation procedures will vary by parties involved, response chain, response time and target resolution.
2013
Page | 19
A user submits a call/ticket to your service desk stating they cannot access the company website. This ticket would be responded to with a message that your organization is currently in a recovery operations cycle and your service ticket will be addressed as soon as technicians have completed the restoration work.
2013
Page | 20
2013
Page | 21
Infrastructure Overview
Provide a detailed overview of your IT environment in this section, including the location(s) of all data center(s), nature of use of those facilities (e.g. colocation, tape storage, cloud hosting), security features of your infrastructure and the hosting facilities, and procedures for access to those facilities.
Data center
Specify the location of all facilities in which your companys data is stored. Include an address and directions to each location. Examples of a data center diagram need to be detailed enough to provide your backup recovery team member the necessary information to perform his or her responsibilities if called upon.
Source: http://www.storageguardian.com/media/network_diagram.gif
2013
Page | 22
Source: http://www.routereflector.com/en/2013/05/data-center-topology-with-cisco-nexus-hp-virtual-connect-and-vmware/
2013
Page | 23
Source: http://sanketshukla.blogspot.com/2009/11/dhs-network-topology-diagram.html
Access to Facilities
Data centers and colocation facilities typically maintain strict entry protocol. Certain members of your organization will typically hold the appropriate credentials to enter the facility. Detail members of your team (and/or your IT service providers team) who have access to all data facilities along with any requirements for access.
2013
Page | 24
Order of Restoration
This section will include instructions for recovery personnel to follow that lay out which infrastructure components to restore and in which order. It should take into account application dependencies, authentication, middleware, database and third party elements and list restoration items by system or application type. Ensure that this order of restoration is understood before engaging in restore work. An example is provided below. The rest of the table should be filled out in the exact order that restoration procedures are to be completed.
Order of Restoration Table: Server Name Ws12_VF1 Server Role Web Server Valley Forge 1 Order of Restoration Restore prior to db12_VF1 startup OS / Patch level ESX4.1 Application loaded Apache
2013
Page | 25
System Configuration
This section should include systems and application specific typology diagrams and an inventory of elements that comprise your overall system. Include networking, web app middleware, database and storage elements, along with third party systems that connect to and share data with this system. You should lay out each of your systems separately and include a table for your network, server layout and storage layout. Network table: Device type Firewall Load balancer Switch Router Name Primary IP OS level Gateway Subnet Mask
Server table:
Server Name/Priority OS Patch IP Address Sub Gateway DNS Alternate DNS Secondary IPs Production Mac Address
2013
Page | 26
2013
Page | 27
Backup Configuration
Use this section to list instructions specifying the servers, directories and files from (and to) which backup procedures will be run. This should be the location of your last known good copy of production data.
Server
Software
Version
Backup Cycle
Backup Source
Backup Target
2013
Page | 28
Monitors
Listed by server, be sure that these monitors are put in place and activated as part of your restore activities. Restoring from a disaster should result in a mirror to your production environment (even if scaled). Monitors and alerts are a critical element to your production system.
Server name
Monitor
Cycle
Alert
2013
Page | 29
This matrix describes the participation by various roles to complete DR tasks or deliverables. It clarifies roles and responsibilities for IT stakeholders in your organization as well as any service providers involved with your business disaster recovery program. Fill in the matrix below, specifying the roles for your company, your service provider (if applicable) and any other 3rd parties that will be involved in your disaster recovery tests.
Responsible Party: Those who do the work to achieve the task Accountable Party: The party ultimately answerable for the correct and thorough completion of the deliverable or task, and the one from whom responsible party is delegated the work Consulted Party: Those whose opinions are sought, typically subject matter experts; and with whom there is two-way communication Informed Party: Those who are kept up-to-date on progress, often only on completion of the task or deliverable; and with whom there is just one-way communication
Positions that will fill these roles and responsibilities will often include your DR coordinator, network engineer, database engineer, systems engineer, application owner, data center service coordinator, and your service provider. Identify the responsibilities of each of these roles in a disaster event, then map them onto a matrix of all activities associated with recovery procedures, as in the example table provided below.
Activity R Maintain situational management of recovery events React to server outage alerts React to file system alerts React to host outage alerts React to network outage alerts Document technical landscape Configure network for system access Configure VPN and acceleration between your business and service provider network (if applicable DRC DRC Responsible Parties A C DRC All I
2013
Page | 30
Maintain DNS or host file Monitor service provider network availability (if applicable) Diagnose service provider network errors (if applicable Create named users at OS level Create domain users Manage OS privileges Create virtual machines Convert physical servers to virtual servers Install base operating system Configure operating system Configure OS disks Diagnose OS errors Start/Stop the virtual machine Windows OS licensing (or your operating system) Security hardening of the OS Daily server level backup Patch Management for Windows servers (or your operating system) Provide a project manager Provide a key technical contact for OS, network, and SAN Coordinate deployment schedule Support, management and update of Protection Software Install, support management and update of Terminal Server
2013
Page | 31
Restoration Procedures
Though your order of operations should stay relatively consistent, list steps taken for each and every backup system. For example: Payroll system backup: System XYZ Payroll Start Db server vm2345-qa1 Start Application server vm354-r1 Start web server vm6_ws4 Terminal server to Ws1_Vf1_Payroll Login to backup archive url: backup.archive.payroll Create temp target folder for backup file Login: user1 Password: 1resu Nagivate to most recent backup file Select file Select restore target Ws1_Vf1_PayrollProd1 Initiate restore This is only an example of what procedures for one system restoration may look like. For each of your actual systems, similarly list step-by-step instructions for full system backup.
2013
Page | 32
Select overwrite options Confirm dialog box warning Are you sure? Complete restore backup file Login to Ws1_Vf1_PayrollProd1 Start Payroll App local\temp\dirs\payrollprod1.exe Navigate via Explorer to temp backup folder Select file Open payrollprod application console Select data source > temp\backup\payrollWs1bckup Import Validate through report test 1 run
Use the rest of this section to similarly list restoration procedures for each of your backup systems.
2013
Page | 33
Have questions about this run book or disaster recovery for your business?
Contact Us!
solutions@xtium.com 800-707-9116
2013