You are on page 1of 16

Major Incident Handling Request for Approval by ITS Senior Management Team May 29, 2007 Presenters: Linda

Rosewood, Ann Berry-Kline, John Hammond (Process Approved) SMT chartered a project to produce a Problem Management process. The group worked on an ITIL-framework for Problem Management, but discovered that what SMT probably wanted, and what the division needed, was a major incident handling process. The model for this would be Incident Management, where the goal is to restore service as soon as possible and manage communications client expectations. Problem Management will be nished later. This Major Incident Handling process is a special form of Incident Management, where coordination and communication internally and externally are emphasized. What we're asking for today is the approval of this MIH process. When we presented this for discussion several months ago, the discussion focused on "who can declare a major incident" and how the processes would work after hours. Since our last visit, John Hammond, Eric Kiesler and I have rened the process and completed it for both 8-5 and after hours. We left the criteria for declaring a major incident to the details recorded in an OLA. Recently, the CruzMail OLA team used this process and it worked awlessly when the details of the service were applied to it. We believe the process is ready to be adopted for the four highlighted services, and to become a general model for all ITS services, including those offered by the divisions. Additionally, one of the deliverables of the DDSLA program is to create a plan to complete OLAs/SLAs for all services. This plan will need to stage services. One criteria for prioritizing a service at the top of the list would be if it qualies as an MIH service, and in the packet youll nd a list of services the committee recommends as the next to have MIH developed. If approved, we will go forward with implementing the process in the four highlighted services, as well as creating a training of Core Technology, Communications, and Support Center staff, where we run scenarios and understand the tasks of each role. In advance of a complete portfolio of OLAs, we hope to get to a place where we can establish thresholds in the services provided by Core Technologies put this process in production sooner rather than later. That is, we will become familiar with the process and be able to identify "who can declare a major incident" before OLAs are created for all major services.

contents of MIH packet

Contents of MIH packet 1 This contents page 2 Major Incident Handling Process Overview A diagram giving overview of individual processes, showing how incident management is applied to major incidents, and identication of three types of coordination roles: Incident, Technical, and Communications. 3 MIH Process Description A outline, describing tasks, roles, and timeline. A textal annotation of the detailed process diagrams that follow. 4 MIH Process Description (After Hours) Supplementary to MIH Process Description, showing how tasks, roles, and timeline are different if a major event occurs outside of our business day of 8 am to 5 pm Monday-Friday. 5 Implementation Notes Assumptions, suggestions on monitoring tool coordination, "Hot Line" proposal for Data Center 6 Recording and Classication Processes (Detail) A process diagram for the rst two stages of Incident Management, including incipient event tracking, major incident classication, and the declaration of a major incident. 7 Referral and Resolution Processes A process diagram for stages three and four of Incident Management showing building the team, and working the incident, to resolution. 8 Event Coordination and Event Communications Processes The purpose of MIH process is to improve coordination and communication between ITS units. This is a process diagram detailing the coordination and communication tasks and roles. 9 Closure Processes by Role A process diagram for the last stage of Incident Management for the roles in the Support Center, Communications, and Technical teams. 10 After Hours Process A process diagram supplementary to the MIH process. 11 Role Denitions Detailed denition of roles and tasks performed 12 Major Incident Declaration Template Exactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. This template lists what information is to be included when declaring a Major Incident. 13 Highlighted services that need MIH List of services that need MIH proceses rst. The rst of this list of ten will be CruzMail, CruzNet, and CruzTime. Although Desktop Services is of the four highlighted services this summer, it will not have a major incident handling OLA.

ITS Major Incident Handling Process Overview Help Desk


Step 1

Inc Coord

Tech Lead

PMG/Comm

Recording

Incipient Event Tracking

Step 2

Classication

Major Incident Classication

Step 3

Major Inc Tracking

MIH Declared

Diagnose

Create Prelim Comms

SMT/DLs Notied

Step 4

Assign Tech Lead Build Team

Referral/ Resolution

Work Incident Major Inc Coord Resolution Event Comms

Closure

Step 5 Closure Closure Closure

03 MIH Process Descriptions

Major Incident Processes 1 (Recording) Incipent Event tracking 1.1 HD staff note pattern of tickets and start linking them with "hot ticket" feature of IT Request 1.2 ITS staff can use Major Incident Hotline 1.2.1 aka "Red Phone." This is a phone that rings in the HD office that is always answered and is never busy. Used by Svc Providers to report major incidents and by Tech Lead to update HD team. The telephone number is easy to remember, and distributed orally. 2 (Classication) Major Incident Classicaiton 2.1 For each service Major Incident criteria are consulted in OLA 2.2 Priority is assigned to tickets based on OLA 2.3 Process includes quick call or IM to system administrators for status check. Also check with Support Center Subject Matter expert. 3 Major Incident declared (Investigation and Diagnosis) 3.1 if tickets indicate that the issue has reached Major Incident Status 3.1.1 complete MIH Template and post it as tech-only message 3.2 create preliminary communications in IT Request visible to ITS staff, visible to the public, and on the telephone greeting.

Role Help Desk Team Svc Providers

Inputs tickets phone call

Outputs hot ticket ticket

Time 15 to 30 min on-going

Inc Coord tickets

Major Inc declaration


tech-only msgs ITR public msgs phone msg

10 minutes 5 minutes

3.3 Notify SMT/DLs that a major incident is declared.

email/cell/ vmail to PMG/ Comm

5 minutes

3.3.1 message to SMT/DLs is simple: there is a major incident, look at Messages and Ticket # for details 4 Major Incident Tracking 4.1 Continue to collect incident information 5 Incident Coord Assigns Tech Lead (Referral and Resolution) 5.1 consults OLA or IT Request to identify Lead Tech 6 Tech Lead Builds Team 6.1 Builds team. Tech Lead can assign other tech lead if needed

Help Desk Team Inc Coord Tech Lead

tickets

tech notes

on going

ITR Cong

MIH note in hot ticket describes team 5 minutes

6.2 Agreement on communication protocol: Status/tools/phone numbers/IM Tech Lead/Icd Coord 7 Team works Incident 7.1 outputs: ticket updates, FAQs, tech-messages workarounds, communications within team and with Incident Coord, root cause, known errors, emergency RFCs 7.2 Priority is placed on on creating work-arounds over xes or discovering root cause. 8 Event Communications 8.1 Tech Lead's communications Tech Lead

see left

updates to Icd Coord via protocol workarou nds, status, xes

8.2 Incident Coordinator's communications

Inc Coord

8.3 PMG/communications:

PMG/Comm

from IncCoord

every 60 minutes or as scheduled. On the hour if possible. updates to ditto PMG/ Comm via tech-only notes, phone calls, public web messages campus every 60 and others minutes or as scheduled

8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS staff contact informaiton. ITR technician accounts are created by the Support Center. ITR technicians are expected to keep their tech proles updated.

8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS Major Incident contact informaiton. ITR technician accounts are created by the staff Processes Support Center. ITR technicians are expected to keep their tech proles updated. 9 Event Coord 9.1 includes implementing workarounds as created 10 Resolution 10.1 resolutions 10.2 resolve tickets 11 Closure 11.1 Tech Lead closure activities

Role

Inputs

Outputs

Time

Inc Coord Tech Lead HD Team Tech Lead post mortem: audience= SMT/DLs/ ITS FAQs, and other normal IcM closure activities audience= campus Inputs workaroun ds, xes as available as possible within 48 hours

11.2 HD team closure activities

HD team

withing 5 working days

11.3 PMG/Comm closure activities

w/in a week and kept in web archive

12 After hours process In separate document

Role

04 MIH Process Description After Hours

Major Incident Processes After Hours 1 (Recording) Alarms/Monitors/Reports are received by Ops Currently Ops provides a daily message about the night's events. sc.update is a list that should be added that distribution. This alerts not only the Help Desk but Lisa Bono and her back-ups. As Core Technologies monitoring tools develop and align, we will align "warning event" and "critical event" alerts to the Major Incident processes (Recording)Alarm/Monitor/Report sufficient to suggest calling On-Call staff 2.1 opens ticket. emails ticket to sc.update. this step will require training before implementation 2.2 Using a template, creates a tech-only message pointing at ticket. Audience is all-techs. Expires in 24 hours. this step will require training before implementation. 2.3 checks SLAor other documentation --is system/service eligible for offhours support? If not, contacts client if known. Email ticket to dls@ucsc.edu calling their attention to the ticket. Before SLAs are in place, we require a list of services that get off-hours support. This list is partially completed. 2.4 is there support for off-hours? if not, contact client and DL per SLA. 2.4.1 If service does not get off-hours support, Help Desk staff perform resolution and closure on ticket after 8 am. 2.5 if yes, contact On-Call Person 2.5.1 if on-call doesn't respond w/in dened time, moves down the list of contacts per on-call protocol. (Classication) Incident Classication 3.1 For each service Major Incident criteria are consulted in OLA Major incident? 4.1 On-Call person decides if report ts the criteria for this service. Acts as Tech Lead until relieved or delegated. 4.2 If yes, Informs Operations staff, who complete Major Incident Template and adds it to ticket. Changes original ticket to "Hot Ticket." Operations staff act as Incident Coord during off-hours 4.3 If not Major Incident, Operators resolve ticket and copy in sc.update. This updates PMG/Comm also. Record in detail what occured and why it was not a Major Incident. (Investigation/Diagnosis) 5.1 On-Call person does what is necessary to gather information and determine team and perhaps other tech lead, if needed. Referral/Resolution (if it is a Major Incident) 6.1 assembles team, if needed; start working issue. 6.2 decides: is there a high probablility that status will not be normal at 6:30 am? (consulting the documentation for that service) 6.2.1 if yes, Operators create a tech-only message pointing at Hot Ticket containing the Major Incident declaration. 6.2.2 Tech lead communicates with management per service's protocol.

Role Ops Staff

Inputs alarms

Outputs daily email

Time

Ops staff

alarm

ticket

immediate

3 4

On-Call

ticket

30 min 10 min tech notes

ITR message, email. On-Call depends on service

30 min

varies

On-Call On-Call Ops Staff Ops-Staff Major Inc declaration phone calls/ emails

varies

6.3 At 8 am Major Incident Process starts with Step 3 of business hours Major Incident Handling. Incident Coordinator duties can be handed off, or stay with Operators depending on workload and service. 7 If resolved before 8 am, begin closure procedures.

05 Implementation notes for MIH

Implementation Notes for Major Incident Handling


We already handle Major Incidents This process builds on what we already are doing Goal of these process is to document for training, consistency, and improvement. 2 Assumptions 2.1 We depend on OLAs to record the implementation details per service 2.2 incidents only--no MIH for service requests 2.3 Incident Coords are assumed to be members of the Help Desk team, but could be in other part of ITS as process develops operationally. As Support Center matures and becomes the single point of contact, incident coordinators are likely to be solely in the Support Center or Data Center. After hours process identies DC Operators as the Incident Coords Groundworks and HPOV integration the two main monitors of Core Tech already have alerts and events and actions taken when threshholds are reached. For example, a "warning event" seen by Groundworks generates an email and a "critical event" requires operator action. As this process is implemented, we will need to coordinate the actions and terms from the monitoring tools to the terms used in the process and keep our terminology consistent. As our monitors mature, we will rely less on clients notifying ITS of outages. Senior Management Buy-In All Senior Managers will need to endorse the framework of Major Incident handling and use it as we develop OLAs for all services. We need to put energy into process development and in examining how we did after each Major Event. Core Tech is the key area here. Major Incident Hotline The Help Desk team's room contains a "hot line" that is used for phone conversations between the Lead Tech and the Incident Coord during a Major Incident. (During open hours of 8 am to 5 pm, generally.) We keep the phone number a close secret. The phone is always answered and never busy. It is used by DLs to report incidents. It is used by Tech Leads to report status when email is not suitable. The MIH plan proposes installing second hotline in the Data Center. As with the hotline in the Help Desk, it should have a distinctive ring, and be always answered. It is to be used by DLs and ITS Service Providers to report incidents. System Owners will use this line to give status reports to the Data Center operators during a major incident, or to report the initial discovery of what could become a major incident. The Hotlines are to be used when veried status is known, not to ask the Help Desk or Data Center staff for technical assistance or to "check to see if there is something wrong with the network." The hotlines are intended for conversations such as "I'm the system administrator for XYZ. Users can't login and we're investigating it now. I'll call you back in 30 minutes with an update." Closure Reports After a Major Incident, the Tech Lead has the responsibilty to produce a technical report describing the technical history of the incident, workarounds created, and a root cause, if known. The Support Center staff work with the PMG/Communications staff to create public versions of this document. The public (and technical reports if appropriate) will be posted on the web indenitely.

Recording and Classication Support Center


Incident, Alarm, Monitors, email, Voice

Incipient Event Tracking

Record Basic Details and Acknowledge Receipt

Need More Detail? ITRequest

Yes

IcM Recording process

No Yes
Service Request Process

Service Request?

Major Incident Classication

No

Consultation with Svc Mgrs and SC SMEs

Major Incident Classication

SLA

Urgency Priority Impact Assessment Parameters OLA Impact Classication

Classied Major Incident

No

Standard Incident Management

Major Incident Tracking


Upgrade to Hot Ticket

Preliminary Comms using Declaration template

Notify SMT/DLs

Referral & Resolution

Referral and Resolution Support Center


Initial Investigation

MIH Technical Team Build Team


Tech Lead and MIC Agree on Internal Communication Protocol

Major Incident Coordinator Assigns Technical Lead

Tech Lead Assigns Appropriate Resources to Team

Tech notes to Hot Ticket

Work Incident

Investigate and Diagnose

Tech Lead to MIC Status

Review Progress

Resolution
Test Root Resolution Yes Determine Root Cause?

No Fix Root Cause Yes Success

No

Identity Workaround Publish Resolution or Workaround

No

Yes Test Workaround

Closure

Closure

Implement Workaround

Yes

Success?

No

Event Comms and Event Coord Communications Support Center


Recording and Classication

MIH Technical Team


Referral and Resolution

To DLs/ SMT

Build Team To Specic Clients Event Comms Tech notes to Hot Ticket To campus Tech Lead to MIC Status

Work Incident

Ticket updates

Major Incident Coordination

Publish Resolution Resolution Publish Workaround

Tech-only messages

workaroun ds/FAQs

Closure

Major Incident Closure

Closed

Major Incident Closure

Problem Record Problem Management


Problem Management

Closure Support Center


Resolution

MIH Tech Team

PMG Comms

Resolution

Resolution

Write Technical Report


Report to Campus

Closure

assist with writing

Problem Record

Problem Management

After Hours MIH Process On Call Personnel


Recording

Operations
Incident, Alarm, Monitors, email, Voice Record Basic Details in ITRequest and Acknowledge Receipt using template Check Service Availability SLA tech-only message , email sc.update

Classication

Off Hour Support for Service?

No

Contact Client and/or DLs

Yes
Contact On Call Person On Call Investigate/ Diagnose

Major Inc Criteria

Investigation/ Referral

Major Incident Declared

Notify Operations

Notify Unit Management

Work Incident

Message to SMT

xed by 6:30 am?

No

Message to DLs

Yes Closure
Closure Processes Prelim Comms

Update tech-only mesg

Update to Hot Ticket

Start Step 3 @ 8 am

11 Roles for MIH

Roles 1 Major Incident Tech Lead Predetermined person for each service, recorded in the OLA, and IT Request as the "Group Manager." Makes decisions about resources Authority to pull in resources from across units. Reports status to Major Incident Coordinator per protocol as dened in OLA Coordinates work of technical team developing workarounds and other resolutions. Writes Technical Post Mortem Report (Audience: SMT, DL's and ITS staff.) Reports to other audiences use this as a resource. Charcteristics of Role

Senior technical person or manager A single person is designated the tech lead during an event. This designate can be delegated. Characteristics of Communications Communications with Incident Coordinator concentrate on status, workarounds, and xes. Although theories are discussed with Incident Coordinator, the focus should not be on technical theories until the Problem Management process discovers the root cause of the problems behind the incidents. Major Incident Coordinator In initial version of the process, this person is always a member of Support Center's Help Desk Team. However, Incident Coords may be designated in Instructional Technology, Media Services, Divisional IT groups, or other ITS units with a client-facing function. Receives reports from Lead Tech and reports incident information to Lead Tech. Usually communicates only the to Lead Tech during an incident--and not to the rest of the technical team Gives status reports to the PMG Communications staff per OLA/SLA Sees that technical communications via Tech-only messages are created and updated Oversees recording, classication, and initial diagnosis during incident. Oversees complete closure of all tickets. Oversees documentation of workarounds, FAQs, etc during closure. Support Center Subject Matter Expert Support Center staff are assigned to be technical experts in each service of the catalog. Consulted during incident classication process before major incident is declared. Assists incident coordinator in writing communications and implementing workarounds. If Incident Coordinator is not in the Support Center, then relevant SMEs should be consulted before declaring a major incident. Data Center Operator Incident Coordinator during Off-Hours process Noties On-Call tech per OLA/ protocol Creates initial ticket during off-hours events On Call Tech Core Systsems staffperson who is on call for a particular service after-hours Responsible for nding the right person to respond to the issue. Contacted by operators, and no one else. PMG Communications Staff

Roles 6 PMG Communications Staff Communicates with campus community during event and post-mortum Uses all communication channels: ITS status page, mass vmail, mass email, targeted communications. In the future RSS feeds, etc. 7 Divisional Liaison (DL's) Communicates with their clients as needed Informs ITS staff of client needs, priorities, critical calendar events 8 Help Desk Staff Executes IcM processes Performs internal and external communications 9 Senior Management Team (SMT) Receives early communications Receives communications during events and post-mortum by lead tech Sets priorities and policy, assigns resources

12 Major Incident Declaration template

Major Incident Declaration Template Exactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. Elements of the template the service impacted person who made the declaration the condition of alarms the impact--who is impacted by this outage or service degradation? who should care about this when did it begin or was discovered? how long is the condition expected to last? This maybe unknown. If so, say unknown. tech groups working on the incident technical lead on incident incident coordinator PMG/comm staff person contacted (if possible) Consult campus calendar and maintenance calendar. Will other events be affected? Could other events be related to cause?

13 Services for MIH

Services for MIH The committee suggests that MIH processes be created and added to the OLAs of these services rst. 1 Network 2 Cruzmail 3 Web Services, especially for the main campus and ITS servers 4 Cruztime 5 Business Systems (FIS, PPS) 6 AIS 7 CruzID and source systems 8 Unix Systems (Timeshares) 9 Telephone 10 Santa Cruz Tickets.com 11 Other Mission Critical Systems as appropriate

You might also like