Professional Documents
Culture Documents
Rosewood, Ann Berry-Kline, John Hammond (Process Approved) SMT chartered a project to produce a Problem Management process. The group worked on an ITIL-framework for Problem Management, but discovered that what SMT probably wanted, and what the division needed, was a major incident handling process. The model for this would be Incident Management, where the goal is to restore service as soon as possible and manage communications client expectations. Problem Management will be nished later. This Major Incident Handling process is a special form of Incident Management, where coordination and communication internally and externally are emphasized. What we're asking for today is the approval of this MIH process. When we presented this for discussion several months ago, the discussion focused on "who can declare a major incident" and how the processes would work after hours. Since our last visit, John Hammond, Eric Kiesler and I have rened the process and completed it for both 8-5 and after hours. We left the criteria for declaring a major incident to the details recorded in an OLA. Recently, the CruzMail OLA team used this process and it worked awlessly when the details of the service were applied to it. We believe the process is ready to be adopted for the four highlighted services, and to become a general model for all ITS services, including those offered by the divisions. Additionally, one of the deliverables of the DDSLA program is to create a plan to complete OLAs/SLAs for all services. This plan will need to stage services. One criteria for prioritizing a service at the top of the list would be if it qualies as an MIH service, and in the packet youll nd a list of services the committee recommends as the next to have MIH developed. If approved, we will go forward with implementing the process in the four highlighted services, as well as creating a training of Core Technology, Communications, and Support Center staff, where we run scenarios and understand the tasks of each role. In advance of a complete portfolio of OLAs, we hope to get to a place where we can establish thresholds in the services provided by Core Technologies put this process in production sooner rather than later. That is, we will become familiar with the process and be able to identify "who can declare a major incident" before OLAs are created for all major services.
Contents of MIH packet 1 This contents page 2 Major Incident Handling Process Overview A diagram giving overview of individual processes, showing how incident management is applied to major incidents, and identication of three types of coordination roles: Incident, Technical, and Communications. 3 MIH Process Description A outline, describing tasks, roles, and timeline. A textal annotation of the detailed process diagrams that follow. 4 MIH Process Description (After Hours) Supplementary to MIH Process Description, showing how tasks, roles, and timeline are different if a major event occurs outside of our business day of 8 am to 5 pm Monday-Friday. 5 Implementation Notes Assumptions, suggestions on monitoring tool coordination, "Hot Line" proposal for Data Center 6 Recording and Classication Processes (Detail) A process diagram for the rst two stages of Incident Management, including incipient event tracking, major incident classication, and the declaration of a major incident. 7 Referral and Resolution Processes A process diagram for stages three and four of Incident Management showing building the team, and working the incident, to resolution. 8 Event Coordination and Event Communications Processes The purpose of MIH process is to improve coordination and communication between ITS units. This is a process diagram detailing the coordination and communication tasks and roles. 9 Closure Processes by Role A process diagram for the last stage of Incident Management for the roles in the Support Center, Communications, and Technical teams. 10 After Hours Process A process diagram supplementary to the MIH process. 11 Role Denitions Detailed denition of roles and tasks performed 12 Major Incident Declaration Template Exactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. This template lists what information is to be included when declaring a Major Incident. 13 Highlighted services that need MIH List of services that need MIH proceses rst. The rst of this list of ten will be CruzMail, CruzNet, and CruzTime. Although Desktop Services is of the four highlighted services this summer, it will not have a major incident handling OLA.
Inc Coord
Tech Lead
PMG/Comm
Recording
Step 2
Classication
Step 3
MIH Declared
Diagnose
SMT/DLs Notied
Step 4
Referral/ Resolution
Closure
Major Incident Processes 1 (Recording) Incipent Event tracking 1.1 HD staff note pattern of tickets and start linking them with "hot ticket" feature of IT Request 1.2 ITS staff can use Major Incident Hotline 1.2.1 aka "Red Phone." This is a phone that rings in the HD office that is always answered and is never busy. Used by Svc Providers to report major incidents and by Tech Lead to update HD team. The telephone number is easy to remember, and distributed orally. 2 (Classication) Major Incident Classicaiton 2.1 For each service Major Incident criteria are consulted in OLA 2.2 Priority is assigned to tickets based on OLA 2.3 Process includes quick call or IM to system administrators for status check. Also check with Support Center Subject Matter expert. 3 Major Incident declared (Investigation and Diagnosis) 3.1 if tickets indicate that the issue has reached Major Incident Status 3.1.1 complete MIH Template and post it as tech-only message 3.2 create preliminary communications in IT Request visible to ITS staff, visible to the public, and on the telephone greeting.
10 minutes 5 minutes
5 minutes
3.3.1 message to SMT/DLs is simple: there is a major incident, look at Messages and Ticket # for details 4 Major Incident Tracking 4.1 Continue to collect incident information 5 Incident Coord Assigns Tech Lead (Referral and Resolution) 5.1 consults OLA or IT Request to identify Lead Tech 6 Tech Lead Builds Team 6.1 Builds team. Tech Lead can assign other tech lead if needed
tickets
tech notes
on going
ITR Cong
6.2 Agreement on communication protocol: Status/tools/phone numbers/IM Tech Lead/Icd Coord 7 Team works Incident 7.1 outputs: ticket updates, FAQs, tech-messages workarounds, communications within team and with Incident Coord, root cause, known errors, emergency RFCs 7.2 Priority is placed on on creating work-arounds over xes or discovering root cause. 8 Event Communications 8.1 Tech Lead's communications Tech Lead
see left
Inc Coord
8.3 PMG/communications:
PMG/Comm
from IncCoord
every 60 minutes or as scheduled. On the hour if possible. updates to ditto PMG/ Comm via tech-only notes, phone calls, public web messages campus every 60 and others minutes or as scheduled
8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS staff contact informaiton. ITR technician accounts are created by the Support Center. ITR technicians are expected to keep their tech proles updated.
8.4 PMG/Comm is responsible for keeping the ITS phone list updated ITS Major Incident contact informaiton. ITR technician accounts are created by the staff Processes Support Center. ITR technicians are expected to keep their tech proles updated. 9 Event Coord 9.1 includes implementing workarounds as created 10 Resolution 10.1 resolutions 10.2 resolve tickets 11 Closure 11.1 Tech Lead closure activities
Role
Inputs
Outputs
Time
Inc Coord Tech Lead HD Team Tech Lead post mortem: audience= SMT/DLs/ ITS FAQs, and other normal IcM closure activities audience= campus Inputs workaroun ds, xes as available as possible within 48 hours
HD team
Role
Major Incident Processes After Hours 1 (Recording) Alarms/Monitors/Reports are received by Ops Currently Ops provides a daily message about the night's events. sc.update is a list that should be added that distribution. This alerts not only the Help Desk but Lisa Bono and her back-ups. As Core Technologies monitoring tools develop and align, we will align "warning event" and "critical event" alerts to the Major Incident processes (Recording)Alarm/Monitor/Report sufficient to suggest calling On-Call staff 2.1 opens ticket. emails ticket to sc.update. this step will require training before implementation 2.2 Using a template, creates a tech-only message pointing at ticket. Audience is all-techs. Expires in 24 hours. this step will require training before implementation. 2.3 checks SLAor other documentation --is system/service eligible for offhours support? If not, contacts client if known. Email ticket to dls@ucsc.edu calling their attention to the ticket. Before SLAs are in place, we require a list of services that get off-hours support. This list is partially completed. 2.4 is there support for off-hours? if not, contact client and DL per SLA. 2.4.1 If service does not get off-hours support, Help Desk staff perform resolution and closure on ticket after 8 am. 2.5 if yes, contact On-Call Person 2.5.1 if on-call doesn't respond w/in dened time, moves down the list of contacts per on-call protocol. (Classication) Incident Classication 3.1 For each service Major Incident criteria are consulted in OLA Major incident? 4.1 On-Call person decides if report ts the criteria for this service. Acts as Tech Lead until relieved or delegated. 4.2 If yes, Informs Operations staff, who complete Major Incident Template and adds it to ticket. Changes original ticket to "Hot Ticket." Operations staff act as Incident Coord during off-hours 4.3 If not Major Incident, Operators resolve ticket and copy in sc.update. This updates PMG/Comm also. Record in detail what occured and why it was not a Major Incident. (Investigation/Diagnosis) 5.1 On-Call person does what is necessary to gather information and determine team and perhaps other tech lead, if needed. Referral/Resolution (if it is a Major Incident) 6.1 assembles team, if needed; start working issue. 6.2 decides: is there a high probablility that status will not be normal at 6:30 am? (consulting the documentation for that service) 6.2.1 if yes, Operators create a tech-only message pointing at Hot Ticket containing the Major Incident declaration. 6.2.2 Tech lead communicates with management per service's protocol.
Inputs alarms
Time
Ops staff
alarm
ticket
immediate
3 4
On-Call
ticket
30 min
varies
On-Call On-Call Ops Staff Ops-Staff Major Inc declaration phone calls/ emails
varies
6.3 At 8 am Major Incident Process starts with Step 3 of business hours Major Incident Handling. Incident Coordinator duties can be handed off, or stay with Operators depending on workload and service. 7 If resolved before 8 am, begin closure procedures.
Yes
No Yes
Service Request Process
Service Request?
No
SLA
No
Notify SMT/DLs
Work Incident
Review Progress
Resolution
Test Root Resolution Yes Determine Root Cause?
No
No
Closure
Closure
Implement Workaround
Yes
Success?
No
To DLs/ SMT
Build Team To Specic Clients Event Comms Tech notes to Hot Ticket To campus Tech Lead to MIC Status
Work Incident
Ticket updates
Tech-only messages
workaroun ds/FAQs
Closure
Closed
PMG Comms
Resolution
Resolution
Closure
Problem Record
Problem Management
Operations
Incident, Alarm, Monitors, email, Voice Record Basic Details in ITRequest and Acknowledge Receipt using template Check Service Availability SLA tech-only message , email sc.update
Classication
No
Yes
Contact On Call Person On Call Investigate/ Diagnose
Investigation/ Referral
Notify Operations
Work Incident
Message to SMT
No
Message to DLs
Yes Closure
Closure Processes Prelim Comms
Start Step 3 @ 8 am
Roles 1 Major Incident Tech Lead Predetermined person for each service, recorded in the OLA, and IT Request as the "Group Manager." Makes decisions about resources Authority to pull in resources from across units. Reports status to Major Incident Coordinator per protocol as dened in OLA Coordinates work of technical team developing workarounds and other resolutions. Writes Technical Post Mortem Report (Audience: SMT, DL's and ITS staff.) Reports to other audiences use this as a resource. Charcteristics of Role
Senior technical person or manager A single person is designated the tech lead during an event. This designate can be delegated. Characteristics of Communications Communications with Incident Coordinator concentrate on status, workarounds, and xes. Although theories are discussed with Incident Coordinator, the focus should not be on technical theories until the Problem Management process discovers the root cause of the problems behind the incidents. Major Incident Coordinator In initial version of the process, this person is always a member of Support Center's Help Desk Team. However, Incident Coords may be designated in Instructional Technology, Media Services, Divisional IT groups, or other ITS units with a client-facing function. Receives reports from Lead Tech and reports incident information to Lead Tech. Usually communicates only the to Lead Tech during an incident--and not to the rest of the technical team Gives status reports to the PMG Communications staff per OLA/SLA Sees that technical communications via Tech-only messages are created and updated Oversees recording, classication, and initial diagnosis during incident. Oversees complete closure of all tickets. Oversees documentation of workarounds, FAQs, etc during closure. Support Center Subject Matter Expert Support Center staff are assigned to be technical experts in each service of the catalog. Consulted during incident classication process before major incident is declared. Assists incident coordinator in writing communications and implementing workarounds. If Incident Coordinator is not in the Support Center, then relevant SMEs should be consulted before declaring a major incident. Data Center Operator Incident Coordinator during Off-Hours process Noties On-Call tech per OLA/ protocol Creates initial ticket during off-hours events On Call Tech Core Systsems staffperson who is on call for a particular service after-hours Responsible for nding the right person to respond to the issue. Contacted by operators, and no one else. PMG Communications Staff
Roles 6 PMG Communications Staff Communicates with campus community during event and post-mortum Uses all communication channels: ITS status page, mass vmail, mass email, targeted communications. In the future RSS feeds, etc. 7 Divisional Liaison (DL's) Communicates with their clients as needed Informs ITS staff of client needs, priorities, critical calendar events 8 Help Desk Staff Executes IcM processes Performs internal and external communications 9 Senior Management Team (SMT) Receives early communications Receives communications during events and post-mortum by lead tech Sets priorities and policy, assigns resources
Major Incident Declaration Template Exactly what tool will be used to store and edit the template has not been determined yet. It is likely to be found in IT Request. Elements of the template the service impacted person who made the declaration the condition of alarms the impact--who is impacted by this outage or service degradation? who should care about this when did it begin or was discovered? how long is the condition expected to last? This maybe unknown. If so, say unknown. tech groups working on the incident technical lead on incident incident coordinator PMG/comm staff person contacted (if possible) Consult campus calendar and maintenance calendar. Will other events be affected? Could other events be related to cause?
Services for MIH The committee suggests that MIH processes be created and added to the OLAs of these services rst. 1 Network 2 Cruzmail 3 Web Services, especially for the main campus and ITS servers 4 Cruztime 5 Business Systems (FIS, PPS) 6 AIS 7 CruzID and source systems 8 Unix Systems (Timeshares) 9 Telephone 10 Santa Cruz Tickets.com 11 Other Mission Critical Systems as appropriate