You are on page 1of 13

TITLE

CAREGROUP IT
KEYWORDS
Legacy Systems : refers to outdated computer systems, programming languages or application software. [technopedia]
Meditech Software: Provides electronic healthcare solutions access to healthcare records etc.
Unix/Linux clusters: a set of computers harnessed together to present a single server resource to applications and to
users. [unix.org]
EMC storage: EMC's Information Management Strategy Services, enable enterprises to organize unstructured
information. [technopedia]

OLD SYSTEM
Decentralised Network 78
Non-standardised Email PCs Consultants
Each hospital own legacy system IBM
Mainframes
Hand-typed results KMS

New System
No emails Acquire Meditech Clinical Financial
Few PCs Store Software System System
No electronic o/p Disseminate
Financial systems frequent vendor failure Apply
Non-integrated lab and radiology Voice/
Self-developed payroll Wireless Data IT Budget
Failed operating room system Structure DROPPED!
System
Limited network with no remote access

Web-
enabled Paired
3000 40 TB apps CPUs
200 Staff
Physicians Data/day

Unix/Linux 3 Backup
Data
Servers Centers
CAREGROUP IT
COMPANY UNITS

HP Unix & Wintel


boxes IBM Cut
prices of PC
IBM (prev PCs
Dell)
Replacement
PeopleSoft HR Payroll of PCs
system

Data centre
tape-backup
Caregroup cut capital budget
expenditures by 90%
The Meditech installations done in Night Iron
Mt. Storage
half the usual time without
consultants at 20% time
2002 Best Knowledge
Management Services in the US.
Timeline of Network Collapse
Napster-like Intermitte Network blocked; backup Redundant
internal attack nt failures procedures implemented core built
Wed., afternoon Thursday Fri., morning Saturday

Service restored CAP declared Reboot of core Network fully


Thurs., Thurs., and Layer 3 operational
4:00 am 4:00pm Friday Sunday, night

4
What went wrong?
In November 2002, a researcher in CareGroup was
experimenting with a file sharing application
Upon finding that his wife was in labor, he suddenly left
with the software running in an untested state
This new application began to explore surrounding
networks and copied data in large volumes, eventually
moving terabytes of data across the network
On Wednesday, November 13 2002, the entire network
for CareGroup went down for almost four days
The Network collapse
Huge data transfers quickly monopolized the services of centrally located network switch

Fortunately the network was physically redundant throughout

But this resulted in the formation complex network of alternative paths for data in the
unavailability of the major switch

The network grew and crept out of spec; algorithms for computing alternative paths no
longer worked correctly

Throughout the network both redundant components became primary resulted in endless
loop of messages between them until the network was disabled completely

Eventually every software applications which used network stopped working


The Network collapse Existing Network
(1) 5500 Switch (1) 5500 Switch
(1) 5505 Switch (1) 5505 Switch
(3) 5509 Switches (1) 5509 Switch

(4) Ren Ctr (rc5, rc6, rc7, rcc)


(1) Mount Auburn (Remote) (3) Ren Ctr (rc7, rc8, rcc)

Renaissance
(1) 5505 Switch Park
switch-rca switch-rcb
5500 5500
FEC 6/23-24 8 12 FEC 6/23-24
(1) Research East
ATM10/1 ATM10/1
Si 14 Si (18) 5505 Switches

AT
/1
FEC 4/21-24 FEC 3/21-24

M
10
8 5500 8

10
M
(1) Baker

/1
AT
FEC 4/21-24 FEC 3/21-24 (4) Deaconess
(8) 5505 Switches 14 switch-rcc 14 (7) Lowry Medical
(1) Maintenance
East West (3) Palmer
(3) 109 Brookline Ave
(1) CC West (Dual-
(2) 2127 Burlington Campus Campus homed w/ccw00m4)
(3) Research North
switch-

AT
7/1

M
M

ly030

5/1
AT

5509 5500 ATM7/1 14 ATM 5/1 5500


FEC 10/1-4 switch-
switch- 8 FEC 8/3-4
12
FEC 10/1-2
FEC 8/1-3 spg06b
br203 Si
9/5
-6 Si
FEC 9/1-4 Si
C FEC
FE 8/1-2
FEC10/5-6
12 5500 8 12 FEC 9/1-2
FEC 9/1-2 FEC10/1-4 FEC 10/5-6

12 switch- FEC 8/1-2 FEC 11/1-2 switch- 5500


ccell118 Si
12 FEC 11/3-4
Si spg06a
FEC 11/5-6
12
FEC 8/1-2
FE
C9 12
/1-
2 FEC 8/1-2
FEC switch-
9/1-2
Si Si ccw00m4 (21) 5505 Switches
(15) 5505 Switches switch- (1) 3500XL
(1) 5500 Switch rob05 5500 5500
(4) Dana
(3) East (13) Farr
(12) CC East Campus (7) Feldberg (6) Kennedy
(4) HIM (2) Finard (2) Lowry Medical
(3) Kirstein (24) CC West Campus - (1) Masco
(4) Reisman - 1 is Dual Homed w/spg06b
(5) Rose (1) PACS
(1) Service
(5) Stoneham
ATM OC-3 (155Mbps) over Sonet (2) Yamens (1) 5500 Switch
ATM OC-3 (155Mbps) dark fiber (21) 5505 Switches
Fast Etherchannel (400 Mbps) (37) 5505 Switches (2) 5509 Switches
Fast Etherchannel (800 Mbps) (1) 3500 XL Switch (PACS)
Not Active
The Network collapse
IT Departments failed attempts

But, they had


The network They went IT no clue of the Used tactical
failure back to 1970s department cause and measure such
affected the paper-based under the each person as restarting
operations at form of supervision of began to give network
Beth Israel operations, Mr. Halamka different equipment
Deaconess for which the began to suggestions but not a
Medical staff was not resolve the would permanent
Center trained for problem increase solution
complexity
The Network collapse
CISCO called in
At 4:00 PM on Thursday, about 24 Hrs after first difficulties with network CISCO
was called in for assistance
CISCO sent team and equipment with in hours adequate to build an entire
redundant core network if needed
CISCO team arrived and immediately took charge, instituting their CAP
process
Implementing CAP process 10 people on site, at least 3 engineers 24x7 worked
continuously with teams throughout the world
About 1:00 AM on Friday Zeroed in on at least part of the problem & decided
to install a large modern switch (Cisco 6509)
After isolating that part of CareGroup network discovered two other parts of
network had the same problem
The Network collapse
Decision to Remain on backup procedures
Network failures Enactment of backup
procedures Revert to normal procedures
(After systems are restored completely)
Mr. Halamka recommended the senior
management to remain on backup procedures
until normalcy is restored
On November 15, activated internal command
center to coordinate backup procedures
The network collapse
Specific processes followed during outage
Command center as the central point of all communications

Morning & afternoon briefing sessions each day as part of a regular schedule

System of runners to retrieve specimens, tests, equipment etc.

Paper documentation of all activities that were previously recorded electronically

Call-back of all urgent lab results to the responsible clinician

Establishment of staggered lab draws to even out demand on laboratories

Manual process for orders and pharmacy dispensing

Manual census lists by admitting office, updated every few hours

Contingency plan for the outsourcing of all ambulatory laboratory volume

Creation of hotlines for support


LESSONS LEARNT
LESSON#1 BRING IN THE EXPERTS
Halamka signed a $300,000 per year agreement for support from CISCO
CISCO could make relevant changes to CareGroup
2 CISCO engineers allotted to remain on-site at BID permanently

LESSON#2 DO NOT OVERRELY ON ONE PERSON


Too much reliance on one expert
Its better to get a second opinion about networks configuration

LESSON#3 KEEP CURRENT


Reliance on one person let the knowledge become outdated
In late 1990s it had evolved into a fragile state

LESSON#4 A LITTLE KNOWLEDGE IS DANGEROUS


Outage triggered by a user experimentation
Important for IT group to remain vigilant, noting changes, supervise user experiments
LESSON#5 NETWORK COMPLIANCE
Established a formal procedure for making changes to network later on
Network Change Control Board created to review and approve n/w infrastructure changes

LESSON#6 ADAPT TO EXTERNALITIES


IT staff more cautious about events in the outside environment
For possible impacts on existing IT functionality

LESSON#7 BALANCE SERVICE AND SECURITY


A balanced customer-centricity with risks to n/w and IT systems required
Remove short-circuit for urgent business need

LESSON#8 HAVE BACKUP PLAN


Not productive to move back and forth between computer and paper processes
The paper system needs to be robust

LESSON#9 YOU NEED ALTERNATIVE ACCESS


Important systems needs to be emergency connected to dial-up modems
CG acquired additional analog telephone capacity and added modem capabilities

LESSON#10 LIFE-CYCLE YOUR NETWORK


Regular upgrades of the components required
The upgrades amounted to a multimillion-dollar one-time expense

You might also like