BISM Group2

TITLE
CAREGROUP IT
KEYWORDS
Legacy Systems : refers to outdated computer systems, programming languages or application software. [technopedia]
Meditech Software: Provides electronic healthcare solutions access to healthcare records etc.
Unix/Linux clusters: a set of computers harnessed together to present a single server resource to applications and to
users. [unix.org]
EMC storage: EMC's Information Management Strategy Services, enable enterprises to organize unstructured
information. [technopedia]
OLD SYSTEM
Decentralised Network 78
Non-standardised Email PCs Consultants
Each hospital own legacy system IBM
Mainframes
Hand-typed results KMS
New System
No emails Acquire Meditech Clinical Financial
Few PCs Store Software System System
No electronic o/p Disseminate
Financial systems frequent vendor failure Apply
Non-integrated lab and radiology Voice/
Self-developed payroll Wireless Data IT Budget
Failed operating room system Structure DROPPED!
System
Limited network with no remote access
Web-
enabled Paired
3000 40 TB apps CPUs
200 Staff
Physicians Data/day
Unix/Linux 3 Backup
Data
Servers Centers
CAREGROUP IT
COMPANY UNITS
HP Unix & Wintel

boxes IBM Cut
prices of PC
IBM (prev PCs
Dell)
Replacement
PeopleSoft HR Payroll of PCs
system
Data centre
tape-backup
Caregroup cut capital budget
expenditures by 90%
The Meditech installations done in Night Iron
Mt. Storage
half the usual time without
consultants at 20% time
2002 Best Knowledge
Management Services in the US.
Timeline of Network Collapse
Napster-like Intermitte Network blocked; backup Redundant
internal attack nt failures procedures implemented core built
Wed., afternoon Thursday Fri., morning Saturday
Service restored CAP declared Reboot of core Network fully

Thurs., Thurs., and Layer 3 operational
4:00 am 4:00pm Friday Sunday, night
4
What went wrong?
In November 2002, a researcher in CareGroup was
experimenting with a file sharing application
Upon finding that his wife was in labor, he suddenly left
with the software running in an untested state
This new application began to explore surrounding
networks and copied data in large volumes, eventually
moving terabytes of data across the network
On Wednesday, November 13 2002, the entire network
for CareGroup went down for almost four days
The Network collapse
Huge data transfers quickly monopolized the services of centrally located network switch
Fortunately the network was physically redundant throughout
But this resulted in the formation complex network of alternative paths for data in the
unavailability of the major switch
The network grew and crept out of spec; algorithms for computing alternative paths no
longer worked correctly
Throughout the network both redundant components became primary resulted in endless
loop of messages between them until the network was disabled completely
Eventually every software applications which used network stopped working

The Network collapse Existing Network
(1) 5500 Switch (1) 5500 Switch
(1) 5505 Switch (1) 5505 Switch
(3) 5509 Switches (1) 5509 Switch
(4) Ren Ctr (rc5, rc6, rc7, rcc)

(1) Mount Auburn (Remote) (3) Ren Ctr (rc7, rc8, rcc)
Renaissance
(1) 5505 Switch Park
switch-rca switch-rcb
5500 5500
FEC 6/23-24 8 12 FEC 6/23-24
(1) Research East
ATM10/1 ATM10/1
Si 14 Si (18) 5505 Switches
AT
/1
FEC 4/21-24 FEC 3/21-24
M
10
8 5500 8
10
M
(1) Baker
/1
AT
FEC 4/21-24 FEC 3/21-24 (4) Deaconess
(8) 5505 Switches 14 switch-rcc 14 (7) Lowry Medical
(1) Maintenance
East West (3) Palmer
(3) 109 Brookline Ave
(1) CC West (Dual-
(2) 2127 Burlington Campus Campus homed w/ccw00m4)
(3) Research North
switch-
AT
7/1
M
M
ly030
5/1
AT
5509 5500 ATM7/1 14 ATM 5/1 5500

FEC 10/1-4 switch-
switch- 8 FEC 8/3-4
12
FEC 10/1-2
FEC 8/1-3 spg06b
br203 Si
9/5
-6 Si
FEC 9/1-4 Si
C FEC
FE 8/1-2
FEC10/5-6
12 5500 8 12 FEC 9/1-2
FEC 9/1-2 FEC10/1-4 FEC 10/5-6
12 switch- FEC 8/1-2 FEC 11/1-2 switch- 5500

ccell118 Si
12 FEC 11/3-4
Si spg06a
FEC 11/5-6
12
FEC 8/1-2
FE
C9 12
/1-
2 FEC 8/1-2
FEC switch-
9/1-2
Si Si ccw00m4 (21) 5505 Switches
(15) 5505 Switches switch- (1) 3500XL
(1) 5500 Switch rob05 5500 5500
(4) Dana
(3) East (13) Farr
(12) CC East Campus (7) Feldberg (6) Kennedy
(4) HIM (2) Finard (2) Lowry Medical
(3) Kirstein (24) CC West Campus - (1) Masco
(4) Reisman - 1 is Dual Homed w/spg06b
(5) Rose (1) PACS
(1) Service
(5) Stoneham
ATM OC-3 (155Mbps) over Sonet (2) Yamens (1) 5500 Switch
ATM OC-3 (155Mbps) dark fiber (21) 5505 Switches
Fast Etherchannel (400 Mbps) (37) 5505 Switches (2) 5509 Switches
Fast Etherchannel (800 Mbps) (1) 3500 XL Switch (PACS)
Not Active
IT Departments failed attempts
But, they had

The network They went IT no clue of the Used tactical
failure back to 1970s department cause and measure such
affected the paper-based under the each person as restarting
operations at form of supervision of began to give network
Beth Israel operations, Mr. Halamka different equipment
Deaconess for which the began to suggestions but not a
Medical staff was not resolve the would permanent
Center trained for problem increase solution
complexity
CISCO called in
At 4:00 PM on Thursday, about 24 Hrs after first difficulties with network CISCO
was called in for assistance
CISCO sent team and equipment with in hours adequate to build an entire
redundant core network if needed
CISCO team arrived and immediately took charge, instituting their CAP
process
Implementing CAP process 10 people on site, at least 3 engineers 24x7 worked
continuously with teams throughout the world
About 1:00 AM on Friday Zeroed in on at least part of the problem & decided
to install a large modern switch (Cisco 6509)
After isolating that part of CareGroup network discovered two other parts of
network had the same problem
Decision to Remain on backup procedures
Network failures Enactment of backup
procedures Revert to normal procedures
(After systems are restored completely)
Mr. Halamka recommended the senior
management to remain on backup procedures
until normalcy is restored
On November 15, activated internal command
center to coordinate backup procedures
The network collapse
Specific processes followed during outage
Command center as the central point of all communications
Morning & afternoon briefing sessions each day as part of a regular schedule
System of runners to retrieve specimens, tests, equipment etc.
Paper documentation of all activities that were previously recorded electronically
Call-back of all urgent lab results to the responsible clinician
Establishment of staggered lab draws to even out demand on laboratories
Manual process for orders and pharmacy dispensing
Manual census lists by admitting office, updated every few hours
Contingency plan for the outsourcing of all ambulatory laboratory volume
Creation of hotlines for support

LESSONS LEARNT
LESSON#1 BRING IN THE EXPERTS
Halamka signed a $300,000 per year agreement for support from CISCO
CISCO could make relevant changes to CareGroup
2 CISCO engineers allotted to remain on-site at BID permanently
LESSON#2 DO NOT OVERRELY ON ONE PERSON

Too much reliance on one expert
Its better to get a second opinion about networks configuration
LESSON#3 KEEP CURRENT

Reliance on one person let the knowledge become outdated
In late 1990s it had evolved into a fragile state
LESSON#4 A LITTLE KNOWLEDGE IS DANGEROUS

Outage triggered by a user experimentation
Important for IT group to remain vigilant, noting changes, supervise user experiments
LESSON#5 NETWORK COMPLIANCE
Established a formal procedure for making changes to network later on
Network Change Control Board created to review and approve n/w infrastructure changes
LESSON#6 ADAPT TO EXTERNALITIES

IT staff more cautious about events in the outside environment
For possible impacts on existing IT functionality
LESSON#7 BALANCE SERVICE AND SECURITY

A balanced customer-centricity with risks to n/w and IT systems required
Remove short-circuit for urgent business need
LESSON#8 HAVE BACKUP PLAN

Not productive to move back and forth between computer and paper processes
The paper system needs to be robust
LESSON#9 YOU NEED ALTERNATIVE ACCESS

Important systems needs to be emergency connected to dial-up modems
CG acquired additional analog telephone capacity and added modem capabilities
LESSON#10 LIFE-CYCLE YOUR NETWORK

Regular upgrades of the components required
The upgrades amounted to a multimillion-dollar one-time expense

BISM Group2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BISM Group2

Uploaded by

Copyright:

Available Formats

TITLE

HP Unix & Wintel

Service restored CAP declared Reboot of core Network fully

Fortunately the network was physically redundant throughout

Eventually every software applications which used network stopped working

(4) Ren Ctr (rc5, rc6, rc7, rcc)

5509 5500 ATM7/1 14 ATM 5/1 5500

12 switch- FEC 8/1-2 FEC 11/1-2 switch- 5500

But, they had

System of runners to retrieve specimens, tests, equipment etc.

Paper documentation of all activities that were previously recorded electronically

Call-back of all urgent lab results to the responsible clinician

Establishment of staggered lab draws to even out demand on laboratories

Manual process for orders and pharmacy dispensing

Manual census lists by admitting office, updated every few hours

Contingency plan for the outsourcing of all ambulatory laboratory volume

Creation of hotlines for support

LESSON#2 DO NOT OVERRELY ON ONE PERSON

LESSON#3 KEEP CURRENT

LESSON#4 A LITTLE KNOWLEDGE IS DANGEROUS

LESSON#6 ADAPT TO EXTERNALITIES

LESSON#7 BALANCE SERVICE AND SECURITY

LESSON#8 HAVE BACKUP PLAN

LESSON#9 YOU NEED ALTERNATIVE ACCESS

LESSON#10 LIFE-CYCLE YOUR NETWORK

You might also like