You are on page 1of 18

ID902 Occam's Razor: An Introduction to Holistic Troubleshooting

Wes Morgan Senior Software Engineer

2011 IBM Corporation

Agenda

Why are we here?


Increasingly Complex Architectures Specialization within IT/IS Command and Control Issues Consequences of Fix it NOW! Preparation Understanding Your Deployment Knowing Your Routine Knowing Your Limits Execution Ask Your Neighbors Identify/Refine Your Target Problem vs. Routine Client, Server or Both? Recent Changes Lather, Rinse, Repeat...

The Holistic Approach Occam's Razor

Questions & Answers


2011 IBM Corporation

Why Are We Here? Complex Architectures


Fault Tolerance/Redundancy Load Balancers Firewalls Intranet/Extranet Virtualization

2011 IBM Corporation

Why Are We Here? IT/IS Specialization


We don't handle that Different team Communication often rare and/or difficult Simple questions answered slowly No one really sees big picture

2011 IBM Corporation

Why Are We Here? Command and Control


We can't do that until the next window Change Control != everyone informed Software integration demands team integration as well Multiple vendors/contractors may be involved

2011 IBM Corporation

Why Are We Here? Fix It NOW Consequences


Panic mode Time-to-resolution faces sometimes arbitrary limits All hands on deck Overall technical guidance lacking Troubleshooting becomes scattershot

2011 IBM Corporation

The Holistic Approach Occam's Razor

Pluralitas non est ponenda sine neccesitate. Plurality should not be posited without necessity. William of Ockham, c. 1285-1349

Close relatives:

When two theories explain the same phenomenon, choose the simpler admit no more causes..than such as are both true and sufficient... (Newton) KISS: Keep It Simple, Stupid

2011 IBM Corporation

Why Use Occam's Razor?


Multiple failures highly unlikely Far more likely that one root failure triggered additional problems Playing it could be introduces complexity and (probably) politics Don't chase rabbits!

2011 IBM Corporation

Preparation Understand Your Deployment


It's far more than just your stuff Hardware (or lack thereof!) Operating System Network (within the data center) Network (long haul/extranet/VPN) Dependencies (directory, SAN) Special-purpose devices (firewalls/proxies/reverse-proxies) Network appliances

KNOW YOUR DATA PATH!


2011 IBM Corporation

Preparation Know Your Routine

Profile your systems!

perfpmr (AIX), perfmon (Windows), iostat/vmstat (Linux)

Understand what normal looks like Be sure to profile peak time too! Logins/sessions per day User patterns (e.g. Accounting end-of-month) Domino platform statistics can be VERY useful

2011 IBM Corporation

Preparation Knowing Your Limits

Compare your routine use to:


Vendor benchmarks Third party testing/whitepapers Software specifications CPU utilization RAM consumption ESPECIALLY important in virtual environments

Know how much wiggle room you have


2011 IBM Corporation

Execution Ask Your Neighbors

Many deployments in your environment share potential points of failure


Load Balancers SAN

Quick check with peers may identify common problem quickly Formalize this process if you can weekly outage reports? May also be indicative of general network issues Allows you to handle some issues without vendor involvement

2011 IBM Corporation

Execution Identify/Refine the Target


Most missed aspect of troubleshooting Identify scope/range of affected users Identify scope/range of affected servers LOOK FOR COMMON FACTORS!

Third-party applications Same location Same release Time of day

Check for customizations Follow the data flow!

2011 IBM Corporation

Execution Problem vs. Routine


Take a snapshot of the problem Compare it to routine data May identify particular areas of concern May allow vendor to focus their efforts better/faster Examples:

Domino NSD NAMElookup activity Perfmon/perfpmr/iostat disk queuing

Pay particular attention to period just BEFORE problem (last 10 minutes) Be prepared to be pointed in a different direction!

2011 IBM Corporation

Execution Client, Server or Both?


DON'T GO AFTER A FLY WITH A SLEDGEHAMMER! Resist the urge to turn on all the debug Overly ambitious debug can present its own performance cost

DEBUG_TCP_ALL in IBM Lotus Domino VP_TRACE_ALL in IBM Lotus Sametime debug=FINEST in Java

It's worth a round of data gathering to target server debug more specifically High-level client-side debug correlates well with trace logs

Live HTTP Headers (Firefox add-on) Firebug (Firefox add-on) Fiddler (MSIE proxy)

Again, gather twice - routine and problem - when possible

2011 IBM Corporation

Execution Recent Changes


Back to Change Control Look for ANY changes close to start of problem Don't forget to check for OS patches/updates Look for new stuff too... Check all along the data flow

2011 IBM Corporation

Lather, Rinse, Repeat...


Be prepared to cycle through this process several times Apply same principles to each area of troublehsooting Example:

Identify/Refine shows only particular users suffering Logs show directory issues Now, users not experiencing problems are routine Troubleshoot directory by comparing problem users against routine users e.g. get LDIF dumps for both

Only go where the evidence takes you!

2011 IBM Corporation

QUESTIONS & ANSWERS

Please

complete a session evaluation!

More

questions? Find me in the Lotus Solutions Development Lab!


THANKS

FOR BEING HERE!

2011 IBM Corporation