Fault Tolerance Techniques: Unit 3

UNIT 3
Fault Tolerance Techniques

Introduction,
Coding technique
failure causes,
Software fault tolerance
fault types, fault detection,
Networkfault tolerance:
fault
and
error
containment,
redundancy,
data diversity,
reversal checks,
malicious
failures,
or
Byzantine
Roll No: 15
Fault Tolerance
Definition
Fault tolerancerefers to a system's ability to deal
with malfunctions.
Fault-tolerant systems - ideally systems capable of

executing their tasks correctly regardless of either
hardware failures or software errors
Real Time and Fault Tolerance
Failure Causes
There are three causes of failure:
Errors in the specification or design,
Defects in the components,
Environmental effects.
Fault Types
Categorized into
Faults are classified according to their behavior

1. Temporal behavior
2. Output behavior.
A fault is said to be active when it is physically

capable of generating errors and to be benign when
it
is
not.
Temporal Behavior
Transient faults:
These occur once and then disappear
Intermittent faults:
Intermittent faults are characterized by a fault
occurring, then vanishing again, then reoccurring, then
vanishing.
Permanent faults:
This type of failure is persistent: it continues to exist
until the faulty component is repaired or replaced.
Fault Detection
Definition
There are two ways to determine that a processor
is malfunctioning:
1.online
2.offline.
. Online detection goes on in parallel with normal
system operation.
. One way of doing this is to check for any behavior
that is Inconsistent with correct operation
A monitor (called a watchdog processor) is associated

with each processor, looking for signs that the
processor is faulty.
The watchdog processor watches the data and
address lines, as shown in Figure.
A second approach is to have multiple processors,

which are supposed to put out the same result, and
compare the results.
A discrepancy indicates the existence of a fault.
Online Detection using Watchdog

Processor
The following actions are indicative of a faulty

processor.
Branching to an invalid destination.

Fetching an opcode from a location containing data.
Writing into a portion of memory to which the
process has no write access.
Fetching an illegal opcode.
Inactive for more than a prescribed period.
Offline detection consists of running diagnostic tests.

Not runnable
When a processor is running such a test, it obviously
cannot be executing the applications software.
Diagnostic test can be scheduled just like ordinary

tasks.
The greater the failure rate, the greater must be the

frequency with which these tests are run.
Fault and Error containment

Includes
When a fault or error occurs in one part of the system,
it can, if unchecked, spread through the system like an
infectious disease.
A fault in one part of the system might.
for example, cause large voltage swings in another; a

fault-free processor can put out erroneous results as a
result of using erroneous input from a faulty unit.
Faults and errors must therefore be prevented from

spreading through the system. This is called
The system is divided into

Fault Containment Zones (FCZ):
An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset.
That is, the failure of some part of the computer
outside an FCZ cannot cause any element inside
that FCZ to fail.
Error Containment Zones (ECZ)
The function of an ECZ is to prevent errors from
propagating across zone boundaries. This is
typically achieved by voting redundant outputs.
Redundancy
Four Types:
Hardware redundancy: The system is provided with far

more hardware if all the components are perfectly
reliable
Software redundancy: The system is provided with
different software versions of tasks, so that when one
version of a task fails under certain inputs, another
version can be used.
Time redundancy: The task schedule has some slack in
it, so that some tasks can be rerun if necessary and
still meet critical deadlines.
Information redundancy: The data are coded in such a
way that a certain number of bit errors can be sombody@gmail.com
Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
bits
N-Modular Redundancy
N-modular redundancy (NMR) is a scheme for
forward error recovery.
It works by using N processors instead of

one, and voting on their output. N is usually
odd.
Figure illustrates this scheme for N = 3.

One of two approaches is possible.
In design (a), there arc N voters and the
entire cluster produces N outputs. In design
(b), there is just one voter.
N-Modular Redundancy
Software Redundancy
System is provided with different software version
of task
Written independently
programmers
by
different
team
of
If one version of task fail under certain input

another version
can be used
Software Redundancy
N-Version Programming
Recovery Block Approach
N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.
In an N-version software system, each module is made

with up toNdifferent implementations. Each variant
accomplishes the same task, but hopefully in a different
way.
Each version then submits its answer to voter or

decider which determines the correct answer, and
This system can hopefully overcome the design faults

present in most software by relying upon the design
diversity concept.
An important distinction in N-version software is the
fact that the system could include multiple types of
hardware using multiple versions of software.
The goal is to increase the diversity in order to avoid
common mode failures.
Using N-version software, it is encouraged that each
different version be implemented in as diverse a
manner as possible, including different tool sets,
different programming languages, and possibly
different environments
Recovery Block Approach

The recovery block operates with an adjudicator which
confirms the results of various implementations of the
same algorithm.
In a system with recovery blocks, the system view is

broken down into fault recoverable blocks.
The entire system is constructed of these fault tolerant

blocks.
Each
block
contains
at
least
primary,
secondary, and exceptional case code along with an

adjudicator
The adjudicator is the component which determines

the correctness of the various blocks to try.
Upon first entering a unit, the adjudicator first executes

the primary alternate.
If the adjudicator determines that the primary block

failed, it then tries toroll backthe state of the system
and tries the secondary alternate.
If the adjudicator does not accept the results of any of
the alternates, it then invokes the exception handler,
which then indicates the fact that the software could
not perform the requested operation.
Software Redundancy Structures
Time Redundancy
Achieves fault tolerance by performing an operation
several times.
Timeouts and retransmissions in reliable point-topoint and group communication are examples of
time redundancy.
This form of redundancy is useful in the presence of

transient or intermittent faults. It is of no use with
permanent faults.
Time Redundancy
1. Recovery Points
2. Backward Error Recovery
Information Redundancy
The basic idea of information redundancy is to provide
more information than is strictly necessary and to use
that extra information to check for errors.
We use coding all the time ourselves, while correcting

for typographical errors.
For example, if we encounter the word startegic, we

will most likely unconsciously correct it to strategic.
This was possible because (a) there is no such word as

startegic, and (b) strategic is the closest word that
we can think of to strategic.
The conditions (a) and (b) are at the basis of all coding
theory.
All computer words arc strings of Os and 1s Coding

ensures that not all strings of Os and Is are legal (i.e., are
valid).
When assessing a coding scheme, we want to know how

many extra bits it adds to the words, and how many bit
errors it can detect or correct.
We are interested in how much work it takes to encode
Information Redundancy structures

Repetition Codes
Parity coding
Checksum codes
Cyclic Redundancy check
Data diversity
Data diversity is an approach that can be used in
association with any of the redundancy techniques
considered above.
Sometimes, hardware or software may fail for certain
inputs, but not for other inputs that are very close to
them.
So, instead of applying the same input data to the
redundant processors, we apply slightly different input
data to them.
Thus we have in some cases another line of defense
against failure.
Real Time of
and Fault
Tolerance
This approach will only work if the sensitivity
the
Data diversity
Reversal Checks
Introduction
If there is a simple relationship between the inputs and
outputs of a system, it may then be possible to
calculate the inputs given the outputs.
This can then be compared with the actual inputs as a
check.
For example, consider a task that finds the square root

of a number.
To see if the process is correct, we can square the
output and check it against the original input. Or let the
task consist of writing a block onto disk.
The reverse operation consists of reading this block
from the disk after writing and comparing it to the input
to make sure that the two are the same.. Real Time and Fault Tolerance
MALICIOUS OR BYZANTINE FAILURES

Introduction
Whenever a failure can cause a unit to behave
arbitrarily, malicious or Byzantine failure is said to
happen.
For correct operation, it is often the case that copies of
the same data as seen by various processors must be
consistent (i.e., the same).
When communication is limited to two-party messages,
the faulty units must be fewer than a third of the total
number of units if consistency is to be guaranteed.
Integrated failure handling

Introduction
When an error is detected, the system must

respond swiftly to deal with it.
In the short term, the error might be masked by
voting
In the long term, the system will have to locate the
failure that gave rise to the error and decide what
to do with the failed unit.
Three options are usually available:
1.
retry
2.
disconnect
3.
replace.
Networkfault tolerance:
Includes
Reliable communication protocols
Agreement protocols
Database commitprotocols -Application:
sombody@gmail.com
Agreement in faulty systems

Introduction
Two Army Problem:
We'll first examine the case of good processors but
faulty communication lines.
This is known as thetwo army problem
Byzantine agreement:
The source processor broadcasts its initial value to
all other processes.
Agreement: All nonfaulty processors agree on the
same value.
Validity: If the source processor is nonfaulty, the
common agreed upon value by all nonfaulty
processors should be the initial value of the source
Check pointing & Recovery

Includes
Checkpoint-Recovery is a common technique for

imbuing a program or system with fault tolerant
qualities, and grew from the ideas used in systems
which employ transaction processing
It allows systems to recover after some fault
interrupts the system, and causes the task to fail,
or be aborted in some way.
While many systems employ the technique to
minimize lost processing time, it can be used more
broadly to tolerate and recover from faults in a
critical application or task.
Continue..
Micro check pointing

A
single
checkpoint
buffer
is
maintained
per
multithreaded ARMOR
process.
The
element
state
is
checkpointed
after
each
operation.
Checkpoints are committed to stable storage after
processing a message.
The is no need to do process-wide checkpoints of
stacks, heap,
The existing locking policy of element data prevents
the need to suspend all threads.
IRIX check pointing
Facility for saving running processes and, at some
other time, restarting the saved processes from the

point already reached, without starting all over again.
A checkpoint image is saved in a set of disk files and
can comprise
A set of processes
All processes in the process group (a set of
processes that constitute a logical job)
All processes in a process session (a set of
processes started from the same physical or logical
terminal)
THANK YOU

Fault Tolerance Techniques: Unit 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fault Tolerance Techniques: Unit 3

Uploaded by

Copyright:

Available Formats

UNIT 3

Fault Tolerance Techniques

Software fault tolerance

fault types, fault detection,

Fault-tolerant systems - ideally systems capable of

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Faults are classified according to their behavior

A fault is said to be active when it is physically

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Real Time and Fault Tolerance

A monitor (called a watchdog processor) is associated

A second approach is to have multiple processors,

A discrepancy indicates the existence of a fault.

Real Time and Fault Tolerance

Online Detection using Watchdog

Real Time and Fault Tolerance

The following actions are indicative of a faulty

Branching to an invalid destination.

Offline detection consists of running diagnostic tests.

Diagnostic test can be scheduled just like ordinary

The greater the failure rate, the greater must be the

Fault and Error containment

for example, cause large voltage swings in another; a

Faults and errors must therefore be prevented from

Real Time and Fault Tolerance

The system is divided into

typically achieved by voting redundant outputs.

Hardware redundancy: The system is provided with far

It works by using N processors instead of

Figure illustrates this scheme for N = 3.

Real Time and Fault Tolerance

If one version of task fail under certain input

Real Time and Fault Tolerance

Real Time and Fault Tolerance

In an N-version software system, each module is made

Each version then submits its answer to voter or

Real Time and Fault Tolerance

This system can hopefully overcome the design faults

Recovery Block Approach

In a system with recovery blocks, the system view is

The entire system is constructed of these fault tolerant

secondary, and exceptional case code along with an

Real Time and Fault Tolerance

The adjudicator is the component which determines

Upon first entering a unit, the adjudicator first executes

If the adjudicator determines that the primary block

Software Redundancy Structures

Real Time and Fault Tolerance

This form of redundancy is useful in the presence of

Real Time and Fault Tolerance

2. Backward Error Recovery

Real Time and Fault Tolerance

We use coding all the time ourselves, while correcting

For example, if we encounter the word startegic, we

This was possible because (a) there is no such word as

Real Time and Fault Tolerance

All computer words arc strings of Os and 1s Coding

When assessing a coding scheme, we want to know how

We are interested in how much work it takes to encode

Real Time and Fault Tolerance

Information Redundancy structures