You are on page 1of 40

UNIT 3

Fault Tolerance Techniques


Introduction,

Coding technique

failure causes,

Software fault tolerance

fault types, fault detection,

Networkfault tolerance:

fault

and

error

containment,
redundancy,
data diversity,
reversal checks,
malicious
failures,

or

Byzantine
Roll No: 15

Fault Tolerance
Definition
Fault tolerancerefers to a system's ability to deal
with malfunctions.

Fault-tolerant systems - ideally systems capable of


executing their tasks correctly regardless of either
hardware failures or software errors

Real Time and Fault Tolerance

Failure Causes
There are three causes of failure:
Errors in the specification or design,
Defects in the components,
Environmental effects.

Real Time and Fault Tolerance

Fault Types
Categorized into

Faults are classified according to their behavior


1. Temporal behavior
2. Output behavior.

A fault is said to be active when it is physically


capable of generating errors and to be benign when
it

is

not.

Real Time and Fault Tolerance

Temporal Behavior
Transient faults:
These occur once and then disappear

Intermittent faults:
Intermittent faults are characterized by a fault
occurring, then vanishing again, then reoccurring, then
vanishing.

Permanent faults:
This type of failure is persistent: it continues to exist
until the faulty component is repaired or replaced.

Real Time and Fault Tolerance

Fault Detection
Definition
There are two ways to determine that a processor
is malfunctioning:
1.online
2.offline.
. Online detection goes on in parallel with normal
system operation.
. One way of doing this is to check for any behavior
that is Inconsistent with correct operation

Real Time and Fault Tolerance

A monitor (called a watchdog processor) is associated


with each processor, looking for signs that the
processor is faulty.
The watchdog processor watches the data and
address lines, as shown in Figure.

A second approach is to have multiple processors,


which are supposed to put out the same result, and
compare the results.

A discrepancy indicates the existence of a fault.

Real Time and Fault Tolerance

Online Detection using Watchdog


Processor

Real Time and Fault Tolerance

The following actions are indicative of a faulty


processor.

Branching to an invalid destination.


Fetching an opcode from a location containing data.
Writing into a portion of memory to which the
process has no write access.
Fetching an illegal opcode.
Inactive for more than a prescribed period.
Real Time and Fault Tolerance

Offline detection consists of running diagnostic tests.


Not runnable
When a processor is running such a test, it obviously
cannot be executing the applications software.

Diagnostic test can be scheduled just like ordinary


tasks.

The greater the failure rate, the greater must be the


frequency with which these tests are run.
Real Time and Fault Tolerance

Fault and Error containment


Includes
When a fault or error occurs in one part of the system,
it can, if unchecked, spread through the system like an
infectious disease.
A fault in one part of the system might.

for example, cause large voltage swings in another; a


fault-free processor can put out erroneous results as a
result of using erroneous input from a faulty unit.

Faults and errors must therefore be prevented from


spreading through the system. This is called

Real Time and Fault Tolerance

The system is divided into


Fault Containment Zones (FCZ):
An FCZ is a subset of the system that operates
correctly despite arbitrary logical or electrical faults
outside the subset.
That is, the failure of some part of the computer
outside an FCZ cannot cause any element inside
that FCZ to fail.
Error Containment Zones (ECZ)
The function of an ECZ is to prevent errors from
propagating across zone boundaries. This is
Real Time and Fault Tolerance

typically achieved by voting redundant outputs.

Redundancy
Four Types:

Hardware redundancy: The system is provided with far


more hardware if all the components are perfectly
reliable
Software redundancy: The system is provided with
different software versions of tasks, so that when one
version of a task fails under certain inputs, another
version can be used.
Time redundancy: The task schedule has some slack in
it, so that some tasks can be rerun if necessary and
still meet critical deadlines.
Information redundancy: The data are coded in such a
way that a certain number of bit errors can be sombody@gmail.com

Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits

N-Modular Redundancy
N-modular redundancy (NMR) is a scheme for
forward error recovery.

It works by using N processors instead of


one, and voting on their output. N is usually
odd.

Figure illustrates this scheme for N = 3.


One of two approaches is possible.
In design (a), there arc N voters and the
entire cluster produces N outputs. In design
Real Time and Fault Tolerance
(b), there is just one voter.

N-Modular Redundancy

Real Time and Fault Tolerance

Software Redundancy
System is provided with different software version
of task

Written independently
programmers

by

different

team

of

If one version of task fail under certain input


another version
can be used

Real Time and Fault Tolerance

Software Redundancy
N-Version Programming
Recovery Block Approach

Real Time and Fault Tolerance

N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.

In an N-version software system, each module is made


with up toNdifferent implementations. Each variant
accomplishes the same task, but hopefully in a different
way.

Each version then submits its answer to voter or


decider which determines the correct answer, and

Real Time and Fault Tolerance

This system can hopefully overcome the design faults


present in most software by relying upon the design
diversity concept.
An important distinction in N-version software is the
fact that the system could include multiple types of
hardware using multiple versions of software.
The goal is to increase the diversity in order to avoid
common mode failures.
Using N-version software, it is encouraged that each
different version be implemented in as diverse a
manner as possible, including different tool sets,
different programming languages, and possibly
different environments
Real Time and Fault Tolerance

Recovery Block Approach


The recovery block operates with an adjudicator which
confirms the results of various implementations of the
same algorithm.

In a system with recovery blocks, the system view is


broken down into fault recoverable blocks.

The entire system is constructed of these fault tolerant


blocks.

Each

block

contains

at

least

primary,

secondary, and exceptional case code along with an


adjudicator

Real Time and Fault Tolerance

The adjudicator is the component which determines


the correctness of the various blocks to try.

Upon first entering a unit, the adjudicator first executes


the primary alternate.

If the adjudicator determines that the primary block


failed, it then tries toroll backthe state of the system
and tries the secondary alternate.
If the adjudicator does not accept the results of any of
the alternates, it then invokes the exception handler,
which then indicates the fact that the software could
not perform the requested operation.
Real Time and Fault Tolerance

Software Redundancy Structures

Real Time and Fault Tolerance

Time Redundancy
Achieves fault tolerance by performing an operation
several times.

Timeouts and retransmissions in reliable point-topoint and group communication are examples of
time redundancy.

This form of redundancy is useful in the presence of


transient or intermittent faults. It is of no use with
permanent faults.

Real Time and Fault Tolerance

Time Redundancy
1. Recovery Points

2. Backward Error Recovery

Real Time and Fault Tolerance

Information Redundancy
The basic idea of information redundancy is to provide
more information than is strictly necessary and to use
that extra information to check for errors.

We use coding all the time ourselves, while correcting


for typographical errors.

For example, if we encounter the word startegic, we


will most likely unconsciously correct it to strategic.

This was possible because (a) there is no such word as


startegic, and (b) strategic is the closest word that
we can think of to strategic.

Real Time and Fault Tolerance

The conditions (a) and (b) are at the basis of all coding
theory.

All computer words arc strings of Os and 1s Coding


ensures that not all strings of Os and Is are legal (i.e., are
valid).

When assessing a coding scheme, we want to know how


many extra bits it adds to the words, and how many bit
errors it can detect or correct.

We are interested in how much work it takes to encode

Real Time and Fault Tolerance

Information Redundancy structures


Repetition Codes
Parity coding
Checksum codes
Cyclic Redundancy check

Real Time and Fault Tolerance

Data diversity
Data diversity is an approach that can be used in
association with any of the redundancy techniques
considered above.
Sometimes, hardware or software may fail for certain
inputs, but not for other inputs that are very close to
them.
So, instead of applying the same input data to the
redundant processors, we apply slightly different input
data to them.
Thus we have in some cases another line of defense
against failure.
Real Time of
and Fault
Tolerance
This approach will only work if the sensitivity
the

Data diversity

Real Time and Fault Tolerance

Reversal Checks
Introduction
If there is a simple relationship between the inputs and
outputs of a system, it may then be possible to
calculate the inputs given the outputs.
This can then be compared with the actual inputs as a
check.

For example, consider a task that finds the square root


of a number.
To see if the process is correct, we can square the
output and check it against the original input. Or let the
task consist of writing a block onto disk.
The reverse operation consists of reading this block
from the disk after writing and comparing it to the input
to make sure that the two are the same.. Real Time and Fault Tolerance

MALICIOUS OR BYZANTINE FAILURES


Introduction
Whenever a failure can cause a unit to behave
arbitrarily, malicious or Byzantine failure is said to
happen.
For correct operation, it is often the case that copies of
the same data as seen by various processors must be
consistent (i.e., the same).
When communication is limited to two-party messages,
the faulty units must be fewer than a third of the total
number of units if consistency is to be guaranteed.

Real Time and Fault Tolerance

Integrated failure handling


Introduction

When an error is detected, the system must


respond swiftly to deal with it.
In the short term, the error might be masked by
voting
In the long term, the system will have to locate the
failure that gave rise to the error and decide what
to do with the failed unit.
Three options are usually available:
1.
retry
2.
disconnect
3.
replace.
Real Time and Fault Tolerance

Networkfault tolerance:
Includes

Reliable communication protocols

Agreement protocols

Database commitprotocols -Application:

sombody@gmail.com

Agreement in faulty systems


Introduction
Two Army Problem:
We'll first examine the case of good processors but
faulty communication lines.
This is known as thetwo army problem
Byzantine agreement:
The source processor broadcasts its initial value to
all other processes.
Agreement: All nonfaulty processors agree on the
same value.
Validity: If the source processor is nonfaulty, the
common agreed upon value by all nonfaulty
processors should be the initial value of the source
Real Time and Fault Tolerance

Check pointing & Recovery


Includes

Checkpoint-Recovery is a common technique for


imbuing a program or system with fault tolerant
qualities, and grew from the ideas used in systems
which employ transaction processing
It allows systems to recover after some fault
interrupts the system, and causes the task to fail,
or be aborted in some way.
While many systems employ the technique to
minimize lost processing time, it can be used more
broadly to tolerate and recover from faults in a
critical application or task.
Real Time and Fault Tolerance

Continue..

Real Time and Fault Tolerance

Micro check pointing


A

single

checkpoint

buffer

is

maintained

per

multithreaded ARMOR
process.
The

element

state

is

checkpointed

after

each

operation.
Checkpoints are committed to stable storage after
processing a message.
The is no need to do process-wide checkpoints of
stacks, heap,
The existing locking policy of element data prevents
the need to suspend all threads.

Real Time and Fault Tolerance

IRIX check pointing

Facility for saving running processes and, at some

other time, restarting the saved processes from the


point already reached, without starting all over again.
A checkpoint image is saved in a set of disk files and
can comprise
A set of processes
All processes in the process group (a set of
processes that constitute a logical job)
All processes in a process session (a set of
processes started from the same physical or logical
terminal)
Real Time and Fault Tolerance

THANK YOU

You might also like