Professional Documents
Culture Documents
Coding technique
failure causes,
Networkfault tolerance:
fault
and
error
containment,
redundancy,
data diversity,
reversal checks,
malicious
failures,
or
Byzantine
Roll No: 15
Fault Tolerance
Definition
Fault tolerancerefers to a system's ability to deal
with malfunctions.
Failure Causes
There are three causes of failure:
Errors in the specification or design,
Defects in the components,
Environmental effects.
Fault Types
Categorized into
is
not.
Temporal Behavior
Transient faults:
These occur once and then disappear
Intermittent faults:
Intermittent faults are characterized by a fault
occurring, then vanishing again, then reoccurring, then
vanishing.
Permanent faults:
This type of failure is persistent: it continues to exist
until the faulty component is repaired or replaced.
Fault Detection
Definition
There are two ways to determine that a processor
is malfunctioning:
1.online
2.offline.
. Online detection goes on in parallel with normal
system operation.
. One way of doing this is to check for any behavior
that is Inconsistent with correct operation
Redundancy
Four Types:
Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits
N-Modular Redundancy
N-modular redundancy (NMR) is a scheme for
forward error recovery.
N-Modular Redundancy
Software Redundancy
System is provided with different software version
of task
Written independently
programmers
by
different
team
of
Software Redundancy
N-Version Programming
Recovery Block Approach
N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.
Each
block
contains
at
least
primary,
Time Redundancy
Achieves fault tolerance by performing an operation
several times.
Timeouts and retransmissions in reliable point-topoint and group communication are examples of
time redundancy.
Time Redundancy
1. Recovery Points
Information Redundancy
The basic idea of information redundancy is to provide
more information than is strictly necessary and to use
that extra information to check for errors.
The conditions (a) and (b) are at the basis of all coding
theory.
Data diversity
Data diversity is an approach that can be used in
association with any of the redundancy techniques
considered above.
Sometimes, hardware or software may fail for certain
inputs, but not for other inputs that are very close to
them.
So, instead of applying the same input data to the
redundant processors, we apply slightly different input
data to them.
Thus we have in some cases another line of defense
against failure.
Real Time of
and Fault
Tolerance
This approach will only work if the sensitivity
the
Data diversity
Reversal Checks
Introduction
If there is a simple relationship between the inputs and
outputs of a system, it may then be possible to
calculate the inputs given the outputs.
This can then be compared with the actual inputs as a
check.
Networkfault tolerance:
Includes
Agreement protocols
sombody@gmail.com
Continue..
single
checkpoint
buffer
is
maintained
per
multithreaded ARMOR
process.
The
element
state
is
checkpointed
after
each
operation.
Checkpoints are committed to stable storage after
processing a message.
The is no need to do process-wide checkpoints of
stacks, heap,
The existing locking policy of element data prevents
the need to suspend all threads.
THANK YOU