You are on page 1of 53


Fault Tolerance

Error models


Error detection techniques

Error Detection,


Damage Confinement,

Error control coding.

Error Recovery,
Fault Treatment,
Fault Prevention,
anticipated and
unanticipated Faults.
Roll No: 15


It is some physical defect that can cause a

component to

malfunction .

Fault can be:

- Hardware Fault (eg: logical wire)
- Software Fault (eg: bug)

Real Time and Fault Tolerance

Fault is a defect within the system

Software bug
Random hardware fault
Memory bit stuck

Real Time and Fault Tolerance

Error is a deviation from the required operation of
system or subsystem
A fault may lead to an error, i.e., error is a
mechanism by which the fault becomes apparent

Fault may stay dormant for a long time before it

manifests itself as an error:

Real Time and Fault Tolerance

Memory bit got stuck but CPU does not access this

Software bug in a subroutine is not visible while

the subroutine is not called

Real Time and Fault Tolerance

A system failure occurs when the system fails to
perform its required function

Presence of an error might cause a whole system to

deviate from its required operation

One of the goals of safety-critical systems is that

error should not result in system failure
Real Time and Fault Tolerance

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Need of fault tolerance

Complex system
Critical applications
Harsh Environment

Real Time and Fault Tolerance

All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from

Components are redundant as they are not required

in a perfect system

Often called protective redundancy

Real Time and Fault Tolerance

Aim: minimise redundancy while maximising

reliability, subject to the cost and size constraints
of the system

Warning: the added components inevitably

increase the complexity of the overall system

This itself can lead to less reliable systems

E.g: first launch of the space shuttle
It is advisable to separate out the fault-tolerant
components from the rest of the systemReal Time and Fault Tolerance

Hardware Redundancy
Two types: static (or masking) and dynamic
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance

Software Redundancy
System is provided with different software version
of task

Written independently





If one version of task fail under certain input

another version
can be used
N version
Recovery block
Real Time and Fault Tolerance

Software Redundancy dynamic type

Four phases
Error detection no fault tolerance scheme can be
utilised until the associated error is detected
Damage confinement and assessment to what
extent has the system been corrupted? The delay
between a fault occurring and the detection of the error
means erroneous information could have spread
throughout the system
Error recovery techniques should aim to transform
the corrupted system into a state from which it can
continue its normal operation
Fault Treatment and continued service an error is a
Real Time and Fault
symptom of a fault; although damage repaired,

Information Redundancy

The data are coded in such a way that a

certain number of bit error can be detected
or corrected

Real Time and Fault Tolerance

Error Detecting Technique

Finding error in first place

Parity checking
Checksum error detection
Cyclic Redundancy check

Real Time and Fault Tolerance

Error Detection
Other types heartbeats etc.
Environmental detection
hardware e.g. illegal instruction
O.S/RTS null pointer
Application detection
Replication checks
Timing checks
Reversal checks
Coding checks
Reasonableness checks

Real Time and Fault Tolerance

Damage Confinement and Assessment

Damage assessment is closely related to damage
confinement techniques used
Damage confinement is concerned with structuring the
system so as to minimise the damage caused by a
faulty component (also known as firewalling)
Modular decomposition provides static damage
confinement; allows data to flow through well-define
Atomic actions provides dynamic damage confinement;
they are used to move the system from one consistent
state to another
Real Time and Fault Tolerance

Error Recovery

Probably the most important phase of any faulttolerance technique

Two approaches:
1. Forward Recovery
2. Backward Recovery

Real Time and Fault Tolerance

Forward Recovery
assessing and removing errors completely
Forward error recovery continues from an erroneous
state by making selective corrections to the system
This includes making safe the controlled environment
which may be hazardous or damaged because of the
It is system specific and depends on


predictions of the location and cause of errors

(i.e, damage assessment)

Eg: redundant pointers in data structuresReal

and Fault

Backward Recovery

BER relies on restoring the system to a previous safe state and

executing an alternative section of the program

This has the same functionality but uses a different algorithm (c.f. NVersion Programming) and therefore no fault

The point to which a process is restored is called a recovery point

and the act of establishing it is termed checkpointing (saving
appropriate system state)

Advantage: the erroneous state is cleared and it does not rely on

finding the location, therefore, be used to recover from unanticipated
faults including design errors

Disadvantage: it cannot undo errors in the environment!

Real Time and Fault Tolerance

Fault Treatment
ER returned the system to an error-free state; however,
the error may recur; the final phase of F.T. is to
eradicate the fault from the system

The automatic treatment of faults is difficult and

system specific

Some systems assume all faults are transient; others

that error recovery techniques can cope with recurring
Real Time and Fault Tolerance

Fault treatment can be divided into 2 stages:

1.fault location
2. system repair

Error detection techniques can help to trace the fault to a

component. For, hardware the component can be replaced

A software fault can be removed in a new version of the


In non-stop applications it will be necessary to modify the

program while it is executing!
Real Time and Fault Tolerance

The Recovery Block approach to FT

Language support for BER
At the entrance to a block is an automatic
recovery point and at the exit an acceptance test
The acceptance test is used to test that the
system is in an acceptable state after the blocks
execution (primary module)
If the acceptance test fails, the program is restored
to the recovery point at the beginning of the block
and an alternative module is executed
Real Time and Fault Tolerance

If the alternative module also fails the acceptance
test, the program is restored to the recovery point
and yet another module is executed, and so on

If all modules fail then the block fails and recovery

must take place at a higher level

Real Time and Fault Tolerance

Error Control Coding


Channel is noisy
Channel output prone to error
we need measure to ensure correctness of the bit
stream transmitted
Error control coding aims at developing methods for
coding to check the correctness

of the bit stream


The bit stream representation of a symbol is called the

codeword of that symbol.

Real Time and Fault Tolerance

Different error control mechanisms:

Linear Block Codes

Repetition Codes
Convolution Codes

Real Time and Fault Tolerance

Linear Block Codes

A code is linear if two codes are added using modulo-2
arithmetic produces a third codeword in the code.

Consider a (n, k) linear block code. Here,

1. n represents the codeword length
2. k is the number of message bit
3. n k bits are error control bits or parity check bits
generated from message using an appropriate rule.

Real Time and Fault Tolerance

We may therefore represent the codeword as

Real Time and Fault Tolerance

Repetition Codes
This is the simplest of linear block codes

A single message bit is encoded into a block of n identical

bits, producing an (n, 1) block code.
This code allows variable amount of redundancy.
It has only two code words - all-zero codeword and allone codeword.

Consider a linear block code which is also a repetition
code. Let
k = 1 and n = 5. From the analysis done in linear block

Real Time and Fault Tolerance

The parity check matrix takes the form

Real Time and Fault Tolerance

Hamming Distance
Improves traditional measures by

Hamming weight, w(c) is defined as the number of

nonzero elements in a code vector.

Hamming distance, d(c1, c2) between two codewords

c1 and c2 is defined as the number

of bits in which they


Minimum distance, dmin is the minimum hamming

distance between two codewords.
Real Time and Fault Tolerance

Watchdog processors
Error detection technique:

A watchdog processor (WP) is a relatively small and

simple coprocessor used to perform concurrent
system-level error detection by monitoring the
behavior of a main processor .

The general system architecture is shown in Figure.

Real Time and Fault Tolerance

The watchdog is provided with some information

about the state of the processor or process to be
checked on the system (instruction and/or data)

Errors detected by the WP are signaled towards the

checked processor or any external supervisory unit
responsible for error treatment and recovery.

Real Time and Fault Tolerance

Watchdog Processor

Real Time and Fault Tolerance

Watchdog Timer
An inexpensive method of error detection Process being
watched must reset the timer

before the timer expires,

otherwise the watched process is assumed as faulty








themselves as a control-flow error such that the system

does not continue to reset the timer

Only processes with relatively deterministic runtimes can be

checked, since the error detection is based entirely on the
time between timer resets

Real Time and Fault Tolerance

Watchdog Timer Application

GUI power off
Temperature control
Telephone switch
Structural integrity check

Real Time and Fault Tolerance

Structure integrity check

GUI power off
Temperature control
Telephone switch
Structural integrity check

Real Time and Fault Tolerance


A common approach to detecting process and node

failures in a distributed (networked) computing

Periodically, a monitoring entity sends a message (a

heartbeat) to a monitored node or process and waits for
a reply.

If the monitored node does not respond within a

predefined timeout interval, the node is declared
Real Time and Fault Tolerance

Heartbeats: Issues
The timeout period is pre-negotiated by the two







The predefined



cannot adapt to

changes in network traffic or to load variability on

individual nodes
The monitored node is assumed to be healthy if it is able
to respond to a heartbeat message
Process/thread responding to the heartbeat message
may operate correctly, while other processes/threads
Real Time and Fault Tolerance
may be in a deadlock situation or operating

Consistency and Capability Checking

Capability Checking

can be implemented as a hardware mechanism

or can be part of the operating system (usually the
access to objects (memory segments, I/O devices) is
limited to users (processors

or processes) with the

proper authorization

Real Time and Fault Tolerance







capability check)

permission vs. activity; if these are not valid, there is an

error trap

password checking

Real Time and Fault Tolerance

Consistency Checks
range check - confirms that a computed value is in a
valid range, e.g: a computed probability must be in
the range 0 to 1

address checking verifies that the address to

accessed exists

opcode checking - checks whether the instruction to

be executed has one of defined (documented)
Real Time and Fault Tolerance

Data Audits

Widely used in the telecommunications industry

A broad range of custom and ad hoc application-level

techniques for detecting and recovering from errors in a
switching environment (in particular in a database).

Data-specific techniques deeply embedded in the

application can provide significant improvement in
Real Time and Fault Tolerance

Static and Dynamic Data Check

A corruption in static data region detected by computing a golden

checksum of all static data at startup and comparing it with a
periodically computed checksum (e.g., Cyclic Redundancy Code)

For dynamic data, the range of allowable values for database fields
are often stored in the database system catalog. This information is
used to perform a range check on the dynamic fields in the

Real Time and Fault Tolerance

Semantic Referential Integrity Check

1. Traces logical relationships among records in

different tables to verify the consistency of the

logical loops formed by the record(s)
2. Detects resource leaks
3. Corruption of key attributes in a database leads to

lost records, i.e., records participating in semantic

relationships disappear without being properly
Real Time and Fault Tolerance

Data Audits: Structural Checks

The structure of the database is established by header fields
that precede the data portion in every record of each table.
Structural audit calculates the offset of each record header
from the beginning of the database based on record sizes
stored in system tables (all record sizes are fixed and known).
The database structure (in particular, the alignment of each
record and table within the database) is checked by








expected values

Real Time and Fault Tolerance


Generate runtime assertions by monitoring the

values of selected
variables in a program

Use the monitored data to abstract out,

via statistical pattern recognition techniques,
the key relationships between the variables,
separately and jointly,

and to establish their probabilistic behavior

Real Time and Fault Tolerance

Control-flow Monitoring Using Signatures


Hardware Approaches
Software Approaches

Real Time and Fault Tolerance

Hardware Approaches
Embedded Signature Monitoring
Pre-computed signature embedded in the application
Recompilation of existing programs
Performance degradation of application
Autonomous Signature Monitoring
Watchdog Processor stores pre-computed signature in
memory and mimics the control flow of application
Watchdog Processor rather complex
High memory overhead

Real Time and Fault Tolerance

Software Approaches
Software techniques partition the application into
blocks, either in the assembly language or in the high
level language
Appropriate instrumentation inserted at the beginning
and/or end of the blocks
The checking code is inserted in the instruction stream
eliminating the need for a hardware watchdog
Two classes of approaches
non-preemptive signature checking
preemptive signature checking

Real Time and Fault Tolerance

Software Approaches

Real Time and Fault Tolerance