You are on page 1of 53

UNIT 2

Fault Tolerance
Fault-Error-Failure.

Error models

Redundancy,

Error detection techniques

Error Detection,

Watchdog

Damage Confinement,

Error control coding.

Error Recovery,
Fault Treatment,
Fault Prevention,
anticipated and
unanticipated Faults.
Roll No: 15

Fault
Definition

It is some physical defect that can cause a


component to

malfunction .

Fault can be:


- Hardware Fault (eg: logical wire)
- Software Fault (eg: bug)

Real Time and Fault Tolerance

Fault:
Example
Fault is a defect within the system

Examples:
Software bug
Random hardware fault
Memory bit stuck

Real Time and Fault Tolerance

Error
Definition
Error is a deviation from the required operation of
system or subsystem
A fault may lead to an error, i.e., error is a
mechanism by which the fault becomes apparent

Fault may stay dormant for a long time before it


manifests itself as an error:

Real Time and Fault Tolerance

Error:
Example
Memory bit got stuck but CPU does not access this
data

Software bug in a subroutine is not visible while


the subroutine is not called

Real Time and Fault Tolerance

Failure:
Definition
A system failure occurs when the system fails to
perform its required function

Presence of an error might cause a whole system to


deviate from its required operation

One of the goals of safety-critical systems is that


error should not result in system failure
Real Time and Fault Tolerance

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Need of fault tolerance


Complex system
Critical applications
Harsh Environment

Real Time and Fault Tolerance

Redundancy
All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from
faults

Components are redundant as they are not required


in a perfect system

Often called protective redundancy

Real Time and Fault Tolerance

Aim: minimise redundancy while maximising


reliability, subject to the cost and size constraints
of the system

Warning: the added components inevitably


increase the complexity of the overall system

This itself can lead to less reliable systems


E.g: first launch of the space shuttle
It is advisable to separate out the fault-tolerant
components from the rest of the systemReal Time and Fault Tolerance

Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits

Software Redundancy
System is provided with different software version
of task

Written independently
programmers

by

different

team

of

If one version of task fail under certain input


another version
can be used
Static:
N version
Recovery block
Real Time and Fault Tolerance

Software Redundancy dynamic type


Four phases
Error detection no fault tolerance scheme can be
utilised until the associated error is detected
Damage confinement and assessment to what
extent has the system been corrupted? The delay
between a fault occurring and the detection of the error
means erroneous information could have spread
throughout the system
Error recovery techniques should aim to transform
the corrupted system into a state from which it can
continue its normal operation
Fault Treatment and continued service an error is a
Real Time and Fault
Tolerance
symptom of a fault; although damage repaired,
the

Information Redundancy

The data are coded in such a way that a


certain number of bit error can be detected
or corrected

Real Time and Fault Tolerance

Error Detecting Technique


Finding error in first place

Parity checking
Checksum error detection
Cyclic Redundancy check

Real Time and Fault Tolerance

Error Detection
Types
Other types heartbeats etc.
Environmental detection
hardware e.g. illegal instruction
O.S/RTS null pointer
Application detection
Replication checks
Timing checks
Reversal checks
Coding checks
Reasonableness checks

Real Time and Fault Tolerance

Damage Confinement and Assessment


Introduction
Damage assessment is closely related to damage
confinement techniques used
Damage confinement is concerned with structuring the
system so as to minimise the damage caused by a
faulty component (also known as firewalling)
Modular decomposition provides static damage
confinement; allows data to flow through well-define
pathways
Atomic actions provides dynamic damage confinement;
they are used to move the system from one consistent
state to another
Real Time and Fault Tolerance

Error Recovery
Introduction

Probably the most important phase of any faulttolerance technique

Two approaches:
1. Forward Recovery
2. Backward Recovery

Real Time and Fault Tolerance

Forward Recovery
assessing and removing errors completely
Forward error recovery continues from an erroneous
state by making selective corrections to the system
state
This includes making safe the controlled environment
which may be hazardous or damaged because of the
failure
It is system specific and depends on

accurate

predictions of the location and cause of errors


(i.e, damage assessment)

Eg: redundant pointers in data structuresReal


and
use
Timethe
and Fault
Tolerance

Backward Recovery

BER relies on restoring the system to a previous safe state and


executing an alternative section of the program

This has the same functionality but uses a different algorithm (c.f. NVersion Programming) and therefore no fault

The point to which a process is restored is called a recovery point


and the act of establishing it is termed checkpointing (saving
appropriate system state)

Advantage: the erroneous state is cleared and it does not rely on


finding the location, therefore, be used to recover from unanticipated
faults including design errors

Disadvantage: it cannot undo errors in the environment!


Real Time and Fault Tolerance

Fault Treatment
Introduction
ER returned the system to an error-free state; however,
the error may recur; the final phase of F.T. is to
eradicate the fault from the system

The automatic treatment of faults is difficult and


system specific

Some systems assume all faults are transient; others


that error recovery techniques can cope with recurring
faults
Real Time and Fault Tolerance

Fault treatment can be divided into 2 stages:


1.fault location
2. system repair

Error detection techniques can help to trace the fault to a


component. For, hardware the component can be replaced

A software fault can be removed in a new version of the


code

In non-stop applications it will be necessary to modify the


program while it is executing!
Real Time and Fault Tolerance

The Recovery Block approach to FT


Language support for BER
At the entrance to a block is an automatic
recovery point and at the exit an acceptance test
The acceptance test is used to test that the
system is in an acceptable state after the blocks
execution (primary module)
If the acceptance test fails, the program is restored
to the recovery point at the beginning of the block
and an alternative module is executed
Real Time and Fault Tolerance

Continue..
If the alternative module also fails the acceptance
test, the program is restored to the recovery point
and yet another module is executed, and so on

If all modules fail then the block fails and recovery


must take place at a higher level

Real Time and Fault Tolerance

Error Control Coding


Includes

Channel is noisy
Channel output prone to error
we need measure to ensure correctness of the bit
stream transmitted
Error control coding aims at developing methods for
coding to check the correctness

of the bit stream

transmitted.

The bit stream representation of a symbol is called the


codeword of that symbol.

Real Time and Fault Tolerance

Continue..
Different error control mechanisms:

Linear Block Codes


Repetition Codes
Convolution Codes

Real Time and Fault Tolerance

Linear Block Codes


Concepts
A code is linear if two codes are added using modulo-2
arithmetic produces a third codeword in the code.

Consider a (n, k) linear block code. Here,


1. n represents the codeword length
2. k is the number of message bit
3. n k bits are error control bits or parity check bits
generated from message using an appropriate rule.

Real Time and Fault Tolerance

continue
We may therefore represent the codeword as

Real Time and Fault Tolerance

Repetition Codes
This is the simplest of linear block codes

A single message bit is encoded into a block of n identical


bits, producing an (n, 1) block code.
This code allows variable amount of redundancy.
It has only two code words - all-zero codeword and allone codeword.

Example:
Consider a linear block code which is also a repetition
code. Let
k = 1 and n = 5. From the analysis done in linear block

Real Time and Fault Tolerance

The parity check matrix takes the form

Real Time and Fault Tolerance

Hamming Distance
Improves traditional measures by

Hamming weight, w(c) is defined as the number of


nonzero elements in a code vector.

Hamming distance, d(c1, c2) between two codewords


c1 and c2 is defined as the number

of bits in which they

differ.

Minimum distance, dmin is the minimum hamming


distance between two codewords.
Real Time and Fault Tolerance

Watchdog processors
Error detection technique:

A watchdog processor (WP) is a relatively small and


simple coprocessor used to perform concurrent
system-level error detection by monitoring the
behavior of a main processor .

The general system architecture is shown in Figure.

Real Time and Fault Tolerance

The watchdog is provided with some information


about the state of the processor or process to be
checked on the system (instruction and/or data)
bus.

Errors detected by the WP are signaled towards the


checked processor or any external supervisory unit
responsible for error treatment and recovery.

Real Time and Fault Tolerance

Watchdog Processor

Real Time and Fault Tolerance

Watchdog Timer
An inexpensive method of error detection Process being
watched must reset the timer

before the timer expires,

otherwise the watched process is assumed as faulty

Watchdog

timers

only

detect

errors

which

manifest

themselves as a control-flow error such that the system


does not continue to reset the timer

Only processes with relatively deterministic runtimes can be


checked, since the error detection is based entirely on the
time between timer resets

Real Time and Fault Tolerance

Watchdog Timer Application


GUI power off
Temperature control
Timer
Telephone switch
Availability
Reliability
Structural integrity check

Real Time and Fault Tolerance

Structure integrity check


GUI power off
Temperature control
Timer
Telephone switch
Availability
Reliability
Structural integrity check

Real Time and Fault Tolerance

Heartbeats
Includes

A common approach to detecting process and node


failures in a distributed (networked) computing
environment.

Periodically, a monitoring entity sends a message (a


heartbeat) to a monitored node or process and waits for
a reply.

If the monitored node does not respond within a


predefined timeout interval, the node is declared
as
Real Time and Fault Tolerance

Heartbeats: Issues
The timeout period is pre-negotiated by the two
parties

or

sometimes

even

hard-coded

by

the

programmer
The predefined

timeout

value

cannot adapt to

changes in network traffic or to load variability on


individual nodes
The monitored node is assumed to be healthy if it is able
to respond to a heartbeat message
Process/thread responding to the heartbeat message
may operate correctly, while other processes/threads
Real Time and Fault Tolerance
may be in a deadlock situation or operating
incorrectly

Consistency and Capability Checking


Capability Checking

can be implemented as a hardware mechanism


or can be part of the operating system (usually the
case)
access to objects (memory segments, I/O devices) is
limited to users (processors

or processes) with the

proper authorization

Real Time and Fault Tolerance

Examples:
virtual

address

management

(MMU

usually

has

capability check)

permission vs. activity; if these are not valid, there is an


error trap

password checking

Real Time and Fault Tolerance

Consistency Checks
range check - confirms that a computed value is in a
valid range, e.g: a computed probability must be in
the range 0 to 1

address checking verifies that the address to


accessed exists

opcode checking - checks whether the instruction to


be executed has one of defined (documented)
opcodes
Real Time and Fault Tolerance

Data Audits
Introduction

Widely used in the telecommunications industry

A broad range of custom and ad hoc application-level


techniques for detecting and recovering from errors in a
switching environment (in particular in a database).

Data-specific techniques deeply embedded in the


application can provide significant improvement in
availability
Real Time and Fault Tolerance

Static and Dynamic Data Check

A corruption in static data region detected by computing a golden


checksum of all static data at startup and comparing it with a
periodically computed checksum (e.g., Cyclic Redundancy Code)

For dynamic data, the range of allowable values for database fields
are often stored in the database system catalog. This information is
used to perform a range check on the dynamic fields in the
database.

Real Time and Fault Tolerance

Semantic Referential Integrity Check


1. Traces logical relationships among records in

different tables to verify the consistency of the


logical loops formed by the record(s)
2. Detects resource leaks
3. Corruption of key attributes in a database leads to

lost records, i.e., records participating in semantic


relationships disappear without being properly
updated
Real Time and Fault Tolerance

Data Audits: Structural Checks


The structure of the database is established by header fields
that precede the data portion in every record of each table.
Structural audit calculates the offset of each record header
from the beginning of the database based on record sizes
stored in system tables (all record sizes are fixed and known).
The database structure (in particular, the alignment of each
record and table within the database) is checked by
comparing

all

header

fields

at

computed

offsets

with

expected values

Real Time and Fault Tolerance

Assertions
Goals

Generate runtime assertions by monitoring the


values of selected
variables in a program

Use the monitored data to abstract out,


via statistical pattern recognition techniques,
the key relationships between the variables,
separately and jointly,

and to establish their probabilistic behavior


Real Time and Fault Tolerance

Control-flow Monitoring Using Signatures


Types

Hardware Approaches
Software Approaches

Real Time and Fault Tolerance

Hardware Approaches
Embedded Signature Monitoring
Pre-computed signature embedded in the application
program
Recompilation of existing programs
Performance degradation of application
Autonomous Signature Monitoring
Watchdog Processor stores pre-computed signature in
the
memory and mimics the control flow of application
Watchdog Processor rather complex
High memory overhead

Real Time and Fault Tolerance

Software Approaches
Software techniques partition the application into
blocks, either in the assembly language or in the high
level language
Appropriate instrumentation inserted at the beginning
and/or end of the blocks
The checking code is inserted in the instruction stream
eliminating the need for a hardware watchdog
processor
Two classes of approaches
non-preemptive signature checking
preemptive signature checking

Real Time and Fault Tolerance

Software Approaches

Real Time and Fault Tolerance

THANK YOU