RTFT15 Unit 2

UNIT 2
Fault Tolerance
Fault-Error-Failure.
Error models
Redundancy,
Error detection techniques
Error Detection,
Watchdog
Damage Confinement,
Error control coding.
Error Recovery,
Fault Treatment,
Fault Prevention,
anticipated and
unanticipated Faults.
Roll No: 15
Fault
Definition
It is some physical defect that can cause a

component to
malfunction .
Fault can be:

- Hardware Fault (eg: logical wire)
- Software Fault (eg: bug)
Real Time and Fault Tolerance
Fault:
Example
Fault is a defect within the system
Examples:
Software bug
Random hardware fault
Memory bit stuck
Error
Definition
Error is a deviation from the required operation of
system or subsystem
A fault may lead to an error, i.e., error is a
mechanism by which the fault becomes apparent
Fault may stay dormant for a long time before it

manifests itself as an error:
Error:
Example
Memory bit got stuck but CPU does not access this
data
Software bug in a subroutine is not visible while

the subroutine is not called
Failure:
Definition
A system failure occurs when the system fails to
perform its required function
Presence of an error might cause a whole system to

deviate from its required operation
One of the goals of safety-critical systems is that

error should not result in system failure
Need of fault tolerance

Complex system
Critical applications
Harsh Environment
Redundancy
All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from
faults
Components are redundant as they are not required

in a perfect system
Often called protective redundancy
Aim: minimise redundancy while maximising

reliability, subject to the cost and size constraints
of the system
Warning: the added components inevitably

increase the complexity of the overall system
This itself can lead to less reliable systems

E.g: first launch of the space shuttle
It is advisable to separate out the fault-tolerant
components from the rest of the systemReal Time and Fault Tolerance
Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
bits
Software Redundancy
System is provided with different software version
of task
Written independently
programmers
by
different
team
of
If one version of task fail under certain input

another version
can be used
Static:
N version
Recovery block
Software Redundancy dynamic type

Four phases
Error detection no fault tolerance scheme can be
utilised until the associated error is detected
Damage confinement and assessment to what
extent has the system been corrupted? The delay
between a fault occurring and the detection of the error
means erroneous information could have spread
throughout the system
Error recovery techniques should aim to transform
the corrupted system into a state from which it can
continue its normal operation
Fault Treatment and continued service an error is a
Real Time and Fault
Tolerance
symptom of a fault; although damage repaired,
the
Information Redundancy
The data are coded in such a way that a

certain number of bit error can be detected
or corrected
Error Detecting Technique

Finding error in first place
Parity checking
Checksum error detection
Cyclic Redundancy check
Error Detection
Types
Other types heartbeats etc.
Environmental detection
hardware e.g. illegal instruction
O.S/RTS null pointer
Application detection
Replication checks
Timing checks
Reversal checks
Coding checks
Reasonableness checks
Damage Confinement and Assessment

Introduction
Damage assessment is closely related to damage
confinement techniques used
Damage confinement is concerned with structuring the
system so as to minimise the damage caused by a
faulty component (also known as firewalling)
Modular decomposition provides static damage
confinement; allows data to flow through well-define
pathways
Atomic actions provides dynamic damage confinement;
they are used to move the system from one consistent
state to another
Error Recovery
Introduction
Probably the most important phase of any faulttolerance technique
Two approaches:
1. Forward Recovery
2. Backward Recovery
Forward Recovery
assessing and removing errors completely
Forward error recovery continues from an erroneous
state by making selective corrections to the system
state
This includes making safe the controlled environment
which may be hazardous or damaged because of the
failure
It is system specific and depends on
accurate
predictions of the location and cause of errors

(i.e, damage assessment)
Eg: redundant pointers in data structuresReal

and
use
Timethe
and Fault
Tolerance
Backward Recovery
BER relies on restoring the system to a previous safe state and

executing an alternative section of the program
This has the same functionality but uses a different algorithm (c.f. NVersion Programming) and therefore no fault
The point to which a process is restored is called a recovery point

and the act of establishing it is termed checkpointing (saving
appropriate system state)
Advantage: the erroneous state is cleared and it does not rely on

finding the location, therefore, be used to recover from unanticipated
faults including design errors
Disadvantage: it cannot undo errors in the environment!

Fault Treatment
Introduction
ER returned the system to an error-free state; however,
the error may recur; the final phase of F.T. is to
eradicate the fault from the system
The automatic treatment of faults is difficult and

system specific
Some systems assume all faults are transient; others

that error recovery techniques can cope with recurring
faults
Fault treatment can be divided into 2 stages:

1.fault location
2. system repair
Error detection techniques can help to trace the fault to a

component. For, hardware the component can be replaced
A software fault can be removed in a new version of the

code
In non-stop applications it will be necessary to modify the

program while it is executing!
The Recovery Block approach to FT

Language support for BER
At the entrance to a block is an automatic
recovery point and at the exit an acceptance test
The acceptance test is used to test that the
system is in an acceptable state after the blocks
execution (primary module)
If the acceptance test fails, the program is restored
to the recovery point at the beginning of the block
and an alternative module is executed
Continue..
If the alternative module also fails the acceptance
test, the program is restored to the recovery point
and yet another module is executed, and so on
If all modules fail then the block fails and recovery

must take place at a higher level
Error Control Coding

Includes
Channel is noisy
Channel output prone to error
we need measure to ensure correctness of the bit
stream transmitted
Error control coding aims at developing methods for
coding to check the correctness
of the bit stream
transmitted.
The bit stream representation of a symbol is called the

codeword of that symbol.
Continue..
Different error control mechanisms:
Linear Block Codes

Repetition Codes
Convolution Codes
Linear Block Codes

Concepts
A code is linear if two codes are added using modulo-2
arithmetic produces a third codeword in the code.
Consider a (n, k) linear block code. Here,

1. n represents the codeword length
2. k is the number of message bit
3. n k bits are error control bits or parity check bits
generated from message using an appropriate rule.
continue
We may therefore represent the codeword as
Repetition Codes
This is the simplest of linear block codes
A single message bit is encoded into a block of n identical

bits, producing an (n, 1) block code.
This code allows variable amount of redundancy.
It has only two code words - all-zero codeword and allone codeword.
Example:
Consider a linear block code which is also a repetition
code. Let
k = 1 and n = 5. From the analysis done in linear block
The parity check matrix takes the form
Hamming Distance
Improves traditional measures by
Hamming weight, w(c) is defined as the number of

nonzero elements in a code vector.
Hamming distance, d(c1, c2) between two codewords

c1 and c2 is defined as the number
of bits in which they
differ.
Minimum distance, dmin is the minimum hamming

distance between two codewords.
Watchdog processors
Error detection technique:
A watchdog processor (WP) is a relatively small and

simple coprocessor used to perform concurrent
system-level error detection by monitoring the
behavior of a main processor .
The general system architecture is shown in Figure.
The watchdog is provided with some information

about the state of the processor or process to be
checked on the system (instruction and/or data)
bus.
Errors detected by the WP are signaled towards the

checked processor or any external supervisory unit
responsible for error treatment and recovery.
Watchdog Processor
Watchdog Timer
An inexpensive method of error detection Process being
watched must reset the timer
before the timer expires,
otherwise the watched process is assumed as faulty
Watchdog
timers
only
detect
errors
which
manifest
themselves as a control-flow error such that the system

does not continue to reset the timer
Only processes with relatively deterministic runtimes can be

checked, since the error detection is based entirely on the
time between timer resets
Watchdog Timer Application

GUI power off
Temperature control
Timer
Telephone switch
Availability
Reliability
Structural integrity check
Structure integrity check

GUI power off
Temperature control
Timer
Telephone switch
Availability
Reliability
Structural integrity check
Heartbeats
Includes
A common approach to detecting process and node

failures in a distributed (networked) computing
environment.
Periodically, a monitoring entity sends a message (a

heartbeat) to a monitored node or process and waits for
a reply.
If the monitored node does not respond within a

predefined timeout interval, the node is declared
as
Heartbeats: Issues
The timeout period is pre-negotiated by the two
parties
or
sometimes
even
hard-coded
by
the
programmer
The predefined
timeout
value
cannot adapt to
changes in network traffic or to load variability on

individual nodes
The monitored node is assumed to be healthy if it is able
to respond to a heartbeat message
Process/thread responding to the heartbeat message
may operate correctly, while other processes/threads
may be in a deadlock situation or operating
incorrectly
Consistency and Capability Checking

Capability Checking
can be implemented as a hardware mechanism

or can be part of the operating system (usually the
case)
access to objects (memory segments, I/O devices) is
limited to users (processors
or processes) with the
proper authorization
Examples:
virtual
address
management
(MMU
usually
has
capability check)
permission vs. activity; if these are not valid, there is an

error trap
password checking
Consistency Checks
range check - confirms that a computed value is in a
valid range, e.g: a computed probability must be in
the range 0 to 1
address checking verifies that the address to

accessed exists
opcode checking - checks whether the instruction to

be executed has one of defined (documented)
opcodes
Data Audits
Introduction
Widely used in the telecommunications industry
A broad range of custom and ad hoc application-level

techniques for detecting and recovering from errors in a
switching environment (in particular in a database).
Data-specific techniques deeply embedded in the

application can provide significant improvement in
availability
Static and Dynamic Data Check
A corruption in static data region detected by computing a golden

checksum of all static data at startup and comparing it with a
periodically computed checksum (e.g., Cyclic Redundancy Code)
For dynamic data, the range of allowable values for database fields
are often stored in the database system catalog. This information is
used to perform a range check on the dynamic fields in the
database.
Semantic Referential Integrity Check

1. Traces logical relationships among records in
different tables to verify the consistency of the

logical loops formed by the record(s)
2. Detects resource leaks
3. Corruption of key attributes in a database leads to
lost records, i.e., records participating in semantic

relationships disappear without being properly
updated
Data Audits: Structural Checks

The structure of the database is established by header fields
that precede the data portion in every record of each table.
Structural audit calculates the offset of each record header
from the beginning of the database based on record sizes
stored in system tables (all record sizes are fixed and known).
The database structure (in particular, the alignment of each
record and table within the database) is checked by
comparing
all
header
fields
at
computed
offsets
with
expected values
Assertions
Goals
Generate runtime assertions by monitoring the

values of selected
variables in a program
Use the monitored data to abstract out,

via statistical pattern recognition techniques,
the key relationships between the variables,
separately and jointly,
and to establish their probabilistic behavior

Control-flow Monitoring Using Signatures

Types
Hardware Approaches
Software Approaches
Hardware Approaches
Embedded Signature Monitoring
Pre-computed signature embedded in the application
program
Recompilation of existing programs
Performance degradation of application
Autonomous Signature Monitoring
Watchdog Processor stores pre-computed signature in
the
memory and mimics the control flow of application
Watchdog Processor rather complex
High memory overhead
Software Approaches
Software techniques partition the application into
blocks, either in the assembly language or in the high
level language
Appropriate instrumentation inserted at the beginning
and/or end of the blocks
The checking code is inserted in the instruction stream
eliminating the need for a hardware watchdog
processor
Two classes of approaches
non-preemptive signature checking
preemptive signature checking
Software Approaches
THANK YOU

RTFT15 Unit 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RTFT15 Unit 2

Uploaded by

Copyright:

Available Formats

UNIT 2

Error detection techniques

Error control coding.

It is some physical defect that can cause a

Fault can be:

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Fault may stay dormant for a long time before it

Real Time and Fault Tolerance

Software bug in a subroutine is not visible while

Real Time and Fault Tolerance

Presence of an error might cause a whole system to

One of the goals of safety-critical systems is that

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Need of fault tolerance

Real Time and Fault Tolerance

Components are redundant as they are not required

Often called protective redundancy

Real Time and Fault Tolerance

Aim: minimise redundancy while maximising

Warning: the added components inevitably

This itself can lead to less reliable systems

If one version of task fail under certain input

Software Redundancy dynamic type

The data are coded in such a way that a

Real Time and Fault Tolerance

Error Detecting Technique

Real Time and Fault Tolerance

Real Time and Fault Tolerance

Damage Confinement and Assessment

Probably the most important phase of any faulttolerance technique

Real Time and Fault Tolerance

predictions of the location and cause of errors

Eg: redundant pointers in data structuresReal

BER relies on restoring the system to a previous safe state and

The point to which a process is restored is called a recovery point

Advantage: the erroneous state is cleared and it does not rely on

Disadvantage: it cannot undo errors in the environment!

The automatic treatment of faults is difficult and

Some systems assume all faults are transient; others

Fault treatment can be divided into 2 stages:

Error detection techniques can help to trace the fault to a

A software fault can be removed in a new version of the

In non-stop applications it will be necessary to modify the

The Recovery Block approach to FT

If all modules fail then the block fails and recovery

Real Time and Fault Tolerance

Error Control Coding

of the bit stream

The bit stream representation of a symbol is called the

Real Time and Fault Tolerance

Linear Block Codes

Real Time and Fault Tolerance

Linear Block Codes

Consider a (n, k) linear block code. Here,

Real Time and Fault Tolerance

Real Time and Fault Tolerance

A single message bit is encoded into a block of n identical

Real Time and Fault Tolerance

The parity check matrix takes the form

Real Time and Fault Tolerance

Hamming weight, w(c) is defined as the number of