You are on page 1of 20

UNIT 6

Practical Systems for Fault


Tolerance:
Application:
Ad-hoc wireless network - Application:
NASA Remote Exploration &
Experimentation
System Architecture: Fault tolerant
computers
General purpose commercial systems Fault tolerant multiprocessor and VLSI
basedcommunication architecture.
Fault tolerant software:Design-N-version

Roll No: 15

Ad hoc Wireless Networks


Definition
Ad-hoc network
a LAN or other small network, with wireless
connections
devices are part of the network only for the
duration of a communications session
Or while in close proximity to the network
Decentralized
E.g.: Bluetooth
Real Time and Fault Tolerance

Ad hoc Wireless Networks


Definition
The principle behind ad hoc networking is multi-hop
relaying in which messages are sent from the
source to the destination by relaying through the
intermediate hops (nodes).

In multi-hop wireless networks, communication


between two end nodes is carried out through a
number of intermediate nodes whose function is to
relay information from one point to another.
Real Time and Fault Tolerance

A static string topology is an example of such


network:
0

In the last few years, efforts have been focused on


multi-hop "ad hoc" networks, in which relaying nodes
are in general mobile, and communication needs are
primarily between nodes within the same network.

Real Time and Fault Tolerance

Applications

of

Ad

hoc

Wireless

Military
applications
Networks
Ad hoc wireless networks is useful in establishing
communication in a battle field.

Collaborative and Distributed Computing


A group of people in a conference can share data in ad hoc
networks.
Streaming of multimedia objects among the participating
nodes.

Emergency Operations
Ad hoc wireless networks are useful in emergency
operations such as search and rescue, and crowd
control.
Real Time
and Fault Tolerance

Issues in Protocol Design


Must run in distributed environment
must provide loop-free routes
must be able to find multiple routes
must establish routes quickly
must minimize overhead in its communication /
reaction to topology change

Real Time and Fault Tolerance

Remote Exploration &Experimentation


Introduction

The goal of the Remote Exploration


&Experimentation (REE) is

to move supercomputing into space in a cost


effective manner

to allow the use of inexpensive, state of the art,


commercial-off-the-shelf (COTS) components and
subsystems in these space-based
supercomputers.
sombody@gmail.com

Some of the responsibilities of the REE system software


include:

1. Managing system resources (maintaining state

information about each node and about the global


system, performing system resource diagnostics,etc.).

2. Job scheduling (globally scheduling jobs across the


system, local job scheduling within the node,
allocation of resources to jobs, etc.).

3. Managing the scientific applications (launching the


applications, monitoring the applications for failure,
initiating recovery for applications, etc.)Real Time and Fault Tolerance

Real Time and Fault Tolerance

The immediate concern of the Applications Manager is to


oversee the execution of the scientific applications.

As the applications represent the ultimate customer of the


REE

environment,

efficiently

supporting

their

required

dependability level is paramount.

The Applications Manager monitors the science application


for externally visible signs of faulty behavior as well as for
messages

generated

internally

by

the

applications

requesting fault tolerance services


sombody@gmail.com

Fault tolerant computers


Fault-tolerant computing is the art and science of
building computing systems that continue to
operate satisfactorily in the presence of faults.

A fault-tolerant system may be able to tolerate


one or more fault-types including
i) transient, intermittent or permanent hardware
faults,
ii) software and hardware design errors,
iii) operator errors,
iv)externally induced upsets or physical
damage.
Real Time and Fault Tolerance

Fault-tolerant computer systems


Fault-tolerant computer systemsare systems
designed around the concepts offault tolerance.

In essence, they must be able to continue working


to a level of satisfaction in the presence of faults.

Real Time and Fault Tolerance

conceptual design of a segregated-component


fault-tolerant computer design

sombody@gmail.com

Fault Tolerant Multiprocessor


Introduction
Fault tolerance is often considered as a good additional
feature for multiprocessor systems but nowadays it is
becoming an essential attribute.

Fault tolerance can be achieved by the use of dedicated


customized hardware that may have the disadvantage of
large cost.

Another approach to fault tolerance is to exploit existing


redundancy in multiprocessor systems via a task scheduling
software strategy based on time redundancy. Real Time and Fault Tolerance

Fault tolerant software


N-Version Programming
Recovery Block Approach

Real Time and Fault Tolerance

N-Version Programming
The N-version software concept attempts to parallel the
traditional hardware fault tolerance concept of N-way
redundant hardware.

In an N-version software system, each module is made


with up toNdifferent implementations. Each variant
accomplishes the same task, but hopefully in a different
way.

Each version then submits its answer to voter or


decider which determines the correct answer, and

Real Time and Fault Tolerance

This system can hopefully overcome the design faults


present in most software by relying upon the design
diversity concept.
An important distinction in N-version software is the
fact that the system could include multiple types of
hardware using multiple versions of software.
The goal is to increase the diversity in order to avoid
common mode failures.
Using N-version software, it is encouraged that each
different version be implemented in as diverse a
manner as possible, including different tool sets,
different programming languages, and possibly
different environments
Real Time and Fault Tolerance

Recovery Block Approach


The recovery block operates with an adjudicator which
confirms the results of various implementations of the
same algorithm.

In a system with recovery blocks, the system view is


broken down into fault recoverable blocks.

The entire system is constructed of these fault tolerant


blocks.

Each

block

contains

at

least

primary,

secondary, and exceptional case code along with an


adjudicator

Real Time and Fault Tolerance

Fault tree diagrams


Alternatively, if both input events must occur in
order for the output event to occur, then they are
connected by an AND gate.

Figure 1 shows a simple fault tree diagram in


which either A or B must occur in order for the
output event to occur. In this diagram, the two
events are connected to an OR gate

Real Time and Fault Tolerance

THANK YOU

You might also like