You are on page 1of 28

International Institute of Professional Studies

Research Paper Submission

on

Analysis of Costs of Recovery of Message Logging Protocols


Date: 15th Oct,’01
Submitted to: Ms. Yamini Karmarkar
Submitted by: Megha Jain
IM-98-29
M.M.S. (5Yrs.)
VII Semester
ACKNOWLEDGEMENTS

I am grateful to Ms. Yamini Karmarkar, who gave me an opportunity to


present this research paper; and under whose able guidance and experience,
this research paper was possible.

I am indebted to the Library resources and Librarians of International


Institute of Professional Studies who provided me with the relevant journals
and literatures used in this research.

My thankfulness is also extended to my peers and colleagues for their


support and encouragement.

Megha Jain
IM-98-29
Abstract – Past research in message logging has focussed on studying the relative
overhead imposed by pessimistic, optimistic and causal protocols during failure free
executions. Applications face a complex trade-off when choosing a message logging
protocol for fault tolerance. In this empirical research, an evaluation of the performance
of these protocols during recovery is presented.

1. AIM

To analyze the effect of T[chk], T[acq], T[rollfwd], T[replay] and


T[rollbck] on T[rec]; where:
T[chk] – Time required to restore the state of the failed process from its
latest checkpoint.
T[acq] – Time required to retrieve determinants and messages logged during
failure-free execution.
T[rollfwd] – Time required to roll-forward the execution of process to its
precrashed state.
T[replay] – Time required to replay messages to the recovering process
from the acquired logs.
T[rollbck] – Time required to roll-back orphan processes.
T[rec] – Recovery time.

2. INTRODUCTION

Message-logging protocols are popular techniques for building systems that


can tolerate process crash failures. These protocols are built on the
assumption that the state of a process is determined by its initial state and by
the sequence of messages it delivers. In principle, a crashed process can be
recovered by: -

• Restoring the process to its initial state.


• Rolling it forward by replaying to it messages in the same order in which
they were delivered before the crash.
In practice, message-logging protocols limit the extent of roll-forward by
having each process periodically save its local state in a checkpoint.

The delivery order of each message is recorded in a tuple, called the


message’s determinant. If the determinants of all the messages delivered by
a crashed process are available during recovery, then the process can be
restored to a state consistent with the state of all operational processes. An
operational process whose state is inconsistent with the recovered state of a
crashed process is called an orphan process.

All message-logging protocols guarantee that upon recovery, no process is


an orphan, but differ in the way they enforce this consistency condition. On
this basis they can be classified into: -

• Pessimistic protocols: They require that a process, before sending a


message, synchronously log on stable storage the determinants and the
content of all messages delivered so far. Thus, these protocols never create
orphan processes. There are two pessimistic protocols:
1) Receiver-based Pessimistic: A process, before sending a message, logs
to stable storage both the determinants and the contents of the messages
delivered so far.
2) Sender-based Pessimistic: The receiver logs synchronously to stable
storage only the determinant of every message it delivers, while the
contents of the message are stored in a volatile send log kept by the
message’s sender.

• Optimistic protocols: They allow processes to communicate even if the


determinants on which they depend are not yet logged on stable storage.
They only require them to eventually reach stable storage. However, if any
of the determinants are lost when a process crashes, orphans may be created.
To reach a consistent global state, these processes must be identified and
rolled back.

• Causal protocols: They combine some of the positive aspects of


pessimistic and optimistic protocols. They never create orphans, yet they do
not write determinants to stable storage synchronously. Rather, the
determinants are logged on to volatile memory. Here, if the state of an
operational process p causally depends on the delivery of a message m, then,
p has a copy of m’s determinant in its volatile memory. This property is
sufficient to restore a crashed process in a state consistent with the state of
all operational processes.

Applications face a complex trade-off when choosing a message logging


protocol for fault tolerance. On the one hand, optimistic protocols can
provide fast failure-free execution and good performance during recovery,
but are complex to implement and can create orphan processes. On the other
hand, orphan-free protocols, like pessimistic and causal, either risk being
slow during recovery, or incur a substantial overhead during failure free
execution. This trade-off needs to be addressed.

3. LITERATURE REVIEW

• L. Alvisi and K. Marzullo, “Tradeoffs in Implementing Optimal


Message Logging Protocols”, Principles of Distributed Computing, pp. 58-
67, June 1996.
• L. Alvisi and K. Marzullo, “Message Logging: Pessimistic, Optimistic,
Causal and Optimal”, IEEE Trans. Software Eng., vol. 24, no.2, pp. 149-
159, Feb. 1998.
• O.P. Damani and V.K. Garg, “How to Recover Efficiently and
Asynchronously when Optimism Fails”, Distributed Computing Systems,
pp. 108-115, 1996.
• S. Rao, L. Alvisi and H.M. Vin, “Egida: An Extensible Toolkit for Low-
Overhead Fault-Tolerance”, IEEE Fault-Tolerant Computing Symp. FTCS-
29, Madison, Wis., pp. 48-55, June 1999.
• S. Rao, L. Alvisi and H.M. Vin, “Cost of Recovery in Message Logging
Protocols”, IEEE Trans. on Knowledge and Data Eng., vol. 12, no. 2, March
2000.
• F.B. Schneider, “Implementing Fault-Tolerant Services Using the State
Machine Approach: A Tutorial”, Computing Surveys, vol. 22, no.3, pp. 299-
319, Sep. 1990.
• E.N. Elnozahy, “On the Relevance of Communication Costs of Rollback
Recovery Protocols”, Principles of Distributed Computing, pp. 74-79, Aug.
1995.
4. HYPOTHESIS

On the basis of the literature review and from the understanding of the
concepts, I have formulated the following hypothesis: -

“There is a high correlation between the T[rec] and the T[rollfwd] in case of
receiver-based pessimistic, sender-based pessimistic and causal protocols;
and between T[rec] and T[replay] in case of optimistic protocols.”

And also,

“The best message logging protocol is one with the lowest T[rec].”
4.1 The Null Hypothesis

Ho: “There is no significant dependence of T[rec] on T[rollfwd] and


T[replay].”

4.2 The Alternate Hypothesis

“There is a significant dependence of T[rec] on T[rollfwd] and T[replay].”

5. METHODOLOGY

The cost of recovery in message logging protocols is measured using Egida,


an object-oriented toolkit for synthesizing rollback recovery protocols. The
five long-running compute-intensive applications – bt, cg, lu, sp, and mg –
from the NPB2.3 benchmark suite developed by NASA’s Numerical
Aerodynamic Simulation program. This information is obtained from the
research conducted by S. Rao, L. Alvisi and H.M. Vin.

5.1 Assumption

It is assumed that processes can fail by crashing, but the hardware is reliable.
Also, other overheads on a system are not considered.

5.2 The Variables

For pessimistic and causal protocols, the recovery time (denoted by T[rec])
for a process consists of: -

1) T[chk], the time to restore the state of the failed process from its latest
checkpoint.
2) T[acq], the time to retrieve determinants and messages logged during
failure-free execution.
3) T[rollfwd], the time to roll-forward the execution of the process to its
precrashed state.

For optimistic protocols, on the other hand, in addition to T[chk] and


T[acq], the recovery time T[rec] consists of: -
1) T[replay], the time to replay messages to the recovering process from
the acquired logs.
2) T[rollbck], the time required to rollback orphans.

Note: T[acq] is protocol dependent – for pessimistic and optimistic


protocols, it is the time to read logs from the file server, while, for causal
protocols, it is the time to collect messages and determinants from the logs
of the remaining processes. In the case of multiple failures, the values of
T[chk], T[acq], T[rollfwd], T[replay] and T[rollbck] are taken to be the
processes that take the longest to recover.

5.3 The Method

For all protocols, the value of T[rec] is determined by the three parameters: -
1) The number of concurrent failures, f. For optimistic protocols, multiple
failures may cause a process to rollback multiple times. For sender-based
pessimistic and causal protocols, multiple failures may complicate the task
of retrieving messages and determinants from other processes.

2) The number of processes, n. For causal and sender-based optimistic


protocols, n may affect T[acq] because it may change the set of processes
from which a recovering process collects its logs. For optimistic protocols, n
may affect T[rollbck] because it may change the number of orphans.

3) The time, t since the latest checkpoint occurrence at which a failure is


induced. For all protocols, this parameter affects the amount of lost
computation that has to be recovered and the size of the logs that have to be
acquired by the recovering process.

Note: To compare fairly the recovery performance of the four logging


protocols for a given application, t should be the same for all protocols.
Once a t is chosen, the iterative nature of applications to compute the
number of iterations i necessary to run each application is used.

5.4 Communication Patterns of the Benchmark Applications


5.5 The Model for Empirical Analysis

1) For receiver-based pessimistic protocols:


T[rec] = T[chk] + T[acq] + T[rollfwd]

2) For sender-based pessimistic protocols:


T[rec] = T[chk] + T[acq] + T[rollfwd]

3) For optimistic protocols:


T[rec] = T[chk] + T[acq] + T[replay] + T[rollbck]

4) For causal protocols:


T[rec] = T[chk] + T[acq] + T[rollfwd]

Note: To measure the values of these protocols, the experiments were


repeated for each protocol until the variance for T[chk], T[acq], T[rollfwd],
T[replay], T[rollbck] and T[rec] was within 1% of their average value.
These figures are taken from the work of S. Rao et all.

6. DATA COLLECTION

The data was taken from IEEE Transactions on Knowledge and Data
Engineering, Volume 2, No. 2, March/April 2000; from the research of
Sriram Rao, Lorenzo Alvisi and Harrick M. Vin.
7. FINDINGS

The findings from different statistical analysis on the data set: -

7.1 Correlation coefficients


See annexes.

7.2 Regression Analysis and Standard Error


See annexes.

7.3 Measures of Central Tendency

The mean is used as the measure of central tendency. The means of different
data sets: -
1) Receiver-based Pessimistic: Mean T[rec] = 168.39 sec.
2) Sender-based Pessimistic: Mean T[rec] = 187.50 sec.
3) Optimistic: Mean T[rec] = 176.60 sec.
4) Causal: Mean T[rec] = 197.90 sec.

7.4 Standard Deviation

The values for standard deviation (S.D.) for the data: -


1) Receiver-based Pessimistic: S.D. = 10.55 sec.
2) Sender-based Pessimistic: S.D. = 13.51 sec.
3) Optimistic: S.D. = 9.90 sec.
4) Causal: S.D. = 20.37 sec.

7.5 Confidence Interval

The confidence interval is calculated at 5% significance level for: -


1) Receiver-based Pessimistic: 5.34
2) Sender-based Pessimistic: 6.84
3) Optimistic: 5.01
4) Causal: 10.31

7.6 T-test Statistics

Two-tailed t-test is used.

1) Receiver-based Pessimistic: -
Table value of t = 2.179 (at 5% level of significance and 12 degree of
freedom). Calculated value of t = 0.10755

2) Sender-based Pessimistic: -
Table value of t = 2.179 (at 5% level of significance and 12 degree of
freedom). Calculated value of t = 0.02822

3) Optimistic: -
Table value of t = 2.201 (at 5% level of significance and 11 degree of
freedom). Calculated value of t = 0.0000038
4) Causal: -
Table value of t = 2.179 (at 5% level of significance and 12 degree of
freedom). Calculated value of t = 0.00267

7.7 R- square Statistics

The R-square statistics obtained for the data set: -

1) Receiver-based Pessimistic: 0.907557


2) Sender-based Pessimistic: 0.69997
3) Optimistic: 0.62042
4) Causal: 0.55734

7.8 F-test Statistics


F-test applied on the data set: -

1) Receiver-based Pessimistic: 0.62777


2) Sender-based Pessimistic: 0.91958
3) Optimistic: 0.48144
4) Causal: 0.20094

8. RESULTS

8.1 Correlation Coefficients

The correlation coefficients lie between 0.3 – 1, which implies the


correlation ranges from high to near perfect. This suggests that there is near
perfect correlation between T[rec] and T[rollfwd]; and high correlation
between T[rec] and T[replay] in case of all the message logging protocols.
8.2 Regression Analysis and Standard Error

Using the regression analysis, we can find the value of the dependent
variable, T[rec], given the values of the independent variables, T[chk],
T[acq], T[rollfwd], T[replay] and T[rollbck]. For this calculation, the values
of standard errors must be taken into account. The linear regression model
fits this calculation.

8.3 Measures of Central Tendency (Mean)

Using this feature of the descriptive statistical tools, the average value of the
cost of recovery (in terms of time in seconds), T[rec], was found out to be
minimum for Receiver-based message logging protocols (168.39 sec). This
suggests that (not considering other overheads) the best method is Receiver-
based message logging protocols.
8.4 Confidence Interval

The mean, standard deviation and the level of significance (here 5%) were
used to calculate the confidence interval. In our samples of data, we are 95%
confident that the mean is in the confidence interval i.e.: -
1) Receiver-based Pessimistic: 168.39 + 5.34 sec
2) Sender-based Pessimistic: 187.5 + 6.84 sec
3) Optimistic: 176.6 + 5.01 sec
4) Causal: 197.9 + 10.31 sec

It gives the upper and the lower limit within which the value can lie.

8.5 Significance of t-test

The t-test shows whether or not the hypothesis holds. If the calculated t-
value < table t-value, the hypothesis is accepted. Else, it is rejected and the
alternate hypothesis is accepted.
1) Receiver-based Pessimistic: -
0.10755 < 2.179, the hypothesis “there is a significant dependence of
T[rec] on T[rollfwd]”; is accepted.
2) Sender-based Pessimistic: -
0.02822 < 2.179, the hypothesis “there is a significant dependence of
T[rec] on T[rollfwd]”; is accepted.
3) Optimistic: -
0.0000038 < 2.201, the hypothesis “there is a significant dependence of
T[rec] on T[replay]”; is accepted.
4) Causal: -
0.00267 < 2.179, the hypothesis “there is a significant dependence of
T[rec] on T[rollfwd]”; is accepted.
8.6 Significance of R-square test

It measures the total percentage variation in dependent variable, T[rec], due


to variations in independent variables, T[rollfwd] and T[replay]. Here: -

1) Receiver-based Pessimistic: 0.907557 implies 90.76%


2) Sender-based Pessimistic: 0.69997 implies 70%
3) Optimistic: 0.62042 implies 62.04%
4) Causal: 0.55734 implies 55.73%

The closer the value of R-square is to 100%, the more is the probability that
we have included the most important variables in our analysis.

8.7 Significance of F-statistics

It tests the significance of R-square test and the overall validity of the
model. If calculated F-value <= table F-value, the null hypothesis is
accepted. Else, it is rejected. For the model to be acceptable, the null
hypothesis should be rejected.
1) Receiver-based Pessimistic: -
0.62777 < 4.18; the null hypothesis is rejected.
2) Sender-based Pessimistic: -
0.91958 < 4.18; the null hypothesis is rejected.
3) Optimistic: -
0.48144 < 4.10; the null hypothesis is rejected.
4) Causal: -
0.20094 < 4.18; the null hypothesis is rejected.

Thus, the null hypothesis is rejected and the overall model is held
significant.

9. CONCLUSION

There is a high correlation between the cost of recovery of message logging


protocols, T[rec], and the time required to roll-forward, T[rollfwd], and the
time required to replay the message, T[replay]. Also, the cost of recovery in
Receiver-based Pessimistic Protocols is the minimum.

10. AREAS OF FURTHER RESEACH

Further research can be done on the implementation of a relatively new


concept of Hybrid Protocols suggested by S. Rao et all.
The objective of hybrid protocols is to maintain the failure-free performance
of sender-based protocols while approaching the performance of receiver-
based protocols during recovery. In hybrid protocols, messages are logged
both at the sender and at the receiver. The sender synchronously logs
messages in its volatile memory; the receiver asynchronously logs messages
to stable storage. Since logging on stable storage is asynchronous,
performance during failure-free executions is virtually identical to that of
sender-based protocols. However, recovery is much faster. Any missing
message is either available in the volatile memory of other operational
processes or it has to be regenerated during recovery if the sender has failed.
In either case, no correct process ever becomes an orphan and the recovering
process can roll forward using the messages on stable storage while in
parallel acquires the missing messages.

You might also like