You are on page 1of 21

Lewis Sykalski

A Case Study In Reliability Analysis

0. Abstract

The project consisted of analysis of dependability gains through usage of design diversity on two
Oracle releases. Bug reports from Oracle metalink were stimulated across multiple releases and
failure analysis was completed to determine the effectiveness of a design diversity fault tolerance
approach. In addition, reliability analysis and prediction was performed on the NCW Data
Collector software using both failure logs generated from past simulation events as well as a
software reliability analysis tool called CASRE (Computer Aided Software Reliability Estimation).

1. Introduction

It has been established that both accounting for fault tolerance and reliability analysis not
only bring better quality to the product but also help lower costs by allowing the analyzer to detect
reliability trends earlier in the lifecycle chain when they can be corrected. However, it is
important to realize that there is an equilibrium that exists for each product wherein the utility of
doing more reliability analysis / fault tolerance activities are overcome by the utility of stopping.
Furthermore, creating fully reliable software can never be achieved. A more attainable objective is
to produce software that is reasonably reliable or write software that meets the customers
requirement for software reliability. Despite the importance and cost benefits, reliability is quite
often ignored by enterprises, and when schedule starts to slip the established QA/reliability
activities are abandoned rather rapidly. This project will attempt to demonstrate the importance of
reliability activities as they relate to my software environment at my work. It will also allow me to
garner experience with these reliability methodologies in the hopes that I might employ some in my
everyday tasking at work.

2. Background

The Data Collector software provides a means for collecting data associated with an
experiment for later comprehensive analysis. The software is written in Java and is architecturally
very modular and versatile. Furthermore, it has the flexibility to support remote databases using
the JDBC and Java RMI APIs. Its sole purpose is to listen to network DIS PDU (Protocol Data
Units) traffic in the form of UDP datagram packets as well as XML Packets (on a separate port) and
then subsequently refine and record it to an Oracle Database. There is a front-end program
(Hyperion Interactive Reporting Studio) that serves as a graphical window into the database;
however, it will not be the focus of the reliability analysis. The diagram below, Figure 2.1
illustrates the simulation environment and subsequently the possible sources of network PDUs for
collection.
Lewis Sykalski

The different simulation players in Figure 2.1 are further described in Table 2.1 below in terms of
their relation to data collection.
Player Description
CAOC Combined Air Operations Center Responsible for sending of XML-based
EW Reports to players.
DC Data Collector Program responsible for collection and refinement of
experiment data
EADSIM Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
FUSION FUSION algorithm with transmits fused EW Reports to all players in the
form of XML-based packets
Humvee Sim A man-in-the-loop simulation for controlling a hummer. Entity State,
Detonation, Fire PDUs
JABE Environment model that controls certain blue players. Responsible for all
cockpit PDUs defined by Man-in-the-loop Cockpit below
JIMM Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
JSAF Environment model that controls certain red players. Responsible for
Entity State, Detonation, and Electromagnetic Emission PDU
JTAC Joint Terminal Attack Controller Responsible for sending of XML-based
NTISR Assignments to players.
Man-in-the-loop A manned player (F-16, F-22, or F-35) who sends a wide range of PDUs
Cockpit for collection (Entity State, Detonation, XML-based, etc)
Man-in-the-loop A man-in-the-loop simulation for controlling a SAM Site. Entity State,
Threat Sims Detonation, Fire PDUs are the pertinent PDUs that these transmit to DC
Lewis Sykalski

Other Sims A wide variety of other Sims that speak DIS. Entity State, Detonation,
Fire PDUs are the main PDUs that Data Collection is interested from these
Police Car Sim A man-in-the-loop simulation for controlling a hummer. Entity State,
Detonation, Fire PDUs
VBMS Virtual Battlespace Management Software - A Gods Eye Viewer, Not
responsible for transmission of any PDUs
WCS White Control Station - A Gods Eye Viewer, Not responsible for
transmission of any PDUs needed by data collection
Table 2.1: Simulation Player Descriptions
The DIS standard defines 67 different PDU types, arranged into 12 families. The data collector
software collects & refines data from the following DIS PDUs & XML PDUs:
(Entity Information/Interaction family) - Entity State PDU
(Warfare family) Fire PDU, Detonation PDU
(Distributed Emission Regeneration family) - Electromagnetic Emission PDU
(XML Custom PDUs) Emcon Status, EW Report, EW Fused Report, MBMS, NTISR
Assignment, etc.

3. Problem

Its been 8 months now since I came into possession of this software. From what I gauge from
the environment, reliability is important, however, it is not overly important. The software for the
most part is allowed to miss a few data PDUs here and there. If the program aborts, however, that
is unacceptable. When a catastrophic failure occurs and data for the experiment is lost, the
simulation run must be thrown away wasting the time of everyone involved. In some of our larger
experiments, this could be upwards of 40 people.
Unfortunately, in the past, reliability has been unacceptable and the software has established a
bad reputation. Ive heard many horror stories from my new colleagues of lost data during past
experiments due to crashes, configuration, and transport issues. To prevent some of these kinds of
things from occurring I have personally added a few fail-safe measures. For example, if the TCP
socket crashes to Oracle, I will close the connection and reinitialize. In addition, most exceptions
are wrapped with try/catch blocks to fail safely and continue. Most importantly, however, I have
built into the program a decent level of verbosity. Should a problem occur, it will be more readily
transparent and easily located.
Despite, many of my reliability revisions, I still encounter many crashes and other failures.
This is due to the fact that each event is its own environment, with different PDUs of interest,
different scenarios, different software loads, and different corporate entities involved. Furthermore,
problems are exacerbated at times by the fact that the other entities also enjoy a lax quality
assurance environment. This results most often in the receipt of garbage datagrams, which I
must account for and recover from in my implementation.
In choosing an application project, I thought it important to examine both components.
Overall reliability, after all, is the sum of the individual component reliabilities. With Oracle, a
fault containment strategy must be employed, as we have no control over the source and you get
what you get. After reading a paper by Gashi, Popov, Strigini about the use of Design Diversity in
DBMS products it was decided I would analyze a similar approach utilizing multiple versions of
Oracle. If the strategy was then found to be useful, it could be employed as a
Lewis Sykalski

fault-tolerance/containment technique for Oracle. For the Java NCW data collector, a different
strategy was necessary. For this component, it was decided that reliability trend analysis &
prediction would be more beneficial. This strategy if employed properly could help detect issues
before it was too late in the development/integration cycle to resolve. I thus resolved myself to
analyze reliability both across lifecycle phases (integration/execution) and across simulation events.

4. Strategy

4.1 Design Diversity Strategy

Approximately 20 Bug reports for both Oracle 9i and 10g will be taken from the Oracle
metalink site employing a pseudo-random approach. My main requirement was that the bugs
come with a bug script or detailed description on how to reproduce. In order to get a controlled
set of inputs, I thus needed to be very careful about my employed selection criteria for bug
scripts. The following criteria are conditions I hoped to satisfy in my selection of bug scripts.
1. Date Independent: I wanted to remove dates as a factor especially in the selection of the
Oracle 9.2 bugs as it would be more possible that it would be fixed in a later version
should I pick an earlier date. To do this, I sorted on other columns effectively ignoring
the report date column, effectively removing this dependency.
2. Easy to Reproduce: While I would like to consider myself an expert in this domain, I
am not. In order to satisfy these criteria, the bug would require a well-detailed bug
report or a detailed description of the stimuli so it can be reproduced.
3. Type Independent: While some types will naturally be easier to reproduce and thus
appear more frequently in my sample, I will do my best to largely ignore this attribute.
Thus I hope the type distribution to be a fair representation of the frequency of that error
type within a release.
Bugs were then run on both versions of Oracle and failures were documented. Results were
classified by failure type. Self-evidence as well as divergence was noted.

4.2 Reliability of NCW DC Strategy

In preparation for this activity, both S-Plus and CASRE were downloaded and examined for
suitability. CASRE was determined to be more suitable for this project as it was easier to
understand and required less overhead by the analyzer. Suitable log files were also gathered
from around the labs. Once gathered, they were then and organized by both run and experiment.
For the reliability analysis portion, the NCW Data Collector log files were then ran through
CASRE to determine reliability through a variety of metrics. CASRE was used in lieu of S-Plus,
because of the short turn-around time required for this project. In order to facilitate this, prep
work was done to determine key words of interest. In addition, a simple JavaScript was written
to parse the log files into an intermediate format, extracting Time of Program Start, Time of
Program Termination, Time of Thread Terminations, and Exception or Failure Messages.
Severity of failure was then determined by examining the exception in the intermediate log file
and comparing with Failure Descriptions in Table 4.2 below. Failure information was then
translated by hand into CASREs internal format. The definition of this format can be found in
Appendix C.1.
Lewis Sykalski

Severity Code Failure Description


9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss
8 Failure Causes Program Abort
7 Failure Causes Program Thread Abort
5 Failure Causes Record Not to be Written, Thread Continues
3 Failure Causes Incorrect Data to be Written, Thread Continues
1 Failure is Caught, Handled and Recovers Correctly
Table 4.2 Reliability Analysis Severity Definition
In this exercise I performed analysis on 10 runs from 2 different unclassified experiments
CALOE-08 and MAGTF-08 (Please see Appendix A.1 for in-depth event descriptions). I could
not go back any farther due to a lack of log files caused by poor configuration management by my
predecessor and even if I managed to scrape up some log files the verbosity in failure logging
would not allow for meaningful reliability analysis. I also planned for analysis of runs from the
integration phase of MAGTF-08 and the execution phase to gauge reliability between life-cycle
phases.
Reliability Analysis was then to be performed using CASREs analysis methodology and
built-in charting mechanisms. The CASRE User Guide includes flow charts detailing usage of
the tool, which became the basis for this exercise. (See Appendix C.3: CASRE Usage Flow).
This broke the exercise into 4 distinct phases: failure count data generation, preliminary reliability
analysis, reliability trend analysis, and reliability prediction. The activities then would be guided
by the built-in analysis functions within CASRE.

5. Activities & Results

5.1. Design Diversity Activity & Results

An established set of metrics was chosen based off of the journal: Fault Tolerance via Diversity
for Off-the-Shelf Products: A Study with SQL Database Servers [Gashi, Popov, Strigini].
Definition of evidence and divergence is provided in Appendix A: Glossary. In addition, SQL
failure types encountered are also provided in Appendix B: Failure Types.

Bug Reports were first chosen from 9.2 and ran against both 9.2 and 10.0 respectively. Results
are reported in Table 5.1.A.
Bug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent
2357784 Internal Error X NO N/A X
2299898 Performance/Hang X NO N/A X
2202561 Incorrect Results NO N/A
2221401 Incorrect Results NO N/A
2739068 Incorrect Results NO N/A
2683540 Incorrect Results NO N/A
2991842 Incorrect Results NO N/A
2200057 Internal Error X NO N/A
2405258 Internal Error X NO N/A
2716265 Internal Error X NO N/A
Lewis Sykalski

Bug # Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent


2054241 Performance/Hang X NO N/A
2485871 Internal Error X NO N/A
2670497 Internal Error X NO N/A
2659126 Internal Error X NO N/A X
2064478 Internal Error X NO N/A
2624737 Internal Error X NO N/A X
1918751 Internal Error X NO N/A
2286290 Incorrect Results NO N/A X
2700474 Incorrect Results NO N/A
2576353 Internal Error X NO N/A

Table 5.1.A: Oracle 9.2 Bug Classification Activity


Bug Reports were then chosen from 10.0 and ran against both 10.0 and 9.2 respectively.
Results are reported in Table 5.1.B.

Bug # Type 10.0 SE 9.2 Fails? 9.2 SE Divergent


5731063 Internal Error X NO N/A
3664284 Incorrect NO N/A
Results
4582808 Incorrect NO N/A
Results
3895678 Internal Error X YES X
3893571 Internal Error X YES X
3903063 Incorrect YES
Results
3912423 Internal Error X NO N/A
4029857 Engine Crash X YES X
4156695 Incorrect YES
Results
2929556 Internal Error X YES X X
3255350 Performance / X NO N/A
Hang
3887704 Internal Error X NO N/A
3405237 Engine Crash X YES X
3952322 Feature X YES X
Unusable
4033889 Incorrect NO N/A
Results
4060997 Internal Error X YES X
4134776 Internal Error X NO X
4149779 Incorrect NO N/A
Results
2964132 Internal Error X YES X
3361118 Internal Error X YES X
Lewis Sykalski

Table 5.1.B: Oracle 10.0 Bug Classification Activity


Oracle 9.2 Oracle 10.0 Oracle 10.0 Oracle 9.2
Total Bug Scripts 20 - 20 -
Failure Observed 20 0 20 11
Performance/ S.E 2 0 1 0
Hang
Internal Error S.E 11 0 10 6
Engine Crash S.E 0 0 2 2
Incorrect Result S.E 0 0 0 0
N.S.E 7 0 6 2
Other S.E 0 0 1 1
N.S.E 0 0 0 0

In summarizing the results:


Oracle 9.2 Scripts: 13 Self Evident, 7 Non Self Evident
Oracle 10.0 Scripts: 1 product Failing: 9 (5SE/4NSE), Both products Failing: 11 (1 Divergent
(1SE/0NSE) 10 Non-Divergent (8 SE/ 2 NSE))
Total Bug Failures 1 out of 2 Products Both DBMS Products Failing
Scripts Failing
S.E N.S.E Non-Divergent Divergent
S.E N.S.E S.E. N.S.E
40 40 18 11 8 2 1 0

According to Gashi, Popov, Strigini analysis methods the number of failures not detected by
design diversity is the number of Non-Divergent Non-Self-Evident failures across all scripts
wherein both products fail. This is due to the fact that these failures are not able to be observed
and happen identically across both DMBS products. Doing this calculation for my scripts: 2/40
(5%) Failures were not detectable.

5.2. Reliability of NCW DC Activity & Results

The Reliability analysis activity was performed by first generating the input CASRE datafiles
(See Appendix C.2 for files). All datafiles included each and every run from integration dry
runs or execution event runs for their respective events. To elaborate, Execution Event Runs
use a Design Of Experiments (DOE) standard deviation set of 10 runs to represent an
experiment. The extra runs that are provided are runs that were either redone due to a variety of
factors (pilot error, cockpit error, etc) or where data collection had to be restarted (not
necessarily the run) due to program abort. As working data collection is a requirement even on
bad runs, we can not throw these out. Integration runs were not limited in the same fashion.
The files were then run through CASRE individually to track reliability progress within a
simulation set. They were then grouped into dependent variable sets (integration/execution or
event/event) to track reliability with regards to the independent variable.
The following charts were then generated for both CALOE Execution/MAGTF Execution as
well as MAGTF Integration/MAGTF Execution partitions (See Appendix C.4.1-C.4.5):
Lewis Sykalski

Failure Count: A plot of the number of failures observed in a test interval as a function
of the test interval number.
Time Between Failures: A plot of time since the last failure as a function of failure
number.
Failure Intensity: The failure intensity (failures observed per unit time) as a function of
total elapsed testing time
Cumulative Failures: The total number of failures observed as a function of total time
elapsed.
Test Interval Length: a plot of the lengths of each test interval as a function of test
interval number
Reliability trend analysis was then performed to determine if the failure count data showed
reliability growth. 2 separate tests were employed (See Appendix C.4.6-C.4.7):
Running Average: The running average of number of failures per interval for failure
count data. If the running average decreases with time (fewer failures per test interval),
reliability growth is indicated.
Laplace Test: The null hypothesis for this test is that occurrences of failures can be
described as a homogeneous Poisson process. If the test statistic decreases with
increasing failure number (test interval number), then the null hypothesis can be rejected
in favor of reliability growth at an appropriate significance level. If the test statistic
increases with failure number (test interval number), then the null hypothesis can be
rejected in favor of decreasing reliability.

Upon successful determination of a reliability growth trend, predictions were employed using 3
models: NHPP (intervals), Yamada S-Shaped, & Generalized Poisson: ML Weight. Please
note that two more would have been employed had I been able to get them not to crash.
(Generalized Poisson & Schick-Wolverton). Also note that more models could have been
evaluated had I converted the Failure Count Data to Time Between Failure data through
CASREs built-in randomized sampling function; however that would have degraded the data
and stretched out my already long task list. Predictions were made at the average interval
length for 15 additional intervals (See Appendix C.4.8). Cumulative Failure Predictions as well
as Reliability Curves were then generated and are shown in Appendix C.4.9-C.4.10.

6. Results Analysis

6.1 Design Diversity Results Analysis

Design diversity using multiple versions of Oracle has shown to be pretty effective for detecting
failures across multiple releases according to the Gashi, Strigini, Popov methodology, detecting
95% of bug scripts. These numbers are much better than similar measurements published by
Gashi, Strigini, & Popov and lead us to believe this is an omnipotent strategy. However, we
must proceed with caution on the results. First off, this is a statistically insignificant sample
and as such should only be taken with a grain of salt. Secondly, this may not be a fair means to
interpret the results. If using this as a true design diversity setup, this reasoning holds for fault
detection. However, if we are using this as a means to smooth out a transition to a future
release, we would be primarily concerned with bug scripts in the future version, as past version
Lewis Sykalski

bugs would likely be contained through other means... Examining just the 20 bug scripts for
Oracle 10.0, 6 were N.S.E & 4 out of 6 of these failures would be detected by utilizing a past
release, yielding a 2/20 (10%) detection rate.
Furthermore, if we chose to do so, we could be more restrictive in our analysis. Just because
the fault is able to be detected, does not necessarily mean it is able to be recovered from. Sure,
we can employ rollback or other fault tolerance strategies once detected, but it doesnt
necessarily mean we will be successful. Using the metric of Both DMBS products failing, we
find that 10/40 (25% for both sets of bug scripts) or 11/20 (55% of Oracle 10.0 bug scripts)
have the potential for not being able to recovered from.
Despite some of the other ways of slicing the data leading to less fruitful percentages, overall
this strategy seems well-suited for detecting faults. Using more releases, or a more diverse set
of releases would likely yield even better percentages. Similarly, using different products I
would expect would detect close to 100% of faults.

6.2 Reliability of NCW DC Results Analysis

My expectations were low as the data was sampled by test interval and more discrete and
granular in nature. However, the raw data proved to have a moderate to high level of reliability
growth correlation. The MAGTF integration / MAGTF Execution set representing different
lifecycle phases exhibited more reliability growth than the CALOE execution / MAGTF
Execution. This was expected as the integration effort as compared to the execution phase is
normally more fraught with problems needing to be resolved.

The reliability trend analysis exhibited statistically significant correlation utilizing both means
indicating the possibility of reliability modeling. For the Running Arithmetic Average Test this
was evaluated by observing that the running arithmetic average decreases as the test interval
number increased. If this was so the number of failures observed per test interval was
decreasing, and hence the softwares reliability is increasing. For the Laplace Test this was
evaluated by observing the results of the test that the test statistic was less than or equal to -1.61
indicating 5%+ significance in reliability growth. This meant that the null hypothesis could be
rejected the null hypothesis, that occurrences of failures follow a Homogeneous Poisson Process
(a Poisson process in which the rate remains unchanged over time) in favor of the hypothesis of
reliability growth at the % significance level. Again both trend tests exhibited more growth
trending in the MAGTF integration / MAGTF Execution set than in the CALOE execution /
MAGTF Execution.

After employing the models using default parameters, I did not arrive at any reliability curve that
fit the model to CASREs default Goodness of Fit (GOF) significance. This was probably due to
the moderate level of variance in the data (some test intervals with many failures next to clean
test intervals with 0 failures. However, by ignoring these criteria as a first-stage screening I was
able to get model ranking. Additionally, I charted the prequential-likelihood, which showed how
much more likely it was that one model will produce more accurate predictions than the other.
This data did not necessarily follow the Goodness of Fit (GOF) variance. Applying the ranking
scheme as shown below (the default provided by CASRE) with highest priority given to
goodness of fit, I arrived at the relative rankings as also shown below.
Lewis Sykalski

Both sets yielded wildly different top reliability models as shown above (Yamada S-Shaped vs.
NHPP [intervals]). In fact the #1 model was the #3 model in the opposing set. At first I thought
this may be indicative of a divergent answer (after all we are modeling the same software). But
actually this is an expected result as it is demonstrates that different reliability growth models are
better suited for a lifecycle phase data set (integration/execution) than an event execution set.

The reliability model predictions arrive at different reliability growth curves, due to the influence
of the prior data set, but in both cases, reliability is increasing. What is more interesting is that
reliability growth is increasing more near the end in the Execution/Execution set than in the
Integration/Execution set. The execution/execution set prediction intervals represent new
events in the execution lifecycle phase, while the integration/execution set prediction intervals
represent continued time in the same execution lifecycle or a transition to a maintenance lifecycle
(Note: we dont have such a phase in our software development lifecycle). So to interpret this, if
transitioning to the beginning of the execution lifecycle phase from the current point in time
[MAGTF Execution End], the reliability models tell us that there is more growth to be had.
Lewis Sykalski

However since we have a positive growth trend between events, it likely wont be as rough as
MAGTF execution or CALOE before it. However, if continuing with more runs within the
execution phase from the current point in time [MAGTF Execution End], the reliability models
tell us that we are already at a relatively stable point.

7. Lessons Learned

The one thing that I thought was most beneficial from this project was the gained practical
knowledge of the tool CASRE. Gaining practical experience, in my opinion, is more important
than a deep-seeded understanding of the models. In addition, I also learned that there is much
volatility in individual models and not to read too-far into specific numbers. However, by
examining trends in multiple models, we can gauge future reliability.
Finally, I learned that having Failure Count Data is more limiting in the world of software
reliability than Time Between Failure data. This is because certain models necessitate Time
Between Failures Data. It is more granular than Failure Count Data and can be transformed
backwards to Failure Count Data with 100% precision. The transformation from Time Between
Failure data to Failure Count Data, on the other hand, introduces error as random sampling must
be employed.

8. Follow-up Actions/Summary

Since results from the design diversity experiment show a strong benefit, a case could be made for
design diversity to tolerate OTS faults. However, realistically, nothing will be done in the
near-term as I really dont have the time. In addition, I got yelled at by my manager for even
investigating a transition to Oracle 11. I will, however, tuck these results in the back of my head
should I ever get relief from a new team member.
Reliability analysis results of the NCW Data Collector while useful to see where I am at, will
likely result in no follow-up action. Trending/predictions could be employed to determine
reliability short-falls when new functionality is added, which in turn could be facilitated by adding
auto-logging capability to NCW to write failures to a CASRE log format. However, I am at
present a one man team and do not have the time. In addition, reliability, while important in the
sense that data collection doesnt abort and waste everyones time, is not a primary focus of our low
to medium fidelity simulation. In fact, most problems can be corrected in the integration phase
leading up to the experiment. The most important factor, however, in my conclusion that nothing
will be done is the fact that this type of activity will not generate any interest from management.
My manager (the same one who yelled me for investigating a transition to Oracle 11) has a firm
belief that we have sufficient reliability at present and gets overly upset when I bring up the
prospect of restructuring the code to improve fault-tolerance and efficiency. However, all this
could change if a string of poor performances arises. So for the time-being I will continue to avoid
focusing on forecasting reliability and instead exert effort towards fault containment and resolution.

Appendix A: Glossary

A.1 Simulation Event & Concept Descriptions


Lewis Sykalski

CALOE08 (Combat Air Level of Engagement): An experimentation event held in 08 which


focused on monitoring Level Of Engagement in a Combat Air Environment factored by level of
NTISR (Non-Tradition Intelligence Sharing and Reconnaissance) enabled by Information
Technology

MAGTF08 (Marine Air Ground Task Force): An experimentation event held in 08 focused
on MAGTF operations where-in a balanced air-ground, combined arms task organization of
Marine Corps forces under a single commander is employed to accomplish a specific mission

Network Centric Warfare (NCW): Experimentation focused on translating an information


advantage, enabled primarily by information technology, into a competitive warfighting
advantage (either lethality or survivability) through information sharing and robust networking.

A.2 SQL Failure Type Definitions

Divergent failures: Any failures where DBMS products return different results. This could be
1 out of 2 or up to n-1 out of n. (e.g. If bug exists in 9.2 and not 10.0 then it would be
divergent)

Nondivergent failures: The ones for which two (or more) DBMS products fail with identical
symptoms. (e.g. If bug exists in 9.2 and 10.0 then it would be non-divergent)

Non-self-evident failures: Failures that can not be detected through observation (e.g. incorrect
result failures without DBMS product exceptions, with acceptable response time)

Self-evident failure: Failures that can be detected through observation. (e.g. Engine crash
failures, internal failures, signaled by DBMS product exceptions or performance failures)

Appendix B: Design Diversity of Oracle Analysis

B.1 SQL Failure Type Definitions

Internal Error: This is an internal error within the DBMS product, generating an exception

message.

Performance/Hang: A Self-Evident Error resulting in obvious loss of performance or product

hang.

Incorrect Results: An incorrect result returned by the DBMS product. Most likely

non-self-evident in nature
Lewis Sykalski

Loss of Service/Crash: A Self-Evident error resulting in loss of Oracle Service or Crash of

Oracle Engine

Feature Unusable: An error that made a feature unusable. Considered a miscellaneous error.

Appendix C: CASRE Reliability Analysis

C.1 CASRE Failure Counts Input File Format

The following is an Excerpt from CASRE 3.0 Users Guide defining the Failure Counts Input
File Format. It can be located in Appendix C Section 2.

The first row in a file of failure counts must be one of the


following seven keywords: Seconds, Minutes, Hours, Days, Weeks,
Years. For failure count data, this keyword names the units in
which the lengths of each test interval are expressed.

The second through last rows in a file of failure count data


have the following fields:
Interval Number of Interval Error
Number Errors Length Severity
(int) (float) (float) (int)

The following is an example of failure count data file:


Hours
1 5.0 40.0 1
1 3.0 40.0 2
1 2.0 40.0 3
2 4.0 40.0 1
2 3.0 40.0 3
3 7.0 40.0 1
4 5.0 40.0 1
5 4.0 40.0 1
6 4.0 40.0 1
7 3.0 40.0 1

C.2 CASRE Input Datafiles

CASRE_Execution_CALOE08.txt CASRE_Execution_MAGTF08.txt CASRE_Integration_MAGTF08.txt

CASRE13.txt CASRE23.txt
Lewis Sykalski

C.3 CASRE Usage Flow


Lewis Sykalski

C.4 CASRE Charts

C.4.1 CASRE Failure Count

C.4.2 CASRE Time Between Failures


Lewis Sykalski

C.4.3 CASRE Failure Intensity

C.4.4 CASRE Cumulative Failures


Lewis Sykalski

C.4.5 CASRE Test Interval Length

C.4.6 CASRE Running Average


Lewis Sykalski

C.4.7 CASRE LaPlace Test


Lewis Sykalski

C.4.8 CASRE Prediction Setup

C.4.9 CASRE Cumulative Failure Prediction (next 15 ints)


Lewis Sykalski

C.4.10 CASRE Reliability Prediction (next 15 ints)


Lewis Sykalski

C.4.11 CASRE Prequential Likelihood (log) (next 15 ints)

References:

Fault Tolerance via Diversity for OTS Products: A Study w/ SQL Database Servers
Ilir Gashi, Peter Popov, Lorenzo Strigini. IEEE Transactions on Dependable and Secure
Computing. Washington: Oct-Dec 2007. Vol. 4, Iss. 4; p. 280

Computer Aided Software Reliability Estimation (CASRE) User's Guide Version 3.0
Allen P. Nikora. March 23, 2000.

You might also like