Professional Documents
Culture Documents
by
Suraj Raghuraman
___________________________________________
Dr. Vincent Ng, Chair
___________________________________________
Dr. Yang Liu
___________________________________________
Dr. Latifur Khan
! Copyright 2008
Suraj Raghuraman
To all my friends
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS
by
THESIS
in Partial Fulfillment
of the Requirements
December, 2008
ACKNOWLEDGEMENTS
I would like to thank my advisor Dr. Vincent Ng for putting up with me. I would have
given up long before if it has not been for him providing support and motivation. He some
how knew what would motivate me and did his best to get me going.
I would like to thank my family for motivating me enough to study abroad. I would
still be working in India if it weren’t for them. Thanks for supporting me when I needed your
I would like to thank my friends for keeping me well fed and happy during the bad
times. Motivating me hard enough for me to finish my thesis. If it weren’t for you guys I
September, 2008
v
ANOMALY DETECTION FROM AVIATION SAFETY REPORTS
ABSTRACT
Anomaly detection from Aircraft Safety Reporting System (ASRS) reports is the task of
identifying the multiple types of anomalies that resulted in an aviation incident based on the
narrative portion of the report, which was written with lot of domain specific terminology.
The thesis proposes an effective semi-supervised approach to solving the problem without
using any domain related information. The approach relies on methods similar to singular
value decomposition and latent semantics for feature generation. A Support Vector Machine
with transduction was used to classify the narratives. Experimental results show that the
method developed was marginally better than the existing methods for high frequency
anomalies and substantially better for low frequency anomaly on the ASRS dataset.
vi
TABLE OF CONTENTS
ACKNOWLEDGEMENTS ............................................................................................... v
ABSTRACT .......................................................................................................................vi
LIST OF FIGURES............................................................................................................. x
CHAPTER 1 INTRODUCTION...................................................................................... 1
vii
4.5.2 Inner Product .................................................................................... 17
4.5.3 Singular Value Decomposition ........................................................ 19
4.5.4 Gram Schmidt Process ..................................................................... 19
4.5.5 Applying Gram Schmidt .................................................................. 20
4.8 Classification ................................................................................................... 21
4.8.1 Learning............................................................................................ 22
4.8.2 Classifying........................................................................................ 22
4.8.3 Interpreting Output ........................................................................... 23
REFERENCES ................................................................................................................. 39
Vita
viii
LIST OF TABLES
Number Page
ix
LIST OF FIGURES
Number Page
x
CHAPTER 1
INTRODUCTION
Aircraft Safety Reports are a rich source of information on flight incident. Every time an
aviation incident occurs any where around the world a report is written which contains the
complete details of the incident. Over the years these reports from around the world were
submitted to the ASRS archive at National Aeronautical and Space Agency (NASA). Over
the decades the database has grown to contain over 150 thousand incident narratives.
The task of anomaly detection is to determine the cause of the incident by just looking at the
narrative portion of the report. There are about 62 anomaly classes in all. The documents
come from varying backdrops, contain lots of grammatical mistakes, incorrect spelling, loads
of conflicting abbreviations and are very hard for even an average person to understand
making the task difficult. The goal of this thesis is to come up with a semi supervised
approach for anomaly detection without relying on any knowledge of the aviation domain
We will start with the existing approaches that were applied on the dataset and approaches
that were used for similar classification tasks. It is always wise to build upon the approaches
based on which people are already working. So we start off with enhancing the word
selection methods and getting better accuracies by using semi-supervised learning algorithms
1
2
After that the focus shifts to using the entire dataset and not just the labeled instances to learn
the features. For this we propose a new approach that uses similarity between documents to
generate features. We use latent semantics and single value decomposition in order to get the
similarity between the documents. While getting the similarity greater emphasis is given to
the words that help identify the presence of the anomalies. The key idea that this thesis
introduces is the usage of a set of documents to generate the features for a single document.
Instead of just looking at the similarity between two documents and classifying based on just
that, An instance selection algorithm is used to pick out the points of reference and then
based on the similarity with these varying reference documents, patterns are extracted using
SVM, which are used to classify documents into anomaly classes. In order to take greater
advantage from the instances we have, SVM with transduction is used instead of a fully
supervised SVM.
Finally, we show that the word selection based methods that we developed perform better
than the existing methods. The semantics based method doesn’t perform as well as the word
selection based methods but performs far better than the approximated latent semantics based
PROBLEM DEFINITION
Anomaly detection in Aircraft Safety Reporting System (ASRS) reports is the task of
identifying the types of anomalies that resulted in an aviation incident based on the narrative
portion of the report. Incidents can occur due to more than one reason and hence more than
The aviation officials write an ASRS report whenever an incident occurs. These
reports are written all around the world and as a result contain varying style of writing. The
content of the report is highly aviation domain specific and contains lots of misspelled words,
acronyms, and grammatical mistakes. To get a clearer picture of how a report looks like, an
ACR Y CLIMBING TO FL210 WAS STOPPED AT 160 FOR TFC AT 170; WHICH PASSED.
ACR Y WAS THEN KEPT AT 160. ACR X WAS DESCENDED TO 170 UNTIL CLEAR OF TFC;
THEN DESCENDED TO 150. ACR X WAS NOT OUT OF 170 WHEN CLRNC ISSUED TO
STOP AT 170. NO ANSWER. CLRNC GIVEN AGAIN. NO ANSWER. TURNED ACR Y FROM
320 DEG TO 250 DEG HDG. CLRNC TO ACR X AGAIN WITH TURN TO 060 DEG; NO
ANSWER. FINALLY ACR X ANSWERED AND TURNED TO 060 DEG WITH CLIMB BACK TOS
170. ACFT PASSED 4.3 MI AND 200' APART. ACR X DID NOT RESPOND TO CLRNC.
3
4
As you can clearly see in Figure 2.1 most of the text is not very readable to a non-aviation
expert. The other complication is the fact that most acronyms are 3 characters in length.
Example ACR can stand for Aircraft as well as Air Control Radar, and it may also refer to an
There are over 62 classes of anomalies that a narrative can belong to. For the sake of
simplicity and to reduce the amount of processing the data needs to go through we will
consider only 8 anomalies. In order not to loose out on the variations that each anomaly
brings forth in terms of the frequency of occurrence and the situation in which they occur, the
RELATED WORK
Supervised approaches use labeled documents to learn and then label the unlabeled
documents. The next sections cover some of the supervised approaches used on the ASRS
dataset.
The basic approach that was tried out by [1] was based on regular expressions. The domain
heavily influences this approach and works similar to how an aviation expert identifies
anomalies from the narrative. After lot of brainstorming a set of shaper words that are useful
in identifying the anomaly expressions, were created. These shaper words are used to identify
expressions from a narrative. Regular expression templates are generated for particular types
of anomalies, using these expressions rules. A narrative is run through these rule trees and
where
5
6
P(d) doesn’t play any role in determining argmaxc P(c | d) so is ignored. Naïve Bayes
decomposes the term P(d | c) into a set of features fi ‘s assuming they are conditionally
independent. So
In spite of having such a big assumption of conditional independence between words, which
clearly doesn’t hold in case of real world data Naïve Bayes is known to be good for text
classification [2] and is sometimes optimal for highly dependent features [3].
Support Vector Machines (SVM) is known to be very good at classifying text [4]. The
classification task is split up into multiple binary classification tasks. In case of binary SVM
the basic idea is to represent the input features in a high dimension feature space. Then it
identifies a hyperplane that will separate the features by the highest possible margin. If we
Where , are the support vectors and "j are obtained by solving the dual
In [6] an SVM classifier was used on the ASRS dataset by taking words as features. In order
to control the number of features given to the learner, Information Gain (IG) was used as a
IG = H(C) – H(C|W)
Where H(C) is the entropy of the anomaly class and H(W|C) is the entropy of the word given
the class.
Overall it was observed that the top 2500 words are more than sufficient to get good
results. It was noticed that anomalies that occur less frequently result in bad F-Scores and
Semi-supervised approaches use a combination of labeled and unlabeled data to label the
unlabeled documents. There are very few semi-supervised algorithms that have been applied
to ASRS dataset. The approach mentioned below is good for text classification in general but
labeled and unlabeled data. The algorithm works in two stages; in the first stage the
unlabeled data is probabilistically labeled using the parameter values. Then in the second
stage the value of the parameters are recalculated using maximum likelihood estimation.
To classify (E-Step):
Where,
is the probability of the kth word from the dith document occurring in the class
j.
The function below is used to distinguish between the label and unlabeled documents.
Where,
Using the above set of expressions, good classification accuracy was achieved in their
experiments. The best result is obtained when the value of # between 0 and 1. The value of #
4.1 Introduction
A good look into the narratives of the ASRS dataset reveals a lot of connections between few
keywords and the anomalies associated with it. There are lots of words that will not occur
with certain anomalies and there are lots of words that will occur only with certain
anomalies. But this is not an easily exploitable feature, since a document can have multiple
anomalies. These are the two key concepts, which the approach relies on for classification.
To start with the task of anomaly detection is split into binary classification tasks of
checking the presence of a single anomaly. For each of the binary tasks the following
• Word Selection: Relevant features are selected using Information Gain metric.
• Feature Reduction: Multiple Features are combined to form a feature using Singular
• Classification: A transductive SVM is used for learning and classifying the feature.
Further sections detail each of the operations. We start by analyzing the data.
9
10
Let’s start with taking a relook at the data. As mentioned before the data has lots of aspects
like abbreviated words, misspelling, grammatical mistakes, domain specific words etc that
make it hard for non aviation domain humans to understand, so there is no way that we can
use any of the more advanced tools of Natural Language Processing (NLP) like part of
speech taggers, syntactic parser, named entity recognizers etc readily. So we are left with two
alternatives; either we byte the bullet and come up with a big task of fixing the text so that
advanced NLP tools can be used; or we avoid using the tools and adopt a different approache
WHEN I CALLED GND CTL I STATED; 'WORCESTER GND (OUR ACFT IDENT) TAXI; VFR
TO BRADLEY.' SHE RESPONDED BY SAYING; '(ACFT IDENT); TAXI VIA BRAVO FOR
INTXN DEP 29.' SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH.
I RESPONDED BY SAYING; 'NEGATIVE.' SHE CAME BACK BY SAYING; 'BRAVO INTXN 29
FOR TKOF IS APPROVED.' I RESPONDED BY SAYING; 'CLRD TO GO AT 29?' SHE SAID;
'AFFIRMATIVE;' AFTER I SAID; 'CLRD TO GO 29.' WHEN I MADE MY 45 DEG LEFT TURN
FOR DEP; I DECIDED TO CALL TWR FREQ TO ADVISE (WHICH IS NOT REQUIRED).
WHEN I CALLED TWR FREQ; A MAN RESPONDED BY SAYING; '(ACFT IDENT); YOU WERE
NEVER CLRD FOR TKOF.' THE NEXT DAY MY F/O (NAME) AND I VISITED THE
WORCESTER TWR TO SPEAK TO THE CHIEF CTLR. WE LISTENED TO THE TAPE AND I
WAS CORRECT; THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT
INTXN BRAVO 29. THE CHIEF CTLR SAID THERE WOULD BE NO ACTION TAKEN
AGAINST ME AND MY F/O. THE CHIEF CTLR AND MYSELF AGREED THAT THE GND CTLR
AND MYSELF USED POOR PHRASEOLOGY. FOR INSTANCE; 'CLRD TO GO;' WHICH IS
USED QUITE OFTEN IN AVIATION FOR 'CLRD FOR TKOF.' I WOULD LIKE TO POINT OUT
THAT NO ATTEMPT WAS MADE BY TWR OR GND FREQ TO CONTACT ME NOT TO TKOF
AT BRAVO INTXN RWY 29. SUPPLEMENTAL INFORMATION FROM ACN 80338: CAPT
FLYING ACFT AND OPERATING RADIOS. I ASKED CAPT IF SHE CLRD US FOR TKOF ON
GND FREQ. HE SAID YES; AND THAT SHE WAS WORKING BOTH FREQS.
On closer examination of the narrative in Figure 4.1, it is easy to notice that lots of words
only occur in certain anomaly cases. For example in the above case the word TAXI is
mentioned that increases the chances of the anomaly being an incursion or excursion of the
runway. Since it is easy for us to determine that by looking at the record, even machines can
do that using simple measures like plain old information gain [6]. The other way of doing
something similar would be to just use Inverse Term Frequency and then adding weight on
Now that we know what kind of words to look for, let us look at the way the words in
text are arranged. As we look at more and more documents it starts to become clear that in
most of the cases the order in which the words occur is not relevant. The other thing to notice
is shaper words like TAXI are not mentioned when the anomaly is Weather (Figure 4.2) for
instance. The more you look at the documents the more you would feel that a Bag of Word
approach is the best way to move forward. That’s what this thesis ended up doing.
AFTER 4 HRS FLT TIME ARRIVED AT SEA. WX HAD STARTED TO GO DOWN. ARPT USING
ILS 34R APCH WITH A RVR OF 3000'. ATTEMPTED FIRST APCH AND INITIATED MISSED
APCH AT DH WITH UNEVENTFUL GO AROUND. STARTED SECOND APCH; AT DH I SAW
APCH LIGHTS. AT SAME TIME COPLT STARTED TO PUSH THROTTLE FORWARD FOR GO
AROUND. I DID NOT REALIZE THAT THE TOGA BUTTONS HAD BEEN PUSHED. WHEN I
GRABBED THE YOKE I DISCONNECTED THE AUTOPLT. AT 100' RWY CAME IN SIGHT
AND WE WERE LINED UP A LITTLE RIGHT. AT THE SAME TIME I BECAME CONFUSED
WITH THE THROTTLE RESPONSES BECAUSE I DID NOT REALIZE THE TOGA BUTTON
HAD BEEN ACTIVATED. WE STARTED A LEFT DRIFT WHILE I TRIED TO TURN THE
LIGHTS ON FOR BETTER DEFINITION OF THE RWY. AT ABOUT 50' THE DRIFT BECAME
FASTER TO THE LEFT. WHEN EDGE LIGHTS CENTERED UP AND ALSO ILS TWR WAS
SEEN; GO AROUND WAS INITIATED FOR AN UNEVENTFUL GO AROUND. ON GO
AROUND IT WAS LEARNED THAT THE RVR HAD DROPPED TO 1000' ALL XMITTERS.
LESSONS LEARNED: 1) A BETTER APCH BRIEF WITH COPLT KNOWING TO CALL DH OUT
EXTRA LOUD SO IT IS DEFINITELY HEARD. 2) APCH SHOULD HAVE BEEN
DISCONTINUED AS SOON AS THE AUTOTHROTTLES FELT FUNNY AND DEFINITELY
WHEN THE RWY WAS NOT LINED UP. 3) ALSO BRIEF A POSITIVE GO AROUND CALLOUT
AGAIN TO DISTRACT THE OUTSIDE CONCENTRATE. I THINK ANOTHER CONTRIBUTING
FACTOR WAS THAT THE TIME WAS CLOSE TO 2 O'CLOCK BODY TIME AND
CONCENTRATION WAS JUST A LITTLE SLOWER THAN NORMAL.
The other key thing to note is that many of the anomalies occur together, for example the
this is a feature that is unique to this dataset and which many datasets will not share. As a
result this thesis will not be exploiting this characteristic of the dataset. A Bayesian classifier
can be used to make the predictions based on the narrative and anomalies identified, whether
or not a different anomaly can also be associated with the current narrative. This could be
something that could be tried out if the aim were to just classify this dataset.
This section details about the how the narratives, where extracted from the ASRS dataset.
The preprocessing and cleaning applied to the narratives. Finally how the words from the
The ASRS document set used for this thesis was a set of comma separated values (csv) files
that were taken from the years 1988 to 2007. Each of these files has thousands of narratives.
There are about 54 fields in each of the file. The thesis will be using only two fields, the
narrative and the event anomaly from the files. There are a total of more than 60 anomalies
present in these files. In this stage we pick out all the narratives containing at least one of the
anomalies mentioned in table 2.1. This step returns about 75000 narratives.
4.3.2 Tokenization
This step involves converting the text as seen in Figure 4.1 to Figure 4.3. Tokenizing the text
is very important in order to ensure that the words can be identified easily, since there are lots
13
of punctuations attached to the words of the narrative. The first step in tokenization is to
separate the punctuations from the words. However, there are few places where punctuations
indicate a measure of distance from example like in the case of 3000’, which shows that the
distance is 3000 feet. These are the places where we cannot afford to strip or separate
punctuations, due to the fact that this can have an adverse affect on the matching process. To
handle such cases, punctuations following or contained within numbers, where not removed.
WHEN I CALLED GND CTL I STATED WORCESTER GND OUR ACFT IDENT TAXI VFR TO
BRADLEY SHE RESPONDED BY SAYING ACFT IDENT TAXI VIA BRAVO FOR INTXN DEP 29
SHE ALSO ASKED ME PREVIOUS TO THIS IF I REQUIRED FULL LENGTH I RESPONDED
BY SAYING NEGATIVE SHE CAME BACK BY SAYING BRAVO INTXN 29 FOR TKOF IS
APPROVED I RESPONDED BY SAYING CLRD TO GO AT 29 SHE SAID AFFIRMATIVE …
The basic idea behind hash generation is to replace the use of words by numbers to facilitate
the generation of the input files for the SVM learning algorithm. All the narratives are read
and each word is assigned a number that will be its primary identity from now on. We are
also not interested in the order in which words occur. All we are interested in is the number
of times they occur in a narrative. So, to avoid repeated parsing of the sentence to get the
frequency of occurrence of the words, take the primary key and associate to each word its
Finally, the result is sorted in ascending order of the primary key value. To exemplify, Figure
Figure 4.4. Hashed version of tokenized Narrative for Anomaly, Incursion – Runway
There are over 20,000 of unique words in the ASRS dataset. Since we are considering words
as features, it becomes a tall order for any classifier to classify using such a high number of
features well. Even though the SVM learning algorithm is known to handle large feature sets
very nicely, it is still better to reduce the number of features sent to it by looking at the kind
After experimenting with several methods on the dataset we observed that the best
results were obtained when only a selective subset of words from the narratives were used
instead of all the words. Typically, the words that are selected either appear a lot in the
The Information Gain (IG) method for word selection used by the folks at NASA in
[6] was found to be one of the best ways of doing word selection (see section 4.4.1). The key
advantage of using IG is that it takes care of both of the cases mentioned above. Another
method using subtraction was also added into the system to increase the accuracy for the
IG is calculated as
In this method, while calculating the entropy, normalized frequencies are not considered. All
that is considered is the presence or absence of a feature in the document. Also, the features
that occur only in a single document are ignored for being too rare.
One of the easiest ways to find out if a feature is more prevalent in a class than other classes
is by doing a subtraction. But the problem with subtraction is that the range of values it
returns can be huge. To fix this problem we use the log function. The log function brings
back the numbers returned by subtracting into a manageable level. So basically the log of the
difference in the frequencies of occurrence of a word in the class and outside it is what is
used.
A combination of both word selection methods described above is used to eliminate the
irrelevant features for an anomaly. The values for both IG and subtraction are calculated for
all the features. Then two lists are formed, one with features arranged in descending order of
IG and the other with subtraction. 75% of the required features are chosen from the top of the
IG list and then the remaining 25% features are taken from the top of the subtraction list.
After this all the unselected features are considered irrelevant and removed from both the
training and test files. A result of word selection is shown in Figure 4.5.
16
In many classification tasks, the feature values for an instance are typically generated from
the instance itself. The less common approach is to use multiple instances to generate the
features for a given instance. One such approach is uses latent semantics kernels [11]. Our
The section starts with the basic terminologies and then continues on to explain how
and why using the method for feature reduction is a good idea.
A document can be represented in many different ways. One of the more mathematical ways
of representing it is similar to a histogram: Take all the words in alphabetical order and then
associate with each word the frequency of its occurrence in the document. For example, the
document term vector for the sentence given below is shown in Table 4.1.
THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO
17
acknowledged 1
at 1
bravo 1
clrd 1
ctlr 1
gnd 1
go 1
i 1
intxn 1
that 2
the 1
to 1
was 1
In a conventional term by document vector, the entries correspond to the word stems instead
of the actual words. But in our case the words themselves were used due the presence of
The term by document vector in this case has already been generated based on the
results from section 4.4. The only key difference is that instead of having words sorted in
alphabetical order; there are numbers, which represent words sorted in ascending order. A
document by term vector also doesn’t have normalized frequencies in its representation. This
also doesn’t harm the approach in any way since the numbers are still as far apart as before
The inner product (or the dot product) is an operation that takes two vectors and returns a
single number as the output. To get the scalar or dot product of two vectors, all the elements
in the first vector should be multiplied to all the corresponding elements in the other vector
The easiest way to calculate the similarity between two documents is to take the dot
product of the document by term vector of the two documents. So, for example, take the
following sentences.
S1. THAT THE GND CTLR ACKNOWLEDGED THAT I WAS 'CLRD TO GO' AT INTXN BRAVO
Taking the dot product between each part of document by term vectors, we have:
S1 . S2 = 1 + 1 + 1 + 1 + 1 + 1 = 6
S1 . S3 = 1 + 1 = 2
S2 . S3 = 1 + 1 = 2
19
Looking at the results of the dot product, it is easy for us to say that S1 and S2 are
more similar compared to S1, S3 and S2, S3. The same result will hold if the use of
frequencies is substituted by the use of normalized frequencies. In other words, instead of the
result being 6, 2 and 2, it would have been some fraction with a similar order.
Singular value decomposition (SVD) is the a process of factorizing a nxm matrix M into
where U is the unitary matrix over some feature space with real numbers.
$ Contains the singular values, which can be used to transform other vectors.
The training set is used to generate $, which is then used to generate the features for both the
The Gram Schmidt process is used for converting a linearly independent set of vectors v1, v2,
v3, …, vn to a orthogonal set of vectors q1,q2,q3, …, qn in the inner product feature space
where projq v is
20
Due to floating point issues, the direct algorithm described above doesn’t give exact results
on a computer. There is a stable version of the algorithm1 that is used and given below.
GRAM-SCHMIDT-STABLE
for i =1 to n do
vi = ai
end for
for i = 1 to n do
rii = ||vi||
qi = vi / rii
for j = i + 1 to n do
rij = qiTvj
vj = vj - rijqi
end for
end for
The concept of latent semantics relies on the concept of co-occurrence, since we are not
using any external semantic tool like wordnet or semantic net. We can use the fact that two
words occurring in the document together are related to each other. This fact is can be used
very computationally intensive task. As a result, we use the Gram Schmidt orthogonalization
(GS) for estimating the singular values and using it to reduce the dimensions of the space
effectively. The GS method is known to provide similar results to the latent semantics kernel
(LSK) [12].
All the feature values of each document are normalized, for all documents. The
document term matrix is created and given to the GS algorithm for processing.
1
http://www.phys.ocean.dal.ca/~lukeman/teaching/3170-3111/slides/Lecture8.pdf
21
• A biasing factor is applied on the norm for all the training documents containing the
anomalies.
• K (resultant feature space size) Documents with high biased norm score as taken as
• The GS algorithm is run on all the documents using these K documents. The reduced
To generate the test data, the GS algorithm is run with the K documents selected in the
The training and test files are generated using the features computed in the train and test
stage, as described above. Even though there exists an unsupervised way of selecting the K
documents [13]. It was seen that the unsupervised approach did not give nearly as good as
4.6 Classification
SVM-light2 is a C language based implementation of SVM. It is small in size, fast to run and
very easy to use. These inherent advantages of SVM-light made us use it. It comes with two
main executables: svm_learn and svm_classify. The learner svm_learn takes the training data
and generates a model. The classifier svm_classify takes the test data along with the model
2
http://svmlight.joachims.org/
22
4.6.1 Learning
Training a SVM with transduction is quite simple. Once we have all the features generated
using the methods from sections 4.2 to 4.5, we only need to put the features in a format that
the SVM-light learner can understand. The input format for the training file is class label
either (-1 or 1) followed by a space separated list of feature number and value pairs. In our
case, if the document belongs to that anomaly we assign it -1, otherwise we assign it 1. In
order to use transduction, the unlabeled instances also need to be passed. These unlabeled
instances are assigned class value of 0. See Figure 4.6 for an example of a training file. Once
the learner encounters a line with label zero it changes to the transduction mode of operation.
Finally svm_learn is run to start the learner when given the training file and model file name
4.6.2 Classifying
The test file is created the same way as the training file. All the narratives that are not labeled
are put in a file. The features that is required for the classification of a test document is
Each test document is processed through the steps mentioned in sections 4.2 to 4.5.
After going through these stages, the features that are generated for each document are taken
23
and a zero is put in place of the class value while passing the features for classification
To use the classifier the executable svm_classify has to be run with the test file, the model
file and the output file name given as command line switch., as shown below.
SVM returns a set of numbers around zero, one per line. Each line corresponds to each
document instance. If the number is closer to -1 then it basically means that SVM classifies
the instances negative, but if it is close to 1 then it means it is positive. The degree of
confidence SVM has on its decision is indicated by the number it churns out. A threshold of
0 was used for determining the class. Anything above zero was considered positive and
EVALUATION
5.1 Dataset
As mentioned before, the evaluation dataset contains the aviation safety reports, with over 62
anomalies and over 130,000 narratives. In order to reduce the scale of the task, only 8
anomalies were considered in our experiments. The number of relevant narratives (i.e., the
number of narratives containing at least one of the eight anomalies) was over 75000. In order
to make sure that the program can run on everyday hardware only 8000 narratives were used.
By using lesser narratives the composition of low frequency anomalies in the dataset was
increased. Since a single narrative can have more than one anomaly the total need not be
same as total number of narratives. The table below lists the anomalies and the number of
The evaluation was performed using 2-fold cross validation. This was done in order
to have a smaller percentage of training instances, which would add greater value while
developing a semi-supervised approach. Each fold contains 4000 narratives and the details on
24
25
Identification is done using binary SVM classifiers. In this case we use 8 binary classifiers
setup to handle each anomaly independently of each other. The experiments and results will
be put forward starting with the existing methods, followed by our proposed approach.
The most basic of the methods that can be used on the dataset set readily is classifying using
a linear SVM. In order to do this we take the output generated by the method described in
26
section 4.2.3 and pass it as input to the learning algorithm. The SVM-light package, which is
based on [7], was used to classify the data. For the classification stage all the options were set
to their default values. Words were selected using the IG metric mentioned in section 4.2.4.
The SVM-light system accepts input as line separated instances. Each instance is
given in the form of class label (-1 or 1) and then followed by a space separated feature
number colon feature value format. The feature number should be strictly greater than zero.
The learner learns and generates a model file after processing the labeled instances.
The test instances are also passed to the SVM-light system in the same way as described
above. But instead of the class label at the beginning of the line, some placeholder is given.
The output provided by the system is a number in the range of -1 to 1, with numbers closer to
threshold of 0 was used to discretize the output. The results of the experiment are tabulated in
Table 5.4 and graphed in Figure 5.1. The rows of table 5.2 correspond to the number of
features selected by IG and used by the SVM; the columns are the eight anomalies.
No. Of Words/Class 1 2 3 4 5 6 7 8
500 83.21 62.13 85.64 50.19 65.38 77.32 71.80 78.42
1500 83.50 63.02 85.27 49.28 65.63 77.54 72.30 79.12
2500 83.65 62.99 85.49 49.21 65.48 77.85 71.87 78.95
5000 83.61 62.79 85.44 49.34 65.33 78.12 72.48 79.07
All 83.92 61.40 85.50 49.06 65.34 77.65 72.57 79.13
27
As seen by the results, the approach of just selecting the words using the IG works very well
with this dataset with the overall F-Score reaching 73.81. The key reason that it fails to
perform well in the case of anomaly 4 is that this anomaly (Inflight Encounter – Turbulence)
is almost all the time accompanied by anomaly 5 (Inflight Encounter – Weather), so they
have a lot of common words used and are not easily distinguished from each other. In case of
anomaly 2 (Airspace Violation – Entry), the low performance can be attributed to the fact
that there is not that many good shapers found due to the low number of training instances
belonging to this anomaly class. Overall, anomalies with higher presence were detected well
The Naïve Bayes classifier is known to provide good results in case of text classifications
tasks. In this experiment the Naïve Bayes classifier is used in conjunction with EM as given
in [6]. The details of the method can be found in section 3.2.1. Unlike the previous methods
28
where a standard learning algorithm package was used, here a custom Naïve Bayes EM
algorithm was coded. Doing this allowed for greater flexibility on how the input was
provided and easier tuning of the parameters. The unlabeled data that we use is the test set.
The input given to the system is again taken from the output of sections 4.2.3 and
4.2.4. Since the way the EM algorithm works is different from the way SVM works, a few
changes need to be made to in the way the input was passed and handled. In order to keep the
input format same as in the other methods the missing features (the words that don’t occur in
the narrative) is handled in the code, by making the value of that feature to zero.
Furthermore, instead of using the frequency of occurrence of words in the instances as given
The output generated by the method is not the same as the one generated by the SVM
classifiers. The output generated is the probability of occurrence of the anomaly, which is
The importance given to the unlabeled data was varied. The best scores obtained by
varying the importance of the unlabeled data are listed in the table 5.3. The average number
Unlabeled data
Weightage/Class 1 2 3 4 5 6 7 8
0.1 83.69 58.71 86.55 59.53 69.90 69.84 72.73 78.57
0.25 83.57 56.46 86.74 57.89 68.14 67.18 72.38 78.47
0.5 83.35 54.16 86.83 55.75 66.31 64.98 71.68 78.06
0.7 83.26 52.43 86.68 54.30 64.93 63.92 71.44 77.81
0.9 83.26 51.59 86.76 52.80 63.91 63.47 71.23 77.27
29
The overall F-score obtained over all the anomalies was 74.21. All words are used, due to
which there are more word frequencies that can influence the outcome, as a result it performs
better on both anomalies 4 and 5. It performs poorly on anomaly 2 due to the fact that the
narratives do not contain many words that are repetitive across different documents of the
same anomaly with high frequency. The performance of the other anomalies is similar to the
plain SVM approach. So use of unlabeled data improves the performance of the Naïve bayes
In this baseline, we generate the features using the Gram Schmidt Kernel (GSK) algorithm
mentioned in [12]. The entire original preprocessed dataset was used to generate the features
using the GSK algorithm. The number of feature chosen was 100, 500 and 1000. The set of
Once the features were generated, the labeled feature instances were given as input to
SVM with all default options for training. -1 was used if the anomaly was present and 1 was
Then the test file was generated again using GSK via the exact same procedure as in
The result was determined using a threshold of 0. Instances with scores below 0 were
classified as -1 and scores above 0 were classified as 1. The best results obtained are
The overall F-Score obtained was 72.77. The GSK approach reduces the features in
the feature set similar to the word selection approach, by combining features that are most
The semi-supervised approach has two stages. The first stage is based totally on the word
selection using IG and the subtractive log method mentioned in section 4.4. The second stage
further reduces the dimensionality of the feature get obtained in the first stage by applying
SVM with transduction can incorporate both labeled and unlabeled information to potentially
yield give out higher classification accuracies. As in the case of plain SVM, the SVM-light
package was used with default options. In order to run SVM-light in the transduction mode a
The input is generated in a similar manner to the previous method. The output
generated at section 4.3.3 is used with a few extra things, since we need to combine the
labeled and unlabeled instances for processing through the learning algorithm. The way the
32
class labels are passed is different, for the instances that are labeled a -1 or 1 and 0 is passed
for the unlabeled instances. In all the experiments all the test instances were passed as the
unlabeled instances for training the learner. The learner learns from the data given to it
Unlabeled instances are passed in the same way as they were passed in the previous
method. Again the classifier returns a confidence score between -1 and 1. The threshold of 0
The experiment was performed with the application of word selection as given in
section 4.3.2. The results of the experiment for 500, 1500, 2500 and 5000 words are tabulated
in the table 5.5 and graphed in Figure 5.4. Comparing this table with those for the baseline
systems, we can see that combining labeled and unlabeled data gives better results than just
When looking at the result one of the key things that stand out is the classification
results obtained for anomaly 4. In the preceding experiments, the way the word selection was
done was using the frequency of occurrence of the words in the documents. However, in this
experiment, we use only binary valued features that indicate the presence or absence of a
word in the document, instead of the number of times it occurs in the other documents, which
was the case with the plain SVM approach. The slight increase seen in all the other
anomalies is due partly to the presence/absence modification and also partly to the fact that
both the labeled and unlabeled data was used to learn the features by the transductive SVM
classifier.
33
Word/Class 1 2 3 4 5 6 7 8
500 84.16 72.64 85.99 69.53 72.48 81.30 76.73 79.70
1500 84.21 73.23 86.36 69.71 72.00 81.94 76.54 79.04
2500 84.61 73.57 86.79 69.68 72.23 82.22 76.53 79.30
5000 84.29 73.88 86.24 69.94 71.96 82.81 76.13 78.91
We continue on the reduced feature set produced by our word selection method. The one
interesting piece of observation that can be incorporated into this reduction method is using
word presence instead of frequency of occurrence of the words in a given document doesn’t
hurt the overall classification, in low classification rate anomalies like anomaly 4, word
presence was very beneficial. Since we can’t use an approach of replacing the normalized
frequencies with 1 or some fraction in order not to influence the value decomposition. We
34
use the method of taking 1-norm of the constant based feature set generated in the previous
step.
5000 features were selected in the word selection phase, so that none of the
documents are left with a very small number of features. The train and test feature reduction
algorithms are run on this 5000 feature set. Then the transductive SVM with default
parameters is used to classify the data. The results tabulated in Table 5.6 and graphed in
Figure 5.5 were obtained using the reduced features of 250, 500 and 800 with a bias value of
3.
The overall F-score obtained was 78.62 compared to the overall F-score of 78.64
obtained from stage 1. The F-scores doesn’t change much due to the application of the
feature reduction using GS, but for classes with average participation reduction gives better
results compared to word selection. However it performs badly in cases were there is a good
No. of Features/Class 1 2 3 4 5 6 7 8
250 82.93 70.51 85.85 66.73 70.57 81.13 76.46 77.96
500 84.41 73.51 86.17 68.31 72.32 81.87 77.26 79.09
800 84.77 74.01 86.77 69.05 73.03 82.41 77.22 79.33
35
Looking at the result for the various approaches, it is very clear that the word selection based
approaches perform very well on the ASRS dataset. The performance goes down when the
words are spread over lot of documents having similar frequencies in different documents.
But overall in most cases the performance for classification is good, considering that all
The best approach turned out to be our approach of word selection using a
combination method instead of plain IG and using the entire dataset for training the
transductive SVM classifier. The GS kernel based baseline method for determining features
for improving the classification performance gave better results then the other existing
methods, but was not as good as the word presence method used in combination with a
The reason why our GS based approach of ours is not performing so well is perhaps due to
the fact that may be the reference points used for getting the new feature space is not
transductive approach of selecting the reference documents could boost the results.
CONCLUSION
Few methods use the entire dataset to generate features for a single document in the dataset.
Through this thesis a method is proposed which lets us combine the various aspects from the
The ASRS dataset is very word-centric. This aspect is clearly shown through the
experiments conducted just by using different ways of word selection. We could get really
high F-scores just by using simple word selection methods and passing it through a learning
algorithm like SVM. Combining unlabeled and labeled data using transduction gave good
improvements over all the existing approaches but it could perform even better if both the
The semi-supervised approach developed using the concepts of word selection and
GSK worked far better on this dataset then just GSK. The key to our method is the way the
instances are selected and used to generated features for other documents. The main
contribution of this thesis lies in our view that documents should not be used just as input
data but also as points of references from which patterns can be derived for future
classification.
There is a lot of room for improvements, especially with respect to how the original
instances are selected for using during feature generation. The current method of using the
training data, though effective, can be bettered to use both the training and test data. Also, the
broader the spread of the reference points, the better the result would be. The word selection
37
38
metric can also be improved on to provide better features to the GSK module, such that more
At first glance, it appears that getting high accuracy for determining the anomalies
associated with them would be an extremely complicated task, requiring many stages of
processing. However, the results of our experiments and the details of the new approach
suggest that sometimes it is good to think simple. At least for the ASRS dataset, it turns out
that the simpler the approach the better the results we get.
REFERENCES
[1] Christian Posse, Brett Matzke, Catherine Anderson, Alan Brothers, Melissa Matske,
Thomas Ferryman. Extracting Information from Narratives: An Application to Aviation
Safety Reports. In Proceedings of IEEE Aerospace Conference, 2005.
[2] Pedro Domingos, Michael J. Pazzani. On the optimality of the simple Bayesian classifier
under zero-one loss. Machine Learning, 29(2-3): 103-130, 1997.
[3] David D. Lewis. Naive (Bayes) at forty: The independence assumption in information
retrieval. In Proceedings of the European Conference on Machine Learning (ECML),
pages 4-15, 1998.
[4] Thorsten Joachims. Text categorization with support vector machines: Learning with
many relevant features. In proceedings of the European Conference on Machine
Learning (ECML), pages 137-142, 2005.
[6] Ashok N. Srivastava, Ram Akella, Vesselin Diev, Sakthi Preethi Kumaresan, Dawn M.
McIntosh, Emmanuel D. Pontikakis, Zuobing Xu, Yi Zhang. Enabling the Discovery of
Recurring Anomalies in Aerospace Problem Reports using High-Dimensional Clustering
Techniques. In Proceedings of IEEE Aerospace Conference, 2006.
[7] Kamal Nigam, Andrew Kachite McCallum, Sebastian Thrun, Tom Mitchell. Text
Classification from Labeled and Unlabeled Documents using EM. Special Issue on
Information Retrieval, 2000.
[8] P. Dewilde, Ed. F. Deprette. SVD and Signal Processing: Algorithms, Applications and
Architectures. Elsevier Science Publishers, pages 3-41, 1988.
[9] Gene H. Golub, Charles F. Van Loan. Matrix Computations, 3rd ed., Johns Hopkins,
1996.
[10] Thorsten Joachims, Probabilistic Analysis of the Rocchio Algorithm with TFIDF for
Text Categorization. Proceedings of 14th International Conference on Machine
Learning, pages 78-85, 1996.
39
40
[11] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard
Harshman. Indexing by latent semantic analysis. Journal of the American Society for
Information Science, 1990.
[12] Nello Cristianini, John Shawe-Taylor, Huma Lodhi. Latent Semantic Kernels. Journal
of Intelligent Information Systems, 2002.
Suraj Raghuraman was born in Madras (Chennai), India on 25th June 1984, the son of
Raghuraman and Usha Raghuraman. Suraj spent most of this childhood breaking and fixing
stuff around the house old later shifting to doing the same on the computer saving lot of
money to his parents. He did his schooling switching school and places very often all over
India. He joined MVIT, Bangalore for a Bachelor of Engineering in Information Science and
after a very patient wait for four years he was awarded the degree in 2005. He joined Subex
as a Fraud & Risk Management engineer during the final years of his Bachelor. Suraj was
very fond of opening companies during the later years of his Bachelors and during his work
at Subex, he started a few companies and closed it down; all were profitable. During 2007 he
was getting bored with his job at Subex and one of his families ventures finally became big
enough to sustain it. He decided to leave his country. He quit his job and joined UTD for
Master’s in Computer Science in Fall 2007. Currently he works at the HLTRI at UTD.