You are on page 1of 4

Multiclass classification of shelter animal outcomes

Yudong Shen
UC Davis
ydshen@ucdavis.edu

ABSTRACT
In this paper, several machine learning methods, including KNN,
SVM, Softmax and neural network are used to predict the shelter
animal outcomes with the information collected by Austin Animal
Center. To improve the classification result, re-sampling is also
used to deal with the imbalance data.

Keywords

trends in animal outcomes. These insights could help shelters focus


their energy on specific animals who need a little extra help finding
a new home.
This problem is essentially a multiclass classification problem, as
we have learned from this class, there are many techniques to solve
this kind of problems. For this shelter animal outcomes problem,
there are 5 classes to be predicted.

Machine learning; Classification; Neural Network

2. TECHNIQUES

1. INTRODUCTION
Every year, approximately 7.6 million companion animals end up
in US shelters. Many animals are given up as unwanted by their
owners, while others are picked up after getting lost or taken out of
cruelty situations. Many of these animals find forever families to
take them home, but just as many are not so lucky. About 2.7
million dogs and cats are euthanized in the US every year.
Using a dataset of intake information including breed, color, sex,
and age from the Austin Animal Center, it is possible to understand

The features included in the raw data are ID, Name, Date Time,
Animal Type, Sex upon Outcome, Age upon Outcome, Breed and
Color and the outcomes are Died, Adoption, Euthanasia, Return to
owner and Transfer. The purpose of the model is using machine
learning models to classify the data into 5 outcomes. Here is an
example of raw data and an overview of outcomes of all dogs and
cats copied from Kaggle forum. As we can see in the figure, most
of the animals are Adoption and Transfer, which is a good news.

Table 1. Raw data example


ID

Name

Date

Type

Sex

Age

A671945

Summer

2015-10-12

Dog

Neutered Male

1 year

A699218

Jimmy

2015-03-28

Cat

Intact Male

3 weeks

Breed

Color

Outcome

Shetland Sheepdog
Mix
Domestic Shorthair
Mix

Brown/
White

Return to
owner

Blue Tabby

Transfer

Figure 1. Animal outcomes overview

Before I get started to solve the classification task, I should carry


out some research to get familiar with the dataset and to decide
what features and observations will be useful. First of all, it is
necessary to convert characters to numbers if they are not.
1. ID: not needed.
2. Name: the meaning of their names is not feasible to analyze
in my prediction model, but having name or not should be a
considerable feature.
3. Date: it matters whether it is a work day or weekend.

situations are unknown or one simple breed mix. Then every


animal should have several ones in these 218 features. This
feature is not that important because different people adopt
different kinds of animals by their own favor.
8. Color: use the same technique of breed. There are 366
unique colors and 54 simple colors are extracted.
The next step is normalization. Actually most of the pre-processed
data is in the range of [0, 1] except the date feature with a much
wider range. So I only need to deal with this feature column by a
simple step.

4. Type: simply set cats to 1 and dogs to 0.


5. Sex: there are four kinds of sex and unknown, here I used
five features and every animal should have only one of them
to be 1 and others are 0. This feature is relatively important
because intact animals may have shorter life.
6. Age: it is natural to convert their ages to days. Intuitively,
this feature is the most important one as young dogs (older
than a month) are likely adopted and old cats are likely
euthanized.
7. Breed: there are more than 1000 unique entries in the breed
feature, so the breeds are rather difficult to work with.
However most of them are cross of two kinds of breeds, I split
them and there are 216 simple breeds in total, and 2 other

2.1 K-Nearest Neighbor


The most straight forward way is implementing a k-nearestneighbor classifier: an object is classified by a majority vote of its
neighbors, with the object being assigned to the class most common
among its k nearest neighbors. Intuitively, higher values of k have
a smoothing effect that makes the classifier more resistant to
outliers. [1] That is to say, for each training data in a largedimension space, the Euclidean distance can be easily calculated by
summing the square of differences between every coordinate, then
finding k nearest data to this testing one, the prediction of this
testing one will be the mode of the k neighbors

Figure 2. KNN classification

2.2 Linear Classifier


Another method is using linear classifier, which means that it
computes the score of a class as a weighted sum of all of image's
pixel values. In another word, the score is based on the value of a
linear combination of the features. [2] The score function is defined
as:
( , ) =

There are many ways to define this kind of classifier, the most two
popular classifiers are Multiclass Support Vector Machine (SVM)
classifier (especially linear) and Softmax classifier.

2.2.1 Support Vector Machine


SVM classifier ideally wants to find some weights that the
calculated score of correct class is higher than all other scores by at
least a margin of delta, which is usually impossible. A loss function
is defined as the distance between that margin and use the form of

hinge loss. Then the SVM algorithm is intended to calculate the


weights which gives a total loss as low as possible. To train the data
with the method of SVM, the basic idea is to find a hyper-plane
which separates the d-dimensional data perfectly into its two
classes. However, if data is not linearly separable, SVM introduces
the notion of a "kernel induced feature space" which casts the data
into a higher dimensional space where the data is separable. In my
project, since 281 features are obtained from raw data which has
only 7 features, this kernel trick maybe not necessary.

2.2.2 Softmax classifier


Softmax classifier is the other common classifier, which is similar
to SVM but has a different loss function called Cross entropy loss.
Unlike the scores for each class from SVM, the output of Softmax
classifier provides probabilities for each class. [3] Actually these 2
most popular linear classifiers are usually comparable. The
performance difference between linear SVM and Softmax are
usually very small.

2.3 Neural network


Neural network is a family of models inspired by biological neural
networks. It is generally presented as systems of interconnected
"neurons" which exchange messages between each other. The set
of input pixels is weighted and transformed by a function, the
activations of these neurons are then passed on to other neurons.
This process is repeated until finally. [4] Although the method of
neural network typically works on unstructured data
(text/audio/images/video), it is still considerable to implement it
because in my pre-processed data, there are 281 dimensions in total,
which is relatively large.

The result of SVM is much better than KNN. As I mentioned above,


linear kernel should be enough to separate the model because I have
281 dimensions. Nonlinear kernel takes longer time while
obtaining worse accuracy.

3.3 Softmax Classifier


Table 4. Softmax
Name
Softmax

Training
Accuracy
65.02%

Test
Accuracy
63.74%

Processing
Time
~2 min

The results of Softmax Classifiers is better than SVM and the


training time is much shorter.

3.4 Neural Network


Table 5. Neural Network
Architecture
281-3-5
281-8-5
281-10-5
281-100-5
281-300-5
281-400-400-5

Figure 3: A simple neural network

Training
Accuracy
65.60%
68.08%
69.07%
83.00%
87.74%
88.43%

Test
Accuracy
64.20%
63.38%
64.35%
58.35%
57.37%
57.16%

Processing
Time
~4min
~5min
~5 min
~8 min
~17 min
~30min

* Different activation functions does not affect the result too much

In my project, my work is implementing classification of animals


outcomes with the methods I mentioned above and evaluate the
accuracy using cross validation. Note that the data was downloaded
from Kaggle and there are 26729 training sets with 281 features in
total.

3. EXPERIMENTS
3.1 K-Nearest Neighbor
Table 2. KNN
k
1
5
50
200
500

Accuracy
44.16%
49.08%
52.59%
52.11%
51.76%

4. EVALUATION
~13 min

Although this method is very simple, the training speed is relatively


slow. Another drawback is that every feature has the same value,
but the breed and color of an animal should be surely more
important than whether it is a weekday or not.

3.2 Support Vector Machine


Table 3. SVM
Kernel
Linear
Gaussian

When the architecture gets more complex, it is more likely to be


overfitting. My understanding is that as neural network has the
ability to learn unstructured and complex data with deeper layer,
but the data structure here is simple enough, it is not necessary to
use it. After some survey, a suitable method here can be using
Random Forest, which is an ensemble learning method that
operated by constructing a multitude of decision trees. [5]

Processing Time

The predicting result when k = 1 is bad, but when k increases the


cross validation accuracy also increases. While if k is too large, the
accuracy will decrease.

Training
Accuracy
63.97%
60.97%

The last model is using neural network. The result here shows
roughly no difference between simple linear classifier.

Test
Accuracy
63.19%
60.90%

Processing
Time
~7 min
~15 min

When it comes to the data deeper, I think the imbalance distribution


of data hurt the performance. For example, only 0.7% of all
animals outcomes are Died, so it is difficult for the model to
learn this class, and in my validation result this class accuracy is
0%. In another word, the results obtained above cannot reflect the
information of this minority class.
If I can collect more data, the performance should be improved. A
larger dataset might expose a different and perhaps more balanced
perspective on the classes. More examples of minor classes may be
useful later when looking at resampling the dataset. But actually I
have only 197 (0.7%) samples of class Died, re-sampling method
will help me build more balanced data.
The most popular of such algorithms is called SMOTE or the
Synthetic Minority Over-sampling Technique. It works by creating
synthetic samples from the minor class: selects two or more similar
instances (using a distance measure) and perturbing an instance one
attribute at a time by a random amount within the difference to the
neighboring instances. [6]

Here is the final result after SMOTE using the method of Softmax
Classifier, which has a decent improvement to the results above.
Table 6. Softmax after SMOTE
Name
Softmax

Training
Accuracy
73.49%

Test
Accuracy
71.24%

Processing
Time
~2 min

5. CONCLUSION
By this project I learned a lot about linear classifier and neural
network, the shelter animal outcomes classification problem is very
challenging and very different from what we did in class
assignments. What we did is hand-written digits recognition, which
is an unstructured database with balanced distribution (every digits
has roughly same samples), and this specific problem is more
structured and has imbalance data.
SMOTE algorithm is very useful to deal with the imbalance
database by generating artificial minority class data to the training
set to prevent them to be ignored by the model. Also, as the
structured data this problem has, neural network is not necessary,
linear classifiers like Softmax is good and fast enough to
implement.

6. REFERENCES
[1] Wikipedia. K-nearest neighbors algorithm wikipedia, the
free encyclopedia, 2016.
[2] Wikipedia. Linear classifier wikipedia, the free
encyclopedia, 2016.
[3] Yichuan Tang. Deep learning using linear support vector
machines. arXiv preprint, arXiv:1306.0239, 2013.
[4] Wikipedia. Artificial neural network wikipedia, the free
encyclopedia, 2016.
[5] Ho, Tin Kam (1998). The Random Subspace Method for
Constructing Decision Forests. IEEE Transactions on
Pattern Analysis and Machine Intelligence 20 (8): 832844.
[6] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer,
W. P. (2002). SMOTE: synthetic minority over-sampling
technique. Journal of artificial intelligence research, 321357.