You are on page 1of 5

Classificaton of Modified MNIST dataset by Team

KSV
Kashif Javed

Vaseem Ahmed

Shrey Gupta

CS Department
McGill University
kashif.javed@mail.mcgill.ca

ECE Department
McGill University
vaseem.ahmed@mail.mcgill.ca

Department of software and IT Engineering


Ecole de technologie superieure
shrey.gupta@mail.mcgill.ca

AbstractThe problem of digit classification in machine learning is well studied over the years, and significant amount of work
has been done on numerous datasets. In this paper, we attempt to
classify images of hand-written digits (0-9) represented in cropped
images, obtained from transformed version of MNIST dataset.We
have discussed various data pre-processing and feature extraction
techniques.Furthermore, we discuss the performance of Naive
Bayes(NB), K-Nearest Neighbor (KNN), Support Vector Machine
(SVM), Neural Network algorithm (NN), and Convoluted Neural
Network(CNN). Results show that the best classifier was CNN
with an accuracy of 0.848 for the given dataset.

I.

I NTRODUCTION

Handwritten text recognition has been an active area of


research in machine learning since 1980s. The task of digit
classification has numerous applications such as signature verification, postal-address interpretation, bank-check processing,
and so on. The problem of handwritten digits is difficult
because they vary in shape, thickness, orientation and position
relative to margins. In this project we are given a modified
version of MNIST dataset. The images in modified dataset are
resized, rescaled, rotated variant of original MINST dataset.
In addition to that, texture and embossing is also added to
each image. Thus, images in modified dataset contain much
more noise compared to images in classic MNIST dataset. In
this report, we attempt to classify images containing digits
into ten categories (0-9). A digit classification task largely
depends upon the feature selection. Therefore, we applied
filtering techniques to remove the extra noise from images and
extract better features using Histogram of Gradients, and Gabor
filters. The processed data was used to train classifiers such as
Naive Bayes, K-Nearest Neighbor, Support Vector Machine,
Neural Network algorithm, and Convoluted Neural Network.
In addition to that, we also trained these algorithms using raw
feature vectors obtained from the images as well. We finally
present our results, and discuss performance of all the tested
classifiers.

A. Averaging Filters
Adding of texture to the data made the original image of the
digits very noisy. Averaging filters are useful in noise removal
as each pixel of the image gets the value equal to the average
of its neighboring pixels.
Median filter: Each output pixel in the image is replaced
by the median of its 3x3 neighborhood pixels. A median filter
was used since median filters are effecting in removing noise
while preserving edges at the same time [7]. Thus we expected
median filter to help in detecting strokes of the digits.
Gaussian filter: Each output pixel in the image is replaced
by a weighted average of its 3x3 neighborhood pixels. Gaussian filter are particularly smoothing filters. They are better
than mean noise removing filters in the sense that they provide
gentler smoothing and preserve edges better than a similar
sized median filter [8]. The value of sigma used was 0.5.
B. FFT and IFFT
Fast Fourier Transform converts the image signal from
spatial domain to frequency domain. High frequency noise can
then be filtered in the frequency domain. The image is then
converted back to spatial domain using an Inverse Fast Fourier
Transform. This helps in removing background texture, and
noise as shown in the Fig.1.
III.

F EATURE E XTRACTION

Feature extraction is a significant part of digit recognition,


and a variety of feature types and extraction strategies have
been proposed in past. However, selecting an appropriate
extraction method depends upon the type of data. In our case,
the modified MNIST dataset contains rotated images of digits
with texture and embossing in background. Therefore, we used
feature extraction methods that are invariant to the rotation,
texture and contrast of an image. Following feature extraction
methods were tried:
A. Histogram of Gradients (HOG)

II.

DATA P RE - PROCESSING

We tried two pre-processing methods to prepare our data


before training our algorithms. We significantly focused our
pre-processing efforts on removing noise added to images because of texture, and embossing in background. The techniques
used are described below:

We first applied Histogram of gradients method to count the


occurrences of gradient orientation in localized portions of an
image. HOG was proposed by Navneet Dalal and Bill Triggs
in June 2005 [2]. HOG features are considered effective for
recognizing objects in images because HOGs are tolerant to
changes in scale, rotation, translation, and contrast of an image
[4].The idea of HOG descriptors is that the local appearance,

Fig. 3: Gabor filters after convolution

Fig. 1: Images on the left show the original image data and
on the right data after being transformed through a FFT and
IFFT transform.

Fig. 2: Gabor filters

and shape within an image can be described by distribution


of intensity gradients or edge directions. The algorithm determines local properties of an image by dividing the image into
small overlapping blocks. For each block HOG directions are
calculated, and the combination of these histograms forms the
feature vector for the image. For this project we calculated the
features on a cell size of [8*8] pixels, and obtained a feature
vector of [1*900] for each image.
B. Gabor Filter bank
Gabor filter bank uniformly covers the image in frequency
domain, and each filter calculates the energy of localized
frequency. Gabor filter extracts features for different orientation and scales [1]. The extracted features convolve with
original image to give different representations of the image.
These representations of image give feature vector for the
image. Fig.2 shows the [5*8] features extracted for different
orientation and scales of the image, and Fig.3 shows the Gabor
features after convolution of [5*8] filters with the original
image. Thus, finally we get two [1*40] matrices, and after
appending them together [1*80] feature matrix for each image

Fig. 4: Confusion Matrix for Naive Bayes

is obtained. We used Gabor filter for feature extraction because


these are scale invariant, and effective for edge detection.
IV.

A LGORITHMS

A. Naive Bayes
Naive Bayes algorithm classifies the data on the basis of
P(class|features) which is equivalent to the product of prior
probability, and likelihood. In addition to that, there is a
general probability term in the denominator but it is same for
all the classed therefore can be ignored. The formula for Naive
Bayes:
P (class)P (f eatures|class)
P (f eatures)

(1)

For this project, we implemented a Bernoulli Naive Bayes,

D. Neural Network
Results have shown that carefully designed Neural Networks can very efficiently classify hand written digits. Human
brain which is exceptionally good at classifying objects inspired the idea of Neural Networks. Just like human brain
Neural Network contains many neurons placed in various
layers. First layer is the input layer and in case of our problem
pixel values of the images were used as the input values
for that layer. Intermediate layers are called hidden layers
which can contain arbitrary number of neurons depending
on the architecture of the Neural Network. A neuron has
various inputs and has different weights assigned to these input
links. The output of a particular neuron is a sigmoid function
applied to the linear combination of its input and weights
(2 x2 + 1 x1 + b), where sigmoid function is defined by:
a = (z) =

1
1 + exp z

(3)

In this way the output of each neuron becomes the input


of the next layers neurons till we reach the output layer
which represents the predicted outcome of the network. During
training weights is adjusted using gradient descent to minimize
the overall error by differentiating the cost function by weights.
Fig. 5: Confusion Matrix for SVM

= l+1 ((a)l )T +
(l)
and achieved an accuracy of .35 on raw features. The confusion
matrix for the NB is shown in Fig.4.

Where jl represents the error of the jth node the in the


layer. And the new weights are assigned to the links by:
(l) = (l) + l+1 ((a)l )T

B. K-nearest Neighbor
We also tested K-nearest neighbor (KNN) algorithm for
this project. We used Euclidean distance as a measure to
calculate the closest training set instances to a test point. The
test point is then assigned the class that is in majority among
these k neighbors. The Euclidean distance is calculated using
the formula:
v
u k
uX
D(x, y) = t (xi yi )2

(2)

i=1

Where x test set and y training set. The variables


xi and yi represent the same features belonging to the respective classes.
For k-NN we achieved a maximum accuracy of .30 on raw
features.
C. Support Vector Machine (SVM)
We also tested the accuracy on raw image data set, as well
as filtered data set using Support Vector Machine classifier. It
performed marginally better than NB Classifier, and achieved
an accuracy of .38 with raw features. However, using filtered
features the performance of SVM was similar to that of NB.
The confusion matrix for SVM calculated with raw features is
shown in Fig.5.

(4)

(5)

Where represents the learning rate.


1) Neural Network Architecture: Initially a fully connected
Neural Network with 2 hidden layers with each layer containing 392 nodes was used to train on the MNIST dataset.
It was done to determine the minimum architecture required
to solve this problem. Using this architecture 0.93 accuracy
was achieved on the MNIST dataset. As the provided training
images had a lot more difficulty than the MINIST dataset so
an additional layer was added to the architecture and number
of nodes in the hidden layers were increased to 576 for solving
this problem.
E. Convoluted Neural Network
Convolution Neural Networks (CNN) is a type of feedforward neural network based on Hubels early work on visual
cortex [3]. The neurons in CNN are arranged in a manner that
they respond to overlapping visual regions of the image acting
as local filters.
In our project we used the CNN based on the LeNet
architecture [6].The LeNet-5 architecture is a specialized architecture of neural network designed to recognize hand written
and machine printed characters. Caffe (implementation of
CNN) was used [5]. Caffe had a default configuration for
LeNet-5 architecture which we used for our dataset as well.
The architecture has the following layers:

For validating KNN, we used a smaller dataset, since KNN


takes very long to run on the entire training dataset. Therefore,
we used 10000 samples as the training set and 1000 samples
as testing set for validation on KNN.
In case of SVM, we used 1-fold validation. We took
37500 images in train set, and 12500 in testing set to perform
validation on SVM.
Furthermore, Validation for Neural Network was performed
on a testing set consisting of 1000 images, and 10000 images
were used for training the NN.

Fig. 6: Learning Rate against number of iterations for CNN

Lastly, Convoluted Neural Network was validated using


a validation set of 100 images. The training set consisted
of 49900 images. Validation was used to determine learning
rate and the number of iteration necessary to achieve optimal
accuracy. We couldnt perform a comprehensive parameter
selection for CNN due to time limitations. Thus we used the
default parameters available in the Caffe LeNet Library.
V.

R ESULTS

As per our training and validation scheme, we performed a


5-fold cross validation on Naive Bayes classifer. The aim was
to obtain an optimal feature set. The accuracy of Naive Bayes
algorithm using raw features was 0.35, while using feature set
obtained from filtered data the accuracy was about .10.
KNN under performed Naive Bayes giving an accuracy of
.30 on raw features. However, we could not run the algorithm
for more than 10000 training images as the time complexity
of KNN with 48x48 features is very large. The optimal result
was achieved at k = 10.
Fig. 7: Accuracy vs number of iterations for CNN

Convolution Layer (C1): The first convolution layer used


a kernel size of 5 pixel each and learnt 20 filters.
Pooling Layer (P1): The pooling layer helps to minimize
computation in upper layers by merging the results of the lower
layers together eliminating non-maximal values. The pooling
kernel size used was 2 pixels.
Convolution Layer (C2): The second convolution layer
used a kernel size of 5 pixels and learnt 50 filters.
Pooling Layer (P2): The pooling kernel size used was 2
pixels.
The learning rate used was 0.01. Number of iterations
performed were 8000. An accuracy of .848 was obtained.
We werent able to perform more iterations because of time
constraint.
F. Testing and Validation
To validate our Naive Bayes results we used k-fold cross
validation. Given the large features and examples in the
training set, we selected a smaller value k=5, because it
takes much more processing time with higher values of k.
We performed validation on Naive Bayes by taking 40000
examples in training set, and 10000 examples in testing set.

For SVM a maximum accuracy of .38 was achieved using


raw features. While, using feature obtained after binarization,and other feature selection methods described above the
accuracy was approximately .10.
Moreover, Neural network with 2 hidden layers containing
392 nodes each gave an accuracy of approximately .20.
Finally, Convoluted Neural Network outperformed all other
classifiers giving an accuracy of .848 on the test set (Kaggle)
and .90 on the validation set. Fig 7. show a graph of accuracy
vs number of iterations of CNN.
VI.

D ISCUSSION

In this project we evaluated various classifiers such as


Naive Bayes, K-Nearest Neighbour, Support Vector Machine,
Neural Network algorithm, and Convoluted Neural Network.
Originally we expected that pre-processing, and better
feature set will improve the accuracy of classification. We
thought pre-processing the data to remove noise by applying
FFT/IFFT transforms would help in better classifying the
image by identifying strokes and edges. But the results were
not favorable. We believe that by tranforming the image, we
removed important information about the digit from the image
thus degrading the image. Thus the classifers were not able to
perform well on the transformed data.
We also thought that using scale, and rotation invariant
filters would help in performing better classification. Thus we
used HOG and Gabor filters. But again the filters perfomed

badly suggesting that they werent able to extract important


discriminating features from the image dataset due to the high
noise content of the images.
Thus the results show that using features calculated after
pre-processing the data performed considerably poor than
using raw data features for simple algorithms like naive bayes,
KNN, svm and neural network classifier.
Even using raw data on simple algorithms performed poor
giving a maximum accuracy of .38 in case of SVM. The poor
performance of these classifiers other than Convoluted Neural
Network can be contributed to the fact that pre-processing, and
feature extraction procedures used by us were unable to extract
a good feature set. On the other hand, CNN performed much
better because it selects localized features from the raw data by
itself. The features selected by CNN are strong classifiers, and
thus the accuracy obtained for CNN is much higher compared
to other classifiers.
VII.

C ONCLUSION

The important aspect of digit classification is feature selection. Pre-processing of images using different filters was
unable to identify the correct feature set necessary to classify
the images properly. Thus more sophisticated algorithms such
as Convoluted Neural Network which acts as localized feature
selector and classifier reached an overall accuracy of .848. For
future work, we would like to compare the performance of
CNN to more advanced classifiers e.g. deep learning.
VIII.

A PPENDIX

We hereby state that all the work presented in this report


is that of the authors.
R EFERENCES
[1]
[2]

[3]

[4]

[5]

[6]

[7]
[8]

T. C. BAU, Using Two-Dimensional Gabor Filters for Handwritten Digit


Recognition, PhD thesis, M. Sc. thesis, University of California, Irvine.
N. DALAL AND B. T RIGGS, Histograms of oriented gradients for human
detection, in Computer Vision and Pattern Recognition, 2005. CVPR
2005. IEEE Computer Society Conference on, vol. 1, IEEE, 2005,
pp. 886893.
D. H. H UBEL AND T. N. W IESEL, Receptive fields, binocular interaction
and functional architecture in the cats visual cortex, The Journal of
physiology, 160 (1962), p. 106.
S. I MPEDOVO , F. M. M ANGINI , AND D. BARBUZZI, A novel prototype
generation technique for handwriting digit recognition, Pattern Recognition, 47 (2014), pp. 10021010.
Y. J IA , E. S HELHAMER , J. D ONAHUE , S. K ARAYEV, J. L ONG , R. G IR SHICK , S. G UADARRAMA , AND T. DARRELL , Caffe: Convolutional
architecture for fast feature embedding, arXiv preprint arXiv:1408.5093,
(2014).
Y. L E C UN , L. B OTTOU , Y. B ENGIO , AND P. H AFFNER, Gradient-based
learning applied to document recognition, Proceedings of the IEEE, 86
(1998), pp. 22782324.
F. M AMEDOV AND J. F. A. H ASNA, Character recognition using neural
networks., in IC-AI, 2006, pp. 728733.
H. I. W ORKS, Gaussian smoothing.

You might also like