You are on page 1of 27

Département de Génie Electrique & Informatique

RAPPORT DE PROJET DE RECHERCHE TUTORÉ

4ème année Informatique et Réseaux

Spotting compromised electronic documents using a

neural network

Adrien GONZALEZ Juin 2018


Alban de La MORSANGLIERE
Julien TENA Tuteur : Eric ALATA
Abstract

The world is undergoing a progressive and ineluctable dematerialisation of paper


documents, replaced by electronic documents. This evolution will result in the eventual
disappearance of paper documents (especially administration documents) within a few years.
Today, most people already need to use them whether it is to read them or write them. We
all know the example of taxes !
The problem is that these documents rarely present any danger in people’s minds.
They often do not know that electronic documents contain security vulnerabilities, which
could be used in a nefarious way. Malicious code can be easily introduced into these
documents. Opening these types of documents could be dangerous because this malevolent
code is executed at this time and could result in damaging the user’s computer.
Most computer attacks follow this scenario and this is why informing people and
giving them tools to spot them is so relevant.
To do so, we created a neural network using TensorFlow. This network is capable of learning
how to recognise different types of documents with 128 bytes pieces as input. We can use it
to spot the integrity of a document, checking all it’s parts to verify that there is not a portion
of malicious code in it.

Keywords : vulnerabilities, electronic document, security, artificial neural network, neurons,


learning rate, deep learning.

Project Report
1
Summary
Introduction 4

I. Materiel and method 5

A. Choosing method and material 5


1. Choice of the type of documents 5
2. Choice of the instrument to create the tool 5

B. Artificial neural network using TensorFlow 6


1. General functioning 6
2. What is backpropagation ? [1] 8
3. Creating our dataset 10
4. Computation 11

C. Limits of these choices: 12


1. TensorFlow’s limits 12
2. Limits of our dataset 12

II. Results 14

A. Changing the learning rate 14


B. Changing the number of iterations 15
C. Changing the number of layers 15
D. Changing the number of neurons in each layer

III. Discussion 17

A. Results’ Interpretation 17
1. Changing the learning rate Error! Bookmark not defined.
2. Changing the number of iterations Error! Bookmark not defined.
3. Changing the number of layers Error! Bookmark not defined.
4. Changing the number of neurons in each layer Error! Bookmark not defined.

Project Report
2
B. How to optimise ? 20
1. Improve the dataset 20
2. Cleaver input 200

C. Prospects for use 21


1. Malware detection 21
2. Advanced file analysis 21
3. Integration in a professional environment 22

Conclusion 23

Project Report
3
Introduction

Today the large expansion of the Internet has led to a progressive dematerialization of
paper documents, replaced by electronic documents.
We can easily assume that paper documents will disappear in a few years, which will
strengthen the importance of electronic documents in everyday’s life. Everybody has to use
them at some stage, whether it is to read them or write them.

The problem is that electronic documents can contain security vulnerabilities and their
nefarious exploitation can lead to attacks against badly informed people. They can effectively
change a document, introducing malicious code to it without people knowing. Opening this
type of document could be dangerous because this malevolent code is executed at this time
and could result in causing damage to the user’s computer.

Most computer attacks follow this scenario and are the direct consequence of the
opening of a compromised document.

This is why informing people of compromised documents and giving them tools to
spot them is so relevant.
To do so, it is necessary to create a neural network ( using tensorflow in this case) which will
learn to recognise different types of documents, taking 128b pieces of document as input.
Our hypothesis is that this system is able to confirm the integrity of a document, checking all
it’s parts to verify that there is not a portion of malicious code in it.

Project Report
4
I. Materiel and method

A.Choosing method and material

1. Choice of the type of documents

The problem of compromised documents is very global and could concern nearly
every existing types of documents.
The tool we created could be efficient with lots of these types but in our work we chose to
select only three of them to implement the neural network :
- Documents .PDF
- Documents .txt
- Documents .docx

We chose this option because these types are representative of the documents people use.
Moreover, it would be very long and difficult for a neural network to learn how to recognize
every type of documents all at once.

We chose not to include compromised files in our dataset because of the lack of data.
Thus our neural network won’t be able to spot a compromised document. But if you feed it
the good data it will be able to spot the difference between a compromised document and a
safe one.

2. Choice of the instrument to create the tool

It is possible to create a neural network without any instrument, using only code but it
would be much more tedious.
Fortunately there is an open source solution, called TensorFlow and developed by Google
which have been made open in 2015.
It is one of the most used deep learning tools in the field of Artificial Intelligence

Project Report
5
B. Artificial neural network using TensorFlow

1. General functioning

An artificial neural network is a system inspired by the human neural network. The
global idea remains the same because of the high amount of neurons and connexions between
them, even if an artificial neural network is significantly less complicated. The main purpose
of this system is to learn how to recognize things like an animal in a picture or in our case the
type of an electronic document. We will see in the next section how this learning is possible.
For now let us focus on the global aspect of a neural network.
Artificial neurons are the most important part of a neural network and are organised in layers.
We can find input and output neurons, which will be used to train a model and check the
performance. Others are hidden neurons with inputs and outputs. They take as inputs all the
neurons of the previous layer. Their outputs are the neurons of the next layer. As shown in
the figure 1, we can see that a neuron is connected with all the neurons from the previous and
the next layer.

Each neuron can be represented like a node of a graph and each connection like a vertex.
Regarding this we can add the concept of weight and bias. These two notions are really
important in an artificial neural network because they are the reason why it can learn.

Figure 1. Artificial Neural Network example

Each vertex has a weight which will be multiplied by the previous output value to give a new
input value for the next layer.

Project Report
6
There is also an input bias for each layer. It will be considered during the computation phase
as we can see in the following equation.
We can now calculate the total input for h1 :

inh1 = ω1 * i1 + ω2 * i2 + b1 * 1

This result will then be computed with an activation function to give the output for h1.

Following this process, we can know the total input for an output neuron and so find the
result of the computation.
Here an example for the total input for o1 :

ino1 = ω5 * outh1+ ω6 * outh2 + b2 * 1

By using the activation function, it is now possible to find the output. This one is not always
as good as expected, so we will calculate the error on each output neuron to find the total
error, using the squared error function. This error will be used to know how accurate is our
neural network.
Here is the calcul for o1’s error :

1
Eo1 = (targeto1 - outputo1)2
2

In this example, target stands for the expected result and output is our actual result. The final
result of the total error is the sum of all the errors :

Etot = Eo1 + Eo2

Once we get this total error, we can start to modify our neural network to optimise it. The
method we will use is called backpropagation.

Project Report
7
2. What is backpropagation ? [1]

We saw in the last section that weights and biases are the main parameters that matter
to find the result. In order to improve this result, we have to adjust our weight and bias
values. To do so, there is a method called backpropagation that will help us find how to
adjust these parameters and find the best result.

This method start by the end of the computation and find how much we have to
modify a weight to obtain a result closer from the target. So we have to find how a change in
a weight affects the error we get as output.
The main tools here are the partial derivative and the learning rate.
The α symbol stands for the learning rate and is an important parameter in an artificial
neural network because it will change the training time and the result of it.
A high learning rate will allow the system to learn faster, but it will also increase the result of
the cost function (the main function of the neural network) at the beginning of the training. It
could be profitable for some projects where you do not have to be precise.
With a low learning rate it is clearly more precise and reliable than a high learning rate. But it
takes more time to train them. The low learning rate bring the training to make really small
steps to reach the minimum of the loss function while the high learning rate progress with big
steps. [2]

Figure 2. Minimization of the loss function with different learning rates [3]

Project Report
8
Following you can find an example of the approach to find a new value to ω5 :
How much does the total error change according to the output outo1 :

How much does the output outo1 change according to the input ino1 :

How much does the input ino1 change according to ω5 :

How much does the total error change according to ω5 :

This value is multiplied by the learning rate and subtract from ω5 to find the new
value :

Once we have the new value, we just have to repeat this approach for each weight and each
bias. Notice that the example above concerns the output layer. Computations are not exactly
the same for hidden layers because each error will propagate itself and so there is more
details to consider.

Project Report
9
3. Creating our dataset

Once we knew how TensorFlow works we had to create the dataset we were working
with.
As we saw, TensorFlow is a very complete tool and could permit to do nearly everything
using deep learning. That means that we had to establish how we were going to work with it.

The first step was to define the content of our dataset, we had to create a repository for
each type of file we were working with.
Then we filled each one with what we consider as a representative sample of files. We tried
to select typical files which could be exchanged on the internet. It has to be in accordance
with electronic documents people are using everyday.

The second step was to define the size of our network in two ways.

- the number of inputs


It corresponds to the size of the pieces of document we are working with, which represents
the number of neurons on the first layer of our network . It can not be too big because it
would increase the complexity of the computing but it has to be big enough to be
representative of the document.
We chose to work with 128 bytes pieces and thus we have 128 inputs neurons for our first
layer.

- the number of layer


Increasing the number of layer is easily possible with TensorFlow, theoretically it permits the
network to learn much complex functions and thus recognise easily documents types. In our
problem, we do not need to add more than two intermediary layers because it is not complex
enough. It would not improve the results but it would definitely increase the computation
time because the more layer we have the more parameters we have to estimate.

Project Report
10
4. Computation

Here are the general steps of the algorithm :

- Create 128b pieces with our documents and mix them, creating the dataset.

- Divide this dataset in two parts : training dataset (learning phase) and testing dataset.

- Define the parameters of the network (learning rate, number of layers, steps, number
and size of the hiddens layers...).

- Create the model, store layers’ characteristics, initialise variables

- Start training, using the backpropagation method we saw above, test on the testing
dataset, printing the results and reiterating the operation the number of steps wanted
by the programmer

- Here we can either:


- train again, possibly using a new training dataset and saving what we
already learn.
- test the network with a new testing dataset

Project Report
11
C. Limits of these choices:

1. TensorFlow’s limits

The principal TensorFlow’s Limit is due to the fact that this is a relatively recent tool.
The problem with such a recent instrument is that there is not much informations about it.
Even if there is some examples explaining the general functioning of TensorFlow, some parts
are still a bit vague with nobody explaining them clearly.
For example, there is no real rule about how we should configure the network for a given
problem.
We do not know how many layers and inputs we need or which optimisation function
to use. It means we need to try different configurations to improve our network.

2. Limits of our dataset

Obviously, the dataset we chose is not perfect. Let us formulate a critical position
about it.

First, as we said earlier, we only chose 3 types of documents to work with, which is
not fully representative of the reality. We know that there is much more types that could be
concerned by our compromised documents problem.
Our current network could work with other types but it would need some updates to be really
efficient.

Secondly, we are working with a relatively small dataset. Given the equipments we
have, we can not work with huge datasets because it would be too complex to compute.
The problem induced by this is that, once again it is not really representative of the reality.
Even in the PDF type, we can not affirm that we have a good approximation of all that exists,
which distorts a bit the results of our tool.

The final defect we have about our datasat is that we do not have enough
compromised documents. These documents are relatively hard to found and, as we know, we
need to be careful manipulating them.

Project Report
12
The result is that we can not easily make it learn well how to recognise this type of
document directly (remember that for now, we know when a document is compromised when
all its part are not of its type).

Project Report
13
II. Results

We are now going to present you some results from our neural network and our
dataset with some variations in the different parameters we are able to change. The Results
line give a percentage of successful recognition in a set of 784 pieces of 128 bytes which can
be from doc, text or pdf files after the learning part. Commented [1]: Ce chapitre doit contenir tous les
résultats et rien que les résultats
Le texte est complété de tableaux, qui contient des
chiffres et des figures, qui sont de
A.Changing the learning rate type variés.
Vous devez être capable d’analyser la présentation, la
précision, la lisibilité des tableaux
et des figures. Ceci doit être fait en cohérence avec le
The first parameter we did change is the learning rate. In these three examples our but du travail et les méthodes
utilisées.
neural network is composed of two layers, each one consisted of respectively 128 and 64 Il vous est demandé de juger de l’objectivité ou de la
neurons. The three tabs present the results of the neural network trying to recognise text and subjectivité des résultats. Ceci
n’est pas toujours facile. La subjectivité se rencontre
doc files (first line on the left), pdf en doc files (first line on the right), txt pdf and doc lorsque la réponse à un critère fait
entrer des variables liées à une appréciation non
(second line). quantifiable.
>> SAVOIR CRITIQUER LA PRESENTATION DES
RESULTATS

Figure 3. Computations with different learning rates

Project Report
14
B. Changing the number of iterations

We did change the number of steps of the learning part. Here are the results of neural
networks composed by two layers each one consisted of respectively 128 and 64 to neurons

Figure 4. Computations with different number of iterations

C. Changing the number of layers

We here compare the use of different number of layers.


In the first line of tabs the neural network is working on comparing txt and doc files. In the
second one: txt, doc, and pdf files. For all these computations we have done one thousand
iterations and the layers are composed of 128, 64 , 32, 16 neurons and 128, 64 neurons.

Project Report
15
Figure 5. Computations with different number of layers

D. Changing the number of neurons in each layer

We did also change the number of neurons composing each layer of the neural
network. Our first layer is always composed of 128 neurons because of the shape of our
input. The other layer is composed of 128, 64, 32 or 16 neurons.

Figure 7. Computations with different number of neurons in each layers

Project Report
16
III. Discussion

A. Results’ Interpretation

1. Changing the learning rate

If we take a look on the effective results of our neural network, we see that no learning
rate seems to be optimal for all of these three situations. For each specific situation there is an
optimal learning rate. This can be explained if we look a little closer at all these situations.
Pdf and Doc file are binary files so they should look similar when we take them with
pieces of 128 bytes. This is maybe why the neural network prefers a low learning rate, so it
does rush into a conclusion too fast and takes more time to spot differences. On the contrary
text files are not binary. So they 128 bytes pieces are really different from pieces of doc files.
We observe the neural network prefers a bigger learning rate. Maybe because it is less
dangerous to rush into a conclusion as the inputs are very different.

It is more difficult to interpret MSE and loss rate from the learning part. It is weird
because a good neural network model should have a low MSE and loss accuracy but in our
case it is sometimes reversed. For example with the neural network comparing pdf and doc
files the best solution is with 0,001 as learning rate in practise but the theory would choose
0,01 or 0,1.
We can assume that MSE and loss rate are good indicators but it are not sufficient to affirm
that a model is better than another because they depends on the training dataset, which is
different from the testing dataset.

For example, we could have in our training dataset a special doc file which would
appears one time on hundred in real life and on which our network fail. This could upper the
MSE and loss rate, but it would not have an impact on reality.
Project Report
17
2. Changing the number of iterations

The results of this experiment are quite clear. We see that more iterations lead to a
better neural network. Except for the two cases in purple which we can assume not to be
representative of the general functioning of our network. They are just isolated cases. We
also see that more the learning rate is low more augmenting the iteration number will be
effective to improve the recognition.

3. Changing the number of layers

As before the result of this test is quite clear. We can see that using more layers do not
lead to a better recognition. The percentage of recognition is multiplied by two with 2 layers
instead of four. It could be explain by the fact that more layers is adapted when we want to
estimate a lot of parameters[4]. In our case we just want to estimate the type of a file. As we
do not estimate a big number of parameter having more layers will reduce the efficiency of
backpropagation.

Project Report
18
4. Changing the number of neurons in each layer

The results of this experiment are interesting. We can see that as the learning rate the
optimal number of neurons in each layer depends on the other parameters. More precisely, it
seems to depend on the learning rate. It is difficult to assume something with these results.
The scientific community does not agree on a conclusion or a specific rule about the number
of neurons in each layer yet. All we can say is that the best configuration for our case, which
is recognising doc, text and pdf files, seems to be a learning rate of 0,1 with 2 layers of
respectively 128 and 32 neurons.

Project Report
19
B. How to optimise ?

1. Improve the dataset

The first and easier thing we could do to optimise our network is to improve our
dataset. This part is very relevant because our network is only based on our dataset to learn.
That makes it very valuable. There is at least two ways to improve it :

- Use a much bigger dataset. It would help us getting closer to the reality even if
it would make the computation more complex.

- Use a more complete dataset. That mean that for each type of document we are
working with, we could have a folder containing compromised files of this type
and a folder containing uncompromised files.
It would for example be useful for the use we are speaking of in the part C.2

2. Cleaver input

Another way to improve our system is to use a cleaver and more appropriate input.
The input of the artificial neural network has to perfectly match with the project’s objective.
For example, if you want to recognise a language you should use letter sequences as inputs.
This way, the neural network will be able to check for this sequences and associate them to
language.
In our case, the input could be optimise with the ASCII(American Standard Code for
Information Interchange). By using this code we could sort every character in different
categories and replace each character of the input file by its category. The artificial neural
network should be able to be trained more efficiently.

Project Report
20
C. Prospects for use

1. Malware detection

We could now include the system we created in an accessible wide audience


application. It could be useful to educate and prevent people from the danger of a
compromised electronic document.
The application would take a file the user is not sure about and split it into parts of 128
bytes. After checking all the parts, the application could manage to tell either the document is
compromised or not according to how much part are detected as being of the same type as the
file. The difficulty is to know what will be the percentage of error from which the application
affirm that a document is compromised or not.

2. Advanced file analysis

This idea of this section remains the same as the previous section, but with an
improvement. With our artificial neural network the application could find the file’s type and
at the same time detect whether it is compromised or not.
This application would require two neural networks to work. The first would be the type
file’s analyser while the second one would check if the document is compromised or not
thanks to the analysis of the first neural network.
This application could be more profitable than the previous one because it finds itself
the document’s type. So it would be easier for comfortable to use for people who are not
familiar with computers.

Project Report
21
3. Integration in a professional environment

We could also think about integrating our project in a professional environment. It


could be implemented in a company’s network to act as filter for emails for example. A
system which will check attached files for every mail coming from the outside to be sure that
they are not compromised. It could be an additional measure used to strengthen the security
in many companies.

Project Report
22
Conclusion

We know that we are currently going through a global dematerialisation of paper


documents replaced by electronic documents. We also know that these documents, used
nefariously, could be dangerous when opened.
That brings the problem of spotting compromised documents among the flow of documents
we face.

We created a neural network, using TensorFlow to solve this problem. The network
we implemented is capable of learning to recognise different types of documents, taking 128
bytes pieces as input. Using this skill, it can be used to check whether a document is
compromised or not : if all its parts (or the vast majority) correspond to the effective type of
the document, we can conclude that it is not compromised.

We saw above that it could also be used in different ways. It could be improved,
dealing with other types or documents for example. It could also be used in a professional
field, checking the integrity of the documents entering the company.
Finally, even if it is not perfect, this tool is efficient and adaptable to lots of applications with
different uses.

Project Report
23
Bibliographie
● [1] M. MAZUR, “A Step by Step Backpropagation Example”, Matt Mazur’s website,
March 17th 2015. Available : https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/. [Consulted on May 26th 2018]

● [2] P. SURMENOK, “Estimating an Optimal Learning Rate For a Deep Neural


Network”, Towards Data Science, November 13th 2017. Available :
https://towardsdatascience.com/estimating-optimal-learning-rate-for-a-deep-neural-
network-ce32f2556ce0. [Consulted on May 26th 2018]

● [3] J. JORDAN, “Setting the learning rate of your neural network.”, Jeremy Jordan’s
website, March 1st 2018. Available : https://www.jeremyjordan.me/nn-learning-rate/.
[Consulted on May 26th 2018]

● [4] Y. Benoit, “TensorFlow & Deep Learning – Episode 3 – Modifiez votre Réseau de
Neurones en toute simplicité”, Blog Xebia, 2017. Available :
http://blog.xebia.fr/2017/04/11/tensorflow-deep-learning-episode-3-modifiez-votre-
reseau-de-neurones-en-toute-simplicite/. [Consulted on May 27th 2018]

Project Report
24
Département de Génie Electrique & Informatique
SPOTTING COMPROMISED ELECTRONIC DOCUMENTS USING A NEURAL NETWORK

Résumé : Le monde traverse une dématérialisation progressive et inévitable des documents


papier qui se voient remplacés par les documents électroniques. D’ici quelques années, la
plupart des documents papier que nous utilisons actuellement auront disparu. Aujourd’hui,
presque tout le monde est déjà amené à utiliser des documents électroniques que ce soit
pour les lire ou pour travailler dessus.
Le problème est que ces documents ne sont que très rarement perçus comme des
dangers potentiels par les personnes qui les utilisent. Ils ne savent généralement pas que
ces documents contiennent des vulnérabilités qui peuvent notamment être utilisées pour
nuire. Il est en effet possible d’introduire du code malveillant au sein d’un document
électronique. L’ouverture de ce document sera alors dangereuse puisqu’elle entrainera
l’exécution du code qui peut causer un certain nombre de dégâts dans un ordinateur.
La plupart des attaques informatiques suivent ce scénario ce qui explique qu’il est si
important d’informer les personnes concernées et de leur fournir les outils nécessaires pour
contrer ces attaques.
Pour ce faire, nous avons créé un réseau de neurones artificiel à l’aide du logiciel
TensorFlow. Ce réseau de neurones est capable d’apprendre à reconnaître différents types
de documents décomposés en bouts de 128 octets. Nous pouvons ainsi utiliser ce système
afin de vérifier l’intégrité d’un document électronique et vérifier qu’il ne contienne aucun
code malveillant.
Mots-clés : vulnérabilités, documents électroniques, sécurité, réseau de neurones artificiel,
neurones, taux d’apprentissage, deep learning

Abstract : The world is undergoing a progressive and ineluctable dematerialisation of


paper documents, replaced by electronic documents. This evolution will result in the
eventual disappearance of paper documents (especially administration documents) within a
few years. Today, most people already need to use them whether it is to read them or write
them. We all know the example of taxes !
The problem is that these documents rarely present any danger in people’s minds.
They often do not know that electronic documents contain security vulnerabilities, which
could be used in a nefarious way. Malicious code can be easily introduced into these
documents. Opening these types of documents could be dangerous because this malevolent
code is executed at this time and could result in damaging the user’s computer.

Project Report
25
Most computer attacks follow this scenario and this is why informing people and
giving them tools to spot them is so relevant.
To do so, we created a neural network using TensorFlow. This network is capable of
learning how to recognise different types of documents with 128bytes pieces as input. We
can use it to spot the integrity of a document, checking all it’s parts to verify that there is
not a portion of malicious code in it.
Keywords : vulnerabilities, electronic document, security, artificial neural network,
neurons, learning rate, deep learning.

Project Report
26

You might also like