Professional Documents
Culture Documents
neural network
Project Report
1
Summary
Introduction 4
II. Results 14
III. Discussion 17
A. Results’ Interpretation 17
1. Changing the learning rate Error! Bookmark not defined.
2. Changing the number of iterations Error! Bookmark not defined.
3. Changing the number of layers Error! Bookmark not defined.
4. Changing the number of neurons in each layer Error! Bookmark not defined.
Project Report
2
B. How to optimise ? 20
1. Improve the dataset 20
2. Cleaver input 200
Conclusion 23
Project Report
3
Introduction
Today the large expansion of the Internet has led to a progressive dematerialization of
paper documents, replaced by electronic documents.
We can easily assume that paper documents will disappear in a few years, which will
strengthen the importance of electronic documents in everyday’s life. Everybody has to use
them at some stage, whether it is to read them or write them.
The problem is that electronic documents can contain security vulnerabilities and their
nefarious exploitation can lead to attacks against badly informed people. They can effectively
change a document, introducing malicious code to it without people knowing. Opening this
type of document could be dangerous because this malevolent code is executed at this time
and could result in causing damage to the user’s computer.
Most computer attacks follow this scenario and are the direct consequence of the
opening of a compromised document.
This is why informing people of compromised documents and giving them tools to
spot them is so relevant.
To do so, it is necessary to create a neural network ( using tensorflow in this case) which will
learn to recognise different types of documents, taking 128b pieces of document as input.
Our hypothesis is that this system is able to confirm the integrity of a document, checking all
it’s parts to verify that there is not a portion of malicious code in it.
Project Report
4
I. Materiel and method
The problem of compromised documents is very global and could concern nearly
every existing types of documents.
The tool we created could be efficient with lots of these types but in our work we chose to
select only three of them to implement the neural network :
- Documents .PDF
- Documents .txt
- Documents .docx
We chose this option because these types are representative of the documents people use.
Moreover, it would be very long and difficult for a neural network to learn how to recognize
every type of documents all at once.
We chose not to include compromised files in our dataset because of the lack of data.
Thus our neural network won’t be able to spot a compromised document. But if you feed it
the good data it will be able to spot the difference between a compromised document and a
safe one.
It is possible to create a neural network without any instrument, using only code but it
would be much more tedious.
Fortunately there is an open source solution, called TensorFlow and developed by Google
which have been made open in 2015.
It is one of the most used deep learning tools in the field of Artificial Intelligence
Project Report
5
B. Artificial neural network using TensorFlow
1. General functioning
An artificial neural network is a system inspired by the human neural network. The
global idea remains the same because of the high amount of neurons and connexions between
them, even if an artificial neural network is significantly less complicated. The main purpose
of this system is to learn how to recognize things like an animal in a picture or in our case the
type of an electronic document. We will see in the next section how this learning is possible.
For now let us focus on the global aspect of a neural network.
Artificial neurons are the most important part of a neural network and are organised in layers.
We can find input and output neurons, which will be used to train a model and check the
performance. Others are hidden neurons with inputs and outputs. They take as inputs all the
neurons of the previous layer. Their outputs are the neurons of the next layer. As shown in
the figure 1, we can see that a neuron is connected with all the neurons from the previous and
the next layer.
Each neuron can be represented like a node of a graph and each connection like a vertex.
Regarding this we can add the concept of weight and bias. These two notions are really
important in an artificial neural network because they are the reason why it can learn.
Each vertex has a weight which will be multiplied by the previous output value to give a new
input value for the next layer.
Project Report
6
There is also an input bias for each layer. It will be considered during the computation phase
as we can see in the following equation.
We can now calculate the total input for h1 :
inh1 = ω1 * i1 + ω2 * i2 + b1 * 1
This result will then be computed with an activation function to give the output for h1.
Following this process, we can know the total input for an output neuron and so find the
result of the computation.
Here an example for the total input for o1 :
By using the activation function, it is now possible to find the output. This one is not always
as good as expected, so we will calculate the error on each output neuron to find the total
error, using the squared error function. This error will be used to know how accurate is our
neural network.
Here is the calcul for o1’s error :
1
Eo1 = (targeto1 - outputo1)2
2
In this example, target stands for the expected result and output is our actual result. The final
result of the total error is the sum of all the errors :
Once we get this total error, we can start to modify our neural network to optimise it. The
method we will use is called backpropagation.
Project Report
7
2. What is backpropagation ? [1]
We saw in the last section that weights and biases are the main parameters that matter
to find the result. In order to improve this result, we have to adjust our weight and bias
values. To do so, there is a method called backpropagation that will help us find how to
adjust these parameters and find the best result.
This method start by the end of the computation and find how much we have to
modify a weight to obtain a result closer from the target. So we have to find how a change in
a weight affects the error we get as output.
The main tools here are the partial derivative and the learning rate.
The α symbol stands for the learning rate and is an important parameter in an artificial
neural network because it will change the training time and the result of it.
A high learning rate will allow the system to learn faster, but it will also increase the result of
the cost function (the main function of the neural network) at the beginning of the training. It
could be profitable for some projects where you do not have to be precise.
With a low learning rate it is clearly more precise and reliable than a high learning rate. But it
takes more time to train them. The low learning rate bring the training to make really small
steps to reach the minimum of the loss function while the high learning rate progress with big
steps. [2]
Figure 2. Minimization of the loss function with different learning rates [3]
Project Report
8
Following you can find an example of the approach to find a new value to ω5 :
How much does the total error change according to the output outo1 :
How much does the output outo1 change according to the input ino1 :
This value is multiplied by the learning rate and subtract from ω5 to find the new
value :
Once we have the new value, we just have to repeat this approach for each weight and each
bias. Notice that the example above concerns the output layer. Computations are not exactly
the same for hidden layers because each error will propagate itself and so there is more
details to consider.
Project Report
9
3. Creating our dataset
Once we knew how TensorFlow works we had to create the dataset we were working
with.
As we saw, TensorFlow is a very complete tool and could permit to do nearly everything
using deep learning. That means that we had to establish how we were going to work with it.
The first step was to define the content of our dataset, we had to create a repository for
each type of file we were working with.
Then we filled each one with what we consider as a representative sample of files. We tried
to select typical files which could be exchanged on the internet. It has to be in accordance
with electronic documents people are using everyday.
The second step was to define the size of our network in two ways.
Project Report
10
4. Computation
- Create 128b pieces with our documents and mix them, creating the dataset.
- Divide this dataset in two parts : training dataset (learning phase) and testing dataset.
- Define the parameters of the network (learning rate, number of layers, steps, number
and size of the hiddens layers...).
- Start training, using the backpropagation method we saw above, test on the testing
dataset, printing the results and reiterating the operation the number of steps wanted
by the programmer
Project Report
11
C. Limits of these choices:
1. TensorFlow’s limits
The principal TensorFlow’s Limit is due to the fact that this is a relatively recent tool.
The problem with such a recent instrument is that there is not much informations about it.
Even if there is some examples explaining the general functioning of TensorFlow, some parts
are still a bit vague with nobody explaining them clearly.
For example, there is no real rule about how we should configure the network for a given
problem.
We do not know how many layers and inputs we need or which optimisation function
to use. It means we need to try different configurations to improve our network.
Obviously, the dataset we chose is not perfect. Let us formulate a critical position
about it.
First, as we said earlier, we only chose 3 types of documents to work with, which is
not fully representative of the reality. We know that there is much more types that could be
concerned by our compromised documents problem.
Our current network could work with other types but it would need some updates to be really
efficient.
Secondly, we are working with a relatively small dataset. Given the equipments we
have, we can not work with huge datasets because it would be too complex to compute.
The problem induced by this is that, once again it is not really representative of the reality.
Even in the PDF type, we can not affirm that we have a good approximation of all that exists,
which distorts a bit the results of our tool.
The final defect we have about our datasat is that we do not have enough
compromised documents. These documents are relatively hard to found and, as we know, we
need to be careful manipulating them.
Project Report
12
The result is that we can not easily make it learn well how to recognise this type of
document directly (remember that for now, we know when a document is compromised when
all its part are not of its type).
Project Report
13
II. Results
We are now going to present you some results from our neural network and our
dataset with some variations in the different parameters we are able to change. The Results
line give a percentage of successful recognition in a set of 784 pieces of 128 bytes which can
be from doc, text or pdf files after the learning part. Commented [1]: Ce chapitre doit contenir tous les
résultats et rien que les résultats
Le texte est complété de tableaux, qui contient des
chiffres et des figures, qui sont de
A.Changing the learning rate type variés.
Vous devez être capable d’analyser la présentation, la
précision, la lisibilité des tableaux
et des figures. Ceci doit être fait en cohérence avec le
The first parameter we did change is the learning rate. In these three examples our but du travail et les méthodes
utilisées.
neural network is composed of two layers, each one consisted of respectively 128 and 64 Il vous est demandé de juger de l’objectivité ou de la
neurons. The three tabs present the results of the neural network trying to recognise text and subjectivité des résultats. Ceci
n’est pas toujours facile. La subjectivité se rencontre
doc files (first line on the left), pdf en doc files (first line on the right), txt pdf and doc lorsque la réponse à un critère fait
entrer des variables liées à une appréciation non
(second line). quantifiable.
>> SAVOIR CRITIQUER LA PRESENTATION DES
RESULTATS
Project Report
14
B. Changing the number of iterations
We did change the number of steps of the learning part. Here are the results of neural
networks composed by two layers each one consisted of respectively 128 and 64 to neurons
Project Report
15
Figure 5. Computations with different number of layers
We did also change the number of neurons composing each layer of the neural
network. Our first layer is always composed of 128 neurons because of the shape of our
input. The other layer is composed of 128, 64, 32 or 16 neurons.
Project Report
16
III. Discussion
A. Results’ Interpretation
If we take a look on the effective results of our neural network, we see that no learning
rate seems to be optimal for all of these three situations. For each specific situation there is an
optimal learning rate. This can be explained if we look a little closer at all these situations.
Pdf and Doc file are binary files so they should look similar when we take them with
pieces of 128 bytes. This is maybe why the neural network prefers a low learning rate, so it
does rush into a conclusion too fast and takes more time to spot differences. On the contrary
text files are not binary. So they 128 bytes pieces are really different from pieces of doc files.
We observe the neural network prefers a bigger learning rate. Maybe because it is less
dangerous to rush into a conclusion as the inputs are very different.
It is more difficult to interpret MSE and loss rate from the learning part. It is weird
because a good neural network model should have a low MSE and loss accuracy but in our
case it is sometimes reversed. For example with the neural network comparing pdf and doc
files the best solution is with 0,001 as learning rate in practise but the theory would choose
0,01 or 0,1.
We can assume that MSE and loss rate are good indicators but it are not sufficient to affirm
that a model is better than another because they depends on the training dataset, which is
different from the testing dataset.
For example, we could have in our training dataset a special doc file which would
appears one time on hundred in real life and on which our network fail. This could upper the
MSE and loss rate, but it would not have an impact on reality.
Project Report
17
2. Changing the number of iterations
The results of this experiment are quite clear. We see that more iterations lead to a
better neural network. Except for the two cases in purple which we can assume not to be
representative of the general functioning of our network. They are just isolated cases. We
also see that more the learning rate is low more augmenting the iteration number will be
effective to improve the recognition.
As before the result of this test is quite clear. We can see that using more layers do not
lead to a better recognition. The percentage of recognition is multiplied by two with 2 layers
instead of four. It could be explain by the fact that more layers is adapted when we want to
estimate a lot of parameters[4]. In our case we just want to estimate the type of a file. As we
do not estimate a big number of parameter having more layers will reduce the efficiency of
backpropagation.
Project Report
18
4. Changing the number of neurons in each layer
The results of this experiment are interesting. We can see that as the learning rate the
optimal number of neurons in each layer depends on the other parameters. More precisely, it
seems to depend on the learning rate. It is difficult to assume something with these results.
The scientific community does not agree on a conclusion or a specific rule about the number
of neurons in each layer yet. All we can say is that the best configuration for our case, which
is recognising doc, text and pdf files, seems to be a learning rate of 0,1 with 2 layers of
respectively 128 and 32 neurons.
Project Report
19
B. How to optimise ?
The first and easier thing we could do to optimise our network is to improve our
dataset. This part is very relevant because our network is only based on our dataset to learn.
That makes it very valuable. There is at least two ways to improve it :
- Use a much bigger dataset. It would help us getting closer to the reality even if
it would make the computation more complex.
- Use a more complete dataset. That mean that for each type of document we are
working with, we could have a folder containing compromised files of this type
and a folder containing uncompromised files.
It would for example be useful for the use we are speaking of in the part C.2
2. Cleaver input
Another way to improve our system is to use a cleaver and more appropriate input.
The input of the artificial neural network has to perfectly match with the project’s objective.
For example, if you want to recognise a language you should use letter sequences as inputs.
This way, the neural network will be able to check for this sequences and associate them to
language.
In our case, the input could be optimise with the ASCII(American Standard Code for
Information Interchange). By using this code we could sort every character in different
categories and replace each character of the input file by its category. The artificial neural
network should be able to be trained more efficiently.
Project Report
20
C. Prospects for use
1. Malware detection
This idea of this section remains the same as the previous section, but with an
improvement. With our artificial neural network the application could find the file’s type and
at the same time detect whether it is compromised or not.
This application would require two neural networks to work. The first would be the type
file’s analyser while the second one would check if the document is compromised or not
thanks to the analysis of the first neural network.
This application could be more profitable than the previous one because it finds itself
the document’s type. So it would be easier for comfortable to use for people who are not
familiar with computers.
Project Report
21
3. Integration in a professional environment
Project Report
22
Conclusion
We created a neural network, using TensorFlow to solve this problem. The network
we implemented is capable of learning to recognise different types of documents, taking 128
bytes pieces as input. Using this skill, it can be used to check whether a document is
compromised or not : if all its parts (or the vast majority) correspond to the effective type of
the document, we can conclude that it is not compromised.
We saw above that it could also be used in different ways. It could be improved,
dealing with other types or documents for example. It could also be used in a professional
field, checking the integrity of the documents entering the company.
Finally, even if it is not perfect, this tool is efficient and adaptable to lots of applications with
different uses.
Project Report
23
Bibliographie
● [1] M. MAZUR, “A Step by Step Backpropagation Example”, Matt Mazur’s website,
March 17th 2015. Available : https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/. [Consulted on May 26th 2018]
● [3] J. JORDAN, “Setting the learning rate of your neural network.”, Jeremy Jordan’s
website, March 1st 2018. Available : https://www.jeremyjordan.me/nn-learning-rate/.
[Consulted on May 26th 2018]
● [4] Y. Benoit, “TensorFlow & Deep Learning – Episode 3 – Modifiez votre Réseau de
Neurones en toute simplicité”, Blog Xebia, 2017. Available :
http://blog.xebia.fr/2017/04/11/tensorflow-deep-learning-episode-3-modifiez-votre-
reseau-de-neurones-en-toute-simplicite/. [Consulted on May 27th 2018]
Project Report
24
Département de Génie Electrique & Informatique
SPOTTING COMPROMISED ELECTRONIC DOCUMENTS USING A NEURAL NETWORK
Project Report
25
Most computer attacks follow this scenario and this is why informing people and
giving them tools to spot them is so relevant.
To do so, we created a neural network using TensorFlow. This network is capable of
learning how to recognise different types of documents with 128bytes pieces as input. We
can use it to spot the integrity of a document, checking all it’s parts to verify that there is
not a portion of malicious code in it.
Keywords : vulnerabilities, electronic document, security, artificial neural network,
neurons, learning rate, deep learning.
Project Report
26