Malware Detection Using N-GRAM Based File Signature Based Method

International Journal on Recent and Innovation Trends in Computing and Communication
Volume: 2 Issue: 11
ISSN: 2321-8169
3793 - 3795
_______________________________________________________________________________________________
Malware Detection Using N-GRAM Based File Signature Based Method

Nutan Navnath Phalke
Sushadevi Shamrao Adagale
Student, Dept. of Computer Engg.

DGOI, COE, Daund, Pune
nutan.phalke@gmail.com
Student, Dept. of Computer Engg.

susha0810@gmail.com
Prof. Amrit Priyadarshi

Assistant Professor
Dept. of Computer Engg.
amritpriyadarshi@gmail.com
Varsha Balasaheb Shinde

Student
Dept. of E & TC Engg.
SCOE, Vadgaon (Bk),Pune
vershs98@gmail.com
Abstract We know that malware can affect on computer data, they disturb computer .there is large growth in virus of different like Trojan
horses, worms, benign etc. however developer has need pay attention on that activity ,need to develop strong anti-analysis technique for that.
Malware detection is critical technique in computer security. Signature based method for malware detection is used, this is mostly used in
commercial antivirus software but this method detects malware only when virus caused damage or already registered. Otherwise it fail to detect
malware. Applying a methodology proven successful in similar problem-domains, we propose the use of n-grams as file signatures in order to
detect unknown malware whilst keeping low false positive ratio. We show that n-grams signatures provide an effective way to detect unknown
malware.
Keywords: Security, computer viruses, data-mining, malware detection, machine learning.
__________________________________________________*****_________________________________________________
I.
INTRODUCTION
Every year there is large growth in malware such as

Trojan horses, worms etc most purpose of this is to disturb
computer, so there is need to develop security technique. The
classic method to detect these threats consists on waiting for
a certain number of computers to be infected, determining
then a file signature for the virus and finally finding a
specific solution for it. the malware detection software can
provide protection against known viruses (ie. those on the
list). This approach has proved to be effective when the
threats are known in beforehand, and is the most extended
solution within antivirus software. Still, as already
mentioned, it fails when facing new ones.
New virus detection is most critical until signature is
obtained. Mutations of the original virus released in the
meanwhile may escape to detection based on that signature.
Malware writer has developed different virus and different
way to hide that code. While researchers design new tools
and strategies to detect them (Nachenberg, 1997). Such
evolution makes very difficult to develop an universal
malware detector. Thus, the task of the researcher is to
achieve very good results against known malware and to turn
the attempt of writing new undetectable virus more difficult.
Several method for effectiveness and achieve efficiency
in new malware detection strategies, such as First, we
consider malware detection ratio (i.e. the amount of viruses
of a sample that the software detects). Second, we have to
look at the false positive ratio (i.e. the amount of non

malicious programs that are erroneously classified as
malware), there is need detect how method is practically
commercialized ,new method detect malware but it has cost
of high false positive is not practical in the real world, where
the system must deal with a lot of benign software that
should not be classified as malware.
Antivirus company deal with low false positive ratio instead
of notable detection of high false positive ratio. Language
recognition is a research area that has tackled a similar
problem, since they also have to deal with the retrieval of
information that is hard to see at the first glance. The most
extended technique against this problem has been the use of
the so-called n-grams. For instance, in Chinese document
classification (Zhou and Guan, 2002) or text classification
(Jacob and Gokhale, 2007).N-grams are all substrings of a
larger string with a length n. A string is simply split into
substrings of fixed length n. For example, the string
MALWARE, can be segmented into several 4-grams:
MALW,ALWA, LWAR, WARE and so on.
In this paper we use two method for tackle this problem
first is that use n-gram based file signature for malware
detection. and second is that dealing with false positives.
Using a parameter named d to control how strict the method
behaves to classify the instance as malware or benign
software in order to avoid false positives.
3793
IJRITCC | November 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________

Volume: 2 Issue: 11
ISSN: 2321-8169
3793 - 3795
_______________________________________________________________________________________________
II.
LITERATURE SURVEY
Detecting unknown malicious code has been tackle by

using machine learning and data mining [1] In their research
they propose to use different data mining methods for
detection of new malware, however, none of the techniques
proposed had good balance between false positive ratio and
malware detection ratio in their experimental results.
N-grams were used first for malware analysis by an IBM
research group [2] They proposed a method to automatically
extract signatures for the malware. Still, there was no
experimental results in their research.
[3] addressed an n-grams-based signature method to detect
computer viruses. In that approach, given a set of non
malicious programs and computer viruses code, n-grams
profiles for each class of software (malicious or benign
software) were generated, in order to later classify any
unknown instance into benign or malicious code using a KNN algorithm with k =1. That experiment had very good
results in terms of malware detection ratio; still, no false
positive ratio was given in the experiment results.
In this paper [5] author says that now a day there is large
growth in malware they affect the computer .malware is
nothing but malicious code .signature based approach is used
for malware detection. still this
method is used by
commertial anti-virus software it can only detect the virus but
it damaged by virus then it fail to detect that malware. so that
it use n-gram based signature approach to to detect unknown
malware. use of n-gram it acts as file signature on that set.to
classify unknown instance malicious and bening software it
uses
k-nearest neighbour algorithm can be used in
classifying issues.
This Paper [6] Author Is To Be Able To Accurately Identify
The True Type Of An Arbitrary File Using Statistical
Analysis Of Their Binary Contents Without Parsing.Also
Possible To Find True Type Of The File Without Name
Announce It. The Method Represent Name Of The File
Called Fileprint.
This paper[7] Detectors that check for the presence of a
sequence of system calls exhibited by a malware instance are
often evadable by system call reordering several dynamic
detection approaches have been proposed that aim to
identify the behavior exhibited by a malware family
III.
PROPOSED METHOD
Our technique use set of training value i.e known value.

for set of representation of file. this set is collection of
malicious software and benign program. Specifically, the
malware is made up of different kind of malicious software
(i.e. computer viruses, Trojan horses, spyware, etc ).N-gram
for set of file acts as a signature based. then system can
classify unknown instances into malicious software and
benign program. we classify the unknown instance using k-
nearest neighbor algorithm (Fix and Hodges, one of the

simplest machine learning algorithms that can be used in
classifying issues. This algorithm relies on identifying the k
most nearest (say most similar) instances, to later classify the
unknown instance based on which class (malware or benign)
are the k-nearest instances. The following measure function
is used in order to detect how much the unknown instance
looks like the known ones:
(1)
where X is the set of the n-grams of the unknown instance, x
is any n-gram in the set X, Y is the set of n-grams of the
instance its class is known, and f (x) is the number of
coincidences of the n-gram x in the set Y.
When every instance of the known set is measured
comparing it with the unknown instance, we select the k
highest values of the measured instance, and we consider the
unknown instance as malware only if on the k most alike files
the amount of malware instances minus the amount of benign
instances is greater or equal than a parameter d, as shown in
the following formula:
(2)
where MW(K) is the amount of malware instances in the knearest neighbours, GW(K) is the amount of benign software
in the k nearest neighbours and d is the parameter d. This
parameter d controls how strict the system is going to be in
order to classify the unknown instance as malware or benign
software. Moreover, this parameter is what we are going to
use to keep low the false positive ratio; with a high value of d
(that has to be always lesser or equal than k), we predict that
the false positive ratio is going to keep low.
Generally we are trying to use dataset of malware
program that means of considering data set of training files
those are known to us and another is set of classifying means
unknown files. through the use machine learning method try
to determine the percentage of originality of the data.
IV.
CONCLUSION:
We conclude that N-gram signature based method for

malware detection is effective . we have chosen a set of
decompiled malware code and texts that have result of the
monitorizing of the execution of those file to act as the
unknown threats and other set to act as the known threats.
then use the n-grams of the known files and the n-grams of
the unknown ones. For any instance of the unknown set, we
are going to classify into malware or benign software, based
upon the coincidences of n-grams between the set of n-grams
of the known files and the set of n-grams of the unknown
ones. Also use n-grams-based signatures methodology can
3794
_______________________________________________________________________________________

Volume: 2 Issue: 11
ISSN: 2321-8169
3793 - 3795
_______________________________________________________________________________________________
achieve detection of new or unknown malware. Furthermore,
with the inclusion of the parameter d in the classification
method, we have achieved a way to configure how strict the
system will be for classifying an unknown instance as
malware; thus, it is a way for controlling the false positive
ratio. This method provides a good detection ratio and the
possibility to control the false positive ratio.
In future research we can use n-gram analysis for malware
detection through the use of different and large collection of
malware
V.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
REFERENCES
Stolfo, S. J.(2001). Data mining methods for detection

of new malicious executables. In SP 01: Proceedings
of the 2001 IEEE Symposium on Security and
Privacy, page 38, Washington, DC, USA. IEEE
Computer Society.
Kephart, J. O. (1994). A biologically inspired immune
system for computers. In In Artificial Life IV:
Proceedings of the Fourth International Workshop on
the Synthesis and Simulation of Living Systems,
pages 130139. MIT Press.
Abou-Assaleh, T., Cercone, N., Keselj, V., and
Sweidan, R. (2004). N-gram-based detection of new
malicious code. In COMPSAC Workshops, pages 41
42.
Kaspersky (2008). Kaspersky security bulletin 2008:
Malware evolution january - june 2008.
I. Santos, Y. K. Penya, J. Devesa, and P. G. Garcia,
N- grams-basedfile signatures for malware
detection, S3Lab, Deusto Technological Found.,
2009
[Online].
Available:
pgbg@technologico.deusto.es
W. L. K. Wang, S. Stolfo, and B. Herzog, Fileprints:
Identifying file types by n-gram analysis, in Proc. 6t
h IEEE Inform. Assurance Workshop,Jun. 2005, pp. :
6471
C. Kolbitsch, P. M. Comparetti, C. Kruegel, E. Kirda,
X. Zhou, and X.Wang, Effective and efficient
malware detection at the end host, in Proc. 18th
Usenix Security Symp., 2009, pp. 351366.
3795
_______________________________________________________________________________________

Malware Detection Using N-GRAM Based File Signature Based Method

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Malware Detection Using N-GRAM Based File Signature Based Method

Uploaded by

Copyright:

Available Formats

International Journal on Recent and Innovation Trends in Computing and Communication