Professional Documents
Culture Documents
Linkpings universitet
SE-581 83 Linkping, Sweden
Linkpings universitet
581 83 Linkping
Linkping universitet
Institutionen for datavetenskap
Examensarbete
Upphovsrtt
Detta dokument hlls tillgngligt p Internet eller dess framtida ersttare frn
publiceringsdatum under frutsttning att inga extraordinra omstndigheter
uppstr.
Tillgng till dokumentet innebr tillstnd fr var och en att lsa, ladda ner,
skriva ut enstaka kopior fr enskilt bruk och att anvnda det ofrndrat fr ickekommersiell forskning och fr undervisning. verfring av upphovsrtten vid
en senare tidpunkt kan inte upphva detta tillstnd. All annan anvndning av
dokumentet krver upphovsmannens medgivande. Fr att garantera ktheten,
skerheten och tillgngligheten finns lsningar av teknisk och administrativ art.
Upphovsmannens ideella rtt innefattar rtt att bli nmnd som upphovsman i
den omfattning som god sed krver vid anvndning av dokumentet p ovan beskrivna stt samt skydd mot att dokumentet ndras eller presenteras i sdan form
eller i sdant sammanhang som r krnkande fr upphovsmannens litterra eller
konstnrliga anseende eller egenart.
Fr ytterligare information om Linkping University Electronic Press se frlagets hemsida http://www.ep.liu.se/
Copyright
The publishers will keep this document online on the Internet or its possible
replacement from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for
anyone to read, to download, or to print out single copies for his/hers own use
and to use it unchanged for non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional upon the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linkping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its www home page: http://www.ep.liu.se/.
Abstract
Malware in smartphones is growing at a signicant rate. There are
currently more than 250 million smartphone users in the world and this
number is expected to grow in coming years [44].
In the past few years, smartphones have evolved from simple mobile
phones into sophisticated computers. This evolution has enabled smartphone users to access and browse the Internet, to receive and send emails,
SMS and MMS messages and to connect devices in order to exchange information. All of these features make the smartphone a useful tool in our
daily lives, but at the same time they render it more vulnerable to attacks
by malicious applications.
Given that most users store sensitive information on their mobile
phones, such as phone numbers, SMS messages, emails, pictures and
videos, smartphones are a very appealing target for attackers and malware developers.
The need to maintain security and data condentiality on the Android
platform makes the analysis of malware on this platform an urgent issue.
We have based this report on previous approaches to the dynamic
analysis of application behavior, and have adapted one approach in order
to detect malware on the Android platform. The detector is embedded
in a framework to collect traces from a number of real users and is based
on crowdsourcing. Our framework has been tested by analyzing data collected at the central server using two types of data sets: data from articial
malware created for test purposes and data from real malware found in
the wild. The method used is shown to be an eective means of isolating
malware and alerting users of downloaded malware, which suggests that
it has great potential for helping to stop the spread of detected malware
to a larger community.
Finally, the report will give a complete review of results for self written
and real Android Malware applications that have been tested with the
system.
This thesis project shows that it is feasible to create an Android malware detection system with satisfactory results.
Acknowledgments
First of all, I would like to thank Prof. Simin Nadjm-Tehrani and
Dr. Urko Zurutuza for their support, guidance and patience over
the course of this Master's thesis project.
Contents
1 Introduction
1.1
1.2
Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Project Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Intended audience
. . . . . . . . . . . . . . . . . . . . . . . . . .
1.5
Related work
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6
Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2 Background
2.1
2.2
14
. . . . . . . . . . . . . . . . . . . . .
14
2.1.1
Platform architecture
. . . . . . . . . . . . . . . . . . . .
14
2.1.2
18
2.1.3
. . . . . . . . . . . . . . . .
20
2.1.4
Android applications . . . . . . . . . . . . . . . . . . . . .
22
. . . . . . . . . . . . . . . . . . . . .
24
2.2.1
Denition . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.2
Detection types
25
. . . . . . . . . . . . . . . . . . . . . . .
2.3
27
2.4
Data Mining
29
2.4.1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
29
. . . . . . . . . . . . . . . . . . .
31
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.5
2.6
Crowdsourcing
Overview
3.2
3.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.1
3.2.2
41
42
3.3.1
42
38
48
4.1
Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
4.2
48
4.3
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
50
4.3.1
Self-written Malware . . . . . . . . . . . . . . . . . . . . .
50
4.3.2
Real Malware . . . . . . . . . . . . . . . . . . . . . . . . .
58
67
5.1
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
5.2
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
iii
List of Figures
1
. . . . . . . . . . . . . . . . . .
15
17
18
19
21
22
23
. . . . . . . . . . . .
25
10
. . . . . . . . . . . . . . . . . . . .
27
11
. . . . . .
29
12
31
13
32
14
34
15
35
16
37
17
. . . . . . . . . . . . . . . . .
38
18
39
19
41
20
42
21
. . . . . . . . . . . . . . . .
44
22
58
23
. . . . . . . . . . .
59
24
. . . . . . . . . . . . . . .
64
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
iv
List of Tables
1
. . . . . . . . . . . . . . . . . . . . . . . .
11
. . . . . . . . . . . .
12
. . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
. . . . .
45
. . . . . . . . . . . . . . . . . . . .
46
47
47
10
Test Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
11
49
12
51
13
52
14
53
15
16
17
18
19
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
56
57
. .
61
. . . . . . . . . . . . . . . . . .
61
Chapter 1
1
Introduction
This paper describes the results of a Master's thesis project (30 ECTS) towards
the fulllment of a degree in Telecommunications Engineering at Mondragon
Unibertsitatea.
1.1
Communications and technology are rapidly growing industries that are changing every day.
Windows Mobile will grow almost 50% between 2010 and 2014, with a high
probability of becoming the leading smartphone operating system vendors in
the future. See Table 1.
Operating System
2010
2014
2014/2010
Market
Market
Change
predicted
predictedShare
Share
Symbian
40.1%
32.9%
BlackBerry OS
17.9%
17.3%
-18.0%
-3.5%
Android
16.3%
24.6%
51.2%
iOS
14.7%
10.9%
-25.8%
Windows Mobile
6.8%
9.8%
43.3%
Others
4.2%
4.5%
8.3%
Total
100%
100%
Table 1: Worldwide mobile device Operating System Market Shares and 20102014 Growth[36]
The IDC predicts that the total number of smartphone applications will grow
at the same rate as smartphone sales. There are currently more than 350,000
applications in Apple's iPhone market and 250,000 applications in the Android
market, according to Silicon Alley Insider [37]. This is depicted in Figure 1.
The ocial Google Android market nearly doubled in size in 2010 and 2011,
surpassing 250,000 applications in March 2011.
Malware , has been a threat for PCs for many years[30] and in light of
the rapid increase of smartphone sales over the last few years[38], it was only
a matter of time before malware developers became interested in staging their
attacks on the smartphone platform. In particular, 2010 and 2011 saw a growing
interest among malware developers in waging attacks on Android's OS[28].
Malware usually destroys valuable and sensitive information in infected systems.
prots from them. In the same way as malware harms computers, it can also
perform attacks on smartphones, given that they have similar operating features. This observation makes it clear that it is necessary to enhance protection
of smartphone devices in the same way as we did with computers some years
ago.
The Android market is an open market system. This means that Android
developers can upload their applications, also called third-party applications,
to Android's ocial market without them being ltered by any certication
authority that would check the trustworthiness of the applications.
On the
one hand, this increases the odds that the Android market will have a greater
variety of applications and content, but on the other hand it facilitates infection
by malware applications, as applications are not analyzed by any certication
authority.
In conclusion, considering the growth of smartphones running the Android
OS
and the increasing number of applications available for the Android OS,
Android platform is the main objective of this project. In order to achieve that
objective, we will develop a behavior-based malware detection system for the
Android platform.
1 Malicious(Mal ) software(ware )
2 Samsung and HTC smartphone
vendors[38]
1.2
Goal
The goal of this Master's thesis was to design and implement a behavior-based
malware detection system for the Android platform.
More specically, the work was divided into the following sub-goals:
The proposed solution was expected to detect malicious applications from Android ocial and non-ocial markets or repositories.
1.3
Project Assumptions
Even if malware did exist in the Android market, rst we needed clear
or good applications with the same name or purpose to test the malware
detection system.
1.4
Intended audience
This thesis is useful to anyone who is involved in mobile Security, and is specially
designed for Android smartphone users and developers. It is also targeted at
anyone interested in crowdsourcing and data mining techniques as they apply
to mobile phones.
The document does not require any prior knowledge in the area of security.
Chapter 2 will provide all the basic theory for the concepts explained in the
paper.
1.5
Related work
Malware has been a threat for computers for many years[30] and continues to
cause irreparable damage to infected systems[29]. The rst attempts to identify
and analyze malware on smartphones started by adapting existing PC security
solutions and applying them to mobile phones. This was not a feasible solution
in light of the high demand placed on resources by antivirus techniques and the
power and memory constraints of mobile devices. Since malware and intrusion
detection systems have already been the subject of massive research, we will give
just a brief review of the evolution of malware and malware detection techniques
as regards mobile phones.
Nwokedi et al. compiled a summary of the most commonly used malware
detection techniques[60]. Their report examined 45 dierent malware detection
techniques in the elds of anomaly-based detection, specication-based detection and signature-based detection. All techniques explained in this report are
very useful background information in order to understand the rst approaches
to malware detection that can also be used in smartphones.
Iseclab[25], International Secure Systems Laboratory, explored the detection
of malicious applications and used dierent approaches to detection based on
dynamic analysis of malicious or infected applications. [55]. They used dierent
approaches and detection techniques based on dynamic analysis that are used
to detect malicious or infected applications. The paper provides useful information about malware detection techniques and tools used in dynamic analysis of
malware.
Jacoby et al.
monitor applications and extract device features, such as RAM free memory,
user inactivity, process count, CPU usage, sent SMS messages, etc.
The aim
Schmidt et al.
from the
pattern.
The same group proposed static analysis in 2009[66] and an Android application sandbox system in 2010[48]. The rst report presented a collaborative
scenario in which dierent devices could perform static analysis of malware directly on the phone. The second method used an Android application sandbox,
a totally secure environment, to perform static and dynamic analysis.
Static
Enck et al. proposed real-time monitoring and analysis of sensitive data with
dynamic taint tracking[56]
sources and applies labels as sensitive data propagates through program variables, les, and inter-process messages. When tainted data leaves the system,
the application scans for suspicious outgoing data.
Bose et al., Shabtai et al.
devices and apply malware detection techniques to these Android mobile phones.
The replicas are an equivalent version of the real mobile devices, and will be
sent to the remote server for malware analysis. Mobile phone replicas will run
in a secure virtual environment where dierent malware detection techniques
are applied.
A lightweight data collector application , installed on the device will be responsible for collecting the system calls generated by Android applications in
the device and storing device information les in the SD Card memory. This
application has similar features to the one proposed by Buennemeyer et al.,[50]
i.e. the sending of all monitored les to a remote server. They, however, made
very few attempts with mobile phones, and we aim to extend use of the application as much as possible. To do so we will ask Android community users to use
a lightweight script application (crowdsourcing application) in order to collect
as much data as possible from dierent Android devices.
A. Doan, R. Ramakrishnan and A. Halevy analyzed the impact of crowdsourcing on the WWW (World-Wide Web) [54]. Their article explains how in
the future crowdsourcing will become one of the most inuential techniques used
to collect information and create databases faster and more eciently.
3 Crowdsourcing
application[59]
9
The following text gives an overview of some recent attacks targeting Android and of malware that has appeared on the Android platform.
Android malware has increased by 400% since 2010[31], and will continue to
grow. In light of this, several malware attacks were carried out on the Android
OS in 2010 and 2011, [65] [11].
Hong Tou Tou, Angry Birds Bonus Level, Tip Calculator, Tap Snake, Monkey Jump and Steamy Window are the most famous malicious applications to
date on the Android platform. Furthermore, more than 50 infected applications
were found on Google's Android market in March 2011, all of them infected
with the DroidDream Trojan application[1].
Another attack targeting the Android platform was carried out by J. Oberheide. He developed the Angry Birds Bonus Level for the Android OS[11]. This
application was a proof-of-concept malware application to showcase the weak
security of the Android marketplace.
purports to be an additional bonus level for the famous game Angry Birds.
The malicious application downloads and installs three additional applications
information from a great number of Android devices in only a few days' time.
NetQin Inc[34], a mobile security service provider, discovered a spyware
application called Tip Calculator in the Android market.
all incoming and outgoing SMS messages in the system to a designated email
address. Another piece of spyware with similar characteristics discovered in non-
Android Pjapps modies the original version of this application and wages an
attack by subscribing to a SMS premium service.
Due to its appeal as the latest malware discovered for the Android OS, and
since both the clean and malicious instances of the application were available, we
decided to analyze this spyware with our proposed malware detection system.
4 Fake
11
NIDS
et
Bose et
Detection
Signature Based
HIDS
al.(2009)[71]
Shabtai et
NIDS
Anomaly
Detection
HIDS,
Detection
Anomaly
Detection
Signature Based
Schmidt et
HIDS
Anomaly
Detection
al.(2008)[69]
al.(2008)[68]
Schmidt et
al.(2008)[49]
HIDS
HIDS,
Buennemeyer
al.(2008)[50]
Anomaly
NIDS
(2007)[52]
Detection
HIDS,
Cheng et al.
SignatureBased
Detection
HIDS
Approach Detection
Method
al.(2004)[61]
Jacoby et
Author
Smartphone OS.
analysis.
anomalies.
Description
All
OS
Android
Mobile
dows
OS/Win-
Symbian
OS
Symbian
OS
Symbian
OS
Symbian
OS
Symbian
Platform
12
63]
al.(2010)[64,
et
Detection
Anomaly
Anomaly
Detection
Anomaly
Detection
OS
Android
OS
Android
OS
Symbian
OS
Android
OS
Symbian
OS
Android
Platform
Taint Droid will monitor Android applications and will alert the
Description
Signature Based
Portolakidis
HIDS,NIDS
Anomaly
Detection
Detection
HIDS,NIDS
NIDS
HIDS
HIDS
Signature Based
HIDS
Detection
Detection
Method
Approach
al.(2010)[56]
Enck et
al.(2010)[72]
Shari et
al.(2010)[48]
Blsing et
al.(2009)[67]
Schmidt et
al.(2009)[66]
Schmidt et
Author
1.6
Thesis structure
This section summarizes the main topics to be discussed throughout the paper,
giving a short overview of each chapter.
Chapter 2, describes the basic theory of the Android platform, intrusion
detection systems, Linux system calls, data mining and clustering algorithms.
The aim of this chapter is to enable the reader to understand the basic concepts
of the project.
Chapter 3, describes the behavior-based malware detection system for the
Android platform that was designed in this project.
Chapter 4, describes the testing and evaluation methods used by the behaviorbased malware detection system for the Android platform.
Chapter 5, describes the nal conclusions and denes the future work of the
project.
13
Chapter 2
2
Background
This chapter will give a brief description of some of the fundamental concepts
and terminology relating to the Android OS, intrusion detection systems, Linux
system calls, data mining and clustering algorithms. The clustering algorithm
section will be illustrated with reference to the way in which we have applied
these known techniques in order to group Android system calls.
2.1
14
15
Operating system:
As mentioned
above, Android is based on Linux 2.6 kernel, which provides the platform
with basic services such as security, memory management and process
management. The kernel can be considered an abstraction layer between
software and hardware layers, responsible for managing and processing requests received from higher layers for interaction with hardware resources.
Middleware:
Application:
16
Start-up
Another essential part of the Android OS is the startup process. Like any
other Linux system, Android has a boot sequence which prepares the services
necessary to run/start the device's operating system.
Figure 3 shows the rst stage in the boot sequence on Android OS.
The rst stage in the boot sequence is running the Bootstrapper application.
The bootstrapper is the program which starts the device's operating system and
initializes and tests the basic requirements of the hardware, peripherals and external memory devices. GRUB and LILO for Linux and NTLDR for Windows
are some of the most famous bootstrapper applications. The bootstrapper application loads the kernel image into RAM, and then the kernel starts the init
process. Figures 3 and Figure 4 explain the Android OS init process and boot
sequence.
The init process initializes system daemons for handling low-level hardware
interfaces, such as USB, the Android debugger or Android Debug Bridge Daemon.
The init process also starts the basic runtime processes, such as the
17
Figure 4 shows the Android OS boot sequence in greater detail. As mentioned above, the init process initializes several daemons and services in the
system. At the same time, the init process starts the Zygote process. We will
describe the process in greater detail on the following pages.
Every Android application runs in its own process, with its own instance
of the Dalvik VM inside a secure environment, a Sandbox.
The Dalvik VM
18
The Zygote
As detailed above, every Android application runs in its own instance of
the Dalvik VM and each instance must start quickly when a new application
is launched in the application layer. Android uses a concept called Zygote to
provide the fast start-up time needed to run the Dalvik VM every time a new
application is executed. Zygote loads the original Dalvik VM during the boot
sequence and waits for new requests from the Runtime process.
When the
19
Register-based Architecture
Virtual machine developers have always been in favor of implementing virtual machines with a stack-based architecture [42] rather than a register-based
architecture[45]
developers to prefer its use. Obviously, this simple implementation comes with
a performance cost. Executables for stack-based architecture are smaller than
executables for register-based architecture. This means a higher memory consumption, leading to a worse performance of the virtual machine.
Register-
20
new Dalvik VM instance for every new application request, and the requested
application will run in that Dalvik VM instance.
21
Eclipse [19]
or
Netbeans [33],
create an Android application installation (APK) le. These APK les can be
installed on Android devices using the Android Debug Bridge tool (adb) or by
downloading them from Android's Ocial Market.
Classes.Dex:
source code. It contains optimized Dex bytecode for the Android application and will run on the Dalvik VM.
Resources: This group contains pictures, libraries and layout les used by
the application.
Figure 8 shows the compilation process in the creation of an Android APK le.
22
One of the most important elements of creating an APK le is the compilation of Java source code. The process of generating the APK le is described
in Figure 8. The les undergo a series of transformations during the process of
creating the Android APK le. These transformations comprise the compilation
process required to generate APK les that will run on Android devices.
The rst step in the process of creating an Android application is to create
an Android project, in which Java source code, Android manifest and resource
les will be generated by Eclipse or Netbeans.
The next step is to program and congure the code to suit your purposes and
to compile the project. Java's compiler in the SDK programming environment
will generate class les from Java's source code and the
aapt 5
will transform
dx.
into the Dex format. Once all the les are compiled, the aapt is tasked with
compiling and generating the Android APK le.
5 Android
23
2.2
2.2.1 Denition
An Intrusion Detection System, also known as an ID[24], is a device or software
application which monitors a network or system for malicious activities[58].
There are many dierent types of IDS. The aim of an IDS is to identify and
detect anomalies in the system or device that is being monitored. Some classes
of IDS will be described below.
Network-Based
-Wireless
The Wireless Intrusion Detection System (WIDS) is similar to the NIDS. Instead of analyzing wired network trac it can analyze wireless trac to detect
suspicious activity.
Host-Based
Host-Based Intrusion Detection Systems (HIDS) monitor all activity that occurs
on the host (the platform comprising the computer hardware and the operating
system) being monitored. This system is capable of monitoring features of the
system such as power consumption, opened les, system call logs, etc.
This project will use a Host-Based Intrusion Detection System to monitor
events on Android devices.
detail.
24
Misuse detection
Signatures
are then created by a group of experts who analyze the code, behavior and
manifestation of the malware. Most antivirus companies still use this technique
to create malware signatures and patterns
One of the disadvantages of this detection type is that the system must
be familiar with all malware patterns and signatures in advance. This type of
detection limits the ability to detect new malware.
The process of nding and identifying new types of attacks and malware
manually takes experts a great deal of time. Antivirus companies are trying to
come up with dierent alternatives in order to avoid this problem through use
of automated processes. Figure 9 shows the dierences between the techniques
of Misuse detection and Anomaly detection.
25
Anomaly-Based detection
Anomaly-Based Intrusion Detection Systems use a prior training phase to establish a model for normal system activity. This mode of detection is rst trained
on the normal behavior of the system or application to be monitored.
Using
If there is no
26
2.3
In Linux, a system call is the way in which a program requests a service from the
operating system's kernel. The Linux kernel has roughly 190 system calls, and
each system call is identied by a unique number that is found in the kernel's
system call table [27].
A system call is invoked by an application using glibc library functions.
Functions like
and
socket()
that glibc can provide applications with to enable them to invoke a system call.
Every time an application from user space makes a request of the OS, the
request passes through the glibc library, the system call interface, the kernel
and nally reaches the hardware. The glibc library interprets the request and
the CPU switches to kernel mode. The system call interface gets the request
from the glibc library and executes the appropriate kernel function by consulting
the system call table. The kernel must interpret the request from the system
call interface and make the request of the hardware platform. Afterwards, the
user receives the information requested by the application following the inverse
process.
Figure 10 describes the Linux user kernel space and the process by
The Linux kernel is executed in the lowest layer of the Android architecture.
This means that all requests made from the upper layers pass through the kernel
using the system call interface before they are executed in the hardware.
27
Analyzing all of the system calls that pass through the system call interface
will give us an accurate picture of the behavior of the application.
6
of hijacking
The aim
the events generated by the Android application. This le will provide useful
information, such as opened and accessed les, execution timestamps and the
number of system calls executed by the application. We will use the number
of system call executions performed by the application to represent behavior.
Section 2.5 will provide insight into this technique
This project will use the lists of system calls to create an anomaly detection system, rst creating the normality model for the Android application using clear Android applications (applications free of malicious code). As stated
above, by extracting the number of system call executions generated by the
Android application it is possible to create a behavioral vector representation
for Android applications.
kill()
47 times.
open()
open()
kill()
system call 25 times to open les or libraries from the system, and the
system call 47 times to kill processes.
The list of Android system calls is too large to show here, but the system
calls list can be found in the Android Linux kernel[9] bionic folder
or in Section
28
2.4
Data Mining
Data mining is the process of extracting patterns from large data sets by combining methods from statistics and articial intelligence in order to obtain useful
information. Data mining is also considered to be the set of techniques and technologies used for exploring large databases in order to nd repetitive patterns,
trends or rules to explain the behavior of a given data set.
Figure 11 shows the sequence of the knowledge discovery process used in
databases (KDD) [46] to obtain useful information or knowledge from a raw
data set. The KDD process refers to the process of discovering useful knowledge.
Data mining refers to a particular step in the process.
When
we are given a raw data set, the rst step is to select information in order
to obtain relevant data. This project will use a crowdsourcing application
installed on several Android devices and an information collector script to
obtain the data set of the behavior of the Android application.
2.
Data preprocessing:
or patterns, it is necessary to lter out irrelevant data. Collecting inappropriate data results in poor interpretation and evaluation of the system,
will render the system unreliable and produce undesired results.
3.
Data transformation:
previous phases into a readable and organized structure. This data will
determine the outcome of the analysis and will create the data set for the
data mining algorithm.
4.
29
Classication
This is a technique used in data mining to classify data into dierent elds or
groups. One of the main characteristics of this technique is that the classication
of data is based on groups or patterns that are already known. This means that
all the information on groups in the system is already dened, and new data
will be compared with these groups in order to classify the data.
Clustering
The technique of clustering involves grouping a set of physical or abstract
objects into clusters of similar objects. In data mining, a cluster is a collection
or group of data that are similar to each other.
compared to the classication method is that the clustering method uses raw
data to create the groups to be used later in order to make a decision. These are
created without any predened group. The given data set will be responsible
for creating the groups or clusters, and afterwards a decision will be made on
which cluster the data belongs to.
or group created to which to assign the data, so the clustering algorithm will
create a random cluster in any position.
One of the easiest ways to decide to which group the data belongs is to
measure the
Euclidean
30
2.5
cluster and recursively split the clusters into smaller ones. Figure 13 shows the
graphical representation of agglomerative and divisive methods.
31
number of clusters as the objective, and the data set is split into those clusters.
The partitioning method aims to discover clusters by iteration and relocation
of points in the data set.
In unsupervised learning, the pattern classication system is based on a set
of training patterns, based on data with as yet unknown respective class labels.
This occurs when labeling of each individual sample is almost impossible. This
type of learning algorithm encompasses algorithms such as neural networks,
nearest neighbor, k-means, etc.
Bearing in mind that the objective in this project is to cluster system call
behavior vectors into two dierent clusters, i.e. Good and Malicious application
behaviors, it is appropriate to apply the partitioning method using the k-means
clustering algorithm.
32
k = 2.
The Good application cluster will describe the proper behavior of An-
droid applications and data clustered into the Malicious group or cluster will be
considered to be malicious or dangerous applications.
The k-means clustering algorithm[62], is a clustering method which aims to
create
J=
k X
n
2
X
(j)
xi cj
j=1 i=1
2
(j)
(j)
xi cj
is the distance measured between a data point xi and
cluster center cj . The cluster center cj indicates the distance of the n data
where
the
pi ,
where each
pi
pi
D numbers.
D-dimensional space.
is a vector of
as a point in a
Every
pi
in the data set, will represent a system call vector produced by the user.
33
vector
Figure 14: K-means applied as a detection system for android system calls
The
n observations,
(j)
xi
system call vector. Applying the k-means algorithm to the Android application
vector data set will create two clusters, with the good and malicious Android
applications classied (k=2) as described below.
The speed of the algorithm and the results obtained in training and test
evaluation are the main reasons we chose to use the k-means algorithm in this
project. Another reason why we chose k-means was the simplicity of implementation in Matlab.
One of the most important tasks of the clustering algorithm is the selection of the Distance measure. This measurement will determine the cluster to
which the data belongs. The calculation of this distance may vary depending
on which mathematical formula is used in the process.
Euclidean, Manhat-
tan, Mahalanobis and Hamming distances are some of the most commonly used
functions to measure such distances.
2.6
Crowdsourcing
Crowdsourcing [59],
34
Chapter 3
3
3.1
Overview
35
the data collected from Android users' applications and create the system calls
vectors. Afterwards, Matlab and the k-means clustering algorithm will use these
system call vectors to detect anomalies in the applications.
8 Strace
3.2
In order to collect Android application data, we will use two data collector
applications. The rst one is a crowdsourcing application developed for Android
devices and the second one is a script running on the Android Emulator.
The rst attempt we made to collect data was carried out by a script using
the thirty most downloaded applications from the Android market in 2010. The
purpose of the script was to monitor Android emulator activity and generate
reports based on the analysis.
The second data mining trial was carried out by the crowdsourcing application for Android devices. The aim of the application was the same as that of
the previous script, but this time the Android user community was used.
Both applications were able to collect essential information from Android
Devices, such as installed applications, device information and most importantly
the system call log les.
See Figure16.
points with a script will produce the system call vectors that will be used in the
Android malware detection system.
The aim of the crowdsourcing and data collection script is to collect as much
information as possible from the Android devices and applications.
37
Parse the collected data to create system call vectors, device information
les and a list of other actions performed by Android applications, such
us opened les or accessed directories, execution timestamp, etc.
The data collector script is written in Perl. This gives us the opportunity to run
the script on several operating systems without changing it in any way. Figure
17 shows the User Interface (UI) of the script.
38
Steps 4, 5 and 6 on the UI, Figure 17, will obtain the Android device
information le and installed application le and create the system calls vector
le.
39
The script was designed to automate most of the data mining process and
interaction within the system. At rst we decided to use a pseudo-random action
event tool called ADB Monkey[2] for interacting with and collecting information
from Android applications. Taking into account the fact that there are more
than 250,000 applications available in the Android Market, it was natural to
conclude that we needed to use an automatic process to record and interact
with the applications. After several attempts, we realized that ADB Monkey was
generating awed pseudo-random events in Android applications. Considering
this, data generated by this application was unsuitable for processing and for
using with the system if we intended to have good results.
Our next approach was to teach ADB Monkey to behave and interact with
Android applications in the same way as humans. We realized, however, that
this technique required articial intelligence knowledge and generated too much
work with processing data, so we decided to use a normal user to create the data.
The complexity of writing a program to behave like a human was the main reason
we decided to use a normal user for data creation. Even so, we found a small
disadvantage associated with use of this technique, i.e. that a single user has to
create the data set for more than 250,000 Android applications. Spending just
5 minutes per application on monitoring and recording application system calls
and the Android device information would require the user to spend almost two
years collecting all of the information for the Android market apps.
We realized that even if we decided to use this technique for the most important 30 applications available on the Android market in January 2011, testing 30
applications would not be sucient to determine and create a Malware pattern
for Android applications.
This brings us to the need for a crowdsourcing approach.
40
Crowdsourcing
environment will provide the tools necessary to compile the Java source code
and generate the APK le that will run on the devices.
The crowdsourcing application has the same features as the data collector
script mentioned in Section 3.2.1, but includes an FTP client to send collected
les to the Android malware detection system. Android Community users only
need to download the application and let it run in the background in order for
it to start monitoring and collecting information from the applications running
on the device.
with other applications while the application runs as a background process and
collects data.
in the SD Card memory and will later be sent as data to the behavior-based
Android malware detection system server via FTP.
41
3.3
and applications mentioned in Section 3.2 are the responsible for collecting data
from Android applications, and the script running on the server will be the responsible for parsing and storing all collected data. Furthermore, the script will
be responsible for creating the system call vectors for the k-means clustering
algorithm.
decompress, disassemble and search for patterns in the APK les. The method
is fast and does not generate a high processing load.
Dynamic Analysis analyzes the behavior of Android applications by monitoring system calls with the Strace tool.
Android smartphone user will be collected using the data collector application
described in section 3.2 as well as the crowdsourcing and data collector script.
In Dynamic Analysis the user will install, execute and generate input data for
the Android applications in order to obtain an application behavior output log
le.
Table 5 shows the advantages and disadvantages of Static and Dynamic
analysis.
42
Static analysis
Advantages
Disadvantages
or signatures in advance
consuming
Dynamic Analysis
Detection of
unknown attacks
43
Figure 21 describes the complete process of Android malware detection carried out by the system.
Data acquisition:
and parsing all of the information collected from Android users. The data
analyzer scripts will collect, extract and analyze all of the parameters from
the strace output les (from the applications tested).
important pieces of data that can be obtained from the strace output le
is the number of system calls executed by an Android application. Another
feature that can be extracted from the output le are the les and libraries
used during the monitoring process.
and clustering the vectors obtained in the previous phase in order to create the normality model and subsequently be able to detect anomalous
behavior of Android applications. Matlab will be responsible for clustering the dierent vectors into dierent groups using the k-means algorithm.
This algorithm will create two clusters, a normality model and a malicious
behavior or anomaly model. See Figure 9. All good application vectors
will be clustered into the normality model, and malicious behavior vectors
into the malicious behavior model cluster.
44
A= [ 4 , 5 , 6 , 7 , 8 ] ;
B= [ 4 , 5 , 6 , 6 , 8 ] ;
C= [ 1 , 2 , 3 , 9 , 9 ] ;
D= [ 4 , 5 , 6 , 7 , 7 ] ;
E= [ 1 , 3 , 3 , 9 , 8 ] ;
%Good
%Good
%Malware
%Good
%Malware
PROGRAM CODE
clear a l l ;
vectors_variable =
COMMENTS
= SQUAREFORM( v e c t o r _ d i s t a n c e ) ;
max( m a t r i x _ v e c t o r s ( : ) ) ;
max_value
clusters
= kmeans ( m a t r i x _ v e c t o r s , 2 ) ;
% Clear a l l v a r i a b l e s in t h e system
% Loads t o v e c t o r _ v a r i a b l e
5
vectors
from
. txt
file
% p d i s t f u n c t i o n computes t h e
Euclidean
pairs
data
of
distance
objects
between
i n mbyn
matrix X
and
puts
in
matrix
% O p t i o n a l . Gets t h e maximum
value
of
the
format
matrix
% kmeans a l g o r i t h m c r e a t e s
two
By
clusters
default
Squared
from
kmeans
Euclidean
input
value .
uses
distance .
45
A, B, D
C, E
to Cluster 2 (Malicious).
Euclidean
Seuclidean
City-Block
Minkowski
Cosine
Mahalanobi
Spearman
Hamming
Jaccard
Result
46
Table 8 shows the similarities between vectors after applying the pdist function with the Euclidean distance metric and squareform function.
The pdist
function computes the Euclidean distance between all vectors and the squareform function transforms the pdist result into matrix form. This table shows
only the Euclidean distance results, but similar results were obtained using the
Semi-Euclidean and Hamming distance metrics.
call vectors of an Android application is the result. Vectors close to 0 are similar
or equal vectors, and those vectors far from 0 are dissimilar vectors.
5.6569
2.2361
5.0990
6.0828
2.4495
5.5678
5.6569
6.0828
7.1414
1.4142
2.2361
2.4495
7.1414
6.2450
5.0990
5.5678
1.4142
6.2450
C, E
are malicious.
Table 9 shows the cluster results obtained using the k-means clustering algorithm with the Euclidean distance metric on the results obtained from the
squareform function.
Cluster
C, E
47
Chapter 4
4
This chapter is divided in 3 dierent sections. Section 4.1 describes the data set
used in the project. Section 4.2 shows the devices and applications used in the
system. A complete analysis of created and real Malware is described in Section
4.3.
Our framework has been tested through analysis of the data collected on the
central server, with two types of data sets: data from articial malware created
for test purposes, Table17, and data from real malware found in the wild, Table
22.
4.1
Data Set
The data set used in this project is that collected by several data collector
applications, as described in Section 3.2.
contains device info, installed applications info and the system call vector log
les, and will be used as the data set or input data in the behavior-based malware
detection system.
4.2
Tables 10 and 11 describe the tools and applications used during implementation
of the project.
Devices
Description
Android G1
First mobile phone with Android OS, version 1.6. It was used to
run Self Written Malware and Android applications.
Samsung Galaxy S
One of the latest mobile phone, version 2.2. It was used to run
self written Malware and Android applications.
48
Program
Description
Ubuntu OS
Matlab
Android emulator
The Android SDK includes a virtual mobile device that can run
on the computer. The emulator allows us to develop and test
Android applications without using a physical device. It was
used to run self written Malware and Android applications.
vsftpd
Very Secure FTP Daemon is a FTP server for the Linux OS. We
used vsftpd to collect Android applications system call log les
for the dierent applications, as sent in by the users.
Perl scripts
49
4.3
This section is divided into two dierent subsections. Subsection 4.3.1 describes
the evaluation process and the results obtained with our own self-written Android malware. Next, the Steamy Window malware is analyzed in subsection
4.3.2.
Calculator_G
Countdown_G
MoneyConverter_G
The good application pattern obtained will be compared against the incoming
data in order to decide if it belongs in the normality model or not.
All developed Android applications were tested using the Android emulator
and the Android mobile phone terminal Samsung Galaxy S, Table 10. All of
these applications have been tested under equal conditions for a xed period of
time (ve minutes), with dierent user interactions. The following pages will
describe some of the results obtained for our malware with the behavior-based
Android malware detection system. We will also provide some information on
the data les collected by the data collector script and crowdsourcing applications.
50
DEVICE INFO
ANDROID NAME
: FROYO
ANDROID VERSION
2.2
IMEI
354795046233372
BOARD
GTI 9 0 0 0
BOARDLOADER
unknown
BRAND
samsung
CPU_ABI
a r m e a b i v 7 a
CPU_ABI2
armeabi
DEVICE
GTI 9 0 0 0
DISPLAY
: FROYO
FINGERPRINT
HARDWARE
smdkc110
SES 6 0 8
2 . 2 /FROYO/XWJPA: u s e r / r e l e a s e
HOST
MANUFACTURER
samsung
MODEL
GTI 9 0 0 0
PRODUCT
GTI 9 0 0 0
RADIO
GTI 9 0 0 0
TAGS
release
TYPE
user
USER
root
k e y s
k e y s
51
INSTALLED
V e r s i o n C o d e p a c k a g e : 1 5
Version
Installed
Process
PACKAGES INFO
0.1.5
Application
Name
PERMISSION
SharkReader
l v . n3o . s h a r k r e a d e r
a n d r o i d . p e r m i s s i o n . INTERNET
a n d r o i d . p e r m i s s i o n .ACCESS_NETWORK_STATE
a n d r o i d . p e r m i s s i o n . GET_TASKS
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE
V e r s i o n C o d e p a c k a g e : 8
Version
Installed
Process
2.2.1
Application
Name
PERMISSION
Network
Location
com . g o o g l e . a n d r o i d . l o c a t i o n
a n d r o i d . p e r m i s s i o n .RECEIVE_BOOT_COMPLETED
a n d r o i d . p e r m i s s i o n . INSTALL_LOCATION_PROVIDER
a n d r o i d . p e r m i s s i o n . ACCESS_WIFI_STATE
a n d r o i d . p e r m i s s i o n . CHANGE_WIFI_STATE
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE
a n d r o i d . p e r m i s s i o n . ACCESS_COARSE_LOCATION
a n d r o i d . p e r m i s s i o n . INTERNET
a n d r o i d . p e r m i s s i o n . WRITE_SECURE_SETTINGS
V e r s i o n C o d e p a c k a g e : 1
Version
1.0
Application
Process
Name
PERMISSION
Installed
Camera
:
Firmware
com . s e c . a n d r o i d . app . c a m e r a f i r m w a r e
a n d r o i d . p e r m i s s i o n . WRITE_SETTINGS
a n d r o i d . p e r m i s s i o n . VIBRATE
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE
a n d r o i d . p e r m i s s i o n .MODIFY_PHONE_STATE
a n d r o i d . p e r m i s s i o n .CAMERA
a n d r o i d . p e r m i s s i o n . ACCESS_FINE_LOCATION
a n d r o i d . p e r m i s s i o n .WAKE_LOCK a n d r o i d . p e r m i s s i o n .SET_WALLPAPER
52
ANDROID_APPLICATION_REPORT
Autor : I k e r Burguera H i d a l g o
Date : Tue Feb 22 1 5 : 4 7 : 2 2 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . mu . r t s l a b . i k e r . c a l c u l a t o r G . apk . o u t _ R e p o r t . t x t
s y s t e m c a l l STATISTIC
system
call
Name
Number
of
Executions
fork
read
write
open
close
time
lseek
getpid
ptrace
access
kill
brk
setgid
ioctl
gettimeofday
writev
mmap2
vfork
2
202
266
235
243
6712
90
4737
7944
84
66
173
1
15930
84
191
3
1
OPEN FILES
File
File
: / s y s t e m / u s r / k e y c h a r s / q w e r t y . kcm . b i n
: / proc /922/ cmdline
53
ANDROID_APPLICATION_REPORT
Autor : I k e r Burguera H i d a l g o
Date : Tue Feb 22 1 5 : 4 8 : 3 8 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . mu . r t s l a b . i k e r . c a l c u l a t o r B . apk . o u t _ R e p o r t . t x t
s y s t e m c a l l STATISTIC
system
call
Name Number
of
Executions
fork
2
read
235
write
696
open
807
close
812
time
7194
lseek
101
getpid
5457
setuid
1
ptrace
9354
access
179
kill
73
dup
2
times
1
brk
188
setgid
1
signal
1
ioctl
18792
gettimeofday
184
mmap
455
munmap
499
getpriority
113
stat
134
fstat
133
recv
17901
mprotect
514
sigprocmask
1236
msgget
109901
syscall
6
writev
445
mmap2
455
vfork
1
54
OPEN FILES
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
ACCESS FILES
File
File
File
File
File
File
File
File
:
:
:
:
:
:
:
:
/mnt/ s d c a r d / C a l c u l a t o r _ B
/ system / u s r / s h a r e / z o n e i n f o / Europe / Stockholm
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
55
56
MoneyConverter_B
CountDown_B
Calculator_B
MoneyConverter_G
CountDown_G
Calculator_G
Name
Malware
Malware
Malware
Normal
Normal
Normal
Type
memory
particular
stores in
SDCard
Get GPS
position and
server
Send user
contacts to a
Fill SDCard
Money
conversion
second.
Second
countdown
Number
Calculation
Objective
Given two numbers, can make trivial operations like the Sum,
Description
Picture
Interactions
Good
Malware
Clustering
Detection
result
rate
Good
Malware
Clustered
Clustered
Calculator
50
10
50
10
100%
Countdown
50
10
50
10
100%
MoneyConverter
50
10
50
10
100%
Calculator_Vector.txt
Countdown_Vector.txt
MoneyConverter_Vector.txt
Each le will contain 60 system call interaction vectors, including good and bad
application interaction vectors.
The next step was to test the system using real Android Malware applications.
57
Steamy Window
We performed several tests using the only Android malware that we had at
the time, Steamy Window.
Steamy Window, shown in Figure 22, was the rst Malware to be tested in
the system. Steamy Window, is a harmless application that can be found in the
Android ocial market for free. However, the same application can be found on
non-ocial Android repositories with malicious code attached. The rst step
was to perform Dynamic Analysis.
58
Interaction_A= 0,0,0,3,7,7,7,0,0,1,1,0,0,11,0,1,0,0,0,3,438,0,0,0,0,0,2405,0,0,0,0,0,0,
0,5,0,0,0,1,1,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,5164,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,12,7,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,8,1,3,4065,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,0,0,0,14011,0,0,0,0,0,648,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,12,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_B=0,0,0,34,43,45,87,0,0,5,5,0,0,47,0,5,0,0,0,31,2695,0,0,0,4,0,8468,0,0,0,
0,0,0,0,22,0,0,0,5,5,0,0,27,0,0,0,46,0,0,0,0,0,0,0,0,20324,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,16,0,0,0,0,0,0,0,0,0,0,0,132,88,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,60,5,27,
13717,0,0,0,0,0,0,0,0,0,0,0,16,0,0,0,0,68,262,0,0,0,0,0,49976,0,0,0,0,0,2328,0,0,0,0,0,
0,0,0,38,0,0,0,0,0,0,132,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,
Interaction_C=0,0,0,19,12,28,29,0,0,1,1,0,0,22,0,1,0,0,0,19,1718,0,0,0,0,0,6632,0,0,0,
0,0,0,0,11,0,0,0,3,1,0,0,4,0,0,0,8,0,0,0,0,0,0,0,0,15089,36,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,41,21,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,24,1,19,10580,
0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,27,15,0,0,0,0,0,37324,0,0,0,0,0,1855,0,0,0,0,0,0,0,0,11,0,
0,0,0,0,0,41,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_D=0,0,0,16,12,27,28,0,0,1,1,0,0,19,0,1,0,0,0,16,1214,0,0,0,0,0,5663,0,0,0,0,
0,0,0,8,0,0,0,2,1,0,0,4,0,0,0,7,0,0,0,0,0,0,0,0,12376,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,40,20,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,21,1,16,8597,0,0,0,0,
0,0,0,0,0,0,0,6,0,0,0,0,27,15,0,0,0,0,0,29712,0,0,0,0,0,1549,0,0,0,0,0,0,0,0,11,0,0,0,0,0,0,
40,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
59
Interaction_E=0,0,0,48,73,67,139,0,0,8,8,0,0,56,0,8,0,0,0,38,2964,0,0,0,8,0,8803,0,0,0,0,
0,0,0,28,0,0,0,6,8,0,0,45,0,0,0,78,0,0,0,0,0,0,0,0,20937,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,24,0,0,0,0,0,0,0,0,0,0,0,210,151,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,93,8,37,14230,
0,0,0,0,0,0,0,0,0,0,0,21,0,0,0,0,108,501,0,0,0,0,0,52168,0,0,0,0,0,2328,0,0,0,0,0,0,0,0,65,
0,0,0,0,0,0,210,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_F=0,0,0,22,13,29,30,0,0,1,1,0,0,32,0,1,0,0,0,22,2512,0,0,0,0,0,8253,0,0,0,0,0,
0,0,14,0,0,0,4,1,0,0,4,0,0,0,12,0,0,0,0,0,0,0,0,19940,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,44,22,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,27,1,22,13363,0,0,0,0,
0,0,0,0,0,0,0,7,0,0,0,0,28,15,0,0,0,0,0,48565,0,0,0,0,0,2328,0,0,0,0,0,0,0,0,12,0,0,0,0,0,0,
44,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
60
Table 18 shows the similarities between the Steamy Window system call
vectors after applying the pdist function with Euclidean distance as the metric.
The pdist function computes the Euclidean distance between all vectors, and
the squareform function transforms the result of pdist into matrix form.
Interaction
0.1818
0.1414
0.1414
0.1818
0.1414
0.1818
0.1768
0.1768
0.1616
0.1667
0.1414
0.1768
0.1010
0.1818
0.1212
0.1414
0.1768
0.1010
0.1818
0.1212
0.1818
0.1616
0.1818
0.1818
0.1717
0.1414
0.1667
0.1212
0.1212
0.1717
Table 18: Steamy Window system call vectors comparison matrix table
The last step was to cluster the previous results into two dierent clusters.
In order to do that we used the k-means clustering algorithm, dening two
clusters,
k = 2,
this metric gave us the best outcome in the analysis and testing when detecting
dierent vectors.
Interaction
Cluster
Application
Table 19: Steamy window clustering result
The nal outcome was a vector with the results of the k-means clustering
algorithm, see Table 19.
vectors,
and
E.
61
pi
space. Since it is not possible to graphically represent more than three vectors
in a
D-dimensional
The blue bars represent the normal behavior of the Steamy Window application and the red bars represent the behavior of the malicious version of the
Steamy Window application.
Every system call has its own number and the
of the executed system call or the count of executed system call. The
axis
shows the number of times that the system call has been executed.
Upon studying Figure 24, we can note some distinct dierences between
good and malicious interactions.
mal behavior of the Steamy Window application, we can clearly see that the
malicious version of the Steamy Window application is executing additional system calls;
and
chown()
some of these. Taking into account that both applications have the same version
number, we can assume that the Steamy Window application downloaded from
non-ocial Android repositories, interactions
Android application.
62
and
E,
is a suspicious/harmful
formation such as executed system calls, count of system call executions and
opened and accessed les.
ANDROID_APPLICATION_REPORT
Autor : I k e r Burguera H i d a l g o
Date : Thu Mar 3 1 6 : 3 5 : 5 7 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m . apk . o u t _ R e p o r t . t x t
s y s t e m c a l l STATISTIC
system
call
Name
Number
of
Executions
read
write
open
close
link
unlink
time
chmod
lseek
getpid
getuid
ptrace
access
sync
kill
rename
mkdir
ioctl
fcntl
gettimeofday
mmap
munmap
fstat
recv
mprotect
sigprocmask
msgget
syscall
writev
mmap2
sched_yield
1219
263
1192
1311
21
21
81
12
280
19762
15
49675
112
48
25
9
1
123934
489
29
406
270
1188
68683
2319
1128
475966
7277
106
406
11
OPEN FILES
File
: / d e v /ashmem
ACCESS FILES
File :
File :
File :
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l
63
64
Figure 24: Steamy Window Interactions bar plot
ANDROID_APPLICATION_REPORT
Autor : I k e r Burguera H i d a l g o
Date : Thu Mar 3 1 6 : 2 1 : 0 6 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m . apk 3 4 8 . o u t _ R e p o r t . t x t
s y s t e m c a l l STATISTIC
system
call
Name
Number
of
Executions
read
write
open
close
link
unlink
time
chmod
lseek
getpid
getuid
ptrace
access
sync
kill
rename
mkdir
dup
brk
ioctl
fcntl
gettimeofday
mmap
munmap
getpriority
stat
lstat
fstat
recv
fsync
clone
mprotect
sigprocmask
msgget
syscall
writev
mmap2
sched_yield
530
163
530
591
13
13
31
7
162
5441
6
12378
44
36
6
4
1
35
110
29213
230
15
176
124
3
550
4
514
16310
36
22
1009
637
112187
2047
53
176
3
65
OPEN FILES
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
ACCESS FILES
File :
File :
File
File
File
File
File
File
File
File
File
File
File
File
:
:
:
:
:
:
:
:
:
:
:
:
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / s h a r e d _ p r e f s /
com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m _ p r e f e r e n c e s . xml
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / s h a r e d _ p r e f s /
com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m _ p r e f e r e n c e s . xml . bak
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
66
Chapter 5
5
This chapter summarizes the results of the work described in this project in
two dierent sections. Section 5.1 will summarize the work carried out over the
course of this Master's thesis. Section 5.2 will suggest new ideas that can be
pursued based on this project.
5.1
Conclusions
and
chown()
are
the system calls most commonly used by malware. A benign application could
make moderate or heavy use of those system calls, thus triggering false positives.
Even when dealing with slightly modied Trojans, the system would still class
them correctly . We have seen that Trojanized applications made more system
call executions and invoked dierent system calls to the kernel in comparison
with the original applications.
The most important contribution of this project is the mechanism we propose for obtaining real traces of application behavior. In previously published
works, we have seen that it is possible to obtain information on behavior using
articially created user actions or creating replicas of smartphones, but crowdsourcing helps the community to obtain real application traces from hundreds
or even thousands of applications.
67
A paper has been published based on this report for the ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices 2011 - SPSM
2011. The paper summarizes the essential details of the framework and contains
further tests performed using the framework on the latest Android malware.
5.2
Future Directions
The next step is to deploy the Crowdroid lightweight client on Google's Android market and distribute it to as many users as possible. Users running our
application will be able to see their own smartphone behavior. We could even
alert the users when one of their applications shows an abnormal trace.
The
system can also act as an early warning system, capable of detecting malicious
or abnormally behaving applications in the early stages of propagation.
By implementing a set of tools, we have demonstrated that one can obtain behavior-based information and have it processed and clustered on a central server. Clustering results have been awless for self-written malware, and
promising with real malware. Whether the performance of a single central server
would suce for large-scale deployment is an interesting topic for further study.
A conguration with multiple cooperating servers, each with a lower load and
faster response, is an avenue to explore.
We have chosen a simple 2-means clustering algorithm to distinguish between
benign applications and their corresponding malware version. The results have
been encouraging, although we need to address some issues that remain unresolved. First, the system would always separate the system call data vectors into
two clusters even if there was no malware present. The cluster mapping would
change drastically whenever a malicious execution vector entered the dataset.
This issue requires some manual checks or further automatic analysis. Secondly,
one could intentionally submit incorrect data into the system, thus leaving the
dataset corrupted. One of the next steps is to authenticate the submitting application in order to ensure that nobody is deliberately sending incorrect data
to the system. As regards the communication mechanism between the Crowdroid client and our server, it is carried out using the FTP protocol in this rst
version and thus does not focus on protecting the privacy of transferred data.
If an attacker snis and manipulates the trac in the communication process
it can lead to misclassication errors. In order to avoid this, we are introducing
encryption mechanisms to preserve the integrity of the data and the authenticity
of the sender. We have to take into account that when applying this technique
on the mobile device it might have an extra overhead in the processing stage,
resulting in higher energy consumption.
Finally, we have the challenge of convincing the Android user community
to install the Crowdroid application.
68
References
[1] 50 Malware applications found on Android Ocial Market. Access date:
25 Nov 2010.
http://m.guardian.co.uk/technology/blog/2011/mar/02/
android-market-apps-malware?cat=technology&type=article.
[2] Adb Monkey UI- Application exerciser. Access date: 12 Nov 2010.
http://developer.android.com/guide/developing/tools/monkey.
html.
http://en.ophonesdn.com/article/show/354.
http://www.androidenea.com/2009/06/android-boot-process-from-power-on.
html.
[5] Android Arquitecture. Access date: 4 Nov 2010. [Online]. Available from:
http://developer.android.com/guide/basics/what-is-android.
html.
http://reminisce06.springnote.com/pages/7407623?print=1.
http://www.alittlemadness.com/2010/06/07/
understanding-the-android-build-process/.
http://bootloader.wikidot.com/linux:boot:android.
[9] Android Kernel. Access date: 20 Apr 2011. [Online]. Available from:
//android.git.kernel.org/.
http:
http://developer.android.com/sdk/index.html.
[11] Angry Birds Bonus Level. J. Oberheide. Access date: 27 Dec 2010.
http://m.guardian.co.uk/technology/blog/2011/mar/02/
android-market-apps-malware?cat=technology&type=article.
http://facinatingandroid.blogspot.com/2011/09/
android-apk-file.html.
9 Jan 2011.
code.google.com/p/smali/.
http://www.f-secure.com/weblog/archives/00000414.html.
69
http://
http://www.f-secure.com/v-descs/cabir.shtml.
http://www.dalvikvm.com/.
http://en.ophonesdn.com/article/show/354.
http://ai.stanford.edu/~ang/papers/nips02-metric.pdf.
eclipse.org/.
http://www.
http://www.hispasec.com/.
http://www.idc.com/getdoc.jsp?containerId=227360.
http://www.idc.com/getdoc.jsp?containerId=prUS22762811.
http://www.idc.com.
http://www.sans.org/reading_room/whitepapers/detection/
intrusion-detection-systems-definition-challenges_343.
14 Jan
2011.
http://www.iseclab.org/.
[26] Linux Kernel manual pages. Access date: 14 Mar 2011.
http://www.kernel.org/doc/man-pages/online/dir_section_2.html.
[27] Linux Kernel system call list table. Access date: 24 Mar 2011.
http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html.
http://adtmag.com/articles/2011/03/03/
android-attacks-on-rise.aspx.
http://www.computereconomics.com/page.cfm?name=Malware%
20Repor.
http://pages.cs.wisc.edu/~pb/comsnets09.pdf.
70
http://www.topnews.in/android-malware-increase-400-report-2328121.
http://www.spamlaws.com/malware-types.html.
//netbeans.org/.
23 Nov 2010.
http:
http://www.netqin.com/en/.
http://techcrunch.com/2011/02/10/nokia-confirms-microsoft-partnership-new-leadership-te
http://www.idc.com/getdoc.jsp?containerId=prUS22486010.
http://www.businessinsider.com/sai.
[38] Samsung HTC Smartphone vendor companies market share. Access date:
3 Jan 2011.
http://www.eweek.com/c/a/Mobile-and-Wireless/
Android-Helps-Samsung-HTC-Double-Market-Share-IDC-792965.
[39] Sandbox. Access date: 29 Oct 2010.
http://www.cs.bgu.ac.il/~dsec022/papers/j9a.pdf.
http://www.businessinsider.com/chart-of-the-day-smartphone-apps-2011-3.
http://www.idc.com/about/viewpressrelease.jsp?containerId=
prUS22689111.
http://en.wikipedia.org/wiki/Stack_machine.
http://www.netqin.com/en/.
http://www.parksassociates.com//blog/article/
number-of-smartphone-users-to-quadruple--exceeding-1-billion-worldwide-by-2014-4.
71
Jour-
Techniques, 10:1
Techniques, 10:1
56, 2002.
[48] Thomas
Bl,
Leonid
Batyuk,
Aubrey-Derrick
Schmidt,
Seyit
Ahmet
An android appli-
Techniques, pages
5562, 2010.
[49] Abhijit Bose, Xin Hu, Kang G. Shin, and Taejoon Park.
Behavioral de-
Proceeding of the 6th international conference on Mobile systems, applications, and services, MobiSys
tection of malware on mobile handsets. In
Proceedings of the
Proceedings of the 41st Annual Hawaii International Conference on System Sciences, HICSS '08, pages 296, Washington, DC, USA, 2008. IEEE
ling and intrusion detection using smart batteries. In
Computer Society.
[51] Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani.
Crowdroid:
Workshop on
Security and Privacy in Smartphones and Mobile Devic es 2011 - SPSM
2011. ACM, October 2011.
Behavior-based malware detection system for android.
In
[52] Jerry Cheng, Starsky H Y Wong, Hao Yang, and Songwu Lu.
SmartSiren:
ACM, 2007.
[53] David Dagon, Tom Martin, and Thad Starner. Mobile phones as computing
3:1115,
October 2004.
[54] Anhai Doan, Raghu Ramakrishnan, and Alon Y Halevy.
systems on the world-wide web.
Crowdsourcing
54(4):86,
2011.
[55] Manuel Egele.
[56] William Enck, Peter Gilbert, Byung-Gon Chun, Landon P. Cox, Jaeyeon
Jung, Patrick McDaniel, and Anmol N. Sheth. Taintdroid: an informationow tracking system for realtime privacy monitoring on smartphones. In
72
[58] By Fengmin Gong, Chief Scientist, Mcafee Network, and Security Technologies. Deciphering detection techniques : Part ii anomaly-based intrusion
detection.
edition, 2008.
[60] Nwokedi Idika and Aditya P Mathur. A survey of malware detection techniques.
[62] J Macqueen.
volume 233,
Proceedings
of the 26th Annual Computer Security Applications Conference, ACSAC
Bos. Paranoid android: versatile protection for smartphones. In
'10, pages 347356, New York, NY, USA, 2010. ACM.
http://www.slideshare.net/JaimeBlasco/
wtf-is-happeninginsidemyandroidphonepublic.
Proceedings of the 2009 IEEE international conference on Communications, ICC'09, pages 631635, Piscataway, NJ, USA, 2009. IEEE
droid. In
Press.
[67] Aubrey-Derrick Schmidt, Jan Hendrik Clausen, Ahmet Camtepe, and
Sahin Albayrak. Detecting symbian os malware through static function call
[71] Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. Detection of malicious code by applying machine learning classiers on static
features: A state-of-the-art survey.
ary 2009.
[72] Ashkan Shari Shamili, Christian Bauckhage, and Tansu Alpcan. Malware
detection on mobile devices using distributed machine learning. In Proceedings of the 2010 20th International Conference on Pattern Recognition,
ICPR '10, pages 43484351, Washington, DC, USA, 2010. IEEE Computer
Society.
[73] Symantec.
http://www.techeye.net/security/androids-steamy-window-trojan-sends-sms-to-premium-numb
[74] Urko Zurutuza, Roberto Uribeetxeberria, and Diego Zamboni.
A data
ture generation. In
74