You are on page 1of 82

Institutionen fr datavetenskap

Department of Computer and Information Science


Final thesis

Behavior-based malware detection


system for the Android platform
by

Iker Burguera Hidalgo


LIU-IDA/ERASMUS-A11/002SE
2011-09-27

Linkpings universitet
SE-581 83 Linkping, Sweden

Linkpings universitet
581 83 Linkping

Linkping universitet
Institutionen for datavetenskap

Examensarbete

Behavior-based malware detection


system for the Android platform
av

Iker Burguera Hidalgo


LIU-IDA/ERASMUS-A11/002SE
2011-09-27

Handledare: Dr. Urko Zurutuza


Examinator: Dr. Simin Nadjm-Tehrani

Linkping University Electronic Press

Upphovsrtt
Detta dokument hlls tillgngligt p Internet eller dess framtida ersttare frn
publiceringsdatum under frutsttning att inga extraordinra omstndigheter
uppstr.
Tillgng till dokumentet innebr tillstnd fr var och en att lsa, ladda ner,
skriva ut enstaka kopior fr enskilt bruk och att anvnda det ofrndrat fr ickekommersiell forskning och fr undervisning. verfring av upphovsrtten vid
en senare tidpunkt kan inte upphva detta tillstnd. All annan anvndning av
dokumentet krver upphovsmannens medgivande. Fr att garantera ktheten,
skerheten och tillgngligheten finns lsningar av teknisk och administrativ art.
Upphovsmannens ideella rtt innefattar rtt att bli nmnd som upphovsman i
den omfattning som god sed krver vid anvndning av dokumentet p ovan beskrivna stt samt skydd mot att dokumentet ndras eller presenteras i sdan form
eller i sdant sammanhang som r krnkande fr upphovsmannens litterra eller
konstnrliga anseende eller egenart.
Fr ytterligare information om Linkping University Electronic Press se frlagets hemsida http://www.ep.liu.se/

Copyright
The publishers will keep this document online on the Internet or its possible
replacement from the date of publication barring exceptional circumstances.
The online availability of the document implies permanent permission for
anyone to read, to download, or to print out single copies for his/hers own use
and to use it unchanged for non-commercial research and educational purpose.
Subsequent transfers of copyright cannot revoke this permission. All other uses
of the document are conditional upon the consent of the copyright owner. The
publisher has taken technical and administrative measures to assure authenticity,
security and accessibility.
According to intellectual property law the author has the right to be
mentioned when his/her work is accessed as described above and to be protected
against infringement.
For additional information about the Linkping University Electronic Press
and its procedures for publication and for assurance of document integrity,
please refer to its www home page: http://www.ep.liu.se/.

Iker Burguera Hidalgo.

Abstract
Malware in smartphones is growing at a signicant rate. There are
currently more than 250 million smartphone users in the world and this
number is expected to grow in coming years [44].
In the past few years, smartphones have evolved from simple mobile
phones into sophisticated computers. This evolution has enabled smartphone users to access and browse the Internet, to receive and send emails,
SMS and MMS messages and to connect devices in order to exchange information. All of these features make the smartphone a useful tool in our
daily lives, but at the same time they render it more vulnerable to attacks
by malicious applications.
Given that most users store sensitive information on their mobile
phones, such as phone numbers, SMS messages, emails, pictures and
videos, smartphones are a very appealing target for attackers and malware developers.
The need to maintain security and data condentiality on the Android
platform makes the analysis of malware on this platform an urgent issue.
We have based this report on previous approaches to the dynamic
analysis of application behavior, and have adapted one approach in order
to detect malware on the Android platform. The detector is embedded
in a framework to collect traces from a number of real users and is based
on crowdsourcing. Our framework has been tested by analyzing data collected at the central server using two types of data sets: data from articial
malware created for test purposes and data from real malware found in
the wild. The method used is shown to be an eective means of isolating
malware and alerting users of downloaded malware, which suggests that
it has great potential for helping to stop the spread of detected malware
to a larger community.
Finally, the report will give a complete review of results for self written
and real Android Malware applications that have been tested with the
system.
This thesis project shows that it is feasible to create an Android malware detection system with satisfactory results.

Acknowledgments
First of all, I would like to thank Prof. Simin Nadjm-Tehrani and
Dr. Urko Zurutuza for their support, guidance and patience over
the course of this Master's thesis project.

I would also like to thank all members of the Real-Time Systems


Laboratory (RTSLab), my corridor mates from Ryds All 9 and
Alsttersgatan 9 and friends from Legazpi for all the support and
fantastic moments we shared in 2010-2011.
Finally, I would like to thank my wonderful and fantastic family,
which in addition to providing me with economic and moral support
also wrote part of my acknowledgment notes.

Contents

1 Introduction

1.1

Background and Motivation . . . . . . . . . . . . . . . . . . . . .

1.2

Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3

Project Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .

1.4

Intended audience

. . . . . . . . . . . . . . . . . . . . . . . . . .

1.5

Related work

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6

Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2 Background
2.1

2.2

14

Android Operating System

. . . . . . . . . . . . . . . . . . . . .

14

2.1.1

Platform architecture

. . . . . . . . . . . . . . . . . . . .

14

2.1.2

The Dalvik Virtual Machine . . . . . . . . . . . . . . . . .

18

2.1.3

The Android Security Model

. . . . . . . . . . . . . . . .

20

2.1.4

Android applications . . . . . . . . . . . . . . . . . . . . .

22

. . . . . . . . . . . . . . . . . . . . .

24

2.2.1

Intrusion Detection System

Denition . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.2.2

Detection types

25

. . . . . . . . . . . . . . . . . . . . . . .

2.3

System calls and Vectors . . . . . . . . . . . . . . . . . . . . . . .

27

2.4

Data Mining

29

2.4.1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Data collection in KDD process

. . . . . . . . . . . . . .

29

. . . . . . . . . . . . . . . . . . .

31

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

2.5

K-means Clustering algorithm

2.6

Crowdsourcing

3 Behavior-Based malware detection system for Android Applications


35
3.1

Overview

3.2

Android Data mining: Crowdsourcing and Self-written applications 37

3.3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.2.1

Android Data collector script . . . . . . . . . . . . . . . .

3.2.2

Android Crowdsourcing and data mining application . . .

41

Behavior-Based malware detection system . . . . . . . . . . . . .

42

3.3.1

42

Design of the Behavior-Based malware detection system .

4 Results and Evaluation

38

48

4.1

Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

4.2

Devices and Programs

48

4.3

. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .

50

4.3.1

Malware detection system Results

Self-written Malware . . . . . . . . . . . . . . . . . . . . .

50

4.3.2

Real Malware . . . . . . . . . . . . . . . . . . . . . . . . .

58

5 Conclusions, Contributions and Future Work

67

5.1

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

5.2

Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

iii

List of Figures
1

Number of Applications available at smartphone App Stores[40] .

Android platform architecture[5]

. . . . . . . . . . . . . . . . . .

15

Android Linux Kernel and Init process . . . . . . . . . . . . . . .

17

Android boot sequence . . . . . . . . . . . . . . . . . . . . . . . .

18

Dex le creation process . . . . . . . . . . . . . . . . . . . . . . .

19

Application request process

21

Android APK le . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

Android APK le generation process . . . . . . . . . . . . . . . .

23

Misuse detection versus Anomaly detection

. . . . . . . . . . . .

25

10

Linux User and Kernel space

. . . . . . . . . . . . . . . . . . . .

27

11

Knowledge Discovery in Databases (KDD) process[46]

. . . . . .

29

12

Taxonomy clustering methods . . . . . . . . . . . . . . . . . . . .

31

13

Hierarchical method: Agglomerative vs Divisive . . . . . . . . . .

32

14

K-means applied as a detection system for android system calls

34

15

Android malware detection system scheme . . . . . . . . . . . . .

35

16

Data acquisition process . . . . . . . . . . . . . . . . . . . . . . .

37

17

Data collector script user interface

. . . . . . . . . . . . . . . . .

38

18

Data collector script process . . . . . . . . . . . . . . . . . . . . .

39

19

Android Crowdsourcing application . . . . . . . . . . . . . . . . .

41

20

Static and Dynamic Analysis

42

21

Android Malware Detection process

. . . . . . . . . . . . . . . .

44

22

Steamy Window application . . . . . . . . . . . . . . . . . . . . .

58

23

Interaction with Steamy window application

. . . . . . . . . . .

59

24

Steamy Window Interactions bar plot

. . . . . . . . . . . . . . .

64

. . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

iv

List of Tables
1

Worldwide mobile device Operating System Market Shares and


2010-2014 Growth[36]

. . . . . . . . . . . . . . . . . . . . . . . .

Related work State-of-the-Art Summary(i) . . . . . . . . . . . . .

11

Related work State-of-the-Art Summary(ii)

K-means Clustering algorithm process

. . . . . . . . . . . .

12

. . . . . . . . . . . . . . .

33

Static and Dynamic Malware analysis advantages and Disadvantages

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Matlab Clustering code for Android Malware Detection

Clustering algorithm metrics

43

. . . . .

45

. . . . . . . . . . . . . . . . . . . .

46

Vector comparison matrix . . . . . . . . . . . . . . . . . . . . . .

47

Example vector clustering results . . . . . . . . . . . . . . . . . .

47

10

Test Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

11

Programs used in the project

49

12

Crowdsourcing application result - Android Device Information

51

13

Crowdsourcing application result - Installed applications . . . . .

52

14

Self Written Application report - Calculator Good Application

53

15

Self Written Application report - Calculator Malicious Application 55

16

Self written android applications description . . . . . . . . . . . .

17

Self written Android Malware result

18

Steamy Window system call vectors comparison matrix table

19

Steamy window clustering result

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . .

56
57

. .

61

. . . . . . . . . . . . . . . . . .

61

Chapter 1
1

Introduction

This paper describes the results of a Master's thesis project (30 ECTS) towards
the fulllment of a degree in Telecommunications Engineering at Mondragon
Unibertsitatea.

The project was carried out at the Department of Computer

and Information Science at Linkping University while studying as a visiting


student from Mondragon Unibertsitatea.
The following paragraphs will detail the background, motivation, related
work and goals of the master thesis. Details on how the project was carried out
and on the results obtained will be presented in the following chapters.

1.1

Background and Motivation

Communications and technology are rapidly growing industries that are changing every day.

The constant evolution of technology necessitates adaption to

new concepts and awareness of new developments. In the following section we


briey cover the trends in the evolution of the smartphone market that make
the subject matter of this thesis relevant.
According to the International Data Corporation [23], smartphone vendors
will ship more than 450 million smartphones in 2011, compared to the 303.4
million units shipped in 2010[21]. Moreover, the smartphone market will grow
four times faster than the traditional mobile phone market, and due to this, the
demand for smartphones will rise considerably. Eventually, customers will reach
the point where they will replace their old mobile phones with smartphones.
The sales growth of mobile phone companies such as Samsung and HTC
between 2009 and 2010 has revolutionized the smartphone market. In light of
this, the IDC predicts that the Android OS will surpass Nokia's Symbian OS
in terms of sales in 2011, and will continue to lead the smartphone OS market
in the coming years [36].

Furthermore, it predicts that the Android OS and

Windows Mobile will grow almost 50% between 2010 and 2014, with a high
probability of becoming the leading smartphone operating system vendors in
the future. See Table 1.

Operating System

2010

2014

2014/2010

Market

Market

Change

predicted

predictedShare

Share
Symbian

40.1%

32.9%

BlackBerry OS

17.9%

17.3%

-18.0%
-3.5%

Android

16.3%

24.6%

51.2%

iOS

14.7%

10.9%

-25.8%

Windows Mobile

6.8%

9.8%

43.3%

Others

4.2%

4.5%

8.3%

Total

100%

100%

Table 1: Worldwide mobile device Operating System Market Shares and 20102014 Growth[36]

The IDC predicts that the total number of smartphone applications will grow
at the same rate as smartphone sales. There are currently more than 350,000
applications in Apple's iPhone market and 250,000 applications in the Android
market, according to Silicon Alley Insider [37]. This is depicted in Figure 1.

Figure 1: Number of Applications available at smartphone App Stores[40]

The ocial Google Android market nearly doubled in size in 2010 and 2011,
surpassing 250,000 applications in March 2011.

Figure 1, shows the interest

of software developers in the Android platform, and we can assume that as


Android developers continue to create applications for Android's OS, malware
developers will continue to create Malware for the system, as well.
1

Malware , has been a threat for PCs for many years[30] and in light of
the rapid increase of smartphone sales over the last few years[38], it was only
a matter of time before malware developers became interested in staging their
attacks on the smartphone platform. In particular, 2010 and 2011 saw a growing
interest among malware developers in waging attacks on Android's OS[28].
Malware usually destroys valuable and sensitive information in infected systems.

Malware is also commonly used to exploit infected devices and obtain

prots from them. In the same way as malware harms computers, it can also
perform attacks on smartphones, given that they have similar operating features. This observation makes it clear that it is necessary to enhance protection
of smartphone devices in the same way as we did with computers some years
ago.
The Android market is an open market system. This means that Android
developers can upload their applications, also called third-party applications,
to Android's ocial market without them being ltered by any certication
authority that would check the trustworthiness of the applications.

On the

one hand, this increases the odds that the Android market will have a greater
variety of applications and content, but on the other hand it facilitates infection
by malware applications, as applications are not analyzed by any certication
authority.
In conclusion, considering the growth of smartphones running the Android
OS

and the increasing number of applications available for the Android OS,

improving the security (i.e.

the integrity, condentiality and privacy) of the

Android platform is the main objective of this project. In order to achieve that
objective, we will develop a behavior-based malware detection system for the
Android platform.

1 Malicious(Mal ) software(ware )
2 Samsung and HTC smartphone

vendors[38]

1.2

Goal

The goal of this Master's thesis was to design and implement a behavior-based
malware detection system for the Android platform.
More specically, the work was divided into the following sub-goals:

Create a malware detection system for the Android platform.

Create data collector applications to monitor Android OS activity.

Design and implement the Android application behavior database.

The proposed solution was expected to detect malicious applications from Android ocial and non-ocial markets or repositories.

1.3

Project Assumptions

Some assumptions were made at the beginning of the project:

Applications available on the ocial Android market would be used to


establish the normality model for the applications, and the equivalent
programs in non-ocial repositories would be used to test the system.

Even if malware did exist in the Android market, rst we needed clear
or good applications with the same name or purpose to test the malware
detection system.

We assumed that downloaded third-party applications were not trusted


applications and must be analyzed/monitored with the crowdsourcing application or data collector script.

The Android community would collaborate on this project by installing


the crowdsourcing application on their devices. The crowdsourcing application would send recorded les to the malware detection system server
for post-analysis.

1.4

Intended audience

This thesis is useful to anyone who is involved in mobile Security, and is specially
designed for Android smartphone users and developers. It is also targeted at
anyone interested in crowdsourcing and data mining techniques as they apply
to mobile phones.
The document does not require any prior knowledge in the area of security.
Chapter 2 will provide all the basic theory for the concepts explained in the
paper.

1.5

Related work

Malware has been a threat for computers for many years[30] and continues to
cause irreparable damage to infected systems[29]. The rst attempts to identify
and analyze malware on smartphones started by adapting existing PC security
solutions and applying them to mobile phones. This was not a feasible solution
in light of the high demand placed on resources by antivirus techniques and the
power and memory constraints of mobile devices. Since malware and intrusion
detection systems have already been the subject of massive research, we will give
just a brief review of the evolution of malware and malware detection techniques
as regards mobile phones.
Nwokedi et al. compiled a summary of the most commonly used malware
detection techniques[60]. Their report examined 45 dierent malware detection
techniques in the elds of anomaly-based detection, specication-based detection and signature-based detection. All techniques explained in this report are
very useful background information in order to understand the rst approaches
to malware detection that can also be used in smartphones.
Iseclab[25], International Secure Systems Laboratory, explored the detection
of malicious applications and used dierent approaches to detection based on
dynamic analysis of malicious or infected applications. [55]. They used dierent
approaches and detection techniques based on dynamic analysis that are used
to detect malicious or infected applications. The paper provides useful information about malware detection techniques and tools used in dynamic analysis of
malware.

Jacoby et al.

introduced battery-based intrusion detection, a host-based

intrusion detection system[61]. This technique monitors anomalous behavior of


smartphone batteries and writes a report in the device listing the causes of high
power consumption.
Some years later, Buennemeyer et al. evaluated the power consumption of
devices with a client application installed on a smartphone using the Symbian
OS [50].The application monitored power consumption data and sent a report to
a remote server to analyze and detect anomalies in the system. Due to the lack
of smartphone malware patterns at that time, most of the anomalous detection
techniques used battery power consumption as the main source of detection
data. These techniques were based on checking and monitoring mobiles phones'
power consumption and comparing it to the normal power consumption pattern
in order to detect anomalies.
Cheng et al. introduced SmartSiren, a collaborative virus detection application for Windows Mobile 5[52]. It collects the communication activity from
smartphones and performs system log le analysis to detect anomalous behavior
in the system. The system uses a proxy-based architecture that interacts with
a client installed on devices in order to avoid a heavy processing load.
Schmidt et al.

showed how to extract smartphone features from Symbian

OS and Windows mobile phones in order to perform anomaly detection in the


systems[68].

They use several APIs provided by Windows and Symbian to

monitor applications and extract device features, such as RAM free memory,
user inactivity, process count, CPU usage, sent SMS messages, etc.

The aim

of monitoring the applications' performance is to obtain data enabling us to


dierentiate between normal and malicious use of a device.

Schmidt et al.

presented a novel approach to static malware detection in

resource-limited mobile environments[67]. Their approach consisted of detecting


malware by extracting function calls from binaries in order to apply a clustering
algorithm to the data.

This technique was used for detecting Symbian OS

malware depending on a mobile phone's features, such as device eciency, speed


and limited resource usage.
In 2006 Symbian was the most widely used smartphone OS and many malware detection techniques were developed for this platform. Due to the imminent growth of smartphones with the Android OS, malware researchers decided
to switch their malware detection techniques and security mechanisms to this
platform [38].
Schmidt et al. presented the rst serious research on malicious applications

for the Android OS [69]

They proposed a solution based on monitoring events

occurring at the Linux kernel level.

They used a monitoring application to

extract features such as executed system calls, modied les, etc.


Linux kernel.

from the

These features were used to create the smartphone normality

pattern.
The same group proposed static analysis in 2009[66] and an Android application sandbox system in 2010[48]. The rst report presented a collaborative
scenario in which dierent devices could perform static analysis of malware directly on the phone. The second method used an Android application sandbox,
a totally secure environment, to perform static and dynamic analysis.

Static

analysis disassembled Android APK les to detect malware patterns. During


dynamic analysis, all of the events occurring on the device (opened les, accessed les, battery consumption, etc.) were monitored. This sandbox provided
a secure environment where malware applications could be executed without
any risk of infection.

Enck et al. proposed real-time monitoring and analysis of sensitive data with
dynamic taint tracking[56]

This technique taints data from privacy-sensitive

sources and applies labels as sensitive data propagates through program variables, les, and inter-process messages. When tainted data leaves the system,
the application scans for suspicious outgoing data.
Bose et al., Shabtai et al.

and Shari et al.

have proposed another solu-

tion for malware detection on smartphones based on Support Vector Machines


(SVM) and learning machines[49, 71, 72], an extension to the Android mobile
phone platform that tracks the ow of privacy-sensitive data through thirdparty applications. Their proposal consists of monitoring smartphone devices
to determine their normal behavior and using collected data to train a learning
machine. This learning machine will learn the normality model of the smartphone and applications and alert the user every time it detects a suspicious
action.
Portolakidis et al.

have proposed a system in which they will perform a

complete malware analysis of the phone in a virtual environment on a remote


server[64] [63]

In both reports, they explain how to create replicas from Android

devices and apply malware detection techniques to these Android mobile phones.
The replicas are an equivalent version of the real mobile devices, and will be
sent to the remote server for malware analysis. Mobile phone replicas will run
in a secure virtual environment where dierent malware detection techniques
are applied.

Our purpose in this project is to improve on and contribute to malware


detection strategies for the Android OS by oering up new ideas. Our work has
its foundation in many of the works mentioned above [48, 69, 28, 68, 64, 63, 66].
Our approach is based on detecting Android malware applications using
Linux system calls and clustering algorithms. Like Portolakidis et al.[63], and
taking into account the limited and poor battery life of smartphones, we are
in complete agreement with the procedure of using a remote server machine to
perform malware detection.
Antivirus software techniques are inadequate for use on smartphones, as
they consume a great deal of CPU and memory resources and can drastically
shorten battery life.On the other hand, we consider it dangerous to send phone
replicas to a remote server, since the replicas contain important and condential
information (contact numbers, messages, pictures, etc.) and may compromise
user condentiality. Rather than sending the whole replica, we propose sending
the log les, collected by a lightweight data collector application installed in
Android devices and containing the device's most important information, to the
remote server for remote malware analysis.
3

A lightweight data collector application , installed on the device will be responsible for collecting the system calls generated by Android applications in
the device and storing device information les in the SD Card memory. This
application has similar features to the one proposed by Buennemeyer et al.,[50]
i.e. the sending of all monitored les to a remote server. They, however, made
very few attempts with mobile phones, and we aim to extend use of the application as much as possible. To do so we will ask Android community users to use
a lightweight script application (crowdsourcing application) in order to collect
as much data as possible from dierent Android devices.
A. Doan, R. Ramakrishnan and A. Halevy analyzed the impact of crowdsourcing on the WWW (World-Wide Web) [54]. Their article explains how in
the future crowdsourcing will become one of the most inuential techniques used
to collect information and create databases faster and more eciently.

3 Crowdsourcing

application[59]
9

The following text gives an overview of some recent attacks targeting Android and of malware that has appeared on the Android platform.
Android malware has increased by 400% since 2010[31], and will continue to
grow. In light of this, several malware attacks were carried out on the Android
OS in 2010 and 2011, [65] [11].
Hong Tou Tou, Angry Birds Bonus Level, Tip Calculator, Tap Snake, Monkey Jump and Steamy Window are the most famous malicious applications to
date on the Android platform. Furthermore, more than 50 infected applications
were found on Google's Android market in March 2011, all of them infected
with the DroidDream Trojan application[1].
Another attack targeting the Android platform was carried out by J. Oberheide. He developed the Angry Birds Bonus Level for the Android OS[11]. This
application was a proof-of-concept malware application to showcase the weak
security of the Android marketplace.

The Angry Birds Bonus Level malware

purports to be an additional bonus level for the famous game Angry Birds.
The malicious application downloads and installs three additional applications

on the user's device in order to steal sensitive information. These applications


were available in Android's ocial marketplace for over ve months, but were
removed after they were discovered to be stealing sensitive information from
mobile phone devices.

J. Oberheide argues that he could collect condential

information from a great number of Android devices in only a few days' time.
NetQin Inc[34], a mobile security service provider, discovered a spyware
application called Tip Calculator in the Android market.

The spyware sent

all incoming and outgoing SMS messages in the system to a designated email
address. Another piece of spyware with similar characteristics discovered in non-

ocial Android repositories was Steamy Window[43]

A Trojan Horse called

Android Pjapps modies the original version of this application and wages an
attack by subscribing to a SMS premium service.
Due to its appeal as the latest malware discovered for the Android OS, and
since both the clean and malicious instances of the application were available, we
decided to analyze this spyware with our proposed malware detection system.

4 Fake

Contact Stealer, Fake Location Tracker and Fake Toll Fraud


10

11

NIDS

et

Bose et

Detection

Signature Based

HIDS

al.(2009)[71]

Shabtai et

NIDS

Anomaly
Detection

HIDS,

Detection

Anomaly

Detection

Signature Based

Schmidt et

HIDS

Anomaly
Detection

al.(2008)[69]

al.(2008)[68]

Schmidt et

al.(2008)[49]

HIDS

HIDS,

Buennemeyer

al.(2008)[50]

Anomaly

NIDS

(2007)[52]

Detection

HIDS,

Cheng et al.

SignatureBased
Detection

HIDS

Approach Detection
Method

al.(2004)[61]

Jacoby et

Author

Smartphone OS.

Detection techniques described can be applied in any

malicious application using Machine Learning methods.

It uses static features extracted from executables for classifying

anomalies in the system.

File system logs and Event detection modules to detect

Linux-kernel view. It uses network trac, Kernel system calls,

This paper analyzes the security on Android smartphones from

analysis.

Vectors will be processed by a Machine learning for further

extracted device features to a remote server in a vector format.

Symbian OS or Windows mobile client application will send

It uses a remote learning-based machine as anomaly detection.

monitored events and API calls in Symbian OS.

Support Vector Machines (SVM) and constructs signatures from

It detects malicious applications by training a classier based on

anomalies.

sends the report to a remote server to be analyzed and detect

Lightweight application monitors the power consumption and

behavior in the system.

activity from the device in order to detect any anomalous

It Performs system log le analysis and collect communication

device power consumption to detect anomalies in the system.

Monitor's device Normal power consumption against actual

Description

Table 2: Related work State-of-the-Art Summary(i)

All

OS

Android

Mobile

dows

OS/Win-

Symbian

OS

Symbian

OS

Symbian

OS

Symbian

OS

Symbian

Platform

12

63]

al.(2010)[64,

et

Detection

Anomaly

Anomaly

Detection

Anomaly

Detection

OS

Android

OS

Android

OS

Symbian

OS

Android

OS

Symbian

OS

Android

Platform

Android mobile phone replicas.

detection analysis. Virtual environments will be used to analyze

A remote security server in the cloud performs the Malware

taint tracking analysis to monitor privacy sensitive information.

user whenever a sensitive data of the user is compromised. Uses

Taint Droid will monitor Android applications and will alert the

TaintDroid is a realtime monitoring system for Android OS.

monitor network trac in a distributed way.

mobile device network. A light-weight Symbian application will

It presents a distributed SVM algorithm to detect Malware on a

applications in a totally secure environment.

patterns. Dynamic analysis executes and monitors Android

Static analysis scan's Android source code to detect Malware

perform Static and Dynamic analysis on Android applications.

It uses an Android Application Sandbox (AASandbox) to

clustering mechanisms in Symbian OS.

They extract function calls from binaries in order to apply

are compared with Malware executables for classifying.

calls in Android OS using the command readelf. Function calls

Perform Static analysis on the executables to extract function

Description

Table 3: Related work State-of-the-Art Summary(ii)

Signature Based

Portolakidis

HIDS,NIDS

Anomaly
Detection

Detection

HIDS,NIDS

NIDS

HIDS

HIDS

Signature Based

HIDS
Detection

Detection
Method

Approach

al.(2010)[56]

Enck et

al.(2010)[72]

Shari et

al.(2010)[48]

Blsing et

al.(2009)[67]

Schmidt et

al.(2009)[66]

Schmidt et

Author

1.6

Thesis structure

This section summarizes the main topics to be discussed throughout the paper,
giving a short overview of each chapter.
Chapter 2, describes the basic theory of the Android platform, intrusion
detection systems, Linux system calls, data mining and clustering algorithms.
The aim of this chapter is to enable the reader to understand the basic concepts
of the project.
Chapter 3, describes the behavior-based malware detection system for the
Android platform that was designed in this project.
Chapter 4, describes the testing and evaluation methods used by the behaviorbased malware detection system for the Android platform.
Chapter 5, describes the nal conclusions and denes the future work of the
project.

13

Chapter 2
2

Background

This chapter will give a brief description of some of the fundamental concepts
and terminology relating to the Android OS, intrusion detection systems, Linux
system calls, data mining and clustering algorithms. The clustering algorithm
section will be illustrated with reference to the way in which we have applied
these known techniques in order to group Android system calls.

2.1

Android Operating System

The Android OS is a Linux-based open source operating system for mobile


devices. It was originally developed by Android Inc. and was bought by Google
in 2005.
The operating system is based on a modied version of the Linux 2.6 kernel[9]
optimized for embedded systems and specially adapted for smartphones and
tablets. The optimization process in embedded systems improves data processing and battery consumption, extending battery life.
The following pages will provide detailed information about the Android OS.

2.1.1 Platform architecture


Architecture
The Android platform was created for devices with limited processing power,
memory and storage space, commonly called embedded systems. It was created
with the objective of implementing an operating system in environments requiring a low memory footprint and processing load, such as smartphones or
tablets.

14

Figure 2: Android platform architecture[5]

15

The Android OS is composed of several software components that can be


divided into three main groups: Operating System (OS), Middleware and Applications.

Operating system:

This group consists of Linux Kernel, the core and

most important component of the Android architecture.

As mentioned

above, Android is based on Linux 2.6 kernel, which provides the platform
with basic services such as security, memory management and process
management. The kernel can be considered an abstraction layer between
software and hardware layers, responsible for managing and processing requests received from higher layers for interaction with hardware resources.

Middleware:

This group consists of Android Runtime and Libraries.

Android Libraries are written in the C/C++ programming language and


Android developers can use them through the Application Framework.
Libraries provide easier access to system resources, such as the camera,
Wi-Fi, ash memory, etc. Dalvik Virtual Machine, or Dalvik VM[16], is
also one of the most important parts of the Android architecture. Dalvik
VM is a Java Virtual Machine specially designed and modied to optimize
memory and energy consumption in embedded systems. Dalvik VM was
designed to run multiple virtual machines without placing additional processing load on the processor. It is also responsible for executing optimized
Java code and Dex les (les in the Dalvik execution format). Dalvik VM
and Dex le internals will be explained in greater detail in Section 2.1.2.

Application:

This group consists of the Application Framework and Ap-

plications. By default, the Android OS includes basic applications like a


web browser, an email client and maps. This layer can also run third-party
applications from the Android market or other repositories. Applications
in this layer are written in the Java programming language. The application framework provides useful components for Android developers. This
layer consists of views, a resource manager, content providers and the notication manager, providing aid to applications using standard libraries.

As Android OS is an open-source project the kernel is available to download on


the internet [9] and it is possible to modify and create new versions adapted to
suit dierent purposes.

16

Start-up
Another essential part of the Android OS is the startup process. Like any
other Linux system, Android has a boot sequence which prepares the services
necessary to run/start the device's operating system.
Figure 3 shows the rst stage in the boot sequence on Android OS.

Figure 3: Android Linux Kernel and Init process

The rst stage in the boot sequence is running the Bootstrapper application.
The bootstrapper is the program which starts the device's operating system and
initializes and tests the basic requirements of the hardware, peripherals and external memory devices. GRUB and LILO for Linux and NTLDR for Windows
are some of the most famous bootstrapper applications. The bootstrapper application loads the kernel image into RAM, and then the kernel starts the init
process. Figures 3 and Figure 4 explain the Android OS init process and boot
sequence.
The init process initializes system daemons for handling low-level hardware
interfaces, such as USB, the Android debugger or Android Debug Bridge Daemon.

The init process also starts the basic runtime processes, such as the

Runtime service, Service manager, Media server and the Zygote.

17

Figure 4: Android boot sequence

Figure 4 shows the Android OS boot sequence in greater detail. As mentioned above, the init process initializes several daemons and services in the
system. At the same time, the init process starts the Zygote process. We will
describe the process in greater detail on the following pages.

2.1.2 The Dalvik Virtual Machine


The Dalvik VM[16], is a Java virtual machine specially designed and modied
to optimize memory and energy consumption in embedded systems like smartphones, tablets and netbooks. It was designed and created by Dan Bornstein,
with collaboration and contribution by other Google engineers. The virtual machine is optimized to require a low level of memory usage and enables multiple
virtual machine instances to run simultaneously with little additional load on
the processor.
The Dalvik VM uses register-based architecture[45], which is faster and more
ecient than the stack-based architecture used in most other virtual machines.

Every Android application runs in its own process, with its own instance
of the Dalvik VM inside a secure environment, a Sandbox.

The Dalvik VM

executes les in the Dalvik VM executable format (Dex Format), which is an


optimized Java code le for systems with constrained memory and processor
speeds.

18

The Dex le format


The Android Java source code is still compiled in class les. As mentioned
earlier, the Dalvik VM is a modied version of a Java virtual machine optimized
for embedded systems, and therefore code must be optimal to achieve the best
performance. Since it is not possible to run class les on Dalvik VM they are
optimized and converted into the Dex le format. Dex les are optimized class
les ready to be executed on the Dalvik VM. Figure 5 shows the process of
compilation from Java source code les to optimized code Dex les.

Figure 5: Dex le creation process

The Zygote
As detailed above, every Android application runs in its own instance of
the Dalvik VM and each instance must start quickly when a new application
is launched in the application layer. Android uses a concept called Zygote to
provide the fast start-up time needed to run the Dalvik VM every time a new
application is executed. Zygote loads the original Dalvik VM during the boot
sequence and waits for new requests from the Runtime process.

When the

Zygote process starts, it initializes an instance of Dalvik VM from the original


Dalvik VM. Afterwards, it loads and initializes the core library classes. Every
time Zygote receives a new application request from the runtime process, it
will create/fork a new Dalvik VM instance from the original Dalvik VM that
was loaded during the boot sequence. Creating an instance of Dalvik VM from
an existing Dalvik VM minimizes the startup time of the application in the
secure environment. For every new application request, Zygote will create a new
instance of Dalvik VM. This process is repeated every time the user requests an
application.

19

Register-based Architecture
Virtual machine developers have always been in favor of implementing virtual machines with a stack-based architecture [42] rather than a register-based

architecture[45]

The simple implementation of stack-based architecture leads

developers to prefer its use. Obviously, this simple implementation comes with
a performance cost. Executables for stack-based architecture are smaller than
executables for register-based architecture. This means a higher memory consumption, leading to a worse performance of the virtual machine.

Register-

based architecture requires an average of 48% fewer executed virtual machine


instructions than stack-based architecture, which considerably improves the performance of the device. On the other hand, the register code used by registerbased architecture is larger than stack-based architecture code.

Even so, the

processing load generated by Register-based architecture is still lower than that


of Stack-based architecture. Taking into account the fact that the Dalvik VM
runs on embedded devices with constrained memory and processing power, the
use of a register-based architecture is the most appropriate choice.

2.1.3 The Android Security Model


Android's security architecture guarantees that no application in the system can
damage other applications or the operating system. Each application runs in
an independent instance of Dalvik VM, with its corresponding PID. This means
that applications are completely isolated.

This technique of running applica-

tions in a secure environment is called sandboxing[39]. A Sandbox is a security


mechanism often used to execute potentially unsafe code or applications from
third-party developers. The Android OS uses a le called AndroidManifest.xml
to enable applications to interact with other applications and system resources
in the device. These permissions are declared before the application is installed
on the device. These permissions are also declared before Android's installation
APK le is generated, and cannot be modied after the app is installed on the
device.
In Linux a user ID identies a user. On Android the Android ID identies an
application running on a Dalvik VM instance. This Android ID is assigned and
stored in device's system after installation and is released when the application
is removed from the device.
Android uses permissions in the sandbox environment to grant access to
system resources such as les, SD Card memory, network, sensors and APIs in
general. Figure 6 the process of executing applications in the Android OS.

20

Figure 6: Application request process

Every time an application is executed in the Android OS application layer,


the System Manager is responsible for collecting and sending these requests to
the runtime process.

The runtime process will catch the requests and notify

Zygote of the execution of a new Android application.

Zygote will create a

new Dalvik VM instance for every new application request, and the requested
application will run in that Dalvik VM instance.

Every Dalvik VM instance

will run only one application in order to provide a secure environment.

21

2.1.4 Android applications


Android applications are written in the Java programming language. Android
uses the Android Software Development Kit (SDK) [10] and Java's programming
environments, such as

Eclipse [19]

or

Netbeans [33],

to compile Java code and

create an Android application installation (APK) le. These APK les can be
installed on Android devices using the Android Debug Bridge tool (adb) or by
downloading them from Android's Ocial Market.

Figure 7 shows the basic

structure of an APK le.

Figure 7: Android APK le

An APK le is composed of three main groups: AndroidManifest.xml, Classes.dex


and Resources, which are packaged into a single le.

AndroidManifest.xml: The Android manifest le describes the Android


application's essential information. It describes application features such
as the application and package name, permissions used by the application
and the minimum version of Android required to run the application.

Classes.Dex:

This le is the result of the compilation of Android Java

source code. It contains optimized Dex bytecode for the Android application and will run on the Dalvik VM.

Resources: This group contains pictures, libraries and layout les used by
the application.

Figure 8 shows the compilation process in the creation of an Android APK le.

22

Figure 8: Android APK le generation process

One of the most important elements of creating an APK le is the compilation of Java source code. The process of generating the APK le is described
in Figure 8. The les undergo a series of transformations during the process of
creating the Android APK le. These transformations comprise the compilation
process required to generate APK les that will run on Android devices.
The rst step in the process of creating an Android application is to create
an Android project, in which Java source code, Android manifest and resource
les will be generated by Eclipse or Netbeans.
The next step is to program and congure the code to suit your purposes and
to compile the project. Java's compiler in the SDK programming environment
will generate class les from Java's source code and the

aapt 5

will transform

the AndroidManifest.xml and resource les into an adequate format so that


they can be interpreted by the Dalvik VM. The generated class les cannot
be interpreted by the Dalvik VM and in order to convert these class les into
Dex les, Android SDK provides a tool called

dx.

This tool converts class les

into the Dex format. Once all the les are compiled, the aapt is tasked with
compiling and generating the Android APK le.

5 Android

Asset Packaging Tool

23

2.2

Intrusion Detection System

2.2.1 Denition
An Intrusion Detection System, also known as an ID[24], is a device or software
application which monitors a network or system for malicious activities[58].
There are many dierent types of IDS. The aim of an IDS is to identify and
detect anomalies in the system or device that is being monitored. Some classes
of IDS will be described below.

Network-Based

The Network-Based Intrusion Detection System (NIDS) is an intrusion detection


system that analyzes network trac, makes decisions about the purpose of the
trac and scans the network for suspicious activity.

-Wireless
The Wireless Intrusion Detection System (WIDS) is similar to the NIDS. Instead of analyzing wired network trac it can analyze wireless trac to detect
suspicious activity.

Host-Based

Host-Based Intrusion Detection Systems (HIDS) monitor all activity that occurs
on the host (the platform comprising the computer hardware and the operating
system) being monitored. This system is capable of monitoring features of the
system such as power consumption, opened les, system call logs, etc.
This project will use a Host-Based Intrusion Detection System to monitor
events on Android devices.

Section 3 will describe this approach in further

detail.

24

2.2.2 Detection types


As regards types of IDS detection, we can divide these into two: Signature-Based
or Misuse detection and Anomaly-Based detection.

Misuse detection

The technique of Misuse detection searches for specic indications or patterns


of attacks, identifying raw byte sequences, protocol type, port numbers, etc.
The aim of this type of detection is to nd patterns in raw data.

Signatures

are then created by a group of experts who analyze the code, behavior and
manifestation of the malware. Most antivirus companies still use this technique
to create malware signatures and patterns
One of the disadvantages of this detection type is that the system must
be familiar with all malware patterns and signatures in advance. This type of
detection limits the ability to detect new malware.
The process of nding and identifying new types of attacks and malware
manually takes experts a great deal of time. Antivirus companies are trying to
come up with dierent alternatives in order to avoid this problem through use
of automated processes. Figure 9 shows the dierences between the techniques
of Misuse detection and Anomaly detection.

Figure 9: Misuse detection versus Anomaly detection

25

Anomaly-Based detection

Anomaly-Based Intrusion Detection Systems use a prior training phase to establish a model for normal system activity. This mode of detection is rst trained
on the normal behavior of the system or application to be monitored.

Using

this model of normal behavior, it is possible to detect anomalous activities that


are occurring in the system by searching the system for strange behavior. This
technique is more complex and requires more resources than Misuse detection.
Despite this, it has the advantage of being able to detect new attacks.
Typically, Misuse detection tries to identify/classify the new object by consulting known malware or malicious behavior patterns stored in a signature
database. Unknown objects are compared with database objects, and if a match
is found between the unknown object being analyzed and the database object,
the unknown object will be considered suspicious or malware.

If there is no

match, it will be classied as unknown.


Anomaly-Based detection, on the other hand, creates a pattern of normal
behavior based on the system's model of normality. New objects will be compared with the normal behavior pattern, and if any of the objects show any
abnormal activity compared to that pattern of normal behavior, they will be
considered malicious applications.

26

2.3

System calls and Vectors

In Linux, a system call is the way in which a program requests a service from the
operating system's kernel. The Linux kernel has roughly 190 system calls, and
each system call is identied by a unique number that is found in the kernel's
system call table [27].
A system call is invoked by an application using glibc library functions.
Functions like

getpid(), open(), read()

and

socket()

are some of the functions

that glibc can provide applications with to enable them to invoke a system call.
Every time an application from user space makes a request of the OS, the
request passes through the glibc library, the system call interface, the kernel
and nally reaches the hardware. The glibc library interprets the request and
the CPU switches to kernel mode. The system call interface gets the request
from the glibc library and executes the appropriate kernel function by consulting
the system call table. The kernel must interpret the request from the system
call interface and make the request of the hardware platform. Afterwards, the
user receives the information requested by the application following the inverse
process.

Figure 10 describes the Linux user kernel space and the process by

which an application sends requests to the hardware platform.

Figure 10: Linux User and Kernel space

The Linux kernel is executed in the lowest layer of the Android architecture.
This means that all requests made from the upper layers pass through the kernel
using the system call interface before they are executed in the hardware.

27

Analyzing all of the system calls that pass through the system call interface
will give us an accurate picture of the behavior of the application.
6

of hijacking

The aim

these system calls is to create an output le containing all of

the events generated by the Android application. This le will provide useful
information, such as opened and accessed les, execution timestamps and the
number of system calls executed by the application. We will use the number
of system call executions performed by the application to represent behavior.
Section 2.5 will provide insight into this technique
This project will use the lists of system calls to create an anomaly detection system, rst creating the normality model for the Android application using clear Android applications (applications free of malicious code). As stated
above, by extracting the number of system call executions generated by the
Android application it is possible to create a behavioral vector representation
for Android applications.

These vectors will be used to create the normality

model or pattern of normal behavior for the application. Here is an example of


an Android application behavior system call vector:
0 ,0 ,0 ,25 ,47 ,4 ,34 ,0 ,0 ,0 ,0 ,0 ,0 ,12 ,0 ,0 ,0 ,0 ,
0 ,260 ,9 ,0 ,0 ,0 ,0 ,1649 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,10 ,0 ,
0 ,0 ,5 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,22 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,
3466 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,0 ,
12 ,0 ,0 ,0 ,0 ,132 ,0 ,0 ,0 ,0 ,0 ,0 ,40 ,41 ,0 ,0 ,0 ,
0 ,0 ,76 ,0 ,0 ,0 ,0 ,0 ,0 ,4 ,0 ,87 ,17 ,0 ,...
Each number separated by commas represents a system call and the number
of system call requests/executions made by the Android application during the
monitoring process. For instance, the system call

kill()

47 times.

open()

is used 25 times and

This means that the monitored application used the

open()
kill()

system call 25 times to open les or libraries from the system, and the
system call 47 times to kill processes.

The list of Android system calls is too large to show here, but the system
calls list can be found in the Android Linux kernel[9] bionic folder

or in Section

2 of the Linux kernel manual pages[26].

6 Hijacking, refers to all illegal actions


7 bionic/libc/SYSCALLS.TXT

to take over or stealing information by an attacker

28

2.4

Data Mining

Data mining is the process of extracting patterns from large data sets by combining methods from statistics and articial intelligence in order to obtain useful
information. Data mining is also considered to be the set of techniques and technologies used for exploring large databases in order to nd repetitive patterns,
trends or rules to explain the behavior of a given data set.
Figure 11 shows the sequence of the knowledge discovery process used in
databases (KDD) [46] to obtain useful information or knowledge from a raw
data set. The KDD process refers to the process of discovering useful knowledge.
Data mining refers to a particular step in the process.

Figure 11: Knowledge Discovery in Databases (KDD) process[46]

2.4.1 Data collection in KDD process


1. Selection of raw data: This is the rst phase of the KDD process.

When

we are given a raw data set, the rst step is to select information in order
to obtain relevant data. This project will use a crowdsourcing application
installed on several Android devices and an information collector script to
obtain the data set of the behavior of the Android application.
2.

Data preprocessing:

In order to avoid misleading or inappropriate rules

or patterns, it is necessary to lter out irrelevant data. Collecting inappropriate data results in poor interpretation and evaluation of the system,
will render the system unreliable and produce undesired results.
3.

Data transformation:

This will transform relevant data collected from

previous phases into a readable and organized structure. This data will
determine the outcome of the analysis and will create the data set for the
data mining algorithm.
4.

Data mining algorithm:

This process uses a data mining algorithm to

detect rules or patterns from the previously generated data set.


5.

Interpretation and evaluation:


the obtained results evaluated.

29

In this phase a report is generated and

Data mining techniques can be separated into many categories or groups,


but this report will analyze classication and clustering techniques, since these
are the most appropriate and relevant for the project.

Classication
This is a technique used in data mining to classify data into dierent elds or
groups. One of the main characteristics of this technique is that the classication
of data is based on groups or patterns that are already known. This means that
all the information on groups in the system is already dened, and new data
will be compared with these groups in order to classify the data.

Clustering
The technique of clustering involves grouping a set of physical or abstract
objects into clusters of similar objects. In data mining, a cluster is a collection
or group of data that are similar to each other.

One of the main dierences

compared to the classication method is that the clustering method uses raw
data to create the groups to be used later in order to make a decision. These are
created without any predened group. The given data set will be responsible
for creating the groups or clusters, and afterwards a decision will be made on
which cluster the data belongs to.

At the beginning there will be no cluster

or group created to which to assign the data, so the clustering algorithm will
create a random cluster in any position.
One of the easiest ways to decide to which group the data belongs is to
measure the

Euclidean

distance between the data and the formed groups. The

Euclidean distance is the result obtained by measuring the proximity of a point


to two or more cluster groups. Based on the analysis, the Euclidean distance
will cluster the data into the closest or nearest cluster.

30

2.5

K-means Clustering algorithm

Clustering is a common technique used for statistical data analysis in many


elds, including machine learning, data mining, pattern recognition, image analysis and bioinformatics[47].
This project will use an unsupervised learning or clustering technique to form
groups or cluster patterns in order to nd the hidden structure or similarities
within the data set.

Due to the lack of data sets available for the Android

platform, we decided to design an Android application behavior database from


scratch, where all the Android app behavior data will be stored.
In order to get satisfactory results in the interpretation and evaluation phase,
we must know which clustering method is the most suitable for detecting malicious applications in the Android platform, as well as which can provide the
best and the most useful information on the collected data.
This part of the document will describe two dierent categories of clustering methods:

Hierarchical methods and Non-Hierarchical or partitioning

methods[74]. Figure 12 shows the taxonomy of clustering methods.

Figure 12: Taxonomy clustering methods

Hierarchical clustering methods create a hierarchy or tree of clusters from a


given data set. The root of the tree contains all data observations in a single
cluster. The tree creates sub-clusters from the root.
Algorithms used in Hierarchical clustering methods are generally agglomerative or divisive.

Agglomerative algorithms start at the leaves of the small

clusters and merge into bigger clusters.

Divisive algorithms start at the root

cluster and recursively split the clusters into smaller ones. Figure 13 shows the
graphical representation of agglomerative and divisive methods.

31

Figure 13: Hierarchical method: Agglomerative vs Divisive

Another method of clustering is the partitioning method. This method sets

number of clusters as the objective, and the data set is split into those clusters.
The partitioning method aims to discover clusters by iteration and relocation
of points in the data set.
In unsupervised learning, the pattern classication system is based on a set
of training patterns, based on data with as yet unknown respective class labels.
This occurs when labeling of each individual sample is almost impossible. This
type of learning algorithm encompasses algorithms such as neural networks,
nearest neighbor, k-means, etc.
Bearing in mind that the objective in this project is to cluster system call
behavior vectors into two dierent clusters, i.e. Good and Malicious application
behaviors, it is appropriate to apply the partitioning method using the k-means
clustering algorithm.

32

K-means Clustering algorithm


Every Android application has its own behavior data, and this data will be
placed in one of two possible clusters: Good and Malicious behavior clusters,

k = 2.

The Good application cluster will describe the proper behavior of An-

droid applications and data clustered into the Malicious group or cluster will be
considered to be malicious or dangerous applications.
The k-means clustering algorithm[62], is a clustering method which aims to
create

clusters, given a data set of n observations.

The k-means clustering algorithm uses the following formula:

J=

k X
n
2
X

(j)
xi cj
j=1 i=1


2
(j)

(j)
xi cj is the distance measured between a data point xi and
cluster center cj . The cluster center cj indicates the distance of the n data

where
the

points from their respective cluster centers.


Table 4 shows the steps of the k-means clustering algorithm:

1. Randomly place K cluster points into the space represented


by n objects. These points will represent the initial centroids of
the clusters
2. Assign every object to the group that has the closet centroid.
3. When all objects have been assigned, recalculate the positions
of the K centroids.
4. Repeat the 2nd and 3rd steps until the centeroids stop
moving. This produces a separation of the objects into groups.
Table 4: K-means Clustering algorithm process

, P , of n observations, with a typical

We suppose that we are given a data set


entry being

pi ,

where each

We can think of each

pi

pi

D numbers.
D-dimensional space.

is a vector of

as a point in a

Every

pi

in the data set, will represent a system call vector produced by the user.

33

vector

Figure 14: K-means applied as a detection system for android system calls

The

n observations,

will be the set of system call vectors collected by mon-

itoring the Android applications, and each

(j)

xi

data point will be one such

system call vector. Applying the k-means algorithm to the Android application
vector data set will create two clusters, with the good and malicious Android
applications classied (k=2) as described below.
The speed of the algorithm and the results obtained in training and test
evaluation are the main reasons we chose to use the k-means algorithm in this
project. Another reason why we chose k-means was the simplicity of implementation in Matlab.
One of the most important tasks of the clustering algorithm is the selection of the Distance measure. This measurement will determine the cluster to
which the data belongs. The calculation of this distance may vary depending
on which mathematical formula is used in the process.

Euclidean, Manhat-

tan, Mahalanobis and Hamming distances are some of the most commonly used
functions to measure such distances.

2.6

Crowdsourcing

Je Howe dened 

Crowdsourcing  [59],

as the act of exporting tasks tradition-

ally performed by one or more employees to an indenite group of persons or a


community through an open call.
Using the crowdsourcing technique, we divided the responsibility of creating
the Android application data set between the users of the Android Community.
Considering that there are more than 8 million Android users in the world, using
this technique to collect information from many dierent Android devices is a
very appealing option.

34

Chapter 3
3

Behavior-Based malware detection system for


Android Applications

3.1

Overview

The implementation of malware detection systems in mobile devices is a fairly


a new concept that is gaining a lot of attention.

Applying the security tools

and mechanisms used in computers to smartphones is not a feasible choice due


to excessive resource and energy consumption. Because of this, we decided to
perform the entire analysis process on a dedicated remote server. This server
will be dedicated exclusively to detecting malicious and suspicious applications
on the Android platform.
Figure 15 describes the general scheme of the behavior-based malware detection system for Android applications.

Figure 15: Android malware detection system scheme

35

As the Android market is an open-market system, users can download their


applications from sources other than the Android ocial market. As a result,
many users end up making heavy use of non-ocial Android repositories where
a lack of supervision and control can result in their downloading third party
applications that may contain malicious code. The aim of the server is to perform dynamic analysis of Android applications to detect anomalies which may
be dangerous for the user.
Using information collector applications such as crowdsourcing and the data
collector script, we can obtain the necessary information from Android applications and perform malware analysis on the system.
Using the crowdsourcing application installed on Android devices, community users will have a chance to contribute to the project by sending recorded
log les of the behavior of Android applications to our malware detection server.
All collected log data les result from use of the Strace Linux tool with
8

Android applications . This tool is assumed to be installed on each user device.


Strace will collect information on the system calls executed by the application.
Monitored system call logs and device information les will be stored in the SD
Card memory and will be sent to the malware detection system using an FTP
client in the crowdsourcing application.

The FTP Server will be responsible

for collecting the information sent by the crowdsourcing application and an


information collector script.

The data collector script will process and parse

the data collected from Android users' applications and create the system calls
vectors. Afterwards, Matlab and the k-means clustering algorithm will use these
system call vectors to detect anomalies in the applications.

8 Strace

tool output le (*.out)


36

3.2

Android Data mining: Crowdsourcing and Self-written


applications

In order to collect Android application data, we will use two data collector
applications. The rst one is a crowdsourcing application developed for Android
devices and the second one is a script running on the Android Emulator.
The rst attempt we made to collect data was carried out by a script using
the thirty most downloaded applications from the Android market in 2010. The
purpose of the script was to monitor Android emulator activity and generate
reports based on the analysis.
The second data mining trial was carried out by the crowdsourcing application for Android devices. The aim of the application was the same as that of
the previous script, but this time the Android user community was used.
Both applications were able to collect essential information from Android
Devices, such as installed applications, device information and most importantly
the system call log les.

See Figure16.

The system call log les contain the

system call sequence generated by Android applications.

Parsing these data

points with a script will produce the system call vectors that will be used in the
Android malware detection system.

Figure 16: Data acquisition process

The aim of the crowdsourcing and data collection script is to collect as much
information as possible from the Android devices and applications.

37

3.2.1 Android Data collector script


As described above, in the rst data mining trial we carried out the data mining
process used a script to collect information from Android applications.
The purpose of the script was to:

Use Android APK applications for training or testing the system.

Install/Uninstall applications on the emulator or real Android device.

Collect Linux system calls using the Linux tool Strace.

Parse the collected data to create system call vectors, device information
les and a list of other actions performed by Android applications, such
us opened les or accessed directories, execution timestamp, etc.

Compile the report for the analyzed applications.

The data collector script is written in Perl. This gives us the opportunity to run
the script on several operating systems without changing it in any way. Figure
17 shows the User Interface (UI) of the script.

Figure 17: Data collector script user interface

Figure 18 describes the data collector script in greater detail.

38

The data collector script allows us to choose between installing applications


on the Android emulator or the real device. Training Data and Test Data folders
contain Good and Malicious Android applications. In order to create the good
behavior pattern for Android applications, we will use applications from the
Training Data folder as a training phase.
The script will install applications from the training data folder and users
will start to interact with the installed application. The script will start monitoring and recording all system calls executed by an application. Afterwards,
the script will remove the application from the device and create a new, clean instance of the system or emulator. This procedure ensures that every monitored
application has the same initial system condition and conguration. Applications in the Test Data folder will undergo the same procedure as the training
data applications.
Finally, the script will create a folder with all monitored/recorded applications.

Steps 4, 5 and 6 on the UI, Figure 17, will obtain the Android device

information le and installed application le and create the system calls vector
le.

Figure 18: Data collector script process

39

The script was designed to automate most of the data mining process and
interaction within the system. At rst we decided to use a pseudo-random action
event tool called ADB Monkey[2] for interacting with and collecting information
from Android applications. Taking into account the fact that there are more
than 250,000 applications available in the Android Market, it was natural to
conclude that we needed to use an automatic process to record and interact
with the applications. After several attempts, we realized that ADB Monkey was
generating awed pseudo-random events in Android applications. Considering
this, data generated by this application was unsuitable for processing and for
using with the system if we intended to have good results.
Our next approach was to teach ADB Monkey to behave and interact with
Android applications in the same way as humans. We realized, however, that
this technique required articial intelligence knowledge and generated too much
work with processing data, so we decided to use a normal user to create the data.
The complexity of writing a program to behave like a human was the main reason
we decided to use a normal user for data creation. Even so, we found a small
disadvantage associated with use of this technique, i.e. that a single user has to
create the data set for more than 250,000 Android applications. Spending just
5 minutes per application on monitoring and recording application system calls
and the Android device information would require the user to spend almost two
years collecting all of the information for the Android market apps.
We realized that even if we decided to use this technique for the most important 30 applications available on the Android market in January 2011, testing 30
applications would not be sucient to determine and create a Malware pattern
for Android applications.
This brings us to the need for a crowdsourcing approach.

40

3.2.2 Android Crowdsourcing and data mining application


The next solution is based on using Android community users to collect data
through a lightweight application installed on their Android devices.
Je Howe dened 

Crowdsourcing 

as the act of exporting tasks tradition-

ally performed by one or more employees to an indenite group of persons or


community through an open call[59].
Using the crowdsourcing technique we shared the responsibility of creating
the Android Application Data set among the Android Community users. Considering that there are more than 8 million Android users in the world it is a
very attractive opportunity to use this technique.
The crowdsourcing application is an Android application written in Java
for the Android OS platform.

The Android SDK and the Java programming

environment will provide the tools necessary to compile the Java source code
and generate the APK le that will run on the devices.
The crowdsourcing application has the same features as the data collector
script mentioned in Section 3.2.1, but includes an FTP client to send collected
les to the Android malware detection system. Android Community users only
need to download the application and let it run in the background in order for
it to start monitoring and collecting information from the applications running
on the device.

Figure 19: Android Crowdsourcing application

The user interface(UI) of the crowdsourcing application Figure 19, contains


two buttons, Start and Stop. If the user presses the Start button a monitoring
service will start running in the background, and the application will stop when
the user presses the Stop button. Android users can also start the application
and let it run in the background as a system service.

The user can interact

with other applications while the application runs as a background process and
collects data.

Files recorded by the crowdsourcing application will be stored

in the SD Card memory and will later be sent as data to the behavior-based
Android malware detection system server via FTP.

41

3.3

Behavior-Based malware detection system

3.3.1 Design of the Behavior-Based malware detection system


The behavior-based malware detection system is composed of several applications, which together provide the resources and mechanisms needed to detect
malware on the Android platform. Each program has its own specic functionality and purpose in the system and the combination of all of them creates the
Behavior-Based malware detection system.

The Android data mining scripts

and applications mentioned in Section 3.2 are the responsible for collecting data
from Android applications, and the script running on the server will be the responsible for parsing and storing all collected data. Furthermore, the script will
be responsible for creating the system call vectors for the k-means clustering
algorithm.

Figure 20: Static and Dynamic Analysis

The methods of analysis of the behavior-based malware detection system


developed in this project can be divided into two main groups: Static Analysis
and Dynamic Analysis.
Static Analysis is responsible for analyzing Android source code les in order to nd malicious code patterns or signatures.

This form of analysis will

decompress, disassemble and search for patterns in the APK les. The method
is fast and does not generate a high processing load.
Dynamic Analysis analyzes the behavior of Android applications by monitoring system calls with the Strace tool.

All input traces generated by the

Android smartphone user will be collected using the data collector application
described in section 3.2 as well as the crowdsourcing and data collector script.
In Dynamic Analysis the user will install, execute and generate input data for
the Android applications in order to obtain an application behavior output log
le.
Table 5 shows the advantages and disadvantages of Static and Dynamic
analysis.

42

Static analysis

Advantages

Disadvantages

Cheap and Fast.

Have to know Malware patterns

Not very resource

or signatures in advance

consuming
Dynamic Analysis

Detection of

Highly resource consuming, not

unknown attacks

feasible for battery devices

Table 5: Static and Dynamic Malware analysis advantages and Disadvantages

43

Figure 21 describes the complete process of Android malware detection carried out by the system.

Figure 21: Android Malware Detection process

The malware detection process is divided into three main activities:

Data acquisition:

This activity allows application data to be obtained

from users via crowdsourcing or data collector script.

Data processing manipulation:

This activity consists of managing

and parsing all of the information collected from Android users. The data
analyzer scripts will collect, extract and analyze all of the parameters from
the strace output les (from the applications tested).

One of the most

important pieces of data that can be obtained from the strace output le
is the number of system calls executed by an Android application. Another
feature that can be extracted from the output le are the les and libraries
used during the monitoring process.

Malware analysis and detection:

This activity consists of analyzing

and clustering the vectors obtained in the previous phase in order to create the normality model and subsequently be able to detect anomalous
behavior of Android applications. Matlab will be responsible for clustering the dierent vectors into dierent groups using the k-means algorithm.
This algorithm will create two clusters, a normality model and a malicious
behavior or anomaly model. See Figure 9. All good application vectors
will be clustered into the normality model, and malicious behavior vectors
into the malicious behavior model cluster.

44

The following example shows ve vectors created by an Android application.


This example is just a proof-of-concept to illustrate how the system works.

A= [ 4 , 5 , 6 , 7 , 8 ] ;
B= [ 4 , 5 , 6 , 6 , 8 ] ;
C= [ 1 , 2 , 3 , 9 , 9 ] ;
D= [ 4 , 5 , 6 , 7 , 7 ] ;
E= [ 1 , 3 , 3 , 9 , 8 ] ;

%Good
%Good
%Malware
%Good
%Malware

Each vector represents an interaction with an Android application installed


on the emulator or the real device. Numbers separated by commas represent
the number of times that a system call has been executed. For instance, in the
A interaction the rst system call was executed four times, the second one ve
times, the third one six times and so on.

PROGRAM CODE

clear a l l ;

load ( ' A p p l i c a t i o n _ f i l e _ V e c t o r . t x t ' ) ;

vectors_variable =

vectors_distance = pdist ( vectors_variable , ' Euclidean ' ) ;


matrix_vectors

COMMENTS

= SQUAREFORM( v e c t o r _ d i s t a n c e ) ;

max( m a t r i x _ v e c t o r s ( : ) ) ;

max_value

clusters

= kmeans ( m a t r i x _ v e c t o r s , 2 ) ;

% Clear a l l v a r i a b l e s in t h e system
% Loads t o v e c t o r _ v a r i a b l e
5

vectors

from

. txt

file

% p d i s t f u n c t i o n computes t h e
Euclidean
pairs
data

of

distance
objects

between

i n mbyn

matrix X

% Makes t h e comparison between


vectors

and

puts

in

matrix

% O p t i o n a l . Gets t h e maximum
value

of

the

format

matrix

% kmeans a l g o r i t h m c r e a t e s
two
By

clusters
default

Squared

from

kmeans

Euclidean

input

value .

uses
distance .

Table 6: Matlab Clustering code for Android Malware Detection

45

We used several functions provided by Matlab to perform the analysis of


Android applications. The pdist function, used in the malware vector clustering
Matlab code, Table 6, contains many ways to measure the distances between
the vectors. The pdist function includes several distance metrics, such as the
Euclidean, Semi-Euclidean, City-block, Minkowski, Chebyshev, Mahalanobis,
Spearman, Hamming and Jaccard metrics.
To determine which metric was the best suited for our purposes we performed
several tests using dierent distance metrics on the previous vector example
code. Knowing which vectors are good and which ones are malicious, it is easy
to select the metric that will hopefully produce the best results in the malware
detection system. Vectors

A, B, D

belongs to Cluster 1 (Good)and vectors

C, E

to Cluster 2 (Malicious).

Euclidean

Seuclidean

City-Block

Minkowski

Cosine

Mahalanobi

Spearman

Hamming

Jaccard

Result

Table 7: Clustering algorithm metrics

Out of all the tested metrics, Euclidean, Semi-Euclidean and Hamming


showed the best results, as shown by Table 7.

46

Table 8 shows the similarities between vectors after applying the pdist function with the Euclidean distance metric and squareform function.

The pdist

function computes the Euclidean distance between all vectors and the squareform function transforms the pdist result into matrix form. This table shows
only the Euclidean distance results, but similar results were obtained using the
Semi-Euclidean and Hamming distance metrics.

A comparison of the system

call vectors of an Android application is the result. Vectors close to 0 are similar
or equal vectors, and those vectors far from 0 are dissimilar vectors.

5.6569

2.2361

5.0990

6.0828

2.4495

5.5678

5.6569

6.0828

7.1414

1.4142

2.2361

2.4495

7.1414

6.2450

5.0990

5.5678

1.4142

6.2450

Table 8: Vector comparison matrix

The objective of the project is to distinguish between good and malicious


use of Android applications using the system call vectors generated by Android
applications. It is known in advance that vectors

C, E

A, B, D are benign and vectors

are malicious.

Table 9 shows the cluster results obtained using the k-means clustering algorithm with the Euclidean distance metric on the results obtained from the
squareform function.

Cluster

Table 9: Example vector clustering results

According to the previous table, interactions


model and interactions

C, E

A, B, D belong to the normality

belong to the malicious behavior or anomaly model.

In conclusion, the system was able to distinguish malicious vectors from


normal ones, showing that using the k-means clustering algorithms with the
Euclidean distance metric is an accurate technique for malware detection. Other
test carried out using Semi-Euclidean and Hamming distance metrics showed
very similar results to Euclidean distance and hence we decided not to include
them in the report.

47

Chapter 4
4

Results and Evaluation

This chapter is divided in 3 dierent sections. Section 4.1 describes the data set
used in the project. Section 4.2 shows the devices and applications used in the
system. A complete analysis of created and real Malware is described in Section
4.3.
Our framework has been tested through analysis of the data collected on the
central server, with two types of data sets: data from articial malware created
for test purposes, Table17, and data from real malware found in the wild, Table
22.

The method is shown to be an eective means of isolating malware and

alerting users of downloaded malware, highlighting its potential for helping to


stop the spread of detected malware to a larger community.

4.1

Data Set

The data set used in this project is that collected by several data collector
applications, as described in Section 3.2.

This data, described in Figure 16,

contains device info, installed applications info and the system call vector log
les, and will be used as the data set or input data in the behavior-based malware
detection system.

4.2

Devices and Programs

Tables 10 and 11 describe the tools and applications used during implementation
of the project.

Devices

Description

Android G1

First mobile phone with Android OS, version 1.6. It was used to
run Self Written Malware and Android applications.

Samsung Galaxy S

One of the latest mobile phone, version 2.2. It was used to run
self written Malware and Android applications.

Table 10: Test Devices

48

Program

Description

Ubuntu OS

Ubuntu is a Debian-based Linux distribution operating system.

Matlab

Matlab is a mathematical software used for manipulation of

It was used as the main (OS) in this project.

matrices, representation of data and functions, implementation


of algorithms and vector analysis. We used Matlab as a system
call vector analyzer and clustering method, in order to cluster
Good and Malicious Android applications.
Eclipse

Eclipse is a platform for programming, development, and


Compilation of Java, C++ and many other programming
languages. We used Eclipse integrated with the Android SDK to
develop Android application and self written Malware in Java.

Android emulator

The Android SDK includes a virtual mobile device that can run
on the computer. The emulator allows us to develop and test
Android applications without using a physical device. It was
used to run self written Malware and Android applications.

vsftpd

Very Secure FTP Daemon is a FTP server for the Linux OS. We
used vsftpd to collect Android applications system call log les
for the dierent applications, as sent in by the users.

Perl scripts

Perl is a high-level, general-purpose, interpreted, dynamic


programming language, useful for data manipulation. It was
used to create an automatic Android data mining script on
Ubuntu using an Android emulator. The script for system call
vector generator, Device info collector, etc was made with Perl.

Android Crowdsourcing app

Crowdsourcing is the act of outsourcing tasks to an undened,


large group of people or a community. We developed an Android
application to collect information about the applications from
user's devices. The application info contains system calls logs,
System device info, opened les, ...

Postgre SQL Database

Postgre SQL is an open relational database management system.


We designed an ERM architecture to store Android devices and
applications info.

Table 11: Programs used in the project

49

4.3

Malware detection system Results

This section is divided into two dierent subsections. Subsection 4.3.1 describes
the evaluation process and the results obtained with our own self-written Android malware. Next, the Steamy Window malware is analyzed in subsection
4.3.2.

4.3.1 Self-written Malware


Due to the fast removal of infected applications from applications markets, nding real malware is a dicult task. Antivirus companies can provide these applications, but access to antivirus company databases is often restricted.
Due to these limitations and restricted access to these databases, we decided to create our Android programs and corresponding malware as a proofofconcept in order to test the behavior-based malware detection system until new
real malware is released for the Android platform. These programs will simulate programs available in dierent Android markets. On one hand, the benign
Android applications will simulate applications available on the ocial Android
market, and on the other hand the equivalent applications containing malicious
code will simulate non-ocial repository Android applications.
Using this technique, it is easy to establish the normality model for Android
applications. Vectors collected from good and malicious applications will form
the data set for the k-means clustering algorithm. Afterwards, the clustering
algorithm will determine if the vector belongs to the normality model cluster
or the malicious model cluster. Every version of an application has a normality
and a malicious model.
In order to create the application normality model we will use the following
three applications:

Calculator_G

Countdown_G

MoneyConverter_G

The good application pattern obtained will be compared against the incoming
data in order to decide if it belongs in the normality model or not.
All developed Android applications were tested using the Android emulator
and the Android mobile phone terminal Samsung Galaxy S, Table 10. All of
these applications have been tested under equal conditions for a xed period of
time (ve minutes), with dierent user interactions. The following pages will
describe some of the results obtained for our malware with the behavior-based
Android malware detection system. We will also provide some information on
the data les collected by the data collector script and crowdsourcing applications.

50

Android Device Information


Table 12 shows the Android device information collected by the crowdsourcing
application.

This application will collect Android device information where

Android applications are running in order to understand the behavior of the


applications on dierent devices.

DEVICE INFO

ANDROID NAME

: FROYO

ANDROID VERSION

2.2

IMEI

354795046233372

BOARD

GTI 9 0 0 0

BOARDLOADER

unknown

BRAND

samsung

CPU_ABI

a r m e a b i v 7 a

CPU_ABI2

armeabi

DEVICE

GTI 9 0 0 0

DISPLAY

: FROYO

FINGERPRINT

samsung /GTI 9 0 0 0 /GTI 9 0 0 0 /GTI 9 0 0 0 :

HARDWARE

smdkc110
SES 6 0 8

2 . 2 /FROYO/XWJPA: u s e r / r e l e a s e
HOST

MANUFACTURER

samsung

MODEL

GTI 9 0 0 0

PRODUCT

GTI 9 0 0 0

RADIO

GTI 9 0 0 0

TAGS

release

TYPE

user

USER

root

k e y s

k e y s

Table 12: Crowdsourcing application result - Android Device Information

51

Installed android applications on the device


Table 13 shows the installed applications list for the device, collected by the
crowdsourcing application.

INSTALLED
V e r s i o n C o d e p a c k a g e : 1 5
Version

Installed
Process

PACKAGES INFO

0.1.5
Application

Name

PERMISSION

SharkReader

l v . n3o . s h a r k r e a d e r

a n d r o i d . p e r m i s s i o n . INTERNET
a n d r o i d . p e r m i s s i o n .ACCESS_NETWORK_STATE
a n d r o i d . p e r m i s s i o n . GET_TASKS
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE

V e r s i o n C o d e p a c k a g e : 8
Version

Installed
Process

2.2.1
Application

Name

PERMISSION

Network

Location

com . g o o g l e . a n d r o i d . l o c a t i o n

a n d r o i d . p e r m i s s i o n .RECEIVE_BOOT_COMPLETED
a n d r o i d . p e r m i s s i o n . INSTALL_LOCATION_PROVIDER
a n d r o i d . p e r m i s s i o n . ACCESS_WIFI_STATE
a n d r o i d . p e r m i s s i o n . CHANGE_WIFI_STATE
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE
a n d r o i d . p e r m i s s i o n . ACCESS_COARSE_LOCATION
a n d r o i d . p e r m i s s i o n . INTERNET
a n d r o i d . p e r m i s s i o n . WRITE_SECURE_SETTINGS

V e r s i o n C o d e p a c k a g e : 1
Version

1.0

Application
Process

Name

PERMISSION

Installed

Camera
:

Firmware

com . s e c . a n d r o i d . app . c a m e r a f i r m w a r e

a n d r o i d . p e r m i s s i o n . WRITE_SETTINGS
a n d r o i d . p e r m i s s i o n . VIBRATE
a n d r o i d . p e r m i s s i o n .READ_PHONE_STATE
a n d r o i d . p e r m i s s i o n .MODIFY_PHONE_STATE
a n d r o i d . p e r m i s s i o n .CAMERA
a n d r o i d . p e r m i s s i o n . ACCESS_FINE_LOCATION
a n d r o i d . p e r m i s s i o n .WAKE_LOCK a n d r o i d . p e r m i s s i o n .SET_WALLPAPER

Table 13: Crowdsourcing application result - Installed applications

52

Self written Malware results


Some results obtained from the data collector using Self Written applications
applications are shown in Table 16.
CALCULATOR_G REPORT -Calculator Malware free Application

ANDROID_APPLICATION_REPORT

Autor : I k e r Burguera H i d a l g o
Date : Tue Feb 22 1 5 : 4 7 : 2 2 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . mu . r t s l a b . i k e r . c a l c u l a t o r G . apk . o u t _ R e p o r t . t x t

s y s t e m c a l l STATISTIC

system

call

Name

Number

of

Executions

fork
read
write
open
close
time
lseek
getpid
ptrace
access
kill
brk
setgid
ioctl
gettimeofday
writev
mmap2
vfork

2
202
266
235
243
6712
90
4737
7944
84
66
173
1
15930
84
191
3
1

REPORT OF USED FILES

OPEN FILES

File
File

: / s y s t e m / u s r / k e y c h a r s / q w e r t y . kcm . b i n
: / proc /922/ cmdline

Table 14: Self Written Application report - Calculator Good Application

53

CALCULATOR_B REPORT - Calculator application, malicious code attached.

ANDROID_APPLICATION_REPORT

Autor : I k e r Burguera H i d a l g o
Date : Tue Feb 22 1 5 : 4 8 : 3 8 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . mu . r t s l a b . i k e r . c a l c u l a t o r B . apk . o u t _ R e p o r t . t x t

s y s t e m c a l l STATISTIC

system

call

Name Number

of

Executions

fork
2
read
235
write
696
open
807
close
812
time
7194
lseek
101
getpid
5457
setuid
1
ptrace
9354
access
179
kill
73
dup
2
times
1
brk
188
setgid
1
signal
1
ioctl
18792
gettimeofday
184
mmap
455
munmap
499
getpriority
113
stat
134
fstat
133
recv
17901
mprotect
514
sigprocmask
1236
msgget
109901
syscall
6
writev
445
mmap2
455
vfork
1

REPORT OF USED FILES

54


OPEN FILES

File
File
File
File
File
File
File
File
File
File
File
File
File
File
File

:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/
:/

proc /300/ cmdline


proc /300/ cmdline
system / u s r / s h a r e / z o n e i n f o / z o n e i n f o . dat
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.8663092891957762 2011 2 20 10 54 25 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.0878473753052298 2011 2 20 10 54 27 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.6006282967641784 2011 2 20 10 54 29 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.8340689635440677 2011 2 20 10 54 30 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.1437738552877451 2011 2 20 10 54 31 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.7376069611353528 2011 2 20 10 54 33 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.4984244802612797 2011 2 20 10 54 44 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.5530206720597484 2011 2 20 10 54 46 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.2674243132841187 2011 2 20 10 54 47 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.2960847053244705 2011 2 20 10 54 49 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.2947512088951718 2011 2 20 10 54 51 . t x t
s d c a r d / C a l c u l a t o r _ B / T r a s h I n f o 0.5186420445867813 2011 2 20 10 54 56 . t x t

ACCESS FILES

File
File
File
File
File
File
File
File

:
:
:
:
:
:
:
:

/mnt/ s d c a r d / C a l c u l a t o r _ B
/ system / u s r / s h a r e / z o n e i n f o / Europe / Stockholm
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B
/mnt/ s d c a r d / C a l c u l a t o r _ B

Table 15: Self Written Application report - Calculator Malicious Application

55

56

MoneyConverter_B

CountDown_B

Calculator_B

MoneyConverter_G

CountDown_G

Calculator_G

Name

Malware

Malware

Malware

Normal

Normal

Normal

Type

memory

particular

all your contact information and sends to a server.

stores in
SDCard

attached. When Swedish->Euro button is pressed starts


running the GPS service in the background and writes your
location in SDCard.

Table 16: Self written android applications description

Get GPS
position and

Similar behavior as Moneyconverter_G but with malicious code

server

Send user
contacts to a

Similar behavior as CountDown_G but with malicious code


attached. Every time you press reset button, the application gets

programs write useless information in a text le.

Fill SDCard

Similar behavior as Calculator_G but with malicious code


attached. If the result of the operation is higher than 100, the

Money
conversion

Swedish KR and viceversa.

second.

Given an input number, converts the value from Euros to

Second
countdown

Given an input number, counts downs until the value is 0 every

Number
Calculation

rest, Multiple and Divide.

Objective

Given two numbers, can make trivial operations like the Sum,

Description

Picture

Self Written Android application Malware report


Table 17 shows the results of using the behavior-based Android malware
detection system on the Android malware and apps we created.
In order to test the system we performed 60 interactions for each type of
application. At the end of the process we had 60 interactions, with the good
application of created 50 good traces, and 10 malicious application traces.

Interactions

Good

Malware

Clustering

Detection

result

rate

Good

Malware

Clustered

Clustered

Calculator

50

10

50

10

100%

Countdown

50

10

50

10

100%

MoneyConverter

50

10

50

10

100%

Table 17: Self written Android Malware result

As detailed above, we developed three dierent Android applications with


corresponding malware. Every application was executed 60 times, with 50 interactions performed with the good Android app and another 10 with the malicious
Android application. These 50 interactions will represent the normality model
of the application.
Another data collector script will collect all generated output les from every
interaction and will create three vector les:

Calculator_Vector.txt

Countdown_Vector.txt

MoneyConverter_Vector.txt

Each le will contain 60 system call interaction vectors, including good and bad
application interaction vectors.
The next step was to test the system using real Android Malware applications.

57

4.3.2 Real Malware


In the previous chapter we performed an analysis of self-written Android apps
developed by us as a proof-of-concept to ensure that the behavior-based Android
malware detection system was working properly and could detect malicious
Android applications.
We understood that detecting our Android malware was not as interesting
as detecting real Android malware. Therefore, we decided to look at dierent
Android markets and repositories in order to nd real malicious applications.
Our rst approach was to contact several antivirus companies, such as Hispasec
and Panda Antivirus, in order to obtain real malicious Android applications.
Hispasec, a Spanish security company, was very interested in the project and
decided to share some real Android malware with us. It also provided us with
access to the VirusTotal service and to its malware database.
As long as antivirus companies can provide us with real malware and we can
nd the original applications on the Android market, we can test the behaviorbased Android malware detection system.

Steamy Window
We performed several tests using the only Android malware that we had at
the time, Steamy Window.

Figure 22: Steamy Window application

Steamy Window, shown in Figure 22, was the rst Malware to be tested in
the system. Steamy Window, is a harmless application that can be found in the
Android ocial market for free. However, the same application can be found on
non-ocial Android repositories with malicious code attached. The rst step
was to perform Dynamic Analysis.

58

Dynamic analysis on Steamy Window


We installed the Steamy Window application in the Android emulator and
recorded the performance and user interactions of the applications using the
data collector script and the crowdsourcing application.
Six interactions were performed in total with the malicious and non-malicious
Steamy Window application. These vectors were collected using the crowdsourcing application installed in six dierent devices with six dierent users. Some of
the users installed the Android ocial Steamy Window application, and others
downloaded the application from the non-ocial or unocial Android market.
Every interaction with the application represents a unique system call vector.
This vector will be analyzed by the behavior-based Android malware detection
system.

Figure 23: Interaction with Steamy window application

Interaction_A= 0,0,0,3,7,7,7,0,0,1,1,0,0,11,0,1,0,0,0,3,438,0,0,0,0,0,2405,0,0,0,0,0,0,
0,5,0,0,0,1,1,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,5164,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,12,7,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,8,1,3,4065,0,0,0,0,0,
0,0,0,0,0,0,2,0,0,0,0,2,2,0,0,0,0,0,14011,0,0,0,0,0,648,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,12,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_B=0,0,0,34,43,45,87,0,0,5,5,0,0,47,0,5,0,0,0,31,2695,0,0,0,4,0,8468,0,0,0,
0,0,0,0,22,0,0,0,5,5,0,0,27,0,0,0,46,0,0,0,0,0,0,0,0,20324,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,16,0,0,0,0,0,0,0,0,0,0,0,132,88,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,60,5,27,
13717,0,0,0,0,0,0,0,0,0,0,0,16,0,0,0,0,68,262,0,0,0,0,0,49976,0,0,0,0,0,2328,0,0,0,0,0,
0,0,0,38,0,0,0,0,0,0,132,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,
Interaction_C=0,0,0,19,12,28,29,0,0,1,1,0,0,22,0,1,0,0,0,19,1718,0,0,0,0,0,6632,0,0,0,
0,0,0,0,11,0,0,0,3,1,0,0,4,0,0,0,8,0,0,0,0,0,0,0,0,15089,36,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,41,21,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,24,1,19,10580,
0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,27,15,0,0,0,0,0,37324,0,0,0,0,0,1855,0,0,0,0,0,0,0,0,11,0,
0,0,0,0,0,41,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_D=0,0,0,16,12,27,28,0,0,1,1,0,0,19,0,1,0,0,0,16,1214,0,0,0,0,0,5663,0,0,0,0,
0,0,0,8,0,0,0,2,1,0,0,4,0,0,0,7,0,0,0,0,0,0,0,0,12376,24,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,40,20,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,21,1,16,8597,0,0,0,0,
0,0,0,0,0,0,0,6,0,0,0,0,27,15,0,0,0,0,0,29712,0,0,0,0,0,1549,0,0,0,0,0,0,0,0,11,0,0,0,0,0,0,
40,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

59

Interaction_E=0,0,0,48,73,67,139,0,0,8,8,0,0,56,0,8,0,0,0,38,2964,0,0,0,8,0,8803,0,0,0,0,
0,0,0,28,0,0,0,6,8,0,0,45,0,0,0,78,0,0,0,0,0,0,0,0,20937,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,24,0,0,0,0,0,0,0,0,0,0,0,210,151,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,93,8,37,14230,
0,0,0,0,0,0,0,0,0,0,0,21,0,0,0,0,108,501,0,0,0,0,0,52168,0,0,0,0,0,2328,0,0,0,0,0,0,0,0,65,
0,0,0,0,0,0,210,0,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
Interaction_F=0,0,0,22,13,29,30,0,0,1,1,0,0,32,0,1,0,0,0,22,2512,0,0,0,0,0,8253,0,0,0,0,0,
0,0,14,0,0,0,4,1,0,0,4,0,0,0,12,0,0,0,0,0,0,0,0,19940,48,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,44,22,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,27,1,22,13363,0,0,0,0,
0,0,0,0,0,0,0,7,0,0,0,0,28,15,0,0,0,0,0,48565,0,0,0,0,0,2328,0,0,0,0,0,0,0,0,12,0,0,0,0,0,0,
44,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

In order to detect malicious behavior in those interactions, we applied the


Matlab program described in Table 6.

60

Table 18 shows the similarities between the Steamy Window system call
vectors after applying the pdist function with Euclidean distance as the metric.
The pdist function computes the Euclidean distance between all vectors, and
the squareform function transforms the result of pdist into matrix form.

Interaction

0.1818

0.1414

0.1414

0.1818

0.1414

0.1818

0.1768

0.1768

0.1616

0.1667

0.1414

0.1768

0.1010

0.1818

0.1212

0.1414

0.1768

0.1010

0.1818

0.1212

0.1818

0.1616

0.1818

0.1818

0.1717

0.1414

0.1667

0.1212

0.1212

0.1717

Table 18: Steamy Window system call vectors comparison matrix table

0, are equal or similar vectors. Vectors whith


0 are dissimilar vectors. For instance, vectors F and C are very
similar vector, with a similarity 0.1212 out of a maximum 0.1818. Futhermore,
we can see that the distance between F and E (0.1717) is greater than that
from F to C (0.1212). This means that the clustering algorithm considers F
and E dissimilar vectors and F and C similar or equal vectors.
Vectors with a result close to

a value far from

The last step was to cluster the previous results into two dierent clusters.
In order to do that we used the k-means clustering algorithm, dening two
clusters,

k = 2,

and using the Euclidean distance metric. Compared to others,

this metric gave us the best outcome in the analysis and testing when detecting
dierent vectors.

Interaction

Cluster

Application
Table 19: Steamy window clustering result

The nal outcome was a vector with the results of the k-means clustering
algorithm, see Table 19.
vectors,

and

E.

The system can identify two malicious system call

Thus we can we can conrm that the behavior-based Android

malware detection system can detect interactions performed by the malicious


Steamy Window application.

61

Another way of representing the results obtained using the behavior-based


Android malware detection system is with bar graphs. These graphs, see Figure
24, depict the executed system call vectors of Android apps. .
As stated in Section 2.5, we have

pi

n observations in a data set P , with several


pi is a point in a D-dimensional

system call vectors. We can assume that each

space. Since it is not possible to graphically represent more than three vectors
in a

D-dimensional

space, we used bar graphs.

The blue bars represent the normal behavior of the Steamy Window application and the red bars represent the behavior of the malicious version of the
Steamy Window application.
Every system call has its own number and the

axis, represents the number

of the executed system call or the count of executed system call. The

axis

shows the number of times that the system call has been executed.
Upon studying Figure 24, we can note some distinct dierences between
good and malicious interactions.

Given that the blue bars represent the nor-

mal behavior of the Steamy Window application, we can clearly see that the
malicious version of the Steamy Window application is executing additional system calls;

open(), read(), access(), kill(), chmod()

and

chown()

system calls are

some of these. Taking into account that both applications have the same version
number, we can assume that the Steamy Window application downloaded from
non-ocial Android repositories, interactions
Android application.

62

and

E,

is a suspicious/harmful

Steamy Window Android market Report


The following pages will show some of the reports generated by the data
collector script when run on the system call log les collected from good and
malicious versions of the Steamy Window application.

The les contain in-

formation such as executed system calls, count of system call executions and
opened and accessed les.

ANDROID_APPLICATION_REPORT

Autor : I k e r Burguera H i d a l g o
Date : Thu Mar 3 1 6 : 3 5 : 5 7 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m . apk . o u t _ R e p o r t . t x t

s y s t e m c a l l STATISTIC

system

call

Name

Number

of

Executions

read
write
open
close
link
unlink
time
chmod
lseek
getpid
getuid
ptrace
access
sync
kill
rename
mkdir
ioctl
fcntl
gettimeofday
mmap
munmap
fstat
recv
mprotect
sigprocmask
msgget
syscall
writev
mmap2
sched_yield

1219
263
1192
1311
21
21
81
12
280
19762
15
49675
112
48
25
9
1
123934
489
29
406
270
1188
68683
2319
1128
475966
7277
106
406
11

REPORT OF USED FILES

OPEN FILES

File

: / d e v /ashmem

ACCESS FILES

File :
File :
File :

/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / g o o g l e _ a n a l y t i c s . db j o u r n a l

63

64
Figure 24: Steamy Window Interactions bar plot

Steamy Window Android non-ocial repository application Report

ANDROID_APPLICATION_REPORT

Autor : I k e r Burguera H i d a l g o
Date : Thu Mar 3 1 6 : 2 1 : 0 6 2 0 1 1
E m a i l : i k e r b u r g u e r a ( a t ) g m a i l ( d o t ) com
A p p l i c a t i o n _ N a m e : STRACEcom . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m . apk 3 4 8 . o u t _ R e p o r t . t x t

s y s t e m c a l l STATISTIC

system

call

Name

Number

of

Executions

read
write
open
close
link
unlink
time
chmod
lseek
getpid
getuid
ptrace
access
sync
kill
rename
mkdir
dup
brk
ioctl
fcntl
gettimeofday
mmap
munmap
getpriority
stat
lstat
fstat
recv
fsync
clone
mprotect
sigprocmask
msgget
syscall
writev
mmap2
sched_yield

530
163
530
591
13
13
31
7
162
5441
6
12378
44
36
6
4
1
35
110
29213
230
15
176
124
3
550
4
514
16310
36
22
1009
637
112187
2047
53
176
3

REPORT OF USED FILES

65


OPEN FILES

File
File
File
File

: / proc /348/ cmdline


: / s y s t e m / u s r / k e y c h a r s / q w e r t y . kcm . b i n
: / proc /348/ cmdline
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / s h a r e d _ p r e f s /
com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m _ p r e f e r e n c e s . xml
: / proc /348/ cmdline
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d e v / urandom F i l e : / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
/ webview . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
: / d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s
: / d e v /ashmem
: / s y s t e m / u s r / k e y c h a r s / q w e r t y . kcm . b i n
: / proc /348/ cmdline
: / d e v /ashmem
: / d e v /ashmem
: / d e v /ashmem
: / d e v /ashmem
: / d e v /ashmem

File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File
File

ACCESS FILES

File :
File :
File
File
File
File
File
File
File
File
File
File
File
File

:
:
:
:
:
:
:
:
:
:
:
:

/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / s h a r e d _ p r e f s /
com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m _ p r e f e r e n c e s . xml
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / s h a r e d _ p r e f s /
com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m _ p r e f e r e n c e s . xml . bak
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webview . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l
/ d a t a / d a t a /com . a p p s p o t . s w i s s c o d e m o n k e y s . s t e a m / d a t a b a s e s / webviewCache . db j o u r n a l

After several tests of the Steamy Window application on our behavior-based


Android malware detection system, we can conclude that the Steamy Window
application downloaded from the unocial Android repository is potentially
dangerous Android application malware.

66

Chapter 5
5

Conclusions, Contributions and Future Work

This chapter summarizes the results of the work described in this project in
two dierent sections. Section 5.1 will summarize the work carried out over the
course of this Master's thesis. Section 5.2 will suggest new ideas that can be
pursued based on this project.

5.1

Conclusions

All market indicators forecast a massive increase in the number of smartphones


purchased over the next 5 years. This will pave the way for a potentially massive
increase in malware creation, in particular for the leading OS on the market,
Android.
In this report we have proposed a new framework for obtaining and analyzing smartphone application activity. In collaboration with the Android user
community, it will be capable of distinguishing between benign and malicious
applications with the same name and version by detecting anomalous behavior
for known applications. In addition, by deploying our platform on a number of
test Smartphones, we have created a proof-of-concept for this mechanism as a
means of analyzing emerging threats.
We have indicated that monitoring system calls is a feasible way for detecting malware. According to a brief survey of related works, we have seen that
there are many dierent approaches designed to detect malware. We reasoned
that monitoring system calls is one of the most accurate ways to determine the
behavior of Android applications, since they provide detailed low level information. We realize that API call analysis, information ow tracking and network
monitoring techniques can contribute to a deeper analysis of malware, providing more useful information about malware behavior and more accurate results.
On the other hand, more monitoring capability places a higher demand on the
amount of resources consumed on the device.
We have seen that

open(), read(), access(), kill(), chmod()

and

chown()

are

the system calls most commonly used by malware. A benign application could
make moderate or heavy use of those system calls, thus triggering false positives.
Even when dealing with slightly modied Trojans, the system would still class
them correctly . We have seen that Trojanized applications made more system
call executions and invoked dierent system calls to the kernel in comparison
with the original applications.
The most important contribution of this project is the mechanism we propose for obtaining real traces of application behavior. In previously published
works, we have seen that it is possible to obtain information on behavior using
articially created user actions or creating replicas of smartphones, but crowdsourcing helps the community to obtain real application traces from hundreds
or even thousands of applications.

67

A paper has been published based on this report for the ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices 2011 - SPSM
2011. The paper summarizes the essential details of the framework and contains
further tests performed using the framework on the latest Android malware.

5.2

Future Directions

The next step is to deploy the Crowdroid lightweight client on Google's Android market and distribute it to as many users as possible. Users running our
application will be able to see their own smartphone behavior. We could even
alert the users when one of their applications shows an abnormal trace.

The

system can also act as an early warning system, capable of detecting malicious
or abnormally behaving applications in the early stages of propagation.
By implementing a set of tools, we have demonstrated that one can obtain behavior-based information and have it processed and clustered on a central server. Clustering results have been awless for self-written malware, and
promising with real malware. Whether the performance of a single central server
would suce for large-scale deployment is an interesting topic for further study.
A conguration with multiple cooperating servers, each with a lower load and
faster response, is an avenue to explore.
We have chosen a simple 2-means clustering algorithm to distinguish between
benign applications and their corresponding malware version. The results have
been encouraging, although we need to address some issues that remain unresolved. First, the system would always separate the system call data vectors into
two clusters even if there was no malware present. The cluster mapping would
change drastically whenever a malicious execution vector entered the dataset.
This issue requires some manual checks or further automatic analysis. Secondly,
one could intentionally submit incorrect data into the system, thus leaving the
dataset corrupted. One of the next steps is to authenticate the submitting application in order to ensure that nobody is deliberately sending incorrect data
to the system. As regards the communication mechanism between the Crowdroid client and our server, it is carried out using the FTP protocol in this rst
version and thus does not focus on protecting the privacy of transferred data.
If an attacker snis and manipulates the trac in the communication process
it can lead to misclassication errors. In order to avoid this, we are introducing
encryption mechanisms to preserve the integrity of the data and the authenticity
of the sender. We have to take into account that when applying this technique
on the mobile device it might have an extra overhead in the processing stage,
resulting in higher energy consumption.
Finally, we have the challenge of convincing the Android user community
to install the Crowdroid application.

We need to manage the perception of

a loss of privacy associated with supplying personal behavior information to


the research community, weighing this against the benet of having access to
up-to-date behavioral-based statistics on detected malware.

68

References
[1] 50 Malware applications found on Android Ocial Market. Access date:
25 Nov 2010.

http://m.guardian.co.uk/technology/blog/2011/mar/02/
android-market-apps-malware?cat=technology&type=article.
[2] Adb Monkey UI- Application exerciser. Access date: 12 Nov 2010.

http://developer.android.com/guide/developing/tools/monkey.
html.

[3] Android apk format. Access date: 25 Jan 2011.

http://en.ophonesdn.com/article/show/354.

[4] Android application. Access date: 1 Feb 2011.

http://www.androidenea.com/2009/06/android-boot-process-from-power-on.
html.

[5] Android Arquitecture. Access date: 4 Nov 2010. [Online]. Available from:

http://developer.android.com/guide/basics/what-is-android.
html.

[6] Android boot. Access date: 13 Jan 2011.

http://reminisce06.springnote.com/pages/7407623?print=1.

[7] Android build process. Access date: 26 Jan 2011.

http://www.alittlemadness.com/2010/06/07/
understanding-the-android-build-process/.

[8] Android init. Access date: 13 Jan 2011.

http://bootloader.wikidot.com/linux:boot:android.

[9] Android Kernel. Access date: 20 Apr 2011. [Online]. Available from:

//android.git.kernel.org/.

http:

[10] Android SDK. Access date: 29 Oct 2010.

http://developer.android.com/sdk/index.html.

[11] Angry Birds Bonus Level. J. Oberheide. Access date: 27 Dec 2010.

http://m.guardian.co.uk/technology/blog/2011/mar/02/
android-market-apps-malware?cat=technology&type=article.

[12] Apk le generation. Access date: 27 Jan 2011.

http://facinatingandroid.blogspot.com/2011/09/
android-apk-file.html.

[13] Baksmali. Access date:

9 Jan 2011.

code.google.com/p/smali/.

[Online]. Available from:

[14] Cabir Malware variants.Access date: 26 Nov 2010.

http://www.f-secure.com/weblog/archives/00000414.html.
69

http://

[15] Cabir, Smartphone Malware. Access date: 26 Nov 2010.

http://www.f-secure.com/v-descs/cabir.shtml.

[16] Dalvik Virtual Machine. Access date: 15 Dec 2010.

http://www.dalvikvm.com/.

[17] Dex le compilation. Access date: 14 Feb 2011.

http://en.ophonesdn.com/article/show/354.

[18] Distance metric. Access date: 4 Mar 2011.

http://ai.stanford.edu/~ang/papers/nips02-metric.pdf.

[19] Eclipse. Access date: 23 Nov 2010. [Online]. Available from:

eclipse.org/.

http://www.

[20] Hispasec Security Company. Access date: 25 Mar 2011.

http://www.hispasec.com/.

[21] IDC Forecast 2010 2015. Access date: 12 Feb 2011.

http://www.idc.com/getdoc.jsp?containerId=227360.

[22] IDC Forecast 2011-2015. Access date: 2 Dec 2010.

http://www.idc.com/getdoc.jsp?containerId=prUS22762811.

[23] International Data Corporation, IDCweb. Access date: 4 Dec 2010.

http://www.idc.com.

[24] Intrusion Detection System, IDS. Access date: 24 Dec 2010.

http://www.sans.org/reading_room/whitepapers/detection/
intrusion-detection-systems-definition-challenges_343.

[25] Iseclab. International Secure Systems Laboratory. Access date:

14 Jan

2011.

http://www.iseclab.org/.
[26] Linux Kernel manual pages. Access date: 14 Mar 2011.

http://www.kernel.org/doc/man-pages/online/dir_section_2.html.

[27] Linux Kernel system call list table. Access date: 24 Mar 2011.

http://bluemaster.iu.hio.no/edu/dark/lin-asm/syscalls.html.

[28] Malware developers attacks. Access date: 29 Nov 2010.

http://adtmag.com/articles/2011/03/03/
android-attacks-on-rise.aspx.

[29] Malware economic damage in 2007. Access date: 23 Oct 2010.

http://www.computereconomics.com/page.cfm?name=Malware%
20Repor.

[30] Malware evolution. Access date: 22 Nov 2010.

http://pages.cs.wisc.edu/~pb/comsnets09.pdf.
70

[31] Malware increase. Access date: 22 Dec 2010.

http://www.topnews.in/android-malware-increase-400-report-2328121.

[32] Malware types. Access date: 2 Dec 2010.

http://www.spamlaws.com/malware-types.html.

[33] Netbeans. Access date:

//netbeans.org/.

23 Nov 2010.

[Online]. Available from:

http:

[34] Netqin, mobile security service provider.Access date: 13 Nov 2010.

http://www.netqin.com/en/.

[35] Nokia conrms Microsoft partnership. Access date: 11 Feb 2010.

http://techcrunch.com/2011/02/10/nokia-confirms-microsoft-partnership-new-leadership-te

[36] Operating Systems in Smartphones. Access date: 14 Dec 2010.

http://www.idc.com/getdoc.jsp?containerId=prUS22486010.

[37] SAI - Business Insider. Access date: 17 Dec 2010.

http://www.businessinsider.com/sai.

[38] Samsung HTC Smartphone vendor companies market share. Access date:
3 Jan 2011.

http://www.eweek.com/c/a/Mobile-and-Wireless/
Android-Helps-Samsung-HTC-Double-Market-Share-IDC-792965.
[39] Sandbox. Access date: 29 Oct 2010.

http://www.cs.bgu.ac.il/~dsec022/papers/j9a.pdf.

[40] Smartphone applications evolution. Access date: 21 Dec 2011.

http://www.businessinsider.com/chart-of-the-day-smartphone-apps-2011-3.

[41] Smartphones vendors sales 2011. Access date: 14 Feb 2011.

http://www.idc.com/about/viewpressrelease.jsp?containerId=
prUS22689111.

[42] Stack-Based architecture. Access date: 18 Nov 2010.

http://en.wikipedia.org/wiki/Stack_machine.

[43] Steamy Window Malware. Access date: 25 Feb 2011.

http://www.netqin.com/en/.

[44] Worldwide Smartphone Users. Access date: 2 Nov 2010.

http://www.parksassociates.com//blog/article/
number-of-smartphone-users-to-quadruple--exceeding-1-billion-worldwide-by-2014-4.

[45] Jan A Bergstra and Alban Ponse. Register-machine based processes.

nal of the ACM, 48(6):12071241, 2001.

[46] P Berkhin. Survey of clustering data mining techniques.


56, 2002.

71

Jour-

Techniques, 10:1

[47] P Berkhin. Survey of clustering data mining techniques.

Techniques, 10:1

56, 2002.
[48] Thomas

Bl,

Leonid

Batyuk,

Aubrey-Derrick

Schmidt,

Camtepe, Sahin Albayrak, and Technische Universit.


cation sandbox system for suspicious software detection.

Seyit

Ahmet

An android appli-

Techniques, pages

5562, 2010.
[49] Abhijit Bose, Xin Hu, Kang G. Shin, and Taejoon Park.

Behavioral de-

Proceeding of the 6th international conference on Mobile systems, applications, and services, MobiSys
tection of malware on mobile handsets. In

'08, pages 225238, New York, NY, USA, 2008. ACM.


[50] Timothy K. Buennemeyer, Theresa M. Nelson, Lee M. Clagett, John P.
Dunning, Randy C. Marchany, and Joseph G. Tront. Mobile device pro-

Proceedings of the
Proceedings of the 41st Annual Hawaii International Conference on System Sciences, HICSS '08, pages 296, Washington, DC, USA, 2008. IEEE
ling and intrusion detection using smart batteries. In

Computer Society.
[51] Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani.

Crowdroid:

Workshop on
Security and Privacy in Smartphones and Mobile Devic es 2011 - SPSM
2011. ACM, October 2011.
Behavior-based malware detection system for android.

In

[52] Jerry Cheng, Starsky H Y Wong, Hao Yang, and Songwu Lu.

virus detection and alert for smartphones, pages 258271.

SmartSiren:

ACM, 2007.

[53] David Dagon, Tom Martin, and Thad Starner. Mobile phones as computing

IEEE Pervasive Computing,

devices: The viruses are coming!

3:1115,

October 2004.
[54] Anhai Doan, Raghu Ramakrishnan, and Alon Y Halevy.
systems on the world-wide web.

Crowdsourcing

Communications of the ACM,

54(4):86,

2011.
[55] Manuel Egele.

A survey on automated dynamic malware analysis tech-

niques and tools vienna university of technology.

Computing, V:149, 2011.

[56] William Enck, Peter Gilbert, Byung-Gon Chun, Landon P. Cox, Jaeyeon
Jung, Patrick McDaniel, and Anmol N. Sheth. Taintdroid: an informationow tracking system for realtime privacy monitoring on smartphones. In

Proceedings of the 9th USENIX conference on Operating systems design and


implementation, OSDI'10, pages 16, Berkeley, CA, USA, 2010. USENIX
Association.
[57] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. Advances in knowledge discovery and data mining. chapter From data mining
to knowledge discovery: an overview, pages 134. American Association for
Articial Intelligence, Menlo Park, CA, USA, 1996.

72

[58] By Fengmin Gong, Chief Scientist, Mcafee Network, and Security Technologies. Deciphering detection techniques : Part ii anomaly-based intrusion
detection.

Network, (March), 2003.

Crowdsourcing: Why the Power of the Crowd Is Driving the


Future of Business. Crown Publishing Group, New York, NY, USA, 1

[59] Je Howe.

edition, 2008.
[60] Nwokedi Idika and Aditya P Mathur. A survey of malware detection techniques.

Purdue University, page 48, 2007.

[61] G A Jacoby and Nathaniel J Davis Iv. Battery-based intrusion detection.

Design, page 224, 2005.

[62] J Macqueen.

Some methods for classication and analysis,

volume 233,

pages 281297. 1967.


[63] Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, and Herbert

Proceedings
of the 26th Annual Computer Security Applications Conference, ACSAC
Bos. Paranoid android: versatile protection for smartphones. In
'10, pages 347356, New York, NY, USA, 2010. ACM.

[64] Georgios Portokalidis, Philip Homburg, Kostas Anagnostakis, Herbert Bos,


and Universiteit Amsterdam. Paranoid android : Zero-day protection for
smartphones using the.
[65] J.Blasco P.Rincon.

csvunl, pages 120, 2010.

Hong toutou malware analysis. wTF is happening

inside my android phone. Access date: 24 Dec 2011. Technical report.

http://www.slideshare.net/JaimeBlasco/
wtf-is-happeninginsidemyandroidphonepublic.

[66] Aubrey-Derrick Schmidt, Rainer Bye, Hans-Gunther Schmidt, Jan Clausen,


Osman Kiraz, Kamer A. Yksel, Seyit A. Camtepe, and Sahin Albayrak.
Static analysis of executables for collaborative malware detection on an-

Proceedings of the 2009 IEEE international conference on Communications, ICC'09, pages 631635, Piscataway, NJ, USA, 2009. IEEE
droid. In
Press.
[67] Aubrey-Derrick Schmidt, Jan Hendrik Clausen, Ahmet Camtepe, and
Sahin Albayrak. Detecting symbian os malware through static function call

2009 4th International Conference on Malicious and Unwanted


Software MALWARE, (March 2006):1522, 2009.
analysis.

[68] Aubrey-Derrick Schmidt, Frank Peters, Florian Lamour, Christian Scheel,


Seyit Ahmet amtepe, and Sahin Albayrak. Monitoring smartphones for
anomaly detection.

Mob. Netw. Appl., 14:92106, February 2009.

[69] Aubrey-Derrick Schmidt, Hans-Gunther Schmidt, Jan Clausen, Ahmet


Camtepe, and Sahin Albayrak. Enhancing security of linux-based android
devices.

Image Rochester NY, 2008.


73

[70] Asaf Shabtai, Uri Kanonov, and Yuval Elovici.

Intrusion detection for

mobile devices using the knowledge-based, temporal abstraction method.

J. Syst. Softw., 83:15241537, August 2010.

[71] Asaf Shabtai, Robert Moskovitch, Yuval Elovici, and Chanan Glezer. Detection of malicious code by applying machine learning classiers on static
features: A state-of-the-art survey.

Inf. Secur. Tech. Rep., 14:1629, Febru-

ary 2009.
[72] Ashkan Shari Shamili, Christian Bauckhage, and Tansu Alpcan. Malware

detection on mobile devices using distributed machine learning. In Proceedings of the 2010 20th International Conference on Pattern Recognition,
ICPR '10, pages 43484351, Washington, DC, USA, 2010. IEEE Computer
Society.

[73] Symantec.

Trojanized android application, steamy window. Access date:

12 Apr 2011. Technical report.

http://www.techeye.net/security/androids-steamy-window-trojan-sends-sms-to-premium-numb
[74] Urko Zurutuza, Roberto Uribeetxeberria, and Diego Zamboni.

A data

mining approach for analysis of worm activity through automatic signa-

Proceedings of the 1st ACM workshop on Workshop on


AISec, AISec '08, pages 6170, New York, NY, USA, 2008. ACM.

ture generation. In

74

You might also like