• No results found

Behavior-based malware detection system for the Android platform

N/A
N/A
Protected

Academic year: 2021

Share "Behavior-based malware detection system for the Android platform"

Copied!
82
0
0

Loading.... (view fulltext now)

Full text

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Final thesis

Behavior-based malware detection

system for the Android platform

by

Iker Burguera Hidalgo

LIU-IDA/ERASMUS-A—11/002—SE

2011-09-27

Linköpings universitet SE-581 83 Linköping, Sweden

Linköpings universitet 581 83 Linköping

(2)

Linköping universitet

Institutionen for datavetenskap

Examensarbete

Behavior-based malware detection

system for the Android platform

av

Iker Burguera Hidalgo

LIU-IDA/ERASMUS-A—11/002—SE

2011-09-27

Handledare: Dr. Urko Zurutuza

(3)

Linköping University Electronic Press

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare –från

publiceringsdatum under förutsättning att inga extraordinära omständigheter

uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner,

skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för

icke-kommersiell forskning och för undervisning. Överföring av upphovsrätten vid

en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av

dokumentet kräver upphovsmannens medgivande. För att garantera äktheten,

säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i

den omfattning som god sed kräver vid användning av dokumentet på ovan

be-skrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form

eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller

konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se

för-lagets hemsida

http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet – or its possible

replacement –from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for

anyone to read, to download, or to print out single copies for his/hers own use

and to use it unchanged for non-commercial research and educational purpose.

Subsequent transfers of copyright cannot revoke this permission. All other uses

of the document are conditional upon the consent of the copyright owner. The

publisher has taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to be protected

against infringement.

For additional information about the Linköping University Electronic Press

and its procedures for publication and for assurance of document integrity,

please refer to its www home page:

http://www.ep.liu.se/.

(4)

Abstract

Malware in smartphones is growing at a signicant rate. There are currently more than 250 million smartphone users in the world and this number is expected to grow in coming years [44].

In the past few years, smartphones have evolved from simple mobile phones into sophisticated computers. This evolution has enabled smart-phone users to access and browse the Internet, to receive and send emails, SMS and MMS messages and to connect devices in order to exchange in-formation. All of these features make the smartphone a useful tool in our daily lives, but at the same time they render it more vulnerable to attacks by malicious applications.

Given that most users store sensitive information on their mobile phones, such as phone numbers, SMS messages, emails, pictures and videos, smartphones are a very appealing target for attackers and mal-ware developers.

The need to maintain security and data condentiality on the Android platform makes the analysis of malware on this platform an urgent issue. We have based this report on previous approaches to the dynamic analysis of application behavior, and have adapted one approach in order to detect malware on the Android platform. The detector is embedded in a framework to collect traces from a number of real users and is based on crowdsourcing. Our framework has been tested by analyzing data col-lected at the central server using two types of data sets: data from articial malware created for test purposes and data from real malware found in the wild. The method used is shown to be an eective means of isolating malware and alerting users of downloaded malware, which suggests that it has great potential for helping to stop the spread of detected malware to a larger community.

Finally, the report will give a complete review of results for self written and real Android Malware applications that have been tested with the system.

This thesis project shows that it is feasible to create an Android mal-ware detection system with satisfactory results.

(5)

Acknowledgments

First of all, I would like to thank Prof. Simin Nadjm-Tehrani and Dr. Urko Zurutuza for their support, guidance and patience over

the course of this Master's thesis project.

I would also like to thank all members of the Real-Time Systems Laboratory (RTSLab), my corridor mates from Ryds Allé 9 and Alsättersgatan 9 and friends from Legazpi for all the support and fantastic moments we shared in 2010-2011.

Finally, I would like to thank my wonderful and fantastic family, which in addition to providing me with economic and moral support also wrote part of my acknowledgment notes.

(6)

Contents

1 Introduction 1

1.1 Background and Motivation . . . 1

1.2 Goal . . . 4 1.3 Project Assumptions . . . 4 1.4 Intended audience . . . 4 1.5 Related work . . . 5 1.6 Thesis structure . . . 13 2 Background 14 2.1 Android Operating System . . . 14

2.1.1 Platform architecture . . . 14

2.1.2 The Dalvik Virtual Machine . . . 18

2.1.3 The Android Security Model . . . 20

2.1.4 Android applications . . . 22

2.2 Intrusion Detection System . . . 24

2.2.1 Denition . . . 24

2.2.2 Detection types . . . 25

2.3 System calls and Vectors . . . 27

2.4 Data Mining . . . 29

2.4.1 Data collection in KDD process . . . 29

2.5 K-means Clustering algorithm . . . 31

2.6 Crowdsourcing . . . 34

3 Behavior-Based malware detection system for Android Appli-cations 35 3.1 Overview . . . 35

3.2 Android Data mining: Crowdsourcing and Self-written applications 37 3.2.1 Android Data collector script . . . 38

3.2.2 Android Crowdsourcing and data mining application . . . 41

3.3 Behavior-Based malware detection system . . . 42

3.3.1 Design of the Behavior-Based malware detection system . 42 4 Results and Evaluation 48 4.1 Data Set . . . 48

4.2 Devices and Programs . . . 48

4.3 Malware detection system Results . . . 50

4.3.1 Self-written Malware . . . 50

4.3.2 Real Malware . . . 58

5 Conclusions, Contributions and Future Work 67 5.1 Conclusions . . . 67

(7)

List of Figures

1 Number of Applications available at smartphone App Stores[40] . 2

2 Android platform architecture[5] . . . 15

3 Android Linux Kernel and Init process . . . 17

4 Android boot sequence . . . 18

5 Dex le creation process . . . 19

6 Application request process . . . 21

7 Android APK le . . . 22

8 Android APK le generation process . . . 23

9 Misuse detection versus Anomaly detection . . . 25

10 Linux User and Kernel space . . . 27

11 Knowledge Discovery in Databases (KDD) process[46] . . . 29

12 Taxonomy clustering methods . . . 31

13 Hierarchical method: Agglomerative vs Divisive . . . 32

14 K-means applied as a detection system for android system calls 34 15 Android malware detection system scheme . . . 35

16 Data acquisition process . . . 37

17 Data collector script user interface . . . 38

18 Data collector script process . . . 39

19 Android Crowdsourcing application . . . 41

20 Static and Dynamic Analysis . . . 42

21 Android Malware Detection process . . . 44

22 Steamy Window application . . . 58

23 Interaction with Steamy window application . . . 59

24 Steamy Window Interactions bar plot . . . 64

(8)

List of Tables

1 Worldwide mobile device Operating System Market Shares and

2010-2014 Growth[36] . . . 2

2 Related work State-of-the-Art Summary(i) . . . 11

3 Related work State-of-the-Art Summary(ii) . . . 12

4 K-means Clustering algorithm process . . . 33

5 Static and Dynamic Malware analysis advantages and Disadvan-tages . . . 43

6 Matlab Clustering code for Android Malware Detection . . . 45

7 Clustering algorithm metrics . . . 46

8 Vector comparison matrix . . . 47

9 Example vector clustering results . . . 47

10 Test Devices . . . 48

11 Programs used in the project . . . 49

12 Crowdsourcing application result - Android Device Information 51 13 Crowdsourcing application result - Installed applications . . . 52

14 Self Written Application report - Calculator Good Application . 53 15 Self Written Application report - Calculator Malicious Application 55 16 Self written android applications description . . . 56

17 Self written Android Malware result . . . 57

18 Steamy Window system call vectors comparison matrix table . . 61

(9)

Chapter 1

1 Introduction

This paper describes the results of a Master's thesis project (30 ECTS) towards the fulllment of a degree in Telecommunications Engineering at Mondragon Unibertsitatea. The project was carried out at the Department of Computer and Information Science at Linköping University while studying as a visiting student from Mondragon Unibertsitatea.

The following paragraphs will detail the background, motivation, related work and goals of the master thesis. Details on how the project was carried out and on the results obtained will be presented in the following chapters.

1.1 Background and Motivation

Communications and technology are rapidly growing industries that are chang-ing every day. The constant evolution of technology necessitates adaption to new concepts and awareness of new developments. In the following section we briey cover the trends in the evolution of the smartphone market that make the subject matter of this thesis relevant.

According to the International Data Corporation [23], smartphone vendors will ship more than 450 million smartphones in 2011, compared to the 303.4 million units shipped in 2010[21]. Moreover, the smartphone market will grow four times faster than the traditional mobile phone market, and due to this, the demand for smartphones will rise considerably. Eventually, customers will reach the point where they will replace their old mobile phones with smartphones.

The sales growth of mobile phone companies such as Samsung and HTC between 2009 and 2010 has revolutionized the smartphone market. In light of this, the IDC predicts that the Android OS will surpass Nokia's Symbian OS in terms of sales in 2011, and will continue to lead the smartphone OS market in the coming years [36]. Furthermore, it predicts that the Android OS and Windows Mobile will grow almost 50% between 2010 and 2014, with a high probability of becoming the leading smartphone operating system vendors in the future. See Table 1.

(10)

Operating System 2010 Market predicted Share 2014 Market predictedShare 2014/2010 Change Symbian 40.1% 32.9% -18.0% BlackBerry OS 17.9% 17.3% -3.5% Android 16.3% 24.6% 51.2% iOS 14.7% 10.9% -25.8% Windows Mobile 6.8% 9.8% 43.3% Others 4.2% 4.5% 8.3% Total 100% 100%

Table 1: Worldwide mobile device Operating System Market Shares and 2010-2014 Growth[36]

The IDC predicts that the total number of smartphone applications will grow at the same rate as smartphone sales. There are currently more than 350,000 applications in Apple's iPhone market and 250,000 applications in the Android market, according to Silicon Alley Insider [37]. This is depicted in Figure 1.

(11)

The ocial Google Android market nearly doubled in size in 2010 and 2011, surpassing 250,000 applications in March 2011. Figure 1, shows the interest of software developers in the Android platform, and we can assume that as Android developers continue to create applications for Android's OS, malware developers will continue to create Malware for the system, as well.

Malware1, has been a threat for PCs for many years[30] and in light of

the rapid increase of smartphone sales over the last few years[38], it was only a matter of time before malware developers became interested in staging their attacks on the smartphone platform. In particular, 2010 and 2011 saw a growing interest among malware developers in waging attacks on Android's OS[28].

Malware usually destroys valuable and sensitive information in infected sys-tems. Malware is also commonly used to exploit infected devices and obtain prots from them. In the same way as malware harms computers, it can also perform attacks on smartphones, given that they have similar operating fea-tures. This observation makes it clear that it is necessary to enhance protection of smartphone devices in the same way as we did with computers some years ago.

The Android market is an open market system. This means that Android developers can upload their applications, also called third-party applications, to Android's ocial market without them being ltered by any certication authority that would check the trustworthiness of the applications. On the one hand, this increases the odds that the Android market will have a greater variety of applications and content, but on the other hand it facilitates infection by malware applications, as applications are not analyzed by any certication authority.

In conclusion, considering the growth of smartphones running the Android OS2 and the increasing number of applications available for the Android OS,

improving the security (i.e. the integrity, condentiality and privacy) of the Android platform is the main objective of this project. In order to achieve that objective, we will develop a behavior-based malware detection system for the Android platform.

1Malicious(Mal) software(ware)

2Samsung and HTC smartphone vendors[38]

(12)

1.2 Goal

The goal of this Master's thesis was to design and implement a behavior-based malware detection system for the Android platform.

More specically, the work was divided into the following sub-goals: • Create a malware detection system for the Android platform. • Create data collector applications to monitor Android OS activity. • Design and implement the Android application behavior database. The proposed solution was expected to detect malicious applications from An-droid ocial and non-ocial markets or repositories.

1.3 Project Assumptions

Some assumptions were made at the beginning of the project:

• Applications available on the ocial Android market would be used to establish the normality model for the applications, and the equivalent programs in non-ocial repositories would be used to test the system. • Even if malware did exist in the Android market, rst we needed clear

or good applications with the same name or purpose to test the malware detection system.

• We assumed that downloaded third-party applications were not trusted applications and must be analyzed/monitored with the crowdsourcing ap-plication or data collector script.

• The Android community would collaborate on this project by installing the crowdsourcing application on their devices. The crowdsourcing appli-cation would send recorded les to the malware detection system server for post-analysis.

1.4 Intended audience

This thesis is useful to anyone who is involved in mobile Security, and is specially designed for Android smartphone users and developers. It is also targeted at anyone interested in crowdsourcing and data mining techniques as they apply to mobile phones.

The document does not require any prior knowledge in the area of security. Chapter 2 will provide all the basic theory for the concepts explained in the paper.

(13)

1.5 Related work

Malware has been a threat for computers for many years[30] and continues to cause irreparable damage to infected systems[29]. The rst attempts to identify and analyze malware on smartphones started by adapting existing PC security solutions and applying them to mobile phones. This was not a feasible solution in light of the high demand placed on resources by antivirus techniques and the power and memory constraints of mobile devices. Since malware and intrusion detection systems have already been the subject of massive research, we will give just a brief review of the evolution of malware and malware detection techniques as regards mobile phones.

Nwokedi et al. compiled a summary of the most commonly used malware detection techniques[60]. Their report examined 45 dierent malware detection techniques in the elds of anomaly-based detection, specication-based detec-tion and signature-based detecdetec-tion. All techniques explained in this report are very useful background information in order to understand the rst approaches to malware detection that can also be used in smartphones.

Iseclab[25], International Secure Systems Laboratory, explored the detection of malicious applications and used dierent approaches to detection based on dynamic analysis of malicious or infected applications. [55]. They used dierent approaches and detection techniques based on dynamic analysis that are used to detect malicious or infected applications. The paper provides useful informa-tion about malware detecinforma-tion techniques and tools used in dynamic analysis of malware.

(14)

Jacoby et al. introduced battery-based intrusion detection, a host-based intrusion detection system[61]. This technique monitors anomalous behavior of smartphone batteries and writes a report in the device listing the causes of high power consumption.

Some years later, Buennemeyer et al. evaluated the power consumption of devices with a client application installed on a smartphone using the Symbian OS [50].The application monitored power consumption data and sent a report to a remote server to analyze and detect anomalies in the system. Due to the lack of smartphone malware patterns at that time, most of the anomalous detection techniques used battery power consumption as the main source of detection data. These techniques were based on checking and monitoring mobiles phones' power consumption and comparing it to the normal power consumption pattern in order to detect anomalies.

Cheng et al. introduced SmartSiren, a collaborative virus detection appli-cation for Windows Mobile 5[52]. It collects the communiappli-cation activity from smartphones and performs system log le analysis to detect anomalous behavior in the system. The system uses a proxy-based architecture that interacts with a client installed on devices in order to avoid a heavy processing load.

Schmidt et al. showed how to extract smartphone features from Symbian OS and Windows mobile phones in order to perform anomaly detection in the systems[68]. They use several APIs provided by Windows and Symbian to monitor applications and extract device features, such as RAM free memory, user inactivity, process count, CPU usage, sent SMS messages, etc. The aim of monitoring the applications' performance is to obtain data enabling us to dierentiate between normal and malicious use of a device.

(15)

Schmidt et al. presented a novel approach to static malware detection in resource-limited mobile environments[67]. Their approach consisted of detecting malware by extracting function calls from binaries in order to apply a clustering algorithm to the data. This technique was used for detecting Symbian OS malware depending on a mobile phone's features, such as device eciency, speed and limited resource usage.

In 2006 Symbian was the most widely used smartphone OS and many mal-ware detection techniques were developed for this platform. Due to the immi-nent growth of smartphones with the Android OS, malware researchers decided to switch their malware detection techniques and security mechanisms to this platform [38].

Schmidt et al. presented the rst serious research on malicious applications for the Android OS [69]. They proposed a solution based on monitoring events occurring at the Linux kernel level. They used a monitoring application to extract features such as executed system calls, modied les, etc. from the Linux kernel. These features were used to create the smartphone normality pattern.

The same group proposed static analysis in 2009[66] and an Android appli-cation sandbox system in 2010[48]. The rst report presented a collaborative scenario in which dierent devices could perform static analysis of malware di-rectly on the phone. The second method used an Android application sandbox, a totally secure environment, to perform static and dynamic analysis. Static analysis disassembled Android APK les to detect malware patterns. During dynamic analysis, all of the events occurring on the device (opened les, ac-cessed les, battery consumption, etc.) were monitored. This sandbox provided a secure environment where malware applications could be executed without any risk of infection.

(16)

Enck et al. proposed real-time monitoring and analysis of sensitive data with dynamic taint tracking[56]. This technique taints data from privacy-sensitive sources and applies labels as sensitive data propagates through program vari-ables, les, and inter-process messages. When tainted data leaves the system, the application scans for suspicious outgoing data.

Bose et al., Shabtai et al. and Shari et al. have proposed another solu-tion for malware detecsolu-tion on smartphones based on Support Vector Machines (SVM) and learning machines[49, 71, 72], an extension to the Android mobile phone platform that tracks the ow of privacy-sensitive data through third-party applications. Their proposal consists of monitoring smartphone devices to determine their normal behavior and using collected data to train a learning machine. This learning machine will learn the normality model of the smart-phone and applications and alert the user every time it detects a suspicious action.

Portolakidis et al. have proposed a system in which they will perform a complete malware analysis of the phone in a virtual environment on a remote server[64] [63]. In both reports, they explain how to create replicas from Android devices and apply malware detection techniques to these Android mobile phones. The replicas are an equivalent version of the real mobile devices, and will be sent to the remote server for malware analysis. Mobile phone replicas will run in a secure virtual environment where dierent malware detection techniques are applied.

(17)

Our purpose in this project is to improve on and contribute to malware detection strategies for the Android OS by oering up new ideas. Our work has its foundation in many of the works mentioned above [48, 69, 28, 68, 64, 63, 66]. Our approach is based on detecting Android malware applications using Linux system calls and clustering algorithms. Like Portolakidis et al.[63], and taking into account the limited and poor battery life of smartphones, we are in complete agreement with the procedure of using a remote server machine to perform malware detection.

Antivirus software techniques are inadequate for use on smartphones, as they consume a great deal of CPU and memory resources and can drastically shorten battery life.On the other hand, we consider it dangerous to send phone replicas to a remote server, since the replicas contain important and condential information (contact numbers, messages, pictures, etc.) and may compromise user condentiality. Rather than sending the whole replica, we propose sending the log les, collected by a lightweight data collector application installed in Android devices and containing the device's most important information, to the remote server for remote malware analysis.

A lightweight data collector application3, installed on the device will be

re-sponsible for collecting the system calls generated by Android applications in the device and storing device information les in the SD Card memory. This application has similar features to the one proposed by Buennemeyer et al.,[50] i.e. the sending of all monitored les to a remote server. They, however, made very few attempts with mobile phones, and we aim to extend use of the applica-tion as much as possible. To do so we will ask Android community users to use a lightweight script application (crowdsourcing application) in order to collect as much data as possible from dierent Android devices.

A. Doan, R. Ramakrishnan and A. Halevy analyzed the impact of crowd-sourcing on the WWW (World-Wide Web) [54]. Their article explains how in the future crowdsourcing will become one of the most inuential techniques used to collect information and create databases faster and more eciently.

3Crowdsourcing application[59]

(18)

The following text gives an overview of some recent attacks targeting An-droid and of malware that has appeared on the AnAn-droid platform.

Android malware has increased by 400% since 2010[31], and will continue to grow. In light of this, several malware attacks were carried out on the Android OS in 2010 and 2011, [65] [11].

Hong Tou Tou, Angry Birds Bonus Level, Tip Calculator, Tap Snake, Mon-key Jump and Steamy Window are the most famous malicious applications to date on the Android platform. Furthermore, more than 50 infected applications were found on Google's Android market in March 2011, all of them infected with the DroidDream Trojan application[1].

Another attack targeting the Android platform was carried out by J. Ober-heide. He developed the Angry Birds Bonus Level for the Android OS[11]. This application was a proof-of-concept malware application to showcase the weak security of the Android marketplace. The Angry Birds Bonus Level malware purports to be an additional bonus level for the famous game Angry Birds. The malicious application downloads and installs three additional applications4

on the user's device in order to steal sensitive information. These applications were available in Android's ocial marketplace for over ve months, but were removed after they were discovered to be stealing sensitive information from mobile phone devices. J. Oberheide argues that he could collect condential information from a great number of Android devices in only a few days' time.

NetQin Inc[34], a mobile security service provider, discovered a spyware application called Tip Calculator in the Android market. The spyware sent all incoming and outgoing SMS messages in the system to a designated email address. Another piece of spyware with similar characteristics discovered in non-ocial Android repositories was Steamy Window[43]. A Trojan Horse called Android Pjapps modies the original version of this application and wages an attack by subscribing to a SMS premium service.

Due to its appeal as the latest malware discovered for the Android OS, and since both the clean and malicious instances of the application were available, we decided to analyze this spyware with our proposed malware detection system.

(19)

Author Approac h Detection Metho d Platform Description Jacob y et al.(2004)[61] HIDS Signat ureBased Detection Sym bian OS Monitor's device N or mal p ow er con sumption against ac tual device p ow er consumption to detect anomalies in the system. Cheng et al. (2007)[52] HIDS, NIDS Anomaly Detection Sym bian OS It P erforms system log le analysis and collect comm unication activit y from the device in order to detect an y anomalous b eha vior in the system. Buennemey er et al.(2008)[50] HIDS, NIDS Anomaly Detection Sym bian OS Ligh tw eigh t application monitors the p ow er consumption and sends the rep ort to a remote serv er to b e analyzed and de te ct anomalies. Bose et al.(2008)[49] HIDS Signature Based Detection Sym bian OS It detects malicious applications by training a classier based on Supp ort V ector Mac hines (SVM ) and constru ct s signat ures from monitored ev en ts and API calls in Sym bian OS. Sc hmidt et al.(2008)[68] HIDS Anomaly Detection Sym bian OS/Win- do ws Mobile It uses a remote learn ing-based m ac hine as anomaly detection. Sym bian OS or Windo ws mobile clien t application will send extracted device features to a remote serv er in a vector format. V ectors will b e pro cessed by a Mac hine learnin g for fur ther analysis. Sc hmidt et al.(2008)[69] HIDS, NIDS Anomaly Detection Android OS This pap er analyze s the securit y on Android smartphones from Lin ux-k er nel view. It uses net w or k tra c, Kernel system calls, File system logs and Ev en t detection mo dules to detect anomalies in the sy st em . Shabtai et al.(2009)[71] HIDS Signature Based Detection All It uses static features extracted from executables for classifying malicious application using M ac hine Learning me tho ds. Detection tec hniques describ ed can b e applied in an y Smartphone OS. Table 2: Related w ork State-of-the-Art Summary(i) 11

(20)

Author Approac h Detection Metho d Platform Description Sc hmidt et al.(2009)[66] HIDS Signature Based Detection Android OS P erform Static analysis on the executables to extract function calls in Android OS using the command readelf. Function calls are compared with Malw are executables for classifying. Sc hmidt et al.(2009)[67] HIDS Anomaly Detection Sym bian OS They extract function calls from bi naries in order to apply clustering mec hanisms in Sym bian OS. Bläsing et al.(2010)[48] HIDS Signature Based Detection Android OS It uses an Android Application Sandb ox (AASandb ox) to p erform Static and Dynamic analysis on Android applications. Static analysis scan's Android source co de to detect Malw are patterns. Dynamic analysis ex ecutes and monitors Android applications in a totally se cur e en vi ronmen t. Shari et al.(2010)[72] NIDS Anomaly Detection Sym bian OS It presen ts a distributed SVM algorithm to detect Malw are on a mobile device net w ork. A ligh t-w eigh t Sym bian ap plic at ion will monitor net w ork trac in a distributed w ay . Enc k et al.(2010)[56] HIDS,NIDS Anomaly Detection Android OS T ain tDroid is a realtime monit oring system for Android OS. T ain t Droid will monitor Android applications an d will alert the user whenev er a sensitiv e data of the user is compromised. Uses tain t trac king analysis to monitor priv acy sensitiv e information. P ortolakidis et al.(2010)[64, 63] HIDS,NIDS Anomaly Detection Android OS A remote securit y serv er in the cloud p erforms the Malw are detection analysis. Virtual en vironmen ts will b e used to analyze Android mobile phone replicas. Table 3: Related w or k State-of-the-Art Summary(ii)

(21)

1.6 Thesis structure

This section summarizes the main topics to be discussed throughout the paper, giving a short overview of each chapter.

Chapter 2, describes the basic theory of the Android platform, intrusion detection systems, Linux system calls, data mining and clustering algorithms. The aim of this chapter is to enable the reader to understand the basic concepts of the project.

Chapter 3, describes the behavior-based malware detection system for the Android platform that was designed in this project.

Chapter 4, describes the testing and evaluation methods used by the behavior-based malware detection system for the Android platform.

Chapter 5, describes the nal conclusions and denes the future work of the project.

(22)

Chapter 2

2 Background

This chapter will give a brief description of some of the fundamental concepts and terminology relating to the Android OS, intrusion detection systems, Linux system calls, data mining and clustering algorithms. The clustering algorithm section will be illustrated with reference to the way in which we have applied these known techniques in order to group Android system calls.

2.1 Android Operating System

The Android OS is a Linux-based open source operating system for mobile devices. It was originally developed by Android Inc. and was bought by Google in 2005.

The operating system is based on a modied version of the Linux 2.6 kernel[9] optimized for embedded systems and specially adapted for smartphones and tablets. The optimization process in embedded systems improves data process-ing and battery consumption, extendprocess-ing battery life.

The following pages will provide detailed information about the Android OS. 2.1.1 Platform architecture

Architecture

The Android platform was created for devices with limited processing power, memory and storage space, commonly called embedded systems. It was created with the objective of implementing an operating system in environments re-quiring a low memory footprint and processing load, such as smartphones or tablets.

(23)

Figure 2: Android platform architecture[5]

(24)

The Android OS is composed of several software components that can be divided into three main groups: Operating System (OS), Middleware and Ap-plications.

• Operating system: This group consists of Linux Kernel, the core and most important component of the Android architecture. As mentioned above, Android is based on Linux 2.6 kernel, which provides the platform with basic services such as security, memory management and process management. The kernel can be considered an abstraction layer between software and hardware layers, responsible for managing and processing re-quests received from higher layers for interaction with hardware resources. • Middleware: This group consists of Android Runtime and Libraries. Android Libraries are written in the C/C++ programming language and Android developers can use them through the Application Framework. Libraries provide easier access to system resources, such as the camera, Wi-Fi, ash memory, etc. Dalvik Virtual Machine, or Dalvik VM[16], is also one of the most important parts of the Android architecture. Dalvik VM is a Java Virtual Machine specially designed and modied to optimize memory and energy consumption in embedded systems. Dalvik VM was designed to run multiple virtual machines without placing additional pro-cessing load on the processor. It is also responsible for executing optimized Java code and Dex les (les in the Dalvik execution format). Dalvik VM and Dex le internals will be explained in greater detail in Section 2.1.2. • Application: This group consists of the Application Framework and

Ap-plications. By default, the Android OS includes basic applications like a web browser, an email client and maps. This layer can also run third-party applications from the Android market or other repositories. Applications in this layer are written in the Java programming language. The applica-tion framework provides useful components for Android developers. This layer consists of views, a resource manager, content providers and the no-tication manager, providing aid to applications using standard libraries.

As Android OS is an open-source project the kernel is available to download on the internet [9] and it is possible to modify and create new versions adapted to suit dierent purposes.

(25)

Start-up

Another essential part of the Android OS is the startup process. Like any other Linux system, Android has a boot sequence which prepares the services necessary to run/start the device's operating system.

Figure 3 shows the rst stage in the boot sequence on Android OS.

Figure 3: Android Linux Kernel and Init process

The rst stage in the boot sequence is running the Bootstrapper application. The bootstrapper is the program which starts the device's operating system and initializes and tests the basic requirements of the hardware, peripherals and ex-ternal memory devices. GRUB and LILO for Linux and NTLDR for Windows are some of the most famous bootstrapper applications. The bootstrapper ap-plication loads the kernel image into RAM, and then the kernel starts the init process. Figures 3 and Figure 4 explain the Android OS init process and boot sequence.

The init process initializes system daemons for handling low-level hardware interfaces, such as USB, the Android debugger or Android Debug Bridge Dae-mon. The init process also starts the basic runtime processes, such as the Runtime service, Service manager, Media server and the Zygote.

(26)

Figure 4: Android boot sequence

Figure 4 shows the Android OS boot sequence in greater detail. As men-tioned above, the init process initializes several daemons and services in the system. At the same time, the init process starts the Zygote process. We will describe the process in greater detail on the following pages.

2.1.2 The Dalvik Virtual Machine

The Dalvik VM[16], is a Java virtual machine specially designed and modied to optimize memory and energy consumption in embedded systems like smart-phones, tablets and netbooks. It was designed and created by Dan Bornstein, with collaboration and contribution by other Google engineers. The virtual ma-chine is optimized to require a low level of memory usage and enables multiple virtual machine instances to run simultaneously with little additional load on the processor.

The Dalvik VM uses register-based architecture[45], which is faster and more ecient than the stack-based architecture used in most other virtual machines. Every Android application runs in its own process, with its own instance of the Dalvik VM inside a secure environment, a Sandbox. The Dalvik VM executes les in the Dalvik VM executable format (Dex Format), which is an optimized Java code le for systems with constrained memory and processor speeds.

(27)

The Dex le format

The Android Java source code is still compiled in class les. As mentioned earlier, the Dalvik VM is a modied version of a Java virtual machine optimized for embedded systems, and therefore code must be optimal to achieve the best performance. Since it is not possible to run class les on Dalvik VM they are optimized and converted into the Dex le format. Dex les are optimized class les ready to be executed on the Dalvik VM. Figure 5 shows the process of compilation from Java source code les to optimized code Dex les.

Figure 5: Dex le creation process The Zygote

As detailed above, every Android application runs in its own instance of the Dalvik VM and each instance must start quickly when a new application is launched in the application layer. Android uses a concept called Zygote to provide the fast start-up time needed to run the Dalvik VM every time a new application is executed. Zygote loads the original Dalvik VM during the boot sequence and waits for new requests from the Runtime process. When the Zygote process starts, it initializes an instance of Dalvik VM from the original Dalvik VM. Afterwards, it loads and initializes the core library classes. Every time Zygote receives a new application request from the runtime process, it will create/fork a new Dalvik VM instance from the original Dalvik VM that was loaded during the boot sequence. Creating an instance of Dalvik VM from an existing Dalvik VM minimizes the startup time of the application in the secure environment. For every new application request, Zygote will create a new instance of Dalvik VM. This process is repeated every time the user requests an application.

(28)

Register-based Architecture

Virtual machine developers have always been in favor of implementing vir-tual machines with a stack-based architecture [42] rather than a register-based architecture[45]. The simple implementation of stack-based architecture leads developers to prefer its use. Obviously, this simple implementation comes with a performance cost. Executables for stack-based architecture are smaller than executables for register-based architecture. This means a higher memory con-sumption, leading to a worse performance of the virtual machine. Register-based architecture requires an average of 48% fewer executed virtual machine instructions than stack-based architecture, which considerably improves the per-formance of the device. On the other hand, the register code used by register-based architecture is larger than stack-register-based architecture code. Even so, the processing load generated by Register-based architecture is still lower than that of Stack-based architecture. Taking into account the fact that the Dalvik VM runs on embedded devices with constrained memory and processing power, the use of a register-based architecture is the most appropriate choice.

2.1.3 The Android Security Model

Android's security architecture guarantees that no application in the system can damage other applications or the operating system. Each application runs in an independent instance of Dalvik VM, with its corresponding PID. This means that applications are completely isolated. This technique of running applica-tions in a secure environment is called sandboxing[39]. A Sandbox is a security mechanism often used to execute potentially unsafe code or applications from third-party developers. The Android OS uses a le called AndroidManifest.xml to enable applications to interact with other applications and system resources in the device. These permissions are declared before the application is installed on the device. These permissions are also declared before Android's installation APK le is generated, and cannot be modied after the app is installed on the device.

In Linux a user ID identies a user. On Android the Android ID identies an application running on a Dalvik VM instance. This Android ID is assigned and stored in device's system after installation and is released when the application is removed from the device.

Android uses permissions in the sandbox environment to grant access to system resources such as les, SD Card memory, network, sensors and APIs in general. Figure 6 the process of executing applications in the Android OS.

(29)

Figure 6: Application request process

Every time an application is executed in the Android OS application layer, the System Manager is responsible for collecting and sending these requests to the runtime process. The runtime process will catch the requests and notify Zygote of the execution of a new Android application. Zygote will create a new Dalvik VM instance for every new application request, and the requested application will run in that Dalvik VM instance. Every Dalvik VM instance will run only one application in order to provide a secure environment.

(30)

2.1.4 Android applications

Android applications are written in the Java programming language. Android uses the Android Software Development Kit (SDK) [10] and Java's programming environments, such as Eclipse[19] or Netbeans[33], to compile Java code and create an Android application installation (APK) le. These APK les can be installed on Android devices using the Android Debug Bridge tool (adb) or by downloading them from Android's Ocial Market. Figure 7 shows the basic structure of an APK le.

Figure 7: Android APK le

An APK le is composed of three main groups: AndroidManifest.xml, Classes.dex and Resources, which are packaged into a single le.

• AndroidManifest.xml: The Android manifest le describes the Android application's essential information. It describes application features such as the application and package name, permissions used by the application and the minimum version of Android required to run the application. • Classes.Dex: This le is the result of the compilation of Android Java

source code. It contains optimized Dex bytecode for the Android applica-tion and will run on the Dalvik VM.

• Resources: This group contains pictures, libraries and layout les used by the application.

(31)

Figure 8: Android APK le generation process

One of the most important elements of creating an APK le is the compila-tion of Java source code. The process of generating the APK le is described in Figure 8. The les undergo a series of transformations during the process of creating the Android APK le. These transformations comprise the compilation process required to generate APK les that will run on Android devices.

The rst step in the process of creating an Android application is to create an Android project, in which Java source code, Android manifest and resource les will be generated by Eclipse or Netbeans.

The next step is to program and congure the code to suit your purposes and to compile the project. Java's compiler in the SDK programming environment will generate class les from Java's source code and the aapt5 will transform

the AndroidManifest.xml and resource les into an adequate format so that they can be interpreted by the Dalvik VM. The generated class les cannot be interpreted by the Dalvik VM and in order to convert these class les into Dex les, Android SDK provides a tool called dx. This tool converts class les into the Dex format. Once all the les are compiled, the aapt is tasked with compiling and generating the Android APK le.

5Android Asset Packaging Tool

(32)

2.2 Intrusion Detection System

2.2.1 Denition

An Intrusion Detection System, also known as an ID[24], is a device or software application which monitors a network or system for malicious activities[58].

There are many dierent types of IDS. The aim of an IDS is to identify and detect anomalies in the system or device that is being monitored. Some classes of IDS will be described below.

• Network-Based

The Network-Based Intrusion Detection System (NIDS) is an intrusion detection system that analyzes network trac, makes decisions about the purpose of the trac and scans the network for suspicious activity.

-Wireless

The Wireless Intrusion Detection System (WIDS) is similar to the NIDS. In-stead of analyzing wired network trac it can analyze wireless trac to detect suspicious activity.

• Host-Based

Host-Based Intrusion Detection Systems (HIDS) monitor all activity that occurs on the host (the platform comprising the computer hardware and the operating system) being monitored. This system is capable of monitoring features of the system such as power consumption, opened les, system call logs, etc.

This project will use a Host-Based Intrusion Detection System to monitor events on Android devices. Section 3 will describe this approach in further detail.

(33)

2.2.2 Detection types

As regards types of IDS detection, we can divide these into two: Signature-Based or Misuse detection and Anomaly-Based detection.

• Misuse detection

The technique of Misuse detection searches for specic indications or patterns of attacks, identifying raw byte sequences, protocol type, port numbers, etc. The aim of this type of detection is to nd patterns in raw data. Signatures are then created by a group of experts who analyze the code, behavior and manifestation of the malware. Most antivirus companies still use this technique to create malware signatures and patterns

One of the disadvantages of this detection type is that the system must be familiar with all malware patterns and signatures in advance. This type of detection limits the ability to detect new malware.

The process of nding and identifying new types of attacks and malware manually takes experts a great deal of time. Antivirus companies are trying to come up with dierent alternatives in order to avoid this problem through use of automated processes. Figure 9 shows the dierences between the techniques of Misuse detection and Anomaly detection.

Figure 9: Misuse detection versus Anomaly detection

(34)

• Anomaly-Based detection

Anomaly-Based Intrusion Detection Systems use a prior training phase to estab-lish a model for normal system activity. This mode of detection is rst trained on the normal behavior of the system or application to be monitored. Using this model of normal behavior, it is possible to detect anomalous activities that are occurring in the system by searching the system for strange behavior. This technique is more complex and requires more resources than Misuse detection. Despite this, it has the advantage of being able to detect new attacks.

Typically, Misuse detection tries to identify/classify the new object by con-sulting known malware or malicious behavior patterns stored in a signature database. Unknown objects are compared with database objects, and if a match is found between the unknown object being analyzed and the database object, the unknown object will be considered suspicious or malware. If there is no match, it will be classied as unknown.

Anomaly-Based detection, on the other hand, creates a pattern of normal behavior based on the system's model of normality. New objects will be com-pared with the normal behavior pattern, and if any of the objects show any abnormal activity compared to that pattern of normal behavior, they will be considered malicious applications.

(35)

2.3 System calls and Vectors

In Linux, a system call is the way in which a program requests a service from the operating system's kernel. The Linux kernel has roughly 190 system calls, and each system call is identied by a unique number that is found in the kernel's system call table [27].

A system call is invoked by an application using glibc library functions. Functions like getpid(), open(), read() and socket() are some of the functions that glibc can provide applications with to enable them to invoke a system call. Every time an application from user space makes a request of the OS, the request passes through the glibc library, the system call interface, the kernel and nally reaches the hardware. The glibc library interprets the request and the CPU switches to kernel mode. The system call interface gets the request from the glibc library and executes the appropriate kernel function by consulting the system call table. The kernel must interpret the request from the system call interface and make the request of the hardware platform. Afterwards, the user receives the information requested by the application following the inverse process. Figure 10 describes the Linux user kernel space and the process by which an application sends requests to the hardware platform.

Figure 10: Linux User and Kernel space

The Linux kernel is executed in the lowest layer of the Android architecture. This means that all requests made from the upper layers pass through the kernel using the system call interface before they are executed in the hardware.

(36)

Analyzing all of the system calls that pass through the system call interface will give us an accurate picture of the behavior of the application. The aim of hijacking6 these system calls is to create an output le containing all of

the events generated by the Android application. This le will provide useful information, such as opened and accessed les, execution timestamps and the number of system calls executed by the application. We will use the number of system call executions performed by the application to represent behavior. Section 2.5 will provide insight into this technique

This project will use the lists of system calls to create an anomaly detec-tion system, rst creating the normality model for the Android applicadetec-tion us-ing clear Android applications (applications free of malicious code). As stated above, by extracting the number of system call executions generated by the Android application it is possible to create a behavioral vector representation for Android applications. These vectors will be used to create the normality model or pattern of normal behavior for the application. Here is an example of an Android application behavior system call vector:

0 , 0 , 0 , 2 5 , 4 7 , 4 , 3 4 , 0 , 0 , 0 , 0 , 0 , 0 , 1 2 , 0 , 0 , 0 , 0 , 0 , 2 6 0 , 9 , 0 , 0 , 0 , 0 , 1 6 4 9 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 0 , 0 , 0 , 0 , 5 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 2 2 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 3 4 6 6 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 2 , 0 , 0 , 0 , 0 , 1 3 2 , 0 , 0 , 0 , 0 , 0 , 0 , 4 0 , 4 1 , 0 , 0 , 0 , 0 , 0 , 7 6 , 0 , 0 , 0 , 0 , 0 , 0 , 4 , 0 , 8 7 , 1 7 , 0 , . . .

Each number separated by commas represents a system call and the number of system call requests/executions made by the Android application during the monitoring process. For instance, the system call open() is used 25 times and kill() 47 times. This means that the monitored application used the open() system call 25 times to open les or libraries from the system, and the kill() system call 47 times to kill processes.

The list of Android system calls is too large to show here, but the system calls list can be found in the Android Linux kernel[9] bionic folder7or in Section

2 of the Linux kernel manual pages[26].

6Hijacking, refers to all illegal actions to take over or stealing information by an attacker 7bionic/libc/SYSCALLS.TXT

(37)

2.4 Data Mining

Data mining is the process of extracting patterns from large data sets by com-bining methods from statistics and articial intelligence in order to obtain useful information. Data mining is also considered to be the set of techniques and tech-nologies used for exploring large databases in order to nd repetitive patterns, trends or rules to explain the behavior of a given data set.

Figure 11 shows the sequence of the knowledge discovery process used in databases (KDD) [46] to obtain useful information or knowledge from a raw data set. The KDD process refers to the process of discovering useful knowledge. Data mining refers to a particular step in the process.

Figure 11: Knowledge Discovery in Databases (KDD) process[46] 2.4.1 Data collection in KDD process

1. Selection of raw data: This is the rst phase of the KDD process. When we are given a raw data set, the rst step is to select information in order to obtain relevant data. This project will use a crowdsourcing application installed on several Android devices and an information collector script to obtain the data set of the behavior of the Android application.

2. Data preprocessing: In order to avoid misleading or inappropriate rules or patterns, it is necessary to lter out irrelevant data. Collecting inap-propriate data results in poor interpretation and evaluation of the system, will render the system unreliable and produce undesired results.

3. Data transformation: This will transform relevant data collected from previous phases into a readable and organized structure. This data will determine the outcome of the analysis and will create the data set for the data mining algorithm.

4. Data mining algorithm: This process uses a data mining algorithm to detect rules or patterns from the previously generated data set.

5. Interpretation and evaluation: In this phase a report is generated and the obtained results evaluated.

(38)

Data mining techniques can be separated into many categories or groups, but this report will analyze classication and clustering techniques, since these are the most appropriate and relevant for the project.

Classication

This is a technique used in data mining to classify data into dierent elds or groups. One of the main characteristics of this technique is that the classication of data is based on groups or patterns that are already known. This means that all the information on groups in the system is already dened, and new data will be compared with these groups in order to classify the data.

Clustering

The technique of clustering involves grouping a set of physical or abstract objects into clusters of similar objects. In data mining, a cluster is a collection or group of data that are similar to each other. One of the main dierences compared to the classication method is that the clustering method uses raw data to create the groups to be used later in order to make a decision. These are created without any predened group. The given data set will be responsible for creating the groups or clusters, and afterwards a decision will be made on which cluster the data belongs to. At the beginning there will be no cluster or group created to which to assign the data, so the clustering algorithm will create a random cluster in any position.

One of the easiest ways to decide to which group the data belongs is to measure the Euclidean distance between the data and the formed groups. The Euclidean distance is the result obtained by measuring the proximity of a point to two or more cluster groups. Based on the analysis, the Euclidean distance will cluster the data into the closest or nearest cluster.

(39)

2.5 K-means Clustering algorithm

Clustering is a common technique used for statistical data analysis in many elds, including machine learning, data mining, pattern recognition, image anal-ysis and bioinformatics[47].

This project will use an unsupervised learning or clustering technique to form groups or cluster patterns in order to nd the hidden structure or similarities within the data set. Due to the lack of data sets available for the Android platform, we decided to design an Android application behavior database from scratch, where all the Android app behavior data will be stored.

In order to get satisfactory results in the interpretation and evaluation phase, we must know which clustering method is the most suitable for detecting ma-licious applications in the Android platform, as well as which can provide the best and the most useful information on the collected data.

This part of the document will describe two dierent categories of clus-tering methods: Hierarchical methods and Non-Hierarchical or partitioning methods[74]. Figure 12 shows the taxonomy of clustering methods.

Figure 12: Taxonomy clustering methods

Hierarchical clustering methods create a hierarchy or tree of clusters from a given data set. The root of the tree contains all data observations in a single cluster. The tree creates sub-clusters from the root.

Algorithms used in Hierarchical clustering methods are generally agglom-erative or divisive. Agglomagglom-erative algorithms start at the leaves of the small clusters and merge into bigger clusters. Divisive algorithms start at the root cluster and recursively split the clusters into smaller ones. Figure 13 shows the graphical representation of agglomerative and divisive methods.

(40)

Figure 13: Hierarchical method: Agglomerative vs Divisive

Another method of clustering is the partitioning method. This method sets k number of clusters as the objective, and the data set is split into those clusters. The partitioning method aims to discover clusters by iteration and relocation of points in the data set.

In unsupervised learning, the pattern classication system is based on a set of training patterns, based on data with as yet unknown respective class labels. This occurs when labeling of each individual sample is almost impossible. This type of learning algorithm encompasses algorithms such as neural networks, nearest neighbor, k-means, etc.

Bearing in mind that the objective in this project is to cluster system call behavior vectors into two dierent clusters, i.e. Good and Malicious application behaviors, it is appropriate to apply the partitioning method using the k-means clustering algorithm.

(41)

K-means Clustering algorithm

Every Android application has its own behavior data, and this data will be placed in one of two possible clusters: Good and Malicious behavior clusters, k = 2. The Good application cluster will describe the proper behavior of An-droid applications and data clustered into the Malicious group or cluster will be considered to be malicious or dangerous applications.

The k-means clustering algorithm[62], is a clustering method which aims to create k clusters, given a data set of n observations.

The k-means clustering algorithm uses the following formula: J = k X j=1 n X i=1 x (j) i − cj 2 where x (j) i − cj 2

is the distance measured between a data point x(j)

i and

the cluster center cj. The cluster center cj indicates the distance of the n data

points from their respective cluster centers.

Table 4 shows the steps of the k-means clustering algorithm:

1. Randomly place K cluster points into the space represented by n objects. These points will represent the initial centroids of the clusters

2. Assign every object to the group that has the closet centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids.

4. Repeat the 2nd and 3rd steps until the centeroids stop moving. This produces a separation of the objects into groups.

Table 4: K-means Clustering algorithm process

We suppose that we are given a data set, P , of n observations, with a typical entry being pi, where each pi is a vector of D numbers.

We can think of each pias a point in a D-dimensional space. Every pivector

in the data set, will represent a system call vector produced by the user.

(42)

Figure 14: K-means applied as a detection system for android system calls The n observations, will be the set of system call vectors collected by mon-itoring the Android applications, and each x(j)

i data point will be one such

system call vector. Applying the k-means algorithm to the Android application vector data set will create two clusters, with the good and malicious Android applications classied (k=2) as described below.

The speed of the algorithm and the results obtained in training and test evaluation are the main reasons we chose to use the k-means algorithm in this project. Another reason why we chose k-means was the simplicity of implemen-tation in Matlab.

One of the most important tasks of the clustering algorithm is the selec-tion of the Distance measure. This measurement will determine the cluster to which the data belongs. The calculation of this distance may vary depending on which mathematical formula is used in the process. Euclidean, Manhat-tan, Mahalanobis and Hamming distances are some of the most commonly used functions to measure such distances.

2.6 Crowdsourcing

Je Howe dened Crowdsourcing[59], as the act of exporting tasks tradition-ally performed by one or more employees to an indenite group of persons or a community through an open call.

Using the crowdsourcing technique, we divided the responsibility of creating the Android application data set between the users of the Android Community. Considering that there are more than 8 million Android users in the world, using this technique to collect information from many dierent Android devices is a very appealing option.

(43)

Chapter 3

3 Behavior-Based malware detection system for

Android Applications

3.1 Overview

The implementation of malware detection systems in mobile devices is a fairly a new concept that is gaining a lot of attention. Applying the security tools and mechanisms used in computers to smartphones is not a feasible choice due to excessive resource and energy consumption. Because of this, we decided to perform the entire analysis process on a dedicated remote server. This server will be dedicated exclusively to detecting malicious and suspicious applications on the Android platform.

Figure 15 describes the general scheme of the behavior-based malware de-tection system for Android applications.

Figure 15: Android malware detection system scheme 35

(44)

As the Android market is an open-market system, users can download their applications from sources other than the Android ocial market. As a result, many users end up making heavy use of non-ocial Android repositories where a lack of supervision and control can result in their downloading third party applications that may contain malicious code. The aim of the server is to per-form dynamic analysis of Android applications to detect anomalies which may be dangerous for the user.

Using information collector applications such as crowdsourcing and the data collector script, we can obtain the necessary information from Android applica-tions and perform malware analysis on the system.

Using the crowdsourcing application installed on Android devices, commu-nity users will have a chance to contribute to the project by sending recorded log les of the behavior of Android applications to our malware detection server. All collected log data les result from use of the Strace Linux tool with Android applications8. This tool is assumed to be installed on each user device.

Strace will collect information on the system calls executed by the application. Monitored system call logs and device information les will be stored in the SD Card memory and will be sent to the malware detection system using an FTP client in the crowdsourcing application. The FTP Server will be responsible for collecting the information sent by the crowdsourcing application and an information collector script. The data collector script will process and parse the data collected from Android users' applications and create the system calls vectors. Afterwards, Matlab and the k-means clustering algorithm will use these system call vectors to detect anomalies in the applications.

(45)

3.2 Android Data mining: Crowdsourcing and Self-written

applications

In order to collect Android application data, we will use two data collector applications. The rst one is a crowdsourcing application developed for Android devices and the second one is a script running on the Android Emulator.

The rst attempt we made to collect data was carried out by a script using the thirty most downloaded applications from the Android market in 2010. The purpose of the script was to monitor Android emulator activity and generate reports based on the analysis.

The second data mining trial was carried out by the crowdsourcing applica-tion for Android devices. The aim of the applicaapplica-tion was the same as that of the previous script, but this time the Android user community was used.

Both applications were able to collect essential information from Android Devices, such as installed applications, device information and most importantly the system call log les. See Figure16. The system call log les contain the system call sequence generated by Android applications. Parsing these data points with a script will produce the system call vectors that will be used in the Android malware detection system.

Figure 16: Data acquisition process

The aim of the crowdsourcing and data collection script is to collect as much information as possible from the Android devices and applications.

(46)

3.2.1 Android Data collector script

As described above, in the rst data mining trial we carried out the data mining process used a script to collect information from Android applications.

The purpose of the script was to:

• Use Android APK applications for training or testing the system. • Install/Uninstall applications on the emulator or real Android device. • Collect Linux system calls using the Linux tool Strace.

• Parse the collected data to create system call vectors, device information les and a list of other actions performed by Android applications, such us opened les or accessed directories, execution timestamp, etc.

• Compile the report for the analyzed applications.

The data collector script is written in Perl. This gives us the opportunity to run the script on several operating systems without changing it in any way. Figure 17 shows the User Interface (UI) of the script.

Figure 17: Data collector script user interface Figure 18 describes the data collector script in greater detail.

(47)

The data collector script allows us to choose between installing applications on the Android emulator or the real device. Training Data and Test Data folders contain Good and Malicious Android applications. In order to create the good behavior pattern for Android applications, we will use applications from the Training Data folder as a training phase.

The script will install applications from the training data folder and users will start to interact with the installed application. The script will start mon-itoring and recording all system calls executed by an application. Afterwards, the script will remove the application from the device and create a new, clean in-stance of the system or emulator. This procedure ensures that every monitored application has the same initial system condition and conguration. Applica-tions in the Test Data folder will undergo the same procedure as the training data applications.

Finally, the script will create a folder with all monitored/recorded applica-tions. Steps 4, 5 and 6 on the UI, Figure 17, will obtain the Android device information le and installed application le and create the system calls vector le.

Figure 18: Data collector script process

(48)

The script was designed to automate most of the data mining process and interaction within the system. At rst we decided to use a pseudo-random action event tool called ADB Monkey[2] for interacting with and collecting information from Android applications. Taking into account the fact that there are more than 250,000 applications available in the Android Market, it was natural to conclude that we needed to use an automatic process to record and interact with the applications. After several attempts, we realized that ADB Monkey was generating awed pseudo-random events in Android applications. Considering this, data generated by this application was unsuitable for processing and for using with the system if we intended to have good results.

Our next approach was to teach ADB Monkey to behave and interact with Android applications in the same way as humans. We realized, however, that this technique required articial intelligence knowledge and generated too much work with processing data, so we decided to use a normal user to create the data. The complexity of writing a program to behave like a human was the main reason we decided to use a normal user for data creation. Even so, we found a small disadvantage associated with use of this technique, i.e. that a single user has to create the data set for more than 250,000 Android applications. Spending just 5 minutes per application on monitoring and recording application system calls and the Android device information would require the user to spend almost two years collecting all of the information for the Android market apps.

We realized that even if we decided to use this technique for the most impor-tant 30 applications available on the Android market in January 2011, testing 30 applications would not be sucient to determine and create a Malware pattern for Android applications.

References

Related documents

Once the first experiment was initiated, the dyed meltwater was able to penetrate the top layer, but when it reached the boundary between the top and middle layer, the meltwater

The final step for the optimization of the MeDIP protocol was improving the DNA precipitation procedure, which initially was lengthy, performed with toxic reagents such as phenol

The Android SDK provides nec- essary tools and API:s (Application Programming Interface) for developing your own applications with the Java programming language.. See Figure 2 for

During this master thesis at the ONERA, an aeroelastic state-space model that takes into account a control sur- face and a gust perturbation was established using the Karpel’s

The Fingerprinting location model is based on the power of the received signal of the different access points on a certain position, and can then use those values in a series

The paper aims to provide answers to these questions in order to provide developers with a better understanding of the impact of development methods on battery usage, CPU

The integration is achieved through the Android Development Tools (ADT) plug-in, which adds a set of new capabilities to Eclipse to create Android projects; to execute, profile

Two different solutions depending on hash storage location in kernel space and non-pages/pages based verification in user space (see Section 4.2.2 and 4.3.2) were