• No results found

Extended Abstract : Detecting Scareware by Mining Variable Length Instruction Sequences

N/A
N/A
Protected

Academic year: 2021

Share "Extended Abstract : Detecting Scareware by Mining Variable Length Instruction Sequences"

Copied!
3
0
0

Loading.... (view fulltext now)

Full text

(1)

Electronic Research Archive of Blekinge Institute of Technology

http://www.bth.se/fou/

This is an author produced version of a conference paper. The paper has been peer-reviewed

but may not include the final publisher proof-corrections or pagination of the proceedings.

Citation for the published Conference paper:

Title:

Author:

Conference Name:

Conference Year:

Conference Location:

Access to the published version may require subscription.

Published with permission from:

Extended Abstract: Detecting Scareware by Mining Variable Length Instruction Sequences

Raja Khurram Shahzad, Niklas Lavesson

11th Scandinavian Conference on Artificial Intelligence

2011

IOS Press

(2)

Extended Abstract: Detecting Scareware by

Mining Variable Length Instruction

Sequences

Raja Khurram SHAHZAD1 and Niklas LAVESSON

School of Computing, Blekinge Institute of Technology SE-371 32 Karlskrona, Sweden, {rks,nla}@bth.se

Introduction

Scareware represents scam applications that usually masquerade as security applications such as fake anti-virus software to display fake scanning processes and results to scare users into believing that their systems have been infected with malicious content. Traditional countermeasures that rely on either signature-based methods or heuristic-based methods lack the capability of detecting novel instances of scareware since, for both methods, anti-malware vendors need to catch novel instances, analyze them, create new signatures or rules and then update their databases. Generalizing the scareware detection method so that it can detect novel instances can arguably be regarded as important for user protection. Another problem regarding the detection of scareware is that differences between scareware and legitimate software are subtler than between, say, viruses and legitimate software. That is, there is less information that can be used to distinguish between scareware and legitimate software. To our knowledge, no other studies have been conducted regarding the detection of scareware. There could perhaps be two reasons for this: on the one hand, researchers may have regarded scareware as just another type of malware and thus have assumed that previous malware detectors should work for scareware as well. On the other hand, scareware may have been regarded as harmless. We argue that scareware is too distinct to other forms of malware for previous detectors to work and that the risks of scareware could be substantial primarily due to the fact that scareware uses social engineering to gain access to the complete file system of personal computers.

1. Methodology

We have investigated how well classification algorithms can detect scareware and present a static analysis method based on data mining. Our data set contains 550 scareware and 250 benign Windows-based executables. The sample of scareware was provided by Lavasoft [1] and the benign files were downloaded from download.com [2]. We decided to represent files by using extracted instruction sequences (ISes) as features. We disassembled files into ISes by using the Netwide disassembler. We first extracted the opcodes from each file. These opcodes were combined into ordered lists, instruction sequences, to produce our vocabulary. During the extraction process, the end of an instruction sequence was determined by identifying a conditional or unconditional control transfer instruction or function boundary. We used the

(3)

words model and the StringToWordVector filter in Weka to produce word vectors. For our experiment, each word represents a unique IS. We used TF IDF for the weight calculation of each word. Our final vocabulary features 1,625 unique words. We also applied categorical proportional difference (CPD) [3] to obtain reduced feature sets. As it is difficult to know beforehand the optimal number of features to remove, we generated 19 reduced data sets for which 5-95% of the original features were kept. To evaluate classifier performance on the task of scareware detection and to assess the impact of feature selection using CPD, seven common learning algorithms i.e. ZeroR (baseline), SMO, JRip, J48, Naive Bayes, IBk and Random Forests were evaluated by performing 10-fold cross-validation tests. The following four measures defined the elements of the generated confusion matrices: True Positives represent the correctly identified scareware programs, False Positives represent legitimate software that has been classified as scareware, True Negatives represent correctly identified legitimate programs and False Negatives represent scareware programs that were incorrectly classified as legitimate software applications. We argue that the false negatives carry the highest cost from the users’ perspective. The performance of each classifier was evaluated using Detection Rate, which is the percentage of correctly identified scareware and False Negative Rate, which is the percentage of wrongly identified scareware programs. Finally, the Area Under Receiver Operating Characteristic Curve (AUC) was used as the primary measure for classifier performance.

2. Results and Analysis

Random Forests outperformed the other algorithms and its best performance (AUC = 0.97) was recorded on the data set in which 60% of the original features were kept after performing feature selection (a data set with 974 features). On this data set, the other algorithms also achieved either the highest AUC or close to the highest value with negligible differences. Naive Bayes exhibited a high variance in performance across the different data sets, which makes it unreliable for the studied problem.

3. Conclusions and Future Work

We have used a variable length instruction sequence mining approach for the purpose of scareware detection. Due to absence of publicly available data sets, we obtained a large sample of scareware applications and designed an algorithm for extracting instruction sequences from this sample. The experimental results are promising: the Random Forests algorithm managed to yield an AUC score of 0.97. For future work we aim to combine visualization and data mining into a suitable approach to both improve the scareware detection rate further and to identify and understand the general patterns which are characteristics to the scareware class.

References

[1] Lavasoft AB, “Lavasoft.” [Online]. Available: http://www.lavasoft.com/. [Accessed: 13-Jul-2010]. [2] CNET, “Free software downloads.” [Online]. Available: http://download.cnet.com/. [Accessed:

02-Jan-2010].

[3] M. Simeon and R. Hilderman, “Categorical Proportional Difference: A Feature Selection Method for Text Categorization,” in Seventh Australasian Data Mining Conference (AusDM 2008), vol. 87, pp.

References

Related documents

With a core strategy, Gary Hamel means that a company, in a structured way, decides which strategy they should have regarding business mission, product/market scope and basis

With respect to the content of the descriptions of the mental images no clear differences were found although truth tellers experienced their mental images more vividly

[r]

In the trilogy Toby’s character embodies feminine subjective becoming, and figuring Toby as a cyborg allows for openings in the narrative to think on new

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating