Implementation and evaluation of some platform independent obfuscating transformations

(1)

Degree project in

Computer Science

Second cycle

independent obfuscating transformations

(2)

Implementation and evaluation of some platform

independent obfuscating transformations

OSKAR WERKELIN AHLIN

Master’s Thesis at CSC Supervisor: Mikael Goldmann

Examiner: Johan Håstad

(3)

(4)

Abstract

(5)

Implementation och utvärdering av några

plattformsoberoende

obfuskeringstransformationer

(6)

Acknowledgements

(7)

1 Introduction 1 1.1 Motivation . . . 1 1.2 Purpose . . . 2 1.3 Outline . . . 3 2 Background 5 2.1 Definitions . . . 5 2.2 Code obfuscation . . . 5 2.2.1 History . . . 5 2.2.2 Definition . . . 6

2.2.3 State of the art . . . 6

2.3 Compiler technology . . . 7 2.3.1 Front end . . . 8 2.3.2 Back end . . . 10 2.3.3 Common compilers . . . 12 2.4 Reverse engineering . . . 13 2.4.1 Static analysis . . . 13 2.4.2 Dynamic analysis . . . 15 2.4.3 Program slicing . . . 15

2.5 Code obfuscation methods . . . 16

2.5.1 Layout obfuscation methods . . . 16

2.5.2 Data obfuscation methods . . . 17

2.5.3 Control flow obfuscation methods . . . 19

2.6 Analysis metrics . . . 25 2.6.1 Potency . . . 25 2.6.2 Resilience . . . 27 2.6.3 Execution cost . . . 27 2.6.4 Stealth . . . 28 2.6.5 Ease of integration . . . 28 3 Implementation 29 3.1 Obfuscation . . . 29

(8)

3.3 Pointer analysis . . . 31

4 Method 33 4.1 Obfuscation . . . 34

4.1.1 Variable aliasing . . . 34

4.1.2 Call graph flattening . . . 36

4.2 Analysis methods . . . 37

4.2.1 Halstead’s Difficulty Metric . . . 37

4.2.2 Cyclomatic complexity . . . 37

4.2.3 Nesting complexity . . . 37

5 Analysis 39 5.1 Analysis method . . . 39

5.1.1 Evaluating call graph flattening . . . 40

6 Results 43 6.1 Variable aliasing . . . 43

6.2 Call graph flattening . . . 44

7 Discussion 47 7.1 Variable aliasing . . . 47 7.1.1 Potency . . . 47 7.1.2 Resilience . . . 48 7.1.3 Execution cost . . . 48 7.1.4 Ease of integration . . . 49

7.2.1 Potency . . . 49 7.2.2 Resilience . . . 49 7.2.3 Execution cost . . . 50 7.2.4 Ease of integration . . . 50 8 Conclusions 51 8.1 General . . . 51 8.2 Obfuscation quality . . . 51 8.3 Variable aliasing . . . 52

9 Future work 53 9.1 Variable aliasing . . . 53

9.1.1 Adding pointer support . . . 53

9.2.1 Extending to control flow flattening . . . 54

9.3 Array restructuring . . . 54

9.3.1 Algorithm . . . 54

(9)

(10)

Chapter 1

Introduction

There are many different techniques for a company to protect its intellectual ideas against competitors and their services against unintended use. Code obfuscation is one such technique. Code obfuscation is a collection of methods that can be used to protect software against reverse engineering. Reverse engineering is a process that transforms the output of another process into the input of said process, in part or completely, for instance, to reconstruct source code from a compiled program. The application of code obfuscation does not guarantee that a program cannot be reverse engineered, but its purpose is to make the process of reverse engineering prohibitively expensive. Combined with other techniques for protecting software against third party tampering, code obfuscation can serve as an important tool for software protection in the industry.

1.1 Motivation

Consider the process of distributing software from one party to another. One party acts as distributor (the company) and the other party acts as the user of the soft-ware. By tradition, we will call the first part Alice and the second part Bob. Alice wants to distribute her software service in a way so that Bob cannot easily reverse engineer it and reveal the secrets of her code. The secrets can be anything from encryption keys to algorithms. It is important that Bob can run the software effi-ciently in the way it was intended to, otherwise Bob might choose to use another service instead.

(11)

Internet access while running the program, and it also might require Alice to solve the problem of dealing with high server load.

Another solution for Alice could be to use hardware support to secure the software. A specific hardware device would be required for using the service. This would require an attacker to analyze how such a device works. Even though such an approach seems secure, it would be tedious to distribute such a program due to the requirement of a physical hardware device. Furthermore, if an attacker is able to disable or bypass the device, the system would still be open for attacks.

Code obfuscation aims to hide Alice’s secrets through obscuring the code from third party inspection. Code obfuscation does not obstruct Bob from running the program in the way he usually does. In fact it preserves its functionality, albeit at a slight decrease in performance. Bob may try to reverse engineer the code, but the obfuscations will make that process more difficult, time consuming and bothersome. Alice wants to distribute her software to as many customers as possible, so she is planning to do it for desktops as well as mobile clients. Most of the customers use standard operating systems such as Microsoft Windows or Apple Mac OS X on a personal computer. But there are also people who would like to use her software on their smart phones, tablets, etc. Recently she has also been approached by companies who want to embed her application into their hardware. She wants to protect her software on all these platforms.

Over the years, tools have appeared that can obfuscate code for specific platforms, but since they depend on specific platforms, they need to be revamped as new platforms appear. Each of them implement different obfuscation methods. The methods have different characteristics regarding how difficult they are to implement, the cost of using them and how easy they are to remove by an attacker. We are interested in looking at a subset of these methods that are not dependent on the target platform.

1.2 Purpose

The long-standing problem of protecting software secrets against a potentially ma-licious third party today needs to be addressed for multiple platforms ranging from common personal computers to integrated systems with specially designed oper-ating systems. Even more so, with new platforms emerging indefinitely, platform independent code obfuscation could be a potent, economic and practical protection against a malicious third party.

(12)

1.3. OUTLINE

the field.

Finally, the aim of the report is to present recommendations on how a system for target platform independent code obfuscation could be built.

1.3 Outline

The Background chapter serves as an introduction to the current research in the field, both in terms of what methods for code obfuscation exist and how they can be evaluated and analyzed. In the Method chapter we lay out and motivate the selection of the approach that was used for evaluating the chosen obfuscations. We also present and motivate our choice of analysis methods, which have been derived from current research in the subject area. In the Results chapter we present the results of the analysis. We discuss our results in the Discussion chapter, and present our final conclusions in the Conclusions chapter. Finally, we discuss future work and what could have been done differently in the Future work chapter.

(13)

(14)

Chapter 2

Background

In this chapter we describe important theory and concepts that can aid in under-standing the following sections.

2.1 Definitions

This section define terms that are used throughout the report.

Obfuscator A program that obfuscates an input program, emitting an obfuscated output program with equivalent semantics compared to the input.

Toolchain A set of tools that are used to create a software product. Tools are usually chained in a specific way. A simple example of a toolchain is a source code editor and a compiler used to compile the source code.

CPU Central processing unit, mostly referred to as the computer processor.

2.2 Code obfuscation

Code obfuscation is a collective term for techniques that make code harder to read and reverse engineer.

2.2.1 History

(15)

had been sparse until then. At first there were hopes that obfuscation could be used as a black box providing encryption similar to public key encryption. In 2001 Barak [1] showed that no obfuscation could exist in this sense. The results found by Barak more or less halted the academic research in the area of code obfuscation for many years, although companies kept using it practically. Little academic progress was made until 2004 when Lynn et al. [16] showed the first positive results about obfuscation. They showed how obfuscation can be applied on access control graphs, and observed that a similar approach probably could be used for obfuscating finite automata or regular expressions.

Code obfuscation has been used in one way or another in the industry for a long time. Lately, its usefulness in malware creation has been discovered [21], making it harder to do code analysis on viruses thus making it harder for anti-virus programs to detect malicious software. Much of the academic research during the last years has thus focused on deobfuscating obfuscated software in order to protect users from malware.

2.2.2 Definition

In order to reason about the correctness of an obfuscation method, we first need to define formally what an obfuscation is. Collberg, et al. [2] proposed the following definition:

Let P → P0 be a transformation of a source program P into a target program P0.

P → P0 is an obfuscating transformation, if P and P0 have the same observable be-havior. More precisely, in order for P → P0 to be a legal obfuscating transformation the following must hold:

• If P fails to terminate or terminates with an error condition, then P0 may or may not terminate.

• Otherwise, P0 must terminate and produce the same output as P .

Collberg, et al. defined the observable behavior to be the behavior of the program “as experienced by the user”. What this means is that P0 may have side-effects that P does not have, but they should be acceptable and go unnoticed by the user. Collberg, et al. further elaborates about the memory usage and speed difference between P and P0 and states that such differences are valid side effects.

2.2.3 State of the art

(16)

2.3. COMPILER TECHNOLOGY

software running on different architectures to use different obfuscation tools.

Themida

Themida is a proprietary software protection system for Windows developed by Oreans Technologies. It works on a binary level, obfuscating compiled Windows x86 binaries. Among its notable features it includes anti-debugging, anti-memory dump (a snapshot of the working memory of a process at a specific time) and binary integrity checks (ensures that the program can only run if not modified). It also includes functionality for hiding code inside a virtual machine which is obfuscated in itself, increasing the obfuscation level significantly. As Themida not only obfuscates code, it is much more than an obfuscation tool, targeting to protect a program from reverse engineering altogether. [22]

Morpher

Morpher is a compiler driven obfuscation tool developed by MTC Group LTD. Morpher has support for a large number of obfuscation methods, and supports using them in arbritary combinations. It makes use of its tight coupling with the languages’ compilers to be able to apply more sophisticated transforms which require more information about the source code.

It has support for standard C/C++, and limited support for Fortran95 and Ada. Supported architectures are x86, PowerPC and ARM among others. The obfuscator tool is built on llvm-gcc (see Section 2.3.3). Among its notable features Morpher has support for protecting constant values (e.g. values are stored encrypted, and decrypted only upon use) and function transformations. [15]

Diablo

Diablo is an open source infrastructure for rewriting binaries. Development is cur-rently stalled, but some obfuscation methods have been implemented previously. One of the features is control flow graph flattening. Support exists for x86 and ARM. [10]

2.3 Compiler technology

(17)

standard compiler can be divided into two main parts, namely the front end and the back end. There are no exact rules for how a compiler should be written; the de-scription below rather describes one approach. This section is based on Engineering

a Compiler by Cooper and Torczon [4].

2.3.1 Front end

Part of the compiler front end is lexical analysis, parsing and semantic analysis. The lexical analysis and parsing are often jointly referred to as syntactical analysis.

Lexical analysis

Lexical analysis is the process of transforming the flow of characters of the input source code into a sequence of tokens describing the program. Conversely, a token describes a sequence of characters in the source code. Examples of tokens that can be useful for a C compiler front end are IF, ELSE, FOR, RETURN and AND, but a C string literal such as “a string” can also make up a token. The purpose of the lexical analysis is primarily to make the code easier to process for the parser. Listing 2.1 shows a program and Listing 2.2 shows a possible result from an imaginary lexer when feeded with the program in Listing 2.1. Note that each token in the lexical analysis often is associated with the sequence of characters it represents in the source input. This information is needed to e.g. extract the contents of a string or the name of an identifier.

Listing 2.1. A simple example program.

int m a i n (int argc , c o n s t c h a r ** a r g v ) { p r i n t f (" % d ␣ % s \ n ", argc -1 , a r g v [ 0 ] ) ;

r e t u r n 0; }

Listing 2.2. Example of what the lexical analysis of the program in Listing 2.1 might look like.

INT ID P A R _ B E G I N INT ID C O M M A C O N S T C H A R S T A R S T A R ID P A R _ E N D C U R L Y _ B E G I N ID P A R _ B E G I N S T R I N G C O M M A ID M I N U S

NUM C O M M A ID S Q U A R E _ B E G I N NUM S Q U A R E _ E N D P A R _ E N D S E M I C O L O N R E T U R N NUM S E M I C O L O N C U R L Y _ E N D

Parsing

(18)

language. A node may have static attributes, such as the name of the function in a function declaration, and child nodes, such as the arguments in a function call. The parser ensures that the input is part of the source language grammar, and emits a syntactic error if this is not the case. Figure 2.1 shows the output from an imaginary parser fed with the output from Listing 2.2.

Figure 2.1. Example of an AST that could be generated from the program in Listing

2.1.

Semantic Analysis

Input to the semantic analysis is an AST. The purpose of the semantic analysis is solely to ensure that the program is semantically correct. Common errors that the semantic analysis should detect are references to non-existing identifiers (names of variables, functions and similar constructs), type errors, etc. To help in the process, the compiler creates a symbol table, which contains information about the visible identifiers in different scopes within the code. Apart from being used as a means to do semantic analysis of the code, the symbol table can be output for further use by later phases in the compiler.

Intermediate code generation

A compiler may support multiple input languages and multiple output formats. Common input languages are for example C and C++, and it is not uncommon that a user of a compiler will want to target two or more completely different platforms, such as x86 and ARM (we ignore the target operating system in this example). One way to support this is to write one compiler for compiling C code to x86, another for C++ to x86 and similarly one more compiler able to compile source of each input language to ARM native code. In the general case this would require M · N compilers, where M is the number of input languages and N is the number of output formats.

(19)

2.2 illustrates this process. The intermediate representation needs to be complex enough to cover all constructs in all input languages. In contrast to the specialized approach, with M · N compilers, this approach only requires M compiler front ends and N compiler back ends. Furthermore, the most difficult and advanced part of a compiler is analysis and optimizations. Parts of the analysis and optimizations can be done on the intermediate code, reducing the amount of work required to write a back end for a new output format.

Figure 2.2. An illustration of the purpose of the intermediate code.

It should be noted that almost every compiler uses its own language and conventions for IR code. One common property shared by many of these languages is single static assignment (SSA). In SSA each variable is assigned exactly once, i.e. it is not allowed to write to a distinct variable multiple times. A simple example of IR code on the SSA form can be seen in Figure 2.3.

x ← 0 y ← 5 y ← x ∗ y x ← x ∗ y x1← 0 y1 ← 5 y2 ← x1∗ y1 x2← x1∗ y2

Figure 2.3. A sample program to the left and its SSA transformation to the right.

2.3.2 Back end

(20)

Analysis and optimizations

One of the most important properties of a compiler is the quality of its optimizations. In order to do optimizations on the code, the compiler needs to analyze the code. The analysis done by a compiler has much in common with static analysis that an attacker can perform as described later.

Input to the optimization phase is intermediate code. The intermediate code is parsed into one or more control flow graphs (CFG) which are similar to an AST, but on a much more detailed level. The nodes in the CFG consist of basic blocks. One basic block is a small piece of code that fulfills certain requirements. Most importantly, only the last instruction in a basic block can be a branch instruction. Each branch instruction corresponds to a directed edge in the CFG from one basic block to one or more basic blocks. In the most basic case, each function in the source code will translate to one CFG. A CFG may have one or more exit nodes, corresponding to different return points in a function. A CFG normally has one entry point, corresponding to the start of the function it models. Note that a CFG may also be used to model things apart from functions such as whole programs.

Listing 2.3. A sample program.

v o i d f (int x ) { ++ x ; // b l o c k 1 if ( x == 1) x = 0; // b l o c k 2 e l s e ++ x ; // b l o c k 3 p r i n t f (" % d ", x ) ; // b l o c k 4 r e t u r n; }

Figure 2.4. The control flow graph of the function in Listing 2.3. Nodes represent

basic blocks.

(21)

Code generation

Input to the code generation is intermediate code. The code generation phase of the compiler is responsible for instruction selection and register allocation. Instruction selection transforms the instructions of the intermediate code to instructions in the target’s native code such that the semantics of the program is preserved. Register allocation is the process of assigning processor registers efficiently to the variables that need to be represented in the program. When required, the register allocation will spill registers to the stack. Note that instruction selection as well as register allocation is subject to optimizations as well. Output of the code generation phase is native code for the target platform.

2.3.3 Common compilers

Here we present a survey of some common compilers for C and C++.

GNU Compiler Collection

The GNU Compiler Collection (GCC) is an open-source compiler. It is produced by the GNU project, a free software and mass collaboration project. It has been adopted to be the standard compiler by most Unix-like operating systems, although support exists for Windows as well. GCC uses an intermediate representation that is called GIMPLE. GIMPLE comes in many forms, of which one is SSA based. GCC is widely used not only in free software projects, but also in commercial and proprietary software development. [8]

Low Level Virtual Machine

Low Level Virtual Machine (LLVM) is an open source compiler infrastructure which originally was a research project at the University of Illinois. LLVM is com-monly used together with Clang which is a compiler front end with support for C/C++/Objective C. In this configuration, LLVM acts as the compiler back end. There is also a GCC compatible front end for LLVM called llvm-gcc. This front end intends to be a drop in replacement for GCC on supported platforms, while using the LLVM back end.

(22)

2.4. REVERSE ENGINEERING

feature of LLVM is the LLVM IR. LLVM IR is in SSA form, and the same IR can be used as input to all LLVM tools, such as the optimizer, the code generator and interpreter. It can also easily be edited manually without any tools other than a text editor.

LLVM is available for multiple processor architectures, and is able to produce code for x86, x86-64 and ARM as of today. Support for other architectures can be added with modules just as with optimization passes. [14]

Microsoft Visual Studio Compiler

The Microsoft Visual Studio Compiler, which is distributed as part of Microsoft Visual Studio is the de facto compiler for Windows programs - although many other compilers exist. The Microsoft Visual Studio Compiler is proprietary, and can not be extended with third party code transformation modules. [5]

2.4 Reverse engineering

In this section, different reverse engineering techniques are explained and discussed. We do not go into great depth, since this is not the topic of the report. It is necessary to have a brief understanding of some standard reverse engineering techniques in order to understand the benefits of certain obfuscation techniques.

Analysis methods are divided into two main types, namely static and dynamic analysis.

2.4.1 Static analysis

(23)

Data-flow analysis

Data-flow analysis is a technique for determining the set of possible values for vari-ables in certain points of a computer program. For each basic block, we define an entry state s_entry and an exit state s_exit. By state, we mean information about the program, e.g. relationships between variables. Define a transfer function f such that f (sentry) = sexit. We also know that sentry depends only the combined exit

states of predecessors of s in the control flow graph. Using this, we get the pair of equations for each node:

sentry = ∪s0_∈Ss0_exit

sexit = f (sentry)

where S is the set of predecessors to s. By solving these equations, we can determine certain properties about the data flow in each node. [4]

Code cloning

In a program, the same node can often be reached through different execution paths. As the information propagated by data-flow analysis will be more complex for the node in this case, a technique called code cloning is commonly used. Code cloning aims to duplicate nodes so that each node is reachable through exactly one path of nodes. For example, if block C is reachable either through block B or block A,

C is split into two blocks CA and CB. This way, each block will have exactly one

predecessor, making the control flow graph larger, but also more simple. [23]

Path feasibility analysis

Path feasibility analysis finds a subset of the dummy edges introduced by control flow obfuscation, and concludes that they are unfeasible, giving the reverse engineer a more simple control flow graph.

Assume that we have an arbitrary acyclic program execution path P , and ¯x, the

set of variables live (in use) at entry to P . We want to construct a constraint CP

such that (∃¯x)CP is unsatisfiable only if P is unfeasible. Having constructed CP,

we check if it is satisfiable. If it is not, we know that P is unfeasible.

The condition CP can be constructed using a simple set of rules that depend on the

(24)

2.4. REVERSE ENGINEERING

2.4.2 Dynamic analysis

Dynamic analysis is carried out during program execution. The software is executed on a real or a virtual processor. As opposed to static analysis, which should always give the same result for a specific input, the result of dynamic analysis is highly dependant on how the program is executed. Therefore, it is important to explore a sufficient amount of different program executions to create interesting program behaviour. Techniques such as code coverage can be used to ensure that a sufficient portion of the program’s set of possible execution paths has been explored. During the run, various data can be collected. Examples of such data can be snapshots of the state of the program at different locations in the code, and the control flow graph as traversed when executing the program. [23]

2.4.3 Program slicing

Program slicing is an important technique that can be used both for static and dynamic analysis. A program is sliced according to some slicing criterion, and then all parts that are not affected by the parts of interest are filtered out. This makes the debugging process easier, for example if we want to know why a specific value is incorrect at a specific point in the program, the variable would be selected and slicing would filter out all parts of code that do not affect the variable at hand, directly or indirectly. a = i n p u t () c = 5 b = a p r i n t b a = i n p u t () b = a p r i n t b

Figure 2.5. A sample program to the left and to the right the result of program

slicing on the last statement.

(25)

2.5 Code obfuscation methods

The methods that can be used to obscure source code can be divided into three categories; layout obfuscations, data obfuscations and control flow obfuscations. Layout obfuscation only deals with the syntactic elements of the source code, i.e. how the source code is formatted and encoded. Data obfuscations obscure variables, classes and other data. Control flow obfuscations change the control flow of the program such that its pattern is less obvious to spot while analyzing the program flow. This section is largely based on the work by Collberg, et al. [2] and Drape [7]. Note that we put the emphasis on describing the concept of the methods, rather than elaborate on when they can be used without changing the semantics.

2.5.1 Layout obfuscation methods

Layout obfuscations have in common that they only change syntactic elements in the code, i.e. they change the appearance of the code while leaving the real structure intact. Normally this type of transformation is unnecessary as the compiler already removes this structure from the code. However if the code is to be redistributed as is, i.e. in source code, methods of this type can be used to remove some of the human readable information. For source distribution, these transformations actually can provide an important means of protection as variable names and other syntactic sugar provide a human reader with context for better comprehension of the code even if it have been obfuscated by other means.

Scrambling identifiers

Scrambling identifiers is the process of renaming identifiers such as variable and function names. It aims to change these identifiers to names which do not explain what they are used for, or to names that are illogical to what they are used for. Figure 2.6 shows an example of this.

int confirmLogin()... int apples()...

Figure 2.6. An example of how identifier names can be rewritten.

(26)

2.5. CODE OBFUSCATION METHODS

Remove comments

Comments often contain high level information such as why the code works in a specific way, and what the purpose of the code is. This informatin can be removed without any semantic changes. Similarily to the process of scrambling identifiers, this process is normally only useful if the code is disitributed in source form as a compiler does not preserve this information.

2.5.2 Data obfuscation methods

A data obfuscation method changes the way that data is stored in memory. Instead of storing a data structure in the normal way, data is shuffled or changed so that it is hard to interpret at run time, without knowing in which way its representation has been changed.

Value encoding

Encryption of constant strings is an example of value encoding obfuscation. The idea behind this obfuscation method is that an attacker will inspect variables during execution in order to understand the context in which the variable is used. Incre-mentation of a variable by 1 in a block of code with the variable compared to a constant limit is what a typical loop would look like. Similarly branching based on a string value is easy to spot in the code which can help the attacker to navigate in the code.

An example of value encoding obfuscation is to encrypt each constant string in the code. Upon usage the string is decrypted and after use it is encrypted again. It is possible to perform a similar obfuscation for integers. Instead of coding a value naturally, it can be coded in a way similar to Equation 2.1 (⊕ means exclusive or in this context).

var0= var ⊕ 17 (2.1)

A desirable feature for a value encoding obfuscation is that it is a one to one mapping over the range of the variable that stores the value. This ensures that the obfuscation is invertible, i.e. whatever the value of the variable to be encoded is, the value can be decoded without ambiguity. Hence the obfuscation will not break the program if the range of the variable is changed without the obfuscators knowledge.

Variable aliasing

(27)

simple example is the merging of two equally sized integers into one integer with the double size. This can be done by simply storing the first integer in the lower part of the new integer and the second integer in the upper part. Storing the variables in this way generally requires the new variable to be repacked for each operation that operates on either of the original variables. Listing 2.4 shows an example of how variables can be aliased.

Listing 2.4. An example of how four variables can be merged into one.

i n t 8 _ t a = . . . ; i n t 8 _ t b = . . . ; i n t 8 _ t c = . . . ; i n t 8 _ t d = . . . ; i n t 3 2 _ t m e r g e d = a | ( b < < 8) | ( c < < 16) | ( d < < 24) ; Class refactoring

Class refactoring obfuscates by changing the class inheritance hierarchy. This can be done through for example inheritance from dynamically generated classes with no real functionality, or by taking all functionality of one class and split it into two new classes while the original class acts as a proxy for the two new classes. Figure 2.7 shows one such example.

c l a s s I n p u t C h e c k { v o i d u p d a t e I n p u t () ; int g e t R e s u l t () ; }; c l a s s A { v o i d u p d a t e I n p u t () ; } c l a s s B { int g e t R e s u l t () ; }; c l a s s I n p u t C h e c k : p u b l i c A , B {}

Figure 2.7. Example showing how the code to the left can be transformed according

(28)

Array restructuring

Array restructuring is an obfuscation method that changes the structure of an array, transforming it in a way that makes it difficult for an attacker to understand its structure at run time. A simple example of array restructuring would be reversing it at compile time, and reading it in the opposite order at run time effectively undoing the reversal done during compilation. Listing 2.5 shows an example of array restructuring.

Listing 2.5. An example of how an array can be restructured for obfuscation

pur-poses. int s c a n [ 1 0 ] = { 8 , 2 , 9 , 0 , 6 , 4 , 1 , 3 , 5 , 7 }; int f i b o n a c c i [ 1 0 ] = { 2 , 8 , 1 , 13 , 5 , 21 , 3 , 34 , 0 , 1 }; for (int i = 0; i < 10; ++ i ) p r i n t f (" % d \ n ", f i b o n a c c i [ s c a n [ i ]]) ; Variable promotion

Changes the scope of variables, making it more confusing to know which context certain variables belong to. For example a loop variable is commonly initialized just before the loop, used for array indexing inside the loop and at the end of the loop incremented or decremented. After the loop is is usually not used for anything. Instead of this scheme an obfuscator can initialize the loop variable implicitly in some other context, use the loop variable in the loop, and then continue to use it outside the loop context, as shown in Listing 2.6.

Listing 2.6. An example of how a variable can be promoted for obfuscation purposes.

int i ;

for ( i = 0; i < 10; ++ i ) c h e c k ( i ) ;

c h a r b u f f e r [ i ];

s c a n f (" %9 s ", b u f f e r ) ;

2.5.3 Control flow obfuscation methods

(29)

In some sense control flow obfuscation transformations are opposed to optimization transformations, and it is not uncommon that they undo each other.

Opaque predicates

Opaque predicates are predicates which will evaluate to a, at compile time, known value. Using this fact, we can create conditionals which will complicate the control flow graph. This could be done by e.g. using a mathematical identity as in Listing 2.7 [3].

Listing 2.7. An example of dynamically created opaque predicates.

int v = r a n d () ;

if (( v * v * ( v +1) * ( v +1) ) % 4 == 0)

// a l w a y s e x e c u t e d

e l s e

// n e v e r e x e c u t e d

An obfuscator could also use relations between variables acquired through e.g. stat-ical analysis to create opaque predicates. The obfuscator could also choose to create such relations itself through the introduction of one or more new variables with a predecided relation.

Opaque predicates are commonly used as part of other obfuscation methods to produce more powerful obfuscations.

Pseudo cycles insertion

Pseudo cycles insertion creates a loop in the code with some kind of opaque pred-icate as loop condition, which makes the program always break out of the loop immediately, but for a static analyzer it will look like there is another loop in the program. This further complicates the control flow graph of the program.

Control flow flattening

(30)

Figure 2.8. A program where all jumps between basic blocks go through a common

proxy block.

proxy, with edges between the proxy and each basic block entry point and edges between each basic block’s exit point and the proxy entry point, as shown in Figure 2.8.

Combined with opaque predicates, this obfuscation method can make it much harder for the attacker to determine what path will be executed just through static analysis.

Function pointer obfuscation

All common programming languages have support for function calls. Some pro-gramming languages also have support for function pointers. In contrast to normal functions, it may be hard to determine what function is actually called when a function pointer is used instead.

One approach to function pointer obfuscation is to store a pointer to each function in a globally accessible array. Each function call is then replaced by indexing into this array, loading the pointer that corresponds to the function to call and finally calling that function pointer. This scheme can be extended with other obfuscations, such as value encoding and variable aliasing, effectively obscuring the function pointer data structure. Combined with opaque predicates, the process of determining statically what function is called can be made even more difficult.

Transforming the code to use function pointers instead of normal function calls have been formally proved to obscure the program. In the general case, determining which function a function pointer call corresponds to has been proven to be NP-hard [18].

Figure 2.9 shows an example of function pointer obfuscation.

Loop transformations

(31)

int o n e _ m o r e (int v ) { r e t u r n v + 1; } int m a i n () { int x = 5; int y = o n e _ m o r e ( x ) ; p r i n t f (" % d \ n ", y ) ; r e t u r n 0; } int o n e _ m o r e (int v ) { r e t u r n v + 1; }

int (* f u n c ) (int) = & o n e _ m o r e ; int m a i n () { int x = 5; int y = f u n c ( x ) ; p r i n t f (" % d \ n ", y ) ; r e t u r n 0; }

Figure 2.9. An example showing how the code to the left can be obfuscated with

the help of function pointers.

Exceptional branching

Common practice is to use some pattern resembling the if-then-else construct when handling conditional control flow in the application. However, exceptions can be used for this purpose as well if they are supported. In fact, an if-else construction can be trivially transformed into a semantically equivalent try-catch block.

This type of obfuscation does not really alter the control flow in the most strict meaning, but it alters the construct that determines the control flow. The motiva-tion behind this is mainly that it can be used to obscure a code block so that it looks like normal exception handling code while it in fact it is not, thus misleading a reverse engineer.

Figure 2.10 shows an example of how an imaginary exceptional branching transfor-mation.

Dead code insertion

(32)

2.5. CODE OBFUSCATION METHODS int c h e c k (int i n p u t ) { if ( i n p u t > 400 && i n p u t < 8 0 0 ) r e t u r n 1; e l s e { ... r e t u r n 0; } } int c h e c k (int i n p u t ) { try { if ( i n p u t > 400 && i n p u t < 8 0 0 ) t h r o w 1; e l s e { ... r e t u r n 0; } } c a t c h (int e ) { r e t u r n 1; } }

Figure 2.10. A normal conditional control flow to the left, and code that uses exceptional branches to the right.

there is a need for stealthy opaque predicates to protect the code. Figure 2.11 shows unreachable code without use.

if (/* f a l s e o p a q u e p r e d i c a t e */) { ... } int c h e c k (int v ) { int p = 0; for (int i = 0; i < v ; ++ i ) { p += i ; /* c o d e w h i c h d o e s not r e f e r e n c e p */ } r e t u r n 0; }

Figure 2.11. Unreachable code to the left, and useless code (the code that operates

on p) to the right.

(33)

Function transformations

Function transformations include inline expansion of functions into the caller, split-ting one function into two or more functions such that calling the resulsplit-ting functions sequentially performs the same action as the original function, and transforming blocks of code into functions.

Functions provide a means for an attacker to navigate in a program. If an important function has been discovered by an attacker he can use this function as a starting-point and track down all the callers of this function. Inlining a function makes it harder to do such tracking. Splitting a function into multiple functions make it harder for an attacker to grasp the context of the function. Combined with replacing the split functions with pointers as explained in Section 2.5.3, this can be a powerful obfuscation.

Code virtualization

Code virtualization is an obfuscation method in which the code is transformed into virtual machine code. The code is then executed through the use of a virtual machine interpreter that is shipped along with the program. Code virtualization can be applied on the whole program or only on parts of it.

This method is primarily useful for making the program more time consuming to reverse engineer. The mapping from the original code is one to one, thus the original code can be restored by a reverse engineer - in most cases statically, i.e. without running the program - given that the virtual machine interpreter is reverse engineered first. Note however that the structure of the virtual machine interpreter may be very complex. Running the code in a virtual machine interpreter is typically very slow compared to executing the native code directly, thus this approach may not be suitable for performance critical code.

Encryption

Encryption can be applied on different levels in a program. One approach is to encrypt the program in full, and decrypt it in full upon execution. An attacker can in this case either decrypt the program statically or dump the program just after the decryption in run time. In most cases, only parts of the program need to be protected. In this case the full program need not be encrypted, but only the relevant parts.

(34)

2.6. ANALYSIS METRICS

definitely have the value of the constant string. Thus we can encrypt the code block with the constant string as key. To make it harder for a reverse engineer to do static decryption, the comparison between the variable and the constant string can be replaced by a comparison between the hash of each value. Knowing the hash of the string is not enough to decode the block, so an attacker would have to run the code to be able to deduce the meaning of the code in question. Listing 2.8 shows an example of such an approach.

Listing 2.8. An example where a part of the program has been encrypted based on

a string key. The hashing of the string should be done at compile-time for the best result.

if ( h a s h ( i n p u t ) == h a s h (" e n c r y p t i o n - key ") ) {

/* D e c r y p t t h i s b l o c k w i t h i n p u t as key */

}

2.6 Analysis metrics

Collberg et al. [2, 3] propose a number of criteria to use when evaluating obfuscation transformations. In particular they define the criteria potency, resilience, execution cost and stealth. Drape [7] added more criteria, of which we were particularly interested in what he defined as ease of integration.

2.6.1 Potency

Potency is defined to measure how obscure a program P is, i.e. how difficult it is for a human reader to understand. It is based on results in software complexity metrics research. Defined are a number of attributes E_i, which are chosen carefully. The program P is defined to be more potent than the program P0 with regards to the attribute e if e(P )/e(P0) > 1. Attributes for measuring potency include cyclomatic complexity, program length, nesting level complexity, etc. Collberg proposed that a weighted sum E = Σki· Ei(P ) could be used to retrieve one potency value from

multiple metrics.

Cyclomatic complexity

McCabe defined a measure called the cyclomatic complexity number [17]. The pur-pose of the measure is that when a function is more complex, it will have a higher cyclomatic complexity number.

McCabe showed that the cyclomatic complexity C of a function f can be calculated as

(35)

where e_f and n_f denotes the number of edges and nodes respectively in the control flow graph of the function f . This formula simplifies to

C(f ) = df + 1 (2.3)

where d_f denotes the number of conditions in the function f .

Applied to a program, the cyclomatic complexity can be calculated by taking the sum of the cyclomatic complexities of all its functions. The argument for this is that a call to a function is just an edge in the control flow graph of a program.

Halstead’s metrics

Halstead defined a number of metrics that can be used for measuring the complexity of a program. For the potency criteria we are interested in one of them, namely Halstead’s difficulty metric. Consider a subset of a program, for example one block in a control flow graph, or a function. Let n1 be the number of distinct operators

in the subset. Furthermore let n2 be the number of distinct operands and N2 the

total number of operands in the subset. Halstead proposed a difficulty metric D for the subset defined as in Equation 2.4.

D = n1N2

2n₂ (2.4)

Halstead claimed that D is positively correlated with the complexity of the subset it is calculated for, and that it can be used to compare different blocks of code with each other.

Nesting level complexity

Harrison et al. [11] argued that neither McCabe’s nor Halstead’s metrics handled correctly the complexity added by deep nestling of blocks of code. He suggested that the complexity of a control flow graph block should not only depend on what operands and operators that are contained in that particular block, but also on the complexity of the blocks that the block in question can reach. Harrison does not explicitly specify how his new analysis metric should be carried out.

A more precise definition of a complexity measure based on the same idea is proposed by Gong et al. [9].

(36)

2.6. ANALYSIS METRICS

Define postdomination as follows. For two nodes x, y ∈ G, x 6= y, x postdominates

y if and only if every path from y to the exit node passes through x. We say that x directly postdominates y if and only if:

• x postdominates y

• ∀ z such that z postdominates y, z postdominates x.

We note that trivially all nodes have exactly one direct postdominator when there is only one exit node. For a selection node x, we let G_x contain all the nodes between

x and the node that postdominates it. Let dn be the number of selection nodes in

Gn.

Now, we are ready to define our nesting level complexity measure. The nesting degree of a selection node x is defined as in Equation 2.5.

n= 1 − (1/dn) (2.5)

The nesting degree of the entire graph is calculated as = (₁+₂+...+_N)/N , where

N is the number of selection nodes in the entire graph. This will yield a nesting

complexity measure between 0 and 1, where higher value means higher complexity. It is difficult to apply the nesting complexity algorithm on a program (i.e. with function calls as normal edges), because of many reasons (indirect function calls, external function calls, etc). It could be argued however that just as for cyclomatic complexity it is possible to get a good result by simply taking the sum of the nesting complexity of all functions in the program.

2.6.2 Resilience

Resilience is a criteria measuring how hard it would be to create and execute an automatic deobfuscator for a transformation, reversing the obfuscation performed. In contrast to potency this measures the confusion for an automatic deobfuscator, whereas potency measures the confusion for a human deobfuscator. Evaluation of this criteria is quite subjective and speculative, but may be backed up with formal results.

2.6.3 Execution cost

(37)

i.e. the penalty is the same regardless of what input the program is run on. It is regarded as cheap if the amount of resources needed to run the obfuscated program is linearly dependent. Define n to be the resources used by the original program, i.e. time and/or memory used. Costly is defined to be any obfuscation that requires

O(nx_{), x > 1, more resources than the original program and a dear obfuscation}

requires exponentially more resources.

2.6.4 Stealth

Stealth measures how easily a programmer can determine if and how a particular code has been obfuscated. The motivation behind this criteria is twofold. First, an attacker will probably be more interested in a piece of code that looks obfuscated as it probably has been obfuscated for a reason. Second, if the reverse engineer can detect what methods were used to obfuscate a program, he can more easily develop inverse transformations to recreate the original code as well as understand the code easier.

2.6.5 Ease of integration

(38)

Chapter 3

Implementation

3.1 Obfuscation

In order to be able to implement and evaluate our obfuscation algorithms, we need a way to alter the execution flow of computer programs. This can be done on three different levels:

• Binary level • Source code level • Intermediate level

Should we decide to apply our obfuscation algorithms on the binary level, we would have to restrict them to a number of architectures. As we aim for implementing a platform independent obfuscator, this is not good enough. Also, performing trans-formations on a level this low might complicate the process of implementing our algorithms, since a lot of context is lost during compilation.

Obfuscating on the source code level is as platform independent as we can get, as the entire obfuscation procedure can be plugged in even before the source code is compiled. Though, as with binary level obfuscation, it requires a lot of contextual analysis. For this, we need a parser that parses the source code in some way, from which we can get contextual information. If we could create a program that parses source code, applies transformations, and then outputs obfuscated source code without any additional dependencies, we would have a lot of opportunities. This would require us to be able to parse programming languages, which can be tedious in some cases, e.g. C++.

(39)

apply our obfuscating transformations directly on the IR. After having run our ob-fuscation routines, we rely on a compiler that is able to transform the IR into our target architecture, finally giving us an obfuscated binary.

3.2 Decision on which tool to use

A number of different tools were evaluated before finally deciding on how our trans-formations would be applied, and what we would use.

Binary level obfuscation could quickly be excluded from our set of possible ap-proaches, since one of our main goals was platform independence. And as mentioned above, a lot of context is lost during compilation. This would probably reduce the ease of integration by a substantial amount. Hence, we did not evaluate any binary obfuscation techniques.

We looked at and briefly evaluated a number of source level obfuscating transfor-mations.

TXL [24] is a programming language designed to support computer software analysis and source level transformations. This is exactly what we were looking for, but we decided to leave TXL since the community around the project did not seem strong enough for us. Since for example C++ is such a complex language, we felt it would be better to rely on an open source program surrounded by a strong community. We also looked at DMS Software Reengineering Toolkit [6], a toolkit promising us exactly what we were looking for. According to their website, the DMS Software Engineering Toolkit transforms source code into an AST, on which transforms are performed. Then it once again outputs source code. Though, the product is non-free, not open source, and we decided that should we run into problems caused by the toolkits we are using, we prefer turning to a strong open source community and using a free utility.

(40)

3.3. POINTER ANALYSIS

3.3 Pointer analysis

(41)

(42)

Chapter 4

Method

We have implemented and evaluated some of the obfuscation methods discussed in the Background chapter. We have also implemented analysis passes that automati-cally calculate the metrics defined in the Background chapter. These analysis passes will later help us evaluate our obfuscation methods. In this chapter, we describe the implementation of each analysis and obfuscation method on a high level. The following obfuscation methods have been evaluated:

1. Variable aliasing 2. Call graph flattening 3. Array restructuring

The first two methods have been implemented and analyzed. The third method has not been implemented. Thus, we have not been able to perform any analysis on this obfuscation method. The third method is discussed in Section 9.3, where we also discuss how applying it would affect each one of our analysis metrics. Implemented obfuscation methods will be evaluated using the implemented analysis passes, listed below.

(43)

4.1 Obfuscation

Here we describe how the different obfuscations were implemented on a high level. When describing the algorithms, we make some assumptions about the program-ming language on which the obfuscations are to be applied. We assume that it is an imperative language that is not dynamically typed, e.g. we are able to distinguish variable types at compile time. The details of how the implementation was carried out practically are explained later in the report, in the Implementation chapter. In order to be able to explain our obfuscation methods in a more convenient way, it is necessary to understand the terms loads and stores.

Recall that a load is an instruction that is used for loading a bit of memory into some a computer register. A store instruction does the opposite, saving a register value to memory.

4.1.1 Variable aliasing

The algorithm is divided into two phases. During the first phase, we go through the program to identify candidate variables for aliasing, and decide where they should be aliased in our wrapper variables. A wrapper variable is an integer that is meant to fit several smaller integer types inside it. During the second phase, we change the program so that it will access its variables from the wrapper variables, instead of just using the variables themselves.

1. Let S be an ordered set of integer variables that are to be obfuscated with variable aliasing. We only include integer variables under a certain bit width

k to be included into this set. Integers of size equal to k or bigger will be

ignored.

2. Let w denote the current wrapper variable, and let it have w_k bit slots left. 3. Randomly shuffle the elements in S.

4. Take next element si ∈ S. If wk < b, where b denotes the number of bits used

by si, allocate a new wrapper integer w, and let once again wk= k.

5. We now let the current variable be aliased into w at position k. Decrease w_k by the bit width of the current variable.

6. If all elements in S have been handled, exit the algorithm. Otherwise, go to (4).

Now we have a function f (v) 7→ (v0, d) that maps the place of an integer variable

(44)

4.1. OBFUSCATION

remains to make sure that all reads and writes to v will access v0 instead, and that they only deal with the part of v0 that belongs to the corresponding variable. Let us now change all load instructions. We need to modify them so that instead of loading from memory slot v, we want to load from memory slot v0 instead, unmasking the value of interesting size at offset d. Analogously, we want to change all store instructions so that they instead of storing to v, store to position v0 at offset d instead. It is important that we while doing this leave the rest of v0 as it is, since v0 is probably a wrapper variable for other integers in the program.

Effectively, this means that the integer variables in the program will be aliased in memory. During runtime, the real values will be exposed in the computer registers. Hence, this obfuscation aims at preventing static analysis.

Now, for clarification purposes, let us examine how an example program will be obfuscated by the obfuscation method described above.

Listing 4.1. Source code of an example program.

int m a i n () {

int a = 1;

int b = 2; }

This is a very simple case. The two variables a and b will be aliased into a sin-gle wrapper variable. The obfuscated program will look something like follows, assuming the compiler uses 32 bit ints and we use a wrapper size of k = 64 bits.

Listing 4.2. Source code of the same program, with aliased variables.

l o n g l o n g w1 = 0; int m a s k _ i n t = -1; int m a i n () { w1 = ( w1 &( m a s k _ i n t < <32) ) |(1 < <32) ; w1 = ( w1 &( m a s k _ i n t ) ) | ( 2 ) ; }

If pointers are used in the program, we must be careful. If a pointer is dereferenced while pointing to a variable that has been encoded using variable aliasing, we must treat these values in the same way. Keeping track of where a pointer points during run time is hard [18], and therefore we must apply techniques which enable us to alias variables correctly during compile time, allowing a bit weaker obfuscation. For this implementation, we left out variables that could possibly be pointed to by a pointer.

(45)

Similarity to bin packing

Selecting the wrapper variable for a variables to be aliased can be visualized as packing items into bins of a certain size, also known as the bin packing problem. Finding the minimum number of bins needed for arbritary item sizes and bin sizes is a famous NP-hard problem. In our case, the problem is much easier, since items (variables to be aliased) and bins (wrapper variables) have sizes that are always a power of two. It is trivial to see that finding the minimal number of bins needed is simply found by sorting the items in ascending order and packing them greedily. This was not implemented, but our algorithm can easily be modified to use this by sorting the set of variables instead of shuffling it randomly.

Complicating for the optimizer

Experiments have also been carried out where we always left the least significant bit in each wrapper unused. Basically we let, in the algorithm above, wk = k − 1

instead of w_k = k. The purpose of this was to see how the optimizer would react to when things were offset by an odd number of bits in the wrapper variables.

4.1.2 Call graph flattening

This obfuscation aims to flatten the control flow of the program, by reducing the height of the program control flow graph. The basic idea is that a proxy block is added to the control flow graph, effectively reducing the height of the control flow graph to 2. When the program changes state to another node in the graph, this will always be done via the proxy block. By node in this case, we mean a function. A function call represents an edge in the control flow graph. This is basically the same idea as described in Section 2.5.3, but instead of flattening the control flow within a function, we flatten the call graph of the program.

Now we are ready to give an algorithm for performing the call graph flattening.

1. Plant the function proxy f into the program. Let f (x) return a function pointer given an identifier x for some function. The function pointer returned points to the function corresponding to the given identifier. Let f handle an internal list that keeps track of the functions, initially empty.

(46)

4.2. ANALYSIS METHODS

3. Go through the program once again. For each function call, replace it with the following operations: First a call to f (k) where k denotes the identifier of the function to be called. Instead of calling the function directly, call via the function pointer returned by f (k).

This way, the function call graph will be flattened to height 2, effectively flattening the program.

In Section 9.2.1, we describe strategies for how this could possibly be done on a basic block level, and the probable effects of it.

4.2 Analysis methods

In order to be more effective in our evaluation of obfuscation algorithms, we imple-mented special LLVM passes that analyzed them for us automatically. Here we give a description of how these passes were implemented, necessary for understanding how our results were derived.

4.2.1 Halstead’s Difficulty Metric

This metric was implemented as it is defined in Section 2.6.1, for calculating the metric of a function. The analysis passes were applied on a per-function basis, so for giving a measurement of an entire program, we calculated the difficulty metric

Df for each individual function f ∈ F , and then we let the difficulty metric for the

entire program be P

f ∈F(Df).

4.2.2 Cyclomatic complexity

Let us say that the cyclomatic number for each function f ∈ F is called Cf. We

let C_f be the number of branches in f (see Section 2.6.1 to see how closely related this is). To calculate the cyclomatic complexity of the entire program, we simply counted the number of branches in the program, which of course isP

f ∈F(Cf).

4.2.3 Nesting complexity

We calculated the nesting complexity as follows. For each function f ∈ F in the program, we calculated its nesting complexity Nf. For each selection node s, its

(47)

(48)

Chapter 5

Analysis

Here we analyze the obfuscation methods described in the previous chapter accord-ing to the analysis methods described in the Background chapter.

5.1 Analysis method

We have implemented automatic analysis passes for some of the analysis metrics defined in the Background chapter. These metrics include cyclomatic complexity,

nesting complexity and halstead’s difficulty metric. We will use the results of these

metrics to strengthen our discussion about the effect of the implemented obfus-cations. For details on how these implementations were carried out, see Section 4.2.

Execution cost has been measured by building gzip[19] and by compressing using

it with different obfuscation settings. During compression, we measure the running time and peak memory usage of the program. The reason that this isn’t measured for more than one program is that the main purpose of the measurement is to see how CPU heavy parts of a program is affected by obfuscating transformations. Data compression is CPU heavy, thus measuring this for gzip[19] is sufficient for this project.

The compressed file was generated using this simple C++ program:

Listing 5.1. Program that was used to generate random bytes.

# d e f i n e F I L E _ S I Z E 1 0 0 0 0 0 0 0 0

# i n c l u d e < cstdlib >

# i n c l u d e < cstdio >

int m a i n () {

(49)

p r i n t f (" % c ",c h a r( r a n d () % 2 5 0 + 1 ) ) ; p r i n t f (" % c \ n ",’ \0 ’) ;

}

For the other metrics - namely resilience and ease of integration - we have not been able to measure anything concretely, since these metrics are subjective. We will still discuss these in the Discussion chapter.

Each one of the measurable metrics will be tested for:

• Clean binary without obfuscation and without optimization • Clean binary without obfuscation and with optimization • Obfuscated binary without optimization

• Obfuscated binary with optimization

All tests were performed using a MacBook Pro equipped with a 2.2 GHz Intel Core i7 processor. Execution times were consistent, e.g. the difference between execution times was at most 0.1 seconds. Hence, all measurements were done once, and then several more measurements were made to verify the running time.

When optimization is applied, it is always applied after obfuscation in our tests. Seeing how well the optimizer reverses our obfuscation methods is a good way of quickly telling if a given obfuscation is strong.

Since the implemented obfuscation methods are independent in the way they change program execution flow, we have not tested applying a combination of the obfusca-tion methods.

Some obfuscation method specific evaluations have been carried out. These are listed below.

5.1.1 Evaluating call graph flattening

We printed the call graph for a simple C program before and after obfuscation. The program source code can be read in Listing 5.2. The call graph before and after obfuscation is presented in Figure 6.1 and Figure 6.2.

Listing 5.2. Example program for call graph flattening.

int c (int x ) {

r e t u r n % 1 0 0 ; }

(50)

(51)

(52)

Chapter 6

Results

Here we present the results of our measurements. For details on how the experiments were carried out, read the previous part.

6.1 Variable aliasing

In Table 6.1, we list the results of our implemented analysis passes. In Table 6.2, we list the results of our execution cost measures.

We observe that with the optimizer turned on, we always got the same results. We also observe that memory usage increased with obfuscation (without optimization).

Table 6.1. Measurement results for variable aliasing.

obf opt Halstead’s Complexity Cyclomatic Complexity Nesting Complexity

off off 976 1052 1081.6

on off 1222 1052 1081.6

off on 766 980 1004.6

on on 766 980 1004.6

Table 6.2. Execution cost results for variable aliasing.

obf opt Peak Memory (1024 B) Running time (seconds)

off off 720896 7.43

on off 733184 12.85

off on 729088 8.07

(53)

Table 6.3. Measurement results for variable aliasing with offset.

on off 1223 1052 1081.6

on on 766 980 1004.6

Table 6.4. Execution cost results for variable aliasing with offset.

on off 737280 11.63

on on 729088 7.97

Table 6.5. Measurement results for call graph flattening.

off off 976 1052 1081.6

on off 1052 1066 1095.6

off on 766 980 1004.6

on on 824 994 1018.6

Variable aliasing with offset

In Table 6.3, we list the results of our implemented analysis passes. Since the purpose of this experiment was to see how the optimizer reacted, we tested only with obfuscation enabled.

In Table 6.4, we list the results of our execution cost measures.

We observe that adding an offset in our aliasing algorithm yielded the same mea-surement data, but with a slightly faster running time in the non-optimized case.

6.2 Call graph flattening

In Table 6.5, we list the results of our implemented analysis passes. In Table 6.6, we list the results of our execution cost measures.

Table 6.6. Execution cost result for call graph flattening.

off off 720896 7.64

on off 729088 7.99

off on 729088 8.02

(54)

6.2. CALL GRAPH FLATTENING

Figure 6.1. Call graph for the program in Listing 5.2 before call graph flattening

obfuscation.

Figure 6.2. Call graph for the program in Listing 5.2 after call graph flattening

obfuscation.

We observe that running time increased with obfuscation turned on. We also observe that all three complexity measures increased with obfuscation, and that memory usage increased with obfuscation.

(55)

(56)

Chapter 7

Discussion

In this chapter, we discuss our results and the reasons behind them. Final conclu-sions are given in the next chapter.

7.1 Variable aliasing

7.1.1 Potency

Variable aliasing did not affect the cyclomatic complexity in the program. This is as expected. Recall that cyclomatic complexity is closely related to the number of branches in the program, as described in the Background chapter. Our variable aliaser does not add or change any existing branches in the program, hence it is expected that variable aliasing will leave the cyclomatic complexity unchanged. The obfuscation method did not change the nesting complexity of the program. This is also what we expected to see, and the reason is the same as what is described above, the branches in the program are left unchanged.

(57)

7.1.2 Resilience

Applying the optimizer to our obfuscations effectively inverted all of our transforms. This was confirmed by comparing the output assembly of obfuscated and non-obfuscated optimized code. Hence, we can conclude that the obfuscation in its own has weak resilience, and should be combined with other obfuscation methods in order to increase resilience.

Even though we tried making life harder for the optimizer by adding an odd offset in our aliasing algorithm, the optimizer could still figure our how to invert our transformations.

It is likely that the optimizer has a set of optimization rules that perfectly matched our obfuscation rules, effectively dealiasing variables that do not need to be coupled.

7.1.3 Execution cost

Since the optimizer effectively inverted all our obfuscations, these discussions are held about the non-optimized case, unless otherwise stated.

Running time

The extra operations needed in order to shift a value into a wrapper variable in-creases the running time by a constant amount for each variable access that is obfuscated. In our experiments, variable aliasing led to a running time that was a bit more than 50% higher than the original.

An interesting observation that could be made is that the obfuscated binary ran faster when we aliased using offsets. The probable reason for this is something called a partial register stall, and has to do with a 64-bit machine getting jobs to write 32-bit integers to parts of a 64-bit integer. On a 64-bit machine, it is faster to write 64-bit values at once. Adding offsets made such 32-bit integers to be aliased across the "center" of the 64-bit wrapper variable, possibly making the machine operate with it in a full 64-bit register, increasing speed. This would explain the results.

Memory