Taint analysis for automotive safety using the LLVM compiler infrastructure

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/074--SE

Taint analysis for automotive

safety using the LLVM compiler

infrastructure

Éléonore Goblé

Supervisor : Ulf Kargén Examiner : Nahid Shahmehri

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Software safety is getting more and more important in the automotive industry as me-chanical functions are replaced by complex embedded computer systems. Errors during development can lead to accidents and threaten users’ lives. The safety level of the soft-ware must therefore be monitored through Automotive Safety Integrity Levels (ASILs), according to the standard ISO 26262. This thesis presents the development of a static taint analysis tool using the LLVM compiler infrastructure in order to identify safety-critical components and analyze their dependencies in automotive software. The aim was to pro-vide a useful visualization tool to help safety engineers in their work and save time during development. It was concluded that this static taint analysis tool can facilitate and improve the precision of the ASIL decomposition of automotive software.

(4)

Acknowledgments

First and foremost, I would like to thank ARCCORE for giving me the opportunity to conduct this master thesis. In addition, I would like to thank my supervisor Daniels Umanovskis and my colleague John Tinnerholm for their valuable help. I would also like to thank all my colleagues at ARCCORE for their friendly welcome and their support.

Furthermore, I would like to thank my supervisor Ulf Kargén and my examiner Nahid Shahmehri for providing me with valuable feedback.

I would also like to thank my sister Morgane for proofreading my thesis.

Finally, I would like to thank Linköping University and the University of Technology of Compiègne for giving me the possibility to carry out this double-degree project.

(5)

2.5 LLVM . . . 9 2.6 Related Work . . . 10 2.7 Visualization . . . 11 2.8 Evaluation . . . 13 3 Method 15 3.1 LLVM . . . 15 3.2 Taint analysis . . . 16 3.3 Visualization . . . 24 3.4 Evaluation . . . 26 4 Results 30 4.1 LLVM . . . 30 4.2 Taint analysis . . . 30 4.3 Visualization . . . 31 4.4 Evaluation . . . 34 5 Discussion 39 5.1 Taint analysis . . . 39 5.2 Results . . . 40

(6)

5.3 Method . . . 42 5.4 Source criticism . . . 43 5.5 The work in a wider context . . . 43

6 Conclusion 44

6.1 Consequences . . . 45 6.2 Further work . . . 45

(7)

List of Figures

1.1 Master thesis outline . . . 4

2.1 Compilation process . . . 9

3.1 An overview of the LLVM Value inheritance . . . 17

3.2 UML Diagram, describing the architecture of the taint analysis pass . . . 18

3.3 SafeValue and SafeInstruction classes . . . 23

4.1 The list of tainted functions and global variables in each file . . . 31

4.2 An example of the tree view, whose initiator is the variable safe. . . 32

4.3 The alias view of the variable safe in the function testInterProcedural . . . 32

4.4 Visualization tool overview . . . 33

4.5 Which aspect has been used to find the ASIL rating of an object? . . . 35

4.6 An overview of the result of the taint analysis pass on the project (real names have been modified) . . . 37

(8)

List of Tables

3.1 Taint propagation policy . . . 20

3.2 Linear scale questions . . . 27

3.3 Tasks . . . 27

3.4 Store test cases . . . 28

3.5 Load address test case . . . 28

3.6 Pointer parameter test cases . . . 28

3.7 Global initialization test case . . . 28

3.8 File test case . . . 28

3.9 Call test case . . . 28

3.10 Violation test case . . . 28

4.1 Linear scale questions . . . 34

4.2 LLVM IR metrics . . . 35

4.3 Taint information . . . 36

4.4 Taint analysis results . . . 36

4.5 Results . . . 38

(9)

1 Introduction

The importance of safety in the automotive industry has significantly increased in recent years. Purely mechanical functions have been replaced by complex embedded computer sys-tems, which require high levels of safety. In fact, errors during development can lead to accidents and threaten users’ lives. The safety level of the software must therefore be as-sessed and monitored. ISO 26262 [1] is an industry-specific standard for functional safety of road vehicles, similar to the broader standard IEC 61508 which defines Functional Safety of Electrical/Electronic/Programmable Electronic Safety-related Systems [2]. According to ISO 26262, the safety level of an application can be measured by Automotive Safety Integrity Lev-els (ASILs). This standard recommends separating safety-critical objects from non-hazardous objects in the memory.

1.1 Company

This master thesis is done in collaboration with ARCCORE AB [3], headquartered in Gothen-burg, Sweden. ARCCORE is a fully-owned subsidiary of Vector Informatik GmbH, head-quartered in Stuttgart, Germany. ARCCORE provides leading solutions for embedded sys-tems development in the automotive industry. ARCCORE software aims at being developed with respect to the automotive standard AUTOSAR [4].

1.2 Motivation

In the automotive industry, the embedded code supplier needs to provide guarantees to the Original Equipment Manufacturer (OEM) with regards to safety requirements. In order to attempt to establish that the software is safe, the company needs to perform analysis on the code.

Dynamic analysis techniques such as testing and verification are common ways to check software safety, however these methods are tedious. The number of possible paths grows exponentially with the size of the program, therefore, testing only provides a “partial ver-ification”, according to Silva et al. [5]. Hardware protection can also be developed to en-sure safety. AUTOSAR [4] defines a standard for the architecture of Electronic Control Units (ECUs) and recommends functional measures for safety-relevant systems. In embedded sys-tems, a hardware Memory Protection Unit (MPU) [6] allows memory protection by defining

(10)

1.3. Aim

access rights to different parts of memory. In a safety-critical system, the MPU can be used to partition the memory and prevent unsafe components from writing into the safe memory during run-time [7].

Static analysis consists in analyzing the source code before executing it, and thus enables engineers to prove code safety. Static analysis could be used to find out the components to be placed in the safe partition. Static analysis can be combined with dynamic analysis to improve the efficiency of the analysis [8]. However, developing a sound static analyzer is expensive in terms of complexity.

Moreover, safe components which have a higher ASIL need “Freedom from interference” (FFI) [9] from lower level components, which ensures that “a fault in a less safety critical software component will not lead to a fault in a more safety critical component“, according to Leitner-Fischer et al. [7].

Nevertheless, monitoring the safety of the entire software can be costly, according to Azevedo et al. [10]. For a developer of automotive software, it is desirable to limit the amount of ASIL components. In fact, such components have to be developed according to additional requirements imposed by ISO 26262, which significantly increases the effort during imple-mentation and testing phases. The goal is to reduce the volume of code involved by high safety levels as much as possible, in order to be able to study these slices precisely and to limit the risks.

Currently, a manual code inspection is performed in order to identify the dependencies related to the variables used in safety-critical modules. The challenge is to develop a software that would automatically identify dependencies between the safe objects of the program, and thus would give a base to safety engineers to help them to partition the memory.

Taint analysis [11] consists in detecting data coming from untrusted sources and propagat-ing the taint to the variables in relation with this data. Taint analysis can be used to identify data which can influence safety-critical components.

The Low Level Virtual Machine (LLVM) [12] is a compiler infrastructure composed of a set of libraries and reusable objects. LLVM provides several modules for compiler construction, which can be used for static code analysis. The Clang compiler utilizes LLVM in order to transform C code into LLVM IR, which is an intermediate representation. This representation facilitates the analysis of the relation between variables. LLVM also provides the LLVM Pass Framework [13] which gives the possibility to develop an “LLVM analysis pass”, which is a plugin developed on the top of LLVM to analyze source code.

1.3 Aim

The aim of the thesis is to develop a static analysis tool, composed of a taint analysis pass based on existing static analyzers such as LLVM analysis modules and the LLVM Pass Frame-work, and a visualization tool to present the results. Through this, this thesis aims at exam-ining how taint analysis can be used to ensure embedded systems safety. This could be done by analyzing C code and generating the dependencies related to safety-critical components. The output of the program should be easily understandable for the safety architects, which means it should be easy and quick to learn how to use the tool, and the output should be precise enough to provide them with additional information for their work.

1.4 Research questions

The first meetings and discussions made it possible to highlight the most important aspects of the thesis and to raise the following questions:

(11)

1.5. Delimitations

The first task is to study the possibility of implementing a new module on the top of LLVM.

2. How can static taint analysis be used to track dependencies related to safe components in automotive software?

This thesis aims at studying the best method to implement a static taint analyzer for automotive software. This analyzer should identify efficiently the components which can influence variables marked as safe.

3. How to represent results in an understandable way so that engineers can improve the safety development process?

This thesis aims at generating an understandable output which focuses on the most important and relevant information and presents useful data for safety engineers. One task is thus to study the best way to represent the dependencies between safe and unsafe components.

4. Is the taint analysis accuracy sufficient for the application? How does taint analysis visualization affect the usefulness of the output?

The results of the tool can also be compared to manual analysis results performed on existing projects to evaluate the accuracy. The output visualization can be submitted to safety engineers, so that they can evaluate the usefulness of the result.

1.5 Delimitations

This thesis only aims at analyzing dependencies from safety components provided by the user. Thus, the thesis does not provide identification of the initial components considered as safe. This thesis aims at developing a standalone tool, so the integration of the tool is not included in the developing process. Moreover, this tool should be compatible with LLVM-5 and should work on Windows, according to the company technical configurations. Finally, this thesis aims at analyzing embedded code for automotive industry which follows the rules described in MISRA C Guidelines [14].

1.6 Outline

The figure 1.1 illustrates the outline of the master thesis. This figure highlights the main steps of the study.

First, a pre-study was conducted in order to define the subject and plan the thesis work. The Introduction Chapter [1] and the Research Questions were written following this. Some literature and technical research was done in order to write the Theory Chapter [2] and to start the development phase. The research study was useful to design the architecture of the taint analysis LLVM pass based on the LLVM Pass Framework presented in the Method Chapter [3]. Then, the development of the taint analysis LLVM pass and the visualization tool was done iteratively. The main functionalities of the taint analysis pass were tested. A qualitative study was performed on the visualization tool and the taint analysis pass was tested on a real-project of the company in order to evaluate its accuracy. The Results Chapter [4] presents the results of the evaluation and the static taint analyzer composed of the taint analysis pass and the visualization tool. The Discussion Chapter [5] presents feedback and improvements made following the different studies. Finally, the Conclusion Chapter [6] summarizes the results of the master thesis and suggests further work.

(12)

1.6. Outline

Introduction

Theory

Method

Results

Conclusion

Research questions

Planning

Defining subject

Discussion

Static taint

analyzer

_{Accuracy evaluation}

Qualitative study

Architecture

Iterative

development

Testing

Feedback

Improvements

Taint analysis for automotive safety

using the LLVM compiler infrastructure

Development

Taint analysis

pass

Technical research

LLVM

Visualization

C++

Literature research

Taint analysis

Automotive

systems

(13)

2 Theory

This sections aims at presenting the background and the related work relevant to this thesis. First, section 2.1 presents software development in automotive industry. Then, section 2.2 defines functional safety standards and concepts. A review of the different types of static analysis is provided in section 2.3. A brief explanation about pointer analysis is given in section 2.4. Besides, an overview of LLVM is provided in section 2.5. Section 2.6 presents existing studies related to this topic. Finally, section 2.7 introduces software visualization and section 2.8 reviews methods to evaluate software usability and accuracy in the context of static analyzers.

2.1 Automotive industry

The automotive industry deals with safety-critical systems whose malfunctions could lead to serious consequences, including injury to people, environmental issues and large losses of money [5]. Vehicles are increasingly automated and use a lot of embedded computer systems [15]. These systems require more and more checks to ensure vehicle passengers safety.

Automotive systems architecture

Automotive systems are divided into a physical hardware part, such as Electronic Control Units (ECUs), and a software part [16]. ECUs are embedded systems composed of “a micro-controller and a set of sensors” [15], aiming at controlling an electrical system in a vehicle through an embedded software. These systems need to implement protection methods to ensure safety, both at hardware and software levels. The Memory Protection Unit (MPU) [6] is a hardware protection in ECUs aiming at restricting the access to the safe partition dur-ing run-time. A memory access violation generates an exception that terminates program execution.

This thesis focuses on software methods to ensure embedded systems safety.

Embedded software development

According to Freund [17], embedded software involves many constraints such as “real-time scheduling, reliability and production requirements”, which influence software development

(14)

2.2. Functional safety

methods. Embedded software is usually developed in C because this language has been used in critical systems for a long time, and efficient machine code can be generated from C programs [14].

MISRA C Guidelines [14] provide “a subset of the C language” which is supposed to re-duce the possibility of making mistakes during the development. This is done by removing C language expressions which could lead to undefined behaviour, misuse or misunderstand-ing. These guidelines are recommended in the development of embedded applications and safety-related systems.

2.2 Functional safety

Functional safety aims at detecting hazardous situations and applying preventive solutions. These solutions should prevent systematic or hardware failures from having serious conse-quences [2]. Therefore, standards have been developed to assess functional safety and to provide common methods to solve these issues.

Functional safety standards

The automotive industry is regulated by several standards which aim at standardizing products development. IEC 61508 defines Functional Safety of Electrical/Electronic/Pro-grammable Electronic Safety-related Systems [2]. ISO 26262 [1] is adapted from IEC 61508 and deals with functional safety of road vehicles.

AUTOSAR (AUTOmotive Open System Architecture) [4] is an automotive standard for the software architecture of ECUs. This standard recommends measures and mechanisms to improve the development of safety-related software, such as memory partitioning [6]. Unsafe applications are run in user mode whereas safe applications are run in supervisor mode, in order to access the MPU without restriction.

Automotive Safety Integrity Levels (ASILs)

Part 9 of ISO 26262 defines “ASIL-oriented and safety-oriented analyses” in order to decom-pose the software into safety-related components and non-safety-related components. Auto-motive Safety Integrity Levels (ASILs) have been developed to check the safety level of an embedded system. Therefore, respecting ASILs aims at convincing the manufacturers that the products meet safety requirements. In order to develop ASIL software, designers must find out safety-critical components whose malfunctions could lead to serious issues [10], such as the brake system. Therefore, risks related to hazardous situations are defined and classi-fied into four different levels (ASIL A, ASIL B, ASIL C, ASIL D) according to their severity, probability, and controllability [1]. ASIL components must be monitored through safety mea-sures and require more development effort [10]. Components which do not require specific safety measured are identified as Quality Management (QM).

Freedom from interference

Freedom from interference (FFI) is defined by ISO 26262 Part 9 Section 6.2 [1] as the absence of “cascading failures” from a lower ASIL element to a higher ASIL element. This means that components with lower ASIL should not influence components with higher ASIL. This should prevent an error that happens in an unsafe module from propagating to a safety-critical module [7].

Therefore, ASIL components should be separated from QM components inside the mem-ory. ASIL components should be placed in the Memory Protection Unit (MPU) [6].

Finally, static code analysis can be performed in order to identify the components related to safety-critical modules.

(15)

2.3. Static Analysis

2.3 Static Analysis

Static analysis refers to the analysis of a program without running it [18]. Contrary to dy-namic analysis, which is performed on programs during run-time [11], static analysis can be performed directly on source code or on intermediate code, for example on the LLVM intermediate representation (IR) [19].

Although dynamic analysis can be popular, this method has some limitations. One exe-cution path is generated for each input set, and one path is tested for a program at a time. Thus, achieving a high percentage of code coverage is challenging when the number of paths increases, and dynamic methods can thus “encounter [...] paths explosion problems”, ac-cording to Feng and Zhang [20]. Dynamic testing tends to provide only “partial verification” according to Silva et al. [5]: some paths can be missed and inaccurate results can be provided. Static analysis gives the possibility of simulating all the execution paths of the program dur-ing compile-time, which is called symbolic execution, accorddur-ing to Liang et al. [21].

However, static analysis tools are not always fully reliable [11]. They provide either over-approximation or under-over-approximation. These tools can be incomplete, and produce false positives (find an error where there is none), or unsound, and produce false negatives (error not reported), depending on the chosen approximation method. According to Mock et al. [22], if the static analysis method is too precise, then the algorithm complexity can be a limit when running the analysis on large programs.

Analysis methods

Static analysis can be performed by applying formal methods, that is to say, analyzing math-ematically the source code in order to prove some results.

According to P. Cousot and R. Cousot [23], abstract interpretation approximates possible values using abstract sets which aims at converting infinite spaces into finite ones. For exam-ple, as far as the sign of the variable’s values is concerned, the set of integers can be abstracted to the set t(+),(´),(0)u. Another technique is deductive verification, which aims at proving the algorithm by dividing it into a list of mathematical proof obligations, according to Silva et al. [5]. Furthermore, symbolic execution consists in simulating the execution of the program during compile-time, according to Liang et al. [21].

Static analysis can also be based on compiler technology. According to Arroyo et al. [11], modern compilers enable developers to build upon their structure elements, such as Abstract Syntax Tree (AST), Control Flow Graph (CFGs) and Call Graphs (CG), in order to perform data and control flow analysis. Data-flow analysis consists in analyzing the operations per-formed on a data set, whereas control-flow analysis is used to study the flow of tasks and the structure of the program.

Taint Analysis

The field of static analysis developed in this thesis work is taint analysis. According to Arroyo et al. [11], taint analysis is based on information flow and “non-interference”: information flow analysis is used to check that tainted information does not interfere with information which should not be tainted.

Usually, in software security, taint analysis consists in marking data coming from un-trusted sources, such as user input, as unsafe, because external data is always a security risk [11]. As far as software safety is concerned, unsafe data does not necessarily come from the user, but also from unsafe modules. Then, taint analysis can be used to track the unsafe vari-ables which can influence the safety-critical components. In the context of this thesis, tainted data is classified into different safety levels. A lower ASIL data should not influence a higher ASIL data, otherwise both data should be tainted with the higher ASIL level.

(16)

2.4. Pointer and Alias Analysis

Taint analysis is usually divided into three phases [20]. The first one is taint information, which aims at tainting the initiators (source objects). The second phase is taint propagation, which aims at broadcasting the taint to all the other objects in relation with the initiators. The last phase is taint checking, which consists in checking if an object which has been tainted should not be tainted, to detect an unauthorized behavior. According to Schwarz et al. [24], the taint policy should define how the new objects are tainted, which operations propagate the taint, and how the taint is checked at the end.

2.4 Pointer and Alias Analysis

During taint analysis, propagating operations need to be identified. As far as C language is concerned, the main challenge is pointer and alias analysis. According to Avots et al. [25], C is an unsafe language and is difficult to analyze. In fact, operations can be performed on pointers, and pointers can either point to stack, heap objects or functions. There are also multi-level pointers. All of this increase the complexity of the analysis, according to Andersen [26]. Thus, a sound pointer analysis is really hard to achieve. A pointer analyzer must make compromises to obtain readable and reasonable results. Therefore, different properties can be used to identify the level of precision needed for the pointer analysis. According to Hind [27], this level should be in line with the customer’s needs.

Andersen presents in his PH.D. Thesis [26] a pointer analysis for C language based on sub-set constraints. This analysis is inter-procedural, which means that the relationships between the functions are taken into account. Steensgaard [28] presents another inter-procedural pointer analysis, which is based on equality constraints.

Definitions

Andersen [26] defines two fundamental concepts regarding pointer analysis: “alias pair” and “point-to information”.

Alias pair: if p = &x is an assignment, then ˚p is aliased with x. The alias pair is written x˚p, xy. “When the lvalue of two objects coincides, the objects are said to be aliased” [26].

Point-to information: if p=&x and p =&y are two assignments, then the point-to infor-mation of p is the set tx, yu, and is written p ÞÑ tx, yu. Point-to inforinfor-mation denotes “the set of objects a pointer may point to” [26].

Properties

Pointer analysis properties aim at defining the level of precision needed by the application.

Field-sensitivity: Field-sensitivity deals with aggregate data types such as structures and arrays. A field-sensitive analysis studies each field of each structure separately, whereas a field-insensitive analysis considers each access to aggregate data as an access to the whole structure [29].

Intra-procedural or inter-procedural: The intra-procedural pointer analysis performs data-flow analysis only inside functions. This is much easier than inter-procedural analysis, which performs a pointer analysis considering the interaction between functions. Inter-procedural analysis consists in analyzing each function call separately [26].

(17)

2.5. LLVM

Flow-sensitivity: Flow-sensitive analysis takes the execution order of the program, called control-flow, into consideration. This analysis is more precise because it could detect a depen-dency at a given line in the source code, which is also called program-point specific analysis. Contrary to flow-sensitive analysis, flow-insensitive analysis can only summarize the depen-dencies between pointers in the whole program. Pointers which are aliases only at a given moment of the program are referred to as “may-alias” [26].

2.5 LLVM

The Low Level Virtual Machine (LLVM) Project [12] is a compiler framework developed at the University of Illinois. This framework is composed of “modular and reusable compiler and toolchain technologies” [30]. LLVM aims at being a long-term code analysis and optimization system by providing built-in optimization and analysis passes, and the possibility to develop new passes.

Compilation

The compilation is usually divided into three phases [Fig. 2.1]. First, a static compiler front-end, such as Clang, parses the source code and translates it into LLVM intermediate represen-tation (IR). Then, LLVM modules analyze LLVM IR to optimize the code, and finally machine code compatible with the chosen platform is generated.

Figure 2.1: Compilation process [31]

LLVM Intermediate representation (IR)

LLVM IR [19] is an intermediate representation used during compilation. It provides “a hu-man readable assembly language representation” (.ll) and a binary representation called “bit-code” (.bc) which can be executed and on which optimizations are performed.

LLVM IR is a “language independent type-system”, which uses common low-level prim-itives to implement complex high-level functions. Its architecture is a “load/store architec-ture”: all the accesses to the memory are done using load (read from memory) or store (write in the memory) instructions [32]. It means that all more complex operations which require an access to the memory will be divided into load and store instructions.

LLVM bitcode files can be linked together into one single file thanks to the LLVM linker [33], which aims at resolving the definition of functions and variables declared in different files.

Static Single Assignment (SSA)

LLVM IR is a “Static Single Assignment (SSA)” [34] based language: each new assignment of a value to a variable results in a new version of the variable being created. Data-flow analysis is facilitated by SSA representation which expresses a variable as a function of its previous versions.

According to Braun et al. [35], SSA form aims at improving the efficiency of the analysis by “compactly representing use-def chains”. A use-def chain is a data structure composed of an instruction (use) of a variable, and all the possible definitions of this variable. The def-use information is the list of all the instructions which involves a given variable. LLVM

(18)

2.6. Related Work

SSA is built according to Cytron et al.’s algorithm [34]. This algorithm first identifies the different definitions of the variable. Then, if there are concurrent definitions, due to a if-statement for example, the multiple definitions are concatenated and propagated. Finally, the new definition of the variable replaces the old variable in its different uses.

LLVM Pass Framework

The LLVM project provides an LLVM Pass Framework [13]. An LLVM pass can be used to transform, analyze and optimize source code. New LLVM passes can also be developed in C++. Several types of passes are available, which enable the analysis of the source code on different scales, such as modules, functions or basic blocks.

2.6 Related Work

Clang static analyzer

Clang static analyzer is an open-source tool, part of the Clang and the LLVM projects [36]. The formal analysis is based on symbolic execution: a core engine simulates the different execution paths of the program, while the constraint manager checks if the path is satisfiable. The algorithm is path sensitive, so all the possible paths are explored. Arroyo et al. [11] developed a “user configurable static analyzer taint checker” plugin for Clang static analyzer, which aims at checking the propagation of tainted data in C, C++ and Objective C programs. Their tool provides a configuration file so that users can define the sources, propagators, sinks and filters of the taint analysis. Sinks are defined as “critical functions” which should not be influenced by tainted data. Filters are sanitizers which can generate safe data from tainted data. This tool can be used to detect security flaws which could be triggered by malicious user inputs.

Sparse Value Flow (SVF)

SVF (Sparse Value-Flow) [37] is an open-source static tool developed at the School of Com-puter Science and Engineering, UNSW, Australia. This static analysis tool is implemented on the top of LLVM and aims at analyzing inter-procedural pointer dependencies for C and C++ programs. This tool resolves both data and control flow dependencies, thus enabling a more precise analysis. The value-flow construction module, based on Andersen’s points-to infor-mation, generates an “inter-procedural memory SSA”[37] representation, providing def-use chains for pointers, whereas LLVM only provides an intra-procedural memory dependence analysis pass, according to Sui et al. [37]. The inter-procedural analysis is performed sparsely, that is to say, by first over-approximately computing def-use chains and then, by eliminating unnecessary propagation and thus, refining the data-flow analysis. SVF can be used to detect bugs involving value-flow reachability, such as memory leak detection. SABER [38] is a mem-ory leak detector developed on the top of SVF. SVF can also be used to implement “scalable and precise pointer analyses” [38].

Frama-C

The Frama-C platform [39] is an open-source static analysis tool, which aims at performing safety verification on industrial C code. This tool is supposed to be correct, which means that it provides over-approximation, in order to guarantee that no error remains undetected. Frama-C uses abstract interpretation, deductive verification and concolic testing, which is a form of dynamic symbolic execution, to prove the assertions. Frama-C is developed in OCaml language and aims at being an extensible platform, composed of several plugins which enable more sophisticated approaches. Frama-C Evolved Value Analysis plugin aims at identifying the set of possible values of a variable, at a given moment of the execution. Frama-C also

(19)

2.7. Visualization

provides the possibility to slice the program in order to simplify it, and to navigate the use-def chains. Thus, Frama-C can be used to verify that the source code respects the specifications, which can be expressed as ACSL (a formal specification language) annotations. However, currently Frama-C does not provide a taint analysis plugin, although it is possible to compute the dependencies between variables.

Assisted Assignment of Automotive Safety Requirements

Azedevo et al. [10] have developed a tool aiming at automating ASIL allocation and decom-position during design phase. According to ISO 26262 Part 9.5 [1], if several independent safety requirements are responsible for the ASIL rating of a common element, then it is pos-sible to assign a lower ASIL to these requirements. For example, if an element is tainted ASIL D because of two ASIL D sub-elements, then these two sub-elements can be decomposed into two ASIL B requirements, since two ASIL B sub-elements are equivalent to an ASIL D ele-ment. This is done by associating an integer with each ASIL rating (i.e. A=1, B=2, C=3, D=4). In order to compute ASIL allocation, this tool first generates the fault trees thanks to an exist-ing safety analyzer and design optimizer called HiP-HOPS (hierarchically performed hazard origin and propagation studies) [40]. Then, ASIL decomposition is computed by perform-ing a constraint solvperform-ing algorithm on the “minimal cut set” [10], which refers to the smallest set of events that makes an element to be marked as ASIL. This tool can be used to reduce development costs by limiting the amount of high ASIL elements.

2.7 Visualization

Software visualization refers to the visual representation of software components [41]. The challenge related to software visualization is to provide understandable and useful informa-tion for developers so that they can work more effectively [42]. In fact, software visualizainforma-tion aims at reducing the effort spent by developers on development and maintenance tasks [43]. According to Shahin et al.’s systematic review [41], the most used visualization technique is graph-based visualization.

Graph representation

When static analysis is used to examine relations between objects in the source code, a graph representation can be a suitable solution. In fact, graphs can be used to represent these rela-tionships graphically, nodes being objects and relations being edges. Some graphs are com-monly used in static analysis, such as call graphs, program dependence graphs and control-flow graphs. Call graphs display the calling relationship between functions, nodes being functions and edges being calls. Program dependence graphs are used to show the depen-dencies between variables, nodes being statements or values, and edges being relations be-tween them. Control-flow graphs present the different execution paths of a program [42], nodes being instructions and the edges being instruction jumps.

The SVF tool [37] can generate a value-flow graph in order to display program dependen-cies. Different kinds of nodes exist and are highlighted by different colors in order to identify them. The dependencies between elements are represented by edges.

As far as graph representation is concerned, the developer should be able to find easily the useful information and to understand the relationship between objects. Providing interactive features enables the user to hide information which is not currently important and to expand useful details [42]. Information visualization can be facilitated by navigation interactions such as zooming, moving or expanding nodes.

The key issues related to graph representation are due to the information layout. To make a graph easy to read and understand, information should be organized clearly and follow specific rules, according to Herman et al. [44]. Graph drawing also has aesthetic and practical

(20)

2.7. Visualization

rules, such as equal space distribution between nodes. Moreover, edge crossing should be avoided if the graph is planar. One of the most common graph layouts is the tree layout which is convenient to display hierarchical information. “Tree layout algorithms have the lowest complexity and are simpler to implement” [44].

Code annotation

As static analysis is used to analyze source code, a common visualization method is to anno-tate the code directly with the results. Usually, a plugin can be developed and integrated to the IDE (integrated development environment).

The results of the taint checker developed by Arroyo et al. [11] are displayed as code anno-tation in order to warn the developer against untrusted data during development. Frama-C [39] also provides a user interface and a source code browser to display the results on the code.

The advantage of code annotation is to let developers see the context of a result [42], that is to say, reading the code and locating the information inside the project. However, annotations on code do not give the possibility to have a global representation of the dependencies.

LaToza and Myers [42] developed an Eclipse plugin composed of both code annotation and graph representation in order to navigate the call graph of a module. In fact, inter-procedural dependencies are easily represented through a call graph. Thus, the user can get context information from the Eclipse IDE and global information from the graph.

Useful properties in software visualization tools

According to Bassil and Keller [43], “appropriate visualization can significantly reduce the effort spent on system comprehension and maintenance”. In order to define what an “appro-priate visualization” is, Bassil and Keller conducted a survey about software visualization tools. They aimed at evaluating the usefulness and the importance of different visualization aspects. Bassil and Keller [43] report the most essential properties according to the results of the questionnaire:

1. “Search tools for graphical and/or textual elements” 2. “Source code visualization (textual views)”

3. “Hierarchical representation” 4. “Use of colors”

5. “Source code browsing” 6. “Navigation across hierarchies”

7. “Easy access, from the symbol list, to the corresponding source code”

Some useful but not essential properties have also been reported, such as “saving of views for future use”, the “possibility of having multiple [...] instances of the same object being highlighted in all the views”, or the “visualization of different levels of detail in separate window”.

Bassil and Keller [43] have also questioned experts about code analysis support of soft-ware visualization tools. It has been reported that the most important functionalities are “visualization of function calls”, “visualization of inheritance graph” and “visualization of different levels of detail in separate window”.

(21)

2.8. Evaluation

2.8 Evaluation

In the context of static analysis in automotive safety, the accuracy of the tool should be mea-sured so that users can assess whether they can rely on the results. Moreover, the tool aims at helping engineers to be more efficient in their work. Thus, the usefulness of the results should be evaluated to check whether the tool fulfils its goal.

Evaluating the usefulness of the results

According to Seaman [45], qualitative research methods are increasingly used to take into account human behaviour when evaluating software. Qualitative data can not be represented as numbers, contrary to quantitative data. Two data collection methods are commonly used: “participant observation and interviewing” [45]. The first one consists in observing software developers while they are working and taking notes about their behaviour and thoughts. The second one consists in asking a series of questions to developers. After collecting data, results should be analyzed in order to extract “a statement or proposition” [45].

LaToza and Myers [42] evaluated the “potential productivity benefits [...] and the usabil-ity” of their static analyzer taint checker, called REACHER, by conducting a lab study on 12 participants. This tool aims at reducing the time required for a task, by allowing the develop-ers to unddevelop-erstand and navigate the code more effectively. The study consisted in comparing the time the participants needed to perform a task with Eclipse, to the time needed to per-form the same task with REACHER. To make the two tools comparable, all the participants had completed two tutorials on Eclipse and REACHER in order to familiarize with both in-terfaces, before taking part in the study. Each task involved the understanding of “control flow between events” in the program and the use of a call graph, which is REACHER’s focus. Each task focused on a particular aspect of REACHER.

Evaluating the accuracy of the results

According to Anderson [46], ISO 26262 requires to qualify static analyzers by assessing the tool confidence level (TCL). This is expressed as the possibility that a failure in the tool pre-vents the requirements from being met (tool impact TI), and the probability that the failure can be detected (tool error detection TD). Thus, the accuracy of the tool should be assessed and the functional requirements should be tested.

Arroyo et al. [11] evaluated the accuracy of their taint checker based on clang static ana-lyzer following these criteria:

• “capacity for finding usage of tainted data”: this refers, for example, to the capacity of the tool to detect the use of a tainted variable in a given instruction. Each type of usage was tested in a test case.

• “the number of false positives”: this refers to the wrong propagation of tainted data generating false errors.

• “scalability”: the tool was tested on a real case, the hearth bleed vulnerability of OpenSSL.

Sui et al. [38] performed an experimental evaluation in order to measure the accuracy of their static memory leaks detector, called SABER. They define accuracy as the “ability to detect memory leaks with a low false positive rate”. To conduct the study, they tested their tool on “15 SPEC2000 C programs (620 KLOC) and seven open-source applications”. They reported the number of faults found by SABER, and the number of false positives. Then, they computed the false positive rate as seen in [eq. (2.2)]. Finally, they compared the results to the results obtained with other analyzers.

(22)

2.8. Evaluation

Recall that the number of faults reported and the true number of faults can be expressed as follow:

number o f f aults reported= f alse positives+true positives (2.1) Then, the false positive rate can be defined as:

f alse positives rate= f alse positives

number o f f aults reported (2.2) They concluded that their detector is “neither complete [...] nor sound” [38] due to some approximations, such as treating multi-dimensional arrays monolithically or bounding the number of loop iterations.

Imparato et al. [8] have reported “a comparative study of static analysis tools for AU-TOSAR”. They have evaluated the tools according to their precision and recall, which can be expressed as follow:

precision= true positives

number o f f aults reported (2.3)

recall= true positives

f alse negatives+true positives (2.4) A high precision saves time because it limits the amount of false alerts that developers will have to check. The recall measures the number of errors detected out of the total number of errors. If the recall equals 1, then the tool will detect all the errors.

(23)

3 Method

This chapter describes the implementation of the taint analyzer on the top of LLVM in sections 3.1 and 3.2, the development of the visualization tool in section 3.3 and the evaluation of the accuracy of the results and the usefulness of the visualization tool in section 3.4.

3.1 LLVM

The first research question was to examine if it was possible to utilize LLVM to develop a static analysis tool for automotive software. This study was done in three steps.

The first step was to study how to develop a plugin on the top of LLVM. One of the advantages of the compiler infrastructure is the LLVM Pass Framework [13], presented in section 2.5. LLVM passes can be used to transform, analyze and optimize source code in a modular way. Moreover, it is possible to develop new LLVM passes easily thanks to a set of reusable functions and application programming interfaces (APIs) written in C++. LLVM also provides a detailed documentation [47] intended for developers. New passes inherit from one of the Pass child classes: ModulePass, CallGraphSCCPass, FunctionPass, LoopPass, RegionPass and BasicBlockPass. In the context of the thesis, the Module pass was selected because it can analyze the whole program. Therefore, it enables inter-procedural analysis, whereas the Function pass only provides the possibility of analyzing the content of each function separately and independently. Finally, the runOnModule function should be overwritten and is the entry point of the pass. Thus, any object-oriented application can be developed on the top of a Module pass.

The second step was to study how to perform the taint analysis based on the function and APIs provided by the LLVM infrastructure. LLVM APIs give the possibility to iterate over several objects of the LLVM IR inside the module. For example, it is possible to iterate over each instruction, each function or each global variable of the program. It is also possible to iterate over the def-use chains, defined in section 2.5, making LLVM especially well suited to perform taint analysis.

The last step was to study how to run the pass on the projects of the company. Once the pass is developed, it must be compiled with Clang in order to generate a shared library. Then, a pass can be run on an LLVM bitcode file through the command line interface thanks to “the modular optimizer, opt”, according to Lattner and Adve [48]. Thus, in order to analyze the source files of the different projects of the company, the projects had to be compiled with

(24)

3.2. Taint analysis

Clang to obtain the corresponding bitcode files of each module, which is a single C translation unit.

3.2 Taint analysis

The second research question was to examine how taint analysis can be used to analyze the dependencies between safe variables in the automotive industry. The first phase was to define the way to identify the source safe variables, called taint information, and how to implement them in the tool. The second phase was to determine the taint propagation policy, that is to say the set of operations or actions propagating the taint, according to the automotive industry requirements and ISO 26262 [1]. The last phase was to study how to implement the taint analysis algorithm to analyze the LLVM IR.

Taint information

Taint information, also called source information, represents the data set tainted at the initial-ization of the taint analysis algorithm. Thereafter, tainted data refers to safety-critical data, divided into four ASIL ratings (A, B, C, D), whereas untainted data refer to quality manage-ment (QM) data.

Specification The specifications related to taint information should state the type of objects which can be tainted by the user at the beginning. These specifications have been discussed during a meeting with the safety engineers of the company. In the context of the thesis, according to the needs of the company, taint information should be user-configurable, which means that the user can define the list of tainted values as an input of the taint analysis tool. Then, it has been decided that the source objects that a user can taint at the initialization could be:

• a global variable, identified by its name,

• a memory region, identified by an address range, • a source-code file, identified by its name.

In fact, specifying the name of a safe global variable is sufficient to identify it in the source code. Moreover, specifying a memory region can be used to taint the safe registers and the partitions which should be protected in the MPU. Specifying a file is useful if a lot of functions that have to be tainted are located in the same file. This prevents developers from writing the name of each tainted function one at a time.

Finally, each user input can be associated to an ASIL rating (A, B, C, D).

Implementation To implement a user-configurable analyzer, taint variables are defined by the user in an XML configuration file, which is read by the taint analysis pass using the C++ XML processing library Pugixml [49]. Then, user input is converted into several instances of the Input class [Fig. 3.2]. This class is composed of the name of the object or of a memory region (start and end addresses), and an ASIL rating. All the instances of the Input class are stored in a list, which is a member of the taint analysis pass. Thus, this list represents the set of taint information.

Then, these inputs need to be associated to an LLVM class, that is to say an instance of LLVM::Value, which is the most generic LLVM class used to define a variable. An LLVM function is used to select the LLVM::Value corresponding to a name or a memory region. The child classes of LLVM::Value are presented in [Fig. 3.1]. Taint information can either be a global variable (LLVM::GlobalVariable), an address (LLVM::ConstantExpr) or a function (LLVM::Function). LLVM::AllocaInst and LLVM::Argument cannot be part

(25)

3.2. Taint analysis

of taint information since it defines local variables. However, it will be used later in the analysis of the dependencies.

Once the LLVM::Value instance corresponding to the Input has been identified, taint information is converted to an instance of the SafeValue class [Fig. 3.3], which is composed of:

• an LLVM::Value instance

• an instance of the enumeration ASIL (QM, A, B, C, D)

This class is the key of the taint analyzer because every LLVM::Value instance analyzed by the algorithm is stored in a SafeValue instance. All the instances being ASIL A, B, C or D are tainted information, whereas instances being QM are untainted information.

Figure 3.1: An overview of the LLVM Value inheritance [30]

(26)

3.2.

T

aint

analysis

Figure 3.2: UML Diagram, describing the architecture of the taint analysis pass

(27)

3.2. Taint analysis

Taint propagation policy

After defining the taint information, the second phase is to identify the kind of operations which can propagate the taint to other variables, which can be either global variables, local variables, addresses or functions.

Specification The taint propagation policy has been defined in accordance with the opinion of the engineers of the company, based on their experience with safety requirements. Seven cases have been defined and are presented below [Tab. 3.1]. If an object is tainted by several objects, then the highest ASIL should be assigned to it, according to ISO 26262 Part 9 [1].

Store If a new value is assigned to an ASIL variable, resulting in the variable being mod-ified, then the function where the assignment is done should be tainted. A memory write access is always translated by a store instruction in LLVM IR [32]. It is considered that an instruction modifies tainted data if its memory location or its content is overwritten. Thus, if tainted data is a pointer, any assignment to the pointer or to the dereferenced pointer will be considered as a modification.

Load address If an ASIL hard-coded address is assigned to a scalar variable, or converted and assigned to a pointer, then the variable or pointer should be tainted.

Pointer parameter If an ASIL pointer is passed as a parameter to a function, then the content of the function should be analyzed to check if the pointer is modified inside, that is to say, if its memory location or its content is overwritten by another value. In order to do this, the function behaviour is first over-approximated: the calling function and the parameter inside the function are tainted. Then, the content of the called function is analyzed to determine whether the pointer is effectively modified. If there appears to be a modification, then the called function is tainted as well. If the pointer is not modified inside the function, then the called function is not tainted.

Function call If a function is tainted, then each function calling this function should also be tainted. Thus, the taint is propagated to the functions of the call graph originating from this function.

Global If a global value is initialized with tainted data, then this global value should also be tainted.

File A file can only be tainted if the user includes its name in the configuration file. If a file is tainted, then all global variables and functions defined in this file should also be tainted.

Violation When the scalar value of a tainted variable is assigned to a QM variable or a lower ASIL variable, it is not a safety-critical operation, because the safe memory is not likely to be modified. So, no tainted value is added. However, if a tainted pointer is stored in another QM or lower ASIL pointer, the safe memory could be modified later through this unsafe pointer. Thus, this case should not happen in a safe application, except if the tainted variable is a hard-coded address, or if it is a global variable definition. Assigning an ASIL variable to a lower ASIL or QM variable is inconsistent with safety recommendations. Thus, this case is considered as a violation.

(28)

3.2. Taint analysis

Table 3.1: Taint propagation policy

Name Description Taint information Taint Propagation Examples Store Modification of a safe variable inside a function Lvalue

(any type) Function

variable_asil = variable_qm; variable_asil = function_qm(); pointer_asil = &variable_qm; Load Address A safe address is loaded into a variable inside a function Rvalue (address) Function and lvalue

int* pointer = (int *) 0x0F; uint32 address = 0x0F; Pointer parameter A safe pointer is passed as a parameter to a function The pointer parameter Parameter, calling function and called function called_fn(&variable_asil); Definition:

void called_fn(int* pointer) { *pointer = variable_qm; } Call A call to a safe function Called function Calling function void calling_function() { function_modifying_ASIL() } Global A global value definition

Rvalue Global_variable int* global = &global_asil_{int* global = 0x00001002}

File A file is marked as safe File Global variables, Functions Violation A safe pointer is loaded into an unsafe pointer Rvalue (not an address) Violation pointer_qm = pointer_asil; pointer_qm = &variable_asil;

Implementation

The first step of the implementation was to define the scope of the taint analysis pass. Then, the second step was to develop the algorithm to parse and analyze the LLVM IR, in order to identify the different cases presented in the propagation policy [Tab. 3.1]. The last step was to compile the project with Clang to generate LLVM IR.

MISRA C Guidelines Some assumptions have been made throughout the development process of the analyzer according to MISRA C Guidelines [14]. The following rules apply to the embedded project analyzed by the taint analysis pass:

• Each line of code is reachable.

• Variables should always have distinct names.

• Dynamic allocation and deallocation functions are not used.

These rules allow some simplifications. All the lines of the LLVM bitcode file are analyzed as there is no unreachable code. A variable can be identified by its name since two different variables should have different names. Dynamic allocation and deallocation are not taken into account during the analysis. Only hard-coded memory addresses are studied.

(29)

3.2. Taint analysis

Pointer analysis A pointer analysis can have different level of accuracy, as presented in section 2.4. The level of accuracy needed by the tool has been established according to the needs of the company. The taint analyzer should be field-insensitive, which means that each access to a sub-element is equivalent to an access of the whole aggregate data. In fact, ac-cording to ISO 26262 Part 9 Section 6.2 [1], elements composed of sub-elements should be developed according to “the highest ASIL applicable to the element”. The taint analyzer should be inter-procedural so that relationships between functions can be analyzed, in order to identify when a tainted pointer parameter is modified inside a function. Finally, the taint analyzer should be flow-insensitive, which means that the execution order of the program is not important. This is an over-approximation which aims at simplifying the analysis because flow-insensitive analysis is costly in terms of complexity.

Instruction level The taint analysis pass only analyzes source code on the instruction level. Thus, analyzing machine code such as assembly language is out of scope.

LLVM IR analysis At initialization time, taint information is defined. The taint should be propagated to other data according to the taint propagation policy.

The users of each taint information, that is to say, in that case, the list of instructions involving a given LLVM::Value instance, can be listed using the iterator over the users. Once a user is detected, it needs to be analyzed, to identify the taint propagation policy case that it corresponds to. In that case, a user can either be an instruction or a constant expression. The AnalyzerFactory selects the child class of the Analyzer corresponding to the LLVM IR instruction type, as described on the UML diagram [Fig. 3.2].

The LLVM language reference manual [19] describes the different LLVM IR instructions. Listing 3.1: Store Inst

s t o r e { type } { s o u r c e } , { type } * { d e s t i n a t i o n } , a l i g n { type_alignment } The store instruction writes a value inside an address of the memory. It is the only instruction which can modify the content of an existing variable in the memory (on the LLVM IR level) [19]. Thus, this instruction is related to the Store case of the taint propagation policy, if the destination operand has a higher ASIL than the source operand. Otherwise, it is a violation. Finally, if the source operand is a safety-critical address, then it is related to the Load Address case of the taint propagation policy.

A store instruction is often preceded by a load instruction which aims at loading the destination address or the source value of the store instruction.

Listing 3.2: Load Inst

{ r e s u l t } = load { type } , { type } * { source } , a l i g n { type_alignment } The load instruction reads the content of an address in the memory and stores it inside an SSA result. This instruction is used each time the content of the address of the memory needs to be read. For example, a load instruction can be used to load the address stored in the pointer. In order to access the value pointed by the pointer, a second load instruction should be used to load the content stored in the address.

A load instruction does not necessarily indicate that the loaded operand will be modified. In fact, the address of a pointer can either be loaded to modify its content, or to read its content. Then, the instructions following the load should be analyzed, until finding a store instruction or a call instruction.

The call instruction is a special case related to inter-procedural analysis. Listing 3.3: Call Inst

(30)

3.2. Taint analysis

The call instruction is used for function calls. The return value is stored in an SSA result. When performing inter-procedural analysis, if safety-critical data is passed as a parameter to the function, then the content of the function needs to be analyzed as well. This instruction is related to the Pointer Parameter and Call cases of the taint propagation policy.

Listing 3.4: Alloca Inst { r e s u l t } = a l l o c a { type } , a l i g n { type_alignment }

The alloca instruction is used to allocate memory on the stack frame during the execu-tion of a funcexecu-tion. It enables the declaraexecu-tions of local variables which will be released after the function returned. An argument of a function is later assigned to a local value which is declared with an alloca instruction.

Listing 3.5: GetElementPtr Inst

{ r e s u l t } = g e t e l e m e n t p t r inbounds { type } * { source } , { type } { index } The getElementPtr instruction is used to “get the address of a sub-element of an ag-gregate data structure” [19], such as arrays or structures. As the load instruction, it does not necessarily lead to the modification of the operand, then the following instructions need to be analyzed.

Listing 3.6: Global variable

@{ globalVarName } = { g l o b a l | c o n s t a n t } { type } { i n i t i a l i z e r } , a l i g n { type_alignment }

The global instruction is used to declare a global variable. A global variable can be initialized with another global initializer, which can be a global variable or a constant. This is related to the Global case of the taint propagation policy.

Listing 3.7: An example of constant expression: inttoptr

{ d e s t i n a t i o n _ t y p e } i n t t o p t r ( { type } { value } t o { d e s t i n a t i o n _ t y p e } ) Finally, a user can also be a constant expression, which is used to perform operations on constants [19]. If a global value, which inherits from the LLVM::Constant class, is used by a constant expression, then the users of this constant expression should also be analyzed. For example, the constant expression inttoptr [List. 3.7] can be used to convert a constant integer, such as an address, to a pointer.

Propagation policy New tainted variables are stored in a SafeValue instance [Fig. 3.3], in the same way as taint information. It is useful to recall that the SafeValue class aims at storing an LLVM::Value analyzed by the pass, which thus can be associated with an ASIL (A, B, C, D) or classified as QM. The SafeValue objects store a list of all their users, corre-sponding to a propagation case, in a map whose keys are the users’ location. In fact, each time a user is identified as a case of the taint propagation policy, it is stored in an instance of SafeInstruction [Fig. 3.3], which is composed of the tainted value, its alias, the prop-agation type, and its location. If, at some point, the lvalue of two variables are equal, they are said to be aliases, as explained in section 2.4. The location is a global object which refers either to the tainted function where the user is located, or to a tainted global variable if the user is a global declaration. Finally, SafeValue instances are stored in a SafeMap whose keys are the LLVM::Value instances. Thus, it is possible to find out which functions and aliases have been tainted because of a given value, and then to find out which case of the taint propagation policy was responsible for the taint.

(31)

3.2. Taint analysis

Figure 3.3: SafeValue and SafeInstruction classes

Taint propagation algorithm The taint propagation algorithm developed in the context of this thesis is summarized below in pseudo-code. Each instance of the taint information is tainted at the initialization. Then, users of tainted variables are analyzed. If the user corre-sponds to a propagation case of the taint propagation policy, then the taint is propagated to the function or the alias. Finally, the user is converted to a instance of SafeInstruction which is inserted in the user map of the SafeValue instance.

Listing 3.8: Taint propagation algorithm

This i s t h e i n i t i a l i z a t i o n .

t a i n t _ i n f o r m a t i o n = l i s t _ o f _ s a f e _ v a l u e s

f o r each s a f e _ v a l u e i n t a i n t _ i n f o r m a t i o n p r o p a g a t i n g _ t a i n t ( s a f e _ v a l u e )

This f u n c t i o n propagates t h e t a i n t t o t h e s a f e value and a n a l y z e s i t s u s e r s . void f u n c t i o n p r o p a g a t i n g _ t a i n t ( s a f e _ v a l u e ) { i f not( s a f e _ v a l u e . t a i n t e d ) { s a f e _ v a l u e . t a i n t e d = t r u e f o r each u se r i n s a f e _ v a l u e . u s e r s ( ) { i f u s er corresponds t o a propagation c a s e {

i f STORE or LOAD or PARAMETER or CALL p r o p a g a t i n g _ t a i n t ( f u n c t i o n )

i f LOAD or PARAMETER or GLOBAL p r o p a g a t i n g _ t a i n t ( a l i a s ) c o n v e r t u s er t o s a f e _ i n s t r u c t i o n append s a f e _ i n s t r u c t i o n t o s a f e _ v a l u e . user_map } } } }

(32)

3.3. Visualization

Compiling a project with Clang To run the analysis pass on a project, the project has to be compiled with Clang, in order to generate the LLVM bitcode files.

The following command should be executed for each source file in order to generate the corresponding bitcode file.

Listing 3.9: Build c l a n g ´g émit´llvm ó f i l e . bc ć f i l e . c

The linking part should be done with the LLVM linker, presented in section 2.5, which combines several bitcode files into a single bitcode file.

Listing 3.10: Linking llvm´l i n k _{* . bc ´o} output . bc

3.3 Visualization

The third research question was “How to represent results in an understandable way so that engineers can improve the safety development process?”. The development of the visualiza-tion tool was done in two phases: data structure and serializavisualiza-tion, and the development of the graph representation.

Data structure and serialization

The main information to be stored is the list of tainted variables (the instances of the SafeValueclass), and the userMap of each safe value, containing the list of functions, safe instructions and aliases related to this tainted value.

Listing 3.11: Example of JSON representation

t r e e s [ ’ safeValue ’ ] = { " name " : " s a f e V a l u e " , " userMap " : [ { " name " : " f u n c t i o n 1 " , " s a f e I n s t r u c t i o n L i s t " : [ { " a l i a s " : " a l i a s 1 " , " propagationType " : " s t o r e " , } , [ . . . ] ] } , { " name : " f u n c t i o n 2 " , " s a f e I n s t r u c t i o n L i s t " : [ [ . . . ] ] } ] }

Taint analysis for automotive safety using the LLVM compiler infrastructure

Linköping University | Department of Computer and Information Science

Master thesis, 30 ECTS | Datateknik

2019 | LIU-IDA/LITH-EX-A--19/074--SE

Taint analysis for automotive

safety using the LLVM compiler

infrastructure

Éléonore Goblé

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Company

1.2

Motivation

1.3

Aim

1.4

Research questions

1.5

Delimitations

1.6

Outline

Introduction

Theory

Method

Results

Conclusion

Research questions

Planning

Defining subject

Discussion

Static taint

analyzer

Accuracy evaluation

Qualitative study

Architecture

Iterative

development

Testing

Feedback

Improvements

Taint analysis for automotive safety

using the LLVM compiler infrastructure

Development

Taint analysis

pass

Technical research

LLVM

Visualization

C++

Literature research

Taint analysis

Automotive

systems

2

Theory

2.1

Automotive industry

Automotive systems architecture

Embedded software development

2.2

Functional safety

Functional safety standards

Automotive Safety Integrity Levels (ASILs)

Freedom from interference

2.3

Static Analysis

Analysis methods

Taint Analysis

2.4

Pointer and Alias Analysis

Definitions

Properties

2.5

_{Accuracy evaluation}