Växjö University

(1)

Växjö University

School of Mathematics and System Engineering Reports from MSI - Rapporter från MSI

P2A

An Approximative Points-to Analysis for Java

Peter Stensson

Feb 2005

MSI

Växjö University SE-351 95 VÄXJÖ

Report 05016 ISSN 1650-2647

ISRN VXU/MSI/DA/E/--05016/--SE

(2)

Peter Stensson

P2A

An Approximative Points-to Analysis for Java

Master’s Thesis Datalogi

2005

Växjö University

(3)

(4)

i

Abstract

This master’s thesis is presenting an approach to points-to analysis that is targeted to software engineering activities. Software engineering applications that use the result of points-to analysis are often only interested of a small amount of the classes in an application. In order to make the analysis more efficient, a couple of approximations are presented that will reduce the size of the analysis scope. An implementation was constructed to evaluate the approach and the approximations. The experimental result shows that the approximations made on the analysis are in the most cases not affecting the precision at all for the relevant parts of the application being analyzed. The experimental results also show that the analysis time is shortened considerable by using approximations.

Key-words: Points-to analysis, Approximation.

(5)

ii

Sammanfattning

Den här magisteruppsatsen presenterar en metod för points-to analys som är avsedd för aktiviteter inom mjukvaruutveckling. Applikationer för mjukvaruutveckling som använ- der resultatet från en points-to analys är oftast intresserad av en liten del av det program som analyseras. För att göra metoden mer effektiv presenteras också ett antal approximationer som kommer att reducera storleken på analysen. En implementation av metoden och approximationerna gjordes för att kunna utvärdera metoden och approximationerna.

Resultatet visar att approximationerna som gjordes på analysen inte påverkar precisionen avsevärt. Resultatet visar också att tiden för att göra en analys kortas ner mycket genom att göra approximationer på analysen.

Nyckelord: Points-to analys, Approximering.

(6)

iii

Acknowledgments

First of all, I want to thank my supervisor Jonas Lundberg, whose dedication to the topic has eased my work considerably. Every time I ran into trouble, Jonas came up with clever suggestions. I would like to wish him all luck with the second part of his PhD-studies and the work in this thesis will hopefully be of help along the way.

Thanks to all my friends and to my family that have supporting me throughout my studies at Växjö University and also helped me through some personal difficulties.

A special thank to Mikael Hansen for being a great friend and his bottomless knowledge in computer science have helped me many times.

A special thank will also be given to Daniel Petersson for taking time to read my thesis and thanks for all times we have been working together during our studies.

At last, I want to thank my dad, who died way too young and is not able to be with us today and enjoy this moment. This work is for you, and someday we will meet again.

(7)

iv

(8)

List of Figures

3.1 Design overview for P2A . . . 12 3.2 A simple points-to assignment graph illustrating regular assignments . . . 18 3.3 A simple points-to assignment graph illustrating content-assignments . . 19 3.4 An invocation connected to a method . . . 20 5.1 Class diagram for the node classes in the P2A Implementation . . . 41

vii

(11)

viii LIST OF FIGURES

(12)

List of Tables

3.1 Statement and their set constraints . . . 13

3.2 Statements involving arrays and the corresponding set constraints . . . 14

3.3 Constructs in the pointer assignment graph derived from different statements 17 4.1 Statement and their set constraints for Collections . . . 35

4.2 Statements and corresponding pointer assignment graph constructs . . . . 36

4.3 Statements and set constraint for collections and arrays . . . 36

4.4 Statements and corresponding pointer assignment graph constructs . . . . 37

6.1 General results for the evaluated programs . . . 44

6.2 Pointer assignment graph and Call graph results for the evaluated programs 44 6.3 General results for the application approximation . . . 45

6.4 Pointer assignment graph results for the application approximation . . . . 45

6.5 Call graph results for the application approximation . . . 46

6.6 General results for string approximation . . . 46

6.7 Pointer assignment graph results for string approximation . . . 47

6.8 Call graph results for string approximation . . . 47

6.9 General results for graphical user interface approximation . . . 48

6.10 Pointer assignment graph results for graphical user interface approximation 48 6.11 Call graph results for graphical user interface approximation . . . 49

6.12 General results for collection approximation . . . 49

6.13 Pointer assignment graph results for collection approximation . . . 50

6.14 Call graph results for collection approximation . . . 50

6.15 General results for arbitrary removal of classes approximation . . . 51

6.16 Pointer assignment graph results for arbitrary removal of classes approximation . . . 51

6.17 Call graph results for arbitrary removal of classes approximation . . . 52

6.18 General results for all approximations . . . 52

6.19 Pointer assignment graph results for all approximations . . . 53

6.20 Call graph results for all approximations . . . 53

ix

(13)

x LIST OF TABLES

(14)

Chapter 1 Introduction

This thesis will present a static program analysis that compute the set of objects a given reference variable might refer to. Furthermore, the thesis will present and evaluate approximations that could be made on the analysis in order to increase the efficiency. This first chapter will present the thesis. First off, in section 1.1, the motivation for this thesis is presented, which will be followed by the problem definition in section 1.2. The method for the work of the thesis as whole is presented in section 1.3. In section 1.4 the restrictions of the thesis is discussed and the chapter is concluded with describing the structure of the thesis in section 1.5.

1.1 Motivation

The approach discussed before is more commonly called points-to analysis. As stated, the intent of points-to analysis is to compute the set of objects a reference variable could refer to. The current research of points-to analysis is often targeted to applications related to compiler optimization. This thesis will however focus on applications targeted to software engineering. Before moving on, one should remember that points-to analysis is a very slow process if the precision of the analysis is supposed to be good.

The two types of targeted applications do not have the same requirement on the analysis. In order to be useful, the analysis for compilers needs to be very conservative. That is, every piece of the software needs to be examined and computed. This includes methods invoked at virtual machine start-up and class initialization methods responsible for assigning values to static fields. The consequence of this is that several hundred classes will be analyzed even for comparable small programs.

Before the requirements for analyzes targeted to software engineering is presented, the term application relevant information must be discussed. The application classes are relevant information and are defined by a specific package or a name filter. An instance of an application class is called application object and is also relevant information. Non- application classes will be referred as library classes in this thesis.

The bad news for applications targeted to software engineering is that they are often user interactive and it is therefore unacceptable wait several minutes before the result arrives. The good news is that software engineering tools often needs only the application relevant information in order to be useful. Consider a tool that visualizes an abstract view of a program in form of a graph. If one wants to examine the visualization, the including of library classes will probably confuse the user. It is only the application classes that make sense for the user. With this in mind, the cost of the analysis may be reduced if only the application relevant information is considered in the analysis. How should irrelevant

1

(15)

2 CHAPTER 1. INTRODUCTION information not be considered in the analysis without affecting the application relevant information? Previous work [27] shows that approximations increase the efficiency on points-to analysis. With this discussion in mind, the goal of this thesis could now be expressed.

1.2 Thesis Goal

Consider this question: Is it possible to cut down the cost of points-to analysis by making approximations without losing any relevant information for the analyzed application?

In other words, the main goal of this thesis is to answer the question whether it is possible to significantly reduce the time and memory requirements for the analysis without losing any information that is relevant for the application point of view. Significantly means that the result is acceptable if the approximations reduce the analysis time by a factor of ten or more. The loss of relevant application information means that structures like call graphs should not be different for the application classes if approximations are applied on the analysis.

1.3 Method

In order to finish this study constructive research [24] will be used. As stated earlier, the problem is to improve and enhance an existing innovation. The building process according to Järvinen [24] consists of three states; initial, building and target. The initial state for the study are existing approaches and previous work by Jonas Lundberg [29]. To specify the initial state accordingly, the thesis will describe how points-to analysis work and also describe previous work in the area. The building process should then specify which steps that is sufficient in order to reach a target state. The design will be described and potential problems with the design will be discussed. An important matter in the building process is whether any existing tools will be used. Consider these two examples:

1. How should the information be parsed from the application?

2. Is a graph tool necessary to create the pointer assignment graph?

This leads to the question: should you build new tools or use existing ones? This is a design matter that must be sorted out.

The target state could either be known from the beginning or be revealed during the study [24]. In this case, the target state for the pointer assignment graph will not often be known from the beginning. The concept of pointer assignment graph will be presented in section 2.1. Consider this example; the call graph for an analyzed program for this study should have the same state as in some previous work in this area. A part that not has a known final state is the approximations that will be implemented, because it is impossible to know if the approximations will have the same effect in different approaches. The result, for example efficiency and precision, is not known from the beginning.

The other part of the thesis will evaluate and discuss the result. The approach in this thesis will first of all be validated and then verified. Different metrics will be used when the approaches will be compared. As the problem definition states, the two most important metrics is efficiency and precision.

Järvinen [24] describes how a disposition can be structured when you use constructive research. The disposition for the work of the thesis is presented below and the structure of

(16)

1.4. RESTRICTIONS 3 this report is presented in section 1.5 and is quite similar. This is more or less a summary of the discussion in this section and the disposition is an adapted version of the ones Järvinen present.

1. Introduction. This part should give an introduction to points-to analysis and the previous work in this area.

2. Specification. This part should describe the new idea and specify what will be built.

3. Design of main analysis. This part should describe how the new idea will be built and present the tools that will be used. Possible design problems will be identified and solutions or alternatives to these will be presented. Possible design problems may lead to restrictions in the study.

4. Implementation of main analysis. When the design is completed and eventual re- strictions are identified, the implementation can be done.

5. Design of approximations. When the main analysis is ready, the design of new approximations will be done. This is a part in the study that may include a lot of restrictions, because some of the approximations may not work at all.

6. Implementation of approximations. This part will implement the approximations.

7. Restrictions. When all design and implementation is ready all restrictions should be identified.

8. Evaluation. This part will validate the approach and its approximations.

9. Discussion. The conclusion drawn from the evaluation will be discussed in this part. With the evaluation results available the advantage and disadvantage of the approach may be discussed.

1.4 Restrictions

The restrictions of this thesis are mostly related to the precision of the analysis and the reason for this is that some parts of an application are hard to analyze and retrieve information from. The restrictions are presented below.

• Native code will not be considered in the analysis. Every time an invocation is targeted to a native method, the invocation is dropped.

• Input streams to an application is not considered, because it is hard to simulate input streams.

• Methods invoked on an array will not be considered. Like the native invocations, these types of invocations will also be dropped. This means that statements like a.length and a.clone(), where a is an array-variable, will be ignored.

• Class loading and exceptions will be neglected in the analysis.

It is unfortunate to make these restrictions, but this matter will be discussed further in section 7.1 about future work.

(17)

4 CHAPTER 1. INTRODUCTION

1.5 Structure of the Thesis

The rest of this thesis is organized as follows. Chapter 2 describes the concept of points- to analysis, presents some previous work in the area and discusses some applications that could make use of the result from the analysis. Chapter 3 presents the design of the analysis. The set constraints are presented and the algorithms to build the pointer assignment graph and to do the propagation are discussed. The approximations to make the analysis more efficient are presented in chapter 4. The tools used to build the implementation of the analysis is presented in chapter 5. The specific use of the tools is discussed mainly. In chapter 6, the evaluation of the analysis along with its approximations is made by presenting experimental results. The thesis will end with chapter 7, which discusses the result and suggest some future work.

(18)

Chapter 2 Points-to analysis

This chapter will serve as an introduction to the concept of points-to analysis and further motivate the need of efficient algorithms to compute points-to sets for references in an application. Section 2.1 will describe the basics of points-to analysis, while section 2.2 will present the recent work in the topic of points-to analysis. At last, in section 2.3, a group of applications, which are targeted to software engineering and can make use of the result from points-to analysis, will be presented.

2.1 Basics of points-to analysis

The main idea of points-to analysis is very simple; in a program, compute for every reference variable, the objects it could refer to. From now on in this thesis a reference variable will simply be called a variable. This introduction to points-to analysis will be suited for applications written in Java. Although, next section will show that points-to analysis is used for other languages as well. The important concepts of points-to set and pointer assignment graph will also be presented, because they will both be used in the approach this thesis presents in chapter 3. Consider the following simple code fragment.

public void m(){

A a1 = new A();

B b2 = new B();

A a3, a4;

a3 = a1;

a1 = b2;

a4 = a1;

}

Before discussing the example, a short reasoning about objects is sufficient. An object in this thesis is rather an object creation point than an actual object. That is, one object will serve as the representation for a certain object creation point. The representing object may also be called an abstract object. Which objects are then the variables referring to in the example? That depends if the flow of the program will be considered, which is called flow-sensitivity. The allocations of the classes A and B will be called O_A and O_B from now on in this section. The summary of which objects the variables are referring to using a flow-sensitive analysis, is presented below.

• a1 is pointing to both O_A and OB.

• b2 is pointing to O_B.

5

(19)

6 CHAPTER 2. POINTS-TO ANALYSIS

• a3 is pointing to O_A.

• a4 is pointing to O_B.

On the other hand, if the analysis is flow-insensitive, the result will be the following.

• a1 is pointing to both O_A and O_B.

• b2 is pointing to O_B.

• a3 is pointing to both O_A and O_B.

• a4 is pointing to both O_A and OB.

Since we do not know the flow of the program, it is impossible to decide which of the two objects a1 is pointing to when a1 is first assigned to a3 and then later to a4. This gives the less precise result. Why not use flow-sensitivity all the time? It is extremely costly, because each definition of a variable must be considered and that will make the abstraction of the program used during the analysis enormous for large applications. Another sensitivity that must be considered is whether the different invocations of methods should be distinguished. If the invocations are distinguished, the analysis is context-sensitive;

otherwise it is context-insensitive. The approach presented in chapter 3 is conservative and both context- and flow-insensitive.

The concept of points-to set is important and will be used a lot in this thesis. A points- to set for a variable consists of all the objects the variable is referring to. In the example before, the variable a1 has the points-to set O_A, OB. Later, the approach presented in this thesis will force points-to sets to restrict what it could contain. The restrictions depend on the kind of object, variable and statement that is analyzed. It was earlier stated that the approach presented in this thesis is conservative. The meaning of conservative is that if a variable, v, might refer to a certain object, o, during the runtime, that object must be added to the points-to set of v.

The other important concept that will be used in thesis is a pointer assignment graph.

If such a graph is used, an abstraction of the run-time memory states will be constructed.

The graph may have many types of nodes, depending on the implementation, but the two most common ones are the nodes representing variables and objects. A directed edge is representing an assignment in a graph. When the graph has been constructed it can be used to propagate objects transitively to variables. The building blocks of the pointer assignment graph and the propagation for the approach presented in this thesis is described thoroughly in chapter 3.

2.2 Recent work

This section will present previous work on points-to analysis. A summary will be presented in a form of a list that will start with the earliest work and conclude with points-to analysis for Java. The description of each work will be very short.

• In the year of 1994, Emami, Ghiya and Hendren [18] introduced points-to analysis, which divided the memory into concrete locations and then computed for each reference the set of location the reference could point to.

(20)

2.2. RECENT WORK 7

• Andersen [10] presented an analysis related to points-to analysis that modelled the heap precisely. He used set constraints to express the analysis, an approach that will be used in this thesis.

• Steensgaard [44] replaced the set constraints with set equality constraints. With this approach, he was able to find connected components in the constraint graph.

• In 1995, Ruf [38] showed with an implementation of a context-sensitive and context- insensitive subset-based analysis that the latter was almost equal in precision.

• Shapiro and Horwitz [40] presented an empirical result demonstrating that an equality- based analysis is less precise but is much faster than a set-based analysis on large programs.

• Diwan, McKinley and Moss [17] applied points-to analysis to Modula-3 which have declared types and studied three types of alias analyis.

• Aiken, Fähndrich, Foster and Su [9, 19, 45] developed a framework called BANE, which improved the efficiency on a set-based analysis by removing cycles in the constraint graph.

• In 2000, Rountev and Chandra [36] showed that by optimize the constraint graph, it decreased both the analysis time and memory requirement by about 50%.

• Heintze and Tardieu [21] constructed a demand-driven analysis by only producing points-to sets needed for the application of the analysis. That made possible to analyze large programs.

• Liang, Pennings and Harrold [28] adapted several analyses to Java, which all was flow-insensitive and context-insensitive.

• Rountev, Milanova and Ryder [37] used Soot Optimization Framework [49] to output set constraints to be used as input to the framework BANE. The analysis was both context- and flow-insensitive but field-sensitive.

• Whaley, Rinard, Vivien [53] constructed a context-, flow- and field-sensitive set- based analysis, which was used to compute escape information. The analysis was demand-driven and was therefore able to analyze large programs.

• Whaley and Lam [52] adapted the demand-driven analysis presented by Heintze and Tardieu [21] to Java by adding field-sensitivity and respect the declared types.

• In 2002, Milanova, Rountev and Ryder [32] presented object-sensitive analysis which was an adaptation of a context-sensitive analysis.

• Lhoták [27] presented a framework called SPARK, which is a part Soot Optimiza- tion Framework [49], and included several types of points-to analysis. The analysis was aided by a pointer assignment graph which was a representation of the analyzed program.

(21)

8 CHAPTER 2. POINTS-TO ANALYSIS

2.3 Applications for points-to analysis

This chapter will be concluded by describing which kind of applications that could be made out of result from points-to analysis. The main focus will be on applications that are targeted to software engineering and derivatives from the pointer assignment graph.

But first, some applications for compiler optimization will be presented.

1. Side-effect analysis determines the memory locations that may be modified by the execution of a statement [35].

2. Def-use analysis identifies pairs of statements that set the value of a memory loca- tion and afterwards use that value [33].

3. Another use is to determine if an object is not reachable after its method of creation is returned, then the object could be allocated at the stack. One could also see if an object is reachable only from a single thread during its lifetime, allowing the removal of unnecessary synchronization operations that ensure mutual exclusion [38]. These two applications are somewhat related and commonly called escape analysis [12].

When the result of points-to analysis is used to optimize compilers, the whole program must have been analyzed. This means that every method in all classes reachable from the entry points of the program must be included. The entry points in a program is the main-method, static initializations and the methods invoked implicitly by the Java Vir- tual Machine. This leads to very large analysis involving several hundreds of classes. If the goal of the analysis is mostly application targeted to software engineering, then the situation is different, because the classes that are interesting are probably not the ones used implicitly by the Java Virtual Machine. For example, if an architecture recovery is constructed using the result from points-to analysis, the classes loaded by the Java Virtual Machine are probably only confusing the user when trying to understand the architecture. Before presenting the applications for software engineering, some derivatives of the pointer assignment graph will be presented.

2.3.1 Pointer assignment graph derivatives

After the pointer assignment graph been constructed and propagated, each reference in the graph have a points-to set consisting of zero or more objects. Various data structures could be derived from the graph.

1. An object call graph is a directed graph with its nodes represented by objects and methods, and will be called object methods. The edges in the graph are statically detected invocation between two object methods. Consider a method, m having a this-reference with the points-to set {o¹, o²} and also contains an invocation of another method, n, with the points-to set {o³}. With an conservative approach, the following construct will be created in the graph: (m, o¹) → (n, o³) and (m, o²) → (n, o³). Two types of data structures can subsequently be derived from an object call graph; namely an ordinary call graph and an object usage graph.

2. The ordinary call graph is created by merging all object methods implemented by the same method definition into a single node. The call graph will be an important structure in this thesis, because it will be used to measure if any information are lost due to approximations made on the analysis.

(22)

2.3. APPLICATIONS FOR POINTS-TO ANALYSIS 9 3. The object usage graph is derived by reducing all object methods belonging to the

same object into a single node.

4. From both the ordinary call graph and the object usage graph, a class usage graph could be derived, where each node is a class and each edge represents an invocation between two classes.

The precision of these graphs is far better using points-to analysis, than the ones that could be derived from other algorithms, such as Rapid Type Analysis [11] and Class Hierarchy Analysis [16].

The pointer assignment graph can also be used to trace individual objects. By looking at which points-to sets that might contain a certain object, one can determine which parts of the program the object reach or not reach. In a similar way, one can determine which objects a certain entity in a program can contain by looking at the points-to sets for the references in the entity. An entity is either a member or a class. The information gained for an entity can be used to do a forward or backward tracing to see which entities it will affect or be affected by.

2.3.2 Software engineering applications

As mentioned, the result of points-to analysis could be used to create applications for software engineering. These applications do not always need whole program points-to analysis in order to be useful. That is nice considering that the approximations presented in section 4 might reduce the precision of the analysis. The first application that will be discussed is architecture recovery.

When maintaining, extending or reusing a software system, the understanding of the different parts of the system is vital. If the software system is a legacy system there is often the problem that the system is not well documented. Maintaining is a large part of the software cycle, and understanding the software is large part of the maintainence process. It is impossible to understand the details of a software system, so one must create higher level models of the system and show how the different parts interact. The goal is thus to recover the main architecture of the software system, and most of the architectural recovery methods in use today start at the class level and uses different techniques to find a hierarchical clustering of the system classes. These clustering techniques is based either similarity metrics [39, 51, 31, 48] or graph theoretical concepts like dominance [15, 20, 30] and edge and node cuts [15]. A problem with these clustering techniques is that they try to group classes rather than separate instances of these classes. If a library class is used often and in different parts of the system, it may lead to a misleading clustering of the different classes. It would have been nicer if the instances of the library class were treated as separate entities. The good news is that an object usage graph, retrieved from points-to analysis, provides information to use separate instances.

Metric based refactoring uses software metrics to measure properties of a software system. Well-defined objective measurement rules in form of numbers are mapped to the properties [14, 13, 22, 43]. The measurements provide information upon which decisions about software engineering tasks can be both planned and performed better. Many of these metrics are based on invocations or field accesses and a metric called coupling measures to what degree a class is coupled to its surrounding classes. A high metric value means that a class is not an isolated part of the system, and that knowledge is useful when a system should be maintained. Metrics has the same problem as the clustering of methods that were discussed before. If instances are used instead of classes, the coupling value

(23)

10 CHAPTER 2. POINTS-TO ANALYSIS will probably be smaller and the result perhaps better. The result of points-to analysis could help achieving these goals for metric based re-factoring.

Another type of application is usage based software visualization which visualizes a program as a graph with classes as nodes and inheritance or usage as edges. The goal of the visualization is to group classes in a way that makes the graph easy to understand.

The distance between two classes must be computed using a metric value. A low value indicates that two classes should be grouped closely together. The number of invocations and field accesses are often used to determine the metric value [42, 41, 46]. Like the previous two software engineering applications discussed, software visualization using distance metrics also need a precise call graph in order to be effective and instances of often used classes is better to use than the class itself. All these three applications have the same problem when extracting and use the information about a program, but points-to analysis could make the situation better and that is quite nice. This section will be closed up with a discussion about integration and regression testing.

• Integration testing is used to validate that the interactions between two classes are correct. The main problem with integration testing is that it is hard to show that enough interactions are considered during the testing. Another problem is to find a suitable test ordering for the involved classes.

• Regression testing is used to show that a program still satisfies its requirements after a change is made. Again the test order is a problem; furthermore, it is hard to determine the impact of a change in a given class.

The object relation diagram is a model of dependencies between different classes and can be used to solve the above stated problems [25, 26, 23, 47]. The object relation diagram is a directed graph in which nodes represent program classes and edges represent dependencies between these classes, which could be of type inheritance, aggregation or association. Graphs that not are precise will contain spuriously edges that represent impossible dependencies and are not originated from any actual program execution. In the case of integration testing, unnecessary dependencies may be triggered and extend the testing time. In the case of regression testing, the imprecise graph could lead to overesti- mation of changes, which cause retesting of unaffected classes. The problems may also lead to dependency cycles in the graph, which complicates the task of finding a test order and is a problem for both types of testing. With points-to analysis, the aggregation and association edges are easy to derive from the pointer assignment graph and that will help building a much more precise object relation diagram.

This chapter has given an introduction to points-to analysis and presented previous work. The last section was devoted to further motivate the need of an approach which goals are to mainly focus on the application classes in a program. The next two chapters will describe an approach for points-to analysis and a couple of approximations that will try to make the analysis more efficient.

(24)

Chapter 3 P2A

This chapter of the thesis will present a points-to analysis that is called P2A. The approach adapts the set constraints that Andersen uses in his algorithm for C [10]. The whole process is iterative and uses two work lists. P2A is built to analyze classes that are reachable from the entry points, particularly the Main class entry point.

Definition 3.1. An analysis of a program p involves all classes that are transitively reach- able from the explicit or implicit entry points.

P2A is created for Java but the approach presented here could easily be adapted for another strongly typed object-oriented language. In order to be as language independent as possible, all algorithms will be typed in pseudo-code. The examples however, mostly for clarity, will be presented in Java code. The structure of this chapter will be as follows:

• Section 3.1 will present a rather abstract view of the design for P2A.

• The set constraints that will be used are described in section 3.2.

• The internal representation of the program is a graph that is called a pointer assignment graph. The approach to build the graph is described in section 3.3.

• Section 3.4 concludes this chapter by presenting the algorithms that are used to do the propagation of objects.

It should also be mentioned that each of the sections may end with some details about the corresponding part in the implementation. Another convention decided is the termi- nology for methods. When P2A is presented in this thesis, the Java terminology, method and invocation, will be used rather than procedure, f unction and call.

3.1 Design overview

Figure 3.1 gives an overview of the design for P2A. The rectangle denotes a process and the rounded version of the rectangle denotes data. The whole process starts with reading two XML documents, containing information about what will be analyzed and how it will be analyzed. The structure of these two XML documents can be found in appendix A. The next phase includes reading and parsing the application that will be analyzed. The implementation uses the Soot Optimization Framework [49] to do the parsing. Chapter 5.1 describes the use of Soot Optimization Framework in context of the implementation. When the class-files have been parsed, P2A is ready to start building

11

(25)

12 CHAPTER 3. P2A

Analysis Info

SOOT Processing

Parsed Class Files

Local PAG Construction

PAG

Method Processing

Method PAG

Propagation

Points-to Sets Local

Propagation

Points-to Sets P2A Engine

Project Info

Figure 3.1: Design overview for P2A

(26)

3.2. SET CONSTRAINTS 13 the pointer assignment graph. This process is called Method-processing in figure 3.1.

Method-processing consists of two parts:

1. Method pointer assignment graph construction, and 2. Local propagation.

The output from method-processing is a pointer assignment graph for the processed method and points-to sets for the references in the method. Notice that the points-to set is built up iteratively through the whole process and the dashed rounded rectangle means that the points-to sets is not finalized. Next in the process is the inter-procedural propagation that connects invocations with method targets. As mentioned before the whole process is using two work lists, which may contain invocations, references and entry points. When both work lists is empty, the analysis is finished.

In the implementation, method-processing uses Grail¹to construct the pointer assignment graph, and section 5.2 describes how Grail is used in the implementation. As mentioned earlier, section 3.3 will describe the algorithm for constructing the pointer assignment graph and the propagation algorithms and the work lists will be presented in section 3.4.

3.2 Set constraints

As mentioned earlier, P2A adapts the set constraints that Andersen uses in his algorithm for C [10]. The analysis is defined in terms of two sets. Set V contains all the variables in the analyzed program and set O contains the objects created on an object allocation site.

And, for each allocation site, sⁱ, a separate object name oⁱ∈ O will be used. Finally, each variable v ∈ V is associated with a set of objects denoted Pt(v) ⊆ O and referred to as the points-to set of v.

Each program statement that involves object transportation will be associated with a specific set of constraints. That is, the complete points-to analysis will be defined in terms of set constraints. Relevant statements and their set constraints is presented in table 3.1.

For example, a simple assignment statement l = r indicates that all objects referenced by variable r, is also referenced by variable l. In terms of constraints on the points-to sets, the following is written: Pt(r) ⊆ Pt(l). Arrays are a special case and is treated in section 3.2.1.

Statement Set constraint sⁱ: l = new A() oⁱ∈ Pt(l)

l = r Pt(r) ⊆ Pt(l)

l = (A)r Pt(r) ⊆APt(l) ⇔ oⁱ∈ Pt(r) ∧ typeO f (oⁱ) A ⇒ oⁱ∈ Pt(l) l = r₀.m(r₁, . . . , r_n) ∀oⁱ∈ r₀: m(this, p₁, . . . , p_n, ret) = dispatch(oⁱ, m),

oⁱ∈ Pt(this)∧

Pt(p₁) ⊆ Pt(r₁) ∧ . . . ∧ Pt(p_n) ⊆ Pt(r_n)∧

Pt(ret) ⊆ Pt(l) l = r₀.x Pt(r₀.x) ⊆ Pt(l) l₀.x = r Pt(r) ⊆ Pt(l₀.x)

Table 3.1: Statement and their set constraints

1Grail is an indoor graph tool used by Software Technology Group at Växjö University

(27)

14 CHAPTER 3. P2A Constraints due to object allocation and assignments are straightforward. The type cast statement l = (A)r implies that only objects oⁱ ∈ Pt(r) having a type typeO f (oⁱ) which is a subtype of A should be constrained to Pt(l). In the virtual invocation statement l = r₀.m(r₁, . . . , rn) a method dispatch(Type A, Signature m) will be used to decide the generated constraints between the invocation and the possible methods to be invoked. A dispatch method is supposed to mimic the behaviour occurring at run-time when deciding the target of a method invocation.

3.2.1 Set constraints for arrays

Array-objects and their array-variables will be treated a bit different. An array-object has a points-to set itself consisting of other objects, oⁱ∈ O. The set of array-objects will sometimes be referred as O_array⊆ O. The array-variables however, have a points-to set consisting of array-objects, aⁱ∈ O_array. The array-variables are a subset of all variables and are the set V_array. It is only arrays like sⁱ: l = new A[n] where A is a reference type that are interesting for the analysis. Due to the fact that an array-object have a points-to set and could be contained in another, there is a problem with the assignment. How does one know what to do with the following statement?

l = r //Where l and r is array references

l[i] = x //Where l is an array reference and x is a reference.

What kind of set constraints should these two statements have? The first statement should transport all array-objects, aⁱ∈ O, from r ∈ V_array to l ∈ V_array. This, however, is not working for the second statement, because the semantics states that x ∈ V should be contained in the array-object, which l ∈ Varray has a reference to. l ∈ Varray could of course, have references to more than one array-object. The first statement is a regular assignment; the second assignment will be called content-assignment, because it affects the content of array-objects. Table 3.2 shows the set constraints for statements involving arrays. In the table, l, r ∈ V_array and x, y ∈ V .

Statement Set constraint sⁱ: l = new A[n] aⁱ∈ Pt(l)

l = r Pt(r) ⊆ Pt(l)

l = (A[])r Pt(r) ⊆_A[]Pt(l) ⇔ oⁱ∈ Pt(r) ∧ typeO f (oⁱ) A[] ⇒ oⁱ∈ Pt(l) l = x₀.r Pt(x₀.r) ⊆ Pt(l)

x₀.l = r Pt(r) ⊆ Pt(x₀.l) x₀.l = y₀.r Pt(y₀.r) ⊆ Pt(x₀.l)

l[i] = x ∀aⁱ∈ Pt(l), Pt(x) ⊆ Pt(aⁱ) x = r[i] ∀aⁱ∈ Pt(r), Pt(aⁱ) ⊆ Pt(x)

l[i] = r[i] ∀aⁱ∈ Pt(r), ∀b^j∈ Pt(l), Pt(aⁱ) ⊆ Pt(b^j) x₀.l[i] = y ∀aⁱ∈ Pt(x₀.l), Pt(x) ⊆ Pt(aⁱ)

x = y₀.r[i] ∀aⁱ∈ Pt(y₀.r), Pt(aⁱ) ⊆ Pt(x)

x0.l[i] = y₀.r[i] ∀aⁱ∈ Pt(y₀.r), ∀b^j∈ Pt(x₀.l), Pt(aⁱ) ⊆ Pt(b^j) System.arraycopy(r, l) ∀aⁱ∈ Pt(r), ∀b^j∈ Pt(l), Pt(aⁱ) ⊆ Pt(b^j) Arrays. f ill(l, x) ∀aⁱ∈ Pt(l), Pt(x) ⊆ Pt(aⁱ)

Table 3.2: Statements involving arrays and the corresponding set constraints

l[i] = x means that for all array-objects, aⁱ ∈ O, in the points-to set of l ∈ V_array, the points-to set for aⁱ now includes the points-to set for x ∈ V . The reverse statement

(28)

3.3. POINTER ASSIGNMENT GRAPH 15 x = r[i] should be read as the points-to set for x ∈ V now includes the points-to sets for all array-objects contained in r ∈ V_array. l[i] = r[i] means that all array-objects, aⁱ∈ O, in l ∈ V_array should add the points-to sets for all array-objects, a^j ∈ O, in r ∈ V_array. There are also some static methods involving arrays that could be handled as content- assignments. System.arraycopy(r, l) has the same set constraint as the statement l[i] = r[i]

and Arrays.array f ill(l, x) has the same set constraint as l[i] = x. The array-cast works in the same manner as a regular cast, and therefore has the same set constraint except that array-casts has an array-type as the cast-type. The field accesses involving arrays has the same set constraints as an ordinary array access. The idea about handling arrays this way will be extended in section 4.3 and include the Collection Framework in the Java Standard Library. That is one of the approximations that will be presented in this thesis.

All the necessary set constraints are now in place for the analysis. The next step will be to describe how the pointer assignment graph is built and present the different type of nodes and edges.

3.3 Pointer assignment graph

An abstraction of the run-time memory states is constructed when the points-to sets are computed for every variable. In P2A this abstraction will be a pointer assignment graph, and this section will present how the graph is built and what kind constructs it has. Be- fore presenting examples of pointer assignment graphs and algorithms to construct it, the different type of nodes and edges will be presented.

The nodes in the pointer assignment graph could roughly be divided in three categories;

value-, reference- and artificial-nodes. Value-nodes contains an object that could be added to a points-to set, reference-nodes have a reference to different value-nodes through its points-to set and artificial-nodes is a representation for a method or invocation. Most of the nodes belong to a distinct category, but there are some exceptions as will be shown.

The different nodes in the pointer assignment graph are as follows:

• An object-node is a representation for an object, oⁱ∈ O, and could be contained in different points-to sets.

• An array-object node is a representation for an array, aⁱ∈ Oarray. The array-object node differs slightly from an object-node. As object-nodes, it could be contained in different points-to sets, but in addition to that, it also has a points-to set itself.

The values contained in its points-to set is a representation of what actually could be stored in the array.

• A variable-node is representing a variable, v ∈ V , and has a points-to set containing different objects.

• A field-node is representing a field, v ∈ V . It has same characteristics as the variable- node and exists only to distinguish itself from the variable-node.

• A this-node is representing the this-reference, v ∈ V , for a method. It has same characteristics as the variable-node except that the points-to set for a certain this- node has some restrictions.

• A cast-node is representing a cast and is surprisingly a part of the set V . The reason to include casts in the set of variables, V , is that it makes the propagation easier.

(29)

16 CHAPTER 3. P2A It will be discussed more in section 3.4. Similar to the this-node, the points-to set for a cast-node has some restrictions. The restrictions was presented in the section (3.2) about set constraints.

• An array-variable node is representing an array-variable, v ∈ V_array. It has a points- to set consisting of array-objects, aⁱ∈ O_array.

• An array-field node is representing a field, v ∈ V_array, of array-type. Like a regular field-node, the array counterpart exists only to distinguish itself from the array- variable node.

• An array-cast node is representing a cast of array-type and is a member of the set V_array. Like the regular cast-node, the points-to set of this node has some restrictions what it could consist of. This was stated in section 3.2.1.

• A null-node is a representation of the null constant. Although not a reference type, it must be included because the parameter of a method could be lost if a null constant is ignored. That is, if an argument is not of reference type, it is of no interest. Then, the corresponding parameter of the method will not be handled. Furthermore, if the same method is invoked again, the previous null argument may this time be an argument of reference type. A method is only created once, and a parameter may therefore be lost if null constants is ignored.

• A method-node is an artificial node and is representing a method. The node has references to the nodes representing its this-reference, parameters and a possible return value. If the method is static the node has no reference to the this-reference.

• An invocation-node is another artificial node and is representing an invocation.

It has references to the nodes representing its arguments, the possible receiver- reference and the base-reference if the target method is non-static. The invocation- node has a points-to set to ease the propagation. That matter is discussed further in section 3.4. It also has information whether the invocation is polymorphic or monomorphic.

And now, the different types of edges in the pointer assignment graph:

• An assign-edge is representing an ordinary assignment. It will be represented by the symbol → in the thesis.

• A content-assign edge is used when an assignment is accessing an array-object.

The content-assignments was discussed more thoroughly in section 3.2.1. In this thesis, the content-assign edge will be represented by the symbol 7→.

• A reference-edge is in fact not a part of the analysis. But it will be used to strengthen the understanding of the pointer assignment graph examples presented later in this section. The reference-edge will be represented with a dotted version of the assignment symbol.

So, this is the different kind of nodes and edges that will be used in the pointer assignment graph. Before a few examples will be presented, a table (3.3) will be constructed that describes what graph constructs will be built for the statements presented in table 3.1 and 3.2 in section 3.2. Array-variables is denoted x_ain the table to distinguish them from

(30)

3.3. POINTER ASSIGNMENT GRAPH 17 regular variables. In order to reduce the size of the table the logical or symbol | is used on the right hand side of some statements. | has precedence of all the three symbols →, 7→, and =. The method invocation statement is shown only with regular variables, but l and the arguments r₁, . . . , rncould be array-variables. This implicates that ret and the pa- rameters p₁, . . . , p_nalso can be array-variables. The restriction that r₀and corresponding this not can be a array-variable was discussed in section 1.4.

Statement Pointer assignment graph construct

sⁱ: l = new A() oⁱ→ l

sⁱ: l_a= new A[] aⁱ→ l_a l = r | r0.x r | r0.x → l l_a= r_a| r₀.x_a r_a| r₀.x_a→ l_a

l = (A)r r → A → l

la= (A[])r_a ra→ A → l_a

l = r₀.m(r₁, . . . , r_n) r₀→ r₀.m(r₁, . . . , r_n)∧

r₀→ m.this∧

r1→ m.p₁∧ . . . ∧ r_n→ m.p_n∧ m.ret → l

l₀.x = r | r₀.y r | r₀.y → l₀.x l0.x_a= r_a| r₀.y_a ra| r₀.y_a→ l₀.x_a l_a[i] = x | r₀.x | r_a[i] | r₀.x_a[i] x | r₀.x | r_a| r₀.x_a7→ l_a x = r_a[i] | r₀.x_a[i] r_a| r₀.x_a7→ x

l0.x_a[i] = x | r₀.x | r_a[i] | r₀.x_a[i] x | r0.x | r_a| r₀.x_a7→ l₀.x_a l₀.x = r_a[i] | r₀.x_a[i] r_a| r₀.x_a7→ l₀.x

System.arraycopy(r_a, l_a) r_a7→ l_a Arrays. f ill(l, x) x 7→ la

Table 3.3: Constructs in the pointer assignment graph derived from different statements

It is now time to present a few examples of pointer assignment graphs. The first example will be created from the Java code below:

A a1 = new A();

A a2 = new A();

B b1 = new B();

B b2;

b2 = b1;

b2 = a2;

a1 = b2;

}

The code is quite simple, but it will generate an example of a pointer assignment graph that shows how regular assignments work. The objects will called O¹_A, O²_A and O³_B. The pointer assignment graph is shown in figure 3.2. Eventually, when the propagation has been processed, the points-to set for each variable will be:

• Pt(a1) = {O¹_A, O²_A, O³_B}

• Pt(a2) = {O²_A}

• Pt(b1) = {O³_B}

(31)

18 CHAPTER 3. P2A

Object: B

Reference: a2

Reference: b1 Object: A

Reference: b2 Reference: a1

Object: A

Figure 3.2: A simple points-to assignment graph illustrating regular assignments

• Pt(b2) = {O¹_A, O²_A, O³_B}

Next example illustrates the other type of assignment, namely content-assignment. The pointer assignment graph in figure 3.3 is derived from the Java code below:

A a1[] = new A[2];

A r1 = new A();

A r2 = new A();

A[0] = r1;

A[1] = r2;

A a2[] = a1;

A r3 = a2[0];

A r4 = new A();

a2[0] = r4;

}

Again, the code makes no sense, but it is enough to illustrate the concept of content- assignment in the context of a pointer assignment graph. The objects will called O¹_A[], O²_A, O³_A, and O⁴_A. This time, when the propagation has been processed, the points-to set for each reference and the array-object will be:

• Pt(A[]) = {O²_A, O³_A, O⁴_A}

• Pt(a1[]) = {O¹_A[]}

• Pt(a2[]) = {O¹_A[]}

• Pt(r1) = {O²_A}

• Pt(r2) = {O³_A}

• Pt(r3) = {O¹_A, O²_A, O³_A}

• Pt(r4) = {O⁴_A}

The content of the points-to sets is straightforward, but notice that the reference r3 get the whole points-to set of the array-object O¹_A[] even though it only reads from one position. There is a problem with the concept content-assignment in a pointer assignment graph, but that matter will be discussed in section 3.4, because it is more related to the propagation. In short, the problem is that some variables and array-objects may not end up

(32)

3.3. POINTER ASSIGNMENT GRAPH 19

Object: A[] Object: A

Reference: r4 Reference: a1

Reference: a2 Reference: r1 Reference: r2

Object: A Reference: r3

Object: A

Figure 3.3: A simple points-to assignment graph illustrating content- assignments

with correct points-to set, because some array-variables may in some cases never belong to the work list.

The last example deals with the connection of an invoke-statement and the target method. The graph will also show the references method- and invocation-nodes have.

The code below is not complete and shows only an invocation in method m1 to target method m2 contained in class A.

public void m1(){

...

A a = new A();

B b = new B();

C c = a.m2(b);

...

}

public C m2(B b){

...

C c = new C();

...

return c;

}

Figure 3.4 shows the graph, which excludes the object-nodes because they are not relevant when showing the inter-procedural connection between an invocation-node and a method-node. The assign-edge from the variable a to the invocation is only in place so that the propagation reaches the invocation-node, which only has one incoming edge.

That implies that an invocation-node will always have the same points-to set as its base- node, in this case the variable a. As mentioned earlier, if a method has a void return value or is static, the method- and invocation-node will only have references to their parameters and arguments, respectively. If these exist, that is.

It is now time to present an algorithm that shows how a pointer assignment graph could be implemented. The algorithm is quite large, so it is divided in three parts; method-, statement- and expression-processing. The names will naturally be MethodProcessing, StatementProcessing and ExpressionProcessing for the algorithms. When a new method is encountered, it should be processed. That is, building the pointer assignment graph for the method and then process the local propagation. The local propagation will be presented in section 3.4. The whole processing of a method begins in the method- processing part, which first creates a method-node and then scans through the statements of the method. The statement-processing will then be processed for each statement in

(33)

20 CHAPTER 3. P2A

Reference: a

Reference: b

Reference: c

Reference: this

Reference: p

Reference: ret Invocation: m2

Method: m2

Figure 3.4: An invocation connected to a method

the method body. Algorithm 1 shows the method-processing part of the algorithm. The algorithm is simple and do not require that much explanation. The environment of the method-processing should however be explained. The method-processing is started every time a new target method is found when an invocation is processed in the propagation.

In fact, this is the only way the method-processing can start in P2A. Like the name implies, ProcessedMethods is a set consisting of all processed methods. The variables CurrentMethod and Graph will be used in the forthcoming parts of the algorithm, and is therefore some sort of global variables.

Algorithm 1 Building the pointer assignment graph. Part 1 of 3

1: procedure METHODPROCESSING(Method m)

2: if m /∈ ProcessMethods then

3: set m as CurrentMethod

4: methodNode = createNode(m)

5: add methodNode to Graph

6: for all Statements sⁱ∈ m do

7: StatementProcessing(sⁱ)

8: end for

9: end if

10: end procedure

The next part of the algorithm is the statement-processing and is shown in algorithm 2. Statement-processing is built around four general types of statements; invoke-, assign-, identity- and return-statement. With the expression-processing, these statements will be further categorized. The origin of these types of statements is from Soot Optimization Framework [49], which will be presented further in section 5.1. To aid the understanding of what happening in this part of the algorithm, the four general kind of assignments will be presented in more depth.

• An InvokeStatement is representing an invocation, which is returning a void value.

Expression-processing will be used to determine which type of invocation it is. That is, if the invocation is monomorphic or polymorphic. createInvocationRe f erences could be seen as a method that extract the base and the arguments from the invokestatement.

• An AssignStatement is representing an assignment. Both the left- and right-side expressions of the assignment will be added to the pointer assignment graph if their type is of interest. Expression-processing will be consulted to construct nodes of correct type. The right-side expression of the assignment could be an invoke- expression, and if that is the case, the method createInvocationRe f erences will

Växjö University

Växjö University

P2A

An Approximative Points-to Analysis for Java

Peter Stensson

P2A

An Approximative Points-to Analysis for Java

Växjö University

Abstract

Sammanfattning

Acknowledgments

Contents

List of Figures

List of Tables

Chapter 1 Introduction

1.1 Motivation

1.2 Thesis Goal

1.3 Method

1.4 Restrictions

1.5 Structure of the Thesis

Chapter 2

Points-to analysis

2.1 Basics of points-to analysis

2.2 Recent work

2.3 Applications for points-to analysis

2.3.1 Pointer assignment graph derivatives

2.3.2 Software engineering applications

Chapter 3 P2A

3.1 Design overview

3.2 Set constraints

3.2.1 Set constraints for arrays

3.3 Pointer assignment graph