Extended Metamodelica Based Integrated Copiler Generator

(1)

Institutionen för datavetenskap

Department of Computer and Information Science

Master Thesis

Extended MetaModelica Based Integrated

Compiler Generator

by

ARUNKUMAR PALANISAMY

LIU-IDA/LITH

-EX-A—12/058—SE

2012-10-18

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

(2)

Linköpings universitet

Institutionen för datavetenskap

Master Thesis

Extended MetaModelica Based Integrated

Compiler Generator

by

ARUNKUMAR PALANISAMY

LIU-IDA/LITH

-EX-A—12/058—SE

2012-10-18

Supervisor: Olena Rogovchenko

Dept. of Computer and Information Science

Examiner: Prof. Peter Fritzson

(3)

ii

Abstract

OMCCp is a new generation (not yet released) of the OpenModelica Compiler-Compiler

parser generator which contains an LALR parser generator implemented in the

MetaModelica language with parsing tables generated by the tools Flex and GNU Bison. It

also contains very good error handling and is integrated with the MetaModelica semantics

specification language.

The main benefit with this master thesis project is the development of new version of

OMCCp with complete support for an extended Modelica grammar for a complete

OMCCp-based Modelica parser. The implemented parser has been tested and the results

have been analyzed. This is a new enhanced generation OMCCp with improvements made

from the previous version. This version support Modelica as well as the language

extensions for MetaModelica, ParModelica, and optimization problem specification.

Moreover, the generated parsers are about three times faster than those from the old

OMCCp.

(4)

iii

Acknowledgements

I would take this opportunity to thank several important peoples who made this

thesis possible to complete. First I would like to thank my examiner professor Peter

Fritzson who gave me the opportunity to work on this thesis believing in my capabilities.

Along with him, I have to thank my supervisor Olena Rogovchenko who has been keeping

track of my progress every week and encouraging me to complete my work faster, and also

I have to thank my technical supervisor Martin Sjölund providing me with great technical

assistance in coding and enough guidance which made this work possible. I am happy to be

a part of the OpenModelica project which has provided me with the opportunity to learn a

new language and contribute to the development of this open source project.

I have to thank IDA administration (department of computer and information

science) for providing me with office and resource for my daily work which is very

essential.

I would like to thank my family, especially my mother and brother who offered

their support when I started my work and encouraging me daily during difficult situations

which made me to feel comfortable and also I would like to thank my fellow master thesis

students Goutham and Alachew for offering some valuable advice.

(5)

iv

1. Introduction

1.1. OpenModelica

2

1.2. OMCCp

2

1.3. ANTLR

3

1.4. Project Goal

3

1.5. Approach

3

1.6. Intended readers

4 2. Theoretical Background

2.1. Modelica

5 2.2. MetaModelica

6 2.3.1. Match continue expression

6 2.3.2. Union type

7 2.3.3. List

8 2.3. Compiler Construction

2.3.1. Compiler phases

9 2.3.2. Front end-Lexical analysis

12 2.3.3. Front end-Syntax analysis

14 2.3.4. LALR Parser

17 3. Existing Technologies

3.1. Flex

23

3.2. Bison

24 4. Implementation

4.1. Problem Statement

25

4.2. Proposed Solution

25

4.3. OMCCp Design Architecture

26

4.4. New parser

28

4.5. Error Handler

39

4.6. OMCCP VS ANTLR Error Handler

41

(6)

v

5. Testing

5.1. Sample input and output

44

5.2. Analysis of result

47

5.3. Improvements from previous version

50 6. User Guide

6.1. Getting started

52

6.2. OMCCp commands

52 7. Conclusion

7.1. Accomplishment

55 7.2. Future works

55 Bibliography

57 Appendices

59

(7)

vi

List of figures

2.3.1. Compiler Phases

9

2.3.2. Lexer Component

12 2.3.2.a)

finite state automata

13

2.3.3. Parser Component

15 2.3.3.a)

Construction of PDA

16 2.3.3.b)

Construction of AST

17 2.3.3.c)

Construction of DFA

18 2.3.3.d)

action and goto table

20 2.3.3.e)

Construction of AST without look ahead

21 2.3.3.f)

Construction of AST with look ahead

22

4.3. Layout of OMCCp

26 5.2.a)

Parsing time results between old OMCCp and new OMCCp

44

(8)

Chapter 1 Introduction

Before getting into the technical aspects of the project, I would like to discuss some

general information about this thesis. The idea of creating this thesis work came from

OpenModelica developers who analyzed the problems of previous implementation of

OMCCp and came with a proposal of reworking the parser so that the efforts of the

previous work should not go vain with the goal of developing OMCCp to become the

main parser tool for the OpenModelica project replacing ANTLR. The whole thesis

required the knowledge of compiler construction techniques, the advanced compiler

construction course which I had opted to study from my course curriculum is the main

motivation for taking this thesis work, as I had good background knowledge plus the

lab exercises in the course provided me the knowledge of how to write codes for lexer

and parser. Even though with good background knowledge, the lack of knowledge in

Modelica made the starting process very slow and I found difficulties in installing the

tools and software’s required for the work. But learning Modelica is not a difficult task

as there are enough resources and materials provided by the Open Source Modelica

Consortium (OSMC). Once I understood the language, then the next part of the work

went rather rapidly. Then I found some difficulties in understanding the actual work of

OMCCp, Considering the complexity of the OMCCp design it was initially tough to

understand what is the real problem and what should I do, But help is always provided

by the OpenModelica developers and made me comfortable when I found it difficult.

After first three weeks, I understood my work and progressed in rapid pace and able to

complete the implementation between 60 to 90 days as proposed in my initial thesis

plan. Thus, I assume the people who are reading this report are the ones who are going

to do some work in OMCCp and I assure you that with the experience, the work will be

very interesting and fun.

(9)

2 1.1 Open Modelica

The OpenModelica project develops a modeling and simulation environment based on

Modelica language. The OpenModelica uses the OMC (OpenModelica compiler).

OMC translates code written in the Modelica language to generate C or C++ code

which runs the simulation for the Modelica codes. There are several sub systems which

are integrated with OpenModelica like the following [1].

OMEdit

OMEdit is a user friendly graphical modeling tool which provides the users with an

easy approach to create models by adding components (drag and drop), connecting and

editing the components, simulation of models, and plotting the results. It also provides

both textual and graphical representation of the user defined models which makes the

work easier [1].

OMShell

OMShell is an interactive window terminal integrated with OpenModelica project

which provides an interactive approach to the users by simply loading and simulating

the model by use of commands [1].

OMNotebook

OMNotebook is a lightweight notebook editor which can be used to compile and run

Modelica code by evaluating the codes written in the cells [1].

1.2 OMCCp (OpenModelica Compiler-Compiler Parser)

OMCCp (OpenModelica Compiler-Compiler parser generator) is a new generation

parser tool implemented entirely in MetaModelica an extension of Modelica language.

The OpenModelica project currently uses ANTLR (Another tool for language

recognition) tool for generating the AST. In this thesis we present an alternative tool

called OMCCp which is a new generation enhanced parser with lexical analysis and

syntax analysis implemented in separate phases improving the efficiency of compiler,

and using the MetaModelica in the new bootstrapped OpenModelica compiler for the

RML semantics implementation. The tool also contains good error handling. The

OMCCp uses LALR parser to generate the Abstract Syntax Tree (AST) [2].

(10)

1.3 ANTLR (Another tool for language recognition)

The OpenModelica project is currently integrated with ANTLR tool to generate the

Abstract syntax tree (AST). The ANTLR uses LL (k) parser to generate the AST. Over

the years the tool has been identified with well known disadvantage like memory

overhead, bad error handling and lack of type checking. The tool contains an integrated

lexical and semantic analysis decreasing the efficiency of the parser. Generally the

performance of the parser is higher when lexical analysis is kept separate from the

semantic analysis. The ANTLR tool is connected with OMC with an external c

interface [2].

1.4 Project Goal

The goal of this master thesis is to write a new parser and front end to OMCCp

(OMCCp lexer and OMCCp parser) instead of the previous parser to generate

MetaModelica Abstract Syntax Tree as input to the new bootstrapped OpenModelica

compiler.

The results expected from this thesis are,

1) A working OMCCp lexer and parser integrated with MetaModelica in the new

bootstrapped OpenModelica compiler.

2) Tested OMCCp.

3) Improvements in performance compared from previous parser.

4) Completed Modelica grammar for a complete OMCCp based Modelica parser.

5) Release of updated OMCCp.

1.5 Approach

The first step to start this thesis is based on the literature study of Modelica,

MetaModelica and most importantly knowledge in compiler construction techniques.

There are several lectures and presentation slides available to learn the Modelica from

materials on

www.openmodelica.org

website which contributes to the better

understanding of the Modelica language syntax, and also be the first step in

construction of the new parser.

(11)

4 Since the OMCCp is implemented in MetaModelica an extension of Modelica language

it is important to familiarize with the language construct. There are online courses

available on the

www.openmodelica.org

website which provides with sample exercises

and methods to construct MetaModelica code. The exercises are highly recommended

before the start of implementation. A MetaModelica guide [5] is also available which

contains the whole MetaModelica constructs and built in functions.

The next part is to learn about the different phases of compiler which are required to be

known before implementation. There are two phase’s namely front end and the back

end. In this thesis we focus on the front end which requires the knowledge of lexical

analysis and syntax analysis.

The lexical analysis or scanner is the first stage in the front end of the compiler phase.

In this phase scanner receives the source code as a character stream and translates the

character stream into a list of tokens based on rules written in the form of regular

expression. The tool used for generating the tokens is flex.

The syntax analysis or parser is the next phase in the compiler front-end where it takes

the tokens generated by the lexer as its input and creates an intermediate form called

Abstract Syntax Tree (AST). The bison or yacc tool is used for generating the AST.

1.6 Intended Readers

The reader of this document is someone who is interested in core compiler construction

work, especially interested in building the front end of the compiler phases. This

document provides the developers some of the important information about the

OpenModelica project OMCCp.

(12)

Chapter 2 Theoretical background

This chapter provides the theoretical knowledge required for the understanding of this

work. It covers the topic of Modelica, MetaModelica and compiler construction

techniques.

2.1 Modelica

Modelica is an object-oriented equation based language for modeling and simulation. It

can be used for modeling; multi domain component based modeling which includes

systems containing mechanical, electrical, electronic, hydraulic, thermal, control,

electric power etc. The effort is supported by OSMC, a non-profit organization [3].

Example

class Hello World

Real x (start = 1);

parameter Real a = 1;

equation

der(x) = - a * x;

end HelloWorld;

Some of the important points of the language to be discussed here, is that the number of

unknown variables must be equal to the number of equation with some exceptions like

parameter; constant variables need not be specified in equation part. In the above

example we created a new model or class “Hello World” which contains the variable

‘x’ of Real type and a parameter variable ‘a’ of Real type. Thus in the equation section

we have defined the variable ‘x’ and hence the number of variables is matched with

number of equations [3].

(13)

6 2.2 MetaModelica

MetaModelica is an extension of Modelica language which is used to model the

semantics of Modelica language. MetaModelica is developed as a part of

OpenModelica project to provide a platform for developing Modelica compiler. Some

of the MetaModelica language constructs used in this implementation are discussed

below [4] [5].

2.2.1 Matchcontinue

The match continue expression is very similar to switch expression in C but the

difference is that, the match continue expression returns a value and also it supports

pattern matching.[4][5].

Example

function matchcontinue

x: = String s;

Real x;

algorithm

matchcontinue s

case "one" then 1;

case "two" then 2;

case "three" then 3;

else 0;

end matchcontinue;

In the above example given a String‘s’ it returns the number value of the corresponding

string. The underscore ’_’ can be used to match any case. The matchcontinue

expression contains a case block, each case block is executed for finding a match , if a

catch is matched it returns the value, if it does not match, the next case block is tried

until a match is found, the match continue expression can be used to return more than

one value.

(14)

2.2.2 Uniontype

The uniontype is a collection of one or more record types. The uniontype can be

recursive and can refer to other uniontypes. This is one of the important construct of

MetaModelica used in this implementation. The uniontype is mainly used in the

construction of Abstract Syntax tree (AST) [4][5].

Example

In the above example we created a new Uniontype “Exp” which includes collection of

two or more record types namely “INT”,”NEG”, “ADD”. The record type is restricted

class type in Modelica which does not contain equations. Suppose if we have an

expression like 6 + 44. The Abstract syntax tree will be constructed as.

ADD (INT (12), INT (5))

When the parser finds the expression 6 + 44, the uniontype “Exp” is invoked and it

finds the appropriate record type and constructs the AST. Here the parser finds the

expression is of add operation and selects the add record add and the numbers are of int

type and selects the record INT from the uniontype “EXP”.

ADD

INT

6 INT

44 uniontype Exp

record INT Integer x1; end INT;

record NEG Exp x1; end NEG;

record ADD Exp x1; Exp x2; end ADD;

(15)

8 2.2.3 List

The list construct in MetaModelica is used to add or retrieve items from a list which

are very similar to arrays. The cons operator ’::’ is used for adding or retrieving

elements from the list [4][5][2].

Example:

In the above example we create a list ‘a’ of integer type which contains elements

from 1 to 3. In the second line we use the cons operator’::’ to add and retrieve

elements.

Retrieve operation:

In line no 2 we have the statement i::a=a

It retrieves the top element 3 from the list and stores it in variable ‘i’ and stores the

remaining list element is variable ‘a’.

Add operation:

In line no 3 we have the statement a= i::a

It will add an item ‘i’ into the list ‘a’. (i.e.) adds element 3 into list a.

l i s t <Integer> a={1 ,2 ,3};

i : : a=a ;

(16)

2.3 Compiler Construction

2.3.1 Compiler phases

Fig: 2.3.1. Compiler phases [6] [7].

Lexical Analyzer

Source code

Syntax Analyzer

Semantic Analyzer

Tokens

AST

Front end

Code optimizer

Type checked

AST

Intermediate code

Optimized AST

Intermediate code

Assembly code

Back end

(17)

10 This implementation strongly requires compiler construction knowledge. The people

who are interested in building a new compiler or their own, must follow the above six

steps. Generally the compiler consists of two phases namely [6] [7].

1. Front-end and

2. Back end.

Front end

The front-end includes 3 stages namely

1. lexical analyzer

2. Syntax analyzer

3. Semantic analyzer

Back end

The back-end includes 3 stages namely

1. Code optimizer

2. Intermediate code

3. Code generator.

Both the front-end and back end is dependent on each other. Each stage in a compiler

takes the input from the output of the previous stages of the compiler. Now let us see

what each stage perform.

1. Lexical Analyzer

Input: Source code

Output: Tokens

The lexical analyzer also called scanner is the first stage of building a compiler.

The Lexical analyzer takes the source code in the form of symbols as input and

outputs Tokens. The token streams are identified based on rules written in the form

of regular expression [6] [7].

2. Syntax Analyzer

Input: Tokens

(18)

The syntax analyzer is the second stages of building a compiler. The syntax

analyzer also called Parser, takes tokens as its input from the previous stage and

outputs Abstract Syntax Tree only if it identifies the tokens are created according to

the rules of the language otherwise it reports error messages [6] [7].

3. Semantic Analyzer

Input: AST

Output: type checked AST

The semantic analyzer is the third stage of the compiler. It takes the input from

syntax analyzer (i.e.) AST and perform type checking over the created AST and

outputs type checked AST otherwise reports error messages [6] [7].

4. Code optimizer

Input: AST

Output: optimized AST

The code optimization is the fourth stage of the compiler. It takes the input from

the semantic analyzer performs code optimization over the type checked AST and

outputs optimized AST. More about the code optimization can be found in compiler

construction book [6] [7].

5. Intermediate code

Input: optimized AST

Output: intermediate code

The intermediate code is the fifth stage of the compiler. It takes the input from code

optimizer and outputs intermediate code which close to the final code [6] [7].

6. Code generator

Input: optimized AST

Output: intermediate code

The code generator is the final stage of the compiler. It takes the input from

intermediate code and outputs machine code or assembly code which is the final

target code [6] [7].

In this implementation we are interested in the front-end phase of the compiler, and

we will be discussing about front end stages in detail. Hence the above few lines

will give the basic knowledge to the readers who are interested in building a new

(19)

12 compiler. More about the compiler stages and especially about the back end can be

read through the book Aho, Lam, Sethi, Ullman, Compilers Principles, Techniques,

and Tools, Second Edition. Addison-Wesley, 2006.

2.3.2 Front-end - lexical analyzer

The lexical analyzer also called scanner is the first stage in compiler construction.

As said in definition it takes source code as input and outputs tokens based on the

rules written in the form of regular expression. It identifies the tokens and reduces

the complexity of the next stages of the compiler. The scanner outputs the token

based on the first match it found and hence order of rules is very essential to avoid

ambiguity.

The lexical analysis in this implementation is performed by the use of program

called lexer [6] [7] [2].

Lexer

Fig: 2.3.2. Lexer component

OMCCp Lexer - Program

source code

Finite automata-DFA

Regular expressions

runs

(20)

The lexer is a program which takes the source code and outputs the token. To

generate the tokens, the lexer runs the Deterministic finite automata (DFA) based

on the rules written in the form of regular expression and outputs the tokens. We are

using DFA in this implementation as we want to have single path between

transitions of states and more importantly to avoid ambiguity, however a Non

Deterministic finite automata (NFA) can also be converted to DFA. To avoid such

complexities we use DFA in the implementation. Let us see how a DFA works.

Deterministic Finite Automata (DFA)

It is a 5 tuple represented as

5 tuple- (Q, Σ, δ, q0, F) [8] [6].

Σ-set of alphabets

Q-set of states

δ-set of transitions made

q0-start state

F –final state

Example:

{Wa|WE {a, b}*}

Input: abbaaba

Fig: 2.3.2.a) finite state automata

q1

q0

b

a

b

a

(21)

14 In the example we have a regular expression of the rule which says that words

containing alphabets a and b can be repeated one or more time with a condition that

words must end with a. when we give the input string abbaaba the rules will be

identified and automata will be created which outputs the tokens.

For the above input string we represent the 5 tuple as

5 tuple (Q, Σ, δ, q0, F)

Q =q0,q1

Σ =a,b

δ =Q x Σ --->Q eg, (q0, a) -> q1 and (q1, b) ->q0

q0 =start state and F=q1

we can see from the above automata it initially starts with q0 starting state and reads

a and moves to next state q1 and it reads b and move again to q0 and it reads b and

remains in same state q0 and reads a and again goes to q1 and reads a again and

remain in the same state and reads b and goes to q0 and finally reads string a and

reaches the final state q0 which tells that the string is accepted and finally it outputs

the tokens [8] [6].

2.3.3 Front-end-syntax analyzer

The syntax analyzer also called the parser is the second stage in front-end. It takes

the tokens from the lexer and uses as its input and checks whether the generated

tokens are constructed according to the rules of the language. If it is correct, it

creates Abstract Syntax Tree (AST) otherwise it reports error messages. Since the

lexical analysis is kept separate from the syntax analyzer it reduces the complexity

of the parser. The AST is used as input for the back end of the compiler [6] [7].

The syntax analyzer is performed by the component called parser.

Parser

The parser is a program which takes the tokens as its input, runs LALR algorithm

over the list of tokens and generates the AST with help of parse tables and stack

states. The architecture of parser is discussed below.

(22)

Fig 2.3.3) Parser component

The parser uses the parsing table to determine the next state. The next state is obtained

from the parsing table based on the look ahead token and current stack state. By doing

so, it constructs the AST by means of two operations namely shift and reduce

operations. It uses PDA to perform the two operations.

Context free grammar (CFG)

The rules of the parser are represented in the form of

context free grammar. A CFG is represented as four tuple [9][6].

(V, Σ, R, S) where

• V- set of non terminals

• Σ- set of terminals

• R- set of rules

• S- start symbol

Eg:

1.S->aS|X

2. X->ab

In the above grammar the left hand side of the rules are strictly non terminals

and generally represented in capital letters and in the right hand side it contains a

combination of terminals and non terminals usual a terminal is represented in lower

cases the terminals can also include operators and symbols like $,# etc.. in the above

example the terminals are a and b.

Parser

Tokens

Parse table

Stack states

(23)

16 Push Down Automata (PDA)

A push down automata (PDA) is a stack which is used to construct the AST by reading

the tokens and pushing into the stack called the push operation and when it encounters

sufficient amount of tokens which can be replaced by a new token it creates the abstract

syntax tree(AST) known as the reduce operation. It uses the stack states and tokens to

decide the next action. It continues to create the AST until an accept state is found with

the help of parsing table [10] [6] [2].

Let us see how the PDA works with a small example

1.S->aS|X

2.X->ab

Let us consider the above two grammar, the PDA for the above productions is

constructed as follows. It starts reading the input tokens from the production in this case

it starts reading the token a from rule 1 and pushes a into the stack, then it reads the

next input ‘S’ from rule 1 and pushes the token into stack, from the top of the stack it

determines the next action and state and finds that token ‘S’ can be replaced by new

token ‘X’ and performs reduce operation and thereby constructs the AST. The process

continues until an accept state is found. Below is the diagrammatic representation of the

construction of PDA and AST.

Fig: 2.3.3. a) Construction of PDA

a

S

a

X

a

b

a

(24)

Fig :2.3.3. b) Construction of AST

2.3.4 LALR Parser

The look ahead left right (LALR) parser is a type of LR parser which is used in the

implementation. The LALR parser comes in the category of bottom up parsers

which are efficient than top down parser. The LR parser starts reading the input from

left and proceeds in direction towards right without backtracking. Since it avoids

backtracking more time is saved and this type of parser ideally suits our

implementation. To avoid back tracking the LR parser uses the look ahead ‘k’ on the

input symbols and decide whether to shift or reduce. Hence the LR parser is usually

represented as LR (k) where k denotes the look ahead. In general the value of k=1

[11] [6] [2].

The LR parsers are best suited than LL parser which commits the reduce action once

it got tokens in the parse stacks which gives wrong results where as the LR(k)

parsers waits until it completes the look ahead action (i.e.) it looks into the entire

input symbol and finds the appropriate pattern before committing the action. Thus

LR parsers are used to handle large amount of grammar without errors. They are

best suited for error handling comparing to LL parsers [11] [6] [2].

The LR parser is deterministic and they produce single correct parse without

backtracking and ambiguity. The construction of LR parser is performed by two

b

a

X

a

_S

(25)

18 actions namely shift and reduce by the use of parse table constructed by DFA. Now

let us see with a small example how the LR parse tables are constructed.

Construction of LR parser

Let us consider the following grammar rules

1. S’->S$

2. S->aS

3. S->X

4. X->ab

The DFA construction is represented below

Fig 2.3.3.c): construction of DFA

As said earlier the LR parsers are deterministic, the canonical LR items namely the

action and Goto table are built based on the DFA. The construction of DFA is begin by

traversing every production with a dot placed in front of the right hand side of the

production for example let us take the first rule S’-> .S$, a dot is placed before a

non-terminal ‘S’, the ultimate goal of the LR parser is to parse entire right hand side of the

S’->.S$

S->.aS

S->.X

X->.ab

1 S’->S.$

S->a.S

X->a.b

S->.aS

S->.X

X->.ab

S->X.

Accept

S->aS.

X->ab.

2

4

5

6

3 S

a

(26)

productions (i.e.) in this production the dot should reach the end of the symbol ‘$’

which can be represented as (S’-> S$.) insists that the rule is processed completely by

the parser. The above diagram represents the various states in which the DFA is

constructed to parse all the four productions. For constructing the DFA we have to

follow certain rules with respect where the dot is placed.

1) If the dot is placed in front of a non terminal represented in capital letter

then we have to write down all the production of that non terminal. For

example in the state 1 we can see that dot is placed in front of a non

terminal ‘S’(S’->.S$) and in the corresponding state we have wrote

down all the production of S (S->.aS, S->.X) and again we see dot is

placed in front of non terminal ‘X’ so we wrote productions of ‘X’.

2) If the dot is placed in front of a terminal (small letters) we can continue

to traverse the dot with next symbol.

The above diagram represents the various states in which the entire production is

traversed completely which is used for the construction of the parse table (action and

goto). For more details about the construction of the DFA look into

http://en.wikipedia.org/wiki/LR_parser.

Construction of action and goto table

Once the DFA is constructed we can construct the action and goto table with help of

DFA states. The action table is nothing but the actions taken on a particular token

which can be either two operations namely

1.Shift represented as ‘S’ followed by the which state is goes next, for example S3

means on reading that particular input we are performing a shift operation and move to

state 3.

2. Reduce represented as ‘R’ followed by the which state is goes next, for example R4

means on reading that particular input we are performing a reduce operation and move

to state 4.

The action table represents the actions performed on the set of terminals and goto table

represents the actions performed on the set of nonterminals. Below is the representation

of the table based on the DFA construction.

(27)

20 Action GoTo

a

b

$

S

X

1 S3

2

4

2 accept

3 S3 S6

5

4

4 R3

5 R2

6 R4

Fig 2.3.3.d) action and goto table

The construction of LALR parser is very similar to the above construction with only a

minor modification made in parse table by combining two different states which almost

perform the same actions in order to save table space. eg if state 3 and 4 have same

actions on the same input the table is combined into single row as 34 and their

corresponding actions are joined in a single row. To learn more about this construction

look into Aho, Lam, Sethi, Ullman, Compilers Principles, Techniques, and Tools,

Second Edition. Addison-Wesley, 2006

Look ahead

The look ahead is a very important in parsing a grammar as it avoids

ambiguity and error. The look ahead waits for the maximum incoming tokens to decide

which rules it should apply. The look ahead has two advantages

a) It helps to avoid conflicts and produce correct result

b) It avoids duplicate states and eliminates the space of having extra stack

Now let us see with an example how works look ahead. Let us consider the following

grammar rules [12].

1: E → E + E

2: E → E * E

3: E → number

(28)

Input

1+2*3=7

Case 1: Construction of AST without Look Ahead

The parser pushes the input token ‘1’ into the stack and found a match with rule no 3

and replaces ‘1’ with ‘E’ and then pushes ‘+’ into the stack and then ‘2’ and replace it

with rule no 3 (E->number). At this time it does not looks into the rule no 4 which says

that * has more precedence than + which means it should not apply rules no 1 to the

input symbol ‘1+2’, but since no look ahead is performed it performs reduce operation

and applies rule no 1 (E->E+E), at this point we have result 3 in the stack. And in the

next step it pushes ‘*’ into the stack and then finally it pushes ‘3’ into the stack and

replaces it with rule no 3 (E->number) and in the stack we have E*E and applies rule no

2 (E->E*E) and we will get the result 9 in the final stack which is wrong. Below is the

pictorial representation of the stack [12].

Fig 2.3.3.e): stack with non look ahead [12].

1 E

+

2 E

+

E

*

3 E

*

E

(29)

22 Case 2: Construction of AST with Look Ahead

The parser pushes the input token ‘1’ into the stack and found a match with rule no 3

and replaces ‘1’ with ‘E’ and then pushes ‘+’ into the stack, At this time it looks into

the rule no 4 which says that * has more precedence than + which means it should not

apply rules no 1 to the input symbol ‘1+2’, hence in this example the ‘+’ is look ahead,

since look ahead is performed it does not performs reduce operation on the input

‘1+2’, at this point we still have (E+E) in the stack. And in the next step it pushes ‘*’

into the stack and then finally it pushes ‘3’ into the stack and replaces it with rule no 3

(E->number) and in the stack we have (E+EE) and applies rule no 2 (E->EE) and we

will get the result 6 in the stack and in the final stack we have (E+E) and applies rules

no 1(E->E+E) which will give the correct result 7. Below is the pictorial representation

of the stack [12].

Fig 2.3.3.f): stack with look ahead [12]

The above example is taken from wikipedia for further details refer

http://en.wikipedia.org/wiki/Parsing#Lookahead

.

1 E

+

2 E

+

E

*

+

E

+

E

*

E

Look ahead

(30)

Chapter 3 3 Existing Technologies

In this chapter we will discuss about the existing tools which are used to generate the

lexer and parser namely the FLEX and BISON tool.

3.1 Flex

Flex (fast lexical analyzer) is a tool which is used to generate the lexical analyzer or

scanner. Flex tool serves as an alternative tool to lex. Flex tool makes the first stage of

the compiler easier. It automates the generation of tokens. The input to the FLEX is

generally a file with “.l” extension, in our implementation the input file is “lexer

Modelica.l” which contains the rules written in the form of regular expressions. When

the user gives the input symbol (test cases) the scanner starts reading the input and finds

a match with regular expression and outputs the tokens. It generates a c file (“lexer

Modelica.c”) which contains the arrays and algorithm which runs the DFA over the list

of tokens. The flex tool is used to avoid complexity of recognizing the tokens based on

the order of priority. Also the flex is used to handle large amount of rules and thus

avoid the complexity in the next stage of compiler (parser). Mostly the flex tool is used

in combination with bison as it provides the tokens to bison directly avoiding

ambiguity. More information about the flex tool, a course about FLEX manual and

other necessary information can be found in

http://flex.sourceforge.net/manual/

. The

manual contains very good information about the FLEX [13] [2] [7].

(31)

24 3.2 Gnu Bison

It is mostly referred as Bison which is a parser generator. The bison tool is mostly used

with flex tool. The bison tool gets the tokens as input from the flex and checks whether

the tokens are constructed according to the rules of the grammar and creates Abstract

Syntax Tree (AST). The rules are identified in the form of Context Free Grammar.

Bison generates LALR parser, the algorithm we use to construct the AST. Bison tool

generates a C code which contains the transition arrays and algorithm which runs the

PDA to create the AST. More information about GNU bison can be found in

http://flex.sourceforge.net/manual/Bison-Bridge.html

. The manual contains very good

information about the generated c files and how the array transition and LALR tables

are constructed and how the parser determines to perform shift or reduce operation by

querying the table [14] [7] [2].

(32)

Chapter 4 Implementation

In this chapter we will discuss how OMCCp is implemented by identifying the problem

statement, then we propose a new solution for the identified problem, then discuss

about the OMCCp design and architecture and the important code changes.

4.1 Problem Statement

The design of OMCCp was started in the year 2011 by the OpenModelica developers.

The development and coding was started by a master thesis student who designed the

structure and layout of the OMCCp. When the final code was done, the implementation

had problems. The implementation was highly dependent on the “OpenModelica”

project directory “Compiler” which includes the front end and back end components.

During the development of OMCCp, the OpenModelica developers made a lot of

significant changes in the “Compiler” utilities files. These changes were made to

improve the efficiency of OMC. One of the well made significant changes is the

Compiler utility file (“rtopts.mo”) has been removed from the directory “Compiler”

(Compiler->util->rtopts.mo) and the OMCCP was developed in context with removed

file and hence the developed OMCCp failed to work and does not support the newer

version which also prevented the release of OMCCp [2].

4.2 Proposed Solution

The solution we propose in this thesis is to rebuild the entire OMCCp with respect to

the changes made to “Compiler utility component models” which supports new version.

We reuse the old version of OMCCp as a starting point to understand the code and

identify the flaws which made the OMCCp to fail. We implement the parser entirely in

MetaModelica with lexical and semantic analysis in separate phases. By separating

these phases we improve the efficiency of parser and also the performance of compiler.

We use LALR algorithm to create the AST which performs much better than other

parsing algorithm and also avoids ambiguity. Also we will implement full Modelica

grammar so that OMCCp parses all types of Modelica and MetaModelica syntax and

constructs.

(33)

26 4.3 Omccp Design and Architecture

Fig 4.3) layout of OMCCp

OMCCP

OMCCp Lexical

Analyzer

OMCCp Syntax

Analyzer

lexer Modelica.l

Lexer.mo

LexerCode.tmo

Generated files:

LexerCode Modelica.mo

LexerGenerator.mo

Lexer Modelica.mo

LexTable Modelica.mo

Token Modelica.mo

parser Modelica.y

Parser.mo

ParseCode.tmo

Generated files:

ParserCode Modelica.mo

ParserGenerator.mo

Parser Modelica.mo

ParseTable Modelica.mo

FLEX

BISON

Main.mo

SCRIPT.mos

OMCC.mos

(34)

The design of OMCCp is consists of two process namely

1.) The OMCCp lexical analysis and

2.) The OMCCp syntax analysis.

We use the flex tool for generating the lexical analysis. The main files in the lexical

analysis are “lexer Modelica.l” and “Lexer.mo”. As we said earlier the OMCCp is

entirely implemented in MetaModelica. The flex tool identifies the scanner file “lexer

Modelica.l”, by identifying the token and generates C file (“lexer Modelica.c”). We use

the c file to generate the MetaModelica code (eg. Files with .mo extension) which can

be seen in the above diagram under the name generated files.

We use the bison tool for generating the syntax analysis. The main files in the syntax

analysis are “parserModelica.y” and “Parser.mo”. The bison tool takes the “parser

Modelica.y” as its input and generates C file (“parserModelica.c”). From the generated

c file we generate the MetaModelica code which can be seen in the above diagram.

The file “Main.mo” is the main file which contains the run time system calls to start the

translation process. From this file we make functional call to start the lexer and parser.

The file “SCRIPT.mos” contains the loading of compiler utility files and the test files.

In this report we will be discussing the important changes made during the

development. For more detailed explanation about the code please look into the report

work of “A MetaModelica based parser generator applied to Modelica” by Edgar

Alonso Lopez Rojas which is available from the following link.

(35)

28 4.3 New Parser

In this part we will start discussing about the changes made to support new version.

Before getting into the changes we discuss the main models of OMCCp which starts the

translation.

Main.mo

The above code is taken

public function main

input list<String> inStringLst;

protected

list<OMCCTypes.Token> tokens;

ParserModelica.AstTree astTreeModelica;

algorithm

_ := matchcontinue (inStringLst)

Local

case args as _::_

equation

{filename,parser} = Flags.new(args);

"Modelica" = parser;

false=(0==stringLength(filename));

print("\nParsing Modelica with file " + filename + "\n");

// call the lexer

print("\nstarting lexer");

tokens = LexerModelica.scan(filename,false);

print("\n Tokens processed:");

print(intString(listLength(tokens)));

// call the parser

print("\nstarting parser");

(result,astTreeModelica) = ParserModelica.parse(tokens,filename,true);

print("\n")

// printing the AST

if (result) then

print("\nSUCCEED");

else

print("\n" +Error.printMessagesStr());

end if;

(36)

The above code is taken from the OMCCp main model “Main.mo”. This is the main

model where we make the run time system calls to start the translation process namely

the lexer and parser. The model contains a function “main” which makes the call to

lexer and parser respectively. The function contains an input component list of type

String which takes the input as string arguments. Then in the protected section we have

list of type “OMCC tokens” which is used for printing the tokens identified by the

lexer. Look into the model “TokenModelica.mo” for the list of Tokens. Then we have

the ParserModelica.AstTree astTreeModelica to print the AST. In the equation section

we take the input from the user and then make call to lexer “LexerModelica.scan” with

filename (input or test case file given by the user) as an argument which can be seen in

the above code. The results of the lexer are stored in “tokens”. Once the lexer outputs

the tokens we make call to the parser “ParserModelica.parse” by giving the tokens

identified by the lexer as input. The parser checks if the tokens are formed according to

the rules of the language. Then we print the AST or print error message. For detailed

version of this model refer the Appendix A section.

LexerModelica.mo “function Scan”

function scan "Scan starts the lexical analysis, load the tables and consume the program to

output the tokens"

input String fileName "input source code file";

input Boolean debug "flag to activate the debug mode";

output list<OMCCTypes.Token> tokens "return list of tokens";

algorithm

// load program

(tokens) := match(fileName,debug)

local

list<OMCCTypes.Token> resTokens;

list<Integer> streamInteger;

case (_,_)

equation

streamInteger = loadSourceCode(fileName);

resTokens = lex(fileName,streamInteger,debug);

then (resTokens);

end match;

end scan;

(37)

30 The above code is taken from the model “LexerModelica.mo”. The model contains

function “scan”. The model “Main.mo” makes the first system call to this function, to

start the lexical analysis. The function starts the scanning process, loads the table from

the arrays created by the “Flex” tool and consumes the program to output the token.

The function contains two input component “filename of type string” and “Debug of

type Boolean” and one output component “token of type list”. In the algorithm section

we first load the program by passing the input components. Then in the equation

section we load the real source code file given by the user as input using the function

call “loadSourceCode (fileName)” and then start the scanning process by making

another function call to “lex” with the given filename which starts identifying the total

number of characters in the given file and outputs them as tokens. We added only the

important text of code here, for more details about the code refer the Appendix A

section.

Lexer Modelica.mo “function Lex”

function lex

input String fileName "input source code file";

input list<Integer> program "source code as a stream of Integers";

input Boolean debug "flag to activate the debug mode";

output list<OMCCTypes.Token> tokens "return list of tokens";

algorithm

tokens := {};

if (debug) then

print("\n TOTAL Chars:");

print(intString(listLength(program1)));

end if;

while (List.isEmpty(program1)==false) loop

if (debug) then

print("\nChars remaining:");

print(intString(listLength(program1)));

end if;

end while;

tokens := listReverse(tokens);

end lex;

(38)

The above code is taken from the model “LexerModelica.mo”. The model contains

function “lex”. The function “scan” makes the call to this function. The function lex

starts loading the input source code file and converts the input source code file as

stream of integers. This is done by the function “scan”. Then in the algorithm section

the function starts reading the characters from the source file in a list and outputs the

tokens until the full completion of the identified character from the given source file

and then prints the token using the listReverse built-in functions. We added only the

important text of code here, for more details about the code refer the Appendix A

section.

ParserModelica.mo “function parse”

function parse "realize the syntax analysis over the list of tokens and generates the AST

tree"

input list<OMCCTypes.Token> tokens "list of tokens from the lexer";

input String fileName "file name of the source code";

input Boolean debug

output ParseCodeModelica.AstTree ast "AST tree that is returned when the result output is

true";

algorithm

while (List.isEmpty(tokens1)==false) loop

if (debug) then

print("\nTokens remaining:");

print(intString(listLength(tokens1)));

end if;

(tokens1, env,result,ast) := processToken(tokens1,env,pt);

if (result==false) then

break;

end if;

end while;

if (debug) then

printAny(ast);

end if;

end parse;

(39)

32 The above code is taken from the model “LexerModelica.mo”. The model contains

function “scan”. The model “Main.mo” makes the second system call to this function,

to start the Syntax analysis. The function “parse” realize the syntax analysis over the

list of tokens given as input from the lexer and generates the AST. The function has

three input component and one output component. The first input is a list which

contains the tokens generated by the generated by lexer. In the algorithm section we

check the total number of tokens and for each token we check the token is formed

according to rules with the help of process token function. If the result are correct we

print the AST. We added only the important text of code here, for more details about

the code refer the Appendix A section.

Function statements

The Modelica function statements have specific input and output components to which

data types can be declared. We can also specify other normal data types which do not

be an input or output component statements under the keyword protected. And

moreover we cannot assign the input component parameter directly with another

parameter.

Old version

function printBuffer

input list<Integer> inList;

output String outList;

Integer c;

algorithm

outList := "";

while (Util.isListEmpty(inList)==false) loop

c::inList := inList;

outList := outList + intStringChar(c);

end while;

(40)

The above code is taken from the package “Lexer.mo”. The above function block is

used for printing the tokens. But the above code has some errors and warnings which

are pointed using the shadings. The function block contains one input component list of

type integer named inlist and one output component of type String named outList. Then

we have a normal variable ‘c’ of type integer. This statement will give an warning

message in the newer version (variables which are not assigned as input or output

component must be protected).

Then in the algorithm block we have match expression statement and a while loop

Which checks for the tokens is empty and then assigns the input component inlist to

variable c which gives an error (variables which are declared to an input component

cannot be assigned to another component).

New version

The modified changes are listed are pointed with shading which clears the warning

message by declaring the variable ‘c’ of integer type under the keyword protected and

the error message is cleared by copying the input component list of type integer to

normal list type of integer named inlist1 ( list<Integer> inlist1:=inList).

function printBuffer

input list<Integer> inList;

output String outList;

protected

list<Integer> inList1:=inList;

Integer c;

algorithm

outList := "";

while (List.isEmpty(inList1)==false) loop

c::inList1 := inList1;

outList := outList + intStringChar(c);

end while;

(41)

34 The above mentioned function block changes is really important to support the newer

version of the OpenModelica.

Modelica identifiers

There are two types of identifiers in Modelica. The first identifier which is of the form

[a-zA-Z_] [a-zA-z0-9]* is commonly parsed by all types of the parser. But in Modelica

we have new type of identifier namely Q-IDENT which represents the special

characters identified in single quotes. (‘Any character within the quotes’). This

Q-IDENT is specially used in enumeration statements mostly in Modelica electrical

component packages. The older version does not support the special type of

enumeration statements written within quotes.

Enumeration statements which include (Q-ident)

The above statements which are shaded are a special type of Modelica identifier

(Q-Ident) which will not be parsed by the OMCCp showing error like replace the token

with some other alternatives which is done by error handling. In the new

implementation we tried to write the new rule in “lexer Modelica.l” by prioritizing the

rule in the actual identifier. Since the Q-Ident is only used on rare occasions we gave

the Q-Ident with lower priority without affecting the actual rule. The new change are

listed below

class test

String str;

type test= enumeration('4' "word",'1' "alpha");

test v;

equation

str=v.'4';

str=v.'1';

end test;

(42)

In the above listing we can see, the identifier is given higher priority than the Q-ident

which appears in single quotes without affecting the actual identifier rule as well as we

decided not to write a separate rule for the Q-ident as it would cause an overhead to the

parser. After the new rule the OMCCp is able to handle all types of statements.

Next we look into the important changes made in the parser.

One of the important changes made during this implementation is the compiler utility

file in the path (Compiler->util->rtopts.mo) has been removed. The old parser was

implemented to handle all the operations using this file which caused overhead problem

in the parser. Hence new parser is modified according to newly added utility files. The

new added utility files are loaded in the SCRIPT.mos.

Old version

SCRIPT.mos

The SCRIPT.mos file is used to load all the packages which are required to run the

OMCCp by passing this file as an argument. As we can see in the above listing the old

parser uses the rtopts.mo which has been removed due to overhead problem. The main

problem with the file was that function to handle different operations like string

manipulation, list operations etc. are written in the same file which caused the overhead

letter [a-zA-Z]

wild [ _ ]

digit [0-9]

digits {digit}+

ident (({letter}|{wild})|({letter}|{digit}|{wild}))({letter}|{digit}|{wild})*

loadFile("../../Compiler/FrontEnd/Absyn.mo");

loadFile("../../Compiler/Util/Error.mo");

loadFile("../../Compiler/Util/ErrorExt.mo");

loadFile("../../Compiler/FrontEnd/Dump.mo");

loadFile("../../Compiler/Util/Print.mo");

loadFile("../../Compiler/Util/RTOpts.mo");

loadFile("../../Compiler/Util/Util.mo");

loadFile("../../Compiler/Util/System.mo");

(43)

36 and the modified version of the rtopts.mo is split into several files which are listed

below in the shadings.

New version

SCRIPT.mos

The above models are developed by the OpenModelica developers, so we need not to

get into the details of coding. The things we need to do is to just find which functions

have to be replaced with old version. Some example of few modifications is listed

below.

Main.mo

The above listing is taken from the “Main.mo” where we will start the actual

translation; since the file is removed we have to make the changes with newly added

utility file named “flags.mo”. The actual function call RTOpts.args is used to read the

loadFile("Types.mo");

loadFile("../../../Compiler/FrontEnd/Absyn.mo");

loadFile("../../../Compiler/Util/Error.mo");

loadFile("../../../Compiler/Util/ErrorExt.mo");

loadFile("../../../Compiler/FrontEnd/Dump.mo");

loadFile("../../../Compiler/Util/Print.mo");

loadFile("../../../Compiler/Util/Flags.mo");

loadFile("../../../Compiler/Global/Global.mo");

loadFile("../../../Compiler/Util/Pool.mo");

loadFile("../../../Compiler/Util/Debug.mo");

loadFile("../../../Compiler/Util/List.mo");

loadFile("../../../Compiler/Util/Settings.mo");

loadFile("../../../Compiler/Util/Corba.mo");

loadFile("../../../Compiler/Util/Name.mo");

loadFile("../../../Compiler/Util/Scope.mo");

loadFile("../../../Compiler/Util/Util.mo");

loadFile("../../../Compiler/Util/System.mo");

case args as _::_

equation

{filename,parser} = RTOpts.args(args);

" Modelica" = parser;

false=(0==stringLength(filename));

(44)

strings given as arguments, since this file has been removed, the task is to find an

alternative to the above which can be done by the new utility file named ”Flags.mo”.

The above listing shows a replacement for the above replaced “RTOpts.arg”. The new

utility file “Flags.mo” contains function named “new” which reads the strings given as

arguments and outputs them. The function new from the package Flags.mo is listed

below.

Flags.mo

The above mentioned change is one sample example of how the new modification is

done to support the new version. We will not be discussing all the changes in this report

as it would be highly impossible to include the entire code changes in this report.

To perform the entire modification was a very difficult task considering the complexity

of the implementation as the whole OMCCp contains 15,000 lines code. It will be

really hard to go through every line of the code and finding the problem. The good

thing about the OMCCp is; it contains very good error handling, so based on the

previous report work we try to compile the code and find out the list of errors and from

the obtained error list we try to understand the code and find a possible replacements in

Extended Metamodelica Based Integrated Copiler Generator

Institutionen för datavetenskap

Department of Computer and Information Science

Master Thesis

Extended MetaModelica Based Integrated

Compiler Generator

by

ARUNKUMAR PALANISAMY

LIU-IDA/LITH

-EX-A—12/058—SE

2012-10-18

Linköpings universitet

SE-581 83 Linköping, Sweden

Linköpings universitet

581 83 Linköping

Master Thesis

Extended MetaModelica Based Integrated

Compiler Generator

by

ARUNKUMAR PALANISAMY

LIU-IDA/LITH

-EX-A—12/058—SE

2012-10-18

Supervisor: Olena Rogovchenko

Dept. of Computer and Information Science

Examiner: Prof. Peter Fritzson

ii

Abstract

OMCCp is a new generation (not yet released) of the OpenModelica Compiler-Compiler

parser generator which contains an LALR parser generator implemented in the

MetaModelica language with parsing tables generated by the tools Flex and GNU Bison. It

also contains very good error handling and is integrated with the MetaModelica semantics

specification language.

The main benefit with this master thesis project is the development of new version of

OMCCp with complete support for an extended Modelica grammar for a complete

OMCCp-based Modelica parser. The implemented parser has been tested and the results

have been analyzed. This is a new enhanced generation OMCCp with improvements made

from the previous version. This version support Modelica as well as the language

extensions for MetaModelica, ParModelica, and optimization problem specification.

Moreover, the generated parsers are about three times faster than those from the old

OMCCp.

iii

Acknowledgements

I would take this opportunity to thank several important peoples who made this

thesis possible to complete. First I would like to thank my examiner professor Peter

Fritzson who gave me the opportunity to work on this thesis believing in my capabilities.

Along with him, I have to thank my supervisor Olena Rogovchenko who has been keeping

track of my progress every week and encouraging me to complete my work faster, and also

I have to thank my technical supervisor Martin Sjölund providing me with great technical

assistance in coding and enough guidance which made this work possible. I am happy to be

a part of the OpenModelica project which has provided me with the opportunity to learn a

new language and contribute to the development of this open source project.

I have to thank IDA administration (department of computer and information

science) for providing me with office and resource for my daily work which is very

essential.

I would like to thank my family, especially my mother and brother who offered

their support when I started my work and encouraging me daily during difficult situations

which made me to feel comfortable and also I would like to thank my fellow master thesis

students Goutham and Alachew for offering some valuable advice.

iv

Table of contents

1. Introduction

1.1.

OpenModelica

2

1.2.

OMCCp

2

1.3.

ANTLR

3

1.4.

Project Goal

3

1.5.

Approach

3

1.6.

Intended readers

4