OMCCp : A MetaModelica Based Parser Generator Applied to Modelica

(1)

Institutionen f¨

or Datavetenskap

Department of Computer and Information Science

Master’s thesis

OMCCp: A MetaModelica Based

Parser Generator Applied to Modelica

by

Edgar Alonso Lopez-Rojas

LIU-IDA/LITH-EX-A–11/019–SE

2011-05-31

'

&

$

%

Link¨

opings universitet

SE-581 83 Link¨

oping, Sweden

Link¨

opings universitet

581 83 Link¨

oping

(2)

(3)

Institutionen f¨

or Datavetenskap

Department of Computer and Information Science

Master’s thesis

OMCCp: A MetaModelica Based

Parser Generator Applied to Modelica

by

Edgar Alonso Lopez-Rojas

LIU-IDA/LITH-EX-A–11/019–SE

2011-05-31

Supervisors: Martin Sj¨

olund and Mohsen Torabzadeh-Tari

Dept. of Computer and Information Science

Examiner:

Prof. Peter Fritzson

(4)

Upphovsr¨

att

Detta dokument h˚

alls tillg¨

angligt p˚

a Internet ˆ

a eller dess framtida

ers¨

attare ˆ

a under en l¨

angre tid fr˚

an publiceringsdatum under

f¨

oruts¨

attning att inga extra-ordin¨

ara omst¨

andigheter uppst˚

ar.

Tillg˚

ang till dokumentet inneb¨

ar tillst˚

and f¨

or var och en att l¨

asa,

ladda ner, skriva ut enstaka kopior f¨

or enskilt bruk och att anv¨

anda

det of¨

or¨

andrat f¨

or ickekommersiell forskning och f¨

or undervisning.

¨

overf¨

oring av upphovsr¨

atten vid en senare tidpunkt kan inte upph¨

ava

detta tillst˚

and. All annan anv¨

andning av dokumentet kr¨

aver

up-phovsmannens medgivande. F¨

or att garantera ¨

aktheten, s¨

akerheten

och tillg¨

angligheten finns det l¨

osningar av teknisk och administrativ

art.

Upphovsmannens ideella r¨

att innefattar r¨

att att bli n¨

amnd som

up-phovsman i den omfattning som god sed kr¨

aver vid anv¨

andning av

dokumentet p˚

a ovan beskrivna s¨

att samt skydd mot att dokumentet

¨

andras eller presenteras i s˚

adan form eller i s˚

adant sammanhang

som ¨

ar kr¨

ankande f¨

or upphovsmannens litter¨

ara eller konstn¨

arliga

anseende eller egenart.

F¨

or ytterligare information om Link¨

oping University Electronic

Press se f¨

orlagets hemsida http://www.ep.liu.se/

Copyright

The publishers will keep this document online on the Internet - or

its possible replacement - for a considerable time from the date of

publication barring exceptional circumstances.

The online availability of the document implies a permanent

per-mission for anyone to read, to download, to print out single copies

for your own use and to use it unchanged for any non-commercial

research and educational purpose. Subsequent transfers of copyright

cannot revoke this permission. All other uses of the document are

conditional on the consent of the copyright owner. The publisher has

taken technical and administrative measures to assure authenticity,

security and accessibility.

According to intellectual property law the author has the right to be

mentioned when his/her work is accessed as described above and to

be protected against infringement.

For additional information about the Link¨

oping University

Elec-tronic Press and its procedures for publication and for assurance

of document integrity, please refer to its WWW home page:

http://www.ep.liu.se/

c

(5)

(6)

(7)

Abstract

The OpenModelica Compiler-Compiler parser generator (OMCCp) is an

LALR(1) parser generator implemented in the MetaModelica language with

parsing tables generated by the tools Flex and GNU Bison. The code

gener-ated for the parser is in MetaModelica 2.0 language which is the

OpenMod-elica compiler implementation language and is an extension of the ModOpenMod-elica

3.2 language. OMCCp uses as input an LALR(1) grammar that specifies the

Modelica language. The generated Parser can be used inside the

OpenMod-elica Compiler (OMC) as a replacement for the current parser generated by

the tool ANTLR from an LL(k) Modelica grammar. This report explains

the design and implementation of this novel Lexer and Parser Generator

called OMCCp.

Modelica and its extension MetaModelica are both languages used in the

OpenModelica environment.

Modelica is an Object-Oriented

Equation-Based language for Modeling and Simulation.

(8)

(9)

Acknowledgements

It is an honor for me to be able to culminate this work with the guidance

of remarkable computer scientists. This thesis would not have been possible

unless the clear vision of my examiner, professor Peter Fritzson. As the

director of the Open Source Modelica Consortium (OSMC) he presented this

great opportunity to me. Together with him, I have to thank my supervisors

Martin Sj¨

olund and Mohsen Torabzadeh-Tari. Martin has made available

his support and guidance in a number of ways that I cannot count and

Mohsen has always been keeping track of my progress and helping me with

the difficulties I found. I am pleased to be part, learn and contribute to this

great open source project called OpenModelica.

Nevertheless, To IDA (Department of Computer and Information

Sci-ence) for offering its locations and resources for my daily work.

I cannot forget to thank my family. My parents Jesus and Soledad for

supporting me since the beginning in this project to become a Master in

Computer Science. My fianc´

ee Helena, who has all the time been

encour-aging me to give my best in every step of this journey. I am delighted to

include my future daughter Isabella here; who is been my biggest motivation

to complete this work before the day she step for the first time in this world.

Last, but not less important my financial sponsors from Colombia:

Fun-dacion Colfuturo

1

_{and EAFIT University}

2

_{. They believed in my talent and}

provided the financial resources to achieve this goal.

1_{http://www.colfuturo.org/} 2_{http://www.eafit.edu.co/}

(10)

(11)

List of Figures

2.1 Compiler Phases . . . .

6

2.2 Compiler Front-End . . . .

8

2.3 Parser components . . . .

12

2.4 OpenModelica Environment [Fritzson et al., 2009]

. . . .

18

3.1 OMC simplified overall structure [Fritzson et al., 2009] . . . .

24

3.2 OMC Language Grammars

. . . .

24

4.1 OMCCp (OpenModelica Compiler - Compiler) Lexer and Parser

Generator . . . .

34

4.2 OMCCp Lexer and Parser Generator Architecture Design . .

36

4.3 OMC-Lexer design . . . .

37

4.4 OMC-Parser design . . . .

39

4.5 OMC-Parser LALR(1) . . . .

40

5.1 OMCCp - Time Parsing . . . .

63

(16)

(17)

List of Tables

2.1 LR(1) parsing table [Aho et al., 2006]

. . . .

14

2.2 LR(1) parsing table rearranged [Aho et al., 2006] . . . .

14

2.3 LALR(1) parsing table [Aho et al., 2006] . . . .

15

5.1 OMCCp Files Implementation . . . .

60

5.2 Test Suite - Compiler

. . . .

63

5.3 OMCCp - Time Parsing . . . .

63

(18)

(19)

Listings

2.1 MetaModelica uniontype . . . .

19

2.2 MetaModelica matchcontinue . . . .

20

2.3 MetaModelica list

. . . .

20

3.1 ANTLR grammar file structure . . . .

25

3.2 Flex file structure . . . .

27

3.3 Bison file structure . . . .

29

4.1 Lexer.mo function scan . . . .

37

4.2 Parser.mo function parse . . . .

41

4.3 MultiTypedStack AstStack

. . . .

43

4.4 ParseCode.mo case reduce action . . . .

43

4.5 ParseCode.mo function getAST . . . .

44

4.6 Modifications in the Bison Epilogue

. . . .

46

4.7 Modifications in the Rules section in Bison

. . . .

47

4.8 List of semantic values of tokens

. . . .

48

4.9 Constants for error handling . . . .

50 4.10 Custom error messages in OMCCp . . . .

53 4.11 Error messages in OMCCp

. . . .

53 4.12 program.mo with errors . . . .

54 4.13 Parser.mo original function . . . .

54 4.14 Parser.mo modified function . . . .

55 A.1 Compile Flex and Bison . . . .

81 A.2 OMCC.mos . . . .

81 A.3 OMCCP Command

. . . .

81 A.4 SCRIPT.mos debug mode . . . .

82 A.5 OMCCP debug mode

. . . .

82 B.1 Lexer.mo

. . . .

83 B.2 LexerGenerator.mo . . . .

92 B.3 LexerCode.tmo . . . 100

B.4 Types.mo . . . 102

C.1 Parser.mo . . . 107

C.2 ParserGenerator.mo . . . 126

C.3 ParseCode.tmo . . . 143

D.1 lexer10.l . . . 146

D.2 parser10.y . . . 147

xvii

(20)

xviii

LISTINGS

E.1 ParseTable10.mo . . . 152

E.2 ParseCode10.mo . . . 157

E.3 Token10.mo . . . 168

E.4 LexTable10.mo . . . 169

E.5 LexerCode10.mo . . . 171

F.1 lexerModelica.l . . . 176

F.2 parserModelica.y . . . 180

G.1 SCRIPT.mos . . . 205

G.2 Main.mo . . . 206

(21)

Chapter 1

Introduction

1.1 Background

The OpenModelica project develops a modeling and simulation environment

based on the Modelica language [Fritzson, 2004]. The effort is supported by

the Open Source Modelica Consortium (OSMC). It uses the

OpenModel-ica Compiler (OMC) [Fritzson et al., 2009] to generate either C, C++ or

C code that runs simulations which are written in the Modelica language.

OpenModelica currently makes use of the tool called Another Tool for

Lan-guage Recognition (ANTLR) to generate the Parser for the OpenModelica

Compiler (OMC). The work presented in this master’s thesis offers an

al-ternative for the ANTLR parser. We present a novel Compiler-Compiler

implemented completely on MetaModelica. MetaModelica is an extension

of the Modelica language intended for modeling the semantics of languages.

One large example is the modeling of the whole Modelica language together

with its MetaModelica extensions in the OpenModelica bootstrapped

com-piler version [Sj¨

olund et al., 2011].

The ANTLR parser generator [Parr and Quong, 1995], which is already

used in the OpenModelica project since several years, has well known

dis-advantages including memory overhead, bad error handling, lack of type

checking, and not generating MetaModelica code for building the Abstract

Syntax Tree (AST). Since the AST nodes are initially generated in C (for

later conversion into MetaModelica) without strong type checking, small

errors in the semantic actions in the grammar are not detected at

(22)

2 CHAPTER 1. INTRODUCTION

tion time, and can give rise to hard-to-find bugs in the generated C code.

When the semantic actions can be specified in MetaModelica and the AST

builder generated in MetaModelica, this source of errors can be completely

eliminated. Currently ANTLR generated parsers connect with OMC by an

external C interface. It is also built as an integrated Lexer and Parser that

hide behind a considerable amount of libraries. These libraries handle as

a black-box the complexity of the syntax analysis process in the compiler.

ANTLR is only suitable for parsing LL grammars.

1.2 Project Goal

The goal of this master’s thesis is to write a parser generator using

Meta-Modelica language that can replace the current parser ANTLR and generates

MetaModelica code instead of C-code.

The results expected from this thesis are:

• A Lexer and a Parser for Modelica grammar including its

MetaMod-elica extension that outputs the Abstract Syntax Tree (AST) for the

language processed.

• Lexer and Parser generator written in the MetaModelica language.

• Improvements in the error handling messages compared with ANTLR;

specifically the messages concerning error correction hints of

mal-formed syntax.

1.3 Methodology

The methodology used for the construction of the OpenModelica

Compiler-Compiler parser generator (OMCCp) is based on a literature study of

com-piler construction techniques. There are various projects that offer lexer and

parser generators but there are none for the Modelica language. A literature

review is the base for the initiation of this project on compiler construction.

Different literature from the OSMC is available. This contributes for a

better understanding on the OpenModelica project. Besides the literature

reviewed, we include the experience of the supervisor Martin Sj¨

orlund. He

built the first bootstrapping compiler for the Modelica Language [Sj¨

olund

et al., 2011]. Various papers and books from the examiner are available.

(23)

1.4. INTENDED READERS

3 The examiner has a clear vision of the next steps in the development of the

compiler due to his involvement in the project since it started several years

ago.

There are exercises available for learning the MetaModelica, including

online courses.

The exercises are important for familiarisation with the

MetaModelica language. A guide of MetaModelica is also provided to

ad-dress the most common built in functions and limitations of the language.

After the literature review, existing technologies that can support the

project are addressed. A review of the techniques they use and the

ben-efits is performed. This will lead the architectural decisions towards the

implementation of the parser and lexer generator.

Finally the implementation of a subset of the Modelica grammar for the

existing parser generator is addressed. This will finalise the project and

prove the validity of the proposed solution.

1.4 Intended Readers

The reader of this document is someone who wants to understand more

about compiler construction and more specifically the syntax analyser phase

of the OpenModelica Compiler. This document has important information

for the OpenModelica developer who wants to work on the OMC compiler

design and construction.

1.5 Thesis Outline

This thesis gives an overview of the OpenModelica project and the

architec-ture of the OpenModelica compiler. In the Chapter 2, Theoretical

Back-ground, it familiarise the reader with the topic of Compiler Construction.

More specifically the Lexer Analysis and Syntactic Analysis and different

basic concepts about grammars.

This thesis covers the topic of existing technologies in chapter 3 as a

basic understanding for the Implementation.

Finally the Chapters 5 Discussion and 6 Related Work explains different

parts of the project analysing the results of the implementation. The

con-clusions review the achievement of the goals and analyse the implemented

solution. Further work provides the reader who intends to continue this

(24)

4 CHAPTER 1. INTRODUCTION

work more information about desired extensions and improvements over

this project.

The appendices cover the source code of the entire project including the

sample generated files from the exercise 10 of the MetaModelica exercises

available in [Fritzson and Pop, 2011a,b]. A large subset of Modelica 3.2

grammar [Modelica-Association, 2010] is also included. It was used to prove

the usability of the parser generator.

(25)

Chapter 2

Theoretical Background

“The world as we know it depends on programming languages”

Aho et al. [2006]

We required a strong knowledge of compilers construction theory for

implementing the solution for this thesis. For a better understanding of this

project; the reader must be familiar with some of the fundamental terms

and basic algorithms used for the construction of the lexical analyser and

the parser during this project.

This chapter covers the main topics of compiler construction that are

used on the implementation of the solution presented in Chapter 4. The

next part of this chapter addresses an important topic for this project; which

is the improvement on error handling during the compiler parsing phase.

The last part presents an overview of the current OpenModelica project

including the Modelica and MetaModelica languages and the OMC.

2.1 Compilers

Aho et al. [2006] is a mandatory book for anyone who intends to understand

the concepts of Compilers. Most of the compiler’s theory covered on this

part is based on this book. Other sources such as Kakde [2002] and Terry

[2000] have been reviewed and is addressed in the different subsections. This

section intends to give the reader an introduction of the compiler terms and

techniques used during the design and development of this project.

(26)

6 CHAPTER 2. THEORETICAL BACKGROUND

2.1.1 Fundamentals

Programming languages rely strongly for their evolution and massive use

on compilers. These languages exist due to the limitations for developers

in building complex systems in machine-language; which only identify

se-quences of binary instructions. However, in a more general view, a Compiler

is a software tool that serves as a translator from one language into another.

Figure 2.1: Compiler Phases

If we see a compiler as a process we can identify the source language as

the input and the target language as the output of this process. For example

in languages such as C, the input language is the C code and the output

language is machine code for a specific architecture and operative system.

(27)

2.1. COMPILERS

7 There are several types of compilers used today, and the classification

depends on different purposes of the compilers. We distinguish between

native-compilers, cross-compilers, interpreters and source-to-source

compil-ers translators.

Native-compilers are used for the generation of machine-specific code

(binary code). The cross-compilers generate machine-specific code too; but

they generate the code for a different machine as the one they are running.

The interpreters for languages are similar to the Java virtual machine

(JVM)

1

. They receive as an input two parameters: the source program

and the input for the program. The interpreter simulates the result of the

compiled source program executed directly in machine-language code. It

outputs the expected result of the source program over the input used as a

parameter.

We are addressing the source-to-source compilers in this report; which

are commonly used for translating one high level language such as Modelica

into another high level language such as C. This technique is common due

to the difficulty of generating low level language code such as Assembler or

directly binary code.

The complexity of a compiler is showed in the figure 2.1, inside a

com-piler there are two main parts that can be recognised, the Analysis

(Front-End ) and the Synthesis (Back-(Front-End ). The Analysis phase is handled by

the Front-End of the compiler. The Front-End is divided into three steps:

Lexical Analysis, Syntax Analysis and Semantic Analysis. These steps as

presented in figure 2.2

The Lexical Analysis task is performed by a component called Lexer.

The main function of the Lexer is to get as an input the source code and

recognised different sequences of characters into a unit called token.

The Front-End is the part of the compiler that we focus on this

im-plementation. During the analysis phase, the source code is processed by

the Lexical Analyser, Syntax Analyser and Semantic Analyser to output an

intermediate representation of the input code called Abstract Syntax Tree

(AST).

(28)

8 CHAPTER 2. THEORETICAL BACKGROUND

input program

tokens abstract

syntax tree three-address code

Symbol Table

Lexer Parser Intermediate

code generator Compiler Front-End

Figure 2.2: Compiler Front-End

2.1.2 Lexical Analysis

The Lexical Analysis, also called scanning, receives the source code as a

character stream. It identifies the special tokens specified by a language

making it more simple for the next phase of the compiler. The programming

language’s tokens are often specified by the use of Regular Expressions.

A Lexer is a program that runs a Finite Automata which recognises a

valid language based on a regular language. As mentioned above, the regular

languages are described by the use of regular expressions.

The Lexical Analysis is the first part of the compiler. It simplifies the

complexity of recognising a complete grammar, by giving a simple

trans-formation of the source code into a list of tokens. In the next step of the

compiler the syntax analysis uses only the tokens to accept or reject the

source code provided.

For better understanding on how the Lexical Analysis works; we

intro-duce the basic concepts of Finite Automata and Regular Languages in the

next section. We present later a description of what a Lexer specifically

does.

Finite Automata and Regular Languages

Sipser [2005] presents the use of a Finite Automata, also known as

nite State Machines, to recognize the regular languages. He defines a

Fi-nite Automata as a collection of states (Q), an alphabet (Σ), a transition

function(δ : QxΣ → Q), a start state (q

0

) and a set of accept states.

(29)

2.1. COMPILERS

9 To describe briefly how a Finite Automata works, the use of state

di-agrams are broadly used. There are two types of Finite Automata, the

first one is Deterministic Finite Automata (DFA) and the other is

Non-Deterministic Finite Automata (NFA).

The Lexer

For the construction of the Lexer it is preferred to use the DFA. However,

a NFA can also be converted into a DFA. A lexer can also simulate the

non-deterministic behaviour of a NFA. The main reason for using a DFA is

that we want to have a transition function δ that allows the Lexer to decide

only one path over a specific char input in the character stream from the

source code.

All the regular expressions of the set of tokens are summarised during the

construction of a Lexer. It often happens that one sequence of characters

can be recognised as two or more different tokens. Therefore, the lexer

must have extra rules that prioritise longer strings over shorter ones. Other

instructions can also be added to order the rules in an accepting sequence

to avoid ambiguity.

The Lexical Analysis phase filters some tokens that are used only by the

programmer such as comments, different kind of spacing and indentation on

the code. This task simplifies the complexity of the code by converting all

the characters in a list of tokens. If the Parser has to deal with this task the

amount of terminal tokens will increased, making the Rules more complex

and decreasing the overall performance of the compiler.

The compiler gains performance when the Lexical Analysis is kept

sep-arated from the Syntax Analysis. This performance can be achieved by

ap-plying specialised techniques in the handling of the character stream, such

as buffering for reading certain amount of characters at the time.

A structure called Symbol Table is used to store all the identifiers with

their names or values, this structure avoids duplication and efficiency of the

code through all the phases of the compiler as represented in the Figure 2.2.

A token is usually represented by a tuple, consisting in an identifier of the

token and a reference to the Symbol Table, e.g. T OKEN < IDEN T, x >

where IDENT is the identifier of the token and x is the value found by the

Lexer of the identifier.

(30)

10 CHAPTER 2. THEORETICAL BACKGROUND

Flex a Lexer Generator

There are some programs that automate the labour of constructing the

transition rules to identify the tokens for a Lexer. Flex is one example of

a Lexer Generator. It is based on the Lexer generator Lex. It uses as an

input a file with the definition of the rules for the recognition of the tokens.

These rules are defined using regular expressions.

Flex also allows the developer to specify the return token that matches

each pattern. Some tokens such as white spaces and line feeds are ignored

as explained above.

In the next chapter we cover the existing technologies used for this report

and we explain in a technical detail Flex.

2.1.3 Syntax Analysis

The Syntax Analysis phase is performed by a program called Parser. The

Parser requires a more powerful language than regular expression to specify

the programming language constructions. The rules are commonly expressed

using Context Free Grammars (CFG). The CFG can be recognised by the

use of a PushDown Automata.

PushDown Automata and Context Free Languages

Sipser [2005] defines a Context Free Grammar (CFG) as a 4-tuple (V, Σ, R, S),

where V is the set of variables, Σ is a set of terminals, R is a set of Rules

and S is the Start Variable.

A Push-Down Automata (PDA) starts by reading an input set of tokens.

The PDA uses the tokens and a stack to store and decide the next state and

action of the PDA, this action can be to reduce from the stack or to store

any state or token into the stack. By doing this it keeps running until it

finds an accept state and then ends. Several situations can happen including

an infinite loop of the machine. This explains why the grammar should be

constructed in such a way that it avoids these problems.

Similar to the DFA, there are PDA that are deterministic, and those are

the ones we consider for building the Parser.

(31)

2.1. COMPILERS

11 The Parser

The Parser is in charge of determining if the source code that has been

tokenised by the Lexer is constructed according to the rules of the grammar.

By doing this, it executes a PDA that outputs “accept ”if the input belongs to

a valid construction of the grammar, otherwise it outputs an error message

identifying the token that does not fit the construction rules.

The work done by the Lexer in the first phase of the compiler allows the

Parser to ignore tokens such as white spaces and line feed and consider all

the tokens as terminals of the grammar that describes the language. This

simplifies the rules of the Parser making it more efficient and fast in the

process of syntax analysing.

The Parser validates the rules of the grammar from list of tokens

re-ceived from the Lexer. However at the same time, a Parser can executes

an additional task, which is the construction of a structure called Abstract

Syntax Tree (AST). The AST resembles a tree and it is the representation

of all the source code in a set of three instructions. This is the input for the

compiler Back-End. The Back-End uses this AST for the optimisation and

generation of machine-specific code.

A Parser is composed by a predictive table, a stack of states, a list of

tokens as an input and a parsing algorithm that runs over the list of tokens.

Figure 2.3 shows these components and their interactions.

A Parser uses the predictive tables, also called parsing tables, to

deter-mine the next action and the new state of the machine. The next state is

queried from the parsing tables depending on the lookahead token and the

current state of the stack.

Parsers are commonly classified by the algorithm used for performing the

parsing operation. There are three known types: Top Down Parser, Bottom

Up Parser and Universal Parser. However for programming languages only

the first two are utilised due to the inefficiency of the Universal Parser.

A Top-Down Parser builds the parse tree from the top to the bottom.

Bottom Up Parser works in the opposite direction as the Top Down Parser.

Top Down Parsers only work for grammars called Left to right, Left most

derivation parser (LL). A LL(k) Parser is a top descendant parser with k

lookahead tokens. LL(k) Parsers utilise a predictive table to decide the next

state.

(32)

12 CHAPTER 2. THEORETICAL BACKGROUND

Tokens Parsing Tables

PARSER

AST

State Stack lookahead

Figure 2.3: Parser components

derivation parser (LR). Knuth [1965] introduced first the concept of LR

parsing. The most common parsers are LR(k), Simple LR parser (SLR) and

Look Ahead LR parser (LALR). The Parser LALR(1) uses a simplification

of the parsing tables used by the LR(1) parser.

In general a Bottom Up Parser builds the AST by performing two types

of task: Shift and Reduce.

Shift allocates the variable or terminal symbol found while the machine

goes through the list of tokens. It utilise a table called Action Table; which

contains all terminals and rules for calculating the next state. The table

called GOTO Table is used for calculating the next state. When the result

is calculated it pushes these values back into the state stack.

Reduce pops a certain number of values from the stack to apply later

a push with a new value also using the GOTO Table. While reducing a

LALR parser can build up the AST and push the new value into another

stack called Semantic Stack which also follows the rules of shift and reducing

performed by the algorithm.

Blasband [2001] made an effort in parsing grammars that do not perfectly

fit into the classification of LL and LALR grammars.

On this report we briefly look at Top Down Parsers. We are more

inter-ested in the LALR(1) Bottom Up Parser; which is the type of parser used

in this implementation. The parser LALR(1) is explained in more detail in

the next section.

(33)

2.1. COMPILERS

13

2.1.4 Parser LALR(1)

The LALR(k) parsers were first introduced by DeRemer [1969]. They are

the most commonly parsers used in programming languages due to the speed

and size of the parsing tables and the advantages over its predecessors the

LR(0) and SLR(0) parsers.

Kakde [2002] and Aho et al. [2006] explain very well how the bottom up

algorithm works. We are interested here in understand the basic principles

of the LALR(1) algorithm.

parsing tables

There are two tables in an LALR parser: The first one is the ACTION table,

the second is the GOTO table. The theoretical construction of these tables

can be found almost in any compiler literature such as Aho et al. [2006],

Kakde [2002], Terry [2000].

There are two methods for constructing LALR(1) parsing tables from the

LR(1) parsing tables: The first one, the easy but space consuming method, is

presented here. The other method differentiates from the former by checking

in every step of the construction of the LR(1) the simplification of the

com-mon rules, reducing significantly the number of states in the LR(1) parsing

table.

We explain the construction of the LALR(1) parsing tables and the

con-tent of the LR(1) through this example. Lets take this sample from the

grammar from Aho et al. [2006].

Simple Grammar Sample

S’ → S

S → CC

C → cC|d

From this grammar the parsing table LR(1) 2.1 is constructed according

to the algorithm presented in Aho et al. [2006]. This is called the

canoni-cal LR(1) collection. The symbol r in the table identifies a REDUCTION

operation and the symbol s identifies a SHIFT operation. The keyword acc

identifies the acceptance valid state.

From the table 2.1 we can observe that about half of the entries in the

table are blank spaces. The LR(1) parsing tables have the disadvantage of

(34)

14 CHAPTER 2. THEORETICAL BACKGROUND

Table 2.1: LR(1) parsing table [Aho et al., 2006]

ACTION

GOTO

state

c

d

$

S

C

0 s3

s4

1

2

1 acc

2 s6

s7

5

3 s3

s4

8

4 r3

r3

5 r1

6 s6

s7

9

7 r3

8 r2

r2

9 r2

growing considerably large, even for small grammars, due to the redundancy

of productions for similar states with different lookahead symbol.

If we rearranged the rows in the way presented in the parsing table

2.2. We can notice that there are similarities between the productions for

different lookahead and the states (3 and 6 and 8 and 9). There are states

that share the same core production for different lookahead symbols.

Table 2.2: LR(1) parsing table rearranged [Aho et al., 2006]

ACTION

GOTO

state

c

d

$

S

C

0 s3

s4

1

2

1 acc

2 s6

s7

5

3 s3

s4

8

6 s6

s7

9

4 r3

r3

7 r3

5 r1

8 r2

r2

9 r2

The LALR(1) parsing table 2.3 is constructed in based on the one above

first by identifying the common core of each set and replacing the sets with

(35)

2.2. ERROR HANDLING IN SYNTAX ANALYSIS

15 an union.

For better understanding of this construction the reader can

address the literature [Aho et al., 2006, Section 4.7.4].

Table 2.3: LALR(1) parsing table [Aho et al., 2006]

ACTION

GOTO

state

c

d

$

S

C

0 s3

s4

1

2

1 acc

2 s6

s7

5

36 s36

s47

89

47 r3

r3

5 r1

89 r2

r2

LALR(1) Algorithm

Both the LR(1) and the LALR(1) perform the same algorithm, the only

difference is the parsing tables used by LALR(1) contain different states

that will be Shifted or Reduced in the Stack.

The parsing algorithm starts by finding the right action in the ACTION

table for a given terminal symbol a and a current state i denoted

AC-TION[i,a]. This value can have either a REDUCE (r), SHIFT (s), ACCEPT

(acc) or error (blank) action.

The GOTO table is used to find the next state j and the non-terminal I

denoted GOT O[I

i

, A] = I

j

.

A REDUCE action takes a certain number of symbols from the parsing

stack, apply a transformation and puts back the result and the next state

back into the stack.

When an error is detected (blank entry in the parse table), several

cor-recting actions can be performed. This topic is covered with more detail in

Section 2.2.

2.2 Error Handling in Syntax Analysis

The Error handling techniques in the Front-End are more relevant during

the Syntax Analysis phase and the Semantic Analysis phase than in the

(36)

16 CHAPTER 2. THEORETICAL BACKGROUND

Lexical Analysis phase.

Only a few errors can be detected by the Lexical Analysis, such as

non-terminated comments, invalid characters used or unrecognised token. One

possible error-recovery strategy implemented in a Lexer is to ignore invalid

characters from the input and keep the process.

The Error handling techniques can be divided into two topics: Error

recovery techniques and Error Message display. Error recovery techniques

are concerned on how the parser can keep parsing after an error token is

found. Error Message displays are related with how to present useful hints

for the developer in order to correct the source code.

In this section we will present the two topics named above for error

handling techniques during the Syntax Analysis phase.

2.2.1 Error Recovery

For LALR parsers several error recovery techniques have been developed as

in [Burke and Fisher Jr, 1982, Bilos, 1983, Burke and Fisher, 1987, McKenzie

et al., 1995, Degano and Priami, 1998, Corchuelo et al., 2002] and more

recent researches as in [Kats et al., 2009, de Jonge et al., 2010].

Error recovery techniques try to improve the quality of the parser by

different techniques such as primary recovery or secondary recovery.

The first condition to start the recovery is to access the configuration

obtained when the token preceding the error token was shifted onto the

stack. Techniques for deferring the reduce actions after a shift have been

developed in Burke and Fisher Jr [1982].

Primary techniques are related with single token modification from the

list of tokens. Single modification is only possible when the error is classified

as simple. This modification can be either insertion, deletion, substitution

or merging.

Every attempt to perform a repair is known as a trial. A common

tech-nique for searching the trials is to attempt to repair the error token by

performing one of these operations: merging, insertion, substitution, scope

recovery and finally deletion.

In the case of insertion or substitution a set of possible candidates should

be generated and then from there a single candidate or none should be

selected.

(37)

2.3. THE OPENMODELICA PROJECT

17 tokens needs to be reduced. This can be done by discarding tokens that

precedes, follows or surround the error token. This is known as secondary

recovery.

2.2.2 Error Messages

In simple recovery the error messages are classified in 5 different types:

merging, misspelling, insertion, deletion, substitution.

In secondary recovery the error messages are classified in 2 types. Type

1 error messages are displayed when the discarded tokens are present in

a single line. Type 2 errors are displayed when multiple lines need to be

discarded.

In addition there are 3 other types. The first refers to different candidates

for a recovery. The second type is displayed when the end of file is reached

but not expected. The third is used when all error recovery routines fail;

and then the parser displays a generic unrecoverable syntax error message.

2.3 The OpenModelica Project

OpenModelica

2

_{is an open source project leaded by the Open Source}

Model-ica Consortium (OSMC)

3

. At the moment of writing this report

OpenMod-elica is on version 1.7.0 launched on April 2011.

OpenModelica contains different tools that contribute with the design

and construction of simulation projects in OpenModelica. These tools are

classified into: Compiler tools, Graphic interface tools, Eclipse-based

envi-ronment.

The OpenModelica environment consists in several tools such as

OMEd-itor, UML-Modelica, OMShell, OMNotebook, DrControl under

OMNote-book and Modelica Development Tooling (MDT). There are some other

resources such as documentation, OMDev (tools for building the compiler),

and auxiliary tools for the OpenModelica Developer that have been used

during the development of this project. Figure 2.4 shows the architecture

of the OpenModelica environment.

2_{OpenModelica: http://www.openmodelica.org}

(38)

18 CHAPTER 2. THEORETICAL BACKGROUND

OMOptim

Optimization

Subsystem

Modelica

Compiler

Interactive

session

handler

Execution

Graphical Model

Editor/Browser

Textual

Model Editor

Modelica

Debugger

DrModelica

NoteBook

Model Editor

Eclipse Plugin

Editor/Browser

Figure 2.4: OpenModelica Environment [Fritzson et al., 2009]

2.3.1 The Modelica Language

The design of the Modelica Language was started in the fall 1996. The

first report of the language was made available on the web in September

1997. The first publication on Modelica by Elmqvist [1997] was made at the

Symposium on Computer-Aided control System Design is been developed

ever since, with several researcher contributors e.g. Fritzson and Bunus

[2002], Pop and Fritzson [2005, 2006], Akesson et al. [2008, 2010], Sj¨

olund

[2009], Sj¨

olund et al. [2011], Lundvall et al. [2009] as a language created

for multi-domain modelling and simulation. Modelica is an equation-based

and object-oriented language designed with the aim of defining a de facto

standard for simulation.

There have been recent efforts in writing a new Modelica compiler. The

compiler and other parts of the OpenModelica project are described in

Fritz-son et al. [2009].

2.3.2 MetaModelica extension

The main source of information for the MetaModelica language is the draft

document “MetaModelica users guide” written by Fritzson and Pop [2011a].

This document has been improved recently by Fritzson and Pop [2011b]

towards the implementation of the specifications of a new version of the

(39)

2.3. THE OPENMODELICA PROJECT

19 Modelica Language.

MetaModelica was created in the OpenModelica project with the

inten-tion of modelling the semantics of the Modelica language. MetaModelica is

then the starting point for the construction of a Modelica Compiler. The

MetaModelica Language is part of the project to create a Bootstrapping

compiler written in MetaModelica for MetaModelica and Modelica language.

MetaModelica adds new operators and types to the Modelica language.

We cover in this report the constructs uniontype, record, matchcontinue and

list.

uniontype

The uniontype is a construct that allows MetaModelica to declare types

based on the union of 2 or more record types. It can be recursive and

include other uniontypes. An example of uniontype is presented in listing

2.1. Listing 2.1: MetaModelica uniontype

1 uniontype Exp record INT 3 Integer i n t e g e r ; end INT ; 5 record IDENT String i d e n t ; 7 end IDENT ; end Exp ;

matchcontinue

The matchcontinue instruction resembles the switch instruction in C with

some additions. Unlike the switch instruction, matchcontinue can return a

value. It can contain more than one conditional, and it can also return more

than one value. A section for definition of local variables is present right

after the matchcontinue declaration. The wild card ‘ ’ (underscore) can be

used to match all cases, additionally an else case can be used instead of the

wild card.

The matchcontinue instruction contains case blocks similar as the

com-mon switch instruction in C code. Each case can contain an equation-block.

The program flow tries to execute correctly one instruction after the next

(40)

20 CHAPTER 2. THEORETICAL BACKGROUND

one in a specific equation-block. If any instruction is not executed or fails,

the next case is tried. If it fails again then it keeps trying the next case

until one case block reaches the end. Then a return value is assigned to

the corresponding variables or case block can reach the final and no value

is assigned to the variables. An example of the syntax for matchcontinue is

presented in listing 2.2.

Listing 2.2: MetaModelica matchcontinue

2 ( to ke n , env2 ) := matchcontinue ( a c t ) l o c a l

4 Types . Token t o k ; case ( 1 ) equation 6

t o k = Types .TOKEN( tokName [ a c t ] , a c t , b u f f e r , i n f o ) ; 8 then (SOME( t o k ) , env2 ) ;

case ( )

10 then (NONE( ) , env2 ) ; e l s e

12 then (NONE( ) , env2 ) ; end matchcontinue ;

list

It is used to create linked list that works as in C. The bracket are used to

define the list elements. The operand ‘::’ is used to add items or to retrieve

items from a list.

To illustrate how a list works in MetaModelica, we have the following

sample code in listing 2.3. In this code the instruction on line 1 creates a

list called ‘a’. In the line 2 it retrieves the top element ‘3’ from the list ‘a’,

and saves it in the variable i. It saves the rest of the list back in the variable

‘a’. Finally we have the line 3 which is the inverse operation of line 2 and

will add an item ‘i’ into the list ‘a’.

Listing 2.3: MetaModelica list

1 l i s t <Integer> a = { 1 , 2 , 3 } ; i : : a=a ;

(41)

2.3. THE OPENMODELICA PROJECT

21

2.3.3 Abstract Syntax Tree - AST

The AST (Abstract Syntax Tree) is is a structure that abstracts part of

the details present in the source code and represents unambiguously the

constructs of the programming language. The declaration of the AST

(se-mantic constructions of the language) is possible to be built in MetaModelica

language because of the construct ‘uniontype’ explained in the last section.

The construction of this tree is based in primitive operations such as

integer operations. These constructs represent the semantic constructs of

Modelica and MetaModelica language. The file Absyn.mo contains the

spec-ification for the constructs.

(42)

(43)

Chapter 3

Existing Technologies

Several technologies were the base for the construction of this project. In

this chapter we present an introduction about some of the features that the

OpenModelica Compiler (OMC) has built-in and the current ANTLR parser

generator used in OMC.

Fast Lexical Analyzer Generator (FLEX) and GNU Bison are the

tech-nologies in which this project was based for the construction of the OMCCp.

Aaby [2003] is a good reference for those who want to learn more about

compilers construction with Flex and Bison. We use his book and the GNU

Bison manual (latest version today 2.4.3 by Donnelly and Stallman [2010])

to explain some technological aspects of FLEX and GNU Bison.

This chapter is not intended to be a guide for these technologies, but it

gives the reader the required concepts to understand the rest of this thesis.

3.1 OpenModelica Compiler (OMC)

OpenModelica Compiler (OMC) is the core tool for the OpenModelica project.

It has been developed since the beginning of the Modelica language.

Fea-tured in Fritzson et al. [2009].

3.1.1 Architecture and Components

The architecture of OMC is presented in Figure 3.1. The main components

of this diagram represent some of the phases of a compiler where the lexical

and syntax analysis is represented by the starting process “Parser”. This

(44)

24 CHAPTER 3. EXISTING TECHNOLOGIES

parser does both, the lexical and the syntax analysis due to the design of

ANTLR as an integrated lexer and parser generator.

SCode /explode

Lookup

Parse Inst DAELow

Ceval Static

Absyn SCode DAE

(Env, name) SCode.Class

Exp.Exp

Values.Value SCode.Exp Types.Type)(Exp.Exp,

(Env, name) Main SimCode C code Dump DAE Flat Modelica

Figure 3.1: OMC simplified overall structure [Fritzson et al., 2009]

The OMC is used to compile both, MetaModelica grammar and Modelica

grammar from version 1 until 3 (see Fig 3.2). Its source code is available

for download from the subversion repository

1

_.

OMC Compiler

MetaModelicaParser Modelica1Parser Modelica2Parser Modelica3Parser

Figure 3.2: OMC Language Grammars

3.1.2 ANTLR

Another Tool for Language Recognition (ANTLR) is a parser generator tool

that integrates the lexical analysis and the syntax analysis in one single tool.

(45)

3.1. OPENMODELICA COMPILER (OMC)

25 It generates parsers for LL(k) grammars.

ANTLR was created by Parr and Quong [1995]. ANTLR is today in

version 3. The information presented here is extracted from the official

website

2

_{and the tutorial website by Mills [2005]. The reference manual}

by Parr [2007] is a more complete and detailed information about ANTLR.

This project is intended to be a substitute for this tool. We consider

impor-tant for this project to get an overview of the most significant features and

characteristics of this tool.

The grammar used in ANTLR is of type LL(k), which means that the

parsers generated by ANTLR are Top-Down parsers as explained in

Sec-tion 2.1.3. ANTLR uses Extended Backus-Naur Form (EBNF) notaSec-tion for

defining the grammar rules. The notation Extended Backus-Naur Form is

an extension of Backus-Naur Form (BNF). EBNF notation adds new

con-structs to the BNF such as ‘+’ to indicate one or more of an item after

square brackets.

The grammar file used by ANTLR contains several parts as presented in

listing 3.1.

Listing 3.1: ANTLR grammar file structure

1 h e a d e r {

// s t u f f t h a t i s p l a c e d a t t h e t o p o f < a l l > g e n e r a t e d f i l e s

3 }

5 o p t i o n s { o p t i o n s f o r e n t i r e grammar f i l e }

7 { o p t i o n a l c l a s s pr eam ble − output t o g e n e r a t e d c l a s s f i l e i m m e d i a t e l y b e f o r e t h e d e f i n i t i o n o f t h e c l a s s } 9 c l a s s Y o u r L e x e r C l a s s extends L e x e r ; // d e f i n i t i o n e x t e n d s from h e r e t o n e x t c l a s s d e f i n i t i o n 11 // ( o r EOF i f no more c l a s s d e f s ) o p t i o n s { YourOptions } 13 t o k e n s { EXPR; // I m a g i n a r y t o k e n 15 THIS=” t h a t ”; // L i t e r a l d e f i n i t i o n INT=” i n t ”; // L i t e r a l d e f i n i t i o n 17 } 19 l e x e r r u l e s . . . myrule [ a r g s ] r e t u r n s [ r e t v a l ] 21 o p t i o n s { d e f a u l t E r r o r H a n d l e r=f a l s e ; } 2_{ANTLR: http://www.antlr.org/}

(46)

26 CHAPTER 3. EXISTING TECHNOLOGIES

: // body o f r u l e . . .

23 ;

25 { o p t i o n a l c l a s s pr eam ble − output t o g e n e r a t e d c l a s s f i l e i m m e d i a t e l y b e f o r e t h e d e f i n i t i o n o f t h e c l a s s } 27 c l a s s Y o u r P a r s e r C l a s s extends P a r s e r ; o p t i o n s { YourOptions } 29 t o k e n s . . p a r s e r r u l e s . . . 31 r u l e n a m e [ a r g s ] r e t u r n s [ r e t v a l ] o p t i o n s { d e f a u l t E r r o r H a n d l e r=f a l s e ; } 33 { o p t i o n a l i n i t i a t i o n c o d e } : a l t e r n a t i v e 1 35 | a l t e r n a t i v e 2 . . . 37 | a l t e r n a t i v e n ; 39

{ o p t i o n a l c l a s s pr eam ble − output t o g e n e r a t e d c l a s s f i l e 41 i m m e d i a t e l y b e f o r e t h e d e f i n i t i o n o f t h e c l a s s } c l a s s Y o u r T r e e P a r s e r C l a s s extends T r e e P a r s e r ; 43 o p t i o n s { YourOptions } t o k e n s . . . 45 t r e e p a r s e r r u l e s . . . 47 // a r b i t r a r y l e x e r s , p a r s e r s and t r e e p a r s e r s may be i n c l u d e d

As we can see, it contains several sections including a header, lexer

(to-kens and rules), parser (to(to-kens and rules), AST rules and options sections

that are copied verbatim to the generated parser.

The generated parser files are in the desired target-language that is

spec-ified when compiling the grammar file.

ANTLR allows the OpenModelica developers to specify in a robust and

flexible way the rules and the grammar for the combination of the lexer

and parser. It then generates code in the target language that outputs the

designed AST for both Modelica and MetaModelica grammar.

3.1.3 Current state

The OMC is today (May 2011) in the version 1.7.0 (r8600). It is intended

to be used by both industry and academy. Various research materials have

(47)

3.2. FLEX

27 been produced since 1997 including: Master’s

3

_{and PhD’s}

4

_thesis,

confer-ence papers

5

, journals papers

6

and books

7

. Other more are today under

development or recently finished such as this master’s thesis. This proves

that the OMC is today an active research topic in the OpenModelica project.

3.2 Flex

Based in the tool called Lexical Analyzer Generator (LEX). The grammar

accepts regular expressions to define the tokens.

3.2.1 Input file lexer.l

The FLEX input file lexer.l contains three sections: definitions, rules and

user code.

Listing 3.2: Flex file structure

1 D e f i n i t i o n s %%

3 R u l e s %% 5 U s e r c o d e

Definitions: Contains declarations of definitions and start conditions. Can

contain code to be included verbatim to the output in the top as a

declaration.

Rules: Contains the rules in the form of patterns of and extended set of

regular expressions. Each rule contains an action in C code that can

return a token, reject or change the start condition.

User Code: It is copied verbatim to the output file.

3.2.2 Output file lexer.c

The output file lexer.c generated by FLEX contains three main sections:

Declaration of variables and arrays, the algorithm that runs the DFA and

3_{http://www.openmodelica.org/index.php/research/master-theses}

4_{http://www.openmodelica.org/index.php/research/phd-and-licentiate-theses} 5_{http://www.openmodelica.org/index.php/research/conference-papers} 6_{http://www.openmodelica.org/index.php/research/journal-papers} 7_{http://www.openmodelica.org/index.php/research/booksproceedings}

(48)

28 CHAPTER 3. EXISTING TECHNOLOGIES

the return action section with the actions that have been specified for each

rule.

The arrays that are present in the lexer are:

yyec: Mach any UTF-8 code with a started condition.

yyaccept: check the states against the accept condition.

yyacclist: once accepted, the action for each state is found here.

yymeta: control array for the transitions.

yybase: control array for the transitions.

yydef: default transition for the states.

yynxt: determines the next transition of the states.

yychk: control array that verifies errors.

FLEX is designed to handle a large amount of rules and tokens.

It

simplifies the number of rules and tokens utilized by the parser in the next

phase of the compiler. That is why it is common to find a combination

of FLEX and other parser generators such as the tool called Yet Another

Compiler-Compiler (YACC) or its successor GNU Bison.

For a complete reference of FLEX, the FLEX manual by Paxson [2002]

is a good source of information.

3.3 GNU Bison

GNU Bison is a parser generator that generates a LALR(1) parser from a

context-free grammar. The generated parser can be in one of these three

languages: C-code, C++ and Java. It is based on the tool called YACC.

GNU Bison receives as an input a file with the grammar rules. This

grammar file is specified using BNF. The output of the process is a parser

written in C that communicates with a lexer, commonly written in LEX or

FLEX.

In this section we explain these input and output file in detail and cover

some other details about GNU Bison that will increase the understanding

of the presented project implementation in the next chapter.

(49)

3.3. GNU BISON

29

3.3.1 Input file parser.y

There are 4 sections in a grammar file: Prologue, Bison declaration,

Gram-mar rules and Epilogue distributed as presented in Listing 3.3.

Listing 3.3: Bison file structure

1 %{ P r o l o g u e 3 %} B i s o n d e c l a r a t i o n s 5 %% Grammar r u l e s 7 r e s u l t : r u l e 1 −components . . . | r u l e 2 −components . . . 9 . . . ; 11 %% E p i l o g u e