High-level programming languages translator

(1)

Master Thesis

Computer Science

Thesis no: MCS-2008-17

January 2008

High-level programming languages

translator

Mohammed Salih

Ognyan Tonchev

Department of

Interaction and System Design

School of Engineering

Blekinge Institute of Technology

Box 520

(2)

This thesis is submitted to the Department of Interaction and System Design, School of Engineering

at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies

Contact Information:

Authors:

Mohammed Salih

E-mail:

m.mustafams@gmail.com

Ognyan Tonchev

E-mail:

otonchev@gmail.com

University advisor:

Mia Persson

mia.persson@bth.se

Department of Interaction and System Design

Department of

(3)

A

BSTRACT

Abstract This paper discusses a high level language translator. If we divide translators of

programming languages in two types: those working for two specific languages and universal

translators that can be used for translation between different programming languages, the solution

that will be presented in this work can be classified as both, specific language oriented and an

universal translator. For the purpose of the research it was limited to translate only from Java to

C++, but it can easily be extended to translate between any other high level languages. For

simplifying the process of translation the project uses an intermediate step. All programs in the

input language are first compiled to an abstract XML language and then to the desired output

language. That way it is not necessary to translate directly from one programming language to

another which is a very tricky and difficult task and could make the solution difficult to be

maintained and extended. Hence the translator can also be used to translate from any high level

language to XML. That gives another advantage to our solution: an XML representation of a

computer program is valuable information by itself. We describe the design and implementation of

the solution, demonstrate how it works and also give information on how it can be extended to work

for any other programming language.

(4)

Preface

(5)

C

ONTENTS

Abstract

………. 3

Preface

………. 4

………..………..

5 Introduction

……… 7

Chapter 1:

Background

9 1.1 Language

Translators/Compilers………..

9

1.2 Components of a Translator/Compiler……….

9 1.3 Previous

work……….

11

1.4 Universal high-level language translator………..

12 1.5 Why

XML...

12 Chapter 2:

Problem definition/Goals

14 2.1 Problem

definition………

14 2.2 Goals………

14

2.2.1 High-level language translator at low cost………..

14 2.2.2 Universal……….

15 2.2.3 Portable……….

15 2.2.4 Extendable………

15

2.2.5 Showing good performance……….

15 2.2.6 Reliable………

15 Chapter 3:

Methodology

17 3.1 Preparation

work………..

17

3.2 Choosing the technologies………...

17

3.3 Designing the project………...

18

3.4 The case study……….

18 Chapter 4:

Theoretical work

20

4.1 Implementation language – Java……….

20

4.2 The intermediate level – XML……….

20 4.3 Java

CC………

24

4.4 The xml parser……….

25

4.5 Putting them all together……….

26 4.5.1 1

L1->XML……….

26 4.5.2 XML->L2………

27

4.6 Introducing support for a new language……….

29

4.6.1 New input language……….

29

4.6.2 New output language………

30 Chapter 5:

Case study

32 5.1 First

scenario………

32

(6)

5.3 Third

scenario………..

36 Chapter 6:

Results

38 6.1 Translator……….

38 6.2 Code

re-factoring……….

38

6.3 Source code reformatting and Programming style………..

38 6.4 Pseudo

code……….

38

6.5 Publishing the source code on the Internet……….

39 Chapter 7:

Discussion/analysis

40

7.1 Advantages and Limitations of the study……….

40 7.2 Further

work……….

41 Summary

………. 42

References

………. 44

Appendix A

The abstract programming language………

46 Appendix B

Case study………

61

(7)

I

NTRODUCTION

- Motivation for the work

Many programming languages have been used in the past and more and more are being used today.

The need for translation of programs written in one high-level computer language to another arises

more nowadays. There are millions of lines of code written in the old fashioned languages that need

to be rewritten. A lot of companies have already invested money in developing products but now

they want to change their platform to a new one in order to use the new opportunities it provides.

There are also many algorithms available on the Internet written in popular programming languages

that people would like to reuse with other languages they are familiar with.

It is obvious that manual translation of programs from one language to another is hard and in most

cases an impossible task. Either it is hard to find programmers familiar with the source language or

investing in people that will handle this process is not worth the money. Individuals not familiar

with the language used for implementation of a solution they have found on the Internet have to

spend a lot of time fighting with the manual translation almost reconstruct the system from its

functions requirement. Some kind of an automated approach shall take place. But is it possible such

an automated translator to be created? Why would it be difficult to program an universal translator

since the concept behind computer languages is common - for instance, loops, branch statements

and mathematical expressions, procedures and functions in structure programming, classes with

methods and attributes in OOP? How those difficulties could be minimized?

- Related work

There were many attempts in the past for implementing high-level computer language translators,

some of them were developed just to solve small problems such as converting between different

versions of the same computer language. Others were supporting different languages, but translated

in one direction only as is the case in [1] which can translate only from Smalltalk to C++. There are

many approaches that could be used for translating from one high-level computer language to

another – for an example such a translation procedure can be done directly, or using an intermediate

code in order to facilitate the work[2]. In this work an universal language translator will be

addressed in details. There will be a description of the technologies it uses, the overall design and

implementation and also a description of what makes it different from all the existing solutions.

- A simple overview of the solution

The need for translation of one programming language to another arises because all the time new

programming languages are showing up. These languages are different by features, means (there

must be a purpose behind the existence of any new language), some languages can use the benefits

of others directly without need to be translated to that language, for instance, all the features of java

could be used from PHP or Perl [3], it is possible a scripting language like Java/JavaScript

integration [4] to be used inside a java code.

(8)

- The thesis statement/Contribution

This thesis introduces an universal translator for the high level programming languages. A new

Abstract Programming Language using XML (APL) will be innovated for the purpose of translation

which will be use as intermediate representation for a program's source code.

- APL as standard for pseudo code

The pseudo code is describing an algorithm using a natural languages[7] publishing a source code

for an algorithm on the Internet is a big issue for simple reason, in which computer language should

be published; java, C++ or C# may be. Either to be published in many programming languages or as

a pseudo code which will be coded again by programmers to a certain language. Either way is a

time consuming. The idea here is: using APL as pseudo code representation, that means a source

code could be published on the Internet in APL format, and later, any one can use the universal

translator to convert APL to the target language s/he wants.

"Pseudo code is an informal language that helps programmers develop algorithms. It is similar to

everyday English; it is convenient and user-friendly, but it is not an actual computer programming

language" [8]

The research, analysis, design and implementation for all of the translator and the APL will be

covered in deep details by the rest of the thesis.

- Discussion regarding the format of the rest of the thesis

(9)

C

HAPTER

1:

B

ACKGROUND

1.1 Language Translators/Compilers

A compiler is a program that can read a program in one language - the source language - and

translate it into an equivalent program in another language - the target language. An important role

of the compiler is to report any errors in the source program that it detects during the translation

process.[9]

Compilers are most usually used to translate a program written in any high-level computer language

to machine code. The machine code interpretation can then be executed to produce some desirable

output. An example is the widely popular GNU compiler collection – gcc that is actually a standard

collection of compilers used in all Unix-like systems nowadays.[10] With this set of tools one can

compile a program written in C, C++, Ada, Fortran and some other computer languages to

executable program.

Example 1.1: compiling a simple C program using gcc

Although a rare use of a compiler is to translate computer programs written in one high-level

language to another high-level language, there is still a number of such attempts through the years.

One of the earliest attempt is a project developed by the PascAda group at the University of

California from the early 80s[11]. This is simply a translator that converts source code written in

Ada to Pascal and vice versa. It will be discussed i detail later in this chapter since it uses one very

simple and straightforward method for high-level language translation which is also used in almost

all of the previous works including the Gypsy-to-Ada Program Compiler[12] for example.

1.2 Components of a Translator/Compiler

(10)

optimization steps are optional and sometimes can be skipped. The syntax tree that is the output

from the Syntax Analyzer is a tree called Abstract Syntax Tree(APL) which is a tree representation

of the program that is being compiled. Its format can be seen on Figure 1.2 below.

Figure 1.1: Phases of a compiler[9]

(11)

Figure 1.2: Syntax tree[9]

There is not much difference when we talk about high-level-language to high-level-language

transformation. It is only the back end part which produces output in any high-level computer

language instead of machine code. Usually in that case the intermediate representation is some kind

of a subset of the semantically equal parts in the two high-level languages which take part in the

translation procedure. That way if a program written in on of the two languages is transformed to

the intermediate representation, the process of transforming it to the other language is

straightforward. Of course if there are also syntactic differences between the two languages then

they should be processed on a different way.

1.3 Previous work

Either because it costed too much efforts or due to some other reason, it is rare to find work

dedicated on high-level-language to high-level-language translation. And there are only solutions

that work for two specific languages using a pretty straightforward technique for translations. The

PascAda project that was mentioned above is one of the first documented translators of this kind.

The algorithm that it uses and later used but other solutions is explained here.

(12)

Example 1.2 Java Class translated to C structure

1.4 Universal high-level language translator

The implementation that will be discussed in this report is a high-level language translator that can

easily be changed and configured to translate between any two high-level computer languages. For

the purpose of this research it will be however limited to translate only from Java to C++. It will

show that such a translation can be implemented with not much cost using two standard “tools”.

The heart of the front end part of the compiler will be a Language Parser generated by the well

popular JavaCC Parser Generator[13]. This is an open source project developed in the Java

programming language. It takes a language structure description(A grammar file that should be in a

special format[14]) as an input and generates a parser in Java for the language described by the

grammar file. By default the parser does not generate any input, it just parses and analyses

programs. It is a programmers task to change the parser to generate some desirable output. Another

„tool“ that will be used is an add-on for the JavaCC called JJTree[15]. JJTree allows the parser

generated by JavaCC to produce APL(the syntax tree described above that is part of any compiler).

The intermediate representation of the input program that the front end part of the compiler will

produce is actually a document in the XML format. It is a special language designed and

implemented as part of this work. Later follows a discussion motivating this approach. The XML

language is an union of one set – the semantic of C++ plus some additions that will make life easier

when translating Java programs to XML. That way every program written in any of the two

languages can easily be transformed and presented as XML. Since C++ is a very complex language

that has not only high level capabilities but also low level ones, it will not be difficult to represent

programs written in other high level languages(such as C, Pascal, Ada) in that XML format.

The back-end part of the translator is standard for all high-level languages. It uses XML files to read

specific language grammar descriptions. That information is then used for translation of the

intermediate representation to the desired output language. This way support for new language can

be easily introduced by simply editing a grammar file and also changing a configuration file within

the translator.

1.5 Why XML?

(13)

source code is used[17]. The simple answer to the question is: It is an attempt to make easily

maintainable and extendable approach. A more detailed explanation follows below:

- As we discussed already a parser for any high level language can easily be generated using

standard tools. Of course coding is also needed but it is just to integrate the parser within the

project. .Obviously the intermediate representation should be in a format that can easily be

understood, extended if needed and parsed. XML is the natural approach which comes here. It is a

standard for keeping and transferring data. It is software and hardware independent and anyone

involved in programming at some extend understands it. There is a number of already developed

XML parsers for all the modern platforms.

- The project can be used to translate programs to XML and such a representation is valuable

approach by itself. There is actually one popular work in this area that works on Unix-like systems

and translates from C to XML[18]. It is an extension to gcc, but is of a limited use due to a lot of

limitations in the supported features. However, such an XML representation can be used for all

kinds of code analysis by everyone without the need to put great extend of efforts dealing with

compilers stuff.

(14)

C

HAPTER

2:

P

ROBLEM

D

EFINITION AND

G

OALS

2.1 Problem definition

The problem of translating high-level languages is getting now more popularity than always. This is

due to the variety of programming languages that are being used nowadays. And it is not only the

new languages but also the old fashioned ones. Millions of lines of code written in old fashioned

languages are available on the Internet. Code written in these old languages is also still in

production due to the high cost of translation to the new modern languages. It is either the case that

people not familiar with programming at all are using the old programs and still has to deal with old

language interpreters or the projects written in old fashioned languages showing also bad

performance are so big that it would cost too much money for a programmer to do manual

translation. Here arises the need for an automated translator to be developed.

Although the problem is important there is not much that has been done in this area so far. There is

a number of high-level language translators but non of them can be used as an universal one or even

extended to work for other languages. Most of these works support only two specific languages,

like the Fortran to QuickBasic translator[6] or the Smalltalk to C++ translator[1]. Both of them are

not designed to work as universal translators, they are also written in C, which makes them not

portable. The Ada to Pascal project[11] supports translation from Ada to Pascal and vice versa and

the authors also state that the translation procedure can be applied to all high-level languages. The

problem with this compiler is that either it was developed as an universal one the process of

introducing support for another language is not so straightforward. It requires good programming

skills and a lot of coding including development of language parser, intermediate representation and

so on. The translator was developed almost 30 years ago and the procedures that could be applied to

the old fashioned languages could not be applied to the modern ones due to the big amount of

semantical differences.

The project developed as a result of this thesis work will deal with the problems discussed above.

The new translator supports translation from Java to C++ but can easily be extended to work for

other high-level languages. It is also portable and can be installed on both Windows and Unix-like

systems. However it will also other requirements that will be discussed later in this chapter.

The research questions that this thesis will try to answer are:

- Is it possible to convert programs written in one programming language to another?

- What are the difficulties?

- How could they be minimized?

- Why would it be difficult to program a universal translator since the concept behind computer

languages is common?

2.2 Goals

The project developed as a result of this thesis work meets the goals described below.

2.2.1 High-level language translator at low cost

(15)

highly popular technologies will be used. The Java language used for developing the project is also

extremely popular nowadays.

The heart of the front end part of the compiler will be a Language Parser generated by the well

popular JavaCC Parser Generator[13]. This is an open source project developed in the Java

programming language. It takes a language structure description(A grammar file that should be in a

special format[14]) as an input and generates a parser in Java for the language described by the

grammar file. By default the parser does not generate any input, it just parses and analyses

programs. It is a programmers task to change the parser to generate some desirable output. Another

“tool” that will be used is an add-on for the JavaCC called JJTree[15]. JJTree allows the parser

generated by JavaCC to produce APL(the syntax tree described above that is part of any compiler).

The intermediate representation of the input program that the front end part of the compiler will

produce is actually a document in the XML format. It is a special language designed and

implemented as part of this work. Later follows a discussion motivating this approach. The XML

language is an union of one set – the semantic of C++ plus some additions that will make life easier

when translating Java programs to XML. That way every program written in any of the two

languages can easily be transformed and presented as XML. Since C++ is a very complex language

that has not only high level capabilities but also low level ones, it will not be difficult to represent

programs written in other high level languages(such as C, Pascal, Ada) in that XML format.

2.2.2 Universal

The high-level language translator should be universal. That means the algorithms used for the

language translation should be applicable to all high-level languages at the highest possible extend.

2.2.3 Portable

The word universal means also that the project could easily be installed on different platforms.

Nowadays a variety of operation systems are being used and the translator should be working at

least on Windows and the Unix-like systems. This will be achieved by using Java for developing the

project plus XML for storing and transferring the data needed for the translation procedure.

2.2.4 Extendable

The high-level language translator should be extendable. That means support for any high-level

computer language can be introduced upon configuration, eg. Editing an XML file or with not much

programming efforts by using a straightforward procedure that is described in details.

2.2.5 Showing good performance

The translator have to show good performance when translating programs. Since there is no basis

for comparison, performance will be measured towards existing and popular compilers or

interpreters.

(16)

(17)

C

HAPTER

3:

M

ETHODOLOGY

3.1 Preparation work

As a preparation work for writing this thesis a deep literature survey was done to find the existing

solutions and any written materials regarding the topic being researched here. To find the materials

two digital libraries containing a lot of technical literature and research papers in the computer

science field were chosen: IEEE Xplore and ACM Digital Library. These two particular databases

were chosen since they are considered to be one of the reachest ones containing dozens of years of

archive. They also have very good reputation among the scientists.

The keywords used to search for materials in both IEEE and ACM digital libraries were: language

translator, source to source, language convertor, language compiler. Although there is a lack of good

previous works related to the topic chosen, what was found through this search made possible a

solid base to be built on which a new and better project to be developed resolving some of the

problems existing in the previous solutions.

Since the project that is developed as a result of this thesis is basically a compiler, a good material

related to compiler design and implementation was needed. Some teachers involved in compiler

related work were asked to propose a good book. Also Amazon was searched for books about

compilers. The keyword used was „compiler“. The book reviews and readers opinions were

examined in order to choose the better book. The book that was found is referenced in the

Background section of this thesis and also used for collecting ideas of how to build the high-level

language translator.

3.2 Choosing the technologies

Among collecting the literature the preparation work for this thesis includes also a search for

technologies to be used when building the translator. These technologies have to fulfill several

criteria’s in order the goals to be achieved.

(18)

needs of the translator.

Another important point is choosing the xml parser. This step is vital to the project since there are

basically two technologies for building xml parsers and choosing one of them will lead also to

design changes in the back-end part of the translator(the one that translate the intermediate

representation to the destination language). The criterias used for choosing between the two

technologies is described later in this thesis. For finding a list with xml parsers a google search was

accomplished. The keywords used were: „free xml parser“, „open source xml parser“. Then the

parsers were rated by considering the following criterias: how popular and reliable the parser was,

how fast it was, the level of documentation and easiness of use.

The procedure for choosing other technologies such as the language for implementation of the

parser and the format of the intermediate representation is described later in this thesis in the

Theoretical part. The decisions taken were made based on the experience of the authors of the

thesis.

3.3 Designing the project

The universal language translator was developed to combine all the technologies mentioned above

and to meet to the greatest extend possible the already described goals. The whole project was

designed to fulfill the following design requirements:

−

The project should be structured basically in several functionally different parts: The front-end

part of the translator that converts an input language to the intermediate representation, The

back-end part responsible for converting the intermediate representation to the destination

language

−

The front-end and the back-end parts should also be designed with classes to separate different

logic into different objects. For example the language parser only parses the input language and

it does not know anything else then how to parse. There are other classes responsible for going

through the abstract syntax tree that it produces and for generating the XML which is actually

the intermediate representation

−

The project should not only work as one whole, to translate program written in one high-level

computer language to another, but also to be used for translation from one high-level computer

language to XML and vice versa

3.4 The case study

This work uses also as a research method a Case study in the process of translation from Java to

XML and back to Java and to C++. The case study data is then analyzed to get an indication as to

what extend the process of translation works and how reliable it is. This test is important since the

authors belief is that the translator works and can be used in production but due to the limitations

and the wide variety of computer programs available it has to be shown whether everything

expected to be supported is really supported. Some measurements are also taken for measuring the

performance and the reliability of the project under Windows and UNIX-like systems.

(19)

The second scenario will be converting the same java program in the first scenario to CPP in order

to see how the translator deals with the syntax and methods in case of converting to the different

programming language

(20)

C

HAPTER

4:

T

HEORETICAL WORK

4.1 Implementation language - Java

The main goal of this thesis is a cheap and portable compiler from one high-level computer

language to another to be created. It has also to be easily maintainable and extendable by people not

involved in its design and implementation. The language that naturally comes to the mind is Java.

This is the language that almost everyone involved in programming in high-level computer

languages understands and uses. In some sense it is simpler than C++ and programs written in it are

easier to be read and maintained. It is far more easy to find a piece of code or a program that you

need already available and free on the Internet written in the Java language. There is no need even

to mention that it will be a tricky and sometimes an impossible task to make a project written in a

high-level computer language different then Java run on different platform than the one it was

initially developed in. Java is developed as platform independent language or at least it is supposed

to be. However there is no perfect world and it is the case with the Java language as well. There are

some compatibility issues but it is to great extend platform-independent and writing in the Java

language will save a tremendous amount of time when developing code that is supposed to work on

different platforms.

4.2 The intermediate level - XML

It was already discussed in this work that the intermediate level is an important part of any

compiler. In this project the intermediate representation is in a special XML format. This is a

language that was created for the purpose of this work. Its documentation and description is

available and can be seen in Appendix A. Figure 4.1 below shows a simple Java program that prints

the first N=20 Fibonacci numbers:

(21)

The output of the front-end part of the universal language translator will be an intermediate

representation of the program in the mentioned XML format. The actual look of the new XML

program is shown below:

</Variable> </Variables>

</VariableDeclaration> </ClassMember>

(22)

</Operand> </BinaryExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> <Operand id="0"> <FunctionCallExpression name="fib"> <FunctionParameters> <Operand> <BinaryExpression> <OperatorType type="-"/> <Operand id="0"> <Variable name="n"/> </Operand> <Operand id="0"> <Data type="int" value="2"/> </Operand> </BinaryExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> </BinaryExpression> </ReturnStatement> </Else> </IfStatement> </BlockStatement> </FunctionDeclaration>

(23)

</Operand> <Operand id="1"/> </UnaryExpression> </ForStatementExpression> <ExpressionStatement> <FunctionCallExpression name="println"> <Member> <Variable name="System"/> <Variable name="out"/> </Member> <FunctionParameters> <Operand> <FunctionCallExpression name="fib"> <FunctionParameters> <Operand> <Variable name="i"/> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </ExpressionStatement> </ForStatement> </BlockStatement> </FunctionDeclaration> </ClassDeclaration> </APL> Figure 4.2: Fibonacci.xml

Why using an XML?

As it was discussed previously in this work it will be the case that everyone familiar with compilers

after looking at this approach will first ask “Why XML?”. Well, at first glance using XML for the

intermediate representation gives an impression for working on the too high level. And at least it

was not possible to find an existing solution that works on such a high level. Sometimes even

machine code representation of the initial source code is used[17]. The simple answer to the

question is: It is an attempt to make easily maintainable and extendable approach. A more detailed

explanation follows below:

- As we discussed already and a more detailed discussion follows later a parser for any high level

language can easily be generated using standard tools. Of course coding is also needed but it is just

to integrate the parser within the project. .Obviously the intermediate representation should be in a

format that can easily be understood, extended if needed and parsed. XML is the natural approach

which comes in mind. It is a standard for keeping and transferring data. It is software and hardware

independent and anyone involved in programming at some extend understands it. There is a number

of already developed XML parsers for all the modern platforms. The translator developed as part of

this thesis uses such a standard XML parser and it is discussed later in this chapter. Hence this is

another part of the project that is highly popular and free.

(24)

compilers stuff. Imagine in how many ways the program shown in Figure 4.1 can be written. Figure

4.3 below shows the same program but written in different format.

Figure 4.3: Fibonacci2.java

It is obvious that doing some kind of analysis over these two programs does not look to be a simple

and straightforward task. It is one and the same program but looking different from parsing

standpoint. However the XML representation of this program is one and the same and it is very easy

an XML parser to be downloaded from the Internet(there are plenty of them written in different

languages) and used within any program that will analyse the code.

Of course using an XML has its disadvantages mainly connected to the performance. The XML

representation of high-level computer language programs is heavy and hence parsing it is time

consuming. However this is not an issue with this project since it focuses more on other goals such

as creating a cheap, portable, extendable and easy maintainable translator to be created.

4.3 Java CC

As it was already discussed previously in this work, each language compiler consists of two main

parts: the front end and the back end. The heart of the front end part of this compiler is a Language

Parser generated by the well popular JavaCC Parser Generator[13]. This is an open source project

developed in the Java programming language. It takes a language structure description(A grammar

file that should be in a special format[14]) as an input and generates a parser in Java for the

language described by the grammar file. By default the parser does not generate any input, it just

parses and analyses programs. It is a programmers task to change the parser to generate some

desirable output. Another “tool” that is used within the front end part is an add-on for the JavaCC

called JJTree[15]. JJTree allows the parser generated by JavaCC to produce an abstract syntax tree -

APL. It was already shown how the APL three looks like.

Why choosing Java CC? What are the other existing solutions that could be used?

There is a wide variety of parser generators available. However nly a few can produce Java code

and are also free or open-source. These include: JavaCC, ANTLR and SableCC. JavaCC and

ANTLR are similar since they both produce parsers that use the LL(1) algorithm. SableCC on the

other hand producec parsers that use LALR(1) algorithm which is an improved form of LR parser.

This was one of the reasons SableCC not to be considered for this thesis. There is a number of

reasons to choose LL(1) parsing instead of LALR(1):

(25)

As stated before JavaCC and ANTLR are similar parser generators. The documents found on the

Internet comparing JavaCC to ANTLR all showed that JavaCC is the most popular parser and the

disadvantages of JavaCC were only poor documentation and inability to create parsers in other

languages different then Java. It was not enough to decide wich one of them to use since for the

purpose of this thesis the parser generated should be in Java and hence this is not a real

disadvantage. Further study and testing of both parser generators was done. It showed that

nevertheless ANTLR was better documented than JavaCC, it was easier to understand and use the

JavaCC parser. It generates all the classes needed by the parser and hence the parser does not need

any external libraries or files. ANTLR however uses its own environment and the parser generated

is not a stand-alone program. One will run into problems if using two parsers(for example a parser

that parses Java programs and another one parsing C++ programs) created under different versions

of ANTLR. JavaCC seems to be the maintainable one. It was proved to work under all Java

platforms and on countless different machines[20]. JavaCC also has better error reporting that

shows the exact place where the parsing error was found plus a diagnostic screen. The debug

capabilities are also better. By using different debug options one can get a deep analysis of the

parsing process. The parser also comes with a number of grammars which makes the process of

generating language parser easy and straightforward. There is no need to write a grammar file for

the Java language for example. On the Internet there is also a rich variety of grammar files for the

JavaCC parser generator including C, C++, Pascal and other popular high-level computer

languages. However if a parser in the C++ language should be created then ANTLR seems to be the

best parser generator to be used.

4.4 The XML parser

The project uses the widely popular and open-source SAX parser[21]. It is used for parsing the

XML intermediate representation of the programs being translated. Before getting an idea why the

Sax parser was chosen for this project some introduction to XML parsing should be presented.

There are two main specifications for parsing SAX: DOM and SAX[21]. There are a number of

parsers available using either DOM or SAX. The difference in both technologies is as follows: The

SAX parsing is based on event-based solutions. The XML is being parsed and upon meeting of

specific tags a dedicated handler function is called. That way there is no container that keeps

already parsed data and the programmer himself has to keep track of history or building an internal

representation that will make possible re-analysing the XML at a later stage. Of course that is done

only if needed. The DOM technology on the other hand works on somehow different way. These

parsers create an internal tree representation of the XML that is put in memory and navigation

through that tree is possible at run time. For that purpose there are different functions for selecting

parent or child nodes. The tree is kept in memory until parsing of the xml is finished.

It is obvious that parsers using the SAX technology have one major advantage: these parsers have

better performance compared to the DOM parsers. They can parse even a document which is

gigabytes in size. DOM parsers work very slow with big XML documents since they keep the

representation of the whole document in memory.

(26)

4.5 Putting them all together

The technologies described above are united into one whole thing to form the universal high-level

language translator. The translator consists of two executable programs – the first one converts a

program in one language to the XML intermediate representation and the second one converts the

intermediate representation into a destination language. The two parts can work independently,

hence the translator can also be used to translate from one high-level computer language to XML

and vice versa. Figure 4.4 below shows the three ways the translator can be used to translate in.

Figure 4.4: The universal high-level language translator

4.5.1 L1->XML

The program that translates from the input language L1 to the XML representation consists of at

least one language parser generated by the JavaCC parser generator. There are as many language

parsers as is the number of the supported high-level computer languages. Each parser also uses the

JJTree add-on to generate an abstract syntax tree. The procedure for generating a language parser is

straight-forward and is accomplished by using the JavaCC parser generator. There is only one Java

Class that has to be written for each parser so the parser will be integrated into the universal

language translator. That class will go through the abstract syntax tree and will generate the

intermediate representation of the input program that will be into an XML format. For generating

the XML format a special API is used. Figure 4.5 shows some of the public methods introduced into

that API.

Figure 4.5: Public methods in XMLOutput.java

(27)

Figure 4.6: APLVisitor.java

That code will print the Data tag representing a string:

<Data type="string" value="Hello world!"/>

4.5.2 XML->L2

The program that translates from the XML intermediate representation to the destination language

consists of four classes the main class APL, APLhandler, XMLTag and Printer. The diagram bellow

illustrates how this part works:

Figure 4.7: XML to source code

APL: contains the entry point of the program it checks the arguments and make sure that the

configuration file exist and load it and then create an instance of SAXParser in order to parse the

XML file and APL handler to handles the XML events during the parsing.

(28)

source code instructions. SAXParser send notify the Handlers each time it finds start document,

start element, end element, character data, end document as well as any error occurred [21]. Every

time APLhandler gets a tag it creates and XMLTag to hold the tag information and pushes it into

stack, and when it gets end of tag it generates source code for the tag by calling the method

XMLTag.getValue(tag), this value will be added to the xmltag object on the top of the stack, and so

on the process goes till reach the root element "APL" tag which will be contained the full source

code of the program.

Printer class is just helps generating organized and readable source code.

Figure 4.7 shows XML fragment that represent a variable declaration “private int x;” and the table

4.1 demonstrate how this XML tags are going to be converted to source code in details.

Figure 4.8: Variable declaration “private int x;” in XML format

The APLHandler deals with events by the following algorithm:

New tag event: create XMLTag and push it push the stack

End tag event: pop tag, get its value, pop another tag add the value of the previous tag to it,

and push it back to the stack

The following table shows in steps how the XML fragment in Figure 4.8 is converted to source

code:

Event Response

New tag VariableDeclaration

tag = new XMLTag("VariableDeclaration", attributes);

push(tag)

New Tag Type

tag = new XMLTag("Type", attributes);

push(tag)

End Tag Type

tag = pop()

valule = XMLTag.getValue(tag) // the value is the string "int"

parentTag = pop() //the parent tag in this case VariableDeclaration

parentTag.addValue(value);

push(parentTag)

New Tag Variables

tag = new XMLTag("Variables", attributes);

push(tag)

New Tag Variable

tag = new XMLTag("Variable", attributes);

push(tag);

End Tag Variable

tag = pop()

value = XMLTag.getValue(tag);// the value is "x"

parentTag = pop();// the parent is Variables tag

parentTag.add(value)

End Tag Variables

tag = pop()

(29)

Variable tag on the previous event which is "x"

parentTag = pop();// the parent is VariableDeclaration tag

parentTag.add(value)

End Tag VariableDeclaration tag = pop()

value = XMLTag.getValue(tag);// it has two values "int" and "x"

//it doesn't have a parent tag

//the function getValue is responsible to generate the source from

//the values and the attributes of the tag in this case the source

//would be generate as follow:

tag.getAttribute("modifiers")+" "+getValue(0)+" "+getValue(1)+";"

which equals: private int x;

Table 4.1: Generating source code from XML

4.6 Introducing support for a new language

Introducing support for a new language within the translator has two aspects. The first one is

introducing support for a new input language which involves generating a new language parser with

the JavaCC parser generator. The second one is adding support for a new output language which

involves editing a configuration file which is part of the translator. Both procedures are described in

details below.

4.6.1 New input language

Implementing support for parsing a new language is a straightforward and simple process involving

only a little programming. The process of introducing new input language includes:

- Generating a new parser for the language to be implemented. That parser has to be generated

with the JavaCC parser generator with the JJTree add-on included. The process for

generating a new parser is very simple and is not described in this thesis. One can read about

it on [13]. The only thing needed for generating a new parser is a grammar file. For the most

popular high-level computer languages there is such a file available on the Internet but if

somehow such a file is missing a new one has to be created by the programmer itself. The

format of the grammar files can be seen here [14].

- After generating a new parser with the JavaCC parser generator that parser has to be

changed to generate the intermediate representation – the abstract programming language

shown in Appendix A. That process is straightforward but is the trickiest thing that has to be

done. It involves creation of a new Java Class and also programming. That class uses the

visitor design pattern [22] and it goes through the abstract syntax tree built by the language

parser and generates the desired output – in this case an XML output. The class for the Java

language can be seen in Appendix C. Its name is APLVistor and it is located in the

APLVisitor.java file. Figure 4.6 above shows a small part of that file and how an XML is

being generated by using a dedicated API. The public methods of that API are shown in

Figure 4.5.

(30)

4.6.2 New output language

Translating a program from XML format to another programming language is a simple process,

because the translator always uses the same code but a different configuration file for each target

language, that makes introducing support for a new output language a straight-forward process -

adding a new configuration file for the new language. A configruation file is a simple text file that

contains many lines and each of them is a pair of key and value representing how a specific

instruction in abstract format is going to be translated and represented in the target language. The

figure bellow shows some part of such a configuration file for the Java language.

(31)

(32)

C

HAPTER

5:

C

ASE

S

TUDY

The solution built on open sources (Java and XML) which make it low cost and portable because

both technologies are platform independent. It's universal for high level programming languages

and expendable, it just uses different configuration file for each target language.

This section shows a case study which will be used to demonstrate the work and prove that the

system achieved its goal.

A computer program written in specific programming language will be converted to another

programming language. As mentioned before, the translator has a middle step which is generating

an intermediate code for a given program in Abstract Programming Language. this intermediate

code is actually computer program in eXtensible Markup language XML format. another part of the

translator converts from XML to specific programming language by using simple configuration file

contains key and value pairs show the reserved word and few information about specific language.

many scenarios will covered by the case study.

5.1 First scenario

In this scenario small Java program showing bellow will be converted to APL and then from APL to

Java again, then some measure will be taken to prove some of the goals

Figure 5.1: class Point.java original source code

By using the translator we can convert this program to abstract format APL. Here is the command:

2apl Point.java

(33)

after executing the command it results a new abstract program Point.xml in XML format as follow:

1 <APL>

2 <ClassDeclaration name="Point" modifiers="public"> 3 <ClassMember> 4 <VariableDeclaration modifiers="private"> 5 <Type type="int"/> 6 <Variables> 7 <Variable name="x"> 8 </Variable> 9 </Variables> 10 </VariableDeclaration> 11 </ClassMember> 12 <ClassMember> 13 <VariableDeclaration modifiers="private"> 14 <Type type="int"/> 15 <Variables> 16 <Variable name="y"> 17 </Variable> 18 </Variables> 19 </VariableDeclaration> 20 </ClassMember>

21 <FunctionDeclaration constructor="yes" name="Point" modifiers="public">

22 <FunctionArguments>

23 <Parameter name="ix" type="int"/> 24 <Parameter name="iy" type="int"/> 25 </FunctionArguments> 26 <BlockStatement> 27 <ExpressionStatement> 28 <AssignExpression> 29 <OperatorType type="="/> 30 <Operand id="0"> 31 <Variable name="x"/> 32 </Operand> 33 <Operand id="0"> 34 <Variable name="ix"/> 35 </Operand> 36 </AssignExpression> 37 </ExpressionStatement> 38 <ExpressionStatement> 39 <AssignExpression> 40 <OperatorType type="="/> 41 <Operand id="0"> 42 <Variable name="y"/> 43 </Operand> 44 <Operand id="0"> 45 <Variable name="iy"/> 46 </Operand> 47 </AssignExpression> 48 </ExpressionStatement> 49 </BlockStatement> 50 </FunctionDeclaration>

(34)

62 <BinaryExpression> 63 <OperatorType type="+"/> 64 <Operand id="0"> 65 <BinaryExpression> 66 <OperatorType type="+"/> 67 <Operand id="0"> 68 <Variable name="x"/> 69 </Operand> 70 <Operand id="0"> 71 <Data type="string" value=","/> 72 </Operand> 73 </BinaryExpression> 74 </Operand> 75 <Operand id="0"> 76 <Variable name="y"/> 77 </Operand> 78 </BinaryExpression> 79 </Operand> 80 </FunctionParameters> 81 </FunctionCallExpression> 82 </ExpressionStatement> 83 </BlockStatement> 84 </FunctionDeclaration> 85 </ClassDeclaration> 86 </APL>

Figure 5.2: class Point.xml

As shows on the previous figure the APL format is not a java code or any other programming

language. it represents the program instructions in a very abstract level, by describing a program,

adding more details and make it general and not related to a specific programming language. This

information helps the converter to convert the program to another programming language. Observe

that "System.out.println" has been recognized and dealt with as Consol.println to make more

abstract, later Consol.println will be converted to the appropriate method that prints to standard out

put of the target language, for instance this will be converted to printf in case of c or to printf/cout

in C++, writeln in pascal and so forth.

The next step is converting Point.xml from its abstract form back to java again, to prove that the

translator doesn't change the functionality of the program, then comparing the original source code

with the output. by using apl2o (Abstract Programming Language to Other) and java configuration

file Point.xml will be translated to java source code.

apl2o Point.xml -java.conf

(35)

Figure 5.3: class Point.java generated form Point.xml

The diagram above shows that the program converted back to java successfully as it was, although

there are few differences but they don't affect to functionality like comments has been removed.

the converter generates formatted code even the original code was not formatted. so it could also be

used to reformat the code and make it organized and increase its readability.

5.2 Second scenario

in this scenario the same java program Point.java will be converted to C++ of course after converted

to abstract form (XML)

by executing the following command

apl2o Point.xml cpp.conf

we will get the program in C++ syntax

Figure 5.4: class Point.cpp generated from Point.xml

(36)

previous diagram which also shows that "System.out.println(x+","+y)" function in the original

Point.java program has been translated to "cout<<x<<","<<y<<endl".

5.3 Third scenario

This scenario to check the performance, reliability, and portability a more complicated example is

needed here. A program with complex instructions has been developed in Java for sake of testing.

The program first creates a dynamic list with the Fibonacci numbers and then prints it. Then the list

is copied into an array and then all members of the array greater then a number are printed in

reverse order. The last step is to sort the array in reverse order using a simple sort algorithm and

print its elements. The diagrams bellow show some parts of the program for the full source code see

Appendix B.

Figure 5.5: method fib, class Fibonacci.java

Figure 5.6: Simple sort algorithm, method sortArray, class Fibonacci.java

This Fibonacci program will be converted to XML and then back to Java again. both original

program and the converted one will be compiled and executed in Windows XP and Ubuntu Linux to

prove that the translator is really portable and works in different platform, then some measurements

will be collected and discussed. the same commands in previous scenario's has been followed here.

check Appendix B to see the results of converting to XML and then XML to Java.

(37)

System specifications

Microsoft Windows XP and Ubuntu 7

Running in same computer with the following hardware specifications

Intel(R) Pentium(R) 4

Mobile CPU 1.60GHz

512 M of RAM

Original file

XML file

Generated Source

code

Java Compile

time

Windows XP

390 ms

20 ms

2.1 sec

Ubnuntu 7 Linux

373 ms

18 ms

2.6 sec

Size in bytes

8.192

53.248

7.521

Table 5.1: performance, portability, and reliability measurements

(38)

C

HAPTER

6:

R

ESULTS

The purpose of the thesis work was an universal high-level language translator to be built. An

Abstract Programming Language has been created(Shown in Appendix A) to represent computer

programs in an abstract form that does not belong to any specific language. The APL uses an XML

syntax which makes programs represented in it to be easily parsed, understood and platform

independent. The translator uses APL as an intermediate representation. The solution was built on

open source technologies including Java and XML with good performance. Support for new

languages can easily be introduced most of the time without writing even a single line of code. Only

a simple configuration file has to be written using a regular text editor. Below follow sections

showing where the translator could be used.

6.1 Translator

The the main purpose of the solution is translation from one programming language to another and

replacing the manual translation to save time and to avoid mistakes.

6.2 Code re-factoring

It facilitates code parsing, because it's always a very complicated process to parse source code when

it is on a high level language format by using this translator a source code could be converted to

XML which is very simple to parse and many parsers available on the Internet for free and they are

easy to learn an use.

"In software engineering, "refactoring" a source code module often means modifying without

changing its external behavior, and is sometimes informally referred to as "cleaning it up"."[23]

6.3 Source code reformatting and Programming Style

It could be used for reformatting source code by removing the unnecessary while spaces, arranging

the statement to specific style to increase the readability of the source code.

"Programming style refers to a set of rules or guidelines used when writing the source code for a

computer program. It is often claimed that following a particular programming style will help

programmers quickly read and understand source code conforming to the style as well as helping to

avoid introducing faults." [24]

6.4 Could be use as pseudo code

(39)

6.5 Publishing the source code on the Internet

Publishing source code on the Internet is a big issue, especially for public algorithms like

encryption, sorting, and searching because, in which programming language should be published.

Most of the case the solution is either to publish it on many programming language or just publish it

as pseudo code. The first solution experiences duplications and the second one needs more time and

efforts to understand the pseudo code and write it in specific programming language manually. The

best solution to the problem is to publish the source in XML format and then it could easily be

converted to the target language and even to a new language in the future.

(40)

C

HAPTER

7:

D

ISCUSSION

/A

NALYSIS

7.1 Advantages and Limitations of the study

Due to the hugeness of the project it is obvious that it will have a number of limitations and flaws.

Some of them are because of the limited amount of time dedicated to the thesis work but others are

just normal for software dealing with a problem that will always remain open and further work will

be needed. There are also other problematic areas within this work existing mainly because there is

some kind of prioritization existing of the problems and goals that the thesis is trying to fulfill.

Below are discussed the limitations and disadvantages of the whole work.

- The project handles mainly semantically equal parts with a few exceptions. Only a small number

of semantically different parts in Java and C++ are handled by the translator. For example if an

array is being created in java the following syntax will be used

int array[] = new int[10];

array[0] = 5;

array[1] = 10;

…

Since in Java everything is an Object. That code will be translated to C++ the following way:

int array[10]

array[0] = 5;

array[1] = 10;

…

- Only file by file translation is supported. Hence if the project consists of more than one file each

file has to be translated separately

- Only one class per file. Hence files consisting of more then one class cannot be translated.

- The program in XML file is very large when compared to the original one.

- There is no procedure established for handling C or C++ header files. Hence introducing support

for parsing C or C++ programs will need also implementation of such an algorithm that will handle

them.

- Using XML as an intermediate representation seems a little bit unnatural for a compiler. Although

having an XML representation of a program is something valuable, the intermediate representation

could be changed to something on the lower level that fits more to a compiler and the XML could

be just another output. However that could have good impact mainly on the performance and would

introduce some problems to the overall design of the project which is aimed to be easily understood

and extended.

(41)

project gives another different view that deserves to be analyzed. Below there is a short list

describing the main advantages of the project:

-It is free – it only uses technologies which are open source and free of charge

-It can be upgraded to work for other high-level computer languages – the existing implementation

translates only Java programs to XML, Java and C++. However introducing support for a new

language is easy and can be done with not much efforts. The procedure for implementing a new

language is straightforward and involves almost no programming efforts

- The source can easily be understood and maintained

-The design of the project will let extending the project not to lead to a source that is difficult to be

maintained and extended again at a later stage

The authors overall impression of the thesis work is that although it has many disadvantages and

limitations it still remains valuable and deserves a closer look and gives a good ground for later

improvement that will make it more usable in the real world. It achieves its goals to a big extend

and that makes it different from what have been created so far.

7.2 Further work

High-level programming languages translator

Master Thesis

Computer Science

Thesis no: MCS-2008-17

January 2008

High-level programming languages

translator

Mohammed Salih

Ognyan Tonchev

Department of

Interaction and System Design

School of Engineering

Blekinge Institute of Technology

Box 520

This thesis is submitted to the Department of Interaction and System Design, School of Engineering

at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies

Contact Information:

Authors:

Mohammed Salih

E-mail:

m.mustafams@gmail.com

Ognyan Tonchev

E-mail:

otonchev@gmail.com

University advisor:

Mia Persson

mia.persson@bth.se

Department of Interaction and System Design

Department of

A

BSTRACT

Abstract This paper discusses a high level language translator. If we divide translators of

programming languages in two types: those working for two specific languages and universal

translators that can be used for translation between different programming languages, the solution

that will be presented in this work can be classified as both, specific language oriented and an

universal translator. For the purpose of the research it was limited to translate only from Java to

C++, but it can easily be extended to translate between any other high level languages. For

simplifying the process of translation the project uses an intermediate step. All programs in the

input language are first compiled to an abstract XML language and then to the desired output

language. That way it is not necessary to translate directly from one programming language to

another which is a very tricky and difficult task and could make the solution difficult to be

maintained and extended. Hence the translator can also be used to translate from any high level

language to XML. That gives another advantage to our solution: an XML representation of a

computer program is valuable information by itself. We describe the design and implementation of

the solution, demonstrate how it works and also give information on how it can be extended to work

for any other programming language.

Preface

C

ONTENTS

Abstract

………. 3

Preface

………. 4

Table of contents

………..………..

5

Introduction

……… 7

Chapter 1:

Background

9

1.1 Language

Translators/Compilers………..

9

1.2

Components of a Translator/Compiler……….

9

1.3 Previous

work……….

11

1.4

Universal high-level language translator………..

12

1.5 Why

XML...

12

Chapter 2:

Problem definition/Goals

14