Master Thesis
Computer Science
Thesis no: MCS-2008-17
January 2008
High-level programming languages
translator
Mohammed Salih
Ognyan Tonchev
Department of
Interaction and System Design
School of Engineering
Blekinge Institute of Technology
Box 520
This thesis is submitted to the Department of Interaction and System Design, School of Engineering
at Blekinge Institute of Technology in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science. The thesis is equivalent to 20 weeks of full time studies
Contact Information:
Authors:
Mohammed Salih
E-mail:
m.mustafams@gmail.com
Ognyan Tonchev
E-mail:
otonchev@gmail.com
University advisor:
Mia Persson
mia.persson@bth.se
Department of Interaction and System Design
Department of
A
BSTRACT
Abstract This paper discusses a high level language translator. If we divide translators of
programming languages in two types: those working for two specific languages and universal
translators that can be used for translation between different programming languages, the solution
that will be presented in this work can be classified as both, specific language oriented and an
universal translator. For the purpose of the research it was limited to translate only from Java to
C++, but it can easily be extended to translate between any other high level languages. For
simplifying the process of translation the project uses an intermediate step. All programs in the
input language are first compiled to an abstract XML language and then to the desired output
language. That way it is not necessary to translate directly from one programming language to
another which is a very tricky and difficult task and could make the solution difficult to be
maintained and extended. Hence the translator can also be used to translate from any high level
language to XML. That gives another advantage to our solution: an XML representation of a
computer program is valuable information by itself. We describe the design and implementation of
the solution, demonstrate how it works and also give information on how it can be extended to work
for any other programming language.
Preface
C
ONTENTS
Abstract
………. 3
Preface
………. 4
Table of contents
………..………..
5
Introduction
……… 7
Chapter 1:
Background
9
1.1 Language
Translators/Compilers………..
9
1.2
Components of a Translator/Compiler……….
9
1.3 Previous
work……….
11
1.4
Universal high-level language translator………..
12
1.5 Why
XML...
12
Chapter 2:
Problem definition/Goals
14
2.1 Problem
definition………
14
2.2 Goals………
14
2.2.1
High-level language translator at low cost………..
14
2.2.2 Universal……….
15
2.2.3 Portable……….
15
2.2.4 Extendable………
15
2.2.5
Showing good performance……….
15
2.2.6 Reliable………
15
Chapter 3:
Methodology
17
3.1 Preparation
work………..
17
3.2
Choosing the technologies………...
17
3.3
Designing the project………...
18
3.4
The case study……….
18
Chapter 4:
Theoretical work
20
4.1
Implementation language – Java……….
20
4.2
The intermediate level – XML……….
20
4.3 Java
CC………
24
4.4
The xml parser……….
25
4.5
Putting them all together……….
26
4.5.1 1
L1->XML……….
26
4.5.2 XML->L2………
27
4.6
Introducing support for a new language……….
29
4.6.1
New input language……….
29
4.6.2
New output language………
30
Chapter 5:
Case study
32
5.1 First
scenario………
32
5.3 Third
scenario………..
36
Chapter 6:
Results
38
6.1 Translator……….
38
6.2 Code
re-factoring……….
38
6.3
Source code reformatting and Programming style………..
38
6.4 Pseudo
code……….
38
6.5
Publishing the source code on the Internet……….
39
Chapter 7:
Discussion/analysis
40
7.1
Advantages and Limitations of the study……….
40
7.2 Further
work……….
41
Summary
………. 42
References
………. 44
Appendix A
The abstract programming language………
46
Appendix B
Case study………
61
I
NTRODUCTION
- Motivation for the work
Many programming languages have been used in the past and more and more are being used today.
The need for translation of programs written in one high-level computer language to another arises
more nowadays. There are millions of lines of code written in the old fashioned languages that need
to be rewritten. A lot of companies have already invested money in developing products but now
they want to change their platform to a new one in order to use the new opportunities it provides.
There are also many algorithms available on the Internet written in popular programming languages
that people would like to reuse with other languages they are familiar with.
It is obvious that manual translation of programs from one language to another is hard and in most
cases an impossible task. Either it is hard to find programmers familiar with the source language or
investing in people that will handle this process is not worth the money. Individuals not familiar
with the language used for implementation of a solution they have found on the Internet have to
spend a lot of time fighting with the manual translation almost reconstruct the system from its
functions requirement. Some kind of an automated approach shall take place. But is it possible such
an automated translator to be created? Why would it be difficult to program an universal translator
since the concept behind computer languages is common - for instance, loops, branch statements
and mathematical expressions, procedures and functions in structure programming, classes with
methods and attributes in OOP? How those difficulties could be minimized?
- Related work
There were many attempts in the past for implementing high-level computer language translators,
some of them were developed just to solve small problems such as converting between different
versions of the same computer language. Others were supporting different languages, but translated
in one direction only as is the case in [1] which can translate only from Smalltalk to C++. There are
many approaches that could be used for translating from one high-level computer language to
another – for an example such a translation procedure can be done directly, or using an intermediate
code in order to facilitate the work[2]. In this work an universal language translator will be
addressed in details. There will be a description of the technologies it uses, the overall design and
implementation and also a description of what makes it different from all the existing solutions.
- A simple overview of the solution
The need for translation of one programming language to another arises because all the time new
programming languages are showing up. These languages are different by features, means (there
must be a purpose behind the existence of any new language), some languages can use the benefits
of others directly without need to be translated to that language, for instance, all the features of java
could be used from PHP or Perl [3], it is possible a scripting language like Java/JavaScript
integration [4] to be used inside a java code.
- The thesis statement/Contribution
This thesis introduces an universal translator for the high level programming languages. A new
Abstract Programming Language using XML (APL) will be innovated for the purpose of translation
which will be use as intermediate representation for a program's source code.
- APL as standard for pseudo code
The pseudo code is describing an algorithm using a natural languages[7] publishing a source code
for an algorithm on the Internet is a big issue for simple reason, in which computer language should
be published; java, C++ or C# may be. Either to be published in many programming languages or as
a pseudo code which will be coded again by programmers to a certain language. Either way is a
time consuming. The idea here is: using APL as pseudo code representation, that means a source
code could be published on the Internet in APL format, and later, any one can use the universal
translator to convert APL to the target language s/he wants.
"Pseudo code is an informal language that helps programmers develop algorithms. It is similar to
everyday English; it is convenient and user-friendly, but it is not an actual computer programming
language" [8]
The research, analysis, design and implementation for all of the translator and the APL will be
covered in deep details by the rest of the thesis.
- Discussion regarding the format of the rest of the thesis
C
HAPTER
1:
B
ACKGROUND
1.1 Language Translators/Compilers
A compiler is a program that can read a program in one language - the source language - and
translate it into an equivalent program in another language - the target language. An important role
of the compiler is to report any errors in the source program that it detects during the translation
process.[9]
Compilers are most usually used to translate a program written in any high-level computer language
to machine code. The machine code interpretation can then be executed to produce some desirable
output. An example is the widely popular GNU compiler collection – gcc that is actually a standard
collection of compilers used in all Unix-like systems nowadays.[10] With this set of tools one can
compile a program written in C, C++, Ada, Fortran and some other computer languages to
executable program.
Example 1.1: compiling a simple C program using gcc
Although a rare use of a compiler is to translate computer programs written in one high-level
language to another high-level language, there is still a number of such attempts through the years.
One of the earliest attempt is a project developed by the PascAda group at the University of
California from the early 80s[11]. This is simply a translator that converts source code written in
Ada to Pascal and vice versa. It will be discussed i detail later in this chapter since it uses one very
simple and straightforward method for high-level language translation which is also used in almost
all of the previous works including the Gypsy-to-Ada Program Compiler[12] for example.
1.2 Components of a Translator/Compiler
optimization steps are optional and sometimes can be skipped. The syntax tree that is the output
from the Syntax Analyzer is a tree called Abstract Syntax Tree(APL) which is a tree representation
of the program that is being compiled. Its format can be seen on Figure 1.2 below.
Figure 1.1: Phases of a compiler[9]
Figure 1.2: Syntax tree[9]
There is not much difference when we talk about high-level-language to high-level-language
transformation. It is only the back end part which produces output in any high-level computer
language instead of machine code. Usually in that case the intermediate representation is some kind
of a subset of the semantically equal parts in the two high-level languages which take part in the
translation procedure. That way if a program written in on of the two languages is transformed to
the intermediate representation, the process of transforming it to the other language is
straightforward. Of course if there are also syntactic differences between the two languages then
they should be processed on a different way.
1.3 Previous work
Either because it costed too much efforts or due to some other reason, it is rare to find work
dedicated on high-level-language to high-level-language translation. And there are only solutions
that work for two specific languages using a pretty straightforward technique for translations. The
PascAda project that was mentioned above is one of the first documented translators of this kind.
The algorithm that it uses and later used but other solutions is explained here.
Example 1.2 Java Class translated to C structure
1.4 Universal high-level language translator
The implementation that will be discussed in this report is a high-level language translator that can
easily be changed and configured to translate between any two high-level computer languages. For
the purpose of this research it will be however limited to translate only from Java to C++. It will
show that such a translation can be implemented with not much cost using two standard “tools”.
The heart of the front end part of the compiler will be a Language Parser generated by the well
popular JavaCC Parser Generator[13]. This is an open source project developed in the Java
programming language. It takes a language structure description(A grammar file that should be in a
special format[14]) as an input and generates a parser in Java for the language described by the
grammar file. By default the parser does not generate any input, it just parses and analyses
programs. It is a programmers task to change the parser to generate some desirable output. Another
„tool“ that will be used is an add-on for the JavaCC called JJTree[15]. JJTree allows the parser
generated by JavaCC to produce APL(the syntax tree described above that is part of any compiler).
The intermediate representation of the input program that the front end part of the compiler will
produce is actually a document in the XML format. It is a special language designed and
implemented as part of this work. Later follows a discussion motivating this approach. The XML
language is an union of one set – the semantic of C++ plus some additions that will make life easier
when translating Java programs to XML. That way every program written in any of the two
languages can easily be transformed and presented as XML. Since C++ is a very complex language
that has not only high level capabilities but also low level ones, it will not be difficult to represent
programs written in other high level languages(such as C, Pascal, Ada) in that XML format.
The back-end part of the translator is standard for all high-level languages. It uses XML files to read
specific language grammar descriptions. That information is then used for translation of the
intermediate representation to the desired output language. This way support for new language can
be easily introduced by simply editing a grammar file and also changing a configuration file within
the translator.
1.5 Why XML?
source code is used[17]. The simple answer to the question is: It is an attempt to make easily
maintainable and extendable approach. A more detailed explanation follows below:
- As we discussed already a parser for any high level language can easily be generated using
standard tools. Of course coding is also needed but it is just to integrate the parser within the
project. .Obviously the intermediate representation should be in a format that can easily be
understood, extended if needed and parsed. XML is the natural approach which comes here. It is a
standard for keeping and transferring data. It is software and hardware independent and anyone
involved in programming at some extend understands it. There is a number of already developed
XML parsers for all the modern platforms.
- The project can be used to translate programs to XML and such a representation is valuable
approach by itself. There is actually one popular work in this area that works on Unix-like systems
and translates from C to XML[18]. It is an extension to gcc, but is of a limited use due to a lot of
limitations in the supported features. However, such an XML representation can be used for all
kinds of code analysis by everyone without the need to put great extend of efforts dealing with
compilers stuff.
C
HAPTER
2:
P
ROBLEM
D
EFINITION AND
G
OALS
2.1 Problem definition
The problem of translating high-level languages is getting now more popularity than always. This is
due to the variety of programming languages that are being used nowadays. And it is not only the
new languages but also the old fashioned ones. Millions of lines of code written in old fashioned
languages are available on the Internet. Code written in these old languages is also still in
production due to the high cost of translation to the new modern languages. It is either the case that
people not familiar with programming at all are using the old programs and still has to deal with old
language interpreters or the projects written in old fashioned languages showing also bad
performance are so big that it would cost too much money for a programmer to do manual
translation. Here arises the need for an automated translator to be developed.
Although the problem is important there is not much that has been done in this area so far. There is
a number of high-level language translators but non of them can be used as an universal one or even
extended to work for other languages. Most of these works support only two specific languages,
like the Fortran to QuickBasic translator[6] or the Smalltalk to C++ translator[1]. Both of them are
not designed to work as universal translators, they are also written in C, which makes them not
portable. The Ada to Pascal project[11] supports translation from Ada to Pascal and vice versa and
the authors also state that the translation procedure can be applied to all high-level languages. The
problem with this compiler is that either it was developed as an universal one the process of
introducing support for another language is not so straightforward. It requires good programming
skills and a lot of coding including development of language parser, intermediate representation and
so on. The translator was developed almost 30 years ago and the procedures that could be applied to
the old fashioned languages could not be applied to the modern ones due to the big amount of
semantical differences.
The project developed as a result of this thesis work will deal with the problems discussed above.
The new translator supports translation from Java to C++ but can easily be extended to work for
other high-level languages. It is also portable and can be installed on both Windows and Unix-like
systems. However it will also other requirements that will be discussed later in this chapter.
The research questions that this thesis will try to answer are:
- Is it possible to convert programs written in one programming language to another?
- What are the difficulties?
- How could they be minimized?
- Why would it be difficult to program a universal translator since the concept behind computer
languages is common?
2.2 Goals
The project developed as a result of this thesis work meets the goals described below.
2.2.1 High-level language translator at low cost
highly popular technologies will be used. The Java language used for developing the project is also
extremely popular nowadays.
The heart of the front end part of the compiler will be a Language Parser generated by the well
popular JavaCC Parser Generator[13]. This is an open source project developed in the Java
programming language. It takes a language structure description(A grammar file that should be in a
special format[14]) as an input and generates a parser in Java for the language described by the
grammar file. By default the parser does not generate any input, it just parses and analyses
programs. It is a programmers task to change the parser to generate some desirable output. Another
“tool” that will be used is an add-on for the JavaCC called JJTree[15]. JJTree allows the parser
generated by JavaCC to produce APL(the syntax tree described above that is part of any compiler).
The intermediate representation of the input program that the front end part of the compiler will
produce is actually a document in the XML format. It is a special language designed and
implemented as part of this work. Later follows a discussion motivating this approach. The XML
language is an union of one set – the semantic of C++ plus some additions that will make life easier
when translating Java programs to XML. That way every program written in any of the two
languages can easily be transformed and presented as XML. Since C++ is a very complex language
that has not only high level capabilities but also low level ones, it will not be difficult to represent
programs written in other high level languages(such as C, Pascal, Ada) in that XML format.
2.2.2 Universal
The high-level language translator should be universal. That means the algorithms used for the
language translation should be applicable to all high-level languages at the highest possible extend.
2.2.3 Portable
The word universal means also that the project could easily be installed on different platforms.
Nowadays a variety of operation systems are being used and the translator should be working at
least on Windows and the Unix-like systems. This will be achieved by using Java for developing the
project plus XML for storing and transferring the data needed for the translation procedure.
2.2.4 Extendable
The high-level language translator should be extendable. That means support for any high-level
computer language can be introduced upon configuration, eg. Editing an XML file or with not much
programming efforts by using a straightforward procedure that is described in details.
2.2.5 Showing good performance
The translator have to show good performance when translating programs. Since there is no basis
for comparison, performance will be measured towards existing and popular compilers or
interpreters.
C
HAPTER
3:
M
ETHODOLOGY
3.1 Preparation work
As a preparation work for writing this thesis a deep literature survey was done to find the existing
solutions and any written materials regarding the topic being researched here. To find the materials
two digital libraries containing a lot of technical literature and research papers in the computer
science field were chosen: IEEE Xplore and ACM Digital Library. These two particular databases
were chosen since they are considered to be one of the reachest ones containing dozens of years of
archive. They also have very good reputation among the scientists.
The keywords used to search for materials in both IEEE and ACM digital libraries were: language
translator, source to source, language convertor, language compiler. Although there is a lack of good
previous works related to the topic chosen, what was found through this search made possible a
solid base to be built on which a new and better project to be developed resolving some of the
problems existing in the previous solutions.
Since the project that is developed as a result of this thesis is basically a compiler, a good material
related to compiler design and implementation was needed. Some teachers involved in compiler
related work were asked to propose a good book. Also Amazon was searched for books about
compilers. The keyword used was „compiler“. The book reviews and readers opinions were
examined in order to choose the better book. The book that was found is referenced in the
Background section of this thesis and also used for collecting ideas of how to build the high-level
language translator.
3.2 Choosing the technologies
Among collecting the literature the preparation work for this thesis includes also a search for
technologies to be used when building the translator. These technologies have to fulfill several
criteria’s in order the goals to be achieved.
needs of the translator.
Another important point is choosing the xml parser. This step is vital to the project since there are
basically two technologies for building xml parsers and choosing one of them will lead also to
design changes in the back-end part of the translator(the one that translate the intermediate
representation to the destination language). The criterias used for choosing between the two
technologies is described later in this thesis. For finding a list with xml parsers a google search was
accomplished. The keywords used were: „free xml parser“, „open source xml parser“. Then the
parsers were rated by considering the following criterias: how popular and reliable the parser was,
how fast it was, the level of documentation and easiness of use.
The procedure for choosing other technologies such as the language for implementation of the
parser and the format of the intermediate representation is described later in this thesis in the
Theoretical part. The decisions taken were made based on the experience of the authors of the
thesis.
3.3 Designing the project
The universal language translator was developed to combine all the technologies mentioned above
and to meet to the greatest extend possible the already described goals. The whole project was
designed to fulfill the following design requirements:
−
The project should be structured basically in several functionally different parts: The front-end
part of the translator that converts an input language to the intermediate representation, The
back-end part responsible for converting the intermediate representation to the destination
language
−
The front-end and the back-end parts should also be designed with classes to separate different
logic into different objects. For example the language parser only parses the input language and
it does not know anything else then how to parse. There are other classes responsible for going
through the abstract syntax tree that it produces and for generating the XML which is actually
the intermediate representation
−
The project should not only work as one whole, to translate program written in one high-level
computer language to another, but also to be used for translation from one high-level computer
language to XML and vice versa
3.4 The case study
This work uses also as a research method a Case study in the process of translation from Java to
XML and back to Java and to C++. The case study data is then analyzed to get an indication as to
what extend the process of translation works and how reliable it is. This test is important since the
authors belief is that the translator works and can be used in production but due to the limitations
and the wide variety of computer programs available it has to be shown whether everything
expected to be supported is really supported. Some measurements are also taken for measuring the
performance and the reliability of the project under Windows and UNIX-like systems.
The second scenario will be converting the same java program in the first scenario to CPP in order
to see how the translator deals with the syntax and methods in case of converting to the different
programming language
C
HAPTER
4:
T
HEORETICAL WORK
4.1 Implementation language - Java
The main goal of this thesis is a cheap and portable compiler from one high-level computer
language to another to be created. It has also to be easily maintainable and extendable by people not
involved in its design and implementation. The language that naturally comes to the mind is Java.
This is the language that almost everyone involved in programming in high-level computer
languages understands and uses. In some sense it is simpler than C++ and programs written in it are
easier to be read and maintained. It is far more easy to find a piece of code or a program that you
need already available and free on the Internet written in the Java language. There is no need even
to mention that it will be a tricky and sometimes an impossible task to make a project written in a
high-level computer language different then Java run on different platform than the one it was
initially developed in. Java is developed as platform independent language or at least it is supposed
to be. However there is no perfect world and it is the case with the Java language as well. There are
some compatibility issues but it is to great extend platform-independent and writing in the Java
language will save a tremendous amount of time when developing code that is supposed to work on
different platforms.
4.2 The intermediate level - XML
It was already discussed in this work that the intermediate level is an important part of any
compiler. In this project the intermediate representation is in a special XML format. This is a
language that was created for the purpose of this work. Its documentation and description is
available and can be seen in Appendix A. Figure 4.1 below shows a simple Java program that prints
the first N=20 Fibonacci numbers:
The output of the front-end part of the universal language translator will be an intermediate
representation of the program in the mentioned XML format. The actual look of the new XML
program is shown below:
<APL> <Imports> <Import> <Variable name="java"/> <Variable name="io"/> <Variable name="*"/> </Import> </Imports>
<ClassDeclaration name="Fibonacci" modifiers="public"> <ClassMember>
<VariableDeclaration modifiers="private, final, static"> <Type type="int"/>
<Variables>
<Variable name="MAX">
<DeclarationInitializator>
<Data type="int" value="20"/> </DeclarationInitializator>
</Variable> </Variables>
</VariableDeclaration> </ClassMember>
<FunctionDeclaration name="fib" type="long" modifiers="public, static"> <FunctionArguments>
<Parameter name="n" type="int"/> </FunctionArguments> <BlockStatement> <IfStatement> <Condition> <BinaryExpression> <OperatorType type="le"/> <Operand id="0"> <Variable name="n"/> </Operand> <Operand id="0">
</Operand> </BinaryExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> <Operand id="0"> <FunctionCallExpression name="fib"> <FunctionParameters> <Operand> <BinaryExpression> <OperatorType type="-"/> <Operand id="0"> <Variable name="n"/> </Operand> <Operand id="0"> <Data type="int" value="2"/> </Operand> </BinaryExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> </BinaryExpression> </ReturnStatement> </Else> </IfStatement> </BlockStatement> </FunctionDeclaration>
<FunctionDeclaration name="main" type="void" modifiers="public, static"> <FunctionArguments>
<Parameter name="args" type="String[]"/> </FunctionArguments> <BlockStatement> <ForStatement> <ForStatementInit> <VariableDeclaration> <Type type="int"/> <Variables> <Variable name="i"> <DeclarationInitializator>
</Operand> <Operand id="1"/> </UnaryExpression> </ForStatementExpression> <ExpressionStatement> <FunctionCallExpression name="println"> <Member> <Variable name="System"/> <Variable name="out"/> </Member> <FunctionParameters> <Operand> <FunctionCallExpression name="fib"> <FunctionParameters> <Operand> <Variable name="i"/> </Operand> </FunctionParameters> </FunctionCallExpression> </Operand> </FunctionParameters> </FunctionCallExpression> </ExpressionStatement> </ForStatement> </BlockStatement> </FunctionDeclaration> </ClassDeclaration> </APL> Figure 4.2: Fibonacci.xml
Why using an XML?
As it was discussed previously in this work it will be the case that everyone familiar with compilers
after looking at this approach will first ask “Why XML?”. Well, at first glance using XML for the
intermediate representation gives an impression for working on the too high level. And at least it
was not possible to find an existing solution that works on such a high level. Sometimes even
machine code representation of the initial source code is used[17]. The simple answer to the
question is: It is an attempt to make easily maintainable and extendable approach. A more detailed
explanation follows below:
- As we discussed already and a more detailed discussion follows later a parser for any high level
language can easily be generated using standard tools. Of course coding is also needed but it is just
to integrate the parser within the project. .Obviously the intermediate representation should be in a
format that can easily be understood, extended if needed and parsed. XML is the natural approach
which comes in mind. It is a standard for keeping and transferring data. It is software and hardware
independent and anyone involved in programming at some extend understands it. There is a number
of already developed XML parsers for all the modern platforms. The translator developed as part of
this thesis uses such a standard XML parser and it is discussed later in this chapter. Hence this is
another part of the project that is highly popular and free.
compilers stuff. Imagine in how many ways the program shown in Figure 4.1 can be written. Figure
4.3 below shows the same program but written in different format.
Figure 4.3: Fibonacci2.java
It is obvious that doing some kind of analysis over these two programs does not look to be a simple
and straightforward task. It is one and the same program but looking different from parsing
standpoint. However the XML representation of this program is one and the same and it is very easy
an XML parser to be downloaded from the Internet(there are plenty of them written in different
languages) and used within any program that will analyse the code.
Of course using an XML has its disadvantages mainly connected to the performance. The XML
representation of high-level computer language programs is heavy and hence parsing it is time
consuming. However this is not an issue with this project since it focuses more on other goals such
as creating a cheap, portable, extendable and easy maintainable translator to be created.
4.3 Java CC
As it was already discussed previously in this work, each language compiler consists of two main
parts: the front end and the back end. The heart of the front end part of this compiler is a Language
Parser generated by the well popular JavaCC Parser Generator[13]. This is an open source project
developed in the Java programming language. It takes a language structure description(A grammar
file that should be in a special format[14]) as an input and generates a parser in Java for the
language described by the grammar file. By default the parser does not generate any input, it just
parses and analyses programs. It is a programmers task to change the parser to generate some
desirable output. Another “tool” that is used within the front end part is an add-on for the JavaCC
called JJTree[15]. JJTree allows the parser generated by JavaCC to produce an abstract syntax tree -
APL. It was already shown how the APL three looks like.
Why choosing Java CC? What are the other existing solutions that could be used?
There is a wide variety of parser generators available. However nly a few can produce Java code
and are also free or open-source. These include: JavaCC, ANTLR and SableCC. JavaCC and
ANTLR are similar since they both produce parsers that use the LL(1) algorithm. SableCC on the
other hand producec parsers that use LALR(1) algorithm which is an improved form of LR parser.
This was one of the reasons SableCC not to be considered for this thesis. There is a number of
reasons to choose LL(1) parsing instead of LALR(1):
As stated before JavaCC and ANTLR are similar parser generators. The documents found on the
Internet comparing JavaCC to ANTLR all showed that JavaCC is the most popular parser and the
disadvantages of JavaCC were only poor documentation and inability to create parsers in other
languages different then Java. It was not enough to decide wich one of them to use since for the
purpose of this thesis the parser generated should be in Java and hence this is not a real
disadvantage. Further study and testing of both parser generators was done. It showed that
nevertheless ANTLR was better documented than JavaCC, it was easier to understand and use the
JavaCC parser. It generates all the classes needed by the parser and hence the parser does not need
any external libraries or files. ANTLR however uses its own environment and the parser generated
is not a stand-alone program. One will run into problems if using two parsers(for example a parser
that parses Java programs and another one parsing C++ programs) created under different versions
of ANTLR. JavaCC seems to be the maintainable one. It was proved to work under all Java
platforms and on countless different machines[20]. JavaCC also has better error reporting that
shows the exact place where the parsing error was found plus a diagnostic screen. The debug
capabilities are also better. By using different debug options one can get a deep analysis of the
parsing process. The parser also comes with a number of grammars which makes the process of
generating language parser easy and straightforward. There is no need to write a grammar file for
the Java language for example. On the Internet there is also a rich variety of grammar files for the
JavaCC parser generator including C, C++, Pascal and other popular high-level computer
languages. However if a parser in the C++ language should be created then ANTLR seems to be the
best parser generator to be used.
4.4 The XML parser
The project uses the widely popular and open-source SAX parser[21]. It is used for parsing the
XML intermediate representation of the programs being translated. Before getting an idea why the
Sax parser was chosen for this project some introduction to XML parsing should be presented.
There are two main specifications for parsing SAX: DOM and SAX[21]. There are a number of
parsers available using either DOM or SAX. The difference in both technologies is as follows: The
SAX parsing is based on event-based solutions. The XML is being parsed and upon meeting of
specific tags a dedicated handler function is called. That way there is no container that keeps
already parsed data and the programmer himself has to keep track of history or building an internal
representation that will make possible re-analysing the XML at a later stage. Of course that is done
only if needed. The DOM technology on the other hand works on somehow different way. These
parsers create an internal tree representation of the XML that is put in memory and navigation
through that tree is possible at run time. For that purpose there are different functions for selecting
parent or child nodes. The tree is kept in memory until parsing of the xml is finished.
It is obvious that parsers using the SAX technology have one major advantage: these parsers have
better performance compared to the DOM parsers. They can parse even a document which is
gigabytes in size. DOM parsers work very slow with big XML documents since they keep the
representation of the whole document in memory.
4.5 Putting them all together
The technologies described above are united into one whole thing to form the universal high-level
language translator. The translator consists of two executable programs – the first one converts a
program in one language to the XML intermediate representation and the second one converts the
intermediate representation into a destination language. The two parts can work independently,
hence the translator can also be used to translate from one high-level computer language to XML
and vice versa. Figure 4.4 below shows the three ways the translator can be used to translate in.
Figure 4.4: The universal high-level language translator
4.5.1 L1->XML
The program that translates from the input language L1 to the XML representation consists of at
least one language parser generated by the JavaCC parser generator. There are as many language
parsers as is the number of the supported high-level computer languages. Each parser also uses the
JJTree add-on to generate an abstract syntax tree. The procedure for generating a language parser is
straight-forward and is accomplished by using the JavaCC parser generator. There is only one Java
Class that has to be written for each parser so the parser will be integrated into the universal
language translator. That class will go through the abstract syntax tree and will generate the
intermediate representation of the input program that will be into an XML format. For generating
the XML format a special API is used. Figure 4.5 shows some of the public methods introduced into
that API.
Figure 4.5: Public methods in XMLOutput.java
Figure 4.6: APLVisitor.java
That code will print the Data tag representing a string:
<Data type="string" value="Hello world!"/>4.5.2 XML->L2
The program that translates from the XML intermediate representation to the destination language
consists of four classes the main class APL, APLhandler, XMLTag and Printer. The diagram bellow
illustrates how this part works:
Figure 4.7: XML to source code
APL: contains the entry point of the program it checks the arguments and make sure that the
configuration file exist and load it and then create an instance of SAXParser in order to parse the
XML file and APL handler to handles the XML events during the parsing.
source code instructions. SAXParser send notify the Handlers each time it finds start document,
start element, end element, character data, end document as well as any error occurred [21]. Every
time APLhandler gets a tag it creates and XMLTag to hold the tag information and pushes it into
stack, and when it gets end of tag it generates source code for the tag by calling the method
XMLTag.getValue(tag), this value will be added to the xmltag object on the top of the stack, and so
on the process goes till reach the root element "APL" tag which will be contained the full source
code of the program.
Printer class is just helps generating organized and readable source code.
Figure 4.7 shows XML fragment that represent a variable declaration “private int x;” and the table
4.1 demonstrate how this XML tags are going to be converted to source code in details.
<VariableDeclaration modifiers="private"> <Type type="int"/> <Variables> <Variable name="x"> </Variable> </Variables> </VariableDeclaration>
Figure 4.8: Variable declaration “private int x;” in XML format
The APLHandler deals with events by the following algorithm:
New tag event: create XMLTag and push it push the stack
End tag event: pop tag, get its value, pop another tag add the value of the previous tag to it,
and push it back to the stack
The following table shows in steps how the XML fragment in Figure 4.8 is converted to source
code:
Event Response
New tag VariableDeclaration
tag = new XMLTag("VariableDeclaration", attributes);
push(tag)
New Tag Type
tag = new XMLTag("Type", attributes);
push(tag)
End Tag Type
tag = pop()
valule = XMLTag.getValue(tag) // the value is the string "int"
parentTag = pop() //the parent tag in this case VariableDeclaration
parentTag.addValue(value);
push(parentTag)
New Tag Variables
tag = new XMLTag("Variables", attributes);
push(tag)
New Tag Variable
tag = new XMLTag("Variable", attributes);
push(tag);
End Tag Variable
tag = pop()
value = XMLTag.getValue(tag);// the value is "x"
parentTag = pop();// the parent is Variables tag
parentTag.add(value)
End Tag Variables
tag = pop()
Variable tag on the previous event which is "x"
parentTag = pop();// the parent is VariableDeclaration tag
parentTag.add(value)
End Tag VariableDeclaration tag = pop()
value = XMLTag.getValue(tag);// it has two values "int" and "x"
//it doesn't have a parent tag
//the function getValue is responsible to generate the source from
//the values and the attributes of the tag in this case the source
//would be generate as follow:
tag.getAttribute("modifiers")+" "+getValue(0)+" "+getValue(1)+";"
which equals: private int x;
Table 4.1: Generating source code from XML
4.6 Introducing support for a new language
Introducing support for a new language within the translator has two aspects. The first one is
introducing support for a new input language which involves generating a new language parser with
the JavaCC parser generator. The second one is adding support for a new output language which
involves editing a configuration file which is part of the translator. Both procedures are described in
details below.
4.6.1 New input language
Implementing support for parsing a new language is a straightforward and simple process involving
only a little programming. The process of introducing new input language includes:
- Generating a new parser for the language to be implemented. That parser has to be generated
with the JavaCC parser generator with the JJTree add-on included. The process for
generating a new parser is very simple and is not described in this thesis. One can read about
it on [13]. The only thing needed for generating a new parser is a grammar file. For the most
popular high-level computer languages there is such a file available on the Internet but if
somehow such a file is missing a new one has to be created by the programmer itself. The
format of the grammar files can be seen here [14].
- After generating a new parser with the JavaCC parser generator that parser has to be
changed to generate the intermediate representation – the abstract programming language
shown in Appendix A. That process is straightforward but is the trickiest thing that has to be
done. It involves creation of a new Java Class and also programming. That class uses the
visitor design pattern [22] and it goes through the abstract syntax tree built by the language
parser and generates the desired output – in this case an XML output. The class for the Java
language can be seen in Appendix C. Its name is APLVistor and it is located in the
APLVisitor.java file. Figure 4.6 above shows a small part of that file and how an XML is
being generated by using a dedicated API. The public methods of that API are shown in
Figure 4.5.
4.6.2 New output language
Translating a program from XML format to another programming language is a simple process,
because the translator always uses the same code but a different configuration file for each target
language, that makes introducing support for a new output language a straight-forward process -
adding a new configuration file for the new language. A configruation file is a simple text file that
contains many lines and each of them is a pair of key and value representing how a specific
instruction in abstract format is going to be translated and represented in the target language. The
figure bellow shows some part of such a configuration file for the Java language.
C
HAPTER
5:
C
ASE
S
TUDY
The solution built on open sources (Java and XML) which make it low cost and portable because
both technologies are platform independent. It's universal for high level programming languages
and expendable, it just uses different configuration file for each target language.
This section shows a case study which will be used to demonstrate the work and prove that the
system achieved its goal.
A computer program written in specific programming language will be converted to another
programming language. As mentioned before, the translator has a middle step which is generating
an intermediate code for a given program in Abstract Programming Language. this intermediate
code is actually computer program in eXtensible Markup language XML format. another part of the
translator converts from XML to specific programming language by using simple configuration file
contains key and value pairs show the reserved word and few information about specific language.
many scenarios will covered by the case study.
5.1 First scenario
In this scenario small Java program showing bellow will be converted to APL and then from APL to
Java again, then some measure will be taken to prove some of the goals
Figure 5.1: class Point.java original source code
By using the translator we can convert this program to abstract format APL. Here is the command:
2apl Point.java
after executing the command it results a new abstract program Point.xml in XML format as follow:
1 <APL>
2 <ClassDeclaration name="Point" modifiers="public"> 3 <ClassMember> 4 <VariableDeclaration modifiers="private"> 5 <Type type="int"/> 6 <Variables> 7 <Variable name="x"> 8 </Variable> 9 </Variables> 10 </VariableDeclaration> 11 </ClassMember> 12 <ClassMember> 13 <VariableDeclaration modifiers="private"> 14 <Type type="int"/> 15 <Variables> 16 <Variable name="y"> 17 </Variable> 18 </Variables> 19 </VariableDeclaration> 20 </ClassMember>
21 <FunctionDeclaration constructor="yes" name="Point" modifiers="public">
22 <FunctionArguments>
23 <Parameter name="ix" type="int"/> 24 <Parameter name="iy" type="int"/> 25 </FunctionArguments> 26 <BlockStatement> 27 <ExpressionStatement> 28 <AssignExpression> 29 <OperatorType type="="/> 30 <Operand id="0"> 31 <Variable name="x"/> 32 </Operand> 33 <Operand id="0"> 34 <Variable name="ix"/> 35 </Operand> 36 </AssignExpression> 37 </ExpressionStatement> 38 <ExpressionStatement> 39 <AssignExpression> 40 <OperatorType type="="/> 41 <Operand id="0"> 42 <Variable name="y"/> 43 </Operand> 44 <Operand id="0"> 45 <Variable name="iy"/> 46 </Operand> 47 </AssignExpression> 48 </ExpressionStatement> 49 </BlockStatement> 50 </FunctionDeclaration>
62 <BinaryExpression> 63 <OperatorType type="+"/> 64 <Operand id="0"> 65 <BinaryExpression> 66 <OperatorType type="+"/> 67 <Operand id="0"> 68 <Variable name="x"/> 69 </Operand> 70 <Operand id="0"> 71 <Data type="string" value=","/> 72 </Operand> 73 </BinaryExpression> 74 </Operand> 75 <Operand id="0"> 76 <Variable name="y"/> 77 </Operand> 78 </BinaryExpression> 79 </Operand> 80 </FunctionParameters> 81 </FunctionCallExpression> 82 </ExpressionStatement> 83 </BlockStatement> 84 </FunctionDeclaration> 85 </ClassDeclaration> 86 </APL>
Figure 5.2: class Point.xml
As shows on the previous figure the APL format is not a java code or any other programming
language. it represents the program instructions in a very abstract level, by describing a program,
adding more details and make it general and not related to a specific programming language. This
information helps the converter to convert the program to another programming language. Observe
that "System.out.println" has been recognized and dealt with as Consol.println to make more
abstract, later Consol.println will be converted to the appropriate method that prints to standard out
put of the target language, for instance this will be converted to printf in case of c or to printf/cout
in C++, writeln in pascal and so forth.
The next step is converting Point.xml from its abstract form back to java again, to prove that the
translator doesn't change the functionality of the program, then comparing the original source code
with the output. by using apl2o (Abstract Programming Language to Other) and java configuration
file Point.xml will be translated to java source code.
apl2o Point.xml -java.conf
Figure 5.3: class Point.java generated form Point.xml
The diagram above shows that the program converted back to java successfully as it was, although
there are few differences but they don't affect to functionality like comments has been removed.
the converter generates formatted code even the original code was not formatted. so it could also be
used to reformat the code and make it organized and increase its readability.
5.2 Second scenario
in this scenario the same java program Point.java will be converted to C++ of course after converted
to abstract form (XML)
by executing the following command
apl2o Point.xml cpp.conf
we will get the program in C++ syntax
Figure 5.4: class Point.cpp generated from Point.xml
previous diagram which also shows that "System.out.println(x+","+y)" function in the original
Point.java program has been translated to "cout<<x<<","<<y<<endl".
5.3 Third scenario
This scenario to check the performance, reliability, and portability a more complicated example is
needed here. A program with complex instructions has been developed in Java for sake of testing.
The program first creates a dynamic list with the Fibonacci numbers and then prints it. Then the list
is copied into an array and then all members of the array greater then a number are printed in
reverse order. The last step is to sort the array in reverse order using a simple sort algorithm and
print its elements. The diagrams bellow show some parts of the program for the full source code see
Appendix B.
Figure 5.5: method fib, class Fibonacci.java
Figure 5.6: Simple sort algorithm, method sortArray, class Fibonacci.java
This Fibonacci program will be converted to XML and then back to Java again. both original
program and the converted one will be compiled and executed in Windows XP and Ubuntu Linux to
prove that the translator is really portable and works in different platform, then some measurements
will be collected and discussed. the same commands in previous scenario's has been followed here.
check Appendix B to see the results of converting to XML and then XML to Java.
System specifications
Microsoft Windows XP and Ubuntu 7
Running in same computer with the following hardware specifications
Intel(R) Pentium(R) 4
Mobile CPU 1.60GHz
512 M of RAM
Original file
XML file
Generated Source
code
Java Compile
time
Windows XP
390 ms
20 ms
2.1 sec
Ubnuntu 7 Linux
373 ms
18 ms
2.6 sec
Size in bytes
8.192
53.248
7.521
Table 5.1: performance, portability, and reliability measurements