Prototyping the Tree Automata Workbench Marbles

(1)

Prototyping the Tree Automata Workbench Marbles

Petter Ericson

Supervisor: Frank Drewes

Assistant supervisor: Brink van der Merwe

Department of Computing Science, Ume˚ a University S–901 87 Ume˚ a, Sweden, pettter@cs.umu.se

Abstract. In [Dre09], Drewes outlines Marbles, a programming framework

for working in a generic and systematic way, not only on trees, as several

frameworks already exist for this purpose, but on tree recognisers, trans-

ducers, generators and other formal devices as well. This thesis presents a

prototype of a proposed implementation of this framework, demontrating its

functionality by using it as a base for implementing a well-known algorithm

on tree transducers.

(2)

Techniques based on trees and various tree formalisms have seen increasing use in many areas in recent years[NP92][CDG ⁺ 02]. Perhaps most well-known is the propensity for using XML as a data exchange medium[Sch07], but tree-based tech- niques have found its place in many other areas, such as natural language processing [KG05], model checking[AJMd02], and compiler optimisation.

Tree techniques, and specifically tree automata techniques are largely based in the work done by Chomsky et al. in exploring various string formalisms and their respective restrictions, such as the Chomsky hierarchy[Cho56]. During the 1960’s, researchers started exploring whether the results obtained in string automata theory and formal languages could somehow be extended to trees. As the case was, they could, and since then, tree automata research has been an ever growing area of research, though practical applications of this research are much more recent.

Despite this rather long history, there has been relatively little in the way of programming language and operating system support for exploration of the capa- bilities of tree automata. Instead, toolkits and applications tend to focus on solving a specific problem, or being well-suited for one particular area. While the lack of a generic toolkit has obviously not entirely stopped research into tree techniques and various automata, it may still be argued that the fragmented nature of the research codebase has slowed the pace. Further, if such a generic toolkit was available and widely used, researchers would presumably find it easier to collaborate by producing and exchanging source code.

The tree automata workbench Marbles, proposed by Drewes in [Dre09] is in- tended to be a generic and extensible programming framework for working with trees and tree automata. Specifically, it is intended for exploring the relationships between, and capabilities of, various categories of tree automata, and how algo- rithms can transform these automata in different ways.

1.1 Previous work

Of course, several projects exist which allow for working on trees and tree automata in various capacities. However, most of these are somewhat narrow in scope, or are not presented in a more organised fashion but rather used informally within the confines of a research group.

Certain tools have seen more wide distribution, though, and some of these have served as an inspiration for the ideas behind Marbles.

– Treebag[Dre98] is a workbench for working with tree generators and transduc- ers, stepping through derivations and transductions and inspecting the effects on the tree. It is further possible to write algebras that work on the trees to produce some output. The specific context for which Treebag was developed was tree-based picture generation, but it is usable for many other tasks relating to trees and tree generators as well. Treebag is in many ways an ancestor to Marbles, having the same originator and similar ambition of generality.

– ForestFire[Cle09] is a toolkit for pattern-matching and tree acceptance prob- lems, with various algorithms organised into taxonomies.

– Tiburon[MK06] is a package of algorithms for working on various kinds of tree automata, notably weighted tree transducers and regular tree grammars.

Though the focus is on natural language processing and related problems in

machine learning, the algorithms are general enough to potentially see use in

certain other domains as well.

(4)

– Timbuk[GB03] is a collection of tools for working on reachability proofs for term rewriting systems. Recent versions (3.0 and onward) no longer include the tree automata manipulation tools that warrants its inclusion here, but older versions include various algorithms for emptiness checking, boolean operations such as automata intersection, union, inversion etc.

– LearnLib[RSB05] is a library for finite automata learning and experimentation that primarily focuses on learning algorithms for string automata. However, certain aspects of its organisation was useful as an inspiration for the Marbles prototype.

1.2 Project goals

While the above projects and systems are very useful in their specific domains, they are nevertheless developed to explore those domains, and in some sense constrained by them. Marbles, in contrast, aims to be a jack-of-all-trades programming frame- work, supporting exploration of tree automata and tree-based algorithms through being extensible and flexible enough for first-approximation work in practically any domain.

As may be apparent, a complete framework of this size and complexity is not a viable goal for a thesis at the MSc level. Instead, we aim to propose a viable basis for further research into a complete framework. Specifically, we aim to

– Find a programming language suitable for implementing the Marbles system, with a view to making the system easy for external researchers to expand upon – Choose a reasonable subset of tree formalisms for implementation in the proto- type, given the time constraints and desire for coverage of the relevant automata classes

– Make a viable prototype for the Marbles system, in particular the prototype should,

• be usable for at least a small part of the tasks covered by the full system

• include a basic GUI for interacting with the automata

• include concrete implementations of a subset of the concepts described in [Dre09], and

• have a reasonable (i.e. consistent and logical) architecture, suitable for fur- ther implementation work, with a view to eventually be expanded into the complete framework.

1.3 Outline

Section 2 will dig deeper into the theoretical fundaments required for the rest of the thesis, while Section 3 will describe the basics of the Marbles implementation, including a discussion of the choice of programming language.

Section 4 will continue discoursing on the theory and practise of the prototype implementations, by describing the theoretical description, as well as the implemen- tation of the automata types provided in the prototype.

Section 5 describes a proof-of-concept implementation of two related algorithms on tree automata using the types and methods described in Sections 3 and 4.

Finally, Section 6 will contain a number of closing remarks, and introduce the

next steps in making the prototype into an actual working framework, usable for

research purposes.

(5)

2 Preliminaries

In order to fully appreciate the potential applications and programming patterns used in the Marbles prototype, it is necessary to first go through some basics of formal tree language theory. In principle, the extension from string languages is simple - simply allow more than one successor to each symbol - but obviously this is not sufficiently well-defined to function very well in a formal setting.

2.1 Introduction to trees and automata theory

We define an alphabet to be any nonempty set Σ of symbols, which can be extended to be a ranked alphabet by adding a mapping R from Σ to N.

The number k = R(s) we name the rank of the symbol s ∈ Σ. We also define the sets Σ k = {s ∈ Σ | R(s) = k} for all k ∈ N. As a convention, we may use a subscript to make the rank explicit, i.e. a symbol s with rank R(s) = 2 may be written s ₂ . Requiring that symbols have one rank only is not in general necessary, but makes some proofs and theorems easier to state.

Trees A tree can be defined in many ways: as an acyclic graph with a designated root node or as terms, for example. We prefer to view trees as a special case of strings, however, and reach this definition:

Let {[, ], , } be a set of auxiliary symbols, disjoint from any other alphabet con- sidered herein. The set T _Σ of trees over the (ranked) alphabet Σ is the set of strings defined inductively as follows

– Σ ₀ ⊂ T _Σ ,

– for a ∈ Σ k , k ≥ 1, t 1 . . . t k ∈ T Σ , t = a[t 1 , . . . , t k ] ∈ T Σ ,.

Fig. 1. A simple graphical representation of the tree a[b[c], d]

In the tree t = a[b[c], d] (shown graphically in Figure 1), the symbol a is the root of the tree, while b[c] and d are child trees, or direct subtrees. The set of all subtrees of a particular tree, subtrees(t), is composed inductively as follows:

– t is in the set subtrees(t)

– if t ⁰ is in the set subtrees(t), then all child trees of t ⁰ are in subtrees(t) Further, a tree with no direct subtrees (e.g. d), is called a leaf. Thus, Σ 0 is a set of trees, as well as a set of symbols. A tree language over Σ is any subset of T Σ . We can again use Σ 0 as an example, as it can be viewed as the tree language consisting of only leaves.

The yield of a tree t ∈ T Σ is the string over Σ 0 obtained by reading the leaves

of the tree from left to right.

(6)

It should be noted that for all trees considered in this thesis, there is an ordering of the direct subtrees of a tree such that each direct subtree can be given an index.

However, it is likely that the complete Marbles system would contain support for unordered trees as well.

Contexts, variables and multicontexts A context c over the ranked alphabet Σ is a tree with a special symbol 2 6∈ Σ occurring exactly once, as a leaf. The sub- stitution of any tree t ∈ T _Σ in place of the symbol 2 is denoted c[t] and (obviously) yields a tree in T _Σ . The set of all context over the ranked alphabet Σ is denoted C _Σ .

We define a set X of variables x ₁ , x ₂ , . . ., which is disjoint from any specific ranked alphabet considered in this thesis. In terms of ranks, X = X ₀ , and we use the notation X ^k to denote the k first elements of X.

A multicontext c ^k of rank k over the ranked alphabet Σ is a tree in T _Σ∪X

k

, where each variable occurs exactly once. We use the notation c[t 1 , . . . , t k ] to denote the substitution of each variable x i in c by the tree t i .

String automata Recall that a deterministic finite string automaton (DFSA) is a 5-tuple A = (Σ, Q, R, F, q ₀ ), where

– Σ is the alphabet, – Q is the set of states

– R is the set of rules on the form qa → q ⁰ , where q, q ⁰ ∈ Q, and a ∈ Σ, such that each left-hand side occurs at most once,

– F ⊆ Q is the set of final states, and – q 0 ∈ Q is the initial state.

An intermediate string s i of the DFSA A working on the string s is a string p i q i r i where p i is a prefix in the string s and r i the suffix such that p i r i = s, while q i is a state in Q. Informally, p i may be seen as the part of the string that has been processed, q i as the current state, and r i as the part of the string that is left.

A valid run of a DFSA A on a string s is a sequence s l . . . s k of intermediate strings where s i and s i+1 are related to each other and R as follows:

– p i+1 in s i+1 is exactly p i a, and r i is exactly ar i+1 for some symbol a in s – there is a rule q i a → q i+1 in R.

An accepting run s 0 . . . s n of a DFSA A on a string s is a valid run such that – s 0 = q 0 s, and

– s n = sq where q ∈ F

The regular language accepted by A is the set L(A) of strings on which accepting runs of A can be constructed.

Further, a DFSA can be seen as a function from the set of strings Σ ^∗ to the boolean values, where every string in the language is mapped to true, and every other string to false.

By dropping the requirement that each left-hand side occurs at most once in the rule set, we obtain non-deterministic finite string automata (NFSA). No additional expressive power is gained with this change, though individual languages may have an exponentially smaller representation as NFSA than as DFSA[Sip06].

Tree automata While we will more rigorously define tree automata in the next

section, a brief note is in order to explain how string automata are extended to work

on trees. Basically, there are two approaches; either the automaton has an initial

state which is applied to the root, after which the computation runs in parallel

(7)

top-down through the various branches. The alternative is to have no particular starting state, but instead “leaf transitions”, moving from leaf directly to a state, and then working bottom-up through the tree, culminating in a state that is or is not part of the set of final states. Naturally, these automata exist in deterministic and non-deterministic versions as well.

2.2 Project plan

The original plan was for the project to run over 6 months, with the bird-eye view of the intended activities detailed in Table 1. The various stages are explained below:

– Initial planning was to include not only the planning, but also the final deci- sion on what the project would actually entail. The preliminary readings were focused on the initial Marbles paper, as well as a small number of surveys of similar projects and the programming languages used there.

– Design and prototyping: The taxonomy constructions detailed in [Cle08]

were to be considered at this stage, while making a further study of reasonable programming language choices and the module structures implied by them.

– Basic definitions and implementation: The choice of programming lan- guage having been made, the basic types and organisation of the prototype was to be considered and implemented at this stage.

– Concrete class implementation: Actual automata and algorithm types were to be implemented here, and connected to a basic GUI.

– Concluding implementation, writing a report: The final weeks were to be dedicated to bughunting, minor design issues and writing of the report.

Week Activity

1 Initial planning, preliminary reading of materials, initial design choices 2-4 Further design and prototyping, including work on the taxonomies dis-

cussed in [Cle08].

5-10 Basic definitions and implementation

10-15 Concrete class implementation, simple GUI construction 16-22 Concluding implementation, writing a report

Table 1. Initial project plan

(8)

3 Basic Implementation of Marbles

3.1 Choice of language

As the prototype was intended to function as a base on which the full system could be built, much thought was spent on the choice of language. Ideally, the language would be familiar to a large number of researchers, while having several desirable features, such as platform-independence, easy extensibility and a powerful typing system. Initially, the languages considered were C++, Java and Haskell (C#

being seen as far too closely tied to the Microsoft Windows platform). However, as C++ is both hard to distribute in a platform-independent form, and notoriously counterintuitive, the deliberations quickly centered on Java or Haskell.

Java Java is intuitively a good fit, being familiar to most researchers, and addi- tionally having the desired platform independence and extensibility. However, there are a number of typing “tricks” that are quite difficult to pull off using Java, such as deciding on type parameter co- and contravariance (that is, if S is a supertype of T, is P<S> a super- or subtype of P<T>?). Java also has a number of other unde- sirable features, e.g. the distinction between primitive types and objects, the lack of operator overloading and implicit conversions between used-defined types, and a general lack of easy prototyping constructions. Further, programming in Java tends to require much so-called “boilerplate” code, i.e. simple and often-used concepts take much code to express properly. As an example, Java requires the types of both arguments, return values and variables to occur at all times, even when the type of a particular variable is both obvious (to a human reader) and easy to infer (for the compiler).

Haskell Haskell[HPJW ⁺ 92] in contrast, is a purely functional language which while not having the mass appeal of Java, is still well-known in the research community.

However, it lacks the compiled portability of Java, as well as any sort of easy in- tegration with any other language. Further, having no option to use anything but functional programming (albeit with monads etc. playing the roles of objects) makes certain algorithms hard to implement. It does feature a very powerful typing system, including inference computations to reduce boilerplate type declarations. There is quite a bit more support for fast prototyping than in Java, including an interactive shell (ghci) for trying out the functions and monads that has been defined.

Scala With both Haskell and Java being problematic in their own way, Scala[OMM ⁺ 04]

appeared as an alternative. While at the time it was not as well-known as even

Haskell, it nevertheless had an active community, and showed great promise for

the future. Further, it touted easy integration into the Java ecosystem, meaning

researchers interested in using Marbles would likely be able to write their client

code in Java and use the Scala parts of the framework behind the scenes. Strong,

static typing with heavy use of type inference further tipped the scales, promising

to remove much of the boilerplate required in Java code. Scala also allows for much

coding to be done using a functional programming paradigm, which offers certain

include the ease by which user-written classes and types can be integrated into the

language. Notably, infix operators are simply method applications with the dot and

parenthesis omitted, and by defining properly named methods, pattern matching

and other features traditionally implemented as language constructs, can be easily

applied on any imaginable class.

(9)

Scala also allows for quick and easy prototyping through its interactive shell and quick syntax. The integration with Java works both ways as well; it is trivial to use any existing Java library as a component in a Scala program, meaning integrating existing Java code with Marbles would likely be comparatively easy. These features combined to make Scala appear an ideal choice.

3.2 Basics of Scala

Scala is a functional object-oriented hybrid language with static typing and a syntax designed to remove boilerplate and increase legibility. It is designed to run on both the Java JVM and Microsofts .Net infrastructures, and has a large standard library that handles much of the underlying complexities in various common tasks.

Everything in Scala is an object, down to the integral data types (int and so on), and functions. Further, as opposed to Java or C#, there is no such thing as a static method, which makes Scala in some respects even more object-oriented than those two languages. To facilitate the equivalent functionality, Scala allows the definition of singleton objects through the use of the keyword object. By defining a class and singleton with the same name, these are named companion object and class of each other, which means that they can access the private members of each others, accomplishing the functionality of static members.

Probably the most significant difference between Scala and C#/Java is multiple implementation inheritance. That is, Java classes inherit only from a single class, but potentially multiple interfaces. However, each interface only describes methods that need to be present in the class. No actual implementation code is included in interface. Through personal communication with the author, it has been established that this particular feature (or rather, lack thereof) was a major stumbling block in the (Java) Treebag implementation. By nature, automata implementation lends itself well to use of multiple implementation inheritance, and the steps required to reproduce the same behaviour in Java was cumbersome and forced.

In contrast, Scala traits are “rich”, in the sense that they can make use of the defined methods to provide more functionality. For example, by inheriting (mixing in) the trait Ordered[T] and implementing the single abstract method compare, all of the comparison operators (< > <= >=) become available, as well as various sorting methods on collections of the class and many other similarly useful functions.

3.3 Scala example

In order to familiarise ourselves with Scala syntax, we will gradually construct parts of the class Tree as implemented in Marbles. Starting off, recall that a tree is defined as having a finite number of subtrees. A simple implementation of trees in Scala might thus look like this:

class Tree (val subtrees: Seq[Tree])

The keyword class starts a class definition, just as in Java and C#, but as is readily apparent, such definitions may be much more concise than in those two lan- guages. The parenthesis simultaneously defines the default constructor of the class and the instance variables, which in this case is a sequence (Seq) of Trees. Further- more, this sequence is defined to be immutable (the val keyword), meaning that it cannot change during the lifetime of the object. Instance variables and methods are public by default in Scala. A sample usage of this class is the simple assignment val t = new Tree(Nil)

println(t.subtrees)

(10)

which creates a leaf tree, and prints the subtrees of it. The printout of this script will obviously be, simply, ’Nil’.

This tree class can only represent skeletons of trees, however, as there is no way to associate a symbol with a specific position. We can easily add a “root” instance variable to take care of this, and by using type parameters, we can even make trees with roots of any type:

class Tree[+T] (val root:T, val subtrees: Seq[Tree[T]])

The type parameterisation should be familiar to anyone with experience of lan- guages like C++, C# or Java. The only unfamiliar part would be the + symbol, which in this case represents the variance of the type parameterisation. That is, if S is a superclass of T, is a Tree[S] a superclass of Tree[T] (covariance)? Is it a subclass of Tree[T] (contravariance)? In Scala, covariance is indicated by the + sign, while a - sign would indicate contravariance. Note that if Trees were not immutable, then they would not actually be covariant, as one could potentially do val t1 = new Tree[Int](2,Nil)

val t2:Tree[AnyRef] = t1 t2.root = "abc"

println(3.0 / t1.root)

which will obviously result in a runtime error. Thus making objects immutable is not only useful for verification, but also has impacts on what restrictions are reasonable to set on type casts.

As defined above, the Tree class can represent the actual data of a tree properly, but it still lacks any method implementations. For example, printing a Tree with println would use the standard AnyRef toString method, which simply prints the class name and object address. Likewise, the equals method uses object reference rather than object equality to determine its truth value. Adding these methods we arrive at

class Tree[+T] (val root:T, val subtrees: Seq[Tree[T]]){

override def toString : String = subtrees.size match { case 0 => root toString

case _ => root.toString + subtrees.mkString("[",",","]") }

override def equals(other : Any) = other match{

case that : Tree[_] => this.root == that.root &&

this.subtrees == that.subtrees

case _ => false

}

override def hashCode =

41 * ( 41 + root.hashCode) + subtrees.hashCode }

As is apparent, the def keyword declares the beginning of a function declaration,

while override indicates that the function will override an implementation in a

superclass. Further, note that toString and hashCode lack parentheses. This is a

Scala convention indicating that they, though they might require calculation, will

not change the object.

(11)

Function definition includes an equals sign that indicates the beginning of the function body. This might be followed by a {, but may also simply be followed by a statement computing the result, as in these two cases.

The pattern matching showcased in equals and toString is similar to most other functional languages, though the specifics differ. Note the use of the under- score character ( ) as a general default, or “uninteresting” value. Further, mkString is a method of the Seq class, which constructs a string from the first argument, the members of the sequence interspersed with the second argument, and ending with the third argument. In other words, generally what you would expect a string rendition of a specific tree would be.

The hashCode definition is included so as to make sure that if two object are equal, then they also have the same hashCode. The specific implementation was inspired by an example in the book Programming in Scala[OSV11]. The basic idea is to use a reasonably large prime (e.g. 41) and combine it with the hashCodes of all instance variables relevant for object equality, to arrive at a reasonably fast and well-spread hash code.

Fields and functions of Tree For the latter parts of the implementation of Tree, we will omit the already defined parts, and only show the added fields and functions.

depth The depth (or height) of a tree is the length of the longest path from the root to a leaf. In Tree:

def depth:Int = subtrees match { case Nil => 1

case _ => ((subtrees map (_.depth)) max) + 1 }

Again, we use pattern matching to give the proper results. However, we also utilise two new Scala constructs: anonymous function, and the map method. Taking these concepts in order, an anonymous function can be defined in many different ways.

The canonical way is to define the arguments, with types, and then the body, as follows:

val f = (x:Tree[T]) => x.depth

However, Scala allows for various shorthands. In particular, if the type of the ar- gument can be inferred somehow, it may be omitted. Moreover, if each argument only appears once in the function, one may omit the list of arguments, and instead introduce “holes” (denoted by underscores) in the function body. Thus the above function is identical to the argument given to map in the definition of depth.

Moving on, map is a method of pretty much every collection in the Scala collec- tions framework. It takes as its argument a function literal from the element type T to some other type U, and produces a collection of of the results of the function, as applied to every element of the collection. Thus subtrees map ( .depth) returns a sequence of the depths of each subtree. Taking the maximum and adding one necessarily results in the depth of the current tree.

leaves The leaves of a tree are often of interest in various algorithms and automata implementations. Accessing them in Marbles is handled by the leaves function:

def leaves:Seq[Tree[T]] = subtrees match { case Nil => Seq(this)

case _ => subtrees flatMap (_.leaves)

}

(12)

The only thing that warrants explanation in this function would be the flatMap function. In Scala, flatten can be applied to a collection of collections (e.g. a List of Lists), and results in the inner collection being “flattened”, i.e. the members of the inner collections are instead made members of the outer:

scala> val l = List(Set(1),Set(2,3),Set(2))

l: List[Set[Int]] = List(Set(1), Set(2, 3), Set(2)) scala> l.flatten

res1: List[Int] = List(1, 2, 3, 2)

flatMap, then, is simply a map followed by flatten.

map With mapping functions being so central to a functional way of programming, the Tree class would have been clearly lacking without one:

def map[To](f : (T) => (To)):Tree[To] = { new Tree(f(root), subtrees map (_ map f)) }

Note the (function) type of the argument.

subst In addition to simply making a mapping of the nodes, one might want to keep most of the tree intact, but substituting specific subtrees based on the value of the root. This accomplished in Marbles by the subst method:

def subst[U >: T](subs : Map[U,Tree[U]]) : Tree[U] = { if(subs isDefinedAt root)

subs(root) else

new Tree[U](root,subtrees map(_.subst(subs))) }

The type parameter of subst showcases another part of the Scala typing system, namely lower bounds. This means that the type parameter (in this case U), must be a superclass of the indicated class (in this case T). That this distinction is reasonable in the case of the subst method is easily verified: As some subtrees of type Tree[T]

are likely to remain, it is not reasonable to have no restrictions whatsoever on the

target type, nor is it reasonable to allow subtypes of T. However, a T is also a

member of all its supertypes, meaning that (due to covariance) Tree[T]s are also

members of Tree[U] for U a supertype of T. These kinds of restrictions are very

hard to realise in a consistent manner in languages such as Java.

(13)

The object Tree We mentioned briefly the concepts of companion objects, which is among other things how Scala provides the functionality usually associated with static members.

apply In addition to the more traditional uses of static methods (constants, parsing etc.), a Scala convention is to define only a default constructor in the class, and use the companion object apply functions as factory methods for additional sets of arguments. This in turn relies on another Scala shorthand notation, namely that in calls to the apply function, the function name can be omitted. That is, given the singleton object

object Incr {

def apply(i:Int):Int = i + 1 }

then instead of Incr.apply(5) you can write Incr(5)

The constructor/factory parts of the Tree companion object looks like this:

object Tree {

def apply[T](rt:T,ss:Seq[Tree[T]]) = new Tree(rt,ss) def apply[T](rt:T) = new Tree(rt,Nil)

}

Note that since the object itself is a singleton, it has no default constructor, and no type parameter. Instead, the type parameters are attached to the apply methods.

unapply In order to facilitate pattern matching on user-defined types, Scala defines the “inverse” function to apply: unapply. In the case of Tree, it is implemented as follows:

def unapply[T](t:Tree[T]):Option[ (T, Seq[Tree[T]]) ] = Some((t.root,t.subtrees))

The predefined type Option is used to indicate optional values, and is either equal to Some(value), as in the unapply method above, or to None. In the case of general unapply methods, they can do an arbitrary computation on the input and return None in case the pattern does not match. In this case, the pattern match is used like this:

val t = Tree("a",Nil) t match {

case Tree(x,Nil) => println(x + " with no subtrees")

case Tree(x,subs) => println(x + " with subtrees " + subs)

}

(14)

Tree parsing Scala provides a framework for combinator parsing, which is used heavily in Marbles to provide persistent test cases. Fully examining the parsing framework is outside the scope of this thesis, as it requires quite a few concepts of Scala not covered in the sections above. However, a few notes are in order before the source code is shown:

– A Parser[T] is the parser type defined by Scala, which is used in building combined parsing expressions.

– The implicit keyword is introduced in the code below. It has three different meanings:

• Before the declaration of an object, it means that that object can be implic- itly accessed wherever it is in scope (usually, the same class, the companion object, and anywhere that class is explicitly used).

• Before a function parameter, it means that if the parameter is omitted, a search will be made to find an implicit object of the correct type, which will be inserted as a parameter.

• By defining a converter function T -> S as implicit, an object might be automatically converted from type T to S, anywhere that conversion function is in scope, and the typing rules demand it. A notable instance is the implicit conversion from AnyRef to String, analogous to Java. No implicit converter is explicitly shown in the below code.

– The ~ operator combines multiple Parsers into a Parser parsing the sequence of the segments. This is the basic operator of the combinator parsing framework.

– The class ElementParsers[T] was created to serve as a way to pass a Parser[T]

as an implicit parameter, something which proved otherwise complicated due to the design of the Scala combinator parsing framework.

/** A parser for trees on the form "root[tree,...]" or "root" if the

* tree is a leaf.

*/

implicit def treeParser[T](implicit rootParsers:ElementParsers[T]) = new ElementParsers[Tree[T]] {

val root:Parser[T] = rootParsers def tree:Parser[Tree[T]] =

(root~opt("["~>repsep(tree,",")<~"]")) ^^ { case root~None => new Tree(root,Nil) case root~Some(subs) => new Tree(root,subs) }

def start = tree }

For further details on Scala syntax and programming, the excellent book Pro-

gramming in Scala[OSV11] by Odersky et. al. is highly recommended.

(15)

3.4 General Marbles organisation

The current Marbles codebase is organised into three modules:

– algorithm holds the actual higher-level algorithms that have been implemented as part of this thesis. Future versions will likely have baseline classes and traits available to help facilitate working with the as of yet unimplemented Marbles GUI, but as of now, the algorithms are relatively self-contained.

– automaton contains the various types of tree automata that have been imple- mented. These classes will likely also be further split up into base classes, traits and subclasses in the future, to avoid code duplication as far as possible. How- ever, at the moment there are a few simple interface traits defined, with little code being shared between the classes.

– util is a catch-all module for holding the basics of the Marbles system. Notably, alphabets and trees reside here, as do the basics of the parsing system.

A (partial) class diagram is shown in 2.

Fig. 2. Partial class diagram of the Marbles prototype

(16)

4 Tree Recognisers and Transducers in Marbles

As described in Section 2.1, finite automata are constructs that in general have a few things in common, notably an alphabet, a state set, and a rule set. In Marbles, the Alphabet is simply a Set of some type T, while a RankedAlphabet is a map from some T to Int. States could be of any type, conceptually, but in the current prototype, they are Strings. The type of the rule sets obviously vary between different automata types, but in general they are Maps from tuples representing the left-hand side to Sets of the appropriate type collecting the various right-hand sides.

All of the assertions regarding the expressive power of various unweighted automata expressed below have been proven since at least the 1970’s, and the proofs can generally be found in Joost Engelfriets lecture notes on Tree Automata and Tree grammars[Eng75], though in many cases more elegant variants have emerged.

4.1 Recognisers

Extending string recognisers (FSA) to the tree case entails, as mentioned, some way of handling branches, with a choice being made as to moving from the root downwards (top-down) or from the leaves up (bottom-up) during processing. Both of these approaches are available in Marbles, using the TDNFTA and BUNFTA classes, respectively. With more theoretical rigour:

A bottom-up non-deterministic finite tree automaton (BUNFTA) is a 4-tuple A = (Σ, Q, R, F ) where

– Σ is the (ranked) tree alphabet,

– Q is a ranked alphabet of states such that Q = Q 1 ,

– R is a set of rules on the form a[q 1 . . . q k ] → q ⁰ for q ⁰ , q 1 . . . q k ∈ Q, a ∈ Σ k , and – F ⊆ Q is a set of final states.

In Marbles, this is represented by the BUNFTA class, which contains the instance variables sigma, states, rules, and fin, which obviously corresponds exactly to the structure described above. The rule set is of the type Map[(T,Seq[String]), Set[String]], which, again, corresponds rather exactly to how we describe them in algorithms. As was mentioned in the introduction to this section, we use a Set to gather the various right-hand sides corresponding to a particular left-hand side.

Moving on with the theoretical definition, an intermediate tree t _i of a BUNFTA A is a tree over Σ ∪ Q.

A valid run of a BUNFTA A is a sequence of intermediate trees t l . . . t m such that the trees t i and t i+1 are related as follows:

– There is a subtree a[q 1 [s 1 ] . . . q k [s k ]], a ∈ Σ, q 1 , . . . , q k ∈ Q, s 1 , . . . , s k ∈ T Σ at a position p in t i

– there is a subtree q[a[s 1 , . . . , s k ]], q ∈ Q at position p in t i+1

– t i and t i+1 are otherwise equal, and – there is a rule a[q 1 . . . q k ] → q ⁰ in R.

An accepting run t 0 . . . t n of a BUNFTA A on a tree t is a run where – in t 0 = t and

– in t n = q[t], q ∈ F .

The set L(A) of trees on which an accepting run can be constructed for a BUN-

FTA A is the language of the automaton. The class of languages recognised by

BUNFTA is the class of regular tree languages.

(17)

In Marbles, running the automaton on a tree to see if it part of the language is a simple manner of using the apply method, either explicitly or through just using the object(arguments) syntax. Further, the applyState method will re- veal the exact state set that a specific subtree ends up in, while isDeterministic checks if the automaton is deterministic (i.e. that every Set of right-hand sides is of size at most one). Also, parsing of an automaton has been implemented us- ing combinator parsing. In addition, BUNFTA mixes in (inherits) the Scala trait PartialFunction[Tree[T],Boolean], which makes it smoothly integrate into the Scala software ecosystem.

By restricting the rule set such that each left-hand side appears at most once, we arrive at the deterministic variant of bottom-up tree automata (BUDFTA).

The expressive power of BUDFTA is exactly equal to BUNFTA[Eng75], and both recognise the class of regular tree languages. Though the proof of this assertion is outside the scope of this introduction, we provide an example of a single language implemented with and without nondeterminism. The language is the set of trees over Σ = {f ₂ , g ₁ , a ₀ , b ₀ } such that each subtree whose root is an f contains both as and bs, and the two automata are, respectively

Example 1. N = (Σ, {q a , q b }, R N , {q a , q b }) where R N is {a → q a b → q b

g[q _a ] → q _a g[q _b ] → q _b f [q a , q b ] → q a f [q a , q b ] → q b

f [q b , q a ] → q a f [q b , q a ] → q b } and D = (Σ, {q _a , q _b , q _ab }, R _D , {q _a , q _b , q _ab }) where R _D is

{a → q _a b → q _b

g[q a ] → q a g[q b ] → q b

f [q a , q b ] → q ab f [q b , q a ] → q ab

f [q ab , q a ] → q ab f [q ab , q b ] → q ab

f [q a , q ab ] → q ab f [q b , q ab ] → q ab

g[q _ab ] → q _ab f [q _ab , q _ab ] → q _ab }

The proof of the general equivalence is based on the same principle as the proof of the equivalence of deterministic and nondeterministic string automata. That is, the state set Q of the nondeterministic is replaced by the set P(Q), and transitions are added accordingly. Though D lacks a state corresponding to the empty set and the requisite transitions it demonstrates the key elements of the proof: the addition of states and transitions in a systematic manner to recognise the same language as N deterministically.

Moving on to the top-down case, a top-down deterministic finite tree automaton (TDNFTA) is a 4-tuple A = (Σ, Q, R, q ₀ ) where

– Σ is the (ranked) tree alphabet

– Q is a ranked alphabet of states such that Q = Q 1

– R is a set of rules on the form q[a[x 1 . . . x k ]] → q 1 [x 1 ] . . . q k [x k ] where q, q 1 . . . q k ∈ Q, a ∈ Σ k , x 1 . . . x n = X k are variables, and

– q 0 is the initial state.

Again, the Marbles implementation stays close to what is defined in the theory,

with the variables being named sigma, state, rules and q0, respectively, with the

rule set being a Map[(T,String),Set[Seq[String]]]. The one thing to note is

that the right-hand sides are contained in a Set of Seqs. That is, the state that

(18)

the automaton uses to traverse downward is dependent on what state is applied to the sibling trees. The distinction may seem unimportant, but as will be apparent, it is critical to how the tree automaton works non-deterministically. We introduce the notation λ to denote the empty sequence (i.e. in rules involving leaves on the left-hand side).

An intermediate tree of a TDNFTA A is, as for BUNFTA, a tree over Σ ∪ Q.

A valid run of the TDNFTA A is a sequence of intermediate trees t l . . . t m where t i and t i+1 relate to each other as follows:

– There is a subtree q[a[s 1 , . . . , s k ]], q ∈ Q, a ∈ Σ k , s 1 , . . . , s k ∈ T Σ at position p in t i

– there is a subtree a[q 1 [t 1 ] . . . q k [t k ]], q 1 , . . . , q k ∈ Q at position p in t i+1

– t i and t i+1 are otherwise equal, and

– there is a rule q[a[v 1 . . . v k ]] → q 1 [v 1 ] . . . q k [v k ] in R.

An accepting run t 0 . . . t n of a TDNFTA A on a tree t is a run where in t 0 = q 0 [t], and t n ∈ T Σ , that is, no states remain in the final intermediate tree.

The set L(A) of trees on which an accepting run can be constructed for the TD- NFTA A is the language accepted by A. TDNFTA recognise the class of regular tree languages, just as BUNFTA and BUDFTA. Deterministic top-down tree automata (TDDFTA) can be defined similarly to BUDFTA, that is, we restrict the rule set such that each left-hand side occurs at most once.

TDDFTA recognise a proper subclass of the regular tree languages, i.e. there are regular tree languages for which no TDDFTA can be constructed. An example of such a language is the language {f [a, b], f [b, a]}. To prove this, assume that there is a rule

q 0 [f [v 1 , v 2 ]] → q 1 [v 1 ], q 2 [v 2 ]

in R. This, however, means that in order for both f [a, b] and f [b, a] to be in L(A), there must be rules

q 1 [a] → λ q ₂ [a] → λ q 1 [b] → λ q ₂ [b] → λ

in R as well, meaning that both f [a, a] and f [b, b] are in L(A), which would result in the automaton recognising the wrong language. It should be fairly obvious that it is rather trivial to construct a BUDFTA recognising the correct language (and more generally, that all finite tree languages, like all finite string languages, are regular).

Further, by allowing non-determinism in the top-down case, we can amend the automaton to have the rule set

q 0 [f [v 1 , v 2 ]] → q a [v 1 ], q b [v 2 ] q ₀ [f [v ₁ , v ₂ ]] → q _b [v ₁ ], q _a [v ₂ ]

q a [a] → λ q _b [b] → λ

showing that top-down automata can recognise the same language. This example also shows non-determinism in top-down automata, and specifically nondetermins- tistically choosing not only the possible states for a specific subtree, but the possible combinations of states for sibling subtrees, which is what allows TDNFTA to recog- nise languages TDDFTA cannot.

The functions apply, isDeterministic, and parsing all work similarly to how

they work for BUNFTA, but applyState no longer takes only a tree as an argument,

but instead takes both a tree and a state, and reports if it is possible for the tree

to be processed starting in the specified state.

(19)

4.2 Semirings and weighted automata

A semiring is an algebraic structure, used to define weighted tree automata (WTA), and, using WTA, recognisable tree series. Specifically, a semiring is a set O equipped with two binary operations, + and · (addition and multiplication), such that

– + is an associative, commutative operation on O with identity element 0 – · is an associative operation on O with identity element 1

– · distributes over +, and

– multiplication with the additive identity 0 annihilates O, that is, a · 0 = 0 · a = 0 for all a ∈ O.

Marbles defines the Semiring[T] and SemiringFactory[T] traits which may be implemented by the user. Alternatively, one may use one of the predefined semirings provided in semirings.scala, that is either

– the Reals semiring, which is basically the real numbers, with +, ·, 0 and 1 as would be expected.

– the MaxPlus semiring, with

0 := −∞

1 := 0 + := max

· := + – the Boolean semiring, with

0 := false 1 := true + := OR

· := AND

(here, symbols to the left of ‘:=’ denote semiring components and symbols to the right of it have their usual meaning).

Informally, a weighted tree automaton (WTA) computes a function from T Σ to some semiring O, using the multiplication and addition operations to deduce a value from the tree. Using the previous definition of a BUNFTA as a basis, we extend this to the weighted case as follows:

A bottom-up weighted finite tree automaton (BUWFTA) is a 5-tuple A = (O, Σ, Q, R, F ) where

– O is a semiring,

– Σ and Q are as in BUDFTA,

– R is a set of rules on the form a[q ₁ . . . q _k ] → _w q ⁰ for q ⁰ , q ₁ . . . q _k ∈ Q, a ∈ Σ k , where w is called the weight of the rule, and

– F is a mapping from Q to O of final weights

The Marbles representation of this structure is the BUWFTA class, which is defined in much the same way as the BUNFTA class, save that the set containing the right- hand sides now contains tuples of resulting state and weight. Further, the set of final states has been replaced by a Map from state to a final weight, and instead of inheriting from PartialFunction[Tree[T],Boolean], the resulting value is of type R (i.e. the semiring type parameter).

We define the (potentially infinite) alphabet Γ = Γ 1 of pairs of state and weight.

That is, for q ∈ Q and w ∈ O, (q, w) ∈ Γ 1 . An intermediate tree t i of a BUWFTA A is a tree over Γ .

A valid run of a BUWFTA A is a sequence of intermediate trees t l . . . t m such

that the trees t i and t i+1 are related as follows:

(20)

– There is a subtree a[(q ₁ , w ₁ )[s ₁ ] . . . (q _k , w _k )[s _k ]], a ∈ Σ, q ₁ , . . . , q _k ∈ Q, w 1 , . . . , w _k ∈ O, s ₁ , . . . , s _k ∈ T Σ at a position p in t _i

– there is a subtree (q, w ⁰ )[a[s ₁ , . . . , s _k ], q ∈ Q at position p in t _i+1 – t _i and t _i+1 are otherwise equal,

– there is a rule a[q 1 . . . q k ] → w q ⁰ in R, and – w ⁰ = w · Q k

i=1 w _i .

An successful run r = t ₀ . . . t _n of a BUWFTA A on a tree t is a run where – t ₀ = t and

– t _n = (q, w)[t], where q ∈ Q. In this case, w _r (t) = F (q) ∗ w is the weight contributed by the run r of the BUWFTA A on the tree t.

The weight of the tree t as given by the BUWFTA A is the sum of all weights w _r (t), where r is an accepting run of A on t.

The tree series defined by A is the mapping from the trees in T _Σ to their weights as given by A. As was mentioned at the beginning of this section, BUWFTA define the class of recognisable tree series.

The top-down case is defined analogously, though obviously using TDNFTA as its base rather than BUNFTA. For a thorough survey on weighted automata theory in general, including formal proofs of various properties mentioned above, refer to [DKV09].

4.3 Generators

The “inverse” formal devices of recognisers are various kinds of grammars. Notable in the string case is the context-free grammar, which is much more readily used to define context-free languages than the appropriate recogniser, the push-down automaton. Likewise, the standard regular expression is as easily converted to a grammar as to a finite string automaton. For regular tree languages the equivalent construction is the regular tree grammar. While recognisers are reasonable to define in both a top-down and a bottom-up manner, it would be hard to know in advance how many leaves to start with in a bottom-up generator. Further, as with the string variants, deterministic grammars are obviously unreasonable, as such grammars would only define a single string or tree. Formally:

A regular tree grammar (RTG) RTG is a 4-tuple G = (Σ, N, R, S) where – Σ is the ranked alphabet of terminal (output) symbols,

– N is a ranked alphabet of non-terminal symbols such that N 0 = N , – R is a set of rules on the form

A → t, where A ∈ N and t is a tree over Σ ∪ N , and – S ∈ N is the starting symbol

In Marbles, the class RTGrammar is more or less set up as expected, with the

instance variables being named sigma, nonterminals, rules and start, respec-

tively. The rules variable is a Map[String,Set[Tree[Either[String,T]]]]. In

contrast with the recogniser automata, generators like RTGs cannot in a natural

way be represented as functions in Scala. Instead, we choose to model them as

Iterator[Tree[T]], i.e. devices that iterate over a (potentially infinite) collection

of items (in this case the Tree[T] of a tree language).

(21)

An intermediate tree t of an RTG G is a tree over Σ ∪ N .

A valid sequence of G is a sequence t ₀ . . . t _n of intermediate trees where each pair of trees t _i , t _i+1 are related as follows:

– There is a nonterminal A at position p in t _i , – there is a subtree t ⁰ at position p in t i+1 , – t i and t i+1 are otherwise identical, and – there is a rule A → t ⁰ in R.

A generation of a tree t ∈ T _Σ by the RTG G is a valid sequence of intermediate trees such that t 0 = S and t n = t.

The language L(G) generated by the RTG G is the set of trees that can be generated by G. Though the proof is outside the scope of this thesis, it can be shown that RTG correspond exactly to BUNFTA, and thus is another way to define the regular tree languages. As an example, consider the language defined by D and N in Example 1 on page 17. A RTG defining the same language looks as follows:

Example 2. G = (Σ, {A, B, S}, R _G , {S}) where R _G is

{S → A S → B

A → a B → b

A → g[A] B → g[B]

A → f [A, B] B → f [A, B]

A → f [B, A] B → f [B, A]}

It may be illustrative to connect the nonterminals A and B with the states q a

and q b , respectively, and compare R G with R N . Note that, apart from the start rules involving S, the rules are identical but “inverted”. In fact, the constructive proof of the expressive equivalence of BUNFTA and RTG involves adding a start state that goes to every state/nonterminal in F , and then simply inverting the rules.

In Marbles, actual tree generation proved to be much more of a problem than anticipated. Eventually, a solution was found, though the algorithm is less elegant than might be desired. At the centre of the algorithm is a “rule-choice” tree, which dictates what rule to apply at each nonterminal. The iteration works on this tree, while constructing the current output tree of the iteration.

Making things more formal: for the RTG G = (Σ, N, R, S), we define a partial mapping d r (the “rule-depth”) from the nonterminals N to the natural numbers iteratively as follows: d r is the unique partial function w r : N → N such that, for all A ∈ N , w r (A) is the smallest natural number for where there exists a rule A → t where, for all nonterminals B in t, w r (B) < w r (A). Starting with the completely undefined mapping, we can determine w r :

– If there is a rule A → t in R G such that t consists of only terminal symbols, then d r (A) = 0.

– Loop, while d r becomes more defined:

• Let d r

_old

= d _r

• For each rule A → c[B 1 , . . . , B _l ] in R, where c is a multicontext over Σ, and B ₁ , . . . , B _l are nonterminals,

– if d _r

_old

(A) is not defined, but d _r

_old

(B _i ) for i = 1, . . . , l is, let d _r (A) = max _1≤i≤l d _r

_old

(B _i ) + 1.

The mapping d _r thus denotes how many “levels” each nonterminal is from a

finished output tree. This is used to order the rules involving a specific left-hand

side, such that the most “shallow” rule comes first. If two rules are equally deep,

the ordering is based on the output trees.

(22)

Given a rule-choice tree, we define the corresponding output tree as follows:

FUNCTION computeCurrent(n: nonterminal, rt:rule-choice tree) LET root[subtrees] = rt

LET ot:output-tree = rules(n)[root] // i.e. the root:th rule // corresponding to the // left-hand side n FOREACH(nonterminal sn in ot)

IF subtrees IS EMPTY

// Find the shortest path to a complete output tree replace sn with computeCurrent(n, 0[])

ELSE

replace sn with computeCurrent(n,subtrees.head) LET subtrees = subtrees.tail

ENDIF END RETURN ot END

LET output = computeCurrent(startSymbol, ruleChoiceTree)

Thus, the rule-choice tree 0[] will return the smallest tree of the language, even though it may require more than a single rule application to get there.

Iterating over the trees of the language is accomplished using a depth-first search with iterative deepening, though certain complications are introduced because of certain rule-chains ending in dead ends, among other things. In addition, some rule combinations will result in the same output tree being generated twice. This can either be ignored as being an irrelevant side effect of the algorithm, or alleviated through keeping track of the output trees that has already been used, and simply keep iterating until a new tree is found. This is guaranteed to terminate by the rule set and alphabet both being finite. As an example, for the language discussed in Example 2, the first few rule-choice trees we expect from the iteration are:

0[]

1[]

0[1]

0[2]

0[3]

1[1]

//We skip 1[2] and 1[3] since the outputs are equal to 0[2] and 0[3]

0[1[1]]

0[1[2]]

(23)

The iteration itself is not particularly interesting. It is a simple matter of updat- ing the positions of the tree, and substituting the subtrees with fresh copies with the proper number of subtrees in the cases of an internal node being updated. Indeed, implementing the algorithm in a functional manner took far more time than under- standing its general outline. A simplified version of the functional implementation looks as follows:

FUNCTION iterateSubtree(rt : rule-choice tree,

depth : integer,

alreadyUpdated : boolean

): (rule-choice tree, boolean) IF(alreadyUpdated)

(rt, true) // Simply move on, the tree has already been updated ELSE

IF(depth == 0)

IF (rules left for this nonterminal)

// Simple, we can update here and move on (Tree(rt.root + 1, Nil), true)

ELSE

// We need to update somewhere else (Tree(0, Nil), false)

ELSE

// We are not yet at the target depth LET newsubs = FOREACH( srt IN rt.subtrees )

// Iterate downwards, and collect the changed subtrees (newsub, alreadyUpdated) = iterateSubtree(srt,

depth - 1 alreadyUpdated) YIELD newsub

END

// Did we get our desired change yet?

IF(alreadyUpdated)

// Yes, return it, then

(Tree(rt.root, newsubs), true) ELSE

// We need to update this node IF(rules left for this nonterminal)

// Fill the tree below this level with the proper // number and arrangement of zeroes

(fillTree(nonterminal, rt.root + 1, depth), true) ELSE

// Just fill with zeros for now

(fillTree(nonterminal, 0, depth), false) ENDIF //Rules left

ENDIF // Update below this node ENDIF // At the target depth

ENDIF // Update before we even got to this node END

In the actual iteration, an iterate function tries to run the current rule-choice

through the iterateSubtree function, and increases the depth if it is not possible

to update. This function also incorporates the duplicate checking code.

(24)

4.4 Transducers

Tree transducers are formalised automata that take a tree t as input and use that to construct an output a tree t ⁰ (possibly linked to some other value). Because of their use in areas such as translation and XML processing, as well as having other interesting properties, they have been the focus of quite a bit more research than the recogniser classes.

As for recognisers, it is reasonable to define both bottom-up and top-down vari- ants of tree transducers and it will be shown that both variants have interesting properties. Informally, we can think of tree transducers as tree recognisers that, apart from producing a state at each node, also produce a tree. Additionally, there is a specified way the trees at each node are combined into one final output tree.

More formally:

A bottom-up finite tree transducer (BUFTT) is a 5-tuple T = (Σ, ∆, Q, R, F ), where

– Σ is the (ranked) input alphabet, – ∆ is the (ranked) output alphabet,

– Q is a ranked alphabet of states such that Q = Q 1 , – R is a set of rules on the form

s[q 1 [x 1 ], . . . , q k [x k ]] → q[t ⁰ ]

where q, q 1 , . . . , q k ∈ Q, s ∈ Σ k , x 1 , . . . , x k are variables, and t ⁰ ∈ T _∆∪X

k

– and F ⊆ Q is a set of final states.

As for the previous automata types, the Marbles equivalent is fairly close to the theoretical definition: sigma, delta, states and fin all have the types one would expect, while rules is of the type Map[(F,Seq[String]),Set[(VarTree[T], String)]], where F is the type parameter of the input alphabet, and T of the output alphabet. VarTree[T] is in principle a Tree over Either[Int,T], though there are a number of extra methods implemented for easing the tasks associated with tree transducers and similar constructs.

An intermediate tree t of a BUFTT T is a tree over Σ ∪ ∆ ∪ Q.

A computation of a BUFTT T is a sequence t l , . . . , t m of intermediate trees such t i and t i+1 relate to each other and T as follows:

– there exists a tree s[q 1 [t 1 ], . . . , q k [t k ]] at position p in t i . – there exists a tree q[t ⁰⁰ ] at position p in t i+1

– t i and t i+1 are otherwise equal,

– there is a rule s[q 1 [x 1 ], . . . , q k [x k ]] → q[t ⁰ ] in R, and

– t ⁰⁰ is the tree one obtains by taking t ⁰ and substituting each instance of x i by t i , for i = 1, . . . , k.

A successful computation of a BUFTT T on a tree t ∈ T _Σ is a computation t ₀ , . . . , t _n where t ₀ = t and t _n = q[t _out ] where q ∈ F and t _out ∈ T _∆ . The trees t, and t _out are the input and output trees, respectively, of this computation. As the BUFTT may be nondeterministic, each input tree defines a set of output trees, and the BUFTT as a whole defines a relation U on T Σ × T ∆ , where (t, t out ) ∈ U if and only if there is a successful computation of T such that t and t out are its input and output trees respectively.

In Marbles, a BUTreeTransducer[F,T] inherits from TreeTransducer[F,T],

which as of this writing is simply a “forwarding” trait that inherits from the ba-

sic PartialFunction[Tree[F],Set[Tree[T]]] trait. This inserts the transducer

at the appropriate place in the Scala ecosystem, and allows one to use various in-

teresting constructions, such as making a RegularTreeGrammar, and then mapping

(25)

a tree transducer on top of it, to end up with an Iterator over the sets of output trees. Alternatively, by using flatMap, the individual trees are accessed. As for the other automata types, a parser is included in the companion object.

In a similar way that BUNFTA relate to BUFTT do TDNFTA relate to top- down finite tree transducers (TDFTT). Formally:

A top-down finite tree transducer (TDFTT) is a 5-tuple T = (Σ, ∆, Q, R, q 0 ), where

– Σ is the (ranked) input alphabet, – ∆ is the (ranked) output alphabet,

– Q is a ranked alphabet of states such that Q = Q 1 , – R is a set of rules on the form

q[s[x 1 , . . . , x k ]] → c[q 1 [x i

₁

], . . . , q n [x i

_n

]]

where q ₁ , . . . , q _n , q ∈ Q, s ∈ Σ _k , k ∈ N, i 1 , . . . , i _n ∈ {1, . . . , k} and c is a multi- context of rank n over ∆,

– and q ₀ ⊆ Q is a set of initial states.

The Marbles implementation is again fairly close to the theory, but with rules being of a quite interesting type: Map[(F,String),Set[(VarTree[T],Seq[(String, Int)])]]. Here, (F,String) corresponds to the left-hand side quite obviously, but the right-hand side is more complex: Each VarTree has a number of variables that may be larger or smaller than the number of subtrees of s, so the Seq of (String,Int) records what state should be used for each particular variable, and what subtree of s should be inserted at that point. Obviously, the Seq needs to have the same size as the amount of variables in the VarTree.

We forego formal definitions of the computations of TDFTT at this time to focus on what makes TDFTT fundamentally different from BUFTT. In short: In- stead of choosing a tree based on symbol and states from below, and inserting the subtrees at their respective places, we work from the top, transforming the tree and nondeterministically choosing the states and trees as we move downward. This becomes relevant only when the transducer is non-linear, in the sense that subtrees are copied during the processing. Specifically, in TDFTT, we can initiate processing of two copies of the same subtree using two different states, while in BUFTT any processing will already be complete by the time we are able to apply any copy- ing. This important distinction means that there are relations that can be defined by BUFTT but not by TDFTT, and vice versa. This relationship will be further explored in Section 5.

Weighted transducers In a similar way to hoe weighted automata associate a

weight with a tree, weighted transducers associate a weight with a input-output

tree pair. This is useful for multiple purposes, such as associating probabilities with

various translations of a natural language sentence.

(26)

5 Algorithms on Tree Automata

In order to demonstrate that the implemented prototype serves one of the intended purposes of the complete Marbles system (i.e. as a means to quickly and easily test algorithms on various tree automata), several algorithms on tree automata were implemented. During the implementation work, certain problems seemed to be common for implementing most if not all algorithms.

5.1 Functionalising tree automata algorithms

While it is quite possible to use an imperative programming style even in Scala, the language is designed to be used in a functional way, and many aspects of the collections framework among others have plenty of methods and structures that allow for easy functional programming. This can be contrasted to many if not most descriptions of algorithms in the literature, where imperative pseudocode seems prevalent. Converting the algorithms from imperative to functional requires a rather more deep understanding of the fundamentals of the algorithm. For this reason, implementation is often slow to start, and may have to restarted several times, as the understanding of the problem grows. As a compensation, the final algorithm implementation may in some cases be both simpler, more elegant, and less bug- prone than the more traditional implementations. In addition, while implementation using imperative languages may at times be more straightforward, it most often still requires the programmer to consider various implementation details that is left undefined by the algorithm.

5.2 Further transducer background

In order to properly appreciate the example transducer splitting algorithms, we require some more theory and definitions. Recall the definitions in Subsection 4.4.

By placing constraints on the structures, we can find different classes of tree trans- ductions. Specifically, we call the class of tree transductions definable by TDFTT T, and by BUFTT, B. Further, by restricting the number of rules with the same left-hand side to at most one, we arrive at the deterministic TDFTT and BUFTT, respectively, the classes of transductions definable by these are denoted by DT and DB. These, and the classes defined below, are all described in [Eng75]. Additionally, the proofs of the various relations between the classes can be found there, as well as the algorithms implemented below.

Additionally, we define the following constraints:

– A transducer is total deterministic, if there is exactly one right-hand side for each possible left-hand side.

– A transducer is linear, if each variable that occurs on the left-hand side of a rule occurs at most once in each right-hand side.

– A transducer is non-deleting, if each variable that occurs on the left-hand side of a rule occurs at least once in each right-hand side.

– A transducer is single-state, or pure, if |Q| = 1.

By prepending D t , L, N, and P to T or B, we denote the class of transductions defined by TDFTT and BUFTT with the above constraints, respectively. For ex- ample, DLB is the class of transductions definable by deterministic linear BUFTT.

In Subsection 4.4, we briefly mentioned that there were transductions that could be defined by a BUFTT but not by a TDFTT, and vice versa, i.e. that T and B are incomparable:

T 6⊆ B 6⊆ T

Prototyping the Tree Automata Workbench Marbles

Prototyping the Tree Automata Workbench Marbles

Petter Ericson

Supervisor: Frank Drewes

Assistant supervisor: Brink van der Merwe

Department of Computing Science, Ume˚ a University S–901 87 Ume˚ a, Sweden, pettter@cs.umu.se

Abstract. In [Dre09], Drewes outlines Marbles, a programming framework

for working in a generic and systematic way, not only on trees, as several

frameworks already exist for this purpose, but on tree recognisers, trans-

ducers, generators and other formal devices as well. This thesis presents a

prototype of a proposed implementation of this framework, demontrating its

functionality by using it as a base for implementing a well-known algorithm

on tree transducers.

Table of Contents

Prototyping the Tree Automata Workbench Marbles . . . . 1

Petter Ericson Supervisor: Frank Drewes Assistant supervisor: Brink van der Merwe 1 Introduction and Motivation . . . . 3

1.1 Previous work . . . . 3

1.2 Project goals . . . . 4

1.3 Outline . . . . 4

2 Preliminaries . . . . 5

2.1 Introduction to trees and automata theory . . . . 5

Trees . . . . 5

Contexts, variables and multicontexts . . . . 6

String automata . . . . 6

Tree automata . . . . 6

2.2 Project plan . . . . 7

3 Basic Implementation of Marbles . . . . 8

3.1 Choice of language . . . . 8

Java . . . . 8

Haskell . . . . 8

Scala . . . . 8

3.2 Basics of Scala . . . . 9

3.3 Scala example . . . . 9

Fields and functions of Tree . . . . 11

The object Tree . . . . 13

Tree parsing . . . . 14

3.4 General Marbles organisation . . . . 15

4 Tree Recognisers and Transducers in Marbles . . . . 16

4.1 Recognisers . . . . 16

4.2 Semirings and weighted automata . . . . 19

4.3 Generators . . . . 20

4.4 Transducers . . . . 24

Weighted transducers . . . . 25

5 Algorithms on Tree Automata . . . . 26

5.1 Functionalising tree automata algorithms . . . . 26

5.2 Further transducer background . . . . 26

5.3 Bottom-up transducer splitting . . . . 28

BUFTT splitting example . . . . 29

5.4 Top-down transducer splitting . . . . 31

TDFTT splitting example . . . . 32

5.5 Splitting algorithm implementation . . . . 36

5.6 Top-down splitter implementation . . . . 38

6 Conclusions and future work . . . . 39

7 Acknowledgements . . . . 40

1 Introduction and Motivation

1.1 Previous work

Of course, several projects exist which allow for working on trees and tree automata in various capacities. However, most of these are somewhat narrow in scope, or are not presented in a more organised fashion but rather used informally within the confines of a research group.

Certain tools have seen more wide distribution, though, and some of these have served as an inspiration for the ideas behind Marbles.

– ForestFire[Cle09] is a toolkit for pattern-matching and tree acceptance prob- lems, with various algorithms organised into taxonomies.

– Tiburon[MK06] is a package of algorithms for working on various kinds of tree automata, notably weighted tree transducers and regular tree grammars.

Though the focus is on natural language processing and related problems in

machine learning, the algorithms are general enough to potentially see use in

certain other domains as well.

– LearnLib[RSB05] is a library for finite automata learning and experimentation that primarily focuses on learning algorithms for string automata. However, certain aspects of its organisation was useful as an inspiration for the Marbles prototype.

1.2 Project goals

As may be apparent, a complete framework of this size and complexity is not a viable goal for a thesis at the MSc level. Instead, we aim to propose a viable basis for further research into a complete framework. Specifically, we aim to

– Make a viable prototype for the Marbles system, in particular the prototype should,

• be usable for at least a small part of the tasks covered by the full system

• include a basic GUI for interacting with the automata

• include concrete implementations of a subset of the concepts described in [Dre09], and

• have a reasonable (i.e. consistent and logical) architecture, suitable for fur- ther implementation work, with a view to eventually be expanded into the complete framework.

1.3 Outline

Section 2 will dig deeper into the theoretical fundaments required for the rest of the thesis, while Section 3 will describe the basics of the Marbles implementation, including a discussion of the choice of programming language.

Section 4 will continue discoursing on the theory and practise of the prototype implementations, by describing the theoretical description, as well as the implemen- tation of the automata types provided in the prototype.

Section 5 describes a proof-of-concept implementation of two related algorithms on tree automata using the types and methods described in Sections 3 and 4.

Finally, Section 6 will contain a number of closing remarks, and introduce the

next steps in making the prototype into an actual working framework, usable for

research purposes.

2 Preliminaries

2.1 Introduction to trees and automata theory

Let {[, ], , } be a set of auxiliary symbols, disjoint from any other alphabet con- sidered herein. The set T _Σ of trees over the (ranked) alphabet Σ is the set of strings defined inductively as follows

– Σ ₀ ⊂ T _Σ ,

We define a set X of variables x ₁ , x ₂ , . . ., which is disjoint from any specific ranked alphabet considered in this thesis. In terms of ranks, X = X ₀ , and we use the notation X ^k to denote the k first elements of X.

A multicontext c ^k of rank k over the ranked alphabet Σ is a tree in T _Σ∪X

String automata Recall that a deterministic finite string automaton (DFSA) is a 5-tuple A = (Σ, Q, R, F, q ₀ ), where

– R is the set of rules on the form qa → q ⁰ , where q, q ⁰ ∈ Q, and a ∈ Σ, such that each left-hand side occurs at most once,

Further, a DFSA can be seen as a function from the set of strings Σ ^∗ to the boolean values, where every string in the language is mapped to true, and every other string to false.

Haskell Haskell[HPJW ⁺ 92] in contrast, is a purely functional language which while not having the mass appeal of Java, is still well-known in the research community.

Scala With both Haskell and Java being problematic in their own way, Scala[OMM ⁺ 04]