Practical MAT learning of natural languages using treebanks

(1)

Petter Ericson

Supervisor: Johanna Bj¨orklund

Department of Computing Science, Ume˚a University S–901 87 Ume˚a, Sweden, pettter@cs.umu.se

Abstract. In this thesis, an implementation of the MAT algorithm for query learning is improved and tested, specifically by attempted simula- tion of the MAT oracle using large corpora. In order to arrive at suitable test corpora, algorithms for generating positive and negative examples in relation to regular tree languages are developed.

(2)

Table of Contents

Practical MAT learning of natural languages using treebanks . . . . i

Petter Ericson Supervisor: Johanna Bj¨orklund 1 Introduction . . . . 1

1.1 Motivation . . . . 1

1.2 Language identification . . . . 1

1.3 MAT learners . . . . 2

Basics . . . . 2

1.4 Outline . . . . 3

2 Preliminaries . . . . 4

2.1 Introduction to trees and tree languages . . . . 4

Alphabets, strings and trees . . . . 4

Automata . . . . 6

Tree automata . . . . 6

Grammars . . . . 8

2.2 MAT learners . . . . 10

Contexts and substitution . . . . 10

Myhill-Nerode and equivalence classes . . . . 10

Observation tables . . . . 11

Putting it all together . . . . 12

Implementation status . . . . 12

3 Method . . . . 13

3.1 Implementation of the algorithm . . . . 13

Optimising for speed . . . . 13

Adding required functionality . . . . 14

3.2 Tree generation . . . . 14

3.3 Test runs . . . . 16

Optimisations . . . . 16

Tree generation . . . . 16

Corpus runs . . . . 17

4 Results . . . . 18

4.1 Tower languages . . . . 18

4.2 Optimisation . . . . 18

4.3 Generalisation . . . . 19

4.4 Corpora . . . . 19

Real-world corpus gaps . . . . 20

4.5 Tree generation . . . . 22

(3)

5 Major difficulties . . . . 23

5.1 Lack of robustness . . . . 23

5.2 Selecting counterexample . . . . 23

5.3 Suitable testing languages . . . . 24

6 Future work . . . . 25

(4)

1 Introduction

Achieving a concise and specific model for natural languages has been a major goal of natural language research in the last decades, as such a representation could lead to a better understanding of language, and specifically, could lead to computerised simulations of various natural language tasks. This thesis will consider the Minimal Adequate Teacher model of algorithmic learning of languages, specifically how it can be used to achieve a compact and precise tree-based model for natural languages through simulating complete knowledge of the target language by using large positive and negative corpora.

1.1 Motivation

While probabilistic models of language appears to be adequate in provisionally recognising and translating natural languages (e.g. the Google Translate service), and small, hand-crafted grammars are used to correct and rearrange words according to heuristics, a comprehensive, accurate and complete model of a natural language is still not within the reach of computers. Such a model would be very helpful in many applications, such as parsing, translation and natural language interfaces in general.

Furthermore, having an exact computational model of a natural language could lead to insights into more general linguistics, such as how language is being represented in the brain, what drives linguistic development et cetera.

1.2 Language identification

The general problem of language identification has been extensively explored, notably by Gold [Gol67], Angluin [Ang80] et. al.

Simply put, the problem is ”what can we say about a language, given examples inside and outside that language”? Unfortunately, given no prior knowledge about the language, the answer is ”not very much”. However, if we know that the underlying language is of a certain class, opportunities start to arise.

Identification of natural languages has proved to be a difficult problem, mainly due to two facts: The simplest class of string grammars that are ex- pressive enough to model natural languages is the context-free class; language identification for context-free languages has been shown to be NP-complete. Thus the simplest reasonable way of producing a grammar for a natural language has been assumed to be a human, simply typing up the rule set, and submitting this to a minimisation process (or at least an optimiser).

However, as context-free string languages can be produced from the yield of regular tree languages (see Section 2 on page 4), and language identification of regular tree languages is not NP-complete, there would seem to be room for a way of learning ”natural tree languages”, as long as the parse trees of the string examples are provided. Fortunately, tree banks (i.e. large sets) of parse trees have been compiled by linguists for many different languages, providing both training and test data for various learning algorithm approaches.

(5)

1.3 MAT learners

The language identification model of choice in this thesis is the Minimal Ad- equate Teacher due to Angluin et. al. in [Ang87], where it is named the L∗

algorithm. Indeed, the aim of the project was to arrive at a practically usable model for language identification of natural languages from corpora using MAT learners. While the theoretical basis of MAT learning is further covered in the next section, and specifically in Subsection 2.2 on page 10, let us briefly de- scribe the overall algorithm, and why it is thought to be a suitable candidate for real-world language identification duties.

Basics MAT learning was introduced to explore the minimum information required for a perfectly rational student to learn the given regular language, and as such introduces a restricted set of messages that the student and the teacher may pass between each other.

The model is as follows: the teacher (or MAT oracle) has full knowledge about the target language, and is required to respond to the queries of the student according to that knowledge. The student has initially no knowledge of the target language, besides it being a regular language, but by submitting queries to the teacher it will eventually build up an internal model consistent with the target language. The student may submit two kinds of queries:

– First, it may submit a model (automaton) to the teacher (an equivalence query). If the model is consistent with the language, the teacher will return a special token, indicating that the learning is complete. Otherwise, it will return a counterexample, i.e. an item that the model classifies wrongly. That is, the submitted automata regards the counterexample as a member of the language except that it is not, or as not being part of the language, but it in fact is.

– Second, is may submit an item (string, tree etc.) to the teacher (a membership query). If the item is part of the target language, the teacher will return true, otherwise it will return false.

The data from the membership queries is used to build an observation table of membership data for various combinations of string prefixes and suffixes, which eventually can be synthesised into an automaton. Receiving a counterexample from an equivalence query using that automaton results in more rows and/or columns in the table, which are filled in by further membership queries, giving another automaton which is submitted to the teacher, and so on until the teacher returns the accepting token, indicating that the algorithm has run its course.

(6)

1.4 Outline

Section 2 on the following page will explain the theory behind the MAT algorithm in more detail, while Section 3 on page 13 will give an overview of previous work, as well as the project plan for the thesis. Section 3 also contains certain algorithms developed for the purpose of testing the results obtained. Section 4 on page 18 contains a run-down of what results were accomplished, as well as some reasoning about the test runs and what they actually measure. Finally, Section 5 on page 23 reasons about how the results illustrate the difficulties of the problem at hand, and what potential exists for alleviating these problems.

(7)

2 Preliminaries

2.1 Introduction to trees and tree languages

The theory of tree languages is in essence an extension of the theory of string languages. Specifically, the class of regular tree languages is the class of regular string languages, extended by allowing symbols to have more than one successor.

To make this extension obvious, it is informative to view the class of regular string languages in the context of finite automata, because the hierarchy of string languages depends on the hierarchy of automata used to recognise them.

Thus, the extension of the classes of string languages to trees becomes a problem of redefining the automata in terms of trees as opposed to in terms of strings.

Alphabets, strings and trees Formally, an alphabet is a nonempty set Σ of symbols. A ranked alphabet is a pair (Σ, R) where

– Σ is an alphabet, i.e., a finite set of symbols, and – R is a mapping Σ → K ⊂ N.

We call the number k = R(s) the rank of the symbol s. Furthermore, for every k ∈ range(R), we define the set Σ_k = {s ∈ Σ | R(s) = k}. A symbol s of rank k may be written s_k to make the rank explicit. The requirement that symbols have one rank only is unimportant, but useful.

Informally, a tree (or term) is an acyclic graph with a designated node called the root. Looking at it from a string perspective, we arrive at the following definition, however:

Let {[, ]} be a set of auxiliary symbols, disjoint from every other alphabet considered herein. The set TΣ of trees over the (ranked) alphabet Σ is the set of strings defined inductively as follows

– for a ∈ Σ₀, t = a ∈ T_Σ

– for a ∈ Σk, k ≥ 1, t1. . . tk ∈ TΣ, t = a[t1. . . tk] ∈ TΣ,

Fig. 1. A simple graphical representation of the tree a[b[c], d]

(8)

In the tree t = a[b[c], d] (shown graphically in Figure 1 on the facing page, the symbol a is the root of the tree, while b[c] and d are child trees, or direct subtrees. The set of all subtrees of a particular tree, subtrees(t), is composed inductively as follows:

– t is in the set subtrees(t)

– if t⁰ is in the set subtrees(t), then all child trees of t⁰ are in subtrees(t) Further, a tree with no direct subtrees (e.g. d, is called a leaf. Thus, Σ₀can be seen as a set of trees, as well as a set of symbols. A tree language over Σ is any subset of TΣ, again, Σ0 serves as an example, namely the tree language consisting of only leaves.

The yield of a tree t ∈ TΣ is the string over Σ0 obtained by reading the leaves of the tree from left to right.

It should be noted, at this point, that for all trees considered in this thesis, there is a well-known ordering of the direct subtrees of a tree, such that each direct subtree can be given an index.

A path in a tree t is a sequence p = a₁. . . a_d of symbols where for any n : n ≥ 2, n ≤ d, the symbol an is the root of a subtree of the tree rooted in an−1. In other words, a path is the sequence of symbols encountered when walking downward from some node in a tree. d is called the length of the path.

The length of the longest path in the tree t is called the depth of the tree.

Fig. 2. A path in a tree

A related concept to paths in trees is the concept of positions. While string positions are easily denoted by a simple numeric index, tree positions need to take branching into account. Thus, the position of subtree s in the tree t is the successive indices of the direct subtrees going downwards from the root of t. For example, the position [0, 0] corresponds to the node marked c in Figure 2, and the path going from the root to that node is abc.

The set subs(t) of subtrees of t ∈ TΣ is defined recursively as follows:

– let subs(t) = t, and

– for each tree t[s1. . . sn] in subs(t), let subs(t) = subs(t) ∪ s1. . . sn

(9)

Automata A deterministic finite string automaton (DFSA) is a 5-tuple A = (Σ, Q, R, F, q0) consisting of

– an alphabet Σ, – a set Q of states,

– a set R of rules on the form q[a] → q⁰, where q, q⁰∈ Q, and a ∈ Σ, such that each left-hand side occurs at most once,

– a set F ⊆ Q of final states, and – an initial state q₀∈ Q.

A configuration of the DFSA A working on the string s is a 4-tuple C = (A, s, q, p) where q ∈ Q is the current state of the automaton and p ∈ substr(s) is the position of the automaton in the string, defined as the substring that is left to process.

A run of a DFSA A on a string s is a sequence C₀. . . C_n of configurations where A and s are the same in each configuration, and q_n and p_n in C_n relate to qn−1and pn−1in Cn−1, and R as follows:

– pn is the substring of pn−1 obtained by removing its first symbol a, and – there is a rule in R such that qn−1[a] → qn.

An accepting run C₀. . . C_n of a DFSA A on a string s is run such that – C0 has q = q0 and p = s and

– Cn has q ∈ F and p is the empty string

The regular language accepted by A is the set L(A) of strings on which accepting runs of A can be constructed.

Tree automata Extending finite automata to work on trees entails mostly some way of handling the branches, though the asymmetry of trees requires a choice to be made on which direction the computations should run: top-down or bottom-up. That is, do we start in an initial state and start processing at the root, moving downwards (top-down), or do the leaves lead directly to states, and the processing involves work upwards toward the root, leaving a final (or nonfinal) state (bottom-up)? Both are valid choices, though as will be shown, the resulting language classes are not necessarily equivalent. Formally:

A bottom-up deterministic finite tree automaton (BUDFTA) is a 4-tuple A = (Σ, Q, R, F ) where

– Q is a ranked alphabet of states such that Q = Q₀ – F ⊆ Q is a ranked alphabet of final states

– Σ is the (ranked) tree alphabet and

– R is a set of rules on the form a[q1. . . qk] → q⁰ for q⁰, q1. . . qk ∈ Q, a ∈ Σk

where each left-hand side occur at most once.

(10)

A configuration of a BUDFTA A running on a tree t is a 4-tuple C = (A, t, Σ⁰, t⁰) where

– A is the automaton, – t is the tree,

– Σ⁰ is a ranked alphabet with Σ_k⁰ = Σk for k >0 and Σ⁰₀= Σ0∪ Q, and – t⁰ is tree over Σ⁰, called the position tree.

A run of a BUDFTA A on a tree t is a sequence of configurations C0. . . Cl

such that for all configurations, A, t and Σ⁰ are equal, and for each successive pair of configurations Cn, Cn−1, the trees t⁰_n and t⁰_n−1are related as follows:

– There is a subtree a[q₁. . . q_k], q₁. . . q_k∈ Q at a position p in t⁰_n−1 – there is a subtree q⁰∈ Q at position p in t⁰_n

– t⁰_n−1and t⁰_n are otherwise equal, and – there is a rule a[q₁. . . q_k] → q⁰ in R

An accepting run C₀. . . C_n of a BUDFTA A on a tree t is a run where – in C₀, t⁰ = t and

– in C_n, t⁰∈ F

The set L(A) of trees on which an accepting run can be constructed for a BUDFTA A is the language of the automaton. The class of languages recognised by BUDFTA is the regular tree languages.

A top-down deterministic finite tree automaton (TDDFTA) is a 5-tuple A = (Σ, Q, R, F, q₀) where

– Q is a ranked alphabet of states such that Q = Q₁∪ Q₀ – F ⊂ Q is a ranked alphabet of final state such that F = F0

– Σ is the (ranked) tree alphabet – q0 is the initial state and

– R is a set of rules on either the form q[a[v1. . . vk]] → q1[v1] . . . qk[vk] where q, q1. . . qk∈ Q1, a ∈ Σk, k > 0, v1. . . vn are variables,

– or the form q[a] → q⁰, where q ∈ Q1, q⁰ ∈ Q0, a ∈ Σ0; each left-hand side occurs at most once in R

A configuration of a TDDFTA A on a tree t is a 4-tuple C = (A, t, Σ⁰, t⁰) where

– A and t are as in BUDFTA configurations,

– Σ⁰ is a ranked alphabet where Σ_k⁰ = Σk for k 6= 1 and Σ₁⁰ = Σ1∪ Q and – t⁰ is a tree where in the path from the root to each leaf there is exactly one

symbol in Q.

(11)

A run of the TDDFTA A on a tree t is sequence of configurations C0. . . Cl

where for each pair Cn, Cn−1, A, t, Σ⁰ are equal and for each successive pair Cn, Cn−1, t⁰_n and t⁰_n−1 relate to each other as follows:

– There is a subtree q[a[t1. . . tk]], q ∈ Q at position p in t⁰_n−1 – there is a subtree a[q1[t1] . . . qk[tk]] at position p in t⁰_n – t⁰_n−1and t⁰_n are otherwise equal, and

– there is a rule q[a[v1. . . vk]] → q1[v1] . . . qk[vk] in R

– as an alternative, the subtrees may be q[a] and q⁰, respectively, with the rule being q[a] → q

An accepting run C0. . . Cn of a TDDFTA A on a tree t is a run where – in C₀, t⁰ = t and

– in Cn, yield(t⁰) ∈ F^∗

The set L(A) of trees on which an accepting run can be constructed for the TDDFTA A is the language accepted by A. TDDFTA recognise a subclass of the regular tree languages, i.e. there are regular tree languages for which no TDDFTA can be constructed. An example of such a language is the language f [a, b], f [b, a]. To prove this, assume that there are rules

q0[f [v1, v2]] → q[v1], q[v2] q[a] → qa

q[b] → q_b

in R. This, however, means that in order for both f [a, b] and f [b, a] to be in L(A), both qa and qb has to be in F , meaning that both f [a, a] and f [b, b] are in L(A) as well, meaning that we are recognising the wrong language. It should be fairly obvious that it is both possible and, in fact, easy to construct a BUDFTA recognising the correct language.

Recall that non-deterministic finite string automata are defined similar to deterministic finite string automata except that there is no restriction that each rule have a unique left-hand side. Further, recall that this adds no additional computational power. In other words, non-deterministic finite string automata recognise the same class of languages as deterministic string automata.

It can be shown that while BUDFTA and BUNFTA both recognise the regular tree languages, TDNFTA does not have the same restrictions as TDDFTA, and in fact recognise the whole class of regular tree languages.

Grammars A related construct to automata are grammars. While an automaton recognises a language, using states and rules in the aforementioned manner, a grammar generates the language, by successively applying rules to nonterminal symbols to produce the complete language member, comprised entirely of terminal symbols. As a singe grammar generates an entire language of many members, it is obvious that non-determinism is a prerequisite. Further, for trees

(12)

it is natural to generate the tree from the top down, as there is no way to know in advance how many leaves the finished tree will have.

A regular tree grammar (RTG) is a 4-tuple G = (Σ, N, S, R) where – Σ is a ranked alphabet of terminal symbols,

– N is a ranked alphabet of nonterminal symbols, such that N0 = N and N ∩ Σ = ∅

– S ∈ N is the start symbol, or starting nonterminal, and

– R is a set of rules on the form n → t, where n ∈ N and t is a tree over Σ ∪ N It is possible to construct a definition of RTGs where N and Σ may over- lap, but the definition is slightly more complex. As a convention, in this report nonterminal symbols will be denominated by capital letters (A,B,C, . . . ), while terminal symbols will be denominated by lowercase letters (a,b,c, . . . ).

An intermediate tree t of the RTG G = Σ, N, S, R is a tree over Σ ∪ N . A successful generation of a tree t by the RTG G is a sequence t₀. . . t_l of intermediate trees of G, such that t0= S, tl= t and for each pair tn, tn−1

– There is a nonterminal A at position p in tn−1, – there is an intermediate tree t at position p in tn, – tn−1and tn are otherwise identical, and

– there is a rule A → t in R.

As for the automata, this definition can also be made so that the generation is done ”in parallel”, i.e. that several rules may be applied on different nonterminals in each intermediate tree. This has no impact on the trees that can be generated, however, and the definition again becomes more complex.

The language L(G) generated by the RTG G is every tree in TΣ for which there is a successful generation by G. It can be shown that RTGs define exactly the class of regular tree languages, and that for every BUDFTA A there is an RTG generating the language recognised by A.

Recall that for strings, context-free grammars define the context-free languages in terms very similar to the regular tree grammars defined above. It can be shown that for every context-free grammar, there is a family of regular tree grammars G such that if one takes the yield of every tree in L(G), the resulting set of strings is exactly the context-free language. Conversely, every regular tree language defines a corresponding context-free tree language. See [Eng75] for the proof of this relation.

(13)

2.2 MAT learners

Recall from the introduction that the basic algorithm uses a MAT oracle with full knowledge of the target language to respond to the membership and equivalence queries of the learner. The learner uses the query responses to build up an observation table from which an automaton consistent with the queries made so far may be synthesised, in order to be submitted as a new equivalence query.

Eventually, the synthesised automaton will recognise exactly the target language.

While the original algorithm, described in [Ang87], works on strings and regular string languages, it is well-known that the basis of the algorithm, the Myhill-Nerode theorem, carries over to the regular tree languages. For an earlier look at an adaption of the MAT algorithm to tree languages, see [DH03].

Contexts and substitution An important concept for understanding MAT learning of trees is context. A context is an incomplete tree, which can be completed by ”slotting in” any tree at a specific position. Formally:

The set of contexts C_Σ over a ranked alphabet Σ is the set of trees over the ranked alphabet Σ⁰, such that

Σ₀⁰ = Σ₀∪2, 2 /∈ Σ Σ_n⁰ = Σ_n, n ≥ 1 and the special symbol2 occurs at most once.

The substitution of2 in a context c ∈ CΣ for any tree t ∈ T_Σ is written c(t).

Myhill-Nerode and equivalence classes A Myhill-Nerode theorem for a class of languages establishes that the number of equivalence classes of each language contained therein is finite. Briefly, any automaton partitions the set of all possible elements (e.g., T_Σ, Σ^∗) into disjoint subsets, depending on what state the automaton will end up in, having processed a specific element of the universe.

This requires the automaton to be complete, in the sense that every combination of state and input element must be accounted for in the rule set. This can easily be achieved by adding a dead state which all previously undefined rules have as its right-hand side. Naturally, this may lead to the automaton containing many more transitions than would otherwise be needed.

Formally, we define the equivalence classes of Σ∗, as defined by a complete DFSA A as a partition of Σ∗ into disjoint subsets EQq, q ∈ Q, such that for each equivalence class EQ⁰_q and every string s ∈ Σ∗, s ∈ EQ⁰_q if and only if a run C0. . . Cn of A on s can be constructed such that

– C0 has q = q0 and p = s and

– Cn has q = q⁰ and p is the empty string

Using this definition we arrive at a different way of defining the language of A is as the union of the equivalence classes EQq, q ∈ F .

An equivalence class EQqof A on Σ∗ is also referred to as the set of equivalent prefixes. The Myhill-Nerode theorem for regular string languages claims that

(14)

these equivalence classes are finite in number, and correspond exactly to the states of the complete automaton. Contrast this to, for example, the push-down automata of the context-free languages, which have a potentially infinite amount of states, counting the stack. It should come as no surprise that there is no version of the Myhill-Nerode theorem for context-free languages.

However, there is a version of the Myhill-Nerode theorem for regular tree languages: we define the equivalence classes TΣ as defined by a complete BUDFTA A as a partition of TΣ into disjoint subsets EQq, q ∈ Q such that for each equivalence class EQ⁰_q and every tree t ∈ TΣ, t ∈ EQ⁰_q if and only if a run C0. . . Cn

of A on t can be constructed such that – C₀ has t⁰= t and

– C_n has t⁰ = q, q = q⁰

The equivalence classes of regular tree languages thus define sets of equivalent subtrees. Similarly, TDDFTA can be said to define equivalent contexts.

Observation tables An observation table is a table where the rows are indexed by trees, and the columns by contexts. Further, the trees indexing the rows are divided into two sets S and R, representing states and transitions, respectively.

These sets relate to each other as follows:

– S ⊂ R ⊂ Σ[S], where Σ[S] indicates the set of trees a[t1. . . tk] such that a ∈ Σ and t1. . . tk ∈ S

– S is subtree-closed, i.e, if a tree t = a[t1. . . tk] is in the set, then t1. . . tk are in the set.

An observation is a truth value for if the tree c[t], composed of the context c marking the column and the tree t marking the row, is part of the target language. An observation table can be synthesised into an automaton, assuming the table fulfils certain properties. A full description of these properties is outside the scope of this thesis, but the idea is that each row must be filled in completely, i.e. there is no tree c[t], c ∈ C, t ∈ R ∪ S for which we do not know if it is in the target language or not. Further every row indicating a transition must be the same as a row indicating a state. That is, there is no row indexed by a tree in R \ S that is not the same as a row indexed by a tree in S.

The automaton synthesised from this kind of table takes its states from the set S. The final states are the subset of S where the empty context results in the tree being in the target language, and the transition function is defined by how the trees in R look, and what rows are equal.

As an example, consider the observation table shown in Table 1 on the following page. The first four rows correspond to states (note that they are all distinct, in the sense that they have different truth values in some column), while the two remaining rows represent transitions. Translating this table to an automaton is done by basically observing what trees are equal.

For example, taking a as a subtree takes us to the state qa, while working upwards and finding an f , we look in the table for a tree f [a], which we find at row 5. Observing that the row is identical to that for a, we arrive once again at state qa. The case for b is more complex, since we go from state qb to state qf [b]

(15)

upon reading the first f . Another f takes us to q_{f [f [b]]}, an accepting state, while reading a third f again leaves us in state q_{f [b]}(note how the rows are identical).

2 f[2] f[f[2]]

a t t t

b f f t

f [b] f t f

f [f [b]] t f t

f [a] t t t

f [f [f [b]]] f t f Table 1. Example observation table

Putting it all together The basic idea of MAT learning is to identify the equivalence classes and the relations between them, by asking equivalence and membership queries. The intermediate results are stored in the observation table.

In general, the table will start with the empty context and no states, implying that no trees belong in the language. The teacher will reply with some tree from the language, which then will form the basis of the first ”proper” automaton.

Because of the requirement that S stay subtree-complete, the rows will be pop- ulated by all subtrees of the first counterexample. The contexts will be chosen based on them distinguishing between two otherwise equivalent subtrees. Thus the learner will eventually arrive at a newly completed observation table, syn- thesise a new automaton, and receive a counterexample from the teacher, and so on, until the correct automaton is submitted.

In the observation table, the equivalence classes and distinguishing contexts are shown explicitly, as each of the rows in S, representing states, also represents an equivalence class, while the distinguishing contexts indexing the columns are obviously distinguishing between at least two states (or equivalence classes) or it would be pointless to include in the table.

While much work has gone into exploring the capabilities of MAT learning, and optimising it for various minimisation tasks, comparatively little effort has gone into adapting it for real-world identification of natural languages.

Implementation status The source code relevant to the project was previously published and used in conjunction with the article [DH07] by Drewes et. al.

among others. Briefly, it is a direct implementation of the MAT algorithm in Java code, combined with parsing, timing, profiling and other utilities. The first goal of the current project was to optimise and generalise the current code base, adding capabilities for weighted trees and more reasonable run time for the simpler cases where the oracle had access to the original automaton. With this done, the project was to move on to showing the actual learning from corpora.

(16)

3 Method

In order to demonstrate a robust implementation of MAT learning that is relevant to the real world we aim to learn a reasonably complex language from a tree bank. Further, the learning should be efficient enough that the normal run times are measured in minutes, hours or at least days, instead of months and years. To accomplish this, the code base previously used in [DH07] is optimised, extended and modified. Improvements are also made to the reporting abilities in the code base, both for profiling and testing purposes. In addition, the capability to generate corpora from a given grammar is introduced to make testing against a ground truth possible.

3.1 Implementation of the algorithm

When the project started, the code base was largely unimproved from its previous use. There, it had been used to both show MAT learning of tree languages as an actually working concept, given a description of the language, and as an illustration of various improvements to the basic algorithm. Of necessity, the code also contained various basic constructs, such as finite tree automata, and, for that matter, trees. While there is an implementation of trees in the standard Java library, these have a rather different implementation and usage pattern than what is required for formal language purposes.

Optimising for speed As little effort had been spent previously on optimising the actual code (rather focusing on algorithm-level optimisations) there were a number of changes made that dramatically improved performance. Identification of bottlenecks was accomplished mainly through code profiling (that is, using debugging and run-time information to determine what lines of code take up what proportion of CPU time), but visual inspection and refactoring played a role as well. The first (and most important) bottleneck identified and fixed in this way was that comparing two symbols from two trees for equality was a string comparison operation. Tests detailed in Section 4 on page 18 show the amount of speedup derived from this change.

The second major change involved the large hash table constructed to keep track of trees in general, and to avoid excessive tree traversals for comparisons.

As adding a tree to this table involved checking (and adding) every subtree to the table, a very large number of table look-ups would occur. Though theoretically, hash look-ups have constant look-up speed, in practise, larger hash tables can be significantly slower than smaller ones. Separating the table based on the root symbol of the tree yielded a significant speedup, again shown in Section 4.

(17)

Adding required functionality While improving performance was a primary focus of the project, certain operations were unavailable in the original code base. Notably, both weighted trees and any sort of handling of corpora were missing. Reducing code duplication through polymorphism was seen as a priority, but unfortunately had to take a back seat to making the implementation work quickly. As a result, there are several classes claiming to do more or less the same thing, such as the various Learners and Teachers. Refactoring these to reduce code duplication and increase the versatility of the code base is important, but was not a priority at the time for the project.

3.2 Tree generation

Since it was discovered that learning from corpora was significantly more difficult than first surmised, a number of simpler tests had to be devised. A key requirement was to generate both positive and negative corpora relative to a known automaton (which was relatively simple to accomplish using grammars) and a notion of a ”random” tree from inside and outside a given language. This random generation problem turned out to be much harder than anticipated, and there seem to be little prior work done in the area.

While it may not be immediately obvious why this is a problem, consider the wide range of alphabets and potential structures that have to be handled in a reasonable manner. Not only do we want a good selection of tree depths, but also a good selection of trees from each depth. If, for example, only one symbol of the alphabet has rank greater than 0, but there are many leaves, we are likely to still want a wide variety of structures to be represented. The naive algorithm (choosing a random member from the alphabet as the root of the next subtree) will not in general capture this behaviour, as it is likely that most of the chosen nodes will be leaves, leaving us with a larger proportion of small trees than perhaps is desired.

For this purpose, a parameter, or ”fudge factor” was introduced to the algorithms, so as to help shape the randomisation to produce a reasonable selection of trees, being less dependent on the specific structure of the alphabet or language.

In the end, two algorithms were devised: One that used the rules of the grammar to produce a random tree in the language (subject to a ”fudge factor”), and one that generates a random tree in TΣ (also subject to a ”fudge factor”), which can then be classified using the automaton corresponding to the grammar.

(18)

The algorithm used to produce trees from a grammar is as follows:

// Produce a random tree from a regular tree grammar // We use recursion to build up the tree.

// The initial call is given last

function getRandomSubtree(nonterminal N, integer depth) do

rhs = get random right-hand side corresponding to N while (fudge /

(depth * minimum depth arising from rhs) <

random(0,1))

for each nonterminal K in rhs

replace K in rhs by getRandomSubtree(K, depth + 1) end

return rhs end

randomTree = getRandomSubtree(starting symbol, 0)

There are a number of problems with this algorithm; notably that it can end up in an endless loop from time to time, and that there is no obvious connection between the fudge factor and what trees are produced. A more robust version may leverage basic statistics in order to, for example, select the desired tree depth from a gamma distribution, and use some kind of randomisation to build the tree up to that depth. There would still be problems with using the naive algorithm for this purpose, though, as you may not be interested in every branch being the same length. This may be a common result, if symbols are chosen randomly up to the final depth, if the alphabet contains many internal nodes and few leaves.

The algorithm used to produce random trees from TΣ is as follows:

// Produce a random tree from a ranked alphabet // We again use recursion to build the tree.

function getRandomSubtree(integer depth)

if(((number of leaves in alphabet * depth * fudge) / total number of symbols ) >

random(0,1) ) return random leaf else

root = random symbol subtrees = empty list for i = 1 to rank(root)

add getRandomSubtree(depth + 1) to subtrees return Tree(root,subtrees)

end end

(19)

As with the previous algorithm, this one has several problems. Though there is no risk for endless loops in this case, the distribution of output trees is heavily dependent on the proportion of leaves to total number of symbols in the relevant alphabet. The fudge factor can alleviate this somewhat, but as the else-clause still chooses a random symbol from the complete alphabet, if the alphabet contains mostly leaves, the output will still be mostly very small trees. The obvious fix for this problem, choosing a non-leaf if available depending on the fudge factor, has not been implemented as of this writing.

3.3 Test runs

In order to demonstrate the improvements, several different test runs are pre- sented. The first focus will be to compare the speeds of the unmodified and optimised learners. Secondly, we will show that the trees produced by the generation algorithms at least in some respects represent a good sample of the target language. Finally, an attempt will be made to show some of the successes and failures of the various corpus-learning algorithms.

Optimisations In order to test for improvements, the simplest way is to run the same test before and after the modifications, and this is the way these test were run. Three different languages will be identified using the old and the new code base, averaging the execution time over several runs to get a better estimate.

Tree generation In order to properly test the generation algorithms, we require a measure to test against. Several options exist, such as the number of leaves or the number of subtrees, but one of the simplest ones is simply the total depth of the tree, or, equivalently, the length of the longest path in the tree. We partition TΣ into subsets T_Σⁱ based on the depth i. Though individual trees may be interesting to look at, the time and effort required to generate every tree of the various languages is excessive. Thus, we only look at the relative sizes of the various subsets.

The size of T_Σⁱ is heavily dependent on the alphabet Σ, and the recursive definition is rather involved. However, it should be obvious that

T_Σ⁰ = Σ₀ meaning

|T_Σ⁰| = |Σ₀| Further,

|T_Σ¹| =

∞

X

i=0

|Σ_i||Σ₀|ⁱ

(20)

Moving on to higher depths makes the calculation more involved, though, as a single path is enough to determine the depth. An expression that slightly underestimates the number of trees is

|T_Σⁿ| =

∞

X

i=0

|Σi|(

n−1

X

j=0

|T_Σ^j|)ⁱ⁻¹|T_Σⁿ⁻¹|

The underestimation stems from fixing the depth-defining path in one position. Allowing this to be any path would result in an even more rapidly rising number, though the computation becomes significantly more complex.

As for estimating the number of trees of each depth of a specific language by looking at the grammar, this is significantly harder. It is certainly not im- possible, especially with normalised grammars, but the work required to arrive at a reasonable specification is deemed to be too involved at this time. Instead, a comparison will be made between the total number of trees in TΣ, the randomised negative trees, the randomised positive trees, and the trees used for membership and equivalence queries in the algorithm.

Corpus runs Finally, a set of tests were run to see how many trees would actually be required for a reasonable shot at finding the target language using a number of randomised corpora. These included initial tests using actual tree banks. Once the realisation had been made that this was unlikely to produce positive results in any meaningful way, they were followed by test on corpora linked to successively simpler languages.

(21)

4 Results

While the project as a whole failed in obtaining positive results, sub-goals were still reached and certain complications made the nature of the problem clearer for future research.

4.1 Tower languages

In order to explore the relationship between number of states and trees required, a family of languages was devised:

A tree from the tower language of height n starts with at least one f_n’s, then goes on through at least one f_n−1, and so on until the last f₁ has a bot symbol as its only leaf. Formally:

Definition 1 (Tower language). A tower language of height n is defined by an RTG G = Σ, N, S, R where

– Σ is the set f_i: i ∈ 1 . . . n ∪ ⊥, where Σ₀= ⊥ and Σ₁= Σ − ⊥, – N is the set q_i: i ∈ 1 . . . n,

– S is the nonterminal q_n, and

– R is the set of rules such that for every i ∈ 2 . . . n,

• qi→ fi[qi] ∈ R,

• qi→ fi[qi−1] ∈ R, and also

• q1→ f1[q1]inR, and

• q1→ f1[⊥]

Informally, we can see the trees of tower language alphabets (i.e. alphabets with Σ = Σ1S ⊥0) as strings, turned sideways. I.e. the tree f3[f2[f1[⊥]]] is equivalent to the string f3f2f1, and the tower language of height 3 is equivalent to the regular string language f3+ f2+ f1+.

A similar, but more complex language class is the binary-tree variant of the tower languages, in which each symbol fi is binary, and the rules are otherwise similar. However, as these trees and languages quickly become too complicated for the corpus learning to be effective, they have not been included in the testing.

4.2 Optimisation

While still perhaps not clearly usable in most cases, in the end the code base performed quite a bit faster than at the start of the project. The most obvious speedup came from replacing most string comparisons with integer comparisons, by creating an alphabet class keeping the string representations of the symbols, and having alphabet indices as comparison objects in the symbol instead. See Table 2 on the facing page for comparisons.

Further optimisation was achieved by splitting the rather large hash table of trees in the corpus into several tables, indexed by root symbol. It is likely that further optimisations could be achieved by creating similar indexed tables for deeper symbols, but the tests showed a much smaller benefit as the tables grew smaller and more manageable. Again, see Table 2 for comparisons.

(22)

Branch Mean Max Min Total (10 runs) Neither improvement 25.2 26.0 24.3 252.0 No string comparisons 15.3 15.8 14.4 153.0 Root-indexed hash tables 15.0 15.5 14.6 149.7 Mainline (both) 10.7 11.0 10.5 107.2

Table 2. Time in seconds for a run of 4 corpora of 2000 trees each taken from the tower language of height 4 and its co-language

As can be seen in the table, the speedups achieved by either change is almost entirely independent, meaning the total speedup is rather high. It has to be noted, at this point, that while the test is not intentionally contrived to display the speedup, it does have certain properties that makes it suitable to show the speedups in a flattering light, viz. it involves lots and lots of tree comparisons.

The size of the corpora (2000 trees in both positive and negative training and testing corpora) may seem large, but as will be shown, it is far from an unreasonable number in order to capture important aspects of even small and uncomplicated languages. In addition, the speedup test uses the CorpusTeacher class, which by its nature has to compare many more trees than does the AutomaticTeacher class. However, though the tests are not demonstrated here, there is a notable speedup even using the AutomaticTeacher oracle. Obviously, the difference is, for most languages, less than for the similar CorpusTeacher learning process.

Apart from these two major improvements, loop tightening and other small optimisations accounted for minor improvements.

4.3 Generalisation

The plan for the generalisations of the code base was to have a single class family of learners and one of teachers. While the end result was close to this, with much code shared between various learners and teachers (and other classes), sadly, most of this was duplicate rather than inherited code. The speed differ- ential between running with native booleans everywhere as opposed to applying a generic semiring factory was in the end big enough to motivate this somewhat unwieldy construction, while the other variants used to try various other approaches had enough differences in actual code to require a more thorough integration, making shared code through inheritance largely impractical.

4.4 Corpora

Implementing the usage of corpora into the program was mostly straightforward, as far as the assumption that all requested trees were included inside either the positive or negative corpus. However, as was readily apparent, this was not an assumption that could be made for any real-world corpora.

(23)

Real-world corpus gaps The requirement in the MAT algorithm to keep the rows of the table subtree-complete requires the MAT learner to make a sub- stantial amount of specific membership queries. In fact, for an automaton with n states and m transitions, at minimum (n + m) · log(n) membership queries has to be made, potentially many more. Several trials were made with simple grammars to detect how many trees would have to be in the negative and positive corpora, given ”random” generation. This in itself required quite a lot of programming and testing, given that no proper algorithm to generate a ”random” tree from within or without a language was readily available. In the end, algorithms mentioned in Subsection 3.2 were developed for these purposes.

In order to illustrate the number of trees required in the corpora for various number of states, compared to the theoretical minimum of membership queries required, Table 3 was produced.

Tower height Min. trees req. Positive corpus Negative corpus Trees used

Tower 1 3 10 1 4

Tower 2 8 10 10 15

Tower 3 13 150 150 159

Tower 4 20 2000 2000 1523

Tower 5 27 > 9200 > 20000 N/A

Table 3. Various tree metrics for learning of tower languages

The values in the table was computed as follows:

– The minimum trees required is the amount of trees used by the AutomaticTeacher for teaching the language. This includes every tree used in a membership query, as well as the trees sent as counterexamples in response to equivalence queries.

– The positive and negative corpora are the sizes in number of trees of the corpora that are likely to result in the correct language being identified. For the tower languages of height 3 and 4 the corpora were not actually sufficient to answer all the queries, but were of sufficient size that the guesses made to fill up the gap were few enough that the final result was the correct language anyway.

– The trees used is the number of trees used by CorpusTeacher to respond to queries, both membership and equivalence queries.

The main reason why the Trees used number is significantly higher than Minimum trees required is because it includes the final equivalence comparison, which is likely to run through many trees before it concludes that the automaton indeed recognises the correct language. Without this, the numbers closely match, though the second number is still slightly higher. Both numbers are somewhat

(24)

inaccurate, though, as the equivalence queries are handled quite differently by the two Teachers. While AutomaticTeacher does the contradiction backtrack- ing computation detailed in [DH07] , the CorpusTeacher does a simple iteration over all the trees in the corpora until it finds one that is misidentified by the automaton. Because of this, the second number may even be seen as underesti- mating the number of trees ”used” by the teacher.

However, whatever the number of trees actually used, the number of trees required in the corpora remains the same. As is obvious from the table, this number increases exponentially with the height of the tree language. This is due to a combination of factors, the most important being that the languages get increasingly complex, with more states and transitions required to be identified properly by the learner. The alphabets also grow more complex, meaning that the number of trees not in the language of a specific height increase greatly.

With the realisation that corpora in general need to be impractically large, the main focus shifted from getting usable corpora to making reasonable guesses about the inevitable holes in the teachers’ knowledge.

(25)

4.5 Tree generation

Fig. 3. Trees by depth generated by various means

Figure 3 shows the distribution of trees of various depths for TΣ, two positive and two negative corpora (relative to the tower language of height 4) generated using the algorithms described in Subsection 3.2 on page 14, and the trees used in an actual corpus run of the MAT algorithm on the same language, using the corpora. While it may seem strange that the total number of trees used is larger than the total amount of trees in the corpus, recall that not only is each subtree included, but every context as well. Due to how fast T_Σ grows with each step, the depths above 7 have been omitted.

As is readily observed, the generated negative examples peak at around depths 6 to 7. This is likely because of the statistical characteristics of the algorithm and the alphabet, since the small amount of positive trees at this depth would have little impact. These instead cluster at around depth 16. The characteristics of the generating algorithms seem to be if not perfect, then at least usable for generating a random sample of trees selected from inside and outside a particular language. However, further testing would be required in order to claim this in any sort of definitive way.