Methods for phylogenetic analysis

(1)

Examensarbete

Methods for phylogenetic analysis

K˚

are Krig

(2)

(3)

Methods for phylogenetic analysis

Applied Mathematics, Link¨opings Universitet

K˚are Krig

LiTH - MAT - EX - - 2010 / 20 - - SE

Examensarbete: 30 hp Level: D

Supervisor: Jan Snellman,

Examiner: Jan Snellman,

(4)

(5)

Abstract

In phylogenetic analysis one study the relationship between different species. By comparing DNA from two different species it is possible to get a numerical value representing the difference between the species. For a set of species, all pair-wise comparisons result in a dissimilarity matrix d.

In this theises I present a few methods for construcing a phylogenetic tree from d. The common denominator for these methods is that they do not gen-erate a tree, but instead give a connected graph. The resulting graph will be a tree, in areas where the data perfectly matches a tree. When d does not per-fectly match a tree, the resulting graph will instead show the different possible topologies, and how strong support they have from the data.

Finally I have tested the methods both on real measured data and con-structed test cases.

Keywords: Phylogenetic trees, tight span, split decomposition.

(6)

(7)

Acknowledgements

I would like to thank my supervisors. First Svante Linusson who suggested this subject and supervised me when the project started, and also Jan Snellman who was my supervisor when I finaly got around to finish the work. Bengt Persson also deserves some thanks for supplying the data used.

My work makes heavy use of of previous works by Alice Lesser and Daniel Huson, both have been kind enough to answer mails with questions when I got stuck at some point. Thanks a lot for your help.

Finaly my family deserves some thanks for encouragement and some proof-reading. With a special thanks to my cousin Sara Rassner who has been a great resource with her extensive biological knowledge.

(8)

(9)

Nomenclature

Symbols

X a finite set. The species of animals we want to find a relationship

be-tween.

d X × X → R. d is a metric on X if for any x, y, z ∈ X it satisfies the

following three conditions: d(x, y) = d(y, x) ≥ 0 d(x, y) = 0 ⇐⇒ x = y

d(x, z) ≤ d(x, y) + d(y, z) (the triangle inequality)

If we relax the second condition to d(x, x) = 0 we call d a pseudo metric.

dij = d(i, j) will sometimes be used to make the text easier to read.

X 2

the set of all pairs of distinct elements in X.

|X| the number of elements in X.

Definitions

• The pair (X,d) is called a metric space.

• A graph G = (V, E) consists of two finite sets. V is the vertices of the graph and E consists of unordered pairs of V called edges.

• A weighted graph G = (V, E, w) is similar to the ordinary graph with the addition that for every edge e ∈ E there is an associated weight w(e). In this paper weights will never be negative.

• A path in a graph is a sequence of distinct vertices v1, v2, . . . , vk such that

{vi, vi+1} ∈ E for i = 1, 2, . . . , k − 1. For a graph the length of a path is

the number of edges in the path. For a weighted graph, the length of the path, is the sum of the weights of the edges in the path. A path from u to v will be written uv.

• Two vertices u, v ∈ V are said to be neighbours if they are connected by an edge {u, v} ∈ E.

• The degree of a vertex is the number of neighbours of said vertex. • A subdivision of a graph is the process of deleting one edge e = {u, v} ∈ E

and adding one vertex w and two edges {u, w} and {w, v}.

• Two graphs are said to be homeomorphic if both can be constructed from the same graph with a series of subdivisions.

(10)

x

• An embedding of a graph is an actual drawing of a graph, usually in two dimensions.

• A planar graph is a graph that can be embedded in a plane so that no edges need to cross.

• The frontier of a graph with a given embedding contains the vertices and edges from which a line can be drawn to the infinity point without crossing any edges.

(11)

Introduction

1.1 Background

To find a weighted tree, where the distance between all pair of leaves match a given metric space, is a fairly easy problem. When analyzing measured data the problem is that usually a perfectly matching tree does not exist. A basic

approach to this a problem is to find the tree that is the closest possible match1

to the measurements. The problem with this method is that even small amounts of noice in measurements might give different tree topologies. Usually such methods provide a single number to describe how well the measurements fits into a tree. There is no way to know in which areas the tree topology has strong support from the data and in which areas the topology is uncertain.

All the methods in this thesis have in common that they aim for a tree, but when no exactly matching tree exist, they will generate a graph. The aim is that the graphs will be close to trees in areas where one tree topology has strong support, while areas where several topologies are possible should give an idea of how strong support different topologies have.

1.2 Topics covered

Chapter 2: Here we consider optimal realizations as a tool for analyzing metric

spaces. We also present hereditarily optimal realizations.

Chapter 3: This chapter studies the Tight Span of metric spaces. We show

how to use them to construct realizations and how those realizations relate to the hereditarily optimal realizations.

Chapter 4: Finding d-splits of (X, d), and how to visualize a set of splits as

a splits graph. Also suggests a hybrid method, based on d-splits and TightSpan.

Chapter 5: Here we apply the three methods from chapters 3 & 4 to analyse

a few different metrics.

Appendix A: An appendix containing the metrics used in chapter 4.

1

For example the least square fit over all pairs of species.

(14)

(15)

Chapter 2

Realization of metrics

In this chapter we study the problem of finding realizations of metrics. That is, to find a weighted graph representing the distances given in the metric. More specifically we will study realizations where the sum of edge weights is minimal.

2.1 Realizations

A realization of d on X is a weighted graph G = (V, E, w) satisfying V ⊇ X, w > 0, and for any a, b ∈ X the length of the shortest connecting path is equal to d(a, b). Consider the complete graph with V = X and edge weight w(a, b) = d(a, b). Because a metric by definition satisfies the triangle inequality

the shortest path1 _{between two vertices in this graph is the edge connecting}

them. Hence it is clear that realizations exist for all metric spaces.

Vertices v ∈ V , v /∈ X are called auxiliary vertices. If an auxiliary vertex has

no or only one neighbour it is obviously not contributing to connecting the non-auxiliary vertices. An auxilary vertex with two neigbours a, b does contribute to the connection, but it and its two edges can be replaced with a single edge connecting a and b given as weight the sum of weights of the two removed edges. To avoid such unnecessary vertices, we introduce the condition, that any auxil-iary vertex must have at least three neighbours. We also require edge weights to be non-zero, as any two vertices with a distance of zero could be merged in to a single vertex.

We define the total edge weight of a realization as ||G|| =P

e∈Ew(e)

2.2 Optimal realizations

A realization is considered to be optimal if ||G|| is minimal.

Lemma 2.2.1 Let G be a realization of (X, d), |X| = n, with more than 2(n2)+1

vertices. Then (X,d) is realized by a subgraph of G with at most 2(n2)+1 vertices.

1

Other paths of equal length are possible.

(16)

4 Chapter 2. Realization of metrics

Proof by Imrich[6]. Assume G = (V, E, w) to be a realization of the metric

space (X, d), and |X| = n. For any pair x, y ∈ X there is a shortest path pxy=

xy in G. If there are multiple shortest paths, just arbitrarily pick one. Let P be

the set of choosen paths for the the n₂ pairs. We now assume G = S

x,y∈X

pxy.

To any vertex v ∈ V we associate the set Pv:= {pxy|v ∈ pxy, pxy∈ P }.

Assume there is three vertices u, v, w, such that Pu = Pv = Pw. We may

then choose the notation so that dG(u, v) + dG(v, w) = dG(u, w). Choose a path

pxy∈ Pv (and hence, pxy is also in Pu and Pw). Because pxy, by definition, is

a shortest path, its sub-path from u to w must also be optimal. Because all

paths in Pv pass v on the way between u and w, we may route them all via

the sub-path used by pxy to get from u to w. Now all edges to v except the

two in pxy are unused and may be removed. The degree of v is now 2 and we

may therefore remove it and replace the two connecting edges with a single one. Hence we may require that a realization has at most one u for any v, such that

Pu= Pv.

With n₂ paths, we have at 2(n

2) possible values2for P

u, and at most 2(

n 2)+1

vertices in G. 2

Theorem 2.2.1 Every finite metric space has an optimal realization.

Proof According to lemma 2.2.1, the number of vertices in G is bounded. At

most there can be one edge for each pair of vertices, so the number of possible unweighted graphs is also bounded. The edge weights are bounded below by

0 and above by max(d(x, y))3_{. The total edge weight of a finite graph is a}

continuous function of the lenghts of its edges. Since a continuous function of a compact set attains its infimum, there is an optimal set of edge weights for any graph. And because there is only a finite number of unweighted graphs, an

optimal realization exist. 2

Theorem 2.2.2 Optimal realizations are not always unique. To give an

exam-ple of this we first need a few lemmas.

Lemma 2.2.2 For x, y, z ∈ X, all different. If d(x, z) = d(x, y) + d(y, z) then y

is the only common point for any shortest paths xy and yz in a given realization of (X,d).

Proof Assume that there exist a vertex v common for the paths xy and yz.

The triangle inequality give

d(x, z) ≤ d(x, v) + d(v, z)

As v is on the path xy we know d(x, v) < d(x, y). And because v is on yz we also have d(v, z) < d(y, z). Those three inequalities give

d(x, z) ≤ d(x, v) + d(v, z) < d(x, y) + d(y, z) = d(x, z)

which is a contradiction. 2

2

including the sets |Pu| = 1 ⇒ u ∈ X 3

In this proof we will temporarly allow edge weights of zero, to get a compact space. This is not a problem, as for any graph with zero weight edges, there is a graph where vertices of zero distance are joined together.

(17)

2.2. Optimal realizations 5

Lemma 2.2.3 For x, y, u, v ∈ X, all different. If

dxy+ duv< max{dxu+ dyv, dxv+ dyu}

then no shortest path xy has any common vertices with any shortest path uv in any realization of (X, d).

Proof Assume that there exist a vertex w ∈ X that is common for two paths.

Then the triangle inequality gives

dxu+ dyv ≤ dxw+ dwu+ dyw+ dwv (2.1)

As w is on a shortest path xy it follows

dxy= dxw+ dwy

and as w is also on a shortest path uv

duv= duw+ dwv

Using the last two equalities as substitutions in (2.1) gives

dxu+ dyv ≤ dxy+ duv (2.2)

With similar argumentation it can also be shown that

dxv+ dyu≤ dxy+ duv (2.3)

But as dxy+ duv is assumed to be strictly smaller than the max at least one of

eq(2.2) and eq(2.3) must be false. A contradiction. 2

We will now consider an example of a metric space that have multiple optimal realizations. This example was presented by Dress[2].

a b c d e a 0 2 4 2 2 b 2 0 2 4 2 c 4 2 0 2 2 d 2 4 2 0 2 e 2 2 2 2 0

According to lemma 2.2.2, b is the only common vertex for any shortest path ab and bc. Similar conditions hold for vertices a, c & d, but not e. Lemma 2.2.3 gives that any shortest path ab has no vertices in common with any shortest path cd. The same holds for shortest paths ac, bd. As a result the shortest paths ab, bc, cd and da may only intersect at vertices a, b, c, d. Therefore any realization of the metric must contain a subgraph homeomorphic to a 4-cycle joining a, b, c, d with all edges having a weight of 2.

We can assume that the shortest path ae travel some dictance along either ab or ad before reaching an auxilary vertex v where it will leave the path. Assume without loss of generality that ae follows ab. When placing the auxiliary vertex v we must make sure not to create a path be that is shorter than 2. We have the following conditions:

(18)

6 Chapter 2. Realization of metrics

dav+ dve= 2

dbv+ dve≥ 2

giving dve ≤ 1. Recalling lemma 2.2.2 we conclude that the shortest paths ae

and ce will have e as its only common vertex. With similar reasoning as above

we can show that the ce will pass an auxilary vertex w with dwe≥ 1. We now

know that any realization must at least contain a 4-cycle of total weight 8, and two edges e, v, e, w each with weight at least 1. Hence any realization must have at least weight 10. Both graphs in figure 2.1 have total weight 10, and are optimal. a b c d e v w 2 2 1 1 1 1 1 1 a b c d e v w 2 2 1 1 1 1 1 1

Figure 2.1: Two optimal realizations of the same metric.

The existence of multiple solutions makes optimal realizations a troublesome tool for visualizing data. The final nail in the coffin is provided by Alth¨ofer[7] who proved that computing them is NP-hard.

2.3 Hereditarily optimal realizations

An alternative to optimal realizations is hereditary optimal realizations4_{. For}

a finite metric space (X, d), the hereditarily optimal realization Γ = (V, E, w) is defined as follows. If |X| ≤ 2, all optimal realizations are also h-optimal. If

|X| = k ≥ 3 and h-optimal realizations of (Y, dY) are already defined for all

|Y | < k. Then the realization Γ = (V, E, w) of (X, d), with X ⊆ V , is h-optimal

if for any Y ( X there is a subgraph Γ0 _{= (V}0_{, E}0_{, w |}

E0) of Γ, such that Γ0 is

a h-optimal realization of (Y, d |Y) and Γ has minimal total edge weight of all

such graphs.

In general a hereditarily optimal realization is not optimal. Dress[2] shows that all h-optimal realizations of a metric are homeomorphic. One example of a metric where the h-optimal realization is not optimal is the one that gives the optimal realizations in fig 2.1. The h-optimal realization of the same metric can be seen in fig 2.2.

4

(19)

2.3. Hereditarily optimal realizations 7 a b c d e 1 1 1 1 1 1 1 1 1 1 1 1

(20)

(21)

Chapter 3

Tight Span of metric spaces

The theory of Tight Spans was independently developed by Isbell[4], Dress[2] and Chrobak and Larmore[5]. This presentation will use the terminology used by Lesser[1], whose terminology is based on the one used by Dress. Because this text will only cover finite metric spaces, some minor changes have been made. Eg. using max in place of sup.

Let RX _{be the set of functions {X → R}. Given a metric space (X, d) we}

define P (X, d) = PX as

P (X, d) := {f ∈ RX _{| f (x) + f (y) ≥ d(x, y), ∀x, y ∈ X}} _(3.1)

According to the definition d(x, x) = 0, and f (x) + f (x) ≥ d(x, x). Hence f (x) ≥ 0, ∀f ∈ P (X, d), ∀x ∈ X. As f is a map from a finite set X to R, it can

also be viewed as a point in a |X|-dimensional real space. PX will then be the

polyhedron given by the cuts f (x) + f (y) ≥ d(x, y). We now define the Tight

Span T (X, d) = TX as the subset of PX containing all maps f that are minimal

with respect to the partial ordering f ≤ g ⇔ f (x) ≤ g(x), ∀x ∈ X. Another equal description of the Tight Span is

TX = {f ∈ RX|f (x) = max

y∈X{d(x, y) − f (y)}, ∀x ∈ X} (3.2)

Where the condition states that apart from fulfilling the inequality conditions

of PX, there must for every x ∈ X exist some y ∈ X so that the inequalities

hold with equality. A proof that these two definitions are equal can be found in [1].

3.1 Tight-equality Graphs

For any f ∈ P (X, d), we now define its tight-equality graph K(f ) = (V, E). The vertex set is V = X and the edge set

E = {{x, y} ∈X

2

|f (x) + f (y) = d(x, y)} (3.3)

Remember that for f ∈ TX there are at least one condition in eq(3.1) for every

x ∈ X that holds with equality. Because every edge in K(f ) corresponds to one

(22)

10 Chapter 3. Tight Span of metric spaces

of the conditions holding with equaliy f ∈ TXif and only if all vertices in K(f )

has degree greater or equal to one. Note that loops are allowed in K(f ) and will appear at vertex x when f (x) = 0.

As an example consider a three point metric with all points at a distance of 1 apart. The cuts from eq(3.1) will then be

f (x) ≥ 0 f (y) ≥ 0 f (z) ≥ 0 f (x) + f (y) ≥ 1 f (x) + f (z) ≥ 1 f (y) + f (z) ≥ 1

and PX will look as in Fig 3.1. The K-graphs corresponding to the marked

points can be seen in Table 3.1.

We now introduce the notation [f ] for the smallest face in PX that contains

f . For all internal points g ∈ [f ] the same equalities hold, and hence they all have the same graph K(g). At the border of [f ] the equalities still hold, but also some of the inequalities may now hold with equality. Hence

Lemma 3.1.1

[f ] = {g ∈ PX|K(f ) ⊆ K(g)}

Which will now be used to show another lemma.

Lemma 3.1.2 The dimension dim[f ] of the face [f ] is equal to the number of

bipartite connected components of K(f ).

Proof Let f be an element in a bounded face of PX. This will also mean

f ∈ TX. Lemma 3.1.1 tell us that

[f ] = {g ∈ PX|K(f ) ⊆ K(g)}

Now by subtracting f from this face and taking the affine hull we obtain a linear space

L := {h ∈ RX|h(x) + h(y) = 0, ∀{x, y} ∈ K(f )}

Let Y ⊆ X be all vertices of a connected component of K(f ). Let KY(f ) be

the subgraph of K(f ) with vertex set Y and edges x, y ∈ K(f ) iff x, y ∈ Y .

If KY(f ) is bipartite it will contribute with 1 to the dimension of L, while a

non-bipartite component will not. This can be illustrated by considering some vertex y ∈ Y . Let h(y) = a. Then for all x ∈ Y that are neighbours to y we know that h(x) + h(y) = 0, and hence h(x) = −a. So for a bipartite component Y = (A|B) we have h(x) = a, ∀x ∈ A and h(y) = −a, ∀y ∈ B, which clearly has a one-dimensional set of solutions. If on the other hand we add even a single edge between two vertices in u, v ∈ A (or both in B) we add the condition h(u) + h(v) = 0 ⇒ a + a = 0 ⇒ a = 0, which gives us a 0-dimensional contribution to L. Summation over all connected components of K(f ) prove the

lemma. 2

For any f, g ∈ PX, let h = (f + g)/2. Because PX is defined by linear

(23)

3.1. Tight-equality Graphs 11 a b c d e

Figure 3.1: PX, the polytope continues to infinity in direction of the dashed

edges. The x-axis go to the right, the y-axis go up and the z-axis in to the paper.

Point Constraints that hold with equality K-graph

a f (z) = 0 x y z b f (z) = 0 f (y) + f (z) = 1 x y z c f (y) + f (z) = 1 x y z d f (x) + f (y) = 1 f (x) + f (z) = 1 f (y) + f (z) = 1 x y z e f (x) + f (z) = 1_{f (y) + f (z) = 1} x y z

(24)

12 Chapter 3. Tight Span of metric spaces

Lemma 3.1.3 K(h) = K(f ) ∩ K(g)

Proof Assume {x, y} ∈ K(h). Then according to definition

h(x) + h(y) = d(x, y) and because h = (f + g)/2 f (x) + g(x) + f (y) + g(y) 2 = d(x, y) or f (x) + g(x) + f (y) + g(y) = 2d(x, y)

With f, g ∈ PX, it is not possible that g(x) + g(y) > d(x, y), because then

f (x) + f (y) < d(x, y) which is a contradiction. Using the same argument but replacing f with g and g with f we can conclude that f (x) + f (y) = d(x, y) and g(x) + g(y) = d(x, y). Hence {x, y} ∈ K(f ) and {x, y} ∈ K(g) ⇒ {x, y} ∈ K(f )∩K(g). Since {x, y} ∈ K(h) where chosen arbitrary it follows that K(h) ⊆ K(f ) ∩ K(g).

An edge {x, y} ∈ K(f ) ∩ K(g) is induced by an equality condition that holds in both f and g. Because the condition is linear and h is a convex combination of f and g, the condition must hold at h too. {x, y} ∈ K(h) ⇒ K(f )∩K(g) ⊆ K(h)

∴K(h) = K(f ) ∩ K(g). ₂

Lemma 3.1.4 If f 6= g then dim[h] > 0.

Proof Assume f 6= g and dim[h] = 0. Then [h] = {h}. Lemma 3.1.3 give that

K(h) ⊆ K(f ) and K(h) ⊆ K(g). Lemma 3.1.1 then give f, g ∈ [h] ⇔ f = g = h

a contradiction. 2

3.2 The relation between Tight Span and h-optimal

realizations

Using the graph K(f ) we are now able to give an explicit discription of the

h-optimal realization as Γd:= (Vd, Ed, wd) with vertex set

Vd := {f ∈ P (X, d)| K(f) is connected and not bipartite}

and edge set

Ed:= {{f, g} ∈

Vd

2

| K( (f+g)/2 ) is connected and bipartite} with weights

wd({f, g}) := max

x∈X|f (x) − g(x)|

Theorem 3.2.1 Γd is an h-optimal realization of (X, d).

A full proof is given by Dress[2] (Theorem 7), and a sketch of proof is given by Lesser[1] (Theorem 4.4).

Computing the hereditarily optimal graph Γd with the use of the above

description is still not easy. Instead consider the 1-skeleton T_d(1) := (F0, F1, wd)

of the Tight Span. A graph with vertex set F0 containing all the 0-dimensional

faces of PX, edge set F1 containing {f, g} ∈ F₂0 where f and g are points of a

(25)

3.2. The relation between Tight Span and h-optimal realizations 13

Theorem 3.2.2 The hereditarily optimal realization Γd is a subgraph of the

1-skeleton T_d(1).

Proof We first show that Vd ⊆ F0. Let f ∈ Vd. According to the definition

of Vd the graph K(f ) must be connected and not bipartite. By Lemma 3.1.2,

dim[f ] = 0. Hence f ∈ F0.

The second step is to show that Ed ⊆ F1. Let {f, g} ∈ Ed. According

to the definition K((f + g)/2) is connected and bipartite. By Lemma 3.1.2, dim[(f + g)/2] = 1 and therefore f and g are points on a shared 1-dimensional

face in PX. Hence {f, g} ∈ F1.

By definition wd is equal for all edges that exist in both graphs. 2

Theorem 3.2.3 The 1-skeleton T_d(1)of the tight span is an h-optimal realization

if and only if the metric (X, d) is totaly split decomposable1_.

A proof of this is given by Lesser[1] (Theorem 7.3).

Theorem 3.2.4 If the metric (X, d) can be realized as a tree, then T_d(1) will

be that tree. T_d(1) will then also be both a h-optimal and optimal realization of

(X, d).

A proof of this is given by Dress[2] (Theorem 8, pages 359-365)

3.2.1 Reconstructions using Tight Span

For reconstructions using Tight Span we look at the 1-skeleton T_d(1). For a given

metric one will first compute all the cuts from eq(3.1). From these inequalities

we can then compute all 0-dimensional faces in PX. The 0-dim faces correspond

to vertices in the 1-skeleton. Finaly the edges can be computed by searching for vertices whose corresponding corners only differ in one inequality holding with equality. These computations have all been done using polymake and I’m not sure which algorithms the program use.

1

(26)

(27)

Chapter 4

Split Decomposition

Probably the oldest method of analyzing relationships between different species is to divide them in to different groups. Eg life can be split in to animals, plants, fungi etc. Unfortunately it is not obvious which attributes are most important for analyzing relationships. Mass is one example of an attribute that does not give much information. A full grown Mastiff dog can weigh around 70kg, a weight similar to a full grown human, but is closer related to other breeds of dogs, like a chihuahua with a weight of about 3kg. Split Decomposition was first presented in 1992 by Andreas Dress[3].

4.1 d-Splits

Split decomposition is a method for finding splits (A|B), A, B ( X such that A ∩ B = ∅ and A ∪ B = X, where the elements in A is in some way different from the elements in B.

A d-split (A|B) is a bipartition of X into two non-empty sets A, B such that for any i, j ∈ A and k, l ∈ B

dij+ dkl < max (dik+ djl, dil+ djk) (4.1)

To easier understand the condition we study the three possible trees that connect four different points.

a b c d a c b d a c d b

Figure 4.1: The three possible trees connecting four elements

Trees with some edge of weight zero is possible here. It is clear that the only

(28)

16 Chapter 4. Split Decomposition

suitable split1 _{(A|B) of the left tree is A = {a, c}, B = {b, d}. An indication of}

how likely the three different cases are is given from the sums:

d(a, c) + d(b, d) left tree

d(a, b) + d(c, d) middle tree

d(a, d) + d(b, c) right tree

where the lowest sum indicates the most likely relation between the four points. The condition in eq(4.1) state that a d-split (A|B) may not contain two elements in A and two in B split in the least likely way. Note that if the distances between four points can be realized by a tree, then the sums representing the two other trees will be equal and hence only the correct tree will be allowed.

All d-splits are given a weight

αA,B:=

1

2i,j∈A,k,l∈Bmin max(dik+ djl, dil+ djk) − dij− dkl

called its isolation index. Because of eq(4.1) αA,B > 0 for all d-splits. To

simplify computations we can define α for all bipartitions A,B as:

αA,B:=

1

2i,j∈A,k,l∈Bmin max(dik+ djl, dil+ djk, dij+ dkl) − dij− dkl

(4.2) If A, B is a d-split the expressions are equal. If A, B is not a d-split this new

definition would give αA,B= 0

For each d-split (A|B) we now define a split metric2:

δA,B(i, j) =

1 if i ∈ A, j ∈ B, or i ∈ B, j ∈ A 0 else

We can now divide d in two parts d0_{and d}1 _{difined as follows.}

d1:= X

∀d-splits(A|B)

αA,BδA,B

d0:= d − d1

According to definition d(x, y) ≥ 0, δA,B(x, y) ≥ 0 and as showed previously

αA,B≥ 0. Hence we conclude that d1≥ 0. Bandelt and Dress[3] show that d0is

a pseudo metric, and that there are no d-splits of d0_{. We call d}0_{the split-prime}

residue of d. If d0_{= 0, then d is said to be totaly split decomposable. Note that}

d1 _{is by definition always totaly split decomposable.}

As the data in the split-prime residue is not represented by the d-splits it is

reasonable to assume that for the d-splits to be a good representation of d, d0

should be small. This will be measured with the splitability index

100   X i,j∈X d1 ij . _X k,l∈X dkl   1

Excluding trivial splits where |A| = 1 or |B| = 1

2

(29)

4.2. The splits graph 17

4.1.1 Computing the d-splits

When analysing the metric space (X, d) we want to find all possible d-splits. This is done efficiently by computing bigger and bigger partial splits. For a partial d-split (A|B) just like a full d-split A and B satisfies A ∩ B = ∅ and eq(4.1), but unlike a full split A ∪ B ( X. For a set {a, b} with only two

elements, there is one3_{d-split A = {a}, B = {b}.}

Assume X = {1, 2, . . . , n}. Further assume that all d-splits of the subset {1, 2, . . . , i} are known. We can now find all d-splits for {1, 2, . . . , i, i + 1} by testing (A ∪ B | {i + 1}) and for every d-split (A|B) of {1, 2, . . . , i} test if (A ∪ {i + 1} | B) or (A | B ∪ {i + 1}) are d-splits on {1, 2, . . . , i, i + 1}. Such

tests can be done by computing αA,B as defined in eq(4.2) and checking if it’s

non-zero.

4.2 The splits graph

4.2.1 Circular split systems

Let S be the set of all d-splits of a metric (X, d). Consider an ordered list

(x1, x2, x3, . . . , xn) of all elements in X. The set of splits is circular, if there

exist some ordered list, such that all splits4 _{can be written on the form}

({xp, xp+1, . . . , xq} | X \ {xp, xp+1, . . . , xq}), 1 < p ≤ q ≤ n

Arbitrarily choose one element as x1. To find a circular ordering is then

equivalent to the consecutive ones problem. If a circular ordering exists it can be found in linear time[11]. If S is not circular, we want to find a circular ordering of S ⊂ S where |S| is maximal. This is a computationaly hard problem.

SplitsTree5_{uses an algorithm that will find a circular ordering if one exists, and}

otherwise find a reasonably good ordering[10]. Note that we do not need to consider trivial splits when computing an ordering.

4.2.2 Add trivial splits

The splits graph G0 = (V0, E0, w) starts out with a single vertex v0, and a

labeling function σ : X → V0, mapping all elements in X to v0. For some trivial

split, separating xk from all other elements. Add a vertex vk, an edge from vk

to v0with weight equal to the splits isolation index, and set σ(xk) = vk. Once

all trivial splits have been added in this manner, the resulting graph will be a star graph where all elements in X marks their own leaf leaving an unmarked

vertex in the center6_.

4.2.3 Add circular splits

Given a splits graph Gt−1and a split St= ({xp, xp+1, . . . , xq} | X\{xp, xp+1, . . . , xq}) ∈

S we compute a shortest path Pt from σ(xp) to σ(xq), such that the edges are

3

({a}|{b}) is considered equivalent to ({b}|{a}).

4

Remember, (A|B) = (B|A)

5

The program used for creating the splits graph

6

It is possible to create metrics where the number of trivial splits is less than |X|, in such cases the central vertex will still be marked.

(30)

in the frontier of Gt−1. Finaly to construct Gt we take Gt−1, add a copy of

the subgraph containing Pt and any leafs σ(xp+1), . . . , σ(xq−1), remove leafs

labeled by xp, . . . , xq and any unlabeled leafs from Gt−1 Then add edges

con-necting vertices induced by the same vertex in Pt. The new edges are all given

weight equal to the isolation index of St. σ is updated so xp, . . . , xq maps to

the corresponding new leafs.

4.3 Split Decomposition of a Tree

Theorem 4.3.1 If (X, d) is a tree-realizable metric, the splits graph will be the

underlying tree.

Proof First consider the set of d-splits. An edge e in the tree T separates X

in two sets A and B, containing the vertices on respective side. For i, j ∈ A, k, l ∈ B and u, v the vertices connected by e. u and v is choosen so that u is on the same side of e as A. By the triangle inequality d(i, j) ≤ d(i, u) + d(u, j). Similarly d(k, l) ≤ d(k, v) + d(v, l). Hence

d(i, j) + d(k, l) ≤ d(i, u) + d(u, j) + d(k, v) + d(v, l) (4.3)

Because T is a tree, a path connecting a vertex in A to a vertex in B must include e. Therefore

d(i, k) = d(i, u) + d(u, v) + d(v, k)

d(i, l) = d(i, u) + d(u, v) + d(v, l)

d(j, k) = d(j, u) + d(u, v) + d(v, k)

d(j, l) = d(j, u) + d(u, v) + d(v, l)

and

d(i, k) + d(j, l) = d(i, u) + d(u, v) + d(v, k) + d(j, u) + d(u, v) + d(v, l)

= d(i, u) + d(j, u) + d(v, k) + d(v, l) + 2d(u, v)

≥ d(i, j) + d(k, l) + 2d(u, v)

> d(i, j) + d(k, l)

d(i, l) + d(j, k) = d(i, u) + d(u, v) + d(v, l) + d(j, u) + d(u, v) + d(v, k)

= d(i, u) + d(j, u) + d(v, l) + d(v, k) + 2d(u, v)

≥ d(i, j) + d(k, l) + 2d(u, v)

> d(i, j) + d(k, l)

Hence, according to eq(4.1), (A|B) is a d-split. Also, no other type of d-splits are possible. The isolation index of the d-split is defined in eq(4.2), but because both choices in the max are equal it can be simplified.

αA,B = 1₂ min i,j∈A,k,l∈Bdik+ djl− dij− dkl = 1 2_{i,j∈A,k,l∈B}min diu+ dju+ dvk+ dvl+ 2duv− dij− dkl ≥ 1 2_{i,j∈A,k,l∈B}min (2duv) = duv (4.4)

If u is an auxiliary vertex we may choose i, j ∈ A so that ij contains u. Should u not be an auxiliary vertex, let i = u. In a similar fashion, choose k, l ∈ B so

(31)

4.4. Example 19

Because the splits all correspond to edges in the underlying tree, for any two

splits Si, Sj we have Ai⊂ Aj. Therefore when adding the split Si to the graph,

all species in Ai must be on the same side of any previosly added edge. That is,

they are all mapped to a single vertex which Si split in to two and add a single

edge between them.

Corollary 4.3.1 A tree-realizable metric is totaly split decomposable.

Proof By definition

d1:= X

∀d-splits(A|B)

αA,BδA,B

For any i, j ∈ X, d1_{(i, j) is the sum of α}

A,B for all splits where i and j are

in different sets. Because all splits are induced by an edge in the tree and the

isolation indices are all equal to the weight of the corresponding edge d1_{(i, j) =}

d(i, j), ∀i, j ∈ X ⇔ d1 _{= d ⇔ d}0 _{= d − d}1 _{= 0. Hence d is totaly split}

decomposable.

4.4 Example

This example will use data from six mammals found in appendix A.1.

(32)

20 Chapter 4. Split Decomposition Size A B αA,B 2 1 2 16 3 1, 3 2 7.5 1 2, 3 8.5 1, 2 3 15.5 4 1, 3, 4 2 7.5

1, 3 2, 4 0 No possible splits using this

par-tial split, removed

1, 2, 3 4 2.5 1, 4 2, 3 0.5 1 2, 3, 4 8 1, 2, 4 3 3 1, 2 3, 4 12.5 5 1, 3, 4, 5 2 3.5 1, 3, 4 2, 5 3 1, 2, 3, 4 5 4.5 1, 2, 3, 5 4 2.5 1, 2, 3 4, 5 0 removed 1, 4, 5 2, 3 0 removed 1, 4 2, 3, 5 0.5 1, 5 2, 3, 4 0 removed 1 2, 3, 4, 5 7 1, 2, 4, 5 3 2 1, 2, 4 3, 5 0 removed 1, 2, 5 3, 4 12.5 1, 2 3, 4, 5 0 removed 6 1, 3, 4, 5, 6 2 3.5 1, 3, 4, 5 2, 6 0 removed 1, 2, 3, 4, 5 6 4.5 1, 3, 4, 6 2, 5 0 removed 1, 3, 4 2, 5, 6 3 1, 2, 3, 4, 6 5 3 1, 2, 3, 4 5, 6 1.5 1, 2, 3, 5, 6 4 2.5 1, 2, 3, 5 4, 6 0 removed 1, 4, 6 2, 3, 5 0 removed 1, 4 2, 3, 5, 6 0.5 1, 6 2, 3, 4, 5 0 removed 1 2, 3, 4, 5, 6 7 1, 2, 4, 5, 6 3 1.5 1, 2, 4, 5 3, 6 0.5 1, 2, 5, 6 3, 4 12.5 1, 2, 5 3, 4, 6 0 removed

(33)

4.4. Example 21 A B αA,B Non-trivial splits 1, 3, 4 2, 5, 6 3 1, 2, 3, 4 5, 6 1.5 1, 4 2, 3, 5, 6 0.5 1, 2, 4, 5 3, 6 0.5 1, 2, 5, 6 3, 4 12.5 Trivial splits 1 2, 3, 4, 5, 6 7 2 1, 3, 4, 5, 6 3.5 3 1, 2, 4, 5, 6 1.5 4 1, 2, 3, 5, 6 2.5 5 1, 2, 3, 4, 6 3 6 1, 2, 3, 4, 5 4.5

Table 4.2: The final set of d-splits.

4.4.2 Constructing the graph

This example metric produce a set of 11 splits, that we will now visualize. We first start by adding the 6 trivial splits all at the same time. The circular ordering used is (1, 2, 5, 6, 3, 4). Non-trivial splits are added, going from top to bottom in the previous table. For all splits (A|B), we choose to duplicate the smaller of the sets A, B, to minimize the size of the graph.

1 2 5 6 3 4 3.5 2.5 3 4.5 7 1.5

(34)

22 Chapter 4. Split Decomposition 1 2 5 6 3 4 7 2_.5 1.5 3.5 3 3 4.5

Figure 4.3: Path for split (56|1234) marked.

1 2 5 6 3 4 1.5 3.5 3 4.5 3 1.5 7 2_.5

1 2 5 6 3 4 3.5 3 4.5 7 2.5 0_.5 1.5 3 1.5 4.5

(35)

4.4. Example 23 1 2 5 6 3 4 3_.5 3 7 0.5 0.5 3 1.5 4.5 3 1.5 2.5 0.5 0.5 1.5

1 2 5 6 3 4 3.5 3 7 2.5 0.5 0.5 0.5 0.5 3 1_.5 4.5 3 1.5 0.5 0.5 1.5 12.5 12.5 12.5

(36)

4.5 A hybrid method

In chapter 3 we see that the 1-skeleton of the Tight Span is hereditarily optimal if and only if the metric is totaly split decomposable. Chapter 4 describes a

way to compute the totaly split decomposable component (X, d1_{) of a metric}

(X, d). I therefore propose a method where (X, d1) is computed from (X, d), and

then the h-optimal realization of (X, d1) is computed using Tight Span. This

realization will not be h-optimal with respect to (X, d), but if the splittability index is high it might still be close to the real h-optimal realization.

Theorem 4.5.1 If (X, d) is a tree-realizable metric, the hybrid method will find

the underlying tree.

Proof Because (X, d) is tree-realizable it is, according to Corollary 4.3.1, totaly

split decomposable. Hence (X, d1_{) = (X, d) will be tree-realizable and,}

accord-ing to Theorem 3.2.4, the 1-skeleton of the Tight Span will be the underlyaccord-ing

(37)

Chapter 5

Test runs

In this chapter I will use the three methods on several metrics. I will use both real world data and data generated randomly. For d-splits all work is done using SplitsTree[8]. For Tight Span the computations are performed with polymake[9] using JavaView for visualization. For the hybrid method, I have written a program to compute the totaly split decomposable metric in the first step. Then I construct the Tight Span using polymake and JavaView.

5.1 Metrics from random trees

In this section I will present two random weighted trees of different size. The main reason for testing on metrics from constructed trees is that then we know the underlying topology, which is not the case for metrics from measured data. We already know that all three methods will find the right tree from exact data, therefore some noise will be added to the distances.

5.1.1 Generating a random tree

To generate a random binary tree I use the following recursive algorithm sug-gested by Pio Korinth[12]. For a tree with n vertices, first create one auxiliary vertex. Then connect this vertex to a left tree with k leaves and a right tree with n-k leaves. Where k ∈ N, 1 ≤ k < n is randomly choosen. Once the un-weighted tree is created, each edge is given an integer weight x ∈ [1, 5] choosen randomly with uniform distribution. Finally the leaves are labeled (0,1,2,...) and the distance matrix is computed.

5.1.2 Disturbing the distance matrix

Given a distance matrix d, I compute a perturbed matrix d. For each dij in the

lower left triangle of d we can compute dij = dij+ δij. Where δij is a normaly

distributed stochastic variable with a mean µ = 0 and a standard deviation σ = 0.2. The upper right triangle of d is the transposition of the lower left triangle. It is possible that the triangle inequality does not hold in d. If so (X, d) is not a metric space. In such cases the entire d will be recalculated.

(38)

26 Chapter 5. Test runs

5.1.3 Case with 6 leaves

0 1 2 3 4 5 3 5 5 3 4 1 4 3 2

Figure 5.1: Original undisturbed tree with 6 vertices

1 0 3 2 4 5

(39)

5.1. Metrics from random trees 27

Figure 5.3: Reconstruction using d-splits. Splitability Index: 95.40

1 0 3 4 2 5

(40)

5.1.4 Case with 10 leaves

0 1 2 5 3 4 6 7 8 9 2 3 2 1 3 2 3 15 5 3 5 4 2 4 5 2

Figure 5.5: Original undisturbed tree with 10 vertices

0 1 6 9 8 7 2 3 4 5

(41)

5.1. Metrics from random trees 29

Figure 5.7: Reconstruction using d-splits. Splitability Index: 74.44

7 6 5 8 2 4 9 3 0 1

Figure 5.8: Reconstruction using the hybrid method. Vertex 9 is connected to the right of the two nearby auxiliary vertices.

(42)

5.2 Metrics using real world data

5.2.1 About the data

The data is taken from ClustalW and contains pairwise matches of the DNA for a single protein family. Scores from ClustalW are in the range 0-100, with a score of 100 for identical sequences. I use 100 − score as the distance. The taxa and full distance matrices can be found in Appendix A.

5.2.2 Case with 6 mammals

1 5 6 2 3 4

Figure 5.9: Tight Span visualizing the relationship between six mammals.

Figure 5.10: Visualizing the relationship between six mammals, using d-splits. Splitability Index: 100

(43)

5.2. Metrics using real world data 31

1

6

2

5

3

4

Figure 5.11: The hybrid method visualizing the relationship between six mam-mals.

(44)

5.2.3 Case with 10 taxa

4 3 5 6 8 7 2 9 10 1

Figure 5.12: Tight Span visualizing the relationship between ten taxa.

Figure 5.13: Visualizing the relationship between ten taxa, using d-splits. Splitability Index: 93.

(45)

5.2. Metrics using real world data 33 10 4 3 9 1 8 7 5 6 2

(46)

5.2.4 Case with 19 taxa

Figure 5.15: Visualizing the relationship between 19 taxa, using d-splits. Splitability Index: 59.

The Tight Span and the hybrid method proved to time consuming to com-pute for 19 taxa on a normal home comcom-puter.

(47)

Chapter 6

Conclusions and future

work

For the fairly simple problems with sex species all three methods give very similar results. For the random case the correct topology can quite clearly be found by all methods. When we increse the number of species to ten the topology of the random tree can still be found with d-splits, with the exception of the leaves 2 & 5, which are both connected to the same auxiliary vertex.

d-Splits have the advantage of being cheap to compute. Also the splits graph is often planar, which makes it easier to present in most media. The disadvantage is that the data in the split prime residue is thrown away. It seems that as the number of species increase the splitability index decreses and it gets harder to find enough non-trivial splits to get any useful information. I can’t see any obvious way around this problem. There exist other methods

for finding families of splits1 _{that can give a higher splitability index. Some of}

those are already implemented in SplitsTree.

A general problem with the Tight Span is that the resulting graph will contain a high number of auxiliary vertices. Also the graph is often not planar. JavaView, the tool used for visualizing the graph from Tight Span, sometimes

even have trouble finding an embedding2_{in 3d. Even for a problem with just ten}

species it was hard to find a point of view rendering a good 2d representation. The high number of vertices from the Tight Span not only make this method computationally expensive, it also means that the end result gets cluttered, making it hard to extract useful information. It is possible that the Tight Span might work well as the first step, followed by some other algorithm that reduces the number of vertices in the graph. One such method is suggested by Pio Korinth[12]. Another possibility is to preprocess the distance matrix to get a matrix with properties that reduce the number of vertices in the final graph. This was the idea behind the hybrid method.

The hybrid method has the advantage that the substeps where the Tight Span and the final embedding is computed is significantly faster than the pure Tight Span method. The reason for this is the lower amount of auxiliary ver-tices. This also results in a cleaner looking final graph. Compared with d-splits

1

not d-splits, but other bipartitions

2

With edge lengths corresponding to the edge weight.

(48)

36 Chapter 6. Conclusions and future work

there is hard to find any advantages of the hybrid. A clear disadvantage is that once again it’s hard to find a planar embedding of the final graph. Something that is usually easy for a splits graph. All in all, the hybrid seems to inherit the disadvantages of both the two original methods, being computationally expen-sive and also throwing away the split prime part of the data.

The observant reader might have noticed that the topology suggested for the six mammals does not match the common classification. It is suggested that spieces 5 (Sus scrofa, wild boar) and 6 (Equus caballus, domestic horse) share a common ancestor that is not shared with 2 (Bos taurus, cattle). However 2 and 5 both belong to the Artiodactyla order, and are belived to have a common ancestor not shared with 6, that belong to the Perissodactyla order. Because of consistency among the different methods I belive that this discrepancy is due to distances being measured from a single protein family. Measurements accounting for a larger portion of the genome should hopefully resolve this problem.

(49)

Bibliography

[1] A. Lesser: Hereditarily Optimal Realizations, Filosofie Licensiatavhan-dling i matematik med inriktning mot bioinformatik, (2004), Department of Mathematics, Uppsala University.

[2] A. Dress: Trees, tight extensions of metric spaces, and the cohomological dimension of certain groups: A note on combinatorial properties of metric spaces, Adv. in Math 53, (1984)

[3] H-J. Bandelt, A. Dress: A Canonical Decomposition Theory for Metrics on a Finite Set, Adv. in Math 92, (1992)

[4] J.R. Isbell: Six theorems about injective metric spaces, Commentarii Math-ematici Helvetici 39, (1965)

[5] M. Chrobak, L. Larmore: Generosity helps or an 11-competitive algorithm for three servers, J. Algorithms 16, (1994)

[6] W. Imrich: On Optimal Embeddings of Metrics in Graphs, J. of Combina-torial Theory B, 36 (1984)

[7] I. Alth¨ofer: On optimal realizations of finite metric spaces by graphs, Dis-crete and Computational Geometry, 3 (1988)

[8] D.H. Huson, D. Bryant: Application of Phylogenetic Networks in Evolu-tionary Studies, Mol. Biol. Evol 23(2) (2006)

[9] E. Gawrilow, M. Joswig polymake: a Framework for Analyzing Convex Polytopes, Polytopes – Combinatorics and Computation, (2000)

[10] Andreas W.M. Dress, Daniel H. Huson Constructing Splits Graphs IEEE Transactions on computational biology and bioinformatics. Vol 1, No 3, (2004)

[11] Kellogg S. Booth, George S. Lueker Testing for the Consecutive Ones Prop-erty, Interval Graphs, and Graph Planarity Using PQ-Tree Algorithms, Journal of computer and system sciences, 13, (1976)

[12] P. Korinth Tight Span used in Phylogenetics, Examensarbete i matematik

(2007), Instutionen f¨or Matematik, Kungliga tekniska h¨ogskolan.

(50)

(51)

Appendix A

Metrics used

A.1 6 mammals

Data file:

Sequence format is Pearson

Sequence 1: PTGE_HUMAN 152 aa Sequence 2: Bos_taurus.AAK51127 153 aa Sequence 3: Mus_musculus.BAC28463 153 aa Sequence 4: Rattus_norvegicus.AAG24803 153 aa Sequence 5: Sus_scrofa.BG895996 153 aa Sequence 6: Equus_caballus.AAL18255 153 aa

Start of Pairwise alignments Aligning...

Sequences (1:2) Aligned. Score: 84

(52)

40 Appendix A. Metrics used Distance matrix: 0 16 24 23 16 19 16 0 23 23 9 12 24 23 0 6 23 23 23 23 6 0 25 26 16 9 23 25 0 9 19 12 23 26 9 0

A.2 10 taxa

Sequences used: Sequence 1: uniprot|P25437|ADH3_ECOLI 369 aa Sequence 2: uniprot|P32771|FADH_YEAST 386 aa Sequence 3: uniprot|Q965R0|Q965R0 554 aa Sequence 4: uniprot|Q17335|ADHX_CAEEL 384 aa Sequence 5: uniprot|P00326|ADHG_HUMAN 374 aa Sequence 6: uniprot|P00325|ADHB_HUMAN 374 aa Sequence 7: uniprot|P07327|ADHA_HUMAN 374 aa Sequence 8: uniprot|P40394|ADH7_HUMAN 374 aa Sequence 9: uniprot|P28332|ADH6_HUMAN 368 aa Sequence 10: uniprot|P08319|ADH4_HUMAN 391 aa Distance matrix: 0 39 48 43 53 53 54 54 54 53 39 0 51 43 54 54 54 54 57 56 48 51 0 16 55 56 55 56 59 57 43 43 16 0 48 48 48 49 52 50 53 54 55 48 0 6 8 32 37 41 53 54 56 48 6 0 7 31 36 40 54 54 55 48 8 7 0 32 37 40 54 54 56 49 32 31 32 0 40 42 54 57 59 52 37 36 37 40 0 41 53 56 57 50 41 40 40 42 41 0

(53)

A.3. 19 taxa 41

A.3 19 taxa

This is a superset of the taxa in A.2. Sequences used:

Sequence 1: uniprot|P25437|ADH3_ECOLI 369 aa Sequence 2: uniprot|P32771|FADH_YEAST 386 aa Sequence 3: uniprot|Q965R0|Q965R0 554 aa Sequence 4: uniprot|Q17335|ADHX_CAEEL 384 aa Sequence 5: uniprot|P00326|ADHG_HUMAN 374 aa Sequence 6: uniprot|P00325|ADHB_HUMAN 374 aa Sequence 7: uniprot|P07327|ADHA_HUMAN 374 aa Sequence 8: uniprot|P40394|ADH7_HUMAN 374 aa Sequence 9: uniprot|P28332|ADH6_HUMAN 368 aa Sequence 10: uniprot|P08319|ADH4_HUMAN 391 aa Sequence 11: uniprot|P11766|ADHX_HUMAN 373 aa Sequence 12: uniprot|Q9SK86|Q9SK86 388 aa Sequence 13: uniprot|Q9SK87|Q9SK87 386 aa Sequence 14: uniprot|Q9XIS0|Q9XIS0 387 aa Sequence 15: uniprot|Q8RV10|Q8RV10 381 aa Sequence 16: uniprot|P06525|ADH1_ARATH 379 aa Sequence 17: uniprot|Q96533|ADHX_ARATH 379 aa Sequence 18: hCP35830.1 379 aa Sequence 19: hCP1624454.1 220 aa Distance matrix: 0 39 48 43 53 53 54 54 54 53 38 59 61 60 57 53 40 40 44 39 0 51 43 54 54 54 54 57 56 38 57 58 59 56 57 40 40 44 48 51 0 16 55 56 55 56 59 57 40 63 65 64 62 59 46 42 37 43 43 16 0 48 48 48 49 52 50 31 56 58 58 59 52 38 34 38 53 54 55 48 0 6 8 32 37 41 38 57 58 57 56 50 47 41 41 53 54 56 48 6 0 7 31 36 40 38 57 58 55 57 49 46 41 41 54 54 55 48 8 7 0 32 37 40 37 59 59 56 59 49 47 40 41 54 54 56 49 32 31 32 0 40 42 40 57 57 56 60 55 50 43 43 54 57 59 52 37 36 37 40 0 41 42 59 59 56 55 53 51 45 43 53 56 57 50 41 40 40 42 41 0 39 58 59 57 58 51 47 42 44 38 38 40 31 38 38 37 40 42 39 0 55 55 54 53 47 32 50 13 59 57 63 56 57 57 59 57 59 58 55 0 22 50 52 53 51 56 61 61 57 65 58 58 58 59 57 59 59 55 22 0 55 51 52 53 67 62 60 59 64 58 57 55 56 56 56 57 54 50 55 0 50 51 50 56 57 57 56 62 59 56 57 59 60 55 58 53 52 51 50 0 47 48 55 57 53 57 59 52 50 49 49 55 53 51 47 53 52 51 47 0 42 49 50 40 40 46 38 47 46 47 50 51 47 32 51 53 50 48 42 0 35 35 40 40 42 34 41 41 40 43 45 42 50 56 57 56 55 49 35 0 15 44 44 37 38 41 41 41 43 43 44 13 61 62 57 57 50 35 15 0

(54)

(55)

LINKÖPING UNIVERSITY ELECTRONIC PRESS

Copyright

The publishers will keep this document online on the Internet - or its possi-ble replacement - for a period of 25 years from the date of publication barring exceptional circumstances. The online availability of the document implies a permanent permission for anyone to read, to download, to print out single copies for your own use and to use it unchanged for any non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this per-mission. All other uses of the document are conditional on the consent of the copyright owner. The publisher has taken technical and administrative mea-sures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For

ad-ditional information about the Link¨oping University Electronic Press and its

procedures for publication and for assurance of document integrity, please refer to its WWW home page: http://www.ep.liu.se/

Upphovsr¨att

Detta dokument h˚alls tillg¨angligt p˚a Internet - eller dess framtida ers¨attare

- under 25 ˚ar fr˚an publiceringsdatum under f¨oruts¨attning att inga

extraordi-nära omständigheter uppst˚ar. Tillg˚ang till dokumentet innebär tillst˚and för

var och en att l¨asa, ladda ner, skriva ut enstaka kopior f¨or enskilt bruk och

att använda det oförändrat för ickekommersiell forskning och för undervisning.

¨

Overföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta

tillst˚and. All annan anv¨andning av dokumentet kr¨aver upphovsmannens

med-givande. För att garantera äktheten, säkerheten och tillgängligheten finns det

l¨osningar av teknisk och administrativ art. Upphovsmannens ideella r¨att

in-nefattar r¨att att bli n¨amnd som upphovsman i den omfattning som god sed

kräver vid användning av dokumentet p˚a ovan beskrivna sätt samt skydd mot

att dokumentet ¨andras eller presenteras i s˚adan form eller i s˚adant sammanhang

som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller

egenart. F¨or ytterligare information om Link¨oping University Electronic Press

se f¨orlagets hemsida http://www.ep.liu.se/ c

2010, K˚are Krig

Methods for phylogenetic analysis

Examensarbete

Methods for phylogenetic analysis

K˚

are Krig

Methods for phylogenetic analysis

Abstract

Acknowledgements

Nomenclature

Symbols

Definitions

Contents

Chapter 1

Introduction

1.1

Background

1.2

Topics covered

Chapter 2

Realization of metrics

2.1

Realizations

2.2

Optimal realizations

2.3

Hereditarily optimal realizations

Chapter 3

Tight Span of metric spaces

3.1

Tight-equality Graphs

3.2

The relation between Tight Span and h-optimal

realizations

3.2.1

Reconstructions using Tight Span

Chapter 4

Split Decomposition

4.1

d-Splits

4.1.1

Computing the d-splits

4.2

The splits graph

4.2.1

Circular split systems

4.2.2

Add trivial splits

4.2.3

Add circular splits

4.3

Split Decomposition of a Tree

4.4

Example

4.4.2

Constructing the graph

4.5

A hybrid method

Chapter 5

Test runs

5.1

Metrics from random trees

5.1.1

Generating a random tree

5.1.2

Disturbing the distance matrix

5.1.3

Case with 6 leaves

5.1.4

Case with 10 leaves

5.2

Metrics using real world data

5.2.1

About the data

5.2.2

Case with 6 mammals

1

6

2

5

3