Multiple Sequence Alignment by Iterated Local Search

(1)

Search

HS{IKI{MD{04{001

JensAuer

Submitted by JensAuerto the Universityof Skovdeasa dissertation towards

the degree of Master of Sciencewith a major incomputer science.

September2004

I certify that allmaterial inthis dissertation whichis not my own work has been

identied and that no material is included for which adegree has already be

conferred uponme.

(2)

Kommunikation ochinformation

Master Dissertation

Multiple Sequence Alignment by Iterated

Local Search

Jens Auer

August 29, 2004

(3)

my supervisor Bjorn Olsson for his support and comments which

helped me very much to complete this thesis project. I also would

like to thank Thomas Stutzle for providing me with some of the

(4)

The development of good tools and algorithms toanalyse genome and protein data in

theformof aminoacid sequencescodedasstrings isoneof themajorsubjectsof

Bioin-formatics. Multiple Sequence Alignment (MSA) is oneof thebasicyetmostimportant

techniques to analyse data of more than two sequences. From a computer scientiest

point of view,MSA is merelya combinatorialoptimisation problem where strings have

toarrangeinatabularformtooptimise anobjectivefunction. ThedicultywithMSA

is that it is NP{complete, and thus impossible to solve optimally in reasonable time

underthe assumptionthatP6=N P.

TheabilitytotackleN P{hardproblemshasbeengreatlyextendedbytheintroduction

of Metaheuristics (see Blum & Roli (2003)) for a summary of most Metaheuristics,

general problem-independent optimisation algorithms extending the hill-climbing local

search approach to escape local minima. One of these algorithms is Iterated Local

Search(ILS)(Lourenco etal.,2002;Stutzle,1999a,p. 25),arecenteasytoimplement

but powerful algorithm with results comparable or superior to other state-of-the-art

methodsformanycombinatorialoptimisationproblems,amongthemTSPandQAP.ILS

iterativelysampleslocalminimabymodifyingthecurrentlocalminimumandrestarting

a localsearchporcedureon thismodiedsolution.

This thesis will show howILS canbe implementedfor MSA. After that,ILS will be

evaluatedandcomparedtootherMSAalgorithmsbyBAliBASE(Thomsonetal.,1999),

aset ofmanuallyrenedalignmentsusedinmostrecent publicationsof algorithmsand

inatleast two MSAalgorithm surveys. Theruntime-b ehaviourwill beevaluatedusing

runtime-distribution.

The qualityof alignmentsproduced byILS isatleast asgood asthebestalgorithms

available and signicantly superiour to previously published Metaheuristics for MSA,

Tabu Search and Genetic Algorithm (SAGA). On the average, ILS performed best in

ve out of eight test cases, second for one test set and third for the remaining two.

A drawback of all iterative methods for MSA is the long runtime needed to produce

good alignments. ILS needs considerably less runtime than Tabu Search and SAGA,

but cannot compete with progressive orconsistency based methods, e.g.ClustalWor

T{COFFEE.

keywords: Combinatorial Optimisation, Metaheuristics, Multiple Sequence

(5)

1 Introduction 1

1.1 Motivation . . . 1

1.2 Aim and Objectives . . . 3

2 Background 4 2.1 Multiple SequenceAlignment . . . 4

2.1.1 Denition . . . 6

2.1.2 AlignmentScores . . . 8

2.1.3 Complexity . . . 9

2.2 Metaheuristics and Combinatorial Optimisation. . . 10

2.2.1 CombinatorialOptimisation . . . 10

2.2.2 Heuristicalgorithms . . . 11

2.2.3 Metaheuristics . . . 12

2.3 IteratedLocal Search. . . 13

2.3.1 GenerateInitialSolution . . . 15 2.3.2 EmbeddedHeuristic. . . 16 2.3.3 Perturbation . . . 16 2.3.4 Acceptancecriterion . . . 17 2.3.5 Termination condition . . . 18 2.4 RelatedWork . . . 18 2.4.1 SimulatedAnnealing . . . 20 2.4.2 SAGA . . . 21 2.4.3 Tabu Search . . . 21 3 Algorithm 23 3.1 Constructionheuristic . . . 23 3.2 LocalSearch . . . 24

3.2.1 Singleexchangelocalsearch . . . 24

3.2.2 Blockexchangelocal search . . . 27

3.2.3 Comparison . . . 29

3.3 Perturbation . . . 30

3.3.1 Randomregap perturbation . . . 30

3.3.2 Blockmoveperturbation . . . 30

3.3.3 Pairwise alignment perturbation . . . 31

(6)

3.4 Acceptance criterion . . . 33

3.5 Terminationcondition . . . 33

3.6 IteratedLocal SearchforMultiple SequenceAlignment. . . 34

3.7 Implementation . . . 36

4 Method and Materials 37 4.1 Run-time distributions . . . 37

4.2 The BAliBASEdatabaseof multiplealignments. . . 38

4.2.1 StatisticalEvaluation . . . 40 5 Results 41 5.1 ILS components. . . 41 5.1.1 Constructionheuristic . . . 41 5.1.2 Local search. . . 41 5.1.3 Perturbation . . . 43 5.1.4 AcceptanceCriterion. . . 43

5.2 BaliBASE evaluationof ILS . . . 44

5.3 Runtime-bahaviour ofILS . . . 50

6 Conclusions and Discussion 54 6.1 Conclusions . . . 54

6.2 Discussion . . . 54

(7)

2.1 Example Multiple SequenceAlignment . . . 5

2.2 Structure of theheveinprotein . . . 5

2.3 Pictorial viewof ILS... . . 14

3.1 Move operation of thesingleexchangelocalsearch . . . 25

3.2 LocalSearch stepwith blockexchange... . . 27

3.3 Random regapperturbation . . . 30

3.4 Perturbationusing a randomblockmove. . . 31

3.5 Perturbationusing optimalpairwise alignments . . . 31

3.6 Block-alignment perturbation . . . 32

3.7 COFFEE scoresforILS on referenceset gal4 ref1 . . . 34

5.1 Boxplotofthe averageBSPS scoresof all144 alignmentsinthedatabase. 46 5.2 Boxplotofthe averageCS scoresof all144 alignmentsinthedatabase... 46

5.3 Boxplotsforeachreference setshowingBSPS-scores . . . 47

5.4 Boxplotsforeachreference setshowingCS-scores . . . 48

5.5 Solution qualitytrade-o for1bbt3ref1 . . . 51

5.6 Solution qualitytrade-o for1pamA ref1 . . . 52

5.7 Runlength-distribution for1r69 ref1 . . . 53

5.8 Runlength-distribution for1pamA ref1 . . . 53

(8)

2.1 Classicationof MSA algorithms . . . 20

5.1 Comparisonof construction heuristics . . . 41

5.2 Comparisonof COFFEE-scoresforbothlocalsearches . . . 42

5.3 Runtimecomparison forbothlocalsearches . . . 42

5.4 Results ofdierent perturbations . . . 43

5.5 Runtimes ofdierent perturbation . . . 43

5.6 PercentageimprovementsinCOFFEE-scorefordierentacceptancecriteria. 44 5.7 Runtimefordierent acceptance criterions. . . 44

5.8 AverageSP-scoresfor BAliBASE . . . 49

5.9 Averagecolumn scoresforBAliBASE . . . 49

5.10 Statisticalsignicance of thedierencesin BSPS-and CS score... . . 50

(9)

2.1 IteratedLocal Search. . . 15

2.2 BETTER acceptance criterion . . . 17

2.3 RESTARTacceptance criterion . . . 17

3.1 Singleexchangelocal search. . . 26

3.2 Blockexchangelocalsearch . . . 28

3.3 Random regapperturbation . . . 30

3.4 Random blockmoveperturbation. . . 31

3.5 Pairwisealignmentperturbation . . . 32

3.6 Realignregion perturbation . . . 32

(10)

The development of good tools and algorithms to analyse genome and protein data is

oneofthemajorsubjects ofBioinformatics. Thisdataistypically collectedintheform

of biological sequences, i.e.strings of symb ols representing amino acidsor nucleotides.

Fromthe comparison of these sequences, information aboutevolutionary development,

conservedregionsormotifsandstructuralcommonalitiescanbeacquired. Thisprovides

thebasisfordeducingbiologicalfunctions ofproteins orcharacterisingproteinfamilies.

One central technique to compare a set of sequences is Multiple Sequence

Align-ment (MSA). This work presents a new algorithm for amino acid sequence alignment,

where the goal is to arrange a set of sequences representing protein data in a

tabu-lar form such that a scoring functionis optimised. The problem of Multiple Sequence

Alignmentishardtosolve,formostscoringfunctionsithasbeenproventobeNP-hard.

Considering computational and biological problems, Notredame(2002) describes three

majorproblems:

Which sequencesshould be aligned?

Howis thealignment scored?

Howshould thealignmentbeoptimized w.r.t.totheobjective function?

Thisthesis focusesonthethird point, summarisedasfollows byNotredame (2002):

Computing optimal alignments is very time-consuming, it even has been

provedthatitisNP-hardwithmostusedobjectivefunctions,whichmakesit

intractableforexactalgorithmswhenmorethanasmallsetofsequencesisto

be aligned. Tohandle larger sets, heuristics mustbe used, which leavesthe

questionof thereliabilityof theseheuristics and their capabilitytoproduce

nearoptimalresults.

1.1 Motivation

MSA has beena eld of research for a long time and many algorithmshave been

pro-posed. Most algorithmsproposedfor MSA rely on a step-by-step alignment procedure

whereonesequenceisconsecutivelyaddeduntilallsequencesarealigned. Inmostcases,

the sequences are added in a precomputed order to the alignment. This class of

al-gorithms, with ClustalW (Thompson et al., 1994) being the most prominent member,

suers from the usual problem of their greedy approach,e.g.being unable to alter

(11)

especiallyfortheinterestingcaseof sequencesimilaritybelow25%,wherealocalsearch

can improve the alignment by 8% (see table 5.2 on page 42). This approach is very

likely to become stuck in a local minimum and could therefore be improved by using

algorithmsable toescape a localminimum.

Toovercometheproblemsoflocal minima,severalstochasticschemeshave been

pro-posedinthecontextofcombinatorialoptimization. Theseschemesareoftendescribedas

Metaheuristics,sincetheydescribeageneralalgorithmicframeworkwhichisindependent

from a special optimization problem. Established metaheuristics are simulated

anneal-ing, tabusearch,guidedlocalsearch,antcolonyoptimizationand genetic/evolutionary

algorithms. Metaheuristics are an active research eld inAI and are used inmany

dif-ferentdomains,bothinresearchandindustrial applications(seeBlum&Roli(2003)for

an overviewof Metaheuristicsin CombinatorialOptimisation).

EventhoughMSAisofsuchremarkableimportance,onlythreecasesofcontemporary

MetaheuristicsapplicationstoMSA werefound intheliterature searchforthis project.

Kim et al.(1994) used simulatedannealing, Notredame& Higgins(1996) implemented

agenetic algorithmandrecently,Riaz etal.(2004) studiedTabuSearch. Theresultsof

NotredameandRiazarecomparablewithotherstateofthearttechniquesandencourage

furtherinvestigationof otherMetaheuristics.

Inthis thesis,IteratedLocalSearch(ILS)(Lourencoet al.,2002)isimplementedand

evaluatedasanalgorithmforMSA.Startingfromalocalminimumgeneratedbyrunning

anembeddedheuristicprocedureonaheuristicallycomputedinitialalignment,Iterated

LocalSearch (ILS)iteratively performs thefollowingsteps:

1. Perturbation: Modify thecurrent solution toescapethelocalminimum.

2. EmbeddedHeuristic: Run the emb edded heuristic on the perturbed solution to

reachthenext local minimum.

3. Acceptance: Selecteither thesolutionbefore orafterrunning theemb edded

solu-tion asstartingsolution forthenextiteration.

Section 2.3 describes ILS as a Metaheuristic in detail, and section 3.6 presents the

al-gorithm implemented in this project forMSA. The idea behind ILS, the search in the

reduced space of local minima can be found in many algorithms in combinatorial

op-timization underdierent names, including iterated descent, large-step Markovchains,

iteratedLin-Kernighanandchainedlocaloptimisation. Despitealonghistoryofthe

ba-sicidea,ILSisarelativelynewMetaheuristicand hasnotbeenusedforMSA.Lourenco

et al.(2001) givesabrief introductiontoILS.

ILS has shown very good resultson manycombinatorial optimization problems, e.g.

the traveling salesman problem (Stutzle & Hoos, 2002), where it constitutes one of

thebest algorithmscurrentlyavailableand the quadraticassignment problem (Stutzle,

1999b),oneofthehardestproblemsknown. Otherapplicationsincludescheduling

prob-lems of dierent kinds (den Besten et al., 2001; Stutzle, 1998) and graph colouring

(12)

and givea goodoverviewof ILS.

Compared to already applied Metaheuristics, ILS has several advantages. It is a

modular algorithm with very simple basic steps and few parameters which have to be

adjusted,asopposed toe.g.geneticalgorithms.Anotherbenet whichILS shareswith

other Metaheuristics isits independence of theobjective functiontobe optimised.

1.2 Aim and Objectives

The aim of this dissertationprojectis toimplement and evaluateILS (Lourenco et al.,

2002)forMSA.Thealgorithmisevaluatedusingareference setof alignments,provided

by the BaliBase database of multiple sequence alignments (Bahr et al., 2001;

Thom-son et al., 1999). ILS is compared to established state-of-the-art algorithms for MSA

w.r.t.biological validity by measuring the conformance of the computed alignments

withthose foundinthedatabase, andw.r.t.computationalpropertiesbyusingruntime

distributions (Hoos&Stutzle,1998).

Tofull theaim, thefollowingobjectiveshave tobe considered:

Choose agoodconstruction heuristic bytesting several alternatives

Construct a good local search procedure which is able to improve an alignment

very fast but also reaches high solution quality. This involves testing dierent

neighb ourhoodstructuresanddesigndecisionssuchaspivotingstrategiesor

neigh-bourhood pruning techniques.

Implement aperturbationprocedure whichtstothelocal searchprocedure used

and helps tondhigh qualitysolutions.

Choose an acceptance criterion which balances intensication and diversication

toexplore interestingregions ofthe searchspace eciently.

Determine acut-o toendthe ILS.

Chapter 3 presents several alternatives foreach of these points. Chapter 5 compares

eachofthealternativestakingintoaccounttheintendeduseasacomponent oftheILS.

Considering the results, the complete ILS is presented in algorithm 3.7 on page 35 in

chapter 3.

Altogether, three construction heuristics, two local search procedures (with two

piv-oting strategies), four perturbation strategies and three acceptance criteria are tested

(13)

Thischapterintroducesthebasicconceptsneededinthisthesisandprovidesanoverview

aboutpreviousrelated work.

2.1 Multiple Sequence Alignment

MSA is a fundamental tool in molecular biology and bioinformatics. It has a variety

of applications,e.g.ndingcharacteristic motifsand conservedregions inprotein

fami-lies, analysingevolutionary relationsbetweenproteins andsecondary/tertiarystructure

prediction. The following section shows concrete examples of MSA in widely known

problemsof molecularbiology.

A common way to represent the evolution of a set of proteins is to arrange them in

a phylogenetic tree. The rst step in computing the tree is to measure all pairwise

distances of the given set of sequences and derive a distance matrix from this. The

distancescan be gathered from a MSA bycomparing each pair of aligned sequencesin

thealignment. Fromthis distancematrix, aphylogenetictreecanbe computed,e.g.by

UPGMA(Sneath& Sokal,1973) orNeighbourJoining(Saitou &Nei,1987).

MSA is a basictool when grouping proteins intofunctionally related families. From

a MSA, it is easy to construct a prole or a consensus, which is a pseudo-sequence

representing thecomplete alignment. The consensus orthe prole inturn canbe used

toidentifyconservedregionsormotifs. MSAcanalsobeusedwhendecidingifaprotein

belongstoafamilyornotbyaligning thequery-sequencetogetherwithknownproteins

belongingtothefamily.

MSAcanalsobeusedasastartingpointforsecondaryortertiarystructureprediction.

When aligned together with related sequences, the conserved motifs can give a hint of

secondaryortertiarystructure,sincethestructureisusuallyconservedduringevolution,

buttheactualsequenceislikelytovaryincertainparts. Figure2.1showsthisexemplary

with the Cysteins (printed yellow) aligned together. These amino acids in uence the

structureof anprotein byformingdisuld-bridges .

Figures 2.1 and 2.2 showan exampleMSA of a family of related sequences together

withapictureof thethree-dimensionaltertiary structureofoneofthealignedproteins,

hevein.

Asdenedlater,allsequencesintheMSAareadjustedtothesamelengthbyinserting

gaps (symbolisedbydots inthis example). Similar aminoacidsare alignedtogether in

the same column and regions of high amino acid similarity can be identied. Most

important are the black printed columns of Cysteins, since these parts of the proteins

in uence thetertiary structure byforming bonds insidethe proteins. These bonds are

(14)

teins. The alignment shows eight columns of Cysteins, which form disuld

bridges and are an essentail part of the proteins structure. The dots

sym-bolisegaps. Taken fromReinert et al.(2003).

Figure 2.2: Structureof oneof theproteinsfrom gure2.1. Thedisuldbridgesformed

(15)

Thissectiondenesthedomain,i.e.theMSAprobleminformalterms,sothatitcanbe

used to describe the algorithm precisely. We rst formalise the concept of a biological

sequence.

Denition 2.1 (Alphabet and Sequence) Let 6 be an alphabet, i.e. a nite set of

symbols, and 66= ;. A sequence over the alphabet 6 is a nite string of symbols from

6. Thelength of a sequences,denoted as jsj,is thenumber of symbolsin the sequence.

For a sequence, s[i] describes the ith symbol in thissequence.

Inmolecular biology,thealphabet6usuallyconsistsof thesetof twenty-threeamino

acidsfound inproteins, extended witha specialsymb olas aplaceholder foranyamino

acid(usually denoted as\X"or\{").

We can now dene the multiple alignment of a set of sequences. We dene rst a

variationofmultiplealignment,wherecolumnsconsistingcompletelyofgapsymb olsare

allowed, and restrictthis inthenal denitionof multiple alignment.

Denition 2.2 (Pseudo Multiple Alignment) Let 6 be an alphabet as dened in

denition 2.1, and fs

1 , ...,s

k

g a set S of k sequences over this alphabet. A pseudo

multiplealignmentisa setofsequencesS 0 =fs 0 1 ;:::;s 0 k

g overthe alphabet6 0 =6[f0g, where k2 All sequences in S 0

have the same length: 8s 0 i ;s 0 j 2S 0 js 0 i j=js 0 j j 8i k;s 0 i 2 S 0

can be reduced to the corresponding sequence s

i

in S by removing

all gap symbols from s 0

i

Theorderofsymbolsinevery sequences 0

i in S

0

isthe sameasinthe corresponding

sequences

i .

The number of sequences in the alignment is jS 0

j; the number of columns is dened as

the length of the sequences in S 0

.

PMSA(S) is the setof all possible pseudoalignments of the sequencesS.

The set of all possible alignments PMSA(S) is an innite set, since the numb er of

columns consisting only of gaps is not restricted, which also implies that there is no

boundon thelengthof thesequencesina pseudo alignment.

A Multiple fSequenceg Alignment is apseudo alignment withthe restrictionthat no

columnmayconsistonly of gap characters.

Denition 2.3 (Multiple Sequence Alignment) Let n be the number of sequences

in the Alignment and k the number of columns. A set S of sequences is a multiple

alignment i it isa pseudo alignmentand it fulls the following condition:

8ik9jn s

j

(16)

S.

Despiteoftheinnitesetofpseudoalignments,thenumb erofMultipleSequence

Align-ments is restricted by the lengths and the number of sequences and hence nite, as is

thelengthof thealignment.

Whenonlytwosequencesarealigned,theMSAiscalledapairwisealignment. Pairwise

alignment is not discussed in this thesis, because they can be computed eciently by

dynamic programming(Needleman&Wunsch,1970).

In order todenean optimalmultiple alignment,a qualitymeasure mustbedened,

which assigns a value to an alignment. The measure is called a scoring or Objective

function (OF), because it is the mathematical goal to be optimised. The term OF

will bepreferred in this thesisbecausetheMSA problemis tackledasa Combinatorial

Optimisation(CO) problem.

Denition 2.4 (Scoring Function, Objective Function) Let 6 0

be dened as in

denition 2.2. For a number of sequences n, a function

f :(6 0?

) n

!R

is called anObjective function(OF).

6 ?

is dened as the set of all words obtained by concatenation of symbols from the

alphabet 6.

TypicalOFsare(weighted)SumofPairs(SP)(Altschul,1989),whichscorestheMSA

byscoring all pairs of aminoacids induced bythe columns of thealignment,and

Con-sistency based Objective Function For alignmEnt Evalutation (COFFEE) (Notredame

et al., 1998), which compares the alignment to a library of reference alignments. OFs

arediscussedin 2.1.2onthe followingpageinmore detail.

Finally,anoptimalalignmentcannowbedened. Thetermoptimalcanmeaneither

a minimum or maximum value, depending on the objective function used. As we use

a scoring function (as opposed to a cost function) in our alignment algorithm, where

betteralignmentsreceive ahigherscore, we deneoptimalitybymaximisation. 1

Denition 2.5 (Optimal Multiple Sequence Alignment) For a set of sequences

S = s

1 ;:::;s

k

over an alphabet 6 and an OF f, S 0 = s 0 1 ;:::;s 0 k is an optimal multiple sequencealignment i S 0 2MSA(S) 8S ? 2MSA(S):S ? 6=S 0 !f(S ? )f(S 0 )

The second point states that f(S 0

) is a maximum value of the function f on the set of

all alignments over S. This denition includes the possibility of more than one optimal

MSA.

1

Thisdoesn'tcauseanylossofgenerality,sinceaminimisationproblemcaneasilybetransformedina

(17)

To estimate thequalityof an alignment, each alignment gets a value based on an OF,

where a higher value corresponds to improved quality. The OF measures the quality

of an alignment and should re ecthowclosethis alignmentis tothe optimalbiological

alignment.

Sum-of-pairs is the most widely used scoring function SP (Altschul, 1989) score, a

directextension from thescoring methodsused inpairwise alignment. Foreach pair of

sequences,thesimilaritybetweenthealignedpairsofaminoacidsarecomputed,inmost

cases by using a scoring matrix. The pairwise scores are then added for each aligned

pair ofresiduesinevery pairof sequences.

SP scores are used in many variations in MSA algorithms, e.g.using weights for

the pairs of sequences to prevent sets of similar sequences from dominating the whole

alignment ordierentmethods ofassigningpenaltiestogapsinan alignment.

The dependence on a scoring matrix is a major drawback of the SP-scores, since

these scoring matrices are very general and created by statistical analysis of largesets

of alignments. The degree towhich they match the properties of the sequences one is

to align varies. Morgenstern et al. (1998) argue that scoring matrices fail to produce

biologically reasonable alignmentswhen the sequenceshave a lowglobalsimilarity and

shareonly localsimilarities. Notredame(2002)identiesthescoringofgapswhen using

theSP-score asa majorsourceof concern.

COFFEE isaconsistencybasedapproachtoscoringanMSA.Insteadofmeasuringthe

similarity of each pair of amino acids induced by an alignment, COFFEE (Notredame

etal.,1998)measuresthecorrespondence betweeneachinducedpairwisealignment and

a libraryof computedpairwise alignment. Each pairin thelibraryis assigned a weight

that is a function of its quality. The weight is equal to the percentage identity of the

aligned sequences. The score of an alignment is computed as the score of all pairs of

induced alignments, i.e.all pairs of aligned sequences in this multiple alignment. For

each pair of residuesalignedin boththe libraryand the induced alignment,the weight

isaddedtothescore. Tonormalisethescore,thisscoreisdividedbythemaximalvalue,

whereall pairsinthemultiplealignmentare found inthelibraryalignments. 2

Formalised,theCOFFEE scoreforan alignmentA canbeformulatedas:

Denition 2.6 (COFFEE score for a MSA)

COFFEE(A)= N01 P i=1 N P j=i+1 W i;j 2SCOR E(A ij ) N01 P i=1 N P j=i+1 W i;j 2LEN(A i;j ) 2

Fortheoptimalscoretobe1,thealignmentsinthelibrarymustbetotallyconsistent,suchthateach

(18)

i;j i j i;j

weight associated with that alignment, LEN(A

i;j

) the length of this alignment and

SCOR E(A

i;j

) beingthenumberof alignedpairsof residuesfound bothinA

i;j

andthe

libraryalignment.

Using a library of alignments computed from the set of sequences under analysis

prevents the scoring scheme from being too general and takes the properties of the

sequences into account. Notredame et al. (1998) used a library of pairwise alignments

created by ClustalW, but Notredame et al. (2000) suggest a library containing both

global alignments from ClustalW and local alignments and showed that this actually

leadto anincrease inthequalityof thecomputed alignments.

When considering COFFEE as the OF for a local search, it is a great benet that

one can restrict the computational eort to those parts which have been changed by

theprocedure whenperformingone step,e.g.tothose twocolumns where a gap and a

residuehavebeenexchanged.

One oftheproblemsofbothCOFFEEand SPscoringistheirassumption ofuniform

and time-invariantsubstitutionprobabilities forallpositionsinanalignment. The

sub-stitutionprobabilitydependsonstructuralandfunctionalpropertiesoftheproteins and

canvaryfrom zero forsomeregionstocompletevariabilityinothers.

norMD (Thompsonetal.,2001)triestore ectthisbyusingacolumn-basedapproach,

wheretheprobabilitiesofsubstitutionsandthusthescoresforresiduespairsare

column-dependent.

The norMD score is computed by measuring mean distances in a continuous space

imposed by each sequence in the alignment. norMD incorporates ane gap costs and

a similaritymeasure forthesequences which is basedon a hash scorederivedfrom dot

plotsof each pairof sequences. Finally,thescoreis normalisedto avalue betweenzero

and one.

The strength of this method is its independence from the numb er and length of the

sequences in the alignment, and the assignment of column-based scores which can be

usedtond good orbad alignedregions insidean alignment.

A majordrawbackwhich prohibits theusage of norMD as a scoringfunction forthe

ILSisitscomplexity. Duetothehandlingofgap costsand thecomputationof thehash

scores during scoring, the wholealignment hasto be processedcompletely whichleads

tovery largerun-times. Takingthis intoaccount, COFFEE is preferred asthe OF for

this project.

2.1.3 Complexity

ThealignmentofonlytwosequencesisacommonspecialcaseofMSAandecient

algo-rithmshavebeenproposed,e.g.byNeedleman&Wunsch(1970)orGotoh(1982). These

algorithmsusedynamicprogrammingtoecientlycomputeallpossiblealignments;their

asymptotic complexity is O(l 2

), where l is the length of the longest sequence, needing

also O(l 2

) space (the space complexity canbe reduced tolinear space consumption by

(19)

plexitywiththenumberofsequences: Thecomplexityfordynamicprogrammingwithn

sequencesof maximumlengthl isO(l n

). Consideringthis, only small setsof sequences

canbe aligned withdynamic programming.

ComputingalignmentsformorethantwosequenceshasbeenshowntobeNP-hardby

Wang&Jiang (1996)formetric scoringmatrices andbyJust(1999) foramore general

class of scoring matrices, which gives strong reasons to believe that there will be no

ecientalgorithm tosolvethis problem exactly(except P=NP).

EvenifJust's provecannot be directlytransferred toMSA withCOFFEE asOF,it

gives strong reasonto believe that computing optimal COFFEE alignments is

compu-tational hardand needsadvancedtechniques.

2.2 Metaheuristics and Combinatorial Optimisation

Thissectionexplainsanddenestwokeytermsofthisthesis,namelyMetaheuristicand

Combinatorial Optimisation(CO).

2.2.1 Combinatorial Optimisation

Theproblemofoptimisationcangenerallybeformulatedasoptimisingafunctionf(X),

where X is a set of variables. 3

Usually the possible set of values is restricted by

con-straints, expressed by inequalities such as g(X) c;c 2 R, where g(x) is a general

function. Optimisationproblemscanbedividedintotwocategories: thosewheretheset

of variablesare froma continuousdomain,and those withadiscrete domain; thelatter

one is called combinatorial. According to Papadimitriou & Steiglitz (1982), CO is the

search for a nite (or countably innite) set, typically an integer, set, permutation or

graph,whereas inthecontinuouscase,thesolutionsareusuallyasetofreal numbersor

evena function. Both casesaretackledwithvery dierent methods,and wefocus only

on COproblems.

Blum &Roli (2003)denes a COproblemas follows:

Denition 2.7 (Combinatorial Optimisation problem) ACO problemP =(S;f)

is dened by: a set of variables X=x 1 ;:::;x n sets of variables D 1 ;:::;D 2

constraintsamong variables

anobjective functionf tobe optimised, wheref :D

1

2:::2D

n !R

3

Inthiscontext,\optimising"mustbeseeninaverybroadviewasthesearchforabestassignmentof

(20)

S=fs=f(x 1 ;v 1 );:::;(x 2 ;v 2 )gjv i 2D i

;s satiesall constraintsg

S is usually called the search space or solution space, and the elements of S arefeasible

solutions (also called candidatesolutions).

ToconformwithourdenitionofoptimalMSA (seedenition2.5),weagainfocuson

maximisationproblemssolely. Tosolveoptimisationproblemmeanstondthesolution

s ?

2 S with maximum value, that is f(s ?

) f(s)8s 2 S. The solution s ?

is called a

globally optimal solution. If there is more than one globally optimal solution, the set

S ?

Swill be called thesetof globally optimal soultions.

2.2.2 Heuristic algorithms

WhentacklingCOproblems, onegenerallydistinguishes betweencompleteand

approx-imationproblems. Complete algorithms areguaranteed tond an optimalsolution for

everyinstanceofanCOprobleminnite time. Oneofthemostwidelyusedalgorithms

are theBranch&Bound algorithms, which are also called A ?

inAI.To guarantee

opti-malsolutionsisoftennotpossible,sincemanyCOproblemsareN P-hard,whichimplies

that no polynomial time algorithms exist, ifwe assume P6= NP. Instead of complete

algorithms the guarantee of nding optimal solutions is relaxed to nd good solutions

closetotheoptimalones.

Approximativealgorithmsareoftencalledheuristicalgorithms,especiallyinArticial

Intelligence and Operations Research. The wordheuristic hasGreek roots withto nd

ortodiscoverasits originalmeaning. Reeves(1993) denes aheuristic as:

Denition 2.8 (Heuristic) A heuristic isa techniqueseekinggood (i.e.near-optimal

solutionsata reasonablecomputationalcost without being abletoguaranteeeither

feasi-bility or optimality, or even in many cases to state how close to optimality a particular

feasible solution is.

Heuristicsusuallyincorporate problemspecic knowledge beyond theknowledge found

in the problem denition to reach a good trade-o between solution quality and

com-putationalcomplexity. Thefollowingpresentationofheuristicmethodsfollowsthose by

Blum&Roli (2003) and Stutzle(1999a).

When consideringheuristicmethods, Stutzle(1999a)distinguishes betweenc

onstruc-tive and local search methods. Constructive methods generate solutions from an

ini-tially emptysolution by adding componentsuntil a completesolution is reached; most

prominentofthese isthegreedyapproachwhichalwaysaddthecomponentwhichgives

the best improvement in terms of the OF. Constructive methods are usually among

the fastest approximation algorithms found, but the computed solutions are of

infe-riorqualitycompared tolocal searchmethods. ManyMSA algorithmsare constructive

algorithms,among them ClustalW, PileUpand MultAlign.

Local search methodsinsteadperforman iterative search. Theystartfrom an initial

(21)

solution which can be computed from the current solution by applying a modication

operatortoit. Thesetofallsolutionscomputablefromthecurrentsolutionsiscalledthe

neighb ourhoodofthecurrentsolutions. Theneighb ourhoodstructureisoftenvisualised

asagraphontopofthesetofallsolutionswhereanedgebetweentwosolutionsindicate

that these two solutions areneighboured,i.e.that onesolution canbe computed from

the other by using the modication operator on the other one. The neighb ourhood is

denedaccording toBlum& Roli(2003); Stutzle(1999a)as:

Denition 2.9 (Neighbourhood) AneighbourhoodstructureisafunctionN:S!2 S

thatassignstoeverys2Sasetof neighboursN (s)S. N(s)iscalledtheneighbourhood

of s.

The iterative process goes on until no better solution in the neighb ourhood of the

current solution can be found. This nal solution of the local search is called a locally

maximal solution:

Denition 2.10 (Local Optimum) A locally maximal solution (or local maximum)

w.r.t. a neighbourhood structureN is a solution s such that 8s 0

2N (s) f(s 0

)f(s). s

is called strict localminimum i 8s 0

2N(s) f(s 0

)<f(s).

Aneighb ourhoodwhereeverylocaloptimumisalsoaglobaloptimumiscalledanexact

neighbourhood. Exactneighbourhoodsoftenhaveexponentialsize,sothattosearchthe

neighb ourhood requiresexponentialtimeinthe worst case which makesthem unusable

inpractise.

2.2.3 Metaheuristics

One of the main features of heuristics as described in the last section is that they

incorporate problem specic knowledge. Thismakes them specic forthe problem and

prohibitsthetransfer of onesuccessfulheuristicfrom oneproblem toanother.

Inrecentyears,mucheortwasspentonthedevelopmentofgeneralheuristicmethods

which are applicable toa wide range of COproblems. These general-purposemethods

are now called metaheuristics (Glover, 1986) (some earlier publications use the term

\modern heuristics",e.g.Reeves(1993)).

The Metaheuristics Networkdenes metaheuristic as:

Denition 2.11 (Metaheuristic) Ametaheuristicisasetofconceptsthat

can be used to dene heuristic methods that can be applied to a wide set of

dierent problems. Inother words, a metaheuristic canbeseenas a general

algorithmic framework which can be applied to dierent optimization

prob-lems with relatively few modications to make them adapted to a specic

problem. 4

4

Metaheuristics Network website: http://www.metaheuristics.net, last accessed Sun, 25th April

(22)

marise themain properties which characterisea metaheuristic as:

Metaheuristics arestrategies that\guide"the searchprocess.

The goalis to ecientlyexplore thesearch space inorder to nd (near-)optimal

solutions.

Techniques which constitute metaheuristic algorithms range from simple local

search procedurestocomplexlearning processes.

Metaheuristicalgorithmsare approximative and usuallynon-deterministic.

They mayincorporate mechanisms to avoid getting trapped in conned areas of

thesearch space.

Thebasic conceptofmetaheuristicspermit an abstractlevel description.

Metaheuristics arenotproblem specic.

Metaheuristicsmaymakeuseofdomain-specicknowledgeintheformofheuristics

thatare controlled bythe upperlevel strategy.

Todays more advanced metaheuristics use search experience (emb odied in some

form ofmemory) toguidethe search.

Many metaheuristics have been proposed during the last years and are well known

today,e.g.SimulatedAnnealing(SA) orGeneticAlgorithm (GA).Blum&Roli(2003)

givesan overviewof themostsuccessfulmetaheuristics and showsseveralways of

cate-gorising them, depending on which aspect of the algorithm is infocus. Metaheuristics

dier in the numb er of feasible solutions examined in each step (population-based vs.

single point search), in their use of the objective-function (dynamic vs. static

objec-tive function), in thenumb er of neighb ourhood structures used during the search (one

vs. various neighb ourhoods), the usage of memory (memory usage vs. memory-less

methods) and the domain where the underlying idea estimated from (nature-inspired

vs. non-nature inspired).

Commonmetaheuristicsare(inalphabeticalorder): antalgorithms,evolutionary

algo-rithms,GreedyRandomisedAdaptiveSearchProcedure(GRASP),GuidedLocalSearch,

IteratedLocalSearch(ILS), SimulatedAnnealing (SA),TabuSearch(TS)andVariable

Neighb ourhood Search(VNS).Section2.4on page18showsexamplesofmetaheuristics

forMSA.

2.3 Iterated Local Search

A majorproblem withlocalsearchalgorithmsistheprobabilityofgettingtrapped ina

localoptimum. Onepossibilitytopreventthisistoletthelocalsearchacceptworsening

(23)

modication should be donein such a waythat it leadsto a solution distantlyenough

such that the local search could not easily undo the changes and return to the local

optimum.

s

s’’

s’

solution space S

score

Figure 2.3: PictorialviewofILS.Afterbeingtrappedinthelocaloptimums,thecurrent

solution ismodied which leadstoa new pointin thesearch space,marked

withs 0

. Thenext runof thelocalsearchleadstothenew localoptimums 00

,

whichinturn canbe modied again.

ILS (Lourencoet al., 2002; Stutzle, 1999a,p. 25) isa systematicapplication of this

idea, where an emb edded heuristic search is applied repeatedly to a modied solution

computed before. Thisresults ina randomwalk of solutions locally optimalw.r.t.the

emb eddedsolution. ILSisaverygeneralframeworkwhichincludesothermetaheuristics

such as iterated Lin-Kernighan, variable neighb ourhood search or large-step Markov

chains.

ILSisexpectedtoperformmuchbetterthansimplyrestartingtheemb eddedheuristic

fromarandomlycomputedpositioninthesearchspace(Lourencoetal.,2002). Lourenco

etal.(2002)pointoutthatrandomrestartsoflocalsearchalgorithmshaveadistribution

thatbecomesarbitrarilypeakedaboutthemeanwhentheinstance sizegoestoinnity.

Most local optima found have a value close to the mean local optimum, and only few

optimaarefoundwithsignicantbettervalues. Theprobabilitytondalocaloptimum

withavalueclosetothemeanlocaloptimumfoundisverylarge,whereastheprobability

to nd a signicantly better local optimumconverges to zero with increasing instance

size. They concludethat it is very unlikely to nd a solution even slightlybetterthan

themean localoptimumfor largeproblem instancesand thata biased samplingof the

space oflocal optimacould leadto bettersolutions.

ILS uses the idea of a random walk in the space of local optima. From the current

solutions, anewstartingpoints 0

forthelocalsearchiscomputedbymodifyingsusing

a perturbationprocedure (s 0

(24)

feasiblesolutions). Theembeddedheuristicisthenrunons,whichleadstoanewlocal

optimums 00

. If s 00

passes an acceptance test, e.g. yields an improvement compared to

thestartingpoint s, itbecomes thenext elementof thewalk,otherwise s 00

is discarded

and the walk restartswith s. Figure 2.3 shows onestep of therandom walk. The nal

solution of thealgorithmis thebest solution found during this walk.

Algorithm 2.1An algorithmic presentationof ILS.

procedure Iterated LocalSearch

s

0

GenerateInitial Sol ution

s LocalSearch(s 0 ) s best s

whiletermination conditionnot metdo

s 0 Perturbation(s,history) s 00 EmbeddedHeuristic(s 0 ) if f(s 00 )>f(s best )then s best s 00 endif s AcceptanceCriterion(s;s 00 ;history) endwhile return s best endprocedure

This process is formalised in algorithm 2.1. The algorithm consists of components:

GenerateInitialSolution, LocalSearch, Perturbation, AcceptanceCriterion and

termina-tioncondition. Inordertoreachhighperformanceandsolutionquality,thePerturbation

and AcceptanceCriterion can make use of the history of the search, e.g.by

emphasis-ing diversication or intensication based on the period of time passed since the last

improvementwasfound. Theacceptance criterionacts asa counterbalancetothe

per-turbationprocedurebylteringthenewlocaloptima. Thefollowingparagraphsdescribe

each componentof thealgorithm.

2.3.1 GenerateInitialSolution

GenerateInitialSolutionisaconstructionheuristicwhichcomputesthestartingpointfor

the rst run of the emb edded heuristic. The construction heuristic used should be as

fast aspossiblebut producegood startingpoints forthe emb edded heuristic; standard

choicesareeitherrandomsolutionsorgreedyheuristics. Inmostcases,greedyheuristics

arepreferred sincethe embeddedheuristic reaches alocaloptimuminfewerstepsthan

startingfrom randomsolutions. It isworthtonotethat thebest greedyheuristics may

notbe thebeststartingpointforalocalsearch. Lourencoetal.(2002) giveanexample

where using one of thebest construction heuristics known for theTravelling Salesman

Problem(TSP)leadstoworseresultsthanusingarandomstartingsolutionasinputfor

alocalsearchprocedure. Anotherpointtoconsider isthetimeagreedyheuristicneeds.

(25)

point getsless important.

2.3.2 EmbeddedHeuristic

Inmostcases,theemb eddedheuristicisaniterativeimprovementlocalsearchprocedure,

butanyotherheuristiccanbeused,sincetheembeddedheuristiccanmostlybetreated

as a black box. Lourenco & Zwijnenburg (1996) used a TS in their application tothe

job-shopscheduling problem.

However,therearepointstoconsiderwhenoptimisingtheperformanceofanILS.The

local search should be selected with respect to the perturbation used, where it should

notbeable toeasilyundo thechanges oftheperturbation.

A trade-o between speed and solution quality of the embedded heuristic must be

found. Usually a heuristic with better results is preferable, but if the heuristic needs

too much time, the ILS is unable to examineenough local optima to nd high-quality

solutions.

Finally,onehastoconsidertheinterrelationbetweentheemb edded heuristicandthe

perturbationwhenoptimisinganILS.Inmanycases,theembeddedheuristicusessome

trickstoprunethesearchspaceandspeedupthesearchprocess,e.g. byusingdon'tlook

bits(Bentley,1992)inTSPalgorithms(Stutzle,1999a,p. 78). Thehighestperformance

isreachedwhen thesetricksareimplementedw.r. t.theperturbationprocedure.Inthe

exampleofdon't lookbitsforTSP,thedon'tlookbits aresavedforthenextrunofthe

local search and only reseted for the cities where the perturbation induced a change.

Thisgains a largeadvantageinperformanceof theILS.

2.3.3 Perturbation

The perturbation procedure modies a local optimum to become a new starting point

for the next heuristic search. Perturbations are often indeterministic to avoid cycling.

A general aspect in choosing perturbations is that they must not be undone easily by

theemb eddedheuristic,sinceitwouldbefallbacktothelastvisitedlocaloptimum. To

nd a good perturbation, properties of theproblem instance canbe considered aswell

asproblem specic knowledge.

Most perturbations are rather simple, e.g.in TSP algorithms, small \double-bridge

moves"arefrequentlyused, which removefouredges and addfournew edges. Whilein

thecase of TSP,a simpleperturbationprocedure leads togood results, otherproblems

might need more complexprocedures, e.g.optimisinga part of thecurrent instance or

solvinga relatedinstance withrelaxed constraints.

Themostimportantpropertyoftheperturbationisitsstrength. Perturbationstrength

canbe informally dened asthe numb er of components ina solution that getchanged

byaperturbation. Sincethenumberofchangedcomponentsprovidesameasure forthe

distancebetweentwosolutions,theperturbationstrength canbe seenasthedistanceof

theperturbed solutiontotheoriginalsolution. The strengthoftheperturbationshould

(26)

theproblem and onthe particularinstance.

The strength of a perturbation can be xed, where the distance between s and s 0

is

xedoradaptive,whereitsstrengthchangesovertime,e.g.getslargerwhenno

improve-mentswerefound toemphasisediversication. Toadapttheperturbationstrength,the

search historycanbe used.

2.3.4 Acceptance criterion

The acceptancecriterion determines whether asolution from theemb eddedheuristic is

usedasastartingpointforthenextiterationofILS.Theacceptancecriterionprovidesa

balancebetweenintensicationanddiversicationofthesearchspace. Thetwosimplest

casesareacriterionwhichacceptsonlyimprovementsandthusemphasisesintensication

andtheoppositecriterion,whichacceptseverytimeandprovidesastrongdiversication.

The rstcriterion isdened asBETTER (algorithm 0),thesecond oneasALWAYS.

Algorithm 2.2BETTER acceptance criterion

if f(s 00 )>f(s) then s s 00 endif

In between these two extremes, there areseveral possibilities foracceptance criteria.

Itispossibletopreferimprovements,butacceptnon-improvementstepssometimes,e.g.

byusing a criterionsimilar tothe oneusedinSA.Other choicescould make useof the

searchhistory similartosomeimplementationsof TS.Alimiteduseof thehistoryisto

restart the ILS algorithm from a new initial solution when it seems that there will be

no improvement found in the current region of the search space. A simple way would

betorestart ifthesearchisnotabletotraverse toabetterlocalminimumforacertain

number ofsteps. TheRESTARTcriterionimplementsthis approach:

LetibethecurrentnumberiterationsoftheILSandi

l ast

theiterationwherethelast

improving step was made. The RESTART acceptance criterion is dened in algorithm

2.3.

Algorithm 2.3RESTARTacceptancecriterion

if f(s 00 )>f(s) then s s 00 else if f(s 00 )f(s)^i0i l ast >i T then

s GenerateInitial Sol ution

endif

It conforms totheBETTERcriterion byacceptinga local minimumasstartingpoint

forthenext iteration only ifit isbetterthanthe previousstartingpoint. Furthermore,

ifthere hasbeenno improvement fora numb eri

T

of iterations, thesearch is restarted

with a newly generated solution. The new startingpoint canbe generated in dierent

(27)

startingpoint arealso possible.

2.3.5 Termination condition

The termination condition ends the algorithm. There are no restrictions on how to

cho ose the condition. Common choices include a xed number of iterations, a number

of iterations depending on someproperties of the problem instance, a xed amount of

consumedrun-time or(ifpossible)an estimationof thesolution quality.

2.4 Related Work

Thischapterpresentspreviously publishedalgorithmsandsummarisesthemost

impor-tant resultsfound during the literature study done in the initial phase of this project.

Westartwith aclassication of MSAalgorithms brie ydescribingthe mostprominent

algorithms foreach class. Found applicationsof Metaheuristics are described in detail

laterinthischapter. Notredame(2002) reviews mostavailablealgorithmsand classies

them intofourclasses:

Progressive algorithms addsequencesinastep-by-stepmanneraccordingtoa

pre-computed order. They use an extension of the dynamic programming algorithm used

inpairwisealignment (Needleman&Wunsch,1970) toadd asequence toanalignment.

Thisclass includes themostwidely usedalgorithms: ClustalW, whichcan be seen asa

standardmethod formultiple alignments, PileUpand MultAlign.

All progressive algorithms startby cho osing an order in which the sequences will be

added tothe alignment, mostlyimplemented by computing an approximative

phyloge-netictreefromthesimilaritiesofthesequences. Thesequencesarethenaddedaccording

tothisorder. ThisprocessissimilartotheconstructionheuristicusedinILS.The

draw-backs ofprogressivealgorithmsare theinabilitytocorrect gapsplaced inearly stepsof

thealignment,thedominationofa largeset ofsimilarsequencesand theimposed order

on thesequences.

The most prominent algorithm for MSA, ClustalW (Thompson et al., 1994), aligns

the sequences according to a guidance tree computed by neighbour joining. It uses

manyheuristicadjustementstocompensatethedrawbacksofpureprogressivestrategies,

among them automaticchoice of substitutionmatrices fordierent degress of sequence

identity,localgap penaltiesand gap penaltyadjustment.

Exact algorithms rely on a generalisation of the algorithm for pairwise alignment

formore thantwo sequences. The originalpairwise alignmentalgorithm (Needleman&

Wunsch,1970)usesdynamicprogrammingtondanoptimalpathinatwo-dimensional

latticewhich correspondstotheoptimalpairwise alignment. Byextendingthis scheme

(28)

needed tocomputed the optimalalignment is O(l ) for nsequences of maximal length

l,this approach islimitedtoonly small setsof short sequences.

This approach is implemented in the MSA program (Lipman et al., 1989), which

inturn is an heuristic implementation of an algorithm proposed byCarrillo &Lipman

(1988). DCA(Stoyeetal.,1997)enhancesthenumb erofsequenceswhichcanbealigned

byMSA byusing adivide-and-conquer approachtocutthesequencesintosmaller

frac-tion,whichinturncanbealignedwithMSAandlaterreassembled. Still,themaximum

number ofsequenceswhichcanbehandled remainsrelatively small.

Iterative algorithms are based on the idea that a solution can be computed by

startingfromasub-optimalsolutionwhichisthenmodieduntilitreachestheoptimum.

Iterativealgorithmscanbefurtherclassiedintostochasticanddeterministicalgorithms.

TherstsubclassincludesHiddenMarkovModels,GA,SA,TSandtheILSpresentedin

thisthesis. Mostalgorithmsinthesecondsubclassarebasedon dynamicprogramming.

SAGA (a Genetic algorithm), SA and TS are described in detail at the end of this

chapter.

OneofthebestMSAalgorithmsavailable,Prrp,usesadeterministiciterativestrategy.

Prrp uses a doubly nested loop strategy to optimise weights assigned to the pairs of

sequencesandtheweightedSP-scoresimultaneously. Theinnerloopre-alignssubgroups

inside the alignment until no further improvement can be made. The outer loop in

turn computes approximative weights for all pairs of sequences based on the current

alignment. The iterationends whentheweightsconverged.

Consistencybasedalgorithms trytoproduceaMSAthatcorrespondsmostwithall

pairwise alignments. By using thepairwise alignments, consistencybasealgorithmsare

independentofaspecicscoringmatrixandmakeuseofpositiondependentscores. This

bypassesmanyoftheproblemsofprogressivealgorithmsandresemblestheenhancements

of ClustalWasan inherent featureof thealgorithm.

Since thisclass includesallalgorithmswhichmakeuseof consistencybased objective

function, itincludes SAGA,T-COFFEE,DiAlignand theILS.

T-COFFEE is a progressive algorithm usingthe COFFEE functioninsteadof a

sub-stitionmatrixwhen computingthealignment. It usesanextended libraryof globaland

local alignments, which are merged into a scoring matrix by a heurstic method called

libraryextension. Thelibraryextensionprocessassignsaweighttoeachpairofresidues

present in thelibrary alignments. The weight is larger forpairs inan alignment which

areconsistentwith theother libraryalignments.

DiAlign is basedon gap-free segmentsof residue pairsin thepairwise alignments

in-ducedbythesequences insidea MSA.Itstarts withcomputing all pairwise alignments

andidentifyingalldiagonals(diagonalsrepresentmatchesofresiduesandhence,a

diag-onal is a gap-freesegment). The diagonals are then sortedaccording to a probabilistic

weightandtheiroverlapwithotherdiagonals. Thelaterpropertyshouldemphasize

mo-tifsoccuringinmorethantwosequences. TheMSAisformedfromthislistbyaddingthe

(29)

align-Name Classication Reference

MSA Exact Lipman etal.(1989)

DCA Exact Stoyeet al.(1997)

ClustalW Progressive Thompsonet al.(1994)

Multalin Progressive Corpet (1988)

DiAlign Consistency-based Morgenstern etal.(1998)

DiAlign2 Consistency-based Morgenstern (1999)

T-COFEE Consistency-based/Progressive Notredame etal.(2000)

Prrp Iterative Gotoh(1996)

SAGA Iterative Notredame &Higgins(1996)

TabuSearch Iterative Riaz et al.(2004)

Table2.1: MSAalgorithmandtheirclassication,takenfromNotredame(2002). Please

refer to the text for a short description of the algorithms and the

classica-tions.

Table 2.1 shows the most popular algorithms together with classication and

refer-ences. Theremainderofthis chapterbrie ydescribesalreadyimplemented

Metaheuris-ticsinmore detail. Moredetailed information canbe found inthereferences.

2.4.1 Simulated Annealing

The rst modern Metaheuristic applied to MSA is Simulated Annealing (Kim et al.,

1994),whichwasusedtooptimisetheSPobjectivefunctionwithaneandnaturalgap

costs. Theneighbourhoodstructureusedisbuiltupfromamoveoperationwhichswaps

a block of consecutive gaps with a adjacent block of symbols in one of the sequences.

All parameters for the swap operation, which are size and position of the gap block,

whether symbols to the left or right are swapped and how many symbols are swapp ed

aredetermined randomly. No gapsareinserted ordeletedduring thesearch.

SA tries to explore this neighbourhood structure by traversing iteratively from one

solution to a better scoring solution, but accepting also steps to solutions with lower

scoreswith acertain probability. ThisgivesSA thepossibilitytoescapelocalminima.

SA is oneof thefew Metaheuristics withmathematically provedconvergenceagainst

an optimalsolution. Inpractise, SA istoo slowforbeingused asacomplete alignment

algorithm,butcanbeusedasanalignmentimprovertocorrectandimprovealignments

(30)

SAGA (Sequence Alignment by Genetic Algorithm) is a GA for MSA proposed by

Notredame & Higgins (1996). It was used to optimise alignments for the weighted

SP scorewith dierentgap penalties, and laterextendedtousetheCOFFEE objective

functionaswell(Notredame et al.,1998).

SAGA startswitha population ofrandomalignmentsgenerated byadding arandom

number ofgaps tothestartand end of eachsequence,ensuring that allsequences have

thesamelengthafterwards.

To computethe next generationof one iteration, 50% of thenew population islled

withthe 50%ttest (thetness valuecorrespondstothescore ofthe alignment)

align-ments from the previous generation. The remaining 50% are created by modifying

randomlychosenalignmentsfromthepreviousgeneration, wheretheprobabilityfor

be-ingchosendependsonthetnessvalue. SAGAuses 22dierentoperatorstomodify an

alignment: two crossoveroperators which generate a new alignment from two parents,

two operator which inserts blocks of gaps into the alignment, a familyof sixteen

oper-ators shifting blocks of gap inside the alignment to new positions, an operator which

tries to nd small proles by nding the best small gap-free substring of a sequence

in all sequences and an operator optimising a pattern of gaps in a small region of the

alignment by either exhaustive search or GA, depending on the size of the region and

number ofgapsto be placed.

Which operator is applied is stochastically chosen, with a probability depending on

thequalityof theospring an operator hasgeneratedinthe lastteniterations.

SAGA isamong thebestalgorithms forMSA regarding alignmentquality,butneeds

very long timeto producegoodqualityalignments, maybe due tousing randomly

gen-eratedalignmentsasthe initialpopulation.

2.4.3 Tabu Search

Tabu Search (TS) (Glover, 1986; Glover & Laguna, 1997) is a general Metaheuristic

which explores a neighb ourhood bytraversingfrom one solution tothe neighbourwith

the best score, even ifthis neighbour has a lower score. To prevent cycling, TS uses a

formofmemorytodeclarerecentmovesasforbiddenforanumb erofiterations. Insome

cases, amovetoa solution declaredtabumayactually leadtoan improvement. Inthis

case,aspirationcriteriaareusedwhichallowforbiddenmovesundersomecircumstances,

e.g.ifthenewsolutionwouldbebetterthanallpreviouslysolutionsfound. Thememory

canalso be usedtobalancebetweendiversicationand intensication.

Riazet al.(2004)implementedaTSforMSA,usingtheCOFFEEobjectivefunction.

Theyproposetwodierentmovestrategies: asingleblockmove,whichisthesamemove

asused inthe BlockMove local search inILS, and move strategy moving a rectangle of

gaps which spans over more than one sequence to a new position. The second move

strategy is implemented in their TS algorithm since it has been found to give shorter

runtimes thanperformingaseries of blockmoves.

(31)

until it has been improved to be notfurther classied as bad aligned. The TS is then

focusedon thenext bad alignedregionuntil no furtherbadaligned regionsare found.

TS istestedon theBAliBASEdatabase(version1.0),and isfound tobe comparable

to other algorithms in terms of solution quality, but needs very long time to produce

goodalignments. ItispointedoutthattheinitialalignmentfromwhichtheTSisstarted

(32)

This chapter describes the ILS algorithm for MSA and motivates the decisions made

whendesigningthecomponentsofILS(section2.3). Foreachofthecomponents,several

alternatives will be proposed and compared to nd the best alternative to use in the

ILS. The results are all presented in chapter 5. The decisions made are motivated by

preliminary testsand subsets oftheBAliBASE database.

3.1 Construction heuristic

Threevariantsofconstructionheuristics havebeenimplementedand evaluatedtoseeif

they canserve asa goodstarting pointfora local search.

Therst heuristic,Random,buildstheinitialalignmentbyrandomlyinsertingaxed

percentageofgapsintothesequences,llingtheshortersequenceswithadditionalgaps.

The second heuristic, AlignRandom, generates the initial alignment by successively

aligning a sequence to an alignment in random order. It starts with aligning two

se-quences, and adds the next sequence to this alignment until all sequences have been

added. The alignment method is similar to standard dynamic programming and

de-scribed byGotoh(1993). 1

Thethird heuristic,NeighbourJoining,isagreedyheuristicwhichalignsthesequences

stepbystepaccordingtoaguidancetree,whichiscomputedusingtheneighbourjoining

method (Saitou& Nei,1987) and thealignment isdone withthesame procedure asin

heuristicnumbertwo. NeighbourJoiningisimplementedbytheQuickjoinlibrary,which

uses a heuristic speed-up techniquetoimplement thealgorithm as described by Saitou

(Brodalet al., 2003). 2

The heuristics will be referred to as Random, AlignRandom and NeighbourJoining in

thefollowing.

Table 5.1on page41 compares thequality ofthe computed alignments byeach

con-structionheuristicsforbothCOFFEE-scoreandBAliBASESumofPairsScore

(BSPS)-score. NeighbourJoiningclearly achievedthe highestscores inbothcases.

Considering that the run-time for NeighbourJoining and AlignRandom is nearly the

same and the runtime of the whole ILS benets from a starting solution near to a

local optimum, NeighbourJoining is selected as the construction heuristic used in the

implementation. Thisdecisionisfurthermotivatedbytheassumptionthatgoodsolution

areclusteredtogetherinthesearchspace,whichisonereasonforthegoodperformance

on ILS.

1

Gotohpresentsfouralgorithms,dieringinestimatedrun-timeandqualityofthealignmentsproduced.

Inthisimplementation,AlgorithmAisused.

2

Quickjoinisavailablefromhttp://www.daimi.au.dk/

(33)

search starts from thesame initialsolution, whichmakesit unsuitedfor theRESTART

acceptance criterion. When used in conjunction with RESTART, the rst initial

solu-tion iscomputed byNeighbourJoining,whereas the new initialsolutionsfor restartsare

computedbyAlignRandom.

3.2 Local Search

Theembeddedheuristic,whichwillbealocalsearchprocedureinthis project,hasgreat

in uenceon the performance of ILS.Given that great impact,we will designthe other

componentsaround alocal search procedure.

To nd a good local search regarding both runtime and solution quality, two local

search procedures have been implemented and tested on a random subset of the

BAl-iBASEdatabase. Thesubsetconsistsofa quarter ofall sequencesofeachreferenceset,

randomlychosen.

The twolocalsearchprocedureseachexamineadierent neighbourhoodstructureby

strictiterativeimprovement,goingfromonesolutiontoanotheronlyifthenew solution

isbetterthantheformer.

3.2.1 Single exchange local search

Therstlocalsearchprocedureexchangesonesymb olwithanadjacentgapsymboluntil

no further improvement can be made. The neighb ourhood structure used in the local

search isdenedas follows:

Denition 3.1 (Neighbourhood structure for single exchange local search) A

pseudo multiple sequence alignment A is contained in the neighbourhood of another

pseudo multiple alignment B i alignment A can be transformed into alignment B by

exchanging a gap witha symbol,preserving the orderof the symbols in eachsequence.

Thesizeoftheneighb ourhoodisO(N 2

3L),whereN isthenumberofsequencesinthe

alignmentandListhelengthofthelongestsequenceinthesetSofsequences. Theworst

case is an alignment with a maximum number of gaps, sinceeach gap can be replaced

bytworesidues (except gapsat thebeginning orend of asequence, which canonly be

replaced byoneresidue), whichgives two (one) neighb ours. This worst casealignment

consists of columns which are built with N 01 gaps and only one non-gap symbol.

Eachsymb olfrom thesequencesconstitutes onecolumn, sothealignment has P

s2S jsj

columns. Thenumb erofgapsisthengivenby(N0 1)3 P

s2S

jsj. ForLbeingthelengthof

thelongestsequence,itholdsthatLjsj8s2Sandfromthat P

s2S

jsjjSj 3 L=N3 L.

Altogether, this gives(N01)3N 3L=N 2

3L0N 3L=O(N 2

3L).

The local search procedure explores this neighb ourhood in a combination of greedy

andrandomprocedure. Thelocalsearchtraverseseveryblockofgaps,i.e. aconsecutive

sequence of gaps, and searchesfor an improvement within this block. Within a block,

(34)

KQQ−−−−KYLSA−KSPK

HFN−−−−RY−ITS−−PK

ERT−−−QYPDI−−−−PK

−−−KLVN−−E−AKNL−−

ERT−−−QYPDI−−−−PK

KQQ−−−−KYLSA−KSPK

−−−KLVN−−E−AKNL−−

HF−−−N−RY−ITS−−PK

Figure 3.1: Moveoperation of thesingle exchangelocal search. A gap inside ablockof

consecutive gapsis swapp edwithoneof thesymbolsadjacent tothis block.

improvement, it is executed and the local search is restarted. If no improvement was

found in this block, the local search is continued with the next block. The blocks and

thegaps ineachblockaretraversed inrandomorder.

One nal point of variation inside the local search worth to consider is the question

which moveshould be performedduring eachiteration.

Algorithm 3.1 shows thecomplete local search algorithm in pseudo code. The

algo-rithm examines all gap blocks and the adjacent symb ols if swapping the symb ol with

oneof the gaps inside theblock yieldsto an improvement until either an improvement

isfound orallgaps havebeenexamined. Depending on theparameterpivot, whichcan

besettoeither\Best"or\First",thelocalsearchselectstheneighb ourwhichyieldsthe

largestimprovementpossibleortherstneighb ourfoundtoyieldanimprovement. The

algorithmusesanauxiliaryfunctionswapwhichswapstwopositionsinsideanalignment.

ThefunctionExchangeMoveevaluatesamoveintheneighb ourhoodstructureby

com-puting the dierent scores of the current alignment and the neighb our reached by

ex-changing onesymb oland a gap. The dierencebetweenthese twoscores isreturned.

The functionExamineBlock searchesthebestimprovement w.r.t.ablockof

consecu-tivegapsbyevaluatingallpossiblemovesdonewhenconsideringoneofthesymbolsright

and lefttothe blockandthe gapsinsidethe block. It returnsthevalue andposition of

thegapof thebestimprovementmovefound, andzeroifno improvement canbemade.

The localsearch isimplemented inprocedure SingleExchange. It examines all blocks

of gaps as long as an improvement move in the neighb ourhood can be made. To do

this, it rst evaluates all moves induced by a block and the symb ol left to this block

(lines24{25). Ifnewimprovementwasfound,orifpivotingissetto\Best",thesameis

donefor symb ol tothe right and thesame block(lines 26{34). If theexamination of a

blockfoundanimprovementandpivotingissetto\First",thefor-loopisquitinline36.

When theloopterminates, thebest improvement movefound isexecuted (lines39{40).

Figure 3.1 shows a possible move during the local search. First, a block of gaps is

selected. Forbothadjacent residues,the scoreof exchangingthem withanygap inthe

selectedblockiscomputed,andthebestcombinationiseventuallyexecutedasthemove

forthis iteration.

Theestimatedrun-timeofoneiterationofthelocalsearchprocedureisO(N 2

3L). In

the worst case,the local search has toevaluate the whole neighb ourhood of a solution

which conforms toevaluating every possiblemove sinceeachneighbourcanbe reached

(35)

1: function ExchangeMove(Alignment A,OF f,gapPos, symbolPos)

2: delta f(A)

3: A

0

A

4: swap(A', gapPos,symbolPos) .swap two positions inan alignment

5: delta del ta0f(A 0

)

6: return delta

7: endfunction

8:

9: function ExamineBlock(Alignment A,OF, BlockG, symb olPos)

10: bestDelta 0

11: for allgapg2Gdo . nd thebestimprovementforthis block

12: delta ExchangeMove(A;f;gap;symbolPos)

13: if delta<bestDeltathen

14: bestDelta delta

15: gapPosition gap

16: endif

17: endfor

18: return (bestDel ta;gapPosition) . bestDelta=0 ifnoimprovementfound

19: endfunction

20:

21: procedure SingleExchange(AlignmentA,Objective Functionf,pivot)

22: repeat

23: foreachblockG ofconsecutive gapsdo

24: symbolPos symbolleft toG

25: (bestDelta;gapPos) ExamineBlock(A;f;G;symbolPos)

26: if bestDelta<0_pivot=\Best" then

27: rightPos symbolright toG

28: (rightDelta;rightGap) ExamineBlock(A;f;G;rightPos)

29: if rightDelta<bestDeltathen

30: bestDelta rightDelta

31: symbol Pos rightPos

32: gapPos rightGap

33: endif

34: endif

35: if pivot=\First"^bestDelta<0 then

36: break .quittheforloop

37: endif

38: endfor

39: if bestDelta<0 then .ifimprovement found

40: swap(A, gapPos, symb olPos) .performthefound move

41: endif

42: untilno improvement wasfound inthe lastiteration

(36)

space. Don't look bits are a matrix of boolean values, which are initially set to false.

A symb ol beneatha gap blockis only considered foramove ifthecorresponding don't

look bit is false. Wheneverthe local search evaluates a symbol adjacent toa block of

gaps,andndsthatnoimprovementcanbemadebyexchangingthissymbolwithoneof

thegapsintheadjacentblock, thedon't lookbitissettotrue;whenamoveisactually

executed,thedon't look bitsforthe two exchanged positions aresettofalse.

3.2.2 Block exchange local search

Thesecond localsearchprocedure issimilartotherstonebuttriestoexchangewhole

blocks of gaps instead of one symbol each step. Riaz et al. (2004) favoured this local

searchintheirTSimplementation 3

and statethatitissuperiorinruntimeand solution

qualitytothesingle exchangelocalsearch.

The neighb ourhood structurefortheblockexchangelocalsearch isdened as:

Denition 3.2 (Neighbourhood structure for block exchange local search) A

pseudo multiple sequence alignment A is contained in the neighbourhood of another

pseudo multiple sequence alignment B i alignment A can be transformed into

align-ment B by exchanging a block of gaps with a block of non-gap symbols in one sequence

and preserving the orderof the symbols in each sequence.

In the worst case, the neighbourhood structure is equivalent to the neighb ourhood

structureofsingleexchangelocalsearch,sinceamaximalnumb erof gapblocksisgiven

when each blockisof lengthone. Thisgivesthe sameworst-casesize oftheneighb

our-hood, O(N 2

3L), but on the average, the neighbourhood canbe expected to be much

smaller.

KQQ−−−−KYLSA−KSPK

HFN−−−−RY−ITS−−PK

ERT−−−QYPDI−−−−PK

−−−KLVN−−E−AKNL−−

ERTQYP−−−DI−−−−PK

KQQ−−−−KYLSA−KSPK

−−−KLVN−−E−AKNL−−

HFN−−−−RY−ITS−−PK

Figure 3.2: Move operation for the block exchange local search. A randomly selected

blockof gapsis exchangedwithablockof symb ols of equalsizetoitsright.

Algorithm 3.2 describes the local search procedure and gure 3.2 presents one step

ingraphical form. It uses thesametechniquetodistinguish betweenbestimprovement

and rst improvement pivoting asSingleExchange(algorithm 3.1). Swapping of blocks

ofsymb olsisdonebyan auxiliaryfunctionswapBlo ckwhichswapstwoblocks insidean

alignment.

3

TStraversesto thebest solutioninthe neighb ourhoodofthe current solution,aprocesssimilar to

(37)

1: function BlockMove(Alignment A,OF f,Blockgaps, Blocksymbols)

2: delta f(A)

3: A

0

A

4: swapBlo ck(A', gaps, symb ols)

5: delta del ta0f(A 0

)

6: return del ta

7: endfunction

8:

9: procedure BlockExchange(Alignment A,OF f,pivot)

10: repeat

11: bestDelta 0

12: foreachblockG ofconsecutive gapsdo

13: l eftSymbol s maximal blockof atmostjGjconsecutive symbolstotheleft

14: if BlockMove(A,f,G, leftSymb ols)<bestDeltathen

15: bestDelta BlockMove(A,f, G,leftSymb ols)

16: symbolBl ock leftSymbols

17: gapBlock G

18: endif

19: if pivot=\Best"_bestDelta=0 then

20: rightSymbols maximal blockof atmostjGjconsecutive symb ols tothe right

21: if BlockMove(A,f,G, rightSymbols)<bestDeltathen

22: bestDelta BlockMove(A,f,G,rightSymbols)

23: symbol Block rightSymbols

24: gapBlock G

25: endif

26: endif

27: if bestDelta<0^pivot=\First" then

28: break .end thefor-loop

29: endif

30: endfor

31:

32: if bestDelta<0 then

33: swapBlock(A,gapBlock, symbolBlock)

34: endif

35: untilno improvement found inlastiteration