Search
HS{IKI{MD{04{001
JensAuer
Submitted by JensAuerto the Universityof Skovdeasa dissertation towards
the degree of Master of Sciencewith a major incomputer science.
September2004
I certify that allmaterial inthis dissertation whichis not my own work has been
identied and that no material is included for which adegree has already be
conferred uponme.
Kommunikation ochinformation
Master Dissertation
Multiple Sequence Alignment by Iterated
Local Search
Jens Auer
August 29, 2004
my supervisor Bjorn Olsson for his support and comments which
helped me very much to complete this thesis project. I also would
like to thank Thomas Stutzle for providing me with some of the
The development of good tools and algorithms toanalyse genome and protein data in
theformof aminoacid sequencescodedasstrings isoneof themajorsubjectsof
Bioin-formatics. Multiple Sequence Alignment (MSA) is oneof thebasicyetmostimportant
techniques to analyse data of more than two sequences. From a computer scientiest
point of view,MSA is merelya combinatorialoptimisation problem where strings have
toarrangeinatabularformtooptimise anobjectivefunction. ThedicultywithMSA
is that it is NP{complete, and thus impossible to solve optimally in reasonable time
underthe assumptionthatP6=N P.
TheabilitytotackleN P{hardproblemshasbeengreatlyextendedbytheintroduction
of Metaheuristics (see Blum & Roli (2003)) for a summary of most Metaheuristics,
general problem-independent optimisation algorithms extending the hill-climbing local
search approach to escape local minima. One of these algorithms is Iterated Local
Search(ILS)(Lourenco etal.,2002;Stutzle,1999a,p. 25),arecenteasytoimplement
but powerful algorithm with results comparable or superior to other state-of-the-art
methodsformanycombinatorialoptimisationproblems,amongthemTSPandQAP.ILS
iterativelysampleslocalminimabymodifyingthecurrentlocalminimumandrestarting
a localsearchporcedureon thismodiedsolution.
This thesis will show howILS canbe implementedfor MSA. After that,ILS will be
evaluatedandcomparedtootherMSAalgorithmsbyBAliBASE(Thomsonetal.,1999),
aset ofmanuallyrenedalignmentsusedinmostrecent publicationsof algorithmsand
inatleast two MSAalgorithm surveys. Theruntime-b ehaviourwill beevaluatedusing
runtime-distribution.
The qualityof alignmentsproduced byILS isatleast asgood asthebestalgorithms
available and signicantly superiour to previously published Metaheuristics for MSA,
Tabu Search and Genetic Algorithm (SAGA). On the average, ILS performed best in
ve out of eight test cases, second for one test set and third for the remaining two.
A drawback of all iterative methods for MSA is the long runtime needed to produce
good alignments. ILS needs considerably less runtime than Tabu Search and SAGA,
but cannot compete with progressive orconsistency based methods, e.g.ClustalWor
T{COFFEE.
keywords: Combinatorial Optimisation, Metaheuristics, Multiple Sequence
1 Introduction 1
1.1 Motivation . . . 1
1.2 Aim and Objectives . . . 3
2 Background 4 2.1 Multiple SequenceAlignment . . . 4
2.1.1 Denition . . . 6
2.1.2 AlignmentScores . . . 8
2.1.3 Complexity . . . 9
2.2 Metaheuristics and Combinatorial Optimisation. . . 10
2.2.1 CombinatorialOptimisation . . . 10
2.2.2 Heuristicalgorithms . . . 11
2.2.3 Metaheuristics . . . 12
2.3 IteratedLocal Search. . . 13
2.3.1 GenerateInitialSolution . . . 15 2.3.2 EmbeddedHeuristic. . . 16 2.3.3 Perturbation . . . 16 2.3.4 Acceptancecriterion . . . 17 2.3.5 Termination condition . . . 18 2.4 RelatedWork . . . 18 2.4.1 SimulatedAnnealing . . . 20 2.4.2 SAGA . . . 21 2.4.3 Tabu Search . . . 21 3 Algorithm 23 3.1 Constructionheuristic . . . 23 3.2 LocalSearch . . . 24
3.2.1 Singleexchangelocalsearch . . . 24
3.2.2 Blockexchangelocal search . . . 27
3.2.3 Comparison . . . 29
3.3 Perturbation . . . 30
3.3.1 Randomregap perturbation . . . 30
3.3.2 Blockmoveperturbation . . . 30
3.3.3 Pairwise alignment perturbation . . . 31
3.4 Acceptance criterion . . . 33
3.5 Terminationcondition . . . 33
3.6 IteratedLocal SearchforMultiple SequenceAlignment. . . 34
3.7 Implementation . . . 36
4 Method and Materials 37 4.1 Run-time distributions . . . 37
4.2 The BAliBASEdatabaseof multiplealignments. . . 38
4.2.1 StatisticalEvaluation . . . 40 5 Results 41 5.1 ILS components. . . 41 5.1.1 Constructionheuristic . . . 41 5.1.2 Local search. . . 41 5.1.3 Perturbation . . . 43 5.1.4 AcceptanceCriterion. . . 43
5.2 BaliBASE evaluationof ILS . . . 44
5.3 Runtime-bahaviour ofILS . . . 50
6 Conclusions and Discussion 54 6.1 Conclusions . . . 54
6.2 Discussion . . . 54
2.1 Example Multiple SequenceAlignment . . . 5
2.2 Structure of theheveinprotein . . . 5
2.3 Pictorial viewof ILS... . . 14
3.1 Move operation of thesingleexchangelocalsearch . . . 25
3.2 LocalSearch stepwith blockexchange... . . 27
3.3 Random regapperturbation . . . 30
3.4 Perturbationusing a randomblockmove. . . 31
3.5 Perturbationusing optimalpairwise alignments . . . 31
3.6 Block-alignment perturbation . . . 32
3.7 COFFEE scoresforILS on referenceset gal4 ref1 . . . 34
5.1 Boxplotofthe averageBSPS scoresof all144 alignmentsinthedatabase. 46 5.2 Boxplotofthe averageCS scoresof all144 alignmentsinthedatabase... 46
5.3 Boxplotsforeachreference setshowingBSPS-scores . . . 47
5.4 Boxplotsforeachreference setshowingCS-scores . . . 48
5.5 Solution qualitytrade-o for1bbt3ref1 . . . 51
5.6 Solution qualitytrade-o for1pamA ref1 . . . 52
5.7 Runlength-distribution for1r69 ref1 . . . 53
5.8 Runlength-distribution for1pamA ref1 . . . 53
2.1 Classicationof MSA algorithms . . . 20
5.1 Comparisonof construction heuristics . . . 41
5.2 Comparisonof COFFEE-scoresforbothlocalsearches . . . 42
5.3 Runtimecomparison forbothlocalsearches . . . 42
5.4 Results ofdierent perturbations . . . 43
5.5 Runtimes ofdierent perturbation . . . 43
5.6 PercentageimprovementsinCOFFEE-scorefordierentacceptancecriteria. 44 5.7 Runtimefordierent acceptance criterions. . . 44
5.8 AverageSP-scoresfor BAliBASE . . . 49
5.9 Averagecolumn scoresforBAliBASE . . . 49
5.10 Statisticalsignicance of thedierencesin BSPS-and CS score... . . 50
2.1 IteratedLocal Search. . . 15
2.2 BETTER acceptance criterion . . . 17
2.3 RESTARTacceptance criterion . . . 17
3.1 Singleexchangelocal search. . . 26
3.2 Blockexchangelocalsearch . . . 28
3.3 Random regapperturbation . . . 30
3.4 Random blockmoveperturbation. . . 31
3.5 Pairwisealignmentperturbation . . . 32
3.6 Realignregion perturbation . . . 32
The development of good tools and algorithms to analyse genome and protein data is
oneofthemajorsubjects ofBioinformatics. Thisdataistypically collectedintheform
of biological sequences, i.e.strings of symb ols representing amino acidsor nucleotides.
Fromthe comparison of these sequences, information aboutevolutionary development,
conservedregionsormotifsandstructuralcommonalitiescanbeacquired. Thisprovides
thebasisfordeducingbiologicalfunctions ofproteins orcharacterisingproteinfamilies.
One central technique to compare a set of sequences is Multiple Sequence
Align-ment (MSA). This work presents a new algorithm for amino acid sequence alignment,
where the goal is to arrange a set of sequences representing protein data in a
tabu-lar form such that a scoring functionis optimised. The problem of Multiple Sequence
Alignmentishardtosolve,formostscoringfunctionsithasbeenproventobeNP-hard.
Considering computational and biological problems, Notredame(2002) describes three
majorproblems:
Which sequencesshould be aligned?
Howis thealignment scored?
Howshould thealignmentbeoptimized w.r.t.totheobjective function?
Thisthesis focusesonthethird point, summarisedasfollows byNotredame (2002):
Computing optimal alignments is very time-consuming, it even has been
provedthatitisNP-hardwithmostusedobjectivefunctions,whichmakesit
intractableforexactalgorithmswhenmorethanasmallsetofsequencesisto
be aligned. Tohandle larger sets, heuristics mustbe used, which leavesthe
questionof thereliabilityof theseheuristics and their capabilitytoproduce
nearoptimalresults.
1.1 Motivation
MSA has beena eld of research for a long time and many algorithmshave been
pro-posed. Most algorithmsproposedfor MSA rely on a step-by-step alignment procedure
whereonesequenceisconsecutivelyaddeduntilallsequencesarealigned. Inmostcases,
the sequences are added in a precomputed order to the alignment. This class of
al-gorithms, with ClustalW (Thompson et al., 1994) being the most prominent member,
suers from the usual problem of their greedy approach,e.g.being unable to alter
especiallyfortheinterestingcaseof sequencesimilaritybelow25%,wherealocalsearch
can improve the alignment by 8% (see table 5.2 on page 42). This approach is very
likely to become stuck in a local minimum and could therefore be improved by using
algorithmsable toescape a localminimum.
Toovercometheproblemsoflocal minima,severalstochasticschemeshave been
pro-posedinthecontextofcombinatorialoptimization. Theseschemesareoftendescribedas
Metaheuristics,sincetheydescribeageneralalgorithmicframeworkwhichisindependent
from a special optimization problem. Established metaheuristics are simulated
anneal-ing, tabusearch,guidedlocalsearch,antcolonyoptimizationand genetic/evolutionary
algorithms. Metaheuristics are an active research eld inAI and are used inmany
dif-ferentdomains,bothinresearchandindustrial applications(seeBlum&Roli(2003)for
an overviewof Metaheuristicsin CombinatorialOptimisation).
EventhoughMSAisofsuchremarkableimportance,onlythreecasesofcontemporary
MetaheuristicsapplicationstoMSA werefound intheliterature searchforthis project.
Kim et al.(1994) used simulatedannealing, Notredame& Higgins(1996) implemented
agenetic algorithmandrecently,Riaz etal.(2004) studiedTabuSearch. Theresultsof
NotredameandRiazarecomparablewithotherstateofthearttechniquesandencourage
furtherinvestigationof otherMetaheuristics.
Inthis thesis,IteratedLocalSearch(ILS)(Lourencoet al.,2002)isimplementedand
evaluatedasanalgorithmforMSA.Startingfromalocalminimumgeneratedbyrunning
anembeddedheuristicprocedureonaheuristicallycomputedinitialalignment,Iterated
LocalSearch (ILS)iteratively performs thefollowingsteps:
1. Perturbation: Modify thecurrent solution toescapethelocalminimum.
2. EmbeddedHeuristic: Run the emb edded heuristic on the perturbed solution to
reachthenext local minimum.
3. Acceptance: Selecteither thesolutionbefore orafterrunning theemb edded
solu-tion asstartingsolution forthenextiteration.
Section 2.3 describes ILS as a Metaheuristic in detail, and section 3.6 presents the
al-gorithm implemented in this project forMSA. The idea behind ILS, the search in the
reduced space of local minima can be found in many algorithms in combinatorial
op-timization underdierent names, including iterated descent, large-step Markovchains,
iteratedLin-Kernighanandchainedlocaloptimisation. Despitealonghistoryofthe
ba-sicidea,ILSisarelativelynewMetaheuristicand hasnotbeenusedforMSA.Lourenco
et al.(2001) givesabrief introductiontoILS.
ILS has shown very good resultson manycombinatorial optimization problems, e.g.
the traveling salesman problem (Stutzle & Hoos, 2002), where it constitutes one of
thebest algorithmscurrentlyavailableand the quadraticassignment problem (Stutzle,
1999b),oneofthehardestproblemsknown. Otherapplicationsincludescheduling
prob-lems of dierent kinds (den Besten et al., 2001; Stutzle, 1998) and graph colouring
and givea goodoverviewof ILS.
Compared to already applied Metaheuristics, ILS has several advantages. It is a
modular algorithm with very simple basic steps and few parameters which have to be
adjusted,asopposed toe.g.geneticalgorithms.Anotherbenet whichILS shareswith
other Metaheuristics isits independence of theobjective functiontobe optimised.
1.2 Aim and Objectives
The aim of this dissertationprojectis toimplement and evaluateILS (Lourenco et al.,
2002)forMSA.Thealgorithmisevaluatedusingareference setof alignments,provided
by the BaliBase database of multiple sequence alignments (Bahr et al., 2001;
Thom-son et al., 1999). ILS is compared to established state-of-the-art algorithms for MSA
w.r.t.biological validity by measuring the conformance of the computed alignments
withthose foundinthedatabase, andw.r.t.computationalpropertiesbyusingruntime
distributions (Hoos&Stutzle,1998).
Tofull theaim, thefollowingobjectiveshave tobe considered:
Choose agoodconstruction heuristic bytesting several alternatives
Construct a good local search procedure which is able to improve an alignment
very fast but also reaches high solution quality. This involves testing dierent
neighb ourhoodstructuresanddesigndecisionssuchaspivotingstrategiesor
neigh-bourhood pruning techniques.
Implement aperturbationprocedure whichtstothelocal searchprocedure used
and helps tondhigh qualitysolutions.
Choose an acceptance criterion which balances intensication and diversication
toexplore interestingregions ofthe searchspace eciently.
Determine acut-o toendthe ILS.
Chapter 3 presents several alternatives foreach of these points. Chapter 5 compares
eachofthealternativestakingintoaccounttheintendeduseasacomponent oftheILS.
Considering the results, the complete ILS is presented in algorithm 3.7 on page 35 in
chapter 3.
Altogether, three construction heuristics, two local search procedures (with two
piv-oting strategies), four perturbation strategies and three acceptance criteria are tested
Thischapterintroducesthebasicconceptsneededinthisthesisandprovidesanoverview
aboutpreviousrelated work.
2.1 Multiple Sequence Alignment
MSA is a fundamental tool in molecular biology and bioinformatics. It has a variety
of applications,e.g.ndingcharacteristic motifsand conservedregions inprotein
fami-lies, analysingevolutionary relationsbetweenproteins andsecondary/tertiarystructure
prediction. The following section shows concrete examples of MSA in widely known
problemsof molecularbiology.
A common way to represent the evolution of a set of proteins is to arrange them in
a phylogenetic tree. The rst step in computing the tree is to measure all pairwise
distances of the given set of sequences and derive a distance matrix from this. The
distancescan be gathered from a MSA bycomparing each pair of aligned sequencesin
thealignment. Fromthis distancematrix, aphylogenetictreecanbe computed,e.g.by
UPGMA(Sneath& Sokal,1973) orNeighbourJoining(Saitou &Nei,1987).
MSA is a basictool when grouping proteins intofunctionally related families. From
a MSA, it is easy to construct a prole or a consensus, which is a pseudo-sequence
representing thecomplete alignment. The consensus orthe prole inturn canbe used
toidentifyconservedregionsormotifs. MSAcanalsobeusedwhendecidingifaprotein
belongstoafamilyornotbyaligning thequery-sequencetogetherwithknownproteins
belongingtothefamily.
MSAcanalsobeusedasastartingpointforsecondaryortertiarystructureprediction.
When aligned together with related sequences, the conserved motifs can give a hint of
secondaryortertiarystructure,sincethestructureisusuallyconservedduringevolution,
buttheactualsequenceislikelytovaryincertainparts. Figure2.1showsthisexemplary
with the Cysteins (printed yellow) aligned together. These amino acids in uence the
structureof anprotein byformingdisuld-bridges .
Figures 2.1 and 2.2 showan exampleMSA of a family of related sequences together
withapictureof thethree-dimensionaltertiary structureofoneofthealignedproteins,
hevein.
Asdenedlater,allsequencesintheMSAareadjustedtothesamelengthbyinserting
gaps (symbolisedbydots inthis example). Similar aminoacidsare alignedtogether in
the same column and regions of high amino acid similarity can be identied. Most
important are the black printed columns of Cysteins, since these parts of the proteins
in uence thetertiary structure byforming bonds insidethe proteins. These bonds are
teins. The alignment shows eight columns of Cysteins, which form disuld
bridges and are an essentail part of the proteins structure. The dots
sym-bolisegaps. Taken fromReinert et al.(2003).
Figure 2.2: Structureof oneof theproteinsfrom gure2.1. Thedisuldbridgesformed
Thissectiondenesthedomain,i.e.theMSAprobleminformalterms,sothatitcanbe
used to describe the algorithm precisely. We rst formalise the concept of a biological
sequence.
Denition 2.1 (Alphabet and Sequence) Let 6 be an alphabet, i.e. a nite set of
symbols, and 66= ;. A sequence over the alphabet 6 is a nite string of symbols from
6. Thelength of a sequences,denoted as jsj,is thenumber of symbolsin the sequence.
For a sequence, s[i] describes the ith symbol in thissequence.
Inmolecular biology,thealphabet6usuallyconsistsof thesetof twenty-threeamino
acidsfound inproteins, extended witha specialsymb olas aplaceholder foranyamino
acid(usually denoted as\X"or\{").
We can now dene the multiple alignment of a set of sequences. We dene rst a
variationofmultiplealignment,wherecolumnsconsistingcompletelyofgapsymb olsare
allowed, and restrictthis inthenal denitionof multiple alignment.
Denition 2.2 (Pseudo Multiple Alignment) Let 6 be an alphabet as dened in
denition 2.1, and fs
1 , ...,s
k
g a set S of k sequences over this alphabet. A pseudo
multiplealignmentisa setofsequencesS 0 =fs 0 1 ;:::;s 0 k
g overthe alphabet6 0 =6[f0g, where k2 All sequences in S 0
have the same length: 8s 0 i ;s 0 j 2S 0 js 0 i j=js 0 j j 8i k;s 0 i 2 S 0
can be reduced to the corresponding sequence s
i
in S by removing
all gap symbols from s 0
i
Theorderofsymbolsinevery sequences 0
i in S
0
isthe sameasinthe corresponding
sequences
i .
The number of sequences in the alignment is jS 0
j; the number of columns is dened as
the length of the sequences in S 0
.
PMSA(S) is the setof all possible pseudoalignments of the sequencesS.
The set of all possible alignments PMSA(S) is an innite set, since the numb er of
columns consisting only of gaps is not restricted, which also implies that there is no
boundon thelengthof thesequencesina pseudo alignment.
A Multiple fSequenceg Alignment is apseudo alignment withthe restrictionthat no
columnmayconsistonly of gap characters.
Denition 2.3 (Multiple Sequence Alignment) Let n be the number of sequences
in the Alignment and k the number of columns. A set S of sequences is a multiple
alignment i it isa pseudo alignmentand it fulls the following condition:
8ik9jn s
j
S.
Despiteoftheinnitesetofpseudoalignments,thenumb erofMultipleSequence
Align-ments is restricted by the lengths and the number of sequences and hence nite, as is
thelengthof thealignment.
Whenonlytwosequencesarealigned,theMSAiscalledapairwisealignment. Pairwise
alignment is not discussed in this thesis, because they can be computed eciently by
dynamic programming(Needleman&Wunsch,1970).
In order todenean optimalmultiple alignment,a qualitymeasure mustbedened,
which assigns a value to an alignment. The measure is called a scoring or Objective
function (OF), because it is the mathematical goal to be optimised. The term OF
will bepreferred in this thesisbecausetheMSA problemis tackledasa Combinatorial
Optimisation(CO) problem.
Denition 2.4 (Scoring Function, Objective Function) Let 6 0
be dened as in
denition 2.2. For a number of sequences n, a function
f :(6 0?
) n
!R
is called anObjective function(OF).
6 ?
is dened as the set of all words obtained by concatenation of symbols from the
alphabet 6.
TypicalOFsare(weighted)SumofPairs(SP)(Altschul,1989),whichscorestheMSA
byscoring all pairs of aminoacids induced bythe columns of thealignment,and
Con-sistency based Objective Function For alignmEnt Evalutation (COFFEE) (Notredame
et al., 1998), which compares the alignment to a library of reference alignments. OFs
arediscussedin 2.1.2onthe followingpageinmore detail.
Finally,anoptimalalignmentcannowbedened. Thetermoptimalcanmeaneither
a minimum or maximum value, depending on the objective function used. As we use
a scoring function (as opposed to a cost function) in our alignment algorithm, where
betteralignmentsreceive ahigherscore, we deneoptimalitybymaximisation. 1
Denition 2.5 (Optimal Multiple Sequence Alignment) For a set of sequences
S = s
1 ;:::;s
k
over an alphabet 6 and an OF f, S 0 = s 0 1 ;:::;s 0 k is an optimal multiple sequencealignment i S 0 2MSA(S) 8S ? 2MSA(S):S ? 6=S 0 !f(S ? )f(S 0 )
The second point states that f(S 0
) is a maximum value of the function f on the set of
all alignments over S. This denition includes the possibility of more than one optimal
MSA.
1
Thisdoesn'tcauseanylossofgenerality,sinceaminimisationproblemcaneasilybetransformedina
To estimate thequalityof an alignment, each alignment gets a value based on an OF,
where a higher value corresponds to improved quality. The OF measures the quality
of an alignment and should re ecthowclosethis alignmentis tothe optimalbiological
alignment.
Sum-of-pairs is the most widely used scoring function SP (Altschul, 1989) score, a
directextension from thescoring methodsused inpairwise alignment. Foreach pair of
sequences,thesimilaritybetweenthealignedpairsofaminoacidsarecomputed,inmost
cases by using a scoring matrix. The pairwise scores are then added for each aligned
pair ofresiduesinevery pairof sequences.
SP scores are used in many variations in MSA algorithms, e.g.using weights for
the pairs of sequences to prevent sets of similar sequences from dominating the whole
alignment ordierentmethods ofassigningpenaltiestogapsinan alignment.
The dependence on a scoring matrix is a major drawback of the SP-scores, since
these scoring matrices are very general and created by statistical analysis of largesets
of alignments. The degree towhich they match the properties of the sequences one is
to align varies. Morgenstern et al. (1998) argue that scoring matrices fail to produce
biologically reasonable alignmentswhen the sequenceshave a lowglobalsimilarity and
shareonly localsimilarities. Notredame(2002)identiesthescoringofgapswhen using
theSP-score asa majorsourceof concern.
COFFEE isaconsistencybasedapproachtoscoringanMSA.Insteadofmeasuringthe
similarity of each pair of amino acids induced by an alignment, COFFEE (Notredame
etal.,1998)measuresthecorrespondence betweeneachinducedpairwisealignment and
a libraryof computedpairwise alignment. Each pairin thelibraryis assigned a weight
that is a function of its quality. The weight is equal to the percentage identity of the
aligned sequences. The score of an alignment is computed as the score of all pairs of
induced alignments, i.e.all pairs of aligned sequences in this multiple alignment. For
each pair of residuesalignedin boththe libraryand the induced alignment,the weight
isaddedtothescore. Tonormalisethescore,thisscoreisdividedbythemaximalvalue,
whereall pairsinthemultiplealignmentare found inthelibraryalignments. 2
Formalised,theCOFFEE scoreforan alignmentA canbeformulatedas:
Denition 2.6 (COFFEE score for a MSA)
COFFEE(A)= N01 P i=1 N P j=i+1 W i;j 2SCOR E(A ij ) N01 P i=1 N P j=i+1 W i;j 2LEN(A i;j ) 2
Fortheoptimalscoretobe1,thealignmentsinthelibrarymustbetotallyconsistent,suchthateach
i;j i j i;j
weight associated with that alignment, LEN(A
i;j
) the length of this alignment and
SCOR E(A
i;j
) beingthenumberof alignedpairsof residuesfound bothinA
i;j
andthe
libraryalignment.
Using a library of alignments computed from the set of sequences under analysis
prevents the scoring scheme from being too general and takes the properties of the
sequences into account. Notredame et al. (1998) used a library of pairwise alignments
created by ClustalW, but Notredame et al. (2000) suggest a library containing both
global alignments from ClustalW and local alignments and showed that this actually
leadto anincrease inthequalityof thecomputed alignments.
When considering COFFEE as the OF for a local search, it is a great benet that
one can restrict the computational eort to those parts which have been changed by
theprocedure whenperformingone step,e.g.tothose twocolumns where a gap and a
residuehavebeenexchanged.
One oftheproblemsofbothCOFFEEand SPscoringistheirassumption ofuniform
and time-invariantsubstitutionprobabilities forallpositionsinanalignment. The
sub-stitutionprobabilitydependsonstructuralandfunctionalpropertiesoftheproteins and
canvaryfrom zero forsomeregionstocompletevariabilityinothers.
norMD (Thompsonetal.,2001)triestore ectthisbyusingacolumn-basedapproach,
wheretheprobabilitiesofsubstitutionsandthusthescoresforresiduespairsare
column-dependent.
The norMD score is computed by measuring mean distances in a continuous space
imposed by each sequence in the alignment. norMD incorporates ane gap costs and
a similaritymeasure forthesequences which is basedon a hash scorederivedfrom dot
plotsof each pairof sequences. Finally,thescoreis normalisedto avalue betweenzero
and one.
The strength of this method is its independence from the numb er and length of the
sequences in the alignment, and the assignment of column-based scores which can be
usedtond good orbad alignedregions insidean alignment.
A majordrawbackwhich prohibits theusage of norMD as a scoringfunction forthe
ILSisitscomplexity. Duetothehandlingofgap costsand thecomputationof thehash
scores during scoring, the wholealignment hasto be processedcompletely whichleads
tovery largerun-times. Takingthis intoaccount, COFFEE is preferred asthe OF for
this project.
2.1.3 Complexity
ThealignmentofonlytwosequencesisacommonspecialcaseofMSAandecient
algo-rithmshavebeenproposed,e.g.byNeedleman&Wunsch(1970)orGotoh(1982). These
algorithmsusedynamicprogrammingtoecientlycomputeallpossiblealignments;their
asymptotic complexity is O(l 2
), where l is the length of the longest sequence, needing
also O(l 2
) space (the space complexity canbe reduced tolinear space consumption by
plexitywiththenumberofsequences: Thecomplexityfordynamicprogrammingwithn
sequencesof maximumlengthl isO(l n
). Consideringthis, only small setsof sequences
canbe aligned withdynamic programming.
ComputingalignmentsformorethantwosequenceshasbeenshowntobeNP-hardby
Wang&Jiang (1996)formetric scoringmatrices andbyJust(1999) foramore general
class of scoring matrices, which gives strong reasons to believe that there will be no
ecientalgorithm tosolvethis problem exactly(except P=NP).
EvenifJust's provecannot be directlytransferred toMSA withCOFFEE asOF,it
gives strong reasonto believe that computing optimal COFFEE alignments is
compu-tational hardand needsadvancedtechniques.
2.2 Metaheuristics and Combinatorial Optimisation
Thissectionexplainsanddenestwokeytermsofthisthesis,namelyMetaheuristicand
Combinatorial Optimisation(CO).
2.2.1 Combinatorial Optimisation
Theproblemofoptimisationcangenerallybeformulatedasoptimisingafunctionf(X),
where X is a set of variables. 3
Usually the possible set of values is restricted by
con-straints, expressed by inequalities such as g(X) c;c 2 R, where g(x) is a general
function. Optimisationproblemscanbedividedintotwocategories: thosewheretheset
of variablesare froma continuousdomain,and those withadiscrete domain; thelatter
one is called combinatorial. According to Papadimitriou & Steiglitz (1982), CO is the
search for a nite (or countably innite) set, typically an integer, set, permutation or
graph,whereas inthecontinuouscase,thesolutionsareusuallyasetofreal numbersor
evena function. Both casesaretackledwithvery dierent methods,and wefocus only
on COproblems.
Blum &Roli (2003)denes a COproblemas follows:
Denition 2.7 (Combinatorial Optimisation problem) ACO problemP =(S;f)
is dened by: a set of variables X=x 1 ;:::;x n sets of variables D 1 ;:::;D 2
constraintsamong variables
anobjective functionf tobe optimised, wheref :D
1
2:::2D
n !R
3
Inthiscontext,\optimising"mustbeseeninaverybroadviewasthesearchforabestassignmentof
S=fs=f(x 1 ;v 1 );:::;(x 2 ;v 2 )gjv i 2D i
;s satiesall constraintsg
S is usually called the search space or solution space, and the elements of S arefeasible
solutions (also called candidatesolutions).
ToconformwithourdenitionofoptimalMSA (seedenition2.5),weagainfocuson
maximisationproblemssolely. Tosolveoptimisationproblemmeanstondthesolution
s ?
2 S with maximum value, that is f(s ?
) f(s)8s 2 S. The solution s ?
is called a
globally optimal solution. If there is more than one globally optimal solution, the set
S ?
Swill be called thesetof globally optimal soultions.
2.2.2 Heuristic algorithms
WhentacklingCOproblems, onegenerallydistinguishes betweencompleteand
approx-imationproblems. Complete algorithms areguaranteed tond an optimalsolution for
everyinstanceofanCOprobleminnite time. Oneofthemostwidelyusedalgorithms
are theBranch&Bound algorithms, which are also called A ?
inAI.To guarantee
opti-malsolutionsisoftennotpossible,sincemanyCOproblemsareN P-hard,whichimplies
that no polynomial time algorithms exist, ifwe assume P6= NP. Instead of complete
algorithms the guarantee of nding optimal solutions is relaxed to nd good solutions
closetotheoptimalones.
Approximativealgorithmsareoftencalledheuristicalgorithms,especiallyinArticial
Intelligence and Operations Research. The wordheuristic hasGreek roots withto nd
ortodiscoverasits originalmeaning. Reeves(1993) denes aheuristic as:
Denition 2.8 (Heuristic) A heuristic isa techniqueseekinggood (i.e.near-optimal
solutionsata reasonablecomputationalcost without being abletoguaranteeeither
feasi-bility or optimality, or even in many cases to state how close to optimality a particular
feasible solution is.
Heuristicsusuallyincorporate problemspecic knowledge beyond theknowledge found
in the problem denition to reach a good trade-o between solution quality and
com-putationalcomplexity. Thefollowingpresentationofheuristicmethodsfollowsthose by
Blum&Roli (2003) and Stutzle(1999a).
When consideringheuristicmethods, Stutzle(1999a)distinguishes betweenc
onstruc-tive and local search methods. Constructive methods generate solutions from an
ini-tially emptysolution by adding componentsuntil a completesolution is reached; most
prominentofthese isthegreedyapproachwhichalwaysaddthecomponentwhichgives
the best improvement in terms of the OF. Constructive methods are usually among
the fastest approximation algorithms found, but the computed solutions are of
infe-riorqualitycompared tolocal searchmethods. ManyMSA algorithmsare constructive
algorithms,among them ClustalW, PileUpand MultAlign.
Local search methodsinsteadperforman iterative search. Theystartfrom an initial
solution which can be computed from the current solution by applying a modication
operatortoit. Thesetofallsolutionscomputablefromthecurrentsolutionsiscalledthe
neighb ourhoodofthecurrentsolutions. Theneighb ourhoodstructureisoftenvisualised
asagraphontopofthesetofallsolutionswhereanedgebetweentwosolutionsindicate
that these two solutions areneighboured,i.e.that onesolution canbe computed from
the other by using the modication operator on the other one. The neighb ourhood is
denedaccording toBlum& Roli(2003); Stutzle(1999a)as:
Denition 2.9 (Neighbourhood) AneighbourhoodstructureisafunctionN:S!2 S
thatassignstoeverys2Sasetof neighboursN (s)S. N(s)iscalledtheneighbourhood
of s.
The iterative process goes on until no better solution in the neighb ourhood of the
current solution can be found. This nal solution of the local search is called a locally
maximal solution:
Denition 2.10 (Local Optimum) A locally maximal solution (or local maximum)
w.r.t. a neighbourhood structureN is a solution s such that 8s 0
2N (s) f(s 0
)f(s). s
is called strict localminimum i 8s 0
2N(s) f(s 0
)<f(s).
Aneighb ourhoodwhereeverylocaloptimumisalsoaglobaloptimumiscalledanexact
neighbourhood. Exactneighbourhoodsoftenhaveexponentialsize,sothattosearchthe
neighb ourhood requiresexponentialtimeinthe worst case which makesthem unusable
inpractise.
2.2.3 Metaheuristics
One of the main features of heuristics as described in the last section is that they
incorporate problem specic knowledge. Thismakes them specic forthe problem and
prohibitsthetransfer of onesuccessfulheuristicfrom oneproblem toanother.
Inrecentyears,mucheortwasspentonthedevelopmentofgeneralheuristicmethods
which are applicable toa wide range of COproblems. These general-purposemethods
are now called metaheuristics (Glover, 1986) (some earlier publications use the term
\modern heuristics",e.g.Reeves(1993)).
The Metaheuristics Networkdenes metaheuristic as:
Denition 2.11 (Metaheuristic) Ametaheuristicisasetofconceptsthat
can be used to dene heuristic methods that can be applied to a wide set of
dierent problems. Inother words, a metaheuristic canbeseenas a general
algorithmic framework which can be applied to dierent optimization
prob-lems with relatively few modications to make them adapted to a specic
problem. 4
4
Metaheuristics Network website: http://www.metaheuristics.net, last accessed Sun, 25th April
marise themain properties which characterisea metaheuristic as:
Metaheuristics arestrategies that\guide"the searchprocess.
The goalis to ecientlyexplore thesearch space inorder to nd (near-)optimal
solutions.
Techniques which constitute metaheuristic algorithms range from simple local
search procedurestocomplexlearning processes.
Metaheuristicalgorithmsare approximative and usuallynon-deterministic.
They mayincorporate mechanisms to avoid getting trapped in conned areas of
thesearch space.
Thebasic conceptofmetaheuristicspermit an abstractlevel description.
Metaheuristics arenotproblem specic.
Metaheuristicsmaymakeuseofdomain-specicknowledgeintheformofheuristics
thatare controlled bythe upperlevel strategy.
Todays more advanced metaheuristics use search experience (emb odied in some
form ofmemory) toguidethe search.
Many metaheuristics have been proposed during the last years and are well known
today,e.g.SimulatedAnnealing(SA) orGeneticAlgorithm (GA).Blum&Roli(2003)
givesan overviewof themostsuccessfulmetaheuristics and showsseveralways of
cate-gorising them, depending on which aspect of the algorithm is infocus. Metaheuristics
dier in the numb er of feasible solutions examined in each step (population-based vs.
single point search), in their use of the objective-function (dynamic vs. static
objec-tive function), in thenumb er of neighb ourhood structures used during the search (one
vs. various neighb ourhoods), the usage of memory (memory usage vs. memory-less
methods) and the domain where the underlying idea estimated from (nature-inspired
vs. non-nature inspired).
Commonmetaheuristicsare(inalphabeticalorder): antalgorithms,evolutionary
algo-rithms,GreedyRandomisedAdaptiveSearchProcedure(GRASP),GuidedLocalSearch,
IteratedLocalSearch(ILS), SimulatedAnnealing (SA),TabuSearch(TS)andVariable
Neighb ourhood Search(VNS).Section2.4on page18showsexamplesofmetaheuristics
forMSA.
2.3 Iterated Local Search
A majorproblem withlocalsearchalgorithmsistheprobabilityofgettingtrapped ina
localoptimum. Onepossibilitytopreventthisistoletthelocalsearchacceptworsening
modication should be donein such a waythat it leadsto a solution distantlyenough
such that the local search could not easily undo the changes and return to the local
optimum.
s
s’’
s’
solution space S
score
Figure 2.3: PictorialviewofILS.Afterbeingtrappedinthelocaloptimums,thecurrent
solution ismodied which leadstoa new pointin thesearch space,marked
withs 0
. Thenext runof thelocalsearchleadstothenew localoptimums 00
,
whichinturn canbe modied again.
ILS (Lourencoet al., 2002; Stutzle, 1999a,p. 25) isa systematicapplication of this
idea, where an emb edded heuristic search is applied repeatedly to a modied solution
computed before. Thisresults ina randomwalk of solutions locally optimalw.r.t.the
emb eddedsolution. ILSisaverygeneralframeworkwhichincludesothermetaheuristics
such as iterated Lin-Kernighan, variable neighb ourhood search or large-step Markov
chains.
ILSisexpectedtoperformmuchbetterthansimplyrestartingtheemb eddedheuristic
fromarandomlycomputedpositioninthesearchspace(Lourencoetal.,2002). Lourenco
etal.(2002)pointoutthatrandomrestartsoflocalsearchalgorithmshaveadistribution
thatbecomesarbitrarilypeakedaboutthemeanwhentheinstance sizegoestoinnity.
Most local optima found have a value close to the mean local optimum, and only few
optimaarefoundwithsignicantbettervalues. Theprobabilitytondalocaloptimum
withavalueclosetothemeanlocaloptimumfoundisverylarge,whereastheprobability
to nd a signicantly better local optimumconverges to zero with increasing instance
size. They concludethat it is very unlikely to nd a solution even slightlybetterthan
themean localoptimumfor largeproblem instancesand thata biased samplingof the
space oflocal optimacould leadto bettersolutions.
ILS uses the idea of a random walk in the space of local optima. From the current
solutions, anewstartingpoints 0
forthelocalsearchiscomputedbymodifyingsusing
a perturbationprocedure (s 0
feasiblesolutions). Theembeddedheuristicisthenrunons,whichleadstoanewlocal
optimums 00
. If s 00
passes an acceptance test, e.g. yields an improvement compared to
thestartingpoint s, itbecomes thenext elementof thewalk,otherwise s 00
is discarded
and the walk restartswith s. Figure 2.3 shows onestep of therandom walk. The nal
solution of thealgorithmis thebest solution found during this walk.
Algorithm 2.1An algorithmic presentationof ILS.
procedure Iterated LocalSearch
s
0
GenerateInitial Sol ution
s LocalSearch(s 0 ) s best s
whiletermination conditionnot metdo
s 0 Perturbation(s,history) s 00 EmbeddedHeuristic(s 0 ) if f(s 00 )>f(s best )then s best s 00 endif s AcceptanceCriterion(s;s 00 ;history) endwhile return s best endprocedure
This process is formalised in algorithm 2.1. The algorithm consists of components:
GenerateInitialSolution, LocalSearch, Perturbation, AcceptanceCriterion and
termina-tioncondition. Inordertoreachhighperformanceandsolutionquality,thePerturbation
and AcceptanceCriterion can make use of the history of the search, e.g.by
emphasis-ing diversication or intensication based on the period of time passed since the last
improvementwasfound. Theacceptance criterionacts asa counterbalancetothe
per-turbationprocedurebylteringthenewlocaloptima. Thefollowingparagraphsdescribe
each componentof thealgorithm.
2.3.1 GenerateInitialSolution
GenerateInitialSolutionisaconstructionheuristicwhichcomputesthestartingpointfor
the rst run of the emb edded heuristic. The construction heuristic used should be as
fast aspossiblebut producegood startingpoints forthe emb edded heuristic; standard
choicesareeitherrandomsolutionsorgreedyheuristics. Inmostcases,greedyheuristics
arepreferred sincethe embeddedheuristic reaches alocaloptimuminfewerstepsthan
startingfrom randomsolutions. It isworthtonotethat thebest greedyheuristics may
notbe thebeststartingpointforalocalsearch. Lourencoetal.(2002) giveanexample
where using one of thebest construction heuristics known for theTravelling Salesman
Problem(TSP)leadstoworseresultsthanusingarandomstartingsolutionasinputfor
alocalsearchprocedure. Anotherpointtoconsider isthetimeagreedyheuristicneeds.
point getsless important.
2.3.2 EmbeddedHeuristic
Inmostcases,theemb eddedheuristicisaniterativeimprovementlocalsearchprocedure,
butanyotherheuristiccanbeused,sincetheembeddedheuristiccanmostlybetreated
as a black box. Lourenco & Zwijnenburg (1996) used a TS in their application tothe
job-shopscheduling problem.
However,therearepointstoconsiderwhenoptimisingtheperformanceofanILS.The
local search should be selected with respect to the perturbation used, where it should
notbeable toeasilyundo thechanges oftheperturbation.
A trade-o between speed and solution quality of the embedded heuristic must be
found. Usually a heuristic with better results is preferable, but if the heuristic needs
too much time, the ILS is unable to examineenough local optima to nd high-quality
solutions.
Finally,onehastoconsidertheinterrelationbetweentheemb edded heuristicandthe
perturbationwhenoptimisinganILS.Inmanycases,theembeddedheuristicusessome
trickstoprunethesearchspaceandspeedupthesearchprocess,e.g. byusingdon'tlook
bits(Bentley,1992)inTSPalgorithms(Stutzle,1999a,p. 78). Thehighestperformance
isreachedwhen thesetricksareimplementedw.r. t.theperturbationprocedure.Inthe
exampleofdon't lookbitsforTSP,thedon'tlookbits aresavedforthenextrunofthe
local search and only reseted for the cities where the perturbation induced a change.
Thisgains a largeadvantageinperformanceof theILS.
2.3.3 Perturbation
The perturbation procedure modies a local optimum to become a new starting point
for the next heuristic search. Perturbations are often indeterministic to avoid cycling.
A general aspect in choosing perturbations is that they must not be undone easily by
theemb eddedheuristic,sinceitwouldbefallbacktothelastvisitedlocaloptimum. To
nd a good perturbation, properties of theproblem instance canbe considered aswell
asproblem specic knowledge.
Most perturbations are rather simple, e.g.in TSP algorithms, small \double-bridge
moves"arefrequentlyused, which removefouredges and addfournew edges. Whilein
thecase of TSP,a simpleperturbationprocedure leads togood results, otherproblems
might need more complexprocedures, e.g.optimisinga part of thecurrent instance or
solvinga relatedinstance withrelaxed constraints.
Themostimportantpropertyoftheperturbationisitsstrength. Perturbationstrength
canbe informally dened asthe numb er of components ina solution that getchanged
byaperturbation. Sincethenumberofchangedcomponentsprovidesameasure forthe
distancebetweentwosolutions,theperturbationstrength canbe seenasthedistanceof
theperturbed solutiontotheoriginalsolution. The strengthoftheperturbationshould
theproblem and onthe particularinstance.
The strength of a perturbation can be xed, where the distance between s and s 0
is
xedoradaptive,whereitsstrengthchangesovertime,e.g.getslargerwhenno
improve-mentswerefound toemphasisediversication. Toadapttheperturbationstrength,the
search historycanbe used.
2.3.4 Acceptance criterion
The acceptancecriterion determines whether asolution from theemb eddedheuristic is
usedasastartingpointforthenextiterationofILS.Theacceptancecriterionprovidesa
balancebetweenintensicationanddiversicationofthesearchspace. Thetwosimplest
casesareacriterionwhichacceptsonlyimprovementsandthusemphasisesintensication
andtheoppositecriterion,whichacceptseverytimeandprovidesastrongdiversication.
The rstcriterion isdened asBETTER (algorithm 0),thesecond oneasALWAYS.
Algorithm 2.2BETTER acceptance criterion
if f(s 00 )>f(s) then s s 00 endif
In between these two extremes, there areseveral possibilities foracceptance criteria.
Itispossibletopreferimprovements,butacceptnon-improvementstepssometimes,e.g.
byusing a criterionsimilar tothe oneusedinSA.Other choicescould make useof the
searchhistory similartosomeimplementationsof TS.Alimiteduseof thehistoryisto
restart the ILS algorithm from a new initial solution when it seems that there will be
no improvement found in the current region of the search space. A simple way would
betorestart ifthesearchisnotabletotraverse toabetterlocalminimumforacertain
number ofsteps. TheRESTARTcriterionimplementsthis approach:
LetibethecurrentnumberiterationsoftheILSandi
l ast
theiterationwherethelast
improving step was made. The RESTART acceptance criterion is dened in algorithm
2.3.
Algorithm 2.3RESTARTacceptancecriterion
if f(s 00 )>f(s) then s s 00 else if f(s 00 )f(s)^i0i l ast >i T then
s GenerateInitial Sol ution
endif
It conforms totheBETTERcriterion byacceptinga local minimumasstartingpoint
forthenext iteration only ifit isbetterthanthe previousstartingpoint. Furthermore,
ifthere hasbeenno improvement fora numb eri
T
of iterations, thesearch is restarted
with a newly generated solution. The new startingpoint canbe generated in dierent
startingpoint arealso possible.
2.3.5 Termination condition
The termination condition ends the algorithm. There are no restrictions on how to
cho ose the condition. Common choices include a xed number of iterations, a number
of iterations depending on someproperties of the problem instance, a xed amount of
consumedrun-time or(ifpossible)an estimationof thesolution quality.
2.4 Related Work
Thischapterpresentspreviously publishedalgorithmsandsummarisesthemost
impor-tant resultsfound during the literature study done in the initial phase of this project.
Westartwith aclassication of MSAalgorithms brie ydescribingthe mostprominent
algorithms foreach class. Found applicationsof Metaheuristics are described in detail
laterinthischapter. Notredame(2002) reviews mostavailablealgorithmsand classies
them intofourclasses:
Progressive algorithms addsequencesinastep-by-stepmanneraccordingtoa
pre-computed order. They use an extension of the dynamic programming algorithm used
inpairwisealignment (Needleman&Wunsch,1970) toadd asequence toanalignment.
Thisclass includes themostwidely usedalgorithms: ClustalW, whichcan be seen asa
standardmethod formultiple alignments, PileUpand MultAlign.
All progressive algorithms startby cho osing an order in which the sequences will be
added tothe alignment, mostlyimplemented by computing an approximative
phyloge-netictreefromthesimilaritiesofthesequences. Thesequencesarethenaddedaccording
tothisorder. ThisprocessissimilartotheconstructionheuristicusedinILS.The
draw-backs ofprogressivealgorithmsare theinabilitytocorrect gapsplaced inearly stepsof
thealignment,thedominationofa largeset ofsimilarsequencesand theimposed order
on thesequences.
The most prominent algorithm for MSA, ClustalW (Thompson et al., 1994), aligns
the sequences according to a guidance tree computed by neighbour joining. It uses
manyheuristicadjustementstocompensatethedrawbacksofpureprogressivestrategies,
among them automaticchoice of substitutionmatrices fordierent degress of sequence
identity,localgap penaltiesand gap penaltyadjustment.
Exact algorithms rely on a generalisation of the algorithm for pairwise alignment
formore thantwo sequences. The originalpairwise alignmentalgorithm (Needleman&
Wunsch,1970)usesdynamicprogrammingtondanoptimalpathinatwo-dimensional
latticewhich correspondstotheoptimalpairwise alignment. Byextendingthis scheme
needed tocomputed the optimalalignment is O(l ) for nsequences of maximal length
l,this approach islimitedtoonly small setsof short sequences.
This approach is implemented in the MSA program (Lipman et al., 1989), which
inturn is an heuristic implementation of an algorithm proposed byCarrillo &Lipman
(1988). DCA(Stoyeetal.,1997)enhancesthenumb erofsequenceswhichcanbealigned
byMSA byusing adivide-and-conquer approachtocutthesequencesintosmaller
frac-tion,whichinturncanbealignedwithMSAandlaterreassembled. Still,themaximum
number ofsequenceswhichcanbehandled remainsrelatively small.
Iterative algorithms are based on the idea that a solution can be computed by
startingfromasub-optimalsolutionwhichisthenmodieduntilitreachestheoptimum.
Iterativealgorithmscanbefurtherclassiedintostochasticanddeterministicalgorithms.
TherstsubclassincludesHiddenMarkovModels,GA,SA,TSandtheILSpresentedin
thisthesis. Mostalgorithmsinthesecondsubclassarebasedon dynamicprogramming.
SAGA (a Genetic algorithm), SA and TS are described in detail at the end of this
chapter.
OneofthebestMSAalgorithmsavailable,Prrp,usesadeterministiciterativestrategy.
Prrp uses a doubly nested loop strategy to optimise weights assigned to the pairs of
sequencesandtheweightedSP-scoresimultaneously. Theinnerloopre-alignssubgroups
inside the alignment until no further improvement can be made. The outer loop in
turn computes approximative weights for all pairs of sequences based on the current
alignment. The iterationends whentheweightsconverged.
Consistencybasedalgorithms trytoproduceaMSAthatcorrespondsmostwithall
pairwise alignments. By using thepairwise alignments, consistencybasealgorithmsare
independentofaspecicscoringmatrixandmakeuseofpositiondependentscores. This
bypassesmanyoftheproblemsofprogressivealgorithmsandresemblestheenhancements
of ClustalWasan inherent featureof thealgorithm.
Since thisclass includesallalgorithmswhichmakeuseof consistencybased objective
function, itincludes SAGA,T-COFFEE,DiAlignand theILS.
T-COFFEE is a progressive algorithm usingthe COFFEE functioninsteadof a
sub-stitionmatrixwhen computingthealignment. It usesanextended libraryof globaland
local alignments, which are merged into a scoring matrix by a heurstic method called
libraryextension. Thelibraryextensionprocessassignsaweighttoeachpairofresidues
present in thelibrary alignments. The weight is larger forpairs inan alignment which
areconsistentwith theother libraryalignments.
DiAlign is basedon gap-free segmentsof residue pairsin thepairwise alignments
in-ducedbythesequences insidea MSA.Itstarts withcomputing all pairwise alignments
andidentifyingalldiagonals(diagonalsrepresentmatchesofresiduesandhence,a
diag-onal is a gap-freesegment). The diagonals are then sortedaccording to a probabilistic
weightandtheiroverlapwithotherdiagonals. Thelaterpropertyshouldemphasize
mo-tifsoccuringinmorethantwosequences. TheMSAisformedfromthislistbyaddingthe
align-Name Classication Reference
MSA Exact Lipman etal.(1989)
DCA Exact Stoyeet al.(1997)
ClustalW Progressive Thompsonet al.(1994)
Multalin Progressive Corpet (1988)
DiAlign Consistency-based Morgenstern etal.(1998)
DiAlign2 Consistency-based Morgenstern (1999)
T-COFEE Consistency-based/Progressive Notredame etal.(2000)
Prrp Iterative Gotoh(1996)
SAGA Iterative Notredame &Higgins(1996)
TabuSearch Iterative Riaz et al.(2004)
Table2.1: MSAalgorithmandtheirclassication,takenfromNotredame(2002). Please
refer to the text for a short description of the algorithms and the
classica-tions.
Table 2.1 shows the most popular algorithms together with classication and
refer-ences. Theremainderofthis chapterbrie ydescribesalreadyimplemented
Metaheuris-ticsinmore detail. Moredetailed information canbe found inthereferences.
2.4.1 Simulated Annealing
The rst modern Metaheuristic applied to MSA is Simulated Annealing (Kim et al.,
1994),whichwasusedtooptimisetheSPobjectivefunctionwithaneandnaturalgap
costs. Theneighbourhoodstructureusedisbuiltupfromamoveoperationwhichswaps
a block of consecutive gaps with a adjacent block of symbols in one of the sequences.
All parameters for the swap operation, which are size and position of the gap block,
whether symbols to the left or right are swapped and how many symbols are swapp ed
aredetermined randomly. No gapsareinserted ordeletedduring thesearch.
SA tries to explore this neighbourhood structure by traversing iteratively from one
solution to a better scoring solution, but accepting also steps to solutions with lower
scoreswith acertain probability. ThisgivesSA thepossibilitytoescapelocalminima.
SA is oneof thefew Metaheuristics withmathematically provedconvergenceagainst
an optimalsolution. Inpractise, SA istoo slowforbeingused asacomplete alignment
algorithm,butcanbeusedasanalignmentimprovertocorrectandimprovealignments
SAGA (Sequence Alignment by Genetic Algorithm) is a GA for MSA proposed by
Notredame & Higgins (1996). It was used to optimise alignments for the weighted
SP scorewith dierentgap penalties, and laterextendedtousetheCOFFEE objective
functionaswell(Notredame et al.,1998).
SAGA startswitha population ofrandomalignmentsgenerated byadding arandom
number ofgaps tothestartand end of eachsequence,ensuring that allsequences have
thesamelengthafterwards.
To computethe next generationof one iteration, 50% of thenew population islled
withthe 50%ttest (thetness valuecorrespondstothescore ofthe alignment)
align-ments from the previous generation. The remaining 50% are created by modifying
randomlychosenalignmentsfromthepreviousgeneration, wheretheprobabilityfor
be-ingchosendependsonthetnessvalue. SAGAuses 22dierentoperatorstomodify an
alignment: two crossoveroperators which generate a new alignment from two parents,
two operator which inserts blocks of gaps into the alignment, a familyof sixteen
oper-ators shifting blocks of gap inside the alignment to new positions, an operator which
tries to nd small proles by nding the best small gap-free substring of a sequence
in all sequences and an operator optimising a pattern of gaps in a small region of the
alignment by either exhaustive search or GA, depending on the size of the region and
number ofgapsto be placed.
Which operator is applied is stochastically chosen, with a probability depending on
thequalityof theospring an operator hasgeneratedinthe lastteniterations.
SAGA isamong thebestalgorithms forMSA regarding alignmentquality,butneeds
very long timeto producegoodqualityalignments, maybe due tousing randomly
gen-eratedalignmentsasthe initialpopulation.
2.4.3 Tabu Search
Tabu Search (TS) (Glover, 1986; Glover & Laguna, 1997) is a general Metaheuristic
which explores a neighb ourhood bytraversingfrom one solution tothe neighbourwith
the best score, even ifthis neighbour has a lower score. To prevent cycling, TS uses a
formofmemorytodeclarerecentmovesasforbiddenforanumb erofiterations. Insome
cases, amovetoa solution declaredtabumayactually leadtoan improvement. Inthis
case,aspirationcriteriaareusedwhichallowforbiddenmovesundersomecircumstances,
e.g.ifthenewsolutionwouldbebetterthanallpreviouslysolutionsfound. Thememory
canalso be usedtobalancebetweendiversicationand intensication.
Riazet al.(2004)implementedaTSforMSA,usingtheCOFFEEobjectivefunction.
Theyproposetwodierentmovestrategies: asingleblockmove,whichisthesamemove
asused inthe BlockMove local search inILS, and move strategy moving a rectangle of
gaps which spans over more than one sequence to a new position. The second move
strategy is implemented in their TS algorithm since it has been found to give shorter
runtimes thanperformingaseries of blockmoves.
until it has been improved to be notfurther classied as bad aligned. The TS is then
focusedon thenext bad alignedregionuntil no furtherbadaligned regionsare found.
TS istestedon theBAliBASEdatabase(version1.0),and isfound tobe comparable
to other algorithms in terms of solution quality, but needs very long time to produce
goodalignments. ItispointedoutthattheinitialalignmentfromwhichtheTSisstarted
This chapter describes the ILS algorithm for MSA and motivates the decisions made
whendesigningthecomponentsofILS(section2.3). Foreachofthecomponents,several
alternatives will be proposed and compared to nd the best alternative to use in the
ILS. The results are all presented in chapter 5. The decisions made are motivated by
preliminary testsand subsets oftheBAliBASE database.
3.1 Construction heuristic
Threevariantsofconstructionheuristics havebeenimplementedand evaluatedtoseeif
they canserve asa goodstarting pointfora local search.
Therst heuristic,Random,buildstheinitialalignmentbyrandomlyinsertingaxed
percentageofgapsintothesequences,llingtheshortersequenceswithadditionalgaps.
The second heuristic, AlignRandom, generates the initial alignment by successively
aligning a sequence to an alignment in random order. It starts with aligning two
se-quences, and adds the next sequence to this alignment until all sequences have been
added. The alignment method is similar to standard dynamic programming and
de-scribed byGotoh(1993). 1
Thethird heuristic,NeighbourJoining,isagreedyheuristicwhichalignsthesequences
stepbystepaccordingtoaguidancetree,whichiscomputedusingtheneighbourjoining
method (Saitou& Nei,1987) and thealignment isdone withthesame procedure asin
heuristicnumbertwo. NeighbourJoiningisimplementedbytheQuickjoinlibrary,which
uses a heuristic speed-up techniquetoimplement thealgorithm as described by Saitou
(Brodalet al., 2003). 2
The heuristics will be referred to as Random, AlignRandom and NeighbourJoining in
thefollowing.
Table 5.1on page41 compares thequality ofthe computed alignments byeach
con-structionheuristicsforbothCOFFEE-scoreandBAliBASESumofPairsScore
(BSPS)-score. NeighbourJoiningclearly achievedthe highestscores inbothcases.
Considering that the run-time for NeighbourJoining and AlignRandom is nearly the
same and the runtime of the whole ILS benets from a starting solution near to a
local optimum, NeighbourJoining is selected as the construction heuristic used in the
implementation. Thisdecisionisfurthermotivatedbytheassumptionthatgoodsolution
areclusteredtogetherinthesearchspace,whichisonereasonforthegoodperformance
on ILS.
1
Gotohpresentsfouralgorithms,dieringinestimatedrun-timeandqualityofthealignmentsproduced.
Inthisimplementation,AlgorithmAisused.
2
Quickjoinisavailablefromhttp://www.daimi.au.dk/
search starts from thesame initialsolution, whichmakesit unsuitedfor theRESTART
acceptance criterion. When used in conjunction with RESTART, the rst initial
solu-tion iscomputed byNeighbourJoining,whereas the new initialsolutionsfor restartsare
computedbyAlignRandom.
3.2 Local Search
Theembeddedheuristic,whichwillbealocalsearchprocedureinthis project,hasgreat
in uenceon the performance of ILS.Given that great impact,we will designthe other
componentsaround alocal search procedure.
To nd a good local search regarding both runtime and solution quality, two local
search procedures have been implemented and tested on a random subset of the
BAl-iBASEdatabase. Thesubsetconsistsofa quarter ofall sequencesofeachreferenceset,
randomlychosen.
The twolocalsearchprocedureseachexamineadierent neighbourhoodstructureby
strictiterativeimprovement,goingfromonesolutiontoanotheronlyifthenew solution
isbetterthantheformer.
3.2.1 Single exchange local search
Therstlocalsearchprocedureexchangesonesymb olwithanadjacentgapsymboluntil
no further improvement can be made. The neighb ourhood structure used in the local
search isdenedas follows:
Denition 3.1 (Neighbourhood structure for single exchange local search) A
pseudo multiple sequence alignment A is contained in the neighbourhood of another
pseudo multiple alignment B i alignment A can be transformed into alignment B by
exchanging a gap witha symbol,preserving the orderof the symbols in eachsequence.
Thesizeoftheneighb ourhoodisO(N 2
3L),whereN isthenumberofsequencesinthe
alignmentandListhelengthofthelongestsequenceinthesetSofsequences. Theworst
case is an alignment with a maximum number of gaps, sinceeach gap can be replaced
bytworesidues (except gapsat thebeginning orend of asequence, which canonly be
replaced byoneresidue), whichgives two (one) neighb ours. This worst casealignment
consists of columns which are built with N 01 gaps and only one non-gap symbol.
Eachsymb olfrom thesequencesconstitutes onecolumn, sothealignment has P
s2S jsj
columns. Thenumb erofgapsisthengivenby(N0 1)3 P
s2S
jsj. ForLbeingthelengthof
thelongestsequence,itholdsthatLjsj8s2Sandfromthat P
s2S
jsjjSj 3 L=N3 L.
Altogether, this gives(N01)3N 3L=N 2
3L0N 3L=O(N 2
3L).
The local search procedure explores this neighb ourhood in a combination of greedy
andrandomprocedure. Thelocalsearchtraverseseveryblockofgaps,i.e. aconsecutive
sequence of gaps, and searchesfor an improvement within this block. Within a block,
KQQ−−−−KYLSA−KSPK
HFN−−−−RY−ITS−−PK
ERT−−−QYPDI−−−−PK
−−−KLVN−−E−AKNL−−
ERT−−−QYPDI−−−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
HF−−−N−RY−ITS−−PK
Figure 3.1: Moveoperation of thesingle exchangelocal search. A gap inside ablockof
consecutive gapsis swapp edwithoneof thesymbolsadjacent tothis block.
improvement, it is executed and the local search is restarted. If no improvement was
found in this block, the local search is continued with the next block. The blocks and
thegaps ineachblockaretraversed inrandomorder.
One nal point of variation inside the local search worth to consider is the question
which moveshould be performedduring eachiteration.
Algorithm 3.1 shows thecomplete local search algorithm in pseudo code. The
algo-rithm examines all gap blocks and the adjacent symb ols if swapping the symb ol with
oneof the gaps inside theblock yieldsto an improvement until either an improvement
isfound orallgaps havebeenexamined. Depending on theparameterpivot, whichcan
besettoeither\Best"or\First",thelocalsearchselectstheneighb ourwhichyieldsthe
largestimprovementpossibleortherstneighb ourfoundtoyieldanimprovement. The
algorithmusesanauxiliaryfunctionswapwhichswapstwopositionsinsideanalignment.
ThefunctionExchangeMoveevaluatesamoveintheneighb ourhoodstructureby
com-puting the dierent scores of the current alignment and the neighb our reached by
ex-changing onesymb oland a gap. The dierencebetweenthese twoscores isreturned.
The functionExamineBlock searchesthebestimprovement w.r.t.ablockof
consecu-tivegapsbyevaluatingallpossiblemovesdonewhenconsideringoneofthesymbolsright
and lefttothe blockandthe gapsinsidethe block. It returnsthevalue andposition of
thegapof thebestimprovementmovefound, andzeroifno improvement canbemade.
The localsearch isimplemented inprocedure SingleExchange. It examines all blocks
of gaps as long as an improvement move in the neighb ourhood can be made. To do
this, it rst evaluates all moves induced by a block and the symb ol left to this block
(lines24{25). Ifnewimprovementwasfound,orifpivotingissetto\Best",thesameis
donefor symb ol tothe right and thesame block(lines 26{34). If theexamination of a
blockfoundanimprovementandpivotingissetto\First",thefor-loopisquitinline36.
When theloopterminates, thebest improvement movefound isexecuted (lines39{40).
Figure 3.1 shows a possible move during the local search. First, a block of gaps is
selected. Forbothadjacent residues,the scoreof exchangingthem withanygap inthe
selectedblockiscomputed,andthebestcombinationiseventuallyexecutedasthemove
forthis iteration.
Theestimatedrun-timeofoneiterationofthelocalsearchprocedureisO(N 2
3L). In
the worst case,the local search has toevaluate the whole neighb ourhood of a solution
which conforms toevaluating every possiblemove sinceeachneighbourcanbe reached
1: function ExchangeMove(Alignment A,OF f,gapPos, symbolPos)
2: delta f(A)
3: A
0
A
4: swap(A', gapPos,symbolPos) .swap two positions inan alignment
5: delta del ta0f(A 0
)
6: return delta
7: endfunction
8:
9: function ExamineBlock(Alignment A,OF, BlockG, symb olPos)
10: bestDelta 0
11: for allgapg2Gdo . nd thebestimprovementforthis block
12: delta ExchangeMove(A;f;gap;symbolPos)
13: if delta<bestDeltathen
14: bestDelta delta
15: gapPosition gap
16: endif
17: endfor
18: return (bestDel ta;gapPosition) . bestDelta=0 ifnoimprovementfound
19: endfunction
20:
21: procedure SingleExchange(AlignmentA,Objective Functionf,pivot)
22: repeat
23: foreachblockG ofconsecutive gapsdo
24: symbolPos symbolleft toG
25: (bestDelta;gapPos) ExamineBlock(A;f;G;symbolPos)
26: if bestDelta<0_pivot=\Best" then
27: rightPos symbolright toG
28: (rightDelta;rightGap) ExamineBlock(A;f;G;rightPos)
29: if rightDelta<bestDeltathen
30: bestDelta rightDelta
31: symbol Pos rightPos
32: gapPos rightGap
33: endif
34: endif
35: if pivot=\First"^bestDelta<0 then
36: break .quittheforloop
37: endif
38: endfor
39: if bestDelta<0 then .ifimprovement found
40: swap(A, gapPos, symb olPos) .performthefound move
41: endif
42: untilno improvement wasfound inthe lastiteration
space. Don't look bits are a matrix of boolean values, which are initially set to false.
A symb ol beneatha gap blockis only considered foramove ifthecorresponding don't
look bit is false. Wheneverthe local search evaluates a symbol adjacent toa block of
gaps,andndsthatnoimprovementcanbemadebyexchangingthissymbolwithoneof
thegapsintheadjacentblock, thedon't lookbitissettotrue;whenamoveisactually
executed,thedon't look bitsforthe two exchanged positions aresettofalse.
3.2.2 Block exchange local search
Thesecond localsearchprocedure issimilartotherstonebuttriestoexchangewhole
blocks of gaps instead of one symbol each step. Riaz et al. (2004) favoured this local
searchintheirTSimplementation 3
and statethatitissuperiorinruntimeand solution
qualitytothesingle exchangelocalsearch.
The neighb ourhood structurefortheblockexchangelocalsearch isdened as:
Denition 3.2 (Neighbourhood structure for block exchange local search) A
pseudo multiple sequence alignment A is contained in the neighbourhood of another
pseudo multiple sequence alignment B i alignment A can be transformed into
align-ment B by exchanging a block of gaps with a block of non-gap symbols in one sequence
and preserving the orderof the symbols in each sequence.
In the worst case, the neighbourhood structure is equivalent to the neighb ourhood
structureofsingleexchangelocalsearch,sinceamaximalnumb erof gapblocksisgiven
when each blockisof lengthone. Thisgivesthe sameworst-casesize oftheneighb
our-hood, O(N 2
3L), but on the average, the neighbourhood canbe expected to be much
smaller.
KQQ−−−−KYLSA−KSPK
HFN−−−−RY−ITS−−PK
ERT−−−QYPDI−−−−PK
−−−KLVN−−E−AKNL−−
ERTQYP−−−DI−−−−PK
KQQ−−−−KYLSA−KSPK
−−−KLVN−−E−AKNL−−
HFN−−−−RY−ITS−−PK
Figure 3.2: Move operation for the block exchange local search. A randomly selected
blockof gapsis exchangedwithablockof symb ols of equalsizetoitsright.
Algorithm 3.2 describes the local search procedure and gure 3.2 presents one step
ingraphical form. It uses thesametechniquetodistinguish betweenbestimprovement
and rst improvement pivoting asSingleExchange(algorithm 3.1). Swapping of blocks
ofsymb olsisdonebyan auxiliaryfunctionswapBlo ckwhichswapstwoblocks insidean
alignment.
3
TStraversesto thebest solutioninthe neighb ourhoodofthe current solution,aprocesssimilar to
1: function BlockMove(Alignment A,OF f,Blockgaps, Blocksymbols)
2: delta f(A)
3: A
0
A
4: swapBlo ck(A', gaps, symb ols)
5: delta del ta0f(A 0
)
6: return del ta
7: endfunction
8:
9: procedure BlockExchange(Alignment A,OF f,pivot)
10: repeat
11: bestDelta 0
12: foreachblockG ofconsecutive gapsdo
13: l eftSymbol s maximal blockof atmostjGjconsecutive symbolstotheleft
14: if BlockMove(A,f,G, leftSymb ols)<bestDeltathen
15: bestDelta BlockMove(A,f, G,leftSymb ols)
16: symbolBl ock leftSymbols
17: gapBlock G
18: endif
19: if pivot=\Best"_bestDelta=0 then
20: rightSymbols maximal blockof atmostjGjconsecutive symb ols tothe right
21: if BlockMove(A,f,G, rightSymbols)<bestDeltathen
22: bestDelta BlockMove(A,f,G,rightSymbols)
23: symbol Block rightSymbols
24: gapBlock G
25: endif
26: endif
27: if bestDelta<0^pivot=\First" then
28: break .end thefor-loop
29: endif
30: endfor
31:
32: if bestDelta<0 then
33: swapBlock(A,gapBlock, symbolBlock)
34: endif
35: untilno improvement found inlastiteration