A tool for graphicalvisualisation and analysis ofgenomic sequences

(1)

UPTEC X 01 008 ISSN 1401-2138 JAN 2001

KRISTINA ENGDAHL

A tool for graphical

visualisation and analysis of genomic sequences

Master’s degree project

(2)

Molecular Biotechnology Programme Uppsala University School of Engineering UPTEC X 01 008

Date of issue 2001-01 Author

Kristina Engdahl

Title (English)

A tool for graphical visualisation and analysis of genomic sequences

Title (Swedish) Abstract

A graphical tool for the visualisation and analysis of genomic sequence data has been implemented using object-oriented modelling and Java. In a user-friendly interface, a set of sub-sequences is aligned to a known reference sequence in EMBL-format. The features of the EMBL-entry are displayed and it is possible to extract specific regions from the subset in well-established file-formats for further analysis.

Keywords

Evolutionary analysis, visualisation, object-oriented modelling, CORBA Supervisors

Tomas Bergström Uppsala Universitet Examiner

Ulf Gyllensten Uppsala Universitet

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information Pages

26

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

(3)

Ievolutionarastudierarmanoftaintresseradavattjamforagener

franolikaindividerellerarterforattundersokaskillnaderochlikheter.

Hastigheten med vilken forandringar uppstar i var arvsmassaskiljer

sigat iolikadelaravgenometochdarforardetoftaviktigtattutfora

analyseravolika regioner separat.

Det har projektet har gatt ut pa att konstruera en programvara

som underlattar evolutionarastudier. Programmet kan anvandas for

att titta pa en kartlagd genomisk region och dess olika egenskaper.

Tilldetta kananvandarenimporteradelsekvenserfransamma region

eller narbeslaktadegenomiska omraden. Denya sekvensernajamfors

med referens-sekvensen ochritas ut isammafonster.

Med hjalp av programmet far man en god overblick av hur en

mangdsekvensertackerettvisstomrade. Detarmojligtattsparain-

formationenpaolikasattochdetgarattvaljautvissa delregionerfor

vidareanalysgenomattsparademistandardiseradelformat. Filfor-

maten har valtssaatt de kananvandasiandra biologiskaanalyspro-

gram.

(4)

1 Introduction 3

2 Background 5

2.1 Object-oriented modellingof biologicaldata . . . . . . . . . . 5

2.1.1 BioJava . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Designpatterns . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 The MVC-model . . . . . . . . . . . . . . . . . . . . . 7

2.3 Integration of external programs . . . . . . . . . . . . . . . . . 9

2.3.1 Distributed computing . . . . . . . . . . . . . . . . . . 9

2.3.2 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Visualisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 The EMBL Nucleotide Sequence Database . . . . . . . . . . . 13

2.6 The General FeatureFormat . . . . . . . . . . . . . . . . . . . 14

3 Materials and Methods 15 4 Results 15 4.1 The graphicalview . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Lookingatfeatures in the reference sequence . . . . . . 16

4.1.2 Adding sub-sequences to the display . . . . . . . . . . 17

4.2 Aligning subsets tothe reference sequence using blast . . . . . 17

4.3 The editableview . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Saving the result . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4.1 Save alignment . . . . . . . . . . . . . . . . . . . . . . 19

4.4.2 Extract and save a specic region . . . . . . . . . . . . 19

4.5 CORBA wrapper forBl2seq . . . . . . . . . . . . . . . . . . . 19

4.6 Platform independence? . . . . . . . . . . . . . . . . . . . . . 20

5 Discussion 20 5.1 Designing anobject-oriented model . . . . . . . . . . . . . . . 21

5.2 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4 Possible extensions of the software. . . . . . . . . . . . . . . . 21

5.5 Further testing . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Conclusions 22

(5)

Biologicaldata are oftenprovided in crude textle-formatsthat can bedif-

cult to interpret. To visualise such data in an intuitive manner can be a

great help inover-viewingand analysing aspecic data set. The aimof this

projectwastoconstruct asoftwaretoolforgraphicalvisualisationand anal-

ysis of dierentregions ina genomicsequence. The software should provide

a user-friendly interface for genetic analysis in general and for evolutionary

studiesinparticular. Aprogramlikethiswillbeveryhelpfulwhendispersed

fragments in a genomic region has been sequenced from several individuals

and shouldbeexaminedfurther. It willgiveagoodoverviewof thecoverage

thatthesetofsub-sequences constituteandalsohelpinthedecisionofwhich

parts should be extracted for further analysis.

The fundamental concept when performing evolutionary studies is com-

parison and in the study of molecular evolution one of the basic quantities

isthe nucleotide substitutionrate. Bycomparingdierencesand similarities

between related genomicsequences, the nature and aect of nucleotide sub-

stitutions can be charachterisised. This knowledge can be used for example

whendatingevolutionaryeventslikedivergence timebetweenspecies. There

are several theories on how comparisons should be performed and substitu-

tion rates should be calculated [19].

The conjugation of molecular biology to classic evolutionarystudies has

since the 1970s resulted in several methods for estimating genetic distances

between organisms and between individuals within species. This soon led

to the proposal that there might exista molecular clock that could be used

as a universal reference when setting species divergence in time [21]. The

assumption ofthis theory isthatthe substitutionrate wouldbeconstantfor

everylivingorganismovertime insharedgenomicdomains. However, it has

sincebeenshownthat thesubstitutionratediersbothwithinagenome[20]

and between dierent lineages in evolution [10]. Thus, when analysing ge-

nomicsequences inanevolutionaryperspectiveseparate analysisof dierent

domains isoften required.

Biology is a very data-rich discipline and there are a large number of

computational tools available for various kinds of analysis. The tools are

provided in dierent ways, often they can be accessed directly via the in-

ternet, but sometimes they need to be downloaded and executed on alocal

machine. Thesize andcomplexityof theprograms varies. Somecan beused

withoutcharge,whileotherrequiresalicence. Thereismuchtogainbygath-

ering resources, combining thoseof special interest can render atailor made

solution for a specic application. The tools are often very specialised and

seldom compatible with other programs without modications. The tech-

(6)

present themselves in the form of distribution and heterogeneity [17]. Dis-

tributed applicationsrun indierentprocesses and are not restricted to the

localmachine. By setting up communications between the remote resource

and the clientprogramthese can bemade tointeroperate. A more demand-

ing taskis to solve the problemof hetrogenous applications, since these run

on dierent operatingsystems orare implementedin dierent programming

languages. In simple terms, the solution is to somehow wrap the external

resource, therby providingatranslationofthethatprogramandenablingan

integration.

Anotherproblemthat oftenisencountered isthe lackof well-established,

general denitions of how to represent biological data. There are several

groups working on standardisation of formats in the eld of biocomputing.

The biowidget consortium work on a community consensus for graphical

components [3] and the BioJava [23], as well as the BioPerl [24] projects

are a rich resources for the handling of biosequences and alignment data.

However, one of the most important groupsthat providestandardisations is

the Life Science Group, atask forceof the Object ManagingGroup (OMG)

[25]. Theyaredeningstandardsforthecommunicationandinteroperability

among computational resources in the life science research. The standards

are developed ina distributed object arhitecture.

The primary aim of this project was to develop a program with an in-

terface where the user easily can extract the informationof aspecic region

from a set of biological sequences and where one can obtain a visual repre-

sentation of how this set of sub-sequences are located relativeto areference

sequence. Althoughthereareexistingprogramsthatdosimilairthings,there

are fewthat are developed for evolutionary studies.

Oneofthegoalshasalsobeentogatheranduseexistingsoftwareforanal-

ysis and thereby avoid re-implementation of dierent algorithms. This can

beaccomplished by setting up a Client/Serverapplication using distributed

computing. The program is mainly for evolutionary studies within a single

species or of closely related species. For this purpose, the specications of

the project was to build an object-orientated application that fullled the

followingrequirements:

importsareferencesequence fromoneofthepublicsequence databases

and graphically visualisesits dierent genomicdomains,

enables the alignment of \inhouse" produced sequences to the graph-

ically displayed reference sequence,

(7)

domains,

is platform independent,

makes use of existing toolsfor alignmentand analysis,

provides anintuitive user interface.

2 Background

In the design of software several decisions about models and methods for

implementing the programmust be made. The rst thing to do isto dene

the application and how it should work. In the case of designing a user

interface,issueslikehowdatashouldbedisplayedandhandledareimportant.

It isalsoimportanttoinvestigatewhat kindof datashouldbeused andhow

it should be collected. In this section, modelling of a biological system and

theuse ofdistributedcomputingwillbeexplained. Somecomputationalaids

for bioinformatics are mentioned and the need of visualisation is discussed.

Finally, some generalleformats forrepresenting biologicalinformationare

described.

2.1 Object-oriented modelling of biological data

In object-oriented software, a program is viewed as a collection of interact-

ing, but independent components. An object encapsulates both data and

methodsandhas aninterface through whichotherobjectsare allowed toin-

teract. Therststep whendesigninganobject-orientedsystem istoidentify

the necessary objects and their relationships. [4] It is a demanding task to

develop object-oriented models. The goalis oftento provideauniversal and

reusable solution. To obtain goodresults the process must be iterative and

careful analysis of the specications isrequired.

There have been several attempts to provide conceptual object-oriented

models of biological data [12]. When designing a model for the represen-

tation of a biological system it is very important to have a clear idea of

the use and purpose of the model. Dierent approaches can render dier-

ent interpretations on some biological terms. Conceptual modelling oers

implementation-independentsolutionsand can beagreat help inseparating

informationfrom description of a system. A well-designed modelprovidesa

goodplatform for discussing issues related toa specic system.

(8)

One resource available for object-oriented software developmentin abiolog-

icalcontext isBioJava[23]. The BioJavais anopen-source project and pro-

vides various objects forprocessing biologicaldata. As the name mightgive

a hint of, the objects are implemented in the Java programming language.

The project was started in 1999 and has ever since been under continuous

development. The rst stable release, version 1.00, of the BioJava core was

ready inearlyautumn 2000. The ApplicationProgrammersInterface (API)

can be used froma compiled and compressed jar leor the source code can

be downloaded. The BioJava packages include objects for working with se-

quences and symbols,dierent leformats and visualisation. There are also

objectsfor dynamic programmingand database access.

The core elements of BioJava are the objects for representing biological

sequences. In a computational context, traditionally a sequence is just a

string of ASCII characters. However, a biological sequence is much more

then just letters and therefore additional characteristics must be dened.

Examples of properties of a biological sequence that can be included in an

object can be the denition of the alphabet they are constituted of, the

annotation of regions within the seqeunce, or information about how and

when the sequence wasdiscovered. When dening biologicalproperties itis

important to avoid ambiguity in the denition. For example, if a sequnce's

onlyrepresentation isletters,aTwillrefertothebase Thymine undersome

circumstances, whileit underothercircumstances aswellcan bereferring to

the amino acid Threonine.

In BioJava,the representation of a biologicalsequence is based on three

interfaces: symbolList, FeatureHolder, and Annotatable [23]. These are

implemented in an object called Sequence, see gure 1. The symbolList

denes all methodsforhandlingdata associatedwith the codestring. It has

a dened alphabet and each symbol has its own molecular denition. The

featureHolder adds the possibility for the sequence to contain information

about annotations such as dierent features. A Feature is an object that

must beheldbyanotherobject,forexamplebyaSequenceandcannotexist

onitsown. The Featureobjecthas alocationintheSequence andisinfact

itselfafeatureHolder,makingitpossibletohavenestedfeatures. Boththe

Sequence and the Feature interfaces implement the interface Annotatble,

which means that they both can contain additional informationthat rather

than being dened as a self-contained biological object is accessible as a

description in the formof a string.

(9)

Sequence Annotable

SymbolList FeatureHolder

Feature

Figure 1: Graphical representation of interfaces representing dierent com-

ponents of abiologicalsequence and their relationships.

2.2 Design patterns

The use of design patterns is a great help when designing object-oriented

software. The idea is to extract high level interactions between objects and

reuse their core features in new applications. The development of design

patterns inthe software communityhas been much inspiredbythe architect

Christopher Alexander who dened the nature of a pattern likethis: \Each

patterndescribesaproblemwhichoccursoverandoveragaininourenviron-

ment, and then describesthe core of the solutiontothat problem,in sucha

waythatyoucanusethis solutionamilliontimesover, withouteverdoingit

the same way twice." [1]. Even though this denition applies to construct-

ing houses rather than buildingprograms, it is alsotrue for object-oriented

design.

Design patterns make object-oriented designs more exible and elegant.

The benets are found in avoiding redesign and reusingsolutions that have

worked before. Italsohelps a greatdeal in obtainingclarity of adesign and

to write code that is easyto understand [4].

2.2.1 The MVC-model

The Model-View-Controller (MVC) pattern was originally developed at the

Xerox PARCinthelate 1970'sforthe programminglanguageSmalltalk[14].

It is one of the most wellknown design patterns and is used for the imple-

mentation of graphical user interfaces (GUIs). The Model-View-Controller

model consists of three separated objects, which as the name suggests are,

the Model, the View, and the Controller. Before the MVC was introduced

GUIs where often complex in their internal structure and had a low degree

(10)

both exibilityand reuse.

Smalltalk'sModel-View-Controllerhasbeenadaptedtovariousdegreesin

most other GUI class librariesand application frameworks. The JavaSwing

component architecture isbased onthe MVCmodel[5]. However, the swing

MVCis aspecialisedversion, constructedtosupportthe pluggablelookand

feel instead of applicationsin general.

The responsibilities of the classes building the GUI are clearly divided

in the MVC model, see gure 2. The Model object comprises the data of

the application. It maintains the state and updates its views according to

its current state. The View displays information about the Model through

a user interface. Therecan beseveral Views accessing the same model. The

Controlleris the mediatorof howuser inputsshould bere ected inthe data

and which viewshould be displayed.

user

controller sees sees

uses

model

view 2 view 1

notify notify

data- access

Figure 2: The Model-View-Controller-modelis based on three objects. The

Model contains all the data of the system, while the Views are responsible

for displayingthedata ina dierent ways. The Controller takescare of user

inputs and makes sure they are re ected in the Model.

(11)

subscribe/notify protocol. The Model allows other objects to be registered

to receive information about changes in the model and is not concerned of

how its data willbe displayed. The Model noties its views when a change

occurs in the model's data. Thus, the view is always displaying an up to

date version of the data. This protocol makes it possible toattach dierent

views orcontrollers tothe same model.

When designing a particular application, keeping the MVC model in

mind, the rst step is toformalisethe separation between the data manage-

ment(the Model) andthe user interface(the View-Controller). Answeringa

fewquestionscan dothis. [14]Inthecase ofdatamanagementthe program-

mer needstond out thefollowing: What isthe data? Howcan the databe

specied? How can the data be changed? In the case of the user interface,

solutionstotheseissuesmustbefound: Howshoulddatabedisplayed? How

do events map intochangesin the data? Howshould itallbeput together?

TheMVCmodelisagoodbasewhenbuildinganapplicationthatrequires

multiple views. Itis very goodfor constructingpowerful userinterfacesthat

easily can be extended by adding new features e.g. attaching a new view.

Another scenario where a model like the MVC is very useful is when dis-

tributed applications are needed. By separatingthe modeland itsgraphical

representation, the application can be distributed e.g. by letting the model

act asa server and the user interface act asa client.

2.3 Integration of external programs

One of themost importanttasks inthe eldof bioinformaticsistointegrate

dierent resources in order to highlight correlations and gain new knowl-

edge about a specic system. When addressing such issues, heterogeneous

environmentsandthe needtointegratedistributedapplicationsareoftenen-

countered problems. Fortunately,there are several ways tosolve these kinds

of problems.

2.3.1 Distributed computing

Distributed computing enables the interoperation of dierent applications

and programs. A distributed application has components and objects run-

ning in dierent process either onthe same machine oronmachines located

across a network. In addition, if the application also is heterogeneous, its

processes are runningunder dierentoperatingsystems or have been imple-

mented in dierent programminglanguages.