• No results found

A tool for graphicalvisualisation and analysis ofgenomic sequences

N/A
N/A
Protected

Academic year: 2022

Share "A tool for graphicalvisualisation and analysis ofgenomic sequences"

Copied!
27
0
0

Loading.... (view fulltext now)

Full text

(1)

UPTEC X 01 008 ISSN 1401-2138 JAN 2001

KRISTINA ENGDAHL

A tool for graphical

visualisation and analysis of genomic sequences

Master’s degree project

(2)

Molecular Biotechnology Programme Uppsala University School of Engineering UPTEC X 01 008

Date of issue 2001-01 Author

Kristina Engdahl

Title (English)

A tool for graphical visualisation and analysis of genomic sequences

Title (Swedish) Abstract

A graphical tool for the visualisation and analysis of genomic sequence data has been implemented using object-oriented modelling and Java. In a user-friendly interface, a set of sub-sequences is aligned to a known reference sequence in EMBL-format. The features of the EMBL-entry are displayed and it is possible to extract specific regions from the subset in well-established file-formats for further analysis.

Keywords

Evolutionary analysis, visualisation, object-oriented modelling, CORBA Supervisors

Tomas Bergström Uppsala Universitet Examiner

Ulf Gyllensten Uppsala Universitet

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information Pages

26

Biology Education Centre Biomedical Center Husargatan 3 Uppsala Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

(3)

Ievolutionarastudierarmanoftaintresseradavattjamforagener

franolikaindividerellerarterforattundersokaskillnaderochlikheter.

Hastigheten med vilken forandringar uppstar i var arvsmassaskiljer

sigat iolikadelaravgenometochdarforardetoftaviktigtattutfora

analyseravolika regioner separat.

Det har projektet har gatt ut pa att konstruera en programvara

som underlattar evolutionarastudier. Programmet kan anvandas for

att titta pa en kartlagd genomisk region och dess olika egenskaper.

Tilldetta kananvandarenimporteradelsekvenserfransamma region

eller narbeslaktadegenomiska omraden. Denya sekvensernajamfors

med referens-sekvensen ochritas ut isammafonster.

Med hjalp av programmet far man en god overblick av hur en

mangdsekvensertackerettvisstomrade. Detarmojligtattsparain-

formationenpaolikasattochdetgarattvaljautvissa delregionerfor

vidareanalysgenomattsparademistandardiserade lformat. Filfor-

maten har valtssaatt de kananvandasiandra biologiskaanalyspro-

gram.

(4)

1 Introduction 3

2 Background 5

2.1 Object-oriented modellingof biologicaldata . . . . . . . . . . 5

2.1.1 BioJava . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Designpatterns . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 The MVC-model . . . . . . . . . . . . . . . . . . . . . 7

2.3 Integration of external programs . . . . . . . . . . . . . . . . . 9

2.3.1 Distributed computing . . . . . . . . . . . . . . . . . . 9

2.3.2 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Visualisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 The EMBL Nucleotide Sequence Database . . . . . . . . . . . 13

2.6 The General FeatureFormat . . . . . . . . . . . . . . . . . . . 14

3 Materials and Methods 15 4 Results 15 4.1 The graphicalview . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Lookingatfeatures in the reference sequence . . . . . . 16

4.1.2 Adding sub-sequences to the display . . . . . . . . . . 17

4.2 Aligning subsets tothe reference sequence using blast . . . . . 17

4.3 The editableview . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Saving the result . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.4.1 Save alignment . . . . . . . . . . . . . . . . . . . . . . 19

4.4.2 Extract and save a speci c region . . . . . . . . . . . . 19

4.5 CORBA wrapper forBl2seq . . . . . . . . . . . . . . . . . . . 19

4.6 Platform independence? . . . . . . . . . . . . . . . . . . . . . 20

5 Discussion 20 5.1 Designing anobject-oriented model . . . . . . . . . . . . . . . 21

5.2 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.3 CORBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.4 Possible extensions of the software. . . . . . . . . . . . . . . . 21

5.5 Further testing . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6 Conclusions 22

(5)

Biologicaldata are oftenprovided in crude text le-formatsthat can bedif-

cult to interpret. To visualise such data in an intuitive manner can be a

great help inover-viewingand analysing aspeci c data set. The aimof this

projectwastoconstruct asoftwaretoolforgraphicalvisualisationand anal-

ysis of di erentregions ina genomicsequence. The software should provide

a user-friendly interface for genetic analysis in general and for evolutionary

studiesinparticular. Aprogramlikethiswillbeveryhelpfulwhendispersed

fragments in a genomic region has been sequenced from several individuals

and shouldbeexaminedfurther. It willgiveagoodoverviewof thecoverage

thatthesetofsub-sequences constituteandalsohelpinthedecisionofwhich

parts should be extracted for further analysis.

The fundamental concept when performing evolutionary studies is com-

parison and in the study of molecular evolution one of the basic quantities

isthe nucleotide substitutionrate. Bycomparingdi erencesand similarities

between related genomicsequences, the nature and a ect of nucleotide sub-

stitutions can be charachterisised. This knowledge can be used for example

whendatingevolutionaryeventslikedivergence timebetweenspecies. There

are several theories on how comparisons should be performed and substitu-

tion rates should be calculated [19].

The conjugation of molecular biology to classic evolutionarystudies has

since the 1970s resulted in several methods for estimating genetic distances

between organisms and between individuals within species. This soon led

to the proposal that there might exista molecular clock that could be used

as a universal reference when setting species divergence in time [21]. The

assumption ofthis theory isthatthe substitutionrate wouldbeconstantfor

everylivingorganismovertime insharedgenomicdomains. However, it has

sincebeenshownthat thesubstitutionratedi ersbothwithinagenome[20]

and between di erent lineages in evolution [10]. Thus, when analysing ge-

nomicsequences inanevolutionaryperspectiveseparate analysisof di erent

domains isoften required.

Biology is a very data-rich discipline and there are a large number of

computational tools available for various kinds of analysis. The tools are

provided in di erent ways, often they can be accessed directly via the in-

ternet, but sometimes they need to be downloaded and executed on alocal

machine. Thesize andcomplexityof theprograms varies. Somecan beused

withoutcharge,whileotherrequiresalicence. Thereismuchtogainbygath-

ering resources, combining thoseof special interest can render atailor made

solution for a speci c application. The tools are often very specialised and

seldom compatible with other programs without modi cations. The tech-

(6)

present themselves in the form of distribution and heterogeneity [17]. Dis-

tributed applicationsrun indi erentprocesses and are not restricted to the

localmachine. By setting up communications between the remote resource

and the clientprogramthese can bemade tointeroperate. A more demand-

ing taskis to solve the problemof hetrogenous applications, since these run

on di erent operatingsystems orare implementedin di erent programming

languages. In simple terms, the solution is to somehow wrap the external

resource, therby providingatranslationofthethatprogramandenablingan

integration.

Anotherproblemthat oftenisencountered isthe lackof well-established,

general de nitions of how to represent biological data. There are several

groups working on standardisation of formats in the eld of biocomputing.

The biowidget consortium work on a community consensus for graphical

components [3] and the BioJava [23], as well as the BioPerl [24] projects

are a rich resources for the handling of biosequences and alignment data.

However, one of the most important groupsthat providestandardisations is

the Life Science Group, atask forceof the Object ManagingGroup (OMG)

[25]. Theyarede ningstandardsforthecommunicationandinteroperability

among computational resources in the life science research. The standards

are developed ina distributed object arhitecture.

The primary aim of this project was to develop a program with an in-

terface where the user easily can extract the informationof aspeci c region

from a set of biological sequences and where one can obtain a visual repre-

sentation of how this set of sub-sequences are located relativeto areference

sequence. Althoughthereareexistingprogramsthatdosimilairthings,there

are fewthat are developed for evolutionary studies.

Oneofthegoalshasalsobeentogatheranduseexistingsoftwareforanal-

ysis and thereby avoid re-implementation of di erent algorithms. This can

beaccomplished by setting up a Client/Serverapplication using distributed

computing. The program is mainly for evolutionary studies within a single

species or of closely related species. For this purpose, the speci cations of

the project was to build an object-orientated application that ful lled the

followingrequirements:

 importsareferencesequence fromoneofthepublicsequence databases

and graphically visualisesits di erent genomicdomains,

 enables the alignment of \inhouse" produced sequences to the graph-

ically displayed reference sequence,

(7)

domains,

 is platform independent,

 makes use of existing toolsfor alignmentand analysis,

 provides anintuitive user interface.

2 Background

In the design of software several decisions about models and methods for

implementing the programmust be made. The rst thing to do isto de ne

the application and how it should work. In the case of designing a user

interface,issueslikehowdatashouldbedisplayedandhandledareimportant.

It isalsoimportanttoinvestigatewhat kindof datashouldbeused andhow

it should be collected. In this section, modelling of a biological system and

theuse ofdistributedcomputingwillbeexplained. Somecomputationalaids

for bioinformatics are mentioned and the need of visualisation is discussed.

Finally, some general leformats forrepresenting biologicalinformationare

described.

2.1 Object-oriented modelling of biological data

In object-oriented software, a program is viewed as a collection of interact-

ing, but independent components. An object encapsulates both data and

methodsandhas aninterface through whichotherobjectsare allowed toin-

teract. The rststep whendesigninganobject-orientedsystem istoidentify

the necessary objects and their relationships. [4] It is a demanding task to

develop object-oriented models. The goalis oftento provideauniversal and

reusable solution. To obtain goodresults the process must be iterative and

careful analysis of the speci cations isrequired.

There have been several attempts to provide conceptual object-oriented

models of biological data [12]. When designing a model for the represen-

tation of a biological system it is very important to have a clear idea of

the use and purpose of the model. Di erent approaches can render di er-

ent interpretations on some biological terms. Conceptual modelling o ers

implementation-independentsolutionsand can beagreat help inseparating

informationfrom description of a system. A well-designed modelprovidesa

goodplatform for discussing issues related toa speci c system.

(8)

One resource available for object-oriented software developmentin abiolog-

icalcontext isBioJava[23]. The BioJavais anopen-source project and pro-

vides various objects forprocessing biologicaldata. As the name mightgive

a hint of, the objects are implemented in the Java programming language.

The project was started in 1999 and has ever since been under continuous

development. The rst stable release, version 1.00, of the BioJava core was

ready inearlyautumn 2000. The ApplicationProgrammersInterface (API)

can be used froma compiled and compressed jar leor the source code can

be downloaded. The BioJava packages include objects for working with se-

quences and symbols,di erent leformats and visualisation. There are also

objectsfor dynamic programmingand database access.

The core elements of BioJava are the objects for representing biological

sequences. In a computational context, traditionally a sequence is just a

string of ASCII characters. However, a biological sequence is much more

then just letters and therefore additional characteristics must be de ned.

Examples of properties of a biological sequence that can be included in an

object can be the de nition of the alphabet they are constituted of, the

annotation of regions within the seqeunce, or information about how and

when the sequence wasdiscovered. When de ning biologicalproperties itis

important to avoid ambiguity in the de nition. For example, if a sequnce's

onlyrepresentation isletters,aTwillrefertothebase Thymine undersome

circumstances, whileit underothercircumstances aswellcan bereferring to

the amino acid Threonine.

In BioJava,the representation of a biologicalsequence is based on three

interfaces: symbolList, FeatureHolder, and Annotatable [23]. These are

implemented in an object called Sequence, see gure 1. The symbolList

de nes all methodsforhandlingdata associatedwith the codestring. It has

a de ned alphabet and each symbol has its own molecular de nition. The

featureHolder adds the possibility for the sequence to contain information

about annotations such as di erent features. A Feature is an object that

must beheldbyanotherobject,forexamplebyaSequenceandcannotexist

onitsown. The Featureobjecthas alocationintheSequence andisinfact

itselfafeatureHolder,makingitpossibletohavenestedfeatures. Boththe

Sequence and the Feature interfaces implement the interface Annotatble,

which means that they both can contain additional informationthat rather

than being de ned as a self-contained biological object is accessible as a

description in the formof a string.

(9)

Sequence Annotable

SymbolList FeatureHolder

Feature

Figure 1: Graphical representation of interfaces representing di erent com-

ponents of abiologicalsequence and their relationships.

2.2 Design patterns

The use of design patterns is a great help when designing object-oriented

software. The idea is to extract high level interactions between objects and

reuse their core features in new applications. The development of design

patterns inthe software communityhas been much inspiredbythe architect

Christopher Alexander who de ned the nature of a pattern likethis: \Each

patterndescribesaproblemwhichoccursoverandoveragaininourenviron-

ment, and then describesthe core of the solutiontothat problem,in sucha

waythatyoucanusethis solutionamilliontimesover, withouteverdoingit

the same way twice." [1]. Even though this de nition applies to construct-

ing houses rather than buildingprograms, it is alsotrue for object-oriented

design.

Design patterns make object-oriented designs more exible and elegant.

The bene ts are found in avoiding redesign and reusingsolutions that have

worked before. Italsohelps a greatdeal in obtainingclarity of adesign and

to write code that is easyto understand [4].

2.2.1 The MVC-model

The Model-View-Controller (MVC) pattern was originally developed at the

Xerox PARCinthelate 1970'sforthe programminglanguageSmalltalk[14].

It is one of the most wellknown design patterns and is used for the imple-

mentation of graphical user interfaces (GUIs). The Model-View-Controller

model consists of three separated objects, which as the name suggests are,

the Model, the View, and the Controller. Before the MVC was introduced

GUIs where often complex in their internal structure and had a low degree

(10)

both exibilityand reuse.

Smalltalk'sModel-View-Controllerhasbeenadaptedtovariousdegreesin

most other GUI class librariesand application frameworks. The JavaSwing

component architecture isbased onthe MVCmodel[5]. However, the swing

MVCis aspecialisedversion, constructedtosupportthe pluggablelookand

feel instead of applicationsin general.

The responsibilities of the classes building the GUI are clearly divided

in the MVC model, see gure 2. The Model object comprises the data of

the application. It maintains the state and updates its views according to

its current state. The View displays information about the Model through

a user interface. Therecan beseveral Views accessing the same model. The

Controlleris the mediatorof howuser inputsshould bere ected inthe data

and which viewshould be displayed.

user

controller sees sees

uses

model

view 2 view 1

notify notify

data- access

data- access

Figure 2: The Model-View-Controller-modelis based on three objects. The

Model contains all the data of the system, while the Views are responsible

for displayingthedata ina di erent ways. The Controller takescare of user

inputs and makes sure they are re ected in the Model.

(11)

subscribe/notify protocol. The Model allows other objects to be registered

to receive information about changes in the model and is not concerned of

how its data willbe displayed. The Model noti es its views when a change

occurs in the model's data. Thus, the view is always displaying an up to

date version of the data. This protocol makes it possible toattach di erent

views orcontrollers tothe same model.

When designing a particular application, keeping the MVC model in

mind, the rst step is toformalisethe separation between the data manage-

ment(the Model) andthe user interface(the View-Controller). Answeringa

fewquestionscan dothis. [14]Inthecase ofdatamanagementthe program-

mer needsto nd out thefollowing: What isthe data? Howcan the databe

speci ed? How can the data be changed? In the case of the user interface,

solutionstotheseissuesmustbefound: Howshoulddatabedisplayed? How

do events map intochangesin the data? Howshould itallbeput together?

TheMVCmodelisagoodbasewhenbuildinganapplicationthatrequires

multiple views. Itis very goodfor constructingpowerful userinterfacesthat

easily can be extended by adding new features e.g. attaching a new view.

Another scenario where a model like the MVC is very useful is when dis-

tributed applications are needed. By separatingthe modeland itsgraphical

representation, the application can be distributed e.g. by letting the model

act asa server and the user interface act asa client.

2.3 Integration of external programs

One of themost importanttasks inthe eldof bioinformaticsistointegrate

di erent resources in order to highlight correlations and gain new knowl-

edge about a speci c system. When addressing such issues, heterogeneous

environmentsandthe needtointegratedistributedapplicationsareoftenen-

countered problems. Fortunately,there are several ways tosolve these kinds

of problems.

2.3.1 Distributed computing

Distributed computing enables the interoperation of di erent applications

and programs. A distributed application has components and objects run-

ning in di erent process either onthe same machine oronmachines located

across a network. In addition, if the application also is heterogeneous, its

processes are runningunder di erentoperatingsystems or have been imple-

mented in di erent programminglanguages.

References

Related documents

A graphical tool for the visualisation and analysis of genomic sequence data has been implemented using object-oriented modelling and Java. In a user-friendly interface, a set

The tool includes different kinds of plots and filters that make the process of selecting sub-sets out of large data sets easier. The program supports zooming and translation of

Arabidopsis thaliana, sequence alignment, annotation, data mining, Bioinformatics, genome sequencing, Java, Perl, BLAST, ClustalW, literature

Various membrane protein and signal peptide prediction programs were used for the identification of membrane proteins and proteins associated with the membrane.. Proteins

The aim of the project was to develop a tool for visualisation of microarray data for enzymes in metabolic pathways, and for comparison of pathways in different organisms, by

A lot of thoughts and opinions of the interviewees can be related to facilitation of performing changes in software in micro frontends projects. As presented in 5.3

Familj till barn med cancer försökte leva för stunden, vara positiva och hoppfulla inför framtiden (Björk et al., 2009; Nolbris et al., 2007; Woodgate, 2006b) även om föräldrarna

During the development of the website, the author uses key findings from literature review to make sure that the result web-based user interface satisfies