Evaluation and development of bioinformatics tools for design of ligation-based probes for nucleic acid analysis

(1)

September 2009

Evaluation and development of bioinformatics tools for design

of ligation-based probes for nucleic acid analysis

Hoda Ibrahim

(2)

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 09 029 Date of issue 2009-08 Author

Hoda Ibrahim

Title (English)

Evaluation and development of bioinformatics tools for design of ligation-based probes for nucleic acid analysis

Title (Swedish)

Abstract

The aim of this thesis is to develop a process for the selection of a target for the identification of microorganisms using the so-called padlock probes, and to design and implement a computer program that automates this process.

Keywords

Padlock probes, probe design, rolling circle amplification (RCA), AMSD, YODA, OligoArray 2.1

Supervisors

Johan Stenberg Q-linea Scientific reviewer

Olle Eriksson

Department of information technology Uppsala University

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

30 Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

(3)

based probes for nucleic acid analysis

Hoda Ibrahim

Sammanfattning

Det finns ett stort intresse för detektion av mikroorganismer särskilt av de som förekommer i bioterrorism, mat‐ och vattenförgiftning och pandemier till exempel ”den nya influensan”. Idag är jordens befolkning som störst och eftersom en stor del är resande försvåras isolering av smittade.

Detektion i ett tidigt läge är kritiskt då den kan förhindra extrema konsekvenser.

På institutionen för genetik och patologi vid Uppsala universitet har man utvecklat en metod som kan användas för detektion av mikroorganismer, Amplified Single Molecule Detection, ASMD. Denna metod bygger på en gemensam plattform för protein‐ och nukleinsyraanalys. På Q‐linea AB, där detta examensarbete har utförts, utvecklar man bl.a. instrument för ASMD. I den här rapporten kommer jag endast att behandla nukleinsyraanalys eftersom det är mest relevant för mitt arbete.

Med ASMD genomförs allt ifrån sekvensigenkänning, amplifiering och detektion i ett sammanhängande experiment och med endast ett instrument. Metoden bygger på

sekvensigenkänning med Hänglåsprober, amplifieringsmetoden Rullande cirkel amplifiering, RCA, och fluorescensbaserad detektion. Tillsammans gör de ASMD till en specifik och snabb metod.

En av de svåraste uppgifterna under en nukleinsyraanalys är att designa unika prober för en

organism. En prob är en DNA‐molekyl med en kort sekvens som är komplementär till målsekvensen och som även innehåller sekvenser som används för amplifiering och identifiering. Probens

målsekvens fungerar som ett ”fingeravtryck” för genomet. Idag finns det många program för probdesign men många av dem är för långsamma med dålig prestanda, kräver för mycket minne och är svåra att installera. Dessutom finns det inga som är anpassade för hänglåsprober, de flesta är avsedda för design av PCR‐primerpar och hybridiseringsprober för mikroarrayer. I det här projektet har jag analyserat olika program som designar prober samt skrivit ett utvärderingsprogram som sorterar dessa prober efter andra parametrar anpassade för hänglåsprober.

Examensarbete 30 hp

Civilingenjörsprogrammet Bioinformatik Uppsala Universitet januari 2009

(4)

Introduction

The amount of today’s population is larger than ever and has wide‐ranging traveling habits. This makes it more difficult to isolate outbreaks, and lately with the avian influenza, it is possible to not only infect between traveling birds but also to humans.

A general disadvantage of the existing laboratory methods of today is the delay of time needed for samples to reach the central reference laboratory. There are also many closely related biological agents in the environment that classical assays cannot discriminate between. Therefore it would be a great advantage to be able to use only one instrument platform for both protein and nucleic acid‐

based detection. The ideal assay should be simple; to perform, fast, reliable, cost efficient and offer multiplexing capabilities.

The BioNanoLab project

In the BioNanoLab project, Q‐linea cooperates with Uppsala University and the Swedish Defense Research Agency (FOI) with funding from the Swedish Defense Material Administration (FMV). The aim of the project is to develop new and better optimized technology for detection of potential biological warfare agents.

Even though Sweden has never been exposed to biological warfare¹, it has been used in other countries and has tremendous consequences during breakout. Therefore Sweden should always be prepared for it and minimize its impact if it would happened. Biological warfare could be all from anthrax letters to contamination of water and food, e.g. with disease‐causing bacteria such as Vibrio cholerae or Salmonella enterica (2).

Aim of this project

The purpose of this thesis work is to develop a process for the selection of target sequences for the identification of microorganisms, using the so‐called Padlock probes, and to design and implement a computer program that automates this process.

There are many detection methods for microorganisms today. Some of the most popular are

bacterial culture, DNA microarrays and qPCR (quantitative PCR). Unfortunately, all of these methods lack in some respect. Bacterial culture takes too long time and need some complementary tests for identification. DNA microarrays perform poorly in identification of pathogens with small differences in their DNA and these methods need a pre‐amplification of DNA. qPCR has a low multiplex capacity.

The padlock probe assay is a relatively new method that is used in the same field but with the advantage of being able to detect a small amount of target‐sequence among large amounts of non‐

target DNA. This highly multiplexable, sensitive and specific method is based on oligonucleotide probes that have two end sequences that are complementary to the target sequence. When these ends match perfectly to a target sequence the probe will get a circular shape, looking like a padlock.

After ligation, the probe will be circularized, making it possible to amplify it using e.g. RCA.

1 Biological warfare is the use of pathogens such as bacteria, viruses or other toxic products created by using biological organisms (2) as weapons and in bioterrorism.

(6)

The major challenge in this project is to find this short bit of sequence to act as a fingerprint of each organism. Many of the tools available today are used for microarray probe design. Since Padlock probes differ a bit from them, a development or an implementation of a new tool is necessary.

Background

Nucleic acid analysis methods

There are different types of nucleic acid analysis methods used today. I will go through just a few of them in this essay, DNA microarrays, PCR, and padlock probes which is the method that is used at Q‐

linea. These techniques are used widely for detection of microorganisms such as bacteria, viruses, fungi and other organisms that are too small to be seen by the naked human eye.

DNA microarrays

DNA microarrays are of great importance in large scale analysis. They enable thousands of sequences studies at the same time. Microarrays have been used in many fields; gene expression in cancer research (1), SNP detection, comparative genomic hybridization and alternative splicing and many more (2). A DNA microarray, often called a “gene chip”, consists of numerous DNA sequences attached to known positions on a surface, such as a glass slide. These sequences are referred to as probes, while the unknown sequences are the target.

A typical gene expression experiment is to measure the amount of mRNA in a sample. This can be done by extracting RNA from an affected and a non‐affected organism, each to be labeled with a different color, see Figure 1. First, reverse transcription of the mRNA to a more stable cDNA is performed and this is then applied on the microarray surface for hybridization. Each spot on the microarray consists of many probes representing a specific gene. The more cDNA that hybridizes to these probes, the stronger the signal is. After hybridization, image scanning by laser and detector is done for further data analysis and data mining.

Figure 1: An illustration of a microarray. The target hybridizes to the probe and the detection oligonucleotide hybridizes to the target giving a signal after scanning.

(7)

For analysis of bacterial and other microbial communities, microarrays meet a great challenge in terms of specificity, sensitivity and quantification. It is a powerful tool to use when the quantity is large since it gives higher signal intensity. However, many types of microarrays have been developed for this need, functional gene arrays, community genome arrays, phylogenetic oligonucleotide arrays to mention some examples reviewed by Zhou (3). But these still need improvement when it comes to hybridization, accuracy and interpretation of the output data. Another disadvantage is that the gene chip can only be used for one experiment.

Realtime quantitative PCR

Real‐time quantitative polymerase chain reaction or quantitative real time polymerase chain reaction (qPCR) is based on the usual PCR technique to amplify a low amount of DNA in presence of dNTPs, primers and a temperature stable enzyme, DNA polymerase. The DNA double strand is denatured and used as the template for replication, initiated by the primers. This triggers an exponential growth, except at the start and end, which makes it more difficult to get accurate measurements in ordinary PCR.

The main difference in qPCR is that the amplified DNA can be measured after each amplification cycle. This is possible by either using fluorescent dyes in the double stranded DNA or by having probes that fluoresce when hybridized with the complementary target (2). As the numbers of amplification products increases the fluorescence signal will increase with it which makes it detectable and quantifiable. One disadvantage of PCR is that it has a low capacity for multiplexing.

Figure 2: a) The PCR starts with the double strain denaturing allowing primers to anneal to each single strain. DNA polymerase binds to the primers and extends a new DNA strand with free dNTPs. b) Schematic picture of the exponential amplification in the PCR. c) Picture showing the results from a typical qPCR run on a dilution series. The DNA signal is exponentially increased each cycle until it reaches a threshold. Pictures are reproduced with permission from Andy Vierstraete and from USB, now a part of Affymetrix, Inc., 2009" (4), (5).

Padlock probes technology

Ligase‐mediated gene detection technique was invented by Ulf Landegren in the late 1980’s (6). It is based on the ability of two oligonucleotides to be joined by ligation when annealing next to each other on a complementary target‐sequence. One of the oligonucleotides is radioactive labeled and the other is attached to another known substance. Nowadays a fluorescence substance is used instead of the radioactive to label the probe. If both oligonucleotides attach perfectly to the target sequence, they are joined covalently by ligation and they can easily being detected when the known substance is bound to a solid support.

(8)

Figure 3: This picture illustrates the difference between a complementary target and a mismatched target. If the two probes match perfectly to the target a ligation can occur that joins the two ends. From [U Landegren, R Kaiser, J Sanders,

Padlock probe was a further development of the previous method. The padlock probes differ a little from it by having one more segment, acting as a linker binding the two oligonucleotides together.

The probe is around 90 bases long, 20 bases for each target‐complementary end and the linker is about 50 bases long. This linker may carry important sequences such as sequence tags for

hybridization of primers or fluorescent reporter probes, or restriction sites. The two ends of the long nucleotide sequences bind to the complementary target forming a circle looking like a padlock with a nick² between the ends. The shape of the probe is caused by the helical structure of the DNA when the ends of the oligonucleotide wrap around it.

If the ends match perfectly with the target, a ligase enzyme is used to connect both ends together, forming a circle. After that, amplification is possible. This can either be done with PCR, where a pair of primers recognizes a small sequence on the linker and initiates the polymerization, or using rolling circle amplification (RCA). This method uses a polymerase which copies the circularized probe multiple times until the enzyme is inactivated, for example by increasing the temperature. The resulting product is linear single stranded DNA containing multiple complementary copies of the

2 The covalent bond is missing between the two bases.

(9)

circle. In solution the stand collapses into a roughly spherical DNA sequence called RCP, rolling circle product or Blob. RCPs are used later in amplified single molecule detection.

This method is very specific since it only needs one mismatch to prevent a circle from being formed which makes it usable for single mutation detection, detection of pathogens (7) or genotyping. One advantages of using this method instead of PCR which is quite fast, is that hundreds or thousands of different padlock probes can be combined in a single reaction, while PCR is usually limited to up to ten simultaneous reactions.

Figure 4: Illustrates the padlock probe technology. The yellow ends on the padlock probe are complementary to part of the target sequence (A). It will be circularized by ligation in the presence of a correct target sequence (B). Amplification

can be done either by RCA or PCR (C).

Amplified single molecule detection ”ASMD”

The optimal solution would be to develop a strategy which does the same work as qPCR and

microarrays without the disadvantages of them. One step closer to this ideal method is the amplified single molecule detection (ASMD). By using padlock probes to identify the target sequence and then RCA to amplify reacted probes, Jarvuis and his coworkers have developed instruments for sensitive detection and quantification of microorganisms (8).

ASMD is based on RCA of small circular DNA probes resulting in a rolling circle product, RCA, forming a random coil, roughly 1 µm in diameter. If the blobs are labeled with short fluorescent molecules,

(10)

each blob will, upon illumination with suitable laser, emit light which is approximately 100 times sharper than the surrounding solution. This is very useful during blob counting. Each time blobs pass through a thermoplastic micro channel in the confocal fluorescence microscope; they can be

detected with a line‐scan, producing a histogram with different stack heights depending on the pixel intensity of each blob, see Figure 5. During blob counting all intensities under a specific threshold are eliminated to remove background noise. The remaining stacks are the actual blobs. If the blobs are few they can be amplified by using a second round of RCA. This procedure is called circle‐to‐circle amplification (C2CA). A restriction enzyme is added along with a replication oligonucleotide (RO) to the sample. The restriction enzyme cleaves the RCA product at specific positions and later the RO initiate the replication of these products starting the RCA procedure all over again.

Figure 5: Illustration of the ASMD technique, from detection of target with padlock probe, to amplification with RCA and finally analysis and blob counting. It can be done for both nucleic acid and proteins (using the proximity ligation assay).

Adapted by permission from Macmillan Publishers Ltd: NATURE METHODS (Jarvius et al), copyright (2006) (8).

The RCA methodology is a powerful and simple procedure. It has exceptionally high specificity and sensitivity allowing single‐based mutations, specific antigens and single molecule counting of DNA/RNA or even proteins (9). Compared to PCR which is a cycled based procedure, RCA is an isothermal procedure. There is no need for large instrument to regulate the temperature. It also gives a lower error level during amplification and has a smaller variation than PCR. Combined with probing methods such as padlock probes it yields an inexpensive, less complicated method which gives higher multiplex capacity to serve as a potent alternative to the thermo cycling diagnostic methods.

Methods

Here are some definitions used in this essay.

Probe = Consists of the complementary sequence of the target at the ends and some information sequences in the middle. With probe in this section it means only the part complementary to the target sequence of the genome.

Target sequence = the sequence the probe ends will bind to. This sequence is only 30 bases in length.

(11)

Subsequence = A 30mer sequence from the genome generated from the evaluation program.

Target genome = the genome that we are trying to find target sequences for.

Non‐target genome = other genomes that should not be detected by the probes we are designing.

Probe design tools

The main task for probe design is to identify a short sequence in a genome which will acts as a

“fingerprint”. The complementary sequence of the “fingerprint”, the probe, will bind strongly to it, but weakly or preferably not at all to a non‐target during hybridization. There are many existing oligodesign tools available today. Unfortunately many of them are difficult to install and configure, complicated to use, platforms‐specific, excessive running time, lack important features for good probe design etc. These are not suitable for padlock probe design since different kind of parameters need to be taken into consideration. However some of these parameters are necessary in both cases which make it easier to develop new software for padlock probes design with the existing as

template.

The following guidelines may be considered when designing a program for probe design:

Probe length: How long the identifying part of the sequence is. Some programs accept a range of different length, other return just one specific length. Usually the probe length is 40‐50 for DNA microarrays, but for padlock probe are smaller, around 40.

Melting Temperature ( ): What is the melting temperature and how to calculate it for the probe‐

oligonucleotide duplex is another critical factor in probe design. Usually is the between 63‐70 °C, depending on the probe length. The most common methods to calculate the melting temperature is the Nearest Neighbor (NN) (10). For more difficult cases a range is relevant since it gives more flexibility.

GC Content: GC% is an important factor in DNA and provides information about the strength of annealing. A range between 30%‐70% is preferable. A higher percentage gives a greater risk for secondary structure and may be difficult to denature.

Identity (%): The tool makes use of a heuristic percent limit on how much the probe may match the non‐target.

Contiguous identity: Checks if there are continuous sequence identities between a target and a non‐

target. A 50mer probe is rejected if the numbers of contiguous identities are larger than a set limit.

Sequence similarity search: BLAST is the most widely used tool for similarity search. Some tools use their own search tool like SeqMatch (YODA).

Forward and reverse strand match: Checks if oligonucleotide selection should be compared against both forward strand and reverse‐complement to ensure there is no cross hybridization.

Target/probe mismatch position: Taking into account the impact of mismatch position between the probe and target. Mismatch on the solution end of the probe gives lower signal intensity than those in the middle.

(12)

Free energy: Calculation of Gibbs free energy for the binding of the target to the probe. This can also be used to calculate .

Non‐specific hybridization: This checks if any cross hybridization against any non‐target is possible.

This is necessary for the selection of specific probes.

Secondary structure: Trying to predict the secondary structure of oligonucleotide. For this many different programs are used for example MFOLD or Primer3.

Dimer and Hairpin: Check if dimerization³ or hybridization within the probe is possible.

Oligo binding position: Decide where you want the probe at the 3' or 5' end of the target sequence.

Prohibited motifs: If the tool allows users to specify certain motifs to be avoided, or may be repetitive sequences if they have not been avoided before.

Start and ending positions: Start and ending positions for the probe to be avoided.

Probes / target: Refers to how many probes will be designed for each target sequence.

Target region probe: If a probe may bind to multiple regions on the target.

Exon/intron structure: If the probes design distinguish between different splice variants.

Platform: What platform the tool can run on for example Windows, Linux, or Mac.

GUI/CL: How to run the program, from the command line or through a graphical interface.

Three programs which are suitable to use for evaluation and further development are

PathogenMIPer, YODA and OligoArray 2.1. All of them are free to download from the internet and easy to install. It also follows a website or files as manuals.

PathogenMIPer

PathogenMIPer is a tool used for molecule inversion probe (MIP) design. This method was first developed for detection of single nucleotide polymorphisms (SNPs) in human genes but have shown to work well for pathogens (11). The technology behind MIP is quite similar to padlock probe. A probe with complementary sequence to the target sequence is designed. The middle base is deleted and the two halves are tailored to an identification sequence. One base is deleted in the middle of the target sequence, creating a double helix with a gap during hybridization which can be filled with specific nucleotides and ligation, see Figure 6.

3 Two biological components binding to each other (2).

(13)

Figure 6: A schematic picture of the MIP methodology. A probe is designed from the genome and the middle base is cleaved in the middle, creating two sequences homologue 1 and homologue 2 which are ligated to a middle sequence (A). After annealing, the gap is filled and followed by PCR and sequencing (B), (C), (D). © 2006 Thiyagarajan et al; licensee

BioMed Central Ltd (11).

PathogenMIPer is powerful to use for short genomes like viruses and bacteria. The program accept data in a FASTA format through a graphical user interface. First, all candidate probes with the specific length are generated. Continuous stretches of the same base longer than six in the probe are

eliminated, so are the probes with a middle base other than the preferred. The remaining candidates are checked against non‐target genomes and all absolute matches are eliminated. After that the middle 11‐base region has been checked against other genomes to ensure specificity. Candidates outside the melting temperature range and the similarity threshold are also eliminated. The last two steps is the BLAST search against the host genome and tags added to the complete inversion probe.

(14)

Figure 7: Picture of the PathogenMIPer user interface.

PathogenMIPer generates seven output files. The program creates files after each step which can be viewed during running the program. The first file created has the name “project_name.cnd”. This file contains all possible probes. After filtering these, “project name.ftr”, are created. This file contains only the ID and the sequence of the probe. The third file “project_name.res”, are created after screening against a host. Finally “project_name.prb” is the final file with all ready‐to‐order probes, with tags. This file contains all the information needed, genotype/ID, location of the MIP in target genome, homologue2 of the probe, two primer sequences separated by the cleavage sequence, and finally homologue 1. Probes with problems in the tag region are saved in a separate file,

“project_name_bad.prb”.

YODA

Yet‐another Oligonucleotide Design Application –YODA is an entirely java written program provided with a graphical user interface, and parameters to edit. This program designs probes for microarrays which differs it from PathogenMIPer. The developer of the program has concentrated on sensitivity, specificity and consistency (12). YODA is a unique program since it does not entirely rely on BLAST, which many of the existing tools does today. BLAST is used for probe similarity search in the

sequence database (13). It uses a word‐bases search with a minimum of size 7nt. If the query/probe and sequence in the database does not share at least one word within these seven stretches BLAST will not be able to align them. Instead YODA uses its own sequence similarity search, SeqMatch, with a word length of 4nt. This might give a less rapid search but with a higher sensitivity.

(15)

Similar to PathogenMIPer, YODA accepts files in FASTA format but requires unique identity name for each genome. The files can either be sent as a design, host or genome file. The design file is the file with the genome to design probes for. The host file includes the genome(s) to compare the design genome with, and the genome file is used for additional support for the design file when performing iterative designs. Initially melting temperature and GC content are determined with nearest‐neighbor formula for all possible probes with the specified length. Oligos with the prohibited sequences, wrong melting temperature and an improper secondary structure are rejected.

Figure 8: Picture of the YODA user interface.

To select among accepted candidates, YODA uses a probe sorter mechanism where the user can specify criteria for selection among the accepted candidates. YODA returns three output files for each probe sorter. One file contains the candidate probes which have passed the parameter limit.

This one contains the oligo sequence, the location in the genome, and GC%. The second file contains the FASTA title lines for the target for which the probe sorter was unable to select probes.

The last file contains the sequences for which the probe sorter was unable to select probes.

(16)

OligoArray 2.1

OligoArray is another program used for microarray probe design. It is also written in Java and was developed on Linux computers (14). This program is dependent on two other programs, BLAST and OligoArrayAux (15). Before running OligoArray a local BLAST database must be created. It should contain all the sequences to be checked against the design sequences for specificity. The file with all the sequences should be in a FASTA format and have unique identities with no spaces or “and ‘‐

symbols.

Before running the program different kind of parameters needs to be set: Oligo length range, melting temperature range, GC content range, max numbers of probes etc. Prohibited sequences can be edited by the user which makes the program more user‐friendly. It also requires two input files, the genome to design probes for and the BLAST database. With this done, the program starts with finding probes from the required distance from the 5’ to 3’. If the length is longer, the program shrunk its size to the actual length of the genome. Next step is to find the similarity between the oligonucleotide with the BLAST database. A matrix will be created keeping track of all the similarities between each position of the input sequence and other sequences. GC content and are calculated for each sequence with the specific length. If they do not fulfill the criteria they are eliminated. The rest are tested for the absence of secondary structure. The free energy is calculated using

OligoArrayAux. Probes that have a structure with a negative free energy are eliminated. The rest that successfully pass the probe specificity are computed.

(17)

Figure 9: Picture of the OligoArray 2.1 user interface.

OligoArray generates three output files, one oligonucleotide file, one log file and one rejected file.

The oligonucleotide file contains all passed oligonucleotides. This file contains the name, length, position, melting temperature, free energy and other non‐target genomes of the oligonucleotide.

The log file contains program status and all analysis steps during design and the rejected file all the nucleotides that have not been passed through the design process.

Result and Discussion Tool comparison

For comparison, PathogenMIPer, OligoArray 2.1 and YODA were used to design probe set for six different microorganisms, two viruses and four bacteria. The viruses are Cydia pomonella granulovirus and Enterobacterio‐Phage MS2, and the bacteria are the Baculovirus Bacillus

(18)

atrophaeus, Bacillus subtilis, and Escherichia coli and Pantoea agglomerans. Bacillus thuringiensis was also used only as a host genome against the design genomes.

PathogenMIPer OligoArray 2.1 YODA

Seq. Similarity search BLAST BLAST w = 7 SeqMatch w=4

Contiguous identity Yes ??? Yes

% Identity Yes No (not as a parameter) Yes

Target/Probe

Mismatch Pos. No No No

Forward/Reverse

Strand Match No No Yes

GC content Yes Yes Yes

Free Energy No Yes No

Method NN NN;SL98 NN;SL98

range No Yes Yes

Nonspec

hybridization Yes Yes No

Secondary structure No OligoArrayAux Yes

Dimer No Yes Yes

Hairpin No Yes Yes?

Oligo Binding pos Yes Yes(3’) Yes

Optimized Probe Len Yes Yes No

Prohibited Motifs Yes Yes Yes

Probes/Target No >1 Yes

Target regions/probe ?? 1 1

Exon/intron

structure No No No

Language Perl Java Java

OS Windows Windows, Unix Windows, Linux

and MAC OS X

Table 1: Comparison between the tools and their parameters.

All genomes were downloaded as FASTA files from Genbank. Pantoea agglomerans and Bacillus atrophaeus have not been completely sequenced; instead all subsequence matches from each organism were downloaded into a file. These two genomes used different parameters from the other genomes in the tools. Since each tool has its own set of parameter complicating a direct comparison, the parameters for YODA were set first and the other tool’s parameters were adjusted to that, shown in Table 2.

(19)

Tool name #probes Oligo

Length range GC% Specific

probes Max (%)

identity Total Run Time(min)

YODA 392 40 6(69) 12(51) 392 80 1490:14

Cydia pomonella

granulovirus 50 40 69 51 50 ‐ 21:57

Enterobacteri Phage

MS2 50 40 ‐ ‐ 50 ‐ 1:37

Echerchia coli 49 40 ‐ ‐ 49 ‐ 827:49

Bacilus Subtilis 50 40 ‐ ‐ 50 ‐ 623:02

Bacilus atrophaeus 47 40 ‐ ‐ 47 ‐ 1:11

Pantoea agglomerans 146 40 ‐ ‐ 146 ‐ 14:38

OligoArray 2.1 ~50/genome Or

3/part genome

4040 6090 4060 x x

Cydia pomonella

granulovirus 50 40 ‐ ‐ x x

EnterobacterioPhage

MS2 50 40 ‐ ‐ x x

Echerchia coli X 40 x x x x x

Bacilus Subtilis X 40 x x x x x

Bacilus atrophaeus 323 40 ‐ ‐ x x

Pantoea agglomerans 1395 40 ‐ ‐ x x

Table 2: Comparison between the parameter settings.

YODA has a running time of n if the max identity is below 94%, and if the sequence length doubles, the running time doubles. If the max identity is higher than 94% it gets a running time of

n but it quadruple if the sequence length doubles (12). YODA selected 392 probes totally for all of the bacteria. All bacteria and their total probes are represented in Table 2. The first four bacteria are sequenced in the Genbank database so 50 probes/target was chosen for them and 3 probes/target for the last two who has not been sequenced yet (as these genomes are represented by multiple short sequences rather than a single long sequence). All possible sorters were selected for them, except for Pantoea agglomerans which could not be sorted with non‐overlapping sorter. It took totally 24 hours 50 minutes and 14 seconds. Longest time took Escherichia coli which is the largest one.

Running time of OligoArray 2.1 depends on the number of sequences, number of probes to generate and number of sequences in the BLAST database. The program does not print out the total time it takes to run a sequence. But to get a feeling for how long time it takes they have an example in the OligoArray 2.0 article. For a 45mer it takes between 4‐12 hours on a 1.2 GHz dual Xeon processors (14). Total percent identity is not available as a parameter in OligoArray 2.1, but it has a limit of 50%

identity in the documentation (12).

PathogenMIPer has a processing time depending on the number of candidates in the first step. If the number of candidates increases, the processing time will increase exponentially. Other parameters such as length and tag replacement will also affect the processing time. Since this tool took too long time to design probes, up to several days for just one bacterium, it was excluded from the

comparison.

The Evaluation program

PathogenMIPer, OligoArray 2.1 and YODA do not satisfy the padlock probe criteria fully. The best probe is the probe which is 30‐40 bases long and is as unique as possible when compared to non‐

(20)

target genomes. In case where highly similar non‐target sequences exists, they should preferably differ more in the middle since it is where it will bind to the target sequence. This is important because it is where the ligation will occur. If there are a lot of similar non‐target sequences in the sample, the padlock probe will bind to different targets and will not be unique for a specific genome.

This will be a good “fingerprint” for the genome.

These programs were used to select the actual probes used later in the evaluation algorithm. The goal is to find the ten best probes for each organism.

Algorithm

Pseudo code for the evaluation program:

1. //Read in all 3 files and type:

2. File1 = inSubseqs //The file with target sequence 3. File2 = inHostGenomes //File with all non‐target sequences

4. File 3 = inGenome //The whole genome sequence for the target organism 5. Type = yoda/oligoArray

6. If(type = Yoda) call YodaReader()

7. If(type = OligoArray) call OligoArrayReader() 8. Remove all identical sequences in target sequences 9. For(iterate though inSubseqs)

10. For(int start = 1; start<=probe.length()‐30+1; start++){

11. create new 30mer subsequences of the target sequences 12. delete all containing a sequence of the restriction site } 13. save the remaining 30mers in a indexed sequence database (db) 14. Iterate through db{

15. Create a hashmap with subsequence ID and a list as value}

16. For(iterate through inSubseqs ) //Go through all target sequences

17. for(iterate through inHostGenomes){ // Go through all the sequences of the host genome with one base difference 18. compare each 30mer subsequence with the target sequence

19. if(targetsequence.position != subsequence.position) 20. mismatches++;

21. score += scoreArray[position]; //ScoreArray ={1,1,1,1,1,1,1,1,1,1,2,4,6,50,100,100,50,6,4,2,1,1,1,1,1,1,1,1,1,1}

22. save all subsequences with mismatches <20

23. save 10 most similar subsequences/target sequence in the Map}}

24. Sort all target sequences with their 10 most similar subsequences //With the most similar with highest score as possible 25. Print into output file

Above is a short pseudocode of my evaluation program code. The program starts with reading in three files; the probe file containing all possible probes from YODA or OligoArray 2.1, the hostfile containing all genomes/sequences to screen against and last file containing the genome which the probe has been created from. Depending from which tool the probes have been generated from, different method will be called, YodaReader() or OligoArrayReader(), see point 5, 6 above. The program makes sure the inSubseqs file does not contain any identical probes (point 8), before it generates new probes with a length of 30 bases. Each 40mer probe gives 11 new probes. If any of the sequences contains a restriction site sequence, it will be deleted, see Appendix 1.

Each one of the new 30mer probes will then be compared to all the subsequences generated from the host (non‐target genomes). Each subsequence will get scored by two different scores, mismatch and score. Mismatch is how many mismatches there is between the probe and the subsequence.

Score depends on where the mismatches are on the probe, higher score the closer to the center on the probe, see Figure 10. To get the program to run faster only the subsequences with less than 20 mismatches are being checked.

(21)

Figure 10: A schematic picture of how the score is set during comparison between probe and subsequence.

The 10 most similar subsequences i.e. the subsequences with the lowest score are saved for each probe. Finally the probes are sorted depending on the most similar subsequence. The best probe is the one where the most similar subsequence is as different as possible, i.e. has the highest score. The probes including their 10 most similar subsequences are printed into an output file.

Figure 11: Structure of the evaluation program.

Implementation

The evaluation program is written in Java and developed in Eclipse under Windows operating systems. It relies on MolTools, a class library with complete classes and methods used in the evaluation program (16). The program is run from the command line. To start the program three input files are required and the name of the probe design tool that has been used. The first file is the file containing all the candidate probes; the second one includes all the host genomes to be tested against, and the third one is the genome used for probe design. It can only accept output files from

(22)

YODA and OligoArray 2.1. When the program is finished it generates a file with all the sorted probes with their 10 most similar subsequences.

Performance

The evaluation program is limited by the amount of available memory and the processor speed. It also depends on how many genomes there are in the host file. The code has not been optimized during compilation. This could be done to give a better performance. It took approximately 136 min 28 sec for Enterobacterio‐phage MS2 that is 4 kilobases (kb) long. Escherichia coli which is the largest bacterium, 4642kb, took 96 min 52 sec. This time difference depends on the size of the host file, since Enterobacterio‐phage MS2 were compared to all other five larger bacteria did it take longer time.

The reason for using YODA or OligoArray 2.1 for probe design, before the use of evaluation program, is for saving time. The evaluation program would take much longer time to do it itself. It took

approximately 1:37min to run Enterobacterio‐phage MS2 on YODA and 136 min 28 sec to sort them with the evaluation program. Totally 2.76 min for each probe (11*50= 550 ‐30mer). If the evaluation program designed them it would take totally 9770.4 min for all 3540 probes (3569 bases gives 3540 ‐ 30mer).

Comparison

The best results from using the evaluation program after using YODA or OligoArray 2.1 were compared with the results from only using YODA and OligoArray 2.1. They got to pick out the best probe, see Table 3. The shortest genome, Enterobacterio‐phage MS2, was used since it took shortest time. The result shows that the evaluation program got higher score for the probes since the

mismatches were closer to the middle than the results from YODA and OligoArray 2.1.

Table 3: The result comparison from the different tools and their mismatches. The upper sequence is the probe and the one under is the most similar subsequence from the host file. The green area shows where the score get higher, and the

red were the mismatches are.

(23)

It would be interesting to see how the evaluation program performs when it gets to design target sequences for a whole genome. Since it would take too much time to calculate, one gene from Bacillus atrophaeus were used instead. It was sent in as a target sequence file in the evaluation program which could design totally 1379 target sequences and it took 129 min and 17 sec to find the 10 best target sequences. These results were compared to the results from first selecting target sequences using YODA and then using the evaluation program. It took only 42 min and 29 sec total this way. The results were surprisingly good, see Appendix 2. Both ways chose the identical first four best target sequences. The difference was on the fifth target sequence were the evaluation program alone gave a bit better result.

Program The 5^th Best sequences mismatch Score time

YODA (495) +

Evaluation program

TGGGCGTAAAGGGCTCGCAGGCGGTTTCTT

CGACGGTAATTGCCTCGCAGGCGGTTATCT 10 16 0:13 +

41 min 16 sec Evaluation

program (1379)

AGCCGCGGTAATACGTAGGTGGCAAGCGTT CGACGGTAATTGCCTCGCAGGCGGTTATCT

15 16 129 min

17 sec

Application

The ten best target sequences for each organism were selected using YODA and the evaluation program. Padlock probes were designed and synthesized for each of these target sequences, but they have not been fully evaluated.

Conclusions

The results show that we get much better results when using the evaluation program after one of the tools than using only the tool. The evaluation program gives a more unique target sequence that differs more from the non‐target genomes in the ligation area and therefore have a higher score.

The results from having the evaluation program first designing target sequences for a whole gene compared to first using YODA shows that the evaluation program would give somewhat better results. But YODA is a good tool to use since it decreases running time to a manageable time, while still yielding acceptable target sequences. In the comparison made, the first four selected target are the same, indicating that YODA did not discard the best target sequences.

Future

Some improvement could be developed to improve the evaluation program further. One of these is to score the different mismatched differently depending on the bases. Number of nitrogen bonding varies depending on what basis it is, see below. If it has a mismatch to a base that binds with 3 bindings then it will get a higher score than only 2 or 1.

3 2

1  A‐A, T‐T, G‐G, C‐C, C‐T and G‐A gives 0p.

(24)

The performance could always be improved by optimizing the code. 96 min 52 sec to run Echerchia coli is too long time. This could be done in many ways, i.e. changing the programming language.

Acknowledgment

First I would like to thank my supervisor Johan Stenberg for his help and inputs during my work at Q‐

linea and his critical reading of this work and suggestions. I want to thank the other guys at Q‐linea:

Anna Karman, Magnus Elgh and Jenny Göransson for their help and support and for offering valuable comments on the manuscript. Also, big thanks to Olle Eriksson for assisting me while writing my report.

Finally I want to express my appreciation to my family (Samira, Walid, Khalid, Aya, Hiba and Ghina) who has always believed in me and supported me during my studies.

(25)

References

1. Genschwind, Daniel H. DNA microarrays: translation of the genome from laboratory to clinic. The Lancet Neurology. 05 2003, Vol. 2, pp. 275‐282.

2. www.wikipedia.com. [Online] [Cited: 05 29, 2009.]

http://en.wikipedia.org/wiki/Biological_warfare.

3. Zhou, Jizhong. Microarrays for bacterial detection and microbial community analysis. 2003, Vol. 6, pp. 288‐294.

4. Andy Vierstraete. [Online] [Cited: 08 18, 2009.]

http://users.ugent.be/~avierstr/principles/pcr.html.

5. USB Corporation. [Online] [Cited: 08 18, 2009.]

http://www.usbweb.com/tech_tips/USB_Tech_Tip_207.pdf.

6. Landegren, Ulf, et al. A Ligase‐Mediated Gene Detection Technique. Science. 08 26, 1988, pp.

1988‐1080.

7. Szemes, Marianna, et al. Diagnostic application of padlock probes ‐ multiplex detection of plant pathogens using universal microarrays. Nucleic Acid Reserach. April 28, 2005, Vol. 33 No.8, p. e70.

8. Jarvius, Jonas, o.a. Digital quantification using amplified singlemolecule detection. Nature Methods. 09 2006, Vol. 3:9, ss. 725‐727.

9. Demidov, Vadim V. Rolling‐Circle Amplification. Enclopedia of Diagnostic Genomics and Proteomics. 2005, pp. 1175‐1179.

10. Kreil, P David, Russell, R Roslin and Russel, Steven. Microarray Oligonucleotide Probes. Methods Enzymol. 2006, Vol. 410, pp. 73‐98.

11. Thiyagarajan, Sreedevi, et al. PathogenMIPer: a tool for the design of molecular inversion probes to detect multiple pathogens. BMC Bioinformatics. 11 14, 2006, Vol. 7:500.

12. Nordberg, Eric K. YODA: selecting signature oligonucleotides. Bioinformatics. 09 23, 2005, Vol.

21:8, pp. 1365‐1370.

13. BLAST. ncbi. [Online] [Cited: 06 09, 2009.] http://blast.ncbi.nlm.nih.gov/Blast.cgi.

14. Rouillard, Jean‐Marie, Zuker, Michael and Gulari, Erdogan. OligoArray 2.0: Design of

oligonucleotide probes for DNA microarrays using a thermodynamic approach. Nucleic Acid Research.

04 22, 2003, Vol. 31, 12, pp. 3057‐3062.

15. OligoArray 2.1: Genome‐scale oligonucleotide design for microarrays. OligoArray 2.1. [Online] 11 01, 2006. [Cited: 06 10, 2009.] http://berry.engin.umich.edu/oligoarray2_1/.

16. Sourceforge. [Online] [Cited: 08 24, 2009.] http://sourceforge.net/projects/moltools‐lib/.

17. Nilsson, Mats, et al. Padlock Probes: Circulating Oligonucleotides for localized DNA detection.

Science. 09 30, 1994, pp. 2085‐2088.

(26)

18. Abd‐Elsalam, A Kamel. Bioinformatic tolls ang guideline for PCR primer design. African Journal of Biotechnology. 04 28, 2003, Vol. 2:5, pp. 91‐95.

19. Software Tools for Design of Reagents for Multiplex Genetic Analyses. Stenberg, Johan. Uppsala : Acta Universitatis Upsaliensis, 2006. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Medicine 148. p. 41.

20. Korf, Ian, Yandell, Mark and Bedell, Joseph. BLAST. Sebastopol : O'Reilly, 2003.

21. PathPort: the pathogen portal project. YODA. [Online] [Cited: 06 10, 2009.]

http://pathport.vbi.vt.edu/YODA.

22. BMC Bioinformatics. BioMed Central. [Online] [Cited: 06 10, 2009.] PathogenMIPer Software.

http://www.biomedcentral.com/1471‐2105/7/500.

Appendix

Appendix 1

| Source code for the new probes that is generated from the old 40 bases long probes.

final Collection<NucleotideSequence> thirtyMer = new ArrayList<NucleotideSequence>();

final int length = 30;

final int scoreArray[] = {1,1,1,1,1,1,1,1,1,1,2,4,6,50,100,100,50,6,4,2,1,1,1,1,1,1,1,1,1,1};

//Iterate through all probes in the input file (goalSeqsBac)

for (final Iterator<SimpleNucleotideSubSequence> i = goalSeqsBac.iterator();i.hasNext();) { final SimpleNucleotideSubSequence s = i.next();

//Generate 30mer from each probe

for(int start = 1; start<=s.length()-length+1; start++){

int end = start+length-1;

String newSubSeq = s.subsequence(start,end);

String parentID = s.getID();

int hostSubPos = s.getStart()+start-1;

String hostSubID = s.getParentID() +"_" + hostSubPos;

//Filter all who contains the restriction site sequence if(hostSubPos > 40 || !newSubSeq.contains("AGTC")){

PropertyHolderNucleotideSequence newSubsequence = new

DefaultPropertyNucleotideSubSequence(hostSubID,newSubSeq,Nucleotide Sequence.DNA,parentID,Polar.PLUS,start,end);

thirtyMer.add(newSubsequence);

} }

}

(27)

Appendix 2

| The 5 best results from the output files for Bacillus atrophaeus Bacillus atrophaeus with Evaluation program:

Sequence ID: Sequence Mismatch Score

Målsekvensen:

gi_162569007_gb_EU326483.1__48

ACGGGTGAGTAACACGTGGGTAACCTGCCT gi|170079663|ref|NC_010473.1|:2998649: GCGGGTGACCATCACGTGGAGTAGAAGTAA 13 17 gi|170079663|ref|NC_010473.1|:4370316: ATTCTCGACATACACGTGGCGCGCTTCCCG 15 17 gi|170079663|ref|NC_010473.1|:2425904: CCGGTTCAGCAACACGTTGCAGACCCACTA 12 18 Målsekvensen:

gi_162569007_gb_EU326483.1__513

ATTGGGCGTAAAGGGCTCGCAGGCGGTTTC

gi|170079663|ref|NC_010473.1|:2088643: ATTTTTCGGAAAGGGCTTGCAGAGTGCCAT 12 17 gi|170079663|ref|NC_010473.1|:354219: GTTGCTGGCGAACGGCTCGCGGGTGAAATC 12 17 gi|170079663|ref|NC_010473.1|:1001123: TTGTGCCATTCTGGGCTCGCTGGCGATGGG 13 17 Målsekvensen:

gi_162569007_gb_EU326483.1__1083

CATTCAGTTGGGCACTCTAAGGTGACTGCC gi|170079663|ref|NC_010473.1|:3180352: CTTTCCACTCAGCACTCTGAAGAGATCGAC 12 16 gi|170079663|ref|NC_010473.1|:2575887: CATTCCGAAAAGCACTCTCGGATTCCTTAC 12 17 gi|170079663|ref|NC_010473.1|:295258: TATTGTCTGCGGCACTCTTCGGTTGCAACA 13 17 Målsekvensen:

gi_162569007_gb_EU326483.1__47

GACGGGTGAGTAACACGTGGGTAACCTGCC gi|170079663|ref|NC_010473.1|:2425903: ACCGGTTCAGCAACACGTTGCAGACCCACT 12 16 gi|170079663|ref|NC_010473.1|:4501585: GTGGAGCAAGCAACACGTCGTCCACCAGTC 12 16 gi|170079663|ref|NC_010473.1|:4014542: TATTCGGCGTTAACACGTGCAGCACGCTCC 15 16 Målsekvensen:

gi_162569007_gb_EU326483.1__473

AGCCGCGGTAATACGTAGGTGGCAAGCGTT

gi|170079663|ref|NC_010473.1|:3010953: ATGAATAGTGGTACGTAGGTCAAATGTCTG 15 16 gi|170079663|ref|NC_010473.1|:792824: ATTGCAGGAAATACGTAGGCCTGATAAGAC 15 16 gi|170079663|ref|NC_010473.1|:3590751: TGCTGGGCGTAGACGTAGGCGGTATTGGTG 13 17

Bacillus atrophaeus with YODA and then the Evaluation program:

Sequence ID: Sequence Mismatch Score

Målsekvensen:

gi_162569007_gb_EU326483.1__48

ACGGGTGAGTAACACGTGGGTAACCTGCCT gi|170079663|ref|NC_010473.1|:2998649: GCGGGTGACCATCACGTGGAGTAGAAGTAA 13 17 gi|170079663|ref|NC_010473.1|:4370316: ATTCTCGACATACACGTGGCGCGCTTCCCG 15 17 gi|170079663|ref|NC_010473.1|:2425904: CCGGTTCAGCAACACGTTGCAGACCCACTA 12 18 Målsekvensen:

gi_162569007_gb_EU326483.1__513

ATTGGGCGTAAAGGGCTCGCAGGCGGTTTC 10

gi|170079663|ref|NC_010473.1|:2088643: ATTTTTCGGAAAGGGCTTGCAGAGTGCCAT 12 17 gi|170079663|ref|NC_010473.1|:354219: GTTGCTGGCGAACGGCTCGCGGGTGAAATC 12 17 gi|170079663|ref|NC_010473.1|:1001123: TTGTGCCATTCTGGGCTCGCTGGCGATGGG 13 17 Målsekvensen:

gi_162569007_gb_EU326483.1__1083

CATTCAGTTGGGCACTCTAAGGTGACTGCC

gi|170079663|ref|NC_010473.1|:3180352: CTTTCCACTCAGCACTCTGAAGAGATCGAC 12 16 gi|170079663|ref|NC_010473.1|:2575887: CATTCCGAAAAGCACTCTCGGATTCCTTAC 12 17 gi|170079663|ref|NC_010473.1|:295258: TATTGTCTGCGGCACTCTTCGGTTGCAACA 13 17 Målsekvensen:

gi_162569007_gb_EU326483.1__47

GACGGGTGAGTAACACGTGGGTAACCTGCC gi|170079663|ref|NC_010473.1|:2425903: ACCGGTTCAGCAACACGTTGCAGACCCACT 12 16 gi|170079663|ref|NC_010473.1|:4501585: GTGGAGCAAGCAACACGTCGTCCACCAGTC 12 16 gi|170079663|ref|NC_010473.1|:4014542: TATTCGGCGTTAACACGTGCAGCACGCTCC 15 16 Målsekvensen:

gi_162569007_gb_EU326483.1__515

TGGGCGTAAAGGGCTCGCAGGCGGTTTCTT

gi|170079663|ref|NC_010473.1|:4027809: CGACGGTAATTGCCTCGCAGGCGGTTATCT 10 16 gi|170079663|ref|NC_010473.1|:1016438: GGGCGTTAAACAGCTCGCAGAAGATCCGTT 12 16 gi|170079663|ref|NC_010473.1|:2420868: CAGGCGCAACCGGCTCGCTGGAGAGCGCCT 12 16

Evaluation and development of bioinformatics tools for design of ligation-based probes for nucleic acid analysis

September 2009

Evaluation and development of bioinformatics tools for design

of ligation-based probes for nucleic acid analysis

Hoda Ibrahim

Bioinformatics Engineering Program

Uppsala University School of Engineering

UPTEC X 09 029 Date of issue 2009-08 Author

Hoda Ibrahim

Title (English)

Evaluation and development of bioinformatics tools for design of ligation-based probes for nucleic acid analysis

Title (Swedish)

Abstract

The aim of this thesis is to develop a process for the selection of a target for the identification of microorganisms using the so-called padlock probes, and to design and implement a computer program that automates this process.

Keywords

Padlock probes, probe design, rolling circle amplification (RCA), AMSD, YODA, OligoArray 2.1

Supervisors

Johan Stenberg Q-linea Scientific reviewer

Olle Eriksson

Department of information technology Uppsala University

Project name Sponsors

Language

English

Security

ISSN 1401-2138 Classification

Supplementary bibliographical information

Pages

30

Biology Education Centre Biomedical Center Husargatan 3 Uppsala

Box 592 S-75124 Uppsala Tel +46 (0)18 4710000 Fax +46 (0)18 555217

based probes for nucleic acid analysis

Sammanfattning

Table of contents

Introduction

The BioNanoLab project

Aim of this project

Background

Nucleic acid analysis methods

Amplified single molecule detection ­”ASMD”

Methods

Probe design tools

Result and Discussion Tool comparison

PathogenMIPer OligoArray 2.1 YODA

The Evaluation program

Conclusions

Future

Acknowledgment

References

Appendix

Appendix 1

Appendix 2

Amplified single molecule detection ”ASMD”