• No results found

Probes for ESBL: A Method for Production of Probe Targets in Antibiotic Resistant Genes

N/A
N/A
Protected

Academic year: 2022

Share "Probes for ESBL: A Method for Production of Probe Targets in Antibiotic Resistant Genes"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

17-X4 

Probes for ESBL 

A Method for Production of Probe Targets in  Antibiotic Resistant Genes 

Erik Berner-Wik, Caitlin Haughey, Lauri Mesilaakso,   Hampus Olin, Jonatan Ulfsparre, Emma Östlund 

Beställare: Q-linea 

Beställarrepresentant: Jan Andersson  Handledare: Magnus Lundgren 

1MB332, Självständigt arbete i molekylär bioteknik, 15 hp, vt 2017 Civilingenjörsprogrammet i molekylär bioteknik

Institutionen för biologisk grundutbildning, Uppsala universitet

(2)

Abstract

This project aimed to find a method for producing potential probe targets for identification of ESBL (Extended Spectrum Beta Lactamase) genes in bacteria. ESBLs are a type of enzymes responsible for antibiotic resistance in many bacteria. The result we developed was a semi-automated pipeline that utilises several Perl scripts to download gene sequences, identify sequence subgroups based on sequence similarity, find common target sequences among them and screen the target sequences against a background database. These target sequences should work with padlock probes and therefore had specific requirements regarding length and highest number of allowed mismatches. This report includes descriptions of the scripts and ideas for future improvements, as well as an ethical analysis about aspects relevant to research on antibiotic resistance.

(3)

Table of contents

1. Background 3

2. Introduction 4

3. The pipeline 6

3.1 Production of ESBL gene sequences 6

3.2 Sequence similarity analysis 7

3.3 Finding target sequences 9

3.3.1 Alternative I (matcher.pl) 9

3.3.2 Alternative II (common_target_finder.pl) 11

3.4 Screening targets against a background list (target_bg_check.pl) 11

4. Evaluating the reliability of target finding scripts 12

4.1 Alternative I (matchTester.pl & matchControl.pl) 12

4.2 Alternative II (false_targets_finder.pl) 13

5. Results 14

6. Discussion and ideas for improvement 15

6.1 Improving and automating the similarity analysis 15

6.2 Evaluation of target sequence searcher scripts 15

6.3 Subgrouping and geography 17

6.4 Improving evaluation of match finding scripts 18

6.5 A better background check procedure 18

7. Ethical considerations and challenges 19

7.1 Geographical and financial justice 19

7.2 Ethical and financial conflicts 20

7.3 Overconsumption of antibiotics 20

7.4 Knowledge, the way to success 21

Acknowledgements 22

References 23

Appendices 25

A1. Different types of ESBLs 25

A2. Neighbor-joining trees of IMP and OXA with marked subgroups 26

A3. Neighbor-joining trees of VIM, NDM and KPC 28

A4. Background genomes 30

A5. Step-by-step guide to retrieving target sequences 31

A5.1 Retrieving gene sequences 31

A5.2 Similarity analysis 31

A5.3 Search for targets 32

A5.4 Filter targets against the background list 32

A5.5 Subgrouping (if no targets are found) 33

(4)

A5.6 Checking how matcher.pl and common_target_finder.pl find matches 33

A5.6.1 Running matchTester.pl 33

A5.6.2 Running false_targets_finder.pl 34

A7. A closer look at common_target_finder.pl 34

A7.1 First Comparison 34

A7.2 Second comparison 35

(5)

1. Background

Sepsis is the medical term for blood poisoning, which is an infection caused by pathogenic bacteria in the bloodstream. This infection can become so severe that it results in a potentially fatal inflammatory reaction in the body. A common symptom of sepsis is hypotonia, unusually low blood pressure. If a sepsis patient becomes hypotonic and does not receive effective treatment, mortality increases with 7

% for each hour that the infection goes untreated during the following 6 hours after symptoms are first observed (Kumar ​et al.​ 2006). Speed is therefore essential when it comes to correctly diagnosing sepsis, especially since it is the number one leading cause of death in hospitals intensive care units (ICUs) in many high-income countries (Bataar ​et al. 2010).

Q-linea is a company that focuses on developing a method for fast diagnostics of sepsis. The technology that they are working on will be able to identify pathogens and their susceptibility to antibiotics. This method will be able to offer high speed detection due to the use of padlock probes, which are linear nucleotide sequences that become circular when binding to a certain target sequence.

Thereafter the probes can be amplified by rolling circle amplification and hybridized with fluorescent DNA-probes. If a padlock probe has hybridized with its target sequence a signal will be received.

Since pathogens within the same species have shared DNA-sequences, padlock probes can be

designed for these DNA-sequences and thus detect the pathogenic species. Padlock probes can also be designed to detect a certain antibiotic resistance by giving the probe a target sequence that lies within the resistance gene.

At the present time, sepsis diagnostics is based on the use of blood cultures. This identification method takes at least two days to give a conclusive result, which means that it can only confirm or refute a doctor’s suspicions when it comes to sepsis and the chosen treatment. This leads to a large number of possibly ineffective antibiotic treatments for sepsis patients due to the increasing amount of resistance genes carried by the pathogens causing the disease. Recognition of these resistance genes with the help of padlock probe technology could reduce the time for application of correct and effective antibiotic treatment for different sepsis patients depending on the pathogen involved in the different infections. The padlock probes designed by Q-linea will be used by an instrument that will be able to identify pathogens and their resistance genes within 10 hours which is a significant life-saving reduction in time from the previously common 48 hours (Kaiser ​et al. 2002).

One major cause of antibiotic resistance are the so called Extended Spectrum Beta Lactamases (ESBL), which are a diverse group of enzymes that provide antibiotic resistance to certain Gram negative bacteria. There are many ESBL gene varieties that encode for enzymes. These can be

grouped and classified in various ways depending on specific characteristics, places of origin or by the antibiotics they are able to inhibit. Some of the more common and severe ones are listed in Appendix 1. The only common denominator for all these ESBLs is the ability to cleave and neutralize beta lactam, which is the active part of several antibiotics frequently used today. The term

extended-spectrum beta-lactamase originates from that these enzymes have ‘extended

broad-spectrum’ activity on many different types of beta lactams, compared with ‘broad-spectrum’

classic TEM and SHV enzymes (see Appendix 1 for descriptions). This capacity renders the

pathogens carrying ESBLs resistant to numerous types of beta-lactam antibiotics and beta-lactamase inhibitors, and they can occur in various different strains of bacteria due to horizontal gene transfer.

(6)

Additionally ESBLs can also be categorized into subgroups in different ways depending on a

particular set of characteristics. One of the most common ways of executing this is by considering the amino acid sequence (Ambler 1980) splitting the ESBLs into four molecular classes A, B, C and D based on conserved regions and motifs. This classification is simple and well used since it is based simply on primary structure. However while this may be an easy way to classify the enzymes, it does not provide any kind of association to their effect on certain antibiotics, since an enzyme’s

relationship between structure and function is dependent on more than just the amino acid sequence.

Another established way of classification is by using the Bush-Jacoby group, which is based on functional properties of the enzymes such as their substrate and inhibitor profiles, molecular masses and isoelectric points (Bush & Jacoby 2010).

2. Introduction

The objective given to us by the company Q-linea was to produce a pipeline for identifying suitable probe targets for different ESBL genes and then to compile a list of the discovered potential target sequences. Since there is more than one variation of the same resistance gene, the aim was to preferably find as few sequences as possible that can be used as target sequences for all variations of each distinct gene. These target sequences also had to meet specific criteria given by Q-linea

regarding the configuration of the probe target. Lastly, the sequences were to be compared and filtered against a list of background genomes obtained by Q-linea in order to ensure that they would not give rise to false positive results if they were used as probe targets.

The pipeline we created in accordance with these specifications consists of a step-by-step procedure with different programs and algorithms with specific inputs and outputs for each step, along with suggestions of how to divide the genes into smaller groups if no target sequence should be found.

Since this pipeline was such a central part of this project, the outline of it is included in the methods section of this report, as well as the Neighbor Joining trees and subgroups based on the trees.

However, the full output of the pipeline will be only be reported to Q-linea in a separate document and is not included in this report.

The aforementioned subgroups of ESBLs can be further compiled into four more extensive groups, classical ESBLs (CTX-M, SHV, TEM), carbapenemases (KPC, NDM, OXA, VIM, IMP), AmpC ESBLs (CMY, AAC) and Inhibitor resistant beta lactamases (IRT, Inhibitor resistant SHV) (Babic ​et al.​ 2006). For this project Q-linea requested that we focus on the first three of these groups, starting with the carbapenemases which have the highest priority, then the classical ESBLs and lastly the AmpC group. With this priority order in mind, we restricted our work to carbapenemases, but the other groups of ESBLs can be readily handled with our pipeline as well.

The precise attributes that Q-linea have specified for the possible probe target sequences are under nondisclosure, yet the overall design criteria can be summarized as two distinct segments separated by a spacer of a certain length. The upstream segment, segment nr 1, will bind the gene to a matrix. It has to have a distinct length in number of base pairs and is allowed to contain a specific maximum amount of mismatches. The other segment (downstream from segment 1) will be called segment nr 2 and it should consist of three separate subsegments (A, B and C). Subsegment B will serve as the binding site for the padlock probes and is therefore not allowed to contain any amount of mismatches, whilst

(7)

subsegments A and C are allowed to contain a few mismatches. Segment 2 has to be unique, so no exact matches are permitted against the obtained background list of genomes, whereas segment 1 should not be found in certain background genomes.

Figure 1. A graphical representation of what the probe targets should look like.

(8)

3. The pipeline

The following chapter will go through each step of the pipeline in more detail. Figure 2 summarises the steps in the pipeline. For a step-by-step guide on how to use the pipeline see section A5 of the appendix.

Figure 2. An overview of the developed pipeline.

3.1 Production of ESBL gene sequences

The ESBL gene sequences were obtained by querying the databases of the National Center for Biotechnology Information (NCBI). In practice the queries were performed with using programming

(9)

utilities called E-utilities which were provided by NCBI. E-utilities contain server-side programs that provide a stable interface into the Entrez query and database system at NCBI.

The script ​dataProd.pl​ first reads query parameters (which database in NCBI is queried and the exact search terms used for the query) from a textfile. The queries were restricted to Bacterial Antimicrobial Resistance Reference Gene Database (Bioproject number: 313047) at NCBI because this Bioproject contains well curated records of representative DNA sequences that contribute to resistance to various antibiotics. The deRedundancifier.pm package contains a subroutine called deRed which handles both the reading of two datasets and writing into a file. It also controls that gene sequences that are

identical are not written more than once in a fasta file by checking that each fasta header is unique.

3.2 Sequence similarity analysis

Since it would likely be necessary to divide the gene sequences into smaller groups we needed a way to visualise the similarity between them. The reason for this is that some ESBL types are quite divergent due to accumulation of mutations over long periods of time, so finding a common target sequence among all the variations could prove to be difficult. Since phylogenetic trees are easy to interpret we decided to use them as a starting point.

The similarity measure should be based on the number of matches and mismatches between two sequences. Therefore we chose to construct trees based on a ​p-​distance matrix. ​p-​distance is calculated using the following formula (Zvelebil & Baum. 2008):

D = 1 −

#positions−#gaps#matches

where D equals the fraction of mismatches in a pairwise alignment, excluding gaps. Applying this formula to each pair of sequences in a multiple alignment obtains a distance matrix that is then used to construct an NJ (neighbor-joining) tree. NJ-trees are constructed by pairing the two sequences or nodes with the shortest distance between them until all sequences are incorporated. It is possible to set the branch lengths to represent these distances when viewing the tree.

This method does a poor job of estimating phylogeny since ​p-​distance does not use an evolutionary model and therefore underestimates the number of mutations, but it works very well to show the relative number of mismatches between sequences in an easily understood way (See Figure 3) which is what we needed for our bacterial genes. For larger sets of genes the trees can get quite dense and complicated, but this is not a problem when viewing them in a program such as Figtree.

(10)

Figure 3. Example of a tree constructed using the sequence similarity pipeline on our IMP dataset. A highly divergent sequence is seen on the right; this sequence turned out not to be an IMP gene. The NJ algorithm sometimes assigns negative branch lengths, as seen in the middle of the image where there is a branch pointing inwards rather than outwards, but this is not important for our analysis.

MUSCLE was used to align sequences. MAFFT is another alignment tool that worked just as well and found the same alignments as MUSCLE for datasets containing up to about 40 sequences. To

construct the distance matrix we used the ​distmat​ function in EMBOSS with the gap weight set to 0.

This function assigns a score of 1 to exact matches and a score of 0 to mismatches. The result is a matrix containing the average number of mismatches per 100 positions for each sequence pair.

The designed method allows you to exclude a number of positions in the beginning and end of the alignment before matrix calculations, since those regions often contain big gaps caused by sequences that are longer than the rest. They are therefore unnecessary to include if you want to shorten the time for calculations. Seaview was used to visualise the alignments so that we could determine how many positions we needed to exclude. The distance matrix was then converted to the phylip format and used to construct an NJ-tree using the ​neighbor​ function in PHYLIP. See Figure 4 for a flowchart of the pipeline and the programs used in each step during this project.

(11)

Figure 4. Flowchart of the sequence similarity analysis pipeline and the programs used during the project. The alignment visualisation step is recommended but not necessary.

Using this method we were able to identify sequences that had been found through the database searches that did not contain ESBL genes, and remove them from our datasets. We were also able to divide the larger sets of genes that had no target sequences (​e.g​. the OXA carbapenemases) into groups based on sequence similarity.

3.3 Finding target sequences

For finding the common target sequences between genes, two different Perl scripts were produced.

They tackle the same problem but with two different approaches. The scripts will be described below and compared in the discussion part of the report.

3.3.1 Alternative I ( ​matcher.pl​)

The script ​matcher.pl​ imports sequences from a single fasta file containing the relevant genes. Based on the gene accession number the sequences are sorted and the first one is selected as the reference sequence. Knowing which sequence is the reference simplifies debugging of the script. This sequence will be the basis for both the generation and preliminary combination of segment 1 and the three subsegments of segment nr 2 into the full target sequences based on their spacing. The different

(12)

segments are generated by stepping through the entire reference sequence and saving every segment of the specified length, as seen in Figure 5.

Figure 5. Example of segment generation of length 4 from a reference sequence.

Filtering of the segments is done by using the Perl module String::Approx that searches for matches between a segment and the rest of the sequences, taking the maximum number of allowed mismatches into account. Segments are discarded if a match is not found in exactly all of the sequences. When the segments are generated the positions of the segments in the reference sequence are saved. This is now used as a guideline for the combination of the segments into full target sequences. Segments are combined into a target sequence only if the spacing between them is correct. In the same way that the individual segments were filtered, the full targets are now filtered by checking that the combination of segments is present in all sequences with the correct spacing and mismatches. Once this final filtering is done the remaining targets are saved for further use in the pipeline,​ i.e.​ screening against the background list. An overview of the process can be seen in Figure 6.

Figure 6. Flowchart showing the general idea behind Alternative I for finding target sequences.

(13)

3.3.2 Alternative II ( ​common_target_finder.pl​)

There are many similarities between ​matcher.pl and ​common_target_finder.pl​. This alternative also starts by loading a fasta-file containing sequences and uses String::Approx to match different segments. In Figure 7 the basic idea of how ​common_target_finder.pl​ works is presented. Two sequences are first used to retrieve potential target sequences in a first comparison. The remaining sequences in the set are used in a second comparison to filter out targets that do not exist in all the sequences in the file. At the end, only targets that are found in all sequences (true targets) are kept.

The two sequences in the fasta file that alphabetically come first are used for the first comparison. The reason for this is that the script reads a fasta file of a sequence set and stores the sequences and their accession numbers in a hash. Since hashes are not sorted, there is a chance that the two sequences used in the first comparison will be the same sequence if they are chosen randomly from the hash. By ordering the hash and choosing the first two sequences, this problem is avoided. A more detailed explanation of ​common_target_finder.pl​ can be read in Appendix A7.

Figure 7. The general idea of how ​common_target_finder.pl​ works.

3.4 Screening targets against a background list (target_bg_check.pl)

Gene identification using probes is not done in a vacuum, and therefore you need to make sure that your probe targets will not appear in any genomes other than the ones you are looking for. We decided to use the BLAST algorithm to align our potential targets to the list of background genomes and then discard targets with hits meeting certain criteria. The downside of using BLAST is that you are not guaranteed to find all possible alignments for your query, but the upsides are that it is a fast algorithm, you have a lot of search parameters to customise and you get access to a lot of information about the hits.

We specifically used the BLAST+ suite from NCBI since it allows the construction of local databases and running searches against those. The genomes used to construct the databases were the reference

(14)

genomes from NCBI for each species listed in table A2 in appendix 4. The command used to do this was ​makeblastdb -in genomes.fasta -parse_seqids -dbtype nucl where ​genomes.fasta​ is the file containing all the sequences for the database.

The program we wrote does separate BLAST searches for segment 1 and 2 of the targets since they have different criteria on specificity and which genomes should be in the background list. Segment 1 is not allowed to exist in full in a few genomes. Segment 2 has a short middle subsegment

(subsegment B) that is very important to the target-probe interaction and therefore should not exist in a larger set of genomes, while the flanking subsegments (A and C) are allowed to exist.

The requirements for segment 1 were fulfilled by running a BLAST search that only returns full alignments with no gaps or mismatches as hits. If any hits are returned for a segment the program discards it. This was done using the default BLAST+ options with the addition of only returning ungapped alignments.

The requirements for segment 2 were highly unlikely to let any potential probe targets pass the background check, since subsegment B was short enough that it statistically will appear in a large enough genome such as a human genome. Tests with randomly generated sequences of the same length as subsegment B returned many identical hits with E-values (Expect-value, a BLAST parameter that describes the number of hits expected for a specific sequence when searching a

database (NCBI, 2017), between 10 000 and 50 000, meaning that the background database has a very high chance of containing identical matches of any sequence of that length.

This problem was solved by running a BLAST search on the whole segment where only about half or more of the segment needs to align to the background sequence and some mismatches are allowed for a hit to be returned. All hits are then checked for any appearances of subsegment B, and are discarded if it is found. This means that a segment will pass if the background genomes contain sequences identical to subsegment B, but it will fail if the background genomes contain subsegment B that is also flanked by parts of subsegment A or C.

If both segment 1 and 2 of a potential target passes the background check the target will be saved to a file and can be considered for probe design.

4. Evaluating the reliability of target finding scripts

As we had produced the target finding scripts (​common_target_finder.pl and ​matcher.pl​) we wished to ensure that the scripts would find the various ways matches can be found in the gene sequences.

E.g.​ matches can be found with various amounts of mismatches and distances between the segments.

For this purpose we developed two scripts which can be used to test these various cases.

4.1 Alternative I (matchTester.pl & matchControl.pl)

The basic idea behind this alternative for evaluating how well searching targets in sequences work, is that all the relevant information about each sequence is stored while the first sequence is created. The next sequences then randomly vary this stored information with respect to amount of mismatches (not the distance between the segments 1 and 2). Hence, the length of all the sequences and the amount of

(15)

matches stay the same in all sequences. Once this procedure is repeated a significant amount of times, it is probable that all possible variations will be tested inside the boundaries of variation given in the script.

In the first stage the script ​matchControl.pl​ randomly creates segments (either segment 1 or 2, or skipping either of these). Those which are created and appended into the first sequence are stored into an array of hashes as either segment 1 or segment 2. The areas exactly before and after the segments (flanking areas) are also stored and the distances of the flanking areas are randomly varied. Once these segments are created, the distances of each of them are checked in order to ascertain if there are any segments which fall into the predetermined classified distance criteria. If there are any of these segments which fall into the criteria after certain amount of iterations of additions, the creation of the sequence ends and is printed out into a file as the first fasta sequence.

In the next stage a predetermined amount of mismatches is added to the segments which were found to be matches in the first stage. The amount of mismatches is random and allowed to vary only in the range of predetermined classified criteria. In addition all the flanking areas are completely

randomized. This makes it probable that only the matches created in the first stage are found as matches provided that the lengths of the segments and flanking areas are long enough.

Once the mismatching and randomization are done, all the segments are printed out to the next fasta sequence. This process is then repeated 25 times so that in total there will be 26 fasta sequences in the file. Lastly the length of each fasta sequence (which is the same for all of them) is printed onto the STDOUT as well as the details of the matches created in the first stage.

The script ​matchTester.pl​ first runs​ matchControl.pl​, which creates a fasta file. This file is then used as input file for the scripts which search for matches,​ i.e.​common_target_finder.pl and ​matcher.pl. In summary ​matchControl.pl​ creates sequences in which various amounts of mismatches are applied and various combinations of segments 1 and 2 exist. Then ​matchTester.pl runs ​matchControl.pl with searching scripts and the results are printed out into the STDOUT and to fasta files.

4.2 Alternative II (false_targets_finder.pl)

Alternative I controls if a search script finds any targets in a set of sequences were targets should be found. ​false_targets_finder.pl​ works as a complement to alternative I, since it checks that only targets that exist in all sequences of a set (true targets) are generated from the search scripts. It does so by searching for each target sequence in the set they were retrieved from. Targets that do not exist in all sequences are called false targets. If such targets are generated from a search script,

false_targets_finder.pl​ will find them and say that the search script has a bug. These false targets will be printed to a file.

The script starts by running either searcher-script set to zero mismatches allowed in any segment, and then uses the searcher’s in- and outfile as arguments. By doing this, ​false_targets_finder.pl can use regular expression to search for each target sequence from the searcher’s outfile in all sequences from its infile. If a target does not exist in one or more sequences in the set, it is stored in a hash. When all targets have been searched for, ​false_targets_finder.pl​ controls if the hash is empty. An empty hash

(16)

means that no false targets could be found, and the searcher is therefore said to not generate any false targets. However, if the hash is not empty, the searcher generates false targets and its results are not reliable.

5. Results

The developed pipeline was used to gather target sequences for five different ESBL types; KPC, IMP, NDM, OXA, and VIM. The gathering is described more in depth in Appendix A5. Only

common_target_finder.pl ​was used to search for targets since it generated more targets than

matcher.pl​. Table 1 shows the results for each type and how the number of targets are reduced after the targets have been filtered. No subgroup division was necessary for NDM, KPC or VIM. These types also had the least number of sequences in their dataset, which together with the fact that they have small variation can explain why no division was necessary.

Table 1. The number of target sequences for each set of gene types in the different steps of the process of retrieving target sequences. The letter ​s ​symbolises a subgroup of the gene type.

* This group had sequences removed in the similarity analysis step, since they were not from the correct gene.

The number in the table is after removing sequences.

Gene type Number of sequences in dataset

Number of found targets

Number of targets after filtering

KPC 24 114302 60852

NDM 18 119902 56772

VIM 49* 403 251

IMP_s1 26 193 23

IMP_s2 20 160 160

IMP_s3 10 122 52

OXA_s1 172 10199 890

VIM had a total number of 50 sequences, but one was removed in the similarity analysis when it was proven to be a sequence for a gene other than VIM.

IMP was divided into three subgroups: IMP_s1, IMP_s2 and IMP_s3. This was due to the difficulty in efficiently deducing where each sequence originates from. The different subgroups with the

neighbor-joining tree for IMP can be seen in Appendix 2 Figure A1.

OXA had the largest set of sequences (480 sequences). When OXA’s subgroups were created it was evident that an automatisation of the subgrouping was necessary. There was only one prominent subgroup for OXA, OXA_s1. This subgroup accounted for 172 sequences of the set. Other subgroups tried to be created, but we quickly realized that this would take unreasonable amount of time, since these were insignificant compared to OXA_s1 in size. For this reason, only OXA_s1 was used as a subgroup for OXA.

(17)

6. Discussion and ideas for improvement

6.1 Improving and automating the similarity analysis

For this project we manually divided sequences with variations of the same gene type into subgroups using the trees created by the similarity analysis. This process could be greatly improved by a script that does this automatically at the end of the analysis.

One way to go about this would be to use the tree file produced by PHYLIP or the distance matrix produced by EMBOSS to calculate the mean (or median) and variation of the distances between sequences. This can then be used as a measuring stick to determine which sequences are different enough from the rest to be excluded completely and to decide how to divide a set of sequences into subgroups. A criteria for subgroup division could be to minimize distance variation within a subgroup until some criteria is fulfilled. This criteria could be a minimum number of sequences allowed in a subgroup, a maximum number of subgroups or a specific maximum amount of variation.

Another method for subgroup division would be to take one row or column from the EMBOSS distance matrix (as each row or column contains distances from one sequence to every other

sequence) as a one-dimensional dataset for cluster analysis using a fitting algorithm. Cross validation should be used to find the best number of subgroups and identify outliers. Outliers should be easily identified after clustering, since they will be alone in a cluster.

Since the distance matrix only has positive distances the cluster analysis would have to be performed on each row (or column) separately to identify the correct subgroups. Otherwise the sequences that are located close to the middle of the tree would make it appear as though sequences further out on the branches are close to each other even if they are not, since they have the same distance to the middle.

Making sure to “center” the data on different sequences would prevent such errors. Maybe a clustering method that uses latent variables (such as PCA) would be able to identify the sequence subgroups outright using the distance matrix, in which case that would likely be the best method.

6.2 Evaluation of target sequence searcher scripts

Alternative I (​matcher.pl​) and alternative II (​common_target_finder.pl​) started out as two quite different takes on the same problem. As development went on, the technical basis for alternative I turned out to be unsustainable for the size of the segments and the number of mismatches. Due to this, both alternatives now use the Perl module String::Approx for matching segments to sequences since it is quick and easy to use.

There are still differences between the two alternatives, the main one being how the target sequences are filtered. Alternative I uses the String::Approx function ​aindex​ to find the position of matching segment and thereby checking the spacing. There is unfortunately a case where ​aindex ​will match a segment too early if there is a repeating pattern, thus giving an indication of faulty spacing and deleting that target sequence even though it actually is a correct target. This was found too late in development to be addressed.

(18)

Alternative II does not use ​aindex ​but instead splits the sequence into substrings around the first matched segment and from there finds the full target sequence by matching the other segments in the substring. This means a larger number of targets are generated by alternative II. Neither of the alternatives seem to generate false target sequences though, as tested in our evaluation of the found matches.

Alternative II is not able to handle instances when there are duplicates of segment 1s and 2s. Since it creates words (see figure 5) that represent segment 2 of its reference sequence, all multiples of segment 2 are handled for the reference. However when these words are matched against sequence 2 in the set, the script only evaluates one match, even though there could potentially be more than one match for a word. This means that when a match is found, one spacing segment (​i.e.​ the segment in which segment 1 are allowed to be found for a segment 2) is cut out from sequence 2, corresponding to the matched word. Since there is a possibility that more matches can be found, more spacing segments could also be created. These segments are used to retrieve segment 1s for each segment 2.

When only one spacing segment is created, potential segment 1s are missed.

When the remaining sequences in the dataset are used to filter out non-targets, alternative II uses the same function that matches words from sequence 1 with sequence 2. This results in too many targets being considered non-targets and they are therefore deleted. When the script iterates through each sequence, it searches for segment 2 for each target sequence. Since segment 2 can match in more than one position in the evaluated sequence, the issue remains. Only one match is considered. Therefore, only one spacing segment is created in which the script searches for segment 1s corresponding to the evaluated segment 2. When a segment 1 is not found, it is deleted. Since only one spacing segment is used, there is a chance that the deleted segment 1 exists in another matching segment.

This issue with alternative II was discovered too late in the project. A quick fix to solve the issue was done but with no success. The idea behind this fix was to change the subroutine that matches words that represented each possible segment 2 of a sequence with another sequence, so that more than one position of a match could be found if it existed. This was accomplished by changing the subroutine so that the sequence that a word was searched in was duplicated. By duplicating the sequence, a while loop could be used to check each position of a match in the sequence. To avoid returning the same position, the word could be set to match the duplicated sequence in the while criterium. For each match, the word in the segment would be substituted for a non-matching word. This way, the positions in the duplicate and the original sequence are identical, but since all previous matches are overwritten in the duplicated sequence that determines the while criterium, each matching position would be unique. Since the positions of the original sequence and its duplicate are identical, the subsegments A and C of each match could then be retrieved by using the original sequence. The criteria for the subsegments would then be controlled the same way as earlier described (see A7 in Appendix). By checking all possible matches of segment 2 in a sequence, all positions can be retrieved, and therefore be used to create multiple spacing segments, which in turn potentially gives more segment 1s to each segment 2.

Some other changes had to be done in the script as well. Since more than one position of a match of segment 2 could be returned after the fixed subroutine, more segments corresponding to each match of a segment 2 had to be created. This was solved by iterating through each position of a match for a

(19)

segment 2 in the evaluated sequence and creating corresponding spacing segments. In the first comparison, each spacing segment that was created was matched with the words created from the spacing segment of the reference sequence. As before, all matches were segment 1s corresponding to a segment 2, that combined together would create potential target sequences.

In the second comparison, the changed subroutine was used to evaluate if a segment 2 of a target sequence existed in the remaining sequences of the set. This also meant that more than one position could be retrieved for a match of segment 2 in the sequence. Hence, more segments could also be created. This process was done similarly to the one in the first comparison (see A7 in Appenix). Since more potential spacing segments were created, each segment 1 corresponding to an evaluated segment 2 had to be searched for in each spacing segment. Segment 1s that were not found in the evaluated sequence were deleted, and the rest were saved.

Theoretically, the fix worked. However, when the script was executed repeatedly for a sequence set, the amount of target sequences was not consistent, indicating that the fix inserted a bug in the script.

Subsequently when it was tested with ​false_targets_finder.pl​, the result was that the new version of the script returned false target sequences. Since there was not enough time for debugging, the earlier version of the script was used to search for targets in the experiment. This means that less target sequences were found than what actually exists, which in turn affected the subgrouping.

6.3 Subgrouping and geography

One interest that Q-linea had with this project was that we should find target sequences for various gene types in different geographical areas. Target sequences that were specific for ESBL genes in Europe and the US were of particular interest. The primary goal with this project was however to develop a method for gathering target sequences and then for us to use this method to retrieve the sequences. After this was completed, the project would focus more on how to solve the geographic targeting. As we started looking into this we soon noticed that the task at hand was too arduous to be completed with the resources we had at our disposal inside the confines of this project. The reason for this was primarily the lack of proper annotational information in NCBI’s sequence database entries on the geographical origins of the sequences, which means that finding geographical information would have entailed excessive amounts of manual work on our part.

The subgrouping for IMP was performed with the aim of obtaining minimal number of subgroups for targets. By doing so, Q-linea would be able to check the geographical origin of the sequences within each subgroup and see which subgroup fit them best. For OXA only one subgroup was created since the other subgroups were considered to be too small in comparison, consisting of only 20-30

sequences when targets could finally be found. Since the eight different groups of the gene OXA are so distantly related (​Walther-Rasmussen & Høiby 2006​), each group would probably have to be considered separate sets of sequences. Which in turn would result in eight different sets of sequences to be evaluated independently. For each of these sets, subgrouping could then be done and target sequences would be retrieved for more sequences.

(20)

6.4 Improving evaluation of match finding scripts

On the whole the evaluation scripts for finding matches seem to deliver what is expected of them.

Both produce seemingly correct results for the test cases in which potential targets should be found or not found with match searching scripts. However, there are still some areas of improvement yet to be implemented. For ​matchTester.pl​ one of these is to make the script loop enough times so that it is probable that all different variations of mismatches and distances between segments are covered, and to make a readable summary of the success or failure of the searching scripts. In addition, the script would be able to track which types of variations had been tested and if the tests failed or not. One of the first steps in this could be to change ​matchControl.pl​ into a package which can return the match list. This would make it easier to see if the match finding scripts really found that which was there to be found.

false_targets_finder.pl ​can also be improved. It can only test searcher scripts that are set to allow zero mismatches in each segments. One optimization of ​false_targets_finder.pl​ would therefore be to change the script so that the search can allow mismatches, and as a result search for all targets that the searchers potentially can retrieve. As a fast and easy control, however, ​false_targets_finder.pl can give a good indication if a searcher functions as it should or not.

6.5 A better background check procedure

The initial plan for the background check was that it should be done after the first potential target sequences had been found, using the first two gene sequences of a dataset before those targets were searched for in the rest of the sequences. This would speed up everything that is done after the background check because there would be fewer targets left to go through. However, doing the background check after the entire target finding process would instead speed up the background check, also because there would be fewer targets to go through. But because of some problems with integration of the background check into the target finding scripts, the second option was chosen for this project. The best option could depend on the specifics of the input dataset (a large or small dataset, more or less divergent sequences) and this is probably worth investigating further, but it could not be done in this project because of time constraints.

The script itself has been somewhat optimized for speed, most importantly by making sure that there are no redundant BLAST searches done since those account for most of the running time. There are probably minor things that could be changed to make the script even better timewise. More important improvements would be to provide the user with more and better information about the target

segments, such as how many matches are found in the background list, how many mismatches they contain and information about which species contain the matches. This information can be found by manually looking at the BLAST output files for now.

BLAST is a very versatile tool for finding alignments of sequences in a database, and by running it using a command-line application you get access to many options for input, output and alignment criteria. The database used in this project was comparatively small in regards to what a bigger project might want to use, so no extra options regarding the database had to be used. For a larger database

(21)

there are options to restrict the search using GI numbers on local databases or Entrez queries on remote databases.

There is of course the option to use a different method than BLAST to screen the potential targets.

The best reason not to use BLAST is that it has proved to be difficult to specify an exact number of mismatches to use as a threshold for keeping or discarding segments, or to specify patterns where parts of the query can have more or fewer mismatches (as in segment 2 of our targets) in an alignment. BLAST is also a heuristic method which means it cannot guarantee that it finds all alignments. Using a non-heuristic method would be the best option precision-wise, but that usually means that the time for screening targets will increase a lot for a larger database.

7. Ethical considerations and challenges

During the course of this project we considered and analysed the impact of the results and their possible repercussions from an ethical point of view. Analysing the ethical aspects of projects requires reflecting over society, nature, environmental sustainability, individual people and human rights. This type of analysis is a difficult one to make and there is often no clear answer to a certain problem, since there are so many aspects to contemplate aside from just the potential economic or scientific benefits.

Aspects such as risks and benefits, moral values and duties, rights and justice have to be taken into consideration and analysed from different points of view. Who should be held responsible? Whose rights do we need to consider? Who will reap the benefits and who will be affected by the fallout if something should go wrong? Is the product or technique safe? Is it just, in means of accessibility for all the people of the world? Is it sustainable to use with no risks of disrupting any balances? These are questions that we have pondered over since the beginning of our project and we have done our best to consider all the aspects that are relevant, both positive and negative ones.

Our project has been to develop a pipeline for finding and correctly identifying target sequences in different ESBL genes that can be present in the genomes of bacteria that can cause sepsis. These targets will then hopefully be useful by Q-linea as target probes for the technology they are

developing, which will result in a fast and sensitive method for sepsis diagnostics. At first glance this would seem to be a purely righteous pursuit since it would help in saving countless lives and also limit the use of ineffective antibiotics, something that is quite problematic since it promotes antibiotic resistance in bacterial strains. However, there are some aspects of this project that may not be entirely positive from an ethical point of view, such as ones regarding human rights, equality and

sustainability.

7.1 Geographical and financial justice

Due to the number of different types of ESBL genes and the myriad of variations that occur in many of the groups are so large, it would not be possible for us to find probe targets within each one of these groups over the course of only three months time. Therefore we had to categorize some of the genes into larger groups based on their degree of resistance, but also to some extent based on their

geographical locations.

The groups of ESBL genes that we chose to prioritize were chosen in regard to their threat level and where Q-linea would stand to make a larger profit geographically, even though these areas might not

(22)

be the places in which the technique is most needed. This could be seen as ethically questionable where human rights and equality are concerned, since it means that wealthier individuals in developed countries will reap the benefits of the technology even though they might not need it as badly as people from developing countries do. However, this is often necessary for companies developing new products and technology so that they can make a profit and establish themselves in parts of the scientific world where sponsorship for their product might be obtained. This will in turn enable them to lower the cost for their product and make it available in countries that would not have been able to afford it otherwise.

7.2 Ethical and financial conflicts

Since antibiotic resistance is an issue affecting the whole world it might seem unethical for companies to withhold information that might help to solve the problem. As members of this group working on this specific project we had to sign a non-disclosure agreement, which upon signing meant that we could not spread any type of information regarding the specific design criteria of the target probes.

However, this kind of information might hold a significant value for solving the antibiotic resistance issue, which could possibly be of great use to other researchers working on solving this problem. Is it ethically acceptable for companies to knowingly withhold valuable information regarding an issue that affects all of society on a global scale, that we should all work together on to solve as quickly as possible? From this point of view it might seem wrong for a company to keep this information to themselves for economical reasons.

On the other one hand we have to consider the fact that ensuring that a profit can be made of a product is necessary for it to be developed in the first place, and that once the initial technology is on the market the price of manufacture can be significantly lowered so that even low income countries can afford it. You need a large amount of funding for a new product to be made, but no one would be interested in investing in a product that might not generate a profit simply because of someone else beating you to the punch. It is therefore a necessary thing that we must face within scientific research in corporate world. However it is always worth discussing when it is necessary to keep information confidential and when it is not.

7.3 Overconsumption of antibiotics

Something that is often brought up in discussions regarding antibiotic resistance is the paradox that by treating patients with antibiotics for bacterial infections, we are subsequently making the bacteria more resistant. Therefore it is important to consider both the positive and negative effects of antibiotic use from an ethical standpoint. To have the knowledge and means to save a person’s life in the case of serious infections while knowing that by doing this you are further contributing to the growing

resistance problem.

Where should we draw the line regarding use of antibiotics? It is a complicated matter, and the main problem stems from the overuse of antibiotics that has been steadily getting worse over the last few decades. This high consumption along with using antibiotics for other purposes than for bacterial infections have gotten us to this point, an important example being the meat industry where animals are treated with antibiotics to achieve faster growth and prevent disease from spreading. Another example is the misuse of antibiotics occurring in some countries such as China, where parents demand

(23)

that their children should be given antibiotics for common colds. Since these are virus infections the treatment does not help the children, but rather worsen the antibiotic resistance problem (Kan ​et al.

2015).

In Sweden the use of antibiotics has been well regulated and limited in comparison to other countries in Europe, which becomes apparent from studies analyzing the geographical spread of different resistances (Glasner ​et al.​ 2013). This does not mean that antibiotics are unused, or even used well in Sweden, but it is directly correlated to a keyword; constraint. It is not simply the usage of antibiotics that causes resistance problems, but the far too frequent rates of which the bacteria are being exposed to them. This is why the work being done by companies such as Q-linea is both very important and ethically sound. By creating a technology for fast and specific diagnosis of the resistances genes present in bacteria causing sepsis, both the number of ineffective antibiotic treatments and the amounts used will decrease, which is the answer to the problem.

7.4 Knowledge, the way to success

Unfortunately there is a large and widespread ignorance regarding antibiotics and antibiotic

resistance, which makes things even more difficult. If a sizable part of the world’s population are not aware of the problem, how can we solve the problem? Consequently there lies an enormous

responsibility on the shoulders of those who posses the knowledge to educate the public about the dangers of overuse and how we have to take measures to change for the better. New antibiotics and faster diagnostics will both have an impact on the problem of course, but will not solve the problem itself. The focus of all research today is about finding new possibilities and ways to sustain our dependence. But why are we not using the tools we already have, ​i.e.​ the power of widespread knowledge? Therefore it can be questionable if Q-linea, our project, and all other researchers are doing everything they can to solve the problem? Informing should be of great interest in every project directed at solving the problem with antibiotic resistance. We think that is our ethical responsibility as engineers.

(24)

Acknowledgements

Lastly we would like to express our thanks to Q-linea for giving us the opportunity to work on this project, and for all their advice and help with different problems we encountered during our work. We would also like to thank Jan Andersson and Magnus Lundgren for all their support during the course of this project.

(25)

References

Ambler. 1980. The structure of beta-lactamases. Philosophical Transactions of the Royal Society of London Series B, Biological Sciences 289: 321–331.

Babic M, Hujer AM, Bonomo RA. 2006. What’s new in antibiotic resistance? Focus on beta-lactamases.

Drug Resistance Updates 9: 142–156.

Bacterial Antimicrobial Resistance Reference Gene (ID 313047) - BioProject - NCBI. WWW-dokument:

https://www.ncbi.nlm.nih.gov/bioproject/?term=Bacterial+Antimicrobial+Resistance+Reference+Gen e+Database. Hämtad 2017-05-23.

Bataar O, Lundeg G, Tsenddorj G, Jochberger S, Grander W, Baelani I, Wilson I, Baker T, Dünser MW.

2010. Nationwide survey on resource availability for implementing current sepsis guidelines in Mongolia. Bulletin of the World Health Organization 88: 839–846.

Bush, Jacoby GA. 2010. Updated functional classification of beta-lactamases. Antimicrobial Agents and Chemotherapy 54: 969–976.

Glasner C, Albiger B, Buist G, Tambić Andrasević A, Canton R, Carmeli Y, Friedrich AW, Giske CG, Glupczynski Y, Gniadkowski M, Livermore DM, Nordmann P, Poirel L, Rossolini GM, Seifert H, Vatopoulos A, Walsh T, Woodford N, Donker T, Monnet DL, Grundmann H, European Survey on Carbapenemase-Producing Enterobacteriaceae (EuSCAPE) Working Group. 2013.

Carbapenemase-producing Enterobacteriaceae in Europe: a survey among national experts from 39 countries, February 2013. Euro Surveillance: Bulletin Europeen Sur Les Maladies Transmissibles = European Communicable Disease Bulletin 18:

Information NC for B, Pike USNL of M 8600 R, MD B, Usa 20894. 2008. Options for the command-line applications. National Center for Biotechnology Information (US)

Kaiser JR, Cassat JE, Lewno MJ. 2002. Should antibiotics be discontinued at 48 hours for negative late-onset sepsis evaluations in the neonatal intensive care unit? Journal of Perinatology: Official Journal of the California Perinatal Association 22: 445–447.

Kan J, Zhu X, Wang T, Lu R, Spencer PS. 2015. Chinese patient demand for intravenous therapy: a preliminary survey. The Lancet 386: S61.

Kumar A, Roberts D, Wood KE, Light B, Parrillo JE, Sharma S, Suppes R, Feinstein D, Zanotti S, Taiberg L, Gurka D, Kumar A, Cheang M. 2006. Duration of hypotension before initiation of effective

(26)

antimicrobial therapy is the critical determinant of survival in human septic shock. Critical Care Medicine 34: 1589–1596.

NCBI - BLAST - Frequently Asked Questions, WWW Dokument:

https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=FAQ#expec t. Hämtad 2017-06-01.

Walther-Rasmussen J, Høiby N. 2006. OXA-type carbapenemases. The Journal of Antimicrobial Chemotherapy 57: 373–383.

Zvelebil, Baum JO. 2008. Understanding Bioinformatics. Garland Science, Taylor & Francis Group, LLC, New York, USA.

(27)

Appendices

A1. Different types of ESBLs

Table A1.

Abbreviation Description

KPC Kleibsiella Pneumoniae Carbapenemases NDM New Delhi Metallo-Betalactamases OXA Oxacillinase-type Betalactamases

VIM Verona Integron-encoded Metallo-Betalactamases IMP Metallo-Betalactamases

CTX-M Cefotaxime and Ceftriaxone resistent. Ceftaziclime sensitive.

SHV Resistance against broad spectrum penicillin such as ampicillin, tigecycline and piperacillin but not against oxyimino-substituted cephalosporins.

TEM Can hydrolyze penicillins and first generation cephalosporins but not oxyimino cephalosporin.

CMY Cephamycin resistance ACC Aminoglycoside resistance

(28)

A2. Neighbor-joining trees of IMP and OXA with marked subgroups

Figure A1. The Neighbor-Joining tree for IMP genes with the 3 subgroups used. The sequence on the right was removed from the dataset as it was not an IMP gene.

(29)

Figure A2. The Neighbor-Joining tree for OXA genes with the subgroup that was used.

(30)

A3. Neighbor-joining trees of VIM, NDM and KPC

Figure A3. The tree for all variants of VIM genes used.

Figure A4. The tree for all variants of NDM genes used.

(31)

Figure A5. The tree for all variants of KPC genes used.

(32)

A4. Background genomes

Table A2. Species and the specific genomes used in the background databases. For bacteria only chromosomal DNA was included.

Species NCBI reference genome Database

Candida albicans (eight chromosomes)

Candida albicans SC5314 Both

Homo sapiens (24 chromosomes and mitochondrial DNA)

Homo sapiens GRCh38.p10 Both

Escherichia coli Escherichia coli str. K-12 substr. MG1655

Segment 2

Enterococcus faecalis Enterococcus faecalis V583 Segment 2 Enterococcus faecium Enterococcus faecium DO Segment 2 Klebsiella pneumoniae Klebsiella pneumoniae subsp.

pneumoniae HS11286

Segment 2

Pseudomonas aeruginosa Pseudomonas aeruginosa PAO1

Segment 2

Staphylococcus aureus Staphylococcus aureus subsp.

aureus NCTC 8325

Segment 2

Streptococcus pneumoniae Streptococcus pneumoniae R6 Segment 2

(33)

A5. Step-by-step guide to retrieving target sequences

This guide will assume you are using the command prompt in Linux.

A5.1 Retrieving gene sequences

Sequence sets can be created using ​dataProd.pl​. This script must be located in the same folder as deRedunancifier.pl​ and a text file containing all search terms. For instance, creating a KPC fasta file, the search terms are:

nucleotide,313047[BioProject]+AND+(bacteria[filter]+AND+refseq[filter])+AND+blaKPC[title]

nucleotide,313047[BioProject]+AND+(bacteria[filter]+AND+refseq[filter])+AND+KPC*[title]

It is important that both lines are included. By substituting KPC with names of other gene types fasta files for other genes can also be created. More than one fasta file can be created simultaneously by adding search terms for several genes in the same file.

Run the program by entering ​perl dataProd.pl INFILE​ where INFILE is changed to the name of the file containing search arguments. The program will create separate faste files for each gene.

A5.2 Similarity analysis

To run the sequence similarity analysis you need the two scripts named ​treemaker.pl and

emboss2PhyMatrix.pl ​as well as the gene sequences in fasta format. All three files should be located in the same folder. The programs MUSCLE, Seaview, EMBOSS and PHYLIP must also be installed.

Start the analysis by typing ​perl treemaker.pl​ in the command line. Type ​Y​ and press enter when prompted. After this enter the name of the file containing the gene sequences and the name of the job (this affects what the output files will be named). This starts the sequence alignment using MUSCLE.

After the alignment is done, Seaview will open in a new window to visualise the alignment and you will be prompted to input the start and end positions for distance matrix calculations. An example alignment is shown in Figure A6. We suggest positions where most sequences have started or ended, so in the case shown in Figure A6 the start position should be at the A nucleotide marked with a red dot. Click on one of the nucleotides to see its position number. Do the same thing to choose the end position. After doing this you can close Seaview. If you want to skip this step simply input ​1 as the start position and the last position in the alignment as the end position.

(34)

Figure A6. A section of the beginning of an alignment as seen in Seaview. The red dot marks the suggested start position for matrix calculations for this alignment.

The program will now calculate the p-distance matrix using EMBOSS and convert it to the PHYLIP format using ​emboss2PhyMatrix.pl. ​Once this is done you will be presented with the settings screen for the ​neighbor​ module of PHYLIP. Type ​R​ and press enter to choose the upper-triangular data matrix, then type ​Y​ to start building the tree.

The program then lists the output files with descriptions. Three files need to be renamed or moved to a new folder before running the analysis again or they will be overwritten, these are the files named infile, outfile and ​outtree.

To visualize the tree, use a program that reads Newick files to open ​outtree. ​If there are sequences that stick out compared to the rest you can search for them in the original fasta file using the accession number to see if they are sequences from a wrong gene that accidentally got picked up by the gene retrieval step. If so, remove them from the data set. Otherwise, leave the dataset intact but return to the tree if you need to divide the dataset into smaller groups later.

A5.3 Search for targets

common_target_finder.pl and ​matcher.pl​ are executed in the same way. Each script needs to be in the same folder as the fasta file of the sequence set that targets are searched for. The fasta file and the name of a file containing all found targets are used as arguments. The scripts will create the target file.

During the run, the ​common_target_finder.pl​ prints the number of targets found after comparing each sequence as STDOUT. This way, the user can see if a sequence in the set greatly reduces the number of target sequences. Runtime for ​common_target_finder.pl ​is roughly a minute depending on the size of the sequence set and it is printed out at the end of the run together with the number of targets found.

While ​matcher.pl ​runs it will print the different steps; Comparing, Combining or Filtering. If no matches are found an error will be thrown, explaining which segment is limiting or in which step of the filtering the number of targets was reduced to zero. Runtime for ​matcher.pl​ is for roughly a minute, the exact runtime being printed to STDOUT when finished.

Both scripts are called in the same way from the command line by typing either ​perl matcher.pl INFILE OUTFILE or​ perl common_target_finder.pl INFILE OUTFILE. ​This will find the common targets for the input sequence set in the fasta file given as argument INFILE and print the found targets to the outfile given as argument OUTFILE.

A5.4 Filter targets against the background list

Before starting you need to have the files containing the background genomes in fasta format, named S1_bg.fasta ​for the background for segment 1 and ​S2_bg.fasta​ for segment 2. You also need the target_bg_check.pl​ script, the file containing the target sequences retrieved in the previous step, as well as having BLAST+ and BioPerl installed.

To create local BLAST databases run ​makeblastdb -in S1_bg.fasta -parse_seqids -dbtype nucl through the command line. This creates the database for segment 1. Repeat for ​S2_bg.fasta.

(35)

To start the background screening type ​perl target_bg_check.pl INFILE​OUTFILE ​where INFILE is changed to the file containing the target sequences and OUTFILE is where you want the final targets.

The background screening can take a few minutes. While it runs it will print on the screen which segment is currently being screened and which segment sequences (represented by a number identifier) pass or fail. At the end you will see how many target sequences are left and some information about the output files created by the program.

If you want to keep the BLAST output or input files for later use they need to be renamed or moved to a new folder before doing a new background screening. These are the files named ​blast1_in.txt, blast1_out.txt, blast2_in.txt and ​blast2_out.txt.​ We recommend to keep the BLAST output files as they can be used to get more information about the target sequences.

A5.5 Subgrouping (if no targets are found)

If no targets can be found during step 5.3 or they are all removed during step 5.4, subgrouping is necessary. Subgrouping means to divide the original set of sequences into smaller groups based on sequence similarity.

A subgroup is created by viewing the NJ-tree created during step 5.2 (named ​outtree​) in a program such as Figtree, selecting a number of sequences and copying them to a new text file. The new text file will be in Newick format so to extract the sequence accession numbers you can use a text editor such as Sublime Text 2. To select all accession numbers use a search expression such as “NG_\d+”.

The selected accession numbers are then copied to a new file and saved.

The script ​get_seqs_for_sub.pl​ is used to extract the sequences from the original fasta file. Before running this script you need the original fasta file containing gene sequences as well as the file

containing accession numbers in the same folder. Type ​perl get_seqs_for_sub.pl GENES ACCESSION OUTFILE ​where GENES is replaced by the original fasta file, ACCESSION is replaced by the accession number file and OUTFILE is the file you want the subgroup sequences in.

Extract the accession numbers and sequences for each subgroup you want to create and run the analysis from step 5.3 for each subgroup. If no targets are found for some or all subgroups, redo step 5.5 with smaller subgroups.

You can also use this method to find and exclude individual sequences that drastically lower the number of found targets by excluding individual sequences when copying from the tree in Figtree.

A5.6 Checking how ​matcher.pl​ and ​common_target_finder.pl​ find matches

A5.6.1 Running ​matchTester.pl

In order to check how well ​matcher.pl and ​common_target_finder.pl​ find matches, ​matchTester.pl can be used. It requires that the directories for the both of the match finding scripts and for

matchControl.pl​ are correctly specified in the script code. Run ​matchTester.pl​ by typing ​perl matchTester.pl.

References

Related documents

För mig var det viktigt att läsa igenom kommentarerna för alla shots och inte bara de som jag hade fått tilldelat till mig.. Detta för att jag inte skulle missa någon röd tråd

Each block is implemented by a recurrent neural network (RNN). The encoder takes a sequence of tokens from buggy source code as input and encodes it into vectors, using an

The cell suspension was dripped on to the Petri dish (as for the first group). The cells were then smeared with the brush over the surface with 10 strokes in different directions

Citation for the original published paper (version of record): Rahman, A., Nahar, N., Olsson, B., Mandal,

Submitted to Link¨ oping Institute of Technology at Link¨ oping University in partial fulfilment of the requirements for degree of Licentiate of Philosophy. Department of Computer

Evaluation of biochemical and morphological parameters during reperfusion after of short time cold ischemia showed recovered metabolism and only a slight inflammatory response

In conclusion, CCA approach was applied to find the correlation between the genotypes and the phenotypes in atherosclerosis. The genes met, timd4, pepd and pccd have been

Theorem: The total statistical weight G of the levels for which the parent term spin and orbital angular momentum quantum numbers are S p and L p and the principal quantum number of