Dual RNA-seq analysis of host-pathogeninteraction in Eimeria infection of chickensArnar Kári Sigurðarson Sandholt

(1)

Dual RNA-seq analysis of host-pathogen interaction in Eimeria infection of chickens

Arnar Kári Sigurðarson Sandholt

Degree project inbioinformatics, 2020

Examensarbete ibioinformatik 30 hp tillmasterexamen, 2020

Biology Education Centre, Uppsala University, and Statens Veterinärmedicinska Anstalt

(2)

(3)

Abstract

Eimeria tenella is a eukaryotic, intracellular parasite that, along with six other Eimeria species, causes coccidiosis in chickens. This disease can result in weight loss or even death and is estimated to cause 2 billion euros of damages to the chicken industry each year. While much is known of the life cycle of E. tenella in the chicken, less is known about molecular mechanisms of infection and the chicken immune response. In this study, we produced a pipeline for dual RNA-sequencing analysis of a mixed chicken and E. tenella dataset. We then carried out an analysis on an in vitro infection of the chicken macrophage HD-11 cell line. This was followed by a differential expression analysis across six time points, 2, 4, 12, 24, 48, and 72 hours post-infection, in order to elucidate these mechanisms.

The results showed clear patterns of expression for the chicken immune genes, with strong down-regulation of genes across the immune system at 24 hours and a repetition of early patterns at 72 hours, indicating that reinfection by a second generation of parasite cells was occurring. Several genes that may have important roles in the immune reaction of the chicken were identified, such as MRC2, ITGB3 and ITGA9, along with genes with known roles, such as TLR15. The expression of surface antigen genes in E. tenella was also examined, showing a clear upregulation in the late stages of merogony, suggesting important roles for merozoites.

Finally, a co-expression analysis was carried out, showing considerable co-expression among the two organisms. One of the gene co-expression networks identified appeared to be

enriched with both infection specific genes from E. tenella and chicken immune genes. These results, along with the pipeline, will be used in further studies on E. tenella infections and bring us closer to the eventual goal of a vaccine for coccidiosis.

(4)

(5)

Revealing the mechanisms of coccidiosis - the path to healthier, happier chickens

Popular Science Summary Arnar Kári Sigurðarson Sandholt

The chicken is one of the most important animals in agriculture and is consumed world-wide.

It carries several diseases that can infect humans, the best known one likely being salmonellosis. However, the chicken also suffers from several diseases that do not affect humans, one of which is coccidiosis. This disease is caused by single-cell parasites of genus Eimeria that infect chicken cells in order to breed. This disease ranges from being practically unnoticeable, as the chicken shows no outward symptoms, to very serious symptoms,

potentially killing the chicken. Regardless of symptoms, however, the parasite will multiply in the chicken and be carried out with its feces. The severity of coccidiosis depends on species of Eimeria as well as the number of oocysts, eggs from the parasite, the chicken eats. These oocysts are very difficult to get rid of as they are resistant to many types of disinfectant and reliable methods to get rid of them are difficult to implement in an intensive farming

environment. While the disease is not always severe, it does result in weaker, slower growing chickens which translates to losses for the chicken industry, estimated to be around 2 billion euros per year and resulting in more expensive chicken meat for consumers.

There are currently two ways of combating coccidiosis: vaccination and anticoccidials.

Vaccination is an ideal way to fight the disease, but the downside is that current vaccines require live Eimeria cells. An infection by any Eimeria species gives the chicken immunity to that species, so an infection by a weakened parasite will render a chicken immune to parasites that can cause the disease. However, live parasites can currently only be produced in live chickens. Because of this, the production of large amounts of vaccines is difficult and expensive as well as being problematic from an ethical standpoint. The other method is anticoccidials, chemical antibiotics that are administered through the chicken feed. Due to the scaling problems with vaccination production, this has been the main way of fighting

coccidiosis. However, the Eimeria species are becoming resistant to these antibiotics. This, combined with a growing negative sentiment among consumers towards use of chemicals in meat production, has led to a growing demand for a better form of vaccines that don't require live parasites.

There have been numerous attempts to address this demand, but so far, no viable vaccine has been developed. One reason for this is that the molecular mechanisms, i.e. genes and proteins, by which Eimeria species infect the chicken and the chicken immune system reacts to the infection are not well known. In this project, we aimed to explore these mechanisms by

(6)

examining the expression of genes in the chicken and a species of parasite, Eimeria tenella and comparing it to their expression in non-infected cells. This means looking at how much a given gene is used to produce proteins at different times during the infection process of Eimeria tenella and comparing it to the same gene when the cell isn't infected. For the parasite, the comparison is made to cells that haven't entered chicken cells and are waiting to start infecting. This allows us to see which proteins are being used, when they are being used, and helps to find ones that can be targeted by a vaccine and ones that are important to the immune response.

Our results show that many genes are affected by the parasite infection, especially in the immune system. However, a large part of that reaction is down-regulation, i.e. reducing the amount of protein that is produced from that gene compared to an uninfected cell. This further supports previous results, that Eimeria tenella is actively weakening the immune response of the host and that targeting this ability might produce a viable vaccine. We examined the expression of several parasite genes believed to have a role in this process, showing that the genes were highly expressed when the parasite was reproducing. The expression analysis also revealed several genes that clearly react to the presence of the parasite at different times during infection and may be key to understanding how the chicken acquires immunity to it.

Finally, we attempted to find genes that show a similar pattern of expression in both the chicken and the parasite, i.e. genes that behave in similar ways and might be affecting one another. Many genes showed similar patterns between species and one group of genes was identified that contained both immune genes with known roles in Eimeria tenella infections and parasite genes with roles in the infection process.

By identifying important immune genes in the chicken and infection genes in the parasite, this project has contributed to increasing understanding of coccidiosis. The results will be used to inform further research into this disease and, hopefully, take us one step closer to the creation of an effective vaccine to this disease.

Degree project in bioinformatics, 2020

Examensarbete i bioinformatik 30 hp till masterexamen, 2020 Biology Education Centre and Statens Veterinärmedicinska Anstalt

Supervisor: Robert Söderlund

(7)

Abbreviations

API Application programming interface CPM Counts per million

DE Differential expression FDR False discovery rate

GO Gene Ontology

GPI Glycosylphosphatidylinositol IFN Interferon

IL Interleukin

KEGG Kyoto Encyclopaedia of Genes and Genomes MDS Multidimensional scaling

PCA Principal components analysis PRR Pattern recognition receptors SAG GPI-anchored surface antigens SVA Statens Veterinärmedicinska Anstalt TLR Toll-like receptor

WGCNA Weighted correlation network analysis

(10)

(11)

1 Introduction

Coccidiosis is a disease caused by intracellular, eukaryotic parasites of the genus Eimeria. The disease occurs in a wide variety of species, one of which is the chicken, Gallus gallus. Seven different species of Eimeria are known to infect chickens, including E. tenella and E. maxima.

All seven species infect the gut of the chicken but the specific sections of intestine they infect, the length and specifics of their lifecycle and the severity of the symptoms they cause vary considerably. What is common to all species is that the severity of the infection depends on the infection dose and infected individuals acquire a species-specific immunity post-infection (Chapman et al. 2013). E. tenella causes one of the more severe infections. Its entire life cycle takes place within the cecum of the chicken where it goes through three stages. First, an oocyst containing invasive sporozoites is ingested by the chicken. In the cecum, these sporozoites invade epithelial cells and go through three cycles of asexual replication, called merogony, invading more cells in each cycle. Symptoms start appearing after the second merogony. After the third merogony, sexual reproduction takes place, forming a zygote that transforms into an oocyst that is shed in the chickens feces. The whole process takes approximately nine days (Daszak 1999). At this point, the oocyst needs a few days to sporulate before it's ready to infect the next chicken, beginning the next cycle of infection. The oocyst itself is also an important factor in Eimeria infection, being resistant to a variety of disinfectants and allowing the parasite to remain viable in the environment for months or even years (Belli et al. 2006). This means that it is exceedingly difficult to control coccidiosis through proper hygiene and environmental control.

In order to study the infection, chicken cell lines have been infected with the parasite in vitro.

However, it is impossible to maintain Eimeria parasites through in vitro infection as the parasite doesn't produce oocysts in cell cultures. The life cycle is also slower; in vivo E. tenella is finished with the first merogony at 48 hours post-infection but in vitro it takes 72 hours or more (McDonald & Rose 1987, Tierney & Mulcahy 2003). The in vitro infection system does allow for studying coccidiosis without infecting live chickens and is a viable method for studying the early stages of the parasite life cycle.

Once a chicken is infected by an Eimeria species, symptoms vary greatly in severity. Most cases result in limited symptoms that can nonetheless cause weight loss and less efficient feed conversion. In the worst case, the infection can lead to death. Either way, it results in losses for the chicken industry. Indeed, it is estimated that coccidiosis causes 2 billion euros of damages to the global industry each year (McDonald & Shirley 2009). Currently, two main methods are used to control E. tenella: vaccination and use of antibiotics called coccidiostats.

Vaccination makes use of avirulent and attenuated strains of Eimeria to induce immunity in the chicken. While this method is effective, it remains impossible to produce viable Eimeria through in vitro methods, requiring the use of live chickens instead (Witcombe & Smith 2014).

This results in expensive vaccines and has raised concerns regarding the security and scalability

(12)

of production as well as ethical issues. For these reasons, coccidiostats have been much more widely used than vaccination. Coccidiostats are a type of chemical antibiotic that has been used to effectively control Eimeria in chickens, however many species have acquired resistance to them. This has led to a renewed interest in the creation of a subunit vaccine, i.e. a vaccine that makes use of immune activating proteins to induce immunity in the chicken (Witcombe &

Smith 2014).

If a subunit vaccine could be produced safely and efficiently using, for example, transformed E. coli, it would effectively mitigate most current issues with vaccination. However, despite extensive research, no effective subunit vaccine has yet been produced. One of the problems that have frustrated vaccine development is that, while the life cycles of different Eimeria species are well defined, the molecular mechanisms that they use to infiltrate their hosts and prevent discovery and destruction by the chicken immune system are not well known. Some have been identified, however, like the surface antigen (SAG) proteins. They have been found to play an important role in infection, taking part in the invasion process as well as affecting the expression of important cytokines such as IFN-γ and IL-12 (Chow et al. 2011, Chapman et al. 2013). Much remains unknown about the mechanism by which the chicken immune system detects and deals with Eimeria infections as well, though several factors have been identified.

Eimeria immunity in chickens is known to be species specific and cell-mediated (Rose &

Hesketh 1979). Several important immune mechanisms have also been identified, notably the cytokine IFN-γ which was found to be important in immunoregulation (Rose et al. 1991). Toll- like receptors, specifically TLR4 and TLR15, have been suggested to have a role in the initiation of immune response to Eimeria (Zhou et al. 2013). However, a complete picture of the immune response has yet to emerge. It is therefore imperative that both the E. tenella infection strategy and the chicken immune response be further explored in order to elucidate these processes and identify targets for vaccine development.

One way to do this is to examine the gene expression of both organisms. However, if the normal method of sequencing the transcriptome of each organism separately is used, information on the possible interactions of the two species is lost. One way of circumventing this problem is to use dual RNA-seq. This method differs from normal RNA-seq in that samples are taken from infected tissue and RNA reads are mapped to the genomes of both organisms involved. The advantage of this method is that it allows for exploring the expression of both organisms simultaneously, making it possible to identify interactions between the two species by finding gene co-expression networks involving genes from both organisms (Westermann et al. 2012).

Therefore, dual RNA-seq should be a fitting method to examine the E. tenella infection of chicken cells and identify how the parasite interacts with the chicken immune system over the course of the infection.

(13)

1.1 Aims

The main aim of this project is to produce a pipeline for performing a differential expression analysis on dual RNA-seq data from the chicken and E. tenella. That pipeline will then be used to analyse an in vitro dataset to identify important genes in the chicken immune response to E.

tenella and elucidate their roles and to examine the expression of genes in the chicken and E.

tenella at the same time in order to identify networks of co-expression across the two organisms.

2 Methods

2.1 Data

The data for this project consisted of two datasets of raw mRNA read data, both from in vitro experiments. The two were made up of RNA reads collected from immortalized chicken macrophage cells of cell line HD-11 (Beug et al. 1979), infected with sporozoites from the Eimeria tenella reference strain Houghton. Both the parasite and the cell line are maintained at Statens Veterinärmedicinska Anstalt (SVA). The first in vitro dataset consisted of samples taken from infected and uninfected chicken cell cultures, one from each at 2 hours post- infection and two from each at 12 hours and 24 hours. The sequencing libraries for this dataset were prepared from 500ng total RNA using the TruSeq stranded mRNA library preparation kit including poly-A selection. The sequencing was done in a HiSeq 2500 machine with 125 bp reads using v4 sequencing chemistry. The second in vitro dataset was from a larger experiment with samples taken at 0, 2, 4, 12, 24, 48, and 72 hours post-infection. The sequencing libraries were prepared in the same way as the pilot dataset but from 120 ng of total RNA. The sequencing method was also the same. The samples were taken from cultures kept at SVA and the datasets were generated at the SNP&SEQ Technology Platform at SciLifeLab/Uppsala University.

2.2 Read counting

The analysis of the data began with read counting, i.e. mapping the sequenced reads to the chicken and E. tenella genomes and counting the number of reads mapping to each gene. This pipeline expanded on the pipeline created by Feifei Xu (Uppsala University), who previously worked with both in vitro datasets. The first step of the pipeline was to evaluate the quality of the raw data to determine the extent of trimming required. Quality reports were generated by the SNP&SEQ Technology Platform and these were used to determine the settings for Trimmomatic (Bolger et al. 2014). As the average read quality scores were very high before trimming, only a sliding window of length 4 with an average quality threshold of 20 was used to get rid of any low-quality regions. A high percentage of adapter content was also observed, which required trimming. The trimmed reads were subsequently examined to ensure the adapters had been removed successfully and read quality remained high using FastQC

(14)

(Andrews 2010). MultiQC was then used to collate the reports (Ewels et al. 2016). Once the desired read quality had been attained, the next step was to map the reads to the reference genomes. As this was a dual-RNA seq pipeline, the reads had to be mapped to both genomes.

This was accomplished by concatenating both the sequence files and the annotation files while ensuring that there was no conflict in gene identifiers between the two genomes. The reads were then mapped to the genomes using the splice-aware mapper STAR and counted using HTSeq (Dobin et al. 2013, Anders et al. 2015). As the library preparation allowed for strand-specific mapping, this feature of HTSeq was used with the strandedness setting on reverse. This produced files containing the number of transcripts mapping to each genomic feature in both organisms. These files were split up into count files for each species and statistics on the counts were computed. The computations were performed on resources provided by SNIC through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX) under Project SNIC 2020/15-16. All the code used in the project along with information on software versions can be found on the project’s GitHub repository (Sigurðarson Sandholt 2020).

2.3 Differential expression analysis

The second part of this project was the differential expression analysis run in R. Several different packages for differential expression analysis are available in R, the one chosen for this project was edgeR (Robinson et al. 2010). This package takes in raw read counts and includes functions for normalizing and filtering the data before running the differential expression analysis. It also has functions for running GO (Gene Ontology) and KEGG (Kyoto Encyclopaedia of Genes and Genomes) category enrichment analyses. Before loading the data into R, however, it needed to be processed to make it easier to work with and to add in Entrez gene IDs for downstream analysis. This was accomplished with Python scripts using the Pandas package to manipulate tables and Bash scripts for extracting IDs from the annotation files.

These processed read count files were then read into R. There they were split into a chicken dataset and E. tenella dataset, each containing only relevant data, i.e. infected samples for the parasite dataset. These were then normalized using CPM (Counts Per Million), which is dividing each gene count by the total number of reads for that sample and multiplying by one million. The next step was to examine the data for any outliers using a PCA (Principal Components Analysis) and MDS (Multidimensional Scaling) and examining those found. The MDS plots display leading log² fold change, “...defined as the root-mean-square average of the log-fold-changes for the genes best distinguishing each pair of samples” (Ritchie et al. 2015).

Once the data had been quality checked, it was loaded into the DGElist object from edgeR.

Functions from edgeR were used to normalize the library size of samples in each dataset and remove genes that had very low counts across all samples. The experimental design was then defined, i.e. which comparisons were to be made, and the differential expression analysis run on the different time points. Infected and uninfected samples at each time point were compared for the chicken data whereas the E. tenella data was compared to a pre-infection sample of pure E. tenella sporozoites. The threshold for differential expression was a False discovery rate

(15)

(FDR) value <0.05, as calculated using edgeR, and a log2 fold change of ≥1. P-values were adjusted using the Benjamini-Hochberg methods, resulting in FDR values, in order to correct for the multiple hypothesis testing occurring when testing for the differential expression of every gene present. This resulted in a list of differentially expressed genes which was further analysed by running a GO category and KEGG pathway enrichment analysis. For the chicken, this was accomplished using edgeR functions along with the packages GO.db (Carlson 2019a), org.Gg.eg.db (Carlson 2019b), and the KEGGrest API (Tenenbaum 2020). For E. tenella, the lack of available GO and KEGG annotation and annotation packages meant that a manual enrichment analysis was done using ad hoc scripts. The GO and KEGG annotations for E.

tenella were taken from ToxoDB (Gajria et al. 2008) and generated through KEGG's BlastKOALA tool, respectively.

Visualization of the results of the differential expression analysis was accomplished with several different R packages. These were EnhancedVolcano (Blighe et al. 2019), ggbiplot (Vu 2011), ggplot2 (Wickham 2016 p. 2), ClassDiscovery (Coombes 2019), RColorBrewer (Neuwirth 2014) and functions that are part of edgeR.

The final analysis used the WGCNA package to identify groups of genes with significantly similar expression patterns and cluster them together into modules (Langfelder & Horvath 2008). This was run on normalised CPM values of read counts after filtering using edgeR and only used data from infected samples so that chicken and E. tenella data could be combined without introducing too much bias from uninfected samples having zero read counts for E.

tenella genes.

3 Results

3.1 Read counting

The analysis began with a read counting step, i.e. getting the number of reads mapping to each gene in the chicken and E. tenella from the raw read data. The raw read data from the SNP&SEQ Platform was of high read quality but had a large amount of adapter sequences present which required trimming. Once trimmed, the reads were mapped to both the chicken and E. tenella genomes simultaneously and the number of reads mapping to genes counted.

Table 1 shows information on the samples and the fraction of reads mapping to features.

Features refer to any expressed parts of the genome, generally meaning protein coding genes but also RNA genes and pseudogenes. For this project, only protein coding genes, along with rRNA and tRNA coding genes, were included in the analysis and are referred to as genes. The mapping rate to features was generally high for the samples, ~80-85%, but was much lower for the pure E. tenella sample, ~65%. Once the number of reads mapping to each feature was computed, the files were processed and loaded into R for differential expression analysis.

(16)

Table 1. Statistics of read mapping and metadata for each sample. For each time point, there are three replicates for each of infected and uninfected chicken samples. The 0-hour sample is an exception, only one pure E. tenella sample and two chicken samples were taken. The fraction of E. tenella reads is shown for every sample, including the uninfected ones.

Sample Infection status

Time post- infection

[hours]

Number of mapped

reads

Percentage of E. tenella reads

Percentage of reads mapped

to features 33_S29 Pure

E. tenella

0 1.24E+07 99.954 65.28

1_S1 Uninfected 0 2.16E+07 0.0003 85.64

11_S10 Uninfected 0 1.53E+07 0.0006 84.1

23_S19 Infected 2 1.17E+07 0.4993 83.32

24_S20 Infected 2 1.07E+07 0.466 83

6_S22 Infected 2 2.23E+07 0.5854 84.4

25_S21 Uninfected 2 1.47E+07 0.002 83.63

26_S22 Uninfected 2 2.26E+07 0.0014 83.82

5_S21 Uninfected 2 1.74E+07 0.0005 84.26

2_S18 Infected 4 1.72E+07 1.8494 84.92

2_S2 Infected 4 1.11E+07 0.6958 83

8_S24 Infected 4 1.89E+07 1.0926 85.23

1_S17 Uninfected 4 1.55E+07 0.0003 83.64

3_S3 Uninfected 4 1.09E+07 0.0007 84.05

7_S23 Uninfected 4 2.69E+07 0.0006 85.63

27_S23 Infected 12 1.70E+07 0.573 81.99

28_S24 Infected 12 2.73E+07 0.5766 81.91

29_S25 Infected 12 3.62E+07 0.6084 82.08

30_S26 Uninfected 12 2.31E+07 0.0025 81.49

31_S27 Uninfected 12 1.22E+07 0.0034 82.37

32_S28 Uninfected 12 1.26E+07 0.0016 82.55

10_S26 Infected 24 1.95E+07 2.3565 85.42

4_S20 Infected 24 1.65E+07 1.5509 83.89

4_S4 Infected 24 1.22E+07 1.3263 83.02

3_S19 Uninfected 24 1.77E+07 0.0004 82.83

5_S5 Uninfected 24 1.42E+07 0.0007 82.8

9_S25 Uninfected 24 1.85E+07 0.0018 87.81

12_S11 Infected 48 2.27E+07 5.5871 81.34

18_S15 Infected 48 1.19E+07 5.936 81.25

6_S6 Infected 48 2.81E+07 4.6507 81.81

13_S12 Uninfected 48 2.12E+07 0.0007 82.7

19_S16 Uninfected 48 8.54E+06 0.0011 81.92

7_S7 Uninfected 48 1.02E+07 0.0032 84.86

14_S13 Infected 72 1.27E+07 2.4569 80.76

20_S17 Infected 72 1.34E+07 3.1254 80.5

8_S8 Infected 72 1.28E+07 2.8447 80.82

15_S14 Uninfected 72 1.14E+07 0.0015 81.71

21_S18 Uninfected 72 1.91E+07 0.0016 81.92

9_S9 Uninfected 72 1.42E+07 0.0008 81.06

(17)

Figure 1. A line plot showing the average percentage of reads mapping to the E. tenella genome at each time point.

The fraction of read mapping to the E. tenella genome varies over time, as Figure 1 shows.

There is a large jump in the percentage from 2 to 4 hours and a fall again at 12 hours. It then steadily rises towards 48 hours and falls again at 72 hours. A potential explanation for this pattern is that at 12 hours, the parasite is in the early stages of the trophozoite form, where it is not yet dividing. At 72 hours, some of the already released merozoites might be washed away and lost when the RNA is harvested. The fraction reaches a maximum of just over 5% at 48 hours and has a minimum of about 0.5% at 2 and 12 hours.

3.2 Differential expression analysis

Before running the differential expression (DE) analysis, the data was examined for potential outliers. First, the read counts and library sizes were normalized and any genes with a very small number of reads mapping to them across all samples were removed. Then, both an MDS analysis and a PCA were run on the data and plotted in Figures 2-5.

(18)

Figure 2. A multidimensional scaling plot of the normalized chicken count data. The labels are composed of U for uninfected or I for infected and the time post-infection that the sample was taken.

Figure 3. A multidimensional scaling plot of the normalized E. tenella count data. The labels are the time post-infection at which the sample was taken.

(19)

Figure 4. PCA plot of the first two principal components of the normalized chicken count data. The labels are composed of U for uninfected or I for infected and the time post-infection that the sample was taken.

(20)

Figure 5. PCA plot of the first two principal components of the normalized E. tenella count data. The labels are the time post-infection at which the sample was taken. The 0-hour, pure sporozoite sample is not included.

These figures show a stark contrast between the chicken and E. tenella data, with the chicken data having a more random distribution, perhaps reflecting the more diverse background transcription going on in the chicken, whereas the E. tenella data clusters more distinctly. The MDS plot in Figure 2 does show some clustering for the uninfected and infected samples at 48 and 72 hours. This also occurs in the PCA plot in Figure 4 but data from other time points also clusters close to them. In neither plot is there a very clear separation of the infected and uninfected time points, though in both plots there is a trend of separation along PC2. It would appear, however, that neither infection status nor time can completely explain the variance in the chicken data, suggesting that it has to do with how the cells grow in each experiment. This contrasts dramatically with the E. tenella data where the MDS plot in Figure 3 shows very clear clustering on timepoints, with three main clusters forming for 2 and 4, 12 and 24 and 48 and 72 hour samples. This clustering is not as clear in the PCA plot in Figure 5 but there is still a clear tendency for samples from the same time point to cluster together. This indicates a more dramatic shift in expression in E. tenella between time points. It should be mentioned that the pure E. tenella sample, shown as the zero timepoint in Figure 3, was not included in Figure 5 as it formed an extreme outlier. This sample was used as the reference for the parasite DE analysis and was not removed.

(21)

The next step was the DE analysis itself. For the chicken data, the DE analysis was conducted for each time point, comparing infected samples to uninfected ones. For the E. tenella data, data from each time point was instead compared to the pure parasite sample, 33_S29. Figures 6 and 7 show volcano plots for the chicken and E. tenella, respectively, at all time points.

Figure 6. Volcano plots for differential gene expression in the chicken samples for all time points. The significance thresholds are set at a log2 fold change of ±1 and a false discovery rate of 0.05. NS stands for non-significant.

(22)

Figure 7. Volcano plots for differential gene expression in the E. tenella samples for all time points. The significance thresholds are set at a log2 fold change of ±1 and a false discovery rate of 0.05. NS stands for non-significant.

The plots show the changing expression profile of both species as the infection progresses. For the chicken, the number of up-regulated genes increases up to the 24 hour time point where it peaks and then falls slightly at 48 and 72 hours. Of particular note here are the genes LOC107049603 and NR4A1 which are highly significant at 2, 4 and 12 hours. These genes encode a FosB-like protein and a nuclear receptor, respectively. The number of down-regulated genes increases in a slower fashion and hits a very sudden peak at 24 hours. A far smaller number of genes are significantly down-regulated in the last two time points. This indicates a particularly large reaction in the chicken at 24 hours, perhaps due to the E. tenella parasite hitting a specific stage of the infection process. For E. tenella, a large number of genes are both up- and down-regulated across the different time points, with both quickly increasing from the 2-hour time point. A relatively small number of genes are up-regulated at 2 hours, showing that

(23)

relatively few genes have started to express more than they would while the parasite is still looking for a host to infect. The number of up-regulated genes quickly increases and appears fairly stable in the 24, 48, and 72 hour samples.

3.3 GO and KEGG analysis

An enrichment analysis for Gene Ontology (GO) categories and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathways was carried out on the lists of differentially expressed genes for all time points on both organisms. The chicken data was analysed first using edgeR. For the sake of clarity, only results from three time points from both organisms, 4, 24, and 72 hours post-infection, will be discussed here. The top 10 most significantly enriched GO categories and KEGG pathways for the chicken at each time point can be found in Tables 2 and 3. Only GO categories from the biological process ontology were included in this analysis.

Table 2. Top 10 most significantly enriched GO categories in the chicken at 4, 24, and 72 hours post-infection. The table shows the term, then number of genes associated with this category in the chicken (N), the number of significantly up- and down-regulated genes and the p-value associated with each from a Fisher’s Exact Test.

Top 10 enriched GO categories at 4 hours post-infection

Term N Up Down P.Up P.Down

PR (Positive Regulation) of cellular process 390 17 1 1.07E-06 5.16E-01

PR of transcription by RNA polymerase II 109 9 0 2.72E-06 1

PR of biological process 435 17 1 4.68E-06 5.56E-01

response to chemical 244 12 0 1.36E-05 1

regulation of transcription by RNA polymerase II

174 10 0 1.93E-05 1

PR of transcription, DNA-templated 141 9 0 2.22E-05 1

PR of nitrogen compound metabolic process 260 12 0 2.55E-05 1

transcription by RNA polymerase II 180 10 0 2.59E-05 1

PR of cellular metabolic process 263 12 0 2.86E-05 1

PR of macromolecule metabolic process 264 12 0 2.97E-05 1

cellular process 1380 172 175 4.53E-07 5.07E-01

biological process 1504 184 192 6.05E-07 4.64E-01

cellular macromolecule metabolic process 753 104 79 1.25E-06 9.74E-01 macromolecule metabolic process 834 112 88 1.92E-06 9.77E-01

nucleic acid metabolic process 469 71 37 2.75E-06 1

developmental process 435 67 50 2.89E-06 7.92E-01

anatomical structure development 410 64 47 3.06E-06 7.92E-01 multicellular organism development 370 59 46 3.76E-06 5.79E-01 regulation of biological process 837 110 100 6.55E-06 7.56E-01 nucleobase-containing compound metabolic

process

525 76 44 6.68E-06 9.99E-01

(24)

sensory perception of mechanical stimulus 8 4 0 3.59E-05 1

sensory perception of sound 8 4 0 3.59E-05 1

regulation of signalling 205 16 7 1.72E-04 4.43E-01

signal transduction 361 22 11 4.16E-04 5.56E-01

regulation of cell communication 202 15 7 4.65E-04 4.28E-01

response to stimulus 597 31 18 5.05E-04 5.69E-01

lipid catabolic process 13 0 4 1 5.05E-04

regulation of signal transduction 186 14 5 6.24E-04 6.81E-01 cell surface receptor signalling pathway 188 14 3 6.94E-04 9.32E-01

signalling 377 22 12 7.44E-04 4.93E-01

Table 3. Top 10 most significantly enriched KEGG pathways in the chicken at 4, 24, and 72 hours post-infection. The table shows the term, then number of genes associated with this category in the chicken (N), the number of

significantly up- and down-regulated genes and the p-value associated with each from a Fisher’s Exact Test.

Top 10 enriched KEGG pathways at 4 hours post-infection

Pathway N Up Down P.Up P.Down

Cytokine-cytokine receptor interaction 107 13 2 1.21E-10 1.62E-02 Toll-like receptor signalling pathway 75 8 1 1.41E-06 1.29E-01

C-type lectin receptor signalling pathway 78 8 0 1.91E-06 1

Cytosolic DNA-sensing pathway 36 5 0 3.98E-05 1

NOD-like receptor signalling pathway 110 7 1 1.88E-04 1.83E-01

Adipocytokine signalling pathway 53 5 0 2.61E-04 1

MAPK signalling pathway 199 9 0 3.11E-04 1

Herpes simplex virus 1 infection 134 7 0 6.26E-04 1

Salmonella infection 183 8 1 8.40E-04 2.86E-01

AGE-RAGE signalling pathway in diabetic complications

74 5 1 1.22E-03 1.27E-01

Lysosome 109 0 61 1 3.03E-27

Phagosome 117 6 46 9.49E-01 2.59E-13

Metabolic pathways 1075 80 198 9.48E-01 9.95E-09

Valine, leucine and isoleucine degradation 41 1 19 9.76E-01 1.21E-07

Drug metabolism - other enzymes 51 7 21 1.52E-01 3.18E-07

Glycosaminoglycan degradation 15 0 10 1 1.68E-06

Drug metabolism - cytochrome P450 25 0 13 1 2.40E-06

Metabolism of xenobiotics by cytochrome P450

29 0 14 1 3.08E-06

Glutathione metabolism 42 5 17 3.02E-01 5.59E-06

Histidine metabolism 14 1 9 7.21E-01 8.93E-06

(25)

Lysosome 109 0 35 1 1.40E-26

Phagosome 117 2 21 8.35E-01 5.31E-11

Cytokine-cytokine receptor interaction 107 16 4 3.30E-08 4.18E-01

Metabolic pathways 1075 31 61 4.12E-01 1.85E-06

Glycosaminoglycan degradation 15 0 6 1 3.19E-06

Arginine and proline metabolism 34 0 7 1 6.41E-05

mTOR signalling pathway 122 5 13 2.44E-01 9.40E-05

Glycosphingolipid biosynthesis - globo and isoglobo series

9 0 4 1 9.76E-05

Glycerolipid metabolism 44 0 7 1 3.51E-04

Histidine metabolism 14 0 4 1 6.86E-04

The top GO categories for the 4 hour time point are clearly dominated by various regulatory genes while the top categories at 72 hours are mostly signalling and sensory related. The categories at 4 hours seem to be mostly associated with the positive regulation of different cellular processes and are up-regulated. The categories at 72 hours are more associated reactions and signalling due to stimuli and include both up- and down-regulated genes. The top enriched categories at 24 hours, however, are quite different, including several high-level categories such as biological process, likely due to the sheer number of significantly up- or down-regulated genes at that time point. There are also several developmental process categories, indicating that genes associated with these processes are also used for the response to an E. tenella infection.

The KEGG pathway enrichment at 4 hours shows a similar result to the GO enrichment, several signalling pathways seem to be up-regulated. In this case, it is likely that the same genes are showing up in different categories as many of them share signalling molecules, if not receptors.

At 24 hours, more metabolic pathways are significant, mostly being down-regulated. Of special interest are the top two categories, lysosome and phagosome, as both are involved in breaking down material, including invasive organisms. The large scale down-regulation of these pathways may indicate that the parasite is affecting regulation to prevent cells from destroying it. At 72 hours, the lysosome and phagosome pathways remain down-regulated, though not to the same extent. There remain several down-regulated metabolic pathways but also two signalling pathways: mTOR signalling and cytokine-cytokine receptor interaction.

The E. tenella data was analysed as well but due to a lack of available GO annotation packages in R and a lack of KEGG annotation for the species, the analysis was accomplished using in- house code. The top 10 enriched GO categories and KEGG pathways can be found in Tables 4 and 5.

(26)

Table 4. Top 10 most significantly enriched GO categories in E. tenella at 4, 24, and 72 hours post-infection. The table shows the term, then number of genes associated with this category in the chicken (N), the number of significantly up- and down-regulated genes and the p-value associated with each from a Fisher’s Exact Test.

dephosphorylation 7 0 7 1 1.45E-03

mRNA splicing, via spliceosome 9 0 8 1 3.35E-03

intracellular protein transport 28 12 6 4.65E-03 9.87E-01

translational elongation 8 5 1 1.01E-02 9.82E-01

proton transmembrane transport 6 4 0 1.66E-02 1

vesicle-mediated transport 36 13 7 1.72E-02 9.97E-01

transcription, DNA-templated 26 5 16 6.12E-01 1.81E-02

pathogenesis 4 0 4 1 2.40E-02

protein dephosphorylation 4 0 4 1 2.40E-02

proteolysis involved in cellular protein catabolic process

16 7 2 2.58E-02 9.96E-01

glycolytic process 13 13 0 2.12E-05 1

16 15 1 3.81E-05 1

translation 99 59 19 1.00E-03 1

mRNA splicing, via spliceosome 9 1 8 9.94E-01 2.97E-03

regulation of transcription, DNA-templated 44 9 25 1.00E+00 1.11E-02

intracellular protein transport 28 18 7 2.28E-02 9.58E-01

16 15 1 6.89E-05 1

glycolytic process 13 12 1 6.02E-04 9.98E-01

translation 99 61 16 8.99E-04 1

mRNA splicing, via spliceosome 9 0 8 1 2.85E-03

mRNA processing 6 1 5 9.74E-01 3.45E-02

rRNA processing 19 13 3 3.82E-02 9.92E-01

(27)

Table 5. Top 10 most significantly enriched KEGG pathways in E. tenella at 4, 24, and 72 hours post-infection. The table shows the term, then number of genes associated with this category in the chicken (N), the number of significantly up- and down-regulated genes and the p-value associated with each from a Fisher’s Exact Test.

Spliceosome 79 0 66 1 4.28E-18

Homologous recombination 10 7 2 6.70E-04 9.26E-01

Ribosome 96 31 24 1.52E-03 9.93E-01

Autophagy - animal 5 4 0 5.79E-03 1

Pathogenic Escherichia coli infection 7 1 6 7.76E-01 1.09E-02

Phagosome 6 0 5 1 2.62E-02

Proteasome 33 11 7 3.94E-02 9.80E-01

RNA polymerase 28 4 15 8.16E-01 4.56E-02

RIG-I-like receptor signalling pathway 3 0 3 1 4.77E-02

Oxidative phosphorylation 30 10 10 4.84E-02 6.95E-01

Ribosome 96 60 7 2.82E-05 1

Proteasome 33 23 3 1.05E-03 1

Pyrimidine metabolism 14 11 2 5.71E-03 9.82E-01

Glycolysis / Gluconeogenesis 25 17 5 7.09E-03 9.72E-01

Pathogenic Escherichia coli infection 7 1 6 9.77E-01 9.89E-03

Fatty acid elongation 5 5 0 1.26E-02 1

Autophagy - animal 5 5 0 1.26E-02 1

Base excision repair 5 5 0 1.26E-02 1

Ribosome 96 62 7 3.09E-05 1

Fatty acid elongation 5 5 0 1.62E-02 1

Proteasome 33 21 3 1.74E-02 1

Glycolysis / Gluconeogenesis 25 16 6 3.41E-02 9.21E-01

RIG-I-like receptor signalling pathway 3 0 3 1 4.36E-02

MAPK signalling pathway - yeast 5 1 4 9.44E-01 5.51E-02

Citrate cycle (TCA cycle) 15 10 2 6.47E-02 9.86E-01

Arginine and proline metabolism 8 6 0 7.83E-02 1

(28)

Unlike the chicken gene expression data, all time points have many significantly up- and down- regulated genes in E. tenella. This is very clear at the 4 hour time point in Table 4 where the top categories have most of the genes they contain being significantly differentially expressed in the analysis. The top categories are also very different from those in the chicken. At 4 hours, the chicken had mostly signalling categories of different types, whereas E. tenella has transport, proteolysis, dephosphorylation and DNA/RNA processing. Of these, dephosphorylation and DNA/RNA processing are strongly down-regulated while the transport categories seem to have a mix of up- and down-regulated genes. The catabolic proteolysis process is mostly up- regulated. At 24 hours, dephosphorylation remains down-regulated along with transcription and mRNA splicing but glycolytic processes have become highly up-regulated along with translation and transport. At 72 hours, the top categories are still similar to 24 hours and with similar expression patterns though rRNA processing has become significantly up-regulated.

In the KEGG pathway analysis for E. tenella in Table 5, two things must be kept in mind. First, the comparison is between E. tenella that is actively infecting a host and E. tenella that is ready to infect cells. Second, as there is no KEGG entry for E. tenella or any species of the Eimeria genus; the pathway annotation was generated by BLAST comparison with organisms present in the database and may not be completely accurate. As the table shows, four pathways are dominant across the time points: spliceosome, ribosome, proteasome and RNA polymerase.

The spliceosome is strongly down-regulated across all time points, indicating a high up- regulation in the pure E. tenella sample. The ribosome pathway has both up- and down- regulated genes at 4 hours but becomes strongly up-regulated for the rest of the analysis. The proteasome pathway has a mix of up- and down-regulated genes at 4 and 72 hours but is very up-regulated at 24 hours. The RNA polymerase pathway has mostly down-regulated genes but also some up-regulated ones across the time points. The rest of the pathways generally change but worth a mention is the glycolysis/gluconeogenesis pathway, which is mostly up-regulated at 24 and 72 hours.

(29)

3.4 Immune/infection gene expression

Figure 8. Heatmap of the expression of 241 immune related genes in the chicken. Green represents up-regulated genes and red down-regulated. Intensity scales with log2 fold change.

(30)

In order to examine the immune response in the chicken cells, all genes that were associated with GO:0002376, immune systems process, and all subcategories, as well as immune system related KEGG pathways were identified. Those genes that also showed significant differential expression in at least one time point were then plotted in a heatmap (Figure 8). It also includes a hierarchical clustering dendrogram, showing which genes behave in a similar fashion. The plot shows a few dominant patterns of expression. A majority of genes appear to be heavily down-regulated at 24 hours, especially those that are up-regulated at 2 and 72 hours. A smaller number at the top of the plot shows the opposite behaviour.

Dr. Eva Wattrang, a chicken immunology expert at SVA, identified several groups of genes with known immune related roles in the chicken. These genes were plotted in line plots to further elucidate their behaviour over the course of the infection. Only genes that had an FDR

< 0.05 and a log2 fold change of at least one in at least a single time point were included in the graphs.

Figure 9. The expression of mannose receptors as a function of time. Point shape indicates if the gene was significant, FDR < 0.05, at that time point.

The first gene category examined was the mannose receptors, shown in Figure 9. These show a relatively clear pattern, all mannose receptors except for MRC2 are down-regulated, a very significant effect at 24 hours. MRC2 begins down-regulated, though not significantly so, and is up-regulated at the later time points.

(31)

Figure 10. The expression of chemokines as a function of time. Point shape indicates if the gene was significant, FDR <

0.05, at that time point.

Figure 11. The expression of cytokines as a function of time. Point shape indicates if the gene was significant, FDR <

0.05, at that time point.

(32)

Cytokines are an important family of signalling proteins that have a role in the immune response. Chemokines are a subcategory of cytokines mainly involved in the recruitment of cells to e.g. sites of infection. The expression of these genes in the chicken are shown in Figures 10 and 11. Most of the up-regulated chemokines share a peak at 4 hours, indicating that they may have an important role in the early stages of the infection. The expression then falls, reaching a minimum at 48 hours before rising again at 72 hours. An exception to this pattern is CX3CL1 which has a relatively stable up-regulation until 72 hours, when it goes down but remains significantly up-regulated. The only two chemokines to be significantly down- regulated are CXCL12 and ATRN, both of which are highly down-regulated at 24 hours. The general cytokine category shows a much wider variety of expression patterns. IFNW1 and IL11 both remain non-significant until 48 hours but are both highly up-regulated at 48 and 72 hours, seemingly being important at the later stages of the in vitro infection. CSF3 and IL1B are both significantly up-regulated at the early time points, taper off towards the 48-hour time point but then are again significantly up-regulated at 72 hours.

Figure 12. The expression of Type I Interferon ”signature” genes as a function of time. Point shape indicates if the gene was significant, FDR < 0.05, at that time point.

Interferons are cytokines important in the defence against intracellular pathogens and induce the expression of many other genes. The ”Interferon (IFN) signature” category contains genes involved with the induction of type I IFNs and IFN regulated genes, i.e. genes that are expressed by cells in response to IFN. Figure 12 shows the expression of several of these. The genes IFITM3, LOC107053353, IGBP1L and IRF5 all share a similar expression pattern, with no

(33)

significant differential expression until 24 hours, where they are all significantly down- regulated. After this, they are all again non-significant, except for IFITM3 which is significantly up-regulated at 72 hours. The rest of the genes are all significantly up-regulated at different times. Most of them, IFI6, IFIT5, IFITM5, MX1 and OASL, are all nonsignificant until at 48 hours where they are significantly up-regulated and continue to be up-regulated at 72 hours.

Only IRF1 clearly breaks this pattern, being very significant and highly up-regulated at 2-12 hours and falling to a low but consistent level of up-regulation at 24-72 hours. IFITM5 has a similar pattern of expression at 2-24 hours but is not significant.

Figure 13. The expression of pattern recognition receptor genes as a function of time. Point shape indicates if the gene was significant, FDR < 0.05, at that time point.

Pattern recognition receptors (PRR) are parts of the innate immune system that help recognize pathogens and pathogen related cell damage. In Figure 13, the vast majority of PRR genes are significantly down-regulated at 24 hours, again supporting a major immunological event at that time. Outside of this time point, a few genes are significant, though the clearest outliers are NLRC5 and TLR15. NLRC5 is not significant until at 24 hours, where it is significantly up- regulated and continues to be at 48 and 72 hours. TLR15 on the other hand is up-regulated at 2- 12 hours but only significantly so at 4 hours. CLEC17A is also interesting in that it appears to be up-regulated at 4 hours, though not significantly, but otherwise follows the same pattern as the majority of PRR genes.

(34)

Figure 14. The expression of integrin genes as a function of time. Point shape indicates if the gene was significant, FDR

< 0.05, at that time point.

Integrins are transmembrane receptors that can activate signalling pathways in the cell and allow for rapid reaction to events at the cell surface. Similar to previous categories, the majority of genes in Figure 14 follow a similar pattern, with significant down-regulation at 24 hours and otherwise a general down-regulation but with few significant time points. ITGB3, ITGA9 and ITGB6 clearly follow a different pattern, however. ITGB3 appears to be up-regulated early and has significant up-regulation at 12 and 24 hours but is nonsignificant at 48 and 72 hours. ITGA9 appears to be non-significant until at 48 hours where it is fairly strongly up-regulated and continues to be at 72 hours. ITGB6 behaves in a similar fashion though it isn't significant at 48 hours.

Overall, the various categories of immune related genes in the chicken show a large variety of expression patterns. The clearest common behaviour that can be seen across categories is an increased number of significant genes at 24 hours, especially for down-regulated genes, something that the heatmap in Figure 8 and the volcano plot in Figure 6 also show.

(35)

Figure 15. The expression of surface antigen genes in E. tenella as a function of time. Point shape indicates if the gene was significant, FDR < 0.05, at that time point.

The glycosylphosphatidylinositol (GPI)-anchored surface antigen (SAG) genes are a set of proteins found on the surface of several species of apicomplexan parasites, including E. tenella.

They appear to have important roles in infection. Figure 15 shows the expression of a subset of these genes at the different time points. Most of them show a similar behaviour, non-significant until 48 hours where they are significantly up-regulated and continue at a similar level at 72 hours. A few genes, SAGs 13, 14, and 4, are instead down-regulated across all time points.

SAGs 4 and 14 show a peak of downregulation at 12 hours but subsequently following a similar pattern to the upregulated SAGs and SAG13 becomes continuously more down-regulated.

(36)

3.5 Gene co-expression modules

In order to find networks of co-expressed genes across the two organisms, chicken and E.

tenella, the WGCNA package was used. This tool is used to identify highly correlated gene clusters in an expression dataset. It was applied to the infected chicken samples only; this was to avoid bias from the uninfected chicken samples not having any E. tenella data.

Figure 16. Hierarchical clustering of the normalized expression of genes in each sample. Sample 33_S19 forms an outlier.

(37)

Figure 17. Model fit versus mean connectivity. The lines on the scale independence plot are set at a fit of 0.7 and 0.9.

The first step of the analysis was to load all the in vitro read data from both the chicken and E.

tenella into R where it was filtered and normalized. The data was then clustered (Figure 16) to identify outlier samples. Sample 33_S29 formed a clear outlier, likely because it only included E. tenella data and would therefore be very different from the other samples, having high CPM values for parasite genes and zeroes for the chicken genes. This sample was therefore removed from this analysis to avoid any potential bias. The rest of the data forms three clear groups: the leftmost group of samples was taken at a different time from the rest but includes samples from 2-24 hours, the centre group is composed of samples from 48 and 72 hours and the rightmost group is also samples from 2-24 hours. Next, the fit of the data to the WGCNA model was assessed with different soft thresholds. The goal is to pick a threshold that fits the model well but also allow for high connectivity. Figure 17 shows the trade-off between connectivity and model fit. In order to get the best fit, a threshold of 15-20 would be best but at that point the connectivity has become very low. Therefore, a soft threshold of 9 was chosen as a compromise.

(38)

Figure 18. A dendrogram showing the different genes clustered together and how they were assigned color coded modules.

Table 6. The number of genes and the percentage of E. tenella genes in each module, the number of significantly DE immune-related genes and the number of genes from the immune categories IFN signalling and PRR.

Module Number of genes

E. tenella gene percentage

Number of significant immune genes

IFN signal

PRR

Black 704 17.47 26 0 3

Blue 4129 2.33 40 1 2

Brown 4047 11.29 28 0 2

Green 1014 31.66 38 0 2

Greenyellow 130 31.54 3 0 1

Grey 110 19.09 2 1 0

Magenta 149 12.08 1 1 0

Pink 620 86.45 2 0 0

Purple 149 13.42 11 5 0

Red 716 60.34 15 0 3

Tan 94 3.19 6 0 0

Turquoise 5225 78.39 22 2 1

Yellow 3438 26.70 42 0 6

(39)

With the data cleaned up and model parameters chosen, the co-expression modules were calculated. This resulted in 13 modules, shown in Figure 18 and Table 6. The modules vary greatly in size, from large ones such as turquoise and blue, to small ones such as tan, purple and grey. The fraction of E. tenella genes in each module also varies greatly, with large modules with very high and low fractions of parasite genes likely containing housekeeping genes from each organism. For example, the blue module contains the genes of the glycolysis pathway in the chicken whereas the turquoise module contains the same for E. tenella. A GO and KEGG enrichment analysis further supported this, showing an enrichment of mitochondrial GO categories and metabolic KEGG pathways. Many of the other modules have a fairly mixed composition, though given the far larger number of chicken genes in the analysis, it's not surprising to see the majority of modules being dominated by the chicken.

Table 6 also shows the number of genes from the heatmap in Figure 8 and the immune categories IFN signature and PRR from the previous section. The immune-related genes appear to be spread across the modules, with no module being an obvious “immune response” module.

The GO and KEGG enrichment analysis supported this, with immune related categories being enriched but not showing a strong signal in any specific module. Specific categories of immune genes were checked for enrichment in the modules as well and showed the same pattern again, as can be seen with the IFN signature and PRR genes. These two categories do, however, have a fair number of genes clustered in specific modules: yellow for PRR and purple for IFN signature. For PRR, this includes the gene TLR15, which shows unique behaviour in Figure 13 and is thought to have a role in the chicken immune response to E. tenella. For the IFN signature genes, most of them seem to cluster in the purple module and they make up just under half the immune-related genes present in the module.

(40)

4 Discussion

4.1 The data and the pipeline

The main aim of this project was building a pipeline for the dual RNA-seq and subsequent differential expression analysis of chicken and E. tenella data. This pipeline was to be used to analyse both in vitro and in vivo data for this project and beyond. The first half of the pipeline quality checked the reads, trimmed them, mapped them to the reference genomes and counted those that mapped to features. These steps went well and as Table 1 shows, the data mapped very well to the chicken reference genome but less so to the E. tenella reference. The second half of the pipeline involved loading the data into R, performing further quality checks and performing a differential expression analysis. As the volcano plots in Figures 6 and 7 show, this step was also a success. Overall, the pipeline performed well with the main bottlenecks being the mapping and read counting, both of which take a relatively large number of CPU hours to complete. This was not an issue for the in vitro datasets, as they were relatively small, but needs to be addressed with larger datasets that the pipeline might be used for in the future. The read mapping performed better as STAR allows for multithreading, something HTSeq does not. As such, running large amounts of data will require splitting the data up manually if faster results are desired.

While running the pipeline, the quality checks revealed various issues in the in vitro datasets.

The first was a large amount of adapter content in the reverse reads. This needed to be trimmed away and left many of the reverse reads unusable for the analysis, meaning that many of the forward reads were mapped as single reads. The cause for this most likely lies in the way that the samples were prepared for sequencing in the lab and as such, was beyond the scope of this project to address. The second was the mapping rate of the different samples, shown in Table 1. While most samples mapped quite well, with an ~80-85% mapping rate to features, there was one notable exception: the pure E. tenella sample. It only had a mapping rate of 65%, which is quite a lot lower than for the chicken samples. While the mapping rate to features was quite low, the mapping rate to the genome was around the same level as the chicken samples, ~87%, meaning that the issue lies in reads mapping where there are no annotated features. It seems unlikely that such a large-scale genomic contamination would be present, given the lab prep used DNAse before cDNA synthesis that should have gotten rid of any such contamination.

The best way to check this would be to sequence several pure samples and map them to the genome, but that was beyond the scope of this project. The cause is likely to be a lack of annotation for the E. tenella genome. This then suggests that a large portion of the expression of E. tenella is not being captured and emphasizes the importance of producing a better annotation for the organism, even if it's only structural, so that important genes are not being missed. Data from this project could be used to aid in improving the annotation in the future, as expression data is a valuable tool for determining genes in eukaryotic genomes.

(41)

The general quality of the E. tenella reference genome is another issue. While the chicken genome is of fairly high quality and not very fragmented, the E. tenella genome is highly fragmented. The genome is ~50 Mb in size and made up of 14 chromosomes but is split into over 4000 contigs. This fragmentation may cause genes to be split into multiple contigs, making it more difficult to identify them. This also leads into a different issue: the annotation of both the chicken and E. tenella genomes. For the chicken genome, most of the genes do have a functional annotation but many of those are taken directly from mammalian homologues and many are still only annotated as hypothetical proteins. This reduces the accuracy of any analysis utilizing the reference genome and makes the interpretation of results more difficult. This also extends to GO enrichment analyses as the GO annotations available through packages, such as org.eg.Gg.db, remain relatively poor. There are, however, more extensive annotations available from the GO Consortium. For E. tenella, this problem is far more serious. Over a third of the gene products remain annotated as hypothetical proteins and another large portion has only putative annotations. This extends to the GO annotation where only a minority of genes are GO annotated. There is also no KEGG pathway annotation for E. tenella or any species of genus Eimeria. This lack of annotation makes studying E. tenella and other eimerian parasites difficult and needs to be addressed in order to facilitate more research on Eimeria species and apicomplexan parasites in general.

In order to gauge the quality of the datasets, quality checks were implemented as part of the pipeline by clustering the data and displaying it in Figures 2-5. The chicken data had no particular outliers, and also did not cluster clearly based on infection status, though the infected samples did seem to cluster together more than the uninfected ones. In the MDS plot in Figure 2 there is some tendency for samples from the later time points, 48 and 72 hours post infection, to cluster together, something that can also be observed in the dendrogram in Figure 16. This seems to indicate that a major event is occurring between 24 hours and 48 hours. This is also supported by the volcano plots in Figure 6 and the behaviour of many of the immune-related genes examined in Figures 9-15. Tierney & Mulcahy (2003) described a time series of in vitro E. tenella infection, though with a different E. tenella strain, UTI1507, and in a bovine cell line, Madison-Darby bovine kidney. Their results show that at 24 hours post infection, the parasite has formed trophozoites within a parasitophorous vacuole, i.e. a membrane protected compartment in the host cell. At 48 hours they have formed immature schizonts which become fully mature at 60 hours and have formed merozoites at 72 hours, meaning the first merogony is over and they are ready to leave the current host cell and begin infecting other cells (Tierney

& Mulcahy 2003).

The expression patterns of E. tenella were quite different from those of the chicken. As both the MDS plot in Figure 3 and the PCA plot in Figure 5 show, the parasite samples cluster very clearly based on time points, with there being a clear separation between 2-4, 12-24 and 48-72 hours. Given the results of Tierney & Mulcahy (2003), it seems reasonable to assume that these three stages are the beginning of infection, the parasite securing itself in the parasitophorous vacuole and the merogony itself, possibly with some reinfection starting to occur at 72 hours.

Dual RNA-seq analysis of host-pathogeninteraction in Eimeria infection of chickensArnar Kári Sigurðarson Sandholt