A modular approach to identify differentially expressed genes between men and women for inflammatory diseases

(1)

women for inflammatory diseases

Master Project in Bioinformatics

Spring term 2016 Stefan Fehrenbach

Supervisor: Zelmina Lubovac

Examiner: Angelica Lindlöf

(2)

Abstract

Inflammatory diseases show large differences in susceptibility between men and women.

In previous study, genes that showed different expression patterns between patients and healthy controls in males and females were identified using modules in disease-gene interaction networks. In this work, genes were identified using different methods based on gene expressions in public available data sets.

By counting the occurrences of genes identified in the interaction network in our results, we showed that they greatly overlap with genes identified by our methods and that the disease gene-interaction networks are able to identify genes that can be identified in a gene expression based analysis as well.

Gene expression analysis was implemented in an automatic pipeline, which was designed for a general use. Thereby, future research with similar problems can be simplified. The R- packages limma and WGCNA were used to identify genes that showed differences in males and females and GO terms and KEGG pathways were used to search for enriched functions of those genes.

Further, a difference between males and females was found for systemic lupus erythematosus and Sjögren’s syndrome data sets in the expression of genes belonging to interferon signaling. Interferons are currently examined as drug targets for SLE and a difference between men and women could lead to different results of such a medication. However, the identified genes showed changes in expressions between patients and controls for both men and women. This supports a beneficial effect of such drugs in men and women.

(3)

7 Appendix A-1 7.1 The pipeline code . . . A-1 7.2 Most enriched GO terms . . . A-2 7.2.1 SLE . . . A-2 7.2.2 MS . . . A-5 7.2.3 RA . . . A-8 7.2.4 SS . . . A-9 7.2.5 T2D . . . A-10 7.2.6 UC . . . A-12 7.3 Evaluation of genes found in the modular approach . . . A-14 7.3.1 SLE . . . A-15 7.3.2 MS . . . A-15 7.3.3 RA . . . A-16 7.3.4 SS . . . A-16 7.3.5 T2D . . . A-17 7.3.6 UC . . . A-17 7.3.7 GD, MG . . . A-17 7.4 Calculation of p-values . . . A-18

(5)

1 Introduction

Chronic inflammatory diseases such as asthma, autoimmune diseases, atherosclerosis, inflammatory bowel disease or allergies become an increasingly important factor to our health.

In many cases women are more likely to contract with inflammatory diseases [Ngo et al., 2014]. However, the reasons for this difference in susceptibility between men and women are just partly known.

In previous research, possible causes for different susceptibilities were investigated. Firstly, many studies investigated the impact of sex hormones on the immune system in terms of susceptibility [Ngo et al., 2014], [Whitacre, 2001] and [Lockshin, 2006]. It is likely that sex hormones cause differences between men and women for inflammatory diseases as they are one of the biological processes that are strongly different between men and women and have an impact on the complete body. For example, previous work shows that mice with knocked out estrogen receptor (ERα^-/-) develop glomerulonephritis after one year [Shim et al., 2004].

Secondly, pregnancy was examined for its effects on the woman’s immune system. The pregnancy on the one hand comes with changing concentrations of sex hormones but also with an exchange of cells between the women and the fetus. Arguments for and against a connection between autoimmune diseases and fetal cells have been raised in previous work[Kirby et al., 2004].

Thirdly, effects from different lifestyles on some inflammatory diseases were considered. For example, drug-induced lupus is dominant in males, which is caused by the fact that there are more men than women that are treated with drugs that induce this disease [Lockshin, 2006]. [Ngo et al., 2014] describes that the differences in for example smoking or exposure to sunlight may as well lead to different susceptibilities for inflammatory disease.

In present work, genes were identified by a modular disease-gene network approach [Badam et al., 2016]. A network consisting of two layers was thereby constructed; a disease layer consisting of disease-disease comorbidities generated from the data used in [Blair et al., 2013] and a gene layer consisting of gene-gene interactions received from the STRING protein-protein interaction database. The two layers were connected using associations between diseases and genes from the DisGeNET database [Piñero et al., 2015].

The disease-disease comorbidities were obtained for male and female separately, resulting in a male network and a female network. Within each of these two two-layered network modules of disease and genes were identified using a random walk algorithm as in [Rosvall and Bergström, 2011].

An example network consisting of three diseases and overall ten genes is shown in Fig- ure 1a. The disease layer was plotted in blue, the gene layer in green. Black lines indicate disease-gene associations. The red area M1 shows one possible module. Figure 1b shows the same network with diseases in the center and genes plotted around them. A second possible module M2 is now easier to see. To keep the figure simple the modules in the example consist of two genes and two diseases, however modules can have different numbers of diseases and genes.

In this work, the focus is set to the level of gene expressions by deriving differentially expressed genes between females and males in controls and patients. As well as the modular

(6)

(a) An example network with disease in the upper and genes in the lower part.

(b) The same network with diseases in the center.

Figure 1: An example network consisting of a disease layer (blue nodes) and a gene layer (green nodes). Red areas indicate a module.

disease-gene network approach above, the analysis of gene expression data is supposed to result in genes that show a difference between female patient and controls and male patients and controls. The accordance of resulting genes will be used to evaluate the disease-gene network approach and an enrichment analysis will be performed to identify affected pathways.

1.1 Relevance

As the causes for different susceptibility in autoimmune diseases in men and women are just partly known, further research is needed in this area. By setting the focus to genes in disease-gene interaction networks, novel highly relevant causes may be found. Significant findings will not only improve the understanding of the susceptibility but also improve treatment of the diseases, for example by predicting different drug targets for men and women in individualized medicine.

Beneath improving the understanding of different susceptibility this work also shows the interchangeability of a single gene expression approach and a disease-gene association driven approach.

By the construction of an automatic pipeline to analyze public available gene expression data sets for differences between e.g. female and male patients and controls the analyses of such data will be simplified for future research.

1.2 Aims and Objectives

In this study, we attempt to contribute to deeper the understanding of differences in inflammatory disease between men and women, by evaluating the modular disease-gene network method. This will be performed by developing a versatile pipeline for gene expression analyses in further research.

(7)

This study aims to

• validate genes identified by [Badam et al., 2016] as source for sex specific susceptibility by

– collecting gene expression data for inflammatory diseases from public databases – performing gene expression analysis on the data sets

– comparing differentially expressed genes with genes identified by [Badam et al., 2016].

– connect differentially expressed genes from different diseases

• find a possible explanation for why genes are differentially expressed between dis- eased but not between healthy men and women for modules that are not validated on the healthy data by

– performing pathway analyses on differentially expressed genes

• design in a pipeline that is able to automatically

– download gene expression data sets from public databases

– identify genes that are differentially expressed between male and female controls and patients

– perform pathway analysis of differentially expressed genes

(8)

2 Materials and Methods

To identify genes with differences between men and women in several inflammatory diseases, three functional parts were implemented:

• Firstly, data had to be provided. This was done using data sets from the public database GEO [Edgar et al., 2002]. To come around a manual download of data sets, functions for an automatic download were implemented.

• Once the data was downloaded, interesting genes could be identified. Therefore, dif- ferential expression methods and a WGCNA [Langfelder and Horvath, 2008]method were implemented.

• Finally, the identified genes were used to search for enriched functions, using Gene Ontology terms and KEGG pathways.

To perform all three steps as user friendly and fast as possible on several data sets, the parts were implemented in an automatic pipeline.

For clarification, the names of R-packages are written in bold, R-functions and functions of packages as well as their parameters are written in italic, functions implemented for this work and their parameters are written in a teletype font and implemented global variables are written in a slanted shape in the following sections. Mathematical formulas are written in math italic.

2.1 Materials

All analyzed data sets are listed in Table 1. Data sets were obtained from the GEO database [Edgar et al., 2002] and selected to contain samples of healthy controls and patients that have one of the chosen diseases. An important criteria was that the number of samples should be large enough to provide good statistical power (i.e. over twenty samples, in cases of no other data set for the same tissue and disease sometimes fewer samples). Availability of Bioconductor annotation package for the used microarray (essentially Affymetrix, Illumina and Agilent micoarray) was another criteria important for the choice of data sets. Cho- sen diseases were systemic lupus erythematosus (SLE), multiple sclerosis (MS), rheumatoid arthritis (RA), Sjögren’s syndrome (SS), diabetes type 2 (T2D), ulcerative colitis (UC), Graves’ disease and myasthenia gravis. For the latter two no suitable data sets were found.

The number of samples listed in Table 1 corresponds to the total number of samples in the data set. In some cases was the number of samples used in the analysis lower since samples of e.g. medicated patients or diseases different than the chosen ones had to be excluded. For data sets for that the sex of samples was not given, the sex was predicted as described later in section 2.4.1.

(9)

Table 1: Data sets obtained from the GEO database. Abbreviations are: SLE - systemic lupus erythematosus, MS - multiple sclerosis, RA - rheumatoid arthritis, SS - Sj¨ogren’s syndrome, T2D - diabetes type 2, UC - ulcerative colitis, pDCs - plasmacytoid dendritic cells and PBMCs - peripheral blood mononuclear cells.

Accession number Disease Tissue Samples Sex given

GSE29536 SLE whole blood 410 no

GSE65391 SLE whole blood 996 yes

GSE49454 SLE whole blood 177 yes

GSE30153 SLE B-cells 26 no

GSE4588 SLE B-cells, CD4 T-cells 49 no

GSE10325 SLE B-cells,CD4 T-cells, myeloids 67 no

GSE46923 SLE monocytes 142 no

GSE17048 MS whole blood 144 no

GSE37750 MS pDCs 26 yes

GSE43591 MS T-cells 20 no

GSE32988 MS CD4, CD8 T-cells 48 no

GSE62584 MS B-cells, CD4, CD8 T-cells 60 yes

GSE52139 MS spinal cord 16 yes

GSE59085 CIS PBMCs 22 no

GSE13732 CIS CD4 T-cells 133 no

GSE15573 RA PBMCs 33 yes

GSE4588 RA B-cells, CD4 T-cells 33 no

GSE77298 RA synovial biopsy 23 no

GSE51092 SS whole blood 222 no

GSE40611 SS parotid gland 49 no

GSE23117 SS minor salivary glands 15 no

GSE15932 T2D peripheral blood 32 yes

GSE73034 T2D muscle 28 no

GSE55650 T2D myoblasts, myotubes 23 no

GSE13760 T2D arterial tissue 21 no

GSE73034 T2D hepatic tissue 17 no

GSE20966 T2D pancreatic beta cells 20 yes

GSE25724 T2D human islets 133 yes

GSE3365 UC PBMCs 126 yes

GSE24287 UC ileum 99 no

GSE13367 UC intestinal biopsy 56 no

GSE38713 UC intestinal mucosa 43 yes

GSE53306 UC endoscopic pinch biopsy 40 no

(10)

2.2 Implementation in a pipeline

The implementation of the proposed method was done using the R software environment.

Despite being slower than a low level programming language, R was chosen because it offers many additional packages, especially for the bioinformatic purpose.

For this study, the same analysis steps had to be performed on several data sets. To auto- mate this procedure, the implementation was designed as an automatic pipeline, including all steps from downloading the data to the identification of enriched pathways. As all steps are encapsulated in functions, the pipeline actually called the functions in a specific order and passed needed arguments from one function to another. The specific steps are shown in Figure 2 and further described in the following sections.

Figure 2: The functional steps within the pipeline. The arrows describe the functionality and their order. The squares describe the resulting R object for each step.

2.3 Download

Initially the accession number of the wanted data set had to be set. The entered number was thereby checked to be a valid GEO accession number by comparing it with the global variable geoPattern, which was GSE[1-9]+. As far as known, this term matched all valid accession numbers for GEO series records. If the terms matched, the data set was downloaded from GEO using the GEOquery R package [Davis et al., 2007]. The GEOquery package was chosen, as it is a standard package to access the GEO-database from R.

The GEO2R analysis tool [NCBI, 2016] on the GEO-website also uses this package to load GEO-data into R. GEOquery allowed downloading a data set by its accession number. The data was then saved in files as well as loaded into a list of ExpressionSets. ExpressionSet is an extension of R object ExpressionSets that was implemented into R to store data of microarray experiments (e.g. expression matrix, description of samples). Once one data set

(11)

was downloaded, the package used the saved files to load the data set from the file system in succeeding runs.

The list of ExpressionSets returned by the download functions contained one ExpressionSet for all samples obtained with the same microarray type (the list has length one for most data sets). In cases where the list had a length greater than one, the first ExpressionSet was used.

Once one ExpressionSet was obtained, the subsequent analysis had to consider the type of microarray with which the data was generated. The array type was detected using the annotation slot of ExpressionSet.

Subsequently, the probe IDs were matched to other identifiers using the correct Biocon- ductor AnnotationData package available from Bioconductor [Bioconductor.org, 2016]. This was done with a call to the function loadDatabase. This function matched the given string with sub-strings describing some AnnotationData packages, i.e. u133x3p.db, hgu95av2.db, hgu133plus2.db, hgu133a.db, hgu133plus2.db, hthgu133a.db, illuminaHumanv2.db, illumi- naHumanv3.db, illuminaHumanv4.db, illuminaHumanWGDASL-v4.db and hgug4112.db. If the given string could not be matched to any of these packages, the function used the string itself as name of an AnnotationData package. In case where the name was not a valid AnnotationData package no annotation functions were loaded.

The functions needed are, for example, for the array Affymetrix GeneChip Human Genome U133 Plus 2.0 [Carlson], hgu133plus2ALIAS2PROBE to match gene symbols to probeIDs, hgu133plus2SYMBOL to match probeIDs to gene symbols, hgu133plus2CHR to match a probeIDs to the chromosome of the corresponding gene and hgu133plus2ENTREZ-

ID to match probeIDs to Entrez Gene identifiers as needed for KEGG pathway analysis.

To be available in following analysis steps, the values of these functions were saved in the global variables geneProbes, geneSymbols, CHRs and entrezID.

2.4 Group detection

After the correct annotationData package was loaded, the samples had to be grouped into male or female and control or patient. This was done by a call to getTrait with the Expres- sionSet and the chosen groups as arguments. The function checked the phenoData object of the ExpressionSets for columns matching the groups by using the values of the global variable groupRegex. This variable was a vector with possible groups as names and regular expressions as values. Values for the groups ’Sex’ and ’Disease’ were set.

As the values in the phenoData object depend on the information uploaded to the database by the researcher, it was not possible to set regular expressions matching in all data sets. In cases where no matching column of the phenoData was found (either because the regular expression did not match any column or there was no regular expression for the corresponding group) the user was asked to specify the matching phenoData column. As an aid, a table with the column names of the phenoData and the levels of the values in the columns was shown.

To make the detected groups easier to handle in following analyses, every group was con- verted into a vector with values of -1, 0 and 1, where -1 could represent controls, 1 could represent patients and 0 could represent medicated patients. This conversion was done

(12)

automatically if the group contained only one or two different values. If there were more values, the user was asked to specify which values had to be parsed to -1 and 1. All other values were parsed to 0. Samples that were 0 in one group were afterwards removed with the function removeTraitZero. Thereby it was possible to remove all samples that were not of interest, for example, samples of medicated patients.

2.4.1 Sex prediction

If a group ’Sex’ was of interest but neither automatically found nor detected by the user in the phenoData, the function calcSex was called to predict the sex of the samples.

The sex prediction was based on a method by [Badam et al., 2016], where the sex of samples was identified using the expression of the Affymetrix probes with the probeIDs 201909_at, 205000_at and 206700_s_at. These probes were identified to have a stable expression in male as well as in female samples but were higher expressed in males than in females. Firstly, the expression values were normalized by subtracting the mean expression over all samples and dividing by the standard deviation of all samples and secondly the mean over all three probes was calculated. Samples with a mean below zero were classified as female and samples with a mean higher than zero were classified as male.

To make this approach more general, some additional features were included:

- As the probeIDs are specific for Affymetrix arrays, it is not possible to use them with arrays of other manufacturers, for example an Illumina array. To overcome this problem, the probeIDs were matched to gene symbols. In case that a probeID was not found it was re- placed by the mean expression of all probes for the corresponding gene symbol. This led to different values, as there are possibly more than one probes for a gene symbol. However, in several tested data sets there was no difference in prediction found when using all probes for the corresponding genes symbols compared to the prediction when using only the three probeIDs.

- As the values were normalized around zero there would always be values lower and higher than zero regardless of the original values. This led to at least one male and one female prediction for every input. As the sex of the samples is unknown, there might have been only male or female samples. To account for that, the normalized values were only used if the range of expression between the samples was higher than a threshold (2.0). In tests on a few data sets, a range below 2.0 was only reached in data sets with same sex samples. If the range was lower than 2.0, all samples with a mean expression higher than a cut-off value were classified as male and lower than it were classified as female. On the tested Affymetrix data, a cutoff of 6.0 correctly separated between male and female. As this method might have not been that precise, a warning was generated in the latter case.

As an additional evaluation the R package massiR [Buckberry et al., 2014] was used to predict the sex of the samples. In cases where the predictions were different between the methods, a warning was generated.

2.5 Gene expression analysis

To identify interesting genes in the data two distinct methods were applied. Firstly, thelimma R-package [Ritchie et al., 2015] was used to identify differentially expressed genes between

(13)

male and female and patient and control samples. Limma was used as it offered an easy to use interface and it was capable of finding differentially expressed genes between more than two groups. In addition, it was stable when only few samples were available. This was an interesting feature as data sets with different amount of samples were analyzed.

Secondly, weighted gene co-expression network analysis (WGCNA) was used to identify sets of interesting genes. WGCNA defined modules of genes with similar expression and evaluated the modules’ correlation with clinical trait data. Here, the trait data consisted of the groups male and female and patient and control. Thereby, modules of genes with strong correlation to those groups could be identified [Langfelder and Horvath, 2008]. WGCNA was used as it was shown to discover interesting genes, which could not be discovered by other approaches [Zhou et al., 2014].

2.5.1 Limma

The analysis with limma was divided into two steps. Firstly, a linear model was fitted to the data and secondly, significant probes were extracted from the result. In order to fit a liner module to the data, a design and a contrast matrix had to be defined. Example matrices can be seen in Figure 3. The design matrix defined the membership of each sample to a number of trait groups. The rows of the matrix corresponded to samples and the columns to groups. It was possible to assign a sample to several groups. A value 1 indicates that a sample belongs to the group, a value 0 that the sample does not belong to the group.

The contrast matrix defined which differences between groups were of interest. The rows of the matrix corresponded to the same groups of samples as the columns in the design matrix, whereas each column in the contrast matrix flagged a difference that has to be examined.

By using the values 1 and -1 the groups could be assigned to take part as either positive or negative when calculating the differences. A value of 0 indicated that the group is not of interest for a specific difference.

design F C F P M C M P

F P 1 0 1 0 0

F C1 1 0 0 0

F P 2 0 1 0 0

M P 1 0 0 0 1

M C1 0 0 1 0

contrast F vsM P vsC F P vsM P

F C 1 1 0

F P 1 −1 1

M C −1 1 0

M P −1 −1 −1

Figure 3: Example design and contrast matrix. FP1, FC1, FP2, MC1, MP1 represent names of samples, FC, FP, MC and MP the groups of samples and FvsM, PvsC, FPvsMP names of calculated differences.

With the functions fit <- lmFit(data, designMatrix) and fit2 <- contrasts.fit(fit, contrastMatrix) a linear model was fitted to the data. Limma estimated the contrast β_iso that C^T(X⁻¹E[y_j]) = βiwith C being the contrast matrix, X the design matrix and y_i^T the expression data for gene j. An estimation of αj by E[yj] = Xαj was done with lmFit and C^Tαj = βj was estimated with contrasts.fit. By calling fit2 <- eBayes(fit2) the results from contrasts.fit were supple- mented with p-values for the differences as well as with the moderated F-statistics. Small

(14)

p-values in the F-statistics related to genes that were significantly differentially expressed in all comparisons specified by the contrast matrix [Smyth, 2005].

The calculation of the matrices and fitting of a linear model to the data was encapsulated in the function performLimma. This function returned the MArrayLM object as it was returned from the limma-function eBayes. The design matrix was built dynamically using the trait data frame from getTrait and matching the numerical values in it with values expected for each group. The needed contrast matrix was hard coded and provided the comparison FCvsFP, MCvsMP and FPvsMP.

For the second step of extracting significant probes three functions were available: GetLim- maSigProbes, getLimmaCombPVals and getDeltaDeltaProbes. The first two used the p- values of all comparisons defined by the contrast matrix. GetLimmaSigProbes returned all probes, which were significantly differentially expressed in the moderated F-statistics. The function getLimmaCombPVals used the individual p-values of each comparison and combined them using the function max. Thus, the highest p-value of each probe in the three comparisons was returned. GetDeltaDeltaProbes returned all probes that were significantly differentially expressed in either FCvsFP or MCvsMP, but not in both. In the context of this study, this referred to genes that differed significantly between female controls and patients or male controls and patients. The resulting probes thus corresponded to genes that were different between controls and patients in either male or female but not in both and thereby accounted for general differences in the disease between male and female. This function returned a vector of 1 and 0 indicating if a probe was significantly different in one comparison or not. Whenever p-values were examined to be significant, the Benjamini-Hochberg correction was applied before a significance threshold of 0.05 was used.

2.5.2 WGCNA

The weighted gene co-expression network analysis (WGCNA) was used as a complement to the limma analysis. As both methods use the same input, as the later steps of the pipeline worked on both results, as there was no interconnection between the two methods and the perform the same task in pipeline, the methods are fully exchangeable. Both methods were used to allow a comparison of the results. For the WGCNA the R-package with the same name was utilized [Langfelder and Horvath, 2008].

Using the function probesDiff it was possible to run WGCNA on two different subsets of the complete data. Here, firstly all female samples in the data and secondly all male samples were used. To decrease the amount of probes passed to the WGCNA network construction, each subset was filtered by keeping only the 5000 probes with the highest variance (as variance filtering was recommended in the WGCNA FAQ [Langfelder and Horvath, 2014]).

Filtering saves a significant amount of time and reduces the number of genes so, that during the network building process the data was not separated into ’blocks’. Blocks are subsets of the data for those first networks were built separately and then afterwards merged. The separation would have been needed for big data sets because of memory restrictions on the executing computer.

Afterwards probesDiff called the function getProbesBestModules for both subsets and a named vector of 1 and 0 was returned, indicating for each probe if it was in one but not

(15)

both results for the subsets. Thus, the function checks for each probe if it was correlated to disease-status in one gender but not in the other one or not.

GetProbesBestModule returned probes that belong to the module in the network with the strongest correlation to the trait data. It internally built a network with the WGCNA function blockwiseModules(data, power = sft). This function constructed a network on the given data set and assigned each column (corresponded to probes here) to a module. The value sft set the soft-thresholding power for the network construction. Correlations between probes were raised by this power to define their adjacency value. The adjacency a_i,j of probes x_i and x_j was defined as abs(cor(x_i, x_j))^sft. A higher value for sf t increased scale independence of the network but decreased its mean connectivity, as the values for a_i,j were lower for a high value of sft (due to the fact that abs(cor(a, b)) returned values in the range of [0, 1] and raising them to a power >1 decreased the values). By default, the function cor calculated Pearson correlation of the given values. As the resulting a_i,jvalues were continuous in [0, 1]

the resulting network was weighted. This technique is different from an unweighted network where the values of a_i,j were either 1 if abs(cor(x_i, x_j)) >=hdt or 0 otherwise.

The WGCNA package came with functions to estimate values for both sft for a weighted network and hdt for an unweighted network. The values were estimated so that the network has a good trade of between scale free topology and mean connectivity (As a high scale free topology but also a low mean connectivity can be obtained with a higher value for sft and vice versa with a low value for sft). The functions pickSoftThreshold( data, powerVector

= powers)$ powerEstimate and pickHardTreshold( data, cutVector = powers)$ cutEstimate returned the lowest value from the vector powers (here set to values form 1 to 30) for that the resulting network was estimated to have a good scale free topology. This was defined via an internal factor RsquaredCut. As weighted networks allowed a better representation of the biological background (continuous values rather than true or false), only weighted networks were used here.

Using the function moduleEigengenes(data, colors)$eigengenes with colors indicating the corresponding module for every gene, the eigengene for each module could be found. An eigengene was the first principal component of the expression matrix for all genes in that module and could be used as a replacement for an average expression profile. Using the function with $averageExpr, the normalized average expression of the modules for each sample was returned. Both the eigengenes and the average expressions of the modules could be correlated with the expression of the trait data to find modules corresponding to the trait data. The eigengenes were strongly correlated with the intradmodular hub genes and thus it was better to compare the eigengenes with the trait data, as the hub genes were likely to represent central information of the module. The correlation between the trait data and the modules as well as the correlations significance was calculated utilizing corPvaleStudent. The module with the lowest p-value for its correlation to the disease status was then considered as best module. This module (and thus also its genes) was expected to have a strong correlation with the trait. Besides returning the probes of this module, all probes were saved to a file with their module-color and the probe’s correlation with the trait data. In cases where the parameter visualize for the function getProbesBestModule was set to TRUE, the network was additionally plotted with the function visualizeGraph using Pearson correlation between the probes to generate edges and the color that was assigned

(16)

to each probe by WGCNA as color of the nodes in an graph generated with the R-package igraph [Csardi et al., 2006]. The network building generated some additional plots that were used to control the quality of the constructed network, i.e a scatter plot of trait significance vs module membership in the module was generated. The module membership should have followed the trait significance in a linear manner in this plot to approve the connection between the module and the trait values.

2.6 Enrichment analysis

The enrichment analysis of the identified gene set was performed using two widely used databases. Gene Ontology (GO) terms that are enriched in the derived set of genes were examined. The GO terms identified genes by the action of their products in terms of biological process, molecular function and cellular component [Ashburner et al., 2000].

Beside GO enrichment the KEGG PATHWAY database was used. The KEGG PATHWAY database stored maps of well-known pathways [Kanehisa Laboratories, 2016] and was used to explore if some identified genes shared affected pathways.

As the limma analysis and WGCNA returned either a vector of p-values or a factor of 0 and 1, both named with the corresponding probes, the enrichment was adjusted to work with those two values.

2.6.1 GO enrichment

For the GO enrichment the package topGO [Alexa and Rahnenfuhrer, 2010] was chosen.

This package allowed the use of different statistical tests and algorithms to apply the tests and came with the possibility to visualize the most enriched GO terms in an easy to un- derstand graph structure. Looking through the user guide on Bioconductor the other two suitable packages GOexpress and GOstats did not have these features. TopGO also sup- ported an enrichment analysis based on both vectors of p-values and factors of 0 and 1.

The GO enrichment analysis was encapsulated into the function goEnrichment that needs either a vector of named p-values or a named factor as argument.

The function used the topGO package in three sub-steps:

- Firstly, a topGOdata object was created with new( "topGOdata", onotology = ont, allGenes

= pVals, geneSel = topDiffGenes, affyLib = db). The object could be created for GO terms of the ontology ont ("BP" = biological process, "MP" = molecular function or "CC" = cellular component). In this work only the ontology biological process was used. By changing the value of the parameter affyLib to the correct database (as in section 2.3) the GO enrichment was performed on the right GO terms for every probe (even if the array was not an Affymetrix microarray).

- Secondly, the enrichment in the topGOdata object were tested for statistical significance with the function runTest( topGOdata, statistic = stat, algorithm = algo). The statistical test that was used to test the significance of a GO term was "fisher" for Fisher’s exact test.

Setting the parameter algorithm to "classic" tested the enrichment for every GO term inde- pendently from the significance of adjacent GO terms. The algorithm "elim" for example would have removed the significance used in the enrichment of one GO term from all ancestor terms. The elim-algorithm would have resulted in lower false positives rates than

(17)

the classic-algorithm but also in higher false negative rates [Alexa, 2005]. As there was a chance that false positives would be detected in a later step but false negatives could not be recovered the classic-algorithm was used in this work.

- Once the significance of the enrichment was tested, the results were stored. Therefore, a table of most enriched GO terms was obtained with the function GenTable( goData, test = result, orderBy = "test", topNodes = x). This function returned a table of the x most enriched terms sorted by their values for test. For every of the x terms the table had one row con- taining the GO-ID, GO term, number of annotated genes, number of annotated significant genes, number of expected genes as well as the p-values in result. This table was appended with the symbols of genes identified for each GO term and stored as result for the GO enrichment analysis. The relation of the x most enriched terms was visualized in a graph using the function printGraph( goData, result1, firstSigNodes = x, fn.prefix = file, pdfSW=TRUE).

This graph consisted of the x most enriched terms as well as their ancestor terms and the connections to those. The significance of all terms was encoded in colors ranging from light yellow to dark red. Setting pdfSW = TRUE directly wrote the graph into a pdf at the location specified by file.

2.6.2 KEGG pathway

To access the KEGG PATHWAY database the KEGGprofile R package was chosen. In contrast to the KEGGgraph package, the KEGGprofile package allowed the use of multi- group data. This was of use when handling data with the two groups’ sex and disease status [Zhao et al., 2015].

Instead of a factor or vector of all probes, the library KEGGprofile used a vector with Entrez IDs for only interesting/significant genes. The probes were therefore translated into Entrez IDs of the corresponding genes using the global variable entrezID. As there could have been more than one probe identified as interesting for one entrez-ID, it was possible that the mean expression had to be used. This had no effect on the identified pathways as they were ranked according to the number of corresponding genes and not according to their expression.

Having the genes annotated with their entrez-ID, enriched pathways were identified with find_enriched_pathway( genes, species = "hsa", returned_pvalue = pVal,

returned_genenumber = n). The species argument allowed the enrichment on pathways for different species. As the focus of this work was set to diseases in human, "hsa" was used for human pathways. The values pVal and n allowed setting of thresholds for the returned pathways. PVal set the maximum p-value for enrichment and n the minimum number of annotated genes. The most enriched pathway were then plotted with the function plot_pathway(

gene_expr, groups, pathway_id, species). This function downloaded and plotted a pathway- map from the KEGG PATHWAY database matching pathway_id and species. The parameter groups was used to set the groups within the expression data of the genes. As the pathway maps were very interesting but first had to be downloaded and extended, only the maps for the five pathways with the highest coverage of identified genes were downloaded in an automatic execution of the pipeline.

(18)

2.7 Analysis of given set of genes

In addition to identify genes that are differentially expressed in the expression data sets, also the genes found in the previous disease-gene network approach by Badam et al. had to be examined in terms of showing differences in the groups male/female and control/patient.

As the genes were identified by their gene symbol, here the mean expression of all corresponding probes was used as gene expression. The function checkGenes(genes,sep=",

",entryNames=NA) checked the genes given by gene in the data set for setAccessionNum- ber by plotting their expression profile. The expression profile showed the expression of the genes in all samples. In many cases was it possible to see differences between the groups in the expression profile. To make the differences between the groups better visible, also the mean expressions of the genes in the groups were plotted. For a set with many genes it was however not possible to distinguish between the expression profiles of all genes.

In addition to the expression profile, ratios between the groups, for example the ratios female patients vs male patients, female patients vs male patients or fold change in males vs fold change in females, were plotted. The first two plots allowed identifying genes that are down or up regulated between the specified groups, whereas the latter identified genes with a difference in the comparison of males and females.

2.7.1 Evaluation with the results of expression analyses

In order to evaluate the genes identified by [Badam et al., 2016] for one disease with the results of the gene expression analysis for the same disease, the average occurrence per gene per result was used.

Therefore, for each gene out of the genes identified by [Badam et al., 2016] was counted in how many of our gene expression analysis results it was. The occurrences of all genes were summed up and divided by the number of results and the number of genes in the given set (Equation 1). This value was intended to be independent of the number of genes and the number of result files. This independence allowed a comparison of gene sets with different lengths and a comparison between the diseases as for different diseases different numbers of results were available.

avg(genes, results) =

#genes

X

i=1

#results

X

j=1

x_i,j

#genes · #results with x_i,j =

(1 genei ∈ result_j

0 gene_i ∈ result/ _j (1) The same average values were computed using random genes and all genes that were present in at least one result for each disease.

To have an average value for a random gene set that was stable between different runs, firstly, average values were computed for 100 independent random sets with the same number of genes as the set by [Badam et al., 2016] for each disease and secondly the average over all 100 averages was calculated. Genes were randomly obtained from all genes annotated in the Bioconductor annotation package for the hgu133plus2 microarray (63786 genes).

(19)

The resulting average value was between 0 and 1, where 0 indicated that none of the genes was found in any result and 1 indicated that all genes were found in all results. It could be seen as percentage of how many genes out of the gene set were found in average in each result (i.e. a value of 0.1 showed that in each result 10% of the genes were found).

The average was tested to be independent of number of genes and number of files and random generated values. The results in Figure 4 show that the computed average value (as in Equation 1) is independent of gene set sizes and number of files (thick black lines) in randomly generated sets. However, the lower the number of genes or files, the higher the variance.

To test the average for independence of length of gene sets, ten pseudo result files with random lengths in [0,¹₄×#genes] were generated (here min: 718 genes, max: 14726 genes, mean: 10111.2 genes). For 100 runs, where each run generated 100 gene sets with 5, 10, 25, 50, 75, 100, 250, 500, 750, 1000 random genes, the average of each gene set was computed. Finally, the mean of all average values of sets with the same length are generated and plotted as colored lines.

To test for independence of number of files, first 100 set of genes with each 100 random genes were generated. In 100 runs 100 pseudo result files with random lengths in [0,¹₄ ×

#genes]of random genes were generated (here min: 2 genes, max: 15945 genes, mean:

7971.838 genes). The average value was computed over all 100 set of genes in the first 5, 10, 25, 50, 75 and 100 pseudo result files (colored lines).

(a) Averages of gene sets with given lengths (x- Axis), 100 gene sets for each.

(b) Averages of 100 gene sets with each 100 genes in ’number of files’ (x-Axis) files.

Figure 4: Independence of the computed average from the number of genes and the number of files. Plots were generated on 100 runs (colored lines) for each plot. Thick black line indicates mean averages over all 100 runs.

In order to check the probability of a score for a set of [Badam et al., 2016] to be a random score, a normal distribution of random scores was approximated by using the fitdistr -function

(20)

of the fitdistrplus package on scores for 5000 sets of random genes with the same length as the given set of [Badam et al., 2016]. This function returns µN and σN of an approximated normal distribution. If the probability of a score of a gene set to be a random score is very low, then the probability of the gene set to be a random gene set is very low as well.

The probability to have a value extreme as or extremer than x (as far away or further away from the distributions mean µN as x) was calculated using the normal distributions cumu- lative density function (CDFN) as shown in Equation 2. The CDF gives the probability of the distribution having a value lower than x. This value had to be subtracted from 1 if x is greater than µN to get the probability of the distribution having values bigger than or equal to x. To account for extreme values at both lower and upper tail of the distribution the value finally had to be multiplied with two. The returned value was seen as p-value for x being a random score, even though it is actually probability of all values extreme as or extremer than xbeing a random score.

p(x) =

(2 ∗ CDFN(x, µ = µN, σ = σN) x ≤ µN

2 ∗ (1 − CDFN(x, µ = µN, σ = σN)) x > µN

(2)

2.8 Unimplemented methods

Some methods that would have been of benefit for this work could not be implemented.

A uniform pre-processing for all data sets rather than different methods for different data sets would have improved this work. The selected preprocessing of raw data by the tools Affymetrix QC & pre-processing and Illumina QC & pre-processing ([Eijssen et al., 2013]

and [Eijssen et al.]) could not be integrated into the pipeline due to a difference in expected and available data types, i.e. was the Affymetrix pre-processing made for objects of type AffyBatch, whereas the data was saved in ExpressionSets. Both classes extend the class eSet, it was however not possible to adapt the pre-processing for objects of class eSet.

Secondly, was it not possible to append download of data sets from ArrayExpress [Kolesnikov et al., 2015]. The R-package [Kauffmann et al., 2009] was selected for this task because it was the recommended R package on the website of the ArrayExpress database [EMBL-EBI, 2016] to access this database.

The download function could be used to download data sets from ArrayExpress, where no information about microarray type was available. Accession numbers were used as ArrayEx- press accession numbers if they matched the pattern E-[A-Z]+-[0-9]+ in the global variable arrayExpressPattern. In order to call other functions of the pipeline the needed Bioconductor annotation package had to be set manually. In future work is it possible to ask for user input to specify the array type for ArrayExpress data sets, as this pauses the automatic execution of the pipeline and all data sets were available from the GEO database as well, this was however not implemented.

(21)

3 Results

This section explains the implemented pipeline and shows two example results (for the data sets GSE29536 and GSE65391 with samples of SLE samples and healthy controls) of it. As 35 data sets were analyzed, it is not possible to show the results for all data sets.

3.1 Pipeline

The implemented pipeline can be executed automatically and requires only minimal user input. As initial step needs the R-code setUp.R to be executed. This code loads and sets up all functionality. There are five functions that manage the main user interaction:

• setAccessionNumber( number): This function sets the accession number of the data set that is then used by startPipeline and checkGenes. It will also update the folder structure as in Figure 5.

• startPipeline( numberEset = 1): This function starts the analysis pipeline. Most steps are executed without further user input. Only during the group detection, might the pipeline stop and wait for user input. The code of this function is attached in section 7.1.

• checkGenes( genes, sep = ", ", entryNames = NA): This function checks the expression of genes given by genes, a string with genes separated by sep or a list of such strings for more than one set of genes, and plots different figures based on their expression.

• setRoot(path): This function migrates the folder structure into that files are written to path.

• help(): Writes some help, basically the available functions, to the console.

The code setUp.R also generates a folder structure into which downloaded data sets and results will be written during the execution of the analysis steps. To be operating system independent ’\\’ is used as directory separator in Windows systems and ’/’ in Linux systems.

An example folder structure is to see in Figure 5. The folder ’R’ is the location of all R-code.

Its name is however not important and can be different. The folders ’Data’ and ’Results’

will be created and filled with data during the execution of the program. For every used accession number, a sub-folder is generated in these two folders. The ’DATA’ folder contains all the files that are downloaded from the databases, the folder ’WORK’ is intended to contain later computed data. The folders ’Enrichment’, ’Limma’ and ’WGCNA’ contain the results of each step. The folder ’Checked’ will contain results if a set of genes is checked for their expression in the data set. By calling the function setRoot() it is possible to change the location of ’Data’ and ’Results’ to any location in the file system.

(22)

rootF older Data

GSE29536 DATA

GSE29536_matrix.txt.gz WORK

R

setUp.R Results

GSE29536 Checked Enrichment Limma WGCNA

Figure 5: An example directory tree as used by this program.

As all functionality of the pipeline is encapsulated in functions, it is possible to customize the executed program by for example executing some functions and replacing/removing others.

By changing the parameters of a function, it is further possible to change its behavior and adopt it to other problems. As most of the parameters are appended with default values, this flexibility can be used if needed but comes with no further effort if not needed. Two example results of the pipeline are given in the following two sections.

3.2 Results for SLE in whole blood, data set I

The data set for GSE29536 contains 410 samples obtained with the Illumina Human-6 v2 Expression BeadChip. It contains whole blood profiles for patients with various diseases as well as control samples. Here, only samples for the disease systemic lupus erythematosus (SLE) and control samples were used (106 patients and 114 controls). As the uploaded data contains no information about sex of the samples, this information was predicted, resulting in 60 female controls, 83 female patients, 54 male controls and 13 male patients.

3.2.1 limma

The number of probes that are significantly differentially expressed between female patients and female controls respectively male patients and male controls in the limma comparisons can be seen in the Venn-diagram (Figure 6). It is likely that the higher number of differentially expressed genes in females compared to the number in males is due to the higher number of female patients. However, the function getDeltaDeltaProbes returns a high number of probes (14892), but all of them are only contributed from the female comparison. The two other methods identify a slightly lower number of probes (F-stats: 3321, combination: 573).

(23)

Figure 6: Venn-diagram showing the number of significantly differentially expressed probes in GSE29536. ’X up in Y’ represents probes that are significantly higher expressed in disease status Y in sex X compared to the other disease status in the same sex.

The number 32606 in the lower right indicates genes that are not differentially expressed.

3.2.2 WGCNA

The WGCNA analysis resulted in five modules in the female network (the grey module is a container for all unassigned probes). The brown module has the highest correlation with disease status (correlation: 0.7, p-value 3 ∗ 10⁻²², Figure 7a). In the male network three modules were identified: the blue module has a correlation of 0.58 (p-value 3 ∗ 10⁻⁰⁷) with disease status Figure 7b.

All nodes (representing probes) have only edges (representing high correlations) to nodes of the same color (probes from the same WGCNA module) (Figure 7c). This indicates that all probes of the female network have only high correlations to probes within the same WGCNA module. The only exception is one probe from the blue module that has high correlations with two probes from the turquoise module). The separation of probes from modules in disconnected subgraphs is result of a cutoff to fixed number of correlations/edges.

The scatter plot in Figure 7d shows that central probes in the brown module, i.e. probes that have a high correlations to other probes of the module, have a high correlation to disease status.

Out of the 68 probes in the blue module of the male network and 94 probes in the brown module of the female network, 46 probes were found unique in one of the two modules (10

(24)

in the male and 36 in the female module).

(a) Brown module: Correlation 0.7, p-value 3 ∗ 10⁻²².

(b) Blue module: Correlation 0.58, p-value 3 ∗ 10⁻⁰⁷.

(c) Around 2500 highest correlations (edges) between probes (nodes) in the female subset.

(d) Significance for the type of the module of each probe vs the connection of the probe into the module, brown module, female subset.

Figure 7: Results of the WGCNA-analysis for GSE29536.

3.2.3 GO enrichment

In Table 2 are the four most enriched GO terms for each of the four methods listed. Com- paring the enrichment results for all four methods the most enriched terms for all three implemented limma based methods overlap whereas the result of the WGCNA analysis are clearly distinct. This is likely to be effected by the number of identified probes. The WGCNA method identified 46 probes (to be unique in one of the best module from female and male networks), whereas the limma methods identified 573 (combined), 3321 (F-stats) and 14892 (delta-delta) probes. An effect of number of probes to different enrichment is support by the finding that an enrichment analysis of the 150 probes of the limma F-stats method with the

(25)

highest p-value (p-value < 10⁻¹⁵) enrich the pathways ’defense response to virus’, ’type I interferon signaling pathway’ and ’negative regulation of viral genome replication’, a sub-term of ’regulation of viral process’. Thus, an enrichment analysis with fewer probes of the limma methods is likely to result in GO terms similar to the WGCNA method. The method used for GO enrichment here sets all p-values < 10⁻³⁰to 1e-30, this was the case in enrichment results found for the limma F-stat and the limma delta-delta method. It is hence not known if the GO terms given here are indeed most enriched or if other terms that are not listed first but have a p-value < 10⁻³⁰as well are the most enriched terms.

Table 2: Most enriched GO terms for each of the four methods in the data set GSE29536.

WGCNA comb F-stats delta-delta

interferon-gamma signaling / response to interferon gamma

SRP-dependent cotranslational protein targeting to ER

translation

regulation of viral process

viral transcription nuclear-transcribed mRNA catabolic process

cellular nitro- gen compound metabolic process immune response nuclear-transcribed

mRNA catabolic process, nonsense- mediated decay

translation cellular macro- molecule metabolic process

type I interferon signaling / response to type I interferon

translation viral transcription SRP-dependent cotranslational protein targeting to membrane

To evaluate the results from the enrichment analysis, the expression of the genes annotated to ’interferon-gamma-mediated signaling’ identified by the WGCNA method and the expression of genes annotated to ’SRP-dependent cotranslational protein targeting to membrane’

identified by the limma F-stats method were examined.

Figure 8a shows that all identified genes for ’interferon-gamma-mediated signaling’ are higher expressed in patients compared to controls, with small differences between male controls and female controls (Figure 8b) but always higher expression in female patients than in male patients (Figure 8c). This results in fold changes that are around 1.5 - 1.7 times higher in females than in males (Figure 8d). Thereby the identification of those genes to show different behavior between female respectively male controls and patients is con- firmed, a higher activity of ’interferon-gamma-mediated signaling’ pathway in female patients than in male patients is very likely.

(26)

(a) Mean of controls (C-Mean) vs mean of patients (P-Mean).

(b) Mean of female controls (FC-Mean) vs mean of male controls (MC-Mean).

(c) Mean of female patients (FP-Mean) vs mean of male patients (MP-Mean).

(d) Fold change between female controls (FC/FP) and patients vs fold change between male controls and patients (MC/MP).

Figure 8: Expression of genes annotated with ’interferon-gamma-mediated signaling pathway’

and identified by WGCNA analysis for GSE29536.

The genes identified by the limma F-stats method and annotated to ’SRP-dependent cotranslational protein targeting to membrane’ have a higher expression in controls than in patients (Figure 9a). Interestingly, all genes are higher expressed in females in controls (Figure 9b) whereas the genes are higher expressed in males in patients (Figure 9c). Fig- ure 9d shows that this is reasoned by the fact that most genes have very small differences in

(27)

expression between male controls and patients (as the fold change is around 1 - 1.5) but are 1.25 - 2.5 times higher expressed in female controls than in female patients. Only one gene has a higher fold change in the male comparison than in the female comparison (green dot in the upper center of Figure 9d). In comparison to the fold changes of genes annotated to

’interferon-gamma-mediated signaling pathway’ in Figure 8 all genes are higher expressed in controls than in patients (folds greater than one). The figures show that genes identified by the limma F-stats method and annotated with ’SRP-dependent cotranslational protein targeting to membrane’ have different expressions between female respectively male controls and patients, here almost no difference between male patients and controls but a difference between female patients and controls.

(28)

(a) Mean of controls (C-Mean) vs mean of patients (P-Mean).

(b) Mean of female controls (FC-Mean) vs mean of male controls (MC-Mean).

(c) Mean of female patients (FP-Mean) vs mean of male patients (MP-Mean).

(d) Fold change between female controls and patients (FC/FP) vs fold change between male controls and patients (MC/MP).

Figure 9: Expression of gene annotated to ’SRP-dependent cotranslational protein targeting to membrane’ and identified by limma F-stats analysis for GSE29536.

3.2.4 KEGG enrichment

In addition to GO enrichment, KEGG enrichment was performed. The highest enriched pathways for genes identified by each of the four methods are shown in Table 3. For the results of the WGCNA and the limma combination method only one respectively two pathways were

(29)

found. Genes for the limma F-stats and limma delta-delta methods enrich many pathways (22 and more than 50), for them only the five most enriched pathways are shown.

Table 3: Most enriched KEGG pathways for each of the four methods in the data set GSE29536.

Influenza A Ribosome Purine metabolism Oxidative phosphorylation

Herpes simplex in- fection

Pyramidine metabolism

Purine metabolism

Ribosome Pyrimidine

metabolism Chemokine signal-

ing pathway

Valine, leucine and isoleucine degradation

Cell cycle pathway Lysine degradation The ’purine metabolism’ pathway can be seen in Figure 10. The red background profile behind some genes shows those that were identified by the limma delta-delta method. It represents the expression profile of the identified gene (or mean if more than one probe for the same gene was identified) over all samples. Even though the profiles cannot be seen in detail as they were plotted into small squares of the pathway map, it can be seen that the genes have a very similar expression profile. The similar expressions that are differentially expressed between samples identify the complete pathway to be differentially expressed between the samples. Similar expressions between the annotated and identified genes are even better to see in Figure 11, the pathway map for the ’oxidative phosphorylation’

pathway with genes identified by the limma delta-delta method. Most of the genes have a high expression in the first samples and a moderate expression in the right center. Beside of the similar expression, it is also interesting how many of the associated genes the analysis identifies to be significantly differentially expressed.

(30)

Figure 10: The ’purine metabolism’ pathway map from the KEGG database. Genes with a red background profile are genes identified by the limma F-stats method for GSE29536.

Figure 11: The ’oxidative phosphorylation’ pathway map from the KEGG database. Genes with a red background profile are genes identified by the limma delta-delta method for GSE29536.

(31)

3.3 Results for SLE in whole blood, data set II

Similar to the data set GSE29536, GSE65391 also contains whole blood samples of SLE patients and controls. Thus, the two data sets can be used in a comparison. The sex of the samples is given in this data set, out of 996 samples 57 are female controls, 817 female patients, 15 male controls and 107 male patients.

The Venn-diagram in Figure 12 shows that for this data set again most of the genes that are significantly differentially expressed between controls and patients are identified in the female but not in the male comparison. The numbers of significantly differentially expressed genes in both males and females are higher than the numbers for GSE29536, the number of genes that are only differentially expressed in females is however lower. These differences are partly reasoned by a different number of samples, different female-vs-male ratios of samples (7.2:1 (GSE65391) vs 2.2:1 (GSE29536)) and different numbers of probes (42464 (GSE65391) vs 47668 (GSE29536)).

Figure 12: Venn-diagram showing the number of significantly differentially expressed probes in GSE65391. ’X up in Y’ represents probes that are significantly higher expressed in disease status Y in sex X compared to the other disease status in the same sex X. The number 31209 in the lower right indicates genes that are not differentially expressed.

The number of probes identified (as significant differentially expressed) by the four analysis methods are 177 for WGCNA, 519 for combined limma p-values, 4480 for the limma F- statistics and 10071 for the limma delta-delta method. The order of amount of returned probes between the methods is the same as in GSE29536.

(32)

Comparing the GO terms from Table 4 with the GO terms for GSE29536 in Table 2, an enrichment of similar terms can be observed for the genes identified by the WGCNA analyses and the limma F-stats analyses. The other two methods identified genes with different enrichment, as the tables show only four terms, some enriched terms for the analysis in one data set can be enriched in the analysis of the other data set as well but not within the first four terms (e.g. the term ’SRP-dependent cotranslational . . . ’ found as most enriched for genes identified by the limma combination method in GSE29536 is found highly enriched in GSE65391 as well but not within the best four).

Table 4: Most enriched GO terms for each of the four methods in the data set GSE65391.

immune response protein localiza- tion to basolateral plasma membrane

translation cellular nitro- gen compound metabolic process response to

interferon-gamma

maintenance of api- cal/basal cell polar- ity

SRP-dependent cotranslational protein targeting to membrane

cellular macro- molecule metabolic process

antigen processing and presenta- tion of exogenous peptide antigen via MHC class I, TAP- dependent

rRNA metabolic process

nuclear-transcribed mRNA catabolic process, nonsense- mediated decay

gene expression

regulation by virus of viral protein levels in host cell

ribosomal small subunit biogenesis

viral transcription RNA processing

(33)

4 Analysis

4.1 Systemic lupus erythematosus

For the disease systemic lupus erythematosus (SLE) nine suitable data sets were found in GEO. Four data sets contained whole blood samples (GSE29536, GSE65391, GSE61635 and GSE49454). The data set GSE50772 contained samples of only peripheral blood mononuclear cells (PBMCs). GSE46923 contained samples of monocytes and GSE30153, GSE10325 and GSE10325 samples of B cells, CD4 T cells and myeloids respectively.

Comparing the GO terms found for the data sets a difference in interferon signaling pathways (interferon type I and interferon gamma) and in genes associated to response to viruses is to see. The most significantly enriched terms for the results of each data set can be found in appendix Table 6. In 48 results obtained by the four methods and for different subsets of data sets ’type I interferon signaling pathway’ was found often as enriched process (14 times within the 50 highest enriched processes / 10 times within the 10 highest enriched processes, each enrichment with a p-value lower 0.01). In average are 160 probes in the complete data sets and 28 probes identified as significantly differentially expressed (limma) or within one of the two best modules (WGCNA) annotated to it. Only the more general term ’immune response’ was found more often (16 / 9 times) than the ’type I interferon signaling’. Besides ’interferon type I signaling’ ’viral process’ (13 / 3 times) and ’response to virus’ (10 / 8 times) are found frequently. As interferon is for example released by cells with viral infections [Haller et al., 2006] a finding of this terms together is not surprising.

In average are 180 significant genes and 1459 genes over all annotated to ’viral process’, showing that the enrichment for it comes not only from shared genes with interferon type I signaling. Other frequently annotated terms that cannot be seen as sup-terms of the above are ’peptide biosynthetic process’ and it’s sub-term ’translation’ (9 / 1 times and 8 / 0 times

’translation initiation’ 4 times in the 10 highest enriched pathways), ’response to interferon- gamma’ (7 / 2 times) and ’SRP-dependent cotranslational protein targeting to membrane’ (7 / 3 times).

Both interferon type I and interferon gamma are known to play important roles in SLE [Chiche et al., 2014] and are currently evaluated as drug targets [Kirou and Gkrouzman, 2013, Math- ian et al., 2015]. A difference in the targeted pathways could influence the success of a medication. However, no literature showing a difference in success of treatment targeting interferon type I or interferon gamma between male and female was found. From Figure 8 it is to see that a difference in expressions of genes corresponding to ’interferon-gamma- mediated signaling pathway’ exists between female patients and male patients (c) but also a difference exists between control and patients for both female and male (a, d), supporting the usefulness of interferon type gamma blocking drugs for both sexes. As the genes OASL, OAS1, OAS2, STAT1, GBP1, BST2 and IRF9 are annotated to the GO term ’type I interferon signaling pathway’ as well the same can be said for interferon type I blocking drugs.

4.2 Multiple sclerosis

For multiple sclerosis (MS) were six suitable data sets as well as two data sets for clinically isolated syndrome (CIS) examined. The CIS is related to MS as 85% of MS patients first