EVALUATING THE BIOLOGICAL RELEVANCE OF DISEASE CONSENSUS MODULES

(1)

EVALUATING THE BIOLOGICAL

RELEVANCE OF DISEASE CONSENSUS

MODULES

An in silico study of IBD pathology using a

bioinformatics approach

Bachelor Degree Project in Bioscience G2E 30 credits

Spring term 2019

Author Joel Ströbaek

a16joest@student.his.se

Supervisor Hendrik Arnold de Weerd

hendrik.arnold.de.weerd@his.se

Co-supervisor Björn Olsson

bjorn.olsson@his.se

Examiner Zelmina Lubovac

zelmina.lubovac@his.se

(2)

Abstract

Inflammatory bowel disease encompasses a variety of heterogeneous chronic inflammatory diseases that affect the gastrointestinal tract, where Crohn’s disease and ulcerative colitis are the principal examples. The etiology of these, and many other complex human diseases, remain largely unknown and therefore pose relevant targets for novel research strategies. One such strategy is the in silico application of network theory derived methods to data sourced from publicly available repositories of e.g. gene expression data. Specifically, methods generating graphs of interconnected elements enriched by differentially expressed genes—disease modules—were inferred with data available through the Gene Expression Omnibus. Based on a previous method, the current project aimed to evaluate disease modules, combined from stand- alone inferential methods, in disease consensus modules: representing pathophenotypical motifs for the diseases of interest. The modules found to be significantly enriched by genome-wide association study inferred single-nucleotide polymorphisms, as validated using the Pathway Scoring Algorithm, were subsequently subjects for further analysis using Kyoto Encyclopedia of Genes and Genomes-pathway enrichment, and literature searches. The results of this study adheres to previous findings relating to the employed method, but lack any novelty pertaining the diseases of interest. However, the results substantiate the preceding methods’ conclusion by including parameters that increase statistical validity. In addition, the study contributed to peripheral results concerning both the methodology of consensus module methods, and the elucidation of inflammatory bowel disease etiology and disease subtype differentiation, that pose interesting subjects for future investigation.

(3)

List of abbreviations

AEM Annotated expression matrix CD Crohn’s disease

DCM Disease consensus module DEG Differentially expressed gene

DM Disease module

DREAM Dialogue on Reverse Engineering Assessment and Methods GEO Gene expression omnibus

DG Disease gene

GWAS Genome wide association study HPC High-performance computing IBD Inflammatory bowel disease

KEGG Kyoto Encyclopedia of Genes and Genomes NSC National Supercomputer Center

Pascal Pathway Scoring Algorithm PPI Protein-protein interaction

PPIN Protein-protein interaction network S2B Double specific-betweeness

SNP Single-nucleotide polymorphism

STRING Search Tool for the Retrieval of Interacting Genes/Proteins UC Ulcerative colitis

(4)

Introduction

The clinical treatment administered to a patient is most often the result of years of costly trials and testing, preceded by extensive research into the afflicting illness. And when determining the specific illness that ails a patient—in order to administer the correct treatment—powerful diagnostic tools are required that accurately assess the disruptions caused by the disease. Owing to the complexity of human biology, and of the diseases that afflict us, finding precise predictive markers (biomarkers), as well as therapeutic targets, is an ongoing effort where informatics and computational approaches have proved reliable additions (Barabási et al., 2011). In silico based research further allows for a move away from traditional single-gene targets to a broader systematic approach that has the potential to include the complete human interactome (the physical interactions of molecules within a cell or system) (Loscalzo et al., 2007; Chen & Butte, 2013; Gustafsson et al., 2014a). Network medicine, which bridges systems biology and network theory, ultimately aims to address the complexity of human disease by the amalgamation of existing and future quantitative data (Zanzoni et al., 2009; Barabási et al., 2011; Lee & Loscalzo, in press). But, to accomplish this feat there are several key aspects that require further attention (Mitra et al., 2013; Lee & Loscalzo, in press)—some of which will be touched upon below.

Networks are already widely used to represent and study interactions between elements in several scientific disciplines. In bioscience this includes the study of biological processes, like those observed in the interactome, which in turn is used, e.g., to describe and investigate pathogenesis and disease progression (Vidal et al., 2011; Lage, 2014). Essential to the application of network theory is the idea that intricate systems can be graphically represented (Figure 1) through mapping its constituents (nodes) and their interactions (edges). In contrast to initial assumptions, most biological networks are scale-free, and do not correspond with a Poisson distribution (Barabási & Albert, 1999). This can be observed through the heterogeneous degree distribution of the network nodes, where “degree” is defined as the number of edges connecting Figure 1. Overview of select clustered network types and their constituents. For a biological network, nodes represent e.g. genes, proteins or processes, which are connected through edges. The edges can in turn be weighted (A)—to further define the strength/weakness of any underlying biological interaction—or unweighted (B) for a more rudimentary representation. More information may also be gained by (C) identifying modules, subsets of elements that all share commonalities based on a given parameter. These include, e.g., similar biological functionality or involvement in a particular disease. Modules may also be based purely on topological features and are typically identified from the pattern of interactions using network concepts, such as shortest paths or betweeness.

(6)

2 one node to other nodes. Simply put, a few elements will be highly interconnected to surrounding nodes, forming hubs, while a large portion of elements will be poorly connected (Boccaletti et al., 2006). When analyzing biological networks of this nature hubs will represent important aspects of the system of interest. For example, if each node is representative of a protein and each edge is the connection between protein pairs—as was shown by Jeong et al. (2001)—removing hub proteins substantially increase (multiplicatively by three) corresponding cell lethality, indicating their importance for network integrity, and ultimately organism survival. Further, to describe the nature of communication within a network several concepts might be employed (Figure 2). For example, shortest paths is used to denote the minimal length (number of edges for unweighted networks) between two—non-adjacent—nodes. Whereas, betweenness describes the importance of each node/edge by the number of shortest paths passing through that node/edge (Boccaletti et al., 2006). These, and similar descriptive concepts, are extensively used to elucidate the integral mechanisms of a network of interest, and form the basis for some of the methods used herein.

For purposes of studying complex human diseases, protein-protein interaction networks (PPINs) have been of particular interest, especially when trying to identify the source of disease phenotypes (Delude, 2015). These networks represent the physical interconnectedness of gene products within a biological system, and can be used to explore the functional relationship between these underlying genes (Xu & Li, 2006). Where differentially expressed genes (DEGs) has shown promise for diagnosing diseased tissue, functional studies are argued to grant the better understanding of disease—their molecular processes, systemic interactions, and potential therapeutic targets. This is based on the notion that the underlying mechanisms of the disease might only affect the coding region of the gene, i.e. not induce differential expression, but rather modify the products downstream function (de la Fuente, 2010). However, the combination of these concepts form the key elements when inferring biological meaning to a larger network of interest. One such example is the study of pathways that play integral parts in, e.g., cellular processes, cell signaling or, in disease physiology. These inquiries can be explored by enriching a PPIN with DEGs, thereby identifying subgroups—called modules (Figure 1C) or communities—

that describe the relationship of highly interconnected nodes on multiple levels (Boccaletti et al., 2006; Barabási et al., 2011; Sharma et al., 2015).

Knowledge gained from network studies include the observation that disease genes (DGs) have a higher tendency to subside along the periphery of PPINs (Goh et al., 2007)—consistent with the previously mentioned study indicating the importance of hubs for organism survival (Jeong et al., 2001). Furthermore, the peripheral DGs have an increased likelihood to interact with each other and form smaller interconnected groups, i.e. modules (Xu & Li, 2006; Goh et al., 2007; Barabási et al., 2011; Garcia-Vaquero et al., 2018). Disease modules (DMs) reveal interacting proteins influenced by one or more dysfunctional gene products that typically convey a disease mechanism or phenotype. This phenomenon is further utilized to identify commonalities for similar, or related, illnesses (Mathur & Dinakarpandian, 2010; Mathur & Dinakarpandian, 2012; Garcia-

Figure 2. Visualization of some graph theory concepts. The shortest path P1 connect nodes i and j1 through two edges, whereas path P2 involves three edges for the same node pair. In this example, node k has a high betweenness due to being part of two shortest paths (i-j1 and i-j2). The edge connecting nodes k and i will similarly have a high edge betweenness. Both these nodes have the same degree = 3, i.e.

three edges connecting them to the rest of the network.

(7)

3 Vaquero et al., 2018; Ni et al., 2018). But, it also complicates these inquiries due to the observed comorbidity of otherwise unrelated diseases—affecting similar molecular processes (Barabási et al., 2011; Loscalzo & Barabasi, 2011; Hasin et al., 2017; Garcia-Vaquero et al., 2018). Different strategies have been employed to increase the specificity of DMs (Bebek et al., 2012; Sharma et al., 2015), including the use of genome-wide association studies (GWAS) (Lee et al., 2011). These, in turn, include curated lists of single-nucleotide polymorphisms (SNPs) that are observed to be specific to one disease, disease symptom, or group of related diseases (Lee et al., 2011; Lamparter et al., 2016).

A strategy that might help elucidate the intricacies of pathophenotypical etiology was recently proposed and employed by Garcia-Vaquero et al. (2018). Their module algorithm, S2B (Double specific-betweenness), combines disease modules from two related diseases, forming a unified consensus module that represents their shared commonalities. This is based on betweenness centrality, which scores nodes by their importance to the network based on the number of important interactions that cross through them (further detailed in a dedicated section of Material

& methods). The method was successfully applied to expression data pertaining to motor neuron diseases (Garcia-Vaquero et al., 2018), giving cause for further investigation of the methodology.

Similarly, applying a combination of methods on data from one disease, or comparing disease modules inferred by different methods for the same dataset, could strengthen the validity of currently known DGs and proteins, as well as increase the chances of novel discoveries (Marbach et al., 2012; Choobdar et al., 2018).

Several methods have been proposed to construct DMs (Saelens et al., 2018), which can be categorized by the underlying modularity principle(s) used to cluster input elements. Two examples are clique-based- and co-expression clustering. Clique-based clustering methods evaluates graph density based on regions of highly interconnected units, i.e. subgraphs, commonly called cliques. The included elements are required to have edges connected to all other constituents which, for biological source data, increase the likelihood that the resulting module represents some underlying biological process (Peng et al., 2004). A drawback with clique-based clustering is its comparatively time-consuming execution for large datasets (Schmitt et al., 2017).

Co-expression methods instead focus on gene co-expression, e.g. by weighting node pair connections using a reference trait (Chu et al., 1998; Zhang & Horvath, 2005) so that the trait informs the final inferential module. A drawback to this is the inability to fully account for the complexity of the genetic co-expression, resulting in loss of information or an overestimation of the importance of a connection (Larmuseau et al., 2019). Methods used for the current study are summarized in Material & methods (pp. 5–8).

The current study was largely based on a previous MSc thesis project conducted by McCoy (2019), which investigated the hypothesis that individual DM inference methods could be combined through the S2B method to form disease consensus modules (DCMs). Utilizing the R package MODifieR (https://github.com/ddeweerd/MODifieRDev), three different types of algorithm were employed with required input (DEGs; PPIN) to generate stand-alone method modules. These were subsequently combined iteratively to generate DCMs through S2B. All DMs were then evaluated alongside the DCMs with Pascal (Pathway scoring algorithm), using reference GWAS SNPs. Both S2B and MODifieR are summarized in Material & methods.

Inflammatory bowel disease (IBD) is a group of diseases that have been subject to positive results concerning DEG studies (Lawrance et al., 2001; Lee et al., 2011; Dobre et al., 2018), GWAS (Cleynen et al., 2015; Liu et al., 2015; Luo et al., 2017; Gettler et al., 2019), and the aforementioned networking methodologies (de Souza & Fiocchi, 2018; Eguchi et al., 2018). IBD encompasses a

(8)

4 variety of similar chronic inflammatory diseases that affect the gastrointestinal tract, where Crohn’s disease (CD; potentially affects the entirety of the tract) and ulcerative colitis (UC; only affects the large intestine) are the principal examples. Both initially display typical inflammatory symptoms, such as increased blood flow (reddening) and loss of function, but generally differ in the area being affected—and increasingly diverge throughout disease progression (Sanders, 1998). Correctly diagnosing IBD is therefore essential for adequate treatment, but also of critical import due to the highly comorbid nature of the more common observable symptoms which, e.g., are shared with other inflammatory diseases, colonic cancer, and intestinal infection by Clostridium difficile (Gramlich & Petras, 2007; Magro et al., 2013). Furthermore, complex inflammatory diseases, like IBD, grossly diverge from other human afflictions by the multiplicity of potential factors involved in pathogenesis—including genetic-, environmental-, microbial-, and immunological factors. It is, in part, the observed complexity of inflammatory diseases that make them appropriate targets for in silico network studies (Zhang et al., 2013; de Souza & Fiocchi, 2018), where the inconsistencies of IBD pathology gives further cause for investigation with systems biology and network medicine approaches (de Souza & Fiocchi, 2018; Fiocchi, 2018).

Finding the underlying mechanisms, unique to each IBD subtype, that induce the inflammatory responses and eventual subsequent complications (Ott & Schölmerich, 2013) would greatly increase the possibility of finding effective therapies and—in future—personalizing IBD treatments.

Research question

Using networks to describe biological systems has proven essential to linking, and understanding, many of the processes therein—including potential dysregulations and defects that lead to disease (Barabási et al., 2011; Lee & Loscalzo, in press). But, to translate this into clinically applicable treatments more robust models need to be produced, which further requires a substantial improvement of existing methodologies (Chen et al., 2013; Hasin et al., 2017).

Where inferential DMs have successfully been used in the past (Sonawane et al., 2019) their usefulness depend on several factors that impact their ability to be universally applicable. These e.g. include the underlying theory used to construct the applied algorithm (de la Fuente, 2010), type and size of the input data, and limitations presented by available hardware (Gustafsson et al., 2014a; Hasin et al., 2017; McCoy, 2019). It is therefore of great interest to find stand-alone methods that perform well across research settings and datasets. Alternatively, find ways to combine individual algorithms in consensus methods: that outperform the individual components by balancing the pros and cons of the combined stand-alone methods (Marbach et al., 2012;

Choobdar et al., 2018; McCoy, 2019).

To complement existing research, this study sought to substantiate the potential of consensus methods for generating inferential DMs. And, more specifically, to further investigate S2B as a candidate DCM method. In hopes that any knowledge gained from this project can aid in the ongoing efforts of producing in silico platforms that effectively represent the human body—where the ultimate goal is to bring us closer to true personalized medicine. Furthermore, the results produced by using of the current method and its application—as pertaining to the chosen diseases of interest (IBDs; CD and UC)—would hopefully substantiate previous findings regarding the genetic dysfunctions (and subsequent disruption of biological processes and pathways) of the individual diseases of interest.

(9)

5

Aim

The principle aim was to explore the relevance of combined inference methods for identifying pathophenotypical modules as applied to datasets pertaining to IBD, available through the Gene Expression Omnibus (GEO) (Edgar et al., 2002)—supplied by de Weerd (see Data, in Material &

methods). The following objectives were therefor set in order to substantiate this goal:

i. Construct a pipeline to generate DMs and DCMs

ii. Comparatively evaluate modules by calculating meta-p-values

iii. Visualize and validate significant DM- and DCM constituents (gene lists) to determine adherence to current consensus in regards to enriched pathways

The results were also empirically evaluated in light of the following one-sided hypothesis:

H0 The consensus module methods do not show a significant improvement over the stand- alone methods.

H1: The consensus module methods employed show a significant improvement as compared to the stand-alone methods.

The approach chosen in this study was also compared to the previous one (see Discussion), implemented by McCoy (2019).

(10)

6

Material and methods

As mentioned in the introduction, this study was a continuation of previous MSc thesis work performed by McCoy (2019), and thereby, an attempt to evaluate the effectiveness of combining individual inferential module methods to contrive biologically relevant DCMs.

A general workflow was established (Figure 3) to elucidate the required processing structure and simplify the pipeline construction. In summation, pre-processed gene expression data—in some instances in combination with a PPIN—was used as input for the five chosen inferential module methods (Table 1; p. 8) to generate DMs. Rudimentary DCMs were then generated by identifying the gene overlap between individual modules based on method type (co-expression- or clique- based). These were then combined using S2B to form a unified DCM. All DMs and DCMs were subsequently evaluated through Pascal (detailed below in a dedicated section below), as well as through visual representation of the results (see Evaluation).

The implemented scripts (appendix B), initially written to be executed locally, were edited to adhere to the structure of the high-performance computing (HPC) cluster Tetralith (https://www.nsc.liu.se/systems/tetralith/). It is one of several HPC clusters managed by the National Supercomputer Center (NSC), located at Linköping University (https://www.nsc.liu.se/).

Access to the cluster vastly improved the computational resources available to the project, and opened up some previously unavailable options. The most important of which was an increase in size of the input PPIN (∼ 9-fold).

The current strategy finally differed from the one employed by McCoy (2019) in the following primary ways:

i. Included a larger PPIN

ii. An increase in randomizations for S2B (n = 50, as opposed to n = 1) iii. Excluded repeat implementation of S2B DCMs

Figure 3. Simplified overview of the workflow and subsequent pipeline of the current method, which include (a.) the generation of disease modules and disease consensus modules, and (b.) subsequent evaluation of these. Color assigned to ease interpretation. *Pathway scoring algorithm (Lamparter et al., 2016)

(11)

7 iv. Used a different combination of inferential module methods

v. Focused on two method types (excluding seed-based methods)

Changes i and ii were made to strengthen the validity of any significant results found—where both increase the likelihood that the observed downstream meta-p-values were not due to random chance (Yeung et al., 2003). The exclusion of repeated implementations of S2B modules (iii) was made based on the idea that this might impact end results, based on the underlying algorithm (Garcia-Vaquero et al., 2018). Changes iv and v were more arbitrary in nature, and largely implemented to retain novelty of the present method and any potentially significant results—i.e.

to avoid a generic verification of the previous studies’ results.

Computational details pertaining to the method (primarily R package versions) can be found in appendix A.

Data

In order to test and evaluate the proposed method, an initial disease of interest was chosen—

based on availability through a series of datasets obtained from de Weerd (2017). The selected IBD (CD and UC) datasets were sourced, by de Weerd, from the GEO database; SuperSeries accession GSE87650. Specifically, gene expression profiles generated on the Illumina HumanHT- 12 V4.0 expression beadchip microarray (Illumina, San Diego, CA, USA), available through GEO SubSeries GSE86434. This subset include samples (n = 68) from three immune cell types (CD14+

lymphocytes, CD4+ and CD8+ monocytes) and globulin depleted whole blood. This data was originally collected and analyzed by Ventham et al. (2016), with the approval of The Tayside Committee on Medical Ethics B, and adheres to scientific standards for subsequent data processing and handling.

All datasets had been subject to a standard pre-processing workflow using limma (Smyth, 2005;

Ritchie et al., 2015), with duplicate probes filtered by lowest p-value. To accommodate the input structure required by the MODifieR R package, specific to some of the included methods; probes had also been collapsed into genes using the collapleRows() function, found in the WGCNA R package (Langfelder & Horvath, 2008); with MaxMean set as method parameter (de Weerd, 2017).

The protein-protein interaction (PPI) data for the current implemented method was from version 10.5 of Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) (Szklarczyk et al., 2017), and included the known PPIs of the curated human network—with minimum required interaction score set to > 0.7. The CSV file used as input thereby included n = 594 101 interactions.

The GWAS SNP data was also from de Weerd (2017), selected to correspond to the dataset series made available. Two text files were therefore implemented in the current method, corresponding to the underlying disease (CD or UC) of the datasets, each containing reference SNP identifiers and associated p-values.

Inferential module methods

The following section describes the individual methods used for the project, summarizes their underlying scoring basis, and provides details on the employed consensus methods. The level of detail for these descriptions were limited to fit the scope of the current project, with further information instead deferred to each respective publication sited below.

(12)

8 To generate the stand-alone inference method modules, the R package MODifieR—comprised of eight such methods—was utilized. Partly because it streamlined implementation and minimized the required code, but also because the author was readily available (being part of the University of Skövde alumni) to help troubleshoot potential errors. The MODifieR package is available through GitHub (https://github.com/ddeweerd/MODifieRDev), including documentation pertaining to installation, dependencies, and usage. Except the noted post-processing performed for WGCNA modules below (see Co-expression methods), all implementations were made with default settings (de Weerd, 2019) for each of the five methods used.

Co-expression methods

These methods infer modules based on gene co-expression, e.g. by weighting node pair connections using a reference trait (Zhang & Horvath, 2005). Inferential DM co-expression methods used for the current study (Table 1) are briefly summarized below.

WGCNA has multiple functions and parameter settings available through its R package (Langfelder & Horvath, 2008), although these were largely minimized by using MODifieR for the current method. The MODifieR implementation of WGCNA (de Weerd, 2019) required an input object containing an annotated expression matrix (AEM), where the disease state samples were indicated to include the trait of interest. DMs were then identified from a constructed network by selecting co-expression modules that fell below an adjusted p-value cutoff (< 0.05). As was observed by McCoy (2019), the final DM includes a substantially large number of nodes, as compared to most other inferential module methods. Post-processing was therefore applied to limit the module size, for each dataset, to approximately 2000 genes.

DiffCoEx (Tesson et al., 2010) is a WGCNA derivative that, instead of inferring DMs from priory generated co-expression modules (based on two-way comparison of a defined trait), assign genes to the DM in a pairwise manner based on how their shared correlation is observed to change between the differing conditions of the samples studied (e.g. healthy and diseased). This, according to the authors, results in a more sensitive method that is especially well suited for identifying minute changes between gene expression datasets.

Table 1. Chosen inferential disease module methods for the current project method, alongside their respective required input. DEGs = differentially expressed genes, AEM = annotated expression matrix, and PPIN = protein-protein interaction network.

Primary input

Reference DEGs AEM PPIN

Co-expression methods

WGCNA x Langfelder & Horvath, 2008

DiffCoEx x Tesson et al., 2010

Clique-based methods

Clique Sum Permutations x x Gustafsson et al., 2014b

Correlation cliques x x x Köpsén , 2016

ModuleDiscoverer x Vlaic et al., 2018

Clique-based methods

This inference type is based on identification of cliques (subgroups) within a given network. These cliques, or modules, can then be enriched by applying relevant data (e.g., PPIs) or reference filters (in this case DEGs) to find cliques that display the characteristics of interest (Peng et al., 2004).

(13)

9 Three clique-based methods were included in this study (Table 1), each given a short summary below.

Clique Sum Permutations is a variation of the original Clique Sum method (Barrenäs et al., 2012), suggested by Gustafsson et al. (2014b). In common for these methods is the construction and use of a SQLite (https://www.sqlite.org/) clique database. For a given network, this database is static as long as the intended reference (the PPIN) remains the same—as was the case for the current project. The difference in methodology comes when enriching the cliques, where Clique Sum Permutations compares the summed negative log10 p-values for all genes present in a clique with one of the same size, assigned random genes following a null distribution.

Correlation cliques was developed during a MSc thesis project by Köpsén (2016), built upon the above mentioned Clique Sum method developed by Barrenäs et al., (2012). In summary, the Correlation cliques’ algorithm takes the input PPIN and splits it into several smaller networks, based on weight matrix scores generated from the expression sets of healthy controls. Through iterative computations, all cliques contained within these subnetworks are extracted and imposed on a final network—which in turn is scaled down based on the DEGs present after each iteration.

Significant cliques are then used to form a DM.

The version of the ModuleDiscoverer algorithm available through MODifieR is based on the single- seed approach, which iteratively identifies minimal cliques (size three) by random walks through a supplied PPIN, starting at a single random seed node. Each node is weighted by its number of connections, used to discriminate during the subsequent random extension (unification) of cliques—terminated when all cliques have been maximized. Significance (cutoff set by the user) is determined through enrichment with DEGs, with significantly enriched cliques being combined in the final module (Vlaic et al., 2018).

Consensus module methods

Consensus module methods are used to identify overlap between two or more inputs, often with the aim to identify functional relationships, structural commonalities, or similar descriptive patterns within the pooled data (Barrenäs et al., 2012; Gustafsson et al., 2014a; Menche et al., 2015). In the current method, two consensus methods were used—where, MODifieR arbitrarily formed DCMs by comparing overlap in input DM gene lists; S2B used calculations based on shortest paths (detailed below).

The initial DCMs inferred by MODifieR were (A) the product of combining the co-expression module gene overlap—separately for each dataset—and (B) through combining the three clique- based methodology module genes in the same manner. An occurrence cutoff was set at n = 2, which meant that the overlap genes had to occur in at least two of the sets used as input (i.e. all sets for the co-expression DMs) to be included in the final DCM.

S2B

S2B was developed by Garcia-Vaquero et al. (2018) to be used for comparative analysis of diseases with phenotypical commonalities (related disease). Thereby identify disease-associated genes that potentially interact more closely in the compared related illnesses. The method graphically overlaps two disease modules and calculates shortest paths within the network overlap, based on a double specific-betweenness score [see equation (1)], to sort top scoring interconnected genes.

(14)

10 This allows for the subsequent identification of proteins probable to be involved with the comorbid phenotype. This could be of particular importance when evaluating potential therapeutic targets pertaining to the specified disease (Garcia-Vaquero et al., 2018), and potentially insightful in the advancement of personal medicine (Sharma et al., 2015). In the study performed by McCoy however, S2B was harnessed to iteratively identify overlap between the three select inferential DM methods, in order to create a unified DCM; specific to one disease of interest at a time (McCoy, 2019).

(1)

The above equation (1) computes the S2B score, using the nodes k of an undirected graph G, containing two overlapping subgraphs (A and B). But, since the entirety of A and B are unknowns these are reassigned to the input subsets, a and b. Two functions are utilized in the equation, where sp(k, i, j, G) equals 1 if k occurs in the path of nodes i and j; t(i, j, G) equals 1 if the shortest path length between i and j ≤ average shortest path length of the graph (G). Equations for these functions, alongside others involved in the S2B methodology, were omitted from this report—

available in the source publication by Garcia-Vaquero et al.

A consensus module can be inferred from the resulting S2B model by filtering the S2B scores (a value between 0 and 1) for the included network genes. The filtering parameter is an empirical threshold, determined by underlying function, which discriminates towards the highly connected paths (S2B score > threshold) (Garcia-Vaquero et al., 2018).

Evaluation

The validity of produced inferential modules was tried with Pascal using GWAS SNPs. This was followed by visualization of Pascal meta-p-values, as well as analysis of any modules found to be significant at p-value < 0.05. Gene lists was extracted and subject to further analysis, as briefly described below in the Visualization and enrichment analysis section.

Pascal

Chosen due to its relatively low performance requirements, and simple implementation, Pascal utilizes GWAS generated disease- or trait associated SNP p-values to produce gene- and pathway enrichment scores. For the current method, the significance of a module was inferred through the resulting meta-p-value (sum of χ² gene scores) (Lamparter et al., 2016), based on a user defined threshold (p-value < 0.05). A script used for this purpose can be found in appendix B.

Visualization and enrichment analysis

The results produced during previous steps in the method were primarily visualized through the R packages ‘ggplot2’ (Wickham, 2016) and ‘VennDiagram’ (Chen & Boutros, 2011).

Enrichment analysis, in a wide sense, is used to apply relevance to research data, often by comparing study results to reference databases through statistical testing. For the current study, this was performed using the built in functional enrichment analysis of the STRING database

𝑆2𝐵(𝑘, 𝐺, 𝑎, 𝑏) = ∑^{𝑖∈𝑎,𝑖≠𝑗}_𝑖 ∑^{𝑗∈𝑏,𝑗≠𝑘}_𝑗 𝑠𝑝(𝑘, 𝑖, 𝑗, 𝐺) × 𝑡(𝑖, 𝑗, 𝐺)

∑^{𝑖∈𝑎,𝑖≠𝑗}_𝑖 ∑^{𝑗∈𝑏,𝑗≠𝑘}_𝑗 𝑡(𝑖, 𝑗, 𝐺)

(15)

11 (http://string-db.org; version 11) which in turn employs other databases (Szklarczyk et al., 2019).

Gene lists extracted from modules of significance—with Pascal meta-p-value < 0.05—were translated from their original output (gene IDs) into gene symbols, utilizing the R packages

‘annotate’ (Gentleman, 2019) and ‘org.Hs.eg.db’ (Carlson, 2019). The lists were subsequently individually fed to the STRING web-application, where the resulting top 5 Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways were sourced. The KEGG series of databases collect differing biological data, offering a primarily curated encyclopedia that can be used for assigning functional meaning to data for a system of interest (Kanehisa et al., 2017).

(16)

12

Results

The implemented method aimed to (i) construct a pipeline to generate DMs and DCMs, using five stand-alone- and two consensus methods. And, (ii) comparatively evaluate these modules by calculating meta-p-values using pascal, as well as, (iii) visualize and validate any significant DM- and DCM constituents (gene lists) to determine biological relevance of the source module. Results from each step of this process are detailed and analyzed in this chapter.

WGCNA modules, produced using default settings, were very inconsistent in size (Table 2) for the default cutoff (p-value < 0.05). It was therefore decided to apply a simple post processing procedure that resized the modules (n ∼ 2000) to be more uniform, prior to being used as input for consensus modeling. This meant that the majority of modules inferred by WGCNA, included as input, were not found to be significantly enriched by the select trait (see Material & methods—

Co-expression methods). The new cutoff was induced using the function wgcna_set_module_size available through the MODifieR package implementation of WGCNA (de Weerd, 2019).

Table 2. Number of WGCNA module elements based on two different applied cutoffs. Disease modules (DMs) inferred with size cutoff were subsequently used for consensus modeling.

DM size with cutoff … p-value < 0.05 n ∼ 2000

Crohn's disease

CD14+ 0 1 920

CD4+ 158 2 018

CD8+ 0 1 902

Whole blood 10 591 2 171

Ulcerative colitis CD14+ 0 2 192

CD4+ 0 1 994

CD8+ 0 2 438

Whole blood 1 922 1 922

Each of the stand-alone method modules, as well as DCMs, were evaluated with Pascal. Neither of the co-expression modules, or their overlap consensus resulted in any significant modules (Table 3) for threshold p-value < 0.05. However, the WGCNA UC whole blood module came close at p- value = 0.071, separating it somewhat from the other modules in Table 3.

Table 3. Pascal meta-p-values for co-expression modules and the MODifieR consensus generated from their respective gene lists. No significant modules inferred for threshold p-value < 0.05.

WGCNA DiffCoEx MODifieR consensus A

Crohn's disease

CD14+ 1.7E-1 1.3E-1 2.2E-1

CD4+ 8.8E-1 1.8E-1 5.7E-1

CD8+ 1.5E-1 1.4E-1 3.4E-1

Whole blood 4.8E-1 3.2E-1 3.8E-1

Ulcerative colitis CD14+ 4.7E-1 8.9E-1 4.4E-1

CD4+ 8.0E-1 1.7E-1 1.4E-1

CD8+ 8.6E-1 3.5E-1 2.5E-1

Whole blood 7.1E-2 3.2E-1 9.7E-1

(17)

13 No significant modules were inferred through Correlation clique or ModuleDiscoverer (closest being Correlation cliques for the CD4+ sample set), but the Clique Sum Permutations method identified significant modules for three datasets (Table 4). This did not, however, make a difference for their rudimentary combined consensus—MODifieR consensus B—which failed to infer any DCMs of significance.

Table 4. Pascal meta-p-values for clique-based disease modules and the MODifieR consensus generated from their individual gene lists. Only the Clique Sum Permutations method inferred modules of significance (p-value < 0.05)—bold p-values.

Clique Sum

Permutations Correlation cliques ModuleDiscoverer MODifieR consensus B

Crohn's disease

CD14+ 2.1E-1 2.0E-1 1.6E-1 2.6E-1

CD4+ 2.5E-2 7.4E-2 8.6E-1 2.3E-1

CD8+ 3.6E-1 8.8E-2 7.5E-1 6.7E-1

Whole blood 1.5E-1 3.4E-1 5.0E-1 4.5E-1

Ulcerative colitis CD14+ 4.2E-3 4.7E-1 7.6E-1 2.6E-1

CD4+ 2.2E-1 2.4E-1 8.2E-1 5.5E-1

CD8+ 5.3E-1 5.7E-1 6.6E-1 8.4E-1

Whole blood 3.3E-2 5.0E-1 2.7E-1 1.2E-1

Given that the rudimentary consensus modules produced with MODifieR did not result in any significant modules, no significant results were expected when combining them with S2B. This proved to be a faulty assumption, as seen in Table 5, since the method had produced four such DCMs. Based on this initial outcome, S2B was implemented again using modules from only two stand-alone methods as input. The result from this test is also reported in Table 5. However, it is important to note that this added parameter could not be implemented in a systemic manner (due to time constraints) over all possible combinations of input. Meaning that the resulting DCMs do not hold any significance to the questions posed by the current study. But, were still included to give grounds for further investigation.

Table 5. Pascal meta-p-values for S2B consensus modules generated from MODifieR consensus A and –B, alongside ones generated combining the Clique Sum Permutations- and Correlation cliques’ modules. Significant (p- value < 0.05) modules in bold.

S2B consensus A+B S2B consensus C

Crohn's disease

CD14+ 1.2E-1 3.1E-2

CD4+ 3.8E-4 6.0E-3

CD8+ 3.0E-1 1.3E-2

Whole blood 1.8E-1 1.3E-1

Ulcerative colitis CD14+ 1.3E-2 5.6E-3

CD4+ 7.3E-5 1.2E-5

CD8+ 2.8E-1 1.9E-3

Whole blood 5.0E-3 2.1E-1

All Pascal meta-p-values reported in Tables 3–5 were subsequently summarized in Figure 4 to simplify overview and comparison.

The modules found significant (excluding those produced with the added S2B implementation, i.e.

S2B consensus C; see legend Figure 4) were primary subjects for further analysis. Module

(18)

14 constituents (gene lists; for modules with p-value < 0.05, see appendix C) were used to generate Venn diagrams (Figures 5–8), with further analysis through STRING inferred (http://string- db.org) KEGG pathways (Table 6).

Figure 4. Negative log10 transformed meta-p-values derived from Pascal, divided by disease and separated further based on sample source. The black line was superimposed approximately at the 0.05 threshold, with significant values (p-value < 0.05) appearing above the line. MODifieR consensus modules were distinguished by squares, where A was derived from DiffCoEx- and WGCNA modules, and B from the combined modules of Correlation cliques, Clique Sum Permutations, and ModuleDiscoverer. S2B consensus A+B was generated from the combined MODifieR consensuses, while C was a later addition based on the initial meta-p-values that indicated a poor performance by ModuleDiscoverer (for the current data sets).

Though the performance of this later addition seems to be stronger, actual scientific value could not be assessed since the treatment was not replicated with the other stand-alone methods (owing to time- restrictions).

(19)

15 Figure 6. Venn diagrams of disease modules for each of the applied methods—input source = Crohn’s disease CD4+ lymphocytes—showing number of overlapping constituent genes. The intersect of co-expression modules WGCNA (B1) and DiffCoEx (B2) form MODifieR consensus A (A2), while the unity of clique-based modules [Correlation cliques (C1)/Clique Sum Permutations (C2)/ModuleDiscoverer (C3)] form MODifieR consensus B (A3). The S2B consensus (A1) was subsequently derived from MODifieR consensus modules.

∪

Figure 5. Venn diagrams of disease modules for each of the applied methods—input source = Ulcerative colitis CD4+ lymphocytes—showing number of overlapping constituent genes. The intersect of co-expression modules WGCNA (B1) and DiffCoEx (B2) form MODifieR consensus A (A2), while the unity of clique-based modules [Correlation cliques (C1)/Clique Sum Permutations (C2)/ModuleDiscoverer (C3)] form MODifieR consensus B (A3). The S2B consensus (A1) was subsequently derived from MODifieR consensus modules.

(20)

16

∪

Figure 8. Venn diagrams of disease modules for each of the applied methods—input source = Ulcerative colitis CD14+ monocytes—showing number of overlapping constituent genes. The intersect of co-expression modules WGCNA (B1) and DiffCoEx (B2) form MODifieR consensus A (A2), while the unity of clique-based modules [Correlation cliques (C1)/Clique Sum Permutations (C2)/ModuleDiscoverer (C3)] form MODifieR consensus B (A3). The S2B consensus (A1) was subsequently derived from MODifieR consensus modules.

∪

Figure 7. Venn diagrams of disease modules for each of the applied methods—input source = Ulcerative colitis whole blood—showing number of overlapping constituent genes. The intersect of co-expression modules WGCNA (B1) and DiffCoEx (B2) form MODifieR consensus A (A2), while the unity of clique-based modules [Correlation cliques (C1)/Clique Sum Permutations (C2)/ModuleDiscoverer (C3)] form MODifieR consensus B (A3). The S2B consensus (A1) was subsequently derived from MODifieR consensus modules.

(21)

17 Table 6. Results from functional enrichment analysis of gene sets from significant modules,

performed through the STRING database web-application. FDR = false discovery rate.

KEGG pathways

Method Total Top 5 * FDR

Crohn's disease CD4+ CSP 218

Pathways in cancer Ribosome Cell cycle Ubiquitin mediated proteolysis Purine metabolism Inflammatory bowel disease (IBD) …

1.07E-52 2.39E-51 2.07E-48 5.92E-41 2.07E-36 1.62e-08

S2B 207

Pathways in cancer Kaposi's sarcoma-associated herpesvirus infection PI3K-Akt signaling pathway HTLV-I infection Chemokine signaling pathway

… Inflammatory bowel disease (IBD)

9.70E-89 1.20E-72 3.54E-69 3.54E-69 6.04E-65 4.17e-09

Ulcerative colitis CD4+

S2B 202

Pathways in cancer PI3K-Akt signaling pathway HTLV-I infection Kaposi's sarcoma-associated herpesvirus infection Epstein-Barr virus infection

9.03E-97 2.06E-68 2.30E-65 2.32E-65 1.57E-61 6.04e-10

CD14+

CSP 186

Ribosome Ubiquitin mediated proteolysis PI3K-Akt signaling pathway Pathways in cancer Focal adhesion

2.01E-61 3.92E-56 5.36E-55 1.75E-49 4.01E-44 3.51e-08

S2B 186

Ribosome Pathways in cancer Oocyte meiosis Ubiquitin mediated proteolysis Epstein-Barr virus infection

9.75E-51 1.18E-41 1.21E-38 1.21E-38 1.21E-38 0.0161

Whole blood

CSP 208

Pathways in cancer PI3K-Akt signaling pathway Cell cycle Ubiquitin mediated proteolysis Kaposi's sarcoma-associated herpesvirus infection Inflammatory bowel disease (IBD) …

1.16E-65 1.41E-53 7.23E-50 1.49E-49 2.35E-45 2.55e-09

S2B 188

Viral carcinogenesis Pathways in cancer Epstein-Barr virus infection Ubiquitin mediated proteolysis HTLV-I infection

2.90E-57 4.75E-55 6.52E-54 2.33E-53 7.99E-45 1.11e-05

* IBD pathway added for each as reference

(22)

18

Discussion

The current project sought to evaluate the biological relevance of combined inferential module methods, for the diseases of interest (CD; UC). The proposed method generated DMs using five individual stand-alone methods (Table 1); and DCMs by combining the resulting DMs (based on the modularity principle) in rudimentary overlap consensus modules through the MODifieR R package. The rudimentary DCMs were subsequently used as input to infer S2B consensus modules (see any Figure 5–8 for visual reference). Pascal derived meta-statistics were then utilized to identify significantly SNP-enriched modules, i.e., the targets for further enrichment analysis as well as empirical literary adherence.

For the implemented datasets (GSE86434), consensus derived inferential DMs showed a higher tendency to result in significant meta-p-values (Tables 3–5)—as compared to DMs inferred by the group of selected stand-alone methods (Clique Sum Permutations being the individual exception).

This result aligns with the insights gained through a comprehensive network inference method evaluation performed by Marbach et al., published in 2012. Their results and complementary analysis were based on contributions made by “network inference experts” to the fifth annual Dialogue on Reverse Engineering Assessment and Methods (DREAM5) systems biology challenge.

After systematically comparing a total of 35 individual methods (used to infer gene regulatory networks), they rescored and integrated all individual networks into a unified model. Although this community network did not outperform all stand-alone methods for all tested datasets, it consistently placed in the top-performing group—something that was unobserved for any individual method. They concluded that community networks show promise by being more robust compared to any stand-alone method (Marbach et al., 2012). As mentioned, the significant modules inferred by Clique Sum Permutations presented a contradiction to the otherwise uniform Pascal results. The probable reason for which could not be determined within the scope of the current study. However, as argued by Marbach et al., stand-alone inference methods are often developed in a setting that play to the algorithms strength, potentially making them biased towards a specific application (that subsequent implementations might not adhere to). A possibility could therefore be that Clique Sum Permutations inherently fit these data better (Stolovitzky et al., 2009; Marbach et al., 2012). This could also be an underlying reason for the inconsistencies observed in the Venn diagrams (Figures 5–8), where clique-based method DCMs inferred by MODifieR on average contributed 31% of their constituents to the resulting S2B DCMs, contrasted by the co-expression counterpart only contributing with approximately 3%.

Regardless of the validity of this reasoning for the above contradictory result, the issues of stand- alone methods concerning flexibility is further cause to consider consensus modules as the most relevant candidate for inferring modules (Marbach et al., 2012; Choobdar et al., 2018).

In the preceding work by McCoy (2019) the same statistical tendency, found within this report, was observed for the Pascal results—although, the two implemented methods diverge enough to make any direct comparisons an issue. Of higher relevance, is that similar results were gained after addressing the scalability issues highlighted by McCoy: where a larger PPIN, new data (pertaining to different diseases), and different stand-alone inferential methods were employed for the current method. The similarities in result substantiate McCoy’s findings, while contributing more information to the investigation of S2B as a possible consensus method for inferring DCMs.

Of further note is that significant DCMs were still inferred by Pascal for the increased number of S2B randomizations (as noted on p. 6–7). This process is included in the S2B algorithm to verify that the assigned score of nodes in the resulting module is equal or higher than the corresponding score inferred by a randomized seed set (Garcia-Vaquero et al., 2018). Increasing the number of

(23)

19 randomized modules compared to the final module, from 1 to 50, makes the results reported herein more likely to be of actual significance (Yeung et al., 2003). However, it should be noted that this substantially increased the run time of S2B—which was the motivation for McCoy to decrease this number. With access to the Tetralith cluster, the number of randomizations implemented here could have been increased further, which should be considered for any follow- up to this work where sufficient computational resources are available.

In regards to the KEGG pathways found significantly enriched—the full scope of which cannot possibly be addressed in this report—all of the top 5 pathways reported in Table 6 were found to have recurrence within the literature, mostly establishing solid links to IBD (see below). This does present as evidence of the biological relevance of these modules, but since the employed analysis was not implemented with all inferred modules (including those of p-value > 0.05), it is currently uncertain how large the comparative effect size would be. Further, some of the pathways were primarily connected to only one of the two diseases in the reviewed literature (Tokuhira et al., 2015; Crittenden et al., 2018; Vinayaga-Pavan et al., 2019). But were present in all modules being analyzed [regardless of underlying sample source (patient diagnosed with CD or UC)]. However, this does support the aforementioned problems of complexity inherent in IBD (Zhang et al., 2013;

de Souza & Fiocchi, 2018), making the importance of effective differentiation between CD and UC, in silico, even more apparent. To elucidate the included top 5 KEGG pathways of each module, a summary of select literature found to support them follows below.

The link between inflammation and cancer has been well established since first being identified by Rudolf Virchow in 1863 (Balkwill & Mantovani, 2001). Where inflammatory processes active in IBD specifically increase the risk for developing colorectal cancer. The incident rates are reportedly lower for sufferers of CD, but mirrors the observations for UC; where risk of oncogenesis increases over time (Waldner & Neurath, 2014). One of the underlying inflammatory effectors of IBDs are chemokines. These proteins are responsible for recruiting leukocytes and similar immune cells, and are in some cases involved in their subsequent activation. For this reason, they are suggested as a potential therapeutic target to alleviate or revert symptoms of IBD related to inflammation (Singh et al., 2016), but are also targets of ongoing trials involving cancer treatment (Mollica Poeta et al., 2019). Another pathway bridging the gap between inflammation and cancer, in relation to IBD, is the PI3K-Akt signaling pathway—e.g. involved in the recruitment of immune cells (Khan et al., 2013). Although, this pathway is intrinsically linked to the disease progression of CD (Tokuhira et al., 2015), owing partly to its involvement of the prominent CD DG NOD2 (Hugot et al., 2001; Ogura et al., 2001), it was consistently significantly enriched in the UC sets aswell.

Resulting top 5 KEGG pathways with connections to metabolism included, ubiquitin mediated proteolysis (Vergnolle, 2016; Cai et al., 2019), purine metabolism (Crittenden et al., 2018), and the cell cycle pathway. IBD induced dysfunction of protein metabolism has been well established for several dysfunctional processes involved in IBD pathology (Vergnolle, 2016). And, a recent publication reported further ties between UC and elevated gene trascription of the cell cycle, as well as for protein metabolism (Vinayaga-Pavan et al., 2019). Furthermore, the “ribosome” (a cellular site for protein synthesis) has been observed to be reduced in cases of IBD, and e.g. induce detrimental effects on muscle- and skeletal growth (Figueiredo et al., 2016).

Several of the resulting pathways were implicated through links with viral infections, including Epstein-Barr infection (Nissen et al., 2015; Goetgebuer et al., 2019); viral carcinogenesis, HTLV-1 infection (Tattermusch & Bangham, 2012; Futsch et al., 2017), and Kaposi's sarcoma-associated herpesvirus infection (Butler et al., 2011; Duh & Fine, 2017). These are noted to be potential

(24)

20 complications associated with IBD immunosuppression therapies, which aligns with the sample source being patients with previously established CD or UC undergoing active therapy (Garcia- Vaquero et al., 2018).

Finding clear, substantial, connections between IBD and the enriched oocyte meiosis pathway was problematic compared to the other pathways, but seems to be implicated (alongside disruption of the cell cycle pathway) in cases of gastric cancer (Shi et al., 2018). The cell cycle and cell renewal are also integral functions for a healthy epithelium—the cells that line the gastrointestinal tract—

the disruption of which is readily observed in cases of IBD (Okamoto & Wakanabe, 2016).

Furthermore, there are recently reported connections between cytokine activity and intestinal stem cell renewal and differentiation (Biton et al., 2018). Cytokines are a group of proteins, including chemokines, that are important to cell signaling—found to be further implicated in IBD through “ribosomal biogenesis dysfunction” (Moon, 2014).

Lastly, focal adhesion, or more specifically anti-adhesion, has been discussed as a novel therapeutic target for treatment of IBD. This is proposed to potentially limit the number of T cells recruited to the gastrointestinal tract, and thereby reduce inflammatory symptoms and corresponding complications (Zundler et al., 2017).

The comparatively low FDR of the IBD pathway (Table 6) for the present results could be due to the discriminatory selective process imposed by the rudimentary consensus method: which lacks a statistical scoring basis (Tarca et al., 2012). Or, some genes and proteins reported to be involved in this pathway, as implicated by the KEGG database, are not significantly differentially expressed, or enriched with disease associated SNPs—i.e. leading to their potential exclusion by the implemented inferential methods, or the entire module in subsequent meta-analysis with Pascal.

Another reason could be the simplistic nature of the employed enrichment analysis method—a process argued to require conscious decisions based on the initial data and subsequent results (Simillion et al., 2017). To confidently ascertain a cause further investigation would be required.

A priority for researchers and clinicians working with IBD, is finding therapeutic targets that can be used to target a specific subtype. Pathway enrichment analysis poses a viable option in this ongoing endeavor, exemplified in a recent publication by Han et al. (2018) which report on 15 pathways that might be a step towards effectively differentiating between CD and UC. Making a thorough comparison between these 15 and the ones implicated in this study would have been an interesting addition to the report, possibly imposing some separation of the otherwise uniform enrichment result. Instead, this will remain a recommended addition to potential future investigations of a similar nature.

Any discussion pertaining to the individual gene constituents was omitted, based on a lack of systematic implementation of the analysis. To ensure that the module elements did not pose randomly seeded additions from the input PPIN, review of all modules alongside the PPIN would be required. This follows the same reasoning previously brought up in the discussion of S2B randomization.

In consideration to the above discussion, could the null hypothesis posed in this study—that consensus methods would not show a significant improvement over the stand-alone methods—

be discarded? The rudimentary DCMs produced through MODifieR did not perform significantly better when compared to stand-alone methods (Tables 3–4). And, although the DCMs inferred by S2B did outperformed the majority of other methods, they were more or less neck and neck with Clique Sum Permutation inferred DMs. In other words, there is no overwhelming evidence present

(25)

21 in this study, associated with an increased significance of consensus modules, to not warrant retention of the posed null hypothesis. But, as with the preceding work if McCoy, this study’s result indicates a strong potential for the implemented method, that warrants further investigation.

An important detail, regarding the co-expression methods, that potentially had an effect on the subsequent meta-p-values awarded by Pascal, was the post-processing applied to WGCNA modules (Table 2). It might have affected the resulting WGCNA DMs performance, which would subsequently extend to the consensus modules derived from the co-expression methods intersect.

The currently available information is inadequate to investigate this further, but of note is that the unprocessed WGCNA module had the lowest meta-p-value.

A possible continuation of this project, based on the preliminary results of the S2B consensus C modules (Table 5; Figure 4), would be to systemically apply this variation of the current method to see if other stand-alone methods, combined in this way, generates a similarly positive result.

And, adding the community method employed by Marbach et al. (2012), to unify the resulting DCMs, should be considered a further addition to any scenario involving similar methodologies.

Another interesting result, that should warrants further investigation, is the numerical differences observed through the Venn diagrams (Figures 5–7), where the clique-based modules consistently formed a larger basis for the S2B consensus modules. It would be interesting to see if the occurrence is based on mathematical similarities and/or differences that biases the S2B algorithm to implemented methods with a particular type of underlying modularity principle.

Furthermore, as was pointed out by McCoy (2019), extending implementation to other data is a requirement to ensure the actual validity and stability of the proposed method. This includes testing other diseases but also, e.g., utilizing potentially updated versions of GWAS SNPs for the targeted disease(s), and more exhaustive PPINs. Also, new conceptual models are continuously published (Liu et al., 2019), aiming to untangle the complexity of human disease. Implementing such models would grossly increase the scope of a research project, and thereby the required resources, but might also increase any potential gain—awarded by the effort.

Ethical considerations and impact on the society

Public repositories enable unrestricted reuse of previously gathered data. However, besides (possibly) providing submission requirements many of these databases lack resources to curate any/all incoming data. Through the use of disclaimers the responsibility of verifying data and underlying study quality is often transferred to the future users (Rung & Brazma, 2012). This has e.g. led to several publications trying to address worries of wrongful implementations, where researchers might select datasets unfit for their new intended use. This could in turn result in wrongfully inferred biological significance, risking—at best—a waste of resources when continuing research based on these faulty assumptions. The risks of misuse and ethical aspects of this are an ongoing discussion (Ioannidis & Khoury, 2011; Bauchner et al., 2016), but to date the potential gain to all fields of research is deemed to grossly overshadow the risks of making scientific data publicly available (Bauchner et al., 2016; Taichman et al., 2016). The International Committee of Medical Journal Editors recently strengthened this stance by the implementation of a mandatory “data sharing plan” for clinical trial—“deidentified” individual-patient—data, regarding submissions to all member journals, as of January 1, 2019 (Taichman et al., 2016).

EVALUATING THE BIOLOGICAL RELEVANCE OF DISEASE CONSENSUS MODULES