• No results found

Omic Network Modules in Complex diseases

N/A
N/A
Protected

Academic year: 2021

Share "Omic Network Modules in Complex diseases"

Copied!
94
0
0

Loading.... (view fulltext now)

Full text

(1)

Omic Network

Modules in

Complex diseases

Tejaswi V.S. Badam

Te jas w i V .S. B ad am Om ic N etw or k M od ule s i n C om ple x Di se as es 20

FACULTY OF SCIENCE AND ENGINEERING

Linköping Studies in Science and Technology, Dissertation No. 2114, 2021 Department of Physics, Chemistry and Biology

Linköping University SE-581 83 Linköping, Sweden www.liu.se

(2)
(3)

!"#$%&'()*+,%-*./0'1%#2%

3*"40'5%.#1'61'1%

!

!"#$%&'()"*+$,$(-$,.$(/$0$1

!

! ! ! ! ! ! ! !

!

!"#$%&"#'()#"*+,-"./( 01+(2+&3,.4+#.(56(71/-"8-9(:1+4"-.,/(3#;(<"5=5'/( 2"*"-"5#(56(<"5"#65,43."8-( >?@ABCBD(!"#$%&"#'( !"#$%&"#'(EFEC( ! ( !"#$%&'()&*)'&+&)'*'),%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%!""$%-./01'0+/% ! !

(4)

Edition 1:1

Ó Tejaswi Venkata Satya Badam, 2021 ISBN 978-91-7929-717-6

ISSN 0345-7524

Published articles have been reprinted with permission from the respective copyright holder.

This Ph.D. thesis was kindly supported by grants from Swedish research council (grant number 2015-03807)

(5)

CO- SUPERVISORS

Senior Lecturer Zelmina Lubovac-Pilav, PhD Researcher Maja Jagodic , PhD

FACULTY OPPONENT

(6)
(7)

Performance drives success, but when performance can’t be measured,

networks drive success.

(8)
(9)

molekyler, liksom interaktioner mellan dessa komponenter. Förståelse av en given fenotyp, funktion av en cell eller vävnad, etiologi av sjukdomar eller cellulär organisation kräver exakta mätningar av uttrycksprofilerna för dessa molekyler, vilket ger upphov till enorma mängder av biomedicinska data. Analys av biomedicinska data tillåter oss att förklara viktiga funktioner i interaktionerna som leder till en mekanistisk förståelse av den observerade fenotypen. Samspelet mellan olika komponenter på olika nivåer kan representeras i form av biologiska nätverk, till exempel protein-protein interaktioner (PPI). Nätverk ger en konceptuell och intuitiv ram för att modellera olika komponenter i flera omik-data, såsom transkriptom. De topologiska egenskaperna hos sjukdomsassocierade gener varierar signifikant från sjukdom till sjukdom.

Translationell bioinformatik handlar om utveckling av analytiska och tolkningsmetoder för att omvandla omik-data till förståelsen av komplexa sjukdomar. Komplexa sjukdomar som multipel skleros, reumatoid artrit och lungcancer är några av de sjukdomar som antas vara resultat av underliggande störningar i omik nätverken. Även om det finns många metoder för att modellera interaktioner mellan omik-data vid komplexa sjukdomar saknas det fortfarande tydlighet i hur de resulterande nätverksmodulerna ska tolkas.

I denna doktorsavhandling visade vi hur olika omik-data som transkriptom och metylom kan användas överlagrat på nätverket av protein-interaktioner och att extrahera tätt sammankopplade nätverksstrukturer av relevans för sjukdom, så kallade sjukdomsmoduler. I den första artikeln gjorde vi ett urval av de mest förekommande metoder för identifiering av sjukdomsmoduler och implementerade dessa i ett R-paket MODifieR, som erbjuder en lättanvänd gemensam struktur för olika metoder, samt möjlighet att kombinera moduler från olika metoder. I den andra artikeln visade vi hur nätverksmodulskoncept kan tillämpas på data från helgenomsekvensering för att utveckla en modell för prediktion av myelosuppressiv toxicitet i icke-småcellig lungcancer.

I tredje artikeln demonstrerades ytterligare en framgångsrik tillämning av nätverksmoduler som användes för att identifiera gener som är associerade med biologiska “pathways” samt sjukdomsassocierade metyleringsförändringar relaterade till multipel skleros, reumatoid artrit

(10)

Sedan utvärderades de omiska nätverksmodulerna på 19 olika komplexa sjukdomar genom att använda både transkriptom och metylom data. Vidare identifierade vi också en multi-omik modul i multipel skleros, med signifikant koppling till sjukdomsriskfaktorer genom att utnyttja genomisk överensstämmelse, dvs att flera omik ska ge höga genöverlapp.

Tillämpningen av nätverksmodulerna som ett koncept för att koppla omik-data till sjukdomsmekanismer är kärnan i forskningen som presenteras i denna doktorsavhandling. I synnerhet syftade den till att visa betydelse av hur nätverksomik-koncept kan bidra till kunskap om gener som är dysreglerade vid komplexa sjukdomar för att förstå sjukdomsmekanismer. Denna avhandling ger också verktyg och riktmärken för metoder och insikter i hur en nätverksmodul kan extraheras och tolkas från omik-data vid komplexa sjukdomar.

(11)

genes, proteins, and other biological molecules, including interactions among those components. Understanding a given phenotype, the functioning of a cell or tissue, aetiology of disease, or cellular organization, requires accurate measurements of the abundance profiles of these molecular entities in the form of biomedical data. The analysis of the interplay between these different entities at various levels represented in the form of biological network provides a mechanistic understanding of the observed phenotype. In order to study this interplay, there is a requirement of a conceptual and intuitive framework which can model multiple omics such as genome, transcriptome, or a proteome. This can be addressed by application of network-based strategies.

Translational bioinformatics deals with the development of analytic and interpretive methods to optimize the transformation of different omics and clinical data to understanding of complex diseases and improving human health. Complex diseases such as multiple sclerosis (MS), rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and non-small cell lung cancer (NSCLC) etc., are hypothesized to be a result of a disturbance in the omic networks rendering the healthy cells to be in a state of malfunction. Even though there are numerous methods to layout the relation of the interactions among omics in complex diseases, the output network modules were not clearly interpreted.

In this PhD thesis, we showed how different omic data such as transcriptome and methylome can be mapped to the network of interactions to extract highly interconnected gene sets relevant to the disease, so called disease modules. First, we selected common module identification methods and assembled them into a unified framework of the methods implemented in an R-package MODifieR (Paper I). Secondly, we showed that the concept of the network modules can be applied on the whole genome sequencing data for developing a tested model for predicting myelosuppressive toxicity (Paper II).

(12)

associated with pregnancy-induced pathways and were enriched for disease-associated methylation changes that were also shared by three auto-immune and inflammatory diseases, namely MS, RA, and SLE (Paper III). Remarkably, those methylation changes correlated with the expected outcome from clinical experience in those diseases. Last, we benchmarked the omic network modules on 19 different complex diseases using both transcriptomic and methylomic data. This led to the identification of a multi-omic MS module that was highly enriched disease-associated genes identified by genome-wide association studies, but also genes associated with the most common environmental risk factors of MS (Paper IV). The application of the network modules concept on different omics is the centrepiece of the research presented in this PhD thesis. The thesis represents the application of omic network modules in complex diseases and how these modules should be integrated and interpreted. In particular, it aimed to show the importance of networks owing to the incomplete knowledge of the genes dysregulated in complex diseases and the contribution of this thesis that provides tools and benchmarks for the methods as well as insights into how a network module can be extracted and interpreted from the omic data in complex diseases.

(13)

I. MODifieR: An Ensemble R Package for Inference of Disease Modules from Transcriptomics Networks.

Hendrik A. de Weerd*, Tejaswi V.S. Badam*, David Martínez-Enguita, Julia Åkesson, Daniel Muthas, Mika Gustafsson, and Zelmina Lubovac-Pilav.

Bioinformatics, 2020. 36(12), pp.3918–3919.

II. Whole-genome sequencing and gene network modules predict

gemcitabine/carboplatin-induced myelosuppression in non-small cell lung cancer patients.

Niclas Björn*, Tejaswi V.S. Badam*, Rapolas Spalinskas, Eva Brandén, Hirsh Koyi, Rolf Lewensohn, Luigi De Petris, Zelmina Lubovac-Pilav, Pelin Sahlén, Joakim Lundeberg, Mika Gustafsson, and Henrik Gréen.

npj Systems Biology and Applications, 2020. 6(1).

III. CD4+ T-cell DNA methylation changes during pregnancy

significantly correlate with disease-associated methylation changes in autoimmune diseases.

Tejaswi V.S. Badam*, Sandra Hellberg*, Ratnesh B. Mehta*,

Jeanette Lechner-Scott, Rodney A. Lea, Jorg Tost, Xavier Mariette, Judit Svensson-Arvelund, Colm E. Nestor, Mikael Benson, Göran Berg, Maria C. Jenmalm, Mika Gustafsson* and Jan Ernerudh*. Submitted

IV. A validated generally applicable approach using the systematic

assessment of disease modules by GWAS reveals a multi-omic module strongly associated with risk factors in multiple sclerosis.

Tejaswi V.S. Badam*, Hendrik A. de Weerd*, David

Martínez-Enguita2, Tomas Olsson, Lars Alfredsson, Ingrid Kockum, Maja

Jagodic, Zelmina Lubovac-Pilav*, Mika Gustafsson*. Submitted

BioRxiv,2020,doi: https://doi.org/10.1101/2020.10.26.351783

(14)
(15)

S1. Therapeutic efficacy of dimethyl fumarate in relapsing-remitting multiple sclerosis associates with ROS pathway in monocytes.

Karl Carlström, Ewoud Ewing, Mathias Granqvist, Alexandra Gyllenberg, Shahin Aeinehband, Sara Lind Enoksson, Antonio Checa,

Tejaswi V.S. Badam, Jesse Huang, David Gomez-Cabrero, Mika

Gustafsson, Faiez Al Nimer, Craig E. Wheelock, Ingrid Kockum, Tomas Olsson, Maja Jagodic and Fredrik Piehl.

Nature Communications, 2019. 10(1).

S2. DNA Methylation changes in Primary Progressive Multiple Sclerosis associate with brain pathology.

Majid Pahlevan Kakhki, Yun Liu, Alexandra Gyllenberg, Tejaswi

V.S. Badam, Tojo James, Jacqueline Hammer, Mika Gustafsson,

Ingrid Kockum, Lars Alfredsson, Jan Hillert, Tomas Olsson, Lara Kular* and Maja Jagodic*.

In manuscript

S3. Methylation, miRNA, and gene expression have an integrated sex-specific role in the pineal gland of birds subjected to unpredictable light schedules.

Fábio Pértille, Tejaswi V.S. Badam, Nina Mitheiss, Pia Løtvedt, Mika Gustafsson, Luiz Lehmann Coutinho, Per Jensen and Carlos Guerrero-Bosagna.

(16)
(17)

BIOGRID Biological General Repository for Interaction Datasets BMI Body Mass Index

CAD Coronary Artery Disease CD Crohn’s Disease

CHD Congenital Heart Disease

DREAM Dialogue for Reverse Engineering Assessments and Methods DIAMOnD DIseAse Module Detection

DiffCoEx Differential Co-expression Analysis DiME Disease Module Extraction

DICER Differential Correlation in Expression for meta-module Recovery DINGO Differential Network Analysis in Genomics

DEG Differentially Expressed Genes DMP Differentially Methylated Genes DisGeNET Disease Gene Network

EBV Epstein-Barr Virus FunCoup Functional Coupling

GWAS Genome Wide Association Studies GSNCA Gene Sets Net Correlations Analysis GSVD Generalized Singular Value Decomposition GSEA Gene Set Enrichment Analysis

GO Gene Ontology

GEO Gene Expression Omnibus HLA Human Leukocyte Antigen

HPIN Human Protein Interaction Network IMEX International Molecular Exchange IBD Inflammatory Bowel Disease

(18)

MODifieR MODule IdentifieR

MONET MOdularising Network Toolbox MINT Molecular INTeraction

MCODE Molecular Complex Detection MODA Module Differential Analysis MS Multiple Sclerosis

NSCLC Non-Small Cell Lung Cancer NP Non-deterministic Polynomial-time PPI Protein-Protein Interactions

Pascal Pathway Scoring Algorithm QUBIC Qualitative Biclustering RA Rheumatoid Arthritis

SNP Single Nucleotide Polymorphism SuM Susceptibility Modules

SLE Systemic Lupus Erythematosus

STRING Search Tool for Recurring Instances of Neighbouring Genes UC Ulcerative Colitis

VIStA diVIsive Shuffling Approach

WGCNA Weighted Gene Co-expression Network Analysis

(19)

!"!#$%&'()(*+,-!$./0+-11-*2-))*.*/!"""""""""""""""!#$$! -3+)&-4)!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$%! $5670890.:;<=>?>0!=@<5;A758:60A:>01A:=6;B5C76!""""""""""""""""""""""""""!%$! +=CC<?D?:7A<0&?<?EA:701A:=6;B5C76!""""""""""""""""""""""""""""""""""""""""!%$$$! -@@B?E5A758:6!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!%#! F0.:7B8>=;758:!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!&! !"!#$%&'()&*+,'&)#-+,+'.,%/&*+0(!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!#! !"1#23*4,%5(#&'6#(7(*3/(#/36+0+'3!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$! !"8#9/+0(#6&*&!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!%! !":#;,/<)3=#6+(3&(3(#&'6#*%&+*(!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!&! !"#"!$%&'($)*+,-,).!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!#! !"#"/$0'1),213$4-13&*4,4!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$%! !"#"5"$067$89$:;<$6=>$,;$2&3(;:;-.!""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$&! G01A7?B5A<60A:>0D?7H8>6!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""!&'! 1"!#23*4,%5#&'6#>%&<?#*?3,%7!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!#$! 1"1#@%,*3+'A<%,*3+'#+'*3%&0*+,'#'3*4,%5(!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""!#%! 1"8#@%,<3%*+3(#,.#'3*4,%5(!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!#&! /"5"!$0*<'1:&,).!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$'! /"5"/$?3;)&:1,).!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$(! /"5"5$?1,@'34!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$#! 1":#B%&<?#0)C(*3%+'>!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!'(! 1"D#E,6C)3#+'.3%3'03!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!'#! /"A"!$633<BC:43<$:1(*&,)DE4!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!&&! /"A"/$?1,@'3BC:43<$:1(*&,)DE4!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!&)!

(20)

1"F#G&*&#<%3A<%,03((+'>H#6+..3%3'*+&)#&'&)7(+(H#&'6#/&<<+'>!""""""""""""""""""""""""""!$#! 1"I#B3',/+0#0,'0,%6&'03!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$$! 1"J#K&)+6&*+,'#,.#*?3#6+(3&(3#/,6C)3(!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$)! /"F"!$G';-),*;:1$H:1,<:),*;!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!*)! /"F"/$6):),4),-:1$H:1,<:),*;!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!*+! 1"L#M,.*4&%3#C(36#+'#*?3#*?3(+(!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$&! I0&?6=<760A:>0J56;=6658:!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""!'(! 8"!#$,,)#63N3),</3'*!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!$*! 8"1#23*4,%5#/,6C)3(#+'#*%&'(0%+<*,/+0(!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!)#! 8"8#23*4,%5#/,6C)3(#+'#/3*?7),/+0(!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!)%! 8":#23*4,%5#/,6C)3(#+'#>3'3*+0(#OPBMQ!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!%(! 8"D#R'%+0?/3'*#,.#%+(5#.&0*,%(#+'#/C)*+A,/+0#/,6C)3(!""""""""""""""""""""""""""""""""""""!%)! K0+=DDABL0A:>09=7=B?0C?B6C?;75E?6!"""""""""""""""""""""""""""""""""""""""""!)*! :"!#MC//&%7!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!%&! :"1#S3)3N&'03#,.#*?3#*?3(+(#*,#/36+0+'3!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!%*! :"8#T+/+*&*+,'(#&'6#.C*C%3#<3%(<30*+N3(!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!+(! -;M:8N<?>O?D?:76!""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!+'! 35@<58OBACHL!"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!+)! %

!

! !

(21)

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! K8K!@$().L(2,&)(L!I,&,)%&$'(2,*.! T#( -&".+( 56( #5.+S5,.1/( 3;*3#8+-( ;J,"#'( .1+( &,+*"5J-( 8+#.J,/9( 1+3=.183,+("-(638"#'(1J'+(813==+#'+-(,+=3.+;(.5("#+66+8."*+(4+;"83=( .1+,3&"+-(3#;(.,+3.4+#.-Q(Z#+("--J+("-("#.+,"#;"*";J3=(;,J'@,+-&5#-+( *3,"3L"="./9(S1"81(83J-+-(-J66+,"#'("#(&3."+#.-(3#;(+#5,45J-(85-.-( 65,( 1+3=.183,+Q( ?-."43."5#-( -J''+-.( .13.( .1+( /+3,=/( +U&+#-+( 56( "#+66+8."*+(4+;"83."5#("-(3L5J.(]DAF(L"=="5#("#(.1+()>(3=5#+(VCWQ(T#( 3;;"."5#9(.1+(.,+4+#;5J-(+U&+#-+(,+=3.+;(.5(.1+(;+*+=5&4+#.(3#;( "4&,5*+4+#.(56(;,J'-9(3#;(.5(8="#"83=(&,+="4"#3,"+-9(6J,.1+,(366+8.-(

!"#$%&'()*%+'$"

(22)

the finances of medical services (2). These issues mirror the unpredictability of complex diseases, which can involve the interaction of a number of factors such as genome, environment and lifestyle. A future goal of modern healthcare involves personalized medicine, i.e., tailoring prediction, diagnosis, treatment, and eventual prevention to individual patients. Moreover, individual markers for drug selection generally work poorly, and the choice of drugs in complex diseases is often based on trial-and-error strategies, which bring suffering to patients and incur higher healthcare costs.

To enable the development of individualized medicine (sometimes also referred to as precision medicine), Leroy Hood’s pioneering 2004 work suggested the use of molecular interaction networks, derived from omics involving translational bioinformatics (1). Translational bioinformatics is an emerging field that addresses the development of methods for storage, analysis, and interpretation, in order to optimize the transformation of increasingly voluminous biomedical data, and genomic data, into proactive, predictive, preventive, and participatory medicine (3). Thus, the end product of translational bioinformatics involves newly discovered knowledge from these integrated efforts which can be disseminated to a variety of stakeholders such as biomedical scientists, clinicians, and patients.

(23)

1.2 Networks and systems medicine

Molecular networks, such as protein-protein interaction (PPI) networks, serve as effective platforms for uncovering this multi-faceted interplay within complex diseases. Disease-associated genes identified by high-throughput studies can be computationally mapped onto models of the human PPI network. One of the most important characteristics of functionally related genes is that they tend to co-localize and form disease modules (4). An aim of systems medicine is for medical applications to adopt a holistic or systems perspective. This field utilizes translational bioinformatics methods and data integration, which have often taken the form of network-based methods, sometimes also referred to as network medicine. These applications have so far tended to make use of the fact that disease genes are functionally related, and their corresponding protein products are highly interconnected and co-localized in networks, thereby forming disease modules. A number of module-based studies have been undertaken on different complex diseases and cancers by several different research groups (5–9). Our research group focuses on autoimmune diseases such as multiple sclerosis (MS), and we applied a coarse-graining strategy to find the modules that were enriched for single nucleotide polymorphisms (SNPs) found by genome wide association studies (GWAS) for MS. For example, in the article by Hellberg et al. (2016), they found a highly connected MS module of 81 genes derived from gene expression enriched for pathways related to T cells, such as cell activation and chemotaxis (7). The modules contained hundreds of genes, and a

(24)

common principle was to first validate them through genomic concordance (4), i.e., the modules derived from one or several omics were validated by enriching disease genes derived from another independent omic. For example, the MS module was also validated by an eight-fold enrichment of disease-associated SNPs from GWAS (7). Moreover, using the module, they identified and validated four secreted protein biomarkers which, together, separated three independent MS cohorts significantly better than any single-protein model. The genomic concordance principle was also used in disease module validation by Menche et al. (2015) (10,11), and in disease modules benchmark study organized by the DREAM community (8), which aimed to prioritize and combine different disease modules.

Figure 1. Overview of topics covered in this thesis. The Venn diagram shows the three principal areas of this thesis: bioinformatics, systems medicine, and software development, where omics is the core of all the components.

(25)

As shown in Figure 1, this thesis comprises bioinformatics and systems medicine, which in turn include various aspects of bioinformatics. These utilize omics (see section 1.3) to reconstruct, analyse and model disease networks, which is only possible through database and software development.

1.3 Omics data

High-throughput technologies and experiments have revolutionized biological research. The introduction of technologies such as microarrays, high-throughput sequencing, large-scale genome-wide association studies and mass spectrometry have enabled the analysis of whole genomes, transcriptomes, proteomes, etc., which is necessary for understanding complex biological systems. Omics can be considered a discipline which encompasses biological information on various molecular entities. The suffix “omics” is added to a molecular entity to imply that it is either comprehensive or globally assessed. For example, genomics is a term which focuses on the study of whole genomes, whereas “genetics” studies only individual variants or single genes. The field of omics has enabled researchers to analyse and understand several pools of transcripts (transcriptomics), proteins (proteomics), DNA (epigenomics), metabolites (metabolomics) and other biological entities, as described further in Figure 2 (12).

Omics could potentially revolutionize medicine by analysing disease as perturbations in systems, rather than malfunctions in individual genes, and thereby enable the development of precision

(26)

medicine in complex diseases. However, traditional bioinformatics tools require a thorough and complete understanding of how all molecules are regulated within a complex network with many nonlinearities. An example of these nonlinearities might involve the fact that the expression of a gene requires transcription factors which act by recruiting co-factors, resulting in “AND” relationships between the factors involved. Another example of nonlinearities in gene regulation might be a saturation effect, occurring alongside high transcription activity in a gene. This leads to an increase in the transcription factor, which results in a lesser decrease in the transcription rate.

Figure 2. Types of omics which can be integrated, and their constituent parts.

UK Biobank is probably the largest compendium to which researchers can apply for access to single nucleotide polymorphisms (SNPs) in DNA in the medical records of ~500,000 individuals from the UK. These studies have ushered in a post-GWAS era, where

(27)

targeted approaches derived from GWAS can now be carried out using longitudinal multi-omics designs, which are much more powerful in terms of systems medicine.

In this thesis, we tried to show a generally applicable strategy for integrating omics data to understand the complexity and characteristics of different complex diseases of vital importance.

1.4 Complex diseases and traits

Genetic factors influence almost every human condition or disease, determining susceptibility or resistance, and how they interact with environmental factors. Complex diseases do not obey the standard Mendelian patterns of inheritance (13). Complex traits are features in human genetics that are believed to be result of multiple gene interactions, genetic heterogeneity and other yet unknown reasons (14). In simple terms, a complex disease or a trait is the result of interaction between a number of known and unknown factors, which results in changes in mRNA expression and DNA methylation in thousands of genes. In many cases there is a genetic predisposition, which means that the individual is susceptible to developing the disease, but this does not mean the person is necessarily destined to develop the disease. The disease phenotype also depends on environment and lifestyle, which can be altered to prevent or delay the onset of disease. Thus, a complex disease is the result of interplay between genetic and environmental factors, and this is a challenge in terms of research (13).

(28)

Given that a complex disease is caused by the interaction of multiple genes and environmental factors, it is essential to understand gene-gene interactions and gene-gene-environment interactions to improve understanding and develop new therapies. It could also be described as polygenic, indicating that it involves a contribution from multiple genes. This includes diseases like diabetes, various types of cancer, etc., where the condition cannot be attributed to a single glitch in the genome.

The problem of complexity is also due to disease heterogeneity, as well and the fact that the interconnections between substantial numbers of genes making it exceedingly difficult for detailed studies of individual genes to provide a functional understanding of disease mechanisms. Successful strategies have been employed to define the interplay between genetic factors at the molecular level. This might involve prioritization of genes acting as potential biomarkers, based on co-localization in protein-protein interaction (PPI) networks, thereby forming disease modules. However, this has not been optimized for decision support. Complex diseases where modules have been used to understand the disease as a whole include multiple sclerosis, asthma, lung cancer and diabetes. For example, Sharma et al. (2015) identified an asthma disease module that is enriched for disease associated SNPs and validated it through both computational and experimental approaches. This module was able to explain the disease heterogeneity and captured novel pathway of glucocorticoids and GAB1 that is associated with immune response in asthma (15).

(29)

The main complex diseases and traits addressed in this thesis are drug toxicity (Paper II), MS (mainly Paper IV, but also Paper III), and RA and SLE in the 2nd trimester of pregnancy (Paper III).

1.4.1 Drug toxicity

In the context of treatment for complex diseases such as cancer, chemotherapy is a classic therapy which aims to reduce the growth rate of tumour cells. The general mechanism driving the action of these drugs affects the healthy cells as well as cancer cells, resulting in unwanted toxicity effects. These toxicity effects can range from simple nausea and fatigue to myelosuppression and neuropathy and could be due to small variations in the dosage which render the treatment ineffective. In chemotherapy, drugs have narrow therapeutic windows. In Paper II , we aimed to understand three different myelosuppressive toxicity effects caused by gemcitabine and carboplatin treatment in non-small cell lung cancer (NSCLC), namely neutropenia, leukopenia, and thrombocytopenia. A deficit or decreased production of neutrophils is known as neutropenia, whereas a deficit of all leucocytes is referred to as leukopenia. This deficit of blood cells renders patients vulnerable to infections. The reduction in the number of platelets is referred to as thrombocytopenia and can cause spontaneous bleeding in patients (16).

(30)

1.4.2 Multiple sclerosis

Multiple sclerosis is a chronic inflammatory and neurodegenerative disease which is characterized by multi-focal demyelinating lesions in the central nervous system, causing a variety of neurological manifestations. It is well known that there are a number of associated genetic variants of MS (17), but no single gene variant can predict whether a patient will develop MS. Recent research shows that HLA-DRB1*15:01 is the strongest genetic marker, with an odds ratio of 3. This means that an individual with this allele variant is three times more likely to develop MS than individuals who do not have it. The exact cause of MS remains unknown, but it is thought to arise in genetically susceptible individuals, where the microbiome determines disease development alongside environmental and lifestyle factors (18). Siblings of an individual with MS have an almost 17-fold higher risk of developing the disease. Monozygotic twins have a higher concordance rate than dizygotic twins, which provides support for a significant, yet complex genetic aetiology in MS (19). Genetic variations account for about 30% of the overall disease risk, and disease-associated SNPs are most often located in genes which regulate innate or adaptive immunity (20). The human leukocyte antigen (HLA) locus on chromosome 6 accounts for more than 20% of susceptibility, particularly HLA class I and II genes, whose functions are to present antigens to T cells, a crucial step in adaptive immunity. This region has been implicated in the development of hundreds of diseases,

(31)

many immune-mediated, which suggests common predisposing immunological processes.

Genetic investigations in the past decade have uncovered many genetic variants and mechanisms that have a role in pathogenesis of MS (21). The neurodegeneration and toxic effects are considered to be driven by triggering of neurotoxic pathways such as activation of microglia and response to reactive oxygen species. As such, risk factors for onset of the disease may vary from those determining the progression of the disease. Furthermore, it is conceivable that longer periods of time between the presentation of a risk factor and its impact on progress of the illness could also be a challenge for this type of investigation, as could the widespread use of compelling disease‐modifying treatments. For instance, some risk factors in terms of the onset of the disease, such as low sun exposure / vitamin D and smoking, were also risk factors for relapse (low sun exposure/vitamin D) or for progress of the disease (smoking), but not for both. Others, such as previous disease with Epstein-Barr infection (EBV), were related to the onset of the disease, but not to clinical infection or to progress of the disease, while a few, such as pregnancy, were related to relapse, but not to diagnosis or prognosis (18). In paper IV, we aimed to identify the multi-omic disease module that has enrichment for various genes associated with environmental and lifestyle risk factors of MS.

(32)

1.4.3. MS, RA and SLE in pregnancy

During pregnancy, the intimate relationship between the mother and the developing embryo creates a potential problem, since the maternal immune system needs to tolerate the presence of paternal alloantigens (non-self) in order to allow the two genetically distinct individuals to co-exist for the duration of the pregnancy. Pregnancy is a unique immunological condition, balancing the need for immunological tolerance while maintaining effective immunity (22).

Characterization of peripheral immune cells throughout pregnancy has shown that the systemic changes induced by pregnancy are both gestational-age dependent and cell type specific (23). Furthermore, increased susceptibility to certain infections (24), and the improvement of some T-cell-mediated diseases such as MS and RA (25,26), also confirm that 1) systemic adaptations occur in response to pregnancy and 2) it is not a general immune suppression, but a response tailored to maintaining the integrity of the maternal immune response, simultaneously allowing tolerance of foetal antigens. Therefore, pregnancy is evidently a dynamic state rather than an “immunosuppressive” one. However, certain aspects of adaptive immunity are clearly altered and suppressed systemically, particularly aspects of T-cell responses, since triggering foetal-specific T-cell responses could be detrimental to pregnancy. Specifically, in paper III, we aimed to understand the epigenetic changes and mechanisms in the CD4+ T cells during pregnancy that can correlate with the disease findings of MS, RA and SLE.

(33)

! ! ! ! ! ! ! ! ! ! ! M8K!A#23&$7!()+!S$(54!24#&$N! X+.S5,$-(83#(,+&,+-+#.(43#/(./&+-(56(-/-.+4("#(.1+(,+3=(S5,=;Q(Y5,( +U34&=+9(.1+(T#.+,#+.(85J=;(L+(;+-8,"L+;(3-(3(#+.S5,$(S1+,+(.1+( #5;+-(3,+(854&J.+,-(5,(5.1+,(;+*"8+-9(3#;(.1+(+;'+-(3,+(&1/-"83=(V5,( &+,13&-( S",+=+--W( 85##+8."5#-( L+.S++#( .1+( ;+*"8+-Q( 01+( b5,=;( b";+(b+L(85J=;(L+(-++#(3-(3(*3-.(#+.S5,$(S1+,+(.1+(&3'+-(3,+( #5;+-9( 3#;( ="#$-( 3,+( .1+( +;'+-Q( Z.1+,( +U34&=+-( "#8=J;+( -58"3=( #+.S5,$-(56(38RJ3"#.3#8+-(5,(5.1+,(./&+-(56("#.+,38."5#9(#+.S5,$-(56( &JL="83."5#-(="#$+;(L/(8".3."5#-9(.,3#-&5,.3."5#(#+.S5,$-9(4+.3L5="8(

,"-.%/&+.01"

.$("2/%3'(1"

(34)

networks and communication networks. As shown in Figure 3, a simple network has nodes which are interconnected with other nodes via edges or links.

Figure 3. Simple network figure. The blue dots represent the nodes or vertices, and the green links connecting the blue dots represent the edges.

Regardless of whether they represent the Internet or associations between proteins, people or airports, complex networks share many topological features, and this suggests that similar rules govern their formation. Models which aim to mimic the evolution and growth of these networks assume the existence of a geometry underlying their structure and shaping their topology (27).

Network-based approaches to human disease have potentially both biological and clinical applications. A better understanding of the effects of cellular interconnectedness on disease progression could lead to the identification of disease genes and disease pathways which, in turn, could offer better targets for drug development (28). These advances may also lead to better and more accurate biomarkers for monitoring the functional characteristics of networks perturbed by diseases, as well as to better disease classification.

(35)

2.2 Protein-protein interaction networks

High-throughput technologies detecting molecular interactions have resulted in a large number of network datasets. These datasets often contain a number of various kinds of evidence, including functional, physical, and molecular level interactions. These interactions can be manually curated or scientifically predicted, based on models developed using high-throughput data. This ensemble of knowledge on interactions constitutes protein-protein interactions (PPIs).

PPI networks are the mathematical representation of physical and functional connections between proteins in cells. These can represent transient, stable, and predicted interactions which constitute the dynamic part of the network. In other words, PPI can also be referred to as an interactome, since it involves a comprehensive assessment of the interactions between proteins. PPI networks are available through molecular interaction databases such as STRING db (29), INTACT (30), HPIN (31), BIOGRID (32), MINT (33) and IMEX (34). This thesis used the human PPI network extracted from STRING, version 11 (29). This version consists of interactions between more than 35,000 proteins. The downloaded network consists of interactions between Ensembl protein ids. Since different isoforms of proteins can be coded from a single gene, we were able to map these 35,000 protein ids to about 21,000 gene entrez ids.

(36)

Since the advent of high-throughput studies, there has been a focus on a few proteins, and this results in finding more interactions. This certainly gives rise to knowledge bias, i.e., an overly focus onto a few genes leading to many more hypotheses generated (in this case interactions) regarding these genes simply due to the research interest. Therefore, applying network-based strategies may be heavily influenced by increased false-positivity rate at some specific network compartments, which can lead to falsely drawn conclusions, such as to over-emphasize well-studied genes. In order to counteract these biases, research groups also rely on predicted interactions from the large-scale high-throughput studies available in public databases. For example, FunCoup is a network incorporating 10 different types of evidence with Bayesian likelihood, from high-throughput genomic and proteomic data (35). Though the interactome network is still incomplete, Menche et al. (2015) have shown that the predictive power of these network-based strategies can uncover shared genetic information across many complex diseases (11). The PPIs determined by large-scale experimental and computational approaches include an extremely large number of false positives, i.e., a significantly large fraction of the putative interactions detected must be considered false because they cannot be confirmed to occur in vivo (36).

(37)

2.3 Properties of networks 2.3.1 Modularity

One of the main characteristics of complex networks in general is modularity. Modularity can be understood as the presence of partitions of highly connected components of networks according to their physical or functional properties (37). Modules can be found in different systems, such as in networks of web pages describing related topics (38), networks of friends from a sociological perspective (39) or networks for scientific collaboration (40). An equivalent for the term for “module” in other scientific disciplines, such as sociology, is network community. For example, the module of genes involved in MAPK cascade can be considered as dynamic signalling pathways, whereas modules consisting of ribosomal genes can be considered as static molecular complexes.

Figure 4. Schematic diagram of the three modularity concepts, redrawn from the paper by Barabasi et al. (2012) on network medicine (28): a) topological module, b) functional module, c) disease module.

It is important to distinguish between three distinct phenomena in the formation of modules from a network-based perspective of disease genes. Nodes in a locally dense neighbourhood have a stronger tendency to connect with nodes in the same

(38)

neighbourhood, which could be described as a “topological module”. These modules are identified using network-clustering algorithms, which ignore the function of the individual genes. Alternatively, if nodes are clustered together based on functional similarity, so-called “functional modules” can be defined (Figure 4). Both of these types of modules usually refer to functional pathways in a biological system. The third distinct type of module is a “disease module”, representing a group of genes that contribute to a complex function which could be dysregulated in a particular disease phenotype.

However, it is important to bear in mind that disease modules can be similar or can overlap with functional / topological modules. Secondly, a disease module is defined with respect to a particular disease and, accordingly, each disease has its own module (22), which tends to overlap with other modules. Finally, a gene, protein or metabolite can be implicated in several disease modules, which means that different disease modules can overlap (28). These characteristics aid the disease module identification process, an important step in network medicine (Figure 4).

2.3.2 Centrality

It is important to know how a node or edge is connected within complex networks in order to understand the flow of information. Centrality is a measure that gives an estimation of this information. It is a useful parameter in signalling networks, and is often used in attempts to set drug targets (41). For example, a number of studies

(39)

show that cancer proteins are topologically more central than the other proteins in the human PPI network. Centrality can be measured using various methods. For example, centralities are sometimes calculated using “random walks”, where random nodes are chosen as a starting point, and the “time” or “speed” required to reach other nodes in the network. This information can be coupled with the weights of the nodes or edges in the network to influence the calculation.

Betweenness centrality is a measure that gives information about how often a node can occur on all the shortest paths between two chosen nodes. This enables the identification of nodes which can control information flow, i.e., nodes with high betweenness centrality. These nodes can represent important genes or proteins involving signalling pathways and can form targets for drug discovery.

2.3.3 Cliques

Cliques are considered to be one of the basic concepts in graph theory. They can be defined as a set of completely connected nodes which can form a subgraph. The task of finding whether there is a clique of a given size in a graph (the clique problem) is NP-complete and may therefore require exponential time with respect to the number of network interactions. This has generated a considerable number of algorithms for finding cliques in a short time. A maximal clique is one which either cannot be extended or is part of another, larger clique.

(40)

Maximum Clique Enumeration (MCE) problem is a graph-related problem where all the maximum cliques in a finite simple graph have to be identified, and these, in turn, depend on the size of the largest clique and the density of the network. Although the Bron– Kerbosch algorithm is useful for resolving and optimizing this issue, it remains a challenge to identify all maximum cliques in biological networks when it comes to extremely large graphs or those with a hair-ball density. Problems like clique enumeration and optimization, which are NP-HARD or NP-complete, have been challenging issues in the fields of data mining and bioinformatics (42).

2.4 Graph clustering

A feature of real networks is modularity where different components interact with each other under a certain function or phenotype.

Figure 5. A typical network where nodes and edges are divided into four communities coloured in red, yellow, green, and blue.

(41)

Graph clustering is a specific problem defined in the context of networks to identify the clusters that correspond to a certain operation. For example, as shown in Figure 5, the set of blue nodes or green nodes belong to a specific cluster that can be attributed to a specific coordinated function.

2.5 Module inference

Given the importance of module inference in systems medicine, there are plenty of algorithms used for this very purpose. However, the performance of different module inference methods for discovering disease modules remains poorly understood, creating a need to evaluate these methods transparently, based on objective benchmarks across various diseases and networks. The vast majority of module-inferring algorithms have focused on using topological information on the interactome to predict disease modules. This only takes network properties into consideration, and not biological (mainly omic) disruptions in the disease. For example, in the DREAM challenge article by Choobdar et al. (2019), 75 module identification algorithms were assessed on networks to find the top performing algorithms (8). The 75 algorithms belong to categories such as kernel clustering, modularity optimization and random walk, which take into consideration the topological properties of molecular networks, and cluster them to identify the disease modules.

The algorithms proposed and applied to identify disease modules can be categorized into two groups. On the one hand, there are

(42)

algorithms which take topological information on the interactome and apply various clustering techniques to identify disease modules, e.g., Markov clustering. On the other hand, there are algorithms which make use of data-derived, disease-associated molecules or genetic loci to identify disease modules that correlate with disease functions, such as DIAMOnD (see section 2.5.1). The data-derived information can either be differentially expressed genes or differentially correlated /co-expressed genes. Here, we consider the latter group of algorithms to define a gold standard for disease module identification. This group of algorithms can be further classified into three sub-groups: 1. Seed-based algorithms, 2. Clique-based algorithms, and 3. Algorithms based on co-expression.

2.5.1 Seed-based algorithms

Seed-based algorithms are those in which the module is identified in networks based on a given set of prior genes, often referred to as seed genes. These genes are considered to be top priority, based either on the fact that they are differentially methylated or expressed, or on the fact that they are genes known to be associated with the disease phenotype in question.

2.5.1.1 DIAMOnD (DIseAse Module Detection)

This algorithm is based on a systematic analysis of the network properties of known disease proteins, and reveals that connectivity significance is the most predictive quantity in terms of characterizing their interaction patterns, rather than connection

(43)

density (43). This is a heuristic algorithm (Figure 6) which identifies the disease module through the following steps:

1. The connectivity significance is determined for all genes connected to any of theseed genes, in the form of p-value per gene.

2. The genes are then ranked according to their respective p-values.

3. The gene with the highest rank is added to the set of seed nodes, increasing their number.

4. Steps (1)-(3) are repeated with the expanded set of seed genes, pulling one gene at a time into the growing disease module. This four-step procedure is repeated to span the module across the entire given network to identify topologically relevant genes which will become part of the disease module.

Figure 6. The DIAMOnD algorithm.(a) The algorithm is heuristic, and therefore undergoes several iterations to calculate the connectivity significance of each node by expanding the module by one node for every iteration. (b) Module from the interactome, highlighting the seed genes and detected genes along with their interactor genes.

(44)

2.5.2 Clique-based algorithms 2.5.2.1 MCODE

“Molecular Complex Detection” (MCODE) aims to detect densely connected regions in large protein-protein interaction networks that may represent molecular complexes (44). MCODE uses node weighting by local neighbourhood density, and outward traversal from a locally dense seed protein, to isolate the dense regions according to a given parameter. The algorithm has an advantage over other graph clustering methods in that it has a directed mode that allows fine-tuning of clusters of interest without considering the rest of the network. It also allows examination of cluster interconnectivity, which is relevant for protein networks (44). It follows a three-step procedure:

1. It uses node weighting to define the cliquishness of a locally dense region of a network (k-means).

2. This node-weighted graph is used as a complex prediction graph to recursively from the highest weighted seed node, and to define a smaller, denser network region around the seed node. 3. The final complexes are filtered and scored on the basis of fluff

(45)

2.5.2.2 Module discoverer

Taking an Maximal Clique Enumeration (MCE) based approach, Module Discoverer is a heuristic which approximates underlying modular structure of PPI by enumerating cliques iteratively, starting with random seed proteins in the network (45). It also follows three steps:

1. Approximation of PPI community structure by identifying minimal cliques of size 3 starting from the random genes and then iteratively extending them in random order.

2. Identification of significantly enriched cliques using a permutation test.

3. Assembly of the disease module based on the user defined p-value cut-off.

2.5.2.3 Clique SuM

In this method, SuMs (Susceptibility Modules) are defined by integrating PPI network data and differentially expressed genes (DEG) for each disease (46). The SuMs are identified using a stepwise process. First, maximal cliques are extracted from the PPI network, and all cliques which are not part of other cliques are used, down to a minimum size of 3 (46). Then each clique identified is tested using two different variants: 1) Clique SuM permutation, 2) Clique SuM exact.

The difference between these two variants is the test used for enrichment of clique of DEGs. Clique SuM exact is the method

(46)

proposed by Barrenäs et al. (2012) in which enrichment of the clique is determined by application of a one-sided Fisher exact test. Clique SuM permutation is a variation of the first method and was initially used by Gustafsson et al. (2014). In this variant, the clique enrichment is determined by comparing sum of weights of the genes in a clique with null distribution of clique with equivalent size of the clique using a permutation test (6).

2.5.2.4 Clique correlation

Clique correlation is also a clique-based method which incorporates both correlation analysis and clique enrichment (7). This method involves subsampling from the given interactome according to the weights of the interactions, which can be used to prioritize interactions between proteins that correlate well in the specific cell type of interest to identify modules. This step of the algorithm is inspired by co-expression-based network analysis (see section 2.5.3). For this filtering of the given interactome, a correlation score is calculated for every interaction which is calculated by subtracting the gene interaction pair Pearson correlation value from 1. Then the obtained correlation score is multiplied with the confidence of the correlation to obtain edge score. This edge score is then scaled to compare this score of the interaction with uniform distribution to identify if the original edge score of the interaction is either higher or lower than the random edge score obtained from the uniform distribution.

(47)

Using this as the criteria, the given interactome is filtered to contain only the interactions with a calculated edge score that is higher than any random edge score. Thus, a data specific interactome network is obtained. Then maximal cliques are inferred from this new interactome and then clique enrichment is performed using the Fisher exact test to identify the significant cliques. The union of the identified significant cliques is termed as the disease module for one iteration. In this way at least 10 000 iterations are performed to obtain disease modules and then the genes present in at least 50% of iterations are considered as the final disease module.

2.5.3. Algorithms based on co-expression

Co-expression networks are considered to be the most flexible networks for evaluating and understanding various phenotypes as they do not use known networks as prior input. They generally refer to correlation at transcript level, though correlation relationships can be assessed at different molecular levels such as proteins, metabolites, or methylated DNA sites. By way of comparison, most PPI networks represent “generic” interactions that are void of cell type and temporal context, whereas co-expression networks, for example, can be generated using data from specific cell types from normal and diseased individuals across the entire spectrum of development (47).

(48)

2.5.3.1 WGCNA

One of the most widely used algorithms for constructing co-expression networks is Weighted Gene Co-co-expression Network Analysis (WGCNA) (47). The input for WGCNA can be any quantitative molecular data such as gene expression matrices collected from the sample population.

The first step in a WGCNA is to filter genes based on expression and variance, since genes which are not expressed or do not vary will be uninformative. Pair-wise Pearson correlations are calculated between the remaining genes, and then converted to connection strengths by raising the correlations to the power of β

(soft thresholding). Since the derived correlation network can be very dense, it is necessary to approximate the network to have scale-free topology. This is ensured by selection of an appropriate β. Next, the connection strengths are transformed into a topological overlap measure (TOM). Two genes have a high TOM if they share strong connections with the same sets of genes. Genes are then grouped into “modules” based on TOMs, using hierarchical clustering (47).

2.5.3.2 DiffCoEx

The principle of DiffCoEx (Differential Co-expression Analysis) involves applying WGCNA to an adjacency matrix representing the correlation changes between conditions (48). It clusters the genes using a dissimilarity measure, which is computed from correlations of the same sets of genes across different conditions within the presented data.

(49)

To identify co-expression modules with DiffCoEx, five steps are followed: first an adjacency matrix is defined between all the genes under consideration, based on pair-wise correlations using a Pearson correlation. The generalized topological overlap-based dissimilarity matrix is then computed from the adjacency matrix. Finally, using this dissimilarity measure, hierarchical clustering is applied, followed by tree-cutting using a dynamic height cut. The resulting clusters form modules of genes in which all members are strongly inter-correlated (48).

2.5.3.3 MODA

MODA (Module Differential Analysis) is a method based on co-expression, which works by identifying the modules in condition-specific networks (49). A condition-condition-specific network is constructed from the gene expression matrix, defined by correlation coefficients of gene pairs using a sample-saving approach for each of the conditions. Similar to WGCNA, from the constructed network by employing hierarchical clustering, and by determining the optimal cutting height based on the quality of modules. Network-differential analysis then compares two sets of modules from each of the condition-specific networks (49).

Though there are many other disease module algorithms that were publicly available, we did choose the above stated 9 algorithms and built the MODifieR R-package. The other methods considered in formulating the package were as follows:

(50)

Table 1. List of module detection algorithms that were reviewed but not included in the MODifieR R-package

DiME (50) Constructs a co-expression network using jackknife correlation coefficients between gene pairs. Computes bi-score and topological modules iteratively by moving nodes locally within the network.

CHD (51) Identifies the union of dysfunctional subnetworks, using the shortest paths between disease genes and the remaining part of the interactome as a module. DICER (52) Identifies modules that correlate

differently between sample groups.

CoXpress (53) Identifies modules pertaining to each sample group separately.

DINGO (54) Shows how a subset of samples behaves differently from base co-expression derived from all samples.

GSNCA (55) Defines a set of genes as a module if they are differentially expressed at gene level or pathway level.

GSVD (56) Identifies “genelets”, which represent the partial co-expression signal from multiple genes that can be compared and used for clustering the samples.

HO-GSVD (57) Similar to GSVD. Allows use of more than two sample groups.

QUBIC (58) Method based on biclustering, which identifies modules that are unique to a subpopulation of samples. Does not use prior grouping of samples.

References

Related documents

The random forest machine learning algorithm is applied to images to perform classification with field-based data as training data, tree crown cover estimation with high

Syftet eller förväntan med denna rapport är inte heller att kunna ”mäta” effekter kvantita- tivt, utan att med huvudsakligt fokus på output och resultat i eller från

I regleringsbrevet för 2014 uppdrog Regeringen åt Tillväxtanalys att ”föreslå mätmetoder och indikatorer som kan användas vid utvärdering av de samhällsekonomiska effekterna av

Motivated by applications to topological data analysis (TDA), we introduce an efficient algorithm to compute a minimal presentation of a multi-parameter persistent homology

In this study we apply different parameter settings to test the performance of a widely used method for disease module detection in multi-omics data called Weighted

When observing the results in table 4.5 it is possible to see that using the ANN classifier with the GloVe extraction method yields the best result in five out of six categories,

A formula for the asymptotic covariance of the frequency response function estimates and the model parameters is developed for the case of temporally white, but possibly

F-stats: translation, SRP-dependent cotranslational protein targeting to mem- brane, mRNA catabolic process, viral expression, gene expression WGCNA: immune response and