• No results found

Exploring the transcriptional space

N/A
N/A
Protected

Academic year: 2021

Share "Exploring the transcriptional space"

Copied!
75
0
0

Loading.... (view fulltext now)

Full text

(1)

Doctoral Thesis in Biotechnology

Exploring the transcriptional space

JOSEPH BERGENSTRÅHLE

kth royal institute of technology

(2)

Exploring the transcriptional space

JOSEPH BERGENSTRÅHLE

Doctoral Thesis in in Biotechnology KTH Royal Institute of Technology Stockholm, Sweden 2021

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of > Doctor of Philosophy < Friday the 19h February 2021, at 10:00 a.m. in Air & Fire, Tomtebodavägen 23A, Stockholm.

(3)

© Joseph Bergenstråhle ISBN 978-91-7873-761-1 TRITA-CBH-FOU-2021:3

(4)

Public defense

The public defence of this thesis will take place on February 19, 2021 at

10.00 AM in Air and Fire, Science For Life Laboratory, Tomtebodav¨agen

23, Solna. For the degree of Doctor of Philosophy (PhD) in Biotechnology.

Respondent — Joseph Bergenstr˚ahle, M.Sc. in Biotechnology

Dept. of Gene Technology, Royal Institute of Technology - KTH Science For Life Laboratory, Solna, Sweden

Chairman — Prof. Peter Savolainen

Dept. of Gene Technology, Royal Insititute of Technology - KTH Science For Life Laboratory, Solna, Sweden

Faculty opponent — D.Sci. Jay W. Shin RIKEN Center for Integrative Medical Sciences

1-7-22 Suehiro-cho Tsurumi-ku W406 Yokohama 230-0045 Japan Evaluation committee

Assoc.Prof. Marc Friedl¨ander

Department of Molecular Biosciences, The Wenner-Gren Institute, Stock-holm University

Science For Life Laboratory, Solna, Sweden Prof. Ola Spjuth

Department of Pharmaceutical Biosciences, Uppsala Universtiy, Uppsala, Sweden

PhD. ˚Asa Bj¨orklund

Department of Cell and Molecular Biology, National Infrastructure of Swe-den, Science for Life Laboratory

Uppsala University, Uppsala, Sweden

Respondent’s supervisor — Prof. Joakim Lundeberg

Dept. of Gene Technology, Royal Insititute of Technology - KTH Science For Life Laboratory, Solna, Sweden

Respondent’s co-supervisor — Asst.Prof. Patrik L. St˚ahl

Dept. of Gene Technology, Royal Insititute of Technology - KTH Science For Life Laboratory, Solna, Sweden

(5)
(6)

Abstract

Transcriptomics promises biological insight into gene regulation, cell diver-sity, and mechanistic understanding of dysfunction. Driven by technolog-ical advancements in sequencing technologies, the field has witnessed an exponential growth in data output. Not only has the amount of raw data increased tremendously but it’s granularity as well. From only being able to obtain aggregated transcript information from large tissue samples, we can now pinpoint the precise origin of transcripts within the tissue, some-times even within the confines of individual cells. This thesis focuses on the different aspects of how to use these emergent technologies to obtain a greater understanding of biological mechanisms. The work conducted here spans only a few years of the much longer history of spatially resolved tran-scriptomics, which started with the early in situ hybridization techniques and will continue to a potential future with complete molecular profiling of every cell in their natural, active state. Thus, at the same time the work presented here introduces and demonstrates the use of the latest techniques within spatial transcriptomics, it also deals with the shortcomings of the cur-rent state of the field, which undoubtedly will see extensive improvements in the not too distant future. Article I is part of a series of articles where we mechanistically examine the biological underpinnings of a serendipitous finding that single-stranded nucleic acids have immunomodulatory effects. In particular, we look at influenza-infected innate immune cells and the ability of the oligonucleotide to inhibit viral entry. The oligonucleotides prevent the cells from responding to certain types of pattern recognition and cause a decrease in viral load. Our hypothesis is that the adminis-tration of oligonucleotides blocks certain endocytic routes. While the in vivo experiments suggest that the influenza virus is still able to infect and promote disease in the host, changes in signaling response due to the inhi-bition of the endocytotic routes could represent an avenue for future ther-apeutics. The conclusions were drawn by combining protein labeling and conventional methods for RNA profiling in the form of quantitative real-time PCR and bulk RNA sequencing. As a transition into the concept of spatial RNA profiling, the thesis includes an Additional material review

(7)

article on spatial transcriptomics, where we give an overview of the current state of the field, as it looked like in the beginning of 2020. In Article II, we report on the development of an R package for analyzing spatial transcriptomics datasets. The package offers visualization features and an automated pipeline for masking tissue images and aligning serially sectioned experiments. The tool is extensively used throughout the rest of the arti-cles where spatial transcript information is analyzed and is available for all scientists that use the supported spatial transcriptomics platforms in their research. In Article III, we propose a method to spatially map long-read sequencing data. While previously described methods for high-throughput spatial transcriptomics produce short-read data, full-length transcript infor-mation allows us to spatially profile alternatively spliced transcripts. Using the proposed method, we find alternatively spliced transcripts and find iso-forms of the same gene to be differentially expressed in different regions of the mouse brain. Furthermore, we profile RNA editing across the full-length transcripts and find certain parts of the mouse left hemisphere to display a substantially higher degree of editing events compared to the rest of the brain. The proposed method is based on readily available reagents and does not require advanced instrumentation. We believe full-length tran-script information obtained in this manner could help scientists obtain a deeper understanding from transcriptome data. Finally, in Article IV, we explore how the latest technologies for spatial transcriptomics can be used to characterize the expression landscape of respiratory syncytial virus infections by comparing infected and non-infected mouse lungs. By integra-tion of annotated single-cell data and spatially resolved transcriptomics, we map the location of the single cells onto the spatial grid to localize immune cell populations across the tissue sections. By correlating the locations to gene expression, we profile locally confined cellular processes and immune responses. We believe that high-throughput spatial information obtained without predefined targets will become an important tool for exploratory analysis and hypothesis generation, which in turn could unlock mechanistic knowledge of the differences between experimental models that are impor-tant for translational research.

(8)

Sammanfattning

L¨aran om genuttryck tros kunna ge kunskap kring celldiversitet och en ¨okad

mekanistisk f¨orst˚aelse f¨or dysregulation. Detta f¨alt, ben¨amnt

transkrip-tomik, har sett exponentiell tillv¨axt i m˚an av genererad data p˚a senare ˚ar,

till stor del drivet av teknologiska framsteg. Inte bara den r˚aa m¨angden data

har ¨okat, utan ¨aven f¨orm˚agan att s¨arskilja vilka celler som informationen om

generna kommer ifr˚an. Historiskt har s˚adan information endast observerats

utifr˚an st¨orre v¨avnadsbitar, och s˚aledes har ett medelv¨arde ¨over flertalet

celler observerats, utan att veta fr˚an vilka celler de individuella

observa-tionerna h¨arstammar eller cellernas inb¨ordes lokalisation. Denna

avhan-dling kretsar kring de nya metoderna f¨or spatiell analys av transkriptomet,

vilka m¨ojligg¨or positionering av vart n˚agonstans i v¨avnaden genuttrycket

sker och p˚a s˚a vis ger den granularitet som verklig mekanistisk f¨orst˚aelse

ofta kr¨aver. Det arbete som presenteras h¨ar sp¨anner endast ¨over n˚agra ˚ar

av den l¨angre bana som utvecklingen av spatiell transkriptomik befinner sig

p˚a, fr˚an de tidiga experimenten av in situ hybridisering till en potentiell

framtid med komplett molekyl¨ar profilering av varje cell i deras naturliga

milj¨o. D˚a det senare ¨an ej ¨ar realiserat idag, behandlar avhandlingen och

de inkluderade arbeten ¨aven tillkortakommanden i dagens teknik. Detta

f¨alt ¨ar under mycket snabb utveckling, och flera av de svagheter som finns

idag tros vara kraftigt f¨orminskade inom en relativt snart framtid.

Ar-tikel I ¨ar en del av en serie av artiklar d¨ar vi mekanistiskt unders¨oker ett

fenomen d¨ar enkelstr¨angade nukleinsyror medf¨or immunomodulativa

effek-ter. Mer specifikt unders¨oker vi i den aktuella artikeln hur oligonukleotider

av s¨arskild l¨angd p˚averkar influenzainfekterade dendritiska celler och

virus-partiklarnas m¨ojlighet att ta sig in i dessa celler. Vi finner inhibering av

cellernas f¨orm˚aga att respondera till s¨arskild m¨onsterigenk¨anning samt

min-skade virusm¨angder direkt efter administration av oligonukleotider. V˚ar

hypotes ¨ar att detta ¨ar en effekt av blockering av s¨arskilda endocytotiska

v¨agar. Experiment i m¨oss tyder p˚a att influensaviruset fortfarande ¨ar

ka-pabelt att infektera och medf¨ora sjukdom hos djuren, men resultatet av

att blockera de endocytotiska upptagsv¨agarna f¨or viruset medf¨or f¨or¨andrad

signalering, vilket kan utg¨ora en intressant m¨ojlighet f¨or terapeutiska

inter-ventioner. Slutsatserna dras genom att kombinera protein-inf¨argning och

konventionella metoder f¨or analys av transkriptomet, i form av kvantitativ

realtids-PCR och bulk-RNA-sekvensering. En ¨overg˚ang till spatiell analys

g¨ors sedan, d¨ar en review p˚a ¨amnet ¨ar inkluderad i avhandlingen som en

bilaga, och fungerar som ¨oversikt ¨over alla de metoder som tagits fram

f¨or att m¨ojligg¨ora denna typ av analys, s˚a som det s˚ag ut i b¨orjan av

2020. I Artikel II visar vi utvecklingen av en mjukvara skriven i R f¨or

(9)

visualiser-ingsm¨ojligheter och en automatiserad pipeline f¨or bildhantering. Verktyget ¨

ar ¨oppet tillg¨angligt f¨or alla som anv¨ander de spatiella

transkriptomik-plattformarna som st¨ods. I Artikel IIIvidareutvecklar vi protokollet f¨or

spatiell transkriptomik f¨or att kunna utnyttja de teknologiska framstegen

som skett inom sekvensering av full¨angds-transkriptomik. Genom att l¨asa

av hela transkript ist¨allet f¨or endast kortare bitar, som ¨ar standard idag,

kan transkriptomets fulla komplexitet analyseras. Exempelvis visar vi hur kvantiteter av olika isoformer av en och samma gen skiljer sig markant

mel-lan olika regioner i mushj¨arnan samt hur vissa typer av RNA-f¨or¨andringar

¨

ar vanligare i olika regioner. Det f¨oreslagna protokollet anv¨ander enkelt

tillg¨angliga reagenser och kr¨aver ingen avancerad m¨atutrustning. Vi tror

att full¨angds-information kommer att vara avg¨orande f¨or att uppn˚a

kom-plett biologisk f¨orst˚aelse utifr˚an transkriptomdata. Slutligen, i Artikel

IV, anv¨ander vi de senaste metoderna f¨or spatiell transkriptomik f¨or att

unders¨oka hur den lokala milj¨on i lungan p˚averkas av en viral infektion

genom att j¨amf¨ora genuttrycket mellan infekterade och icke-infekterade

m¨oss. Genom att integrera publikt tillg¨anglig annoterad data fr˚an enskilda

celler me spatiell transkriptomdata, kartl¨agger vi hur olika typer av

immun-celler lokaliserar sig ¨over v¨avnadssnitten. Genom att korrelera genuttryck

och celltypernas position, skapar vi en utt¨ommande bild ¨over hur olika

cel-lul¨ara processer och immunresponser uppvisar lokala anpassningar. Vi tror

att storskalig spatial information utan f¨ordefinierade val kring vilka gener

som unders¨oks kommer att utg¨ora ett viktigt verktyg f¨or explorativ analys

(10)

Preface

My work during the years has been a little bit of everything. Some might call it unfocused, but let’s call it multi-faceted instead. There is however a common thread throughout all publications that in some places bear my name. That common thread is called transcriptomics. For an outsider, a rather curious word. To be honest, I’m not sure anyone outside the field of genomics even has the faintest idea what it’s about. So what is this ... transcriptomics? What can you do with it, and why should you really bother? Writing this thesis, my ambition is to partly answer these questions. Partly, because that’s where we’re at. While the transcriptome indeed can give us much information about the biological underpinnings of cellular processes and responses, it’s still but a smaller piece of the complex puzzle that is cell biology. However, a brief background introduction is in order to set the stage. Let’s start with that thing I mentioned, transcriptomics.

(11)

List of publications

This thesis is based on the following articles and manuscripts, included as an appendix to the thesis.

Article I

Candice Poux, Aleksandra Dondalska, Joseph Bergenstr˚ahle, Sandra

P˚alsson, Vanessa Contreras, Claudia Arasa, Peter J¨arver, Jan Albert, David

C. Busse, Roger LeGrand, Joakim Lundeberg, John S. Tregoning, Anna-Lena Spetz. A Single-Stranded Oligonucleotide Inhibits Toll-Like Receptor 3 Activation and Reduces Influenza A (H1N1) Infection. Frontiers in Im-munology 10:2161 (2019), DOI: 10.3389/fimmu.2019.02161

Article II

Joseph Bergenstr˚ahle, Ludvig Larsson, Joakim Lundeberg. Seamless

in-tegration of image and molecular analysis for spatial transcriptomics work-flows. BMC Genomics, 1:482 (2020), DOI: 10.1186/s12864-020-06832-3

Article III

Kevin Lebrigand, Joseph Bergenstr˚ahle, Kim Thrane, Annelie

Moll-brink, Pascal Barbry, Rainer Waldmann, Joakim Lundeberg. The spatial landscape of gene expression isoforms in tissue sections. bioRxiv (pre-print), 2020, DOI: 10.1101/2020.08.24.252296

Article IV

Joseph Bergenstr˚ahle, Aleksandra Dondalska, Lovisa Franz´en, Sandra

P˚alsson, Alexandros Sountoulidis, Joakim Lundeberg, Anna-Lena Spetz.

Spatially resolved transcriptomics of Respiratory Syncytial Virus infection. Manuscript.

Additional material

Michaela Asp, Joseph Bergenstr˚ahle, Joakim Lundeberg. Spatially

Re-solved Transcriptomes—Next Generation Tools for Tissue Exploration. BioEs-says, 42:10, DOI: 10.1002/bies.201900221

(12)

Contents

1 The Transcriptome 1

1.1 Dynamic changes and stochasticity . . . 4

1.2 The exogenous transcriptome . . . 4

1.3 Spatial context . . . 5

2 What are we really looking at? 7 2.1 Cell type . . . 8

2.2 Cell fate . . . 10

2.3 Single-cell sequencing and new cell types . . . 13

2.4 Correlation between the omes . . . 15

2.5 The analysis of transcriptomic data . . . 17

2.6 Integration across datasets . . . 21

3 The era of spatial transcriptomics 23 3.1 The characteristics of in situ capture methods . . . 23

3.2 The limitation of sensitivity and resolution . . . 26

3.3 Computational tools for in situ capture data . . . 27

4 Over the horizon 29 4.1 A more complete picture of spatial transcriptomics . . . 29

4.2 Spatial multi-omics . . . 31

4.3 Computational enhancement . . . 31

4.4 Large atlases and their applications . . . 32

4.5 Adding the next dimension . . . 33

5 Present investigations 35

Acknowledgements 41

References 43

(13)

CONTENTS

Nomenclature

Adapters (sequencing) - Specific sequences that are added to the nucleic acid fragments prior to sequencing, which are required for the sequencing instrument to recognize and identify the fragments.

Autoencoders - A type of unsupervised machine learning, where the orig-inal input is compressed to a space of reduced dimensionality and then reconstructed in order to learn the most salient features of the data. Cell type - A classification distinguishing morphologically and phenotypi-cally different cells.

Cell states - A more fine-tuned classification quantifying the overall char-acteristics of a cell, including its transcriptome, proteome, and morphology. Cells from the same type can exist in different states.

Differentiation (cellular) - The process in which a cell transitions from one cell type to another, often into a more stable state.

Epigenomics - The study of the epigenome, which is all the chemical changes to the DNA and its current configuration and packaging that affect gene expression.

Gene expression - The event where a gene is transcribed to RNA. Heterogeneous - Something that is diverse in character or composition. Latent state - A “hidden” state that we cannot observe. In this context, it is often used in machine learning to refer to an underlying biological con-dition that we try to infer given observational data.

Long non-coding” RNA (lncRNA) - RNA molecules that are more than 200 nucleotides long and not translated into protein.

Non-negative matrix factorization - An unsupervised machine learning algorithm that can reveal low-dimensional structure from high-dimensional data, making it easier to interpret. A non-negative matrix (e.g., gene counts across samples) is factorized into two new non-negative matrices (e.g, one that describes the structure between genes and the other the structure be-tween samples).

Messenger RNA (mRNA) - RNA molecules that are later translated into proteins.

Multi-omics - The field of analyzing multiple omics in an integrated man-ner.

Neoantigens - Tumor-specific antigens displayed on the tumor cell that is absent from normal tissue.

Next generation sequencing (NGS) - or massively parallel sequenc-ing, a catchall term to describe the sequencing methods that introduced sequencing of millions of reads in parallel.

Omics - The suffix -omics is used to describe several disciplines around the quantification and characterization of different types of biological features

(14)

CONTENTS

(e.g., transcriptomics refers to the study of RNA transcripts).

Omes - The suffix -ome is related to -omics but instead describes the ob-ject that is under study within the different disciplines (e.g., transcriptome refers to the set of all RNA transcripts).

Phenotype - An objects (in this context e.g. a cell) observable traits. Proteomics - The study of proteins.

Pseudotime -An axis describing the cell state of a cell along a differentia-tion path inferred from observadifferentia-tions of multiple cells of the same type. Ribosomal RNA (rRNA) - RNA molecules that are the primary com-ponents of ribosomes, which translate mRNA into proteins.

Stemness - Refers to fundamental stem cell properties, a high degree of stemness would imply greater ability for self-replication and plasticity to transition into various cell types/states.

Somatic cell - All cells of the body excluding sperm and egg cells. Translation - The event where ribosomes connect amino acids together, in the order that the mRNA codes for, to form proteins.

Transcriptomics - The study of all the RNAs in a cell, tissue, or organism. Transcription - The process where an enzyme (polymerase) creates RNA from DNA.

Transcription factor (TF) - A protein that binds to specific parts of the DNA to regulate transcription.

(15)
(16)

Chapter 1

The Transcriptome

Let’s start with the cell, the basic, but yet so complex, building block of the human body. There are more than ten trillion of these in a human being [1], and if we consider the somatic cells, excluding some special cases like the red blood cells lacking a nucleus, immune cells that by nature undergo chromosomal rearrangements, or the fact that changes to the genome occur and accumulate over time, they all share the same DNA, a molecular code built by a 4-letter alphabet (figure I). However, as we all are quite aware, the body is remarkably heterogeneous and displays a vast array of different cells, otherwise we would just look like a homogeneous blob. How can this be? This is pretty much where transcriptomics comes in.

When the cell reads the DNA blueprint, it creates complimentary copies of the code, known as an RNA transcripts. The complete collection of these transcripts constitutes the cell’s transcriptome. Here, we are passing a very important point from an analysis perspective, because even though the code base is static, its usage is highly dynamic. While the transcriptome under-pins the proteome—the collection of all proteins in the cell—that mediate cell function and make up 60 % of the cell’s dry mass [2], the proteome will, in turn, affect the transcription of the genome and the translation of the transcriptome in an endless dependency cycle. When we measure the transcriptome with current methods, we only see a snapshot of it at a specific point in time, and, from that snapshot, we try to figure out the dy-namic underpinnings of what we observe. Furthermore, the dydy-namic nature of transcriptomics implies that quantitative measurements are informative since both the timing and the degree of expression affect cellular function.

How many transcripts are there? This is still a growing number, al-though with a negative second derivative. In 2003, The National Human Genome Research Institute (NHGRI) launched a consortium named EN-CODE, which aims to identify and categorize all functional elements in the human genome. table 1 provides the numbers in its most recent version

(17)

CHAPTER 1. THE TRANSCRIPTOME P P P P P P P P Nucleotide A A T T C G G C P P P P U A G C H OH H CH2 H H O OH Ribose O H H H CH2 H H OH Deoxyribose A C T DNA RNA 3’ end 5’ end 3’ end 5’ end 5’ end 3’ end Nucleobases Base pair Transcription pre-mRNA mRNA Post-transcriptional modifications protein Translation Arg

Exon1 Intron1 Exon2 Intron2 Exon3

5’UTR 3’UTR 3’UTR 5’UTR Poly-A tail 5’cap R L L N N N T T T T T T T TT T T T A B NH2 N N N N H G NH2 NH N N N H O NH2 N N H O NH N H CH3 O O U NH O O N H

Figure I: Constituents of a cell. (A) The building blocks of the nucleic acids DNA and RNA. (B) The flow of genetic information from DNA to protein. The flow should not be viewed as completely unidirectional, as products of the later stages in turn affect the events of the earlier.

(35th by the time of this writing).

A first noteworthy observation is that there are substantially more tran-scripts than there are genes, a consequence of the fact that trantran-scripts, once transcribed, undergo rearrangements during the removal of intronic elements to give rise to alternative exon configurations in the final tran-scripts. These modifications add substantial complexity to the transcrip-tome. While these alternative transcripts have historically been somewhat neglected, mainly due to technical constraints, transcript-level information most likely do have important implications for cellular function, as evi-denced by their tight regulation and role in development and tissue home-ostasis [3].

As a side note, this increased complexity has a significant correlation with the number of annotated cell types in each species [4]. Thus, with the proviso that one could use cell-type diversity as a proxy for organism complexity, it seems that higher-order organisms have developed by increas-ing the post-transcriptional diversity rather than the number of genes, even though it remains an open debate if this is causative or more of a symptom of genetic drift [5].

The last section quietly introduced the concept of a cell type, which is given more throughout treatment in chapter 2. For now, we can conclude

(18)

Table I: Statistics of gene features obtained from GENCODE Release (ver-sion 35)

Total number of genes 60 656 Total number of transcripts 229 580

Protein-coding genes 19 954 Protein-coding transcripts 84 485

Long non-coding RNA genes 17 957 Nonsense mediated decay transcripts 16 495 Small non-coding RNA genes 7 569 Long non-coding RNA loci transcripts 48 684

Psuedogenes 14 767

that different cells display variations in their transcriptome to the extent that some transcripts are restricted to certain types of cells. On the other end, there are also genes, usually involved in maintenance tasks, that are universally expressed in all cells. The degree of expression, i.e., the number of transcript copies present in a cell at a given time, varies considerably be-tween different genes, ranging from single digits to thousands. Furthermore, the distribution of transcript abundances across genes also varies between different cells. For example, when looking at a diverse set of immune cells, 80 % of the transcripts identified in the cell could come from anywhere be-tween 500–5000 unique genes, depending on the function of the cell and its differentiation state [6].

While almost all of the genetic code at some point is transcribed into RNA molecules [7], only approximately 1.5 % of these, the messenger RNAs (mRNAs), are translated into protein [8]. What about the transcribed parts of the genome that never get translated? The most abundant (80 % to 90 %) type of RNA is ribosomal RNA (rRNA), which assists in the actual trans-lational process. The exact function of the remaining RNA molecules is less known, although it has been shown to have a functional role in cellu-lar behaviour. For instance, it has been implicated in selective pressure [9] and linked to various disease states [10]. Moreover, using genetic engineer-ing, scientists have disrupted certain non-coding regions of the genome with resulting detrimental effects [11]. Hence, there is little doubt that tran-scripts from these regions have important biological implications. Among these lesser known transcripts, the majority are more than 200 nucleotides long. These are referred to as long non-coding RNA (lncRNA). The cur-rent knowledge about lncRNA is quite far behind their coding counterparts. While we’ve reached a point where we have a quite good understanding of the sequence and resulting function of protein coding genes, to the degree that one can predict function based on primary sequence alone [12], the current understanding of the relationship between function and sequences of non-coding RNA is substantially poorer. The ability to annotate many

(19)

CHAPTER 1. THE TRANSCRIPTOME

such transcripts is hampered by the fact that they are often weakly ex-pressed, making them challenging to study. Furthermore, the sequences are in general less conserved throughout evolution [13].

1.1

Dynamic changes and stochasticity

There is an inherent stochasticity to gene expression, which should come as no surprise given the underlying probabilistic nature of the chemical reac-tions and interacreac-tions of molecules involved in transcription [14], [15]. This stochasticity comes on top of cell-state-dependent regulatory mechanisms, like the packaging of the DNA and the sequences surrounding the genes, that dictate the likelihood of transcription [16].

There have been numerous attempts to formulate models for the highly dynamic nature of gene expression. Yet, researchers have failed to find a unified model that satisfactorily explains transcription across genes and conditions, in large part owing to the multifactorial nature of the process and the technological difficulties of studying it [17]. Nevertheless, it’s apparent that the transcriptome can change in the timeframe of just a few minutes in response to external stimuli [18]. There are also well-understood and categorized oscillation patterns, like those of the cell cycle phases, that occur along the cell division stages, which result in periodic transcriptional expression of genes involved in processes like DNA replication, chromosome segregation, and cell adhesion [19].

1.2

The exogenous transcriptome

While the RNA transcripts of the transcriptome is primarily created within the cell, there is evidence that some transcripts are transferred between cells via extracellular vesicles [20]. These membrane-enclosed particles are se-creted from the cell containing a cargo that can consist of various molecules, including nucleic acids, and are thought of as a means for the cell to com-municate and affect both its immediate neighbors and more distant cells.

The vesicles are thought to have several modes of interaction, and one such mode is internalization by the recipient cell and subsequent release of the cargo within. The normal route of delivery would imply lysosomal degradation, and for the RNA to be functionally active in the new host some sort of endosomal escape would have to occur. The frequency of such events and extent to which they impact the transcriptomic profile of cells are currently unknown. Nevertheless, the cargo size of individual extracellular vesicles is limited, which would suggest a bias in the type of RNA that can be packed within. Likely, the overall transcriptome of the recipient cell is

(20)

1.3. SPATIAL CONTEXT

hardly affected but specific transcripts might be. These transcripts could mediate important physiological effects on the cell.

Alternative routes of RNA transfer have also been suggested to be me-diated via membrane-coated “nanotubes”, which are acting as intercellular bridges between cells in close contact with each other [21]. The concept of cellular exchange, cellular parabiosis, covers all types of molecules apart from just RNA. It is suggested to be an important mechanism for the sta-bility of cellular homeostasis and tissue integrity, as the network of cells becomes more resilient to deleterious events like detrimental mutations. In contrast, individual cells without a community to support them could eas-ily go off course. As many protective processes, cellular parabiosis has also been implicated in tumor biology, where inter-cell communication could make tumors resistive to therapies [22], [23]. More research is needed to fully understand the role of RNA transfer in organisms, and appropriately enough, it has its own “ome” dubbed transferome [24].

1.3

Spatial context

Finally, along with the main theme of this thesis, cellular status and its con-stituents are highly influenced by their environment, which should come as no surprise given the information flow from both nearby cells and organs far away. One of the more striking examples of such influence is within tumor biology, where the malignant cells along with their surroundings, including the extracellular matrix and soluble molecules, form the so-called tumor microenvironment (TME). The TME has been shown to be decisive for un-derstanding inter-patient variability in disease progression and treatment outcome [25], [26]. With time, tumors tend to become more heterogeneous, and divergent clones can evolve from the same primary tumor [27]. De-pending on which cells that are in contact with each other, the behaviors of the cells change, and these differences are reflected in their transcriptomic composition [28]. Cellular behavior and gene expression within the TME determine the potential for immune effector functions and the variability in infiltration rates of immune cells, and the strong selective pressure exerted by the immune system has been linked to transcriptomic changes among the tumor cells. For example, by lowering the expression of neoantigen genes, the tumor cells could more easily evade immune cell recognition [29]. All these transcriptomic changes occur in response to the surrounding environ-ment, and, even within the same piece of tissue, substantial differences can be observed depending on which neighbors the cells have.

The spatial aspect of transcriptomics could be considered on yet an-other level of detail, within the individual cell. The various stages from transcription to translation can be performed at different intracellular

(21)

lo-CHAPTER 1. THE TRANSCRIPTOME

cations, and the trafficking of RNA transcripts is thought to constitute an important regulatory functionality. For example, it has been shown that certain transcripts localize at different parts of the cell in a way that cannot be explained by random events. This behavior could have evolved due to the fact that the position of an mRNA governs where the protein is made, and, thus, it could be energetically favorable for cells to localize transcripts that code for fast-response proteins near the cellular compartments where they are needed [30]. As technologies for massive parallel exploration of spatial organization within the cell have just recently started to emerge, it can be expected we will see many new discoveries in the coming years. These discoveries have the potential to deepen our understanding of cellular behavior and the mechanisms of pathology.

(22)

Chapter 2

What are we really looking at?

So what can the transcriptome tell us? It’s notable that there is an in-creasing effort to combine multiple omes to characterize cells and their be-havior, but method development for high throughput profiling of the dif-ferent modalities has been evolving at an uneven pace. Transcriptomics is a modality where the technological development has been in the forefront and enabled highly feasible workflows for large data collections. You will find that this thesis will focus almost exclusively on the mRNA constituents of the transcriptome, since the underlying technologies used to identify and measure the transcripts are all based on capturing the transcripts by the polyA-tails present at their 3’ ends (figure I). In this context, while we sometimes talk about whole transcriptome profiling, it often implies that we disregard a large portion of the transcripts, which are not polyadeny-lated. While these discarded transcripts mainly consist of rRNA, which are often considered less interesting to study as they are less cell-type specific, there is undoubtedly a selection associated with this capture technique, and some important non-coding transcripts will be missed [31].

While the mRNA profile only makes up a tiny fraction of the cells’ absolute mass ( 1 %), it can still give us information about cellular behavior. The transcriptome is a reflection of the current state of the cell, influenced both by the environment surrounding the cell but also the history of it. The transcriptome can tell us what kind of cell it is and, potentially, the trajectory it’s currently on to what it will become in the future. Sometimes it can tell us why the cell is dysfunctional or give us clues about how external input might influence cellular behavior. However, it’s hard to infer much meaning to an individual transcriptomic profile without having some sort of framework as a reference. To create this type of reference, we usually refer to cell types.

(23)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT?

2.1

Cell type

Cell types seek to classify the cells within an organism. Historically, this classification has been based on histology and morphological criteria. By combining the location of the cell, its morphology, and the presence of cer-tain key functional markers, like the presence of a gene or a protein on the cell surface, one could put a label on the cell to give it an identity. While these properties can be used for a rough categorization of cells into cell types, single-cell analyses have revealed a much more complex reality; it’s not always possible to separate cells by a few well-known markers, and many anatomical properties can be shared between cells having very dif-ferent functions. The classification task also has the property of becoming increasingly more challenging and complex as one moves down the reso-lution tree and the differences between the cells begin to narrow (figure II).

Tissue level Cell type level Single-cell level

Increased resolution

Figure II: The different resolution of omics observations, ranging from the well-defined tissue level to the detailed level of single cells covering the whole continuum of cellular states.

There is a non-trivial question of where a meaningful subclassification actually stops. One could argue that it never stops, as each individual cell has a unique composition of molecules that affects its behavior. However, keeping such a dimensionality in the analysis quickly becomes unfeasible in most instances. The preferable grouping would appear to be where a meaningful functional difference of interest still divides the cells, which, as with everything else, largely depends on the question at hand. For one study,

the lineage choice of thymocytes to become CD4+ helper T-cells or CD8+

(24)

2.1. CELL TYPE

stated hypothesis, while, in another, the subclassification of CD8+ T-cells

could be essential to understand their pathogen clearance potential [32]. While the typical practice of identifying certain key genes and proteins as a means to put hard labels on cells is mostly an artificial construct, it has its uses for creating a framework to work with and a means of communi-cation. However, it should be stressed that cell-type characteristics are not definitive but highly dynamic in nature. One such example is epithelial-mesenchymal transition (EMT) and the reversed epithelial-mesenchymal-epithelial transition (MET), an essential alternation between cellular states during embryonic development and tissue repair. Epithelial cells are tightly an-chored together as they are positioned along the outer surfaces of organs and blood vessels, with clear spatial differences between the apical and basal surfaces. During EMT, the epithelial cells lose their polarity, and the proteins holding them together are removed. They become mesenchymal in their phenotype, obtaining a greater migratory potential and ability to synthesize and secrete various extracellular components.

As with many other processes associated with developmental biology, EMT and MET reappear as important aspects of tumor development. Ep-ithelial cells lining the invasive front of primary tumors undergo EMT and will acquire the migratory potential to relocate from the tumor to a distant site in the body. There, MET enables them to firmly establish a metastatic lesion [33]. During these transitions, there is a gradual shift in the transcrip-tomic profiles of the tumor cells [34]. Furthermore, by genetically knocking out certain signaling pathways, an intermediate transcriptional profile is en-riched among a population of cells undergoing EMT [35], suggesting that there exist checkpoints along the transcriptional continuum for which key genes need to be present in order for the transition to continue (figure III). Is such an enrichment of an intermediate state, often much less stable than either endpoint along the trajectory, to be viewed as its own cell type?

In this context, there is a continuum of cellular states, rather than dis-tinct types. The cell can be more or less mesenchymal or epithelial in its characteristics, rather than a binary choice between the two, and a cell type could be viewed as a closed manifold of multiple possible cell states. From a transcriptomic standpoint, these planes could overlap in certain dimensions, allowing different cell types to find themselves with similar transcriptomic profiles, which could make them challenging to distinguish, especially if only a limited set of marker genes is used to identify them. In such situations, successfully separating between the cell types requires more data, either in the form of more comprehensive transcriptomic profiles or alternative modalities, like epigenetics, that can provide additional information about the possible states.

(25)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT?

State 1 State 2 State 3 State 4

Tr anscr iptional tr ansition Time Checkpoint Epithelial Mesenchymal

Figure III: Theoretical model of cellular transition between states, exempli-fied by the epithelial-mesenchymal transition (EMT). The RNA composition of the cell gradually changes over time along the differentiation path, and checkpoint states block further changes unless specific conditions are met.

types can lead to misinterpretation of data. In a paper from Bosteels C. [36], the authors highlight one such possible scenario which could serve as an example. They find that monocyte-derived (MC) cell types might histor-ically have been falsely attributed to certain roles in immune responses to viral infection. To define cell type, previous experiments have used marker genes that in a “steady state” situation distinguish between MCs and con-ventional dendritic cells (cDCs). However, in an inflammatory environment, subtypes of cDCs obtain another cellular state, which, in regard to these markers, resembles the conventional profile of MCs. These different cell types display vastly different functional potential in their ability to, for ex-ample, present antigens to T-cells and migrate to lymph nodes. Clearly, given their different functions, we should view these as different cell types but with overlapping transcriptional and proteomic profiles of key markers.

2.2

Cell fate

Starting from a single ancestor, cells divide, migrate, differentiate, and evolve into the different organs and tissues of the body. Some cells get committed into stable terminal identities, while others are kept in a more plastic state, allowing them to quickly generate new cells of several types.

(26)

2.2. CELL FATE

The commitment is thought to be achieved by changing the chromatin con-figuration of the cell, which changes the expression probability of different genes [37]. Such alterations to the phenotype of the cell that are not asso-ciated with genotypic changes is the topic of epigenetics. It is worthwhile to point out that the changes are often inheritable but not irreversible and can be modified by environmental stimuli [38].

In 1957 C.Waddington [39] used a metaphor for describing developmen-tal processes where he pictured a marble rolling down a hill (figure IV). Depending on the shape of the landscape, governed by epigenetic factors, the marble would fall into different terminal stages representing alternative cellular fates. This landscape traditionally gave the picture of a unidirec-tional cell fate progression, which later has been shown to be inaccurate, as exemplified in the EMT/MET pendulum. Furthermore, back in 2006, scientists demonstrated the ability to reprogram fibroblast back into a stem cell state by introducing certain transcription factors under specific condi-tions [40]. However, the landscape can still act as a good mental picture, and the slope of the hill could be viewed as the path of least resistance. More recent transcriptomic profiling of different lineage trajectories could be described by modifying the classical landscape picture [41], [42]. These studies suggest that progenitor cells co-activate competing fate-expression programs up to a decision point, where extrinsic cues are involved in tuning the transcriptional profile to the tipping point where the cell takes a turn to one or another differentiation path. Furthermore, as exemplified by neu-ral crest differentiation, such decisive extrinsic cues are tightly connected to the spatial position and neighbors of the cell [41], again reflecting the importance of positional information in determining cellular behaviour.

Another model to describe transitions between cell states has been pre-sented as a dynamic system where stable states take the form of attractors within the state space [43]. This is most easily thought of as a simple two-gene system (figure V) where each two-gene has a positive feed-back loop on its own expression while inhibiting the other. We could picture an undiffer-entiated stem cell positioned right in the middle of the two extreme states, with similar expression of both genes. If we induce a small elevation in ex-pression of either gene, the cell would start to move toward that particular state, reinforcing its moves the further it goes due to the positive feedback. At a particular point, the cell will become committed to the state for which the stimuli was given, and, once there, it will be unlikely for the cell to dif-ferentiate back without major stimuli that promotes the other gene. In such a system, stemness would be induced if the strength of the feedback loops were reduced or the degree of inhibition lessened, allowing more oscillations between the genes.

(27)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT? Commitmen t Plasticity Decision point A B C Commitmen t

Figure IV: Waddington’s landscape. The classical model (A) pictured a unidirectional process where epigenetic factors determined the shape of the landscape, in turn deciding the path of the marble. Since then, it has been shown that there are fate-competing expression programs active at the same time and at certain points (B), and the combination of external stimuli and stochastic transcriptional behavior determines the outcome. Furthermore, cellular fates are not completely deterministic but exhibit a high degree of plasticity and ability to reverse (C), even if such reversibility could imply a high-energy barrier and be unrealistic to occur naturally.

as cellular behavior is the result of an orchestra of millions of interacting parts. From a transcriptomic viewpoint, it would seem that the expression of single genes could not have the power to change the fate of the cell on its own. However, as in a game of Dominoes, certain influential genes can ignite a cascade of events which ultimately determine the destiny of the cell. In [44], this is exemplified by the powerful influence that a single transcription factor has in determining if a neural stem cell starts to differentiate and give rise to cell types of regenerative character. However, in this case, there is an equally important extrinsic cue that needs to be present in the form of a physical injury, changing the topology of the landscape in order for the differentiation path to open, if you will.

By observing the transcriptome, we can understand and place the cells on their fate trajectories. Such placements are facilitated by having a ref-erence knowledge of the underlying Waddington landscape. Armed with this knowledge, we may ask ourselves: Where are the hills and valleys of the differentiation paths of the cells? At the current expressional states of the cells, what environmental cues are needed to push them toward a certain fate? Or, what is the underlying reason that cells of the same type from two individuals ended up with distinct functional attributes? Since the topology of the landscape is highly influenced by environmental cues,

(28)

2.3. SINGLE-CELL SEQUENCING AND NEW CELL TYPES

X Y

Figure V: The two-gene system of dynamic cellular behavior. Top: Gene x and y exert positive feedback on their own expression while inhibiting the expression of each other. Bottom: A phase portrait of the dynamical system. The green mid-point is a cellular state with a high degree of plasticity (i.e., a “stem cell” state). The cell becomes incrementally committed when the expression level of one the genes takes over.

having spatial information along with transcriptomic profiles substantially eases the interpretation of our observations.

2.3

Single-cell sequencing and new cell types

Sequencing of nucleic acids has revolutionized the field of transcriptomic

profiling. However, originally, sequencing-based methods measured the

transcriptome on the level of cell populations. Such population averages could lead to false interpretations of transcript relations and make it chal-lenging to give an accurate picture of the cellular states if there exists large heterogeneity within the sample population (figure VI).

Somewhat ironically, the population-average profile could in fact reflect a transcriptomic state that no single cell actually possesses. However, tech-nological advancements for single-cell analysis have made it possible to ob-tain transcript information of individual cells, enabling much finer cell-state

(29)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT? Tr anscr ipt y Transcript x Tr anscr ipt y Transcript x Observation Interpretation

single cells multiple cells

Observation Interpretation

Figure VI: Combining data from multiple cells can result in the phenomenon of Simpson’s paradox, where trends within groups disappear, or even re-verse.

categorizations and detailed interpretations of transcript relationships. This has spurred the interest in creating cellular atlases, often ambitious projects which aim to catalog the cell types present in certain organs or even whole humans [45]. An example of such an effort, and what the addition of single-cell information can contribute, is described in two back-to-back publica-tions from 2018 that categorize a novel cell type in lung tissue [46], [47]. Here, the gene-expression signatures of the cells were used to partition them into discrete populations, and most of the clusters were annotated as belonging to particular cell types based on previously known markers. However, there were also cells for which similarity was located somewhere in between those of the annotated clusters, indicating a transition state for which new unique gene markers were identified. By immunofluorescence, the new markers were used to obtain information of where these cell states were located, and the authors could find co-localization of the cells with the annotated cell types on either side of the proposed trajectory, thus lending further support for the existence of the new-found intermediate state.

Turning back to the definition of a “new” cell type and where one should actually draw the line for such a claim. In the above mentioned papers, a clearly discrete cluster in a lower-dimensional gene space was found to have a transcriptional profile similar to ionocytes, previously found to be critical for osmotic homeostasis in other organisms. Furthermore, the authors found that this cell type in particular expressed higher levels of CFTR compared

to any other cell obtained from the tissue. Mutations in this gene are

well known to be a cause of cystic fibrosis [48]. Based on earlier studies without single-cell information, it had been assumed that CFTR was just a lowly abundant transcript from common ciliated cells located in the airways. However, by obtaining a more granular picture of the individual cells, it was

(30)

2.4. CORRELATION BETWEEN THE OMES

found that this rare cell population, consisting of only around 1 % of the total cellular mass, actually is the main producer of this transcript. Given its implications in cystic fibrosis, this type of knowledge can potentially lead to alternative approaches to treatment.

2.4

Correlation between the omes

In a holistic sense, proteins are the actual effectors of the cell, and it’s therefore the proteins that ultimately dictate cell behavior. While such a statement seems to put forward proteomics as the ultimate modality to study cells with and render the other omics rather insignificant, all the omes are, in fact, complementary and can only together provide a comprehensive picture of the cellular state [49]. In fact, since the transcriptome constitutes an essential part in the interplay between all omes, looking at the transcrip-tome could be viewed as a proxy for the past, present, and future profile of the cell with regard to the other modalities.

However, it is unknown to which degree the transcriptomic profile can in-form biological mechanisms during situations of longer-term dysregulation, which in general are poorly understood. Proteins are modified in numerous ways after translation, some deleterious, and late-onset degenerative dis-eases are implicated in accumulation of damaged or toxic proteins. While the environmental change that follows such an accumulation, both intra-and extracellularly, will most probably affect the transcriptome, it is not certain that the underlying cause can be inferred from transcriptomic data. Could it be so that the dysregulation was initiated at the proteome level, and never actually visible in the transcriptome?

Nevertheless, as technological advancements have made transcriptomic profiling among the most accessible measurements of the cell, the prospect of using transcriptomics for inference of the other omes has become rather appealing. However, making such inference assumes there is a correlation between the transcriptome and the other modality of interest. Considering the proteome, these are undoubtedly mutually dependent, as all proteins need to be preceded by translation of transcripts, and the proteins in turn affect cellular behavior and alter multiple layers of transcriptional regula-tion. While the abundance of a certain transcript will largely dictate the abundance of the corresponding protein in a steady-state situation, it’s to be expected that major deviations could appear, especially during transition phases of shorter timescales, e.g., during differentiation or cellular stress (figure VII) [50].

The degree of correlation across time will be a function of the particular gene, where some proteins only get translated on demand in response to certain stimuli and others are stored in cellular compartments (granules)

(31)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT? Pr ot ein abundanc e mRNA abundance

State A Transition State B

A

A

bundanc

e

Stimulus Window of correlation

B

Protein mRNA A bundanc e Time

C

D

Time

E

mRNA abundance mRNA abundance

Figure VII: (A) At steady state, there is usually a higher correlation between

mRNA and protein levels than during a transition phase. (B)-(E) The

correlation between modalities varies between genes and conditions. (B) Regulation of translation that results in quicker protein production as a response to a stimulus. (C) Translation of a protein that subsequently is stored in the cell while the mRNA is broken down. (D) Stochastic temporary expression but with a more stable protein concentration. (E) Cell-cycle genes with a clear oscillating pattern to their abundance levels.

long after an increase in transcript abundance has faded [51]. Nevertheless, a general agreement for canonical markers has been seen when using newly developed methods that allow for simultaneous profiling of both RNA and protein from the same cell [52]. In general, it’s expected to see a certain degree of lag between mRNA and protein abundances, as the first naturally precedes the latter. Any such differences, as well as stochastic variation within single cells, tend to be averaged out across populations.

(32)

2.5. THE ANALYSIS OF TRANSCRIPTOMIC DATA

The epigenome can be analyzed by using enzymes which preferably inter-acts with parts of the genome that currently are in an “accessible” configu-ration [53]. The enzyme inserts sequencing adapters at the accessible sites, and, thus, downstream sequencing produces a quantitative measurement of the current epigenetic landscape in terms of accessibility. This open-chromatin state is a requirement of most currently known transcription fac-tors for binding to the DNA and subsequently influencing transcriptional activity. As noted in a previous section, such transcription factors can have profound effects on cellular fate decisions. However, there are many exam-ples of discordance between the accessibility of a locus and corresponding transcriptional activity. Accordingly, the current view is that the config-uration of the chromatin is more of an enabler than a determinant of its regulatory role, which is far more complex [54]. Going back to the example highlighted earlier, physical injury could be an environmental stimulus that remodels the chromatin, but it’s only when a certain transcription factor is present that some of the newly accessible loci actually modify transcrip-tional activity [44]. As such, the interplay between the transcriptome and other omics is multifactorial and highly non-linear, suggesting that multi-omics profiling is needed to comprehensively describe cellular states.

2.5

The analysis of transcriptomic data

While transcriptomic profiling includes simple ocular detection of single genes visualized on a piece of tissue, the type of data most relevant to the context of this thesis is high-throughput next generation sequencing (NGS) data. NGS provides transcript information in the form of integer counts— the number of reads found of each transcript—which gives us a quantifica-tion of gene expression. If we seek to profile the whole transcriptomes of many cells, one quickly realizes that datasets become quite large, not only in terms of the number of observations (e.g., cells) but also features (e.g. genes or transcripts). When analyzing such high-dimensional data, it’s easy to encounter a phenomenon known as the “curse of dimensionality”. As the dimensionality of the data increases, so typically does the sparsity. Noisy dimensions place even very similar transcriptomes far apart, which poses a challenge to finding interesting biological patterns. And transcriptomics data is swamped with noise: both due to measurement errors and biological stochasticity.

Thus, a typical first step when analyzing transcriptomics data is to find a lower-dimensional representation of it that turns the noisy picture into something more interpretable, i.e., we make the assumption that there ex-ists a low-dimensional manifold that captures the underlying structure of the data and thereby places similar transcriptomes close to each other and

(33)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT? 1 Dimension Gene A Gene A Gene B Gene A Gene B Gene C Whole-transcriptome

?

20,000 Dimensions Samples x Genes Direction

Observations Low-dim representation Grouping

Cell type A Cell type B

homogeneous neighborhood

Low-probability transition

Expression Gene X

Local transcriptional variability A

B

C

Figure VIII: The high-dimensional nature of whole-transcriptome data. (A) Visual illustration of how an increasing number of dimensions quickly

be-comes challenging for the human mind to grasp. (B) Typically,

high-dimensional data is projected onto a low-high-dimensional embedding for vi-sualization and downstream processing. (C) Global similarity does not per definition imply that a transition between cell states is likely to occur. By having a well-represented number of homogenous neighboring cells, such transition probabilities could be computed under the assumption that the expected expression level among the neighborhood for each individual gene can be accurately modeled and statistically assessed. A strong statistical outlier could indicate a less likely transition and might suggest more deter-ministic differences in cell types.

dissimilar transcriptomes far apart. There are a plethora of different algo-rithms to create this low-dimensional embedding. With the introduction of the larger and more complex datasets generated from scRNA-seq, non-linear

(34)

2.5. THE ANALYSIS OF TRANSCRIPTOMIC DATA

methods like UMAP have become popular due to their ability to preserve both local and global structure in the data representation. However, sub-stantial differences in output can arise depending on the implementation of the algorithm and preprocessing of the data [55].

For single cells, a low-dimensional representation of the data is often used for grouping cells of high similarity, for which an equally large array of computational methods exists. As an example, in recent years, graph-based methods have become commonplace for clustering single-cell data, as it is a flexible and scalable method for larger datasets that does not im-pose strong assumptions about the shape of the resulting clusters. In this type of clustering, each cell is represented as a node that gets connected to neighboring cells based on transcriptional similarity. A second algorithm can then be applied to this graph, e.g., the Louvain method [56], to detect “communities” of nodes (cells). These groups, often annotated as cell types or states, are then used for downstream analysis where various statistical tests are performed to find transcriptomic signatures that differentiate the groups. There is an inherent caveat to this type of analysis, which stems from the fact that the same data is used twice, resulting in inflating the ap-parent statistical strength. For example, uncertainty in the initial grouping of cells is typically not accounted for during the tests between the resulting cellular groups, and the final outcome might therefore suffer from selection bias. Thus, findings from these tests should be viewed as exploratory and hypotheses based on them need to be verified in new data. Alternatively, more robust conclusions could be drawn by adopting a design scheme where the initial data is divided into separate datasets and the clustering is per-formed on the first, to learn a separating hyperplane that can be used to test differences between clusters in the second [57], mirroring a classical setup of a training and a test in typical machine learning models. However, such handling of transcriptomic profiling data is rarely seen in published literature.

As briefly mentioned in the previous sections, practically all current methods for RNA profiling provide a snapshot in time of the current state of the transcriptome. Given the dynamic nature of RNA abundances, such snapshots pose a clear limitation on the ability to determine the direction of cellular behavior and state transitions. However, by observing a larger population of cells, there is certain dynamic information present in the data, as observations include cells from various intermediate states along differen-tiation trajectories. As such, these cells can be viewed to represent different time points as the cells roll down the Waddington landscape, although they are actually obtained and profiled at the same time. This has given rise to computational methods to order cells in pseudotime. A typical approach is to first project the transcription profiles onto a lower-dimensional space and

(35)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT?

then find the longest path in a minimum spanning tree of the cells, which would represent a likely differentiation trajectory [58]. As such, the concept describes a transition in transcriptomic state as a proxy for transition in time. Nevertheless, even if cells could be placed along a transcriptional gra-dient, it does not by itself solve the question of directionality. Such dynamic inference could be made if there is knowledge about the density of cells and the rates at which they enter and exit each state [59]. These rates could depend on various parameters, e.g., proliferation, cellular death, physical migration, or down-/up-regulation of markers. Solving such an equation to predict future states from a static snapshot is non-trivial but can be eased if sampling is performed at multiple time points. By obtaining the rates of which cells enter and exit the various states along a trajectory, models can be built to suggest the topology of the differentiation landscape and where checkpoints for fate decisions exist [60].

An alternative way to infer directional changes is by analyzing the ratio between pre-mRNA and mRNA [61], [62]. Given that transcription first gives rise to the immature transcript, which subsequently proceeds through the mature state before being translated to protein, such a ratio could be used to extrapolate transcriptomic abundances at a later time points by solving a system of differential equations for each gene. By combining tran-scriptional information with graph-based trajectories in low dimensional space, one could obtain not only a representation of cellular differentiation trajectories but also the direction they are headed along these paths (figure VIII - B).

A word of caution is warranted when considering distances in the low-dimensional representation of transcriptional space. Depending on the al-gorithm used to project the data onto the embedding, these distances are subject to variability and differences in the power to preserve local and global structure. Commonly used algorithms like tSNE and UMAP are de-signed to foremost preserve local structure, i.e., highly similar data points in high-dimensional space cluster together in the low-dimensional repre-sentation, but distances between clusters, i.e., the global structure, should be more carefully interpreted. Furthermore, it’s worthwhile to note that global similarity of transcriptomes does not per se imply high transition probabilities, and even though two cells are in proximity of each other in the low-dimensional embedding and share many transcriptional traits, they might still be less likely to transition to one another (figure VIII - C). To avoid such potential pitfalls of connectivity within the graph, methods have been developed that compute gene-specific local expression variability and consider the neighbors to determine the expected distribution within the local community [63]. This type of analysis can identify transitions which are highly unlikely to occur unless supported by an intermediate transition

(36)

2.6. INTEGRATION ACROSS DATASETS

state, highlighting the need of observing locally homogeneous neighborhoods in order to reliably detect differentiation paths. In short, any transition or trajectory analysis will be substantially more robust the more complete the representation of cellular states is in the data.

2.6

Integration across datasets

Owing to the diversity of biological systems, there is a trend for datasets to grow in size and complexity. Initiatives to create atlases like the Human Cell Atlas [45] is an illustration of such complex endeavors, where the datasets include samples generated from multiple different labs, under different con-ditions, methods, and reagents. This creates unwanted technical variation which obscures biological signals. Therefore, there is a need for compu-tational methods to identify and correct for these effects when integrating samples. Such non-biological signals—so-called “batch effects”—always, to varying degrees, appear even within smaller experiments conducted in the same lab with exactly the same methodology. When integrating samples across labs and protocols, the effect becomes nested and typically even more complex. There is an inherent challenge in these integration efforts, since there is typically no easy way to know how the combined data set should look like if correctly integrated and the risk of removing potentially impor-tant biological signals is therefore high. Furthermore, the categorization

between technical and biological variation is often not obvious.

Conse-quently, a stronger removal of unwanted variation often comes at the price of a poorer conservation of biological variation. Across the vast array of computational methods developed to adjust for unwanted variation, this trade-off is balanced to varying degrees [64]. Thus, certain methods could be preferable to others in different contexts. For example, if a study aims to detect rare cell types or states rather than to completely remove batch effects for sample-to-sample comparisons, some technical variation could be permissible to make sure important biological variation is not discarded. Another important consideration when selecting which method to use is the runtime and how it scales with respect to the size of the dataset; it could simply become infeasible to apply certain methods if the dataset is of a certain size. For batch effects pointing in one direction, linear models would usually perform well [65]. But for nested, more complex effects, non-linear methods, e.g., based on deep learning, might be preferable [66], [67]. The latter often comes at the expense of interpretability, computational requirements, and ease of use, however.

While it can be non-trivial to decipher the output of the correction, there are several methods to get an idea of how well the datasets have been integrated. These methods indicate whether unwanted technical variation

(37)

CHAPTER 2. WHAT ARE WE REALLY LOOKING AT? Batch Cell type Correc�on/Integra�on Batch Cell type Correc�on/Integra�on Correc�on/Integra�on

Label-free measures of biological conservation

Integration between datasets

Conservation within datasets

Label-based measures Poor Good Poor Good Batch Cell type Batch A B C

Figure IX: Dataset integration. (A) An integration harmonizes the samples and thereby enables assessments across datasets. (B),(C) Different methods to score the integration. Optimally, unwanted technical batch effects are removed while biological signals are retained.

has been removed or if biological variation has been preserved. An exam-ple of such an assessment is that neighborhoods in low-dimensional space should not be solely from a single dataset (batch-effect correction), while, at the same time, cell-type labels within a dataset should not start to mix (biological signal preservation) (figure IX). The latter would require hav-ing cell-type labels associated with each cell, but label-free methods exist that instead use cell-cycle and trajectory positions within each data set and look at the preservation of such patterns in the final integrated output. In general, integration efforts become easier if cell types are well represented across datasets.

(38)

Chapter 3

The era of spatial transcriptomics

As introduced in chapter 1, the transcriptome of a cell is highly influenced by its surrounding environment. As a result, when analyzing cells obtained from tissue without knowing their physical place and surroundings, impor-tant information that contributes to the observed transcriptomic profile of the cell is lost. This realization has led to the development of methods which aim to retain the spatial information alongside the transcript data. Adding the spatial dimension can be approached by various means, and the last decade has seen rapid evolution in the number of methods and their capabilities. A review of this development and the associated methods is added as additional material to this thesis (Appendix). In the next chapter, I will focus entirely on the in situ capture methods that are used throughout the articles included in the present investigations.

3.1

The characteristics of in situ capture

meth-ods

The main results in the present investigations are broadly based on the Spatial Transcriptomics (ST) protocol [68] (figure X). While the basis for the method is described in section 4 of the included review, the remainder of this chapter will point out the most important aspects of analyzing ST

data. The discussion in chapter two highlighting the analytical aspects

of interpreting single cell- versus cell population- data is a recurring theme when working with ST data. In fact, ST data could be viewed as a modality which is challenged both by the sparsity of scRNAseq data and by the population averages of bulk RNAseq data.

As with scRNAseq, ST data usually suffers from sparsity, manifesting in a high amount of zero counts observed for specific transcripts. While there is some debate regarding if these zeros are “inflated”, i.e., observed

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Detta projekt utvecklar policymixen för strategin Smart industri (Näringsdepartementet, 2016a). En av anledningarna till en stark avgränsning är att analysen bygger på djupa

1 Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Tokyo, Japan; 2 School of Medicine, University of California, San Diego, CA, USA; 3 Department of

In total, fast killing made up approximately 30 % of all the killing events characterized in Paper I (n=117). slow was slightly refined. Here approximately 35% of the