Topological and geometrical methods in data analysis

(1)

Doctoral Thesis in Mathematics

Topological and geometrical methods in data analysis

OLIVER GÄFVERT

Stockholm, Sweden 2021

kth royal institute of technology

(2)

Topological and geometrical methods in data analysis

OLIVER GÄFVERT

Doctoral Thesis in Mathematics KTH Royal Institute of Technology Stockholm, Sweden 2021

Academic Dissertation which, with due permission of the KTH Royal Institute of Technology, is submitted for public defence for the Degree of Doctor of Philosophy on Friday the 11th of June 2021, at 2:00 p.m. in sal F3, Kungl Tekniska Högskolan, Lindstedtsvägen 26, Stockholm.

(3)

Printed by: Universitetsservice US-AB, Sweden 2021

(4)

iii

Abstract

This thesis concerns two related data analysis pipelines, using topological and geometrical methods respectively, to extract relevant information. The first pipeline, referred to as the topological data analysis (TDA) pipeline, con- structs a filtered simplicial complex on a given data set in order to describe its shape. The shape is described using a persistence module, which characterizes the topological features of the filtration, and the final step of the pipeline extracts algebraic invariants from this object. The second pipeline, referred to as the geometric data analysis (GDA) pipeline, associates an algebraic variety to a given data set and aims to describe the structure of this variety. Its structure is described using homology, an invariant which for most algebraic varieties can only be computed numerically using sampling methods.

In Paper A we consider invariants on multi-parameter persistence modules. We explain how to convert discrete invariants into stable ones via what we call hierarchical stabilization. We illustrate this process by constructing stable invariants for multi-parameter persistence modules with respect to the interleaving distance and so called simple noise systems. For one parameter, we recover the standard barcode information. For more than one parameter we prove that the constructed invariants are in general NP-hard to calculate.

A consequence is that computing the feature counting function, proposed by Scolamiero et. al. (2016), is NP-hard.

In Paper B we introduce an efficient algorithm to compute a minimal presentation of a multi-parameter persistent homology module, given a chain complex of free modules as input. Our approach extends previous work on this problem in the 2-parameter case, and draws on ideas underlying the F4 and F5 algorithms for Gröbner basis computation. In the r-parameter case, our algorithm computes a presentation for the homology of C−^F→ A−^G→ B, with modules of rank l, n, m respectively, in O(r²n^r+1+ n^rm + n^r−1m²+ r n²l) arithmetic operations. We implement this approach in our new software Muphasa, written in C++. In preliminary computational experiments on synthetic TDA examples, we compare our approach to a version of a classical approach based on Schreyer’s algorithm, and find that ours is substantially faster and more memory efficient. In the course of developing our algorithm for computing presentations, we also introduce algorithms for the closely related problems of computing Gröbner bases for the image and kernel of the morphism G. This algorithm runs in time O(n^rm + n^r−1m²) and memory O(n²+ m n + n r + K), where K is the size of the output.

Paper C analyzes the complexity of fitting a variety, coming from a class of varieties, to a configuration of points in R^N. The complexity measure, called the algebraic complexity, computes the Euclidean Distance Degree (EDD) of a certain variety called the hypothesis variety as the number of points in the configuration increases. Finally, we establish a connection to complexity of architectures of polynomial neural networks. For the problem of fitting an (N − 1)-sphere to a configuration of m points in R^N, we give a closed formula for the algebraic complexity of the hypothesis variety as m grows for the case of N = 1. For the case N > 1 we conjecture a generalization of this formula supported by numerical experiments.

(5)

iv

In Paper D we present an efficient algorithm to produce a provably dense sample of a smooth compact variety. The procedure is partly based on com- puting bottlenecks of the variety. Using geometric information such as the bottlenecks and the local reach we also provide bounds on the density of the sample needed in order to guarantee that the homology of the variety can be recovered from the sample. An implementation of the algorithm is provided together with numerical experiments and a computational comparison to the algorithm by Dufresne et. al. (2019).

(6)

v

Sammanfattning

Den här avhandling rör två relaterade dataanalysflöden, som använder topologiska respektive geometriska metoder, för att extrahera relevant in- formation. Det första flödet, kallat topological data analysis (TDA) pipeline, konstruerar ett så kallat “filtrerat simpliciellt komplex” på en given data- mängd i syfte att beskriva formen på datamängden. Formen beskrivs genom en så kallad persistensmodul, som karaktäriserar de topologiska egenskaperna hos filtreringen, och det slurgiltiga steget i flödet är att extrahera algebraiska invarianter från detta objekt. Det andra flödet, kallat geometric data analysis (GDA) pipeline, associerar en algebraisk varieté till en given datamängd och syftar till att beskriva dess struktur. Strukturen beskrivs med hjälp av homo- logi, en invariant som för de flesta algebraiska varieteter bara kan beräknas numeriskt med hjälp av testmetoder.

I Paper A studerar vi invarianter för persistensmoduler med flera parametrar. Vi förklarar vi hur man konverterar diskreta invarianter till stabila via det vi kallar hierarkisk stabilisering. Vi illustrerar denna process genom att konstruera stabila invarianter för persistensmoduler med flera parametrar med avseende på interleaving-metriken och så kallade simple noise systems.

För endast en parameter återfår vi barcode-invarianten. För mer än en para- meter bevisar vi att de konstruerade invarianterna i allmänhet är NP-svåra att beräkna. En konsekvens är att beräkning av feature counting-funktionen, föreslagen av Scolamiero et. al. (2016), är NP-svår.

I Paper B introducerar vi en effektiv algoritm för att beräkna en minimal presentation av en multi-parameter persistensmodul, givet ett ked- jekomplex av fria moduler som input. Vårt tillvägagångssätt bygger på ti- digare arbete med detta problem i fallet med två parametrar och bygger på idéer som ligger till grund för F4 och F5-algoritmerna för att beräkna Gröbner baser. I r-parameter fallet beräknar vi en presentation av homo- login för C −^F→ A −^G→ B, med moduler av respektive rang l, n och m med O(r²n^r+1+ n^rm + n^r−1m²+ r n²l) aritmetiska operationer. Vi har imple- menterat vår algoritm i den nya mjukvaran Muphasa, skriven i C++. I pre- liminära experiment på syntetiska TDA-exempel jämför vi vår strategi med en version av en klassiskt algoritm baserat på Schreyer’s algoritm med resultatet att vår algoritm är väsentligt snabbare och mer minneseffektiv. Under utvecklingen av vår algoritm för minimala presentationer så introducerar vi också algoritmer för närbesläktade problem för att beräkna en Gröbner-bas för bilden och kärnan till morfismen G. Vår algoritm i det fallet har tidskom- plexiteten O(n^rm + n^r−1m²) och minneskomplexitet O(n²+ m n + n r + K), där K är storleken på resultatet.

Paper C analyserar komplexiteten av att anpassa en algebraisk varietet, som kommer från en klass av varieteter, till en konfiguration av punkter i R^N. Komplexitetsmåttet, kallat algebraisk komplexitet, beräknar den Eukli- diska avståndsgraden (EDD) för en viss varietet som kallas hypotesvarieteten när antalet punkter i konfigurationen ökar. Slutligen visar vi en koppling till komplexiteten av arkitekturer i polynomiella neurala nätverk. För problemet med att anpassa en (N −1)-sfär till en konfiguration av m punkter i R^N, ger vi

(7)

vi

en sluten formel för hypotesvarietetens algebraiska komplexitet när m växer för fallet N = 1. För fallet N > 1 formulerar vi en förmodan som generaliserar denna formel med stöd av numeriska experiment.

I Paper D presenterar vi en effektiv algoritm för att producera en bevis- ligt tät delmängd av en slät och kompakt algebraisk varieté. Algoritmen är delvis baserat på beräkningen av flaskhalsar av varieteten. Med hjälp av geo- metrisk information så som flaskhalsar och lokal räckvidd ger vi också övre gränser för densiteten som krävs för delmängden för att garantera att variete- tens homologi kan återvinnas. En implementering av algoritmen tillhandahålls tillsammans med numeriska experiment och en jämförelse med algoritmen av Dufresne et. al. (2019).

(8)

Acknowledgements

I would like to begin by thanking my advisor Sandra Di Rocco and co-advisors Wojciech Chachólski and David Eklund for their guidance and support, interesting discussions and valuable advise during the years leading up to this thesis.

I am very grateful to my co-authors. In particular, Matías Bender and Michael Lesnick for our many interesting and long discussions. I would also like to thank my other colleagues at the mathematics department, who made my time there much more enjoyable. In particular, and in no particular order, Samuel Fromm, Ger- ard Farré, Julian Mauersberger, Parikshit Upadhyaya, Johan Wärnegård, Martina Favero, Lena Leitenmaier, Gustav Zickert, Carl Ringqvist, Nasrin Altafi, Marcus Nordström, Philippe Moreillon, Federico Izzo and Eric Ahlqvist.

Finally I would like to thank Ella, Salem and my family for their encouragement and support.

(11)

(12)

1 Introduction

Understanding the underlying structure generating observed data is the aim of data analysis. The methods of topological data analysis (TDA) and nonlinear algebra can unveil this structure without a specific choice of parameters or model. The power of these methods comes from the formulation of computable and visualizable invariants describing the shape of an object, which in this setting means a data set. Their application has been seen in a wide range of areas including, but not limited to, medicine [4, 33, 37], biology [16, 21, 20], robotics [7, 35, 22, 39] and image processing [6, 3, 2].

The field of topological data analysis originates from the study of similarity of Euclidean submanifolds and estimation of their homology [30, 36]. The main tool of TDA, the persistence module, was formulated by Edelsbrunner et. al. [28] and later Carlsson et. al. showed that it has the structure of a graded module over a polynomial ring [43, 12].

The premise of TDA is that shape matters. Data in high dimensions is usually sparse but may have relevant low-dimensional topological features. An illustrative example can e.g. be found in [41] where TDA is used to discover periodicity in dynamical systems. In their setting, periodic behavior corresponds to a homological 1-cycle, i.e. a closed circle in the state space. Detecting such structure using other methods is difficult as formalizing a description of shape is a non-trivial task often dependent on a choice of parameters. TDA accomplishes this with the use of tools coming from algebraic topology and commutative algebra. It avoids making a choice of parameters by considering all parameter choices and condensing this, usually very large quantity of information, into a concise and visualizable invariant.

For problem instances involving multiple measurements on the data (also re- ferred to as filtration parameters), the construction of TDA invariants and efficient computation of the persistence module is an open area of research. In Paper B we make use of the observation that the persistence module has the structure of a graded module in order to compute it. We formulate a novel algorithm based on the theory of Gröbner bases to efficiently compute presentations of persistence modules. In the case of invariants, it was shown by Carlsson and Zomorodian [13]

that no complete discrete invariant exists in the multi-parameter case, which means an invariant that completely determines the isomorphism class of the object. In

(13)

2 CHAPTER 1. INTRODUCTION

Paper A we show how to construct versions of classical commutative algebra invariants for the multi-parameter setting and analyze their computational complexity.

Our main result is showing that the invariant of Scolamiero el. al. [14] and its generalization is in general NP-hard to compute.

TDA has recently been used in other areas of mathematics and vice versa, to enrich the information extracted from the persistence module using tools coming from e.g. algebraic geometry. See for instance the work of [8, 26, 41]. If the observed data comes from an algebraic variety it has additional structure that we can make use of in our analysis. The information could for example be used to estimate the tangent spaces of the underlying variety in order to guide the construction of the persistence module and reduce memory and computational time [8, 32].

Yet another use of the algebraic structure is to directly estimate equations of the underlying variety, produce a dense sample and calculate its homology. Analyzing the complexity of estimating the equations of the variety is the subject of Paper C.

In Paper D we consider the latter part of this pipeline. We construct a sampling algorithm to efficiently sample the variety with a density that guarantees us to be able to recover its homology.

Sampling an algebraic variety is in itself an important problem as geometric invariants such as homology groups and Betti numbers can almost always not be computed symbolically. It is however a non-trivial problem as it scales poorly with the degree and dimension of the variety and ambient space. Ideally, one would like to have a uniform sample of the variety with a prescribed density. Formulating an algorithm to produce a uniform sample of the variety is the subject of [9, 18]

and an algorithm to produce a provably dense sample is described in [26]. The main result of Paper D is constructing a novel provably dense sampling algorithm which in numerical experiments proves to be substantially more efficient than the algorithm of [26].

(14)

2 Background

In this chapter we introduce the methods and tools used throughout this thesis.

The primary tool binding this thesis together is homology. Homology can be used to classify and describe the geometrical shape of objects and answer questions such as how many components it consists of or the number of holes/voids present in the object etc. The work in this thesis regards two related data analysis pipelines illustrated below, the first one being the topological data analysis (TDA) pipeline and the latter the geometric data analysis (GDA) pipeline. The output of each pipeline is a homological invariant while the tools used to compute it comes from algebraic topology and algebraic geometry respectively.

Topological data analysis (TDA) pipeline

Construct filtered simplicial complex

Compute persistence

module Invariant

Data

Figure 21 – The TDA pipeline for computing homological invariants.

Geometric data analysis (GDA) pipeline

Estimating equations of an algebraic

variety V

Compute sample with density

E⊂ V δ Compute homology

of E Data

Figure 22 – The GDA-pipeline for computing homological invariants.

Throughout this chapter, a data set U is a finite subset U ⊂ R^N where N is a positive natural number.

(15)

4 CHAPTER 2. BACKGROUND

2.1 Topological data analysis

The fundamental goal of topological data analysis is to describe the shape of a data set. As observed by Edelsbrunner et. al. [28] one can formalize a description of the shape of a data set amendable for computations using simplicial complexes and homology. Starting from a data set U the question is how to construct a simplicial complex. While we can make any such choice, we would like to construct a simplicial complex that best reflects the structure of the object that U was sampled from. The strategy of persistent homology, the main tool of TDA, is to construct a filtration of simplicial complexes on U . The homological generators that persist in the filtration are believed to represent significant topological features of the underlying object.

The filtration is constructed starting from a metric on the points of U . There are several possible strategies to construct a filtration, such as Vietoris-Rips, ˘Cech or α-complexes and their witness constructions (see [11, 27] for details).

Example 2.1. The most widely used strategy for constructing a filtration of sim- plical complexes in TDA is the Vietoris-Rips complex. Let d : U × U → R be a metric on U . The Vietoris-Rips complex at the parameter value is defined by

F () = {σ ∈ P(U ) | d(x, y) < for all x, y ∈ σ}

where P(U ) denotes the power set of U , consisting of all subsets. A property of the

Figure 23 – Illustration of a filtered Vietoris-Rips complex.

Vietoris-Rips complex, which makes it attractive from a computational perspective, is that it is fully determined by its 1-skeleton. This means that in order to describe it one only has to consider individual pairs of points. This is not the case for neither the ˘Cech or α-complex.

Let SC denote the category of simplicial complexes, R denotes the category whose objects are the real numbers and ∀a, b ∈ R there is an arrow from a to b if a ≤ b in the usual ordering. A filtered simplicial complex on U can be repre- sented as a functor F : R → SC. One can then compute the homology of each simplicial complex in the filtration. In persistent homology one normally computes the homology with coefficients in some field K, making the homology group into a K-vector space (see [11] for details). An important property of homology is functo- riality. This means that if there is a map f : S → S⁰ of two simplicial complexes S and S⁰, then we get an induced map on homology such that the following diagram

(16)

2.1. TOPOLOGICAL DATA ANALYSIS 5

commutes:

S

// S⁰

H_k(S, K) // Hk(S⁰, K)

where H_k(S, K) denotes the k-th homology group of S computed with coefficients in K. By taking homology of F (with coefficients in a field K) we thus get the following functor Hk(F, K) : R → VectK with values in the category of K-vector spaces, called the k-th persistence module of U .

It was shown by Carlsson and Zomorodian in [43] that the k-th persistence module has the structure of a graded module over K[x]. Further, they observed that this means that the persistence module is a Principal Ideal Domain (PID). It follows that there is a unique decomposition of the module into irreducible components, called bars. This means that

Hk(F, K) ∼=

n₁

M

t=1

x^c^tK ⊕

n₂

M

t=1

x^a^tK/x^b^tK

for some at, bt, ct, n1, n2 ∈ N. Since the above decomposition is unique, the values n1, n2and at, btand ctcompletely characterize the module. These numbers consti- tute an invariant of the module called the barcode. This is a complete invariant of the module, which means that it completely determines its isomorphism type. Fig- ure 24 illustrates the barcodes of the zeroth, first and second persistence modules of a filtration defined in [31].

In a more general setting we may construct the filtration using several metrics on U . Let the symbol R^r denote the category of r-tuples of real numbers. There is an arrow between a and b in R^rif a ≤ b in the poset ordering, which means that a_i ≤ b_i for all 1 ≤ i ≤ r. Suppose now that we construct the filtration using r metrics on U (see [13] for details), we will refer to the number of metrics used as the number of parameters of the filtration. The result is a functor F : R^r → SC.

Figure 25 illustrates an example of a simplicial complex filtered by two parameters, i.e. with r = 2. Taking the homology of this functor as above yields the k-th multi- parameter persistence module Hk(F, K) : R^r→ Vect_K. It was shown by Carlsson et.

al. [13] that no complete discrete invariant exists for multi-parameter persistence modules. This is because the multi-parameter persistence module, constructed using r metrics, has the structure of a Z^r-graded module over a polynomial ring in r variables. The moduli space of such modules is a very complicated algebraic variety for which no complete discrete invariant exists. To analyze such modules one must thus resort to incomplete invariants, which do not completely determine the isomorphism type but can still extract relevant information from the module.

We are now ready to formulate the TDA-pipeline by restating Figure 21.

(17)

Figure 24 – Illustration due to [31] of the barcodes of the zeroth, first and second persistence module.

Figure 25 – Illustration due to [38] of a filtration F : R²→ SC.

(18)

2.1. TOPOLOGICAL DATA ANALYSIS 7

Topological data analysis (TDA) pipeline

Construct filtered simplicial complex

Compute persistence

module Invariant

Data

In the following we explain the last two stages in the pipeline.

Computing the persistence module

Computing the k-th persistence module means computing the k-th homology of all simplicial complexes in the filtration F : R^r → SC and all of their induced maps. For the case r = 1 this problem has been studied extensively and several software packages have been made available, see [34] for an overview. The problem essentially comes down to Gaussian elimination and using a series of tricks to avoid performing reductions on columns.

To compute the k-th persistence module of a filtration F : R^r → SC one pro- ceeds as follows. Let Ck(F (), K) denote the k-th chain module of the simplicial complex F (). The set of k-th chain modules for each in fact also defines a func- tor Ck(F, K) : R^r→ Vect_K where the maps are induced from the inclusion maps of F . Moreover, the k-th boundary map ∂k() : Ck(F (), K) → C^k−1(F (), K) extends to a natural transformation from Ck(F, K) to C^k−1(F, K). Computing the persis- tence module then comes down to computing ker(∂k) and im(∂k+1) and taking their quotient.

For the case r = 1 it is possible to perform all of these computations within the same matrix reduction, see [43] for details. The only restriction one has when performing reductions is that the filtration values of the simplices has to be respected, this means that one cannot add a column to one which has a smaller filtration value of its associated simplex. This is resolved by ordering the columns by the filtration value of their associated simplex in increasing order and then performing the reductions left-to-right.

For the case r > 1 things are more complicated. In Paper B we detail the con- nection between computing the k-th persistence module and the theory of Gröbner bases. The left-to-right columns reductions in the r = 1 case essentially correspond to a Gröbner basis computation. This connection was observed already by Carlsson et. al. in [12] but since then no real effort has been made to further connect TDA with the results of Gröbner basis theory.

One may generalize the persistence module computation for r = 1 to the multi- parameter case by taking the perspective of the persistence module as a graded module over a polynomial ring. In [13] it is shown that there is a correspon- dence between persistence modules and finitely generated Z^r-graded modules over K[x1, . . . , xr]. In fact, the filtration also has the structure of such a module and computing the persistence module can be expressed as computing the homology of a chain complex of free Z^r-graded K[x¹, . . . , xr]-modules, as shown by Scolamiero

(19)

et. al. in [15]. Computing the k-th persistence module thus means computing the kernel of a map between such modules and then computing the quotient with the image of the map above.

To compute the kernel of a map ϕ : A → B of free Z^r-graded K[x¹, . . . , x_r]- modules one typically has to compute a Gröbner basis. This can in a sense be seen as a generalization of Gaussian elimination but has in the multigraded case a doubly exponential worst-case time complexity. In Paper B we formulate an algorithm to compute the kernel and image of ϕ with computational time complexity of order O(n^rm + n^r−1m²) and memory complexity O(n²+ m n + n r + K) where the input is given as a boundary matrix with n columns and m rows, and K is the size of the output. This algorithm is then used to construct an algorithm that compute a presentation of the multi-parameter persistence module.

Invariants

For r = 1 the barcode is a complete invariant that contains all of the information about the module. It was shown by Cohen-Steiner et. al. [19] that this invariant is stable, which means that small variations in input yields small variations of the invariant. In order to talk about stability one needs to define a metric. The metric most commonly used on barcodes is the bottleneck distance, defined as follows. The barcode of a 1-parameter persistence module, F , can be embedded into the space R × R∞ where R^∞:= R ∪ {∞} by letting each bar represent a coordinate in this space. The barcode of F can thus be represented as

B(F ) := {(at, bt) ∈ R × R^∞| 1 ≤ t ≤ n1} ∪ {(ct, ∞) ∈ R × R^∞| 1 ≤ t ≤ n2} This representation is called a persistence diagram [28]. Let F and G be 1-parameter persistence modules, then the bottleneck distance is defined as

dB(F, G) := inf

γ sup

b∈B(F )

||b − γ(b)||_∞

where γ ranges over all bijections from B(F ) to B(G) and || · ||_∞ denotes the L^∞- norm. Note that this means that F and G have to have the same number of bars.

If this is not the case one simply adds a set of “empty” bars.

In order for invariants in TDA to be useful in practice they need to be stable.

For the case r > 1 the most prominent stable invariant is the rank invariant, proposed by Carlsson and Zomorodian [43]. This invariant can be seen as collecting all one-dimensional barcode information in a big table. Consequently, it captures the persistent information of the module, but extracting multivariate persistent information from the table and visualizing it require more work. Another stable invariant is the feature counting function [14], which extracts readably persistent information of the module. In Paper A we generalize its construction through the process of hierarchical stabilization, a process to stabilize classical invariants in commutative algebra for the setting of multi-parameter persistent homology.

(20)

2.2. GEOMETRIC DATA ANALYSIS 9

Further we define an algorithm to compute it and show that it is in general NP- hard to compute. This is an indication that extracting multivariate persistent information from the rank invariant, and perhaps in general, might be NP-hard as well.

2.2 Geometric data analysis

If we have more information about the source of U , we can make use of this when computing homological invariants. In this section we assume that U is sampled from an algebraic variety, whose defining equations are either known or unknown.

If the variety is unknown we need to estimate its defining equations. This is a non-trivial task as determining dimension and degree of the variety is a challenging problem. Once the defining equations have been estimated, computing the homology can almost surely not be done symbolically and thus one has to resort to numerical methods such as sampling. From a sample of the variety one can construct a simplicial complex and compute its homology. In order to guarantee that the “correct” homology can be recovered from the sample we require the it to have a prescribed density. This outlines the process of the GDA pipeline. In the following we introduce the necessary definitions.

Geometric invariants

Let V ⊆ C^N be a smooth subvariety and let q ∈ R^N. For x ∈ V , let NxV ⊆ C^N denote the embedded normal space to V at x. Let

dE(u, v) :=

N

X

i=1

(ui− vi)²: C^N × C^N → C

denote the (squared) Euclidean distance function. Further let d_V(u) denote the minimal distance, with respect to dE, from u to any point on V . Similarly we refer to the distance between two varieties V, W as the minimum distance, with respect to dE, of any two points x ∈ V , y ∈ W . A critical point of dE with respect to q and V is a point x ∈ V such that q ∈ NxV . Consider the set of critical points of dE with respect to q and V :

I(V, q) = {x ∈ V : q ∈ NxV }.

We call I(V, q) the normal locus of V with respect to q.

Recall that the degree of a zero-dimensional ideal is equal to the degree of the corresponding zero-scheme, i.e. the number of points defined by the ideal counted with multiplicity. It holds that for generic q ∈ C^N, the ideal I(V, q) is zero-dimensional [25]. The degree of I(V, q) for generic q ∈ C^N is constant and is called the Euclidean Distance Degree (EDD) of V , see [25] for more details.

(21)

Example 2.2. Consider the variety in R² defined by the following polynomial f (x, y) = (0.6x⁴+ 7y⁴− 1)(x⁴+ y⁴+ 4x²y²− 0.9) + xy − 0.1

It is worth noting that the use of the algebraically closed field C is important to the theory. Over C the EDD is constant and in this case equal to 64. Figure 26 illustrate the real points in the normal locus for the choice q = (−0.4, −0.2). For our specific choice of q, only 6 points out of 64 are real.

q

Figure 26 – Illustration of the real part of the normal locus for a specific choice of V and q.

The normal locus I(V, q) is interesting for sampling since it contains at least one point on every connected component of V .

Next we consider a very important invariant of the embedding V ⊂ R^N of a variety called the reach of V . For p ∈ R^N, let π_V(p) denote the set of points x ∈ V such that ||x − p|| = minx⁰∈V ||x⁰− p||. Also, for r ≥ 0 let Vr denote the tubular neighborhood of V of radius r, that is the set of points x ∈ R^N at distance less than r from V . Then the reach τV of V may be defined as

τV := sup{r ≥ 0 : |πV(p)| = 1 for all p ∈ Vr}

For a point q ∈ V we have a local notion called the local reach, or the local feature size, which is defined as

τV(q) = sup{r ≥ 0 : |πV(p)| = 1 for all p ∈ Bq(r)}

where Bq(r) = {p ∈ R^N : ||p − q|| < r}. Then τV = infq∈V τV(q). If V is compact then the reach is positive and if we in addition assume that dim V > 0, V is not convex and therefore the reach of V is finite.

(22)

Let ∆V = {p ∈ R^N : |πV(p)| > 1} and define the medial axis MV as the algebraic closure of ∆V. Let UV = R^N \ MV and note that the projection πV: UV → V is a well defined map taking a point p to the unique closest point of V . Also, the local reach τ_V(q) is the distance from q ∈ V to the medial axis and the reach τ_V is the distance from V to the medial axis.

Example 2.3. Consider the variety defined by the polynomial f (x, y) = x⁴− 2xy − 2(x²− y²) + 5y + 1

Figure 27 illustrates its reach and the set ∆V, whose closure is its medial axis. The radius of the dashed circle is equal to the size of the reach. The solid circles indicate other candidates for the reach.

Figure 27 – The reach and medial axis of an algebraic variety.

The reach can also be computed as a minimum of two quantities [1], one of which derives from local considerations in terms of the curvature of V and the other one is a global quantity arising from the bottlenecks of V . More precisely, the global quantity is the radius of the narrowest bottleneck of V , defined as:

1

2inf{||x − y|| : (x, y) ∈ V × V, x 6= y, y ∈ N_xV, x ∈ N_yV } A bottleneck of V is a pair

(x, y) : x 6= y, x ∈ NyV and y ∈ NxV

The local quantity, which we call ρ, is the minimal radius of curvature on V , where the radius of curvature at a point x ∈ V is the reciprocal of the maximal curvature of a geodesic passing through x.

Example 2.4. Consider the variety in R² defined by the polynomial f (x, y) = x⁴+ y − 2(x²− y²). Figure 28 illustrates the narrowest bottleneck of this variety and its tubular neighbourhood (dashed lines).

(23)

Figure 28 – The narrowest bottleneck and the tubular neighbourhood of a variety.

Weak feature size

The weak feature size can be seen as a generalization of the bottlenecks. For the convenience of the reader we recall the construction introduced in Paper D. For a set of points x1, . . . , xk ∈ V to define a bottleneck of degree k we first require there to be a point y ∈ C^N such that the Euclidean distance from y to xi is the same for each 0 ≤ i ≤ k and that y is in the normal space Nx_iV for each 0 ≤ i ≤ k.

Moreover, we require y to be critical in some sense. It turns out that a good notion of critical is, as is explained in [17], that y is in the convex hull of x1, . . . , xk. This means that there exists real non-negative numbers a1, . . . , ak such thatPk

i=1ai= 1 and y =Pk

i=1aixi.

Let C := {(y, r) ∈ C^N × C | ||xi− y||²− r = 0, for 1 ≤ i ≤ k} and denote the closure of the image of the projection C → C^N by ρ(x1, . . . , xk). Then ρ(x1, . . . , xk) is the set of centers of all (N − 1)-spheres passing through the points x1, . . . , xk. Next, let Conv(x1, . . . , xk) denote the convex hull of the points x1, . . . , xk. Define the k-th bottleneck locus to be the following set:

Bk:= {(x1, . . . , xk) ∈ C^{N ×k}\∆ | ∩^k_i=1NV(xi)∩ρ(x1, . . . , xk)∩Conv(x1, . . . , xk) 6= ∅}

where ∆ := {(x1, . . . , xk) | xi= xj for some i, j} ⊂ C^{N ×k}.

We refer to the elements in B2 as the bottlenecks of V and the elements in Bk

for k > 2 as the generalized bottlenecks of V . In the case k = 2 the conditions in the above definition simplify to algebraic conditions which means that B2 is an algebraic variety in C^N × C. For k > 2 the set Bk may not be algebraic but if we restrict Bk to only allow real tuples (x1, . . . , xk) ∈ R^{N ×k} it is a semi-algebraic set.

By restricting the sets Bk to only allow tuples in R^N we can define a notion of narrowest bottleneck. Let B_k^R denote this restriction. Let `k: B_k^R → R be the function that takes (x1, . . . , xk) to the radius of the smallest (N − 1)-sphere in ρ(x1, . . . , xk). Define bk to be the minimal value (or infimum in case B_k^R is not finite) of the image of `k, called the radius of the narrowest bottleneck of degree k.

(24)

We now define the weak feature size as the following quantity:

wf s(V ) := min

2≤k≤EDD(V )

b_k

Computing the generalized bottlenecks of V is in general much harder than com- puting b2. It is therefore interesting to understand in which cases the equality wf s(V ) = b₂holds.

Example 2.5. Let f₁(x, y) = 3x²y − y³+ 2(x²+ y²)(x²+ y²− 1) and f2(x, y) = 10x²y³− 5x⁴y − y⁵+ 5(x²+ y²)²(x²+ y²− 1). The curves {f₁= 0} and {f₂= 0}

are illustrated in Figure 29 and are two examples of varieties in R² for which the weak feature size, wf s(V ), is smaller than the narrowest bottleneck b₂.

b2

wfs

(a) Illustration of the wf s and b2of the curve {f1= 0}.

b2

wfs

(b) Illustration of the wf s and b2 of the curve {f2= 0}.

Figure 29 – Two examples of when the weak feature size is smaller than the narrowest bottleneck of a variety.

The varieties in Figure 29 illustrate two examples in R² in which the equality wf s(V ) = b2 fails. Both of these examples are oscillations of a circle in the sense that they can be described in the polar plane as r = 1+f (cos(nθ)) for some function f : R → R and n ∈ Z. In Paper D we conjecture that wf s(V ) = b2holds in general.

We are now ready to formulate the GDA pipeline by restating Figure 22:

Geometric data analysis (GDA) pipeline

Estimating equations of an algebraic

variety V

Compute sample with density

E⊂ V δ Compute homology

of E Data

In the following we explain the two middle steps of the pipeline.

(25)

Estimating equations

Estimating the defining equations of the algebraic variety from which U was sampled is a challenging and largely unsolved problem. While the area of statistical regression is a huge research area with a long history, estimating parameters of algebraic varieties has not received much attention.

The Veronese map of degree d is the map vd: P^N → Sym^dP^N sending x ∈ P^N to its dth symmetric power. The affine Veronese map of degree d is the dehomoge- nization of vd in the sense that it sends x ∈ R^N to all possible monomials of degree

≤ d, which is a vector in R(^{N +d}^d ). Let x1, . . . , xmbe a labelling of the points in U . The Vandermonde matrix of degree d of U is defined as:

U≤d(U ) :=





 v_d(x1)

... v_d(x_m)







where vd: R^N → R(^{N +d}^d ) is the affine Veronese map of degree d.

Suppose that U is sampled from a variety V ⊆ R^N without noise. It is then straightforward to proceed as in [8] to numerically estimate the coefficients of a set of polynomials defining V . The main tool for doing this is the Vandermonde matrix.

Example 2.6. For U = {x₁, x₂} and N = d = 2 we have that v₂: R²→ R⁶, v₂(x, y) = [x², xy, y², x, y, 1]

U_≤2(U ) =v2(x1) v2(x2)

=x²₁₁ x11x12 x²₁₂ x11 x12 1 x²₂₁ x21x22 x²₂₂ x21 x22 1

Note that the coefficients of any polynomial up to degree 2 that vanishes on both x1 and x2 lie in the kernel of U≤2(U ).

In fact, it is true in general that the coefficients of any polynomial, f , of degree

≤ d, that vanishes on U , lie in ker U≤d(U ). A generating set for an ideal cutting out V can thus be obtained by a choice of basis for ker U_≤d(U ). The problem one now faces is dealing with the numerical instability associated with computing the kernel of the Vandermonde matrix, as is explained in [8].

In practice one usually expects U to be noisy, from e.g. measurement or sys- tem errors. To deal with this, one strategy is to add an unknown perturbation, pi ∈ R^N, to each xi for 1 ≤ i ≤ m. The problem would then be to find the small- est perturbation of U that yields a non-trivial kernel of the Vandermonde matrix.

This is essentially the strategy of Total Least Squares (TLS) and its non-linear generalizations.

For general varieties the TLS-problem is highly non-linear and a very difficult optimization problem corresponding to a low-rank approximation problem. In Pa- per C we tackle this by considering parametrized classes of varieties. This strategy

(26)

avoids the use of Vandermonde matrices and thus short-cuts its numerical insta- bility. Consider a class of varieties {Vβ} in R^N parametrized by β ∈ R^k for some k ∈ N. Finding the Vβ that U was most likely sampled from translates into the following optimization problem:

min

p₁,...,p_m∈R^N

m

X

i=1

dE(xi, pi)

s.t ∃ β ∈ R^k s.t p_i ∈ V_β, for all 1 ≤ i ≤ m

In Paper C we define a complexity measure, called the algebraic complexity, which measures the amount of work that needs to be performed in order to find the global optimum of the above optimization problem.

Example 2.7. (Eckart-Young Theorem) Suppose we want to approximate U with a hyperplane in R^N. The objective function that we optimize is the sum of the squared Euclidean distance function from each point in the configuration to its closest point on the hyperplane. Note that this is the (squared) Frobenius-norm of U −Y , where U and Y are represented as matrices and Y is a set of m points lying on the hyperplane. It is a consequence of the Eckart-Young Theorem that the critical hyperplanes can be computed analytically using the singular value decomposition (SVD). The critical hyperplanes are computed by first centering the configuration around the origin and then computing the SVD of U represented as a matrix.

Suppose m ≥ N and that

U = W1diag(σ1, . . . , σN)W2

is the SVD of U when represented as an (m × N )−matrix. Then the critical hyper- planes are given by kernel elements of the matrices

U_i= W₁diag(σ₁, . . . , σ_i−1, 0, σ_i+1, . . . , σ_N)W₂

where the i-th singular value has been set to zero. Each U_i: R^N → R^m has a one-dimensional kernel and thus the number of critical hyperplanes equals N . By [25, Example 2.3] these are all the solutions of the corresponding critical ideal of the problem. This means that the EDD of this polynomial optimization problem equals N if m ≥ N . The kernel elements of each Ui are the principal components of U in the sense of principal component analysis (PCA) [40].

Sampling algebraic varieties

In practice one almost always has to resort to numerical methods to compute invariants of algebraic varieties, which rely heavily on sampling. Efficiency of the sampling algorithm becomes central as many techniques tend to grow exponen- tially in complexity with the ambient dimension. Figure 210 illustrate samplings of varieties using the sampling strategy of Paper D.

(27)

(a) A surface in in R⁵. (b) A surface in R⁶.

(c) A curve in R⁶. (d) A surface in R⁷.

Figure 210 – Illustrations of sampled complete intersection surfaces and a curve.

Existing algorithms for sampling algebraic varieties can be divided into two categories: probabilistic methods [9, 24, 18] and provably dense sampling [26]. In the probabilistic methods one defines a probability distribution on the variety to sample from. In [9] they define a probability distribution which yields a uniform sample of the variety. In this approach however, it is difficult to guarantee the density of the sample, which is sometimes desirable. In [26] the authors introduce a provably dense sampling strategy for algebraic varieties. By provably dense sample we mean an -sample in the sense of the following definition:

Definition. A sample of a variety V ⊂ R^N is a finite subset E ⊂ V . For > 0 a sample E ⊂ V is called an -sample if for every x ∈ V there is an element e ∈ E such that kx − ek < .

The approach of [26] provide a guarantee that the resulting sample is an - sample for some prescribed ∈ R. In Paper D we define an alternative strategy to [26] to compute a provably dense sample which proves to be substantially faster in numerical experiments. Further, we provide bounds on that allow us to guarantee that the homology of the variety can be recovered.

(28)

3 Summary of Results

Summary of Paper A

Stable invariants for multiparameter persistence.

(joint with W. Chachólski) Pre-print: arXiv:1703.03632.

In this paper we explain how to convert discrete invariants into stable ones via what we call hierarchical stabilization. We illustrate this process by constructing stable invariants for multi-parameter persistence modules with respect to the interleaving distance and so called simple noise systems. For one parameter, we recover the standard barcode information. For more than one parameter we prove that the constructed invariants are in general NP-hard to calculate. A consequence is that computing the feature counting function, proposed by Scolamiero et. al. [14], is NP-hard.

In order for invariants in TDA to be useful in practice they need to be stable.

It was shown by Cohen-Steiner et. al. [19] that the barcode is a stable invariant with respect to the bottleneck distance. For the case r > 1 the most prominent stable invariant is the rank invariant, proposed by Carlsson and Zomorodian [43].

This invariant can be seen as collecting all one-dimensional barcode information in a big table. Consequently, it captures the persistent information of the module, but extracting multivariate persistent information from the table and visualizing it require more work. Another stable invariant is the feature counting function [14], which extracts readably persistent information of the module. In this paper we generalize the construction of the feature counting function through the process of hierarchical stabilization, a process to stabilize classical invariants in commutative algebra for the setting of multi-parameter persistent homology. Such invariants include e.g. the Betti numbers, which are otherwise not stable in this setting.

Further we define an algorithm to compute the stabilized 0-th Betti number of a multi-parameter persistence module. The main result of Paper A is the following theorem:

Theorem. Computing the stabilized 0-th Betti number of a multi-parameter per- sistence module is in general an NP-hard problem.

(29)

18 CHAPTER 3. SUMMARY OF RESULTS

A consequence of this theorem is that computing the feature counting function is also in general NP-hard. This is an indication that extracting multivariate persistent information from the rank invariant, and perhaps in general, might be NP-hard as well.

Contributions to the paper

This paper was developed through regular discussions and my contributions can be found in all aspects of the work.

Summary of Paper B

Efficient computation of multiparameter persistence.

(joint with M. R. Bender and M. Lesnick) Pre-print.

Motivated by applications to topological data analysis (TDA), we introduce an efficient algorithm to compute a minimal presentation of a multi-parameter persistent homology module, given a chain complex of free modules as input. Our approach extends previous work on this problem in the 2-parameter case, and draws on ideas underlying the F4 and F5 algorithms for Gröbner basis computation. We implement the algorithm in our new software Muphasa publicly available at [5]. In preliminary computational experiments on synthetic TDA examples, we compare our approach to a version of a classical approach based on Schreyer’s algorithm, and find that ours is substantially faster more memory efficient. In the course of developing our algorithm for computing presentations, we also introduce algorithms for the closely related problems of computing a Gröbner basis for the image and kernel of a morphism f of free modules. Given free Z^r-graded modules A and B of respective ranks n and m and a morphism f : A → B, our algorithm for Gröbner basis computation, GBS, has the following time and memory complexity:

Theorem. Using the monomial orderings specified in Paper B, the GBS algorithm performs at most O(n^rm + n^r−1m²) arithmetic operations and consumes memory O(n²+ m n + n r + K), where K is the size of the output, in order to compute Gröbner bases for the kernel and image of f .

Further we use the ideas of the GBS algorithm to define an algorithm to compute a presentation of the persistence module, called PresentationPair. We show the following time complexity for this algorithm:

Theorem. Using the monomial orderings specified in Paper B, the Presentation- Pair algorithm performs at most O(r²n^r+1+ n^rm + n^r−1m²+ rn²l) arithmetic operations to compute a presentation of the persistence module where the input boundary matrices have dimensions n × l and m × n respectively.

(30)

19

The presentation computed by PresentationPair in general not minimal and therefore requires a minimization computation on its output to produce a minimal presentation. In computational experiments this algorithm is substantially faster and consumes less memory than an alternative algorithm based on Schreyer’s algorithm. In Figure 31 we illustrate the performance of the algorithms on a time varying point cloud in R³ consisting of p points and t time steps (see Paper B for details).

250 500 750 1000 1250 1500 1750 2000 t

0 100 200 300 400 500

[s]

GBS+Schreyer PresentationPair

(a) Timing results of the algorithms for p = 10 and different values of t.

250 500 750 1000 1250 1500 1750 2000 t

0 500 1000 1500 2000

[MB]

GBS+Schreyer PresentationPair

(b) Memory footprints of the algorithms for p = 10 and different values of t.

Figure 31 – Computational results for computing a minimal presentation of the first persistence module of a time varying point cloud.

The results and proofs of this paper grew out of joint discussions. My contributions lie in all aspects of the work.

Summary of Paper C

Computational complexity of learning algebraic varieties.

Advances in Applied Mathematics (2020).

In this paper we analyze the complexity of fitting a variety, coming from a class of varieties, to a configuration of points in R^N. The complexity measure, called the algebraic complexity, computes the Euclidean Distance Degree (EDD) of a certain variety called the hypothesis variety as the number of points in the configuration increases. Finally, we establish a connection to complexity of architectures of polynomial neural networks.

The main result of Paper C regards the algebraic complexity of the class of (N − 1)-spheres. Let {Vβ} be the class of (N − 1)-spheres and let H(m, N ) be the

(31)

number of critical points of the associated optimization problem (see Paper C) for

|U | = m and ambient dimension N . For the case N = 1 we prove the following theorem:

Theorem.

EDD(H_m,1) = 2^m−1− 1

and if U is a real configuration, then the critical points of Hm,1 with respect to U are all real.

For the general case, N > 1, we conjecture the following formula, based on numerical experiments:

Conjecture. The number of critical (N − 1)−spheres of a generic configuration of m points in R^N, where m > N + 2, is given by:

EDD(H_m,N) = EDD(H_m−1,N) +

m−1

X

k=0

m − 1 k

^N X

l=0

k + 1 l

− 2^m−1− m2^m−2

While the number of critical points tends to be large, they are orders of mag- nitude smaller than the total degree of the polynomial system, or even its mixed volume. Consequently, one could achieve large speedups by solving the system once for generic parameters and then using parameter homotopies for specific problem instances [42].

Summary of Paper D

Sampling and homology via bottlenecks.

(joint with S. Di Rocco and D. Eklund) Pre-print: arXiv:2010.07976.

In this paper we present an efficient algorithm to produce a provably dense sample of a smooth compact variety. The procedure is partly based on computing bottlenecks of the variety. Using geometric information such as the bottlenecks and the local reach we also provide bounds on the density of the sample needed in order to guarantee that the homology of the variety can be recovered from the sample. An implementation of the algorithm is provided together with numerical experiments and a computational comparison to the algorithm by Dufresne et. al. [26].

Existing algorithms for sampling algebraic varieties can be divided into two categories: probabilistic methods [9, 24, 18] and provably dense sampling [26]. In Paper D we develop a sampling technique that constructs a provably dense sample of a variety. By provably dense sample we mean an -sample in the sense of the following definition:

(32)

21

Definition. A sample of a variety V ⊂ R^N is a finite subset E ⊂ V . For > 0 a sample E ⊂ V is called an -sample if for every x ∈ V there is an element e ∈ E such that kx − ek < .

To produce an -sample from V we start from its defining equations. The basic idea is then to intersect the variety with a grid of linear spaces of complementary dimension. Figure 32a shows a curve in R² intersected with a 2-dimensional grid.

The grid-size, denoted by δ, is determined by the narrowest bottleneck of the variety.

Estimates for the number of bottlenecks and numerical algorithms to compute them have been recently studied, see [29, 23]. We show the following theorem saying that we may expect that the number of bottlenecks is in general finite.

Theorem. A generic complete intersection has only finitely many bottlenecks.

(a) Planar curve (b) A quartic surface in R⁴

Figure 32 – Illustrations of samples of a planar curve and a quartic surface.

The density guarantee is achieved by introducing extra sample points given by ad hoc slicing and nearest point computations. Figure 32b shows an -sample of a surface in R⁴ with = 0.1. Let S_δ denote the result of the sampling process. We show the following:

Theorem. Let b₂ be the radius of the narrowest bottleneck of V and let > 0. If 0 < δ√

N < min{, 2b₂}, then S_δ is an -sample of V .

Given an -sample of V it is straightforward to compute the homology groups of V assuming that is small enough. In [10] it is shown that the homology groups of V can be recovered from the sample if < τV. In Paper D we show that the density of the sample needed to compute the zeroth and first homology groups of V is controlled by the radii of the generalized bottlenecks. Recall that the radius of the narrowest generalized bottleneck is equivalent to the weak feature size. One of the main contributions of Paper D is the following result:

Theorem. Let < wf s(V ) and let i ∈ {0, 1}. Let E be an -dense sample of V and let S be the modified Vietoris-Rips complex constructed from E as in Paper D.

Then,

Hi(S) ∼= Hi(V )

(33)

Computing the bottlenecks is in general much easier than computing the reach.

Therefore, in cases where the radius of the narrowest bottleneck is equal to the wf s we obtain sparser samples and considerable computational improvements for varieties with high curvature.

For the higher homology groups we show in Paper D the following result saying that we can bound the reach from below using estimates of the local reach, τ_V(p), for sample points p ∈ V . The homology of V can then be computed from the ˘Cech complex, C(E, ), given by the nerve of the union of all closed -balls ¯Be() for each e ∈ E.

Theorem. If E ⊂ V , > 0, E is an (/2)-sample and < ⁴₅τV(e) for all e ∈ E, then V is a deformation retract of C(E, ). In particular, V and C(E, ) have the same homology groups.

As estimating the reach of a manifold is computationally a very challenging problem, our approach provides an important simplification. The estimate is based on a finite process i.e. computing the local reach at finitely many sample points.

This paper was developed through regular discussions and my contributions can be found in all aspects of the work.

Topological and geometrical methods in data analysis

Topological and geometrical methods in data analysis

OLIVER GÄFVERT

Topological and geometrical methods in data analysis

OLIVER GÄFVERT

Contents

Acknowledgements

1 Introduction

2 Background

2.1 Topological data analysis

2.2 Geometric data analysis

3 Summary of Results

Summary of Paper A

Summary of Paper B

Summary of Paper C

Summary of Paper D