Atomic norm algorithms for blind spectral super-resolution problems

(1)

ATOMIC NORM ALGORITHMS FOR BLIND SPECTRAL SUPER-RESOLUTION

PROBLEMS

by

(2)

c

(3)

A thesis submitted to the Faculty and the Board of Trustees of the Colorado School of Mines in partial fulfillment of the requirements for the degree of Master of Science (Electrical Engineering). Golden, Colorado Date Signed: Jonathan W. Helland Signed: Dr. Michael B. Wakin Thesis Advisor Golden, Colorado Date Signed: Dr. Peter Aaen Department Head Department of Electrical Engineering

(4)

ABSTRACT

This thesis focuses on the development of atomic norm based algorithms for particular instances of blind super-resolution problems appearing in wireless communications, modal analysis in sensor networks, and target localization in radar signal processing. Blind super-resolution problems are a harder version of canonical super-super-resolution problems in that they include more degrees of freedom that must be resolved due to additional sources of blurring via unknown linear transformations. Our atomic norm algorithms focus on leveraging special sparsity structure with respect to certain dictionaries inherent in the signals of interest.

We first provide a relatively self-contained introduction to and review of the atomic norm’s use in non-blind and blind super-resolution problems.

We then develop theoretical tools towards establishing guaranteed blind super-resolution for a problem which we refer to as multi-band line spectral estimation – an extension of line spectral estimation to signals whose composite waveforms occupy continuous frequency bands rather than discrete spikes.

We introduce a generic signal model that can be used to model multi-sensor blind super-resolution problems, which are of particular interest in radar signal processing and sensor networks. We establish some theoretical results about computationally tractable techniques for computing the atomic norms that arise from our proposed multi-sensor signal models.

We next apply our multi-sensor atomic norm algorithm to the problem of modal analysis from vibrational measurements – a blind super-resolution problem arising in structural health monitoring and acoustics.

We finally apply our multi-sensor atomic norm algorithm to the problem of extended target localization in stepped-frequency radar signal processing. This problem is relevant to near-field radar imaging including through-the-wall radar imaging.

(5)

TABLE OF CONTENTS ABSTRACT . . . iii LIST OF FIGURES . . . ix LIST OF SYMBOLS . . . xi ACKNOWLEDGMENTS . . . xii CHAPTER 1 INTRODUCTION . . . 1 1.0.1 Contributions . . . 4 1.0.2 Overview . . . 5 1.1 Mathematical Context . . . 5

1.1.1 Linear Inverse Problems . . . 6

1.1.2 Sparsity Inducing Penalty Functions . . . 7

1.1.3 Duality & Dual Certification of Basis Pursuit (BP) . . . 9

1.1.4 Line Spectral Estimation . . . 11

1.1.5 The Basis Mismatch Problem . . . 12

1.1.6 Sparse Recovery by Total Variation Minimization . . . 12

1.1.7 Dual Certificate for Total Variation Minimization . . . 14

1.2 The Atomic Norm . . . 15

1.2.1 The Atomic Norm Generalizes Other Norms . . . 18

1.2.2 An Atomic Norm for Line Spectral Estimation . . . 19

1.2.3 Dual Certificate for Line Spectral Atomic Norm Minimization . . . 20

(6)

1.2.5 An Equivalent SDP to Atomic Norm Minimization . . . 25

1.2.6 Atomic Soft Thresholding (AST) . . . 26

1.3 Blind Super-Resolution Problems . . . 26

1.3.1 The Atomic Norm for Blind Super-Resolution Problems . . . 29

1.3.2 Atomic Norm Minimization for Blind Super-Resolution . . . 30

1.3.3 Dual Polynomials for Blind Super-Resolution . . . 31

1.3.4 Modulating Waveform Recovery . . . 33

1.3.5 Blind Atomic Soft Thresholding . . . 34

CHAPTER 2 TOWARDS GUARANTEED MULTI-BAND LINE SPECTRAL ESTIMATION . . . 37

2.1 Chapter Overview . . . 37

2.2 A Multi-Band Line Spectral Signal Model . . . 38

2.2.1 Discrete Prolate Spheroidal Sequences . . . 40

2.2.2 The Atomic Dictionary & Norm . . . 45

2.3 Computational Approach . . . 46

2.3.1 An Atomic Norm Minimization Problem . . . 47

2.3.2 The Dual Polynomial . . . 47

2.4 Conjectured Existence of the Dual Certificate: A Tale of Two Dual Polynomials . . . 50

2.4.1 Two Dual Polynomial Constructions . . . 50

2.4.2 Roadmap of Our Strategy . . . 54

2.5 The Simplified Single-Band Case . . . 56

2.6 The Multi-Band Case . . . 58

(7)

2.6.2 Conjectured Closeness of the Two dual Polynomials . . . 62

2.6.3 The First Upper Bound . . . 62

2.6.4 The Second Upper Bound . . . 64

2.6.5 The Approximate Orthogonality of Modulated DPSS Spectral Derivatives . . . 68

2.6.6 Simulations for the Pseudoinverse . . . 76

CHAPTER 3 THE ATOMIC NORM FOR MULTI-SENSOR BLIND SUPER-RESOLUTION PROBLEMS . . . 79

3.1 Chapter Overview . . . 79

3.2 Multi-Sensor Blind Line Spectral Signal Models . . . 80

3.2.1 Two Tensor Atomic Dictionaries and Norms . . . 82

3.2.2 Atomic Norm Minimizations & Dual Polynomials . . . 84

3.2.3 Semi-Definite Relaxation for the General Atomic Norm . . . 87

3.2.4 Semi-Definite Equivalence via the Relaxed Atomic Set . . . 90

3.3 More Efficient Joint Sparsity Recovery . . . 95

3.4 Connections to Tensor Decomposition . . . 96

3.5 Accounting for Measurement Noise . . . 98

3.5.1 Multi-Sensor Atomic Soft Thresholding . . . 98

3.5.2 The AST Dual Polynomial . . . 100

CHAPTER 4 DAMPED MODAL ANALYSIS . . . 101

4.1 The Multi-Sensor Signal Model . . . 103

4.1.1 The Tensor Atomic Dictionary & Norm . . . 104

4.2 Computational Methods . . . 106

(8)

4.2.2 Atomic Soft Thresholding . . . 107

4.2.3 Estimating the Mode Shapes . . . 108

4.3 Numerical Experiments . . . 110

4.3.1 Subspace Construction . . . 110

4.3.2 Single-Sensor Experiments . . . 115

4.3.3 Multi-Sensor Experiments . . . 117

CHAPTER 5 EXTENDED TARGET LOCALIZATION IN STEPPED FREQUENCY RADAR . . . 121

5.1 Problem Setup . . . 122

5.2 Our Approach . . . 124

5.2.1 Discrete Prolate Spheroidal Sequences (DPSS’s) . . . 124

5.3 The Atomic Dictionary and Norm . . . 125

5.4 Atomic Norm Minimization . . . 126

5.4.1 Atomic Soft Thresholding (AST) . . . 127

5.5 Target Localization via the Dual Problem . . . 127

5.6 Computation via Semi-Definite Relaxation . . . 128

5.6.1 Separated AST . . . 129

5.7 Simulations . . . 130

5.7.1 Ideal Target Returns . . . 131

5.7.2 Line Target Returns . . . 132

5.8 Summary . . . 136

CHAPTER 6 CONCLUSION . . . 140

(9)

6.2 Future Work . . . 141 REFERENCES CITED . . . 143

(10)

LIST OF FIGURES

Figure 1.1 ℓ0-norm and its convex relaxation on R2. Observe that (b) is the convex

hull of the black points in (a). . . 8 Figure 1.2 An example of a dual polynomial for the noiseless line spectral

estimation problem. . . 22 Figure 1.3 An example atomic norm solution to the blind super-resolution line

spectral problem (1.33). . . 35 Figure 2.1 Spectral concentration of the leading DPSS vectors for W = 1/32 and

N = 128. . . 44 Figure 2.2 Results of solving the SDP (2.26) for synthetic data with g_k generated

directly from a DPSS subspace. . . 49 Figure 2.3 This experiment demonstrates that although _kq₂(f )_k2 is concave at the

band-centers fk, kq1(f )k2 may not actually have its peaks at the fk. . . . 55

Figure 2.4 Numerical simulations to verify that the bound in Lemma 2.6.4 holds and is asymptotically small with N . Here, J indicates the number of fixed band-center locations fj as N increases, which were generated

randomly for each value of J. p⋆

1 and p⋆2 are the least-squares solutions

to (2.32) and (2.36) respectively. . . 77 Figure 4.1 Here we compute the cumulative singular value energy captured by the

PCA subspace B as L increases. We do this computation for a wide

range [0, 1] (a) and narrow range [0.05, 0.11] (b) of damping ratios. . . . 111 Figure 4.2 The relative projection errors according to (4.14) for a subspace B

constructed from Dwide and Dnarrow for N = 128 and N = 2048. . . 112

Figure 4.3 Dictionary elements i.e. columns of D _{∈ R}2048×500 _{with damping ratios}

taken from the interval [0, 0.01]. . . 113 Figure 4.4 Dictionary elements i.e. columns of D2 ∈ R2048×500 with α taken

(11)

Figure 4.5 Relative subspace projection errors according to (4.14) for the multi-dictionary D = [D1 D2]∈ RN ×2000. Here we construct

D1 ∈ RN ×1000 with ξ taken uniformly from [0, 0.01] and D2 with α taken

uniformly from [0, 10−4_{]. We construct B using L = 3 left singular}

vectors of D. (left): N = 128 time samples. (right): N = 2048 time

samples. Notice that (left) and (right) are on different scales. . . 114 Figure 4.6 Recovery results for our atomic norm minimization algorithm assuming

that the unknown modulating waveforms g_k explicitly live within the PCA subspace B. Indeed, we can see that our algorithm performs well

using our exactly equivalent SDP relaxation, as expected. . . 116 Figure 4.7 Note: we ℓ_∞-normalize the dual polynomial in (a). . . 117 Figure 4.8 The results of solving (4.8) for data generated such that the g_k strictly

live in the subspace B. In (b), the red line corresponds to the method of and the blue line corresponds to the method of . . . 119 Figure 4.9 Results of solving (4.11) for data generated such that the gk do not

strictly live in the subspace B. In (c), the red line corresponds to the

method of and the blue line corresponds to the method of . . . 120 Figure 5.1 Target localization results for ideally simulated targets using different

methods. . . 133 Figure 5.2 Dual polynomials for Figure 5.1. We emphasize that the shape of the

dual polynomial (especially in (d)) does not correspond to the geometry

of the targets Tk in a meaningful. . . 134

Figure 5.3 Example reflectivity profile for one of the simulated line targets. . . 135 Figure 5.4 Target localization results using different methods. The targets are the

black lines. . . 137 Figure 5.5 Dual polynomials for Figure 5.4. We emphasize that the shape of the

dual polynomial (especially in (d)) does not correspond to the geometry

(12)

LIST OF SYMBOLS

scalar, vector, matrix, tensor . . . a, a, A, A Real numbers, complex numbers, natural numbers . . . R, C, N [N ] . . . index set _{{1, 2, · · · , N}} k-th matrix/tensor slice . . . A[:, k], A[:, :, k] superscript∗ . . . Hermitian adjoint of an operator ⊤ . . . transpose to differentiate from the conjugate transpose via ∗ ⊚ . . . outer product i.e. tensor product ⊗ . . . Kronecker product ⊙ . Khatri-Rao-type product: tensor whose frontal slices are column-wise tensor products ◦ . . . Hadamard product ⊖ . . . circular difference a⊖ b = min {|a − b|, 1 − |a − b|} k · kp . . . ℓp norm for vectors, operator p norm for matrices

k · kp,q . . . ℓq norm of column-wise ℓp norms

k · kF . . . Frobenius norm k · k2,2

hA, Bi . . . vec(B)∗_vec(A)

Re_{{·} , Im {·} . . . real part, imaginary part} diag(a) . . . diagonal matrix with main diagonal a Toep(a) . . . Hermitian symmetric Toeplitz matrix with first row a⊤

(13)

ACKNOWLEDGMENTS

I am deeply grateful to my advisor Michael Wakin for allowing me with the opportunity to develop this thesis and for providing me with an excellent academic environment – both in the development and pursuit of my interests. Michael’s deep appreciation for both intuition and mathematical rigor has established a model that will inform my research for the rest of my life. I am honored to have spent the last two years working in his group.

My parents deserve much of the credit for my success; they provided me with a foundation that allowed me to attend college and pursue my interests free of adversity. My gratitude goes beyond words and can never be sufficiently conveyed.

I would also like to thank my friend and mentor Scott Strong, without whose mentorship I would not have pursued a degree in mathematics and hence would not have begun this thesis in the first place. Our conversations over the years have kept me grounded as a researcher.

I would like to finally thank Gongguo Tang and Vincent Tyrone for serving on my thesis committee, as well as the faculty at Mines who provided me with formative educational experiences. I especially mention Luis Tenorio and Paul Constantine in this regard.

(14)

CHAPTER 1 INTRODUCTION

In this thesis, we study a difficult class of linear inverse problems known as blind super-resolution problems, which introduce many additional degrees of freedom to already ill-posed problems in signal processing. Such problems can be thought of as generalizations of canon-ical super-resolution problems in signal processing that describe richer classes of signals and so can model physical systems with potentially more realism. Blind super-resolution prob-lems are an exciting class of probprob-lems to study mathematically and algorithmically because they typically cannot be solved by classical super-resolution algorithms that currently enjoy widespread usage in practice. Moreover, providing theoretical guarantees for the behavior of algorithms developed in this framework often requires new analytic techniques.

However, to lucidly describe these so-called blind super-resolution problems and effec-tively convey their relevance in modern signal processing, we must first introduce super-resolution problems as a whole. With a mathematical notion of super-super-resolution problems in place, the generalization to blind super-resolution will then be a natural one. Furthermore, our intuition for the atomic norm and its structure inducing properties will carry over with-out too much effort. The following discussion of super-resolution in this section is intended to provide intuition only – more rigor will occur in the future sections of this chapter.

Super-resolution itself is a fundamental problem in signal processing – and indeed sci-ence as a whole, finding applications wherever sensor and imaging systems are used to collect measurements of phenomena in the real world in some form that can be digitally processed. Intuitively, super-resolution often refers to the process of digitally recovering high-resolution underlying signals from blurry measurements (due to, say, limited pixel count for a camera or time-limited measurements for a seismic receiver) in the presence of additional uncer-tainty such as measurement noise (often modeled probabilistically as the aggregate of many

(15)

small and unpredictable errors implicit in the measurement system). Examples range from de-blurring photographs in software like Photoshop to recovering the frequency content of vibrational signals of the kind collected in seismology. An intuitive, albeit fictional, example of a super-resolution system in action can be found in crime television shows, where detec-tives zoom in on security footage of a perpetrator’s face in an incriminating situation, only to find that the pixel-count of the image is not sufficient to make out facial details. The often parodied phrase, “Enhance!” is then uttered in a grim baritone voice and, like magic, the identity of the suspect becomes immediately clear.

Although distinctly not so magical in nature as in serial crime television, a good super-resolution system is nonetheless a powerful tool in that a practitioner can use a low-quality measurement device (i.e. less expensive hardware) to collect measurements and then digi-tally account for the imprecision inherent in the inexpensive device, thereby synthesizing a more expensive piece of hardware. Such systems are tailored to solve specific, well-studied problems and are not general-purpose in the sense that a system that super-resolves nat-ural images will not automatically super-resolve seismic measurements. Even deep learn-ing approaches to super-resolution ([4, 19, 48] to name a few) are still constrained in this way; deep learning, broadly speaking, automates the process of encoding domain-specific structural knowledge by extracting implicit structure via statistically relevant features from existing data. The point is that there is no free lunch in super-resolution – we must sacrifice flexibility in favor of solvability.

To effectively super-resolve a signal, one must have specific a priori knowledge of the structure of the true signal of interest. Without such knowledge, one can intuitively expect that there will be many plausible ways of filling in the “gaps” in the low-resolution mea-surements to generate a high-resolution version. For example, when a natural RGB image is subsampled to produce a lower resolution version (i.e. a naive compression scheme), one could fill in the removed pixels with noise to produce a visibly incorrect but still “high-resolution” version of the subsampled image depending on our metric of quality. That is,

(16)

when we subsample our “high-resolution” image, the resulting image will exactly match the original subsampled image under a measure of quality like the ℓ2 distance; any choice we

make for the unknown pixels will explain our measurements equally well. This concept is what is meant by the phrase “ill-posed” in such optimization problems.

To find a high-resolution image with high perceptive quality that still explains our mea-surements, we have to make a claim about the underlying structure of the image. Such a claim might be that natural images have spatial frequency content that is highly localized around edges, of which there tend to be relatively few. Wavelets are well-suited to capture this spectral localization via sparse representations since certain wavelets are themselves compactly supported in time and frequency1_{. That is, wavelets expose special structure}

present in many types of images that we can rely on when super-resolving.

When we incorporate this sparsity structure into our measure of quality via a wavelet transform, the space of plausible images that we might recover is vastly constrained – in fact, there may now be only one2 _{plausible high-resolution image that can explain our}

measure-ments whereas before there were many. In this way, our structural imposition has made the problem well-posed and can be solved in a disciplined and guaranteed manner. Moreover, proposed solutions can be well justified according to our domain knowledge (in this exam-ple of natural images). This intuition is worth keeping in mind throughout this thesis, as the atomic norm (which we discuss in Section 1.2) provides a general-purpose mathematical framework for expressing and utilizing domain-specific knowledge of signal structure in super-resolution systems. Algorithms derived within the atomic norm framework fundamentally rely on this concept of structure.

Now that the concept of super-resolution has been intuitively introduced, we will tran-sition into more technical background material with this intuition in mind. To be specific, this thesis is focused on the development and analysis of algorithms within this atomic

1

This is an oversimplification of wavelets, which also rely on multi-scale properties to produce sparse representations for natural images – see [31].

2

(17)

norm framework and all background material discussed will be in support of this focus; the present review makes no claims of comprehensiveness. That being said, the following sections are dedicated to the requisite mathematics for blind super-resolution problems and the framework of the atomic norm used to develop algorithms for these problems with some preliminary discussion about the scope of this thesis as a whole.

1.0.1 Contributions

Our contributions via this thesis are as follows.

• In this chapter, we provide a relatively self-contained introduction to the atomic norm as used in resolution problems in signal processing. We also review blind super-resolution problems and discuss some recent work on leveraging the atomic norm to solve these challenging problems.

• In Chapter 2, we discuss a problem called multi-band line spectral estimation, which is a blind super-resolution extension of the canonical line spectral estimation problem that arises in many applications. We develop theoretical tools that we believe will assist researchers in establishing the conditions under which our considered atomic norm algorithm (proposed originally in [59]) has theoretically guaranteed super-resolution capabilities.

• In Chapter 3, we introduce a general model for multi-sensor blind super-resolution problems that are of interest in radar signal processing and sensor networks. We propose atomic norm-based algorithms to super-resolve these kinds of signals and also provide some theoretical results that serve as a starting point for proving guaranteed super-resolution.

• In Chapter 4, we apply our work in Chapter 2 to the problem of modal analysis subject to vibrational damping. We demonstrate by way of numerical experiments the promise of our proposed approach.

(18)

• In Chapter 5, we apply our work in Chapter 2 to the problem of extended target localization in stepped-frequency radar signal processing. We again demonstrate the efficacy of our approach using numerical experiments.

1.0.2 Overview

This section provides a road-map for this chapter as well as the suggested reading order of the thesis. The remainder of the chapter serves as an introduction to the framework of the atomic norm, contextualized by the so-called line spectral estimation problem which is a central and classical problem in this work. We then introduce single sensor blind super-resolution problems in Section 1.3 and discuss the atomic norm framework in the context of these problems.

We intend for a reader not well-versed in the atomic norm to read Chapter 1 prior to the remaining chapters. Readers unfamiliar with notions of sparsity in signal processing problems are invited to read all of Chapter 1, whereas readers simply unfamiliar with blind super-resolution problems can skip ahead to Section 1.3.

For a reader with sufficient background in the atomic norm vis-`a-vis blind super-resolution, we suggest skipping ahead to Chapter 2 for the first novel part of our work. Chapter 2 is focused on single-sensor variations of blind inverse problems. Readers interested in multi-sensor problems i.e. radar signal processing may wish to focus on Chapter 3 and beyond. 1.1 Mathematical Context

We primarily intend the following sections in this chapter to provide a reasonable intro-duction to atomic norm minimization by collecting together results from seminal literature on the topic.

(19)

1.1.1 Linear Inverse Problems

A principal object of study in signal processing and optimization theory therein is the canonical linear inverse problem

minimize

x Ω(x)

subject to y =_L(x) (1.1)

where Ω(_{·) is a penalty function that imposes a practitioner’s a priori structural knowledge} of the signals of interest on the domain of optimization. The corresponding forward problem can be stated as

y=_L(x), (1.2)

where _{L is some linear operator (hence the phrase linear inverse problem) and y is the} collection measurements3 _{from some sensor system. We can, of course, express} _{L(x) as}

a matrix-vector product by the linearity of L, but it can oftentimes be more convenient to simply use the function notation. Note also that this measurement model is distinctly noiseless.

In a compressive sensing (CS) context, _{L is an overcomplete matrix A satisfying the} Restricted Isometry Property (RIP) [22, Definition 6.1] which is fundamentally inspired by the Johnson-Lindenstrauss Lemma [22, Lemma 9.35], which establishes that random (sub-Gaussian) embeddings of point clouds preserve the basic geometry i.e. distances and angles between points. This indicates that the uncompressed points should be recoverable (i.e. can be super-resolved) from such compressive embeddings.

In [8], L is taken to be a low-pass filter in the frequency domain (or a low-pass filter in time).

We typically choose the structural penalty function Ω to be convex for the sake of solv-ability; supposing that the feasible region_{{x ∈ X : y = L(x)} ⊂ Y is a non-empty, convex}

3

We present this linear inverse problem in terms of vector variables and measurements for simplicity – we will consider matrix and tensor variables/measurements in future sections.

(20)

set (easily satisfied for a full-rank operator A) of a finite4 _{Hilbert space} _{Y, (1.1) is itself}

a convex optimization problem. For such a convex problem, locally optimal solutions are globally optimal [6, Section 3.2.2]. This global optimality of local solutions is dually attrac-tive since we can recover one (assuming strong-convexity holds) well-justified answer to our super-resolution problem.

When (1.1) is indeed convex, there are a plethora of methods with varying tradeoffs between speed, accuracy, and memory consumption that can be chosen according to the demands of the problem at hand.; we suggest [34] as a general-purpose reference. These algorithms often enjoy strong global convergence guarantees due to the convexity of (1.1), although many algorithms still have local convergence guarantees in the nonconvex case.

We will primarily focus on convex problems in this thesis. 1.1.2 Sparsity Inducing Penalty Functions

We have already stressed the intuitive importance of structural assumptions on the do-main of signals x for the sake of well-posed problems (taken to mean finitely-many solutions as opposed to a continuum of equivalent solutions). In this thesis, we are primarily con-cerned with sparsity structure inherent in x with respect to a transform domain associated with some linear bijection5 _{Φ :} _{X → Y (e.g. the Discrete Fourier Transform (DFT)). By}

sparsity, we mean that x ∈ X can be represented as x = PKk=1ckak for a small K ≪ N,

a_k _{∈ Y, and scalars c}_k. Sparsity is a type of structure that can effectively constrain a wide array of ill-posed problems in signal processing to be well-posed. We will provide specific examples in future sections.

If we consider a subspace _{Y ⊂ C}N _{for N <} _{∞, then there is a finite basis of at most}

N complex vectors that spans Y. In other words, K ≤ N. For sparsity, we require that K ≪ N i.e. x is representable in Y using only a few ck∈ C and basis vectors ak ∈ Y.

4

Although not necessary in the most general sense, we restrict ourselves in this thesis to a finite vector space X , Y ⊆ CN _{for some N <}

∞.

5

(21)

A sparsity inducing norm with respect to _{Y is the ℓ}0-norm6

kxk0 := µcount(supp(x)), (1.3)

where µcount is the counting measure and supp(x) := {n ∈ [N] : x[n] 6= 0} is the support

of x. In this case, _{Y is some scaling t of the set of canonical basis vectors (columns of} the N _{× N identity matrix I}N). We can then take Ω(x) = kxk0 in (1.1), whose optimizer

b

x is the sparsest feasible (_L(bx) = y) vector that explains the measurements y. However, k · k0 is nonconvex and hence yields optimization problems that are NP-hard to solve7. This

computational difficulty leads to a convex relaxation via the ℓ1-norm – indeed a proper norm

and hence convex. Figure 1.1 shows the unit-balls associated with the ℓ0 and ℓ1 norms on

R2_{, making the nature of the ℓ}₁ _{convex relaxation more intuitive. In fact, from Figure 1.1} we can infer that ℓ1-norm promotes a representation of x that is a linear combination of the

columns of the identity matrix I with the fewest number of components. In other words, a representation x = Iα, where µcount(supp(α)) ≪ N. This is an important perspective

of sparsity that will prove critical to understanding the atomic norm when we introduce it later.

(a) ℓ0-norm unit-ball. (b) ℓ1-norm unit-ball.

Figure 1.1: ℓ0-norm and its convex relaxation on R2. Observe that (b) is the convex hull of

the black points in (a).

6

The ℓ0-“norm” is not formally a norm in the sense of a normed vector space – a norm must satisfy the

triangle inequality as a necessary condition, which implies convexity as a consequence.

7

See [22, Theorem 2.17] for a proof that (1.1) using the ℓ0-norm is indeed NP-hard by reduction to the set

(22)

Supposing now that we have a good choice of linear bijection Φ that maps from a space which readily yields sparse representations of the underlying signal x to the measurement space _{Y, choosing the penalty Ω(x) = kxk}1 in (1.1) yields a noiseless version of the famous

Lasso algorithm [42] called basis pursuit (BP). BP for real-valued optimization variables provably [22, Theorem 3.1] yields sparse representations.

1.1.3 Duality & Dual Certification of Basis Pursuit (BP)

In this section, we specifically consider the convex optimization problem associated with BP:

minimize

x kxk1

subject to y = Φx (1.4)

where Φ ∈ CN ×M _{with N < M . This problem will be referred to as the primal problem,}

which we assume is feasible: there exists at least one x such that Φx = y. In particular, we discuss the dual problem to (1.4) and the corresponding dual certificate that guarantees the optimality of a proposed solutionx. We include this discussion as dual certification directly_b informs the construction of a function called the dual polynomial, which is critical to the development of atomic norm algorithms.

Duality theory in optimization concerns the construction of a mathematically related optimization problem to the primal problem called the dual problem, whose dual objective lower bounds the primal objective. There are cases when the dual problem is equivalent to the primal problem in the sense that the dual objective at the dual optimal solution is equal to the primal objective at the primal optimal solution. Furthermore, it is possible to guarantee (i.e. certify) the optimality of the primal solution using the solution to the dual problem, hence the name dual certificate.

Duality is a broad and intricate theory in optimization; for our purposes, we will only discuss Lagrange duality and moreover Lagrange duality in convex optimization only. We will not provide a comprehensive review of duality even in this more precise context, instead referring the reader to the excellent standard reference [6].

(23)

By standard Langrangian construction of the dual function (see [22, p. 558] for a deriva-tion) g(p;x)_b _{≤ kxk}1 with xb the optimizer to (1.4), we obtain the dual problem to (1.4)

maximize

p −hp, yiR

subject to kΦ∗_p_k ∞≤ 1

(1.5) where h·, ·iR:= Re{h·, ·i} to account for complex variables. Note that k · k_∞ is, in fact, the

dual norm of k · k1.

It can be trivially proved that strong duality holds between (1.4) and (1.5) via Slater’s constraint qualification [22, Theorem B.26] since there are no inequality constraints and (1.4) is assumed to be feasible. Strong duality means that for the primal optimal objective value

b

f :=_kbx_k1 and dual optimal objective value bd :=−hbp, yiR, we have bf = bd and hence (1.4) is

equivalent to (1.5) in the sense mentioned before.

The so-called dual certificate q := Φ∗bp (the entries of which are complex first-order polynomials of the dual variables bp) can be constructed and it can be shown that for some x, if the conditions

kqk∞≤ 1,

kxk1− hbp, yiR= 0

(1.6) are satisfied, then x = xb is the globally optimal solution to (1.4) and hence q certifies the optimality of x [54, Section 3.1].

It can be further observed that (1.5) has a linear objective with a convex feasible region i.e. _{F := {p : kΦ}∗p_k_∞_{≤ 1} is a convex set since Φ is a linear operator and k · k}_∞ is itself a convex function whose epigraph is thereby a convex polytope. This implies immediately that bp ∈ ∂F and hence kqk∞=kΦ∗bpk∞ = 1. It is worth noting that, as per [54, Condition

1], the conditions (1.6) imply equivalently that y = q = sign(x). These observations will allb be familiar when we discuss the dual certificates associated with the atomic norm later8_.

8

In the framework of the atomic norm,kqk∞is called the dual polynomial – although in the context of the

(24)

1.1.4 Line Spectral Estimation

In this section, we introduce an archetypal super-resolution problem in radar signal pro-cessing and wireless communications that serves as a starting point for the many of the problems we consider in this thesis. In the following sections, we will also use line spectral estimation as an example problem to contextualize our discussions.

Line spectral estimation begins with a signal model that is a superposition of a few pure sinusoids y(t) = K X k=1 ckei2πfkt, (1.7)

with frequencies which, without loss of generality, are assumed to be fk ∈ [0, 1]. In the

frequency domain, y(t) is a discrete complex measure called the line spectrum:

µ =X k ckδfk, δfk := δfk([0, 1]) = ( 1, fk ∈ [0, 1] 0, fk ∈ [0, 1]/ (1.8) i.e. δfk is the dirac measure on the one-torus [0, 1]. The vanilla noiseless

9 _{problem is to}

recover the true measure µ from time-limited measurements y obtained by sampling y(t) at a properly chosen rate (e.g. a Nyquist rate):

y = K X k=1 ckefk, ef := 1 ei2πf _{· · · e}i2πf (N −1)⊤ _{∈ C}N_. _(1.9)

This time-limited sampling can be thought of as applying a time-windowing to y(t) or a low-pass filter to the spectrum of y(t), hence blurring the line spectrum µ and requiring super-resolution to recover the precise frequency spikes fk.

Supposing that the true frequencies fk lie on a uniform N -element grid over [0, 1] (i.e. a

Fourier grid), then BP with _{L(x) = Φ}∗x, where Φ is the N _{× N FFT matrix, will recover} µ easily via a sparse vector of Fourier basis coefficients – the fk can be determined using

our a priori knowledge of the Fourier grid spacing. However, this problem becomes difficult for BP to solve when the fk do not lie on a such a Fourier grid – this scenario is called the

9

For practical purposes, this problem is usually considered in terms of a noise model. This is not necessary for our purposes.

(25)

off-grid line spectral problem and is often the case for real-world line spectral problems. 1.1.5 The Basis Mismatch Problem

Suppose that the fk in the line spectral model (1.8) do not lie on a predefined Fourier

grid over [0, 1]. Then µ is still sparse in the continuous frequency domain, but the discrete Fourier basis Φ does not perfectly align with the basis in which µ is sparse. In fact, regardless of how finely spaced the Fourier grid is (i.e. how large N is), Φ may never fully align with the true sparsity basis. This issue, coined as basis mismatch [14], results in representations of µ that are not sparse and hence do not properly super-resolve the frequency spikes fk.

Due to this basis mismatch, it is difficult to provide satisfying theoretical guarantees for the efficacy of BP and Lasso (where noise fundamentally limits the resolution to which the fk

can be resolved) for practical usage.

A modern thrust of research has been on developing algorithms that circumvent this fundamental limitation of BP by optimizing over continuous domains rather than discretizing the problem prior to the optimization process. Furthermore, these algorithms often yield favorable theoretical guarantees that can perfectly recover off-grid fk in the noiseless case

and are near-optimal in an information-theoretic sense in the noisy case. We discuss the first of these super-resolution methods in the following section.

1.1.6 Sparse Recovery by Total Variation Minimization

In this section, we discuss the theoretical framework proposed in the seminal work [8], which informed the development of the theoretical framework of the atomic norm framework (which we discuss later).

In [8], a continuous analog to the discrete ℓ1-norm is proposed via the penalty Ω(ν) =

kνkTV, where ν is a complex measure andkνkTV=|ν|([0, 1]) is the total variation (TV) norm.

The measurement operator then maps the measure ν to a discrete vector by bandlimiting ν to [0, 1] and then sampling according to the sampling strategy chosen to generate the measurements y.

(26)

To define the TV norm, we first define a nonnegative measure on the Borel sigma algebra B([0, 1]) called the variation:

|ν|(B) := sup

Π(B)

X

E∈Π(B)

|ν(E)| (1.10)

for some B ∈ B([0, 1]), where Π(B) is a countable partitioning of B into disjoint, B([0, 1])-measurable sets and the supremum is taken over all such disjoint partitionings of B. The TV norm is then defined as the variation evaluated on the entire space:

kνkTV :=|ν|([0, 1]). (1.11)

Note that the TV norm is indeed a formal norm and hence induces a metric on spaces of complex measures on a shared sigma algebra. For real nonnegative measures – which, roughly, correspond to notions of the size of a given set – this distance is intuitively the largest difference between the sizes that two measures can assign the same set10_.

This intuition does not precisely hold for signed and complex measures, although is worth keeping in mind for the sake of understanding. Indeed, we can understand the choice of the TV norm for sparse recovery by observing that for the line spectral measure µ = P_kckδfk

defined on B([0, 1]), the TV norm can be easily seen to equal kµkTV =

X

k

|ck|, (1.12)

which clarifies how the TV norm is analogous to the ℓ1-norm. Minimizing the TV norm over

measures ν supported on_{B([0, 1]) under the constraint L(ν) = y will thereby intuitively yield} a discrete complex measure _{bν composed the fewest (sparsest) number of dirac measures δ}fk

similarly to how the ℓ1 norm recovers a vector that is a composite sum of the fewest number

of columns of the identity matrix I.

Indeed, it is shown [8, Theorem 1.2] that when the frequency separation condition min

j6=k (fj⊖ fk)≥ 2/N, (1.13)

10

In the case of probability measures ν : B([0, 1]) → [0, 1], this difference in size assignment is directly the largest in probability assignment.

(27)

where fj ⊖ fk := min{|fk− fj|, 1 − |fk− fj|} is the circular distance on [0, 1], is satisfied,

then _{bν = µ is the unique minimizer of the following program} minimize

ν kνkTV

subject to y =L(ν) (1.14)

That is, perfect super-resolution of y is theoretically possible under the intuitive assumption that we have enough time-samples of y(t) relative to how close the frequency spikes fk are.

This is an especially surprising result since no assumption was made that bν is even discrete (let alone K-sparse) to begin with.

We mentioned in Section 1.1.4 that the goal of line spectral estimation is to recover the precise fk. It is not at all obvious how to localize these frequency spikes by solving the TV

norm minimization problem as stated. The following section will discuss exactly how to localize these frequencies by solving the dual problem and constructing a dual polynomial, similar to Section 1.1.3.

1.1.7 Dual Certificate for Total Variation Minimization

As discussed in [8, Section 2.1], one can prove [8, Proposition A.1] that if (1.13) holds, then a dual certificate for the TV norm minimization (1.14) exists in the form of a trigonometric polynomial q(f ) parameterized over [0, 1] and, moreover, that this dual polynomial satisfies the properties

|q(fk)| = 1 ∀k ∈ [K],

|q(f)| < 1 ∀f ∈ [0, 1] \ {f1,· · · , fK} .

(1.15) One can construct q(f ) numerically by solving the dual problem to the primal TV norm minimization maximize p hp, yiR subject to _kL∗_(p)_k ∞≤ 1 (1.16) where _L∗ _{: C}N _{→ M(X , B([0, 1])) is the adjoint of the linear measurement operator L and}

M(X , B([0, 1])) denotes the space of complex measures of bounded TV norm11_{. The dual}

11

(28)

polynomial is then constructed using the solution _{bp to (1.16) as q(f) := hbp, a(f)i}R, where

a(f ) := ef. According to the conditions in (1.15), we can then localize the true frequency

spikes fk by finding the locations where |q(f)| = 1 since we must have that |q(f)| < 1 for all

f /_{∈ {f}1,· · · , fK}.

Such constructed dual polynomials are the standard approach in atomic norm algorithms for localizing parameters of interest – in this case frequency spikes – over some continuously parameterized set. We will see in Section 1.2 that we can use dual polynomials to localize much more general parameters, however. Indeed, proof strategies for atomic norm algorithms generally follow the same approach as [8]; if this seems surprising, it turns out that the general equality constrained linear inverse problem (1.1) with Ω chosen as an atomic norm penalty will yield optimization problems that are equivalent12 _{to certain TV norm minimization}

problems (1.14) insofar as the atomic norm is a generalization of the TV norm over certain spaces of complex measures.

1.2 The Atomic Norm

With the intuition for continuous optimization established via the TV norm in the pre-vious section, we are now ready to define the atomic norm. The atomic norm acts as a generalization of canonical structure inducing norms (including the TV norm) by defining the so-called atomic set, which is a generic collection of building blocks tailored to efficiently represent our signal of interest x. We will provide some illuminating examples of norms which are special cases of the atomic norm later in the section.

The atomic norm13_{was proposed as a structure inducing norm for linear inverse problems}

in [10], where it was argued that most practical linear inverse problems have significantly fewer degrees of freedom to resolve in their signal models than the ambient dimension of the signal space X due to their constraints. Under a properly captured mathematical notion of structural decomposition (recall the basis mismatch problem discussed in Section 1.1.5

12

Interestingly, computational methods for (1.14) proposed in [8] rely on semidefinite programming, as do methods for solving atomic norm minimization problems.

13

(29)

where the correct decomposition is not properly captured by BP with its Fourier basis) that accounts for this low-rank structure, convex optimization can effectively expose the underly-ing signal structure and allow us to localize parameters of interest via dual polynomials (see Section 1.1.7).

We will now define the atomic norm itself, which will unify the various notations of sparsity with respect to certain collections of fundamental elements that we have already touched on; in particular, the canonical basis vectors for BP discussed in Section 1.1.4 and dirac measures on B([0, 1]) discussed in Section 1.1.6.

Define a set_{A ⊂ X called the atomic dictionary whose elements we refer to as atoms,} which we assume is compact14 _in _{X (recall that X is our notation for the generic signal}

space which is mapped into the measurement space _{Y by L). Further suppose, without loss} of generality, that _{A is restricted to be a collection of extremal points of a convex body –} that is, for any a ∈ A, a /∈ conv(A \ {a}). Moreover, this implies that a ∈ ∂A, making the convex hull conv(A) closed. Lastly, we assume that A is radially symmetric in the sense that a∈ A ⇒ −a ∈ A. The atomic norm is then defined as the gauge function15 _{of conv(}_A)

kxkA := inf{t > 0 : x ∈ t conv(A)} . (1.17)

In words, this definition is the smallest distance by which the convex hull conv(_{A) must be} stretched in order to first touch the point x_{∈ X . It is worth pointing out that we can have} kxkA =∞ if lim

t→∞(t conv(A)) 6= X ; roughly, conv(A) does not span (with small abuse of the

word) the space_{X .}

IfkxkA <∞, then it is clear that we can represent x using some countable combination of

the atoms ofA. With this in mind, it is easy to see that the atomic norm can be equivalently written in the more illuminating form

kxkA = inf ck, ak∈A ( X k |ck| : x = X k ckak ) (1.18) 14

Recall that if_{X is finite, then this requires that A is bounded.}

15

Note that the gauge function of a generic symmetric set inherently induces a notion of distance with respect to that set.

(30)

where ck is taken over the scalar field of X . We can interpret the atomic norm vis-`a-vis

(1.18) as finding the sparsest combination of_{A-atoms that rebuilds x. To make this clearer,} define the ℓ0-analog of (1.18) as

kxkA,0 :=_c inf k, a∈A ( K ∈ N : x = K X k=1 ckak ) , (1.19)

which can be directly interpreted as computing the rank of x with respect to _{A. This rank} function need not be convex however and so, in the same way that the ℓ1-norm is a tight

convex relaxation of _{k · k}0, so too is k · kA a convex relaxation of k · kA,0.

For the sake of interest, we will now provide an elementary proof that the atomic norm lives up to its name:

Lemma 1.2.1 (The Atomic Norm is a Norm). Suppose thatX is countable and has underly-ing scalar field C. Let _{A ⊂ X be a compact, symmetric set whose elements are the extremal} points of a convex body and induce _{k · k}_A per (1.18). Then _{k · k}_A is a norm on _{X .}

Proof. To begin, we note that the scalability _kαxk_A = _|α|kxk_A for α _{∈ C and} point-separation x = 0 _{⇔ kxk}_A = 0 are obvious upon inspection. As such, we will only prove subadditivity i.e. the triangle inequality.

Let x, y ∈ X and suppose that kxkA,kykA < ∞ (the result is trivial otherwise). We

have, then, that x + y =P_kckak for some sequences (ck) on C and (ak) on A. Moreover,

we have that x =P_iαibi and y =Pjβjcj for similar sequences, hence

kx + ykA = X k |ck|, kxkA = X i |αi|, kykA= X j |βj|. Furthermore, x+ y =X k ckak = X i αibi+ X j βjcj

and so, for any bi = cj, there is some k such that ak = bi = cj and ck = αi + βj. It

immediately follows that

X k |ck| ≤ X i |αi| + X j |βj|,

(31)

implying directly that _{kx + yk}_A _{≤ kxk}_A+_kyk_A. This is the desired result. 1.2.1 The Atomic Norm Generalizes Other Norms

An important interpretation is that the atomic norm generalizes other commonly used structure inducing norms, like the ℓ1-norm, depending on the choice of atomic dictionary A.

In this way, the atomic norm provides a flexible framework for tailoring_{A to fit the specific} structure of the signal of interest x that other canonical norms may not capture while still allowing for these canonical norms when they naturally fit the problem. Moreover, we can induce the atomic norm on connected sets _{A i.e. continua which allows us to avoid} discretizing the domain of optimization as algorithms like LASSO do. This flexibility is powerful in the sense that it allows us to circumvent problems like basis mismatch (see Section 1.1.5) by choosing A to be a continuum that includes the sparse components of x as distinct elements rather than hoping that those elements lie on an a priori grid-based discretization of_{X .}

Upon first introduction to the atomic norm, it is useful to see specific examples of canon-ical norms as special cases of the atomic norm. We now present four such examples, noting, in particular, the fourth which shows the TV norm as a special case of the atomic norm.

1. [10] Let _{A = {±e}1,±e2,· · · , ±eN}, where ej is the j-th column of the N × N

iden-tity matrix I. Then _{k · k}_A = _{k · k}1. This can be seen by noticing that conv(A) =

{x : kxk1 ≤ 1}, the unit-ball of k · k1.

2. Let A = {x : kxk2 = 1}. Then conv(A) is the unit-ball of the ℓ2 norm and hence

k · kA =k · k2. This approach holds for any ℓp norm with p≥ 1.

3. Let _{A =}a ⊚ b ⊚ c_{∈ C}M ×N×P : _kak2 = 1, kbk2 = 1, kck2 = 1

, the set of all third-order, rank-one tensors on CM ×N×P_{. It is easy to see via (1.18) that the induced atomic}

norm is precisely the tensor nuclear norm. This includes the matrix nuclear norm – the sum of the singular values – as an even more special case. It is worth noting that

(32)

the tensor nuclear norm can be thought of as the ℓ1 norm of the singular values and

hence encourages low-rank tensor recovery.

4. Let _{M be the space of all finite complex measures on the measure space (T, B(T))} (note that T _{6= ∅) for Borel sigma algebra B(T), equipped with the TV norm. The} TV norm unit ball16 _is _{{ν ∈ M : kνk}

TV≤ 1} , which is convex by the convexity of of

the norm k · kTV. It can be shown that the extremal points of this unit ball are the

dirac measures

A := {c δτ : c ∈ C, |c| = 1, τ ∈ T}

and hence _{k · k}_A =_{k · k}TV. This is worth noticing in the context of Section 1.1.6 and

line spectral estimation, where the set T = [0, 1] and the line spectra µ are clearly K-sparse combinations of the atoms of A.

1.2.2 An Atomic Norm for Line Spectral Estimation

In this section, we set up an atomic norm that can capture the sparse frequency structure of the line spectra discussed in Section 1.1.4 while avoiding the basis mismatch problems of BP (see Section 1.1.5), where off-grid frequencies can potentially never be properly identified by BP. Recalling the line spectrum µ =PK_k=1ckδfk and corresponding sampled time-domain

measurements y = PK_k=1ckefk =

P

k|ck|eiφkefk, we can notice that y is itself a K-sparse

combination of sinusoids ef. With this observation in mind, we can construct the atomic

dictionary

A := {c ef : c∈ C, |c| = 1, f ∈ [0, 1]}

=eiφe_f : φ ∈ [0, 2π), f ∈ [0, 1] , (1.20) where φ is a phase parameter. This is the exact dictionary used in [41]. It is not difficult to see that_{A is both symmetric (due to φ) and that for some a(φ, f) ∈ A, a(φ, f) /}_{∈ conv(A \ {a})} since removing a corresponds to [0, 1]_{\ {f}, which creates a non-recoverable hole in the}

16

Note: for the space of signed Radon measures, the TV norm unit circle (i.e. the boundary of the unit ball) is the convex set of all probability measures.

(33)

continuum.

We can now see that inducing the atomic norm with _{A will allow us to find a} repre-sentation of x that is K-sparse i.e. optimally sparse since the atoms of _{A are continuously} parameterized over all possible frequencies in [0, 1].

1.2.3 Dual Certificate for Line Spectral Atomic Norm Minimization

In this section, we discuss duality for the atomic norm in a similar manner to Sections 1.1.3 and 1.1.7. In particular, we will consider atomic norm minimization for solving line spectral estimation for illustrative purposes.

To begin, we consider the primal problem minimize

x kxkA

subject to y = x (1.21)

which can be seen as equivalent to the TV norm minimization problem (1.14) since both norms are convex and moreover, for the optimizer x_b to (1.21) and _{bν to (1.14),}

kbx_k_A =kbνkTV=

X

k

|ck|.

As such, the dual problems for the atomic norm and TV minimization problems should be equivalent as well. Given this equivalence between atomic norm minimization and TV norm minimization, we have the same sample complexity where min_i6=j(fi⊖ fj) ≥ _N2 is sufficient

to exactly recover the spike locations fk via the dual polynomial. Note that in (1.21), the

measurement operatorL is simply the identity mapping rather than a mapping from a space of complex measures to the measurement space Y = CN _{as in (1.14).}

In fact, the dual atomic norm is defined [10], [41] as the support function kpk∗ A := sup a_∈Ahp, ai R = sup kxkA≤1 hp, xiR (1.22)

and the dual problem to (1.21) is then maximize

p hp, yiR

subject to _kpk∗ A ≤ 1

(34)

In the same way that there is zero duality gap in the TV norm case, there is zero duality gap between (1.21) and (1.23). Similarly to Section 1.1.7, we can observe that the objective of (1.21) is linear in p and the feasible region is convex by the convexity of the dual norm _{k · k}∗

A, hence the dual optimizer bp will occur on the boundary of the feasible region:

bp ∈ {p : kpk∗

A = 1}, hence kbpk∗A = 1. By strong duality, we also have that

hbp, yiR=kbxk_A=

X

k

|ck|.

The dual certificate i.e. the dual polynomial can be derived from the dual optimizer bp plugged into the dual norm:

kbpk∗ A = sup a_∈Ahbp, ai R = sup φ∈[0,2π), f∈[0,1] Ree−iφhbp, a(f)i (1.24) and take the dual polynomial to be q(φ, f ) := e−iφ_{hbp, a(f)i. Notice that when we compute}

|q(f)| in order to localize peaks, this function becomes invariant to the phase term φ, hence we can effectively ignore φ and write the dual polynomial as q(f ) :=hbp, a(f)i.

This dual polynomial q(f ) can be proven [41, Proposition 2.4] to certifyxb = y as primal-optimal and furthermore exactly localizes the true frequencies fk [41, Proposition 2.5].

As an illustration of this dual polynomial frequency localization, we simulate a line spec-tral signal y using K = 4 random frequencies and N = 128 time samples. We then solve (1.21) using CVX [23] to compute the dual variables bp (details on computing the atomic norm can be found in Section 1.2.4). Figure 1.2 shows the resulting dual polynomial |q(f)| evaluated over a finely spaced grid. We can see that the dual polynomial indeed has well-concentrated spikes around the true frequencies fk.

1.2.4 Computation of the Atomic Norm by Semi-Definite Relaxation

From the previous sections, it is not yet obvious how to compute the atomic norm, let alone solve an optimization problem defined in terms of the atomic norm. Computing the atomic norm despite its convexity can still be NP-hard when there is no reduction to a canonical norm like the ℓ1 norm. As such, we now discuss a way of computing the atomic

(35)

Figure 1.2: An example of a dual polynomial for the noiseless line spectral estimation prob-lem.

norm for the line spectral problem using a convex relaxation of the already convex atomic norm. We use the line spectral problem again as an illustrative example since the appropriate convex relaxation for a generic atomic norm minimization problem must be developed on a case-by-case basis. We note that the idea for such convex relaxation of the atomic norm was first proposed in [10], although many works have since fleshed out the concept with more rigor.

For the atomic norm associated with the line spectral atomic dictionary defined in Section 1.2.2, [41, Proposition 2.1] shows by way of Vandermonde decomposition that

minimize u_∈CN_{, w∈R} 1 2 1 Ntr(Toep(u)) + w subject to Toep(u) x x∗ _w 0 (1.25)

is equivalent to computing the atomic norm i.e. kxkA = 12 1

Ntr(Toep(ub) +wb

, where ub and w are primal-optimal with respect to (1.25). The operator Toep(u) is the Hermitian_b

(36)

Toeplitz matrix with u⊤ _{as the first row. That is,} Toep(u) :=      u1 u2 · · · uN u∗ 2 u1 · · · uN −1 ... ... . .. ... u∗ N u∗N −1 · · · u1     . (1.26)

In general, it is not obvious that such semi-definite programs (SDP’s) are exactly equivalent to the atomic norm – this is a result that must be proved case-by-case.

Note that (1.25) is an optimization problem over the semi-definite cone i.e. a semi-definite program (SDP), as it is often called in the literature. This optimization is a series of linear inequality constraints, which can be readily solved using off-the-shelf software packages like CVX [23] (which uses interior point methods).

For edification, we will provide a proof that (1.25) is equivalent to the line spectral atomic norm; the strategy used in [41] is the same technique employed for most proofs of SDP equivalence in literature. We will use this strategy in a later chapter to establish an SDP lower bound for an atomic norm problem over tensors.

To establish the desired result, we first require the Carath´eodory-Toeplitz lemma on the Vandermonde decompositions of Toeplitz matrices, which we restate now.

Lemma 1.2.2 ([9] Carath´eodory-Toeplitz Theorem). Let P be any positive semi-definite Toeplitz matrix with rank R. Then there exist rank-R matrices V =

ef1 · · · efR and D= diag( d1 · · · dR

), where dr > 0, such that

P = V DV∗.

We are now ready to show SDP equivalence with the atomic norm.

Proposition 1.2.1 ([41, Proposition 2.1]). Denote the optimal objective value of (1.25) as SDP(x). Thenb kbx_k_A = SDP(x).b

Proof. The proof idea is to show that SDP(x) both upper and lower boundsb kbx_k_A in the respective cases whenxb has an atomic decomposition and when the SDP optimization vari-ables are feasible.

(37)

We will show SDP(x)_b _{≤ kb}x_k_A first. Define a(fk) := a(0, fk) and suppose that xb =

P

kcka(fk), where ck ∈ C (we have the polar decomposition ck = |ck|eiφk). We can choose

u=P_k_|ck|a(fk) and w =P_k|ck|. Towards showing that these choices are feasible, we have

that

Toep(u) =X

k

|ck|a(fk)a(fk)∗

It follows that the semi-definite constraint matrix decomposes into a sum of outer products Toep(u) xb (x)_b ∗ _w =X k |ck| a(fk)a(fk)∗ a(fk) a(fk)∗ 1 =X k |ck| a(fk) 1 a(fk)∗ 1 ,

which is easily positive semi-definite, hence u and w are feasible. Now, we have that

1

Ntr(Toep(u)) = u[1] =

P

k|ck| = w. This gives the objective value

1

2(u[1] + w) = X

k

|ck| = kbxkA.

Since the decomposition of x_b was arbitrary, it must be that SDP(x)_b _{≤ kb}x_k_A, completing the first direction of the proof.

We now show that SDP(x)_b _{≥ kb}x_k_A. Conversely assume that u and w are feasible

Toep(u) xb (x)_b ∗ _w

0 ⇒ Toep(u) 0.

By Lemma 1.2.2, there must exist a Vandermonde decomposition Toep(u) = V DV∗, where D _{≻ 0 and V =}

a(f1) · · · a(fK)

. That is, V DV∗ = P_kdka(fk)a(fk)∗ and hence

tr(Toep(u)) = N tr(D). By the Schur complement vis-`a-vis the semi-definite constraint matrix in (1.25), we can establish that xb_{∈ span(V ), thus b}x has the atomic decomposition

b

x= V b =X

k

b[k]a(fk)

for some b _{∈ C}K_{. The Schur Complement Lemma gives that V DV}∗ 1

wV bb∗V∗, and so

(38)

tr(D) = q∗V DV∗q ≥ 1 wq ∗_{V bb}∗_V∗_q = 1 wkbk 2 1

The geometric mean inequality then gives that 1 2 1 Ntr(Toep(u)) + w = 1 2tr(D) + 1 2w ≥ptr(D)w ≥qkbk1 2 =X k |bk| ≥ kbx_k_A,

where the last inequality is because we have already established that x_b has at least one valid atomic decomposition. Since the above results hold for any feasible u and w, we have immediately that SDP(x)b _{≥ kb}x_k_A. This completes the proof.

1.2.5 An Equivalent SDP to Atomic Norm Minimization

Now that we have an exactly equivalent SDP characterization (1.25) of the atomic norm for the line spectral estimation problem, we can easily construct an equivalent SDP to the atomic norm minimization problem for line spectral estimation (1.21) itself. This equivalent (by Proposition 1.2.1) SDP is minimize x∈CN_{, u∈C}N_{, w∈R} 1 2 1 Ntr(Toep(u)) + w subject to Toep(u) x x∗ w 0 y = x (1.27)

which is similar to (1.25) but simply includes x as an additional optimization variable whereas before x was fixed. Solving the dual problem to (1.27) and taking the dual optimal variables associated with the equality constraint as _{bp allows us to construct a dual polynomial that} certifies the optimality of x_b and moreover localizes the true frequency spikes, just as before in Section 1.2.3. In fact, this is the exact approach that we took to produce Figure 1.2 using

(39)

CVX [23].

1.2.6 Atomic Soft Thresholding (AST)

In this section, we briefly discuss how atomic norm minimization problems such as (1.21) can be extended to account for measurement uncertainty via additive Gaussian noise. In particular, suppose that our measurement model is of the form

y =X

k

cka(fk) + ν,

where the additive noise is distributed as ν ∼ N (0, σ2_I

N ×N) with some standard deviation

σ > 0. Then the AST program is defined as a relaxed version of (1.21) minimize z∈CN 1 2ky − zk 2 2+ τkzkA, (1.28)

where τ > 0 is a penalty term typically referred to as the threshold value. To see why this term is called the threshold, observe that the dual problem to (1.28) is

maximize p_∈CN 1 2 kyk 2 2+ky − pk22 subject to kpk∗ A ≤ τ (1.29)

which is derived in [3, Lemma 2]. Moreover, [3, Corollary 1]) establishes that, supposing the frequencies are separated by mini,j(fi⊖fj)≥ 4/N, the dual polynomial q(f) corresponding to

solving (1.29) has peaks _|q(fk)| = τ for each k ∈ [K] and |q(f)| < τ for all f /∈ {f1,· · · , fK}.

A good choice of the threshold τ is, per [3, Theorem 2], τ = ησpN log(N ) for some (finite) numerical constant η > 1 (although one can use η = 1 effectively in practice).

The equivalent SDP to (1.28) is now clearly minimize z∈CN_{, u∈C}N_,w∈R 1 2ky − zk 2 2+ τ 2 1 Ntr(Toep(u)) + w subject to Toep(u) z z∗ w 0. (1.30)

1.3 Blind Super-Resolution Problems

In this section, we introduce the archetypal problem that will remain the focus of the rest of this thesis. Blind super-resolution problems are a particularly challenging class of linear

(40)

inverse problems that add many more degrees of freedom that must be resolved during optimization. In fact, blind super-resolution problems are generally ill-posed even with the structure captured by the atomic norm in the way we have described so far in this thesis. That is, to solve blind super-resolution problems, we need to introduce currently ambiguous additional structure into our signal model that will soon become clear.

To contextualize blind super-resolution problems, we will now briefly discuss some appli-cations in which these problems arise. A first example might be blind deconvolution in which images are convolved with unknown blurring kernels that must be resolved jointly alongside the inherent low-fidelity present in the image due to, say, limited pixel-count for a camera. The time-sequence analog would be time-limited measurements. Blind deconvolution arises in seismic imaging [32], photography [21], and astronomy [40].

Similar to blind deconvolution is the problem of simultaneous blind deconvolution and phase retrieval. Phase retrieval is of interest in many biological contexts such as neural cell growth tracking [49] and cancer cell growth tracking [33]. If the light source is only partially coherent (e.g. LED’s instead of lasers) as in many practical scenarios, then trying to incorporate this incoherence into the signal model results in a blind super-resolution problem.

Another application area is damped modal analysis, which is a focus of ours in Chapter 4. Damped modal analysis can find applications in nuclear magnetic resonance spectroscopy[7], acoustics, and the analysis of vibrating physical structures (not unrelated to acoustics) [20], [36].

Blind super-resolution problems arise frequently in radar signal processing, especially in the context of imaging extended (i.e. non-point) targets; this is another focus of ours in Chapter 5. An example is through-the-wall radar imaging [1], [26], [56]. Such problems can also arise in SAR imaging and target tracking scenarios.

(41)

The generic signal model that acts as a basis for blind super-resolution problems can be stated in vector format as

y=

K

X

k=1

ckTk(a(θk)), (1.31)

where ck ∈ C, a(θk) are the parameterized atoms of some generic atomic dictionary A,

and the _Tk are unknown linear transformations that warp the atomic decomposition. Here

we consider access to full measurements rather than a compressive sensing scenario. It should be pointed out, however, that the compressive measurement operator can in principle be absorbed into the transformations Tk and hence compressive sensing signal models are

special cases of this blind super-resolution signal model.

The general goal in blind super-resolution is, given linear measurements y = L(x), recover the atomic parameters θk while simultaneously estimating the unknown transformations

Tk. However, recovering θk and gk from measurements of x is extremely ill-posed without

constraint on the types of waveforms that the a(θk) and gk can be.

In the problems we study, the unknown transformations_Tk manifest as modulating

wave-forms, which means that we can rewrite the signal model as

y=

K

X

k=1

ckgk◦ a(θk) (1.32)

for some g_k _{∈ C}N_{. However, this slight restriction on the} _T

k is still not enough to make the

recovery problem well-posed.

As established in [50], it turns out that when a(θk) are sinusoids as in the line spectral

estimation problem (see Section 1.1.4), then we only need the mild assumption that the g_k live in an a priori known, low-dimensional subspace to guarantee perfect recovery of both the atomic parameters θk = fk and unknown modulating waveforms gk. That is,

y=

K

X

k=1

(42)

with g_k = Bhk for some unknown coefficients hk ∈ CL and known subspace B ∈ CN ×L,

where L_{≪ N. This is the general signal model that we will use throughout this thesis. We} point out in particular that this signal model (1.33) can be thought of as a blind line spectral model. Indeed, the overarching problem of this thesis is such blind line spectral problems. We encourage readers to refer to Section 1.1 if they are not familiar with the canonical line spectral estimation problem central to this work.

1.3.1 The Atomic Norm for Blind Super-Resolution Problems

Given the form of y in (1.33), it is intuitive to expect that we will need to consider the atomic norm defined over a space of matrices. After all, we need to define an atomic norm that will produce tractable optimization problems, which need not be the case if we restrict ourselves to vector optimization – the constraints may become quadratic rather than linear as desired for analytic purposes. Indeed, this is the approach taken in [50], which we present in a manner that is somewhat clearer for the purpose of this thesis and extensions that we will develop in later sections.

Plugging in g_k= Bhk into (1.33), we obtain

y =X k ck(Bhk)◦ efk =X k ckL (efk⊚ hk) , (1.34)

where we define a new linear measurement operatorL : CN ×L_{→ C}N _with

L(X) = (B ◦ X)1L =: colsum(B◦ X), (1.35)

where L ∈ RL _{is the vector of all ones. In other words,} _{L(X) is the column-sum of the}

product B_{◦ X, which reincorporates the subspace information in B into the lifted signals} efk ⊚ hk = efkh

⊤ _{∈ C}N ×L_{. By defining X =} P

kckefk⊚ hk, we now have a matrix sensing

model _{L(X) = y. It is now clear that we can solve an optimization problem in the lifted} space in which X lives and connect this matrix optimization back to our vector measurements y via the linear measurement operator L via a simple linear constraint.

(43)

We now need to construct an atomic norm that can capture the structure of X – namely that X is a K-sparse combination of rank-one matrices. With this observation in mind, we can define the atomic dictionary

A := {a(f) ⊚ h : f ∈ [0, 1], khk2 = 1} (1.36)

with induced atomic norm _{k · k}_A, which is now a norm over a space of matrices by the easily seen symmetry of _{A. We will denote the rank-one matrix atoms in A as A(f, h) for} notational convenience. We include the restriction khk2 = 1 without loss of generality since

the necessary scaling for gkcan be absorbed into the atomic coefficients ck. It is important to

define an atomic set that is constrained as much as possible if we hope to find an equivalent SDP characterization of the induced atomic norm. We note that _{A is equivalent to the} atomic dictionary used in [50].

It is worth pointing out that the atomic norm_k·k_A defined here lower bounds the matrix nuclear norm _{k · k}_∗ since

A ⊂ {a ⊚ b : kak2 =kbk2 = 1} .

That is, the nuclear norm has less rigid algebraic structure than the atomic norm here and so kXkA ≤ kXk∗ for any X. However, performing optimization with respect to √1_Nk · k∗

will not yield effective super-resolution algorithms because ef has a special trigonometric

structure that should be exploited for good performance. The vanilla nuclear norm does not take this explicit structure into account since it is essentially trying to recover the factors of X when we view X as an N_{× L × 1 canonical polyadic tensor form. Moreover, optimization} with respect to _{k · k}_∗ is intractable and hence must be relaxed in a convex (or nonconvex – see the Burer-Monteiro approach in Chapter 3) manner – much like the atomic norm itself is NP-hard to compute and must also be relaxed.

1.3.2 Atomic Norm Minimization for Blind Super-Resolution

Now that we have defined an atomic dictionary_{A that captures the structure of the blind} super-resolution signal model (1.33) and the associated atomic norm k · kA, we can set up