Bayes risk A-optimal experimental design methods for ill-posed inverse problems

(1)

BAYES RISK A-OPTIMAL EXPERIMENTAL DESIGN METHODS FOR ILL-POSED

INVERSE PROBLEMS

by Christian Lucero

(2)

c

(3)

A thesis submitted to the Faculty and the Board of Trustees of the Colorado School of Mines in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Mathematical and Computer Sciences). Golden, Colorado Date Signed: Christian Lucero Signed: Luis Tenorio Thesis Advisor Golden, Colorado Date Signed: Dr. Willy Hereman Professor and Head Department of Applied Mathematics and Statistics

(4)

ABSTRACT

Optimal experimental design methods are well established and have been applied in a variety of different fields. Most of the classical methods in optimal experimental design however neglect the subject of ill-posedness. Ill-posedness is an issue that is prevalent when solving inverse problems. In order to solve most real-world inverse problems of interest, we must use methods known as regularization to help stabilize the final solution. The use of regularization introduces a bias into the obtained estimates that classical optimal experimental design techniques do not take into account. The primary goal of this thesis is to consider some advancements in optimal experimental design methods that can be applied to ill-posed inverse problems.

We present a general method known as Bayes risk A-optimal design that uses the prior moment conditions of the inverse problem. The Bayes risk A-optimal design has a natural connection to the Tikhonov regularization estimator. We demonstrate how this connection arises and how this method can be used in practice. We also address some of the important theoretical considerations behind the use of A-optimal, and more generally all linear-optimal, design criteria. One of the most important results in optimal design theory is the so-called equivalence theorem. We present an updated version, the generalized equivalence theorem, that can be used with Bayes risk A-optimal designs for ill-posed inverse problems. This updated form of the equivalence theorem is paramount as it justifies the use of techniques that address ill-posedness for all linear-optimal design criteria.

Optimal designs are typically found by using either a sequential search algorithm or a convex optimization routine. To address the actual computation of an A-optimal design for ill-posed inverse problems We discuss the strengths and weakness behind both general approaches and ultimately present a hybrid method that uses both techniques for determining a design. The result is a method that is able to use the strengths of both approaches, specifically a method that is highly customizable and can be used to find A-optimal designs with specific attributes that are desirable to the user.

(5)

TABLE OF CONTENTS

ABSTRACT . . . iii

LIST OF FIGURES AND TABLES . . . vi

CHAPTER 1 INTRODUCTION . . . 1

1.1 Historical Review Of Optimal Experimental Design . . . 1

1.2 OED For Well-posed Inverse Problems . . . 5

1.3 OED For Ill-posed Linear Inverse Problems . . . 7

1.3.1 Optimality Criteria . . . 7

1.3.2 Determining An Optimal Experimental Design . . . 9

1.3.3 Computational Methodology . . . 10

1.4 Contributions of this dissertation . . . 10

CHAPTER 2 BAYES RISK A-OPTIMAL DESIGN FOR ILL-POSED INVERSE PROBLEMS . . . 12

2.1 Classical Optimal Experimental Design . . . 12

2.1.1 Statistical Criteria For Optimal Experimental Design . . . 15

2.1.2 Solving the Classical OED Optimization Problem . . . 16

2.2 An Affine Bayes Risk Estimator . . . 18

2.3 Bayes Risk A-optimal Design Methodology . . . 21

2.3.1 Parameter Selection in the Bayes Risk A-optimal Design . . . 22

CHAPTER 3 ON EXPERIMENTAL DESIGN IN THE CONTEXT OF TIKHONOV REGULARIZED INVERSE PROBLEMS . . . 25

3.1 Summary . . . 25

3.2 Linear Optimality Criteria . . . 26

3.2.1 The Generalized Information Matrix and Generalized Equivalence Theorem . . 28

3.3 Algorithms for Optimal Design . . . 36

3.3.1 Fedorov Sequential Algorithm . . . 37

(6)

3.3.3 Convergence of the sequential algorithm . . . 40

3.3.4 Fedorov Sequential Algorithm with Deletions . . . 42

3.3.5 The CVX Toolbox and the Ill-posed A-optimal Design Problem . . . 44

3.3.6 The Fedorov-CVX Algorithm . . . 45

3.4 An Ill-posed Example Problem . . . 47

3.5 Solution of the Ill-posed A-optimal Design Problem . . . 48

3.5.1 Using Semidefinite Programming Alone . . . 48

3.5.2 Using the Sequential Algorithm and Semidefinite Programming . . . 49

3.5.3 Example 1 - Trimming a Design from an Equally Weighted Grid . . . 51

3.5.4 Example 2 - Building from a Sparse Initial Set . . . 52

3.5.5 Inverse Problem Solutions Using Optimal Designs . . . 52

3.5.6 Summary . . . 53

CHAPTER 4 EXTENSIONS AND FUTURE WORK . . . 59

4.1 Randomized Trace Estimators . . . 59

4.1.1 Coverage Probability of the Hutchinson Trace Estimator . . . 63

4.2 Optimal Design for Joint Inversion: An Example . . . 63

4.3 Future Work . . . 68

REFERENCES CITED . . . 70

(7)

LIST OF FIGURES AND TABLES

Figure 3.1 We use CVX to solve the SDP problem for the case N = 100 on a uniform grid of the design space. The resulting Bayes risk is 3.708e-01. We can use the equivalence theorem to check if the algorithm has achieved optimality. The green filled circles are those experiments which satisfy the optimality condition

φ(x∗_{, ξ}∗_{) = Ψ(x}∗_{, ξ}∗_{). . . 50} Figure 3.2 In this example our sequential algorithm started with the same grid as the CVX

problem and the Fedorov algorithm was allowed to run for 250 iterations for the case N = 100. We see that 31 points are included in this design and the Bayes risk has been reduced to 3.684e-01. We can see that after 250 iterations, a check of the equivalence theorem shows that we have not quite achieved true

convergence. We now use CVX to find the final optimal weights. . . 54 Figure 3.3 Given the design set after 250 iterations of the sequential search, we now use

CVX to determine the final design and their corresponding weights. We see that only 18 experiments remain in this optimal design. The Bayes risk has been further reduced to 3.675e-01. A final check of the equivalence theorem shows

that we have found the weights for an optimal solution. . . 55 Figure 3.4 In this example, we start with design points

{(1.5, 5.5), (1.5, 10), (1.5, 1), (0, 5.5), (3, 5.5)} and N = 100 as before. After 250 iterations of the Fedorov algorithm, we obtain a result similar to the previous example that was initialized with 121 design points. The Bayes risk is has been reduced to 3.689e-01. As before we see that after 250 iterations, a check of the equivalence theorem shows that we have not quite achieved true convergence.

We now use CVX to find the optimal weights. . . 56 Figure 3.5 Given the design set after 250 iterations of the sequential search, we now use

CVX to determine the final design and their corresponding weights. We see that 20 experiments are in this optimal design. The Bayes risk has been further reduced to 3.688e-01. A final check of the equivalence theorem shows that we have found the weights for an optimal solution. A check of the equivalence

theorem indicate that experiments 5 & 21 are not included in the optimal solution. 57 Figure 3.6 Tikhonov estimates for the uniform design, the CVX standalone optimal design,

and the optimal FED-CVX design. The true model input is the solid black line. It is clear that FED-CVX does a better job of recovering the true model parameters than the other two designs. The number of experimental

configurations used in each design is 121, 16, and 17, respectively. . . 58 Figure 4.1 A re-weighting of the experimental configurations based upon a joint optimal

design. The original weights are in red, the new weights are in blue. The

re-weighting results in a use of only 41.84% of the configurations. . . 69 Table 3.1 Review of Bayes Risk Values. . . 53 Table 4.1 Coverages based upon 100,000 confidence intervals. For the case of matrix size of

100, we used the actual generalized dispersion matrices from the CVX

(8)

CHAPTER 1 INTRODUCTION

This dissertation describes novel methods for finding optimal experimental designs (OEDs) for ill-posed inverse problems. We begin by describing what is meant by an optimal experimental design and reviewing the classic literature of that field before moving on to define and describe inverse problems, specifically the challenges posed by a property called ill-posedness.

1.1 Historical Review Of Optimal Experimental Design

By optimal experimental design we mean a protocol for collecting data that respects both the scientific and statistical goals as well as external constraints such as cost or time limitations. A classic and well-studied example is simple linear regression [1, 2]. In that context the design refers to which values of the covariates to collect where perhaps different regions of the covariate space cost different amounts to obtain. Intuitively one wants to find a balance between observations that are most informative about the unknown regression coefficient and those that are less expensive to obtain. Generalizing this basic intuition to more elaborate situations is the goal of modern experimental design.

The data acquired during an experiment provides us with a source of information from which we do our best to use sophisticated analysis techniques to produce a judgement of the underlying model. For example, a large number of procedures have been developed for parameter estimation and inference in linear regression. Even with our best tools, we are limited by the quality of the information gathered during an experiment. It is in the experimenter’s best interest to optimize the quality of the data by properly designing the experiment before any data collection takes place. This is typically done by selecting which experiments to carry out and how often each experimental configuration should be repeated.

Each type of experiment also faces a unique set of constraints, e.g. cost, physical limitations, time, etc. A particular experimental configuration may produce data that gives a faithful representation of the model but might be infeasible due to the constraints. The implementation of all experimental configurations repeated infinitely many times would clearly yield the most knowledge and would be globally optimal, yet infeasible. Instead a subset of all possible configurations should therefore be considered, those which best complement each other, and are optimal in some sense. To be specific, an optimal design is one that is highly accurate and produces reliable model parameter

(9)

estimates given the constraints. It should be efficient and practical to implement, while satisfying some optimality conditions set by the experimenter. It should also be robust in the sense that perturbations in the solution should lead to reasonable compromises in the experimental design.

While very early work on optimal experimental design has been around since the early 1800s, for example see Gergonne [3], we will instead focus on the work of R.A. Fisher as the major starting point in modern experimental design. It is due to the interdisciplinary work of Fisher that we can see the direct benefits of optimal experimental design in a variety of disciplines. The use of statistical methodology behind the design of experiments became popularized in the 1920s and 1930s due to Fisher’s analysis of the data accumulated from the classical field experiments at the Rothamsted Experimental Station. It is from Fisher’s research that statistical methods became an important part of the experimental process and not simply tools to analyze the final data results [4]. In particular we use optimal designs that are determined using optimality criteria that are based upon some function of what is now called the Fisher information matrix.

The core theory of optimal experimental design was developed during the 1950s and 1960s and primarily led by the works of Box, Hunter, Kiefer, Wald & Wolfowitz. One of the most important tools from this era is the general equivalence theorem first proved by Kiefer & Wolfowitz [5] in 1960 that provides conditions for the global optimality of a design. It is also during this time that a design became a probability measure and continuous design theory began [6]. Optimal design theory also began to incoporate convex analysis which helped to link many of the optimality criteria into a larger more general class of problems that could be addressed with new methods in mathematical programming. Many of the modern approaches to OED follow a basic sequential search algorithm presented by Fedorov [1]. Modifications of this basic algorithm have lead to a class of design search procedures known as exchange algorithms. There are also other techniques that are no based upon Fedorov’s algorithm such as the direct solution of the optimization problem using other numerical procedures.

We can also divide the research involving OED into the frequentist and Bayesian approaches. In general, frequentist methods focus upon optimizing some function of the information matrix directly, while Bayesian methods use a decision-theoretic approach that often parallels the non-Bayesian approach. A good foundation of the non-non-Bayesian approaches are presented in Fedorov [1], Steinberg & Hunter [4], Pukelsheim, [2], and Fedorov & Hackl [7]. A basic overview with some resources for solving the associated optimization problem is given by Boyd & Vandenberghe [8]. Bayesian methods have been considered as early as 1956 starting with a paper by Lindley [9]. A good overview of Bayesian methodologies as of 1995 is presented by Chaloner & Verdinelli [10].

(10)

Maurer et al [11] also present a concise summary of recent advances in optimized geophysical survey design based upon Bayesian methodology. Guest & Curtis [12, 13] make use of Bayesian techniques for Geophysical surveys with nonlinear models. We also see Bayesian techniques in recent work by Huan & Marzouk [14] applied to nonlinear simulation-based models.

This dissertation focuses on OED for ill-posed inverse problems. Inverse problems arise in many branches of science and mathematics and are used to convert observed measurements into parameters that can be used to describe a physical parameter that we cannot observe directly. Even though inverse problems are typically ill-posed, the bulk of the existing research in OED primarily focuses only on well-posed problems. Some OED research papers have avoided ill-posedness by trying to enforce well-posedness in the system matrix. For example, Curtis avoids ill-posedness in many of his papers by sequentially designing the system matrices to remain non-singular [15–18]. In his 1986 review, O’Sullivan [19] discussed using the mean squared error (MSE) as a way to study the performance characteristics of an inversion algorithm. The motivation behind using the MSE arose due to the use of method of regularization procedures for the stable solution of ill-posed inverse problems [19]. The MSE can be decomposed into bias and variance components that measure the systematic and random errors, respectively. Early work by Box & Draper [20] also considered using the MSE but focused primarily on minimizing the bias contribution only [21]. Despite O’Sullivan’s review more than two decades ago, very little progress has been made on OED for ill-posed problems. Recent works by Bardow [21] and Haber et al [22], still use the same MSE decomposition that was described by O’Sullivan but seek different ways to estimate the bias and carry out the optimization problem.

Efficient computations of an OED are highly desirable and there have been many techniques developed to carry out the optimization. In general, numerical techniques in OED follow one of two general paths towards finding an OED. The first is an a priori selection of all the possible experi-mental configurations that can be carried out. Based upon this set, the numerical techniques must then determine the associated weights that must be assigned to experimental configuration in the optimal design. The main drawback to these techniques is the inability to consider new experimental configurations beyond the original set. We will refer to such methods as “rigid”. The alternative path for finding OEDs are algorithms that can explore the inclusion of new experimental configu-rations from a larger set of candidate configuconfigu-rations. In many cases the candidate set can include infinite possibilities. These algorithms must therefore find both the best experimental configurations to include and their associated weights. Such algorithms tend to require more computational effort to determine an OED over a rigid method.

(11)

Many advances in OED come from stochastic approximation algorithms. Robbins and Monro [23] helped to pioneer stochastic approximation methods with the presentation of an algorithm that iteratively finds the extrema of functions that cannot be computed directly (or are expensive to com-pute directly). Following the publication of the Robbins-Monro algorithm, Kiefer & Wolfowitz [24] followed up with another algorithm that would stochastically estimate the maximum of a function. Haber et al. looks at efficient ways to carry out the computations required in the optimization prob-lems for both linear [22] and nonlinear OEDs [25]. For example, Haber et al. [26] use a stochastic trace estimator for approximating the trace computation required in the A-optimal design.

Many advances in OED have also come from sequential design that make use of greedy algorithms to interactively improve the OED. Sequential OED has been used since the 1960s and can be found in papers by Box & Hunter [27] and Draper & Hunter [28], amongst many others. There are a number of sequential algorithms that can be considered, for example a number of papers by Curtis [15–18] make use of genetic algorithms to find OEDs in Geophysical survey design. Instead of an exhaustive overview we will focus upon a large class of sequential algorithms known as exchange algorithms which are based upon the work done by Wynn [29, 30] and Fedorov [1]. Exchange algorithms work by determining the best experimental candidate(s) to include in an updated design, while considering the elimination of experimental configurations already within the design. The work done by Wynn [29, 30] resulted in a simple update formula for D-optimal designs while Fedorov focused upon the more general case of linear-optimal design criteria. Numerous modifications of the FEA have been presented in the literature by many authors including Nyguyen & Miller [31], Cook & Nachtsheim [32], and Coles & Morgan [33]. Updates to the Fedorov exchange algorithm (FEA) focus upon more efficient selections of the exchanges that will be made between candidate points and points that already included within a design. Fedorov & Hackl also discuss many of the proposed improvements to the original algorithm in their later work [7].

Recent papers still consider further improvements of sequential design techniques, for example Coles & Morgan [33] present a method of fast sequential experimental design for linearized geophys-ical inverse problems. Recent works by Guest & Curtis [12, 13] also use sequential design techniques for Geophysical survey design. Huan & Marzouk [14] have recently applied sequential OED along with polynomial chaos techniques towards solving nonlinear OEDs. We will use Fedorov’s sequen-tial algorithm in this thesis and apply it towards the solution of Bayes risk A-optimal designs for ill-posed problems.

Finally, there are other aspects of OED that can have been explored such as types of constraints. Work by Haber et al. looks at sparse optimal experimental design for linear ill-posed problems

(12)

[22] and nonlinear ill-posed problems [25]. The ability to control the sparsity of an OED can be useful when the experimenter is limited by resource contraints. Other considerations for controlling sparsity can come from the use of a loss function based upon the individual costs of experiments presented by Fedorov [1]. We present newer work that focuses on enforcing sparsity in well-posed problems in [26] which can be found in the next chapter.

1.2 OED For Well-posed Inverse Problems

Inverse problems arise when we wish to estimate the parameters that describe a physical object or system by using indirect observations of those parameters. A common inverse problem is the use of tomography in medical imaging where observations are used to estimate the the size and shape of a tumor. In this example, the observations are carried out according to some design with the hopes that the selection of experimental configurations behind the observations produces a good estimate of the size and shape of the actual tumor. The use of optimal experimental designs (OEDs) in inverse problems can help to reduce the number of experimental configurations (and overall observations) needed to obtain a good estimate of the model parameters.

Inverse problems are often ill-posed which is an issue that arises when a system of equations has a large condition number. Ill-posed problems suffer from numerical instability, meaning that small errors in the observed data can lead to large deviations in the solution of the parameter estimate. We use regularization methods to help control the stability of ill-posed problems. We will revisit OED for ill-posed inverse problems shortly, but for now we will focus on the well-posed case.

A large volume of research associated with OED has focused upon the solution of well-posed inverse problems. In the well-posed discrete linear problem the data for the experimental design are modeled as

di= a(xi)Tm + εi, i = 1, . . . , n (1.1) where a(xi) are functions that control the experimental design given a set of design variables xi∈ X . HereX is the design space and is the set of all possible experimental configurations that can feasibly be carried out. The data di obtained in an experimental configuration is a function of the unknown model parameters m_{∈ R}k_{, and ε}

i is an independent sequence of random errors with mean zero and variance σ2_{. Let k}

ibe the number of observations that are made for the experimental configuration a(xi)T , then the sum k1+ k2+· · · + kn= N represents the total number of observations obtained. Data are acquired for each of the experimental configurations under consideration. The experimental design is simply the instruction set that dictates which experiments should be carried out (and how

(13)

often) for the overall goal of the recovery of the model parameters m. We will use the general notation found in Fedorov [1] that poses the experimental design problem as the set

ξN = x1, . . . , xn w1, . . . , wn = xi wi N i

with wi = ki/N andPNi wi = 1. Experimental designs expressed in this form are typically called the continuous (sometimes referred to as normalized) design of the experiment [8].

The goal of optimal design is to choose an experimental design that is optimal with respect to some statistical criterion. For well-posed problems (wherem is an unbiased weighted least squaresb estimator) the frequentist approach to OED is to minimize a function of the variance associated with the inverse estimator [1]. The optimality-criteria are therefore functionals that act upon the infor-mation matrix (the inverse of the covariance matrix Σm_b = (σ2/N )(ATWA)−1). These functionals scalarize the matrices and provide a focus upon one or more characteristics that the experimenter wishes to be optimized. Different objective functions may be used with different justifications and give rise to the alphabet optimality criteria. In particular, A-optimal designs minimize the trace of the covariance matrix and are directly meaningful as the minimization of the MSE. Similarly, the most popular scalarization is the D-optimal design which minimizes the determinant of the covari-ance matrix and is known as the generalized varicovari-ance ofm [1]. For more details on the non-Bayesianb OED for well-posed problems see [1, 2] and the expanded discussion in the next chapter.

The Bayesian approach to the well-posed case is to use a decision-theoretic framework that finds the design that maximizes the expected utility function that appropriately describes the goals of a given experiment [10]. The choice of the utility function often leads to objective functions that are si-miliar to those used by the various alphabetic optimality conditions from the non-Bayesian approach. For example, the most popular approach is a utility function based on maximizing the Shannon in-formation (or entropy). In the case of a linear Gaussian model this utility function corresponding to the Bayesian D-optimal design reduces to maximizing the functionf(ξ) = det{M(ξ) + Σ−1} [10]. The main difference to non-Bayesian D-optimality optimality is the inclusion of the prior covariance in the objective function. The other Bayesian alphabetic optimality objective functions differ in the same basic way.

We would like to look beyond the well-posed case, therefore we will consider how to generalize linear optimality criteria for the ill-posed case. This means that we must generalize both the theory for linear-optimal OEDs and expand (or develop new) computational methods that can be used for the generalized case.

(14)

1.3 OED For Ill-posed Linear Inverse Problems

As in most modern scientific areas, the availability of vast amounts of data has led to large-scale problems, in particular large-scale ill-posed inverse problems. Large-scale problems are difficult to deal with in general due to the considerable computational resources that many of the algorithms require in order to obtain a solution. Ill-posedness also requires the use of additional techniques, and hence resources, in order to obtain a meaningful solution.

The main difficulty in such problems is that the estimates are biased due to the use of regu-larization techniques that are needed to find an estimate. The resulting mean squared error of the estimate now has a non-zero bias component and this bias may dominate the overall error. Since the bias depends on the parameters to be recovered, its control relies on the available prior information one has about the parameters. Therefore the major issues when dealing with OED for ill-posed problems are controlling the bias that is introduced from regularization techniques and controlling the additional computational cost required by the proposed methodology.

When it comes to experimental design, one usually has prior experience with the type of experi-ments to be conducted. One may even have collections of training models that provide information about the objects one hopes to recover. A Bayesian approach thus seems natural [26].

The main focus of this dissertation is modern developments in OED for ill-posed linear inverse problems. Some recent attention has been given towards large-scale linear ill-posed problems in [22, 26]. We will expand on some of the ideas and methods presented these papers. In particular, we would like to consider several main areas of research particularly addressing optimality criteria, selection of the OED, and efficient computational methods for finding an optimal experimental design.

1.3.1 Optimality Criteria

The main issue behind OED for ill-posed problems is exactly how to handle the ill-posedness and perhaps generalize the traditional methods of OED to handle these cases. When solving the parameter estimation problem we use regularization techniques in order to help stablize the solution. This regularization introduces a bias on the recovered estimate. Therefore optimality criteria that only minimizes the variance may be overlooking a significant contribution from the bias of the estimate [21]. An alternative approach is to use the Moore-Penrose psudoinverse to help design a system matrix that contains a balanced eigenvalue spectrum [18]. One measure of a matrix’s condition number is the ratio of the maximal-minimal eigenvalues. Therefore OED with a focus on limiting this ratio may help to reduce the poor conditioning of the system matrix. We will consider

(15)

bodies of work that focus on using the bias first.

Since the bias depends on knowing the exact quantity that we are trying to estimate, we must instead use methods to estimate, bound, or use a priori knowledge for the true model parameters. Bardow addresses this issue primarily from a working cycle approach that is also considered by [34– 36] where an iterative cycle of refinement is based upon an initial guess of the estimate, an optimal design step, data collection, and data analysis culminating in an updated estimate. Bardow considers the impact of poor choices for the initial guess and looks to robust experimental design techniques to help stabilize the solution of the OED. The goal of robust OED is to consider the dependence of the optimal design upon the unknown parameter values. This of course is an essential issue that must be considered when trying to find OEDs that include a bias component. The robust OED techniques that Bardow illustrates in [34], primarily follows the work of [37–40] which include the min-max design and average design. The min-max (or worst-case) approach attempts to optimize the worst for any function m in a specified class_{M while the average design optimizes the expected} value of the design objective [21]. These methods are similar to those proposed the decision-theoretic methodology discussed by Chaloner & Verdinelli [10], as well as some earlier methods reviewed by O’Sullivan [19] on the average mean square error and maximum square error design criteria.

Haber et al [22] approaches the estimation of the bias from three perspects, the average optimal design (AOD), the Bayesian optimal design (BOD), and empirical Bayesian design (EBD). The common approach to all three methods is an estimation and minimization of a risk functional for the MSE. Assuming that the model is in a bounded convex setM, the AOD simply considers the average of the bias overM [22]. The BOD assumes that there are some models within M that are more likely, therefore an expectation on the MSE over a distribution π is taken. This approach is similar to the average design approach that Bardow and other authors take. Finally, the EBD is motivated from the use of an available training set. When a collection of plausible models is available from previous applications we may assume that these empirical observations are realizations from an unknown multivariate distribution. We may then take a sample average over these realizations. More approaches to minimize the Bayes risk are also described in [26] and are discussed in the upcoming chapters.

A large body of research was done around the turn of the century by Curtis [15–18] on optimality criteria for OED related to geophysical survey problems. These works primarily focus upon the use of ‘quality measures’ that are chosen to use the eigenvalues and psuedoinverve approach to OED. Some of these measures have direct association to the alphabetic criteria that are common to OED, such as the sum of eigenvalues (equal to the trace of AT_{A). The main idea behind these approaches}

(16)

is to maximize the area beneath the eigenvalue spectra thereby increasing the minimal values of the eigenvalues (and hopefully improving the overall condition number). Some of these measures also include a damping parameter that is designed to reduce sensitivity to smaller eigenvalues [16]. The exact choice of this parameter and its relationship with the damping parameter seems to remain unclear and problem dependent (based upon the eigenvalues). The main hinderance behind the use of many of these quality measures is the computational expense of computing the eigenvalues if computed directly. Instead, Curtis proposes efficient ways to estimate some of these measures [15–17] and notes that the overall power of these measures lies in their ability to increase the smaller eigenvalues at the expense of the larger ones.

1.3.2 Determining An Optimal Experimental Design

There are a number of approaches to determine an optimal experimental design. Traditionally the focus has been on obtaining a global optimal design by direct optimization. Since finding a globally optimal solution is highly desirable, efficient ways to directly solve or estimate the solution as the problem size increases is very important. [22] and [26] both look at global optimization for ill-posed problems. Other general techniques towards finding a global solution make use of stochastic optimization methods such as simulated annealing. In many applications however, finding the global optimum may not be a computationally tractable, and any improvement in the overall design is desirable [16] which is why we might consider the following alternatives.

Some approaches to OED are not intended to find the global OED but instead focus on efficient determination of a reasonably optimal design. The typical idea behind these schemes is that the system matrix is iteratively constructed by considering experimental candidates that will improve the optimality criteria. Much of the early work by Curtis considered the use of genetic algorithms for this construction [15–18]. Genetic algorithms suffer from a variety of problems in practice including the tendency to converge towards local optima, as well as poor scalability. Sequential experimental design is another approach gaining in popularity, see [11–14, 33] amongst others. In general, a sequential design uses the results of one set of experiments to help plan the next set of experiments [14]. A good example of how sequential experimental design works is found in [33]. Sequential designs are greedy algorithms that seek to make a locally optimal choice at each stage with the hope that these choices will lead to a globally optimal solution. While it is possible for any algorithm to find a globally OED, depending on the application these alternatives offer a reasonable trade-off between optimality and efficiency.

(17)

1.3.3 Computational Methodology

Improved computational methods have been proposed for solving the optimization problems as-sociated with OED for ill-posed inverse problems. For the robust design problems, Pronzato and Walter [38] proposed the use of stochastic approximation techniques and advocated that they are the simplest tools for optimizing the robust design criteria. Huan and Marzouk [14] use stochastic approximation methods that mimic the steepest-decent algorithm but uses only two random per-tubrations to estimate the gradient regardless of the problem’s dimension. Stochastic methods have also been used more recently by [22] to help mitigate the computational effort required to compute the trace within the A-optimal objective function. Following the use of the Hutchinson stochastic trace estimator [41], Haber et al illustrates how this approximation leads to efficient computations of the objective function and gradient that are needed to solve the optimization problem [22, 26].

Sequential design iteratively adds new rows to the system matrix based upon new or repeated ex-periments chosen by the greedy algorithm. The quality measures will then need to be updated based upon the new system matrix and these computations can be cumbersome (e.g. finding eigenvalues). Coles and Morgan [33] demonstate how rank one updates following the Sherman-Morrison formula can be applied to iteratively update quality measures based upon the determinant. They go on to show how a SVD factorization can be updated quickly if the user can afford an early investment in its computation.

A large class of sequential design algorithms are known as exchange algorithms. These are search algorithms that explore the design space and iteratively update the design by a series of exchanges. These exchanges consist of the addition of new experimental design points, and the possible removal (or re-weighting) of points in the previous design. Most of these algorithms are based upon the fundamentals presented by Wynn [29] and Fedorov [1]. Algorithms that correspond to the D-optimality criteria have update formulas similar to what is presented by Coles & Morgan [33] and are based upon the initial work done by Wynn [29]. Exchange algorithms that correspond to linear optimality criteria are based primarily on Fedorov [1].

1.4 Contributions of this dissertation

The format of this dissertation is unlike that of a traditonal thesis as it will consist of a collection of published (or submitted) papers that discuss the new methodology. We will provide a basic summary of the pertinent work included within each of the papers that includes some expanded formulations (and/or explanations) that were not included in the final or submitted draft of the papers.

(18)

The main focus of Chapter 2 is the general method of Bayes Risk A-optimal design for ill-posed inverse problems. This is based upon work that we previously published [26] but with an expanded discussion that includes some newer developments in the general method. Chapter 3 focuses upon an update to the linear-optimal design theory that includes a generalization of some of the fundamental theorems in OED for the ill-posed case. In Chapter 3 we also consider some computational approaches to solving the Bayes Risk A-optimal design for ill-posed inverse problems. Chapter 4 consists of some auxillary topics related to the two papers. Included in Chapter 4 is a basic framework for joint optimal design that uses the Bayes Risk A-optimal design methodology.

(19)

CHAPTER 2

BAYES RISK A-OPTIMAL DESIGN FOR ILL-POSED INVERSE PROBLEMS

In this chapter we will discuss Bayes Risk A-optimal design for ill-posed inverse problems. We will begin by reviewing some of the definitions and approaches used in classic optimal design. We then introduce the Bayes Risk A-optimal design approach as a general method for finding OEDs. This chapter serves as an expanded discussion of our paper on “Numerical methods for A-optimal designs with a sparsity constraint for ill-posed inverse problems” [26].

2.1 Classical Optimal Experimental Design

In a well-posed discrete linear inverse problem the data for the experimental design are modeled as

di= a(xi)Tm + εi, i = 1, . . . , n (2.1) where a(xi) are functions that control the experiment design given a set of design variables xi∈ X and the matrix

A =    a(x1)T .. . a(xn)T    (2.2)

is full column rank. The variables xi are also referred to as the design points or experimental configurations, while the set_{X is called the design space [1]. Matrix A is often referred to as the} design matrix. The rows of A represent the different experimental configurations that will be used to collect the data. Repeated experiments correspond to repeated rows within the design matrix. The goal of experimental design is to determine the experimental configurations and the number of repetitions that should be used in the design matrix. As we explained in the previous chapter, this definition is difficult to work with from a computational standpoint. Instead we will use the continuous experimental design definition where instead of repetitions of rows we use weights.

In order to see the role of weights in the optimal design, we will first demonstrate their role in the solution of the inverse problem. We rewrite the model as

(20)

where{aT

i} for i = 1, . . . , n is a set of vectors corresponding to the experimental configurations that are determined by the vector entries (i.e. ai = a(xi)). The data di obtained in an experimental configuration is a function of the unknown model parameters m∈ Rp _{, and ε}

i is an independent sequence of random errors with mean zero and variance σ2_{. Let k}

i be the number of observations that are made for the experimental configuration aT

i, then the sum k1+ k2+· · · + kn= N represents the total number of observations obtained.

The data can then be expressed as a system of equations

d = Am + ε, (2.4)

where the data vector d∈ RN _{and unknown model parameters m}_{∈ R}p _{are related directly by the} rows of the matrix A which represent experiments, plus experimental noise ε. We seek to estimate the model parameters m from (2.4) using least-squares

b m = argmin n X i=1 ki X j=1 (dij− aTim) 2 . (2.5)

We can reduce this problem to an equivalent form that assigns weights to repeated experiments. Let wi = ki/N be the ratio of the number of observations obtained for configuration i from the total taken. Clearly, w1+· · · + wn = 1. Then observe that

ki X j=1 (dij− aTi m)2= ki X j=1 (dij− di)2+ ki(di− aTim)2, (2.6) where di= ki−1 ki X j=1 dij. (2.7)

Now, by keeping only the terms involving m and dividing by constant N , ki X j=1 (dij− aTim) 2₌ ki X j=1 (dij− di)2+ ki(di− aTi m) 2_{= w} i(di− aTim) 2_{+ constant.} _(2.8) Then the least-squares estimate becomes

b m = argmin n X i=1 wi(di− aTim) 2 , (2.9) which we rewrite as b m = argmin (d_{− A}nm)TW(d− Anm), (2.10) where W = diag_{k1/N, . . . , kn/N}. Now the rows of An correspond to different experimental configurations; An is well-conditioned, ATnAn is non-singular, andVar(di) = ki−1σ2.

(21)

It is at this point that we can see that the original problem (2.5) can be cast as a weighted least-squares problem, where the weights define a proportional allocation of the experiments that are to be used in the regression. We may also interpret the resulting wias the relative proportional allocation of experimental “run-time” that should be afforded to experiment i. This interpretation can be quite useful in experiments that have continuous time allocations (i.e. exposures) for run-time. To convert back into the discrete case, we simply determine the weights wi and estimate ki with bki=bNwic.

Let us thoroughly explore the solution to the least-squares problem: we have

b m = argmin Sk(m) = 1 2 d− Anm) T_W(d − Anm =1 2 dTWd_{− 2d}TWAnm + mTATnWAnm . (2.11)

Taking the gradient of Sk and setting it to zero

∇mSk(m) =−ATnWd + ATnWAnm = 0⇒ ATnWAnm = ATnWd ⇒ bm = (AT

nWAn)−1ATnWd. (2.12)

For the well-posed case, we will assume that the matrix AT

nAn is invertible. We will consider the case where AT

nA is ill-conditioned later in this chapter. The expected value of the least-squares estimate is

E( bm) = (ATnWAn)−1ATnWE(d), (2.13) where E(d) is found by linearity

E  k−1 i ki X j=1 dij   = k−1 i ki X j=1 E(dij) = k−1i ki X j=1 aTim ⇒ E(d) = Anm. (2.14) Therefore_{E( b}m) = (AT nWAn)−1ATnWAnm = m andm is unbiased.b

The variance of the estimatem depends on the variance of the averaged data d whereb Var(di) = σ2

N wi (for wi6= 0), or Σd=

σ2

NW−1. The variance-covariance matrix ofm isb Σm_b = [(ATnWAn)−1ATnW]Σd[(A T nWAn)−1ATnW]T = σ2 N(A T nWAn)−1. (2.15)

(22)

The variance-covariance matrix Σm_b is also referred to as the dispersion matrix, and is often denoted D = Σm_b.

In summary, we have shown that in the classical experimental design problem the following are properties of the least-squares estimatem of the model m:_b

(i) m = (A_b T

nWAn)−1ATnWd. (ii) _{E( b}m) = m.

(iii) The covariance matrix is Σm_b = σ2 N(A

T

nWAn)−1. (iv) MSE(m) = traceb

_σ2 N(A T nWAn)−1 .

Without loss of generality, we can redefine A≡ An, and d≡ d where the new {A, d} correspond to the rows and data from different experimental configurations only. The repetition of experiments are accounted for in the optimal wi’s since the optimization problem minimizes a function of the variance, the optimal w’s are independent of realizations of the data. This is the notation that is used for the remainder of this dissertation.

2.1.1 Statistical Criteria For Optimal Experimental Design

We have shown that for the the classical linear OEDs the matrix AT_{A is nonsingular and m can} be found by solving the weighted least squares problem:

S(m) = 1

2(d− Am) T_W(d

− Am), (2.16)

where di is the weighted average of the ki observations of the ith experiment, the rows of A cor-respond to n different experimental configurations and W = diag{k1/N, . . . , kn/N}. We note that (σ2_{/N )W}−1 _{is the covariance matrix of d and the selection of k}

i controls the variances of the observations of different experiments.

Optimal experimental designs are determined by using some statistical criteria that will optimize the quality of the data before the gathering of data ever takes place. For the well-posed case, frequentist OEDs focus upon minimizing some function of the variance of the inverse estimate. For the A-optimal design, this means minimizing the trace of the dispersion matrix (Σm_b) which is equivalent to minimizing the mean-squared-error (MSE) of the estimator for the model parameters. There are a number of different types of constraints that we can use in the optimization problem. The two types that we focus on are the “cost-controlled” design constraints (also known as the classic continuous design problem) and the “sparsity-controlled” design constraints. The main feature of

(23)

the sparsity-controlled estimate is the control of the level of sparsity by the tuning parameter β that behaves in a similar way to the regularization parameter in Tikhonov estimation. Therefore the two optimization problems that we will consider are the A-optimal classic design given by

minimize φA(w) = tr(Σm_b) = trace (ATWA)−1 subject to Pni=1wi= 1

wi≥ 0

(2.17)

and the sparsity-controlled design given by

minimize trace (AT_WA)−1_{+ β}P_w i

subject to 0≤ wi≤ wmax. (2.18)

One of the most important details in this section that should not be over-looked is that when minimizing the variance, there is usually a smallest achievable variance for the experiments. So while mathematically, the MSE can descrease to zero with increasing N , practically there is lower bound that must be observed as there is usually a smallest achievable variance σ2

min for the experiments. We prevent this by setting an upper bound on the wi’s equal to wmax.

2.1.2 Solving the Classical OED Optimization Problem

There are a number of ways that we can constrain the optimization problem. Here we will focus on the classical continuous design problem and the sparsity-controlled design problem. The classic A-optimal design problem is sometimes referred to as as the cost-controlled A-optimal design [22]. In the general case for cost-controlled design, we consider a fixed total cost C, with each observation costing a different amount ci [10]. Without lost of generality, we can assume that the costs of experiments are equal (or negligibly different) and focus upon the minimization of the different experimental configurations. The classic A-optimal design problem is therefore represented by the constrained optimization problem

minimize MSE(m) = trace (A_b T_WA)−1 subject to Pn_i=1wi= 1

wi≥ 0

(2.19)

where we note that we have dropped σ2_{/N from the objective function. The optimal design is given} by the optimal weights w∗_{. These weights tell the experimenter the proportional allocation required} by each experimental configuration in the OED. The experimenter can also approximate the number of repetitions for each configuration by the conversion ki=bNwic. While the boundPwi= 1 helps to ensure that the individual w’s never get very large, there is no assurance of a sparse subset in the optimal design. More on the formulation of this design problem can be found in [1, 7, 8]. In practice the optimal design is typically a subset, but the practitioner may desire a more sparse design.

(24)

To explore the sparsity-controlled design problem, let us first consider the basic optimization problem

minimize trace (AT_WA)−1

subject to 0≤ wi. (2.20)

Here the OED is to set the wi’s equal to infinity, which is clearly infeasible in practice. Another important issue is that while the MSE decreases to zero as the number of experiments N → ∞, in practice there is usually a smallest achievable variance σ2

min for the experiments. This means that we must bound the w’s from above because large values may result in very small variances. Let us therefore consider the revised optimization problem with bounded weights

minimize trace (AT_WA)−1

subject to 0_{≤ w}i≤ wmax. (2.21)

The solution to this problem is obvious, wi= wmax for all i. This approach fails in one important way: it requires conducting every experiment with maximum exposure time (or as many replications as possible). While this is obviously intuitive, it defeats the purpose of optimal design. Running all experiments may be expensive and infeasible due to time and budget constraints.

We can enforce sparsity by using a penalty term that limits the number of experimental config-urations that can be used. The sparsity-controlled optimization problem is therefore given by

minimize trace (AT_WA)−1_{+ β}_kwk 0

subject to 0≤ wi ≤ wmax (2.22)

wherekwk0= #{wi6= 0}. We cannot solve this problem easily because it is not convex [8], and it is of combinatorial complexity. It is usually approximated with the `1 normkwk1 instead ofkwk0 [22]. Since all of the wi are positive, the `1-penalty method is now of the form:

minimize trace (AT_WA)−1_{+ β}P_w i

subject to 0≤ wi≤ wmax. (2.23)

where the sparsity of the design is controlled with the tuning parameter β. To determine an ap-propriate choice for β, the convex optimization problem is run for a few choices of β from which we can produce a plot of the trace of(AT_WA)−1 _vs

kwk0. We should note that a plot vs β is uninformative as the value of β does not indicate the sparsity level directly, it only serves to balance the trade-off between the reduction of the MSE and sparsity. The resulting plot is known as a Pareto curve (or L-curve as is it often L-shaped) and allows us to study the MSE reduction as a function of kwk0 , which depends on the tuning parameter β. We usually pick the corner of this curve, if one exists, or we justify the selection of another point by choosing to balance the MSE with the sparsity of the design.

(25)

It should be noted that the computed optimal design can be normalized to have the classical constraintPw∗

i = 1. It is in this classical form that we can better understand the relative propor-tional allocation required by each experimental configuration in the OED. We can also determine the number of repetitions, ki = bNwic, for each experiment as we did before. Finally, we should note that if w∗

i 1 then the OED for experiment i results in a variance that is larger than the baseline variance of the model. In this event, the resulting w∗

i 1 should be set to zero and the corresponding experiments should not be carried out.

The overall method is quite effective in quickly providing a sparse Pareto curve when the values of β are appropriate, however the experimenter has no idea which values of β to attempt to use beforehand. If poor (or too few) choices of β are used, a sharp corner may not form, if one exists. A sharp corner helps to eliminate ambiguity and aide in proper decision making for the value of β that should be used. There might also be more than one corner that results in the experimenter making a decision as to which value of β to ultimately use. Therefore, the drawback in this method is that the experimenter has little prior knowledge as to which values of β to try to draw a Pareto curve. A binary search approach to finding the corner is therefore needed to find β. This results in the experimenter re-running the problem multiple times and refining the curve until a sharp corner can be acheived. The speed of the solution of the optimization problem is very fast, but the overall process of the method can be limited by searching for a reasonable β.

2.2 An Affine Bayes Risk Estimator

In order to solve ill-posed inverse problems, regularization methods are usually needed. Tikhonov regularization is a very common method that corresponds to a penalized least-squares problem that helps to improve the conditioning of AT_{A. The main issue with the Tikhonov regularized estimator} is that it is a biased estimator. If we seek an optimal design that minimizes the variance only, the bias component may dominiate the mean-squared-error. Since we cannot truly know the bias without knowing the true value of the parameters that we are wanting to estimate we can instead use the Bayes Risk which is defined as the expected value of the mean-squared-error over the prior distribution of the model parameters m.

We will start by showing how the hierarchy of moments from the Bayes risk estimator can be used to obtain the Tikhonov regularized estimate from the ill-posed experimental design framework. We will consider the linear model in the same fashion as the previous chapter, y = Am + ε, and use the following moment conditions to determine a Bayes risk estimator and in an A-optimal design problem. Suppose y, m, λ, and γ are random variables such that

(26)

E(y|m, γ) = Am, Var(y|m, γ) = σ2_I E(m|γ) = m0, Var(m|γ) = γ2Σ

E(γ2_{) = µ}2 γ,

where σ and Σ are known, Σ is non-singular and γ = σ/λ and has been chosen to match the Tikhonov estimator. We wish to find an affine estimator of m that minimizes the Bayes risk EmMSE(m).b

We will express the affine estimator in the form m = By + a. Let us begin by computing the_b expected MSE matrix:

E[( bm− m)( bm− m)T_{] =}_{E[(By + a − m)(By + a − m)}T_] =_{E{[(BA − I)m + a + Bε][(BA − I)m + a + Bε]}T

}.

Let H = BA_{− I. Then,}

E[( bm_{− m)( b}m_{− m)}T] =_{E{[Hm + a + Bε][Hm + a + Bε]}T_}

=E[HmmTHT + 2amTHT+ 2BεaT + 2HmεTBT+ aaT + BεεTBT] =_E{E(E[HmmT_HT _{+ 2am}T_HT _{+ 2Bεa}T _{+ 2Hmε}T_BT _{+ aa}T _{+ Bεε}T_BT

|m, γ])|γ} =_E(E[HmmTHT_{|γ]) + 2E(E[am}THT_{|γ]) + E(E[aa}T_{|γ]) + E{E(E[Bεε}TBT_{|m, γ], γ)}} =E(H(γ2_{Σ + m}

0mT0)HT) + 2E(amT0HT) +E(aaT) +E(σ2BBT) =_E _σ2 λ2HΣH T + Hm0mT0HT + 2amT0HT + aaT + σ 2_BBT = σ2_δ2_HΣHT _{+ σ}2_BBT _{+ (a + Hm} 0)(a + Hm0)T (2.24) where δ2₌_E(1/λ2_).

We can use this result to determine the mean squared error by permuting the products in the arguments of the traces.

MSE(m) =_b _{E[( b}m_{− m)}T₍ b m_{− m)] = σ}2_δ2 _tr(HT_{HΣ) + tr(m}T 0HTHm0) + 2 tr(mT0H T_{a) + tr(a}T_{a) + σ}2 _tr(BT_B) = σ2δ2 tr(HTHΣ) + mT0HTHm0+ 2mT0HTa + aTa + σ 2 tr(BTB) = σ2_δ2 _tr(HT_{HΣ) +} kHm0+ ak2+ σ2 tr(BTB). (2.25)

Now we must find the B and a that minimize the Bayes risk (2.25). This gives the optimal affine estimator given the prior moment conditions. It follows immediately from (2.25) that

a =_−Hm0= (I− BA)m0. (2.26)

(27)

∂MSE(m)_b ∂B = 2BAm0m T 0AT − 2m0mT0AT + 2(I− BA)m0mT0AT+ 2σ 2_B + 2σ2_δ2_BAΣAT − 2σ2_δ2_ΣAT _{= 0} ⇒ B(δ2 AΣAT − I) = δ2 ΣAT ⇒ B = δ2_ΣAT_(δ2_AΣAT − I)−1_{= δ}2_ΣAT_[I − A(δ−2_Σ−1_{+ A}T_A)−1_AT_] = δ2_Σ[A − ATA(δ−2Σ−1+ ATA)−1AT] = δ2_Σ[I − ATA(δ−2Σ−1+ ATA)−1]AT = δ2_Σ {(δ−2_Σ−1_{+ A}T_A) − AT_A }(δ−2_Σ−1_{+ A}T_A)−1_AT ⇒ B = (δ−2_Σ−1_{+ A}T_A)−1_AT_. _(2.27)

Finally solving back for a:

a = (I_{− BA)m}0= (BB−1− BA)m0= B(B−1− A)m0 = B[A−T(δ−2Σ−1+ ATA)− A)m = B(δ−2_A−T_Σ−1_)m

0 ⇒ a = δ−2_(δ−2_Σ−1_{+ A}T_A)−1_Σ−1_m

0. (2.28)

Therefore the affine estimator that minimizes the Bayes risk can be expressed as follows:

b

m = By + a = (δ−2Σ−1+ ATA)−1ATy + δ−2(δ−2Σ−1+ ATA)−1Σ−1m0

= (δ−2Σ−1+ ATA)−1(ATy + δ−2Σ−1m0). (2.29) Now, to illustrate a correspondence to the Tikhonov estimate, recall that the penalized weighted least-squares method yields the following estimator

b

m = argmin (y_{− Am)}TW(y_{− Am) + λ}2 kΓmk2 = (AT_{WA + λ}2_ΓT_Γ)−1_AT_Wy.

We can transform the problem into the form

b

m = argmin (W12y− W12Am)T(W12y− W12Am) + λ2kΓmk2 = argmin (ye_{− e}Am)T(ye_{− e}Am) + λ2

kΓmk2_. If we use the prior moment conditions

E(ey|m, γ) = eAm, _{Var(ey|m, γ) =} σ 2 NI E(m|γ) = 0, Var(m|γ) = _λσ22_N(Γ T_Γ)−1 E(γ2_{) = µ}2 γ ≡ λ

(28)

b

m = (λ2ΓTΓ + eATA)e −1AeTy,e or by transforming back

b

m = (λ2_ΓT_{Γ + A}T_WA)−1_AT_Wy.

The equivalence between the Tikhonov regularized estimate and the affine estimator in (2.29) is clear, however a modification of the moment conditions will add further clarity in the context of the optimal design problem. In the next section we will look at the general Bayes Risk A-optimal design using the prior moment conditions.

2.3 Bayes Risk A-optimal Design Methodology

We will assume that the moment conditions are given as follows:

E(y|m, γ) = Am, Var(y|m, γ) = σ_N2W E(m|γ) = m0, Var(m|γ) = γ2Σ

E(γ2 ) = µ2γ,

where σ and Σ are known, Σ is non-singular and γ = σ/(N λ) and has been chosen to match the Tikhonov estimator. The affine estimator with the smallest expected MSE is given by

b m = AT_{WA +} 1 N δ2Σ −1 −1 AT_{Wd +} 1 N δ2Σ −1_m 0 , (2.30) where δ2 _{= µ}2

γ/σ2 = E(1/λ2). Note that we could have started with the moment condition Var(m|γ) = γ2

NΣ and arrived at an equivalent result, the use of σ

2 _{was shown for completeness} and does not affect the final optimization problem shown below.

The corresponding expected MSE matrix is

E[( bm_{− m)( b}m_{− m)}T] = σ 2 N ATWA + 1 N δ2Σ −1 −1 , (2.31) or equivalently E[( bm− m)( bm− m)T_{] =} τ σ2 N τ A T_{WA + (1} − τ)Σ−1−1, where τ = N δ2_{/(1 + N δ}2₎ ∈ (0, 1).

(29)

The design problem is now clear. If we wish to minimize the Bayes risk (the expected MSE), we must choose the weights in W that minimize the posterior covariance. Therefore we may solve either of the following optimization problems:

minimize trace AT_{WA +} 1 N δ2Σ −1 −1 subject to Pwi= 1 wi≥ 0 . (2.32)

Our simple transformation allows us to instead pose the optimization problem as a function of τ minimize trace τ AT_{WA + (1}

− τ)Σ−1−1 subject to Pwi= 1

wi ≥ 0

. (2.33)

In order to solve these problems we must first consider how we might select our input parameters {N, δ} or {τ}. We will explore how to make these choices in a later section. At this point it is pertinent to discuss some of theoretical aspects of generalized experimental design.

2.3.1 Parameter Selection in the Bayes Risk A-optimal Design

For the Bayes Risk A-optimal design, we have defined an objective function that depends on the selection of the parameter τ . It was shown that this value of τ is related to the parameters_{{N, δ}2

} by δ2 _{= µ}2

κ/σ2. An appropriate selection of τ must therefore be made that reflects the moment conditions in the Bayes risk A-optimal design. We first observe that

trace (1_{− τ)Σ}−1+ τ GTWG−1_{→ trace G}TWG−1

as τ _{→ 1. However, the matrix G}T_{WG can be very ill-conditioned or singular. Therefore we} must limit how large a τ to use to prevent the matrix in the objective function from becoming ill-conditioned.

This can be accomplished by selecting an upper bound τ∗_{< 1 for τ , such that, τ G}T_{WG + (1}₋ τ )Σ−1 is sufficiently well-conditioned for τ < τ∗_{. Recall that the parameters} _{{τ} and {N, δ}2

} are related by the following equation τ = N δ2_{/(1 + N δ}2_{). This relationship can then be used to define} an upper bound for the possible values of N :

N _≤ τ

∗

δ2₍₁_{− τ}∗₎. (2.34)

Recalling that δ2_{= µ}2

κ/σ2 leads to a bound on the feasible ratios of µ2κ and σ2 for each fixed N : µ2

γ σ2 ≤

τ∗ N (1_{− τ}∗₎.

(30)

In order to choose an appropriate value for δ2_{we must estimate it using prior knowledge (or training} sets) to obtain an estimate.

Suppose we have a training setT = {m{1}_{, . . . , m}{T }

} of vectors that are expected to be similar to the unknown m. Then we can estimate an appropriate δ2 _{using the following procedures.}

A moment estimate of m0is b m0= m = 1 T T X j=1 m{j} and an estimate of δ2 _{is given by the moment estimate}

bδ2₌ b µ2 γ/σ2 with µb2γ = 1 tr(Σ)T X j km{j}− mk2_.

For the estimation of our input parameters in this paper, we will use the above approach that uses the moments.

Alternatively, if the prior moments have been chosen so that the optimum m corresponds to ab Tikhonov estimate, it can be shown that δ2 ₌ _E(1/λ2_{), where λ is the regularization parameter} in the standard Tikhonov estimator. For a fixed N , we can then choose an initial design w, for example wi= 1/n, and simulate data for each m{j} inT . Then for each realization we can use the generalized cross-validation (GCV) method to obtain a regularization parameter λj. We can then estimate δ2 _{using the corresponding λ}

j’s using one of two options.

Option 1 is to use a trimmed mean which elimates small values of λj therefore yielding stable estimates for δ2_{. Option 2 is to estimate δ}2 ₌_E(1/λ2_{) using the median of the squared reciprical} of the regularization parameters produced by the GCV method, i.e. bδ2 _{= M edian(1/λ}2

j) for j = 1, . . . , T . When using the median we no longer need to eliminate any of the smaller values of λj to obtain a stable estimate.

Finally, based upon the estimate of δ2 _{we select any value of N such that satisfies the inequality} in (2.34). An obvious choice, if feasible, is the N = p the number of discretization points used by the collocation of the model. Another obvious selection is the value of N that satisfies equality. Ultimately the choice of N is problem dependent and can be simplified by observing a trade-off between the reduction of the Bayes risk versus an increase in the value of N .

While we can use the parameters _{{N, b}δ2_{} directly in the A-optimal design problem, we instead} choose to express the problem in terms of τ only where

τ = N bδ2_{/(1 + N b}_δ2_); which is dependent on the estimate of δ2_{and our choice of N .}

(31)

For the examples included in this paper, we will consider the case N = 100, equal to the number of collocation points. We will estimate the moment conditions by generating a training set consisting of T = 1000 random vectors drawn from a Gaussian distribution with a true model m0and covariance matrix γ2_{Σ where Σ(i, j) = exp(}

−|yi− yj|/0.2) for yi = i/99 and i = 0, . . . , 99. Finally, in order to generate noise realizations and make estimate of δ2 _{we used a signal-to-noise ration of 10 which} resulted in σ2 _{= 0.0158. The parameter b}_δ2 _{is then estimated using the moments as was shown in} (2.3.1) and (2.3.1) . The final selection of τ is based upon the equality (2.34), with the obtained estimate bδ2_{, which estimates the upper bound of the total number of experiments N to carry out} based upon τ∗_{. For this example we chose τ}∗ _{= 0.9999.}

We are free to choose any N that satisfies this inequality, hence we have chosen N = 100. Finally, based upon this choice of N , we determine τ = N bδ2_{/(1 + N b}_δ2_{) and use this value in the design} objective function.

(32)

CHAPTER 3

ON EXPERIMENTAL DESIGN IN THE CONTEXT OF TIKHONOV REGULARIZED INVERSE PROBLEMS

In this chapter, we will provide a summary and expanded discussion of our recently submitted paper “Experimental design in the context of Tikhonov regularized inverse problems”. A preprint of this paper can also be found in the appendix.

3.1 Summary

The main focus of this paper is the expansion of some of the fundamental theorems in optimal experimental design so that they may be applied in the context of ill-posed inverse problems. This includes a generalized equivalence theorem that allows us to generalize many of the existing algo-rithms for the ill-posed cases. In particular we will make use of the Bayes risk A-optimal design criteria that has a natural connection with Tikhonov regularization. We state the conditions that allow us to use these techniques. As a way to efficiently solve the optimization problem, we explore the use of a hybridization of two different numerical approaches with the hopes that we can capitalize on the strengths of the different techniques. Finally, we provide an example where an optimal design problem is difficult to solve using sequential search algorithms.

Sections 2 & 3 of the paper review the general framework of experimental design and provide a general overview of Bayes Risk estimators. The connection between Bayes risk estimators and the Tikhonov regularization is briefly explained. The main result comes from the use of a hierarchy of prior moment conditions that are assumed to be known or can be estimated using a training set. We have already given a thorough introduction on the use of Bayes Risk estimators in the previous chapter. We will therefore use the Bayes risk A-optimal design criteria as the main objective function in this paper. We will also make use of the the weighted Bayes risk, a slight modification of the objective function, for AVO (angle-versus-offset) reflectivity inversion.

Section 4 of the paper provides a concise explanation of the generalized equivalence theorem, although it is never referred to by this name. This generalized form of the equivalence theorem allows us to work with standard information matrices that may not be invertible due to the ill-posedness of these types of problems. A regularization matrix that depends on the prior covariance matrix Σ (based upon the moment conditions) is used to help stabilize the information matrices. The influence of this regularization matrix is controlled by the parameter τ and the result is a

(33)

collection of generalized information matrices that depend on both τ and Σ. Some of the properties of generalized information matrices are stated in the paper, others will be described in Section 3.2.1. In order to use the Bayes risk optimal design techniques, we assume that we either have the prior moment conditions or that we can estimate them using a training set. Additionally, we must somehow determine a reasonable value of τ (the parameter that controls the influence of the prior covariance) before attempting to solve the actual OED problem. The parameter τ depends on moment conditions and the total number of experiments that are to be carried out. Section 5 of the paper provides a brief discussion on how to use the Bayes risk optimal design in practice, particularly the steps required in the estimation of τ . We have already provided an expanded discussion on the the selection of τ in the previous chapter of this dissertation.

Section 6.1 is an application of the Bayes risk A-optimal design for an ill-posed inverse example problem. The main innovation in this section is the extension of numerical methods that can solve Bayes risk A-optimal design problems with the generalized information matrices. In particular, a hybrid technique is introduced that begins with a sequential search algorithm for a relatively small number of iterations. It is then generally assumed that the sequential search algorithm has added the most significant experimental configurations but may have not fully converged in terms of the final weights. The proposed design is then passed into a software package (CVX) that finalizes the weights by solving the associated semidefinite programming problem. The main goal of this hybrid method is to capitalize on the ability to explore the design space, and to efficiently determine an optimal design in a relatively short amount of time. The hybrid method appears to achieve this primary goal as the Bayes risk is reduced over using CVX alone with a fixed design grid.

Finally, in Section 6.2 of the paper we present an example of an optimal design problem that cannot be solved easily using a sequential approach. In AVO inversion, a sequential approach is difficult as we need to correct the noise vector z(t) for correlations and hetersoscedasticity but the covariance matrix is defined by the angle-stacks, which are our variables. Therefore we use a predefined selection of stacks and use CVX to solve for the associated semidefinite programming problem.

3.2 Linear Optimality Criteria

The theory that we are concerned with in this dissertation is applied to all linear-optimal design criteria. We will briefly state the definition of linear-optimality and then provide an extension of some fundamental results in optimal design theory that can be used when finding OEDS for ill-posed problems.

(34)

A non-negative linear functional is a functionL : Rn_{→ R such that:} 1. L(A + B) = L(A) + L(B), for all matrices A and B,

2. _{L(cA) = cL(A), for all matrices, A and c ∈ R,} 3. _{L(A) ≥ 0 for all positive semi-definite matrices.}

An example of a linear functional is the trace of a matrix. In the case of A-design the optimality criteria is _{L(D(ξ)) = tr[D(ξ)]. In general, the term “linear criterion” refers to L(D) = tr[AD(ξ)],} where A = LLT is a positive semi-definite matrix referred to as the “loss matrix”.

Two designs may be compared using a non-negative linear functional as follows:

ξ1> ξ2, if L[D(ξ1)] <L[D(ξ2)]. (3.1) We will call the design ξ∗ _{linear-optimal (or “}

L-optimal”) if

L[D(ξ∗)] = min

ξ L[D(ξ)]. (3.2)

Let us focus upon using this estimator in A-optimal designs for ill-posed problems. To do so we will need to present a general framework of the equivalence theorem that can be used for ill-posed problems. In general, it can be shown that the A-optimal design optimzation problem can be posed as minimize {xi,wi} traceh τ ATWA + (1− τ)Σ−1−1i s.t. xi∈ X , wi≥ 0, X wi= 1. (3.3) We can relax the equality constraint to the more relaxed version

minimize {xi,wi}

traceh τ ATWA + (1_{− τ)Σ}−1−1i s.t. xi∈ X , wi≥ 0, X

wi≤ 1. (3.4) This relaxation can be used when an optimal design has been carried out for n design points but then a decision by the experimenter (or some other limiting factor) requires that a subset of the experimental configurations cannot be carried out and the resulting weights are discarded. In this case the weights would no longer sum to 1. An intuitive step would be re-weighting (or normalizing) the remaining weights so that their new relative proportions sum to 1. As an example, say we carry out 5 experiments with configurations{x1, . . . , x5} and weights {0.15, 0.2, 0.3, 0.2, 0.15}, respectively. If we omit experiment x5 along with its weight, the sum of the remaining weights is 0.85. By normalizing the remaining weights we obtain the new weights w0 ₌