Nonlinear Least-Square Curve Fitting of Power-Exponential Functions: Description and comparison of different fitting methods

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

Nonlinear Least-Square Curve Fitting of Power-Exponential Functions:

Description and comparison of different fitting methods

by

Rasha Talal Altoumaimi

MASTER THESIS IN MATHEMATICS/ APPLIED MATHEMATICS

DIVISION OF APPLIED MATHEMATICS

MÄLARDALEN UNIVERSITY SE-721 23 VÄSTERÅS, SWEDEN

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Master thesis in mathematics / applied mathematics

Date:

2017-10-17

Project name:

Nonlinear Regression of Power-Exponential Functions: Description and comparison of different fitting methods

Author:

Rasha Altoumaimi

Supervisor(s):

Milica Ranˇci´c and Karl Lundengård

Reviewer: Mats Bodin Examiner: Sergei Silvestrov Comprising: 30 ECTS credits

(3)

I would like to dedicate my thesis to my beloved Husband Ahmed Sonba who always gives me support and love

(4)

Acknowledgements

I would like to express my thanks and gratitude to my supervisor Karl Lundengård for his positive and supportive guidance and to supervisor Milica Ranˇci´c. Special thanks to the re-viewer Mats Bodin for giving a very detailed feedback and to Professor Sergei Silvestrov the examiner for this thesis.

(5)

Abstract

This thesis examines how to find the best fit to a series of data points when curve fitting using power-exponential models. We describe the different numerical methods such as the Gauss-Newton and Levenberg-Marquardt methods to compare them for solving non-linear least squares of curve fitting using different power-exponential functions. In addition, we show the results of numerical experiments that illustrate the effectiveness of this approach. Furthermore, we show its application to the practical problems by using different sets of data such as death rates and rocket-triggered lightning return strokes based on the transmission line model.

Keywords: Curve fitting, Power exponential functions,Gauss-Newton algorithms, Levenberg Marquardt algorithm, death rate, rocket-triggered lightning return strokes.

(6)

List of Figures

4.1 The power-exponential function of µ(a, b, xi) in GN-algorithm . . . 35

4.2 The power-exponential function of µ(a, b, xi) . . . 37

5.1 Death risk after ages for men in USA between 1995 and 2004. . . 39 5.2 Death risk for men in USA for some years between 1995 and 2004 using

Gauss-Newton algorithm for x denotes age and y death risk. . . 41 5.3 Death risk for men in USA for some years between 1995 and 2004 using Levenberg

Marquardt algorithm for x denotes age and y death risk. . . 43 5.4 Equipment for measuring rocket-triggered return strokes, image originally

ap-peared in[17] . . . 45 5.5 Comparison of results for fitting rocket-triggered lightning return strokes data

(10)

List of Tables

5.1 The result of residuals least squares ×10−8using µ(c1, c2, a1, a2, a3; x). . . . 44

5.2 The result of residuals least squares for each rocket-triggered stroke using formula fi= yi− µ(xi; a, b, c). . . 48

(11)

Chapter 1 Introduction

1.1 Thesis outline

The goal of this thesis is to analyze and conduct several experiments with the properties of class functions for curve fitting using power exponential functions and by comparing different methods. Researchers would need to measure before setting up a model and so sample suitable points will need to be picked up and curve fit to the chosen points. Here we will examine a few different methods and apply them to the data from different applications. In general terms, models or mathematical formulas may be developed to estimate a variable in the data set based on the other variables in the data, with some residuals error depending on model accuracy such as (Data = Model + Error). This related research can be used in many areas such as electromagnetic compatibility calculations, lightning strike protection and in models of death risk as well.

As will be seen in Chapter 2 and 3, there was a selection of papers and research on standard methods of analysis and standards in performance when fitting models to data: Gauss-Newton algorithm, trust-region algorithm and Levenberg- Marquardt algorithm, some of these meth-ods are general in the sense that they can be used for any optimization problem, while the other methods are especially adapted to least-square problems. This thesis will explain the most common methods as well as choosing two methods used in non-linear least squares curve fitting and explore them in detail including when and how they converge to zero

In the following sections of the introduction, some of the terminology used in this thesis is described. In Chapter 1 we will discuss the basic concepts of non-linear curve fitting. In Chapter 2 we will discuss the Jacobian and Hessian matrices that are important building blocks in popular methods for non-linear curve-fitting. We will also discuss the concept of convergence. In Chapter 3 we will discuss several methods for non-linear curve fitting. In Chapter 4 two of the methods from Chapter 3 will be applied to real data.

(12)

1.2 Curve fitting

Curve fittingis used to find the "best fit" line or curve for a series of data points or, differently put, the curve fitting observes the relationship between one or more predictors (independent variables) and a response variable (dependent variable), with the aim of defining a "best fit" model of the relationship. The goal of analyzing the results of non-linear least-square curve fitting is to find the parameters that can the fit data best in the least-square sense, how much uncertainty there is in the values of the parameters, and if the model fits this data differently for a different set of data. Least squares minimize the square of the error between the original data and the values predicted by the model. To get the best curve fit is determined by how near the data is from the predicted values of the model. Most of the time, the curve fit will produce an equation that can be used to find points anywhere along the curve. In some cases, we may not be concerned about finding an equation. Instead, we may just want to use a curve fit to smooth the data and improve the appearance of our plot [18].

1.2.1 Power exponential functions

In a power function the independent variable x is raised to a constant power c in its most basic form of f (c, x) = xc where c is a coefficient. In this, the analysis should be focused on linear combinations of function of the form µ(b, x) = xe1−xb. This function is called power exponential function where b is real-valued parameter different for each term in the linear combination that can be changed to adopt the shape of the curve.

1.2.2 Selecting parameters

The choice of the parameters of power exponential function are extremely crucial as they affect how good the fit will be. There are some ways to measure the fit that comprise considering the sum of squares of residuals, the maximum residuals, etc. least-squares minimizes the square of the error between the original data and the values predicted by model [18]. In this thesis, we would like the sum of squares error or residuals to be as small as possible.

1.3 Least-squares of curve fitting

Least-square curve fitting is a commonly applied method for choosing parameters in a mathe-matical model so that they approximate observed data in a good way.

Both linear and non-linear regression must deal with the dependent variables being random in some sense. The difference is that in linear regression the value of the predicting model

(13)

depends linearly on its parameters. For example y(x; a, b) = ax + b is linear but y(x; a, b) = ax+ b2is non-linear.

Non-linear regression modelling is somewhat identical to linear regression modelling in that each one of them seeks a graphically tracking specific response from a variable group. Non-linear models are more complex than Non-linear models concerning the development due to the function that is generated via a set of approximations (iterations) which could result from trial-and-error. Mathematicians utilize many established approaches, like the Gauss-Newton approach and the Levenberg-Marquardt approach.

Many problems can be written in the generic form

f_i(b1, b2, ..., bn) = 0, i= 1 : n (1.1)

where fi are given function of n variables. In this section, a detailed study of the numerical

solution of such systems, where at least one function depends nonlinearly on at least one of the variables. Such a system is called nonlinear system of equations, and may be represented compactly in the form of:

f(b) = 0, f : Rn→ Rn

Even more generally, if f is an operator acting on some function space f (b) is functional equation. Let us consider a nonlinear optimization problem of the form

minbϕ (b), b∈ Rn

where the objective function ϕ is a non-linear mapping Rn → R. Most numerical methods attempt to find a local minimum of ϕ(b) such that ϕ(b∗) ≤ ϕ(y) for all y in a neighbourhood of b∗.

In this status where the objective function ϕ is continuously differentiable at a point b then any local minimum point b of ϕ must satisfy

g(b) = ∇ϕ(b) = 0 (1.2)

where g(b) is the gradient vector. This shows the close relationship between solving optimiza-tion problems and non-linear systems of equaoptimiza-tions.

If in form (1.1) there m > n equation we have an overdetermined nonlinear system. Then the least squares solution can be defined to be a solution to

minϕ(b)b∈Rn= 1

2 k f (b) k

2 2

which is a nonlinear least squares problems. We will describe for this problem in a next section.

Optimization problems can be encountered in many applications such as operations research, control theory, chemical engineering, and all kinds of curve fitting or more general mathemat-ical model fitting.

(14)

1.4 Nonlinear least squares problems

A widely used approach for the estimation of the unknown parameters in the non-linear re-gression function is the approach of least squares. The problem of non-linear least-squares is that of minimizing a sum of squares. Considering a vector function f : Rn7→ Rm_{where m ≥ n.}

The aim is minimizing k f (b) k2₂as the nonlinear least-squares objective function min_bϕ (b) ≡ minb∈Rn 1 2 k f (b) k 2 2= minb∈Rn 1 2 m

∑

i=1 f_i(b)2, (1.3)

in which each fi is a real-valued formula which has continuous second partial derivatives.

The problem is presented as a minimization of the l2 norm of a multivariate function of

minb∈Rnk f (b) k2 2, where f(b) =         f1(b) f₂(b) . . . f_m(b)         (1.4)

A common instance is the choice of parameter b within a nonlinear model µ: minb∈Rn

m

∑

i=1

(yi− µ(xi; b))2 (1.5)

where yi are observations at specific values xi. Fitting data to a mathematical model is an

important source of non-linear least squares problems. Here one attempt to fit specific data (yi, xi), i = 1, ..., m to a structure formula y = µ(xi; b). In the case of letting fi(b) express the

error or residual in the structure prediction for the i : th observation,

fi(b) = yi− µ(xi; b), i= 1, ...., m. (1.6)

We must minimize some norm of the vector f (b). The non-linearity arises only from µ(xi; b).

The model µ(b, xi) fits the data well if the data points residuals fiare small.

One of the major advantages in the utilization of the non-linear regression is the wide set of formulas which might be fit. Several scientific or physical procedures are inherently non-linear; thus, to try a linear fitting of a similar structure would definitely result in inferior out-comes. For instance, studies in population very often follow exponential patterns that cannot be structured in a linear way. The non-linear regression can give good estimations of the un-known parameters in the model with the use of a small data set. The problems of least squares might be resolved via the general optimizing method. This study offers numeric results for a large and diverse problem group with the use of software which is common and has under-gone extensive tests. The algorithms that were used in this software include Newton-based line searching and trust-region approaches for unconstrained optimizing, in addition to

(15)

Gauss-Newton, Levenberg-Marquardt for non-linear least squares. Our purpose is comparing the underlying algorithms for the identification of classes of problems on which the performance of every one of the methods is either quite good or quite poor, and for providing bench-marks for further work in non-linear least squares and unconstrained optimizations.

(16)

Chapter 2 Jacobian, Hessian and gradient

2.1 Jacobian and the gradient

The standard approaches for non-linear least squares problems need derivative information concerning the component function of f (b) as mentioned in [3] and [4].

Definition 1. If the function fi : Rn → R is differentiable at point b, the fi must also be

continuous at b then the gradient vector [3]

∇ϕ (b) = (∂ fi

∂ b1, ...,

∂ fi

∂ bn)

T _{∈ R}n

exists and is continuous. All the first-order partial derivatives of a vector valued function f (b): Rn→ Rn_{, is said to be differentiable at the point b if each component f}

i(b) is differentiable at

b. Then the matrix.

J(b) =    ∇ f1(b)T .. . ∇ fn(b)T   =     ∂ f1 ∂ b1 . . . ∂ f1 ∂ bn .. . . .. ... ∂ fn ∂ b1 . . . ∂ fn ∂ bn    

is called the Jacobian matrix J of f (b) [3].

We assume here that f (b) is twice continuously differentiable. It is simply demonstrated that the gradient of ϕ(b) = 1₂fT(b) f (b) is [∇ϕ(b)]j= ∂ ϕ (b) ∂ bj = m

∑

i=1 f_i(b)∂ fi(b) ∂ bj , i= 1, · · · , m, j= 1, · · · , n.

(17)

and it follows that the gradient is the vector

g(b) = ∇ϕ(b) = J(b)T f(b) (2.1) where Jacobian J(b) of the vector function f (b) is defined as the matrix with elements. This is matrix containing the first partial derivatives of the function components

[J(b)]i j= ∂ fi(b) ∂ bj = −∂ µ (b, xi) ∂ bj ∈ Rm×n, i= 1, .., m, j= 1, ..., n. (2.2)

The ith row of J(b) equals the transpose of the gradient of fi(b):

[J(b)]i,: = ∇ fi(b)T = −∇µ(b, xi)T, i= 1, ..., m. (2.3)

As we can see that the Jacobian is a function of the independent variable and the parameters, and therefore it changes from one iteration to the next. So in terms of the linearized model will be ∂ fi

∂ bj = −Ji j, and the residual is given by Eq.(1.6).

2.2 The Hessian matrix

We shall also need the elements of the Hessian of ϕ(b), denoted ∇2ϕ (b), that are given by [∇2ϕ (b)]kl= ∂2ϕ (b) ∂ b_k∂ b_l = m

∑

i=1 ∂ fi(b) ∂ b_k ∂ fi(b) ∂ b_l + m

∑

i=1 f_i(b)∂ 2_f i(b) ∂ b_k∂ b_l (2.4) and it follows that the matrix H(b) can be written as

H(b) = ∇2ϕ (b) = J(b)TJ(b) + m

∑

i=1 f_i(b)∇2f_i(b) (2.5) where [∇2f_i(b)]kl= −[∇2µ (b, xi)]kl = −∂ 2_{µ (b, x} i) ∂ bk∂ bl , k, l = 1, .., m. The summation term can be ignored, therefore we can approximate ϕ(b) as

H(b) = ∇2ϕ (b) = J(b)TJ(b) (2.6)

The special forms of the gradient g(b) and Hessian H(b) can be exploited by methods for the non-linear least squares problem.

Now we consider how the Hessian matrix can be used to establish the existence of a local minimum or maximum.

(18)

Theorem 1. Suppose that f (b) has continuous first and second partial derivatives on a set D_{⊆ R}n. Let b∗be an interior point of D that is a critical point of f(b). Then b∗is a:

1. strict local minimum of f(b) if H f (b∗) is positive definite. 2. strict local maximum of f(b) if H f (b∗) is negative definite.

This theorem can be proved by using the continuity of the second partial derivatives to show that H f (b) is positive definite for b sufficiently close to b∗, and then applying the multi-variable generalization of Taylor’s Formula [21].

We test now if the Hessian of f (b) is indefinite at a critical point. Suppose that f (b) has continuous second partial derivatives on a set D ⊆ Rn. On the other hand, let b∗be an interior point of D that is a critical point of f (b). If H f (b∗) is indefinite, then there exist vectors u, v such that

u· H f (b∗)u > 0, v· H f (b∗)v < 0.

By continuity of the second partial derivatives, there exists an an ε > 0 such that u· H f (b∗+ tu)u > 0, v· H f (b∗+ tv)v < 0.

for |t| < ε. If we define

U(t) = f (b∗+ tu), V(t) = f (b∗+ tv).

Then U0(0) = V0(0) = 0, whereas U00(0) > 0 and V00(0) < 0. Thus, t = 0 is a strict critical local minimum of U (t), and a strict local maximum of V (t) [21].

Definition 2. A saddle point for f (b) is critical point b∗ such that there are vectors u, v for which t = 0 is a strict local minimum for U (t) = f (b∗ + tu) and strict local maximum for V(t) = f (b∗+ tv).

Theorem 2. Sufficient condition for a local minimum. Suppose that bs is a stationary point

and the continuous second partial derivative of function ϕ(bs) is positive definite. Then bs is

a local minimum.

We can say if Hessian H = ∇2ϕ (bs) is negative definite, then bsis a local maximum. If Hessian

∇2ϕ (bs) is indefinite that has both positive and negative eigenvalues, then the stationary point

bsis a saddle point [21].

Depending on the theorems that are discussed above, there are two necessary conditions to obtain an optimal solution. First order necessary condition for b∗ to be a local minimum of ϕ (b) is that b∗is a stationary point that it satisfies

(19)

The second-order sufficient condition if b is a critical point of f (b) that is twice continuously differentiable f (b) : Rn→ Rm_{, and if the Hessian of f at b is positive definite, then f has a}

local minimum at b. In other words, in any direction away from b, the value of f increase at the first perhaps. Thus

∇2ϕ (b) = J(b)TJ(b) +

m

∑

i=1

f_i(b)∇2f_i(b)

where ∇2f(b) is also positive definite. The first-order and often dominant term J(b)TJ(b) of the Hessian contains only the Jacobian matrix J(b), i.e., only first derivatives. The computa-tional cost for storing the mn2 second derivatives ∂2fi(b)

∂ bk∂ bl might be quite high. In the second derivatives are multiplied by the residuals. If the mathematical model is adequate then the residuals will be small near the solution and the second term will be less important. In this case the model can be considered an important part of Hessian that includes Jacobian matrix and the second term can be ignored.

All these notations provide for the model function of non-linear least squares plus its partial derivatives with respect for each parameter and algorithms such as Levenberg-Marqruadt and Gauss-Newton algorithms construct all the necessary structures by themselves:

• J(b) = ∇ f (b)- Jacobian matrix of first order derivatives of the residuals with respect to each parameter and every measurement. It approximates design matrix of the non-linear squares model.

• g(b) = ∇ϕ(b)- gradient vector of first order derivatives of the objective function with respect to the mathematical model parameters. It describes slopes of the objective func-tion ∇ϕ(b) at some points b.

• H(b) = ∇2ϕ (b)- Hessian matrix of second order partial derivatives of the objective function with respect to every combination of parameters.

(20)

2.3 Convergence

One of the useful things that we will discuss is the rate of convergence in various numerical methods. At this point, one of the most important criteria to consider is the speed or order of convergence.

Definition 3. A convergent sequence {bk} with lim

k→∞{bk} = b

∗ _{and b}

k 6= b∗ is said to have

order of convergenceequal to p, if

lim k→∞ |ek+1| |ek| = lim k→∞ |bk+1− b∗| |bk− b∗|p = C

Here p ≥ 1 is called the order of convergence, the constant C is the rate of convergence or asymptotic error constant. Then it is said that bk converges to b of order p with a constant C.

We use the following to demonstrate how rapidly the error e_k= b_k− b∗converges to zero.

We consider the following cases for three different orders of convergence with rate p

• A sequence bk is said to be linearly convergent if bkconverges to b∗with order p = 1,

for a constant 0 < C < 1 then

|ek+1| = C|ek| < |ek| when |ek| is small.

• A sequence b_k is said to be quadratically convergent if b_k converges to b∗ with order p= 2 for constant C > 0

|e_k+1| = C(|e_k|2) whenke_k| is small.

• A sequence b_k is said to be superlinearly convergent if C = 0 for some p ≥ 1 then |ek+1|

|ek|

→ 0 f or k→ ∞

The value of p measures how speed a converge of the sequence. Then the bigger value of p is, the faster is the sequence convergence. In the case of numerical approaches, the approximate solutions sequence converges to the root. In the case where the iterative approach convergence is faster, then a solution could be reached in a smaller number of iterations compared to another approach of a slower convergence. We will discuss the specific convergence properties of some methods in the following chapter.

(21)

Chapter 3 Description of different methods

3.1 Newton’s method

Newton’s method is a root finding algorithm that uses the first few terms of the Taylor series of the function f (b) in the surrounding of suspected root. We will derive this method by providing local stationary points for function f (b) which satisfies ∇ϕ(b) = 0. Assuming the approximation of the function ϕ with its second-order Taylor expansion about the point b_k+1= bk+ h is given by: ϕ (bk+ h) ≈ ϕ(bk) + g(bk)Th+ 1 2h T_H(b k)h

where the gradient is the vector g(bk) = ∇ϕ(bk) and the Hessian is the symmetric matrix

H(b_k) =     ∂2f ∂ b2₁ . . . ∂2f ∂ b1∂ bN .. . . .. ... ∂2f ∂ b1∂ bN . . . ∂2f ∂ bN    

For finding the minimum of ϕ(bk+ h) in h can give us a new direction towards a local

station-ary point b∗. When the Hessian matrix H(b_k) is a positive definite,the minimum is the solution of ∇ϕ(bk+ h) is equal zero. Hence, we want to solve linear equation system

∇ϕ (b + h) = g(bk) + H(bk)h = 0 (3.1)

where h is the Newton step the solution of the symmetric linear system h= −H(b_k)−1g(b_k)

(22)

This gives also the iterative update b_k+1= bk+ h

= bk− (∇2ϕ (bk))−1∇ϕ (bk)

= b_k− H(b_k)−1g(b_k)

= b_k− (J(b_k)TJ(b_k) + G(b_k))−1J(b_k)T f(b_k) Where G(bk) denotes the matrix

G(b_k) =

m

∑

i=1

f_i(b_k)∇2f_i(b_k). The resulting iterative algorithm can be also written

b_k+1= bk+ h = bk− J(bk)−1ϕ (bk) (3.2)

In general the inverse Jacobian matrix need not be computed. Now, we will highlight the most important techniques of Newton’s approach [4]:

• Newton method is quite efficient in the final phase of the iteration, where b is close to b∗.

• The Newton’s method is quadratically convergent to a local minimum b∗ as long as Hessian H(b∗) is positive definite. On the other hand if H(b_k) is negative definite every-where the point b will be in a region and the basic Newton’s approach would converge quadratically towards stationary point b∗then this point b∗maximizer.

• It is better to execute a line search αk which guarantees global convergence.

b_k+1= bk− αkH_k−1gk

We can construct a hybrid method which based on Newton method and the steepest descent method. The solution hk= −H−1(b)kg(bk) is guaranteed to be downhill direction with

pro-vided that Hkis positive definite. Then it is possible to sketch the central section of this hybrid

algorithm as i f ∇2ϕ (b) is positive de f inite h: = h_k else h: = hsd b: = b + αh

Where, h_sd is the steepest descent direction and α is obtained by line search. The hybrid methods can be very efficient, but they are hardly ever used because they require computing ∇2ϕ (b), and for complex application problems this is not available. There are other methods to solve such this problem, according to a series of matrices H∗= ∇2ϕ (b∗) [4].

(23)

3.1.1 Line search

As in the case of solving a non-linear system Newton’s method needs to be modified when the initial value b0is not close to minimizer. Either a line search can be included or a trust region

technique used. In a line search we take the new iterate to be

bk+1= bk+ αkdk, α ≥ 0 (3.3)

Here dkis a search direction and step length αk> 0 chosen so that ϕ(bk+1) < ϕ(bk) which is

equivalent to ϕ(α) < ϕ(0).

f(α) = ϕ(bk+ αkdk) (3.4)

So our dkbeing a descent direction ensures that f0(0) = (dk)Tg(bk) < 0, where g(bk) is

gradi-ent at bk. It is usually not efficient to determine an accurate minimizer. Rather it is demanded

that α_k satisfy the two conditions

ϕ (bk+ αkdk) ≤ ϕ(bk) + σ αkg(bk)Tdk (3.5)

|g(bk) + αkdk| ≤ β |g(bk)Tdk| (3.6)

where σ and β are constants satisfying 0 < σ < β < 1. Typically σ = 0.001 and β = 0.9 are used [4],[3],[11] and [10]. There is three different situations can arise

1 αkshould be increased when the value of αk is quite small that that the gain in value of

the objective function is so small.

2 We must decrease α_kin order to satisfy the descending condition ϕ(bk+1) < ϕ(bk) when

αkis too large,

3 αkis close to the minimum of ϕ(αk) then we can accept this αk value.

We observe that the Newton step hkis not a descent direction if g(bk)TH(bk)−1g(bk) ≤ 0. The

one of reasons to admit the gradient as an alternative search direction since there is a risk that the Newton direction will lead to a saddle point.

3.1.2 Convergence of Newton’s approach

We are interested here in analyzing the convergence of Newton’s approach. We can say that the Newton’s algorithm converges quadratically if the approximation bkis sufficiently close to

a root b∗at which the Jacobian is non-singular.

Theorem 3. Suppose f : Rn→ Rn_{is continuously differentiable and f}_(b∗_{) = 0. If}

(24)

2 J is Lipschitz continuous on a neighborhood of b∗.

Then, for all b₍₀₎sufficiently close to b∗, Newton’s algorithm produces a sequence b₍₁₎, b₍₂₎, ..., that converges quadratically to b∗. The proof of this theorem is described in [3].

Definition 4. Suppose f : Rn→ Rm_{. Then f is said to be Lipschitz continuous on S ⊂ R}n _if

there exists a positive constant L such that

k f (a) − f (b) k≤ L k a − b k ∀a, b ∈ S.

Lipschitz is a technical condition that is more robust than the mere continuity of the Jacobian J but weaker than the condition that the function f (b) be twice continuously differentiable.[3][9]

3.1.3 Problems with Newton’s method

The biggest drawback to Newton’s approach is that J(b) and its inverse must be computed for every one of the iterations. Computing each of the Jacobian matrix and its inverse could be relatively hard and time consuming depending on the system size. Newton’s approach requires solving several of linear systems that may turn out to be complex when there are a number of variables. It converges rapidly in the case where the Jacobian J(b∗) is well-conditioned, in the opposite case it may blow up.

3.2 Trust-region methods

Trust region approaches were initially developed for non-linear least-squares problems [11] and are of a varying kind compared to general descent approaches. The fundamental concept of trust-region methods is initially to decide the step size for each sub-problem, thereafter an optimizing for the ultimate orientation. The radius of ∆k of the trust-region is defined by

step size, when function ϕ(b) is twice continuously differentiable (the second-order Taylor expansion) and it is also believed to be behave similarly to the original formula. Within radius ∆k of maximal step size the optimal orientation is calculated with respect the approximate

formula ϕ(bk) in other words.

min

khk<∆k

m_k(b_k+ h) (3.7)

where Taylor’s theorem gives

ϕ (b + h) = ϕ + gTh+1 2h

(25)

Consequently the quadratic model function mkused at each iterate bk is m_k(bk+ h) ' mk(h) = ϕ(bk) + g(bk)Th+ 1 2h T_H(b k)h. (3.8)

At each iteration, we search for a solution hkof the sub-problem based on the quadratic model

Eq.(3.7) subject to some trusted region. Assuming to know a positive number ∆kin a way that

the model is sufficiently precise inside a ball with radius ∆_k, centered at b_k, and determine the step as minimize h∈Rn mk(h) = ϕk+ h T_g k+ 1 2h T_H kh (3.9) subject to k h k₂≤ ∆k

Since m_k(h) is supposed to be a good approximation to ϕ(bk+ h) for h sufficiently small, one

of the reasons why the step got failed is that h was too large, and should be reduced. Moreover, if the step is accepted, it may be possible to use a larger step from the new iterate and that way reduce the number of steps needed before b∗is reached.

3.2.1 Trust-region strategy

A basic element in a trust-region algorithm is the process of selecting the trust-region radius ∆k at every one of the iterations. The quality of the model with the computed step can be

evaluated by the so-called gain ratio

ρ_k= ϕ (bk) − ϕ(bk+ hk)

m_k(0) − m_k(h_k) (3.10) which gives the ratio between the actual reduction and expected reduction. The actual decrease of the objective function is presented by the actual reduction for the trial step hk. The predicted

reduction in the denominator of Equation (3.10) is the reduction that was expected by the model function mk. The option of ∆k is at least partially decided by the ratio ρk at previous

iterations. By construction the predicted reduction should always be positive. If gain ratio ρk

is negative, the new objective value ϕ(bk+ hk) is bigger than the present value ϕ(bk), thus the

stage has to be discarded. If gain ratio ρkis approximate to one, there’s an efficient agreement

between the model function mk and the function over the step, therefore it is safe expanding

the trust area for the upcoming iteration. In the case that gain ratio ρkis positive but extremely

smaller than one, the trust region is not updated, but when it’s approximate to zero or if it is negative, the trust region ∆kis shrank at next iteration.

The approximate function mk(b) can be minimized via a variety of approaches. With a trust

(26)

of line searching, the exact optimal solution is not necessarily needed. An uncomplicated approach is minimizing the linear approximation

min_khk<∆_k{ϕ(bk) + g(bk)Th}

Its solution is the steepest descent orientation

h= −g(bk) k g(bk) k

where one only must minimize the step length determined to be less than the trust radius.

3.2.2 Trust-region algorithm

For the sake of controlling that the process is performed well, a check is performed to deter-mine whether the trust radius is adequate. Therefore, the expect edreduction mk(bk) − mk(bk+

h_k) and the actual reduction ϕ(bk) − ϕ(bk+ hk) are compared. The ratio between the actual

reduction and the expected one ρk =Ared_Predk_k plays a very significant part in the algorithm. This

ratio is to determine if the trial step is agreeable and alter the radius of the new trust region [14].

Now we can give a model trust region algorithm [11]and [12] Algorithm 1. Trust-Region

step 1: Specify setting approximation b₀, maximum step length ¯∆, initial trust region size ∆0∈ (0, ¯∆) and acceptance constant γ ∈ [0,1₄).

step 2: For k=0,1,2,... until b_kis optimal • Solve min

h

m_k(h) in (3.8)

s.t k h k≤ ∆k

• Compute the gain ratio ρkin Eq.(3.9) for hk.

step 3: Update the current point b_k+1=

(

b_k+ h_k, if ρk> γ

b_k, Otherwise step 4: Update the trust-region radius

∆k+1=      1 4∆k if ρk< 1 4 min(2∆k, ¯∆), if ρk> 3₄and k hkk= ∆k ∆k Otherwise

(27)

3.2.3 The sub-problem of the trust-region

One of the most significant parts of the trust region algorithms are the trust region sub-problems. Since every one of the iterations of a trust region algorithm demands to solve exactly or inexactly a trust region sub-problem, finding an efficient solver for trust-region problems is very important. We will consider sub-problem (3.9) which has been studied by many authors. At iteration k of a trust-region approach, the following sub-problem must be resolved: minimize h∈Rn mk(h) = ϕk+ h T_g k+ 1 2h T_H kh subject to k h k2≤ ∆k

It can be shown that the solution h∗ of this constrained problem is the solution of the linear equation system

(Hk+ λ I)h∗= −gk (3.11)

where g_k∈ Rn_{, H}

k∈ Rn×nis a symmetric matrix, and ∆k> 0 if and only if there exists λ ≥ 0

such that (Hk+ λ I) is positive semi-definite, k h∗k2≤ ∆k and λ (∆k− k h∗k2) = 0. Note

that if Hk = ∇2ϕ (bk) is positive definite and ∆k big enough, the solution of the trust-region

sub-problem is the solution of

∇2ϕ (bk)h = −∇ϕ(bk)

i.e h is Newton direction. Otherwise

∆k≥k hkk=k ∇2ϕ (bk) + λ I)−1∇ϕ (bk) k . (3.12)

So if ∆k goes to zero, then λ goes to infinity and,

h_k→ −−1

λ ∇ϕ (bk)

when λ varies between zero and infinity, the corresponding search direction hk(λ ) will vary

(28)

3.3 The approach of Gauss-Newton

This approach depends on a linear approximation to the component of ϕ(b) in the neighbour-hood of b. The idea of the approach is the approximation of the Hessian matrix H(b) by the first part JT(bk)J(bk). It is used for solving the non-linear least squares problems and could

only be used to minimize a sum of squares objective function. Let us consider the Taylor series expansion follows

ϕ (bk+ h) ≈ 1 2f(bk) T _f_(b k) + hTJ(bk)T f(bk) + 1 2h T_J(b k)TJ(bk)h ≈ ϕ(bk) + hTJ(bk)T f(bk) + 1 2h T_J(b k)TJ(bk)h ≈ ϕ(bk) + g(bk)Th+ 1 2h T_H(b k)h

where the gradient and Hessian of ϕ = 1₂fT f are given by (2.2) and (2.6). Since Newton’s method demands computation of the second derivatives, it is only used when it is feasible to compute H(bk). The Gauss-Newton method differs from Newton’s method by using an

approximation of the Hessian matrix. It can be considered as arising from neglecting the second derivative term

H(b_k) = ∇2ϕ (bk) = J(bk)TJ(bk) + m

∑

i=1 fi(bk)∇2fi(bk) = ∇2ϕ (bk) = J(bk)TJ(bk) + Q(bk)

The term Q(bk) will be small close to the solution b∗if one the residual norm k f (b∗) k is small

or if f (b) is only slightly non-linear. The behaviour of the Gauss-Newton approach can then be expected of being identical to that of Newton’s approach. Specially if a consistent problem has a zero residual, i.e. f (b∗) = 0 the local convergence will be equal for each method. However, for mild to large residual problems the local convergence rate the Gauss-Newton approach can be much lower ranking to the one of Newton’s approach.

This method uses that the approximation Q(bk) = 0 and determines the search direction as the

solution of the Newton equations

∇2ϕ (bk)hN= −∇ϕ(bk)

with the Gauss-Newton method approximates the gradient and Hessian respectively as g(b_k) = JT(b_k) f (b_k) (3.13) H(bk) ≈ JT(bk)J(bk) (3.14)

The resulting method is referred to as the Gauss-Newton approach, where the computation of the search orientation hGNcovers the solution of the linear system

(29)

Note that J(bk)TJ(bk) is always at least positive semi-definite. While J(bk) has full rank

(determinant of Jacobian matrix must not be zero)and the gradient g(bk) is nonzero, the

Gauss-Newton search direction is a descent direction and thence an appropriate direction for a line search and this case is actually the normal formulas for the linear least squares problem

min

hGN k J(bk)h

GN_{− (− f (b}

k)) k22 (3.16)

Otherwise J(b_k)TJ(b_k) is non-invertible and the equation does not have a unique solution. J(b_k)TJ(b_k)hGN= −J(b_k)Tf(b_k) (3.17) In this case, the problem is said to be under-determined or over-parametrized. Furthermore, if ϕ (b∗) = 0, then H(h) ' ∇2ϕ (b) for b close to b∗, we get quadratic convergence.

The difference between Newton’s approach and Gauss-Newton approach are the search direc-tions

H(b)hN = −g(b)

H(h)h_GN = −g(0)

Gauss-Newton applies to over-determined systems and minimizes the error in k J(b)hGN+ f(b) k2 for the update step. For a quadratic system this reduces to standard Newton. One could also solve an overdetermined system by minimizing the quadratic error k ϕ(b) k2. To determine the zeros of the gradient by Newton’s method would here require second derivatives of ϕ which are not needed for Gauss-Newton.

The solution is given by the normal equation

hGN = (J_kTJ_k)−1J_kT(− fk) (3.18)

which is a descent direction as we mentioned before that J_kTJ_kis positive semi-definite matrix. It may be illustrated that for a wide range of instances, taking the final step sizes as one step results in convergence.

Algorithm 2. Gauss-Newton Method

• Step 0: Choose an initial b0∈ Rn, and iterate for k= 0, 1, 2, ...

• Step 1: Repeat until convergence:

– Step 1.1: Solve JT(bk)J(bk)hGN= −JT(bk) f (bk)

– Step 1.2: Choose a step length αkso that there is enough descent.

(30)

Sufficient condition for convergence of the GN-approach is known if the normal formulas for the linearized least squares problem (3.16) are solved exactly in Step 1.1 at every one of the iterations. The approach with line searching α_k may seem to have ensured convergence, provided that first there exists b∗∈ Rn _{such that J}T_(b∗_{) f (b}∗_{) = 0 and second, the Jacobian}

matrix J(b∗) at b∗has full rank n.

We will introduce the notation υ(A) to denote the spectral radius of an n × n matrix A, and define

ρ = υ ((J(b∗)TJ(b∗))−1Q(b∗)) (3.19) The following theorem on local convergence of the Gauss-Newton approach then holds. Theorem 4. Let the first and second assumptions hold. If ρ < 1, then the Gauss-Newton iter-ation converges locally to b∗; that is, there exists ε > 0 such that the sequence {bk} generated

by the Gauss-Newton algorithm converges to b∗for all b₀∈ D ≡ {b| k b − b∗k< ε}.

The proof of theorem 4 can be found in [20].

3.4 The Levenberg-Marquardt method

Levenberg has added improvements on Gauss-Newton method to eliminate the weaknesses of this method when Jacobian matrix is rank deficient when both the row and the column numbers are strictly bigger than the rank. After that Marquardt modified Levenberg’s method by adding a strategy for controlling the so-called damping parameter λ . The Levenberg-Marquardt method is also known as damped least squares method. The method switches be-tween the Gauss-Newton approach described previously and what is called a steepest descent approach [19]. The Levenberg-Marquardt approach is suggested to consider as modification for the Gauss-Newton to restrain the problems failure of such method search direction when Jacobian J is rank deficient. This approach interpolates between the steepest descent and Gauss-Newton methods. In the steepest descent method the sum of the squared error is re-duced by updating parameters in the descent direction or search direction or well-behaved functions and reasonable starting parameters, this algorithm usually a little of less speed than the Gauss-Newton one. Moreover, this approach is more robust than the Gauss-Newton, and that indicates the fact that a solution is found in many cases even starting quite far off the final minimum. Rather than approximation the Hessian as in (2.5), Marquardt observed that the summation term in (2.6) can be approximated by λ I where λ ≥ 0 is a damping parameter al-tered by the algorithm. When λ is greater than zero, it leads to the steepest descent approach. In this approach the Hessian matrix JJT is not needed. Therefore the rate of convergence is only linear, and can be very slow. For this reason the steepest descent approach is usually used only as a starting point or while other search directions fail[4]. Using this we approximate the Hessian as

(31)

Thus to find the search direction hLM is defined by the following modification to (3.15), (J(b)TJ(b) + λ DTD)hLM= −J(b)T f(b) (3.21) which it can be written on other way that we used the original algorithm of approximation H_k≈ JT

kJkto solve (3.22) for different values of damping parameter λ

(Hk+ λkI)hLM= −gk (3.22)

where DTD is a positive-definite, diagonal matrix which represents the relative parameter scaling but it can be used DTD = I to simplify our characterization of the method and f is a residual vector. When λ has large values, the approach takes a small step in the steepest descent orientation. This is good if the current iterate is far from the solution hLM' −1

λ∇ϕ (b).

If λ is selected to be small, then the approach converges faster via the Gauss-Newton hLM' hGN, which is a positive step in the end of stages of the iteration, when b is close to b∗, so we get quadratic final convergence, so we prefer the robustness of the steepest descent method. So we can say that if Jacobian matrix JJT gets zero, it leads to the approach of steepest descent while the Gauss-Newton approximates the Hessian matrix as JJT and does not require the second term of matrix to get convergence rate which is similar to that of the Newton method. The Levenberg- Marquardt method acts more like steepest descent approach when the parameters are far from their optimal value, and acts more like Gauss-Newton approach when the parameters are near from their optimal value.

This approach might also be seen as Gauss-Newton with the use of a trust region process. The problem is equivalent to solve the model of function mk(hLM) (3.10) which is the trust region

equation using the approximation of Hessian, gradient of ϕ(bk)

min_khLM_k≤∆ϕ (b_k) + ∇ϕ(b_k)hLM+ 1 2h

LM_H(b

k)hLM (3.23)

and the iteration step itself is bk+1= bk+ hLM. We perform a trust region strategy instead of

line search technique where the norm of the solution to (3.22), then we must solve at each iteration

mink J(b_k)hLM+ f (b_k) k2₂ subject to k hLMk ≤ ∆k (3.24)

where the trust region radius ∆k is bigger than zero which producing a spherical trust region.

Therefore, the Levenberg-Marquardt approach might be considered as a trust region approach. In order to compute the step in Levenberg-Marquardt’s method is implemented as:

hLM_k = min_hLM{k J(b_k)hLM+ f (b_k) k2₂+λ_kk hLMk2₂} (3.25) where λk > 0 is called Lagrange parameter for the constraint at the kth iteration and being

updated from iteration to iteration. Thus,the hLMis calculated as the normal equations for the damped linear least-squares problems [7].

min_hLM 1 2 J(bk) λkI hLM−− f (bk) 0 2 2 (3.26)

(32)

For ill-conditioned Jacobian (condition number is too lager or infinite), this method makes more robust variation of Gauss-Newton. The key strategy statement of the Levenberg-Marquardt method is how to choose and update the damping λ_k at each iteration.

3.4.1 Implementation strategy of Levenberg-Marquardt method

Analogous to the strategy of trust-region method, the ratio gain ρ(hLM)k (3.12) which is a

comparison of the actual reduction of the objective function in the numerator and the expected reduction of the quadratic model in the denominator. The ratio gain is using to control and update the damping parameter λkin the Levenberg-Marquardt method.

ρ (hLM)k=

ϕ (b_k) − ϕ(b_k+ hLM)

m_k(0) − m_k(hLM₎ (3.27)

It is constructed the denominator is positive, and the numerator is negative if the step was not downhill, it was too large and should be reduced.

ϕ (bk) − ϕ(bk+ hLM) ≈ (yTy− 2yTµ (b) + µT(b)µ(b)) − (yTy− 2yTµ (b) + µT(b)µ(b) + (y − µ(b))hLMTJT+1 2h LMT_JT_JhLM₎ ≈ −((y − µ(b))hLMTJT+1 2h LMT_JT_JhLM₎ ≈ −1 2h LMT_JT_{(2(y − µ(b)) + Jh}LM₎

The denominator is the reduction of predicted by the local linear model. m(0) − m(hLM) = f (b) − f (b) − hLMTJTf(b) −1 2h LMT_JT_hLM = −(hLMTJT f(b) +1 2h LMT_JT_JhLM₎ ≈ −hLMT(JT(y − µ(b)) +1 2h LMT_JT_JhLM₎ ≈ −1 2h LMT (2JT(y − µ(b)) + (JTJ+ λ I − λ I)hLM) ≈ −1 2h LMT_(JT_{(y − µ(b)) − λ diag(J}T_J)hLM₎ ≈ 1 2h LMT_{(λ diag(J}T_J)hLM_{− J}T_{(y − µ(b)))}

(33)

The both hLMT and hLM are positive, so m(0) − m(hLM) is guaranteed to be positive. Finally we get ρ (hLM) =ϕ (b) − ϕ (b + h LM₎ m(0) − m(hLM₎ = − 1 2h LMT_JT_{(2(y − µ(b)) + Jh}LM₎ 1 2hLM T (λ diag(JT_J)hLM_{− J}T_{(y − µ(b)))} ≈ −J T_{(2(y − µ(b)) + Jh}LM₎ (λ diag(JT_J)hLM_{− J}T_{(y − µ(b)))}

In a damped method a small value of indicates that we must increase the damping factor and thereby increasing the penalty on large steps.

• If ρ(hLM_{) has large values that m(h}LM_{) is good approximation to ϕ(b + h}LM_{), and the}

damping parameter λ can be decreased near 0 in a way that the following Levenberg-Marquardt step is closer to an approximate Gauss-Newton step.

• If ρ(hLM_{) has small values or negative, then m(h}LM_{) is an insufficient approximation,}

and the damping parameter λ must be maximized with two-fold goal of getting nearer to the steepest descent orientation and minimizing the length of the step.

In general, if there is a good agreement between the actual reduction and the predicted reduction when ρ_k≈ 1, then the trust region radius ∆k is increased. If the agreement

is poor when ρk is positive but remarkably smaller then 1, then we can decrease the

parameter ρk and values of ∆k can be kept. If ρk is smaller or close to zero, the step is

rejected and ∆_kis decreased at next iteration.

The algorithm for updating the damping parameter in the Levenberg-Marquardt method is [11].

Algorithm 3. Levenberg-Marquardt method for nonlinear equations Step 1. Given b₁∈ Rn_{, ε ≥ 0, λ}

1> m > 0, 0 ≤ η0≤ η1≤ η2, k:= 1.

Step 2. ifk J(b_k)Tf(b_k) k≤ ε, then step; Step 3. Compute ρ(hLM)_k= ϕ (bk)−ϕ(bk+hLM) mk(0)−mk(hLM) ; set bk+1= ( b_k+ hLM_k , if ρk> η0 b_k, otherwise

(34)

Step 4. Choose λk+1 as λk+1=      4λk, if ρk< η1, λk, if ρk∈ [η1, η2], max{λk 4, m}, if ρk> η2; k:= k + 1; go to step 2.

From the algorithm we can compute the step of hLM_k for the Levenberg-Marquardt method

hLM_k = −(J(bk)TJ(bk) + λkI)−1J(bk)T f(bk) (3.28)

The main difference between trust-region and Levenberg-Marquardt approaches is that trust region approach alters the radius of the trust region ∆_k directly, while the Leven-berg Marquardt approach alters the damping parameter λk which using a trust-region

(35)

Chapter 4 Implementation and verification of the

Gauss-Newton and L-M method.

4.1 Numerical experiments

The power-exponential model µ(b, x) that will analyze and simulate of the accuracy of fitting power-exponential function to a certain set of points and focus on the specific of this model for curve fitting. We may define our model

µ (b1, b2, x) = (b1e(1−x))b2 (4.1)

for proper selections of the parameters b1and b2. We have :

• ∂ µ ∂ b1

= (x · e(1−x))b2 _{which is independent of parameter b}

1.

• ∂ µ ∂ b2

= µ(b1, b2, x) · (ln(x) + (1 − x)) = µ(b1, b2, x) · log(x · e(1−x)) which depends on b2

In the case where least squares were utilized for selecting the parameters, the error associated to say (b1e(1−x))b2 by minimize b1,b2 µ (b1, b2) = m=9

∑

i=1 (yi− (b1· e(1−xi))b2)2 (4.2)

This is just m times variance of the data set {y1− (b1· e(1−x1))b2}, ..., {yi− (b1· e(1−xi))b2}. It

(36)

error, and observe that the error is a function of two variables. Thus the ith function would be µi(b1, b2) = yi− (b1· e(1−xi))b2 (4.3)

which means that it would be the residual for the ith data point. The aim is to find values of two variables b1 and b2 that minimize the error residuals. The majority of the least squares

problems are of this type, where the function µi(b) are residuals and the index i represents

the specific data point. This is a technique where least squares problems are distinct. Those problems typically include some assumptions concerning the error in the model. For instance, there might be

y_i= (b1· e(1−xi))b2+ εi (4.4)

where the errors i are thought to spring from one probability distribution often the normal distribution. In association with this structure are the real parameters b1and b2, however every

time we collect data and solve the least-squares problem the results include only estimates ˆ

b1 and ˆb2 of those real parameters. After the computation of those estimates, they will be

compared with two common methods the Gauss-Newton and Levenberg-Marquardt methods for non-linear-least squares problems by using their algorithms.

This is a consequence of the particular structuring of the Hessian matrix ∇2ϕ (b) for the least-squares objective function. The Hessian H(b) in this case is the sum of two terms. The first one is only involved with the gradients of the power-exponential function µiand therefore it

is easier to calculate. The second involves the second derivatives, but is zero if the errors εi

are all zero in case that the model perfectly fits the data. It is trying to approximate the second term (2.5) in the Hessian, and several other approaches for least-squares do it.

4.2 Hessian and gradient for power-exponential model

To compute the gradient and Hessian of this model with data points

f(b) =                y1− (b1· e(1−x1))b2 y2− (b1· e(1−x2))b2 y₃− (b₁· e(1−x3)₎b2 y₄− (b₁· e(1−x4)₎b2 y₅− (b1· e(1−x5))b2 y₆− (b1· e(1−x6))b2 y₇− (b1· e(1−x7))b2 y₈− (b1· e(1−x8))b2 y₉− (b1· e(1−x9))b2               

(37)

The formula for the least-squares objective function is µ (b1, b2, xi) = 1 2 9

∑

i=1 (yi− (b1· e(1−xi))b2)2= 1 2f(b) T_f_(b). The gradient of µ is ∇µ (b1, b2) = ∑9_i=1(yi− (b1· e(1−xi))b2)(xi· e(1−xi))b2 ∑9_i=1(yi− (b1· e(1−xi))b2)µ(xi) · log(xi· e(1−xi))

This can be rewritten as ∇µ(b1, b2) =

(xi· e(1−xi))b2 µ (xi) · log(xi· e(1−xi)) y_i− (b1· e(1−xi))b2

The Hessian matrix is ∇2ϕ (b) = ∇ f (b)∇ f (b)T+ ∑m_i=1 fi(b)∇2fi(b) =

(xi· e(1−xi))b2 µ (xi) · log(xi· e(1−xi)) (xi· e(1−xi))b2 µ (xi) · log(xi· e(1−xi)) + m=9

∑

i=1 fi(b)∇2fi(b)

We observe that {xi} and {yi} are the data values of model, while b1and b2are the variables

in the model. As we mentioned before if fi(b∗) is equal zero then it is reasonable to expect

that f (b) ≈ 0 for b ≈ b∗, implying that

H(b) ≈ ∇ f (b)∇ f (b)T

This final formula only covers the first derivatives of the functions { fi(b)} and proposed that

an approximation to the Hessian matrix can be found using only first derivatives, at least in cases where the model is a good fit to the data.

4.3 Power exponential data fitting

One of the simplest methods of analyzing data is the Gauss-Newton method that uses approx-imation to the Hessian matrix directly. It computes a search direction using the formula for Newton’s method

∇2ϕ (b)hGN = −∇ϕ(b) (4.5) when f (b∗) in Eq(3.15) is equal zero and ∇ϕ(b∗) is full rank, the Gauss-Newton method be-haves like Newton’s method near the solution, but without the costs associated with computing

(38)

second derivatives. Now, we apply the Gauss-Newton method to a power exponential model of the form y_i= µ(xi; a, b) + εi y_i= a · (x_i· e(1−xi)₎b_{+ ε} i with data x_i= (0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9)T

where the εi are measurement errors on the data ordinates, assumed to have like noise. We

apply the GN method without a line search using an initial guess that is close to the solution: a= 0.53

b= 0.8137 For the given data we get the Jacobian

J=               0.3232 −0.2394 0.5211 −0.2227 0.6667 −0.1773 0.7752 −0.1295 0.8560 −0.0873 0.9146 −0.0535 0.9554 −0.0286 0.9815 −0.0120 0.09957 −0.0028               and J(b)JT(b) = 5.8575 −0.5775 −0.5775 0.1666

The Gauss-Newton search direction is gotten by solving the linear system J(b)JT(b)hGN = −J(b) f (b)

and thus

hGN=−0.0004775 −0.0000036

and the new estimate of the solution is

0.5273 0.8052

Since ϕ ≈ 0 , an approximate global solution has been found to the least-squares problem. The least squares objective function can not be negative.

(39)

Figure 4.1: The power-exponential function of µ(a, b, xi) in GN-algorithm

In general, the GN method is only assured to find a local solution when we apply a new initial guess to get close solution and consequently the Gauss-Newton method converges slowly. The purpose of nonlinear regression is to minimize the least squares problem, k f (b) k2= (y − µ(a, b, xi)). We add all the squares of all the entries in f (b) so the result is for residual

k f (b) k= 0.0246. f(b) =               0.0053 −0.0151 0.0042 0.0122 0.0015 −0.0005 −0.0093 0.0084 −0.0048              

Thus (yi− µ(a, b, xi)) = 0.00060653, shows that the model has become quite accurate. The

advantage over Newton method is that we do not need to calculate the second-order derivative of Hessian matrix. However, if any residual component fi(b∗) is large, the approximation of

second-order derivative in Hessian matrix is equal zero will be poor, and the Gauss-Newton method will converge slower than the Newton method.

(40)

There is another important method called the Levenberg-Marquardt method that can also be used to minimize this least squares problem. This method involves both steepest descent and Gauss-Newton iteration. We can use a steepest descent type method until we approach a minimum, then gradually switch to the quadratic rule. We can test to guess how close we are to a minimum by how our error is changing. In particular, Levenberg’s algorithm is formalized as follows damping parameter λ which will determine the blend between steepest descent and Gauss-Newton iteration. Levenberg-Marquardt proposed an algorithm based on this observation, whose update rule is a mix of the above-mentioned algorithms using the modified Hessian matrix.

H(b, λ ) = HGN+ λ I (4.6)

bi+1= bi− (HGN+ λ I)−1∇ϕ (bi) (4.7)

Evaluate the new residuals error at the point given Eq.(4.7) and compute the cost at the new point, ϕnew. This update rule is used as follows: if the residuals error goes down following

update, it implies that our quadratic assumption on ϕ(b) is working and when damping pa-rameter λ is small, Hessian H approximates the Gauss-Newton steps. However, when λ is large, and Hessian H is close to the identity, this causes steepest-descent steps to be taken. On the other hand, if the error increases, we would like to follow the gradient more and so damping parameter λ is increased by the same factor.

(41)

Figure 4.2: The power-exponential function of µ(a, b, xi)

In order to compute the step in Levenberg-Marquardt’s method is implemented in Eq.(3.21). hLM=−0.0149

−0.1468

· 10−23

The result is for residual error of Levenberg-Marquardt algorithm k f (b) k= 0.0246

f(b) =               0.0052 −0.0153 0.0040 0.0120 0.0013 −0.0008 −0.0096 0.0082 −0.0050              

and (yi− µ(a, b, xi)) = 0.00060611 if the error has decreased as a result the update, then

regress the step and decrease the damping parameter λ by a factor of 10 or 0.1. The above algorithm has the disadvantage that if the value of damping parameter λ has a big value, the computed Hessian matrix is not used at all.

(42)

Chapter 5 Application of non-linear least squares for

curve fitting

This chapter illustrates in detail the analysis of two different data sets by using two methods for non-linear least squares techniques for curve fitting. After that, we will also compare the two methods for each data set separately to observe the difference and between the methods and determine which gives the best results.

5.1 Modelling mortality rate

Mortality (death rate) data indicate numbers of death by place, time and cause. The fact that mortality is increasing strongly with increasing age is anything but surprising. But it is to be worthwhile to look closer to how mortality changes due to age. There are patterns that might still surprise a little. Death risk is an easy way to put mortality figures in relation to population [15].

This data on the death rate for men in the USA between 1995 and 2004. Since it is very unlikely that people die at a young age it is easier to see the structure of the data when we look at the logarithm of the values instead of the values themselves [16].

(43)

Figure 5.1:Death risk after ages for men in USA between 1995 and 2004.

5.1.1 Analysing mortality rate data

Consider a linear combination of two power-exponential functions µ (c1, c2, a1, a2, a3, x) =

c₁· e(c2·x) x + e

(a3−a1)_{· (a}

2· x · e(−a2·x))a3 (5.1)

where the function of µ is defined as death rate and the variable x is defined as ages for each year. In this scenario, a mathematical model is arranged using experimental data points, and smooth curve given by theoretical equation endeavours to be fitted to the data. By solving the system of non-linear equations, we obtain the best estimates of the variables c1, c2, a1, a2, a3

of the function µ in a theoretical model (5.1). We are then able to plot this function along with the data points and look how well these data points fit the theoretical equation.

Here, we will examine the two terms of power-exponential formula for death risk by using the first main method is Gauss-Newton algorithm to attempt to fit data collected by the mortality on death rate of USA between 1995 and 2004 to this theoretical model (5.1). We expect the experimental data to nearly follow the theoretical model (5.1) for mortality data. In order to do this, using the Gauss-Newton method, we must first calculate the partial derivatives for the Jacobian and then calculate the Hessian matrix. Here, fiis given by the equation:

(44)

Thus our equation for the partial derivatives for the Jacobian matrix are given by: ∂ f ∂ c1 = e (c2·xi) xi ∂ f ∂ c2 = c₁· e(c2·xi) ∂ f ∂ a1 = −e(a3−a1)_{· (a} 2· xi· e(−a2·xi))a3 ∂ f ∂ a2 = −e (a3−a1)_{· (a} 2· xi· e(−a2·xi))a3· a3· (a2· xi− 1) a2 ∂ f ∂ a3 = e(a3−a1)_{· (a}

2· xi· e(−a2·xi))a3· (ln(a2· xi· e(−a2·xi)) + 1)

where xi is ages for each year as we mentioned before, yi is the mortality at that year, and

c₁, c2, a1, a2, a3 are the initial mortality and rate of death. Once the partial derivatives for the

Jacobian have been calculated, we can proceed with the Gauss-Newton algorithm thereafter we compute our residuals from year 1995 to 2004 to see the differences that can appear in the graph. The code used for this procedure can be found in appendix A.2.

fi=                 8.9810 7.3979 4.5882 4.5544 3.4147 1.7581 2.5582 2.5465 2.6603 2.6603                 · 10−8

Implementation of the Gauss-Newton method usually executes a line search in the search direction hGN, we satisfy the step length condition (3.18) as those discussed in Chapter 3.

(45)

(a) year 1995. (b) year 1998.

(c) year 2000. (d) year 2001.

(e) year 2003. (f) year 2004.

Figure 5.2: Death risk for men in USA for some years between 1995 and 2004 using Gauss-Newton algorithm for x denotes age and y death risk.

(46)

We can see from Figure 5.2 that the curve fits the data quite well. We will now try the more sophisticated L-M method and see if we can get a better fit.

Levenberg-Marquardt can also be implemented by combining the features of steepest descent and Gauss-Newton algorithms. LM steps are linear combinations of these algorithms steepest descent and Gauss-Newton steps based on adaptive rules. We will also study the power expo-nential model for mortality using the Levenberg-Marquardt algorithm by computing Jacobian and Hessian matrix of Eq.(5.2). Here the calculation of error residuals fiis given by graph.

f_i=                 0.19135 0.1087 6.9263 5.5287 4.6414 5.4514 4.9257 4.8617 6.1015 5.5459                 · 10−8

The LM algorithm demands an initial guess for the parameters to be estimated. We chose the values for each different variables c1, c2, a1, a2, and a3 for the initial guess on death rate of

(47)

(a) year 1995. (b) year 1998.

(c) year 2000. (d) year 2001.

(e) year 2003. (f) year 2004.

Figure 5.3: Death risk for men in USA for some years between 1995 and 2004 using Levenberg Marquardt algorithm for x denotes age and y death risk.

(48)

Here we notice that the figures of LM algorithm Figure 5.3 got fit well but not as Gauss-Newton algorithm.

Year GN algorithm LM algorithm 1995 8.9810 0.19135 1998 4.5544 5.5287 2000 1.7581 5.4514 2001 2.5582 4.9257 2003 2.6603 6.1015 2004 2.6603 5.5459

Table 5.1: The result of residuals least squares ×10−8 using µ(c1, c2, a1, a2, a3; x).

In Table 5.1 shows the final result of error residuals that are sums of squares errors for GN and LM algorithms on mortality data. We explored that the Levenberg-Marquardt method has gotten a good fit in year 1995 while the Gauss-Newton method has also gotten a good fit curve in year 2000. So we can say the GN algorithm fits well in some years and LM algorithm as well.

5.2 Modelling rocket triggered lightning return strokes

Here we will fit a model based on power-exponential functions to measured data for rocket triggered lightning return strokes. This model could then be used to calculate electric and magnetic field using techniques similar to the ones in [17].

(49)

Figure 5.4: Equipment for measuring rocket-triggered return strokes, image originally ap-peared in[17]

5.2.1 Analysing data rocket-trigged return stroke

Here we will study with a different linear combination of power exponential function as we did with model previously:

µ (a, b, c, x) = (p − a) · (x · e(1−x))b+ a · (x · e(1−x))c (5.3) where the mathematical model of µ is defined as rocket trigged return strokes 8725,8726 and 8705 with a different fixed value for the peak p as it can be seen in Table 5.2. The variable xis defined as time rescaled so that the peak happens at x = 1 and three different variables a, b and c in the equation for the power exponential given above, and represent as initial guess as well. As a final step of application of non-linear least squares for fitting curve, we try to find the best fit for a set of data with this function in Appendix A.3, shown above Eq.(5.3). As always, the code requires that we first solve, analytically, for the partial derivatives demands for the Jacobian. The mathematical model to be minimized, fi, is given by:

fi= yi− µ(xi; a, b, c) (5.4)

where fiis residual at the particular current waveform of rocket-trigged stroke value, and yiis

(50)

a, b, c for initial guess. Since the Gauss-Newton and Levenberg-Marquardt methods call the Jacobian of fi, the three partial derivatives with respect to the three variables a, b and c must

be calculated and entered into the function d f in MATLAB code refer to Appendix A.3. The three partial derivatives are shown below for each stroke 8725, 8726 and 8705 respectively, according to the values of peak:

∂ f ∂ a = −(x · e (1−x) )b+ ((x · e(1−x))c) ∂ f ∂ b = (p − a) · (log(x) − x + 1) · (x · e (1−x) )b ∂ f ∂ c = a · (log(x) − x + 1) · (x · e (1−x)₎c

Using these three partial derivatives, the Jacobian of fi and its transpose can be calculated,

that way allowing us to apply both algorithms Gauss-Newton and Levenberg-Marquardt. Now that the three variables a, b, c in the function have been determined, it is possible to plot the mathematical model along the measured experimental data points to see if the formulas is actually a good fit for data using both GN and LM algorithms. The graphs below shows these plots of the mathematical model Eq.(5.3) and collected data points

(51)

(a) Result for stroke 8725 using the Gauss-Newton algorithm.

(b) Result for stroke 8725 using the Levenberg-Marquardt algorithm.

(c) Result for stroke 8726 using the Gauss-Newton algorithm.

(d) Result for stroke 8726 using the Levenberg-Marquardt algorithm.

(e) Result for Stroke 8705 using the Gauss-Newton algorithm.

(f) Result for stroke 8705 using the Levenberg-Marquardt algorithm.

Figure 5.5: Comparison of results for fitting rocket-triggered lightning return strokes data using the Gauss-Newton algorithm and the Levenberg-Marquardt algorithm

Nonlinear Least-Square Curve Fitting of Power-Exponential Functions: Description and comparison of different fitting methods

School of Education, Culture and Communication

Division of Applied Mathematics