• No results found

Comparison of the 1st and 2nd order Lee–Carter methods with the robust Hyndman–Ullah method for fitting and forecasting mortality rates

N/A
N/A
Protected

Academic year: 2021

Share "Comparison of the 1st and 2nd order Lee–Carter methods with the robust Hyndman–Ullah method for fitting and forecasting mortality rates"

Copied!
79
0
0

Loading.... (view fulltext now)

Full text

(1)

School of Education, Culture and Communication

Division of Applied Mathematics

MASTER THESIS IN MATHEMATICS / APPLIED MATHEMATICS

Comparison of the 1st and 2nd order Lee–Carter methods with the

robust Hyndman–Ullah method for fitting and forecasting mortality rates

by

Emil Willersjö Nyfelt

Masterarbete i matematik / tillämpad matematik

DIVISION OF APPLIED MATHEMATICS

(2)

School of Education, Culture and Communication

Division of Applied Mathematics

Master thesis in mathematics / applied mathematics

Date:2020-05-29

Project name:Comparison of the 1st and 2nd order Lee–Carter methods with the robust Hyndman– Ullah method for fitting and forecasting mortality rates

Author:Emil Willersjö Nyfelt

Supervisor:Milica Ranˇci´c

Reviewer:Anatoliy Malyarenko

Examiner: Ying Ni

Comprising:30 ECTS credits

Course name:Degree Project in Mathematics

(3)

Abstract

The 1st and 2nd order Lee–Carter methods were compared with the Hyndman–Ullah method in regards to goodness of fit and forecasting ability of mortality rates. Swedish population data was used from the Human Mortality Database. The robust estimation property of the Hyndman–Ullah method was also tested with inclusion of the Spanish flu and a hypothetical scenario of the COVID-19 pandemic. After having presented the three methods and making several comparisons between the methods, it is concluded that the Hyndman–Ullah method is overall superior among the three methods with the implementation of the chosen dataset. Its robust estimation of mortality shocks could also be confirmed.

Keywords

Mortality rate, Death rate, Data fitting, Lee-Carter, 2nd order Lee–Carter, Hyndman–Ullah, Modeling, ARIMA, Random walk, Singular Value Decomposition, Penalized Regression Splines, Weighted Functional Principal Component Analysis, Forecasting, Spanish flu, COVID-19.

(4)

Acknowledgments

I would like to express my appreciation to my supervisor Milica Ranˇci´c for the guidance she provided me during the work process. Recognizing the difficulties that the academic staff have endured during the COVID-19 pandemic, this was even more valuable and it did not go unnoticed. A special thanks is also extended to Professor Robin. J. Hyndman and Dr. Han Lin Shang at Monash University for helping me debug the Demography R-package.

(5)

Contents

List of Figures iii

List of Tables iv 1 Introduction 1 1.1 Established models . . . 2 1.2 Competitive models . . . 3 1.3 Recent developments . . . 4 1.4 Project description . . . 5 2 Methodology 6 2.1 Basic concepts . . . 6 2.1.1 Definitions . . . 6 2.1.2 Actuarial notation . . . 8 2.2 Mathematical tools . . . 8

2.2.1 Orthogonal basis functions . . . 8

2.2.2 Singular Value Decomposition (SVD) . . . 11

2.2.3 The Autoregressive Integrated Moving Average model (ARIMA) . . 12

2.2.4 Penalized Regression Splines . . . 14

2.2.5 Weighted Functional Principal Component Analysis . . . 16

2.3 The basic Lee–Carter method . . . 17

2.3.1 The model . . . 18

2.3.2 Constraints . . . 18

2.3.3 Fitting the data . . . 19

2.3.4 Forecasting . . . 20

2.3.5 Goodness of fit and confidence intervals . . . 22

2.4 The 2nd order Lee–Carter method . . . 23

2.4.1 The model . . . 23

2.4.2 Constraints . . . 24

2.4.3 Fitting the data . . . 24

2.4.4 Forecasting . . . 25

2.4.5 Goodness of fit and confidence intervals . . . 26

2.5 The Hyndman–Ullah method . . . 27

(6)

2.5.2 Constraints and assumptions . . . 29

2.5.3 Smoothing the data . . . 30

2.5.4 Fitting the data . . . 31

2.5.5 Forecasting . . . 32

2.5.6 Goodness of fit and confidence intervals . . . 32

2.6 Sources of error . . . 33

3 Numerical implementation 35 3.1 The data . . . 35

3.2 Analysis . . . 37

3.3 Results and discussions . . . 38

3.3.1 Goodness of fit . . . 38

3.3.2 Within-sample forecasts . . . 41

3.3.3 Forecasts . . . 45

3.3.4 Robust mortality shock handling of Hyndman–Ullah method . . . 50

3.3.5 Handling of estimated COVID-19 mortality for Hyndman–Ullah . . . 52

4 Conclusions 55 4.1 Future work . . . 56

Bibliography 57 Appendices 61 Appendix A: Proof of Theorem 2 . . . 61

Appendix B: R-Code . . . 63

Appendix C: Thesis objectives . . . 68

(7)

List of Figures

2.1 Example of a random walk with drift . . . 14

2.2 Example of smoothing of noisy data points . . . 16

3.1 Mortality rates for 20-, 30- and 40-year-olds in Sweden 1900-2018 . . . 36

3.2 Proportion of variance explained η2for period 1921-2018. . . 38

3.3 Actual and fitted mortality rates for the Basic Lee–Carter. . . 39

3.4 Actual and fitted mortality rates for the 2nd order Lee–Carter. . . 39

3.5 Actual and fitted mortality rates for the Hyndman–Ullah. . . 40

3.6 Actual and within-sample forecasted mortality rates for the Basic Lee–Carter. 41 3.7 Actual and within-sample forecasted mortality rates for the 2nd order Lee– Carter. . . 42

3.8 Actual and within-sample forecasted mortality rates for the Hyndman–Ullah. 42 3.9 Within-sample forecast for 1981. . . 43

3.10 Within-sample forecast for 1993. . . 43

3.11 Within-sample forecast for 2005. . . 44

3.12 Within-sample forecast for 2018. . . 44

3.13 Forecasted mortality rates for the Basic Lee–Carter. . . 45

3.14 Forecasted mortality rates for the 2nd order Lee–Carter. . . 46

3.15 Forecasted mortality rates for the Hyndman-Ullah. . . 46

3.16 Forecasted mortality rates from the Swedish government. . . 47

3.17 Forecasted mortality rates for 2020. . . 48

3.18 Forecasted mortality rates for 2045. . . 49

3.19 Forecasted mortality rates for 2070. . . 49

3.20 Actual and fitted age specific mortality rates for Hyndman–Ullah 1917-1921. 51 3.21 Prop. var. expl. η2of Hyndman–Ullah method (Mortality shock). . . 51

3.22 Actual (or estimated) and fitted age specific mortality rates for 1917-1921 and 2020-2022. . . 53

(8)

List of Tables

(9)

Chapter 1

Introduction

Modeling future mortality rates has in recent decades become important for economists and policymakers. This is, in part, since the population in the world is getting increasingly older. We live longer and therefore we have more economically unproductive years at the end of our lives. For developed countries, where pensions and health care is paid for by the state, it amounts to a high cost that needs to be planned [29]. Forecasting likely mortality rates for the future is therefore essential to acquire tools for planning for these expenses.

Attempts of predicting mortality rates can be traced back to hundreds of years. In 1662, Graunt [27] examined the population of London to create a warning system to combat the bu-bonic plague. As expected, it was shown that human mortality rate was uncertain. However, by categorizing people in groups, one could make more accurate predictions. A first theoret-ical approach to mortality rate modeling was introduced by de Moivre in 1725 [27]. These models are static. Today, we know that mortality rate is ever-evolving during the lifetime of an individual. It is, in fact, a stochastic process that makes accurate forecasting difficult. A milestone was reached when Gompertz published his law of mortality in 1825 [14]. He stated that the human mortality rate is the sum of age-dependent components. This is also called the Gompertz function and increases exponentially with age. This model, proposed almost 200 years ago, impressively still provides a reasonably good fit to recently measured mortality rates.

As previously mentioned, modeling mortality rates is challenging. The human aging brings a lot of change in behavior to an individual which reflects in mortality during a lifetime. This includes, for example, young peoples (especially males) tendencies to take on higher levels of risk. Young people also have less developed consequential thinking compared to adults, which leads to reckless conduct. Also, sudden death during the first months of a life makes a significant impact from the moment of birth moving forward. When considering this, forecasting indeed becomes a complicated task.

(10)

1.1

Established models

Statisticians make use of data from databases, such as The Human Mortality Database [19], to make predictions and match past data to new models in search of new, better, models. These new models then, hopefully, provide a quality forecast on which policies can be made for a population. Needless to say, this is also highly interesting for insurance companies. Having accurate models for a population means being able to offer life insurance policies with nar-rower profit margins which, in turn, makes them more competitive on the market.

The dominating forecasting model used now is the Lee–Carter model presented in 1992 [27]. The model is based on principal components analysis (PCA). Its input is a matrix of age-specific mortality rates ordered by time. The output is also a matrix containing the forecasted mortality rates. Because of its simplicity and decent accuracy, it is widely used in forecasting the mortality rates around the world. The two-factor Lee–Carter model is defined as

ln(mx,t) = ax+ bxkt+ εx,t or consequently

mx,t= eax+bxkt+εx,t,

where mx,t is the central mortality rate at age x in year t, ax is an average log-mortality (log of ratio of number of deaths to the population) at age x, bx measures the response at age x to change in the overall level of mortality over time, kt represents the overall level of mortality in year t, and εx,t is the residual.

In addition to this model, there exist variations of it. For example, the 2nd order extension of the Lee–Carter model [31] which implements an age-specific enhancement method in order to make better predictions. It achieves this by implementing the two first sets of singular value decomposition vectors instead of only the first one in the original model. Because of this, it is also sometimes called the 2nd order Lee–Carter model. We will later, in Chapter 2, go into depth, explaining both of these models and their implementations.

The Lee–Carter model is recommended by two U.S. social security technical advisory panels and is often used as a benchmark when testing other methods. The model and its variations, however, contain some problems. For example, it assumes that the ratio of the rates of mor-tality change at different ages, remains constant over time [6]. That is, the variable bx is invariant. There is, however, evidence that suggests the presence of major age-time inter-action which contradicts this assumption. It also relies solely on previous data input. This means that predictable future events and their probable impact on mortality are ignored. This is a problem since, as mentioned, mortality rates are continuously evolving. In real life, one can not assume that past events are the only factors affecting mortality rates in the future. There exists one type of characteristic property endowed in all these types of forecasts, includ-ing the Lee–Carter model. The output will always be an age-profile that is less smooth and becomes more deviant over time. In other words, the age-profile will become more unlikely

(11)

to represent reality as time goes on [27]. It is also data-independent and will occur regardless of input. The goal thus becomes to find a method of forecasting mortality that, given a base point, stays accurate as time progresses.

Even though Lee–Carter is the internationally main method to make forecasts, it has shown inconsistency between different sets of national data. For example, when applied to the Hun-garian population, it was only able to model the data successfully when restricted to the recent 15 years [2]. In fact, it has been shown that countries with less linear mortality trends may not be suited for the Lee–Carter model [33].

1.2

Competitive models

There exist many other methods of forecasting mortality rates. We can divide these into two categories, the ones that use covariates and those without covariates [13]. The Lee–Carter method is included in the category that does not use covariates. The assumption using these types of methods is that all information concerning the future is included in already observed values of log-mortality rates. This means that we must ignore unpredictable changes in the future such as innovation in medicine, war, recessions and health care policy change. These sudden changes in mortality are called mortality shocks and are inherently unpredictable [8]. We need, however, to include cyclic events such as flu season, holiday traffic accidents and other predictable reoccurring trends.

Advantages using methods without covariates, foremost, includes simplicity. During the de-velopment of these methods, the goal has been to simplify data patterns by reducing dimen-sionality in data. Robustness of these models is based on previous knowledge and that log-mortality rates tend to be well-behaved and thus, easy to work with. The, arguably, easiest form of forecasting in this category is extrapolation in which previous data is used to produce a likely projection of the age-profile graph.

Another model in this category, which we later will examine and implement for a Swedish demographic dataset, is the Hyndman–Ullah method. It is a generalization, or extension, of the Lee–Carter model [4] and has shown to give better forecasts applied to French mortality data compared to the Lee–Carter and several variations of it [20], [22]. These dimensionally reduced methods do not, however, fit in all situations. This has led researchers to add more components to the models. Sometimes these adjustments can arguably be ad hoc additions to the model. As time progresses, new, more complex, models have been developed as computer processing power and new research has become available.

Methods that make use of covariates may include factors ignored in the methods such as Lee– Carter and other non-covariate models. When using its measured impact on mortality rates, it has great potential to improve upon our ability to forecast. It is, however, important to un-derstand that it in no way guarantees a better result. The added information from covariates must exceed the lost information of the age profile of mortality. That is, we need to better

(12)

un-derstand how future events impact mortality in order to be able to reject empirical data. Only then can we make better forecasts for the future in comparison with the non-covariate methods. Two methods in this category are the Equation-by-equation maximum likelihood [13], using either Poisson regression or the method of least squares. They are, as previously mentioned, more difficult to compute than, for example, the variations of Lee–Carter. Especially the Poisson regression which makes use of optimizing non-linear equations which are commonly known to not have closed-form solutions. In this report, we will focus on methods not using covariates and thus we will not go into the subject regarding these methods any further.

1.3

Recent developments

In recent decades, the tools for forecasters have improved considerably. There is now easy ac-cess to high quality databases for developed countries readily available on the internet. There are also open-source packages available (mostly for programming language R) developed by several researchers cited in this thesis. These packages may be used to test datasets to different models to further research. There has also been progress to the models themselves in terms of computation and variety. With that being said, it is not clear that the accuracy in the models has actually increased in recent times [3]. Forecasting models tends to be situation-specific and thus it is difficult to measure one models superiority over another.

In 2005, a four-year, 1.42Me project led by the European Commission, called Bridging the

Micro-Macro Gap in Population Forecastingwas started [12]. The motivation for beginning

the project was the determined need for more adequate monitoring and forecasting of demo-graphic change. The aim was to find a method to go further than the usual forecasting methods using age and gender. It illustrates the desire among current researchers to make use of more parameters than the conventional ones to make future predictions. The project resulted in a microsimulation tool called MICCORE. The tool may be used to bridge the gap between joint projections of groups of similarly characterized individuals (Mac), and projections of the life courses of such individuals (Mic) [12].

A contemporary approach to forecasting mortality rates is using artificial intelligence. More specifically, neural networks may be used to create a method to detect and duplicate non-linear occurrences during the evolution of mortality for a life. In fact, this method of forecasting has shown to have a superb predictive ability, compared to traditional Lee–Carter methods [15]. The implementation is, however, rather cutting edge. It would require personnel knowledge-able of A.I. and computer science, which may not be availknowledge-able at insurance companies or agencies.

Indeed, forecasting mortality rates are of high interest to the research community. This is, surely in part, due to the vast economic upside of being able to predict precise mortality rates, which encourages investment into new research. Developing an accurate model will allow insurance companies to operate with narrower profit margins and with less cash reserve. This

(13)

would greatly increase the potential for higher profit and competitiveness on the market.

1.4

Project description

Comparing the 1st and 2nd order Lee–Carter method with the Hyndman–Ullah method has not previously been done using a Swedish dataset. The project was therefore initiated to de-termine if there are benefits of using the functional approach of the Hyndman–Ullah method over the two Lee–Carter methods.

Because predicting the future has the endowed property of always being uncertain and im-perfect, it allows researchers plenty of liberty to experiment with new methods of forecasting. Combined with the wide range of mortality rate datasets available, it naturally gives the experi-menter considerable freedom to be creative in their approach. In this project, we will capitalize on that granted privilege to investigate the performance of a new forecasting model, compared to that of an established variety. Explicitly, the project aims to compare the Hyndman–Ullah

methodto the basic Lee–Carter method and the 2nd order Lee–Carter method. The objective

is to draw conclusions by comparison between these methods on properties such as historical accuracy and forecasting ability.

In Chapter 2, we demonstrate the components and workings of the basic Lee–Carter method, the 2nd order Lee–Carter method, and the Hyndman–Ullah method, in subsequent order. Chapter 3 is dedicated to presenting numerical implementation of these methods with cur-rent sets of data. Finally, in Chapter 4, several comparisons between the methods will be analyzed in order to draw conclusions on the performance of the methods before bibliography and appendices are presented.

The purpose of the written report is to describe the inner mechanics of these three methods in elucidative conduct. It is intended to be adequately explanatory for the reader to be able to replicate the procedures on their own preferred data set. This is crucial since the accuracy of a model may depend heavily on the dataset being implemented. One forecasting method may perform poorly on one set of data while it ably suits another. This phenomenon was, for instance, illustrated by Rabbi & Mazzuco in 2018 [30]. In that study, nine sets of data from countries with higher mortality rates were compared to corresponding data from low mortality countries. It was shown that the performance of seven different variations of the Lee–Carter method was lower for countries exhibiting high mortality rates. In fact, none of these models could be shown to have coinciding superiority for all nine high mortality countries. It is therefore imperative that new methods are tested on various categories of data before its potential area of application may be determined.

(14)

Chapter 2

Methodology

2.1

Basic concepts

In this section, we present some terminology and concepts that will be used throughout the report. We will start with some fundamental definitions centered in actuarial sciences [8].

2.1.1

Definitions

Definition 1. The future lifetime is modeled by a continuous random variable Tx of a life (x) aged x where x ≥ 0. Consequently, x + Tx is the age-at-death of the life (x).

Definition 2. Given a distribution function Fxdepending on time t, the probability that the life (x) does not survive beyond the age x + t is defined as

Fx(t) = PTx≤ t.

Fx(t) is generally called the lifetime distribution starting at age x.

In many cases, however, we are more interested in the probability of survival of a life (x).

Definition 3. Given a distribution function Fxdepending on time t, the probability that the life (x) survives beyond the age x + t is defined as

Sx(t) = 1 − Fx(t) = 1 − PTx≤ t = PTx> t. Sx(t) is generally called the survival function starting at age x.

(15)

Definition 4. The instantaneous mortality rate for a measured time period ∆x, also called the force of mortality µx, is defined as

µx= lim ∆x→0+ 1 ∆xPT0≤ x + ∆x | T0> x = lim∆x→0+ 1 ∆xPTx≤ ∆x. This can also be expressed in terms of the survival function

µx= lim

∆x→0+

1

∆x 1 − Sx(∆x).

If this limit exists, we can write the survival function from birth as µx= −

1 S0(x)

d dxS0(x).

We now define two different rates of mortality which will be the fundamental concepts of this project.

Definition 5. The crude mortality rate, denoted as r, is defined as the incidence of death in a particular time period among a given population. It is often given as an annual rate of deaths per 100, 000 lives. For a selected time period, it is

r= Number o f deaths o f entire population

Entire population .

Definition 6. The central mortality rate, denoted as m, is defined as the average incidence of death in a particular time period among a given population. For a selected time period, it is

mx= Average number o f deaths o f entire population at age x

Average number o f population aged x .

Example 1.

To calculate mx for age x = 73 during a 2-year period from 2018 to 2019, the average number of deaths at age 73 is divided by the average number of individuals aged 73 years during this time period.

(16)

2.1.2

Actuarial notation

The notation used so far is widely used in the field of statistics. In actuarial sciences, it is common to see a different form of notation shown here [8].

Definition 7. The lifetime distribution function is denoted as tqx= Fx(t) = 1 − Sx(t). If t = 1, the t in the notation is usually dropped and we have

qx= Fx(1) = 1 − Sx(1),

which, for the case of t = 1 year, becomes the mortality rate for a life using actuarial notation.

Definition 8. The survival functions is denoted as tpx= Sx(t),

where for t = 1, the t is likewise dropped from the notation.

Definition 9. The deferred mortality probability is defined as u|tqx= Sx(u) − Sx(u + t) =u px−u+t px,

which denotes the probability that a life (x) survives u years, and then dies within the next t years. Also here, the t may be dropped when t = 1.

2.2

Mathematical tools

There are a number of tools that are used to perform the Lee–Carter method and the Hyndman– Ullah method. In this section, we will go through the inner workings and the necessity of them.

2.2.1

Orthogonal basis functions

Basis functions are extensions from basis vectors on a vector space to functions on a function space. They are used in the Lee–Carter and Hyndman–Ullah methods to construct functions

(17)

that estimates the historical data inputted into the models [20], [27], [31]. To get some intuition into the workings of these functions, we begin with a simple definition regarding orthogonal vectors.

Definition 10. Two vectors v and w where v 6= w in an inner product space V are ortho-gonal if

hv, wi = 0, where h·, ·i is the inner product.

To expand this concept to the function space, two functions f and g spanning the interval [a, b] can also be said to be orthogonal if it satisfies the following definition:

Definition 11. Two functions f and g in a function space F are orthogonal on the interval [a, b] if

h f , gi =

Z b

a

f g dx= 0.

The extension can be intuitively derived by considering two vectors u1and u2of infinite length containing the set of function values { f (x)} and {g(x)} where the interval [a, b] is partitioned on a infinitely fine grid x such that

h f , gi = u1• u2=    f(x1) f(x2) .. .   •    g(x1) g(x2) .. .   = ∞

i=1 f(xi)g(xi) = Z b a f g dx= 0

From basic linear algebra, we know that any vector can be represented by a linear combination of basis vectors. This is stated in the following definition.

Definition 12. A basis B of a vector space V over a field F is a linearly independent subset of V such that B spans V .

Now, using the notion that two functions can be orthogonal, we can extend Definition 12 into the function space. That is, any continuous function can be represented by a linear com-bination of basis functions. This is famously utilized in the Fourier series expansion in one dimension. There, the sine and cosine functions form the basis functions that are used to es-timate a given function. Further, if a set of basis functions are also orthogonal, they are called orthogonal basis functions. We will denote this particular set by {φ (x)k} where k = 1, 2, . . .

(18)

denotes the order of the basis function.

Just like with the Fourier series, the basis functions {φ (x)k} are equipped with a set of coef-ficient or “scores” which we will denote by {βk}. In the Fourier series analogy, it translates to the amplitude of the sine or cosine function and dictates the degree of which the function describes the estimated function.

For large datasets (such as demography data), it is possible to use Principal Component Ana-lysis (PCA)to find the orthogonal functions and its scores. PCA is used since it is a linear orthogonal transformation that makes the transformed data orthogonal. The orthogonal func-tions {φ (x)k} and their scores {βk} can then be extracted from the results of the PCA and used to estimate a function ζ (x) (function of mortality rate) such that

ˆ

ζ (x) = µ (x) + β1φ (x)1+ β2φ (x)2+ . . . + βnφ (x)n,

where ˆζ (x) is the estimated function ζ (x), µ (x) is the function mean and n is the chosen num-ber of orthogonal basis functions used. If an infinite numnum-ber of basis functions are used (that is, n → ∞), then the approximated function ˆζ (x) will approach the actual function ζ (x). Since we have established that an orthogonal basis function φ (x) can be represented by an infinite vector u, we can thus conclude that a set of basis functions {φk(x)} can be represented by an infinite matrix B such that its columns are the set of orthogonal vectors {uk}. This type of matrix is called an orthogonal matrix and can be defined in the following way.

Definition 13. A matrix B is defined as an orthogonal matrix if M0M= MM0= I,

or equivalently

M0= M−1,

where M0is the transpose of M, M−1 is the inverse of M and I is the identity matrix.

By choosing a large number n and k such that n = k, we can construct a square matrix ˆB

estimating the matrix B containing a set of n = k number of orthogonal basis functions as the following: ˆ B= u1 u2 . . . uk         φ1(x1) φ2(x1) . . . φk(x1) φ1(x2) φ2(x2) . . . φk(x2) .. . ... ... ... φ1(xn) φ2(xn) . . . φk(xn) ,

(19)

2.2.2

Singular Value Decomposition (SVD)

The SVD is used in the Lee–Carter forecasting method and its variations. It is a core procedure performed to obtain the estimators for some of the parameters of the model. The idea behind using it is that of data reduction. Applying the method on a matrix M factorizes it into three separate matrices with individual properties used in many fields of mathematics [24]. We will in this section demonstrate how the SVD is performed in general.

Definition 14. The singular value decomposition (SVD) on a matrix M is defined as

M= U ΣV0=

k

i=1

σiuiv0i,

where M is an m × n matrix, U is a unitary m × m matrix, Σ is a non-negative, diagonal m× n matrix (contains only zeroes except for the diagonal), V0 is a unitary n × n matrix and k = rank(M).

The columns of the matrices U and V are called the left and right singular vectors. The diag-onal entries of Σ hold the singular values of M where σ1≥ σ2≥ . . . ≥ σk≥ 0.

Singular value decomposition can be used as an optimal rank-k approximation of a matrix M. Given the rank k, the approximation M ≈ UkΣkVk0is derived by constructing vectors using the first k columns of U and V . Σk is constructed from Σ by using the top left k × k matrix con-taining the non-negative entries assuming that the relation σ1≥ σ2≥ . . . ≥ σk≥ 0 still holds. The SVD-method is used to find least square solutions when it is applied to a matrix of log-arithms of mortality rates. To do that, we must first subtract the averages over time of the logarithmic, age specific rates [27].

In the following example, it is demonstrated how the SVD method can be used to reduce the amount of data for a particular matrix.

Example 2.

Consider the following matrix:

M=       6 −7 3 2 5 4 3 0 1 3 4 0 9 6 1 0 0 0 0 0 1 0 3 5 5      

The matrix is clearly rank = 4. Using some algorithm, (in this case, WolframAlpha1was used), 1https://www.wolframalpha.com

(20)

we can construct the matrices U , Σ and V U=       0.58538 0.796888 −0.11164 −0.0991764 0 0.185345 −0.0846474 0.829539 −0.51995 0 0.665419 −0.577117 −0.382472 −0.279051 0 0 0 0 0 1 0.424494 −0.157292 0.391302 0.801216 0       , Σ =       15.5849 0 0 0 0 0 7.85875 0 0 0 0 0 5.66593 0 0 0 0 0 3.90503 0 0 0 0 0 0       , V =       0.470959 0.251564 0.266457 −0.765639 −0.240231 −0.227248 −0.742123 0.57715 −0.221666 0.12399 0.578663 −0.416766 −0.45946 −0.1038 0.519208 0.479381 −0.34866 0.0472884 0.413177 −0.689695 0.402366 0.301183 0.618513 0.427985 0.426216       .

We can now construct Uk, Σk and Vk, knowing that rank(M) = 4

Uk=       0.58538 0.796888 −0.11164 −0.0991764 0.185345 −0.0846474 0.829539 −0.51995 0.665419 −0.577117 −0.382472 −0.279051 0 0 0 0 0.424494 −0.157292 0.391302 0.801216       , Σk=     15.5849 0 0 0 0 7.85875 0 0 0 0 5.66593 0 0 0 0 3.90503     , Vk=       0.470959 0.251564 0.266457 −0.765639 −0.227248 −0.742123 0.57715 −0.221666 0.578663 −0.416766 −0.45946 −0.1038 0.479381 −0.34866 0.0472884 0.413177 0.402366 0.301183 0.618513 0.427985       .

This results in that we can assemble the approximation M ≈ UkΣkVk0.

2.2.3

The Autoregressive Integrated Moving Average model (ARIMA)

The ARIMA-model is used in time series analysis to, in this case, make forecasts of the future using historical data [27]. The acronym ARIMA is useful to understand the inner workings of the procedure. The autoregression part AR denotes that the regression of the developing variable is performed on its previous value. The integrated acronym I specifies that the data

(21)

values are the differences between their current and preceding values. The final MA part denotes that the regression error is represented as a linear combination of the coinciding error terms that materialized in the past. On this note, ARIMA is often denoted as

ARIMA(p, d, q) , p, d, q ∈ { Z+∪ {0} }.

The parameters p, d and q denote the afore-mentioned properties of the model. The parameter pexpresses the number of lags between the current and previous values. It is also called the orderof the model. The d denotes the number of times the data has been subtracted by previ-ous values and q conveys the order of the moving average [23].

The ARIMA-model can conveniently be reduced in complexity and notation if one or more parameters are set to zero. The ARIMA(0, 0, 1) for instance is simply expressed as MA(1) and thus ignores the two first properties in the model. Likewise, ARIMA(0, 1, 0) is denoted only as I(1) and is known as the random walk [23]. These were the selected parameters best suited for the U.S. population when Lee and Carter presented the basic Lee–Carter method in 1992 [27].

Definition 15. The random walk is defined as

ARIMA(0, 1, 0) = I(1) = Xt= Xt−1+ εt,

where Xt is the evolving variable and εt is the independent and identically distributed (i.i.d.) error-term.

A constant d can also be added to this equation to add drift if the "path" or trend is expected to be decreasing or increasing over time.

Definition 16. The random walk with drift is defined as

ARIMAd(0, 1, 0) = Id(1) = Xt= d + Xt−1+ εt, (2.1)

where Xt is the evolving variable, d is the constant denoting the drift and εt is the i.i.d. error-term.

The error-term εt is distributed as εt∼N (0,σ2) which is a prerequisite for the random walk to be valid i.e., truly random.

It should be noted that the random walk, or ARIMA(0, 1, 0), does not guarantee the best fit to a given dataset. When the basic Lee–Carter method was implemented on the male and female population in India, the ARIMA(0, 2, 0) and ARIMA(1, 2, 0) respectively demonstrated the best fit [7]. Different parameters would need to be tested when constructing a model to determine the best version suited for the particular set of data.

(22)

Example 3.

A random walk with drift, simulated as an example. The drift is set to d = −0.3 and the error-term εt is distributed asN (0,102). The walk is t = 30 steps starting at position k1= 0. Figure 2.1 shows the typical appearance of a random walk using these parameter values.

0 5 10 15 20 25 30 -1 50 -1 00 -5 0 0

ARIMA(0,1,0) - Random walk with drift

Time

Po

si

tio

n

Figure 2.1: Example of a random walk with drift

The ARIMAd(0, 1, 0)-model, or the random walk with drift, is particularly interesting since it was the preferred use, and determined best fit, to make forecasts when the original basic Lee–Carter forecasting method was presented in 1992 [27]. Using drift is quite obvious since we are generally assuming that the mortality rates will continue to decrease for a specified age group as time progresses.

2.2.4

Penalized Regression Splines

Smoothing splines are function estimates ˆs(x) used to balance goodness of fit of a function θ (x) to a set of noisy observations {x, y(x)} [11]. The smoothing reduces noise in data which, in turn, makes it more well behaved when processed. In the Hyndman–Ullah method, it is used to preprocess the mortality rate data before fitting it to the model [20].

(23)

Definition 17. For the Hyndman–Ullah method, the smoothing splines can be defined as ˆ st(x) = argmin θt(x) T

t=1 yt(x) − θt(x) + λ T−1

t=1 θt+10 (x) − θt0(x) ,

where T is the number of time periods, ˆst(x) is the smoothed function, yt(x) is the mortality rates for time period t, θt(x) is the target function , θt0(x) is the derivative and λ is a spline parameter λ ≥ 0. The argmin-function outputs the point of the domain of the function θt(x) at which the function values are minimized. If there are multiple points for which the function is minimized, the argmin-function outputs a set containing these points from the domain.

Hyndman and Ullah used a weighted version of the smoothing splines [20]. It consists of adding weights to each pair of data, (x, y(x)), using some predefined weight w. In Section 2.5, it is shown how this is implemented in the Hyndman–Ullah method.

Regarding the parameter λ , a trade off relation exists in the choice of a suitable value. A low value will keep the smoothed points closer to the actual points, but results in higher vari-ance. Conversely, a high value will provide a lower variance but smooth the line more. Con-sequently, it will stray further away from the actual points. Determining an appropriate value for λ is thus in itself an optimization problem and it is frequently automatized in software. Often, the General Cross Validation (GCV) criterion is used in least squares based smoothing splines [16]. It will therefore serve as the standard value for λ in the numerical experiments in Chapter 3 where the Hyndman–Ullah method will be implemented.

In the following example, a conceptual demonstration is shown how a set of data points is smoothed using regression.

Example 4.

Given the arbitrary points (x, y),

x 25 39 45 57 70 85 89 100 110 124 137 150

y 1000 1250 2600 3000 3500 4500 5000 4700 4405 4000 3730 3400

the following, graphed, smoothed points (x, y) are obtained after applying smoothing using a cubic target-function:

(24)

20 40 60 80 100 120 140 1000 2000 3000 4000 5000

Smoothing of noisy data points using cubic target function

x

y

Actual points Fitted points Approx. func.

Figure 2.2: Example of smoothing of noisy data points

Notice how the red circles takes the shape of a smooth function (dashed line) while estimating the noisy data points.

2.2.5

Weighted Functional Principal Component Analysis

In the Hyndman–Ullah method, PCA (see for example [21] for details) is used. It is, however, first extended to continuous functions since ordinary PCA is a discrete concept [20]. Expli-citly, it is used to find the orthogonal basis functions φk(x) and its uncorrelated coefficients βt,k along with the corresponding weights. Extending ordinary PCA to functional data, however, introduces a problem. The dimensions of the sample covariance matrix becomes infinite for continuous data since we have an infinite number of points. The projection pursuit method, proposed by Li and Chen in 1985 [28], showed that there was a way to resolve this using an algorithm. Improving on this method, Hubert et.al. proposed in 2002 [18] the RAPCA algorithm (Reflection based Algorithm for PCA) that achieves this with greater efficiency.

(25)

The RAPCA algorithm is intended for discrete data. We therefore need to discretize any smooth function ft(x) that we wish to implement in the algorithm. This discretization is done on a fine grid of q equally spaced values {x1, x2, . . . , xq} spanning the interval [x1, xp] (the domain of ft(x)). The result is a matrix, Gt,q containing the discretization of ft(x) for all t. For the discretized data {x1, x2, . . . , xq}, it also becomes possible to use SVD to find the ap-proximated orthogonal basis functions ˆφk(x) along with its uncorrelated coefficients ˆβt,k and the applied set of weights {wt} [20]. Below, the procedure to perform the SVD is given.

1. Let G∗= W G where W is a matrix containing weights such that W = diag[w1, w2, . . . , wt]. 2. Perform SVD on the matrix G∗. This gives G∗= U ΣV0(see Subsection 2.2.2 for details). 3. Retrieve the estimators ˆφk(xj) from the ( j, k)-th element of U .

4. Let B = GU and retrieve the estimators ˆβt,k(x) from the (t, k)-th of B.

Missing values corresponding to missing values in the discretization can be found by interpol-ation. It is possible to discretize the function ft(x) using a large enough number q such that interpolation becomes increasingly unnecessary. However, the computational time to perform the SVD increases substantially for larger q.

2.3

The basic Lee–Carter method

The Lee–Carter model was presented in 1992 by Ronald D. Lee and Lawrence R. Carter in the Journal of the American Statistical Association [27]. It was presented as a time series method to make long-run forecasts in the United States. Due to its ability to accurately fit data, it is now widely used in the world beyond its intended purpose. The model takes a matrix of mortality rates as input and outputs another matrix containing the fitted historical data, as well as the forecasted mortality rates. The method is renowned for its simplicity and accuracy. We will in this section go into detail of the mechanics and theory of this method.

(26)

2.3.1

The model

Definition 18. The Lee–Carter forecasting method is defined as

ln (mx,t) = ax+ bxkt+ εx,t, (2.2)

where ax, bx and kt are parameters depending on age x and time t, while εx,t denotes the associated error. Explicitly, ax is a parameter describing average patterns of mortality depending on the age group. The parameter kt denotes the mortality index and bx denotes a measurement of the sensitivity of ln (mx,t) at age x as kt varies.

The matrix mx,t is usually represented by a table of mortality rates with specific age groups as rows and time periods as columns. Choosing the span of these groups could affect the performance and therefore needs to be selected carefully. For example, some instability in the basic Lee–Carter method could be observed when choosing time periods of 10 or 20 years [27].

2.3.2

Constraints

The basic Lee–Carter method needs some simple constraints to ensure the uniqueness of the solutions. This is because the parametrization in equation (2.2) is invariant under some linear transformations.

Theorem 1. The parametrization in equation (2.2) is invariant under the transformation (ax, bx, kt) →  ax+ cbx, bx d, d(kt− c)  , for any constants c and d, d6= 0. The solution is therefore not unique.

Proof. Applying the transformation we have

ln (mx,t) = ax+ bxkt+ εx,t → (ax+ cbx) + bx

dd(kt− c) + εx,t = = ax+ cbx+ bxkt− cbx+ εx,t = ax+ bxkt+ εx,t.

The parametrization is thus invariant under this linear transformations and consequently, the solution is not unique.

To account for this, Lee and Carter used the following constraints to obtain a unique solution: X

x=1

(27)

T

t=1

kt = 0, (2.4)

where X is the number of age groups and T is the number of historical time periods. In this study, the same constraints will be used to attain the unique solutions.

2.3.3

Fitting the data

The estimation of the parameters in equation (2.2) was done by Lee and Carter using the SVD. First, the estimator ˆaxwas defined in the following manner [27]:

Definition 19. The estimator ˆaxof the parameter ax is defined as

ˆ ax= 1 T T

t=1 ln (mx,t) , (2.5)

where T is the number of time periods fitted.

The estimator is thus the average over time of the logarithm of the central mortality rates. The remaining two estimators ˆbxand ˆkt are obtained by first constructing the matrix M

M= ln (mx,t) − ˆax (2.6)

and then taking the SVD of this matrix. As previously mentioned, this is done to render the least square solutions. Explicitly, we have

SV D(M) = U ΣV0= σ1Ux,1Vt,10 + σ2Ux,2Vt,20 + . . . + σkUx,kVt,k0 , (2.7) where k = rank(M) and V0denotes the matrix transpose of V . Further we have

σi, i = 1, 2, . . . , k , which denotes the singular values and where

Ux,k, Vt,k0 , i = 1, 2, . . . , k are the corresponding singular vectors.

We can now define the estimators ˆbxand ˆkt as the following [22]:

Definition 20. The estimator ˆbxof the parameter bx is defined as ˆbx= Ux,1.

(28)

Definition 21. The estimator ˆkt of the parameter kt is defined as ˆkt= σ1Vt,10 .

After executing these procedures, the output is a new matrix ˆMx,t corresponding to the estima-tion of the logarithmic historical mortality rate for each age group and time period. Explicitly

ˆ

Mx,t = ˆax+ ˆbxˆkt= ˆax+ σ1Ux,1Vt,10 , (2.8)

where ˆaxis given by equation (2.5). Expanded to the matrix form, we have

ˆ Mx,t =    ˆ a1+ σ1U1,1V1,10 . . . aˆ1+ σ1U1,1VT,10 .. . . .. ... ˆ aX+ σ1UX,1V1,10 . . . aˆX+ σ1UX,1VT,10   .

2.3.4

Forecasting

When the historical data for a specific dataset has been implemented, the main point of the Lee–Carter method is to make future predictions up to a chosen point in time. In the basic Lee–Carter model, this is done by using the random walk with drift. Our sought-after evolving variable, in this case, will be the mortality index kt [27]. Using equation (2.1) we thus have our forecasted estimator ˆkt of kt

ˆkt= ˆd+ ˆkt−1+ εt, (2.9)

where the estimator ˆdis the average change, period-for-period, of the random walk given by ˆ

d= ˆkT− ˆk1

T− 1 , (2.10)

meaning that it depends only on the first and last estimate of ˆkt [13]. The variable εt is the error-term with

εt∼N (0,σ2),

whereN (µ,σ2) is the normal distribution with mean µ and variance σ2. Expanding equation (2.9), we have ˆkt= ˆkT+ ˆdt+ t

i=1 εT+i−1, (2.11)

Taking the expected value and variance of ˆkt we get E[ ˆkt] = ˆkT+ ˆdt, and

(29)

This shows that both the expected value and the variance of the evolving variable ˆkt depend on time. This is referred to as a non-stationary time series. Since we expect ˆdto be a negative number (due to decreasing mortality rates over time) and t to be strictly increasing, the expec-ted value will decrease over time while the variance will increase over time.

We can now estimate the standard deviation parameter σ by using the definition of the stand-ard error estimate[13] denoted as see.

Definition 22. The standard error estimate see is defined to be

seet= v u u t 1 T− 2 T−1

t=1  ˆkt+1− ˆkt− ˆd 2 . (2.12)

The index t is added to see to recognize its dependency on the time period.

Applying the estimator seet, we can rewrite equation (2.11) as ˆkt = ˆkT+ ˆdt+

t εt, (2.13)

where εt ∼N (0,seet2).

Using equation (2.8) and equation (2.13), we can now derive a single equation describing the forecasting algorithm ˜ Mx,t = ˆax+ ˆbx(ˆkT+ ˆdt+ √ t εt) = ˆax+Ux,1(ˆkT + ˆdt+ √ t εt),

where ˜Mx,t is a new matrix containing the logarithmic forecasted mortality rates. The matrix will take the following form starting from time period number T + 1:

˜ Mx,t =    ˆ a1+U1,1 ˆkT+ ˆd× 1 + √ 1 ε1  . . . aˆ1+U1,1 ˆkT+ ˆdT0+ √ T0εT0 .. . . .. ... ˆ aX+UX,1 ˆkT + ˆd× 1 + √ 1 ε1  . . . aˆX+UX,1 ˆkT + ˆdT0+ √ T0εT0   ,

where T0is the number of forecasted time periods.

It is worth noting that the forecasting method assumes that both ˆax and ˆbx remain constant within its corresponding age group throughout the forecasting procedure. The matrix ˜Mthus uses only historical data for these variables depending only on the age group x. The variables depending on the time period t then will depend only on t and therefore the matrix ˜Mbehaves strictly as a non-stationary, stochastic, time series.

(30)

Concluding the forecast, the complete model including both the fitted historical data and the forecasted mortality rates, can be joined in a matrixM∗ containing both matrices ˆMand ˜Msuch that

M=M ˜ˆ M ,

which provides a comprehensive table of logarithmic mortality rates, fitted and forecasted.

2.3.5

Goodness of fit and confidence intervals

There are two main methods of determining the accuracy of the complete Lee–Carter forecast-ing model. The first regards the method’s ability to fit historical data to the model. The second is to measure the credibility of future predictions. Evaluating the fitted data to the historical data is a simple procedure while determining future accuracy, naturally, is more uncertain. However, we can apply a confidence interval to the forecasted data to obtain a measure of certainty of the predicted data.

To measure the goodness of fit of the model in relation to the historical mortality rates, the ratio of variance explained η2was used in the Lee–Carter method [27]. The specific procedure of calculating the ratio of variance is first determining the variance of the difference between the actual data and the fitted data. It is then divided by the variance of the actual rates to obtain the ratio. This will provide a value for the badness of fit. Taking the complement of this value will thus provide the sought-after goodness of fit for a chosen age group x. Explicitly, we have

ηx2= 1 −V kmx,t− ˆ mx,tk V kmx,tk  = 1 − ∑Tt=1(mx,t− ˆmx,t)2 ∑Tt=1(mx,t− ¯mx)2 , (2.14)

where ¯mx is the mean value of the actual values. This will provide a proportional variance of the system, which is linearly accounted for, such that 0% < η2< 100%. The value 1 is obtained when the system has no error. This value is sometimes also denoted as R2.

Remark 1. All values for calculating ηx2 are actual percentage mortality rates. I.e., they are not logarithmic.

To equip the forecasted mortality rates with an approximate 95% confidence interval for a forecasted time period t, we multiply the point-forecasted data ˜mx,t with a factor given as [27]

˜

mx,t× (−1.96 ˆbxseet) ≤ ˜mx,t ≤ ˜mx,t× (1.96 ˆbxseet).

Remark 2. The conventional, more exact, value for the quantile z0.025 which gives a 95%

(31)

2.4

The 2nd order Lee–Carter method

In 2003, a variation of the Lee–Carter was published by A.E. Renshaw and S. Haberman [31]. It was proposed as an improvement upon the basic Lee–Carter method in light of its observed inflexibility with respect to age. The method came later to be denominated as the 2nd order Lee–Carter method. In essence, the new procedure tries to capture the rise in crude mortal-ity rates for individuals aged approximately 20-39 years observed in the last 25 years of the 1900s which the basic L-C fails to accomplish. This upswing in mortality rates is credited to the inflated number of suicides and to the onset of HIV and AIDS [31].

The 2nd order LC is similar to its basic variant with the main difference in that it uses a second set of singular vectors from the SVD to create the model. Essentially, the model is endowed with an age-specific enhancement method, or age smoothing property, to hopefully achieve better fitted data and thus, in turn, better forecasts. Indeed, the method has the potential to enhance the accuracy of model fitting and forecasts, but it, unfortunately, steps away from the tremendous advantage of simplicity in the basic LC method. In theory, several less dominant singular vectors (i.e. higher order LC-methods) could be used to further create variations of the Lee–Carter method, but these would be difficult to achieve in practice [5].

2.4.1

The model

Definition 23. The Lee–Carter forecasting method with age-specific enhancement, or 2nd order Lee–Carter method, is defined as

ln (mx,t) = ax+ b (1)

x kt(1)+ b (2)

x kt(2)+ εx,t, (2.15)

where ax, bx and kt are parameters depending on age x and time t, while εx,t denotes the associated error. The indices (1) and (2) denote the order of singular vectors from the singular value decomposition.

The difference from the definition of the basic Lee–Carter model is only in the addition of a second set of singular vectors. The extra set is used to approximate the parameters b(2)x and kt(2). Since these sets of parameters arise from using the second set of the SVD, they are also less dominant than its predecessors b(1)x and kt(1). Illustratively, the parameter k

(2)

t denotes the

mortality index, but to a lesser significance than k(1)t . Likewise, b(2)x denotes a measurement of the sensitivity of ln (mx,t) at age x as kt(1)and k

(2)

t varies, but to a lesser extent than b (1)

(32)

2.4.2

Constraints

The 2nd order Lee–Carter method also needs some simple constraints analogous to the basic case. Again, this is to ensure the outcome of a unique solution. Using equations (2.3) and (2.4), these new constraints can be derived as

X

x=1 b(1)x = X

x=1 b(2)x = 1, T

t=1 kt(1)= T

t=1 k(2)t = 0.

2.4.3

Fitting the data

The estimation of the parameters in equation (2.15) is performed using SVD. The estimator ˆ

axfrom equation (2.5) is unchanged as it only depends on the central mortality rates ln (mx,t). Consequently, it also leaves the process of performing the SVD on the matrix M given by equation (2.6) unchanged. Taking the SVD and truncating the expression after two terms, we have from equation (2.7)

SV D(M) = σ1Ux,1Vt,10 + σ2Ux,2Vt,20 . We now define the estimators ˆb(1)x , ˆb(2)x , ˆkt(1)and ˆk

(2)

t similarly to the basic LC as the following [31]:

Definition 24. The estimator ˆb(i)x of the parameter b (i)

x is defined as ˆb(i)x = Ux,i, i= 1, 2.

Definition 25. The estimator ˆkt(i)of the parameter k(i)t is defined as ˆk(i)

t = σiVt,i0 , i= 1, 2.

A new matrix ˆMx,t(2nd)containing the estimations of the logarithmic historical mortality rates is obtained as

ˆ

Mx,t(2nd)= ˆax+ ˆb(1)x ˆk(1)t + ˆb (2)

x ˆk(2)t = ˆax+ σ1Ux,1Vt,10 + σ2Ux,2Vt,20 , (2.16) where, again, ˆax is given by equation (2.5).

(33)

Expanded to matrix form, we retain ˆ Mx,t(2nd)=    ˆ a1+ σ1U1,1V1,10 + σ2U1,2V1,20 . . . aˆ1+ σ1U1,1VT,10 + σ2U1,2VT,20 .. . . .. ... ˆ aX+ σ1UX,1V1,10 + σ2UX,2V1,20 . . . aˆX+ σ1UX,1VT,10 + σ2UX,2VT,20   .

As observed in the matrix, the procedure of performing the SVD is completely analogous to the case of the basic Lee–Carter method.

2.4.4

Forecasting

To forecast, the procedure is akin to the basic LC-method. The distinction between the meth-ods consists of calculating doublets of the parameters for the random walk. From equation (2.9) we have, for the estimator of kt

ˆkt = ˆd+ ˆkt−1+ εt. Redefining the equation for ˆkt(i), we have

ˆkt(i)= ˆd(i)+ ˆkt−1(i) + εt(i), i= 1, 2. (2.17)

The error terms are again normally distributed such that εt(i)∼N (0,σ2), i= 1, 2. Expanding equation (2.17), we have

ˆk(i) t = ˆk (i) T + ˆd (i)t+ t

j=1 εT(i)+ j−1, i= 1, 2. (2.18)

It is not obvious that the estimated drift constant ˆddepends on time, prompting its concurrent indexing. Examining the estimation equation (2.10) however, we see that this is indeed the case. Therefore, the analogous equations for ˆd(i)is

ˆ d(i)= ˆk (i) T − ˆk (i) 1 T− 1 , i= 1, 2.

Taking the variance of ˆkt(i), we get

V[ ˆkt(i)] = t σ(i)2

, i= 1, 2. (2.19)

Now, taking the square root of equation (2.19) to acquire the standard deviation and then applying it to equation (2.18) we get

ˆk(i) t = ˆk (i) T + ˆd (i)t+√ tεt(i), i= 1, 2. (2.20)

(34)

Since εt(i) ∼N (0,σt(i)) for i = 1, 2 is assumed to be independent, we can ignore this error term and essentially make the random walk a straight line. Using equation (2.16) together with equation (2.20), we thus have

˜ Mx,t(2nd)= ˆax+ ˆb(1)x ˆk (1) T + ˆb (2) x ˆk (2) T + t  ˆb(1)x dˆ(1)+ ˆb (2) x dˆ(2)  ,

where ˜Mx,t(2nd) is a new matrix consisting of the logarithmic forecasted mortality rates starting at time period number t = T + 1.

Just like for the case of the basic LC, we can join the two matrices containing the logarithmic fitted and forecasted mortality rates to create a new matrixM∗(2nd)

M(2nd)= ˆM(2nd)M˜(2nd) , which gives a complete table of the 2nd order Lee–Carter model.

2.4.5

Goodness of fit and confidence intervals

Equivalent to the basic Lee–Carter method, it is possible to measure the goodness of fit. From equation (2.14) we can derive the expression for ηx2

ηx2= 1 −∑ T t=1 mx,t− ˆm (2nd) x,t 2 ∑Tt=1 mx,t− ¯mx 2 .

This will, for the 2nd order Lee–Carter model, provide a proportional variance explained of the fitted values.

Remark 3. All values for calculating ηx2 are actual percentage mortality rates. I.e., they are not logarithmic.

Now, using the confidence interval for ˆk(i)t ˆk(i) t ± 1.96 see (i) t ˆk (i) t , i= 1, 2,

where from equation (2.12) we can derive seet(i)to be

seet(i)= v u u t 1 T− 2 T−1

t=1  ˆk(i) t+1− ˆk (1) t − ˆd(i) 2 , i= 1, 2,

we can construct the approximate 95% confidence interval for ˜m(2nd)x,t as − ˜mx,t× 1.96 (ˆb(1)x see (1) t + ˆb (2) x see (2) t ) ≤ ˜mx,t ≤ ˜mx,t× 1.96 (ˆb (1) x see (1) t + ˆb (2) x see (2) t ).

(35)

2.5

The Hyndman–Ullah method

In 2007, Rob J. Hyndman and Shahid Ullah [20] presented a new, non-parameterized, method of modeling and forecasting mortality rates. It was introduced in order to capture some non-random patterns that could not be accounted for using the Lee–Carter method, as well as provide a smoothing of the log mortality rates [32]. The method has shown superior perform-ance in modeling the French mortality rates compared to the Lee–Carter method and several of its variants [20], [22]. This has made it compelling for researches to study the implementation of population data from other developed countries.

The Hyndman–Ullah method can be regarded as a generalization and extension of the Lee– Carter method in five specific ways [4], [32].

1. Smoothing of the log mortality rates

Smoothing is achieved by assuming that there is a continuous, smooth, function having an error associated with every discrete age group x. It means that instead of the output of the mortality rates being a matrix (table) mx,t, we consider x a continuous variable of function mt(x). The function includes the logarithm transformation applied to the mortality rates.

2. Application of higher-order Principal Component Decomposition

Higher orders of principal component decomposition are used, as opposed to the Lee– Carter method only using the first order. Using higher orders, Hyndman and Ullah determined that it was possible to capture non-random patterns in the mortality data. The method uses Functional Principal Component Analysis (FPCA) (which essentially is PCA extended to time series) to decompose a set of curves into orthogonal functional principal components as well as their uncorrelated principal component scores.

3. Extended univariate time series models to forecast principal component scores Using the observed data D and the set of functional principal components F, the fore-casting of future time periods t can be done using either ARIMA or some exponential smoothing state-space model. This allows for a wider range of tools for forecasting. 4. Compensation for unusual (high mortality rate) time periods

Instead of using a dummy variable for unusual years, as in the basic Lee–Carter case [27], robust estimation accounts for these events (see Subsection 2.5.4).

5. Constant kt

(36)

2.5.1

The model

Here we provide the Hyndman–Ullah model [32] with some auxiliary comments regarding its components. We will start with a quite voluminous definition of the model.

Definition 26. The Hyndman–Ullah model is defined as

yt(x) = ft(x) + σt(x)εt, (2.21) where ft(x) = µ(x) + K

k=1 βt,kφk(x) + et(x), (2.22) for t= 1, . . . , T , where, • x is the age, • t is the time,

• yt(x) is the mortality rates modeled, • ft(x) is the smooth function of x,

• σt(x) is the smooth volatility function of x, • εt is the is the error-term,

• µ(x) is the mean function across ages,

• {φ1(x), . . . , φK(x)} is the set of the first K principal components, • {βt,1, . . . , βt,K} is a set of uncorrelated principal component scores, • et(x) is the residual function and

• T is the number of time periods fitted.

Remark 4. The set {φk(x)} is a set of orthonormal basis functions comparative to the para-meter bxin the Lee–Carter model (see Section 2.3). It represents the sensitivity of the model for age x as time t varies. The set {βt,k} are the coefficient functions comparative to kt in the Lee–Carter model as time t varies (see Section 2.3).

(37)

2.5.2

Constraints and assumptions

Here we will list some assumptions associated with the Hyndman–Ullah method. 1. The error-term εt is assumed to be i.i.d. such that

εt∼N (0,σ2).

2. The number of principal components (pair of basis functions and coefficients) used is set to K = 6 as was done by Hyndman and Ullah in their original published article where the method was presented [20].

3. The residual function et(x) is assumed to be distributed as follows: et(x) ∼N (0,vt(x)). where vt(x) is given by vt(x) = Z x  ˆ ft∗(x) − 6

k=1 ˆ βt,kφk(x) 2 dx, (2.23)

where ˆft∗(x) is the median-adjusted function. Following the procedure of Hyndman and Ullah, K = 6 number of principal components are used.

4. A monotonicity constraint is incorporated in the model [20]. The function ft(x) is as-sumed to be monotonically increasing for x > c (e.g. c = 50 years). This constraint is implemented to reduce the noise of estimated curves for higher ages. This is a qualitat-ive, but reasonable, imposed constraint since for ages higher than, for example 50 years, mortality rates are observed to increase strictly for every year until death.

5. The number of principal components used is K < T . Following the procedure of Hyndman and Ullah (letting K = 6) [20] implies that at least 7 time periods must be forecasted. For the Hyndman–Ullah method, we define mt(x) to be the historical mortality rates. We also denoted Nt(x) as the total population with age x at central time point June 30 of year t. We can then say that mt(x) is near binomially distributed and thus

V[ mt(x) ] = Nt−1(x) mt(x)[1 − mt(x)].

Using Taylor approximation [20] to obtain an estimator of the variance, we also have ˆ

σt2(x) ≈ Nt−1(x) mt−1(x)[1 − mt(x)], (2.24)

providing us with an estimator ˆσt(x) of σt(x) in equation (2.21) after taking the square root. The Hyndman–Ullah method is in its core a three-step procedure, all of which we will go through here to fit and forecast the data.

(38)

2.5.3

Smoothing the data

Smoothing is done using weighted penalized regression splines (see Subsection 2.2.4). Expli-citly, by using matrices and adding weights, we can write this as [9]

minimize β

kw(y − Xβ )k2 subject to β0Dβ < ∞

(2.25)

where w is a vector containing the weights, y is a vector containing the historical logarithmic mortality rates, X is a matrix representing the linear spline basis, β is the solution vector and Dis a diagonal matrix such that

D= 02×2 02×K

0K×2 1K×K

 .

Using Lagrange multipliers, we can rewrite the equation (2.25) problem as [36] minimize

β

kw(y − Xβ )k2+ λ2β0Dβ , (2.26)

for some λ ≥ 0, which in this case is determined by the GCV-criterion discussed in Subsection 2.2.4. Further, Hyndman and Ullah [20] defined the weights as follows:

Definition 27. The weights wt(x) associated with age x and time period t are defined as the inverses of the variances of the estimations of σt2(x) [20]. Explicitly,

wt(x) =

Nt(x)mt(x) 1 − mt(x)

.

We can now show that a solution ˆβ exists that minimizes expression (2.26) [25].

Theorem 2. The solution ˆβ that minimizes the expression kw(y − X β )k2+ λ2β0Dβ for any given λ ≥ 0 is

ˆ

β = w0w(λ2D+ w0wX0X)−1X y0.

Proof. See Appendix A.

Since ˆy= X ˆβ , we obtain the smoothed values ˆ

(39)

2.5.4

Fitting the data

The next step in the method is fitting the data by approximating the bivariate surface (i.e. the difference between the smooth function ft(x) and the mean function µ(x)). This is done by summing the products of the orthonormal basis functions φk(x) along with its corresponding

coefficients βt,k. This is exactly equation (2.22). Knowing that Hyndman and Ullah used

K= 6 number of basis functions [20], the estimation ˆft(x) of ft(x) becomes

ˆ ft(x) = ˆµ (x) + 6

k=1 βt,kφk(x) + et(x), (2.27)

where the robust estimation of the mean function µ(x) is given by ˆ µ (x) = argmin θ (x) T

t=1 ˆft(x) − θ (x) . (2.28)

This means that the estimated mean for an age group x is the value of the spline target-function θ (x) that minimizes the sum of the differences between the estimated smooth function ft(x) and the target function itself for all fitted time periods t. This estimation is also called the L1 -medianof the estimated smooth curves { ˆft(x)} for ˆµ (x). Computationally, ˆµxcan be estimated by ˆ µ (x) = 1 T T

t=1 ft(x).

To find the basis functions φk(x) and the coefficients βt,k we will use weighted functional PCA. This is analogous to weighted PCA extended to continuous functions. The core idea behind this procedure is to find φk(x) for K = 6 such that the mean integrated square error is minimized. This can be written compactly as

minimize φk(x) MISE =1 n n

t=1 Z e2t(x)dx.

After the estimators ˆft(x) and ˆµ (x) are obtained, we defined the median-adjusted set of curves as the following:

{ ˆft∗(x)} := { ˆft(x) − ˆµ (x)}.

To obtain the complete functional PCA with implemented weights, we use the following two-step algorithm proposed by Hyndman–Ullah [20].

1. By using projection pursuit (see Subsection 2.2.5), find the initial, robust, values for { ˆβt,k} and {φk(x)} for t = (1, 2, . . . , T ) and k = (1, 2, . . . , 6).

2. By defining the integrated square error vt(x) according to equation (2.23), we are provided with a measure of accuracy of the approximation of the principal component for the time period t. The weights, for a fixed age group x, are then defined to be

wt = (

1 if vt< s + λ√s,

(40)

where s is the median of the set {vt} and λ > 0 is a parameter controlling the level of robustness. Using the obtained weights, the new, weighted, values { ˆβt,k} and {φk(x)} are generated using the singular value decomposition (see Subsection 2.2.5).

Combining the equations (2.21), (2.24) and (2.27), we can rewrite the equation describing fitted mortality rates ˆyt(x) as

ˆ yt(x) = ˆµ (x) + 6

k=1 ˆ βt,kφˆk(x) + ˆet(x) + q Nt−1(x) m−1t (x)[1 − mt(x)] εt, (2.29)

where ˆµ (x) is given by equation (2.28) and ˆet(x) is given by equation (2.23). This equation is defined on the interval [xt, xT] where T is the last given time point of the historical data.

2.5.5

Forecasting

Since the coefficients ˆβt,kand ˆβt,` are uncorrelated for k 6= `, the ARIMA-model described in Subsection 2.2.3 is a viable alternative to forecast mortality rates. However, since we have K = 6 basis functions, we need to perform the ARIMA on each series { ˆβ1,k, . . . , ˆβT,k} for k= 1, 2, . . . , 6. Rewriting equation (2.29), we can derive the equation describing forecasted mortality rates as ˜ yT,h(x) = ˆµ (x) + 6

k=1 ˆ βT,k,hφˆk(x), (2.30)

where ˆβT,k,h is the h-step forecast of βT+h,k using the ARIMA-model. This equation is thus defined on the interval (xT, xT0] where T0is the last given time point of the forecast horizon. Using equations (2.24), (2.29) and (2.30), the model can be written compactly as

∗ yt,T,h= ( ˆ yt(x) = ˆµ (x) + ∑6k=1βˆt,kφˆk(x) + ˆet(x) + ˆσt(x) on [xt, xT], ˜ yT,h(x) = ˆµ (x) + ∑6k=1βˆT,k,hφˆk(x) on (xT, xT0],

wherey∗t,T,h is the complete model, including fitted and forecasted mortality rates, defined on the interval [xt, xT0].

2.5.6

Goodness of fit and confidence intervals

To obtain a measure of goodness of fit of the historical data, we will use the proportional vari-ance explained for comparison to the Lee–Carter methods previously discussed.

Since the fitted mortality rates from equation (2.29) in the Hyndman–Ullah model are repres-ented by a continuous function, we first need to discretize this function to be able to compare it to discrete models such as Lee–Carter. We will do this on a grid of c equally spaced values such that the set { ˆmc} (c = 1, 2, . . . , T ) is the set of central mortality rates (i.e. the central point within each time period t) spanning the interval [x1, xT]. This results in a matrix ˆM

(HU ) x,t

(41)

consisting of central death rates for the Hyndman–Ullah method with the same dimensions (x,t) as the matrix Mx,t containing the actual mortality rates. We can then construct the pro-portional variance ηx2as ηx2= 1 −∑ T t=1 mx,t− ˆm (HU ) x,t 2 ∑t=1T mx,t− ¯mx 2 ,

for each age group x.

Remark 5. All values for calculating ηx2 are actual percentage mortality rates. I.e., they are not logarithmic.

The forecasting variance can be obtained by summing each component variance of equation (2.29) [20]. We thus have ζT,h(x) ≈ ˆσµ2+ 6

k=1 uT+h,kφˆk2(x) + vt(x) + σT2+h(x),

where ˆσµ2 is the variance of the smooth estimate ˆµ (x), uT+h,k is the variance of the ARIMA-model and σT2+h(x) is the observational error variance given by equation (2.24).

Making the assumption that all sources of variances are normally distributed (Hyndman and Ullah detected no reason to reject this assumption [20]), the confidence intervals can be con-structed as the following using a 95% prediction interval:

˜ yT,h(x) − z0.025 q ζT,h(x) ≤ yt(x) ≤ ˜yT,h(x) + z0.025 q ζT,h(x), where z0.025= 1.96 is the standard normal quantile.

2.6

Sources of error

Endowed in the process of forecasting mortality rates, there are numerous sources of errors present. A good overview can be given by grouping the sources into four different (although not completely independent) categories as attempted by Alho in 1990 [1]. The four different categories are model misspecification, parameter estimation, errors in informed judgment and inherent random variation. In addition to these groups, three types of correlation between er-rors exist [6] that further complicates the error analysis of a specific forecasting method. We will here go through the sources of errors and their correlations one-by-one to outline their presence in the methods.

• Model misspecification could transpire in assumption to the underlying model of data. For example, the assumption that a parametrization function joined with a forecasting model is representative of the actual data. The error would thus be defined as the dif-ference between a complete specific model (e.g. Lee–Carter with ARIMA) and an ideal “perfect” model describing both the past and the future mortality rates.

Figure

Figure 2.1 shows the typical appearance of a random walk using these parameter values.
Figure 2.2: Example of smoothing of noisy data points
Figure 3.1: Mortality rates for 20-, 30- and 40-year-olds in Sweden 1900-2018
Figure 3.2: Proportion of variance explained η 2 for period 1921-2018.
+7

References

Related documents

In order to estimate the effects of the social positions of the partners, their characteristics from Model III – women’s education, status and income, and men’s education, class

För det tredje har det påståtts, att den syftar till att göra kritik till »vetenskap», ett angrepp som förefaller helt motsägas av den fjärde invändningen,

Samtidigt som man redan idag skickar mindre försändelser direkt till kund skulle även denna verksamhet kunna behållas för att täcka in leveranser som

Re-examination of the actual 2 ♀♀ (ZML) revealed that they are Andrena labialis (det.. Andrena jacobi Perkins: Paxton &amp; al. -Species synonymy- Schwarz &amp; al. scotica while

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating

ISBN 978-91-8009-126-8 (PRINT) ISBN 978-91-8009-127-5 (PDF) Printed by Stema Specialtryck AB, Borås.

The ambiguous space for recognition of doctoral supervision in the fine and performing arts Åsa Lindberg-Sand, Henrik Frisk &amp; Karin Johansson, Lund University.. In 2010, a