• No results found

An Extreme Value Analysis Statistical Preparation For The Inference of

N/A
N/A
Protected

Academic year: 2021

Share "An Extreme Value Analysis Statistical Preparation For The Inference of "

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Degree Project

Tauqeer Hussain Shah 1986-02-08

Subject: Mathematics Level: Bachelor Course code: 2MA11E

An Extreme Value Analysis Statistical Preparation For The Inference of

Unemployment Data

(2)

Contents

1 Introduction 5

1.1 Background . . . 5 1.2 Purpose of The Paper . . . 6 1.3 Un-Employment Data . . . 6

2 Extreme Value Theory 7

2.1 History . . . 7 2.2 Definition . . . 7

3 Independent Identically Distributed (iid) 9

3.1 Definition . . . 9 3.2 Examples . . . 9 3.3 Maximum Likelihood Estimation (MLE) . . . 9

4 How to Check Independence 11

4.1 Autocorrelation . . . 11 4.2 Ljung Box . . . 12

5 Quantile Plot 13

6 Analysing The Data 14

7 Conclusion 17

2

(3)

Abstract

In this paper I shall prepare for an extreme value analysis of unemploy- ment data. The aim with the major studies is to investigate whether there is significant difference between the distributions corresponding to data coming from related immigrants to Sweden and the native Swedish citizen. To this end I shall shortly recall the main results and assumptions of extreme value theory. With the assumption i.e. a QQ test for checking the type of dis- tribution, two alternative test on independence and possibly a test checking stationarity of time series.

Keyword: Extreme Value Theory, QQ plot, Autocorrelation.

(4)

ACKNOWLEDGMENTS

I would like to thanks my supervisor Astrid Hilbert (Linnaeus University) for supporting me and giving all answers to my questions related to my work. I would

like to thanks all the peoples who support me during my work by providing different tools and required un-employment data. Thanks to the Swedish

Governement for giving me such opportunity to study in such a friendly enviorment.

4

(5)

1 Introduction

Applied mathematics is used to solve the problem of the un-employment data in this thesis. There are many statistical methods that can be used to explain this, but I will focus on extreme value theory. This is a Bachelor thesis which is complementary to a bachelor degree from University of the Punjab, Pakistan.

1.1 Background

After the 2nd world war, the number of the immigrants has been increased very rapidly in Sweden see Statistics Sweden (2010a, 2010b). In 1940 the proportion of the foreign born peoples among the total population of the country amounted to only about 1 %. The proportion increased to nearly 7 % by 1970 and to nearly 14 % in the beginning of 2010 which corresponds to about 1.3 million individuals. About 50 % of the foreign individuals living in Sweden in 2010 have acquired Swedish citi- zenship. Moreover there is growing group of so called second generation immigrants that is persons born in Sweden with at least one parent born abroad. Today this is group is nearly 1 million individuals. More than half of the group has one parent born in Sweden.

There have been great changes in the employment situation for immigrants in Swe- den during the post war period.Many studies [1]show that the employment situation for the immigrants in Sweden was good up to the mid 1970’s. The un-employment rate was low and there was full employment for both natives and immigrants. Dur- ing long periods, employment levels among immigrants even exceeded those of the natives. Since the beginning of the 1980’s, the immigrant’s labour market situation in Sweden has developed very badly to the work.

According to Swedish Labour Force Surveys the un-employment rate increased for all groups during the recession and especially for those who born outside Europe.

The share long term un-employment is higher among foreign born compared to na- tive born. In the second half of 2008 it was 33.9% among the foreign born and 20.6 % among the native born. It increase to 39.0 % among the foreign born and 28.6% among the native born during the first half of 2010, see [2]. The increase is not faster for foreign born. One reason may be that the foreign born include many newly arrived individuals during 2008 and 2009 with only short periods in searching for jobs. The proportion long term un-employment is expected to decrease only if the recovery would be sustained.

(6)

1.2 Purpose of The Paper

Purpose of this paper is to analysis the data checking whether the assumptions of extreme value theory are satisfied.

If an assumption is not satisfied one has to remove so called Trends.

1.3 Un-Employment Data

I received the un-employment data from [3] to test the models. I have the data from 1979 to 2005 for the analysis of the un-employment among the native born and the foreign born.

6

(7)

2 Extreme Value Theory

2.1 History

The field of extreme value theory was pioneered by Leonard Tippett (1902-1985).

He was employed in British Cotton Industry Research Association, where he worked to make stronger cotton threads. During his studies he realized that the strength of the cotton was controlled by the strength of its weakest fibers. With the help of R.A. Fisher, he obtained three asymptotic limits describing the distributions of extremes. The German mathematician and anti-Nazi activist Emil Julius Gumbel established this theory in 1985 in [4] including the Gumbel distributions.

2.2 Definition

Extreme value theory is a branch of statistics dealing with the extreme deviations from the median of probability distributions. The general theory sets out to assess these type of probability distributions generated by more precise.

We have two approaches now a days,

• ”The Basic Theory Approach” as described in the book by Burry (1975) . Usually this is known as first theorem in extreme value theory.

• ”Tail Fitting Approach” based on the second theorem in extreme value theory which is the most common.

For the analysis of extremal events there are precisely three standard extreme value distributions Frechtet, Weibull and Gumbel. Their distribution functions are given below and also available in Embrechts et al (1997), Page 121;

Frechtet:

Φα(x) =

(0, x ≤ 0

exp{−x}−α x > 0 For Frechtet we have α > 0.

Weibull:

Ψα(x) =

(exp{−(−x)}α, x ≤ 0

1, x > 0

For Weibull we have α > 0.

Gumbel:

Λ(x) = exp−e−x , x ∈ R.

(8)

Instead of using three standard cases for the extreme value distribution, it is prefers if we use one parameter that represents all of the three cases. This can be done by introducing a new parameter ξ in such a way that for ξ = 0 will cover the Gumbel distribution. If ξ = 1/α > 0 then this distribution will be Frechet and if we have ξ = −1/α < 0 then it will represent the distribution of Weibull. So using this we can now define the unified distribution function H which depend on as follow;

Hξ(x) =

(exp−(1 + ξx)−1/ξ , if ξ 6= 0

exp {−exp {−x}} , if ξ = 0. (1)

where 1 + ξx > 0. From this point of view of mathematical inference this means that the distribution of a given sample of extremal data can be determined uniquely by estimating a single parameter namely ξ. As the extremal events are naturally related to the tail of a distribution as they occur very rarely which makes statistical inference less reliable. An extreme value analysis will only give adequate results if the assumptions of the theory are satisfied, in particular of theorems 3.4.5. are satisfied in [7]. Hence it is important to examine statistically the data in the given sample whether they are independent and identically distributed and whether the data follow · · · .an extreme value distribution (1).

For the given unemployed data this is done by a QQ test. These statistical methods are described in the next section.

8

(9)

3 Independent Identically Distributed (iid)

3.1 Definition

In probability theory and statistics, any arbitrary collection of random variables is independent and identically distributed (i.i.d.) if we have the same probability distribution for all random variables and all are independent.

The abbreviation i.i.d is very common in statistics (some times written as IID), the assumption /observations are i.i.d and be independent events or random vari- ables are independents tends to simplify the mathematics for many mathematics methods. We have realistic assumptions, in practical applications of statistical mod- eling.

3.2 Examples

Following are some examples related to iid random variable;

• Repeated dice rolling is one example of i.i.d

• Repeated flipping of a coin is another example of i.i.d.

• Spin of roulette wheel is also an example of i.i.d.

• In signal processing and in image processing the main idea of transformation to IID is, ID (ID = identically distributed, means that the signal should be balanced along the axis of time.) and I (I = independent, means that signal spectrum should be flat).

3.3 Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation is the statistical tool that is used when estimating unknown parameters.

Likelihood and log-likelihood are the main functions that are most often used to derive estimators for the parameters in a given data. They have maximum point at the same values, even though they have different shapes. A value that corresponds to a maximum value or maximum point is known as the Maximum Likelihood Estimate.

Another definition with examples can be found at [6]. A MLE mean that we want to maximize the likelihood function

L(θ) =

n

Y

i=1

f (xi) (2)

This is for some function θ and here we have f its the density function that is con- sidered to be known.If our function is rational then the parameter θ that maximize

(10)

the (1) directly or can also be maximize by maximizing the logarithm of likelihood functions lnL(θ). Now the procedure is to differentiate lnL(θ) w.r.t θ and then we will put it equal to zero to find the values of the θ. Calculate the partial derivative indiviusally if θ is vector with many of the different variables.

We do the same procedure for this as we do in previous case by puting it equal to zero and then solve it for θ. More detail of this can also be found in Modelling Extremal Events for Insurance and Finance in chapter 6.

10

(11)

4 How to Check Independence

In statistical we have many ways to check the independence of the data, but in this paper I will use the following two method to for checking the independence of the data.

4.1 Autocorrelation

To check independence of the data in a time series we can use a method called autocorrelation function for different time lags. Let us define this first;

For example it is initial for a random generator that produced numbers are realizations of independent random numbers.

We can also say that it is a mathematical tool that we can use for finding sequences of repeating patterns in a data that we are going to analysis. For time series data, similar observations that are separated by a specified time periods, called the lag.

Let us consider x1, x2· · · xn are some of the observation from a sample, then the mean value of the sample can be estimated as follow;

¯ x = 1

n

n

X

i=1

xi.

Now the autocorrelation for some sample of different time lags k is defined as;

ρ(k) = Pn−k

i=1(xi− ¯x)(xi+k− ¯x) Pn

i=1x2i . (3)

In chapter 1 of [5] we can find many other tests that can be used for the analysis the independence of data.

If the data are independent realization of an independent random sample there is nothing to check.

The autocorrelation function should be near to zero for all values of k if the data are independent. If any point from the sample differ significant from zero this will be the evidence that the given data are dependent.

To check this whether the points are near to zero or significant different from zero, then define two quantiles ±1.96n and see that values are inside these boundaries. Given data will be iid if the autocorrelation function is lies between the defined boundaries.

The reason for the selection of these boundaries are the large iid sequence r(k) are approximately iid normal distributed , having mean value zero and variance 1/n. It is commonly known that 95% confidence quantiles for the standard normal distribution having mean value equal to zero and standard deviation σ are given by 1.96σ. You can also choose quantiles other than the 95%. But selecting 95%

(12)

values is the standard in statistics. If the values of autocorrelation plot shows some points that lie outside the 95% confidence interval, then we have in question the independency of the data and we have to think about further justifications of the property.

One alternative test for independence is the Ljung Box test that can be used for the whole sample, not only for the non zero points.

4.2 Ljung Box

We encounter a large numbers of tests for randomness in statistics. One of the most common methods for testing is autocorrelation as we discussed in the section 4.1.

We have another method called Ljung Box Test (Named after the research work of M. Ljung and George E.P.Box ), instead of checking randomness on every distinct lag, this test involves the whole data based on the number of lags. It is also known as Portmanteau Test.

We define the Ljung Box Test as follow;

H0: The data in the given sample is iid (Random) H1: Some of the data from given sample is not iid (Random).

The test statistic Q can be defined as follow;

Q = n(n + 2)

d

X

j=1

ρ2(j) n − j.

where n is the size of the given sample, ρ(j) is the autocorrelation of the given sample at lag j and d is the total number of lags that are going for tested.

For some level α, H0 will be rejected if Q > χ21−α(d)

where χ21−α(d) is known as the α quantile of the chi-squared distribution with d degrees of freedom.

If a random variable can be written as a sum of squares of iid normal variables, having variance one and mean equal to zero, then this random variable has chi- squared distribution.

It mean W ∼ χ2 and W can be written as follow;

W =

n

X

i=1

Xi2, Xi ∼ N (0, 1).

The degree of freedom is just one minus from the lenght of the sample. For example if we have the lenght of the sample size is 254, then the degree of freedom will be 254 − 1 = 253.

12

(13)

5 Quantile Plot

To find the distribution that underlying the data, we use a graphical tool known as Quantile-Quantile Plot or Q-Q Plot. For ploting we use the following function;



Xk,n, F n − k + 1 n + 1



: k = 1, 2, · · · , n



. (4)

where Xk,n is the ordered sample and F is the generalized inverse of the dis- tribution function. To use this method our data in the given sample must be in ordered. By ploting (3) we can see whether the data come from F. If we have linear- ity in the Q-Q plot, then it shows that the given data is generated by the reffered distribution.

(3) can also be written as follow;

{(Xk,n, F(pk,n)) : k = 1, 2, · · · , n} (5) where pk,n is the certain plotting position and we can write it as follow;

pk,n = n − k + δk

n + γk

where (δk, γk) is appropriately chosen allowing for some continuity correction.

In the following properties we can summerize the Q-Q plot.

• Comparison Of Distributions. We get roughly linear looking plot, if we generate the data from the random sample of the reference distribution.

• Outliers. If one or some of the data values exhibit gross error or if a data is different from the remaining values, then we can easily identified the outlying points from the plot.

• Location and Scale. A change in one of the distributions by a linear trans- formation simply transforms the plot by the same transformation. We can estimate graphically (with the intercept and slope) its location and scale pa- rameters for a given sample of the data by making its graph, by assuming that the data is taken from the reference distribution.

• Shape. We can find any difference in the distributional shape from our plot.

Let suppose that if the reference distribution have more data or large values then our curve will down at the left or to the up side from the right.

More detail and the illustration of these properties can be found in [7] at page 291.

(14)

6 Analysing The Data

In this section I will analyzing the un-employment data and make some of conclusion from this. In order to work with large samples of un-employment we also consider un-employment data of men and women from 1979 - 2005. In the following figures you can see the time series analysis, individually for men and women.

Figure 1: Time Series Analysis of Men Data

Figure 2: Time Series Analysis of Women Data

14

(15)

Figure 3: Comparison of both men and women un-employment data

Since the sample was small, the method was tested using un-employement data of men and women between 1979 - 2005

Figure 4: The QQ Plot demonstrates that the un-employment data of immigrants and the native swedish population roughly follow the same distribution.

QQ plot with extreme value distributions the distribution functions of which have been given in section 2.

As expected the distribution of the underlaying random variables are of the same type since they fil the line very well. However the parameters are different since the line is different from the diagonal. The angle with the x-axis is 38.

(16)

The data can be checked by ploting the autocorrelation function for different time lags. For this purpose I calculated the confidence bounds by using formula

±1.96

n which gives value of confidence bound 0.1046.

Results for autocorrelation function individually for men and women can be seen in the following figures.

Figure 5: Sample autocorrelation function of men unemployment data, for different time lags

Figure 6: Sample autocorrelation function of women unemployment data, for differ- ent time lags

16

(17)

7 Conclusion

In this section I will make some of the conclusions from our un-employement data.

The plot of autocorrelations for different time lags indicates that nearest neigh- bours are correlated. This is supported by the point estimate for the correlation for the time lags k = 1, 2 · · · , 351 and the 95% confidence interval for k = 1, 2,n2, n − 2.

Before applying the extreme value analysis to these data techniques should be ap- plied to remove it.

The QQ-plot for the male against the female unemployment data show that the data follow the same type of distribution, even if we do not know which one fits these data. This is documented by the fact that the plot perfectly suits a straight line.

Since the line is not diagonal, in particular the inclination is 38. The distributions might differ in their parameter. This can only be decided by a QQ-plot against a theoretical extreme value distribution function or a full extreme value analysis as explained in the chapter 6 of [7]. This is beyond the scope of this thesis.

(18)

References

[1] Wadensj(1973), Ekberg (1983), Scott (1999) and Bevelander (2000) [2] Statistics Sweden, 2008b, 2008c, 2010d, 2010e

[3] www.scb.se

[4] Emil Julius Gumbel, Statistics of Extremes [5] Brockwell et al, (2002)

[6] Mark J. Schervish Theory of statistics

[7] P. Embrechts, C. Kl¨uppelberg, T. Mikosch, Modelling Extremal Events For Insurance and Finance

18

(19)

SE-351 95 Växjö / SE-391 82 Kalmar Tel +46-772-28 80 00

dfm@lnu.se Lnu.se

391 82 Kalmar

References

Related documents

In Murto [10], the cost of making the investment is as- sumed to decrease by a given fraction each time a Poisson process (independent of the Brownian motion driving the cash

Key words: PCDD/Fs, dioxins, pressurized liquid extraction, PLE, selective SPLE, modular M-PLE, soil, sediment, sludge, gas chromatography, new stationary phases,

Collecting data with mobile mapping system ensures the safety measurements, and gives a dense and precise point cloud with the value that it often contains more than

The claim that two jobs are of equal value is based on an evaluative comparison of jobs with respect to demands and difficulties. In most Equal Pay Acts the demand and difficulties

For time heterogeneous data having error components regression structure it is demonstrated that under customary normality assumptions there is no estimation method based on

The choice of length on the calibration data affect the choice of model but four years of data seems to be the suitable choice since either of the models based on extreme value

In Appendix 3 about ”Operative costs due to delays of railway freight trans- ports” the additional costs are calculated for a number of typical rail freight as-

The samples were vortexed before using the liquid handling system for the sample preparation in both packed monolithic 96-tips and commercial 96-tips. The concentration range of