The Effect of Immigration on Income Distribution: A Comparative Study of Ordinary Least Squares and Beta Regression

(1)

|

The Effect of Immigration on Income Distribution:

A Comparative Study of Ordinary Least Squares and Beta Regression

|

By Fanni Forslind

|

Department of Statistics Uppsala University

|

Supervisor: Yukai Yang

| |

|

2021

(2)

Abstract

The purpose of this study is to estimate the relationship between income inequality and immigration in Sweden. To do so, data from the data base Kolada with observations from all 290 municipalities in Sweden is used. As a proxy for income distribution the Gini coefficient is used and as a proxy for immigration the share of foreign born of working age is used. The model also controls for income tax, education level and unemployment level. The dependent variable the Gini coefficient is bounded by a unit interval and it is therefore not possible to simply run a linear regression. Such a model could potentially predict outside the interval. To properly estimate the relationship two approaches are made. Firstly a model is estimated with Ordinary Least Squares (OLS) after the dependent variable is transformed on to the real line through log-odds.

Then a model is estimated using beta regression. The study concludes that there is a statistically significant positive correlation between income inequality and immigration in Sweden. The OLS estimated model shows that a 1 unit increase in immigration, on average increases the log-odds of 0.28336 units, ceteris paribus. Beta regression provides perhaps more intuitive results. If immigration increases with 1% the income inequality increases with on average 0.1046%, ceteris paribus. Because of the easier interpretation, among other things, beta regression is determined to be a better estimation method in this study.

Keywords:

Ordinary least squares (OLS), Beta regression, Proportion dependent variable, Gini coefficient, Income distribution, Immigration.

(3)

1 Introduction

Distribution of income and wealth has been an issue under discussion for a long time, especially in the political economy debate. Fundamentally, it is a moral philosophical question. It is important to distinguish distribution of income from that of wealth. In a society, it is possible for one person to own 100% of the wealth. But there must be some degree of distribution of income for a country to function (Piketty (2020)). If a society has a high enough level of income inequality a situation may arise where the great majority of the population would not be able to purchase the bare necessities to live. It is sometimes argued that some degree of income inequality is good in the sense that it motivates people to be more productive. Others reason that inequality can be considered is unfair.

Regardless of one’s political stand, it is of importance to understand what drives income inequality to be equipped with the right tools to change it in the desired direction.

There are many studies on income distribution between countries and within the U.S. The purpose of this thesis is to contribute to the research field of income distribution in Sweden. In particular the study will examine the relationship between income inequality and immigration in Sweden’s municipalities. This is a fairly unex- plored area of research but an essential one. Immigration and integration are presently widely debated issues in Sweden. Not least since the European Refugee Crisis 2015, in which Sweden received a large share of immigrants. To take a stand in these issues and to deal with immigration in a better way it is necessary to know how it effects the society.

The theory from which the thesis takes off is that higher immigration is positively correlated with higher income inequality. It is often argued that immigrants are badly adjusted to the domestic labour market and therefore mainly adds to the lower skilled work force. This causes a concentration in the lower end of the income distribution (Reed (1999)).

The variable of interest, the Gini coefficient, is in between a unit interval [0,1].

This makes estimating a model a little bit more complicated. Therefore, the estimation is done with two different approaches. Firstly, a model is estimated using Ordinary Least Squares (OLS). A drawback of an OLS-estimated model is that it potentially can predict values outside the unit interval. To solve this the dependent variable is transformed to the real line through log-odds. This does however make interpretation of the results less intuitive. In addition transformed proportion dependent variables often exhibit heteroskedasticity, which makes OLS a questionable choice. Another method to use is beta regression. This method makes interpretation easier. The research field of beta regression is rather new. The purpose of this study is to add some meaningful input to the small collection of applied beta regression studies.

The disposition of the thesis is as follows. In section 2. Background the Gini coefficient is defined and previous research about income distribution is reviewed. In section 3. Data, the variables chosen for the model is presented. The methods used for model estimation are gone through in section 4. Methodology. The results of the model estimation is found in section 5. Results. These results are discussed in section 6. Discussion. The reference list is presented in section 7. References. Lastly|section

(5)

8. Appendices is found which consists of some raw output from RStudio.

2 Background

2.1 The Gini Coefficient

The Gini coefficient is a economic unit of measurement that can be used to describe the income distribution in a population. It is calculated from the Lorenz curve. The Lorenz curve shows the percentage of total income received by a cumulative percentage of recipients. The income recipients are arranged from those with lowest to highest income. The curve must touch the 45-degree line at the upper right and the lower left corner. 0% of the income recipients must have 0% of the income and the cumulative 100% must own 100% of the income. In an economy with complete equality, the Lorenz curve would be a 45-degree line. In an economy with perfect inequality, one household would receive all income and the Lorenz curve would go along the bottom and right axis. In other words, if the area between the 45-degree line and the Lorenz curve (area A in figure 1) is large, the Gini coefficient is high which means the inequality is large.

(Perkins et.al. (2013)). See figure 1 for a visual representation of the Lorenz curve.

Figure 1: The Lorenz curve from Perkins et.al. (2013)

The Gini coefficient is defined as, _A+B^A and is thus a number bounded by the unit interval [0,1].

The Gini coefficient is certainly not uncriticized. Naturally, a single number measurement does not contain the full picture. Two populations with very different distributions can have the same Gini coefficient, the Gini coefficient only measures the area and not the shape. Moreover, the focus of the Gini coefficient is relative distribution, and two countries with different incomes can have the same relative distribution. Yet the Gini coefficient is a much used measurement. Ideally it is used together with different measurements that together can provide a fairer picture. But

(6)

even alone it is not as useless as sometimes suggested. It is relatively easily calculated and interpreted. And nevertheless it does give a part of the picture.

2.2 Previous Research

In this section, previous research on the relationship between immigration and income distribution will be reviewed. But to successfully explain the effect immigration has on income distribution in Swedish municipalities it is needed to control for other factors that can affect the Gini coefficient. Those factors will also be reviewed here.

2.2.1 Immigration and Income Distribution

A common explanation for income inequality throughout the years have been high immigration levels. Card (2009) formulates the main idea behind the theory that immigration drives up income inequality. In general immigrants are less skilled and less adjusted to the domestic labour market. For instance in the sense that they may lack the language skills needed. Card (2009) concludes that immigration to U.S cities has little impact on the native population. But when the immigrants themselves are counted in the inequality increases. This is because there is a concentration of immigrants in the far ends of the income distribution. According to his calculations immigration caused 5% of the rise in wage inequality in the US between 1980 and 2008.

Immigration adds low skilled workers to the work force, as put by Reed (1999).

Immigration increases the relative supply of less skilled workers. When the demand does not increase in the same pace, the price inevitable goes down. Reed therefore theorises that the wages reduces when immigration increases. Her study shows that immigrants to California in general earn very little or very much. According to her calculations immigration explains 24% of the rise in California’s income inequality between 1967 and 1997.

Lemos (2014) finds that immigrants in the UK also tends to cluster at the top and end of the income distribution. Grundsten (2015) finds that in Sweden, immigration increases income inequality but it is mainly only due to a concentration of immigrants in the lower end of the income distribution. The author of that thesis, theorises that differences between countries may be caused by different selections of immigrants.

2.2.2 Income Tax and Income Distribution

The income tax is a common tool that can be used to even out income distribution.

Sweden has an progressive income tax, which means that people with higher income pay a larger share to taxes. The tax revenue is then used for, among other things, the social welfare system. In other words, through the income tax, the state takes from the rich and gives to the poor. The income tax in Sweden consists of multiple parts. All individuals with a yearly income higher than 18800 SEK in the income year 2020 pay tax to the municipality and region where they live. Individuals with a higher yearly income than 504 500 SEK in the income year 2020 also pays tax to the government. Because this thesis is focused at municipalities, income tax will from now on refer to the municipalities’ income tax rates.

(7)

In Joumard, Pisu and Bloch (2012) it is stated that taxes has a large redis- tributive effect. The Gini coefficient for OECD countries in the late 2000’s was 25%

lower (representing higher income equality) before taxes, on average. In addition to this, they find that relative poverty was 55% lower after redistribution through taxes and transfers. These effects was particular strong in the Nordic countries, including Sweden.

Palm and von Beckerath (2019) conducted a study of the relationship between the income tax and income distribution in Swedish municipalities It is there found that a higher income tax in a municipality is negatively correlated with the Gini coefficient.

That is, income tax works as a redistributor in Swedish municipalities.

2.2.3 Level of Education and Income Distribution

Another thing that has been shown to explain income inequality is the level of education in the population. A higher education increases an individual’s productivity, and thus income. De Gregorio and Lee (2002) shows that countries with equal education distribution in general also have an equal income distribution. They also find that countries with higher average education level often have an more equal income distribution.

2.2.4 Unemployment Rate and Income Distribution Mocan (1999) wrote,

"The consensus has been that income inequality is countercyclical in behavior, i.e., increases in unemployment worsen the relative position of low-income groups."

And this idea is indeed widespread. Mocan’s study showed that an increase in structural unemployment substantially increased income inequality in the U.S between year 1948-1994. And there are many studies reaching similar results. For example, Jäntti (1994) found statistically significant result that supported that higher unemployment in

the U.S. reduces income equality.

Björklund (1991) finds evidence for this in Sweden. According to him, the reason behind this is that the people with low wages are more likely to be unemployed.

They would then earn even less, and the income distribution would be stretched further.

(8)

3 Data

In this section, the data base along with the variables used for the analysis are presented.

3.1 Kolada

All data used for the analysis in this thesis is downloaded from Kolada. Kolada is a database that provides key figures for Swedish municipalities and regions. It is run by Rådet för främjande av Kommunala Analyser (RKA). RKA is a non-profit organisation created to facilitate collaboration between the Swedish state and The Swedish Association of Local Authorities and Regions (SALAR). The variables used for this report are compiled by Kolada from Statistiska centralbyrån (SCB) which is a Swedish government agency responsible for official statistics.

The data is from 2018, as this was the most recent year available. The data set consists of 290 observations, the number of municipalities in Sweden.

3.2 Variables

The Gini coefficient is used to measure the income distribution in Swedish municipalities.

Here, the Gini coefficients are calculated based on earned income on individual level.

The hypothesis is that immigrants earn less than natives because they are not as adjusted to the Swedish labour market. Therefore the Gini coefficient based on earned income is preferred over, e.g. disposable income. The variable can take on values between 0 and 1, where 0 represent perfect equality and 1 perfect inequality. For the regression analysis, the Gini coefficient has been given the variable name Gini.

As a proxy for immigration, the share of foreign born between the age 18 and 64 is used. It is believed that foreign born populators would have had enough time to settled down. They are probably more likely, to some extent, to have decided where to live and approached the labour market than for example asylum seekers. Using the share with foreign background instead would include people born in Sweden but with foreign born parents in the analysis. But they are here believed to be more sufficiently prepared to the labour market. Because the Gini coefficient is based on earned income, the population of working age is used. This is weighted by the population size and expressed as a share, thus, can only take on values between 0 and 1. This variable is called Im in the regression.

For income tax, this thesis will only use the part of the income tax decided by the municipalities. Some municipalities belong to the same region and therefore share the same region tax rate. The government income tax is the same for all municipalities. The income tax can also only take on values between 0 and 1. It has been given the name Tax.

For education level the share of the population with a secondary education as their highest level of education has been used. This is used rather than the share with elementary or post-secondary education as the highest level of education. Secondary school can be thought of as the middle level of education. Assuming education on

(9)

average increases the income, having a large share with medium level of education (i.e. secondary) should lead to having a large share of people in the middle of the income distribution. This variable only includes people between the age of 24-64. The reason 24 is used instead of 18, is presumably to give some time to finish post-secondary education. This variable can only take on values between 0 and 1 and is called Edu.

The share of the population in long-term unemployment between the ages 24- 64 is used as a variable for the unemployment rate. Long-term unemployment is here defined as being unemployed for at least 6 months. For clarity, the official definition of long-term unemployment in Sweden is 12 months. This was unfortunately not available as a variable at the time of writing. Long-term unemployment is used in order to exclude those in short-term temporary unemployment who are unlikely to contribute to any changes in the income distribution in the long-run. Again, only the part of the population of working age is included as these are the people affecting the income distribution. This variable is bounded by the [0,1] interval and has been given the working name Ue.

4 Methodology

4.1 Ordinary Least Squares Regression

Ordinary Least Squares (OLS) is a method that can be used to estimate regression models. Assume the population regression equation Y = β₀+ β₁X1+ ... + β_kX_k+ u.

Y is the dependent variable, X_i is an independent variable where i = 1, 2, .., k, X_i is thought to be able to help explain Y . β_i are population parameters that explains how the relationship between Y and X_i look. u is the disturbance term, Y sometimes deviate from its expected value E(Y ) which is because of the disturbance. There are several possible reasons for this. For example, an influential explanatory variable might have been omitted from the model.

But it is time-consuming, expensive and often impossible to observe the population model. Instead we use a sample to estimate the population parameters. This gives the sample regression equation, Y = ˆβ₀+ ˆβ₁X₁+ ... + ˆβ_kX_k+ ˆu. Where, ˆY is the predicted value of Y , and ˆβ_i is the regression coefficient. ˆβ_i can be interpreted as the average change in Y when X_i increases with one unit, all other regressors held constant..

The dependent variable Gini is a proportion variable, i.e. it is on a standard unit interval (a [0,1] interval). This makes running a ordinary least squares regression a questionable choice. Such a model , could potentially predict a value outside the unit interval. A common solution to this problem is to transform the proportion variable into a variable on the real line and then performing the linear regression. Here this will be done by log-odds. The dependent variable will instead be,

log( Gini 1 + Gini).

(10)

The method that will be used to estimate the regression coefficients ˆβi here is OLS. This method works by minimize the sum of the squared residuals. Where the residuals is the difference between the observed Y values, and the estimated ˆY . By squaring them, their sign does not matter. In other words, OLS chooses the equation that fits the closest to the actual Y’s. The regression coefficients can be defined as:

β =ˆ

Px_iy_i− 1/n^Py_i^Px_i Px²_i − 1/n(^Pxi)²

Where, x_i and y_i is the observed values in the sample, and n is the sample size.

OLS has several properties that makes it an favourable choice, To begin with, if all assumptions are fulfilled OLS produces unbiased estimators of ˆβi. This means that the expected value of the estimate is the true population parameter value:

E( ˆβ_i) = β_i. The OLS estimators are also the most efficient of all unbiased linear estima- tor. The estimators are the Best Linear Unbiased Estimators, they are said to be BLUE.

To evaluate the goodness of fit of an OLS model, adjusted R² can be used.

Adjusted R² is a coefficient of determination, originating from R². The total variation in Y (total sum of squares=TSS) is the sum of the variation that can be explained by the sample regression model (ESS) and the unexplained variation (residual sum of squares = RSS). That is,

T SS = ESS + RSS.

From this R² is derived. R² is the percentage of variation in Y explained by the sample regression model:

R² = ESS T SS.

One problem with R² is that a model with a large number of explanatory variables will generate a higher R². This makes comparison of two models difficult. A simple model is often desirable since a complicated model can cause problems with over-fitting. So while a complicated model perform well in-sample, they tend to predict badly out-of-sample.

Adjusted R² allows for comparison between two models, as it takes the number of independent variable (the degrees of freedom) into consideration.

R¯² = 1 −RSS(n − 1) T SS(n − k)

Where, n is the sample size and k is the number of regressors. If the number of regressors X : s is increased, k increases and RSS will be reduced. Without adjustments, R² is increased. But, by dividing RSS by n − k, the increase in k suppresses the fall in RSS.

Both R² and adjusted R² take on values between 0 and 1. If the model explains 0% of the variation in the dependent variable Y, R² or ¯R² is 0. If the model explains 100% of the variation in the dependent variable, the value will be 1. (Asteriou and Hall (2016)).

(11)

4.1.1 Diagnostics of Ordinary Least Squares Models

To perform an OLS regression it is important that all assumptions are fulfilled. The diagnostics that will be run in this analysis is based on the assumptions as summarized by Asteriou and Hall (2016).

To begin with, it is assumed in OLS that the residuals are ∼ N (0, σ²). i.e. it is assumed that the residuals are independently and identically normally dis- tributed, with mean E(u_t) = 0 and a common variance σ². If this is not fulfilled α + βX... cannot be interpreted as a statistical average relation. One way to check if the distribution terms are normally distributed is by examining the Quantile-Quantile (Q-Q) plot. The Q-Q plots quantiles against each other, to see if their distributions matches. One can also perform a Shapiro-Wilk test of normality, where the hypotheses are: H₀ : The Residuals are normally distributed. and H₁ : The residuals are not normally distributed.. The null hypothesis will be rejected if the p-value is smaller than 0.05.

It is also assumed in OLS that there is no multicollinearity. That is, that there is no perfectly linear relationship between the variables. There ought to be little information about one variable within another. There are two types of multicollinearity.

The first type is perfect multicollinearity where one variable can be written as a function of another. And the second type is imperfect multicollinearity where two variables are correlated. It is quite common with some degree of imperfect multicollinearity in regression. The estimates will still be BLUE. But while the estimators still have the smallest variance of all estimators, the variance is larger then it would be without multicollinearity. To detect multicollinearity, the variance inflation factor (VIF) can be calculated. This is defined as,

V IF_i = 1/(1 − R_j²)

Where R²_j is the coefficient of determination of the auxiliary regression of X_j on the original equation’s independent variables. If there is much inter-correlation among the regressors, R²_j will be high causing a large variance of the OLS-estimators. It will also increase the V IF -value. A rule of thumb commonly used is that V IF_i> 10 is a sign of a problematic degree of multicollinearity.

Furthermore, the OLS assumes homoskedasticity. The word homoskedasticity means equal spread, in contrast to heteroskedasticity which means unequal spread.

To assume homoskedasticity means to assume all disturbance terms have the same variance V (u_i) = σ², constant for all i. If a model has heteroskedasticity, the unbiasedness and consistence of the OLS estimators is unaffected. So the values of the parameters given by OLS will be rather good. However, the spread of the distribution of the estimators will increase making ˆβ inefficient. Presence of heteroskedasticity will also affects the variance of the estimators themselves. They will be underesti- mated, causing higher t- and F-statistics which leads to less reliable hypothesis test.

To detect heterskedasticity one can either visually inspect plots, or perform a formal test.

To check for heteroskedasticity a plot with the fitted values on the x-axis and the residuals on the y-axis will be generated. The observations’ distance from 0

(12)

represents how badly the model predicted the value of the observations. If the observation is associated with a positive residual, i.e. the point in the plot is above the 0-line, the model predicted the observation too high. To have some inaccuracy in the model is normal, and frankly, good. A model that fits the data too well, is said to be over-fitted. And generally, will perform badly on new data. The important thing is that the errors are not too big, which in the residual plot, is illustrated by the residual clustering around zero. And, that the distance from zero is approximately zero for all predicted values, i.e. the variance of the disturbance terms are equal (homoskedasticity).

The spread of the residuals should also be random, if they form some kind of pattern, the relationship might be non-linear.

There are several ways of testing for heteroskedasticity formally, the test that will be used in this report i Breusch-Pagan (BP) Lagrange Multiplier test. The hypotheses are for- mulated as, H₀ = The model is homoskedastic and H₁ = The model is heteroskedastic.

If the p-value is smaller than 0.05, the null hypothesis is rejected on the 95% significance level.

4.2 Beta Regression

An alternative to performing an OLS with a transformed dependent variable is to run a beta regression. According to Cribari-Neto and Zeileis (2010) the OLS approach is not optimal, for several reasons. The estimated coefficients will be in terms of the transformed variable and thus, not as easily interpreted. Furthermore the authors claim that such regression models often exhibit heteroskedasticity. The variances around the interval bounds are lower than they are around the middle. Lastly, proportion variables are seldom symmetrical, which linear regressions often assumes.

To solve this problem, Ferrari and Cribari-Neto (2004) proposed a model that assumes the dependent variable to be beta distributed. The beta distribution models probability and is therefore bounded by a unit interval. It is also a very flexible distribution, making it suitable for asymmetrical variables. The equation for the beta density is written,

π(y; p, q) = Γ(p + q)

Γ(p)Γ(q)y^p−1(1 − y)^q−1 (1)

where, the dependent variable is 0 < y < 1. p, q > 0 and decide the shape of the distribution. Γ(.) is the gamma function If y is beta-distributed, its mean is,

E(y) = p (p + q) and its variance,

V (y) = pq

(p + q)²(p + q + 1).

However, it is considered more convenient for the regression analysis to define the mean of the response and a precision parameter. The mean is defined as µ = _(p+q)^p , and the precision parameter is φ = p + q. This gives, p = µφ and q = (1 − µ)φ. By plugging this into equation 1, the density of y is defined as,

f (y; µ, φ) = Γ(φ)

Γ(µ, φ)Γ((1 − µ)φ)y^µφ−1(1 − y)^{(1−µ)φ−1} (2)

(13)

Where, the mean is 0 < µ < 1 and the precision parameter is φ > 0. The mean of y can then be written,

E(y) = µ and the variance,

V (y) = V (µ)

1 + φ = µ(1 − µ) 1 + φ

The precision parameter measures the cluster density of the data. It takes on a small value if there is much data clustering close to 0 and 1. If the mean is fixed, the larger the precision parameter is, the smaller is the variance of the dependent variable, Y, is.

If the dependent variable is assumed to be beta distributed, a beta regression model can be formulated. y₁, y2, ..., yn is said to be a random sample from a beta density, Beta(µ_i, φ). The beta regression model then assumes the mean of the random variable can be represented by,

g(µ_i) = β₀+ β₁x₁+ ... + β_kx_k= η_i, (3) where x_i is the dependent variables. β_i is the estimated coefficient. η_i is the prediction for observation i. g is a function that links the linear predictor and the response variable together. It is called the link function. (Yellareddygari, et.al. (2014)). The link function ought to be a monotonic function that is twice differentiable and transforms the response variable to the real line. E(Y_i|X_i) = g(µ_i) = η_i. It is rather common to use the logit link, g(µ_i) = log(_1−µ^µ ).

The log-likelihood of the beta regression model is defined by Meaney and Moineddin (2014) as,

LL(µ_i, φ) = logΓ(φ)−logΓ(µ_i, φ)−log((1−µ_i)φ)+(µ_iφ−1)log(y_i)+((1−µ_i)φ−1)log(1−y_i) And, L(β, φ) =^Pⁿ_i=1LL(µ_i, φ), where µ_i must be defined so that (3) holds. This can then be maximized. That is, beta regression is based on maximum likelihood estimation.

A disadvantage of using OLS by transforming the dependent variable onto the real line, was previously mentioned to be a more difficult interpretation. If the transformation is done by a logit link, the regression coefficient must be interpreted in terms of log-odds. This is because the transformation is of the form logit(Y ). If a beta regression is performed the results will still need to be in interpreted in terms of log-odds. However, the transformation is done through the link function logit(E(Y )), which does not fundamentally change the data in the same way. Therefore the results from the beta regression can be transformed back for easier interpretation. In this paper, a method called fractional response will be used to transform the results to be more easily interpreted. To use fractional response on a beta regression model, there must be no observations in the data with Y-values exactly 0 or 1. The fractional response method uses average marginal effects which measures the elasticity. That is, how much Y changes on average if an explanatory variable increases with 1%, ceteris paribus. This provides the most meaningful interpretation since all the independent variables also are in between the unit interval.

(14)

To evaluate the goodness of fit of the model estimated with beta regression, Ferrari and Cribari-Neto (2004) uses the pseudo R² (R_p²). This is defined as the squared sample correlation coefficient between the linear predictor and the link-transformed response. I.e. the correlation between g(y) and ˆη. R²_p is bounden by a [0,1] interval. If g(y) and ˆη are in perfect agreement, ˆµ and y are in perfect agreement and, R²_p = 1.

4.2.1 Diagnostics of Beta Regression Models

Because the research field of beta regression is relatively new, there is little consensus about the fundamental assumptions of beta regression. At least to the knowledge of the author of this thesis. The most important and obvious assumption is of course that the dependent variable is beta distributed. In Cribari-Neto and Zeileis (2010) diagnostics of a beta regression model is done by examination of six plots. This approach will be used in this thesis.

But firstly, it should be noted that Espinheira, Ferrari and Cribari-Neto (2008) defined the Pearson residual as the standardized weighted residual. Defined as,

r_t= y_t− ˆu_t qV (yˆ t)

.

This will be seen in the diagnostics plots.

Now, to begin with, the residuals versus indices of the observations is inspected. This plot shows the Pearson residual, which is the difference between the fitted value and the observed value standardized by deviation by the standard deviation of the observed value (Rodrígues (2020)). The Pearson residuals should not show any systematic pattern.

Cribari-Neto and Zeileis (2010) then look at the Cook’s distance plot. Cook’s distance measures how influential a variable is. A variable with a large residual, also known as an outlier, may negativity influence the accuracy of the regression model. As a rule of thumb an observation with a Cook’s distance greater than 1 is considered a problem.

Further, the generalized leverage versus the predicted value is examined. Ac- cording to Kim (2015), this plot also shows influential values. Here, one looks for values in the upper and lower right corners. Observations there could potentially be influential.

In the fourth plot the residuals are plotted against the linear predictor. This plot can be compared to the residual plot described in section 4.1.1. The distance from zero represents how far from the true value the observation was predicted by the model.

The half-normal plot of residuals works like the Q-Q plot described in section 4.1.1, if the residuals follow the line they follow the proposed distribution rather well.

However, the line here should not be thought of as the proposed distribution but instead the proposed model. Ferrari and Cribari-Neto (2004) wrote that the distribution of the residuals of a beta regression model is unknown. The half-normal plot with simulated envelopes where believed to be helpful in deciding if the observed residuals followed the

(15)

fitted model. Yang and Sun (n.d.) describes the procedure. First, simulate a sample and use the fitted model as the true model. Then, fit the proposed model to the simulated sample, and calculate the absolute values of the residuals. Repeat this k times.

For each n sets, calculate an average, minimum and maximum value. Lastly, one plots these values and the observed residuals of the original data set against the half-normal scores. It is the minimum and maximum values that form the envelope. If a large share of points are outside of the envelope, the adequacy of the model can be questioned.

The last diagnostic plots shows how the deviance residuals varies over the observations. A deviance residual is,

2(LL(Saturated model) − LL(Proposed model)).

Where LL is an abbreviation of log likelihood. And the saturated model is an interpolated model, it fits the data points perfectly. The saturated model has the maximum amount of parameters, and therefore generates the highest likelihood of the data possible. So the deviance residuals are two times the difference between the likelihood of the data given the saturated model and the likelihood of the data given the proposed model.

If the proposed model is as over-fitted as the saturated model, all residuals lies along the 0-line in the residual plot. What one looks for in this plot is randomness. If the residuals exhibit some sort of non-random pattern the researcher ought to become vary.

This may be a sign of a bias or heteroskedasticity.

(16)

5 Results

5.1 Ordinary Least Squares Estimation

The relationship between the Gini coefficient, immigration and the three control variables income tax, education level and unemployment rate is here thought to be linear, which makes OLS a logical choice. However, because the Gini coefficient is on an unit interval a problem arises. A model estimated with OLS does not take the interval into consideration, and could potentially estimate a value outside [0,1]. To solve this, the variable Gini, is transformed into a variable on the real line. Here the logit is used, which is a commonly used transform.

Gini^∗= log( Gini 1 − Gini)

To estimate the regression coefficients ˆβ_i OLS is applied. The results are displayed in table 1.

Table 1: OLS-estimates of the model.

Coefficient Estimate Standard Error P-value

Intercept 0.32093 0.10416 0.00226

Immigration (Im) 0.28336 0.10036 0.00508

Income tax (Tax) -1.67840 0.51315 0.00120 Education level (Edu) -1.22029 0.09386 2e-16 Unemployment rate (Ue) 2.13713 0.41553 5.03e-07

All the regressors have p-values smaller than 0.01, making them statistically significant on the 95% significance level. A few of the variables have quite high variances, so the estimates are maybe not as precise as one would wish. Using the values of the correlation coefficients, the sample regression equation becomes:

Gini^∗= 0.32093 + 0.28336Im − 1.67840T ax − 1.22029Edu + 2.13713U e + u As, Cribari-Neto and Zeileis (2010) wrote, a model with a transformed dependent variable is often difficult to interpret. This is also the case for the OLS model here. It has to be interpreted in terms of the change in the log-odds of the Gini coefficient, which is not very intuitive.

If all other regressors are held constant (ceteris paribus), a one unit increase in the share of foreign born in a municipality (Im), the log-odds of the Gini coefficient (Gini) will on average increase with 0.28336 In other words, in general a high level of immigration has a positive effect on the Gini coefficient. Because a Gini coefficient of 1 represents perfect inequality, this means that higher immigration on average increases income inequality.

(17)

Ceteris paribus, a one unit increase in the income tax (Tax) lowers the log-odds of the Gini coefficient with 1.67840 units on average. So the relationship between income tax and the Gini coefficient is negative. A high income tax evens out the income distribution.

The relationship between the education level (Edu) and the Gini coefficient is also negative. A one unit increase in the education level decreases the log-odds of the Gini coefficient with, on average, 1.22029 units, ceteris paribus. A higher education level of the population, does on average increase the income equality.

The unemployment rate (Ue) is positively correlated with the Gini coefficient.

Ceteris paribus, a one unit increase in the unemployment rate increases the log-odds of the Gini coefficient with, on average, 2.13713 units. So, a high unemployment rate in a municipality generally, implies a greater income inequality.

The adjusted R² is, ¯R²= 0.6352.

5.1.1 Diagnostics of the Ordinary Least Squares Estimated Model To make sure the results from section 5.1 are valid, some diagnostics of the OLS estimated model is tested

−3 −2 −1 0 1 2 3

−0.20.00.20.4

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Figure 2: Normal QQ-plot of the OLS estimated model.

Figure 2 shows the normal QQ-plot of the model estimated with OLS. If the residuals were perfectly normally distributed they would follow the green straight line. In figure 1, the residuals deviate from the line. The formal test of the residuals’ normality gives a p-value of 2.28e-11 (see Appendix). Using the 99% significance level the null hypothesis that the residuals are normally distributed is rejected in favour of the alternative. This means there is trends in the data not captured by the model and that the relationship strictly cannot be considered statistical average relation.

To test if there is problematic multicollinearity in the model, the V IF -values

(18)

are calculated.

Table 2: VIF-values for the independent variables.

Variable VIF-value

Immigration (Im) 1.935899 Income tax (Tax) 1.747169 Education level (Edu) 1.808246 Unemployment rate (Ue) 1.672560

None of the V IF -values are greater than 10 and the analysis can be proceeded without worry about multicollinearity.

−0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1

−0.20.00.20.4

Fitted values

Residuals

Figure 3: Plot of fitted values and residuals of the OLS estimated model

Figure 3 shows a plot with the fitted values on the x-axis and the residuals on the y-axis.

There is no clear pattern between the fitted values and the residuals, which indicated the relationship between the dependent variable and the independent variables is linear.

The observation forms a faint funnel shape, i.e. the variance is unequal. There is reason to suspect the assumption of homoskedasticity is violated. To test this formally the Breuch-Pagan test is preformed. This generates a p-value much smaller than 0.05 (see Appendix). And the null hypothesis of homoskedasticity is rejected in favour of the alternative of heteroskedasticity.

(19)

5.2 Beta Regression Estimation

Here, the equation

g(µ_i) = β₀+ β₁Im + β₂T ax + β₃Edu + β₄U e = η_i

is estimated using beta regression. The link function that is used is the logit, g(µ_i) = log(_1−µ^µ ).

Table 3: Beta regression estimates of the model.

Coefficient Estimate Standard Error P-value

Intercept 0.30348 0.10511 0.00389

Immigration (Im) 0.45074 0.08539 1.30e-07

Income tax (Tax) -1.61283 0.52292 0.0020

Education level (Edu) -1.2567 0.09856 2e-16 Unemployment rate (Ue) 1.33868 0.31369 1.98e-05

The results of the beta regression are presented in table 3. All of the p-values are smaller than 0.01, which makes the estimated coefficient values statistically significant on the 99% level. The standard errors are slightly high for Tax and Ue, so the regression coefficients are perhaps not as precise as one would hope. The precision parameter φ is, 487.68 (see Appendix). This value is considered rather large, i.e. the variance of the data is small.

Using the estimate values in table 3, the estimated sample regression equation becomes:

g(µ_i) = 0.30348 + 0.45074Im − 1.61283T ax − 1.25673Edu + 1.33868U e = η_i. This models has to be interpreted in terms of logg-odds. By using fractional response on the results above, the average marginal effects will be attained which gives easier interpretations of the model. These values will now be discussed be- low. The full output of the fractional response procedure, can be found in the Appendix.

If the immigration (Im) in a Swedish municipality increases with 1%, with all other regressors held constant, the Gini coefficient will on average increase with 0.1046% So, the beta regression estimated model shows, just like the OLS estimated model, that immigration has a positive effect on the Gini coefficient. Which means, immigration has a negative effect on income equality. A higher immigration, does generally corresponds to a more unequal income distribution.

A 1% increase in the income tax (Tax) decreases the Gini coefficient with on average 0.3743%, ceteris paribus. So the model shows that a higher income tax, on average, evens out the income distribution in Swedish municipalities.

(20)

A higher education level in a municipality generally lowers the Gini coefficient too. For a 1% increase in the education level, the Gini coefficient on average decreases with 0.2917%, ceteris paribus. In other words, if a large share of the population has a high school education as their highest education level, the income equality in the municipality is generally higher.

Lastly, the beta regression model shows that if the unemployment increases with 1% the Gini coefficient increases with on average 0.3107%, ceteris paribus. So a higher unemployment rate generally increases income inequality.

The pseudo R² for the beta regression estimated model is, R²_p = 0.6294.

5.2.1 Diagnostics of the Beta Regression Estimated Model

In this section some diagnostics of the model estimated with beta regression are run and analysed.

0 50 100 150 200 250 300

−2024

Obs. number

Pearson residuals

Residuals vs indices of obs.

0 50 100 150 200 250 300

0.00.20.40.6

Obs. number

Cook's distance

Cook's distance plot

0.35 0.40 0.45

0.00.10.20.3

Predicted values

Generalized leverage

Generalized leverage vs predicted values

−0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2 −0.1

−2024

Linear predictor

Pearson residuals

Residuals vs linear predictor

0.0 0.5 1.0 1.5 2.0 2.5 3.0

012345

Normal quantiles

Deviance residuals (absolute values)

Half−normal plot of residuals

0 50 100 150 200 250 300

−2024

Obs. number

Deviance residuals

Residuals vs indices of obs.

Figure 4: Diagnostic Plots of the Beta Regression Estimated Model.

(21)

In figure 4 the diagnostics plot of the beta regression model are presented.

The first one, shows the Pearson residuals plotted against the observation number. No clear pattern between the two is visible.

The Cook’s distance plot shows no observation with a Cook’s distance higher than 1. There is therefore no reason to suspect there is an influential variable.

Influential variables can also be detected in the third plot by inspecting for residuals in the far right corners. There are are few observation close to the bottom right corner in the plot. To define where the corners start is arbitrary.Because the Cook’s distance plot shows no influential variables, the analysis will be continued without worry.

In the next plot, the Pearson coefficients are plotted against the predicted values. There is no clear pattern to indicate there is a relationship not picked up by the fitted beta regression model. There is however, a faint funnel shape. This could indicate heteroskedasticity. Heteroskedasticity is rather common in beta regression.

Since the variance is a function of the mean, it is natural for some hetroskedasticity and not necessarily a problem. If the heteroskedasticity is not due to this, but for example because of an omitted variable, and the heteroskedasticity is explained away, problems could occur. The heteroskedasticity here will be assumed to be due to the nature of the beta distribution.

The half-normal plot of residuals show most of the deviance residuals inside the envelopes. This is a sign of a rather good fit of the estimated model.

Lastly, the deviance residuals are plotted against the observation number. The deviance residuals here, looks to be fairly equally spread across the observation numbers.

To conclude, the diagnostics plot does not necessarily cause wary. In the plot of the generalized leverage against the predicted values, there might be signs of influential variables. But because the Cook’s distance plot rejects this, the analysis proceeds. The model also seems to exhibit some heteroskedastic behaviour, however this is for now believed to simply be because of the design of the beta distribution.

(22)

6 Discussion

To model the impact of immigration on the income distribution in Swedish municipalities, two approaches have been made. Firstly the model was estimated with OLS. And secondly, with beta regression. Both models shows that immigration has a statistically significant positive effect on the Gini coefficient, which is used as a measure of income equality. All other regressors used to control the effect of immigration are also statistically significant. It is difficult to compare the regression coefficients of the two models. The regression coefficients of the OLS-model has to be interpreted as the average change in the log-odds of Gini, for a one unit change in one of the regressors, all other regressors held constant. Whereas the regression coefficients for the beta regression model has been chosen to be interpreted in elestacities. If one regressor increases with 1% the dependent variable Gini on average changes with the average marginal effect, ceteris paribus.

The goodness of fit measurements for the two models are not comparable either as they are calculated from different formulas. What can be said is that both models generate rather high values of their coefficients of determination.

The OLS-model’s adjusted R² is 0.6321 and the beta regression model’s pseudo R² is 0.6371. So both models can explain fairly much of the variation in Gini in the data.

In this study beta regression is superior, mainly because of its easier interpretation. In previous research it has been argued that it is questionable to run an OLS regression when the dependent variable is on a unit scale. Since estimated model could then potentially estimate values outside the interval. To remedy this the dependent variable is often transformed on to the real line. This can be done by logit, as was done in this thesis. It then entails that the results must be read in terms of log-odds. Which is not very intuitive. An alternative to this approach is to run a beta regression instead.

A transformation is then done too. In this analysis this was done through the logit link. The difference is however that the transformation in beta regression is done by logit(E(Y )) instead of logit(Y ). That enables the variable to be transformed back for easier interpretation. Here it was thought most meaningful to read the results in terms of average marginal effects. But it would have been possible to transform the regression coefficients back to proportions too, and read them as: For a one unit increase in X_i, ceteris paribus, Gini changes with, on average β_i. This easier interpretation is a great advantage of beta regression.

OLS is often used because of its attractive characteristics. If the model is homoskedastic, all error terms are uncorrelated and the mean of the error terms is zero, OLS estimates are the Best Linear Unbiased Estimators (BLUE). Best refers to the smallest variance. That they are unbiased means that, the expected value of the coefficient estimate is the true value. This is of course a very desirable property. The problem is that when the dependent variable is a transformed proportion variable, the model often exhibit heteroskedasticity. Heteroskedasticity makes the estimates less precise and thus not the best anymore. The attractive properties of OLS does often not hold when proportions are modeled. Models estimated with beta regression are seldom homoskedastic either, since the variance is a function of the mean. Typically, this is not considered a big problem for the quality of the estimates. In conclusion, the beta

(23)

regression estimates are not necessarily better than the OLS estimates. But, the OLS estimates are often not BLUE and therefore not necessarily better either.

Concerning the relationship between immigration and income distribution. Previous research has established a relationship between higher immigration and income inequality. To the knowledge of this author, no research has been done on Swedish municipalities. This study concludes that there is a positive correlation between the share of foreign born of working age and the Gini coefficient. In other words, a higher immigration, on average, corresponds to higher income inequality in the municipalities.

It has previously been argued that immigrants add to the lower skilled workforce and increase the income inequality. While this theory does seem reasonable, this study does not go as far as to investigate the reasons behind the relationship. The results are not necessary caused by a concentration in the lower end of the income distribution either. Possibly, municipalities with higher immigration have a larger share of its population in the richer part of the income distribution. It should also be noted that the direction of the correlation is unknown, there might be factors of integration that makes immigrants concentrate in municipality with higher income inequality.

Further research in this field is needed. Income distribution in itself is considered an important topic by many. Regardless of ones political views on how equal the income of a society should be distributed, it is of interest what drives income inequality to know how the goals of the policy makers can be achieved. Immigration is also a topic constantly current, not least since the European Migrant Crisis 2015. It is of importance to understand the effect that immigration and income distribution has on the society as fully as possible. For example it can be interesting for future research to look into the forces behind the relationship between income distribution and immigration.

(24)

References

Asteriou, Dimitrios. Hall, Stephen G. (2016). Applied Econometrics. 3^rd edition.

London: Palgrave.

Björklund, Anders. (1991). Unemployment and Income Distribution: Time-Series Evidence from Sweden. The Scandinavian Journal of Economics, Vol. 93. No. 3, pp. 457-465.

Borjas, George J. Freeman, Richard B. Katz, Lawrence F. DiNardo, John. M.

Abowed, John M. (1997). How Much Do Immigration and Trade Affext Labor Market Outcomes?. Brookings Paper on Economic Activity. Vol. 1997. No.1. pp. 1-90.

Card, David. (2009). Income and Inequality. The American Economic Review. Vol.

99, No. 2, pp. 1-21.

Cook’s Distance. (2020). Wikipedia. Available at: https://en.wikipedia.org/wiki/

Cook%27s_distance. Accessed at: 27 December 2020.

Cribari-Neto, Francisco. Zeileis, Achim. (2010). Beta Regression in R. Journal of Statistical Software. Vol. 34, No. 2, pp. 1-24.

De Gregorio, José. Lee, Jong-Wha. (2002) Education and Income Inequality: New Evidence From Cross-Country Data. Review of Income and Wealth. Vol. 48, No. 3, pp. 395-416.

Espinheira, Patrícia L. Ferrari, Silvia L.P. Cribari-Neto, Francisco. (2008). On beta regression residuals. Journal of Applied Statistics, Vol. 34, No. 4, pp. 407-419.

Ferrari, Silvia L.P. Cribari-Neto, Francisco.(2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics. Vol. 31, No. 7, pp. 799-815.

Grundsten, Ronja. (2015). Immigration and Income Inequality in Sweden. Linnaeus University. Småland.

Hayes, Andrew F. Cai, Li. (2007). Using heteroskedasticity-consistent standard error estimators in OLS regression: An introduction and software implementation.

Behavior Research Mehotds, Vol. 39, No. 4, pp. 709-722.

Hurdel models. (n.d) STATA. Available at: https://www.stata.com/stata14/fracti onal-outcome-models/. Accessed at: 1 January 2021.

Lemos,Sara. (2014). The immigrant-native earnings gap across the earnings distri- bution. Applied Economics Letters. Vol 22, No. 5, pp. 361-369.

Palm, Frida. von Beckerath, Maja. Skattens effekt på inkomstojämlikhet. Uppsala University. Uppsala. Available at: http://uu.diva-portal.org/smash/get/diva2:

1334531/FULLTEXT01.pdf

Perkins, Dwight H.. Radelet, Steven. Lindauer, David. L.. Block, Steven A.. (2013).

Economics of Development.7^th edition. New York: W.W: Norton & Company Inc.

Piketty, Thomas. (2020) Kapitalet och Ideologin. Stockholm: Monidal.

Reed, Deborah. (1999). California’s Rising Income Inequality: Causes and Concerns.

Public Policy Institute of California.

Rodríguez, Germán. (2020). 3.8 Regression Diagnostics for Binary Data. Generalized Linear Models. [https://data.princeton.edu/wws509/notes/c3s8].

Joumard, Isabelle, Pisu, Mauro. Bloch, Debbie. (2012). Tackling income inequality:

The role of taxes and transfers. OECD Journal: Economic Studies. Published online first. http://dx.doi.org/10.1787/eco_studies-2012-5k95xd6l65lt

Jäntti, Markus. (1994). “A More Efficient Estimate of the Effects of Macroeconomic Activity on the Distribution of Income”. The Review of Economics and Statistics. Vol.

(25)

76, No. 2, pp. 372-378.

Kim, Bommae. (2015). Understanding Diagnostics Plots for Linear Regression Analysis. University of Virgina Library: Research Data Services + Sciences. Available at: https://data.library.virginia.edu/diagnostic-plots/.

Meaney, Christopher. Moineddin, Rahim. (2014). A Monte Carlo simulation study comparing linear regression, beta regression, variable-dispersion beta regression and fractional logit regression at recovering average difference measures in a two sample design. BMC Medical Research Methodology, Vol. 14, No. 1.

Mocan, H. Naci. (1999). Structural Unemployment, Cyclical Unemployment, and Income Inequality. The Review of Economics and Statistics. Vol. 81, No. 1, pp. 122-134.

Skatt i Sverige. (2020). Wikipedia. Available at: https://sv.wikipedia.org/wiki/Sk att_i_Sverige#Skatter_i_Sverige. Accessed at: 22 December 2020.

Q-Q Plot. (2020). Wikipedia. Available at: https://en.wikipedia.org/wiki/Q\T1\t extendashQ_plot. Acessed at: 25 December 2020.

Yang, Zhao. Sun, Xuezheng. (n.d.). Half-normal Plot for Zero-inflated Binomial Regression. University of South Carolina. Columbia.

Yellareddygari, Shashi. Sherman Pasche, Julie. Raymond, J. Taylor. Hua, su.

Gudmestad, Neil. (2015) Beta Regression Model for Predicting the Development of Pink Rot in Potato Tubers During Storage. Plant Disease, Vol. 100, No. 6, pp. 1118-1124

(26)

Appendices

A Ordinary Least Squares Estimation

##

## Call:

## lm(formula = Gini_star ~ Im + Tax_2 + Edu + Unem, data = data)

##

## Residuals:

## Min 1Q Median 3Q Max

## -0.24158 -0.05235 -0.00627 0.04654 0.50993

#### Coefficients:

## Estimate Std. Error t value Pr(>|t|)

## (Intercept) 0.28007 0.10509 2.665 0.00814 **

## Im 0.46747 0.08656 5.401 1.40e-07 ***

## Tax_2 -1.54804 0.51978 -2.978 0.00315 **

## Edu -1.24363 0.09920 -12.536 < 2e-16 ***

## Unem 1.30916 0.31401 4.169 4.06e-05 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#### Residual standard error: 0.09415 on 285 degrees of freedom

## Multiple R-squared: 0.6295, Adjusted R-squared: 0.6243

## F-statistic: 121 on 4 and 285 DF, p-value: < 2.2e-16

B Test of Normality in Residuals

#### Shapiro-Wilk normality test

##

## data: sresid

## W = 0.9342, p-value = 4.761e-10

C Test of Multicollinearity

## Im Tax_2 Edu Unem

## 1.398181 1.740362 1.961012 1.509766

D Test of Homoskedasticity

#### studentized Breusch-Pagan test

##

## data: m1_star

## BP = 28.744, df = 4, p-value = 8.812e-06

(27)

E Fractional Response

## factor AME SE z p lower upper

## Edu -0.2917 0.0228 -12.7788 0.0000 -0.3364 -0.2469

## Im 0.1046 0.0198 5.2807 0.0000 0.0658 0.1434

## Tax_2 -0.3743 0.1214 -3.0846 0.0020 -0.6122 -0.1365

## Unem 0.3107 0.0728 4.2685 0.0000 0.1680 0.4534

The Effect of Immigration on Income Distribution: A Comparative Study of Ordinary Least Squares and Beta Regression

|

|