• No results found

Analysis of pupils’ commuting patterns : from individual to regional levels in Sweden

N/A
N/A
Protected

Academic year: 2021

Share "Analysis of pupils’ commuting patterns : from individual to regional levels in Sweden"

Copied!
34
0
0

Loading.... (view fulltext now)

Full text

(1)

Title: Analysis of pupils’ commuting patterns: from individual to regional levels in Sweden

Author: Walaa Qabaha (Date of birth:90-06-01)

Autumn 2019

Course name: Independent project I, 15credits Subject: Statistics

Örebro University School of Business Supervisor: Yuli Liang,

(2)

Contents

1. Introduction ……….1 1.1 Background ……….2 1.2 Aim………...2 2. Data………3 2.1 Data description………3 2.2 Data processing……….…6

3.Theory and Method………...7

3.1 Logistic regression model……….7

3.2 Model’s performance………8

3.3 Classification: Sensitivity, Specificity and ROC Curve………...9

4. Results ………...10

5. Conclusion………....17

6. Discussion ………18

References………....19

(3)

Abstract

In this paper we study the commuting patterns of Swedish pupils by using the registered data from Statistics Sweden. We started from modeling the commuting probabilities. Logistic regression was employed to fit the outcomes with explanatory variables.

The private school and country of birth are found to influence pupils’ commuting.

In order to demonstrate the explanatory power of the regression model, a so-called receiver operation characteristic (ROC) curve is used as a diagnostic tool for the classification between commuters and non-commuters.

Concluding comments include: the number of pupils that move from home municipality to school municipality will increase over time because there are many factors affected

commuting patterns such as private school, country of birth, and many other social factors. Moreover, this paper provides a suggestion for further study on this topic.

(4)

Acknowledgment

I want to thank my supervisor, Yuli Liang, for her support and guidance during the thesis process. I want to thank my examiner: Olha Bodnar, for examining my thesis and for her recommendations. Many thanks to Johan Helsing, Katarina Wizell, and Mats Hansson from the unit for Education and Jobs at Statistic Sweden (SCB) for their supports and suggestions, and I am grateful for that. Finally, I want to thank my family and my husband for their infinite love and support all the time.

(5)

1

1. Introduction

Pupils who go to school in a different municipality than they live in are counted as

commuters. A non-commuter is thus a student who both lives and goes to school in the same municipality. There may be different reasons why a student chooses a different municipality than the one he or she is living. The choice of school may, for example, depends on the school’s profile or orientation, gender, and country of birth.

Students who are the members of the local community have to travel usually between the place of residence and the place of education. This type of movement attaches two social, geographical functions: learning and living at home, the activity they want to follow. In almost all cases, a third essential function, transport, provides the link between the two if the place of home not the same as the place of education.

Since the fundamental quality has become better in Sweden over the years and with logic assumption, the popular transportation system, commuting as a feasible choice, has grown and has many benefits such as stay away from a place of resident limitation. (Westerlund, 2001).

According to XUE et al. (2010), there are many factors which affect pupils’ commuting patterns such as school location, and they assume that many pupils tend to choose school, which is near to home.

Twenty years ago, happened different changes in the Swedish school system that the children have the opportunity to decide school choice because this choice affects educational equality and educational achievement. The choice of independent school increased over time and played a minor role, according to the National Agency of Education, 3,2% of high school graduates attended private school in the year 2000. This proportion increased to 6.3% in 2003 and 8.5%in 2006. (Andersson et al. ,2012)

In general, the predictions are often about future events, and predictions made by Statistics Sweden (SCB) for this purpose are in the coming ten years or more, while the idea of forecast is more appropriate for the near future. This prediction can be beneficial for SCB to assist in making plans for possible development. For each municipality, the population is divided by gender, age, and country of birth to analyze the effect of prediction (SCB, 2005).

(6)

2

1.1 Background

Our study is concerned with the commuting pattern of Swedish pupils for the age interval of 16 to 18 years. Commuting pattern is not exchangeable, i.e., the number of pupils who commute from municipality A to B does not need to be the correspondingly same number of pupils who commute from B to A. Understanding the commuting pattern would be helpful for Statistics Sweden (SCB’s) prognosis of the number of pupils for each municipality. This type of prediction is beneficial for all municipalities, for example, to plan budget and distribute resources.

The official statistics concerning the education of the population is covered by the unit for Education and Jobs. They are, for example, looking into the transformation between the labor market and education. The unit is responsible for the registers for education and universities in Sweden. The work they are doing is in the form of a sample when it comes to registering processing and total surveys regarding in the area education.

This thesis will provide an analysis of how the choice of school and the place of birth influence commuting pattern by modeling how to travel to school are affected by those factors.

1.2 Aim

The purpose of this thesis is to analyze certain characteristics of commuting patterns in Sweden, and in particular, if the type of school, country of birth, age, or gender influences the commuting flow. We also want to know how these covariates are affecting the outcome variable (binary response).

The aim of this thesis is to model the probability of going to a school outside their home municipality for pupils of the age 16-18. Due to an expected difference between “commute to a municipality” and “commute from a municipality,” we will model their probabilities, separately. Based on the estimated probabilities the expected number of pupils are calculated and compared with the true totals from the registered database.

(7)

3

2. Data

The empirical study has been based on SCB. This chapter will provide a description of the data and the data processing.

2.1 Data description

The data comes from Statistics Sweden (SCB) year 2000 to 2018. The data that has been used is at the individual level, then the logistic model is used, to sum up it to get the data at the aggregated level (municipality level). The data based on pupils at high school between 16 to 18 years old. The total number of commuters from Stockholm municipality to another municipality year 2018 was 5231 and the total number of commuters from another municipality to Stockholm year 2018 was 15282.

The empirical analysis will be based on data from Stockholm municipality and the computing data for incoming and outcoming pupil’s year 2018. We selected all pupils of the age 16-18 from Stockholm municipality. The dependent variables based on two dummy variables. The two dummy variables have been defined as follows

1) Commute from Stockholm to another municipality (outcoming)

𝑦# = %1 If municipality is not equal to school municipality 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 A

2) Commute from another municipality to Stockholm (incoming)

𝑦B = %1 𝑖𝑓school municipality is not equal to municipality 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 A The covariates that have been used are gender, age, country of birth and type of school (public school or private school).

Before we run the logistic model, it is necessary to check if there are empty or small cells in the cross tabulation of the response variable and the explanatory variables.

(8)

4

Table 1: Descriptive statistics data 1 commuting pattern from Stockholm to another municipality (out-commuter)

Table 1a) Total number of pupils by gender in 2018

Gender Commuter Non-commuter Total

Male 2720 9216 11936

Female 2511 8982 11493

Total 5231 18198 23429

Table 1b) Total number of pupils by type of school in 2018

Type of school Commuter Non-commuter Total

Public 3396 9424 12820

Private 1835 8774 10609

Total 5231 18198 23429

Table 1c) Total number of pupils by country of birth in 2018

Country of birth Commuter Non-commuter Total

Scandinavia except Sweden 17 78 95

Sweden 4474 15182 19656

outside Europe 663 2439 3072

Europe except Scandinavia 107 499 606

Total 5231 18198 23429

Table 1d) Total number of pupils by age in 2018

Age Commuter Non-commuter Total

16 1620 5542 7851

17 1722 5095 7875

18 1889 4645 7703

(9)

5

Table 2: Descriptive statistics data 2 commuting pattern from another municipality to Stockholm (in- commuter )

Table 2a) Total number of pupils divided by gender in 2018

Gender Commuter Non-commuter Total

Male 6515 9216 15731

Female 8767 8982 17749

Total 15282 18198 33480

Table 2b) Total number of pupils by type of school

Type of school Commuter Non-commuter Total

Public 5235 9424 14659

Private 10047 8774 18821

Total 15282 18198 33480

Table 2c) Total number of pupils by country of birth

Country of birth Commuter Non-commuter Total

Scandinavia except Sweden 71 78 149

Sweden 13063 15182 28245

outside Europe 1648 2439 4087

Europe except Scandinavia 500 499 999

Total 15282 18198 33480

Table 2d) Total number of pupils by Age

Age Commuter Non-commuter Total

16 5542 6231 11773

17 5095 6153 11248

18 4645 5814 10459

(10)

6

2.2 Data processing

The data obtained by Statistics Sweden are from different data sources: data about personal identification numbers, gender, home municipality, and country of birth, which has been collected from the population register (RTB) year 2000 to 2018. The data about pupils (age, gender, type of school, and type of program) are from the administrative registers.

In this thesis, we were needed to gather the data from multiple sources and transform them into appropriate shapes and formats. SQL service had been used to import the data into R, and the essential variables had been chosen. All sort of data creation and manipulation of statistics was performed using R-program and SQL. SQL had been used to create Statistics tables about the total number of pupils who commute from Stockholm municipality to another and from another municipality to Stockholm year 2018 by gender, age, type of school, and country of birth.

The” left_join” function in R had been used to combine the data from different data sources in order to match each pupil from the population register (RTB) with each pupil from

administrative registers for every year from 2000 to 2018. Before we run the models, some missing values are removed from the data, then converted categorical variables into factors.

(11)

7

3.Theory and method

This chapter will present the logistic model, which has been applied in this thesis. The dependent variable is dichotomized, which indicates that the logistic regression could be chosen in order to describe the relationship between a dichotomized response variable and a set of covariates.

3.1 Logistic regression model

The commuting from Stockholm to another municipality is defined by y=1 if the home municipality is not equal to the school municipality and otherwise y=0. We use the logistic model to estimate the probability of y=1. The possible prognostic factors are gender, age, type of school, and country of birth.

Dobson and Barnett (2008) described the response probabilities that the dependent variable 𝑦D take value one as in the formula below

P(𝑦D = 1/𝒳D) = IJK( LMNOP Q)

#MIJK (LMNOP Q) =

#

#MIJKR (LMNOP Q)

The logistic model = log ( SO

#RSO) = α + 𝑥D

W 𝛽

where 𝒳D is a vector of explanatory variables which are age, gender, country of birth, and types of schools and the dependent variable based in two dummy variables and takes two values

(0 = not traveling, 1 = travelling). 𝛽 is the parameter vector. The two logistic models have been used are described as follows

1)The first model is to model the probability of a pupil that commute from Stockholm to other municipality by using gender, age, country of birth, and type of school as covariates. Then based on the estimated probabilities and the threshold value, the estimated number of pupils commuting from Stockholm to another municipality can be calculated.

2) The second model is to model the probability of a pupil that commute from other

municipality to Stockholm by using gender, age, country of birth, type of school as covariates. Then based on the estimated probabilities and the threshold value, the estimated number of pupils commuting from another municipality to Stockholm can be calculated.

Based on the estimated probabilities and the threshold value, we can get the estimated number of pupils for Stockholm municipality of year 2018, and then in order to compare predicted number of pupils with actual number, the confusion matrix has been used.

(12)

8

3.2 Model performance

Model 1) commuting pattern from Stockholm to another

municipality(out-commuters)

log ( SO

#RSO) = 𝜶 + 𝜷𝟏A + 𝜷𝟐 F+ 𝜷𝟑P + 𝜷𝟒 S + 𝜷𝟓 SW+ 𝜷𝟔 O

Where 𝑝D is the probability that a pupil commutes from Stockholm to another municipality.

model 2) commuting pattern from another municipality to Stockholm

(in-commuters)

log ( SO

#RSO) = 𝜶 + 𝜷𝟏A + 𝜷𝟐 F+ 𝜷𝟑P + 𝜷𝟒 S + 𝜷𝟓 SW+ 𝜷𝟔 O

Where 𝑝D is the probability that a pupil commutes from another municipality to

Stockholm.

Table 3: definition of the variables

Variables and parameters

A Age

F Female

P Private school

S Place of birth Scandinavia except Sweden

SW Place of birth Sweden

O Place of birth outside Europa

𝜶, 𝜷𝟏, 𝜷𝟐, 𝜷𝟑, 𝜷𝟒, 𝜷𝟓, 𝜷𝟔 Parameter to be estimated

The AIC criterion is used for model selection, and most significant six regressors

(independent variables) are chosen. For these subset selections the AIC criterion showed the lowest value. According to James el at. (2013) the stepwise regression is an efficient method to find the best subset of covariates or predictors to fit the data well with the lowest prediction errors.

(13)

9

3.3 Classification: Sensitivity, specificity and ROC curve

The receiver operation characteristic (ROC) curve can be considered as a diagnostic tool for the classification between commuters and non-commuters.

ROC curve is beneficial to match different classifier; because it takes different threshold values into account and captures all thresholds simultaneously. The ROC curve can be used to obtain a proper threshold value. (James et al.,2013)

According to Perkins & Schisterman (2006), the threshold value is a cutoff point on the ROC curve nearby to (0,1) to compare different classifiers. In this study, sensitivity is proportion of commuters who are correctly identified by the model, and specificity is proportion of non-commuters who are correctly identified by the model. The confusion matrix can be used to calculate the true positive rate (sensitivity) and true-negative rate (specificity) with a different threshold value. Still, instead of calculating all confusion matrix with different threshold values, the ROC curve can be used.

The ROC graph summarizes all of the confusion matrices that the threshold produced, and a good ROC curve is obtaining a high true positive rate and a low false-positive rate.

According to James et al. (2013), the evaluation using a measure of specificity and sensitivity can be used as a performance of the diagnostic test in the case of the binary predictor.

Specificity and sensitivity can be combined in different ways to asses optimized thresholds and model accuracy.

Coordinates function of the ROC curve is a criterion used to return the coordinates of the ROC curve at the optimal threshold value. The ROC curve has two coordinates are Y, which is the true positive rate, and X, which is the false positive rate, and those coordinates depend on the threshold value. In this study, the best threshold that obtained which minimizes the false positive rate and has a very high true positive rate.

True positive rate (TPR) is defined as for formula below TPR = Wc MdeWc

False positive rate (FPR) is defined as follows FPR = dcMWedc

Changing the threshold value improves the true positive rate. The true positive rate is called sensitivity, and the false positive rate is 1 – specificity.

In this thesis, we tried different threshold values. For every possible threshold value, we select to identify between the two populations; there will be some cases with the commuting

correctly classified as positive (TP =True positive fraction). In contrast, some cases with the commuting will be classified negative (FN =False Negative fraction). Difference, in some instances without the commuting, will be correctly classified as negative (TN= True negative fraction), but some cases without the commuting will be classified as positive (FP = False Positive fraction).

(14)

10

4. Results

Table1a: Estimates of the logistic model for out-commuters years 2000–2017

Variables Coefficients Odds

ratio

Pair category St. Error Z-value P- value

Intercept -2.0829 0.1245 - 0.0919 -22.661 2e-16 ***

Age 0.0312 1.0316 - 0.0051 6.113 9.81e-10 ***

Female -0.0522 0.9492 Female vs

male 0.0083 -6.302 2.94e-10 ***

Private school 0.0893 1.0934 Private vs public 0.0086 10.326 2e-16 *** Place of birth Scandinavia except Sweden 0.0372 1.0378 Scandinavia except Sweden vs Europe 0.0709 0.524 0.6006 Place of birth

Sweden 0.1139 1.1206 Sweden vs Europe 0.0296 3.851 0.0001***

Place of birth

outside Europe -0.1753 0.83920 Europe vs Outside Europe

0.0320 -5.467 4.57e-08 ***

Note: *significant at 𝛼 =0.05, *** significant at 𝛼 =0.001, pair category is used in order to decide which category is a reference category.

Seen from the results in Table 1a, that the age, female, private school and if the place of birth is Sweden or outside of Europe are statistically significant at 𝛼 =0.05 and

𝛼 =0.001. Except if the Place of birth Scandinavia except Sweden which is insignificant. The small p-value implies that the effect of covariates in the model is important and have

significant effects in the commuting pattern

.

It is not convenient to explain the results based on the estimated parameters for each category of the prognostic factors. It seems more preferable to express the results as odds ratios

instead. According to Dobson and Barnett (2008) when we are analyzing data with the logistic model to model probabilities, the effect of covariates is on the logistic scale. To interpret the impact of covariates, we use odds ratio.

log ( SO

#RSO) = 1.245 + 0.031197A +0.0521 F+ 0.089271 P + 0.037 S + 0.1139 SW - 0.175 O

The odds ratio = 𝑒𝜶M𝜷𝟏p M 𝜷𝟐 qM 𝜷𝟑r M 𝜷𝟒 sM 𝜷𝟓 stM 𝜷𝟔 u

We obtained the odds ratio by exponentiating the coefficient of covariates. To interpret the effect of age, we fix the value of other covariates.

The odds ratio =𝑒v.vw##xy = 1.032.

For example, we can say that for one unit increase in age, the odds of commuting from Stockholm to another municipality (vs. non-commuting) is 1.032 times greater. We can say

(15)

11

also that for one unit increase in Private school, the odds of commuting from Stockholm to another municipality (vs. non-commuting) is 1.093 times greater. The same as if the place of birth is Sweden, we can say that for one unit increase in place of birth is Sweden, the odds of commuting from Stockholm to another municipality (vs. non-commuting) is 1.12 times greater.

For these variables used in the model, we got the smallest AIC using stepwise logistic

regression. The confusion matrix has used to compare actual numbers of pupils with predicted numbers of pupils.

The confusion matrix result is given below.

TN =12217 FP =1328 FN=3962 TP= 6242

Where TP = 6242 is the commuting which correctly classified as positive, and FP =1328 which is the commuting that classified as negative, TN= 12217 is the non-commuting which correctly classified as negative and FN=3962 is the non-commuting which classified as positive. Our model predicted that 7570 students will be travelling while the actual number of travelling student is 5290

.

Area Under the Receiver Operating Characteristics

(AUROC) graph Threshold for model 1 a)

We tried different threshold values, and for this threshold value of 0.3, we obtained the highest sensitivity and lowest specificity. The obtained sensitivity is 0.748 and specificity 0.338.

AUROC graph has been used in order to decide which threshold value is good. The ROC graph summarizes all of the confusion matrices that the threshold produced. The ROC graph has been used instead of calculating all confusion matrix in different thresholds. An excellent model has an area under the curve (AUC) near 1, and a bad model has AUC near 0

(16)

12

Table1b: Estimates of logistic model for commuting pattern (out commuters) year 2017

Note: *significant at 𝛼 =0.05, *** significant at 𝛼 =0.001, pair category is used in order to decide which category is a reference category.

Seen from the results in Table 1b, that the age, female, private school and if the place of birth is Sweden are statistically significant at 𝛼 =0.05 and

𝛼 =0.001. For these variables used in the model, we got the smallest AIC using stepwise logistic regression.

Through the comparison of two periods 2000-2017and 2017, a consistent pattern of prognostic factors were detected.

Variables Coefficients Pair category Odds ratio St. Error Z-value P- value

Intercept -0.6437 - 0.525 0.3516 -7.989 1.36e-15 ***

Age 0.0910 - 1.095 0..1073 4.614 3.94e-06 ***

Female -0.0733 Female vs male 0.929 0.0320 -2.289 0.0221 *

Private school -0.6437 Private vs public 0.525 0.0332 19.373 2e-16 *** Place of birth Scandinavia except Sweden 0.1443 Scandinavia except Sweden vs Europe 1.154 0.2656 0.543 0.5868 Place of birth

Sweden 0.3797 Sweden vs Europe 1.46 0.1357 3.666 0.0002 ***

Place of birth outside Europe

0.0681 Outside Europe vs Europe

(17)

13

Table2a: Estimates of logistic model for in-commuters years 2000–2017

Variables Coefficients Odds ratio Pair category St. Error Z-value P- value

Intercept -0.8181 0.4413 - 0.0678 -12.065 2e-16 ***

Age -0.0193 0.9808 - 0.0038 -5.067 4.04e-07 ***

Female 0.3916 1.4793 Female vs male 0.0062 63.019 2e-16 ***

Private school 1.2844 3.6126 Private vs public 0.0062 207.086 2e-16 ***

Place of birth Scandinavia except Sweden 0.0297 1.0301 Scandinavia except Sweden vs Europa 0.0297 0.621 0.534 Place of birth Sweden -0.1837 0.8321 Sweden vs Europa 0.0196 -9.354 2e-16 *** Place of birth

outside Europa -0.5749 0.5627 Outside Europa vs Europa 0.0219 -26.253 2e-16 ***

Note: *significant at 𝛼 =0.05, *** significant at 𝛼 =0.001, pair category is used in order to decide which category is a reference category.

Seen from the results in Table 2a, that the age, female, private school and if the place of birth is Sweden or outside of Europe are statistically significant at 𝛼 =0.05 and

𝛼 =0.001. Except if the Place of birth Scandinavia except Sweden which is insignificant. The small p-value implies that the effect of covariates in the model is important and have

significant effects in the commuting pattern

.

It is not convenient to explain the results based on the estimated parameters for each category of the prognostic factors. It seems more preferable to express the results as odds ratios

instead. For example, we can say that for one unit increase in female, the odds of commuting from another municipality to Stockholm (vs. non-commuting) is 1.479 times greater.

we can say also that for one unit increase in Private school, the odds of commuting from another municipality to Stockholm (vs. non-commuting) is 3.61times greater. Moreover, P-value for each regression effect is smaller than 0.05, so the 95% confidence interval for each odds ratios (OR) excludes 1. Clearly each regression coefficient is significantly different from zero, so each OR is significantly different from 1.

(18)

14

For these variables used in the model, we got the smallest AIC using stepwise logistic regression. The confusion matrix was been used to compare actual numbers of pupils with predicted numbers of pupils.

The confusion matrix result is given below.

TN=9806 FP=8653 FN=5506 TP=10054

Where TP = 10054 is the commuting which correctly classified as positive,

and FP = 8653 which is the commuting that classified as negative, TN= 9806 is the non-commuting which correctly classified as negative and FN = 5506 is the non-non-commuting which classified as positive.

Our model predicted that 18707 students would be traveling while the actual number of traveling students is 15560.

Area Under the Receiver Operating Characteristics

(AUROC) graph threshold for model 2a

We tried different threshold values, and after classification, we got this threshold value of 0.39, and we obtained the highest sensitivity and lowest specificity. The obtained sensitivity is 0.64 and specificity 0.53.

AUROC graph has been used in order to decide which threshold value is good. The ROC graph summarizes all of the confusion matrices that the threshold produced. The ROC graph has been used instead of calculating all confusion matrix in different thresholds. An excellent model has an area under the curve (AUC) near 1, and a bad model has AUC near 0

(19)

15

Table2b: Estimates of logistic model for in-commuters year 2017

Variables Coefficients Odds ratio Pair category St. Error

Z-value P- value

Intercept -0.2556 0.7744 - 0.2465 -1.037 0.2997

Age -0.0159 0.9841 - 0.0141 -1.134 0.2569

Female 0.3280 1.3882 Female vs male 0.0230 14.261 2e-16 ***

Private School 0.7616 2.1417 Private vs public 0.0233 32.628 2e-16 ***

Place of birth Scandinavia except Sweden -0.1313 0.8769 Scandinavia except Sweden vs Europe 0.1688 -0.778 0.4366 Place of birth Sweden

-0.2556 0.7744 Sweden vs Europa 0.0589 -4.335 1.46e-05 ***

Place of birth

outside Europe -0.5848 0.5571 Outside Europa vs Europa 0.0664 -8.809 2e-16 *** Note: *significant at 𝛼 =0.05, *** significant at 𝛼 =0.001, pair category is used in order to

decide which category is a reference category.

Seen from the results in Table 2b, that the age, female, private school and if the place of birth is Sweden or outside of Europa are statistically significant at 𝛼 =0.05 and

𝛼 =0.001. The small p-value implies that the effect of covariates in the model is important and have significant effects in the commuting pattern

.

It is not appropriate to explain the results based on the estimated parameters for each category of the prognostic factors. It seems more preferable to express the results as odds ratios

instead. For example, we can say that for one unit increase in female, the odds of commuting from another municipality to Stockholm (vs. non-commuting) is 1.382 times greater.

we can say also that for one unit increase in Private school, the odds of commuting from another municipality to Stockholm (vs. non-commuting) is 2.14 times greater.

(20)

16

For these variables used in the model, we got the smallest AIC using stepwise logistic regression

.

The confusion matrix was been used to compare actual numbers of pupils with predicted numbers of pupils. our model predicted that 18707 students will be travelling while the actual number of travelling student is 15560.

Area Under the Receiver Operating Characteristics

AUROC graph threshold for model 2b

We tried different threshold values, and after classification, we got this threshold value of 0.41, and we obtained the highest sensitivity and lowest specificity. The obtained sensitivity is 0.65 and specificity 0.52.

AUROC graph has been used in order to decide which threshold value is good. The ROC graph summarizes all of the confusion matrices that the threshold produced. The ROC graph has been used instead of calculating all confusion matrix in different thresholds. An excellent model has an area under the curve (AUC) near 1, and a bad model has AUC near 0

Through the comparison of two periods 2000-2017and 2017, a consistent pattern of prognostic factors was detected.

(21)

17

5. Conclusion

After the commuting pattern analysis, a number of conclusions are drawn:

There exist effects in a commuting pattern according to specific characteristics such as the choice of School, country of birth, age, and gender.

The effect of commuting has based on beta coefficients for model 1 (out commuters) and model 2 (in commuters). The odd ratios have used to interpret the impact of covariates on the models. The effect of the coefficient from the first model (out-commuters) shows that age, private School, and if the Place of birth is Scandinavia or Sweden have more effect in the odds ratio. For example, one unit increase in age the odds of commuting from Stockholm to another municipality (vs. non-commuting) is 1.032 times greater.

Pupils are traveling from one municipality to another because of the choice of School. While the Place of birth also affects commuting flows. The results in model1(out-commuters) shows a high correlation with the dependent variable and the Place of birth (Sweden). For example, one unit increase in Place of birth Sweden, the odds of commuting from Stockholm to another municipality (vs. non-commuting) is 1.12 times higher.

While the results in model 2 (in commuters) show a high correlation with the dependent variable and privet school, the results show that the odds ratio was been positively correlated with private School. The impact of the coefficient from the second model (in-commuters) shows that private School has more effect on the odds ratio. For example, one unit increase in privet school the odds of commuting from another municipality to Stockholm (vs.

non-commuting) is 3.6 times greater. That means the pupils are moving from another municipality to Stockholm because they want to study in a private school.

This commuting pattern is important because understanding the commuting pattern would be helpful for Statistics Sweden (SCB’s) prognosis of the number of pupils for each

municipality. The statistical model proposed in this thesis can be used for modeling

commuting pattern in other municipalities too. Moreover, this model can be generalized for modeling commuting in the whole country. This type of prediction is beneficial for all municipalities; because when they know how many students will commute next year, this helps each municipality to plan to increase /decrease the budget if the number of pupils will increase/decrease next year.

(22)

18

6. Discussion

It was hard to take into account all municipalities because of the fact that some municipalities are larger than others, and the result could be biased for one municipality and unbiased for others.

The model is good in our case for prediction because, for those two models with these combinations of variables, we got the smallest AIC using stepwise logistic regression. AIC provides a means for model selection and describes how well a particular model fits the data. Concluding from the result, the predicted number of pupils will increase over time, because there are many factors affected commuting pattern such as private school and place of birth. Theirs exists another factor that may be affected commuting pattern; such as social factor; pupils want to study on another municipality because her or his friend study there, or they hear from friends that there exist a good school on another municipality. It would be interesting to see the effect of individual partners or family on the choice of commute.

As a further study, it is possible to build a model at the municipality’s level instead of pupil’s level and use a measure of so-called “net commuting,” which is defined as the number who study in a municipality as a proportion of the registered population in the same municipality. Linear mixed model (LMM) can be an option in order to take the municipal variations into account; what characterizes a linear mixed model is that it consists of both fixed effects and random effects. 𝑦zD = 𝛽v+ 𝛽z𝑑z+𝛼D+ 𝜖zD,

where 𝑦zD is the net commute which is described as the number who study in a municipality as

a percentage of the recorded population in the same municipality per year, 𝛼D is the

municipality-specific random effect, 𝜖zD is error term and (𝛽v+ 𝛽z𝑑z) is the fixed effect. In the model, the net commute for each municipality at the year is assumed to differ from the population mean.

(23)

19

References

Andersson, E., Malmberg, & Östh J. (2012). Travel to School distances in Sweden 2000-2006: Changing School geography with equality implications, Journal of Transport

Geogarpphy.vol23, p35-43.

Dobson A, J. & Barnett, A. J. (2008). An Introduction to Generalized linear models. 3rd ed. Chapman and Hall/CRC.

James, G., Hastie, T., Tibshirani, R. & Witten, D. (2013). An introduction to statistical learning. 3rd ed. New York.

Perkins, N. J. & Schisterman, E. F. (2006). The Inconsistency of “optimal” cutpoints obtained using Two Criteria based on the Receiver Operating Characteristic Curve, American Journal of Epidemiology. Vol163, nr7.

Park, H. (2013). An Introduction to Logistic Regression: From Basic Concepts to

Interpretation with Particular Attention to Nursing Domain, J Korean Acad Nurs, nr2, p.154-164.

Transport Analysis. Commuting in Stockholm, Gothenburg and Malmö- a current state analysis. Web address: www.trafa.se. Publisher: Brita Saxton. Published date:2011-05-31.

Westerlund O. (2001). Arbetslöshet, arbetsmarknadspolitikand geografisk rörlighet, Ekonomisk Debatt 2001, nr4, p.263–272.

XUE, J., Mccurdy, T., Burke, J., Bhaduri B., Liu, C., Nutaro, J. & Patterson, L. (2010). Analyses of school commuting data for exposure modeling purposes. Journal of Exposure Science and Environmental Epidemiology,vol20, p.69-78.

(24)

20

Appendix

setwd("P:/Data/BV/UA/S_Gymnasieskolan/Uppsats prognoser pendling")

# Install packages install.packages("odbc") install.packages("tidyverse") install.packages("dplyr") install.packages("glmnet") # Grafer install.packages("ggplot2") install.packages("ROCR") install.packages("dummies") install.packages("caret") # Open packages library(caret) library(ROCR) library(glmnet) library(dummies) library(odbc) library(tidyverse) library(dplyr) library(ggplot2)

# Connect to database (skolaupp) con <- dbConnect(drv = odbc(), "skolaupp",

encoding = "windows-1252") #Data preparing

rtbland = dbGetQuery(con, "select LandsNamn ,FodelseLand_gruppering = case

when VarldsdelNamn='SVERIGE' then 'Sverige' when VarldsdelNamn ='Norden utom Sverige' then 'Norden utom Sverige'

when VarldsdelNamn='EU28 utom Norden' then 'EU28 utom Norden'

when VarldsdelNamn='Okänt' then 'Okänt' when VarldsdelNamn is null then null else 'Utanför Europa'

(25)

21 end

from skolaupp.dbo.rtbland_stud_uppdrag;") %>% as_tibble()

#Data preparing

# Create table from SQL with variables which we are interested (PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon)

elevgymn00 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2000;") %>% as_tibble()

elevgymn01 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2001;") %>% as_tibble()

elevgymn02 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2002;") %>% as_tibble()

elevgymn03 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2003;") %>% as_tibble()

elevgymn04 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2004;") %>% as_tibble()

elevgymn05 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2005;") %>% as_tibble()

elevgymn06 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2006;") %>% as_tibble()

elevgymn07 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2007;") %>% as_tibble()

elevgymn08 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2008;") %>% as_tibble()

elevgymn09 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2009;") %>% as_tibble()

elevgymn10 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2010;") %>% as_tibble()

elevgymn11 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2011;") %>% as_tibble()

elevgymn12 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2012;") %>% as_tibble()

elevgymn13 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2013;") %>% as_tibble()

elevgymn14 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2014;") %>% as_tibble()

elevgymn15 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2015;") %>% as_tibble()

elevgymn16 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2016;") %>% as_tibble()

elevgymn17 = dbGetQuery(con, "select Ar, personnr,hman, program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2017;") %>% as_tibble()

elevgymn18 = dbGetQuery(con, "select Ar, personnr,hman,program, kommun, hkom from skolaupp.dbo.elevgymn_stud_uppdrag where Ar = 2018;") %>% as_tibble()

(26)

22 #Rename kommun till skolkommun

elevgymn00 = elevgymn00 %>% rename(skolkommun = kommun) elevgymn01 = elevgymn01 %>% rename(skolkommun = kommun) elevgymn02 = elevgymn02 %>% rename(skolkommun = kommun) elevgymn03 = elevgymn03 %>% rename(skolkommun = kommun) elevgymn04 = elevgymn04 %>% rename(skolkommun = kommun) elevgymn05 = elevgymn05 %>% rename(skolkommun = kommun) elevgymn06= elevgymn06 %>% rename(skolkommun = kommun) elevgymn07 = elevgymn07 %>% rename(skolkommun = kommun) elevgymn08 = elevgymn08 %>% rename(skolkommun = kommun) elevgymn09 = elevgymn09 %>% rename(skolkommun = kommun) elevgymn10 = elevgymn10 %>% rename(skolkommun = kommun) elevgymn11 = elevgymn11 %>% rename(skolkommun = kommun) elevgymn12 = elevgymn12 %>% rename(skolkommun = kommun) elevgymn13 = elevgymn13 %>% rename(skolkommun = kommun) elevgymn14 = elevgymn14 %>% rename(skolkommun = kommun) elevgymn15 = elevgymn15 %>% rename(skolkommun = kommun) elevgymn16 = elevgymn16 %>% rename(skolkommun = kommun) elevgymn17 = elevgymn17 %>% rename(skolkommun = kommun) elevgymn18 = elevgymn18 %>% rename(skolkommun = kommun)

(27)

23 #Data preparing

# Create table from SQL with variables which we are interested RTB 2000 -2018 RTB_2000 = dbGetQuery(con, "select ar ='2000',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2000_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2001 = dbGetQuery(con, "select ar ='2001',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2001_stud_uppdrag where AlderSlut between 16 and 18 ;") %>% as_tibble()

RTB_2002 = dbGetQuery(con, "select ar ='2002',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2002_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2003 = dbGetQuery(con, "select ar ='2003',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2003_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2004 = dbGetQuery(con, "select ar ='2004',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2004_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2005 = dbGetQuery(con, "select ar ='2005',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2005_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2006 = dbGetQuery(con, "select ar ='2006',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2006_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2007 = dbGetQuery(con, "select ar ='2007',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2007_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2008 = dbGetQuery(con, "select ar ='2008',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2008_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2009 = dbGetQuery(con, "select ar ='2009',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

(28)

24 as_tibble()

RTB_2010 = dbGetQuery(con, "select ar ='2010',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2010_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2011 = dbGetQuery(con, "select ar ='2011',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2011_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2012 = dbGetQuery(con, "select ar ='2012',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2012_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2013 = dbGetQuery(con, "select ar ='2013',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2013_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2014 = dbGetQuery(con, "select ar ='2014',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2014_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2015 = dbGetQuery(con, "select ar ='2015',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2015_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2016 = dbGetQuery(con, "select ar ='2016',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2016_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2017 = dbGetQuery(con, "select ar ='2017',

PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2017_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

RTB_2018 = dbGetQuery(con, "select ar

='2018', PersonNr,Kommun,AlderSlut,Fodelselandnamn,Kon from

skolaupp.dbo.RTB_2018_stud_uppdrag where AlderSlut between 16 and 18;") %>% as_tibble()

########## preparing the data

########### left joint matchning (RTB with elevgymn) every year from 2000 to 2018 RTB2000_elevgymn = left_join(RTB_2000, elevgymn00, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2001_elevgymn = left_join(RTB_2001, elevgymn01, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2002_elevgymn = left_join(RTB_2002, elevgymn02, by =c("ar" = "Ar", "PersonNr" = "personnr"))

(29)

25

RTB2003_elevgymn = left_join(RTB_2003, elevgymn03, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2004_elevgymn = left_join(RTB_2004, elevgymn04, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2005_elevgymn = left_join(RTB_2005, elevgymn05, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2006_elevgymn = left_join(RTB_2006, elevgymn06, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2007_elevgymn = left_join(RTB_2007, elevgymn07, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2008_elevgymn = left_join(RTB_2008, elevgymn08, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2009_elevgymn = left_join(RTB_2009, elevgymn09, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2010_elevgymn = left_join(RTB_2010, elevgymn10, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2011_elevgymn = left_join(RTB_2011, elevgymn11, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2012_elevgymn = left_join(RTB_2012, elevgymn12, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2013_elevgymn = left_join(RTB_2013, elevgymn13, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2014_elevgymn = left_join(RTB_2014, elevgymn14, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2015_elevgymn = left_join(RTB_2015, elevgymn15, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2016_elevgymn = left_join(RTB_2016, elevgymn16, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2017_elevgymn = left_join(RTB_2017, elevgymn17, by =c("ar" = "Ar", "PersonNr" = "personnr"))

RTB2018_elevgymn = left_join(RTB_2018, elevgymn18, by =c("ar" = "Ar", "PersonNr" = "personnr"))

#combine all RTB_elevgymn togother RTB_elevgmn <- rbind(RTB2000_elevgymn,RTB2001_elevgymn,RTB2002_elevgymn,RTB2003_elevgymn,RTB2 004_elevgymn,RTB2005_elevgymn,RTB2006_elevgymn,RTB2007_elevgymn,RTB2008_elevgy mn,RTB2009_elevgymn,RTB2010_elevgymn,RTB2011_elevgymn,RTB2012_elevgymn,RTB201 3_elevgymn,RTB2014_elevgymn,RTB2015_elevgymn,RTB2016_elevgymn,RTB2017_elevgym n,RTB2018_elevgymn)

(30)

26

#### Create two dummy variable one for commuting from Stockholm to another municipality and one for commuting from another municipality to Stockholm RTB_elevgmn_ny = RTB_elevgmn %>%

mutate(Y1 = case_when(Kommun =="0180" & RTB_elevgmn$Kommun == RTB_elevgmn$skolkommun ~ "0",

Kommun =="0180" & RTB_elevgmn$Kommun != RTB_elevgmn$skolkommun ~ "1")) %>%

mutate(Y2 = case_when(skolkommun =="0180" & skolkommun == Kommun ~ "0", skolkommun =="0180" & skolkommun != Kommun ~ "1"))

# ### Data preparing: change small letter to large letter gör om små bokstäver till stora

rtbland = rtbland %>% mutate(LandsNamn=toupper(LandsNamn))

# here we have all data combine RTB_elevgmn_ny with rtbland (skapa födelseregion utifrån födelsland)

temp = left_join(RTB_elevgmn_ny, rtbland, by = c("Fodelselandnamn" = "LandsNamn"))

####### cleaning the data by remove missing values temp = temp %>% filter(!is.na(Fodelselandnamn )) temp = temp %>% filter(!is.na(Kon ))

temp = temp %>% filter(!is.na( AlderSlut)) temp = temp %>% filter(!is.na(hman))

temp = temp %>% filter(!is.na(FodelseLand_gruppering))

##### change categorical variables to factor temp$ar = factor(temp$ar) temp$Kommun = factor(temp$Kommun) temp$Fodelselandnamn = factor(temp$Fodelselandnamn) temp$skolkommun = factor(temp$skolkommun) temp$program = factor(temp$program) temp$hman = factor(temp$hman) temp$Kon = as.factor(temp$Kon) temp$FodelseLand_gruppering = factor(temp$FodelseLand_gruppering )

# cleaning the data

data1 <- temp[-which(is.na(temp$Y1)),] ####den data use for model 1 commuting from Stockholm to another municipality

data2 <- temp[-which(is.na(temp$Y2)),] #### den data use for model 2 commuting from another municipality to Stockholm

(31)

27 data1$Y1 <- as.factor(data1$Y1) data2$Y2 <- as.factor(data2$Y2) data1$Y2 <- NULL data2$Y1 <- NULL # ###################################################### # model 1a (2000-2017) #########################################################

data = data1%>% filter(ar !="2018") #data for model 1 year from 2000 to 2017

logistic1 = glm(Y1 ~ AlderSlut + Kon + hman+ FodelseLand_gruppering ,family=binomial ,data)

summary(logistic1) fitted.values(logistic1) coef(logistic1)

### predict number of student commute from Stockholm muncipality to other muncipality pred_model1a = predict(logistic1, newdata = temp1[,c(4,6,7,13)], type = "response") ####### to choose best threshold cords function is used

library(pROC)

roc_obj_model1a <- roc( temp1$Y1, pred_model1a) plot(roc_obj_model1a)

coords(roc_obj_model1a, "best", "threshold") pred_model1a = ifelse(pred_model1a>0.3,1,0)

#confusion matrix to compare between actual data and predicted data table(temp1$Y1,pred_model1a)

# total trvelling (predicted) sum(pred_model1a == 1) # total trvelling (actual) sum(temp1$Y1 == 1)

# STEPWISE logistic (gives which variable cominations give the smallet AIC) step_model1a <- logistic1 %>% stepAIC(trace = FALSE)

(32)

28

# ###################################################### # model 1b (2017)

############# bild a model for data year 2017 data = data1 %>% filter(ar =="2017")

logistic = glm(Y1 ~ AlderSlut + Kon + hman+ FodelseLand_gruppering ,family=binomial ,data)

summary(logistic)

# Remove missing value from data and use this data to predict number of student commute from Stockholm to other muncipality pred1(antal elever som pendlar in)

temp1=temp %>% filter(ar =="2018", Kommun =="0180",!is.na(Y1))

#### predict number of student commute from Stockholm muncipality to other muncipality pred_model1b = predict(logistic, newdata = temp1[,c(4,6,7,13)], type = "response")

histogram(pred_model1b)

####### to choose best threshold the coords function is used roc_obj_model1b <- roc( temp1$Y1, pred_model1b)

plot(roc_obj_model1b)

coords(roc_obj_model1b, "best", "threshold") pred_model1b = ifelse(pred_model1b > 0.3,1,0)

#confusion matrix to compare between actual data and predicted data table(temp1$Y1,pred_model1b)

# total trvelling (predicted) sum(pred_model1b == 1) # total trvelling (actual) sum(temp1$Y1 == 1)

# STEPWISE logistic (gives which variable cominations has the smallet AIC) step_model1b <- logistic %>% stepAIC(trace = FALSE)

coef(step_model1b)

# ###################################################### # model 2a (2000-2017)

#########################################################

#logistik model: commute from other muncipality to Stockholm muncipality usindg data =data2

logistic2 = glm(Y2 ~ AlderSlut + Kon + hman+ FodelseLand_gruppering ,family=binomial ,data = data2%>% filter(ar !="2018"))

(33)

29

#### Remove missing value from data and use this data to predict number of student commute from other muncipality to Stockholm muncipality pred2(antal elever som pendlar ut)

temp2=temp %>% filter(ar =="2018", skolkommun=="0180",!is.na(Y2))

###predict number of student commute from other muncipality to Stockholm muncipality pred_model2a = predict(logistic2, newdata = temp2[,c(4,6,7,13)], type = "response") histogram(pred_model2a)

####### to choose best threshold the coords function is used roc_obj_model2a <- roc( temp2$Y2, pred_model2a)

plot(roc_obj_model2a)

coords(roc_obj_model2a, "best", "threshold") pred_model2a = ifelse(pred_model2a>0.39,1,0)

#confusion matrix to compare between actual data and predicted data table(temp2$Y2,pred_model2a)

# total trvelling (predicted) sum(pred_model2a == 1) # total trvelling (actual) sum(temp2$Y2 == 1)

# ###################################################### # model 2b (2017)

#########################################################

#####logistik model: commute from other muncipality to Stockholm muncipality year2017 logistic2017 = glm(Y2 ~ AlderSlut + Kon + hman+ FodelseLand_gruppering ,family=binomial ,data = data2%>% filter(ar =="2017"))

summary(logistic2017) # View(temp2[,c(4,6,7,13)])

##### #####Remove missing value from data and use this data to predict number of student commute from other muncipality to Stockholm muncipality pred2(antal elever som pendlar ut)

temp2=temp %>% filter(ar =="2018", skolkommun=="0180",!is.na(Y2))

#### predict number of student commute from other muncipality to Stockholm muncipality pred_model2b = predict(logistic2017, newdata = temp2[,c(4,6,7,13)], type = "response")

(34)

30

####### to choose best threshold coords function is used roc_obj_model2b <- roc( temp2$Y2, pred_model2b) plot(roc_obj_model2b)

coords(roc_obj_model2b, "best", "threshold") pred_model2b = ifelse(pred_model2b>0.41,1,0)

#confusion matrix to compare between actual data and predicted data table(temp2$Y2,pred_model2b)

# total trvelling (predicted) sum(pred_model2b == 1) # total trvelling (actual) sum(temp2$Y2 == 1)

References

Related documents

Exakt hur dessa verksamheter har uppstått studeras inte i detalj, men nyetableringar kan exempelvis vara ett resultat av avknoppningar från större företag inklusive

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

• Utbildningsnivåerna i Sveriges FA-regioner varierar kraftigt. I Stockholm har 46 procent av de sysselsatta eftergymnasial utbildning, medan samma andel i Dorotea endast

Den förbättrade tillgängligheten berör framför allt boende i områden med en mycket hög eller hög tillgänglighet till tätorter, men även antalet personer med längre än

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av