A regression analysis of NHL cap hits

(1)

KTH Royal Institute of Technology

Mathematical statistics

Bachelor Thesis

A regression analysis of NHL cap hits

Authors:

Stefan Nordenskj¨ old Carl Flogvall

May 26, 2014

(2)

Abstract

This report is a study if a multi linear regression could be used to predict the cap hit of hockey forwards from the NHL. Data was collected during the 2010-2011, 2011-2012, and 2012-2013 seasons. The chosen variables were common hockey statistics and a few none hockey-related, like origin and age.

The initial model was improved by removing insignificant covariates, detected by BIC-test and p-values. The final model consisted of 291 players and had an adjusted R²-value of 0,7820. Of the covariates, goals, assists and ice time had the biggest impact on a player’s cap hit.

(3)

1 Introduction

1.1 Background

1.1.1 Explaining hockey

Ice hockey (here by referred as hockey) is a popular team sport in the north- ern hemisphere. It is played on an ice rink and each team is allowed having five players and one goaltender on the ice when the game starts. The goal of the game is to outscore the other team in the allotted time of 60 minutes. The ice rink’s surface is formed as rectangular with rounded corners, enclosed by a wall known as the boards. The most common formation is to have two defenseman and three forwards, whereof the forward positions are named right wing, center and left wing.

The NHL (National Hockey League) is located in North America and is considered by many as the best hockey league in the world. It is a closed league and each team operates under a salary cap.

(6)

1.1.2 Dictionary

Goaltender - The player who guards the goal and whose job is saving shots from the opposing team and thereby preventing goals.

Defenseman - A position whose main priority is to prevent the opposing team’s goal scoring chances.

Forwards - A position whose main priority is to attack and generate goal scoring chances. A defensive forward provides when preventing the other team from scoring.

Center - The forward who is placed in the middle of ice, away from the boards.

Goal - Shooting the puck into the net.

Assist - Is awarded when a goal is scored and is given to the teammates who last touched the puck before the goal scorer. The assist can be awarded to a maximum of two players for each goal.

Hit - A body check made by a player to an opponent.

Penalty minutes (PIM) - When a player violates the rules of the game, he gets sent to the penalty box for a time depending on the severity of his violation. This puts the penalized team to a disadvantage as they are one player short during the sentenced time.

Plus/Minus - A statistic given to a player on the ice whenever a goal is scored during even-strength (when each team has the same amount of players on the ice). A point is awarded when his team scores and a point is deducted when his team is getting scored on.

Cap hit - A player’s total salary for a contract divided by the length of contract. It is this number that is counted against the salary cap.

All-star game - The All-star game is an exhibition game held by the NHL, and the players are voted in by the fans. Usually the best players from every team are picked to attend to the game.

Ice time - The amount of time a player is on the ice during active play per game.

(7)

1.2 Purpose

Most sport-related statistical models have been focused on baseball or Amer- ican football. In the field of hockey there have been relative few. Therefore it would be interesting to see how this sport can be analyzed with statistics. The objective of this report is to make such a good model as possible estimating NHL players’ salary. Regression analysis gives a great amount of information on how statistics can predict an outcome from reality and answer many questions. How good of a model can be created of just basic knowledge about the sport hockey? How much do really goals and assists impact on player salary and which reputed variables have an influence on the dependent variable? It is also important knowing how to present data and results. Why do things become as the result shows? Can errors occur in the model during the regression and how can it be prevented? All this is going to be analyzed and tested. Hopefully some good results and answers will come out from this statistical investigation.

(8)

2 Theory

2.1 Basic Theory

2.1.1 Econometrics

Econometrics is an area where statistical methods are applied to economical problems. You have given observed data which is applied to the chosen model.

2.1.2 Linear regression

A regression analysis is a statistical method. The purpose of the method is to create a powerful model that describes observed data as good as possible in relation to the reality. The goal is to find the relationship between the dependent variable y and its covariates x. The model for a regression is described by

y = xβ + e, (1)

where β is the coefficient estimated from the regression of the observed data and e is the stocastic error term. The error term is assumed to have a normal distribution and describes the differences between the model’s estimated value and the observed value. It is also presumed to be no correlation between each error term and it can be either positive or negative. The error term contains for example measurement errors in the data and variables that cannot or are not a part of the model. The objective is finding a strong model that has a clear statistic connection between the dependent variable and its covariates, where the error term gives as little influence as possible.

The first covariate is often x₀ and called intercept and is the same value for all observations. All the others are chosen to make the model as powerful as possible getting a good statistical solution of the problem. To do a linear regression means the relationship between the dependent variable and the covariates are linear. This is an assumption that is discussed later in this report. Making a good regression analysis depends on how good data is available because a good model needs data of high quality.

(9)

2.1.3 Dummy variable

A dummy variable only has two different answers, yes or no, which is set to one and zero. For example the variable can be male and if you are a male you get the number one and if you are not it is zero. This can be good to use when to see if there is a difference in result between two observations because of their gender or something else that has two answers.

2.1.4 Some assumptions for a classic linear regression

1. The model needs to have a linear relationship between the variables but does not need linear variables.

2. The covariates are assumed having a set value during the regression from an observation time.

3. When given values on the variable x, the mean value of the error term is zero which is shown in the equation below

E(e_i|X) = 0, (2)

where X stands for all the covariates in your model. This assumption is because of the error term contains of different randomized representations of the x -variables and therefore the assumption is seen as correct. Because of this, the assumption is written as

E(Y_i|X) = βX + E(e_i|X) = BX, (3) which gives the models mean or average value.

4. The variance for every observations error term should be constant according to

V ar(e_i|X) = σ². (4)

5. There should not be any correlation between two different error terms which gives the relationship

cov(e_j, e_i|X) = 0, i 6= j, (5) where cov is the covariance.

(10)

6. To avoid multicollinearity, the covariates should not have a direct linear relation between themselves. This will be revised later on.

7. The model needs to have a correct structure with relevant covariates and needs to have more observations than number of covariates.

8. The models error term should be normal distributed and the assumption is e_i ∼ N (0, σ²). This leads to that the mean value needs to be zero and the variance constant for every term.

2.1.5 Ordinary least square method (OLS)

A commonly used method when doing a regression analysis is the Ordinary Least Square. This is described by

e_i = y_i− x_ijβ_j, (6)

where the error term is given by taking the difference between the result of the model and the observed result. The method minimizes the sum of squares of the error term e_i according to

Xe²_i =^X(y_i− x_ijβ_i)², (7) where the error term now is called error sum of squares. The β-values are computed to make the error term as small as possible.

2.1.6 Result and its interpretations

A regression analysis gives estimates of different values that is interesting to know when analyzing what is important for the dependent variable. The main value is the parameter β that you get when a regression is done with the observations and selected model. It shows how great influence each covariate has on the dependent variable. The estimate of the standard deviation, called standard error, is also given, which gives information on how good the confidence interval is. The confidence interval shows if the estimate is separated from zero and if it is statistically secure at a certain significance level. This can also be seen by doing an F- or t-test which gives a p-value for each covariate.

(11)

2.1.7 R², ’the goodness of fit’

To control the chosen models power, it is common to look at the R²-value from the regression. It is given by the ratio between the variance of the real value and the models estimated value according to

R² = V ar(x ˆβ)

V ar(y) = 1 − V ar(ˆe)

V ar(y). (8)

The value is always between one and zero. The higher R²-value, the better fit of the covariates to its dependent variable. This value increases when using more covariates and therefore it is common to use the adjusted value that considers the degrees of freedom when doing a regression analysis.

2.1.8 The t-test

The t-test is applied to verify when a coefficient is equal to zero under the null-hypothesis. To test the hypothesis, the formula

t = β_i

SE(β_i), (9)

is used, where SE stands for standard error. With the help of a t-table, the probability of the coefficient being equal to zero can be calculated. If the probability is lower than a set level of significance, it is statistically significant.

2.1.9 The F-test

The purpose of the F-test in this report is to test the significance of the covariates i.e. the null-hypothesis for each β_i is equal to zero. It is calculated by

F =

"βˆ_i− β_i⁰ SE( ˆβ_i)

#²

. (10)

The null hypothesis can then be rejected if the F-statistic’s probability is lower than a chosen level of significance, which results in the covariate is statistically significant.

(12)

2.1.10 The Bayesian information criterion (BIC)

Bayesian information criterion known as BIC is a test to determine if a covariate is redundant. The test is to reduce the value of

BIC = n · ln(|ˆe²|) + k · ln(n), (11) where n is the number of observations and k is the number of covariates. By removing a covariate, the BIC-values can be compared with and without the covariate to determine its impact.

2.2 Possible problems with a linear regression

2.2.1 Multicollinearity

Multicollinearity is a problem that occurs when two of the models covariates are linearly dependent. A common example is both the dummy variables female and male are chosen as covariates although they stand for the same thing. If male is zero female is one and vice versa. To avoid this a remod- elling should be done to get rid of one of the covariates in the model. It is also common with scale variables that are almost linearly dependent like age and work experience, which often is depending on each other.

The result of this phenomenon is that the estimated standard deviation of the regression can be very large when it is not supposed to. The confidence interval could also get bigger and more uncertain which may give the needed result to reject the covariate because of insignificance.

2.2.2 Endogeneity

Endogeneity occurs when the error term in a regression analysis is correlated with at least one of the covariates in the model. It makes the estimation uncertain because when an OLS is done it is required that the error term should be uncorrelated with all the covariates in the used model. Measurement errors could also create endogeneity. That is a reason why it is preferred to have accurate data for the observations.

(13)

2.2.3 Sample selection bias

Sample selection bias is when the model with observations is chosen in an incorrect way which may affect the result of the β-coefficients. To do a statistical study, a proper sample of random observations is required. An example is if someone would want to analyze the difference between two school classes’ results on an exam. Then you cannot choose one especially talented class and one ordinary because it is obvious that the talented class will have a higher result than the other. That is because of the talent and ambition level is probably different between the chosen observations. It is the same when adding an new observation. It has to be random made because you do not want the observations to differ from the average ones which can disturb the models regression result.

2.2.4 Simultaneity

Simultaneity occurs when a dependent variable affects one or more of the covariates. It can be related to supply and demand. When the demand gets higher the price goes up. The dependent variable then affects the covariate price. A dependent variable should not affect its covariates in a regression because the error term becomes big when observations are dependent of its regressand.

2.2.5 Missing important covariates

If important covariates for the model are not included, the error term gets big. That is because the dependent variable actual relationship contains of an important covariate missing in the regression. By adding the missing covariate the model becomes better and thus the error term gets smaller.

2.2.6 Heteroskedasticity

A widespread problem with statistical models is the presence of unequal variance in the error term, also known as heteroskedasticity. This could occur because of many different reasons like presence of outliers or combining observations with big scale differences. The consequences of heteroskedasticity are among others that the F-test becomes unreliable and could lead to faulty conclusions about the data.

(14)

3 Method

3.1 Choice of players

The selection of players were done under the following conditions

- The player must have played at least one NHL-game for each season 2010-2011, 2011-2012 and 2012-2013.

This is because of the model is meant for established NHL players and to make sure that the data used is over a longer period of time. The reason is we want to avoid players from having a good/bad season that will affect their predicted salary too much.

- The player in hand has to have played at least 50 games in one of the seasons above.

In order to remove players who have been call-ups from the minor league and only having established NHL players, this restriction was set.

- The player cannot be on an Entry level contract.

Rookies in the league cannot negotiate their contracts due to the fact they are capped. The result is that the contract does not represent the player and does not give a fair picture of his caliber.

- The player will have to play as a forward for all seasons.

In some rare cases, a player can master both playing as a defenseman and forward. Seeing how they have different responsibility, their stats can become misleading and therefore they are removed from this analysis.

3.2 Choice of covariates

For the starting full model, the covariates were as follows.

Goals - In order to win games, the team have to score at least one goal.

The hypothesis is that goal is the most important covariate.

Assist - People who can generate offense and is able to put teammates in good scoring chances will get a lot of assists.

PIM - Even though penalty minutes are not a good thing for a team, having them could indicate a player with a lot of physicality which in itself is often viewed as a positive strength.

(15)

Plus/Minus - In theory players who are good at offense and defense should get a good plus/minus rating, as the team scores when you are on the ice and does not allow many goals in the own net at the same time. The problem with this statistics is that players on bad teams will probably have worse rating than the ones on good teams, although they are likewise as players and therefore it may not affect the cap hit.

Hits - Players who are physical are preferred over players who are not.

Blocked shots - In order to block a shot, the players will have to sacrifice his body stopping a scoring attempt. If the shot is blocked, he will remove the threat of a goal.

Ice time - The better a player is, the more likely he is to receive more ice time. The only problem is that every team has an equal amount of ice time to divide among its players. A team full of good players will have the same total ice time as a team of bad players to share. Another possible problem is correlation to other stats.

Years from median age - In sports people often say a certain age for players is when they produce the most, usually referred as the peak. The covariate is to test if they are paid more during their the peak years.

It is assumed for this model that the peak years is around the median age.

All-star game - If a player is nominated to the All-star game, he is probably a fan favorite which in its turn is a good thing for a team’s marketability.

Center - Centers as a rule have more defensive responsibility than wingers and also takes face-offs. If a center produces as much offense as a winger, the center will be more valuable.

North American - This dummy is a test to see if origin has an effect on a player’s salary.

Missed games - If a player is injured a lot, his worth could be significantly reduced (depending on the injury). The player can also be scratched and will not be allowed to participate in the game.

(16)

3.3 How to improve the model

One way of improving the full model and see which of the covariates that are redundant, is removing by BIC-value. By eliminating one covariate from the full model, the difference in BIC-value will show if the covariate is unneeded.

If the BIC-value is lower without a covariate than with it, it is redundant and can be removed from the model. For the starting full model every covariate is tested to see if it is uncalled-for. Every covariate that has a negative difference will be removed. The same routine for the new model is done to see if any of the covariates are redundant. The procedure will continue until there are not any negative differences left when performing the BIC-test.

3.4 Doing a regression

When the data is collected for each observation, it is time to do a regression analysis. This is done by using Matlab and its commands by putting the dependent variable cap hit as y in the equation and the observed data as different explanatory variables x. The regression is run and the results are analyzed. If something looks wrong or is not statistically significant, remodel the starting equation and run a new regression. Stop when a satisfying result is found and start looking into the outcomes from the regression. Then look at the theory and try to explain what is coming out of the chosen model and evaluate it.

(17)

4 Results

4.1 The full model

The starting full model has the parameters in the written equation below ln(Cap hit) = β₀(Intercept) + β₁(Goals per game) + β₂(Assits per game)

+ β₃(PIM per game) + β₄(Years from median)

+ β₅(Missed games) + β₆(Average icetime) + β₇(Plus/Minus) + β₈(Blocked shots per game) + β₉(North American)

+ β10(Allstar) + β11(Center) + β12(Hits per game) + e.

(12) BIC for the full model is calculated to

BIC_{F ullmodel}= 1122, 2.

The table below shows the BIC-values for the full model without one covariate at the time to see if the variable is redundant.

Covariate BIC without Covariate ∆BIC

Goals per game 1133,4 11,2

Assists per game 1124,0 1,8

PIM per game 1141,8 19,6

Years from median 1123,2 1,0

Missed Games 1125,8 3,6

Average ice time 1164,9 42,7

Plus/Minus 1117,4 -4,7

Hits per game 1117,5 -4,6

Blocked shots per game 1120,0 -2,2

North American 1116,7 -5,4

All-star 1117,1 -5,1

Center 1116,7 -5,4

Table 1: BIC for the full model.

The covariates with negative difference can be removed from the model due to the criterion of a lower BIC-value.

(18)

The subsequent table is data of the regression of the full model.

Covariate β-Value p-value SE(β)

Intercept 12,3835 0,0000 0,2022

Goals per game 1,6130 0,0001 0,3959 Assists per game 0,8292 0,0075 0,3080

PIM per game 0,2680 0,0000 0,0535

Years from median -0,0231 0,0114 0,0091 Missed Games -0,0025 0,0029 0,0008 Average ice time 0,1207 0,0000 0,0170

Plus/Minus 0,2057 0,3491 0,2193

Hits per game -0,0357 0,3218 0,0360 Blocked shots per game -0,2454 0,0669 0,1334 North American -0,0277 0,6248 0,0566

All-star -0,0555 0,4426 0,0721

Center 0,0196 0,6953 0,0499

Table 2: Covariate table for the full model.

Observations 291

R-squared 0,7920

Adjusted R-squared 0,7830 SE of regression 0,3633 Sum square residual 139,7507 F-statistic 88,2137 Prob of F-statistic 0,0000

Table 3: Miscellaneous information for the full model.

As seen the p-values for the variables that are going to be removed are too high for the zero hypotheses and therefore have no statistical significance in the model. The value has to be less than five percent as the significance level is at 95 percent. The R²-value and its adjusted value is high and it shows that the model has an indicating goodness of fit between dependent variable and its covariates. As seen the covariates β-values affects the salary in different amounts.

(19)

The figure below illustrates how the error term fits the normal distribution.

If it is a perfect model the data points will follow the line. There are some miscalculations in the full model. Some of the observations do not fit in the model and therefore the normal distribution is not perfect.

Figure 1: Normal probability plot over the full model.

4.2 The improved model

Every covariate that has negative BIC differences will be removed and not be in the improved model for the new regression. The improved model has the parameters given in the equation

ln(Cap hit) = β₀(Intercept) + β₁(Goals per game) + β₂(Assits per game) + β₃(PIM per game) + β₄(Years from median)

+ β₅(Missed games) + β₆(Average icetime) + e,

(13) and the BIC-value for this model is calculated to

BICImprovedM odel = 1095, 7.

(20)

By comparing BIC-values one more time, we see if the model can be reduced any further.

Covariate BIC without Covariate ∆BIC

Goals per game 1117,2 21,6

Assists per game 1109,4 13,7

PIM per game 1115,6 19,9

Years from median 1095,8 0,1

Missed Games 1098,2 2,5

Average ice time 1138,5 42,9

Table 4: BIC for the improved model.

Table 4 shows that since there are not any negative BIC differences, none of the covariates are redundant and the model will not be better by removing any more of them. As seen the covariate Years from median does not improve the BIC value too much.

Another normal distributed plot is done and as seen the result improves a bit comparing to figure 1.

Figure 2: Normal probability plot over the improved model.

(21)

The error term is relative normal distributed viewable in the graph below.

Figure 3: Histogram of the residual.

The figure below shows the error of each observation. The error is random, which is good when preventing endogeneity from happening.

Figure 4: Plot of error to cap hit in US-dollars.

Figure 4 also confirms that the model is not perfect. It shows how much

(22)

The observations at the zero line are the best ones with the chosen model.

The figure below illustrates the impact of each covariate for the model. Years from median and Missed games are negative, meaning the yellow top shows the predicted cap hit. The intercept is not viewable in this graph because it is more interesting to see how the other covariates affect the cap hit.

Figure 5: Illustration of covariates impact on predicted cap hits.

Covariate Goals per game Assists per game PIM per game

Average icetime 0.8104 0.8304 -0.4175

Table 5: Correlation table for the improved model.

Table 5 shows that it is correlation between average ice time and some of the covariates.

(23)

The regression also gives us information about coefficients and significance.

The subsequent table is data of the regression of the improved model.

Covariate β-Value p-value SE(β) Lower limit Upper limit Intercept 12,3646 0,0000 0,1947 11,9813 12,7479 Goals per game 1,7635 0,0000 0,3341 1,1058 2,4212 Assists per game 1,0735 0,0000 0,2429 0,5954 1,5517 PIM per game 0,2371 0,0000 0,0464 0,1457 0,3285 Years from median -0,0214 0,0180 0,0090 -0,0390 -0,0037 Missed Games -0,0024 0,0047 0,0008 -0,0040 -0,0007 Average ice time 0,1044 0,0000 0,0145 0,0758 0,1331

Table 6: Covariate table for the improved.

Observations 291

R-squared 0,7865

Adjusted R-squared 0,7820 SE of regression 0,3642 Sum square residual 138,7866 F-statistic 174,4107 Prob of F-statistic 0,0000

Table 7: Miscellaneous information for the improved model.

The values above are those are used in our finished improved model. All are statistical significant as seen from the p-values. The models coefficients β-value differ a lot and some are negative and therefore not beneficial when getting a good salary if using this model. Goals and assist are important when you want to be paid good as a forward in the NHL. The goodness of fit is almost the same as in the full model but the probability of the F-statistic has improved in this model.

The final improved model is

(14) ln(Cap hit) = 12, 3646(Intercept) + 1, 7635(Goals per game)

+ 1, 0735(Assits per game) + 0, 2371(PIM per game)

− 0, 0214(Years from median) − 0, 0024(Missed games) + 0, 1044(Average icetime) + e.

(24)

5 Discussion

5.1 The indifferent variables

The variables that were in the full model, but excluded in the improved model are blocked shots, hits, North American, plus/minus, All-star and center.

The one who was the most surprising excluded variable was center, seeing how hockey experts often stress about how important centers are. Perhaps the center’s significance is covered in the included covariates, making the dummy redundant or maybe a center is not more valuable than a winger.

One fault with this covariate is the players position was taken from NHLs own website and they only list one position per player. Sometimes a player can play both on the wings and as center, for example Henrik Zetterberg or Patrick Kane that are stars in the league. Blocked shots and hits were not as surprising as often teams fill their bottom lines with grinders. Grinders are often low paid and their main role is to play a physical game to wear the opponent down. They often rack up a lot of blocked shots and hits relative to their offensive production. Also hits and blocked shots are very subjective stats as they can be hard to measure.

Origin does not seem to have an impact on a player’s salary, which can be interpreted as good due to the NHL does not seem to be xenophobic and players are only judged on their on-ice production. Players who are All-stars do not make more money than their counterparts on the fact they played at an All-star game. It could be because players fan popularity does not have an outcome on wage. Plus/Minus is one of the most referenced stats in hockey, but does not seem to have an effect on player’s cap hit. The fact is that it is a team related statistic and depending on how the player’s role in the team.

Players on bad team will have an overall worse plus/minus and players that go against top offensive opponents will also have a worse rating.

5.2 The improved model covariates

Goals and assists being two of the covariates of the improved model were not a big surprise. They are often the de facto way of measuring offensive production and how good a player is. Usually the players who were awarded most valuable player for a season, were one of the players who scored the most points.

(25)

PIM is the one who is the most confusing, seeing how blocked shots and hits did not make it to the improved model. When looking at the PIM-leaders for the NHL, it is noticed a lot of them are Enforcers. Enforcers are players whose key priority is to protect their fellow players from dangerous hits by fighting any opponent who does this. This leads to them having a lot of penalty minutes while very little offensive production and ice time. This can be observed in figure 5 and table 5.

Years from median age has a negative β-value, meaning the further from the median you were, the bigger pay cut you have to take. This is maybe because players get greedy during the prime years or maybe cause teams are willing to pay more for players in certain age groups when they have experience but are not too old. It is also noticed that this covariate has little impact on the cap hit. Missed games also have a negative β-value which means it is bad for a player’s salary to miss games, which makes sense. Ice time has the best p-value of all covariates (apart from the intercept) and has the biggest effect.

It is interesting that it is better than goals and assist, meaning it is more important how much the coach trusts the player (giving him ice time) than that he produce points on the ice.

5.3 Errors in the model

The biggest error with this model is the way it is constructed. It is meant to look at what a player of a certain caliber makes, not what he should make.

For example take the star Claude Giroux, who took a low paid contract when he was younger and not had broken through. Later he developed in to one of the best players in the NHL but he still has a low paid contract. In the model it only takes into account how his current contract is in relation to his current production, but if we look at the contract when it began, it would be fairer to his production. One other error in the model is UFA, also known as unrestricted free agent. Before a player reaches a certain age, they are tied to a team, meaning they cannot negotiate contracts with other teams. The result is when a player reaches the age and tests the free agent market, they often get overpaid.

The three players with the greatest error in absolute dollars all took pay cuts. All of these took them because they want to win the playoffs. Every team can only spend so much on player salary and with the pay cuts, the team could sign better players. Also two of the players’ salaries would have

(26)

their is no restriction in the model of how big a player’s cap hit could be.

The correlation-test between ice time and the other covariates shows that goals, assist and ice time have strong correlation which was assumed. But since they all have great influence in hockey we chose to have all of them in the final model. They are correlated because a player has to be on the ice to score points and the more he is on the ice, the more chances he has to score points. PIM not being positive correlated is probably because of

”Enforcers”, which was discussed earlier.

5.4 Conclusion

The model shows that statistics can be used to determine a player’s cap hit to some degree. But it also shows that hockey, unlike baseball, you cannot only rely on stats even if taken other factors to account. This is probably because hockey is a more complex sport than baseball where you have more clear roles on the field. A lot of things a hockey player does, do not show up in the stats, for example board play or energy on the ice. It is hard to quantify the details of the game. Therefore it is more difficult defining a powerful model in hockey because in our model only simple stats as goals, assist and ice time really give an effect on the salary. When it was realized that very specific data was needed preventing some of the errors that occurred during the regression analysis, it was accepted that the model did not became a good fit for all of the players. Instead as discussed above you can describe why some players did not fit in the produced model.

Seeing the results, it shows how hard it is producing a good reliable model for such a complex sport as hockey. Many outliers from different sources made the error large for some players. Therefore the conclusion is that hockey maybe is not the best sport analyzing a player salary in relation to goals and other covariates, due the error for some players gets too large. The model can give you a hint of what it should be, but it is not wise to follow it blindly.

(27)

6 References

Books

Lang, Harald. Nov.2013. Topics on Applied Mathematical Statistics, chapter 1-2. Sweden: KTH

Gujarati, Damodar. 2011. Econometrics by Example, chapter 1-2. Great Britain: Palgrave Macmillan

Internet

NHL salary cap information. 2014. www.capgeek.com

NHL official homepage. 2014. http://www.nhl.com/ice/playerstats.htm Yahoo sport website. 2014. www.sports.yahoo.com/nhl/

Sporting charts. 2014. www.sportingcharts.com/nhl/

A regression analysis of NHL cap hits

KTH Royal Institute of Technology

Mathematical statistics

Bachelor Thesis

A regression analysis of NHL cap hits

Authors:

Stefan Nordenskj¨ old Carl Flogvall

May 26, 2014

Abstract

Contents

1 Introduction

1.1 Background

1.2 Purpose

2 Theory

2.1 Basic Theory

2.2 Possible problems with a linear regression

3 Method

3.1 Choice of players

3.2 Choice of covariates

3.3 How to improve the model

3.4 Doing a regression

4 Results

4.1 The full model

4.2 The improved model

5 Discussion

5.1 The indifferent variables

5.2 The improved model covariates

5.3 Errors in the model

5.4 Conclusion

6 References