Regression analysis: An evaluation of the inuences behindthe pricing of beer

(1)

Regression analysis: An evaluation of the influences behind the pricing of beer

Regressionsanalys: En utv¨ ardering av influenserna bakom priss¨ attningen av ¨ ol

Sara Eriksson and Jonas H¨ aggmark

Spring semester 2017

(2)

1 Preface

This project is a bachelor thesis in multiple linear regression created by Sara Eriksson and Jonas H¨ aggmark during the spring semester in 2017.

We wish to thank our mentor Pierre Nyquist, for all of your advises and guiding throughout the

project.

(3)

2 Abstract

This bachelor thesis in applied mathematics is an analysis of which factors affect the pricing of beer at the Swedish market. A multiple linear regression model is created with the statistical programming language R through a study of the influences for several explanatory variables. For example these variables include country of origin, beer style, volume sold and a Bayesian weighted mean rating from RateBeer, a popular website for beer enthusiasts. The main goal of the project is to find sig- nificant factors and, as follows directly, a significant model without any influence of multicollinearity.

The regression analysis is based on a data set with 1413 observations which represent beers that sold over 1000 liters, among further restrictions, and is created from Systembolaget’s sale statistics for 2016 and Ratebeer. This number of observations represents 43% of Systembolaget’s total assortment of beer.

The model is developed through a thorough residual analysis, transformations of variables, de- termination of multicollinearity and a validation of the absence of outliers and high leverage points.

All of these in favor for significance at a level of 95%. In addition to the regression model, two submodels with associated box plots for the variable groups Country of Origin and Beer Style are created for analyzing the importance of these variables amongst each other. A k -fold cross validation study and three different variable selections are carried out for further adequacy checking, these are also given as recommendations for continued analysis.

The result shows that there are several different factors that affect the pricing of beer. For ex- ample, higher alcohol by volume, sour beers and beers from New Zealand yields a higher price while beers with high sales, lagers and Austrian beers show a negative tendency for the price. The result can be used as an example of the influences behind the pricing of beer in Sweden.

The first model in the analysis has 41 explanatory variables and in the final model the number of

explanatory variables is reduced to 20 where all are significant.

(4)

3 Sammanfattning

Detta kandidatexamensarbete i till¨ ampad matematik ¨ ar en analys av vilka faktorer som p˚ averkar priss¨ attningen p˚ a ¨ ol p˚ a den svenska marknaden. En multipel regressionsmodell har skapats med det statiska programmeringsspr˚ aket R genom en studie av influenserna f¨ or ett antal regressorvariabler.

Dessa variabler inkluderar bland andra ursprungsland, ¨ olstil, s˚ ald volym och ett Bayesiskt viktat medelv¨ arde fr˚ an RateBeer, vilket ¨ ar en popul¨ ar hemsida f¨ or ¨ olentusiaster. Huvudm˚ alet med projek- tet ¨ ar att finna signifikanta faktorer och, som d˚ a medf¨ oljer, en signifikant modell utan n˚ agon influens av multikolinj¨ aritet.

Regressionsanalysen ¨ ar baserad p˚ a en upps¨ attning data f¨ or 1413 observationer representerande de

¨

ol som s˚ alt mer ¨ an 1000 liter, bland ytterligare restriktioner, och ¨ ar skapad fr˚ an Systembolagets f¨ ors¨ aljningsstatistik fr˚ an 2016 och RateBeer. Detta antal observationer representerar 43% av Sys- tembolagets totala ¨ olutbud.

Modellen ¨ ar utvecklad genom en grundlig residualanalys, transformationer av variabler, best¨ amning av multikolinj¨ aritet samt en validering av fr˚ anvaron av avvikande v¨ arden och punkter med h¨ ogt inflytande. Allt detta f¨ or att n˚ a en signifikansniv˚ a p˚ a 95%. Ut¨ over regressionsmodellen s˚ a skapas tv˚ a submodeller med tillh¨ orande l˚ addiagram f¨ or variabelgrupperna Ursprungsland och ¨ Olstil f¨ or att analysera betydelsen av dessa variabler sinsemellan. En k -fold korsvalideringsstudie samt tre olika variabelselektioner utf¨ ors f¨ or vidare l¨ amplighetskontroll och dessa ges ¨ aven som rekommendationer f¨ or fortsatt analys.

Resultatet visar att det finns flera olika faktorer som p˚ averkar priss¨ attningen p˚ a ¨ ol. Exempelvis ger ¨ okad alkoholhalt, syrliga ¨ ol och ¨ ol fr˚ an Nya Zeeland ett h¨ ogre pris samtidigt som h¨ og f¨ ors¨ aljning, lager¨ ol och ¨ osterrikisk ¨ ol visar en negativ tendens f¨ or priset. Resultatet kan anv¨ andas som ett ex- empel p˚ a influenserna bakom priss¨ attningen av ¨ ol i Sverige.

Den f¨ orsta modellen i analysen har 41 regressorer och i den slutliga modellen har antalet regressorer

reducerats till 20 d¨ ar alla ¨ ar signifikanta.

(5)

List of Figures

1 Monks have a great historical influence on the art of brewing. . . . 2

2 Survey carried out by SurveyMonkey. . . . . 3

3 Summary of the final cost of a craft beer due to each expense. . . . 4

4 Normal probability and histogram plot for the residuals. . . . 8

5 Ordinary and scaled residuals against the predicted values. . . . . 9

6 The model residuals versus the fitted values for the original model. . . . 15

7 Normal Q-Q plot and histogram for the original model. . . . . 15

8 Scale-location plot for the original model. . . . 16

9 Leverage plot for the original model. . . . 16

10 Logarithmic transformation of the response variable and the variable representing item price. . . . 17

11 Logarithmic transformation of the response variable and the variable representing volume in ml. . . . 18

12 Second model, developed with logarithmic transformations. . . . 18

13 Model developed through multicollinearity analysis. . . . . 19

14 Model B, developed through analysis of dummy variables. . . . 21

15 Outliers and high leverage points. . . . 22

16 Predicted price against actual price, original model. . . . . 23

17 Predicted price against actual price, final model. . . . 23

18 Submodel country of origin. . . . 25

19 Normal probability plots. . . . . 26

20 Submodel beer styles. . . . . 26

21 Countries of origin. . . . 27

22 Beer styles. . . . 28

23 Package. . . . 28

24 Organic. . . . 28

25 In stock. . . . 29

26 Rated. . . . 29

27 Final model. . . . 31

(6)

1 Preface ii

2 Abstract iii

3 Sammanfattning iv

4 Introduction 1

4.1 Background . . . . 1

4.1.1 The History of Beer . . . . 1

4.2 Purpose . . . . 2

4.3 Problem Definition . . . . 2

4.4 Data Set . . . . 3

4.5 Problem Restrictions . . . . 3

4.6 Literature Analysis . . . . 3

5 Mathematical Theory 5 5.1 Multiple Linear Regression . . . . 5

5.1.1 Heteroscedasticity . . . . 5

5.2 Residual Analysis . . . . 6

5.3 Model Adequacy . . . . 7

5.3.1 Residual Plots . . . . 8

5.4 Outliers and High Leverage Points . . . . 9

5.5 Transformations of Variables . . . . 10

5.6 Multicollinearity . . . . 10

5.7 Variable Selection . . . . 10

5.8 Cross Validation . . . . 11

6 Data Set 12 6.1 Restrictions in Data . . . . 12

6.2 List of Variables . . . . 12

6.3 About RateBeer . . . . 14

7 Analysis and Model Development 15 7.1 Residual Analysis . . . . 15

7.2 Transformations of Variables . . . . 17

7.3 Multicollinearity . . . . 19

7.4 Analysis of Significance . . . . 20

7.5 Outliers and High Leverage Points . . . . 21

7.6 Cross Validation . . . . 22

7.7 Variable Selection . . . . 24

8 Submodels 25 8.1 Country of Origin . . . . 25

8.2 Beer Styles . . . . 26

8.3 Box Plots . . . . 27

(7)

9 Results 30

9.1 Final Model . . . . 30

9.2 Submodels . . . . 32

9.3 Calculation Example . . . . 33

10 Discussion 35 10.1 Analysis of Variables . . . . 35

10.1.1 Quantitative Variables . . . . 35

10.1.2 Dummy Variables Except Beer Styles and Countries of Origin . . . . 35

10.1.3 Beer Styles . . . . 36

10.1.4 Countries of Origin . . . . 37

10.2 Looking Back at the Literature Analysis . . . . 37

10.3 Recommendations . . . . 38

11 Appendix 40 11.1 Variable Selection Tables . . . . 40

11.1.1 Best Subset Selection . . . . 40

11.1.2 Forward Subset Selection . . . . 41

11.1.3 Backward Subset Selection . . . . 42

(8)

4 Introduction

In this project a study about how different factors affect the pricing of beers at Systembolaget during 2016 will be carried out. The study will analyze a multiple linear regression model with several influential parameters.

4.1 Background

The government in Sweden has a monopoly for alcohol sales through Systembolaget. This means that Systembolaget is the only store in the country allowed to sell alcoholic beverages above 3.5%

alcohol by volume.

Since the beer industry is immensely trendy as of when this report is written, a closer look at this branch is interesting. The trend is hugely influenced by the great increase in craft beers and the current number of breweries in Sweden is the greatest through history. These breweries vary greatly in size, with the smallest having just a few workers while the biggest companies work on an international scale. The beer market has been expanding largely for the last ten years and is still expanding due to demand and a growing interest. The trend suggests that this expansion will keep on for at least another decade and therefore an analysis of which parameters actually matters for the pricing of beer can be helpful when examining the market. To see which factors determine the pricing of beer in Sweden today, as well as their level of influence, a multiple regression model will be created and analyzed.

4.1.1 The History of Beer

About 10 000 - 15 000 years ago humans ceased to be migratory hunters and gatherers to instead form organized communities where they started to grow cereals. The first type of bread was made of barley, the bread was crumbled and added to water, producing a mash which then was prepared into a beverage said to make people ”exhilarated, wonderful and blissful”.

Remains of breweries and detailed descriptions on how to produce and drink beer, sometimes with complete registers of different types of beer, have been found on several locations in the former Sumerian kingdom Mesopotamia. Similar remains have been found in other ancient civilizations, from the river Nile to the mountain Ararat, from the modern Egypt to Iraq and Iran. Objects from these findings are collected at the museum of the University of Pennsylvania, and through their researches the remains of beer have been identified on an earthen from Iraq, more than 5 000 years old.

When the knowledge on growing cereals expanded north and west the method of brewing followed it’s tracks. The Romans were accustomed to wine but noted that people in the north were drinking beer. Many of the places where the industry of beer has it’s strongest roots today are areas where the Celts settled down early on, from central Europe to Ireland.

After the Middle Ages the Christian monasteries became center of agriculture, knowledge and sci-

ence. The art of brewing was improved, in the beginning to produce beer for the brothers and

traveling pilgrims, and later as a way to finance the monastic life.

(9)

Figure 1: Monks have a great historical influence on the art of brewing.

In many countries with ancient brewing methods beer is considered part of the national identity.

During the Middle Ages many royal courts got rights for brewing in order to collect money to the city and some noble families are still in the business. Finally, the industrial capitalism gave the industry of brewing the formation it has got today. [5]

4.2 Purpose

The results and the complete analyzed picture of this thesis will be interesting for beer producers who are about to price and release a new product at the market. For analysts at Systembolaget the model may be used to negotiate prices with the producers and to examine the modelled prices against the market prices of products in their assortment. Customers at Systembolaget can compare the price for a desired beer with the model. The analysis can have a potential impact on already set prices since those can be compared with the model, thus it can be determined whether a product is over or under priced.

4.3 Problem Definition

In this thesis an analysis of the pricing of beer is carried out. More closely, the question: which

parameters affect the pricing of beer and how significant are they? To answer this question a large

data set is collected and analyzed in order to receive a model that describes the problem. The main

goal of the project is to find significant factors and, as follows directly, a significant model without

any influence of multicollinearity.

(10)

4.4 Data Set

The data set in this project consist of several parameters such as volume, sales, beer styles and a qualitative rating variable. The parameters are collected from Systembolaget’s sale statistics for 2016, Systembolaget’s website and from www.RateBeer.com. The qualitative rating from RateBeer was obtained in order to get some external variable apart from Systembolaget’s own data.

4.5 Problem Restrictions

In the sales statistics from Systembolaget for 2016 there is a large number of samples. Creating the data set proved to be unproblematic since all data was easily accessible on the web. The most difficult part of data collecting was to set the limit for the sales in such a way that enough observations were included and at the same time removing non-representative beers. The dummy variables are grouped under the condition that the number of observations is sufficient.

4.6 Literature Analysis

The number of previous studies directly related to the subject of this thesis is strictly limited.

However, regression analysis models have been created with emphasis on other alcoholic beverages such as wines and whisky in previous bachelor theses.

A bachelor thesis on the influences of pricing wine [7] regarding Systembolaget’s assortment indicates that the estimated coefficient for alcohol by percentage is almost 17, i.e. the logarithm of the price per liter of wine increases with alcohol by volume multiplied by a factor 17. In addition to this statistically estimated coefficient another bachelor thesis on the influences of pricing malt whisky [11] states that the alcohol by volume is instead multiplied by a factor 7.11 · 10 ⁻² .

Alongside these mathematical theses there are several websites that discuss the different factors determining the prices of beers. According to SurveyMonkey [8] the factors that consumers are most interested in when choosing what beer to purchase are taste, price, style of beer and the brewery that produces the beer. The outcome of the survey resulted in the diagram in Figure 2.

Figure 2: Survey carried out by SurveyMonkey.

(11)

An article in Huffington Post [6] states that the price of craft beer is divided into parts by expense.

The aim of the article is to describe why a craft beer may be more expensive than a beer distributed by the, in the article called, macrobreweries. The division of expense is presented graphically in Figure 3.

Figure 3: Summary of the final cost of a craft beer due to each expense.

(12)

5 Mathematical Theory

To investigate and model the relationship between variables by a statistical technique regression analysis is widely used.

5.1 Multiple Linear Regression

The multiple linear regression model is defined by

y = β ₀ + β ₁ x ₁ + β ₂ x ₂ + ... + β _p x _p + (1) where y is the response variable, x i , i = 1, ..., p, is the set of explanatory variables, β i are the asso- ciations between the response and explanatory variables and is the random error component.

It is assumed that:

1. The relationship between the response y and the regressors is linear, at least approximately.

2. The error term has zero mean.

3. The error term has constant variance σ ² . 4. The errors are uncorrelated.

5. The errors are normally distributed.

These assumptions can be tested through residual analysis.

It is often more practical to write the multiple linear regression model in matrix notation.

y = Xβ + where

y =





 y 1

y 2

.. . y n





 X =







1 x 11 x 12 ... x 1p

1 x 21 x 22 ... x 2p

.. . .. . .. . .. . 1 x n1 x n2 ... x np







β =





 β 0

β 1

.. . β p







=







1

2 .. .

n





 The vector of fitted values ˆ y _i can be rewritten as

ˆ

y = X ˆ β = X(X ⁰ X) ⁻¹ X ⁰ y = Hy (2)

where H is the commonly used hat matrix, a n×n-matrix that maps the vector of the observed values into a vector of fitted values. [1]

5.1.1 Heteroscedasticity

If assumption 3 is not fulfilled, i.e. the error term does not have constant variance, then heteroscedas-

ticity is identified. The linear model’s standard errors, confidence intervals and hypothesis test rely

on this assumption. To remedy this problem a transformation of the response y is one possible

solution. [2]

(13)

5.2 Residual Analysis

The residual is the difference between the observed value y i and the corresponding fitted value ˆ y i

and is defined as

e i = y i − ˆ y i , i = 1, ..., n. (3)

The residuals have zero mean and their approximate average variance is estimated by its mean square error, M S _Res .

M S _Res = P n

i=1 e ² _i

n − 2 (4)

The residuals are not independent. However, this has little effect on their use for model adequacy checking as long as n is not small relative to p, the number of parameters. Scaling of the residuals is helpful in finding observations that are outliers or extreme values, see section Outliers and High Leverage Points. There are four different scalings of residuals, these are listed below.

1. Standardized Residuals. The residual is scaled by the square root of its approximate average variance, and therefore it has mean zero and approximately unit variance. This type of scaling of the residual is used to detect potential outliers, a large value (d i > 3) is considered to indicate these observations. The standardized residuals are defined as

d i = e i

√ M S Res

, i = 1, ..., n. (5)

2. Studentized Residuals. To improve the residual scaling, since the mean square error used as variance is an approximation, the i:th residual is now divided by its exact standard deviation. The studentized residual is defined as

r _i = e _i

pM S Res (1 − h ii ) , i = 1, ..., n (6)

where h _ii is the i:th diagonal element of the hat matrix. The variance of this scaled residual is constant and equal to 1, regardless of the location of x i when the form of the model is correct.

3. PRESS Residuals. If there is an observation i that is very unusual in respect to the data set, the regression model may be overly influenced by this observation. This may result that the fitted value ˆ y i is very similar to the observed value y i and it follows directly that the residual e i will be small, and therefore difficult to detect the outlier. The i:th PRESS residual is calculated by

e (i) = e i

1 − h ii

, i = 1, ..., n (7)

and possible high influence points are marked where the PRESS residual is large.

4. R-student. The R-student residual is defined t i = e i

q S _(i) ² (1 − h ii )

, i = 1, ..., n (8)

where

S _(i) ² = (n − p)M S Res − e ² _i /(1 − h ii )

n − p − 1 , i = 1, ..., n (9)

(14)

S ² _(i) is an estimation of the variance σ ² and h ii is the i:th diagonal element of the hat matrix. It is noticeable that the R-student is similar to the studentized residual although it will, in many situations, differ from it. If the i:th observation is influential, S _(i) ² may differ from M S Res and the R-student will be more sensitive to this point. [1]

5.3 Model Adequacy

In order to quantify the extent to which the model fits the data the residual standard error, F- and t-statistics, p-values and the R ² statistics can be analyzed.

The p-value roughly interpreted: a small value states that there is a relationship between the response y and the explanatory variable x that represents the value. A small value is typically 5%

or 1%. This project aims to develop a model where all explanatory variables are significant and therefore the p-value is of great interest.

Residual Standard Error. With each observation there is an associated error term and the RSE is an estimate of the standard deviation of these error terms. It is the average amount that the response will deviate from the true regression line and computed by

RSE = v u u t

1 n − 2

n

X

i=1

(y i − ˆ y i ) ² . (10)

An acceptable RSE-value depends on the problem context.

R ² statistics. Independent of the scale of the response y, the R ² statistics takes on a value be- tween 0 and 1 which explains the proportion of variance.

R ² = 1 − P n

i=1 (y i − ˆ y i ) ² P n

i=1 (y i − ¯ y i ) ² (11)

The fraction is the residual sum of squares, RSS = P n

i=1 (y _i − ˆ y _i ) ² , divided by the total sum of squares, T SS = P n

i=1 (y _i − ¯ y _i ) ² . An R ² statistics that is close to 1 indicates a large proportion of the variability in the response has been explained by the regression model.

In this report R ² is also called Multiple R ² since this is a multiple regression analysis, opposed to Adjusted R ² , which is an adjusted value of the Multiple R ² with respect to the number of explanatory variables as follows:

Adjusted R ² = 1 − RSS/(n − p − 1)

T SS/(n − 1) (12)

F-statistic. When testing the null hypothesis,

H 0 : β 1 = β 2 = ... = β p = 0 (13)

versus

H 1 : β j 6= 0, j = 1, ..., p (14)

it is examined whether there is a relationship between the response y and a predictor x _i , i = 1, ..., p for p number of predictors. In a regression model this can be tested by computing the F-statistics,

F = (T SS − RSS)/p

RSS/(n − p − 1) . (15)

(15)

For a large number of observations, n, the value of the F-statistics that is just a little larger than 1 suggests that the null hypothesis can be rejected in favor for the hypothesis that there is a linear relationship between the response and the predictors.

t-statistic. Similar to the F-statistic the value for the t-statistic is used for hypothesis testing.

The t-statistic is defined by

t i = β ˆ _i − 0

se( ˆ β _i ) (16)

and measures the number of standard deviations that ˆ β i is away from 0, where se( ˆ β i ) denotes the standard error of the i :th regression coefficient. In other words, the value of t i is used to test whether the corresponding regression coefficient differs from zero. [2]

5.3.1 Residual Plots

By plotting the residuals it is possible to validate the model adequacy. When analyzing the plot of the residuals against the predicted values the impression of a horizontal pattern containing the residuals implies that the variance is constant, assumption 3 listed under section Multiple Linear Regression. The normal probability plot is analyzed to examine if the data set is approximately normally distributed.

As a visual example, two plots from the model development in this thesis are presented. The normal probability plot and the histogram of the residuals, see Figure 4, indicates that that the normality assumption is approximately fulfilled. However, it is also noticeable that some of the data diverges, as can be seen in the quantiles of the histogram and also in the extremities in the normal probability plot.

−3 −2 −1 0 1 2 3

−1.0 −0.5 0.0 0.5

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Histogram of modalecountry2$res

modalecountry2$res

Frequency

−1.0 −0.5 0.0 0.5

0 20 40 60

Figure 4: Normal probability and histogram plot for the residuals.

In Figure 5 the ordinary residuals and the square root of the studentized residuals are plotted

(16)

against the fitted values respectively. The plots indicates that the residuals are dependent of the fitted values, which is a sign of heteroscedasticity.

0 50 100 150 200 250 300

−40 −20 0 20 40 60

Fitted values

Residuals

Residuals vs Fitted

1257 414

34

0 50 100 150 200 250 300

0.0 0.5 1.0 1.5 2.0 2.5

Fitted values

S ta n d a rd ize d r e si d u a ls

Scale−Location

1257 414

34

Figure 5: Ordinary and scaled residuals against the predicted values.

5.4 Outliers and High Leverage Points

A point for which y i is far from the value predicted by the model is called an outlier. These points can arise for a variety of reasons and can have severe effect on the regression model. Residuals scaled to the studentized version plotted against the predicted values are helpful to investigate outliers. If an outlier comes from a faulty measurement or analysis it should be corrected or deleted from the data set.

Another problem that might appear in the model is an unusual value for x _i , so called high leverage points. A leverage point located remotely in x space compared to the rest of the sample may control certain model properties. If the point lies on the regression line it will not affect the estimates of the regression coefficients, however it will have a large effect on the model statistics such as R ² , defined in section Model Adequacy, and the standard errors of the regression coefficients. An influence point has an unusual location both in terms of x and y spaces, and will have an impact on the regression line as it drags the line towards itself.

Identifying influential points can be made by analyzing the elements of the hat matrix H, which can be interpreted as the amount of leverage exerted by the i:th observation y i on the j:th fitted value ˆ y j . The hat matrix diagonal is a standardized measure of the distance of the i:th observation from the center of the x space, which gives that large diagonal values are potential influences when combined with a large residual.

To measure the influence Cook’s distance, D i , uses a measure of the square distance between the least-squares estimate based on all n points ˆ β and the estimate obtained by deleting the i:th point β ˆ (i) . Points with large values of D i have considerable influence on the least-squares estimates ˆ β. D i

is F-distributed and with values above 1 it is considered to be influential by rule of thumb. D _i can be expressed by

D i = r ² _i p

V ar(ˆ y i ) V ar(e _i ) = r ² _i

p h ii

1 − h _ii , i = 1, ..., n. (17)

(17)

D i provides information about the effect of observations on the estimated coefficients ˆ β i and fitted values y i .[1] [2]

5.5 Transformations of Variables

Transformations of variables is a method used when the assumptions listed under section Multiple Linear Regression are not fulfilled. A usual starting point in regression analysis is the assumption of a linear relationship between y and the regressors. A suitable transformation may linearize a non-linear function. If the scatter diagram of y against x indicates curvature it may be possible to use the linearized form to present the data, transformations such as logarithmic or reciprocal. These types of transformations are selected empirically while it is also possible to use objective techniques to specify an appropriate transformation, such as Box-Cox.

The assumption that the variance is constant may also be violated, this is common if the response variable y follows a probability distribution in which the variance is related to the mean. It is important to find and correct a non-constant error variance since this might lead to a larger standard error for the regression coefficients than necessary. The effect of a transformation is usually to give more precise estimates of the model parameters and increase sensitivity for the statistical tests. [1]

5.6 Multicollinearity

Regressors are said to be orthogonal if there is no linear relationship between them. In case the regressors are nearly perfectly related it may lead to misleading or erroneous inferences based on the model. This problem is called multicollinearity. In the multiple regression model, the X matrix contains the regressor variables. Its j:th column can be denoted X j , so that X = [X 1 , ..., X p ]. X j

contains the n levels of the j:th regressor variable. Multicollinearity is defined in terms of the linear dependence of the columns of X. The vectors are linearly independent if there is a set of constants s 1 , ..., s p , not all zero, such that

p

X

j=1

s j X j = 0. (18)

There are some useful methods in diagnosing multicollinearity. Examining the off-diagonal elements r ij of the correlation matrix X ⁰ X is one of these methods. If regressors x i and x j are nearly linearly dependent, then |r ij | will be near unity. Examining the simple correlation r ij between the regressors is helpful in detecting near-linear dependence between pairs of regressors only. When more than two regressors are involved in a near-linear dependence, there is no assurance that any of the pairwise correlations r _ij will be large.

The diagonal elements of the C = (X ⁰ X) ⁻¹ matrix are useful in detecting multicollinearity. When using the variance inflation factor (VIF),

V IF _j = C _jj = (1 − R ² _j ) ⁻¹ (19)

where R ² _j is the coefficient of determination, the calculated value for each term in the model that exceeds 5 or 10 is an indication that the associated regression coefficients are poorly estimated because of multicollinearity. [1] [2]

5.7 Variable Selection

The subset selection approach tries to identify which predictors are best related to the dependent

variable, and then fits a least square model with only those predictors. The reduced models may

(18)

then evaluated with different model evaluation criteria to choose the best model. Three of the most common selection approaches are the Best Subset Selection and the Forward/Backward Stepwise Selection algorithms.

The Best Subset Selection algorithm starts with no predictors. Initially it fits the model with only one predictor out of all p possible, and stores the best model, which is defined as the model with the largest value of R ² . Henceforth it fits the model with 2, . . . , p predictors, and stores the best model for each number of predictors. Lastly the single best model out of all stored models are chosen based on different evaluation criteria.

The Forward Stepwise Selection algorithm is initially likewise with no predictors. It adds one predic- tor at the time, the best predictor that gives the highest R ² for the model, until there are p predictors.

The Backward Stepwise Selection algorithm is similar to the Forward algorithm, but starts with a full model with p predictors and instead removes one predictor at the time. [2]

5.8 Cross Validation

In regression analysis there are different resampling methods. These methods are used to obtain additional information about the fitted model. The data set is randomly divided into two sets, one training set and one validation set. The idea of these methods is to see how well the fitted regression model can estimate a sample from the validation set.

The validation set approach is a technique used in model selection to estimate the test error of a predictive model. The principle is to create validation sets from the training data set, a linear model is fitted to the training set and used to predict the responses for the observations in the validation sets. From this conclusions can be drawn on the model’s performance in predicting new observations.

There are two potential drawbacks with the the validation set approach:

1. The estimation of the test error rate can be highly variable due to which observations are divided into the training and validation sets.

2. Only a subset of the observations are used to fit the model. The performance of the model usually decreases with fewer observations included. This can lead to an overestimation of the test error rate for the model fit to the entire data.

In attempt to address the drawbacks of the validation set approach the so called k-fold cross valida- tion can be applied on models created through model development. The set of observations will be divided into k folds with approximately the same sizes. The first fold will be treated as a validation set and the method is fit on the k − 1 remaining folds. The procedure is repeated using every fold as validation set once and the mean square error, MSE, is calculated on the observations in the validation set and the k-fold CV estimate is the calculated average: [2]

CV _(k) = 1 k

k

X

i=1

M SE i . (20)

(19)

6 Data Set

The data set used in this project is created from Systembolaget’s sale statistics for 2016 [13] and one explanatory variable is obtained from RateBeer [10].

To the sale statistics a set of numerical variables were added, these variables are alcohol by volume, number of months in stock up until the annual shift 2016/2017 and the number of stores selling the product.

In addition to the numerical variables, groups of dummy variables were created. These variables represents whether the product is in stock, type of packaging, whether the product is organic, if the product is rated at RateBeer, the product’s country of origin and it’s beer style. In the dummy variable group Beer Style there is one category called other styles by Systembolaget and as for the Country of Origin group the variable other countries includes Barbados, Colombia, Trinidad, Japan, South Africa, India, Australia, Mexico, Turkey, Iceland, Serbia, Croatia, Cyprus, China, Israel, Greece, Bosnia, Jamaica, Estonia, Singapore, Kenya and Poland. Neither of these countries had more than 4 observations and therefore the decision to group them together was made.

6.1 Restrictions in Data

Initially the data set included 3312 beers. After removing beers with unordinary packaging such as kegs, magnum bottles and multi-packs and beers that sold less than 1000 liters there were 1413 beers left to analyze. The restriction due to packaging was made to eliminate probable outliers and high leverage points and the restriction related to sales was made in order to have a good sample size but also removing the most unusual beers since these would probably not be representative for the model.

The beers need to have received nine or more ratings in order to obtain a RateBeer rating and the ones that have received fewer ratings have been regarded as unrated. It is assumed that the beers that are no longer in stock have been available throughout 2016.

6.2 List of Variables

The response variable y is price per liter, where price is represented in Swedish kronor, and spans the interval from 21.6 SEK/liter to 252 SEK/liter with a median value of 71.21 SEK/liter.

The two following tables describes each explanatory variable in the data set including their numbers

of observations, the tables are divided as numerical and dummy variables respectively.

(20)

Explanatory Variables: Dummy

Variable Description Number of observations

Package 1 for bottles and 0 for cans. 1239

Not in stock 1 for not in stock and 0 for in stock. 263 Organic 1 for organic and 0 for non-organic. 121 Unrated 1 for unrated and 0 for rated at RateBeer. 81

Belgian ale Beer style 98

British American ale 573

German ale 11

Dark lager 27

Light lager 429

Medium dark lager 38

Other styles 32

Porter/stout 96

Sour 44

Unclassified ale 26

Wheat 39

Austria Country of origin 7

Belgium 74

Canada 7

Czech Republic 41

Denmark 18

Finland 6

France 5

Germany 55

Great Britain 80

International 48

Ireland 10

Italy 9

Netherlands 9

New Zealand 6

Norway 9

Other countries 34

Spain 12

Sweden 849

Thailand 8

USA 127

Note: International as a country means that the beer probably is from a large, global brewery and

that it is licence brewed in several different countries. Therefore it is not possible to label it as a

specific country. [12]

(21)

Explanatory Variables: Numerical

Variable Description Min. Median Max.

Alcohol Alcohol by volume (ABV) percent- age.

3.6 5.4 14

Item price Price per beer in Swedish kronor. 8.9 24.9 189 Months in stock How many months the product had

been in Systembolaget’s assortment up until the annual shift 2016/2017.

0 15 735

Number of stores Number of Systembolaget stores selling the product as of when this report is written. This is regarded as a measurement of popularity.

0 4 437

RateBeer rating A qualitative subjective rating rang- ing between 0 and 100, where the rating is rounded to the nearest in- teger, based on users ratings. Rate- Beer uses a Bayesian weighted mean for their ratings.

0 45 100

Sales Total sales in liters in 2016. 1001.55 3752.1 9893495

Sales per month Total sales in liters in 2016 divided by the number of months the prod- uct had been in stock 2016.

83.4625 401.995 824457.9

Volume Volume in milliliters. 250 330 1000

6.3 About RateBeer

RateBeer users can rate beers on a combination of scales. RateBeer uses a Bayesian weighted mean for their ratings, which means that the validity of any average is increased by the number of ratings.

As RateBeer themselves put it: ”We use a Bayesian weighted mean so that more ratings increase the score’s validity. Simply put, a beer that has one hundred 5.0 scores will have a score just thousandths of a point under five, whereas a beer that has only ten 5.0 scores might have a score a few tenths below a five. This not only helps us combat abuse but ensures a greater validity to our beer lists”. [9]

This weighted point average is different than RateBeer’s consumer-friendly 100-point scale score, which has been used in this project. A beer’s score is based on its percentile ranking among all beers. To assure the quality of their ratings RateBeer deletes obviously bogus ratings, they do not let a users rating for a specific beer count until the user has rated at least ten beers and they do not let brewers, with affiliates, rate their own beers.

RateBeer is not the only well known database to use the Bayesian formula. The Internet Movie Database, more known as IMDb, has also used the Bayesian formula in order to obtain a weighted mean for their movies based on user votes [4].

The Bayesian formula:

W eighted rank = v

v + m · R + m

v + m · C (21)

where R is the mean rating, C is the midpoint of the scale, v is the number of ratings for a beer and

m is the minimum votes required to be listed in the top 50 beers list at RateBeer.

(22)

7 Analysis and Model Development

The multiple regression model is built and analyzed with the editor RStudio from R. The complete data set is read into R and a first multiple linear regression model is created. In this initial model the following beer parameters are located in the intercept: beers in stock, cans, non-organic, rated at RateBeer, other styles and other countries. Which variables located in the intercept has no effect on the results and may therefore be chosen arbitary. However, the actual coefficients do depend on this choice since the interpretation of the variable coefficient is relative to the base case. Analyzing the model starts by interpreting the plots listed in the following section Residual Analysis.

7.1 Residual Analysis

The residuals versus fitted values, see Figure 6, indicates that there is some tendency of non-linear behaviour in the model. The residuals and the fitted values seem to be dependent, the model shows heteroscedasticity and the variance is not constant.

0 50 100 150 200 250 300

−40−200204060

Fitted values

Residuals

Residuals vs Fitted

148 1228

1006

Figure 6: The model residuals versus the fitted values for the original model.

The quantile-quantile, Q-Q, plot is used to determine if the data is normally distributed. This Q-Q plot shows heavy tails in the extremities, which indicates that the original data have more extremes than expected for a normal distributed data set. This can also be seen in the quantiles of the histogram, see Figure 7.

−3 −2 −1 0 1 2 3

−40−2002040

Normal Q−Q Plot

Theoretical Quantiles

Sample Quantiles

Histogram of mod1$res

mod1$res

Frequency

−40 −20 0 20 40

020406080100

Figure 7: Normal Q-Q plot and histogram for the original model.

(23)

As seen in the residuals versus the fitted values plot, the plot of the scale-location empowers the conclusion that the raw data is heteroscedastic, see Figure 8.

0 50 100 150 200 250 300

0.00.51.01.52.02.5

Fitted values

Standardized residuals

Scale−Location

148 1228

1348

Figure 8: Scale-location plot for the original model.

By analyzing the leverage plot, see Figure 9, it is clear that the data set contain neither any high leverage points nor any outliers. The whole data set is located within the accepted region. This is expected since the number of parameters, n, in the data set is 1413 and therefore considered large.

Section Outliers and High Leverage Points will present a numerical analysis considering these points.

0.00 0.05 0.10 0.15 0.20 0.25 0.30

−505

Leverage

Cook’s distance

0.5 0.5 Residuals vs Leverage

148

990 1341

Figure 9: Leverage plot for the original model.

The summary of this initial model is noted for comparison against other models through the devel-

opment, see Table 1.

(24)

Residual standard error Multiple R ² Adjusted R ² F-statistic P-value

7.769 0.938 0.936 509 < 2.2 · 10 ⁻¹⁶

Table 1: Summary of the original model.

7.2 Transformations of Variables

A logarithmic transformation of the numerical variables including the response variable is made, none of the dummy variables are transformed. If the relationship between a numerical variable and the response variable improves, that transformation will be applied in the regression model.

An improvement is visible in Figure 10, where a transformation of the response variable and the variable representing item price is plotted step by step. In the first picture both variables are untransformed, in the second picture the response variable is transformed and in the last picture both variables are transformed. It is noticeable that the linear behaviour between the variables improves, and the item price variable will therefore be transformed in the regression model.

50 100 150

50100150200250

databeer$aktuellt.pris

databeer$pris.l

50 100 150

3.03.54.04.55.05.5

databeer$aktuellt.pris

log(databeer$pris.l)

2.5 3.0 3.5 4.0 4.5 5.0

3.03.54.04.55.05.5

log(databeer$aktuellt.pris)

Figure 10: Logarithmic transformation of the response variable and the variable representing item price.

A transformation is not always improving the relationship between the response and explanatory

variables. In Figure 11 it is clear that if the explanatory variable is transformed, the linearity in the

relationship does not improve. The transformation is made in the same matter as in Figure 10 and

the decision is made to not include a transformation of this explanatory variable in the model.

(25)

400 600 800 1000

50100150200250

databeer$volym.i.ml

databeer$pris.l

400 600 800 1000

3.03.54.04.55.05.5

databeer$volym.i.ml

5.65.86.06.26.46.66.8

3.03.54.04.55.05.5

log(databeer$volym.i.ml)

Figure 11: Logarithmic transformation of the response variable and the variable representing volume in ml.

After each numerical explanatory variable has been transformed, four of them are transformed in the model. The transformed variables are item price, alcohol, sales and sales per month. The response variable’s logarithmic transformation is also applied in the model. The new model is plotted to compare against the original model, see Figure 12. By analyzing the plots of fitted values against the residuals and the fitted values against the square root of the standardized residuals, this model is close to being homoscedastic. There are still tails in the Q-Q plot and there are points that might be influential according to the leverage plot.

Taking a closer look at observation number 148, which differs from the remaining data, it is found that this observation represents the largest and only beer at 1000 ml.

3.0 3.5 4.0 4.5 5.0 5.5

−0.10.10.20.3

Fitted values

Residuals

Residuals vs Fitted

148

475 1161

−3 −2 −1 0 1 2 3

051015

Normal Q−Q

148

475 1161

3.0 3.5 4.0 4.5 5.0 5.5

0123

Fitted values

Scale−Location

148

475 1161

0.00 0.05 0.10 0.15 0.20

−5051015

Leverage

Standardized residuals Cook’s distance

0.5 1

Residuals vs Leverage

148

1991266

Figure 12: Second model, developed with logarithmic transformations.

The summary of this model displays a large improvement in the F-statistic and the residual standard

(26)

error which now is close to zero, see Table 2. Out of the 41 explanatory variables 21 are significant, which can be compared to the original model where 19 variables were significant.

Residual standard error Multiple R ² Adjusted R ² F-statistic P-value

0.0252 0.996 0.996 9090 < 2.2 · 10 ⁻¹⁶

Table 2: Summary of the transformed model.

7.3 Multicollinearity

To detect multicollinearity between explanatory variables the variance inflation factor, VIF, are calculated in R. In Table 3 all variables with a value larger than 10 are listed.

Variable Ale: British/American Light lager Sweden Sales Sales per month

VIF 11.84 11.11 10.75 23.64 20.54

Table 3: List of explanatory variables with a VIF value larger than 10.

From these results a new model is developed. Sales and sales per month are directly related and therefore one of these variables will be excluded from this model. The sales per month variable is assumed to be the best description with respect to seasonal beers and beers that are temporarily in stock, therefore this is the variable kept in the model. Furthermore, the response variable is the quotient between the item price and volume variables and these will also be excluded from the model. The data is fitted to this model and plotted for analysis, see Figure 13.

3.5 4.0 4.5 5.0

−1.00.00.5

Fitted values

Residuals

Residuals vs Fitted

155

1082 990

−3 −2 −1 0 1 2 3

−4−2024

Normal Q−Q

155

1082990

3.5 4.0 4.5 5.0

0.00.51.01.52.0

Fitted values

Scale−Location

155 1082 990

0.00 0.05 0.10 0.15 0.20

−4−2024

Leverage

Residuals vs Leverage

990

1295 505

Figure 13: Model developed through multicollinearity analysis.

By comparison with the previous model, Figure 12, this model is improved in the normal Q-Q plot

where the tails are lighter. By examination of the leverage plot there are no longer any possible

(27)

influential points. The VIF values are calculated again and variables with values larger than 10 and also the variable which presented multicollinearity in the previous calculation are presented in Table 4. By excluding the sales variable the VIF value for sales per month is no longer exhibiting multicollinearity.

Variable Ale: British/American Light lager Sweden Sales per month

VIF 11.74 10.83 10.73 4.30

Table 4: List of explanatory variables with a VIF value larger than 10.

7.4 Analysis of Significance

After the numerical variables have been transformed the model is further developed through an analysis of significance. As seen in section Multicollinearity some of the dummy variables from the variable group beer styles have shown large VIF values. A new model is tested where all different types of ales are grouped together as one dummy variable, Model A, as a result of their low signifi- cance and some high VIF values. The significance of the variables in this model is analyzed, and the decision to test if some of the countries from its dummy variable group should be regrouped together with the variable other countries is made. The countries that are not significant are regrouped with the other countries located in the intercept and a new model is created, Model B. This regrouping means that the variables Belgium, Canada, Denmark, Finland, France, International, Ireland, Italy, Norway, Spain, Sweden, Thailand and USA is added to the intercept. The difference in significance is presented in Table 5.

The significance codes in the table are represented as:

0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1.

Variable Model A Model B Variable Model A Model B

Alcohol * * Italy Regrouped

Months in stock * * Canada Regrouped

Number of stores * * Netherlands . .

Not in stock Norway Regrouped

Ale New Zealand * *

Light lager * * Spain Regrouped

Medium dark lager * * Sweden Regrouped

Dark lager * * Great Britain . **

Porter/Stout * * Thailand Regrouped

Sour * * Czech Republic *

Wheat Germany * *

Package * * USA Regrouped

Belgium Regrouped Austria . *

Denmark Regrouped Organic

Finland Regrouped Sales * *

France Regrouped RateBeer rating * *

International Regrouped Unrated * *

Ireland Regrouped

Table 5: Significance table.

(28)

Based on the resulting significance in Model B from this section, this is the model chosen and further analyzed. To meet the goal that all variables in the final model should be significant, the organic variable is removed from the model and the ale variable is regrouped together with the other styles variable, thereby adding it to the intercept. The final model is hereby chosen and presented under section Results.

Model B is represented in Figure 14 and Table 6 for comparison with the final model, see section Results.

3.5 4.0 4.5 5.0

−1.0 0.0 0.5 1.0

Fitted values

Residuals

Residuals vs Fitted

123

1019 1359

−3 −2 −1 0 1 2 3

−4 −2 0 2 4

Theoretical Quantiles

Standardiz ed residuals

Normal Q−Q

123

1019 1359

3.5 4.0 4.5 5.0

0.0 0.5 1.0 1.5 2.0

Fitted values

Standardized residuals

Scale−Location

123 1019

1359

0.00 0.05 0.10 0.15 0.20

−4 −2 0 2 4

Leverage

Standardiz ed residuals Cook’s distance

Residuals vs Leverage

1394 1310 1370

Figure 14: Model B, developed through analysis of dummy variables.

Residual standard error Multiple R ² Adjusted R ² F-statistic P-value

0.203 0.758 0.755 198 < 2.2 · 10 ⁻¹⁶

Table 6: Summary of Model B.

7.5 Outliers and High Leverage Points

The leverage plots have, through the development, been varying and displaying different results in

whether there are possible influential points. By analyzing this numerically with Cook’s distance

and leverage testing possible eliminations of observations are decided. In Figure 15 these numeri-

cally calculated values are represented. The first plot shows Cook’s distance, as stated in section

Mathematical Theory a value larger than 0.5 indicates that those observations might be removed

from the model. There are no values larger than 0.1 and therefore no observations are excluded due

to Cook’s distance. In the second plot the leverage values are shown, if the value for a observation

(29)

exceeds 0.25 that observation may be removed. The largest value is approximately 0.2 which leads to no observation being deleted due to it’s leverage value.

0 200 400 600 800 1200

0.00 0.02 0.04 0.06 0.08

Index

Cooks distances

0 200 400 600 800 1200

0.00 0.05 0.10 0.15 0.20

Index

Le ver age

Figure 15: Outliers and high leverage points.

7.6 Cross Validation

The k-fold cross validation is applied on three of the models obtained during the model development.

The original linear regression model, the model with logarithmic transformed variables and the final model. k is set to ten, which generates ten folds to run as validation sets and their respective CV- values are computed. Table 7 represents these averages for the expected mean square errors for each model.

Model Original Model Transformed Model Final Model

CV _k=10 65.2 0.0432 0.0423

Table 7: Cross validation.

The table shows that the MSE have been reduced through the model development and the final

model will be able to predict new observations with a MSE of 0.043. When comparing the cross

validation plot for the original model, Figure 16, and for the final model, Figure 17, there are some

differences. The original model has problems predicting the most expensive observations whereas

the final model does not.

(30)

0 50 100 150 200 250 300

50 100 150 200 250

Predicted (fit to all data)

pr is .l

Small symbols show cross−validation predicted values

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10

Figure 16: Predicted price against actual price, original model.

3.5 4.0 4.5 5.0

3.0 3.5 4.0 4.5 5.0 5.5

Predicted (fit to all data)

log(pr is .l)

Small symbols show cross−validation predicted values

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10

Figure 17: Predicted price against actual price, final model.

(31)

7.7 Variable Selection

The three different methods for variable selection listed under section Mathematical Theory are fit to the final model to examine which variables are of most importance. A number of variables lower than 3 is considered as not important due to the aim of creating a model with a higher number of variables. The result is presented under Appendix and is interpreted as follows.

All three of the methods indicate that the variables alcohol, sales per month and RateBeer rating are of the utter most importance.

Looking at the dummy variable group countries of origin the backward selection method removes the entire group from the models. The forward and the best subset both selects Czech Republic and Germany for 8 or more variables.

From the dummy variable group beer styles the variables light lager and sour are chosen for 5 or more variables, the backward method also selects these styles for models with 5 or more variables.

However the backward method differs by selecting medium dark lager and dark lager for models with 8 or more variables whereas the two other methods removes all other styles from models with 10 or less variables.

If it is preferable to increase the model adequacy further than the chosen final model, changes

would be done according to these results from the variable selection methods. The goal of this

project is, as stated before, to develop a model where all variables are significant and a model with

a large number of variables is preferable. Due to this goal the final model will not be modified.

(32)

8 Submodels

Submodels concerning Country of Origin and Beer Styles are been created to analyze the effect in prices on models only concerning these groups of dummy variables respectively. These are models containing only dummy variables as regressors.

8.1 Country of Origin

A regression model with response variable price per liter and the countries listed under section Variables as explanatory variables have been created to study the variation due to the origin of a beer. Figure 18 presents the standard plots for this model. Cook’s distance and leverage have been calculated numerically for the submodel, resulting in no observations being deleted from the model due to these values.

3.8 4.0 4.2 4.4 4.6 4.8

−1.00.01.0

Fitted values

Residuals

Residuals vs Fitted

990

155 265

−3 −2 −1 0 1 2 3

−3−11234

Normal Q−Q

990

155

1282

3.8 4.0 4.2 4.4 4.6 4.8

0.00.51.01.5

Fitted values

Scale−Location

155 990 1282

0.00 0.05 0.10 0.15 0.20

−2024

Leverage

Residuals vs Leverage

990 1165

652

Figure 18: Submodel country of origin.

When working with dummy variables there are no possible transformations for the explanatory

variables. The response variable have been logarithmically transformed, without this transformation

the data set would have been skewed, see the normal probability plots in Figure 19. In the left plot

the original data is presented, it is clearly skewed. To the right the response variable has been

transformed and its normal probability plot is improved, the data is no longer skewed.

(33)

−3 −2 −1 0 1 2 3

−20246

Normal Q−Q

990

1006 1348

−3 −2 −1 0 1 2 3

−3−2−101234

Normal Q−Q

990

155

1282

Figure 19: Normal probability plots.

8.2 Beer Styles

As in the previous section a regression model with dummy variables as explanatory and price per liter as response is created. In this section the influences of different styles of beers will be analyzed.

Again, the response variable is logarithmically transformed to prevent skewness. This model is presented in Figure 20. Cook’s distance and leverage have been calculated numerically for this submodel as well, once more resulting in no observations being deleted from the model due to these values.

4.0 4.2 4.4 4.6

−1.00.00.51.0

Fitted values

Residuals

Residuals vs Fitted

1006990 1082

−3 −2 −1 0 1 2 3

−3−11234

Normal Q−Q

9901006 1082

4.0 4.2 4.4 4.6

0.00.51.01.5

Fitted values

Scale−Location

1006990 1082

0.00 0.02 0.04 0.06 0.08

−2024

Leverage

Residuals vs Leverage

1315

313 1165

Figure 20: Submodel beer styles.

(34)

8.3 Box Plots

In the regression model there are six groups of dummy variables. In addition to the submodels created for two of these, their box plots are analyzed. A box plot is used when it is of interest to visualize the spread for different dummy variables, opposed to when examining for example the mean. Outliers that have not been detected earlier in the model development may appear. This is because when these data points were examined together with the total number of observations they were not extraordinary in price per liter. [3]

By studying the box plot of Country of Origin, Figure 21, the spread is interpreted as the variation in price per liter. The box plots for each country are placed side by side to see which country have higher medians and which has the greatest spread. The dotted line represents the 95% confidence interval, the boxes are limited by it’s upper and lower quantile, the median is marked in the box and outliers are plotted as dots. Conclusions drawn from this figure are that Belgium, Sweden and USA are the countries with the greatest variation in price per liter. Canada represents the highest median, however its number of observations is only 7 and therefore very low compared to the total number of observations. A more likely candidate for the highest median is Belgium with its 97 observations.

A ustr ia Belgium Br itain Canada Cz ech Rep . Denmar k Finland Fr ance Ger man y Inter national Ireland Italy Nether lands N. Zealand Norw a y Spain Sw eden Thailand USA Others

50 100 150 200 250

Figure 21: Countries of origin.

In Figure 22 the box plots for the different styles of beer are plotted. Unclassified ale, porter/stout and sour beers indicates wide variation in price per liter. The sour style represents the highest median with its 44 observations and light lager beers the lowest median with its 497 observations.

The high number of observations for the light lagers compared to the total number of observations

is considered very likely to in fact be the cheapest beer style in price per liter.

(35)

Ale Belgian Ale Br .−Am. Ale Unclass . Ale Ger man Light L. Medium L. Dar k L. P or ter/Stout Sour Wheat Other Styles 50

100 150 200 250

Figure 22: Beer styles.

The four remaining groups of dummy variables only contain two variables, true or false. In the plot for package, with bottles and cans, the variation in price is seemingly the same. The canned beers are indicated to be cheaper per liter by its median, see Figure 23. The division of observations between the organic beers and the ones that are not is heavily in favour for the ones that are not organic, since only 121 of the beers are. However, the variation in price per liter for the two variables is similar, and the box plot in Figure 24 indicates that the beers that are not organic are more expensive according to the variables medians.

Bottle Can

50 100 200

Figure 23: Package.

Organic Not Organic

50 100 200

Figure 24: Organic.

In the data set there are 263 observations which are not in stock and 1178 observations which are.

The box for the beers which are no longer in stock, see Figure 25, has a slightly wider variety in

(36)

price per liter. This might be due to a large spread in styles, alcohol by volume and so on for beers that are in the temporary batch. This variable is also more expensive per liter than the beers which are still in stock. There are only 81 observations missing rating at RateBeer and the median for these and the rated ones are very similar, see Figure 26. This indicates that the price per liter does not rely on whether they are rated or not. Although, the variation in price is much larger for the rated observations.

In Stock Not In Stock

50100150200250

Figure 25: In stock.

Rated Unrate

50100150200250

Regression analysis: An evaluation of the inuences behindthe pricing of beer

Regression analysis: An evaluation of the influences behind the pricing of beer

Regressionsanalys: En utv¨ ardering av influenserna bakom priss¨ attningen av ¨ ol

Sara Eriksson and Jonas H¨ aggmark

Spring semester 2017

1 Preface

This project is a bachelor thesis in multiple linear regression created by Sara Eriksson and Jonas H¨ aggmark during the spring semester in 2017.

We wish to thank our mentor Pierre Nyquist, for all of your advises and guiding throughout the

project.

2 Abstract

The model is developed through a thorough residual analysis, transformations of variables, de- termination of multicollinearity and a validation of the absence of outliers and high leverage points.

The first model in the analysis has 41 explanatory variables and in the final model the number of

explanatory variables is reduced to 20 where all are significant.

3 Sammanfattning

Regressionsanalysen ¨ ar baserad p˚ a en upps¨ attning data f¨ or 1413 observationer representerande de

¨

ol som s˚ alt mer ¨ an 1000 liter, bland ytterligare restriktioner, och ¨ ar skapad fr˚ an Systembolagets f¨ ors¨ aljningsstatistik fr˚ an 2016 och RateBeer. Detta antal observationer representerar 43% av Sys- tembolagets totala ¨ olutbud.

Den f¨ orsta modellen i analysen har 41 regressorer och i den slutliga modellen har antalet regressorer

reducerats till 20 d¨ ar alla ¨ ar signifikanta.

List of Figures

1 Monks have a great historical influence on the art of brewing. . . . 2

2 Survey carried out by SurveyMonkey. . . . . 3

3 Summary of the final cost of a craft beer due to each expense. . . . 4

4 Normal probability and histogram plot for the residuals. . . . 8

5 Ordinary and scaled residuals against the predicted values. . . . . 9

6 The model residuals versus the fitted values for the original model. . . . 15

7 Normal Q-Q plot and histogram for the original model. . . . . 15

8 Scale-location plot for the original model. . . . 16

9 Leverage plot for the original model. . . . 16

10 Logarithmic transformation of the response variable and the variable representing item price. . . . 17

11 Logarithmic transformation of the response variable and the variable representing volume in ml. . . . 18

12 Second model, developed with logarithmic transformations. . . . 18

13 Model developed through multicollinearity analysis. . . . . 19

14 Model B, developed through analysis of dummy variables. . . . 21

15 Outliers and high leverage points. . . . 22

16 Predicted price against actual price, original model. . . . . 23

17 Predicted price against actual price, final model. . . . 23

18 Submodel country of origin. . . . 25

19 Normal probability plots. . . . . 26

20 Submodel beer styles. . . . . 26

21 Countries of origin. . . . 27

22 Beer styles. . . . 28

23 Package. . . . 28

24 Organic. . . . 28

25 In stock. . . . 29

26 Rated. . . . 29

27 Final model. . . . 31

Contents

1 Preface ii

2 Abstract iii

3 Sammanfattning iv

4 Introduction 1

4.1 Background . . . . 1

4.1.1 The History of Beer . . . . 1

4.2 Purpose . . . . 2

4.3 Problem Definition . . . . 2

4.4 Data Set . . . . 3

4.5 Problem Restrictions . . . . 3

4.6 Literature Analysis . . . . 3

5 Mathematical Theory 5 5.1 Multiple Linear Regression . . . . 5

5.1.1 Heteroscedasticity . . . . 5

5.2 Residual Analysis . . . . 6

5.3 Model Adequacy . . . . 7

5.3.1 Residual Plots . . . . 8

5.4 Outliers and High Leverage Points . . . . 9

5.5 Transformations of Variables . . . . 10

5.6 Multicollinearity . . . . 10

5.7 Variable Selection . . . . 10

5.8 Cross Validation . . . . 11

6 Data Set 12 6.1 Restrictions in Data . . . . 12

6.2 List of Variables . . . . 12

6.3 About RateBeer . . . . 14

7 Analysis and Model Development 15 7.1 Residual Analysis . . . . 15

7.2 Transformations of Variables . . . . 17

7.3 Multicollinearity . . . . 19

7.4 Analysis of Significance . . . . 20

7.5 Outliers and High Leverage Points . . . . 21

7.6 Cross Validation . . . . 22

7.7 Variable Selection . . . . 24

8 Submodels 25 8.1 Country of Origin . . . . 25

y = β ₀ + β ₁ x ₁ + β ₂ x ₂ + ... + β _p x _p + (1) where y is the response variable, x i , i = 1, ..., p, is the set of explanatory variables, β i are the asso- ciations between the response and explanatory variables and is the random error component.

2. The error term has zero mean.