A comparative study on Linear Regression and Neural Networks for estimating order quantities of powder blends

(1)

A comparative study on Linear Regression and Neural Networks for estimating order quantities of powder blends

JACOB HALLMAN

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

(3)

Regression and Neural

Networks for estimating order quantities of powder blends

JACOB HALLMAN

Master in Computer Science Date: September 17, 2019 Supervisor: Somayeh Aghanavesi Examiner: Erik Fransén

School of Electrical Engineering and Computer Science Host company: Sandvik Coromant

Swedish title: En jämförelsestudie mellan linjär regression och neurala nätverk för att estimera beställningskvantiteter av pulverblandningar

(4)

(5)

Abstract

This degree project, conducted at Sandvik Coromant, aims to predict powder blend inventory needs based on order quantity prognoses that are made by customers. Two different machine learning approaches were used; linear regression and neural network regression. Since a majority of the samples consisted of prognoses that indicated that an order will be made without orders actually being placed, classification was applied on the entire dataset and regression on the samples with an order quantity above zero. This way, the regression model was not biased towards zero during training and gave more reasonable predictions for larger orders.

Both automated feature engineering and manual feature engineering was performed in order to improve the results. It was concluded that the standard features combined with those generated by automated feature engineering performed best in the case of regression. Furthermore, a combination of standard, automated, and manual feature sets performed better for classification.

(6)

Sammanfattning

Detta examensarbete, utfört på Sandvik Coromant, ämnar att förutse in- ventering för pulverblandningar genom att använda prognoser givna av kunder. Två olika maskininlärningsmetoder användes: linjär regression och regression i neurala nätverk. Eftersom majoriteten av datapunkter bestod av prognoser som indikerade att en beställning ska placeras ut- an att en beställning gjordes bestämdes att klassificering bör utföras på all data, medan modellen för regression enbart tränas på data där be- ställningar utförts. På detta sätt blev regressionsmodellen inte viktad mot nollvärden och estimering av större ordrar blev mer precis. Både automatisk feature engineering och manuell feature engineering utfördes för att förbättra resultaten. I fallet med regression drogs en slutsats att ursprungliga features tillsammans med de automatiskt genererade gav bäst resultat. För klassificeringen förbättrades resultatet ytterligare av att även lägga till manuella features.

(7)

1 Introduction 2

1.1 Background . . . 3

1.2 Objective . . . 3

1.3 Delimitations . . . 4

2 Relevant theory 5 2.1 Machine learning . . . 5

2.1.1 The Bias-Variance trade-off . . . 6

2.2 Linear regression . . . 6

2.3 Artificial neural networks . . . 7

2.4 k-fold Cross-Validation . . . . 8

2.5 Model evaluation . . . 9

2.5.1 Mean squared error . . . 9

2.5.2 Mean absolute error . . . 10

2.5.3 Mean absolute percentage error . . . 10

2.5.4 R² . . . 10

2.6 Local Interpretable Model-Agnostic Explanations (LIME) 11 2.7 Literature review . . . 11

3 Method 15 3.1 Choice of machine learning algorithms . . . 15

3.2 Collecting data . . . 15

3.2.1 Classification model reasoning . . . 17

3.3 Preprocessing . . . 17

3.4 Feature engineering . . . 18

3.4.1 Featuretools . . . 18

3.4.2 Manual features . . . 20

v

(8)

3.5 Feature selection . . . 22

3.6 Cross-validation . . . 24

3.7 Grid search . . . 24

3.7.1 Neural network architecture . . . 26

3.8 Model training . . . 28

3.8.1 Normalization . . . 28

3.8.2 Encoding . . . 28

3.8.3 Removing outliers . . . 29

3.9 Kruskal-Wallis H test . . . 30

3.10 Combining classification and regression . . . 30

4 Results 31 4.1 Classification . . . 32

4.1.1 Accuracy scores . . . 32

4.2 Regression . . . 33

4.2.1 Error distributions . . . 34

4.3 Kruskal-Wallis H test . . . 35

4.4 Cross-Validation . . . 36

4.5 Combining classification and regression . . . 38

4.5.1 Phase 1: Training . . . 38

4.5.2 Phase 2: Testing the models together . . . 39

5 Discussion and conclusion 41 5.1 Sustainability and ethical aspects . . . 42

5.2 Feature sets . . . 43

5.3 Recommendations . . . 43

5.4 Social relevance . . . 44

5.5 Future work . . . 44

5.6 Conclusion . . . 45

Bibliography 46

A Feature selection scores 50

(9)

Grade/powder grade A certain type of powder blend of materials that is produced by Sandvik Coromant.

Insert Small pieces of metal with hard surfaces that are used in many different areas. For instance, they are commonly used in metal-cutting tools in order to reduce wear on the tools.

Prognosis An estimation made by a customer which approximates how much of a certain powder grade that will be ordered during a specific year/quarter/month.

Order quantity An actual order made by a customer. This is the target value that we want to predict.

Feature set terminology

Standard features {powder grade, prognosis, customer, year, quarter}

Featuretools features See Table 3.3 for regression and Table 3.4 for classification.

Manual features See Table 3.5 for regression and Table 3.6 for classification.

A description of functions for features generated by Featuretools can be found in Appendix A, Table A.4.

1

(10)

This chapter provides an introduction to the thesis, outlines the objective, and summarizes what is included in the scope of this study.

Introduction

This degree project aims to investigate the prediction of powder needs through the use of automated machine learning models. The project was carried out at Sandvik AB; a global engineering group in mining and rock excavation, metal-cutting and materials technology [1]. As part of this engineering group, Sandvik Coromant is a company that focuses on supplying cutting tools and services to the metal cutting industry. Their largest factory, located in Gimo, is the primary research location of this thesis [2].

For companies such as Sandvik Coromant that needs to procure large amounts of materials for their production, it is essential to have precise forecasts of the inventory. Combining different materials into a specific powder blend or powder grade is the first step in the process of creating a carbide insert. Note that a customer in this degree project represents one of the many manufacturing plants owned by Sandvik Coromant across the world. The powder blends are used to create carbide inserts and are thus distributed to these manufacturing plants. Currently, the customers estimate how much of each powder blend that they have to purchase by leaving a prognosis for a certain time span. When it is time to order, they end up ordering a final amount, usually either more or less than the prognosis. The difference between a customer’s actual purchase volume and the prognosis can be highly unreliable and differ significantly between customers, thus making predictions by using the prognosis as a single feature difficult. The task of this degree project is to implement and

2

(11)

compare linear regression and neural network regression in the context of estimating powder blend order quantities from customer prognoses.

Sandvik Coromant currently uses the prognoses as a baseline for powder blend inventory management.

1.1 Background

This thesis concerns the process of manufacturing carbide inserts which are used in metal-cutting tools, high-temperature alloys, and cast iron among other areas. Inserts give a better finish on metal parts and allow for faster machining. The main component of a carbide insert, tungsten carbide, is a durable compound that allows the inserts to last longer and withstand higher temperatures. A regular insert consists of 80%

tungsten carbide and 20% metal matrix composite which binds hard carbide grains together [3]. The composite is commonly made using cobalt, which increases the toughness of the carbide, thus reducing the wear of the surface on the final product. The carbide is milled with the cobalt and any other powders and then spray-dried before being pressed into an actual insert [4]. Different powder blends are used to create inserts. This thesis will focus on predicting quantities of these powder blends, and not the inserts themselves.

To know how much powder that is necessary to complete a particular order, it is essential to do a good estimation. Since the customer predictions can vary a lot from the final orders, having an automatic machine learning model that can verify that the estimates are correct would make the process safer and more stable. The goal is to create such a model with the help of previous forecasts made by customers of Sandvik Coromant.

1.2 Objective

This degree project aims to look at two different approaches to estimate the amount of powder necessary to keep in inventory based on customer prognoses; linear regression and neural network regression. The purpose is to get the implementation that makes the most accurate predictions available for Sandvik Coromant to use if the results are promising.

Linear regression is considered as an approach in this thesis because of its simplicity and extensive usage in the past for similar problems. Also, neural networks are gaining increased recognition and perform better

(12)

than linear regression in cases where nonlinear relationships are present, which is why they will be investigated as well. The degree project must also take the amount of data and limited number of features into account when evaluating the final result.

All prognoses and order quantities that are shown in this degree project are in linearly scaled values of tonnes.

This thesis seeks to answer the following question:

Which machine learning model is more suitable for estimating the vol- ume of customer orders based on previous prognoses; linear regression or neural network regression?

1.3 Delimitations

Looking into other types of algorithms would be interesting if the final result is inaccurate. However, this is not the primary goal of this degree project and will be considered as future work. The primary task will be to find out whether linear regression or neural network regression perform better for the specified problem. Besides, an effort will be made to improve the results as much as possible for the machine learning approach that works best. Sandvik Coromant provides data of previous forecasts for powder purchases and the final order quantities.

The main frameworks used in this work are Scikit-Learn [5] for linear regression and feature selection, and Skorch [6] for neural networks.

Skorch is a wrapper of Pytorch and is compatible with Scikit-learn which improves the workflow when combining the two. Python is used to write all the code because of its approval within the data science field. Pandas, a popular data structure and data analysis python library, is used for Pre- and postprocessing. Prognoses are stored in excel sheets and loaded into data frames and CSV files. Furthermore, the standard features to look at are powder grade, year, quarter, prognosis, and customer. Feature engineering will be performed to improve the results.

(13)

This chapter presents an overview of the technical aspects of this work that are necessary to comprehend the rest of the thesis. A basic

introduction to machine learning is described, as well as elaboration of machine learning algorithms and previous studies.

Relevant theory

2.1 Machine learning

Machine Learning (ML) is the concept of enabling computers to learn by themselves. The more data available, the more tasks with higher complexity can be solved using machine learning [7]. Artificial Intelligence (AI), the simulation of human intelligence, has become a popular term lately, and machine learning is a subfield of that area. ML tasks can be solved by using either supervised or unsupervised learning. Supervised learning involves the prediction of one output with one or more inputs, whereas unsupervised learning only takes one input to find relationships or structures among the data [8]. This thesis will focus on supervised learning since old powder prognoses are used in the machine learning models to estimate new order quantities.

For supervised learning to work, the model has to be taught to make reasonable predictions by training it. The typical way to train a model is to randomly split the available data and use part of it for training and the rest for verification to make sure that the model is not too fitted to the training data. The phenomenon of fitting a model too much to the training data so that it does not perform well in a general case is called overfitting [8].

5

(14)

2.1.1 The Bias-Variance trade-off

Bias and variance is a common concept for statistical learning which refers to error prediction. A machine learning model aims to minimize prediction errors, hence minimizing bias and variance. Variance refers to the change in prediction by using a different training set. The ideal scenario displays small differences in predictions between training sets, but with high variance, minimal changes to the training set can give com- pletely different predictions. Bias, however, refers to an error produced by simplifying the modeling of a problem too much. For instance, applying linear regression to a problem with apparent nonlinear relationships between variables will result in high bias no matter how many training observations there are.

In general, increasing flexibility of a model increases the variance and decreases the bias [8]. Thus, there is a trade-off between bias and variance, which should be balanced to achieve high model accuracy.

2.2 Linear regression

Regression analysis was initially popularized during the 19th century by Sir Francis Galton, according to Stanton [9]. Francis applied his knowledge to the problems of heredity, but today, linear regression is used in numerous areas. Linear regression is a way to model the relationship between a dependent variable and one or more independent variables.

The goal is to find a line that best fits a specific set of data points, usually by minimizing the sum of the squared errors of prediction [10].

A function of the following form can be used to make predictions with linear regression:

y = θˆ 0+ θ₁x1 (2.1)

where θ₀ is the intercept term and θ₁ is the slope [11]. This form is called simple linear regression, using only one independent variable to predict the value of a dependent variable. The outcome, ˆy, is the dependent vari- able, whereas x₁ is the independent variable in this case. Multiple linear regression is another form of linear regression, where multiple regressor variables are involved:

ˆ

y = θ₀+ θ₁x₁+ θ₂x₂+ · · · + θ_ix_i (2.2)

(15)

where the θ_i’s are regression coefficients that describe the starting point and slope of the line. The term linear does not refer to ˆy as a linear function of the x’s, but to the linearity of the parameters θ₀, θ₁, · · · , θ_i [10]. Thus, the prediction line drawn by linear regression does not neces- sarily have to be straight but can be fit to curves in the data by including nonlinear predictors.

2.3 Artificial neural networks

Artificial Neural Networks (ANN) have been around for a while, but the research potential is still huge, according to Silva et al. [12]. ANNs have been used successfully in both classification and regression problems. The foundation of a neural network is to model the nervous system of living beings. Building neurons or nodes that store and process information and connecting them using artificial synapses [13] is the simple idea of a neural network. Typically, a neural network consists of multiple layers of nodes with data flowing between them. There is an input layer that receives data, zero or more hidden layers that compute internal operations and finally an output layer responsible for producing the final output [12]. A typical illustration of a fully connected neural network with one input layer, one hidden layer, and an output layer is displayed in Figure 2.1.

x₁ x₂ x₃ x₄ Input

layer

Hidden layer

y₁ y2

y₃ Output

layer

Figure 2.1: An image of an artificial neural network with one input layer, one hidden layer, and one output layer. The input layer consists of four neurons, the hidden layer of five neurons and the output layer of three neurons.

Every connection or synapse of a neural network has weights asso-

(16)

ciated with it that are adjusted during the training phase to minimize the output error. The weights are typically chosen randomly at the start of the training phase [14]. A node activates when a threshold of lin- ear combinations of inputs is exceeded. Input values are multiplied by the weights of their corresponding connections, summed in the node, and finally processed by an activation function to determine whether the threshold was reached or not [15].

Different types of architectures can be used when building an ANN, one of which is the multiple-layer feedforward network, making use of one or several hidden neural layers, which will be the focus of this thesis. In such a network, the errors are back-propagated from the output layer to the hidden layers to adjust the weights, and better fit the model to the desired output.

2.4 k-fold Cross-Validation

k-fold Cross-Validation is a method that describes how well a machine learning model works across different datasets. Running cross-validation on unseen data will give an idea of how well a machine learning model will work in general for predictions of data that is not used during training.

Round 1 Round 2 Round 3 Round 4 Round 5

Total number of datasets

Figure 2.2: An illustration of 5-fold cross-validation.

Figure 2.2 illustrates how 5-fold cross validation works. First, the dataset is shuffled. Then, the value of k is set, which is 5 in this case.

The dataset is split into k different folds or groups. Next, each of the folds is used as a validation set exactly once. The results from each round can be compared at the end by some error or accuracy measure, and if

(17)

they are relatively close to each other, the model is likely to generalize well.

2.5 Model evaluation

To evaluate the appropriateness of a model, a performance metric of the fit is necessary. It is essential to know if a regression model is adequate, i.e., determine if a model predicts the target variables within an acceptable range of accuracy. This part will cover commonly used evaluation methods to measure the accuracy of a regression model. The reason for picking the ones described in this section is that they each have different benefits. A more detailed explanation of the advantages of each error is included in each section below.

2.5.1 Mean squared error

Mean Squared Error (MSE) is a common measure to determine how well a predicted output matches the true output given a specific input. The mean squared error is given by

1 n

n

X

i=1

(y_i− ˆf (x_i))² (2.3)

where ˆf (x_i) is the prediction that ˆf gives for the ith observation. Like the name suggests, a high MSE value implies that the prediction is far away from the true value while a low value means that the prediction is fairly accurate. While 2.3 shows a calculation of MSE training data used to fit the model, this is not ideal. Instead, the accuracy on previously unseen data that was not used during the training phase is a good measure [8].

Root mean squared error

An alternative to MSE is Root Mean Squared Error (RMSE), the square root of the MSE. The advantage of using RMSE is that it is measured in the same units as the target variable, thus being easy to interpret.

(18)

2.5.2 Mean absolute error

Mean Absolute Error (MAE) calculates the mean of all recorded absolute errors, thus ignoring the sign of the error. MAE is given by

1 n

n

X

i=1

yi− ˆf (xi) (2.4) where ˆf (xi) is the prediction that ˆf gives for the ith observation. The advantage of using MAE is that it is easy to interpret and it is robust to outliers since there is no square operation involved.

2.5.3 Mean absolute percentage error

Mean Absolute Percentage Error (MAPE) is another example of a regression quality measure. MAPE is commonly used in situations where the prediction values are known to be way above zero. With several values close to or at zero, it can provide a misrepresented picture of the error [16]. MAPE can be calculated as follows:

100 n

n

X

t=1

At− Ft

A_t

(2.5)

where A_t is the actual value and F_t is the predicted value. One attribute of MAPE that could be considered a downside is that it has a bias favoring predictions below the actual values.

2.5.4 R

²

R² or the coefficient of determination is a way of measuring the validity of a regression model. It can be interpreted as a proportion of variance of a predicted outcome. The output of this method ranges from 0 to 1, where a value of 1 indicates that every point on the regression line fits the data perfectly [17]. A value of 0.7 would mean that 70% of the data points fall within the result of the regression line. This value can be increased by including more independent variables, which is why an adjusted version of R² also exists. The following equation defines R²:

Pn

i=1( ˆy_i− y)²

Pn

i=1( ˆy_i− y)² (2.6)

where ˆy is the estimated value of the dependent value for the ith

(19)

observation by the regression equation, y is the observed or baseline value of the dependent variable for the ith observation, and y is the mean of all observations of the dependent variable [17].

2.6 Local Interpretable Model-Agnostic Expla- nations (LIME)

LIME is a tool that attempts to interpret any machine learning model.

Ribeiro, Singh, and Guestrin [18] describes that most state-of-the-art machine learning algorithms are black boxes and that building trust in the model by explaining how it makes choices is crucial. LIME does this by separating the explanations from the model (model-agnostic) and making predictions locally, close to the prediction that should be explained. Lo- cally explaining a black-box model is much easier than trying to explain it globally, according to Ribeiro, Singh, and Guestrin [18]. Essentially, LIME perturbs the input data and observes how the output changes by learning a local linear model around it.

First, an instance X to be explained is picked. Then, X is permuted to create feature data with slight modifications. A similarity distance measure is calculated between the original observation X and the per- muted observations. The machine learning model to be explained is then used to predict the outcome of the permuted data. Next, a local linear model with m features is created with the permuted data which explains the black-box model locally but most likely not globally. The resulting feature weights are finally used to describe the local behavior [18].

2.7 Literature review

While there does not seem to be any papers like this one, there are quite many in the general area of prediction. Most previous works show that both linear regression and artificial neural networks can be used effectively for prediction and forecasting in various settings [19–22]. In most cases, ANNs seem to outperform linear regression, although some studies show minimal differences between the two. Furthermore, popular measures that are used to evaluate the performance of an algorithm in studies related to this subject seem to be MSE, R² and MAPE.

One of the more common areas for regression is predictive analysis, modeling future opportunities, and any risks that might be involved.

(20)

More specifically, it is often used in demand or sales forecasting. For most businesses, estimating the number of products that will be sold within a certain period of time is very beneficial. A stock out would be detrimental to the total sales, and another issue would be having too many copies of a product that never sells. Thus, to remain competitive as a company, it is almost necessary to have a good idea of an optimal inventory quantity [23].

Ahangar, Yahyazadehfar, and Pournaghshband [19] made use of neural network regression for stock price prediction. Moreover, they suggested that there is an increasing interest in ANNs for problems related to forecasting. The ability to capture nonlinear relationships and the complexity of new algorithms make them powerful. However, linear methods have primarily been used in the past because of their simplicity.

In the study by Ahangar, Yahyazadehfar, and Pournaghshband [19], 10 economic and 30 financial variables were used for a general regression neural network with 3 layers. From those variables, three economic and four financial ones were used to estimate the stock price using Indepen- dent Components Analysis (ICA). The study concluded that the neural network method is more efficient than linear regression.

Sukhia, Khan, and Bano [23] researched demand forecasting using linear regression, backpropagation, and simple moving average algorithms. By looking at quarters, sales per quarter and seasonal effect, Sukhia, Khan, and Bano [23] compared the actual demand of 12 different products against the forecasts of the algorithms. Applying the Economic Order Quantity (EOQ) model to the forecasted demand produced by linear regression resulted in 57% savings when compared to the simple moving average. The backpropagation algorithm had a corresponding value of 58%. Ultimately, it was concluded that the backpropagation algorithm performed slightly better than linear regression since the inventory costs were lower when applying EOQ [23].

Some of the more common neural networks that use a multilayer feedforward type of architecture are the Multilayer Perceptron (MLP) and Radial Basis Function (RBF). A paper by Oludolapo, Jimoh, and Kholopane [24] suggests that RBF performs better than MLP in the case of predicting energy consumption.

Günay [25], however, made use of an MLP architecture to estimate electricity demand in Turkey with very high accuracy. He argues that correctly identifying the appropriate features to use is essential for an acceptable prediction. In the study, six different features were used,

(21)

related to population, economy, and temperature. The results showed that the MLP model performed very well, even outperforming official predictions of electricity demand.

Moe and Fader [26] investigated the forecasting of new product sales by using advance purchase orders, that is, the ability to make orders ahead of a product’s release date. They used a mixed-Weibull model to estimate sales of new album releases and concluded that the resulting model could support a variety of sales patterns. Additionally, it was noted that the flexibility of the model makes it ideal for products with diffuse sales patterns, such as music or any other category that is expe- rienced based and related to personal preferences.

Kuo, Hu, and Chen [27] proposed an RBF neural network for short- term sales forecasting of papaya milk. By using a hybrid of the Particle Swarm Optimization (PSO) and Genetic Algorithm (GA), they improved the learning of the RBF neural network.

Wu, Yan, and Fan [28] carried out a comparison of different models with sales trends as input to forecast new product sales. Residual Sum of Squares (RSS) was used to measure the effectiveness of the models.

An Exponential Smoothing (ES) algorithm weighted the more recent data heavily when fitting the model but ended up underestimating in increasing trends and overestimating in decreasing trends. Holt’s method seemed to perform well, outperforming AutoRegressive Moving Average (ARMA). However, AutoRegressive Moving Average Vector (ARMAV) had the lowest RSS because it was able to capture more input factors.

Efendigil, Önüt, and Kahraman [29] presented an Adaptive Network- based Fuzzy Inference System (ANFIS) model and compared it to ANN for demand forecasting in a multi-level supply chain setting. The study showed that ANFIS was more effective ANN in estimating more reliable forecasts.

Kong and Martin [30] considered a backpropagation neural network (BPN) to predict future sales volumes of a food product. The results were compared to a linear regression model currently used by the company from which the data was acquired. The BPN seemed to produce better results than linear regression, although factors such as advertising and competition were not taken into account, which might be significant according to the author.

To summarize, most studies reviewed favor an ANN approach over linear regression for predictions in general, which also seems to be the case for sales forecasting. Nevertheless, some studies obtained similar

(22)

results from the two.

(23)

This chapter displays the details of the implementation and the reasoning behind some of those choices. The process of selecting features and training various models will be covered as well.

Method

3.1 Choice of machine learning algorithms

The reason behind choosing linear regression as an algorithm was because of its popularity in forecasting in the past. Both linear regression and neural networks frequently appear according to the previous work reviewed in this degree project. However, the fact that the prognoses were unreliable and differed substantially between customers made the task of predicting the final order quantities challenging. Therefore, because of the complexity of the problem, using ANN:s that can adjust to nonlinearities in the dataset was expected to outperform linear regression.

3.2 Collecting data

Sandvik Coromant supplied the data for this thesis, which consisted of old prognoses made by customers. The data was stored multiple excel files with prognoses from several years. For each type of categorical powder grade, prognoses were given per customer and time span. The time spans varied depending on the format of the excel file.

In some cases, yearly prognoses were used while quarterly or monthly prognoses were more common toward later years. The latest prognosis

15

(24)

specified by a customer was always used. Empty cells were interpreted as that no update of the prognosis should be made.

In total, 2676 samples were available for training. Out of those samples, a majority of them were zero in the target column (order quantity), and some were zero in the prognosis column. The number of samples for each class can be found in Table 3.1.

Table 3.1: Number of samples in the dataset.

Class Sample count

Total number of samples 2676

Prognosis above zero, no order made 1687 Prognosis equal to zero, order made 171

Total number of orders made 989

Table 3.2 shows an example of what a sample data point can look like. A categorical grade or powder grade explains the type of powder blend. The customers are simply encoded into integers, and so are the years. Quarters have integers that align with the real quarters. Thus, 2 in this case refers to Q2. Finally, the prognosis and the order quantity are shown. The order quantity is the value to predict while the others are used as features.

Table 3.2: A mock-up example of what a sample with standard features can look like

Grade Customer Year Quarter Prognosis Order quantity

555 8 3 2 30 48

Furthermore, a density plot and histogram were created to see the distribution of prognoses and order quantities. Figure 3.1 illustrates this.

The majority of order quantity values were zero or close to zero, and several prognoses were in that range as well. This is evident because most values are placed in the first histogram bins.

(25)

0 100 200 300 400 500 0

0.5 1 1.5 2 2.5 3 3.5

·10⁻²

Prognosis and Order quantity

Density

Order quantity Prognosis

Figure 3.1: Density plot and Histogram of prognosis and order quantity.

3.2.1 Classification model reasoning

Considering several samples were zero in the target values, this would bias a regression model towards zero and make larger predictions unreliable.

Running regression on the complete dataset would produce a model that might look good when observing error rates but in reality, it would be quite useless. Instead, it was decided that training a classification model on the entire dataset and then a regression model on the samples with an order quantity above zero was the best option. Thus, when trying to predict the order quantity of a new sample, classification is run first to determine whether an order is placed or not. If the classification model predicts that an order will be placed, regression will be executed on that sample to get a predicted value of the order quantity.

3.3 Preprocessing

Since different layouts were used in the excel files over the years, a significant amount of parsing had to be done to make it usable for model training. Additionally, dealing with different time spans was an issue.

The time spans were given in either years, quarters, or months. The year

(26)

was extracted from the date as a separate feature, and another feature was used for the quarters. In the case of monthly prognoses, the months were assigned to the quarters they belonged to whereas yearly prognoses were mapped to a fabricated quarter.

The information of each excel file was parsed into several CSV files which were later merged into one CSV file. Since prognoses were updated as the date of order comes closer, only the latest version of prognoses was used in this thesis. Furthermore, the Sandvik Coromant database was queried for each customer to find the order quantity corresponding to each prognosis. In addition to this, several properties with information related to each powder grade were also fetched from the database.

3.4 Feature engineering

The standard features available were prognosis, powder grade, customer, year and quarter together with 23 additional features associated with each powder grade that were fetched from the Sandvik Coromant database.

Most of the additional features from the database did not score well during feature selection, and none of them were used in the final experiments.

Feature engineering was performed to create more relevant features and improve results further.

3.4.1 Featuretools

As a first step, the open source python framework Featuretools was used to generate new features automatically. The CSV file with all of the data was split into three different tables; Grades, Customers, and Prognoses with a one-to-many relationship from Prognoses to the Grades and Cus- tomers. This structure, similar to Structured Query Language (SQL), allows Featuretools to calculate new features based on the relationships between the tables. The features that were created typically consisted of rather simple operations such as the mean of the prognoses for every customer or the max value of all the grades. In total, 54 features were generated by Featuretools. However, not all of the features were relevant, which is why feature selection had to be performed to choose the ones with the biggest impact on the result.

(27)

Table 3.3: Featuretools regression features chosen after feature selection.

Feature Description

customers.MEAN(prognoses.prognosis) Mean of the prognosis column (for each customer) customers.SUM(prognoses.prognosis) SUM of the prognosis

column (for each customer) grades.MAX(prognoses.prognosis) Max value of the prognosis

column (for each grade) grades.SUM(prognoses.prognosis) SUM of the prognosis

column (for each grade) customers.STD(prognoses.prognosis) Dispersion of the prognosis

column relative to the mean value, ignoring NaN (for each customer)

Table 3.4: Featuretools classification features chosen after feature selection.

Feature Description

customers.SUM(prognoses.year) Sum of the year column (for each customer)

grades.SUM(prognoses.year) Sum of the year column (for each grade)

customers.COUNT(prognoses) Number of prognoses (for each customer)

customers.MAX(prognoses.prognosis) Max value of the prognosis column (for each customer) customers.MODE(prognoses.grade) The most commonly repeated

grade (for each customer)

As can be seen in Tables 3.3 and 3.4 the features generated by Fea- turetools are mostly self-explanatory. These features are only the ones

(28)

that were chosen after feature selection, not all of them. With the help of relationships between the tables that were created, Featuretools automatically created important features related to each grade and customer.

Although 54 features were created, only five were used for regression and four for classification. The features were selected by a combination of running several feature selection tests with various scoring functions and examining how many features seemed appropriate to improve the results.

The feature selection process will be discussed further in Section 3.5.

3.4.2 Manual features

Another 30 experimental manual features were created by performing certain calculations. Similar to the Featuretools features, only a subset of these were selected in the end. For classification, eight of the manual features were used in the final model while the regression model was tested with another six manual features that scored highly in feature selection.

The goal of manual features was to create features which also take the target values into account. Observing common patterns between prognosis and order quantities after grouping the data by customer, grade, and customergrade was something that Featuretools was unable to do since it only looked at the features themselves. These features were added after splitting the dataset into train and test sets since the target values were used for most of the calculations. Performing operations on the test value set with target values included would produce unrealistic results, thus not properly reflecting a real-world scenario. The main intention of creating features that look into historical purchasing trends was to see how much on average that customers tend to underestimate or overestimate with their prognoses.

Some of the features calculated with the help of target values achieved high scores during feature selection. The most important ones used for regression after feature selection are shown in Table 3.5. In the case of classification, another set of features were used, presented in Table 3.6.

These tables contain abbreviations for feature names since the actual names used in the code were long to be descriptive.

(29)

Table 3.5: Manual regression features chosen after feature selection.

Feature Description

C_P_O customer_prognosis_overestimated. Prognosis column scaled by the average overestimation percentage (for each customer).

C_P_U customer_prognosis_underestimated. Same as C_P_O but for underestimation.

G_P_O grade_prognosis_overestimated. Prognosis column scaled by the average overestimation percentage (for each grade).

G_P_U grade_prognosis_underestimated. Same as G_P_O but for underestimation.

C_G_P_O customer_grade_prognosis_overestimated. Prognosis column scaled by the average overestimation

percentage (for each customer and grade combination).

C_G_P_U customer_grade_prognosis_underestimated. Same as C_G_P_O but for underestimation.

(30)

Table 3.6: Manual classification features chosen after feature selection.

Feature Description

C_G_O_P customer_grade_overestimate_percent. The result of dividing number of overestimations with total number of estimations (for each customer and grade

combination).

C_G_U_P customer_grade_underestimate_percent. Same as C_G_O_P but for underestimation.

C_O_P customer_overestimate_percent. The result of

dividing number of overestimations with total number of estimations (for each customer).

C_U_P customer_underestimate_percent. Same as C_O_P but for underestimation.

C_O_C customer_overestimate_count. Count of overestimations made (for each customer).

C_U_C customer_underestimate_count. Same as C_O_C but for underestimation.

C_E_C customer_estimate_count. Total number of estimations made (for each customer).

C_G_D customer_grade_demand. Total number of samples where order quantity is larger than 0 (for each customer and grade combination).

3.5 Feature selection

Feature selection was performed on the complete dataset for classification and the part of the dataset with order quantities above zero for regression. There exist mainly two approaches for feature selection: wrapper or filter. The wrapper method uses a predictive model to choose features while filtering looks at how strong a relationship each feature has with the output variable. A filtering approach was used as the first step in this degree project since it is computationally fast and gives a decent overview

(31)

of how well each feature performs individually. Scikit-learn was chosen to perform these calculations. F-regression and mutual information were used as scoring functions. F-regression is a univariate correlation calculation, using the standard deviation and the mean, computed between each regressor and the target. Mutual information regression measures the dependency between the regressor and the target using nonparametric methods based on entropy estimation from k-nearest neighbor distances.

While F-regression can capture the degree of linear dependence between two variables, mutual information can capture any statistical dependency but requires more samples to be accurate, according to Scikit-learn [31].

A χ² statistical test was used in classification feature selection, com- puting χ² stats between each feature and class. Tables of the 20 best features for each scoring function, as well as all of the standard features, can be found in Appendix A. Most higher scoring features use the prognosis in some way, which was no surprise. Also, grade, quarter, and year all scored very poorly on their own, which raised the question of how important they were. Furthermore, the manual features scored well, which was to be expected since there was no train/test split of the dataset before feature selection.

After deriving the essential features using the filtering method, they were narrowed down further by using a wrapper method. The wrapper method was used to build trust in the model and to see which features contributed most to the predictions. The specific tool utilized in this degree project was LIME [18]. While the article itself mainly focused on building trust for classifiers, the tool supported regression as well.

While LIME did not answer which features worked best on a global scale, it did help determine which features contributed to several predictions.

The top five Featuretools and manual features were selected from each scoring table in Appendix A. The most relevant ones, which performed well in both filtering and wrapper feature selection methods are shown in Tables 3.3, 3.4, 3.5, and 3.6. These features sets were the ones used for later comparisons, which are explained further in the Results section.

The wrapper method also revealed that the standard features were more important than the univariate filtering approach suggested. Thus, all of the standard features were used for both regression and classification.

(32)

3.6 Cross-validation

Cross-validation was used to explore how well the models work across different training sets. According to James et al. [8], a k of either 5 or 10 is typically used as they are both a good choice in terms of the bias- variance trade-off. Thus, a k of 10 when applying cross-validation was used in this thesis.

3.7 Grid search

A neural network requires that the hyperparameters are tuned to get the most out of it. These hyperparameters were selected by performing an exhaustive cross-validated grid search. Five cross-validation folds were used for the grid search. Five was the default value and was chosen for this degree project because it was a good trade-off between the number of folds and computational speed. The grid search was divided into three rounds since it would be computationally expensive to tune every- thing at once. Each round is presented in Table 3.7 and Table 3.8. The cells with values that performed best in each round are colored in green.

Consequently, the final hyperparameters are given in the third round.

(33)

Table 3.7: Regression grid search hyperparameter tuning.

Round Hyperparameter Values

Round 1

Learning rate 0.01 - - -

Epochs 500 1000 2000 -

Batch size 60 200 300 500

Hidden components 100 - - -

Dropout 0 - - -

Hidden layers 1 - - -

Round 2

Learning rate 0.001 0.01 0.1 -

Epochs 1000 - - -

Batch size 500 - - -

Dropout 0 - - -

Round 3

Epochs 1000 - - -

Hidden components 50 100 250 500

Dropout 0 0.2 - -

Hidden layers 1 2 3 -

(34)

Table 3.8: Classification grid search hyperparameter tuning.

Round Hyperparameter Values

Round 1

Epochs 60 200 300 -

Batch size 60 200 300 500

Dropout 0 - - -

Round 2

Learning rate 0.001 0.01 0.1 -

Epochs 60 - - -

Dropout 0 - - -

Round 3

Epochs 60 - - -

Hidden components 100 250 300 500

Dropout 0 0.2 - -

Hidden layers 1 2 - -

3.7.1 Neural network architecture

This section will cover the architecture used for the neural networks. For classification, early stopping was used since it often reached a good result quickly. Backpropagation was used for both models.

The regression neural network used Rectified Linear Unit (ReLu) activation functions in both hidden layers. This prevented any negative target predictions since ReLu returns zero for negative input. The loss function used for regression was Mean Squared Error (MSE), with a mean reduction. Adam was chosen as the optimizer function. A visualization of the architecture of the regression network is shown in Figure 3.2.

(35)

Num_features Input

250 Hidden

0.2 Dropout

250 Hidden

1 Output

Figure 3.2: Regression neural network architecture.

The orange boxes to the right of the layers represent ReLu activation functions. A dropout layer was added as a regularization technique in both the regression and classification network. The dropout layer randomly drops some of the inputs to a layer (20% in both models in this case).

The classification neural network used a ReLu activation function in the hidden layer. The loss function used for classification was Negative Log Likelihood Loss (NLLLoss), which calculates log-probabilities of each class. These probabilities were obtained in the neural network by using Softmax activation in the output layers. Since only binary classification was used in this degree project, two classes were used (either an order was placed or not). Adam was chosen as the optimizer function for classification as well. A visualization of the architecture of the classification network is shown in Figure 3.3.

(36)

Num_features Input

250 Hidden

0.2 Dropout

Softmax

2 Output

Figure 3.3: Classification neural network architecture.

3.8 Model training

3.8.1 Normalization

Normalization was performed using Scikit-learns MinMaxScaler function to get all features within a range of 0 to 1. It was particularly necessary in order to get the neural network to learn properly. For neural networks, the features should be measured in comparable ranges so that there is no bias towards improperly scaled columns.

3.8.2 Encoding

While most features were continuous, some of the standard ones were categorical. The customers, quarters, and years were all one-hot encoded since they were limited to an adequate number of unique values. The powder grades, however, were label encoded since there were far more unique values present which would create many feature columns in relation to the size of the dataset. Had more data been available, one-hot encoding might have been used for the powder grades as well.

(37)

3.8.3 Removing outliers

There are numerous ways to detect outliers, using either visualization or statistical methods, or even manually determining which samples that are relevant. Another common approach is to identify them using Z- score. While there were some samples with larger order quantities than usual, they should not be considered outliers since they were valid data points. Instead, the only samples removed from the dataset were powder grades which had never been ordered since those samples could belong to either a new product or an outdated version of a product.

Figure 3.4 displays a scatter plot of the entire dataset, which was color-coded by powder grade to visualize outliers and identify whether the grades had any relation to that. It showed that most prognoses and order quantities fall within a specific range and that larger orders were rarer. Moreover, it seemed that powder grades above 900 stands for a significant portion of these more substantial orders. Additionally, larger prognoses consistently lead to an order being made even if they were imprecise.

0 100 200 300 400 500 600

Prognosis

Orderquantity

200 400 600 800

Figure 3.4: Scatter plot of all samples, colored by powder grade. The color bar on the right represents the colors for the categorical powder grade values.

The dotted line displays the ideal scenario where prognoses are equal to the order quantities.

(38)

3.9 Kruskal-Wallis H test

Each model was trained and tested ten times in total for every feature set.

Thus, a Kruskal-Wallis test of statistical significance was performed to verify that the samples used in the feature sets originated from different distributions [32]. By using such a test, it was possible to confirm that the results obtained for each feature set were not entirely random but had its own distribution.

3.10 Combining classification and regression

As a final test, a trained classification model was combined with a trained regression model to predict unseen data. This test was performed by reserving 20% of the entire dataset for final testing and training the classification model on the rest of the data. The regression model was then trained on the samples with an order quantity above zero.

(39)

This chapter introduces the main results of this thesis. Additionally, differences between the algorithms and features that are investigated will be presented.

Results

First, the results of the classification comparisons are presented. It in- cludes comparing a linear approach to the neural network approach as well as differences between feature sets. Next, the same comparisons are shown for the regression part.

For every experiment, 80% of the data is used for training and 20%

for testing. Then, 20% of the training data is used as a validation set for the neural network classification model. By doing so, the validation loss can be monitored for early stopping in order to prevent overfitting.

No validation set is used for regression since early stopping did not improve the performance of the regression model. All of the accuracy and error distributions are calculated on the test set unless otherwise specified. Since the train/test split impacts the results, a total of 10 tests are performed per experiment, and the means of the errors/accuracy are calculated.

From here on, feature sets will be referred to as Standard, Feature- tools and Manual features with combinations of these being enclosed in curly brackets. These feature sets are the features selected for regression and classification after feature selection. The table references corresponding to the features of each feature set can be found in the Terminology section at the beginning of this thesis.

31

(40)

4.1 Classification

Below are the results of classification accuracy using the different feature sets. Logistic regression is the linear approach used in this case. Poly- nomial logistic regression, a special case of logistic regression where the independent variable x and dependent variable y are modeled as a nth degree polynomial is also investigated. The model uses the same function as the logistic regression approach, but the 2nd degree of every numerical feature column in the dataset is calculated before training. The classification model classifies a sample as either 1 or 0, representing whether an order will be made or not.

4.1.1 Accuracy scores

Tables of accuracy scores for different feature sets are presented here.

As mentioned earlier, mean values are calculated over ten runs in order to account for the variance and make accurate comparisons. The whole dataset of 2676 points is used for the classification training, with 7 of the samples discarded as outliers.

(41)

S SF SFM 0.8

0.82 0.84 0.86 0.88 0.9 0.92

Accuracy

Logistic Logistic polynomial ANN

Figure 4.1: Bar plot showing mean classification accuracy over 10 runs using Standard (S), {Standard, Featuretools} (SF), and {Standard, Featuretools, Manual} (SFM) feature sets. The error bars display standard deviations for each feature set.

As Figure 4.1 shows, accuracy scores are consistently higher when using the {Standard, Featuretools, Manual} feature set. Linear approaches seem to work equally well to neural networks. The mean of ANN accuracy is only very slightly ahead after adding the Manual features. How- ever, looking at the standard deviation error bars it cannot be concluded whether the linear approaches or the neural network is the better option.

The standard deviation error bars show that the differences between runs are minimal, which suggests that the classification model is stable in general.

4.2 Regression

This section will cover the results acquired from linear regression and neural network regression. A total of 989 samples from the whole dataset are used for regression training.

(42)

4.2.1 Error distributions

An in-depth comparison between the different approaches and feature sets using common error measures is shown in this section. Similar to the classification accuracy, the regression error measures are calculated over ten separate runs as well.

Table 4.1: Means and standard deviations for regression errors over 10 runs using Standard features.

Model Function R² RMSE MAE MAPE

Linear MEAN 0.6495 20.0152 9.1026 1001.4328 ANN MEAN 0.7327 12.2307 4.2590 217.4704 Linear STD 0.0555 3.5845 0.9152 187.8917 ANN STD 0.2055 4.0569 0.7283 189.6530

Table 4.2: Means and standard deviations for regression errors over 10 runs using {Standard, Featuretools} features.

Linear MEAN 0.6231 22.4856 9.1943 1160.8954 ANN MEAN 0.8167 10.4761 3.8995 364.2408 Linear STD 0.1600 6.5802 0.9098 228.5471 ANN STD 0.1305 2.6423 0.6455 163.4619

Table 4.3: Means and standard deviations for regression errors over 10 runs using {Standard, Featuretools, Manual} features.

Linear MEAN -2.9537 49.8508 18.2882 1239.9574 ANN MEAN 0.4427 25.1303 7.4033 238.8325 Linear STD 2.5908 20.5145 5.3646 307.4138 ANN STD 1.0867 11.3817 2.5774 109.1275

Using {Standard, Featuretools} features with an ANN model pro- duces the best predictions. However, the error distributions tell us that using either a Standard or {Standard, Featuretools} feature set is a viable

(43)

option. Linear regression is more consistent when only using the Stan- dard features. In the case of adding Manual features, the predictions overfit to the training data but ANN:s can deal with it better than linear regression. This does not come as a surprise since several regularization methods were applied to the neural networks.

4.3 Kruskal-Wallis H test

A Kruskal-Wallis H test is used on distributions of classification accuracy and regression error measures to verify that distributions from different feature sets are divergent. The null hypothesis of the Kruskal-Wallis H test is that data samples were drawn from the same distribution. If the null hypothesis is rejected, this means that there is enough statistical evidence to conclude that one or more feature sets dominate another feature set in terms of accuracy/error distributions. Usually, a value of p lower than 0.05 suggests that the samples are drawn from different distributions.

Table 4.4 shows the Kruskal-Wallis results for classification accuracy.

By looking at the p-values, it is evident that the different feature sets for each model have different distributions and are not just random.

Table 4.4: Kruskal-Wallis results on classification accuracy between Standard, {Standard, Featuretools}, and {Standard, Featuretools, Manual} feature sets.

Model H p

Linear 19.4167 0.00006077

Linear polynomial 20.3544 0.00003803

ANN 19.3954 0.00006143

(44)

Table 4.5: Kruskal-Wallis results on regression error distributions between Standard, {Standard, Featuretools}, and {Standard, Featuretools, Manual}

feature sets.

Model Error measure H p

Linear R² 18.9858 0.00007538

Linear RMSE 12.1471 0.00230299

Linear MAE 13.9845 0.00091897

Linear MAPE 3.6723 0.15943339

ANN R² 11.0271 0.00403178

ANN RMSE 16.1368 0.00031329

ANN MAE 17.3600 0.00016995

ANN MAPE 7.1329 0.02825594

Table 4.5 shows the Kruskal-Wallis results for regression error distributions of each error measure. Since p < 0.05 in almost every case, the errors are drawn from different distributions. However, the p-value for MAPE specifically is higher, suggesting that we cannot conclude that these error values originated from different distributions. This is not surprising in this case since MAPE fluctuated much between feature sets.

4.4 Cross-Validation

A 10-fold cross-validation test is used for the best performing feature sets for both classification and regression to get a better perception of how the models perform on different training sets. R² is used as a measure for regression, while accuracy is used for classification. Each classification cross-validation test is carried out on all of the 2676 data points. Each regression cross-validation test is carried out on the 989 order samples with non-negative order quantity values. The five folds which the data is distributed into are selected randomly.

(45)

Table 4.6: Cross-validation test for classification with a {Standard, Feature- tools, Manual} feature set.

Round Train accuracy Test accuracy

Round 1 0.931 0.914

Round 2 0.934 0.910

Round 3 0.936 0.887

Round 4 0.937 0.906

Round 5 0.929 0.906

Round 6 0.933 0.925

Round 7 0.928 0.929

Round 8 0.932 0.917

Round 9 0.930 0.921

Round 10 0.938 0.887

As Table 4.6 shows, the reliability of the classification is rather high and does not differ much between the different validation sets.

Table 4.7: Cross-validation test for regression with a {Standard, Featuretools}

feature set.

Round Train R² Test R² Round 1 0.9952 0.9442 Round 2 0.9956 0.9381 Round 3 0.9911 0.7580 Round 4 0.9960 0.9720 Round 5 0.9913 0.8592 Round 6 0.9955 0.7327 Round 7 0.9921 0.8799 Round 8 0.9912 0.9594 Round 9 0.9964 0.9191 Round 10 0.9938 0.9116

While the R²-scores are pretty consistent, there seem to be some significant dips which suggest that the training set cannot explain the test set that well sometimes.

(46)

4.5 Combining classification and regression

Both classification and regression are put together to work as one model in this section. Samples that are classified as 1 will be used for regression testing. Out of all 2676 data points, 20% is removed to test both models together at the end. The classification model is trained on all the remaining data points while the regression model is trained using the data points with non-negative order quantity values, just like before.

Contrary to the previous results, in this section, the models are only trained once since they have to be stored as separately trained models and then loaded together to be tested on the new data. For each model, the MinMaxScaler is saved as a Python pickle in order to load it later and similarly scale the test data. Additionally, the one-hot encoder has to be saved in a pickle as well since there could be values that are in the original dataset but not in the reserved dataset, thus leading to a different number of features.

4.5.1 Phase 1: Training

This is the first phase of creating a combined model which consists of training both the classification and regression models. A train/test split of 80/20 is done for the training phase. The accuracy and error distributions shown are calculated on the test sets.

Classification training accuracy

A total of 2140 samples are used for the classification training. Accuracy of 0.874 was achieved on the test set of this training phase.

Regression training

A total of 790 data points out of 2140 previously mentioned are used for regression training. The error distributions achieved are presented in Table 4.2.

Table 4.8: Regression errors of final ANN regression model using {Standard, Featuretools} features.

R² RMSE MAE MAPE

0.9136 10.6146 4.1752 229.8546

(47)

0 50 100 150 200 250 300 350 400 0

100 200 300 400

True

Predicted

Figure 4.2: The training results of the final ANN regression model using a {Standard, Featuretools} feature set. Predicted vs true order quantities are plotted along with the perfect prediction line.

As Figure 4.2 shows, the predictions are reasonably good and distributed rather close to the perfect prediction, yielding a decent R²-score.

4.5.2 Phase 2: Testing the models together

This is the second phase of creating a combined model which evaluates how well the classification and regression work on the 20% of reserved data from the entire dataset. No train/test split is performed here since the models are already trained. Instead, the already trained classification model is tested on all of the data points reserved in phase 1. The regression model will then be used on the samples classified as 1. In other words, those that are believed to lead to an order being placed according to the classification model.

A total of 536 data points was reserved from the start to test the combined model. Accuracy of 0.879 was achieved on this test set for the trained classification model from phase 1.

Table 4.9 shows how many 0 and 1 predictions that the classification model made that were correct and compare them to the true count.

(48)

Table 4.9: Classification counts for order and no order samples of final ANN classification model. The total number of correctly predicted samples are compared to the true count.

Category Correctly predicted True

Order (1) 166 199

No order (0) 305 337

Table 4.10: Regression errors of final ANN regression model using {Standard, Featuretools} features.

R² RMSE MAE MAPE

0.7611 7.6624 3.9717 NaN

Figure 4.3 displays the predictions of the testing phase. Compared to Figure 4.2, the predictions are more scattered, thus lowering the R² score significantly.

0 20 40 60 80 100 120

True

Predicted

Figure 4.3: The testing results of the final ANN regression model using a {Standard, Featuretools} feature set. Predicted vs true order quantities are plotted along with the perfect prediction line.

The combined model is, in general, producing reasonably good results. Especially the classification seems to be working well whereas the regression could perform better.

(49)

This chapter discusses the results that were acquired throughout this study. Furthermore, the work will be concluded with recommendations for Sandvik Coromant and future work that remains to be done.

Discussion and conclusion

This degree project aimed to find out whether linear regression or neural network regression was the most efficient approach for estimating Sand- vik Coromant’s powder inventory needs based on customer prognoses.

Previous research showed that neural networks were the most favorable option for forecasting and prediction in general. This study confirmed that ANN was the better approach for regression analysis in this case.

Therefore, the relation between the input and output was likely nonlinear and more complex than what linear regression could explain. However, classification results show that using logistic regression and neural network classification give comparable results.

All of the error distributions were lowest for the {Standard, Feature- tools} feature set in the case of regression, except for MAPE. There could be multiple reasons for this. MAPE does have several drawbacks, one being its inability to perform calculations for data points that are 0, which is why the MAPE is NaN in the error calculations of the combined model. Additionally, MAPE is biased in a way that makes predictions that are lower than the target value seem better, thus giving a smaller error compared to forecasts that are higher by the same amount. This would explain the fluctuation between feature sets.

As the results demonstrated, there was high variance in the accuracy of the predictions despite using multiple methods to make the model generalize better. Nevertheless, this was not very surprising because of

41