Evaluation of different machine learning models for the prediction of electric or hybrid vehicle buyers and identification of the characteristics of the buyers in the EU

(1)

Evaluation of different machine learning models for the prediction of electric or hybrid vehicle buyers and identification of the characteristics of the buyers in the EU

Ziaul Islam Chowdhury & Iskanter Bensenousi

Supervisor Shujun Wang Karlskrona, Sweden September 2020

MBA Thesis

DEPARTMENT OF INDUSTRIAL ECONOMICS www.bth.se/mba

(2)

This thesis is submitted to the Department of Industrial Economics at Blekinge Institute of

Technology in partial fulfilment of the requirements for the Degree of Master of Science in Industrial Economics and Management. The thesis is awarded 15 ECTS credits.

The author(s) declare(s) that they have completed the thesis work independently. All external sources are cited and listed under the References section. The thesis work has not been submitted in the same or similar form to any other institution(s) as part of another examination or degree.

Author information:

Ziaul Islam Chowdhury zich18@student.bth.com Iskanter Bensenousi iskbensenousi@gmail.com

Department of Industrial Economics Blekinge Institute of Technology SE-371 79 Karlskrona, Sweden Website: www.bth.se

Telephone: +46 455 38 50 00 Fax: +46 455 38 50 57

(3)

Abstract

This document is created under the MBA Programme of BTH and is the final thesis of the students Iskanter Bensenousi and Ziaul Islam Chowdhury. The authors have technical and managerial background in electrical and software engineering and tried to combine their engineering expertise with the business knowledge acquired from the MBA Programme in this thesis.

The main goal of this thesis is to evaluate different machine learning models in order to classify buyers of an electric or a hybrid vehicle and to identify the characteristics of the buyers in the European Union.

Machine learning algorithms and techniques were adopted to analyze the dataset and to create models that could predict, with a certain accuracy, the customer’s willingness to buy an EV. Identification of the characteristics of the buyers were based on the identified most important features from the machine learning models and statistical analysis.

The research consisted of exploratory and explanatory methods (mixed method) with quantitative and qualitative techniques. Quantitative technique was applied to convert categorical values to ordinal and nominal numeric values, to establish cause and effect relationship between the variables by using statistical analysis and to apply machine learning methods on the dataset. The quantitative results were then analyzed by using quantitative and qualitative techniques in order to identify the characteristics of the buyers.

The data analytics part relied on a publicly available large dataset from the EU containing transport and mobility related data. From the experiments with logistics regression, support vector machine, random forest, gradient boosting classifier and the artificial neural network it was found that ANN is the best model to identify who won’t buy an EV and gradient boosting classifier is the best model to identify who would like to buy and EV.

ML based feature importance identification methods (MDI, permutation feature importance) were used to analyze the characteristics of the buyers. The major buyer’s characteristics found in this thesis are environmental concern, knowledge on car sharing, country of residence, education, control traffic, gender, incentive, education and location of residence.

Authors have recommended green marketing as the potential enablers towards a faster and larger adoption of electrical vehicles in the market as environmental impact was found as the most significant behavior of the buyer. Finally, for the future researchers, the authors have recommended fine-tuning the algorithms extensively in order to achieve better accuracy and to collect primary data based on the most important features identified in this thesis.

Keywords: Electric Vehicles, Automotive Industry, Machine Learning, Buyer’s characteristics, Potential buyers of EV.

(4)

Acknowledgements

I would like to dedicate this thesis work to my parents (Sirajul Islam Chowdhury and Rowshan Ara Islam). A very special thanks goes to my son Nio Beck and my partner Jessica Beck. My son was born in the middle of the MBA study, it would never be possible to complete the study without their support.

I am grateful to Shujun Wang, the mentor of the thesis for her support and guidance.

Karlsruhe, September 2020 Ziaul Islam Chowdhury

I would like to thank to my parents and dedicate this thesis to them. Also, special mention goes to my partner and the friends without whom I wouldn’t be the person who I am today. I would like to thank my best man and the closest friend for his advices during this thesis. Last but not least, I am grateful to the thesis mentor Shujun Wang for her support and guidance.

Athens, September 2020 Iskanter Bensenousi

(5)

Table of contents

1. Introduction ___________________________________________ 2

1.1. Problem discussion_____________________________________________________ 2 1.2. Problem formulation ___________________________________________________ 3 1.3. Purpose _____________________________________________________________ 3 1.4. Delimitations _________________________________________________________ 3 1.5. Thesis structure _______________________________________________________ 4

2. Theoretical framework and Literature review ___________________ 5

2.1. Predictive analytics and machine learning _____________________________________ 5 2.2. Classification problems and solution approaches _______________________________ 5 2.3. Machine learning algorithms for classification __________________________________ 7 2.3.1. Logistic Regression ___________________________________________________ 8 2.3.2. Support Vector Machine _______________________________________________ 8 2.3.3. Ensemble Methods (Random Forest Classifier, Gradient Boosting Classifier) __________ 9 2.3.4. Artificial Neural Networks ____________________________________________ 10 2.3.5. How to evaluate classification models ____________________________________ 11 2.4. Characteristics of the buyers of EV/HV _____________________________________ 13 2.5. Conceptual model ____________________________________________________ 16

3. Methodology __________________________________________ 18

3.1. Research method _____________________________________________________ 18 3.2. Method details _______________________________________________________ 19 3.2.1. Pre-processing _____________________________________________________ 20 3.2.2. Train and evaluate machine learning models ________________________________ 21 3.2.3. Result evaluation ___________________________________________________ 22 3.3. Required tools and materials _____________________________________________ 22 3.4. Consequences _______________________________________________________ 22 3.4.1. Validity and reliability ________________________________________________ 22 3.4.2. Ethical issues ______________________________________________________ 23

4. Empirical findings or Results _______________________________ 24

4.1. Setup of development environment ________________________________________ 24 4.1.1. Dataset __________________________________________________________ 25 4.1.2. Python Tools and Libraries ____________________________________________ 25

(6)

4.1.3. Keras & Tensorflow _________________________________________________ 25 4.1.4. IBM SPSS _________________________________________________________ 26 4.2. Pre-processing of dataset _______________________________________________ 26 4.2.1. Balancing the distribution of output variable ________________________________ 26 4.2.2. Feature extraction __________________________________________________ 27 4.2.3. Converting textual categorical values into numeric ordinal or nominal values ________ 28 4.2.4. Computation of boxplots _____________________________________________ 29 4.3. Logistics Regression ___________________________________________________ 31 4.3.1. Model details ______________________________________________________ 31 4.3.2. Hyperparameter Tuning ______________________________________________ 33 4.3.3. Result ___________________________________________________________ 33 4.4. Support Vector Machine (SVM) ___________________________________________ 34 4.4.1. Hyperparameter Tuning ______________________________________________ 34 4.4.2. Result by using 2014 dataset ___________________________________________ 35 4.4.3. Result by using 2018 dataset ___________________________________________ 36 4.5. Ensemble Methods ____________________________________________________ 37 4.5.1. Model details ______________________________________________________ 37 4.5.2. Hyperparameter Tuning ______________________________________________ 38 4.5.3. Result by using 2014 dataset ___________________________________________ 38 4.5.4. Result by using 2018 dataset ___________________________________________ 40 4.6. Artificial Neural Network _______________________________________________ 41 4.6.1. Model details ______________________________________________________ 41 4.6.2. Result ___________________________________________________________ 43 4.6.3. Result from SPSS ___________________________________________________ 44 4.7. Identification of the most important features _________________________________ 45 4.7.1. Mean decrease impurity (MDI) based feature importance ______________________ 45 4.7.2. Permutation feature importance ________________________________________ 48 4.8. Identification of the characteristics of the buyers ______________________________ 50 4.8.1. EV buyers are highly concerned on the environmental impact ___________________ 51 4.8.2. Demographic location of the buyers plays an important role ____________________ 51 4.8.3. Buyers of EV would like to subscribe to car sharing __________________________ 53 4.8.4. People who would like to control road traffic are the probable buyer _____________ 54 4.8.5. Traffic congestion is an obstacle ________________________________________ 54 4.8.6. Male population buy more EV __________________________________________ 55 4.8.7. Age slightly impacts the decision ________________________________________ 55 4.8.8. Highly educated buyers will most likely consider an EV ________________________ 56 4.8.9. Buyers from large city or metropolitan area consider EV as an option _____________ 56 4.8.10. Transfers between modes during frequent trip play little role ___________________ 57

(7)

4.8.11. Most buyers drive short distance and duration ______________________________ 57 4.8.12. Government incentive impacts the decision of the buyers ______________________ 58 4.8.13. Parking problem influences the decision of the buyers _________________________ 59 4.8.14. Non underground, light train or tram travelers are more interested in EV __________ 59 4.8.15. Number of vehicles in household influences buyer’s behavior ___________________ 59 4.8.16. Buyers who make or don’t make most frequent trip by walking want EV ___________ 60 4.8.17. Employed and unemployed buyers are interested in EV ________________________ 60 4.8.18. Buyers with different professions decide differently ___________________________ 61 4.8.19. Buyers with 2 or more household members are more interested in EV ____________ 62 4.8.20. Most frequent & non-frequent bus travelers both are interested in EV _____________ 63 4.8.21. Buyers who make frequent trips are more interested in EV _____________________ 63 4.8.22. People living in the center of city are potential buyers of an EV __________________ 64 4.8.23. Buyers with car driving license are more interested in EV ______________________ 64 4.8.24. Most frequent car drivers and non-drivers both are more interested in EV __________ 65 4.8.25. Buyers with lower-middle and middle income-level are more interested in EV _______ 65

5. Analysis _____________________________________________ 67

5.1. Analysis on the evaluation of the ML models _________________________________ 67 5.1.1. Accuracy, precision, recall and F1-score ___________________________________ 68 5.1.2. Computation speed _________________________________________________ 69 5.2. Analysis of the Buyer’s Characteristics ______________________________________ 69

6. Conclusions __________________________________________ 74

6.1. Summary ___________________________________________________________ 74 6.2. Research question and the answers ________________________________________ 74 6.3. Recommendations ____________________________________________________ 75 6.4. Implications _________________________________________________________ 75 6.5. Limitations and future research ___________________________________________ 76

7. References ___________________________________________ 77

8. Appendix ____________________________________________ 82

8.1. Appendix A: List of dropped features ______________________________________ 82 8.2. Appendix B: Converted nominal and ordinal categorical values ____________________ 82 8.3. Appendix C: Flowchart of SVM model ______________________________________ 90 8.4. Appendix D: Selected Python Code ________________________________________ 92 8.4.1. Preprocessing ______________________________________________________ 92

(8)

8.4.2. Gradient Boosting Classifier Model ______________________________________ 93 8.4.3. Common Functions _________________________________________________ 94 8.4.4. Gradient boosting classifier permutation importance __________________________ 95

(9)

List of Tables

Table 1: Structure of the confusion matrix ________________________________________ 12 Table 2: Output variable _____________________________________________________ 21 Table 3: Statistical description of age, country and income level _________________________ 28 Table 4: Values of age category feature ___________________________________________ 28 Table 5: Confusion matrix of logistic regression model based on 2014 dataset ______________ 33 Table 6: Performance metrics of logistic regression model based on 2014 dataset ____________ 33 Table 7: Confusion matrix of SVM model based on 2014 dataset ________________________ 35 Table 8: Performance metrics of SVM model based on 2014 dataset ______________________ 35 Table 9: Confusion matrix of SVM model based on 2018 dataset ________________________ 36 Table 10: Performance metrics of SVM model based on 2018 dataset _____________________ 36 Table 11: Confusion matrix of random forest and gradient boosting classifiers based on 2014 dataset _______________________________________________________________________ 39 Table 12: Performance metrics of random forest and gradient boosting classifiers based on 2014 dataset __________________________________________________________________ 39 Table 13: Confusion matrix of random forest & gradient boosting classifiers based on 2018 dataset 40 Table 14: Performance metrics of random forest & gradient boosting classifiers based on 2018 dataset __________________________________________________________________ 40 Table 15: Performance metrics of MLP based on 2014 dataset by using Keras __________________ 44 Table 16: Accuracy of MLP by using SPSS ___________________________________________ 45 Table 17: Aggregated list of important features sorted by weight ________________________ 50 Table 18: Country wise buyer’s decision to buy an EV ________________________________ 52 Table 19: Count of considering EV/HV on next purchase based on the number of vehicles per household _______________________________________________________________ 60 Table 20: Count of considering EV/HV on next purchase based on profession level ___________ 61 Table 21: Accuracy, precision, recall and F1-score of ML models ________________________ 68 Table 22: List of dropped features ________________________________________________ 82 Table 23: Encoding of textual values to numerical values _________________________________ 82

(10)

List of Figures

Figure 1: Classification task ______________________________________________________ 6 Figure 2: General approach of building a classification model _______________________________ 7 Figure 3: Random forest algorithm _________________________________________________ 9 Figure 4: Multi-layer Perceptron for classification [MLP] __________________________________ 11 Figure 5: Stimulus Response Model of the Buyers Behavior ____________________________ 14 Figure 6: Conceptual model ____________________________________________________ 16 Figure 7: Block diagram of the ML model evaluation method ______________________________ 19 Figure 8: Block diagram of the research method of buyer’s characteristics ______________________ 20 Figure 9: Bar chart of the descriptive analysis between education and consider_ev_buy _____________ 21 Figure 10: Setup of tools and software for the result and analysis ___________________________ 24 Figure 11: Distribution of categories of the target variable ________________________________ 27 Figure 12: Boxplots of features: gender, work status, number of household members, income level, location of residence ________________________________________________________ 29 Figure 13: Boxplots of features: centre or suburbs interest, public transport service, car driving license, number of vehicles in household, knowledge of what car sharing is _________________ 29 Figure 14: Boxplots of features: intention to subscribe to car sharing economy, most frequent trip- walking, most frequent trip-bicycle, most frequent trip as a car driver, most frequent trip as a car passenger. _______________________________________________________________ 29 Figure 15: Boxplots of features: most frequent trip train, most frequent trip metro, most frequent trip tram, most frequent trip bus, most frequent trip motorcycle ________________________ 30 Figure 16: Boxplots of features: destination of most frequent trip, frequency of most frequent trip, problem of most frequent trip congestion, , problem of most frequent trip parking, , problem of most frequent trip bicycle lanes. ________________________________________________ 30 Figure 17: Boxplots of features: problem of most frequent trip infrequent transportation, problem of most frequent trip lack of transportation, problem of most frequent trip-none, transfers between modes during frequent trip, frequent trip duration (in minutes). _________________________ 30 Figure 18: Boxplots of features: frequent trip distance, concern environmental impact, preference in paying tolls or traffic limitation, age, countries incentives ______________________________ 30 Figure 19: Boxplots of features: countries offering tax benefits, education, profession _________ 31 Figure 20: Flow chart of logistic regression model ___________________________________ 32 Figure 21: ROC curve of logistic regression model __________________________________ 34 Figure 22: ROC curve of SVM model ____________________________________________ 36 Figure 23: ROC curve of SVM model based on 2018 dataset ___________________________ 37 Figure 24: Flow chart of ensemble method models __________________________________ 38 Figure 25: ROC curve of random forest model ________________________________________ 40 Figure 26: ROC curve of gradient boosting model ______________________________________ 40 Figure 27: ROC curve of random forest model (2018) ___________________________________ 41 Figure 28: ROC curve of gradient boosting model (2018) _________________________________ 41 Figure 29: Flow chart of multilayer perceptron model ________________________________ 42 Figure 30: ANNs Architecture _________________________________________________ 43 Figure 31: ROC curve of MLP generated by SPSS based on 2014 dataset ______________________ 44 Figure 32: Feature importance from random forest classifier model __________________________ 46 Figure 33: Feature importance from gradient boosting model ______________________________ 47 Figure 34: Permutation feature importance of random forest classifier model____________________ 48 Figure 35: Permutation feature importance of gradient boosting classifier model _________________ 49 Figure 36: Count chart of concern on environmental impacts on the target variable y (consider EV/HV on next purchase) ________________________________________________________________ 51 Figure 37: Count chart of “would subscribe to car sharing if available” on the target variable y ________ 53 Figure 38: Count chart of “would subscribe car sharing” with respect to the “know what car sharing is” __ 53

(11)

Figure 39: Count chart of “Preference_tolls_or_traffic_limitation” with respect to the target variable y ___ 54 Figure 40: Count chart of problem most frequent trip congestion on the target variable y ___________ 55 Figure 41: Count chart of gender on the target variable y _________________________________ 55 Figure 42: Count chart of age categories on the target variable y ____________________________ 56 Figure 43: Count chart of education level on the target variable y ___________________________ 56 Figure 44: Count chart of location of residence on the target variable y ________________________ 57 Figure 45: Count chart of transfer between modes during frequent trip on the target variable y _______ 57 Figure 46: Count chart of frequent trip distance and target variable y _________________________ 58 Figure 47: Count chart of frequent trip duration and target variable y ________________________ 58 Figure 48: Count chart of country with incentive on the target variable y _______________________ 58 Figure 49: Count chart of “problem most frequent trip parking” on the target variable y ____________ 59 Figure 50: Count chart of “most frequent underground or light train” on the target variable y _________ 59 Figure 51: Count chart of “most frequent trip walk” on the target variable y ____________________ 60 Figure 52: Count chart of “work status” on the target variable y ____________________________ 61 Figure 53: Count chart of “education level” on “profession level” ____________________________ 62 Figure 54: Count chart of “household members” on the target variable y ______________________ 63 Figure 55: Count chart of “most frequent trip bus” on the target variable y _____________________ 63 Figure 56: Count chart of “frequency most frequent trip” on the target variable y _________________ 64 Figure 57: Count chart of “centre or suburbs” on the target variable y ________________________ 64 Figure 58: Count chart of “car driving license” on the target variable y ________________________ 65 Figure 60: Count chart of “most frequent trip car as driver” on the target variable y _______________ 65 Figure 60: Count chart of income level on the target variable y _____________________________ 66 Figure 61: Flow chart of support vector machine model _______________________________ 91

(12)

List of abbreviations

AI Artificial Intelligence IS Information Systems ANN Artificial Neural Network MDI Mean Decrease Impurity

API Application Programming Interface ML Machine Learning BEV Battery Electric Vehicle MLP Multi-Layer Perceptron

CO2 Carbon dioxide NLP Natural Language Processing

CEO Chief Executive Officer PCA Principle Component Analysis CEV Combustion Engine Vehicle PHEV Plug-in Hybrid Electric Vehicle

CSE Center for Sustainable Energy PM Particulate Matter CSV Comma Separated Values ReLU Rectified Linear Unit

DNN Deep Neural Network ROC Receiver Operating Characteristic EC European Commission SVM Support Vector Machine

ECOC Error-correcting output coding EEA European Environmental Agency

EOL End of Line

EU European Union

EV Electric Vehicle

GM Green Marketing

HV Hybrid Vehicle

ICE Internal Combustion Engine IDE Integrated Development Environment

(13)

1. Introduction

1.1. Problem discussion

According to European Environmental Agency (EEA), the road transport is the second largest contributor of particulate matter (PM), which causes pollutants and different health issues (Hromádko

& Miler, 2012). Cars cause 12% of the emissions of CO2 in the European Union (EU) (EC - Reducing CO2 emissions from passenger cars). The study conducted by Greenpeace has found similar results;

cars running on petrol or diesel were found as the key factors playing role in the current air pollution level (Greenpeace, 2020).

The necessity of decarbonization and a sustainable future was recognized in a study conducted by Rubens (Rubens, 2019). Governments and the policy makers are trying to control pollution caused by the combustion engine vehicles (CEVs) worldwide; EU had taken a series of legislation sets to reduce emission rates for the new cars. In 2016, fourteen countries have set their national targets on the deployment of electric vehicles between 2020 and 2030. Similarly, the State Council of China set a plan to sell 35 million new energy vehicles by 2025(Wikipedia_Vehicles_China).

Two main characteristics of electric vehicles (EVs) “zero tail-pipe emission” and “low maintenance cost” played major roles to gain the popularity of EVs (Abas, et al., 2019). EVs can be environmentally competitive when renewable energy sources are used in the energy chain and the efficiency of the power plants are improved and reduce the emission of greenhouse gas than the conventional vehicles (Abas, et al., 2019).

The slow progress of the sale of electric vehicles was identified in the study of Lieven et al. where they have mentioned that only 162 out of 3.8 million registered vehicles were electric battery driven (Lieven, et al., 2011). They have conducted a survey consisting of 1152 German individuals and have predicted that only 5% of the samples are potential buyers of the EVs (Lieven, et al., 2011). It is a good progress in a country with the tradition to manufacture automobiles with internal combustion engine and may sum up to 175000 EVs. Tsakalidis and Thiel have found that over one million registered EVs and hybrid vehicles (HVs) in the EU represent only 2% of the market share from 2010 to 2017 (Tsakalidis & Thiel, 2018). Despite the progress on the deployment of the EVs, only 0.2% of the global passenger vehicle fleet are EVs in 2016 (Rubens, 2019).

As the electric vehicle market is still new and represents a small fraction of the automobile market, there is a huge potential to conduct a study to evaluate a set of machine learning models in order to identify potential buyers of the EVs. Authors have carefully observed previously conducted studies and could not find any study where the results of multiple machine learning models were evaluated to classify EV buyers in the EU. There are few studies where the characteristics of the EV buyers are analyzed which will be used as reference to compare the results found in this thesis.

Couple of previously conducted journal articles have been used as reference to this study. Rubens has done a study to identify the buyers of EV after early adopters in the Nordic countries based on unsupervised learning (i.e. K-means clustering) (Rubens, 2019). Lieven, et al. have forecasted the

(14)

potential of EV market in Germany based on 14 categories (Lieven, et al., 2011). Burghard & Dütschke have conducted a study to identify the percentage of EV buyer who wants shared mobility like car sharing (Burghard & Dütschke, 2019). A study conducted by Christidis & Focas focused on the identification of the variables that affect the purchase of an EV by using machine learning (ML) (Christidis & Focas, 2019). The ways that can be used by governments to promote the technological change in electric car industry was examined in the study conducted by Meckling & Nahm (Meckling

& Nahm, 2018).

1.2. Problem formulation

This study belongs to two theoretical research areas; machine learning models and buyer characteristics and has a twofold goal. At first, it focuses on the evaluation of major machine learning models in order to classify who would like to buy an EV or a HV. The core idea is applying data science techniques on a publicly available dataset and perform preprocessing, model creation, hyperparameter optimization and evaluation of the results. Second aim of this study is to identify major characteristics of the buyers of the EV or HV from the dataset by using the machine learning models and predictive analytics.

Research question:

“Evaluation of different machine learning models for the prediction of electric or hybrid vehicle buyers and identification of the characteristics of the buyers in the EU”.

1.3. Purpose

Main purpose of the study is to understand which machine learning model provides best performance on the classification task mentioned in section 1.2. This will provide an insight of preprocessing steps, important hyperparameters, and overall quality of the dataset. Identification of the buyer characteristics which is the second purpose of this study will help us to understand which characteristics play major role on their decision to buy an EV. The better we understand the buyer’s characteristics, the automotive industry can focus on solving the obstacles for the slow transformation of the automotive market and boost the sale of EVs.

1.4. Delimitations

The research is performed on a dataset of EU. It is an EU Travel Survey conducted on 2014 and contains 26605 samples. The dataset is focused on the propensity to purchase a hybrid or electric vehicles (European Commission, Joint Research Centre (JRC), 2014). Hence, the analysis and results are not suitable to be used outside of EU area. Furthermore, according to Christidis & Focas there is a high variability among EU member states. For example, Scandinavian countries seems to adopt and adapt faster to the EV’s than others. Hence, the characteristics may not equally represent the buyers in every EU country. As the buyer’s characteristics were identified based on the secondary dataset, important characteristics not included in the dataset as a feature may be excluded from the findings.

(15)

1.5. Thesis structure

This MBA thesis contains six chapters. It starts with the introduction where the background of the research topic is explained. In the second chapter, the relative literature is presented; all the theories connected to the topic of the thesis are explained as well as previously conducted studies are presented.

The focus of chapter three is to describe the type of research method used in the thesis and the details of the method. Chapter four includes empirical findings and the results of the machine learning models and buyer’s characteristics. It starts with data engineering which includes conversion of categorical features to ordinal and nominal values, filtering out irrelevant features and creation of new features out of the existing features. Machine learning models are created with a set of standard hyperparameters and preprocessed dataset is fed into the models. Based on the results of the models, hyperparameters were adjusted to maximize the performance. Later, machine learning models were used to identify most important features in order to identify the characteristics of the buyers. Performance of the machine learning models, and the characteristics of the buyers were analyzed chapter five. Conclusion chapter includes a summary of the findings, recommendations, limitations of the study and the scope of future research opportunities.

(16)

2. Theoretical framework and Literature review

This chapter provides theoretical foundation of machine learning models and buyer’s characteristics. It starts with the definition of predictive analytics and machine learning, following the definition of classification problems and how to solve them. The classification algorithms used in this study are then explained. In the second part of the chapter, the authors have defined the buyers and their characteristics and investigated relevant studies. At the end of the chapter, a conceptual model is created and explained.

2.1. Predictive analytics and machine learning

Predictive analytics consists of statistical and data mining techniques and determines what may happen in the future (Sharda, et al., 2017). Kelleher et al. have defined predictive data analytics as the art of building a model and using it in order to make historical data-based predictions (Kelleher, et al., 2015).

Predictive analytics uses a model to make predictions and machine learning is used to train the models (Kelleher, et al., 2015).

Machine learning is the capability of the artificial intelligence (AI) systems to acquire knowledge from the pattern from the raw data (Goodfellow, et al., 2016). Computers can handle real-life problems and make subjective decision due to the introduction of machine learning (Goodfellow, et al., 2016). The AI tasks are represented as feature and the feature set is provided to the machine learning algorithm to be solved. It’s a difficult task to identify and extract features from a real-world problem (Goodfellow, et al., 2016). Representation learning is an approach where machines map a representation to the output as well as the representation itself (Goodfellow, et al., 2016).

Géron has created a checklist to be used in order to short-list promising models (Géron, 2017).

According to him, the process should start with training different models with the standard parameters.

The performance of the models will be measured based on N-fold cross-validation from which mean and standard deviation will be calculated. Most significant variables of each model are then analyzed.

The next step is to analyze the types of errors that the models made and try to find a way to avoid these errors. Then a fast check on the feature selection and engineering has to be done. All the steps should be iterated quickly for once or twice. Top three to five models should be short-listed based on the model which makes different types of error.

According to Géron, the hyperparameters of the short-listed models should be tuned by using cross- validation. After being confident on the final model, test set will be used to measure the performance and the generalization errors are estimated (Géron, 2017).

2.2. Classification problems and solution approaches

The outputs of the dataset are given in a form of categories. The task is to classify a potential buyer of an EV to one of these categories. Classification is a pervasive problem which assigns objects to one of the predefined categories (Tan, et al., 2014). In real-world, classification is the most frequently used machine learning tasks (Sharda, et al., 2018). It is most suited to predict or describe binary or categorical datasets (Tan, et al., 2014).

(17)

Classification task consists of a collection of records called instances which is characterized by a tuple (attribute set x, class label y) (Tan, et al., 2014).

Classification model

Input Output

Attribute set (x) Class label (y)

Figure 1: Classification task

Figure 1 shows that the attribute set is used by the classification model for classifying into one of the categories. The feature set may contain discrete as well as continuous features (Tan, et al., 2014).

Classification differs from the regression task from the characteristics of the output; in regression model, the output y is continuous whereas in classification model, output belongs to one of the predefined categories (Tan, et al., 2014). Classification model can be used as an explanatory tool where it distinguishes object of different classes as well as a predictive tool where the class of unknown records are predicted by the model (Tan, et al., 2014).

Binary classification is a machine learning problem where the task of the classifier is to predict the elements of a given set into two groups (Wikipedia:Binary_classification). The continuous values of a dataset can artificially be converted to binary values by using a cutoff value where the resultant value is positive when it’s higher than the cutoff value and negative when the value is lower than cutoff value (Wikipedia:Binary_classification).

In many real-world problems such as text and face classification, there are more than two categories in the dataset. One way of dealing with multiclass problem is to decompose the problem into K binary problems known as one-against-rest (1 − ) approach (Tan, et al., 2014). Another approach is to construct ( − 1)/2 binary classifiers called one-against-one (1 − 1) approach where a pair of classes ( , ) will be distinguished by each classifier (Tan, et al., 2014). In both approaches, predictions made by the binary classifiers are combined and the test instance is classified (Tan, et al., 2014). Generally, a voting scheme is used to combine the predictions of the binary classifiers where the class which got highest number of votes win (Tan, et al., 2014). Alternatively, the outputs of the binary classifiers can be transformed into probability estimates and then the class with the highest probability is assigned (Tan, et al., 2014). The disadvantage of (1 − ) and (1 − 1) approaches is that they are sensitive to the error of binary classifiers (Tan, et al., 2014). Prediction may end up in a tie or wrongly predicted if one of the classifiers makes a mistake (Tan, et al., 2014).

The error-correcting output coding (ECOC) is a robust method that handles problems of multiclass classification tasks. It adds a codeword in the transmitted message. The codeword is used by the receiver in order to detect errors in the message which is received. When the number of errors is small, the erroneous message may be recovered.

(18)

A classifier builds a classification model from the given input dataset in a systematic approach (Tan, et al., 2014). From a given training set containing examples of inputs and outputs, supervised learning algorithms learn how to associate them (Goodfellow, et al., 2016). The output may be provided by a human (“supervisor”) when it is difficult to collect (Goodfellow, et al., 2016).

A learning algorithm is used in each classifier technique, during the training process, in order to identify a model which fits best between the feature set (input) and the class label (output) (Tan, et al., 2014).

The generated model by the learning algorithm should have the capability to fit the input data and correctly classify the class label (Tan, et al., 2014).

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes

Learning Algorithm

Learn Model

Apply Model

Model Induction

Deduction

Figure 2: General approach of building a classification model

The general approach of building a classification model is explained in Figure 2 (Tan, et al., 2014). The training set is provided to the learning algorithm in order to learn and build a classification model. The model is then applied to the test set with unknown class labels.

2.3. Machine learning algorithms for classification

Several algorithms are used to solve classification problems such as: decision tree analysis, statistical analysis, neural networks, case-based reasoning, Bayesian classifiers, genetic algorithms and rough sets (Sharda, et al., 2018). Statistics-based classification methods such as discriminant analysis and logistic regression are based on independence and normality assumption of data which are unrealistic and limit their use in many projects (Sharda, et al., 2018). Due to time constraints, five classification algorithms will be used in this work and those algorithms are discussed in the following section.

(19)

2.3.1. Logistic Regression

Logistic regression is one of the regression algorithms which can be used for classification tasks by estimating the probability to determine that a classification instance belongs a specific class (Géron, 2017). It uses a threshold of the probability to determine if a specific instance of the feature set belongs to a positive class (value labeled as 1) or a negative class (value labeled as 0) (Géron, 2017). The following formula is used to calculate the probability (Géron, 2017):

̂ = ℎ ( ) = ( . )

where, (. ) is the sigmoid function, x is the feature vector, is the transpose of model’s parameter vector , ℎ is the hypothesis function.

Sigmoid function is defined by the following formula (Géron, 2017):

( ) = 1 1 +

Finally, model prediction uses the following formula (Géron, 2017):

= 0 ̂ < 0.5 1 ̂ ≥ 0.5

2.3.2. Support Vector Machine

Support vector machine (SVM) is one of the most influential supervised learning methods (Goodfellow, et al., 2016). It is a classification method originated from statistical learning theory (Tan, et al., 2014).

Decision boundary is represented in SVM by a subset of training samples called “support vectors”. A major advantage of the method is that it works great with high-dimensional data. That means, the curse of dimensionality problem is avoided in SVM (Tan, et al., 2014). SVM problems can be represented as convex optimization problems and can find objective function’s global minimum (Goodfellow, et al., 2016). A drawback of the kernel machines is that they have high training computational cost when the dataset is large (Goodfellow, et al., 2016).

A linear SVM is also known as a maximal margin classifier and searches for a hyperplane that provides the largest margin because the capacity of a linear model is inversely related to the margin of it (Tan, et al., 2014). The worst-case generalization errors are minimized in linear SVM with the help of maximal margin (Tan, et al., 2014). Linear SVM’s decision boundary is written as follows (Goodfellow, et al., 2016):

. + = +

where D is a coefficient of vector, x⁽ⁱ⁾ is a training example, w Rⁿ is a vector of parameters, x Rⁿis the input.

Prediction can be made by using the following formula (Goodfellow, et al., 2016):

(20)

( ) = + ( , ^{( )})

2.3.3. Ensemble Methods (Random Forest Classifier, Gradient Boosting Classifier)

An ensemble represents a group of predictors (Géron, 2017). A prediction model is created in ensemble methods by aggregating the strength of a set of simple base models (Hastie, et al., 2008). An aggregated result of a group of predictors is often better than the best individual predictor’s performance (Géron, 2017). An ensemble classifier performs better than a single classifier when it fulfils two necessary conditions; the base classifier should perform better than a classifier which guesses randomly and they should be independent of each other (Tan, et al., 2014).

A random forest consists of decision tree classifiers and belongs to a class of ensemble method (Tan, et al., 2014). Each tree of a random forest is generated based on the value of a set of independent random vectors with fixed probability distribution and the predictions made by multiple decision trees are combined; shown in Figure 3 (Tan, et al., 2014).

D ^Randomize

D

T

D

T

D

T

D

T

1 2 t-1 t

T *

Step 1:

Create random vectors Original

Training Data

Step 2:

Use random vector to build multiple decision trees

Step 3:

Combine decision trees

Figure 3: Random forest algorithm Hastie et al. have explained the formula of random forest classification:

(21)

( ) = ( ) where, ( ) is the class prediction of the b^th random forest tree.

Random forests are popular due to similar performances like boosting in many problems and simplicity in training and tuning (Hastie, et al., 2008).

Gradient Boosting method is a well-known boosting algorithm which adds predictors to an ensemble sequentially and each predictor tries to correct its predecessor (Géron, 2017). The new predictor tries to fit to previous predictor’s residual error (Géron, 2017).

2.3.4. Artificial Neural Networks

Although simple machine learning algorithms worked well on various problems, they were unsuccessful in solving problems like speech and object recognition (Goodfellow, et al., 2016). Goodfellow, Bengio, et al. have identified the challenges that the traditional machine learning algorithms face. The phenomenon “curse of dimensionality” occurs when the machine learning problem has high number of dimensions, it becomes very difficult because the number of distinct possible configuration increases exponentially when the number of variables increases (Goodfellow, et al., 2016). Smoothness is used by many simple machine learning algorithms to generalize well but due to the statistical challenges involved to solve the problem, the algorithms fail to scale (Goodfellow, et al., 2016). Deep learning reduces the generalization error on sophisticated tasks by introducing explicit and implicit priors (Goodfellow, et al., 2016).

Artificial neural networks (ANNs) inspired by the architecture of the brain is the core of deep learning (Géron, 2017). ANN consists of nodes and directed links analogous to the structure of the human brain (Tan, et al., 2014). ANNs are scalable, powerful and versatile methods for solving highly complex problems of machine learning such as speech recognition services like Apple’s Siri, video recommendation service and learning to play game (Géron, 2017).

Tan et al. have identified a set of characteristics of ANN:

x Approximation of any target function can be achieved by using multilayer neural network with minimum one hidden layer because they are universal approximators (Tan, et al., 2014).

Appropriate topology of the network must be chosen in order to avoid overfitting of the model.

x Due to automatic learning of the weights in the training step, ANN can handle redundant features (Tan, et al., 2014).

x As the ANN are sensitive to noise in the training data, validation set has to be used to handle the generalization errors and/or at each iteration weights have to be decreased by some factor (Tan, et al., 2014).

x A momentum term can be added in the update formula of weight in order to escape the local minimum that the ANN with gradient descent method converges to (Tan, et al., 2014).

x Training a multilayer ANN with large number of hidden nodes is a time-consuming process (Tan, et al., 2014).

(22)

A multi-layer perceptron (MLP) consists of one input layer, one or more hidden layers and the output layer (Géron, 2017). Every layer excluding the output layer contains a bias neuron and connected fully to the next layer (Géron, 2017). The algorithm feeds each training instance to the network and the output of each neuron is computed in each consecutive layer (Géron, 2017). MLP measures the output error of the network from the difference between actual output and the desired output (Géron, 2017).

Géron has given an example of a modern MLP used for classification with rectified linear unit (ReLU) function and Softmax layer (shown in figure 4).

Softmax

∑ ∑ ∑

1

∑

x₁ x₂

Softmax output layer

Hidden layer (e.g. ReLU)

Input layer

Figure 4: Multi-layer Perceptron for classification [MLP]

The signal flows in one direction in figure 4 that means the MLP is a feedforward neural network (Géron, 2017). The Softmax function used in the output layer is used to combine multiple classifiers by calculating a score and then estimating the probability of each class (Géron, 2017). The following equation is used to calculate Softmax function:

( ) = . where, is the dedicated parameter vector of each class.

2.3.5. How to evaluate classification models

Sharda et al. have identified the factors to access the classification model. Most frequently used assessment factor is the prediction accuracy (Sharda, et al., 2018). Second assessment factor is the computational cost in terms of speed required to generate and use the model; the higher the speed is the better the model (Sharda, et al., 2018). Model’s robustness plays an important role because it has to

(23)

predict reasonably accurate predictions even though data contains noise and erroneous values (Sharda, et al., 2018). The selected model must also use an architecture which supports scalability, when large amount of data is used. At last, the model must be interpretable which means that it has to provide insight on how the model predicts (Sharda, et al., 2018).

Novaković et al. have also identified major criterions for the evaluation of classification models. As the target function of a classification problem is discrete, the value of the class attribute must be categorical (Novaković, et al., 2017). The notion of fault is the basic idea for the evaluation of a classification model; an error occurs if the predicted value is different from the actual class example (Novaković, et al., 2017).

Accuracy is defined as the ratio of correctly classified examples (Novaković, et al., 2017). It summarizes the performance of a model in a single number and can be calculated by using the following equation (Tan, et al., 2014):

=

Main drawback of accuracy is that it ignores the differences between the error types and is dependent on the class distribution of the dataset (Novaković, et al., 2017). It is important to distinguish different types of errors. In disease detection problem, the model should have higher priority to determine the patients who has the disease; it is acceptable to certain extent when the model identifies a healthy person as sick because further laboratory test may identify the person as healthy again (Novaković, et al., 2017).

Confusion matrix is a widely used measure to evaluate the performance of a classification model and consists of the counts of correctly and incorrectly predicted test records (Tan, et al., 2014) It captures the details of evaluation phase of the model from where the performance measures are calculated (Kelleher, et al., 2015). The following table show the structure of confusion matrix: (Kelleher, et al., 2015).

Table 1: Structure of the confusion matrix

Prediction

Positive Negative

Target

Positive True Positive False Negative Negative False Positive True Negative

It’s shown in the above table that the structure of the confusion matrix consists of four components; true positive, false positive, true negative and false negative. True positive includes the number of test instances where the value was positive, and the classifier also has identified it as positive. False positive includes the test instances where the value was negative, but the classifier has identified it as positive.

False negative includes the test instances where the target value was positive, but the classifier has

(24)

identified it as negative. Lastly, true negative includes test instances where the target value was negative, and the classifier has also identified it as negative.

The following formulas are used to calculate accuracy metrics (Sharda, et al., 2017):

= +

+ + +

The fraction of predictive positive cases which are accurate is determined by precision (Novaković, et al., 2017). Classifier commits low rate of false positive errors when precision is high (Tan, et al., 2014).

In contrary, the fraction of correctly predicted positive examples of a classifier is measured by recall and it is equivalent to the true positive rate (Tan, et al., 2014). When a baseline model is constructed it may occur that either precision or recall metric can be maximized; as an example a perfect recall will occur when a classifier predicts every sample as positive but precision will be poor (Tan, et al., 2014).

F1 measure summarizes recall and precision and represents a harmonic mean between them (Tan, et al., 2014). It is calculated by using the following equation:

= 2 ×

2 × + +

When the model is applied to the test set, the goal is to choose the model with highest accuracy or lowest error rate (Tan, et al., 2014). In the training phase, a model has to be selected which is not susceptible to overfitting and has the right complexity. In cross-validation method, data is partitioned in two equal sized subsets; one of which is used for training and the other for testing (Tan, et al., 2014). In two-fold cross-validation, the subsets are swapped and hence the role of the subsets are changed (Tan, et al., 2014). The sum of test and train runs accumulate total errors.

Receiver operating characteristic (ROC) curve plots true positive rate and false positive rate. It is one of the commonly used tools used with binary classification algorithms (Géron, 2017). Sensitivity versus 1 − is plotted in the ROC curve. The dotted line in the following figure shows the ROC curve of a random classifier. A good classifier should stay as far as possible from the ROC curve of the random classifier (Géron, 2017).

2.4. Characteristics of the buyers of EV/HV

The first step to understand the behavior of the customers is to define common characteristics of the customers group. This will allow the relevant stakeholders to segment the market and conduct the market analysis more effectively. According to the Marketing Insider and Riley, the characteristics of the buyers act as a black box filtering the market and other stimuli (i.e. economic, political, social and

(25)

technological) and producing buyer responses as outcome, which is defined as the behavior of the buyers (Marketing Insider, 2020) (Riley, 2019). It is depicted in the figure below.

Figure 5: Stimulus Response Model of the Buyers Behavior

Dütschke et al. have identified four major customer groups that would be potentially interested in buying an EV in EU (Dütschke, et al., 2013). The identified categories are the technology enthusiasts, the environmentally aware, the urban individualists and the well-off consumers. Hardman et al. showed that a common characteristic of the EV consumers in the USA is that they already own 2.5 other vehicles, which surpasses national average ownership of individual cars (Hardman, et al., 2016). This could be an indicator that in the USA, owning an EV is still a luxurious choice rather than a transportation priority.

Comparing with China, Zhang and Bai identified that the number of vehicles owned in each household was also a key factor affecting the preferability of the buyers to purchase an EV (Zhang & Bai, 2011).

It was identified in a research conducted by Peters and Dütschke in the EU that key characteristics of the early adopters of EVs are age and number of other vehicles. They also found that the middle-aged men living in a household with ownership of several other vehicles are more probable to purchase an EV (Peters & Dütschke, 2014). According to Plötz et al., people living in the rural areas in Germany were more likely to buy an EV than people living in the large cities which could be the result of a characteristic representing concern on the environmental impacts (Plötz, et al., 2014). This is aligned with Wietschel et al. who identified that the main buyers of the EVs in Germany up to year 2020 will be males aged between 40 and 50 and living in a suburban or rural area with their families (Wietschel, et al., 2012).

Furthermore, Bechhold et al. identified that the following characteristics were linked to increased willingness of buyers to purchase an EV (Bechhold, et al., 2017). With respect to age, young to middle- aged people were more likely to buy an EV. When it comes to education highly educated people were more interested in EVs. Also, household size was a significant factor; families raising children and preparing to further grow were more probable to own an EV. Additionally, household type was also identified as an important parameter to own an EV where owners of single detached dwellings and protected garages had more chances to purchase an EV. Furthermore, people living in the suburbs tend