Customer loyalty, return and churn prediction through machine learning methods: for a Swedish fashion and e-commerce company

(1)

Master Thesis, 30 ECTS

Master of Science in Industrial Engineering and Management, 300 ECTS

Spring 2021

Customer loyalty, return and churn prediction through machine learning methods

for a Swedish fashion and e-commerce company

Master Thesis Study

Anida Granov

(2)

CUSTOMER LOYALTY, RETURN AND CHURN PREDICTION THROUGH MACHINE LEARNING METHODS

Submitted in fulfillment of the requirements for the degree Master of Science in Industrial Engineering and Management

Department of Mathematics and Mathematical Statistics Umeå University

SE – 907 87 Umeå, Sweden Supervisors:

Mohammad Ghorbani, Umeå Universiy Cagri Emre Korkmaz, NA-KD

Examiner:

Natalya Pya Arnqvist, Umeå University Copyright

(3)

Abstract

The analysis of gaining, retaining and maintaining customer trust is a highly topical issue in the e-commerce industry to mitigate the challenges of increased competition and volatile customer relationships as an effect of the increasing use of the internet to purchase goods. This study is conducted at the Swedish online fashion retailer NA-KD with the aim of gaining better insight into customer behavior that determines purchases, returns and churn. Therefore, the objectives for this study are to identify the group of loyal customers as well as construct models to predict customer loyalty, frequent returns and customer churn. Two separate approaches are used for solving the problem where a clustering model is constructed to divide the data into different customer segments that can explain customer behaviour. Then a classification model is constructed to classify the customers into the classes of churners, returners and loyal customers based on the exploratory data analysis and previous insights and knowledge from the company. By using the unsupervised machine learning method K-prototypes clustering for mixed data, six clusters are identified and defined as churned, potential, loyal customers and Brand Champions, indecisive shoppers, and high-risky churners. The supervised classification method of bias reduced binary Logistic Regression is used to classify customers into the classes of loyal customers, customers of frequent returns and churners. The final models had an accuracy of 0.68, 0.75 and 0.98 for the three separate binary classification models classifying Churners, Returners and Loyalists respectively.

The disposition of the report The report is divided into seven chapters. Chapter 1 contains a general overview of the e-commerce industry. Chapter 2 presents the problem statement, followed by the company description and the objectives of the project. After the literature review in Chapter3, Chapter 4states the methodology used in the project including the theory, data explanation and models. Chapter 5presents the data analysis and modelling. Chapter 6presents the results of the study. Finally, this thesis ends with discussion and conclusions in Chapter 7.

(4)

Sammanfattning

Analysen kring att öka, beh˚alla och upprätth˚alla kundtillit är en mycket aktuell fr˚aga inom e-handelsbranschen för att möta utmaningarna med ökad konkurrens och instabila kundrelationer, som en effekt av den ökande användningen av internet för att köpa varor.

Denna studie genomförs hos det svenska mode och e-handelsbolaget NA-KD i syfte att f˚a en bättre insikt i kundbeteende som p˚avekar köp, returer och beslut att lämna företaget (churn). M˚alet med denna studie är s˚aledes att identifiera den lojala kundgruppen hos företaget, samt konstruera modeller för att prediktera kundlojalitet, frekventa returer och kundchurn. Studien innefattar tv˚a separata tillvägag˚angssätt där en klustermodell är konstruerad för att separera data i olika kundsegment för att förklara kundbeteenden. En klassificeringsmodell konstrueras sedan för att klassificera och prediktera kunderna i klasserna ’churner’, ’returner’ och ’loyal’ baserat p˚a en förklarande dataanalys och tidigare insikter och kunskap fr˚an företaget. Genom att använda den oövervakade kluster maski- ninlärningsmetoden ’K-prototypes’ för blandad data, identifieras och definieras följande sex olika kluster; churn, potentiella och lojala kunder samt Brand Champions, obeslutsam- ma kunder och hög risk för churn. Den övervakade klassificeringsmetoden ’bias reduced Logistic Regression’ används för att klassificera kunder i klasserna loyala kunder, kunder som gör frekventa returer och kunder som lämnat företaget. De slutliga modellerna har en noggrannhet p˚a 0.86, 0.75 and 0.98 för de tre separata binära klassificerings modellerna som klassificerar kunderna i grupperna ’churner’, ’returner’ respektive ’loyal’.

Rapportens disposition Rapporten delas upp i sju kapitel. Kapitel 1 inneh˚aller en generell överblick p˚a e-handelsbranschen. I Kapitel 2 presenteras problemformuleringen uppföljt av en företagsbekrivning och projektets m˚al. Nästföljande kapitel, Kapitel 3 innefattar en presentation av liknande studier uppföljt av Kapitel 4 best˚aende av studiens metodik och inkluderande teori, databeskrivning och modeller. Kapitel 5 inneh˚aller en beskrivande analys av data och modellering. Kapitel 6 presenterar resultatet av studien och följs upp av den sista delen, Kapitel 7inneh˚allande diskussion och slutsatser.

(5)

Acknowledgements

The completion of this thesis study would not have been possible without the support and extensive knowledge of several people whose names will be mentioned in this section.

I cannot begin without expressing my thanks to my supervisor from Ume˚a University, Mohammad Ghorbani, for the valuable experience and insight in conducting and writing scientific reports. Thank you for always asking the right questions and giving me helpful practical suggestions.

I would also like to extend my deepest gratitude to my supervisor from NA-KD, Cagri Emre Korkmaz, for believing in me and giving me the opportunity to be the first one in history to perform a master thesis project at NA-KD.

Someone whose help cannot be overestimated is my advisor from NA-KD, Burak Arca.

Thank you for always taking your time to support and guide me through the project.

Without your extensive knowledge and insightful suggestions, the final result would not be what it is today.

Finally, I would like to express my sincere gratitude to my family and partner for their constant encouragement and support throughout my time at university and this thesis.

(6)

Chapter 1 Introduction

1.1 Motivation

By definition, e-commerce refers to the sale of goods or services through the use of the internet. The easily accessible e-commerce of today’s modern world is a win-win situa- tion for both businesses and customers. The sheer amount of information available to customers online can be all that is needed to make a purchase decision. For businesses, e- retail can be used to increase their brand awareness and reduce the cost of physical stores.

However, the constant increase in online shopping is also accompanied by challenges for businesses. The accessibility of online stores only a click away, increases competition in the market and leads to a volatile customer relationship, where the accessibility of customer preferences no longer can be as transparent as in physical stores with the possibility of personal service available. This is a factor for the lower proportion of loyal customers within e-commerce in comparison to physical stores (Seippel;2018).

In addition, online marketing techniques such as websites and social medias are both cheaper and more effective in the long run than traditional marketing techniques. Web- site visitors contribute to a large collection of data, including web traffic data and personal information recorded through cookies. The immense collection of data opens up the possibility of studying customer preferences and purchasing behavior. As Coppola (2021) states in her article of the history and future of e-commerce, the development of e-commerce is highly dependent on the development of technology and will continue likewise in the future. According to the discussion by Coppola(2021) in Statista’s statistics of global e-commerce, the internet and digitalization of today has widely increased the use of e-commerce. An estimation approved that 1.92 billion people used e-retail to purchase goods or services in 2019 accounting for 14,1% of the global retail sales. It is also expected that the number of e-retail customers will increase every year reaching an estimated 21.8% in 2024, due to the increasing accessibility of the internet around the world and variability of online platforms to use for quick purchases.

The 99Firms’ e-commerce statistics for year 2020 (99Firms; 2020) predict that 95%

of all purchases will be done via e-commerce by 2040. A 2019 study by the media and research firm DigitalCommerce 360 also shows that 61% of respondents stop comparing companies on other websites after finding a product they like. This highlights the importance of providing a website with easily accessible information about products. The results of the survey also showed the importance of factors such as free shipping, ease of returns and low cost of returns when comparing different companies.

(9)

In line with this, a study on the influence of e-commerce on customer behavior by Mittal(2013) found that the most important factors for online shoppers are search features that enable customers to find the products they are looking for. It is also stated that by providing third party verification on the website as well as information about the company such as customer service, location, a phone number, and a help button can increase customer trustworthiness to shop from e-retailers. The study highlights the importance of website reputation, payment security and post-purchase services such as shipping and returns for increased customer satisfaction.

A customer who makes multiple purchases spends on average four times as much money as a customer who makes only one purchase (Blevins;2020). This shows how important it is for companies to build and maintain a loyal customer base. The process of conquering customers is called customer acquisition and lays the foundation for growing the customer base. The next step is to maintain the costumers from acquisition, which defines customer retention. Customer retention is usually less expensive than acquiring new customers, so evaluating customer retention is critical for businesses. A study on customer segmentation highlights the importance of customer retention which has increased in recent years, while new customers acquisition has decreased. The study presents the Pareto principle and concludes that most of a company’s revenue is generated by only 20% of its customers (Joy Christy et al.; 2018). By targeting this customer segmentation, the business can improve tailored marketing plans and advertising campaigns to reduce marketing costs but with the same amount of revenue as when generalizing marketing to all.

Another important challenge for e-commerce businesses is the frequent returns that occur among customers. The average return rate for purchases from online stores is 25%

while the return rate for purchases from physical stores is only 8% (Charlton; 2020). To retain satisfied customers, returns must be easily accessible and not a barrier to purchase an item. At the same time, returns usually have a negative impact on the business due to the cost of staff and resources as well as carrying the risk of not being able to resell the returned items. The trade-off between satisfied customers and increased costs to the business associated with returns is highly topical in the fashion industries where online sales take place.

Customer satisfaction is not the only level to measure. Customer attrition, also known as customer churn, implies the percentage of customers who no longer use a company’s service and is very important to businesses. By identifying potential churners in a timely manner, actions can be made to prevent customers from leaving the client base.

Therefore, the analytics of gaining, retaining and maintaining customer trust, is a highly topical issue in the e-commerce industries today to mitigate the above challenges in the long run.

(10)

Chapter 2 Project description

2.1 Problem definition

This study is conducted at the Swedish online fashion retailer NA-KD with the aim of gaining better insight into customer behavior that determines purchases, returns and churn. The company has seen a surge in popularity since its inception, resulting in a user base of 8 million monthly visitors to their website (NA-KD;n.d.). Today, marketing takes place with manual methods and third-party email marketing. There are some previous studies in the area of customer segmentation at the company, where the so-called RFM analysis has been performed, but it has not been applied yet. RFM segmentation is based solely on customer transactions with the constructed metrics of Frequency, Recency and Monetary. By extending the customer segmentation to include other entities such as demographics and web traffic, the company can uncover additional customer behaviors related to purchase, return, and churn to find appropriate marketing strategies susceptible to the specific target groups.

2.2 Company description

NA-KD was founded in 2015 and has been listed among the top 20 fastest growing companies in Europe (NA-KD; 2020). Today, the company consists of just over 250 employees and is headquartered in Gothenburg. The company has five locations of offices, ware- houses and factories across Europe and, in addition to Gothenburg, can be found in Stockholm and Landskrona in Sweden, Krakow in Poland, and Istanbul in Turkey. In addition, the company is globally represented by more than 600 retailers worldwide and delivers to more than 100 countries every month (NA-KD;n.d.).

The Business Intelligence department at the head office in Gothenburg is responsible for supporting data-driven decision making across the organisation. The department was established in 2019 and acts with an overall view of the organisation and departments with data collection, measurements, reports, analysis and insights as well as tracking actions within the organisation. The Business intelligence department at NA-KD also acts as product managers performing tests related to specific actions (Korkmaz; 2021). During the thesis work the Business Intelligence team supervises and provides the detailed facts and insights of the company on the technical aspect of the study.

The Performance Marketing and Sales department of NA-KD takes care of various

(11)

customer relationship management services. Focusing on the continuous development of marketing strategies, the main used tools are email, web-push, SMS notifications and referral programs. The latter is active in 5 different countries and contributing with 1% of the new customers per year (Iyigun; 2021). The main emphasis is put on email marketing as an effort for customer retention. The department works strategically with other departments such as Business Intelligence. The Performance Marketing and Sales department provides insight into the current marketing strategy and focus areas, and will conclude relevant marketing strategies from the thesis study results.

2.3 Limitations

The analysis will be based on customer transactions and website traffic limited to customers of at least one purchase during the investigated period as well as having an account at the company’s website. This is due to the lack of stored historical data of customers without accounts where the customer needs to consent to the collection of data based on their purchasing behavior. In addition, the data is limited to contain only real customers and orders and hence to not include e.g. wholesalers that purchases significantly large amounts of goods. The investigated period is limited from June 2018 until 2021-01-01 as no earlier data is available from Google Analytics.

2.4 The objectives

Considering the challenges faced by the online e-retailer NA-KD in customer acquisition, retention and churn as well as the frequent returns and the lack of previously conducted customer segmentation, the objectives of this thesis study are

- to find the group of loyal customers;

- to create a model for predicting customer loyalty;

- to create a model for predicting whether a customer will make frequent returns;

- to create a model for the prediction of customer churn

where the classification of loyal customers, customers of frequent returns and churned customers should be done with three separate classification models.

2.5 Project structure

The main goal of this thesis is to build and deliver a model that segments NA-KD customers into similar groups and predict customer loyalty, churn and returns. The model should be based on both transactional data and the existing segmentation model RFM, but extended to include other entities that may be relevant in explaining customer behavior. The final model should provide a set of customer segments (clusters) such that customers within each cluster are similar and different from customers in the other clusters. The exploratory data analysis as well as the information gathered from the clustering analysis are used together with information and definitions from the company to define

(12)

the classes regarding customer behavior in terms of loyalty, churn and return activities.

The thesis work is split down into the following phases:

- Literature review to investigate algorithms and methods that have been used in other studies for solving similar kind of problems.

- Implementation of chosen clustering algorithm.

- Evaluation of results and further exploratory analysis of the detected customer segments to find purchase patterns within these segments.

- Implementation of suitable classification models.

- Model evaluation and analysis.

The model is using the web analytic service Google Analytics to access website traffic data and the Google Cloud data warehouse BigQuery to access transactional data and perform data mining. Statistical analysis including initial exploratory analysis and modelling is performed using R programming software.

(13)

Chapter 3 Literature review

The subject of customer segmentation to discover information about customer behaviour is highly topical and is used by some of the largest companies in the world such as Net- flix, Google, and Amazon. Below, different approaches of methods to analyze customer behavior similar to the objectives of this thesis are reviewed. Starting with the most prim- itive and commonly used RFM segmentation method, the literature review is followed by more computationally expensive segmentation techniques as hybrid approaches of machine learning algorithms. These are some of the methods previously used for customer segmentation and classification. The final methods to be used in this work is presented in the section of Methodology in Chapter 4.

3.1 CLV estimation based on RFM analysis

Customer lifetime value (CLV) is used in a case study by Khajvand et al. (2011) for customer segmentation through two different approaches; a RFM analysis and an extended RFM with an additional feature representing the total number of items purchased by a customer in addition to the number of orders. According to Khajvand et al. (2011) the RFM model is the simplest and most powerful model to study customer behavior in the context of customer relationship management. The RFM model is defined by the metrics Recency, the time elapsed since the last purchase; Frequency, the total number of orders made during a specific time period and Monetary, the total amount of money spent by the customer during the given time period. To be considered the most loyal customer and a profit driver, Recency should be low, indicating a higher likelihood of repeated purchases;

Frequency should be high, indicating greater loyalty to the company and the monetary value should be high, providing information about the importance of the customer. Based on the RFM metrics, the authors used K-means algorithm to cluster the customers into segments with similar values. The clustering analysis revealed no significant difference between the two approaches, RFM and extended RFM. The traditional RFM model is hence used for further analysis, where the CLV value is calculated for each cluster using the weighted RFM method (see more details in Khajvand et al. (2011)). This method uses expertise information of the sales department to determine the metrics of the greatest weight and importance for the CLV. The CLV values was later used to assign CLV rankings for each segment to provide a final financial overview of the customer segments that can be used in future marketing strategies.

(14)

3.2 An extended RFM ranking by K-means cluster- ing

RFM analysis is a well-known technique of evaluating customer value based on transactional data that reveals the customers’ purchase behavior. Joy Christy et al. (2018) examined this technique to later extend it by performing three different machine learning based clustering algorithms; K-means, Fuzzy C-means and Repetitive Median based K-means on the resulting RFM values. The aim of the study was to find the clustering methods with the greatest result in terms of iterations, cluster compactness and execution time. The dataset available was the transaction data of customers from an online retail store over a study period of one year. The data consisted of eight features such as customer ID, product code, product name, price, date and time of purchase to name a few.

The RFM metrics were calculated based on the given attributes and ranked in a specific order to be added to the final dataset later. The Repetitive Median based K-means (RM K-means) algorithm performed best with a silhouette width of 0.49 in comparison to K- means’ 0.33 and fuzzy K-means’ 0.43. The RM K-means algorithm also outperformed the other algorithms in terms of execution time with 1.49 seconds compared to 2 seconds for K-means and 24 seconds for fuzzy K-means, and in terms of the number of iterations of 2 in comparison to 4 and 193 iterations respectively for traditional K-means and the Fuzzy one.

3.3 Customer segmentation on behavioural data

To gain insight into customer behavior of an e-marketplace application for second-hand vintage clothing, Aziz (2017) investigated customers preferences in a master’s thesis at Uppsala University. The study explored the research questions of whether the company’s available data could be used to segment customers into groups with similar preferences, what size of segments might be reasonable and whether the segments could be used to target customers. Aziz (2017) performed data pre-processing to create a ranking matrix based on users and brands available on the website. The dimensionality of the data matrix was reduced, by using Principal Component Analysis, and customers without sufficient activity were removed from the data as well. He then used the reduced data matrix to perform a clustering analysis based on the K-means algorithm and the cosine similarity measure. The appropriate number of clusters was based on the Silhouette and the Elbow methods, resulting in the set of three clusters. But based on the Silhouette score of 0.32, it was still not an optimal clustering proposal (to have appropriate clustering proposal the Silhouette value should be larger than 0.5). In addition, a website was constructed for easily accessible visualization tools that the company could use for further analysis. The author highlights the importance of exploring multiple clustering algorithms to determine the most accurate model, as well as modifying the pre-processing phase of the analysis with different thresholds.

(15)

3.4 Using SOM, K-means and the LRFM model for customer segmentation

Ait daoud(2015) extended the traditional RFM model for segmenting customers’ behavior by including the customers’ relation length to the company, L and called the proposed model LRFM model. The LRFM model is used to perform an initial segmentation of the customers of a Moroccan online store, which is later adopted by the two clustering techniques: Self-Organizing Map method (SOM) and the K-means clustering method.

The unsupervised SOM method provided the best number of proposed clusters, which were later used in the K-means algorithm. The study resulted in nine different clusters where the LRFM metrics were compared to examine the behavior of each customer segment.

The customers with the highest LRFM values are treated as the most loyal customers who contribute with highly frequent purchases of high monetary value, as well as a long-term relation to the company and recency which implies that the customer has recently been active with purchases.

3.5 Return prediction within e-retail

Al Imran and Amin (2020) examined the disproportion of return events occurring among online shoppers in comparison to traditional offline shoppers. Returns are one of many reasons for decreased profitability among e-retailers. There are several potential explana- tions for the high percentage of returns, such as flexible return policies, damages, delays, and mismatched expectations, to name a few. In this study, Al Imran and Amin (2020) utilized the state-of-the-art (SOTA) predictive modelling approach to find the most accurate classification model from XGBoost, LightGBM, CatBoost, and TabNet along with the traditional Decision Tree algorithm as a baseline. The available dataset included 12 features based on transactional data, including among others a quantitative feature indicating whether an order was returned or not. The resulting model with the highest performance measure was the TabNet deep-learning based algorithm. The analysis further investigated the most influential features of the data explaining the return events, finding that order location and payment method have the greatest impact on returns, followed by promotional orders and shopping cart orders that indicate whether multiple items or a single item were purchased.

3.6 Customer churn prediction

The challenge related to customer churn in telecommunications have been investigated in Tsai and Lu (2009) by considering hybrid models that combine two different neural network techniques for churn prediction. Customer churn prediction is considered in many data mining methods with the aim of describing the data and predicting unknown or future values of features. The data mining techniques investigated in this study are the supervised classification method Back-Propagation Artificial Neural Networks (ANN) and the unsupervised clustering method Self-Organizing Maps (SOM). The study investigates the performance differences of two different hybrid approaches combined serially; the combination of ANN and ANN and the combination of SOM and ANN. The latter hybrid

(16)

approach, which combines clustering and classification, provides the ability to preprocess and identify patterns within groups that can later be used to classify or predict future values. The training set used to construct the classification model is hence based on the clustering result. They also compared the hybrid model with the baseline model of a single ANN. The study found that the combination of ANN+ANN outperformed both the baseline model and the SOM+ANN in terms of prediction accuracy and Type I and II errors. The SOM+ANN also outperformed the baseline model. However, by introducing fuzzy testing data the SOM+ANN was outperformed by both the ANN+ANN and the baseline model of single ANN. This suggests that the ANN+ANN combination had the greatest performance and stability than the compared models.

3.7 Churn prediction through hybrid classification

Caigny et al. (2018) presents a new classification algorithm using the hybrid approach of logistic regression combined with decision trees to investigate customer churn prediction.

The new method is called logit leaf model (LLM) and is compared against the ordinary decision trees (DT), logistic regression (LR), random forest (RF) and logistic model trees (LMT). The authors emphasize the two main performance areas of customer churn models; predictive performance and model comprehensibility. The trade-off between these performance measurements is the main decision point in modelling by classification algorithms. The DT algorithm is more comprehensible while LR has higher predictive performance. The LLM algorithm has both high predictability and sufficient comprehensibility. The resulting segmentation of the LLM algorithm has also been proven valuable in churn research for investigating churn drivers within specific segments. The LLM algorithm uses decision trees to find subsets of customers, where each segment is represented by a terminal node in the tree, the leaf. A logistic regression based on forward selected variables is then fitted to each customer segment separately, providing probabilities for each instance in the segments. The study, conducted on different datasets, compared performance measures such as area under the receiver characteristic curve (AUC) and top decile lift (TDL), both derived from the confusion matrix, where an algorithm with lower AUC and TDL values showing the best performance. The average AUC and TDL scores of DT and LR were 4.857, 4.929 and 3.286, 3.357, respectively. Comparing the LLM classification algorithm with its building blocks DT and LR, the results of the study show that LLM performs better than DT and LR with the average AUC and TDL metrics both equal to 1.786. The authors showed that the LLM performed at least as well as the random forest procedure RF, but with greater tractability and actionability due to the resulting clusters and associated regression models.

3.8 Customer purchase behavior analysis through a two staged approach

The analysis of customers’ purchase behavior in online stores using a two-stage approach is conducted by Piskunova and Klochko(2020) in their study, which examined customers’

purchases in a Ukrainian online store over a period of 2 years. The approach used in the study consisted of the first stage of segmenting customers into similar groupings

(17)

by the usage of machine learning clustering techniques, followed by the second stage of constructing a classification algorithm used for continuously updating customer segments as well as assigning segments to new clients. The purchasing activity of the customers were evaluated by the classic RFM model extracting the recency, frequency, and monetary value of each customer. The RFM metrics were then used as the main features for the remaining study. The K-means algorithm was chosen as the clustering algorithm. The Elbow and Silhouette methods were used to determine the number of clusters. Since these methods gave different results of 3, 6, 8 and 2 clusters respectively, the R package NbClust was used to test the number of clusters using 26 additional criteria, resulting in the proposal of 3 clusters. The next best option was set to 6 clusters. By validating the results based on 3 or 6 segments and the business insights, the final model was set to use 6 clusters. The classification algorithm was selected by comparing the accuracy of five different classification models. The final classification model was set as Random Forest model with an accuracy of 0.99.

3.9 Comparison between K-prototypes and K-means

The K-means and K-prototypes clustering algorithms are compared by Ruberts (2020) when clustering mixed data sets. The author presents a study on already preprocessed and cleaned Google Analytics data from the online community Kaggle. By using Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) as a comparison method, the groups of data can be visually represented on two dimensions for the different clustering techniques. UMAP embedding requires Yeo-Johnson transformation of the numerical features and one-hot-coding of the categorical features, which are later embedded separately and combined by conditional embedding of the numerical features on the categorical ones. The result of the UMAP embedding is a scatterplot in which the data are visualized on two dimensions. The author starts by performing the K-means clustering method, which requires a numerical dataset. By one-hot-encoding of the categorical features and applying the Yeo-Johnson transformation to the data, the now more Gaussian-like data is used to fit a K-means with the initial number of 15 clusters based on the UMAP visualisation. The K-prototypes method uses mixed data, applying the transformation to the numerical features and leaving the categorical features unprocessed. By colouring the UMAP scatter plot using the two different clustering techniques, the difference among the groups are presented where the K-prototypes algorithm results in clearer boundaries between the groups as well as more evenly distributed groups. By building a LightGBM classification model on top of the clustering model, the author evaluates the models by distinctiveness with the cross-validated F1 score and the clusters informa- tiveness by using the SHAP feature importance. The resulting cross-validated F1 scores for the K-means and K-prototypes methods were 0.986 and 0.97 respectively, indicating that the K-prototypes cluster are adequate and discriminative despite the slightly lower score. The SHAP values of the classifier presents four dominant numerical features of the K-means method, while the K-prototypes method locates 10 important features out of a total 14 features, with the categorical features being of higher importance. The study concludes that clusters based on K-prototypes are therefore more informative due to the higher importance of categorical features and hence should be used by marketers for customer segmentation.

(18)

Chapter 4 Methodology

This chapter contains a presentation of the dataset, followed by data processing and a description of the features, theories and model setup.

4.1 Data description

The present data is based on a proportion of customer transactions and web traffic data during the investigated period from 2018-06-01 to 2021-01-01. To exclude irrelevant, incomplete features as well as duplicates and inappropriate formats, cleaning the data takes a great part of the pre-processing stage to obtain a proper dataset for analysis. The customers in this dataset are limited to being users of NA-KD and thus having created an account on the company’s website. The data is extracted from a database in the BigQuery data warehouse of Google Cloud and web traffic data from Google Analytics.

This results in a data set of the sample size 1.4 million customers. Due to confidentiality, some metrics will be encrypted and some will not be visualized at all in this report.

Figure 4.1: The sources used to create the final dataset in Google BigQuery.

(19)

4.2 Data mining

Data cleansing takes place in the Google BigQuery data warehouse by extracting relevant features, eliminating duplicates, and putting the data into a desired format. New features are extracted from the raw data using data mining techniques to explain customer behavior. Return ratios and total length of customer’s relationship with the company are some examples of newly constructed variables. The final sample data consists of 1,355,533 customers and 25 associated variables that potentially explain the customer purchasing behavior. Figure 4.1 shows the process applied on the data warehouse to create the final dataset in Google BigQuery. The features of the dataset are presented and briefly explained in Table 4.1.

To reduce the number of distinct values of the categorical variables in the data, subgroups were constructed based on the most frequently occurring values. Since the company ships to over 100 countries, it is necessary to divide the countries into smaller subgroups.

Based on the customer base during the investigated period, the top five countries were retained, accounting for 87% of the customers. The remaining countries were grouped together, resulting in a country variable with six levels: A, B, C, D, E and F.

The categorical variable explaining the most commonly used payment method is grouped into KLARNA, Other and both which represent whether the customer uses the KLARNA payment method, any other method, or whether the customer uses both KLARNA and any other method equally often. This division of payment methods is based on the ability of invoice purchases by KLARNA which clearly separates this payment method from the others.

To examine the length of the relation between the customer and the company the date of the first order on the company’s website is extracted and compared to the date of the most recent order. This is the only metric that contains information beyond the investigated period.

NA-KD uses 95 different delivery methods with different performances. There are some called premium which delivers the packages directly to the customers’ home and express which delivers the packages faster than the others. The third delivery option is called standard. By using only these three subsets, there is a risk of excluding some important information, so the variable deliveryMode is included with further details of the delivery method. This feature explains whether the package has been delivered through mailbox, directly to the customers’ home or office or by using traditional parcel shops for the customers to pick-up their items themselves. This feature is relevant due to the differences of defining premium and standard shipment delivery among the countries where e.g. home-delivery can be defined as standard delivery in one country while it can be called a premium option in another country. The metric is coded as A, B, C and D.

In Table 4.1, various e-commerce specific terms are used to explain the metrics. A more detailed explanation of these terms can be found in Table 4.2. The channels divided into first, last and only channel fractions are also split into lower, middle and upper funnels. This fraction is based on the extent of the channel, that is, the percentage of traffic generated to the website. The upper funnel reaches the customers who have no incentive to buy anything - examples of these channels would be Facebook ads. The middle funnel reaches the customers who have an incentive to buy an item, but no incentive for it to be from the specific company NA-KD. Examples of these channels could be Google Shopping, where the customer can search for a specific item and be directed to

(20)

Variable name Description

hashedEmail Encrypted customer emails

country The country related to delivery location mostFreqPayMethod Most frequently used payment method

relationLength The number of days between the first and last purchase.

Recency The number of days between the end of the investigated period and the last purchase date

totalOrders The total number of orders during the investigated period salesQuantity The total number of items purchased

returnQuantity The total number of items returned returnRate The proportion returned items

netRevenue The revenue of purchased items subtracted with the value of returned items. The total net revenue of the customer mostFreqDeliveryMethod The most frequently used delivery method

deliveryMode The mode of delivery

discountedSalesRatio The proportion of items purchased on discount by a voucher code or on sale

mostFreqReturnReason The most frequently stated return reason

avgReturnTime The average time between a purchase and the associated return

mostFreqFirstChannel The most frequently used channel to first visit the website mostFreqLastChannel The most frequently used channel to make a purchase mostFreqOnlyChannel If a customer uses only one channel to enter and make a

purchase, this channel is referred as the only channel. This metric presents the most frequently used only channel mostFreqDevice The most frequently used device by the customer, desktop

or mobile

mostFreqItem1 The most frequently purchased item category

mostFreqItem2 The second most frequently purchased item category mostFreqItem3 The third most frequently purchased item category pageviewPerSession The average number of pageviews per session

conversionRate The total number of orders divided by the total number of sessions

oneSizeItems The number of items purchased with only one size per item multiSizeItems The number of items purchased with several sizes of the

same item

Table 4.1: Features in the dataset with associated description.

the company’s website via Google. The upper and middle funnels are both generated through paid advertisements. The lower funnel reaches customers who have an incentive to purchase an item from a particular company. These customers are reached through channels in a natural way, and thus are called ”organic” channels. These channels are created through unpaid advertising and can be created through the rumor mill that brings the company name to the customer’s attention. Customers could then either reach the website directly through the correct web address or by searching for the company through various search engines.

(21)

Term Explanation

Session A session is a collection of actions performed on the website during a specified period of time, for NA-KD set to 30 min- utes. A single session can include multiple interactions such as page views and transactions. The duration of a session can be determined both by measuring time in seconds and by the number of page views during the session.

Pageviews A pageview is the instance of a new page loaded in a browser.

The total number of pages visited, the pageviews, increases every time a customer enters a new page on the website where each returning page is accounted.

Conversion Conversion describes the process when a website visitor com- pletes the action of making a purchase. The customer-level conversion rate used in this thesis represents the percentage of sessions that end in a purchase.

Voucher Code A voucher code is used at checkout to receive a discount specified by the company, which is often only valid for a certain period of time.

Channel Sales channels explain the way the company enters the market to increase sales. At the most detailed level, NA -KD’s channels consist of 24 different segments. In this study, the channels are divided into three subgroups: Lower, Middle and Higher funnel channels.

Table 4.2: Commonly used terms and their associated explanation.

4.3 Theory

This section includes theory on machine learning methods such as dimensionality reduction, clustering and classification.

4.3.1 Non-parametric Statistics

From the exploratory data analysis in Chapter 5, one can conclude that the data does not follow the normal distribution, hence nonparametric statistical tests need be used to examine the significant effects within the data. The theory of the used nonparametric statistics Kruskal-Wallis H-Test Statistic and Wilcoxon Signed Rank Test Statistic are briefly explained in the following section. For further reading, see Montgomery (2017) and Corder and Foreman (2014).

Kruskal-Wallis H-Test Statistic

When comparing the means of more than two independent samples, the nonparametric statistical procedure of the Kruskal-Wallis H-test is of good fit, which is equivalent to the parametric one-way ANOVA. The Kruskal-Wallis H-test statistic compares the sample

(22)

medians θ_i and the corresponding null hypothesis H₀ : θ₁ = θ₂ = · · · = θ_k

where k ≥ 2. In symmetrical distributions, the mean and median are the same, so under the symmetry assumption, comparing medians is the same as comparing means.

A significant result of the Kruskal-Wallis H-test states that at least one of the popu- lation means is different from the others, but not where the difference occurs. To conduct a Kruskal-Wallis-test, we first rank the observations y_ij in ascending order and replace each observation with its rank, R_ij, where the smallest observation is rank 1. In the case of ties (observations with the same value), the average rank is assigned to each of the equally ranked observations. The Kruskal-Wallis H-Test is given by

H = 1 S²

" _k X

i=1

R²_i·

n_i − N (N + 1)² 4

#

(4.1)

where N is the total number of observations, ni denotes the number of observations in the ith treatment and R_i· is the sum of the ranks in the ith treatment and S² is the variance of the ranks and is expressed as

S² = 1 N − 1

" _k X

i=1 ni

X

j=1

R²_ij −N (N + 1)²

4 .

#

In case of no ties or when the number of ties are moderate, the test statistic H is given by the following simple form

H = 12

N (N + 1)

k

X

i=1

R²_i·

n_i − 3(N + 1), (4.2)

which is obtained by dividing H in (4.1) by S² = N (N + 1)/2,the variance of ranks when there is no tie. When n_i ≥ 5, under the null hypothesis, the test statistic H approximately follows the Chi-square distribution with k − 1 degrees of freedom, and hence the null hypothesis is rejected if H > χ²_α,k−1.

If there are ties in the ranking of values, a correction must be made. A new H-statistic is created by dividing the H-statistic in (4.2) by the correction factor, where the correction value is

C_H = 1 − P(T³ − T ) N³− N ,

where T represents the number of values from the set of ties and N is the total number of values from all samples.

The Wilcoxon-Mann-Whitney test

The Mann-Whitney U -test, also known as the Wilcoxon-Mann-Whitney test or Wilcoxon Rank Sum test, is a nonparametric statistical test to compare the means of two independent continuous populations X1 and X2 that are assumed not to be normally distributed.

However, the distributions of the populations X₁ and X₂ can be assumed to be continuous and to have the same shape and variance, differing only (possibly) in their locations.

(23)

Formally, the Wilcoxon Rank Sum test is used to test the null hypothesis H₀ : µ₁ = µ₂. The corresponding parametric test is the two-sample independent t-test.

Assume that two independent samples x₁ and x₂ with sizes n₁ and n₂, n₁ ≤ n₂, from the populations X₁ and X₂have been drawn. The Mann-Whitney U test mixes the values of two samples and ranks them in ascending order to determine whether the values in the ranked samples are randomly mixed or whether they are clustered at opposite ends. If two or more observations are tied (identical), the mean of the ranks that would have been assigned if the observations differed is used. If the two samples do not differ, the rank order would be random, while a clustering of values in one sample would represent a difference between the two samples.

Let U₁ be the sum of the ranks in the smaller sample, and U₂ be the sum of the ranks in the larger one. Then,

U₂ = (n₁+ n₂)(n₁+ n₂+ 1)

2 − U₁.

When the sample means do not differ, we expect the sum of the ranks for the two samples to be nearly equal after adjusting for the difference in sample size. Consequently, if the sums of the ranks differ greatly, we conclude that the means are not equal.

The U statistic is examined for significance by comparing with the values in the table of critical values. However, if the available number of values of each sample exceeds the one’s in the table, an approximation of large sample may be done (see details in Montgomery and Runger (2018)). However, for large samples, i.e., when n₁ and n₂ are moderately large, say more than eight, the distribution of U₁ can be well approximated by the normal distribution with mean

µ_U₁ = n₁(n₁+ n₂+ 1) 2

and variance

σ²_U

1 = n₁n₂(n₁+ n₂+ 1)

12 .

So, the test statistic

z^∗ = U_i− µ_U₁ σ_U₁

can be used as a test statistic for Wilcoxon rank sum test and the appropriate critical region is |z^∗| > z_α/2, z^∗ > z_α, or z^∗ < −z_α, depending on whether the test is a two-tailed, upper-tailed, or lower-tailed test.

4.3.2 Dimensionality Reduction

The data used in the study contains more than 20 features where performing the well- known method of Principal Component Analysis (PCA) possibly could lead to dimension reduction. To analyse mixed data containing both numerical and categorical variables, such as the data used in this study, the extended version of the standard multivariate principal component data analysis, PCA Mix, from the R package PCAmixdata, can be used. The PCA Mix algorithm consists of a fusion of ordinary principal component analysis (PCA) and multiple correspondence analysis (MCA) on the numerical and categorical variables, respectively (Chavent et al.;2017).

(24)

Multiple Correspondence Analysis is used to analyze the n × p qualitative data matrix X, where n denotes the number of observations and p is the number of categorical variables. Each of the categorical features has m_j levels where the sum of the total number of levels is denoted by m. Initializing the analysis by coding each level as binary, the indicator matrix G of size n × m is constructed. The observations in the matrix are all weighted by _n¹, and the m levels of the categorical features are weighted by _nⁿ

s where n_s represents the total number of observations belonging to the sth level, resulting in N = _n¹I_n, where I_n is the identity matrix of size n. The metric M explaining the distance between two different observations and also giving a greater weight or importance to rare levels is as follows

M = diag(n

n_s, s = 1..., m)

The centered G is denoted as Z with the total inertia of m − p where the generalized singular value decomposition of Z can be calculated by the factor coordinates of the levels by

A^∗ = MVΛ = MA

where V represents the matrix of the eigenvectors, Λ is the diagonal matrix of the the eigenvalues and A is the factor loadings matrix of standard PCA.

MCA follows the properties of (4.3) where each element of A^∗ is denoted as a^∗_si and represents the mean value of the observations in level s’ standardized factor loadings.

a^∗_si = n

n_sa_si = n

n_sz_s^TNu_i = ¯u_i^s (4.3) The sth column of Z is denoted as z_s and the ith principal component is denoted as u_i = _λ^fⁱ

i. The mean value of the loadings of the observations of level s is denoted as

¯

uis. λi denotes the eigenvalues which represents the sum of the correlation ratio η²(fi|xj) measuring the variance of the ith principal component, f_iexplained by the x_jth categorical feature as presented below.

λ_i = ka_ik²_M = ka^∗_ik²_M−1 =

p

X

j=1

η²(f_i|x_j)

The PCA Mix algorithm used to analyze mixed data can be defined by n observations described by p₁ quantitative variables and p₂ qualitative variables. Together they represent the n × p₁ quantitative matrix X₁ and the n ×p₂ qualitative matrix X₂, where the total number of levels of p₂ is defined as m.

The algorithm consists of three steps where the first step includes pre-processing of the numerical data matrix X₁ and the second step consists of factor coordinates processing of the qualitative data set X₂. The third step is the squared loading processing where the resulted loadings are defined as the squared correlations for the quantitative variables and correlation ratios for the qualitative variables.

Step 1: pre-processing

i. The real matrix Z = [Z1, Z2] should be built with dimension n×(p1+m) where Z1 is the standardized X₁ and Z₂ is the centered indicator matrix G of X₂. This follows the same procedure as in standard PCA and standard MCA respectively.

(25)

ii. The diagonal matrix N of the weights of the rows of Z should be built. The weight of _n¹ is often applied on the rows of Z, such that N = _n¹I_n

iii. The diagonal matrix M should be built containing the weights of the columns of Z where the first numerical p₁ column are weighted by 1 according to standard PCA and the last m columns of categorical features are weigthed by _nⁿ

s where n_s of s = 1, ..., m represents the number of observations of the sth level, according to the standard MCA.

The following formula presents the matrix M which indicates that the distance between two rows of Z is a mixture of the distance measure used in standard PCA; the euclidean distance, and the weighted χ2 distance used in the standard MCA:

M = diag(1, ..., 1, n

n₁, ..., n n_m) Step 2: factor coordinates processing

i. By generalised singular value decomposition of Z using the the metrics N and M the following decomposition is given:

Z = UΛV^T The rank of Z is then denoted as r.

ii. The factor coordinate matrix of dimension n × r is defined as (4.4), and can directly be computed from the decomposition of GSDV as (4.5).

F = ZMV (4.4)

F = UΛ (4.5)

iii. The matrix of the factor coordinates consisting of the p₁ quantitative variables and the m levels of the qualitative variables is presented as

A^∗ = MVΛ

where the matrix A^∗ is split into A^∗₁ containing the factor coordinates of the numerical variables and A^∗₂ containing the factor coordinates of the m levels of the categorical variables.

Step 3: squared loading processing The contribution of each variable to the variance of the PC’s are defined as the squared loadings. It is shown that V ar(f_i) = λ_i and λ_i = ka_ik²_M = ka^∗_ik²_M−1, hence the contributions can be directly calculated from A.

Formula (4.6) presents the contribution, c_ji of the variable x_j to the variance of the PC f_i.

(c_ji = a²_ji = a^∗2_ji if the variable x_j is numerical c_ji =P

s∈Ij

n

nsa²_si =P

s∈Ij

ns

na^∗2_si if the variable x_j is categorical (4.6) where the levels of the qualitative variable j is presented in the set I_j.

(26)

4.3.3 Clustering techniques

The clustering techniques used in this study are the K-means clustering algorithm for numerical data and the K-prototypes algorithm for mixed data sets. A brief explanation of the algorithms is given below, for more details, see (Gan; 2011) and (Huang; 1998).

The K-means algorithm

The K-means algorithm takes the numerical data set X = {x0, x1, ..., xn−1} of n records and the integer k in {1, 2, ..., n} to represent the number of initial clusters to be given to the algorithm. The K-means algorithm partitions the dataset into the k number of clusters, denoted as C0, C1, ..., Ck−1, by minimizing the objective function

E =

k−1

X

i=0

X

x∈Ci

D(x, µi), (4.7)

where the distance measure is denoted by D(·, ·) and µ_i the mean of cluster C_i is µi = 1

C_i X

x∈Ci

x

Equation (4.7) can be rewritten as

E =

n−1

X

i=0

D(x, µ_γ_i), (4.8)

where γ_i denotes the cluster membership of x_i and it is equal to j if the observation x_i belongs to the cluster C_j. By using an iterative process, the K-means algorithm minimizes the objective function where the first k records from X are set as initial cluster centers.

Based on the initial cluster centers denotes as µ⁽⁰⁾₀ , µ⁽⁰⁾₁ , ..., µ⁽⁰⁾_k−1 the cluster memberships γ_i⁽⁰⁾ are updated by equation 4.9.

γ_i⁽⁰⁾ = arg min

0≤j≤k−1

D(xi, µ⁽⁰⁾_j ), i = 0, . . . , n − 1, (4.9) where γ⁽⁰⁾_i is set to the index that minimizes the distance. That is, γ_i⁽⁰⁾ is set to the index of the cluster to which xi has the smallest distance. The K-means algorithm updates the cluster centers based on the cluster memberships as shown in equation 4.10.

µ⁽¹⁾_j = 1 {i : γ_i⁽⁰⁾ = j}

n−1

X

i=0,γ⁽⁰⁾_i =j

x_i (4.10)

where j = 0, 1, ..., k − 1.

The K-Means algorithm is repeated until either no change in cluster memberships occurs or maximum number of iteration is reached, to update cluster memberships and cluster centers.

(27)

The K-prototypes algorithm

To cluster a dataset consisting of a mixture of quantitative and qualitative variables, the K-prototypes algorithm of Huang (1998) can be used. The algorithm takes a mixed-type dataset X = {x0, x1, ..., xn−1} with n observations and d attributes. The first p variables are assumed to be quantitative/numeric, whilst the remaining d − p are assumed to be qualitative/categorical variables. The distance between the two observations x and y in the dataset X is defined as

D_mix(x, y, λ) =

p−1

X

h=0

(x_h− y_h)²+ λ

d−1

X

h=p

δ(x_h, y_h) (4.11)

where λ is a balanced weight to avoid a heavier weight on either type of attribute, x_h and yh are the respectively hth component of x and y and δ(·, ·) represents the simple matching function given by

δ(x_h, y_h) =

(0, if x_h = y_h 1, if x_h 6= y_h

The K-prototypes algorithm minimizes the following objective function

P_λ =

k−1

X

j=0

X

x∈Cj

D_mix(x, µ_j, λ) (4.12)

where the function D_mix(·, ·, λ) is defined in (4.11), k denotes the number of clusters, C_j is the jth cluster, and the center of the jth cluster, also called prototype, is denoted by µ_j.

The K-prototypes algorithm iterates to minimize the objective function in (4.12) until some condition is reached. The algorithm initializes the k cluster centers randomly from the dataset denoted by µ⁽⁰⁾₀ , µ⁽⁰⁾₁ , ..., µ⁽⁰⁾_k−1. The updated cluster memberships γ₀, γ₁, ..., γ_n−1 by the K-prototypes algorithm are obtained by

γ_i⁽⁰⁾ = arg min

0≤j≤k−1

D_mix(x_i, µ⁽⁰⁾_j , λ). (4.13) When the cluster memberships are updated by (4.13), the K-prototypes algorithm con- tinues to update the prototypes of the clusters using

µ⁽¹⁾_jh = 1

|C_j| X

x∈Cj

x_h, & for h = 0, 1, ..., p − 1

and

µ⁽¹⁾_jh = mode_h(C_j), & for h = p, p + 1, ..., d − 1,

where mode_h(C_j) represents the most common categorical value for the hth variable in cluster C_j and C_j is

C_j = {x_i ∈ X : γ_i⁽⁰⁾ = j}

(28)

For the distinct values A_h0, A_h1, ..., A_h,m_h−1 that the hth variable can take, where m_h represents the total number of distinct values that the attribute h can take, let the number of records in cluster C_j is be defined as

f_ht(C_j) = |{x ∈ C_j : x_h = A_ht}|, for t = 0, 1, ..., m_h− 1 (4.14) Then mode_h(C_j) in (4.15) can be obtained as

modeh(Cj) = arg max

0≤t≤mh−1

fht(Cj), for h = p, p + 1, ..., d − 1 (4.15) The above steps are repeated by the K-prototypes algorithm until either the maximum number of iterations is reached or no further changes in cluster memberships occur.

4.3.4 Classification

To perform binary classification within the data, the Logistic Regression Model with bias reduction is used to avoid overfitting problems with the model. The following is a brief explanation of the techniques, though more information can be found in (Ratner; 2012), (Kosmidis and Firth; 2010), and (Firth; 1993).

Logistic regression model

The logistic regression model (LRM) classifies individuals into two distinct classes, for example, a buyer and a non-buyer. In the LRM, we assume that the response variable Y_i is a Bernoulli random variable which takes 1 with probability π_i and 0 with probability 1 − π_i. Based on the independent variables X₁, X₂, . . . , X_p of each individual, the logistic regression model classifies the individual into one of the two classes by the logit response function.

This function is given by

E(Y ) = exp (β₀+ β₁X₁+ . . . + β_pX_p) 1 + exp (β₀+ β₁X₁+ . . . + β_pX_p)

where β_i, i = 0, 1, . . . , p, represents the logistic regression coefficients. In LRM, we assume that E(Y ) is related to X₁, X₂, . . . , X_p by the logit function. It is easy to show that

π_i

1 − π_i = E(Y )

1 − E(Y ) = exp (β0+ β1X1+ . . . + βpXp)

The above quantity is called the odds, which has a straightforward interpretation: If the odds is 2 for a particular value of x, it means that a success is twice as likely as a failure at that value of the regressor x.

Parameter estimation and bias reduction in logistic regression The logistic model when yi ∼ Bernoull(π_i), can be fitted by estimating the parameters using the maximum likelihood method. The first step is to construct the likelihood function which is a function of the unknown parameters. Let l(β) be the log-likelihood function for the vector of parameters β = (β1, β2, ..., βp) of the model. An MLE estimate of β is obtained by solving

S(β) = ∇_βl(β) = 0,