• No results found

Omnichannel path to purchase : Viability of Bayesian Network as Market Attribution Models

N/A
N/A
Protected

Academic year: 2021

Share "Omnichannel path to purchase : Viability of Bayesian Network as Market Attribution Models"

Copied!
65
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköping University | Department of Computer and Information Science

Master’s thesis, 30 ECTS | Statistics and Machine Learning

2020 | LIU-IDA/STAT-A--20/003--SE

Omnichannel path to purchase

Viability of Bayesian Network as Market Attribution Models

Anubhav Dikshit

Supervisor : Hao Chi Kiang Examiner : Anders Nordgaard

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

Market attribution is the problem of interpreting the influence of advertisements on the user’s decision process. Market attribution is a hard problem, and it happens to be a significant reason for Google’s revenue. There are broadly two types of attribution models - data-driven and heuristics. This thesis focuses on the data driven attribution model and explores the viability of using Bayesian Networks as market attribution models and benchmarks the performance against a logistic regression. The data used in this thesis was prepossessed using undersampling technique. Furthermore, multiple techniques and algorithms to learn and train Bayesian Networks are explored and evaluated.

For the given dataset, it was found that Bayesian Network can be used for market at-tribution modeling and that its performance is better than the baseline logistic model. Keywords: Market Attribution Model, Bayesian Network, Logistic Regression.

(4)

Acknowledgments

I would like to thank everyone at my university for making my time during my masters a pleasant experience. There will always be a little bit of the LiU in me wherever I go. I would further thank Fanny, Anna and Viking at Nepa for providing me the opportunity to experience Nepa and meet so many wonderful people.

I would like to thank my supervisor Hao Chi for all the feedback and motivation that he provided. Our conversation has always been profound and thought-provoking. I would also like to thank Yusur Almutair, my opponent who has been instrumental to ensure that this thesis conforms to the academic standards.

Finally, I would like to thank my family for all the support and encouragement through-out the master’s program, especially my wife, Chetana Deshpande.

(5)

Contents

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

1 Introduction 1 1.1 Nepa . . . 1 1.2 Background . . . 1 1.3 Objective . . . 2 2 Data 3 2.1 Overview . . . 3 2.2 Data summary . . . 3 2.3 Data Pre-Processing . . . 4 2.3.1 Imbalanced data . . . 4

2.3.2 Estimation of optimal undersampling ratio . . . 5

2.3.3 Distribution of Data . . . 7

3 Theoretical Background 9 3.1 Bayesian Networks . . . 9

3.2 Learning a Bayesian Network . . . 11

3.2.1 Structure Learning . . . 11

3.2.2 Parameter Learning . . . 12

3.3 Bayesian Network Algorithms . . . 12

3.3.1 Constraint-based algorithms . . . 12

3.3.2 Score-based algorithms . . . 13

3.3.3 Hybrid algorithms . . . 14

3.4 Conditional independence tests . . . 16

3.5 Inference on Bayesian Network . . . 17

3.5.1 Exact Inference . . . 17

3.5.2 Approximate Inference . . . 18

3.6 Logistic Regression . . . 18

3.6.1 Log-likelihood . . . 18

3.6.2 The Hessian . . . 19

(6)

3.7.4 Positive Class Accuracy . . . 20

3.7.5 Balanced Accuracy . . . 20

3.8 Techniques for Measuring Association . . . 20

3.8.1 Chi-Square Test (χ2test) . . . 21

3.8.2 Brute Force Search . . . 21

3.8.3 Multiple Correspondence Analysis . . . 21

4 Methods 24 4.1 Baseline Logistic Regression Model . . . 24

4.1.1 Optimal Decision Threshold of Logistic Regression . . . 24

4.2 Bayesian Network Model . . . 25

4.2.1 White-list Creation Using Hypothesis Testing . . . 26

4.2.2 White-list Creation Using Multiple Correspondence Analysis . . . 26

4.2.3 White-list Creation Using Grid Search . . . 28

4.2.4 Black-list Creation Using Domain Knowledge . . . 29

4.2.5 Optimal Restart Using Cross-Validation . . . 30

4.3 Bayesian Network Model Comparison . . . 30

4.4 Attribution Formula . . . 30

5 Results 31 5.1 Baseline Logistic Regression Model . . . 31

5.2 Bayesian Network Models . . . 31

5.3 Distribution of Attribution . . . 32 5.4 Comparison of Attribution . . . 34 6 Discussion 36 6.1 Results . . . 36 6.2 Method . . . 37 6.2.1 Undersampling . . . 37

6.2.2 Bayesian Network Modeling . . . 38

6.2.3 Attribution Formula . . . 38 6.3 Ethical Considerations . . . 39 6.4 Future Work . . . 39 6.5 Delimitations . . . 39 7 Conclusion 40 Bibliography 41 A Appendix 45

(7)

List of Figures

2.1 Visual depiction of random undersampling, adapted from [undersampling_Example_source] 5 2.2 Plots of performance metrics vs. Undersampling ratio, optimal undersampling

ratio 1.4 . . . 6

2.3 Plot to check overfitting by comparing train and valid data-set balanced accuracy . 7 3.1 DAG mapping probability, source from [Graduate_Teaching] . . . . 10

3.2 DAG and CPDAG, adapted from [Graduate_Teaching] . . . . 10

3.3 MCA notation, source[MCA_slide] . . . . 22

3.4 Point cloud of categories/columns, source [MCA_slide] . . . . 22

4.1 Plot showing the model metrics vs. the decision boundary cutoff . . . 25

4.2 Testing for over-fitting at different decision boundary cutoff . . . 25

4.3 Plot showing the Variance Explained vs. Eigenvectors . . . 27

4.4 Plot showing the contribution of variables towards Eigenvector 1 . . . 27

4.5 Plot showing the contribution of variables towards Eigenvector 2 . . . 28

4.6 BIC vs. Iteration using Grid Search . . . 29

4.7 Balanced Accuracy vs. Iteration using Grid Search . . . 29

4.8 Balanced Accuracy vs. Random Restarts . . . 30

5.1 Plot of Bayesian Network model with the highest accuracy, build using Hill climb-ing algorithm with white-list usclimb-ing Chi-square test . . . 32

5.2 Distribution of attribution values for variables from Bayesian Network Model . . . 33

5.3 Distribution of attribution values for variables from Logistic Model . . . 34

A.1 Plot of Model using MMHC algorithm with white-list using using Chi-square test 47 A.2 Plot of Model using Tabu Search Algorithm with white-list using Chi-square test . 48 A.3 Plot of Model using RSMAX2 algorithm with white-list using Chi-square test . . . 49

A.4 Plot of Model using RSMAX2 algorithm with white-list using Grid Search . . . 50

A.5 Plot of Model using Hill climbing algorithm with white-list using Grid search . . . 51

A.6 Plot of Model using MMHC algorithm with white-list using using Grid Search . . 52

A.7 Plot of Model using Tabu Search Algorithm with white-list using Grid Search . . . 53

A.8 Plot of Model using RSMAX2 algorithm with white-list using MCA . . . 54

A.9 Plot of Model using MMHC algorithm with white-list using using MCA . . . 55

A.10 Plot of Model using Hill climbing algorithm with white-list using MCA . . . 56

(8)

List of Tables

2.1 Sample data . . . 4

2.2 Variable description . . . 4

2.3 Proportion of conv==0 by column value . . . 8

3.1 Confusion matrix for a binary class problem . . . 20

4.1 Hypothesis Table with P-values, tested at 95% significance . . . 26

5.1 Performance Metrics for Logistic Model . . . 31

5.2 Performance of Models using different algorithms and white-list techniques . . . . 31

5.3 Mean attribution values of variable for Bayesian Network Model and Logistic Model 35 A.1 Logistic Model Summary . . . 45

(9)

1

Introduction

This section provides a brief introduction to the concept of ’Market Attribution,’ ’Path to Purchase,’ as well as the company and the purpose of the thesis project undertaken.

1.1

Nepa

This thesis was done in collaboration with Nepa AB. Nepa is a Global Consumer Science marketing company based out of Stockholm, Sweden, with offices in Helsinki, Oslo, Copen-hagen, London, and Mumbai.

1.2

Background

Shao and Li [1] define ’Market Attribution’ as the problem of interpreting the influence of advertisements on user’s decision process. The goal of attribution modeling is to pin-point the credit assignment of each positive user (customer who made a purchase or clicked at an ad) to one or more advertising touchpoints.

In 1898, St.Elmo developed a theoretical model to help understand customer journey/pur-chase funnel. This model is called the ’AIDA’ model [2]. The stages proposed by this model are:

1. Awareness – the customer is aware of the existence of a product or service. 2. Interest – actively expressing an interest in a product group.

3. Desire – aspiring to a particular brand or product.

4. Action – taking the next step towards purchasing the chosen product.

Although the ’AIDA’ model has its flaws, such as assuming a linear step-by-step pro-cess behind the purchase decision, as mentioned in [3], the ’AIDA’ model becomes the

(10)

1.3. Objective

customer purchase journey while navigating through multiple touchpoints. Path to purchase analysis refers to the analysis of the sequence of channels (touchpoints) that customers were exposed throughout the ’purchase funnel.’

Previous methods for ’Market Attribution’ include heuristic approaches such as ’Last Click Rule’ or ’First Click Rule,’ which assigns the 100% credit for the first interaction/touchpoint (eg: first advertisement source) or last interaction (eg: last advertisement source). These methods are highly flawed models, as pointed out by [5], since they ignore the whole journey of the customer/transaction. Thus, data driven techniques were deployed to perform market attribution. These models use statistical and machine learning techniques.

Shao And Li were the first pioneers of attribution using data-driven methods [1]. They chose ’Logistic Regression model with bagging’ to reduce estimation variability and a proba-bilistic model to compute the probability of conversion by each channel/touchpoints. ’Game Theory’ based models were explored in [6]. This paper viewed the influence of variables from a causal framework, and its attribution formula has been adopted in this current thesis. Models with a ’carryover’ effect and ’spillover’ effects were explored in [7], where each channel/touchpoints were not viewed linear but had some effect from other channels/-touchpoints influencing the customer’s decision. ’Mutually Exciting Point Process’ models were explored in [8]; where the effects of a channel/touchpoints were modeled as random effects with some interaction between the past events, thus accounting for a time effect. The use of ’Hidden Markov Models (HMM)’ was explored in [9], where along with HMM, the concept of conversion funnel/purchase was used. Econometric based models were ex-plored by in [10], where attribution was based on the return on investment (ROI) calculations performed using time series analysis. ’Directed Markov’ graph models for attribution were explored in [11]. Under these models, the present action was assumed to depend on the last k actions.

Apart from the literature review papers, the master thesis by Neville titled "Channel attribu-tion modelling using clickstream data from an online store" [12] has been a reference for the current thesis. Neville compared the ’Last Click Rule’ with a logistic regression-based market attribution model, the confidence interval for attribution was computed using bootstrapping. For this thesis, the choice of modeling technique is primarily driven by personal interest and has been narrowed to ’Bayesian Networks.’ Another reason that serves as a validation of our choice of model is that Bayesian Networks broadly satisfy the three properties(’Data driven’, ’Fairness’, ’Interpretability’) proposed by [6]. The paper "A Bayesian Network Model of the Consumer Complaint Process" [13] assured of the viability of ’Bayesian Network’ in an industry setting.

1.3

Objective

This thesis attempts to answer the following research questions:

1. Can Bayesian Networks be a good fit for market attribution modeling?

2. Can Bayesian Networks outperform the baseline model in terms of accuracy, consider-ing that they emphasize ’Interpretability’?

(11)

2

Data

2.1

Overview

Twice a week, about 2,000 people (paid by Nepa) participate in an online survey. This survey is conducted multiple times (6-12 times). The reason for multiple surveys is because the average respondent answers one survey per week. The respondents are selected after a first screening process to screen folks that are interested in the product the survey of which is about to be conducted. This survey consists of questions about the different sources of advertisement (channels/touchpoints). Whether or not the respondent ended up purchasing the product, the primary aim of this survey is to capture the whole purchase journey or path-to-purchase.

Because data capture is via a survey, there is an inherent lack of ’ground truth,’ i.e., there is no guarantee that a customer has made a purchase of the product and remembers it correctly and vice versa. However, it is of a general belief that admittance of something is far more trustworthy than the absence of it. Thus, from the data, understanding the interactions between sources of advertisements and the psych of respondents is of greater significance than predicting of a purchase.

Each respondent was given a unique ’respondentID,’ and the touchpoint mentioned by them was captured in the form of a binary flag. Every respondent ends their journey either with a purchase or non-purchase. If a respondent makes a purchase, then it is flagged as a successful conversion and reflected as ’1’ in the column ’conv.’

The sample data with touchpoints for each respondent is shown in table 2.1.

2.2

Data summary

There are 26 variables in the dataset used in the project; however, only 24 of them were used since ’respondentid’ is just an identifier and ’conv’ is flag indicating conversion. Table 2.2

(12)

2.3. Data Pre-Processing

Table 2.1: Sample data

Respondentid Touchpoints

conv ads_on_tv brand_website social_media instore_research

0948279158856 0 1 0 1 0

0948279393368 1 1 0 0 1

0948279446624 1 0 1 1 0

Table 2.2: Variable description

Variable Description

respondentid Unique ID for respondent

conv Did the respondent made a purchase?

ads_on_radio_streaming Did the respondent hear an advertisement on the radio? ads_on_tv Did the respondent view an advertisement on the TV? banner_ads_online_not_social Did the respondent view a banner advertisement on a website? brand_website Did the respondent visit the website of the brand conducting the survey? friend_family_recommendation Was the respondent recommended the product from their family or friends? i_saw_an_offer_promotion Did the respondent encounter a promotional offer?

i_saw_something_new Did the respondent witness a new product?

instore_research Did the respondent inquire about a product while at the store (talking to staff)? magazine_or_newspaper_ads Did the respondent see an advertisement on the magazine or newspaper? online_retailer_research Did the respondent research about a product by visiting an online retailer? online_retailer_visit Did the respondent visit any online retailer sites (eg: Amazon)?

online_video_ad Did the respondent see a video advertisement while surfing? outdoor_ads Did the respondent view a banner advertisement outside somewhere? previous_shopping_list Was the product part of the shopping list?

promo_coupon_leaflet_from_retailer Did the respondent encounter a promotional offer from a retailer? promo_coupon_leaflet_not_from_retailer Did the respondent encounter a promotional offer?

recipe_site Did the respondent visit a recipe site?

researched_on_search_engine Did the respondent researched about a product using search engine? saw_a_product_display Did the respondent see a demo or active display of the product? saw_a_sign_poster Did the respondent witness a product advertisement on a poster?

search_engine_ads Did the respondent encounter an advertisement while using a search engine? social_media Did the respondent follow a brand on social media?

Did the respondent hear about a product through social media? there_was_a_seasonal_event_or_occasional Was there a seasonal event or occasion for the product?

male Was the respondent male?

The dataset used in the project has 9,489 rows, with 2,088 unique respondents. The ratio of conversion to non-conversion is about 1:5.

2.3

Data Pre-Processing

Having seen the data and its features, it is time to focus on data pre-processing, where differ-ent techniques that were used on the dataset are discussed. The problem of imbalanced data and various methods to tackle this problem are discussed in the subsequent chapters.

2.3.1

Imbalanced data

Imbalanced data is defined as "data with an unequal number of examples in each of its classes" in [14]. The data provided from Nepa has one conversion (conv == 1) for every five non-conversions (conv==0). Thus, it is clearly imbalanced data. This imbalanced/rar-ity of conversion is expected, given that public consumption of resources cannot increase with an increase in marketing expenditure. Theoretically speaking, an imbalanced dataset is not a problem, however, as mentioned in [15], "The key point is that relative rarity/class imbalance is a problem only because learning algorithms cannot efficiently handle such data". A dataset having a few instances of one class often leads to a learning algorithm unable to generalize the behavior of the minority class, leading to the algorithm performing poorly

(13)

2.3. Data Pre-Processing

in terms of predictive accuracy [16]. When the data is unbalanced, standard machine learning algorithms that maximize overall accuracy tend to classify all observations as majority class instances [17].

When a logistic model and Bayesian network was trained on the given dataset, both these models were almost naive models (classifying everything as majority class). Thus, confirming the problem of imbalanced data on the model performance. Some of the techniques to tackle the problems arising from imbalanced data are: undersampling technique, oversampling technique, boosting technique and bagging technique.

Figure 2.1: Visual depiction of random undersampling, adapted from [18]

Undersampling is defined in [17] as downsizing the majority class by removing observa-tions at random until the dataset is balanced. Oversampling technique involves sampling the positive (minority) examples with replacement to match the number of negative (majority) examples [19]. The technique of making inference from resampling the dataset is called bootstrapping [20]. Training models on these sub-samples and averaging the prediction is termed as bagging [21].

For the given dataset, the undersampling technique was used to tackle the imbalanced data problem. There are many techniques to perform undersampling, one such technique is the random undersampling. In the random undersampling method, the majority class is downsized by removing observations at random until the dataset is balanced [17] (see figure 2.1). This method (random undersampling) was chosen as it is the recommended starting point for practical application from [22].

Having decided to use undersampling, the problem of finding the optimal undersampling ratio (ratio of majority class to minority class instances in the data) remains unanswered. In [23], it was argued that resampling to full balance (1:1 ratio) is not necessarily optimal for the predictive performance of the model, and the optimal ratio differs across the datasets. A logistic regression model was used to find the optimal balancing/undersampling ratio for the given dataset and the predictive performance of the model was measured for different undersampling ratios using cross-validation.

2.3.2

Estimation of optimal undersampling ratio

In statistical modelling and machine learning it is often a practice to split a dataset of n data-points into three parts, ntrain, nvalidation, and ntest data. The model is trained on ntrain, tuned on n . Finally, the model is trained on n +n and tested on n data

(14)

2.3. Data Pre-Processing

in the ratio of 60-20-20. The logistic model was trained on the ’training’ dataset, and its performance was measured on the ’validation’ dataset.

Figure 2.2: Plots of performance metrics vs. Undersampling ratio, optimal undersampling ratio 1.4

The optimal undersampling ratio was determined using metrics like ’accuracy,’ ’balanced accuracy’ (mean of positive and negative class accuracy), ’F1 score,’ and ’Positive Class Ac-curacy’ (accuracy in identifying successful conversions) on ’validation’ data (see figure2.2). A dataset with varying degree of undersampling ratio was used to train a logistic model, the performance metrics were plotted. The optimal point (red line in figure 2.2) of the undersam-pling ratio correspondences to 1.4 (for every conversion there must be 1.4 non-conversions). As shown in [25], ’accuracy’ and ’F1 score’ suffer from attenuation due to imbalanced distributions. In [26], a new metric called ’Index of Balanced Accuracy’ was used, which is based on using a modified version of balanced accuracy (using geometric mean instead of arithmetic). ’Balanced accuracy’ was chosen as the metric to measure performance of the logistic model since it gave equal importance to classify both the class (’conversion’ and ’non-conversion’) of data and it was readily available in popular R packages such as ’caret’ [27]. Thus the undersampling ratio at which the maximum balanced accuracy was obtained by the model was chosen as the undersampling ratio.

(15)

2.3. Data Pre-Processing

Figure 2.3: Plot to check overfitting by comparing train and valid data-set balanced accuracy

In the endeavor to train a model with good predictive performance, it is a good idea to check for ’overfitting’ of the model. A model that is more flexible than it needs to be, such that it yields a small training error but a significant test error, is said to be ’overfitting’ the data [28]. One can check for this issue by looking at the predictive performance on the ’training’ and ’validation’ dataset(see figure 2.3).From the figure, one can see that the balanced accuracy on ’training’ and ’validation’ follow a similar trend, and there is no significant deviation between them, thus there is no ’overfitting’ on the data.

2.3.3

Distribution of Data

The data provided by Nepa consists of 9,489 rows, 26 columns with 1,745 conversions (conv == 1), and after performing the random undersampling, the data consisted of 4,188 rows. The proportionality of non-conversion (conv == 0) against variables, as shown in table 2.3, is used to check that the distribution of data is not significantly changed post undersampling.

For each variable, the ratio instance o f variable==1instances o f conv==0 is compared before and after undersampling, since the underlying proportions and thus, distribution of data mustn’t change drastically. Eg: For variable (ads_on_tv)

(16)

2.3. Data Pre-Processing

Table 2.3: Proportion of conv==0 by column value

Variable Pre Undersampling Post Undersampling

Variable=0 Variable=1 Variable=0 Variable=1

ads_on_radio_streaming 0.98 0.02 0.98 0.02 ads_on_tv 0.80 0.20 0.85 0.15 banner_ads_online_not_social 0.96 0.04 0.92 0.08 brand_website 0.97 0.03 0.99 0.01 friend_family_recommendation 0.91 0.09 0.95 0.05 i_saw_an_offer_promotion 0.63 0.37 0.83 0.17 i_saw_something_new 0.92 0.08 0.95 0.05 instore_research 0.98 0.02 0.99 0.01 magazine_or_newspaper_ads 0.95 0.05 0.96 0.04 online_retailer_research 0.97 0.03 0.98 0.02 online_retailer_visit 0.94 0.06 0.95 0.05 online_video_ad 0.94 0.06 0.96 0.04 outdoor_ads 0.95 0.05 0.95 0.05 previous_shopping_list 0.98 0.02 0.98 0.02 promo_coupon_leaflet_from_retailer 0.89 0.11 0.91 0.09 promo_coupon_leaflet_not_from_retailer 0.95 0.05 0.96 0.04 recipe_site 0.99 0.01 0.99 0.01 researched_on_search_engine 0.97 0.03 0.98 0.02 saw_a_product_display 0.72 0.28 0.81 0.19 saw_a_sign_poster 0.92 0.08 0.95 0.05 search_engine_ads 0.96 0.04 0.98 0.02 social_media 0.96 0.04 0.96 0.04 there_was_a_seasonal_event_or_occasion 0.99 0.01 1.00 0.00 male 0.60 0.40 0.58 0.42

(17)

3

Theoretical Background

This section presents the mathematical and statistical background of the various techniques used in this thesis. Starting with Bayesian networks - their design, working, and finally, the choice of algorithms and techniques are explained in this chapter.

3.1

Bayesian Networks

Before diving into Bayesian Network, one needs to understand some fundamental building blocks of these networks.

Directed acyclic graphs (DAG): A DAG is a finite, directed graph with no directed cy-cles. A directed graph G = (V, E) consists of two sets: a finite set V of elements called vertices and a finite set E of elements called edges [29]. Directed acyclic graphs have the following three properties: arcs can be only directed, the network must not contain any loop, the network must not contain any cycle (starting vertex of its first edge equals the ending vertex of its last edge).

Parent and Child node: An arc is a directed link between two nodes (random variables), usually assumed to be distinct. Nodes linked by an arc have a direct parent-child relation-ship, the node on the tail of the arc is the parent node, the node on the head of the arc is the child node [30].

Skeleton: An undirected graph can always be constructed from a directed or partially directed one by substituting all the directed arcs with undirected ones, and such a graph is called the skeleton or the underlying undirected graph of the original graph [30].

V-structures: Two nodes being parents of another node such that the two parents are not linked [30].

(18)

3.1. Bayesian Networks

Figure 3.1: DAG mapping probability, source from [31]

DAG can encode the probability distribution of X, formally DAG is called ’Independence map’ of the probability distribution of a random variables X, with graphical separation(KKG) implying probabilistic separation(KKP)[31]. Estimation of a DAG from data is difficult and computationally non-trivial due to the enormous size of the space of DAGs [32].

The graphical separation property is akin to the concept of "Markov blanket". Let V be a set of random variables, P be their joint probability distribution, and X P V, then a Markov blanket M of X is any set of variables such that X is conditionally independent of all the other variables given M [33]. Mathematically expressed as the following:

Independency map(X, V ´(M Y X)|M) (3.1) Any directed arc from a variable ’A’ to ’C’ indicates direct stochastic dependencies. Thus lack of any arc connecting two nodes implies that these nodes are either i) marginally inde-pendent or ii) conditionally indeinde-pendent given a subset of the rest [34].

Bayesian Networks (BNs) are defined as a class of graphical models composed by a set of random variables X = tXi, i = 1, 2, ...mu and a directed acyclic graph (DAG), denoted G= (V, E)in which each node viPV corresponds to a random variable Xi[34].

Figure 3.2: DAG and CPDAG, adapted from [31]

A particular class of DAG is the Completed Partially Directed Graph (CPDAG). CPDAG is a DAG which has both directed and undirected edges [35]. The CPDAG of a (partially)

(19)

3.2. Learning a Bayesian Network

directed acyclic graph is the partially directed graph built on the same set of nodes, keep-ing the same v-structures and the same skeleton, completed with the compelled arcs [30]. Two DAGs having the same CPDAG are equivalent in the sense that they result in Bayesian Network describing the same probability distribution [30].

3.2

Learning a Bayesian Network

The task of fitting a Bayesian network is called learning, and this has two steps: i) Learning the structure: learning the graph, which accounts for the conditional independence present in the data. ii) Learning the parameters: learning the parameters of local and global distribution implied by the graph structured learnt in the previous step.

For a given dataset D, global distribution of X, the DAG represented by G and Θ as its parameters, the learning is given as 3.2 from [36]:

P(G,Θ|D) looooomooooon Learning = P(G|D) looomooon Structural Learning ¨ P(Θ|G, D) looooomooooon Parameter Learning (3.2)

From [36], one can see that for a joint probability distribution X with parametersΘ, can be decomposed into one local distribution for each Xi, conditional on its parents (ParentsXi).

Mathematically, this can be expressed as the following:

P(X|G,Θ) = N ź i=1 P(Xi|ParentsXi;ΘXi) (3.3)

3.2.1

Structure Learning

From [36], one can see that structural learning consists of finding the DAG G that encodes the dependence exhibited by data, thereby maximizing P(G|D). Thus, the structural learning part of equation 3.2 leads to the following:

P(G, D) =P(G)¨P(D|G) =P(G) ż P(D|G,Θ)P(Θ|G)dΘ (3.4) P(D|G)9 ż P(D|G,Θ)P(Θ|G)dΘ= N ź i=1 ż P(Xi|ParentsXi,ΘXi)P(ΘXi|ParentsXi)dΘXi  (3.5)

In Structural learning, one can either: i) opt to rely on experts with domain knowledge, in the form of a white-list (arcs that must be included in the network) or black-list (arcs that must never be present in the network), ii) use the available data and perform conditional independence test for each arc. In [31], it was mentioned that in Bayesian Network (BN), there exists a hierarchy of variables. The variables that are thought of as "causes" placed above the variables that are deemed as "effects" and finally, the "confounding" variables are present at the top of the network.

(20)

3.3. Bayesian Network Algorithms

3.2.2

Parameter Learning

From [37], one can see that if one is to assume parameters in local distribution are indepen-dent, then the parameter part of equation 3.2 becomes the following:

P(Θ|G, D) =

N ź i=1

P(ΘXi|ParentsXi, D) (3.6)

Post-structural learning, the problem of estimating the global distribution parameters, is broken into estimating parameters of local distributions. Of the many methods possible to estimate the parameters, Bayesian Posterior Estimators was used since in [38] it was hinted that the posterior could be represented in a compact factorized form and the computation can be fastened by using the Bayes theorem.

3.3

Bayesian Network Algorithms

Having seen the two types of learning that occur in Bayesian Networks, now it is time to learn about various algorithms that can be used to build/learn a Bayesian Network. There are broadly three types of algorithms: Constraint-based algorithms, Score-based algorithms and Hybrid algorithms.

3.3.1

Constraint-based algorithms

Constraint-based algorithms are largely based on the work of Pearl and Verma [39][40]. The core principle here is that the model learns a DAG using conditional Independence tests. The commonly used tests for conditional independence are ’mutual information test’ and ’exact student’s t-test’ for discrete and continuous BN’s (Bayesian Network) respectively [34]. Some examples of constraint-based algorithms are: PC, Grow-Shrink (GS), Max-Min Par-ents & Children (MMPC), etc. From [34], a template for constraint-based structure learning algorithms is given below:

(21)

3.3. Bayesian Network Algorithms

Algorithm 1A template for constraint-based structure learning algorithms

Input:a dataset containing the variables Xi; i=1, 2, ...m Output:a completed partially directed acyclic graph (1) Phase 1: learning Markov blankets (optional)

(a) For each variable Xi, learn its Markov blanket B(Xi)

(b) Check whether the Markov blankets B(Xi)are symmetric, e.g. Xi PB(Xj)ôXj P B(Xi). Assume that nodes for which symmetry does not hold are false positives and drop them from each other’s Markov blankets.

(2) Phase 2: learning neighbours

(a) For each variable Xi, learn the set N(Xi)of its neighbours (i.e., the parents and the children of Xi). Equivalently, for each pair Xi, Xj, i ‰ j search for a set SXi,Xj Ă

V (including SXi,Xj = H) such that Xi and Xj are independent given SXi,Xj and

Xi, Xj R SXi,Xj . If there is no such a set, place an undirected arc between Xiand

Xj(Xi´Xj). If B(Xi)and B(Xj)are available from points Phase 1 (a) and (b), the search for SXi,Xj can be limited to the smallest of B(Xi)\Xjand B(Xj)\Xi

(b) Check whether the N(Xi)are symmetric, and correct asymmetries as in step Phase 1 (b)

(3) Phase 3: learning arc directions

(a) For each pair of non-adjacent variables Xi and Xj with a common neighbour Xk, check whether Xk PSXi,Xj . If not, set the direction of the arcs Xi´Xkand Xk´Xj to XiÑXkand XkÐXjto obtain a v-structure Vl =XiÑXkÐXj

(b) Set the direction of arcs that are still undirected by applying the following two rules recursively:

(i) If Xiis adjacent to Xjand there is a strictly directed path from Xito Xj(a path leading from Xi to Xjcontaining no undirected arcs) then set the direction of Xi´Xjto XiÑXj

(ii) If Xiand Xjare not adjacent but XiÑXkand Xk´Xj, then change the latter to XkÑXj

3.3.2

Score-based algorithms

Score-based algorithms are heuristic-based optimization techniques. Thus these models try to optimize some score (eg: BIC, Mutual Information, Log-Likelihood Ratio) for a given DAG. Score-based algorithms tend to produce models with higher likelihood compared to constraint-based and hybrid algorithms and produce the largest networks allowing good propagation of evidence [37].

Most scores have tuning parameters, whereas conditional independence tests (mostly) do not. Thus, as mentioned in [41], to select the optimal learning parameters for score-based algorithms, an effective method is to perform a grid search.

Some examples of score-based algorithms are: Hill Climbing and Tabu Search. The Hill Climbing algorithm is as follows[30]:

(22)

3.3. Bayesian Network Algorithms

Algorithm 2Hill Climbing Algorithm

(1) Choose a network structure G over V, usually (but not necessarily) empty (2) Compute the score of G, denoted as ScoreG= Score(G)

(3) Set maxscore = ScoreG

(4) Repeat the following steps as long as maxscore increases:

(a) for every possible arc addition, deletion or reversal not resulting in a cyclic net-work:

(i) compute the score of the modified network G˚, ScoreG˚= Score(G˚): (ii) if ScoreG˚> ScoreG, set G=G˚ and ScoreG=ScoreG˚

(b) update maxscore with the new value of ScoreG (5) Return the DAG G

3.3.3

Hybrid algorithms

Hybrid algorithms combine constraint-based and score-based algorithms to offset the respec-tive weaknesses and produce stable network structures [30]. These algorithms reduce the solution space of DAG using conditional independence tests and use the score to zone in on an optimal DAG [30].

As mentioned in [31], constraint-based algorithms are usually faster, score-based algo-rithms are more stable and hybrid algoalgo-rithms are at least as good as score-based algoalgo-rithms, and often a bit faster.

Some examples of constraint-based algorithms are: Max-Min Hill Climbing (MMHC) and General 2-Phase Restricted Maximization. From [42] the Max-Min Hill Climbing (MMHC) algorithm is as follows:

(23)

3.3. Bayesian Network Algorithms

Algorithm 3 MMPC and MaxMinHeuristic Algorithm procedure: MMPC(T, D)

(1) Input: target variable T; data D

(2) Output: the parents and children of T in any Bayesian (3) network faithfully representing the data distribution (4) Phase I: Forward

(a) CPC= ∅

(b) repeat

(i) ă F, assocF ą=MaxMinHeuristic(T; CPC)

(ii) if assocF ‰ 0 then (I) CPC=CPC Y F (iii) end if

(c) until CPC has not changed (5) Phase II: Backward

(a) for all X P CPC do

(b) if DS Ď CPC, s.t. Ind(X; T|S)then (i) CPC=CPC\tXu (ii) endif (c) end for (6) return CPC end procedure procedure: MaxMinHeuristic(T, CPC)

(1) Input: target variable T; subset of variables CPC

(2) Output: the maximum over all variables of the minimum association with T relative to CPC, and the variable that achieves the maximum

(3) assocF=maxXPVMinAssoc(X; T|CPC) (4) F=argmaxXPVMinAssoc(X; T|CPC) (5) return ă F, assocF ą

(24)

3.4. Conditional independence tests

Algorithm 4MMPC and MMHC Algorithm

procedure: MMPC(T, D)

(1) CPC= MMPC(T, D)

(2) for every variable X P CPC do (a) if T R MMPC(X, D)then (i) CPC=CPC\X (ii) end if (b) end for (3) return CPC end procedure procedure: MMHC(D) (1) Input: data D

(2) Output: a DAG on the variables in D (3) Restrict

(a) for every variable X P V do (i) PCX =MMPC(X, D) (ii) end for

(4) Search

(5) Starting from an empty graph perform Greedy Hill-Climbing with operators add-edge, delete-edge, reverse-edge. Only try operator add-edge Y Ñ X if Y P PCX

(6) Return the highest scoring DAG found

end procedure

3.4

Conditional independence tests

Log-likelihood ratio is one of the conditional independence tests used under the structural learning by Bayesian Networks [30]. Conditional independence tests focus on the presence or absence of individual arcs, since each arc/edge encodes a probabilistic dependence, con-ditional independence tests can be used to assess whether that probabilistic dependence is supported by the data [30]. Thus, if the null hypothesis that states conditional independence is rejected (eg: chi-square test [42]), then the arc can be considered for inclusion in the arc [30]. For a three nodes network 1A1, 1B1 and 1C1 in a DAG, the null hypothesis, 1A1 is proba-bilistically independent(KKP)from1B1 given1C1, thus Log-Likelihood Ratio G2 is given by 3.9 from [30].

H0: A KKPB|C (3.7)

(25)

3.5. Inference on Bayesian Network G2(A, B|C) = ÿ aPA ÿ bPB L ÿ cPC nabc n ¨log nabcn++c na+cn+bc (3.9) In 3.9 the categories of 1A1 are denoted with a P A, categories of 1B1 with b P B and categories of1C1 with c P C. The number n

abc denotes the number of observations for the combination of a category a of A, a category b of1B1 and a category c of1C1. The use of a "+" subscript denotes the sum over an index and is used to indicate the marginal counts for the remaining variables. So, for example, na+cis the number of observations for a and c obtained by summing over all the categories of1B1, from [30].

A good example of the usage of conditional independence is given in [42]. The p-values from the χ2 test are used to accept or reject the null hypothesis (at significance level α). For the variables where the null hypothesis cannot be rejected, then a test for measuring association is performed (eg: G2), where lower p-values implies higher association.

3.5

Inference on Bayesian Network

Suppose one has a Bayesian network B, with the DAG G and parametersΘ. Evidence E is used to calculate the posterior distribution i.e P(X|E), and this remains the most powerful aspect of Bayesian Networks [43]. For example, if the random variables of the network are X, Y and Z, suppose Z takes on value z in evidence E1, then E1 = (?, ?, z)and P(X|E1) = P(X|Z=z), from [43].

From [30], the posterior probability is as:

P(X|E, B) =P(X|E, G,Θ) (3.10)

The queries on the Bayesian network is termed as "conditional probability queries (CPQ)". Conditional probability queries are concerned with the distribution of a subset of variables Q = Xj1, ..., Xjlgiven some evidence E on another set Xi1, ..., Xikof variables in X [30]. The two sets of variables can be assumed to be disjoint and the distribution under the interest is given by equation 3.11 while the Marginal Posterior Probability Distribution given by equa-tion 3.12, from [30].

CPQ(Q|E, B) =P(Q|E, G,Θ) =P(Xj1, ....Xjl|E, G,Θ) (3.11)

P(Q|E, G,Θ) =

ż

P(X|E, G,Θ)d(X\Q) (3.12) The process of propagating the effects of evidence is called belief propagation: belief X using Bayesian Network B is updated using the evidence E and is given by 3.13 from [30]. Thus 3.12 becomes:

P(Q|E, G,Θ) = ź

i:XiPQ

ż

P(Xi|E, ParentsXi,ΘXi)dXi (3.13)

There are two types of algorithms for belief propagation, ’Exact Inference’ and ’Approxi-mate Inference’.

3.5.1

Exact Inference

(26)

com-3.6. Logistic Regression

3.5.2

Approximate Inference

Approximate inference algorithms use Monte Carlo simulations to sample from the global distribution of X and thus estimate P(Q|E, G,Θ). In particular, they generate a large num-ber of samples from B and estimate the relevant conditional probabilities by weighting the samples that include both E and Q = q against those that include only E [30]. In computer science, these random samples are often called particles, and the algorithms that make use of them are known as particle filters or particle-based methods [30].

There exists two techniques for approximate inference, namely ’logic sampling’ and ’like-lihood sampling’. From [30], logic sampling combines rejection sampling and uniform weights, essentially counting the proportion of generated samples including E that also include Q= q. However this method has a flaw in terms of high rejection of samples if the evidence E is a rare occurrence [30].

This flaw of rejecting a high number of sampling is fixed using ’likelihood weighting’ where samples generated include the evidence E by design. However, this means that one is not sampling from the original Bayesian Network anymore, but sampling from a second Bayesian Network in which all the nodes Xi1, ..., Xikin E are fixed. This network is called the mutilated network [30].

3.6

Logistic Regression

Logistic Regression is a technique to model the probability of an observation belonging to a specific class, mathematically "the expected value of Y, given the value(s) of X"[44] can be expressed as the following:

E(yi|xi) =S(β0+β1¨xi) (3.14) where

S(t) = 1

1+exp(´t) (3.15)

The β in equation 3.14 is a vector of coefficients which are the parameters for the model, generally estimated using maximum likelihood estimation(MLE). In logistic model, the out-put variable yiis a Bernoulli random variable, and the probability of yi =1 is given by:

P(yi =1|xi) =S(xiβ) (3.16)

3.6.1

Log-likelihood

The log-likelihood for a given observation xi, yiis given by equation 3.17

L(β; yi, xi) = [S(xi, β)]yi[1 ´ S(xi, β)]1´yi (3.17)

The log-likelihood of the logistic model with N observation, X input feature vector and y output vector is given by 3.18

l(β; y, X) = N ÿ i=1 [´ln(1+exp(xiβ)) +yixiβ] (3.18) Proof of equation 3.18:

(27)

3.7. Metrics for Evaluation of Model Performance l(β;y,X)=ln(L(β;y,X)) =lnśN i=1[S(xiβ)]yi[1´S(xiβ)]1´yi  =řN i=1[yiln(S(xiβ))+(1´yi)ln(1´S(xiβ))] =řN i=1  yiln  1 1+exp(´xi β)  +(1´yi)ln  1´ 1 1+exp(´xi β)  =řN i=1  yiln  1 1+exp(´xi β)  +(1´yi)ln  1+exp(´xi β)´1 1+exp(´xi β)  =řN i=1  yiln  1 1+exp(´xi β)  +(1´yi)ln  exp(´xi β) 1+exp(´xi β)  =řN i=1  ln  exp(´xi β) 1+exp(´xi β)  +yi  ln  1 1+exp(´xi β)  ´ln  exp(´xi β) 1+exp(´xi β)  =řN i=1  ln  exp(´xi β) 1+exp(´xi β) exp(xi β) exp(xi β)  +yi  ln  1 1+exp(´xi β) 1+exp(´xi β) exp(´xi β)  =řN i=1  ln  1 1+exp(xi β)  +yi  ln  1 exp(´xi β)  =řN

i=1[ln(1)´ln(1+exp(xiβ))+yi(ln(1)´ln(exp(´xiβ)))] =řN

i=1[´ln(1+exp(xiβ))+yixiβ]

3.6.2

The Hessian

The second derivatives of the log-likelihood with respect to the parameter β is called the Hessian of the log-likelihood. This is given by 3.19:

5ββl(β; y, x) =´ N ÿ i=1 xiTxiS(xiβ)[1 ´ S(xiβ)] (3.19) Proof of equation 3.19: ∇ββl(β;y,X)=∇β(∇βl(β;y,X)) =∇β řN i=1[yi´S(xiβ)]xi  =´řN i=1xi∇βS(xiβ) =´řN i=1xJi xiS(xiβ)[1´S(xiβ)]

In [45], it was shown that using the central limit theorem the distributions of parameters (β) can be approximated by a normal distribution with mean equal to the true parameter value (from MLE ˆβ) and the covariance given by the inverse of hessian. Mathematically, this is as the following: N   ˆβ, ´ " ´ N ÿ i=1 xiTxiS(xiβ)[1 ´ S(xiβ)] #´1  (3.20)

3.7

Metrics for Evaluation of Model Performance

Having seen the two modeling techniques, it is time to focus on the metrics that allow us to compare and evaluate the predictive performance of models.

3.7.1

Bayesian Information Criterion (BIC)

One of the scores used by score-based algorithms is the (Bayesian Information Criterion) BIC, which provides a goodness of fit metric, from [37]:

(28)

3.8. Techniques for Measuring Association

3.7.2

F1 score

Table 3.1: Confusion matrix for a binary class problem Predicted positive Predicted negative Positive class True Positive (TP) False Negative (FN) Negative class False Positive (FP) True Negative (TN)

The harmonic mean of precision and recall is termed as "F1 score".

precision= true positives

true positives+f alse positives (3.22) recall= true positives

true positives+f alse negatives (3.23) F1=2 ¨ precision ¨ recall

precision+recall (3.24)

3.7.3

Accuracy

Accuracy is the fraction of cases for which the model is correct [46] and is given by the fol-lowing:

Accuracy= true positives+true negatives

true positives+true negatives+f alse positives+f alse negatives (3.25)

3.7.4

Positive Class Accuracy

Positive Class Accuracy or Positive predictive value is defined as:

Positive Class Accuracy= true positives

true positives+ f alse positives (3.26)

3.7.5

Balanced Accuracy

Balanced Accuracy is the arithmetic mean of ’Recall (Sensitivity)’ and ’Specificity’. Balanced Accuracy is good metric to be used in case of imbalanced dataset [47] and is given by the equation 3.29.

RecallorSensitivity= true positives

true positives+f alse negatives (3.27) Speci f icity= true negatives

true negatives+ f alse positives (3.28)

Balanced Accuracy=

true positives

true positives+ f alse negatives+

true negatives true negatives+ f alse positives

2 (3.29)

3.8

Techniques for Measuring Association

In modelling it is often important to understand the variables/columns which are associated with the outcome variable/column(variable ’conv’). There are three techniques that are ex-plored and used in the project, i) Chi-Square Test, ii) Brute Force Search, and iii) Multiple Correspondence Analysis.

(29)

3.8. Techniques for Measuring Association

3.8.1

Chi-Square Test (χ

2

test)

Chi-Square (χ2) test can be performed on a 2 × 2 contingency table and can be a simple measure of association between other variables and the outcome variable (variable ’conv’). One of the advantages of this test is that it is non-parametric thus is robust with respect to the distribution of the data [48].

Let X1, X2, ....Xnbe independent samples from some distribution, such that

P(Xij =1) =1 ´ P(Xij=0) =pj, f or all 1 ď j ď k) (3.30) Each Xiconsists of exactly k ´ 1 zeros and a single one, where the one is in the component of the success category at trial i. The implication of equation 3.30 is that Var(Xij) = pj¨(1 ´ pj), from [49].

The Pearson chi-square statistic is given by:

χ2= k ÿ j=1 (Nj´n ¨ pj)2 n ¨ pj (3.31)

χ2 is Pearson’s cumulative test statistic which follows asymptotically a χ2 distribution when there is no association between the two variables [49].

3.8.2

Brute Force Search

For a given data and all the features, there exists an optimal solution that can model the outcome variable as a function of other variables. This exercise can be perceptive as a search problem with the objective of finding the model with the maximum accuracy/minimum BIC, with a guaranteed global optimal solution or Brute Force Search. The biggest issue here is the time taken for search O(nk), [50]. However, from [50], any possibility of reducing the solution space, reduces the search time to a more reasonable scenario.

3.8.3

Multiple Correspondence Analysis

Multiple correspondence analysis (MCA) aims at analyzing the structure of relationships existing in a set of categorical variables, by explaining the associations through their pro-jection in a space with a reduced number of dimensions almost always two [51]. MCA is an extension of correspondence analysis (CA), which allows one to analyze the pattern of relationships of several categorical dependent variables. MCA is largely based on Principle Component Analysis, with a difference that variables are categorical instead of quantitative [52]. Both CA and MCA are both special cases of weighted principal component analysis [53]. Furthermore, MCA can be obtained using correspondence analysis on an indicator matrix, while the interpretation of correspondence analysis needs to be adapted.

Multiple correspondence analysis also has an attractive property of optimality of scale values with some adjustment (thanks to achieving maximum intercorrelation and thus maximum reliability in terms of Cronbach’s alpha) [53].

(30)

3.8. Techniques for Measuring Association

Figure 3.3: MCA notation, source[54]

Consider the following: If there are I observations, K variables (nominal) with Jk lev-els, such thatř Jk = J then the indicator matrix is denoted by X. Performing "Correspon-dence Analysis" on the matrix X provides us with two-factor scores, one for rows and one for columns (see figure 3.4).

Figure 3.4: Point cloud of categories/columns, source [54]

The variance of a category K is given by equation 3.32.

Var(k) =d2(k, O) = I ÿ i=1 1 Ix 2 ik= I ÿ i=1  yik pk ´1 2 = 1 pk ´1 (3.32)

The implication of 3.32 is that as a category K becomes rarer/uncommon, the distance between point Mk representing category k and origin O becomes more substantial, thus in MCA, uncommon categories are located away from the origin [55].

The distance between pairs of categories is given by equation 3.33.

d2(k, k1) = I ÿ i=1  yik pk ´yik1 p1 k 2 = pk+pk1´2pkpk1 pkpk1 (3.33)

(31)

3.8. Techniques for Measuring Association

The implication of 3.33 is between overlapping individual, if the proportion of individual is highly skewed towards one category, then this leads to a larger distance between the two categories [55].

The inertia of the k-th category given by equation 3.34. Inertia(k) = pk

J ¨d

2(k, O) = 1 ´ pk

J (3.34)

The implication of 3.34 is that MCA identifies rare categories but does not exaggerate the influence of sporadic ones [55].

(32)

4

Methods

This chapter provides an overview of the steps followed in the project. The aim of this chapter is to facilitate replicability conforming to the expectations from a scientific report.

4.1

Baseline Logistic Regression Model

To compare the viability of Bayesian Network as a market attribution model, a Logistic Model was created as a benchmark. Two constraints placed on the model were:

• No variable should be dropped from the model as the end-users of the model are inter-ested in attribution values for all touch-point

• No interaction terms should be present since this would result in multiple combina-tions, often tricky to evaluate.

4.1.1

Optimal Decision Threshold of Logistic Regression

Cross-validation was used to find the optimal decision threshold/cutoff (probability at which the sample is classified as 1 in binary 0,1 data) where model metrics such as F1 score, accuracy, balanced accuracy etc. were plotted against the cutoff. The point at which the balanced accu-racy (as seen in figure 4.1) was maximum is deemed as the optimal decision threshold/cutoff. Furthermore, the overfitting was checked visually in figure 4.2.

(33)

4.2. Bayesian Network Model

Figure 4.1: Plot showing the model metrics vs. the decision boundary cutoff

Figure 4.2: Testing for over-fitting at different decision boundary cutoff

(34)

4.2. Bayesian Network Model

• White-list - The nodes which should always be connected by an edge • Black-list - The nodes which should not be connected by an edge

• Number of Restarts – The number of random restarts which the algorithm runs for to converge at an optima (hopefully global)

• Algorithm - The algorithm to be used for building the Bayesian Network

4.2.1

White-list Creation Using Hypothesis Testing

To understand the dataset given and interactions with ’conversion’, hypotheses formulated were tested using the ’Chi-square test’ (see section 3.8.1). These hypotheses were validated by the supervisor at Nepa (Viking Fristrom).

The following is the result of the hypotheses testing:

Table 4.1: Hypothesis Table with P-values, tested at 95% significance

Variable Hypothesis P-values Result ads_on_radio_streaming Creates brand awareness but does not immediately lead to purchase 0.525237381 Hypothesis rejected ads_on_tv Creates brand awareness but less than radio, does not immediately lead to purchase 0.00049975 Failed to reject banner_ads_online_not_social Most likely to get blocked, but create brand awareness 0.00049975 Failed to reject brand_website Customer actively checking a brand’s website may lead to a long term customer 0.00149925 Failed to reject friend_family_recommendation Depending on how much we value them, recommendation may lead to purchase 0.00049975 Failed to reject i_saw_an_offer_promotion Works for budget shoppers, may led to few purchases 0.00049975 Failed to reject i_saw_something_new Very similar to i_saw_heard_something_else_in_store unlikely to lead to purchase since

this is largely depended on type of people 0.00049975 Failed to reject instore_research People who research about products are already invested on the decision to purchase 0.00049975 Failed to reject magazine_or_newspaper_ads Similar to saw_a_product_display, Creates brand awareness but

does not immediately lead to purchase 0.051974013 Hypothesis rejected online_retailer_research This may still be window shopping but a bit more likely to purchase 0.270364818 Hypothesis rejected online_retailer_visit People are mostly up to window shopping but sometimes purchase 0.053473263 Hypothesis rejected online_video_ad Would create brand awareness but might lead to being annoyed 0.011494253 Failed to reject outdoor_ads Creates brand awareness but does not immediately lead to purchase 0.781109445 Hypothesis rejected previous_shopping_list Things present on previous shopping list is part of the routine 0.24037981 Hypothesis rejected promo_coupon_leaflet_from_retailer Works for budget shoppers, may led to few purchases 0.050474763 Hypothesis rejected promo_coupon_leaflet_not_from_retailer Works for budget shoppers, may led to few purchases 0.009995002 Failed to reject recipe_site If a recipe calls for an item, one is highly likely to purchase it 0.888055972 Hypothesis rejected researched_on_search_engine Similar to in-store_research, highly likely to make the purchase 0.020989505 Failed to reject saw_a_product_display Creates brand awareness but does not immediately lead to purchase 0.00049975 Failed to reject saw_a_sign_poster Creates brand awareness but does not immediately lead to purchase 0.00049975 Failed to reject search_engine_ads Same as billboard but people use ad blocks and this may only led to brand awareness 0.020489755 Failed to reject social_media Social Media is tricky and this can lead to a loyal customer to a

simple ad, thus weak expectation 0.864067966 Hypothesis rejected there_was_a_seasonal_event_or_occasion During the seasonal event, one is highly likely to purchase corresponding

item (gul during Christmas) 0.033983008 Failed to reject male Women tend to respond better to in-store touch-points such as coupons 0.284857571 Hypothesis rejected

The variable which showed association with ’conversion’ using chi-square test at 0.05 significance level were considered as candidate variable for white-list. However, this test has its flaws in terms of assuming no-confounding variables (influences both the dependent variable and independent variable [56]). But these may still prove to be a good starting point to be included as white-list.

4.2.2

White-list Creation Using Multiple Correspondence Analysis

Using Multiple Correspondence Analysis (see 3.8.3) on the data, variables that contributed highly towards Eigenvectors (associated with conversion(conv==1)) were obtained. These variables will be evaluated as white-list.

(35)

4.2. Bayesian Network Model

Figure 4.3: Plot showing the Variance Explained vs. Eigenvectors

(36)

4.2. Bayesian Network Model

Figure 4.5: Plot showing the contribution of variables towards Eigenvector 2

4.2.3

White-list Creation Using Grid Search

A brute-force technique to find the variables that are related to ’conversion’ (to be used as white-list) is to try out all the combinations of variables present in the data. However, this would be computational too expensive, as mentioned in [50] and these amount to řN

i=1(25i) =33, 554, 431 combinations. Leveraging some domain knowledge, my supervisor (Anna Lundmark) at Nepa, reduced this space to about 55,455 combinations.

The plots 4.6 and 4.7 show Bayesian network build with different combinations of vari-ables and using the hill-climbing (considered stable compared to constraint-based [31]) as white-list and their corresponding performance using BIC score and Balanced accuracy. The combination of variables with the best performance (highest balanced accuracy) was selected.

(37)

4.2. Bayesian Network Model

Figure 4.6: BIC vs. Iteration using Grid Search

Figure 4.7: Balanced Accuracy vs. Iteration using Grid Search

4.2.4

Black-list Creation Using Domain Knowledge

Unlike the white-list, there is no data-driven method to create blacklist since this is largely based on domain knowledge. Thus with the help of my supervisor (Anna Lundmark), 132 blacklist rules were created.

(38)

4.3. Bayesian Network Model Comparison

4.2.5

Optimal Restart Using Cross-Validation

As mentioned in [57], random restarts are a better method than a grid search for hyperparam-eter tuning. However, the optimal number of restarts can be dhyperparam-etermined using a grid search and cross-validation. The figure 4.8 shows the effect of restarts vs. balanced accuracy.

Figure 4.8: Balanced Accuracy vs. Random Restarts

4.3

Bayesian Network Model Comparison

The techniques mentioned in sections 4.2.1, 4.2.2, and 4.2.3 provide us with different white-lists as a starting point for our Bayesian Network models. Section 4.2.4 provided us with the black-list edges, allowing us to infuse domain knowledge into the Bayesian Network models. Furthermore, 4.2.5 provided us with an optimal range for the number of restarts. Evaluation of the different algorithms that can be used to learn from the data needs to be performed. These models learn using different algorithms and will be benchmarked against the base-line model (see 4.1) and evaluated on the metrics such as BIC, accuracy, balanced accuracy etc.

4.4

Attribution Formula

Two attribution formulas are used in this project. Using these formulas, the efficacy of the touchpoints/variables are measured. Equation 4.1 is used for the Bayesian Network and is taken from [6] and [30]. Equation 4.1 adapted for logistic model yields equation 4.2.

Attribution(w) =E[conversion=1|w=1]´E[conversion=1|w=0] (4.1)

Attribution(w) = 1

1+exp ´[(β0+β1¨w)]

´ 1

(39)

5

Results

In this section, the results for the models described in chapter 4 are presented. The models using different white-list methods and algorithms are compared and benchmarked against the base-line logistic model. Both the models are trained on the dataset formed by combining ’training’, ’validation’ dataset and tested on the ’test’ dataset, as mentioned in section 2.3.2. The distribution of attribution for touchpoints/variables is also shown.

5.1

Baseline Logistic Regression Model

Table 5.1: Performance Metrics for Logistic Model

Model Balanced Accuracy F1 Score Accuracy Positive Class Accuracy Logistic Model 0.5974451 0.7067669 0.627685 0.5357143

5.2

Bayesian Network Models

Shown below are the performance metrics of various models sorted in decreasing order of balanced accuracy.

Table 5.2: Performance of Models using different algorithms and white-list techniques Model Balanced Accuracy F1 score Accuracy Positive Class Accuracy BIC Hill climbing model white-list variables using Chi-sq test 0.6100052 0.7353207 0.650358 0.5822785 -86189.53 MMHC model white-list variables using Chi-sq test 0.5995583 0.7264493 0.6396181 0.5625 -86253.907 Tabu model white-list variables using Chi-sq test 0.5989452 0.7311828 0.6420048 0.5701754 -86190.331 RSMAX2 model white-list variables using Chi-sq test 0.593064 0.7221719 0.6336516 0.5523013 -86260.938 RSMAX2 model white-list variables using Grid Search 0.5920758 0.7210145 0.6324582 0.55 -8530597.499 Hill climbing model white-list variables using Grid search 0.5910877 0.7198549 0.6312649 0.5477178 -8530522.464 MMHC model white-list variables using Grid Search 0.5805931 0.7131222 0.6217184 0.5313808 -8530593.077 Tabu model white-list variables using Grid Search 0.570051 0.7086331 0.6133652 0.5172414 -8530523.264 RSMAX2 model white-list variables using MCA 0.5116315 0.7356863 0.597852 0.4637681 -20271.464 MMHC model white-list variables using MCA 0.505655 0.7315542 0.5918854 0.4285714 -20287.556 Hill climbing model white-list variables using MCA 0.5051371 0.7319749 0.5918854 0.4264706 -20220.175

(40)

5.3. Distribution of Attribution

Figure 5.1: Plot of Bayesian Network model with the highest accuracy, build using Hill climb-ing algorithm with white-list usclimb-ing Chi-square test

5.3

Distribution of Attribution

From table 5.2, the model with the highest accuracy (as well as balanced accuracy) is chosen as the best performing model and benchmarked against the baseline for attribution values. The distribution of attribution values for variables is shown in the figure 5.2. For the logistic regression model, the distribution of attribution values (figure 5.3) was calculated using the Hessian of the likelihood of the logistic regression (see section 3.6.2).

(41)

5.3. Distribution of Attribution

(42)

5.4. Comparison of Attribution

Figure 5.3: Distribution of attribution values for variables from Logistic Model

5.4

Comparison of Attribution

The mean values from the distribution of attribution values (see 5.2 and 5.3) are given below for both the models.

(43)

5.4. Comparison of Attribution

Table 5.3: Mean attribution values of variable for Bayesian Network Model and Logistic Model

Variable Bayesian Network Model Logistic Regression Model BN Model Rank Log Model Rank

there_was_a_seasonal_event_or_occasion 0.31 0.04 1 12 instore_research 0.27 -0.07 2 23 i_saw_an_offer_promotion 0.22 0.10 3 7 saw_a_sign_poster 0.11 0.05 4 10 researched_on_search_engine 0.10 0.13 5 5 friend_family_recommendation 0.10 0.05 6 8 saw_a_product_display 0.09 0.21 7 3 search_engine_ads 0.07 0.16 8 4 i_saw_something_new 0.07 0.11 9 6 online_video_ad 0.06 -0.05 10 22 promo_coupon_leaflet_not_from_retailer 0.05 0.03 11 13 ads_on_radio_streaming 0.04 -0.03 12 19 ads_on_tv 0.03 0.39 13 1 online_retailer_research 0.02 -0.02 14 18 brand_website 0.01 0.02 15 16 outdoor_ads 0.01 0.05 16 11 previous_shopping_list 0.00 -0.01 17 17 magazine_or_newspaper_ads 0.00 0.05 18 9 social_media 0.00 -0.18 19 24 promo_coupon_leaflet_from_retailer 0.00 -0.03 20 21 online_retailer_visit 0.00 0.02 21 15 recipe_site 0.00 0.26 22 2 male 0.00 0.02 23 14 banner_ads_online_not_social -0.17 -0.03 24 20

(44)

6

Discussion

6.1

Results

From tables 5.1 and 5.2, one can see that the Bayesian network outperforms the logistic regression in terms of predictive performance by about 3% in terms of accuracy. Of the many algorithms that can be used to build the Bayesian Network, the ’Hill-climbing’ algorithm with the ’Chi-Square Test’ for creating a white-list was found to perform the best. As men-tioned in [37], the predictive performance of these algorithms is principally data specific. However, these results still contradict the results obtained in [37] (constraint-based algorithm being more accurate than score-based algorithm). Furthermore, one finds that using the brute force grid search technique to find white-list variables yields models with the highest BIC values. Still, they don’t necessarily predict better than other models, further reinforcing the value of domain knowledge.

Comparing figures 5.2 and 5.3, it is evident that the attribution values for Bayesian Net-work have much lower variance than the Logistic Model. The low variance in attribution values could be due to the choice of model or the formula used to calculate attribution (equation 4.1 vs. 4.2).

From table 5.3, one can see that the relative ranking of variables in terms of their attri-bution values varies significantly between the two models. Assuming that the Bayesian network model is closer to the true model (owing to higher accuracy values), let us discuss why the top five variables (ranked by attribution values) for the logistic model does not reflect the reality.

• ads_on_tv: has the highest variance in figure 5.3, and thus the attribution efficacy is unreliable. Furthermore, in figure 5.1, one can see that this is a confounding variable (influences both the dependent variable and independent variable [56]), thereby ex-plaining the low rank (13) under Bayesian Network.

• recipe_site: from the model summary in Appendix, table A.1, we can see that it is not a significant variable considering the p-values. Hence, the very inclusion of this variable under the model can be questioned. Under Bayesian Network figure 5.1, it is evident that this variable is a terminal node, thus explaining the low rank (22).

References

Related documents

However, the transfer learning approach didn’t provide good results as compared to Residual U-Net and one of the causes can be due to small reso- lution tiles that are used to train

The main idea behind it is to accept or reject proposals of parameters based on a distance between experimental data and data obtained by running simulations of the selected model

• An ABC-MCMC algorithm was constructed with summary statistics care- fully designed to incorporate information at an individual level, for esti- mating a model describing the

Differences in clinical characteristics of the SWI depending on the causative microbial agent have been recognised by others (18, 19). In the current study SWIs caused by

IM och IM-tjänster (Instant messaging): Vi har valt att använda oss av termen IM och IM-tjänster när vi talar om tjänster till smartphones för direkt kommunikation till

Med a) jämför Karlsson journalistiken med andra yrken, och menar att alla yrken behöver ”någon form av monopol på kunskap och eller tekniker”, exempelvis licenser, titlar,

For higher Fermi energy there are states where the non-occupied subbands have a small energy increase, and the occupied subbands have a small energy decrease, which differ for

In this paper we consider the problem of nonlocal image completion from random measurements and using an ensem- ble of dictionaries.. Utilizing recent advances in the field