Constructing decision trees for user behavior prediction in the online consumer market

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2016,

Constructing decision trees for user behavior prediction in the online consumer market

DENNIS FOKIN

JOEL HAGROT

(2)

Constructing decision trees for user behavior prediction in the online consumer market

Beslutsträdskonstruktion för förutsägning av användarbeteende i internethandeln

DENNIS FOKIN AND JOEL HAGROT

Degree Project in Computer Science, DD143X Supervisor: Kevin Smith

Examiner: Örjan Ekeberg

CSC, KTH 2016-05-11

(3)

Abstract

This thesis intends to investigate the usefulness of various aspects of product data for user behavior prediction in the online shopping market. Specifically, a data set from Best- Buy was used, containing information regarding what product a user clicked on given their search query.

Decision trees are machine learning algorithms used for making predictions. The decision tree algorithm ID3 was used because of its simplicity and interpretability. It uses information gain to measure how different attributes help the tree split the set into smaller subsets. The approach was to use one decision tree for each product in the data set, and analyze the distribution of the attributes’ maximum information gains in the root splits across the various trees. For each of these splits, all possible pivot values (a pivot value being the value split on) were attempted, and the pivot values were also recorded to analyze which pivot values that resulted in the most gain.

The results show that how well the query string matches the product title and description are the two most important aspects, followed by the product’s novelty. The number of days since the last two reviews were written before the query proved a decent way to identify trends.

The paper also presents how the attributes were used by analyzing the pivot value distributions, with the conclusion that many attributes were used in similar ways for most products, suggesting it might be possible to create a universal tree applicable for all products.

Regarding the usefulness of decision trees, it was found that they are not very efficient for highly volatile databases, such as those found in the online shopping market. The notion of a universal tree, however, suggests that future work might investigate whether their efficiency could be improved using this, more flexible, approach.

(4)

Sammanfattning

Denna rapport har för avsikt att undersöka använd- barheten av produkters data för att förutsäga användar- beteende i internethandeln. En datauppsättning från Best- Buy användes, innehållandes information gällande sökning- en en användare genomfört samt vilken produkt använda- ren i slutändan klickade på.

Beslutsträd är maskininlärningsalgoritmer som kan an- vändas för att göra förutsägelser. Beslutsträdsalgoritmen ID3 implementerades på grund av dess enkelhet samt tolk- ningsbarhet. Den använder informationsvinst (eng. infor- mation gain) för att mäta hur olika attribut hjälper trä- det med att klyva mängden. Tillvägagångssättet var att konstruera ett beslutsträd för varje produkt i datauppsätt- ningen, och analysera fördelningen av attributens maxima- la informationsvinster vid en klyvning i roten för de olika träden. En klyvning försöktes för alla möjliga pivåvärden (där ett pivåvärde, eng. pivot value, är värdet som klyv- ningen skedde på), och dessa lagrades för att användas vid analysen av vilka pivåvärden som gav upphov till högst informationsvinster.

Resultaten visar att hur väl söksträngen matchar produktens titel och beskrivning är de två viktigaste aspekter- na, följt av produktens nymodighet. Antal dagar sedan de två senaste recensionerna skrevs före sökningen visade sig vara ett hyggligt sätt att identifiera trender på.

Rapporten visar dessutom hur attributen användes ge- nom att analysera fördelningarna för pivåvärden, med slut- satsen att de flesta attributen användes på liknande sätt för de flesta produkterna, vilket tyder på att ett universellt beslutsträd skulle kunna tillämpas på samtliga produkter.

Gällande användbarheten av beslutsträd drogs slutsat- sen att de inte är särskilt effektiva med flyktiga databaser, som de som hittas i internethandeln. Däremot visar idén om universella träd att framtida arbeten skulle kunna un- dersöka hur man skulle kunna göra beslutsträd effektivare.

(5)

Introduction

User behavior prediction is an important activity for many different areas, especially consumer behavior. Clothing companies could utilize this in order to appeal to more customers, for example by recommending clothing articles for their customers based on what they have previously bought. The popular online video streaming service Netflix uses this to recommend new movies based on how the user has previously rated other movies (Bennet & Lanning, 2007).

With user behavior prediction, future user behavior is predicted by analyzing current and previous behavior from users. Predicting how users behave is crucial in order for companies to accurately target their primary audience. This applies both to local ice cream stands and giant consumer electronics stores like BestBuy.

Equipped with this knowledge, online shopping web sites such as BestBuy would be able to predict what products their users are interested in buying, by analyzing what they search for.

Early ways of predicting the consumer market’s behavior involved time series analysis and other methods based on historical data. However, more recent research has found that data mining is an increasingly more effective method. Zheng et al (2013) proved that neural networks are consistently accurate with a high probability in predicting customer restaurant preference.

Another data mining approach is the decision tree. For example, decision trees have been used earlier for predicting pancreatic cancer (Yu et al, 2005).

This paper intends to investigate how the ID3 decision tree algorithm uses various product data. The ID3 algorithm is a machine learning algorithm used for making predictions on large data sets, given a binary question. It chooses among a set of attributes to split on and measures the information gain a split yields when splitting on a certain value, which in this report will be referred to as the pivot value. It is a greedy algorithm that splits on the attribute and value yielding the highest gain, at any given step. The goal is to create as pure leaf nodes as possible - a pure node contains only elements with the same answer to the binary question.

(7)

CHAPTER 1. INTRODUCTION

1.1 Problem statement

The aim of this study is to investigate which product attributes will be favored by the ID3 decision tree construction algorithm when making predictions in the online shopping market.

The study will use the information gain measurement to draw conclusions as to which product attributes are more important in terms of user behavior prediction, and the pivot value distributions to analyze which values of the attributes best help to identify queries leading to a click on a product.

The intent is to investigate the following question:

What general information regarding products, available at BestBuy, will result in the highest information gain in a decision tree, using the ID3 algorithm?

This will also help investigate what available information regarding products is the most important when it comes to user behavior prediction at webshops in general.

(8)

Chapter 2

Background

This chapter provides an overview of previous work done on decision trees as well as an introduction to how they work.

2.1 Related work

Decision trees and other data mining techniques have been designed and utilized for predicting various different things. Yu et al (2005) used decision trees in order to predict pancreatic cancer by serum biomarkers. Their results showed that the decision trees successfully separated the pancreatic cancer patients from the con- trol group, achieving a high sensitivity and specificity, showing that decision trees possess great potential in not only the field of medicine, but also for pancreatic cancer diagnosis and prediction in general. Gutierrez and Leroy implemented decision trees to predict crime reporting (2008). They justified the use of decision trees by stating that they are easy to interpret and continued by explaining that even though they might be a simple representation of knowledge, decision trees can still generate practical solutions to complicated problems. Bao and Intille (2004) determined that decision trees showed the best performance when predicting what everyday activity people were doing by attaching accelerometers to the person and studying them.

It is clear that decision trees are widely used. However, there is a lack of research regarding how relevant decision trees are and which product details are of interest when predicting user behavior in the online shopping market. This laid the foundation for the motivation behind this study.

2.2 Decision trees

Decision trees are used for splitting a large set of data into smaller classes. Figure 2.1 shows how a decision tree could be used to predict why people buy ice cream.

Each level of the tree corresponds to a decision, and all nodes and leaves consist of a class of data that are similar in respect to some target variables. There are

(9)

CHAPTER 2. BACKGROUND

various types of variables accepted by decision trees: Nominal (categorical and non-ordered), ordinal (categorical and ordered) and interval values (ordered values that can be averaged). "Categorical" means that they consist of discrete categories with no inherent value that could be computed upon. "Ordered" variables have an ordering. For instance, temperature, when described as the categories "cold",

"warm" and "hot", would be ordinal.

Figure 2.1: Example of a decision tree.

A key ingredient in decision trees is having "pure sets": sets that do not need to be split further. In figure 2.1 the only pure set is if the weather is warm and the customers have extra money. In all the other sets we have some uncertainty.

In order to gain a decision tree where every leaf is a pure set, the tree needs to be split further until only pure sets are left.

Often pure sets are difficult, and sometimes even impossible, to achieve. In these cases it is important to terminate before the subsets are too small, since they will then get more likely to give inaccurate results because of idiosyncrasies (Neville, 1999). If we continue splitting, the tree will be too reliant on the training data and will thus produce inaccurate results when applied to the test data, this is known as overfitting.

Naturally, the question arises on which attributes the splits should occur. Ac- cording to Neville (1999) a common method is to exclude attributes that have little to no correlation to the target. Neville mentions that though this is a good start, it does not take into consideration dependencies between several inputs.

ID3, or Iterative Dichotomiser 3, is a decision tree constructing algorithm that uses entropy. Entropy is a measure of how certain one can be that an element of a set is of a certain type (in figure 2.1, the types would be "yes" or "no"), and is calculated to determine the purity of a set or a set of sets. Mathematically, entropy is calculated by taking a set S that produces n messages, m1, m₂, ..., m_n where the probability of producing message mi is pi. The formula for entropy is thus:

(10)

CHAPTER 2. BACKGROUND

H(P ) = −

n

X

i=1

pilog2(pi) (2.1)

Where P = p1, p₂, ..., p_n.

The second aspect of ID3 is information gain. Gain is determined by partitioning a set T into subsets T1, T2, ..., Tn (Squire, 2004). The information gain from such a partition is defined as:

Gain(T ) = H(T ) −^Xⁿ

i=1

|T_i|

|T |H(Ti) (2.2)

The main idea in ID3 is to calculate the gain for each attribute and select the attribute with the highest gain, in order to minimize the combined entropy of the resulting subsets (Squire, 2004). The greater the gain, the better that attribute is suited for splitting. The end goal of the algorithm is, as mentioned above, to end up with as pure leaf nodes as possible.

(11)

Chapter 3

Methodology

In this paper, a dataset from BestBuy was analyzed (BestBuy, 2012). The data from BestBuy consisted of mobile web users’ search queries and clicks as well as at what time they occurred, spanning across three months. We refer to the context of each query - the data identifying one query from another, such as the query string and the time of the query - as its query context.

A decision tree resulting from the conclusions of this report would calculate the probability of a user clicking on a product given their search query. A probability for every product would be calculated by constructing one decision tree per product, and following the path in the decision tree corresponding to the values for a certain query context. The probability would then be found in the leaf node.

The experiments in this report are, however, limited to trying out binary splits on the various attributes when calculating the root node for each tree. This is because of the fact that the set is the largest in the root node, thus giving more reliable results for this specific problem definition.

All splits were binary, and the value split upon is referred to as the pivot value.

All queries with a value less than or equal to the pivot value were put in the left subset, while the rest were put in the right subset.

We defined a set of relevant attributes that the algorithm would choose from in order to split the data.

3.1 Webshops’ common attributes

In order to find common general attributes that all products at webshops share, different stores were studied. We defined a webshop as a store where the consumers can order and buy products via the Internet, regardless if the store is physical or not. This led us to popular websites such as Target, Walmart and BestBuy.

To broaden the investigation, Swedish webshops were also studied, including Siba, Elgiganten and Netonnet. This study concluded that every single product in these stores share common attributes, including: the title of the product, the description of the product, reviews and ratings.

(12)

CHAPTER 3. METHODOLOGY

3.2 The data set

BestBuy has published several files online for the public to analyze and make com- putations on. In this case, the data set consisted of several files of which only the training file, train.csv, was of interest to us.

The training file contained data on what item a user clicked on after making a search. It contained 42,366 rows, each containing the following:

• User: the user’s id.

• Sku: "stock-keeping unit", which is the item the user clicked on.

• Category: The category the sku belongs to, for example "video games".

• Query: The query the user searched for.

• Click_time: What time the sku was clicked on.

• Query_time: What time the query was searched on.

Each row in the training file thus constituted the context of a query.

All queries were made in the "Xbox 360 Games"-category ("abcat0701002").

The experiments were made on the training file exclusively since this report is concerned with the construction of the trees and not their performance.

3.3 BestBuy API

BestBuy gave direct access to product data and review data, with Json as a possible output format. The Json format was used because of its simplicity and the availability of the external C++ library JsonCpp; there are no real advantages in terms of the amount of information to any format and the format and library may be chosen arbitrarily.

Querying the product database for a certain sku yielded the product’s name, release date, date of entry into the database, price, category, description, product sales ranking and much more.

Querying the review database for a certain sku yielded all reviews in an array, with their rating, the submitter’s name, their comment and the time of submission.

3.4 Implementation

3.4.1 Attributes

Every attribute needs to help the algorithm with distinguishing between different query contexts, or no information gain will result from a split on it. For example, attributes such as "price" and "sales ranking" have no historic data available (and no individual preferences between users), meaning they will have the same value for

(13)

all query contexts since the product’s price or sales ranking will be the same for all of them. Thus, their gain will always be zero, and they will never be used by the ID3 algorithm even if included.

To help with distinguishing between query contexts, the time of the query and the query string were used. Given these, attributes possible to calculate given the databases available and identified as relevant according to chapter 3.1 were:

• descMatch: The fraction of the words in the query that were featured (case insensitive) in the description of the product.

• titleMatch: The fraction of the words in the query that were featured (case insensitive) in the title of the product.

• popularityWeek: A metric with no historical data in the product data. This along with several other metrics mentioned below had to be calculated using the review data. This attribute was calculated by counting the amount of reviews posted in the seven days leading up to the query, and was thus a measure for the current popularity of the product.

• popularityMonth: Just like popularityWeek, but for all the reviews posted in the month leading up to the query.

• popularity: The number of reviews posted up to the time of query; had to be calculated from the review data as well.

• ratingWeek: The average rating of all reviews posted in the seven days prior to the query. It had to be calculated from the review data.

• ratingMonth: Just like ratingWeek, but for all the reviews posted in the month leading up to the query.

• rating: The average rating at the time of query, and was what the user would’ve seen as the product’s rating in the list of search results. This attribute was also calculated from the review data.

• daysSinceRelease: The number of days since the product was entered into BestBuy’s database ("startDate" in BestBuy’s products API) or since the product’s release date ("releaseDate" in BestBuy’s products API). If both were available and valid, the earliest was chosen. If the chosen one was greater than or equal to 2020, or if both were unavailable or invalid, the product was omitted from the experiment. To avoid enormous amounts of distinct possible values to split upon in the algorithm, days was chosen as a measure rather than hours or seconds. If this value was negative, the product wasn’t available at the time of the query. This means the product was completely irrelevant to this query context - this was used to exclude these query contexts from the set used in the product’s tree.

(14)

• daysSinceLastTwoReviews: The number of days since the last two reviews were posted before the query. It had to be calculated from the review data.

We chose to look for the last two reviews rather than one, since just one review is more likely to be a mere coincidence. Looking for more than two reviews, however, might make it less useful since reviews are seldom posted for most products, and thus, many query contexts would have zero as the value for this attribute.

• hasClickedBefore: A binary attribute: 1 if the user had clicked on the product before, 0 if not.

A day, a week and a month were defined as 86,400 seconds, 604,800 seconds and 2,628,000 seconds (the average year), respectively.

3.4.2 The algorithm

The implementation was programmed in C++, using JsonCpp and Boost external libraries.

First, a list of distinct skus was put together from the training data. The algorithm then iterated through the list, requesting and parsing product and review data once before creating DataObjects for the particular product, each DataOb- ject containing the calculated values of the attributes for its query context. The DataObjects represent a query context and all the attribute values for it. They formed a large set - as mentioned earlier, at most 42,366 in size, depending on how many had positive daysSinceRelease values - which was sent into the algorithm to calculate the gains for the various attributes in the root to see what the maximum possible gain was for splits on each attribute; no trees were constructed as that would be beyond the scope of this paper.

There was also a mechanic for ignoring queries made before a product was released or entered into BestBuy’s database (i.e. a negative daysSinceRelease value), since the product did not exist at that point, making them completely irrelevant.

Products without the necessary information in order to calculate the attribute values for each query context were omitted: Any invalid products (that were not found in the database); Products lacking review data (the attributes relying on review data would not be able to be calculated in this case); Products lacking startDate and releaseDate or with typos in both of them (e.g. a releaseDate in the year of 2049; in the particular case of a typo, the lowest year is used, and if it’s still greater than or equal to 2020, the product is omitted) - which would lead to daysSinceRelease not being able to be calculated.

The input to the algorithm was a set of DataObjects, each representing a query context from the data set, its instance variables being the values of the various attributes earlier identified as relevant and possibly helpful by the authors, as well as a two-dimensional vector, maxSplits, to store the results in.

The maxSplits structure contained, for each attribute, a vector of pairs; one pair for each split on the attribute. The first element in the pair was the maximum gain

(15)

for the attribute in the respective tree, and the second element the pivot value for the split yielding this gain. For each attribute, a split was attempted on the set at all distinct values for the attribute and the resulting gain calculated. If the gain was positive, a split would be plausible, and the pivot value stored along with the gain in maxSplits. If the gain was 0, the gain was still stored, in order to make the comparisons fair - otherwise, attributes not split on as frequently might get higher averages, and this would be misleading in terms of their usefulness; the pivot value was not stored in this case, however, since this would yield undefined pivot value results. All splits were binary.

Once all root splits for all trees were calculated, maxSplits was iterated through to produce the results.

3.5 Hypothesis

We expected titleMatch and descMatch to be the most favored attributes by the algorithm, as they intuitively suggest which products are relevant for the query and should help greatly in distinguishing between query contexts.

The popularity- and rating-related attributes were also expected to be somewhat useful, since they should help detect trends: The former is a measure of the overall trend in regards to public attention, while the latter is a measure of the trend of the public perception of the product - if a product recently received many positive reviews at the time of query, this attribute’s week-variant could help in attributing less importance to the average rating and identify a shift in peoples’

mindsets towards the product. The rating is visible when choosing among products and should greatly influence the probability of a customer clicking on it. We expected the popularity attributes to perform better than the rating attributes, since rating is likely to stay roughly the same.

The regular popularity- and rating-attributes were not expected to perform as well as the week-/month-variants, however, since they don’t take time into account.

However, different popularity- or rating-values on a week-/month-basis, which would be identified by popularityWeek/popularityMonth and ratingWeek/ratingMonth, should help detect sudden changes. Sudden changes in regards to rating aren’t as likely, however, so this notion is much less certain when it comes to the rating attributes. Whether the month-related attributes are better than their week- counterparts will show how quickly trends die, but we had no hypothesis for this.

We believed the daysSinceRelease attribute would perform well, since it rep- resents the novelty of a product, and one might expect most products, especially games, to be much more popular upon release. Our hypothesis was that it should have performed as good as the popularity-related attributes.

The daysSinceLastTwoReviews attribute is another trend-detecting attribute, whose pivot value will show how many days may pass for two reviews to signify an upwards trend in popularity. We expected it to perform on par with the populari- tyWeek or popularityMonth attribute.

(16)

The hasClickedBefore attribute might perform well if there are enough users with multiple entries in the data set. It also depends on whether the fact that a product has been clicked on before is roughly equally important for all users. We believed so, and we believed it would perform among the best.

(17)

Chapter 4

Results

Two tables are presented below: One for the maximum gain of all attempted splits on each attribute, including when the gain was 0, and one with the pivot values for all attempted splits on each attribute when the gain was positive. The maximum gain shows how useful an attribute was to the decision tree algorithm, and the pivot value shows which value yielded the maximum gain. The pivot value thus shows which value best helped identify queries leading to clicks on a product.

The tables in this section display the average and the standard deviation (re- ferred to as σ) as well as the values that are the foundation for the box plots, namely:

minimum, lower quartile (Q1), median, upper quartile (Q3) and maximum. Table 4.2 also displays the number of splits (that is the number of splits with a positive gain, i.e. the number of pivot values).

All values are rounded to five significant figures.

4.1 Gains of the attributes

89 out of 413 products had all the necessary information in order to calculate the values for all attributes. In table 4.1 and figure 4.1, presented below, all attributes thus had 89 gain values.

Table 4.1: Maximum gains in 90 splits for each attribute.

Average σ Min Q1 Median Q3 Max

descMatch 0.00347435 0.012743 6.35304e-006 0.00015444 0.000476321 0.00177311 0.10133 titleMatch 0.00359302 0.0132951 1.91705e-006 0.000223249 0.000645093 0.00209685 0.112829

popularityWeek 0.000176399 0.00108236 0 0 0 1.98355e-005 0.00987867

popularityMonth 0.000454866 0.00239731 0 0 0 6.10359e-005 0.0186516

popularity 0.000431985 0.00213683 0 0 0 2.15803e-005 0.0150421

ratingWeek 0.000246993 0.00141835 0 0 0 1.95821e-005 0.0100762

ratingMonth 0.000319583 0.00159536 0 0 0 4.89423e-005 0.0130282

rating 0.000377328 0.00190143 0 0 0 1.83368e-005 0.0150421

daysSinceRelease 0.000906207 0.0024262 3.14196e-005 9.23961e-005 0.000202777 0.000594331 0.0170462 daysSinceLastTwoReviews 0.000497384 0.00168404 0 4.82439e-005 0.00012159 0.000377627 0.0150421 hasClickedBefore 2.89144e-005 5.54114e-005 0 9.13592e-008 7.75611e-006 2.83969e-005 0.000288352

(18)

CHAPTER 4. RESULTS

Figure 4.1: Box plot showing the gain distribution for the attributes, in base-2 logarithmic scale.

descMatc h titleMatc

h

popularit yWeek

popularit yMon

th popularit

y

ratingW eek ratingMon

th rating

daysSinceRelease daysSinceLastT

woReviews hasClic

kedBefore 2⁻²⁴

2⁻¹⁹ 2⁻¹⁴ 2⁻⁹ 2⁻⁴

Gain

(19)

CHAPTER 4. RESULTS

4.2 Usage of the attributes

For pivot values, only the splits resulting in positive gain were included in the results. The first column contains the number of splits (which is also the number of pivot values).

A box plot is presented as well, for each attribute.

Note that the hasClickedBefore attribute is always split on 0 since it’s a binary attribute.

Table 4.2: Pivot values resulting in the maximum gain for each split, when the gain was positive.

Splits Average σ Min Q1 Median Q3 Max

descMatch 89 0.0930571 0.0663728 0 0.050706 0.0833333 0.137626 0.428571 titleMatch 89 0.0753123 0.0448148 0 0.0487805 0.0645161 0.0960061 0.176471

popularityWeek 32 0.40625 1.0265 0 0 0 0 5

popularityMonth 38 0.5 1.65036 0 0 0 0 10

popularity 30 44.1667 79.8954 0 1 10 56 338

ratingWeek 32 0.792067 1.59745 0 0 0 0 4.84615

ratingMonth 38 0.794892 1.58064 0 0 0 0 4.70588

rating 27 3.18352 1.8655 0 1 4.16 4.5 4.91667

daysSinceRelease 89 517.337 438.114 0 161.5 378 821 1949

daysSinceLastTwoReviews 71 256.366 249.274 0 81 184 310 1066

hasClickedBefore 71 0 0 0 0 0 0 0

Figure 4.2: Box plot showing the pivot value distribution for the descMatch attribute.

0 0.1 0.2 0.3 0.4

Fraction of the number of words in the query found in the description

(20)

CHAPTER 4. RESULTS

Figure 4.3: Box plot showing the pivot value distribution for the titleMatch attribute.

0 5 · 10⁻² 0.1 0.15

Fraction of the number of words in the query found in the title

Figure 4.4: Box plot showing the pivot value distribution for the popularityWeek attribute.

0 1 2 3 4 5

The number of reviews posted in the seven days prior to the query

Figure 4.5: Box plot showing the pivot value distribution for the popularityMonth attribute.

0 2 4 6 8 10

The number of reviews posted in the 30.4 days prior to the query

(21)

CHAPTER 4. RESULTS

Figure 4.6: Box plot showing the pivot value distribution for the popularity attribute.

0 50 100 150 200 250 300

The total number of reviews posted prior to the query

Figure 4.7: Box plot showing the pivot value distribution for the ratingWeek attribute.

0 1 2 3 4

The average rating of the reviews posted in the seven days prior to the query Figure 4.8: Box plot showing the pivot value distribution for the ratingMonth

attribute.

0 1 2 3 4

The average rating of the reviews posted in the 30.4 days prior to the query Figure 4.9: Box plot showing the pivot value distribution for the rating attribute.

0 1 2 3 4 5

The average rating of the reviews posted prior to the query

(22)

CHAPTER 4. RESULTS

Figure 4.10: Box plot showing the pivot value distribution for the daysSinceRelease attribute.

0 500 1,000 1,500

The number of days since release

Figure 4.11: Box plot showing the pivot value distribution for the daysSinceLastTwoReviews attribute.

0 200 400 600 800 1,000

The number of days since the last two reviews

Figure 4.12: Box plot showing the pivot value distribution for the hasClickedBefore attribute.

−0.5 0 0.5 1

0 is false, 1 is true

(23)

Chapter 5

Discussion

This chapter first discusses the gain and pivot distributions, and then extends this to a hypothetical discussion on the usefulness of decision trees. The discussion is summed up in the "Conclusion" subsection, followed by a critical analysis of the work done in this paper.

5.1 Importance of attributes

This section discusses the gains of the attributes, presented in table 4.1 and figure 4.1. The gain is a measure of how useful an attribute is to the ID3 algorithm.

Table 4.1 shows that titleMatch and descMatch were clear winners, with average gains at approximately 0.0035 and medians at approximately 0.0006 and 0.0005, respectively. This corresponds with the hypothesis. This means that most users searched for a particular name or brand and found it, since titleMatch was the most important attribute. The fact that the descMatch attribute wasn’t as favored is likely due to the fact that descriptions don’t always contain the title or name of the product. They do, however, contain other useful terms whenever a customer searches for a category or certain attributes of a product, e.g. "action game". They both have an almost equally large standard deviation which is much larger than for the other attributes. This could be explained by the fact that many queries don’t directly include a title or relevant "tag" or category, rendering these attributes much less useful. They also become less useful for products that have a title including words that are used in other titles or descriptions; the ideal situation is when a product’s title is a unique name or word.

The daysSinceRelease attribute is next with an average gain at about 0.0009 and a median at approximately 0.0002. This attribute speaks for a product’s novelty. The result means that the novelty of a product is much more important than its contemporary popularity trend, since daysSinceRelease vastly outperformed all popularity- and rating-related attributes. It could also be because of the simple fact that they didn’t manage to identify trends adequately (which can be seen in e.g. figure 4.5 - most DataObjects had an upper quartile at zero because of the low

(24)

CHAPTER 5. DISCUSSION

frequency of review posting for most products). This also contradicts the hypothesis somewhat, as daysSinceRelease proved much more important than the popularity- and rating-related attributes.

The daysSinceLastTwoReviews performed almost as well as daysSinceRelease, its gain’s average and median being at approximately 0.0005 and 0.0001, respectively. This means it is a decent trend indicator.

The rating and popularity attributes are next, with average gains at about 0.0004, and their medians at 0. It is reasonable that the popularity-related attributes were more important than the rating-related attributes, since we don’t believe a product’s rating or approval among the public to change much with time.

The trend should thus be better measured by the number of, rather than the rating of, reviews. This is shown in the results, and goes hand in hand with the hypothesis.

The popularityMonth attribute outperformed popularity somewhat, and much better than popularityWeek. The same goes for ratingMonth, in that it outperformed ratingWeek. This means trends usually last for more than a week, upwards to a month.

The hasClickedBefore attribute performed worst in terms of average, though its median was higher than that for all the popularity and rating attributes. The reason it didn’t perform as well as we thought in the hypothesis is most likely due to the fact that most users only appear once in the data set, which results in many DataObjects having a 0 value, or in fact, no value at all (which means having a 0 as the value, as well), since they didn’t appear more than once. This means that for the vast majority of DataObjects, whether they clicked on the product or not is completely independent of the hasClickedBefore attribute value, resulting in a very low gain.

5.2 Usage of attributes

This section discusses the pivot values of the attributes, presented in table 4.2 and figures 4.2 through 4.12. The pivot values can be used to interpret how the decision tree used the attributes.

It turned out that descMatch and titleMatch were usually split at around 0.07- 0.09 - this means that less 1/10 of the words in the query would be found in the description or title, respectively, for the algorithm to put a query context into the left subset. Since the algorithm tries to achieve maximum purity in its subsets at any given split, this suggests that more than around 1/10 words found in either title or description results in a likely click, judging by the information gain. The standard deviation isn’t negligible for either, however, so this limit shifted a bit between products.

The daysSinceRelease attribute has an average pivot value at 515 days, with a fairly large standard deviation. This is about how old a product would usually be to distinguish a queries resulting in clicks from those who didn’t, meaning that video games are the most popular within the first year.

(25)

The daysSinceLastTwoReviews attribute’s pivot value was largely at 200, suggesting that reviews posted with that frequency (once every 100 days) might suggest the product is still conceived as relevant or new by customers.

The rating attribute was usually split at approximately 3.2. This isn’t quite what could be expected when there’s inflation in the rating and most products achieve at least 3. However, it suggests that the only way to identify queries leading to a click on the product using the rating is when the rating is low and then changes drastically. The standard deviation is fairly high, however, and the maximum is at 4.9, meaning that the quantity could be considered more important than the actual rating.

The ratingWeek attribute does have a surprisingly low average at about 0.8; the same goes for ratingMonth. It’s not likely that the actual rating would be this low, since the average pivot value of rating is at about 3.2. We believe it can rather be explained by the fact that for most products, reviews are seldom posted, and the amount of query contexts where there were no reviews posted in the seven days prior to the query is thus very large. This can be seen in the median and quartile values, which are all at zero. This leads to a ratingWeek/ratingMonth at 0 in them, pulling the average and the quartiles down. The low average suggests that the algorithm often chose to split dataObjects with a positive ratingWeek/ratingMonth from those without; just a few reviews might suggest a product is trending. The popularity- Week/popularityMonth attributes have similarly low averages, and it seems rating- Week/ratingMonth and popularityWeek/popularityMonth were used for the same purpose, which was originally the purpose of just popularityWeek/popularityMonth:

to determine whether a product is trending in attention (not in terms of rating or perception).

5.3 Usefulness of decision trees

This section aims to hypothetically discuss the use of decision trees, given the results in this paper.

Decision trees have many advantages, as stated in chapter 2. They are easily interpretable and fast once constructed - two factors that might be important depending on the usage. Naturally, they also possess many disadvantages, especially for this specific task. For example, if you wanted to actually construct the trees, should one tree be constructed for all the products? The tree would most likely have to be enormous. Thus a tree needs to be created for every single product. That might be problematic in its own way if we’re dealing with millions of products. In this case, however, the number of products is fewer.

When creating one tree for each product, even just iterating through all trees is a time consuming process, and this approach is probably not a good use for decision trees. Especially not when new products are entered into the database without much training possible due to their novelty. Products with less training data (newer or less popular products) produce poor decision trees on their own.

(26)

This approach to decision trees, where one tree is trained for every product, might thus not be very well applicable to a volatile database such as that of a webshop.

The fact that certain attributes seem vastly favored, not depending much on the product (as shown by the fairly low standard deviations for some of the attributes’

gains and pivot values), suggests that it might be possible to create a universal tree. This could be explored in another report, using the conclusions made here, to investigate the actual usefulness of decision trees in this context.

To really make a good user behavior prediction algorithm, we believe decision trees would have to be combined with static attributes (those that don’t have unique user preferences or contexts) like sales ranking and price to bias popular and cheaper products. These static attributes are often intuitively very important but can’t be included in a decision tree algorithm as they don’t help in distinguishing between query contexts.

Although our initial goal and focus was specifically on decision trees, another method would have been to implement a time series analysis, since we actually have obtained data over time. Naturally, a comparison could have been made between the two to determine how well decision trees and time series perform for this task.

Nonetheless, this paper focuses solely on decision trees and how they should be constructed. A comparison between two completely different algorithms would be the basis for another paper.

5.4 Conclusion

This paper identified certain attributes that were vastly favored by the ID3 algorithm, and the conclusion is, in short, that how well the query string matches the product’s title and description, the novelty of the product and the time since the last two reviews are important to help the algorithm distinguish between different query contexts - more so, in fact, than the rating and the number of reviews.

Thanks to the general characteristics of the attribute preferences across different products’ trees, one might be able to construct a universal tree with the ability of it being applicable to every single product, with no further training necessary. This feature would give a certain freedom of not being forced to train trees for every product, after having used the training data to build a general tree.

5.5 Critical analysis

The fact that the training data actually only contained one category, "Xbox 360 Games", makes the results and the analysis relevant only to Xbox 360 games and, most likely, games for other platforms. We believe it can be extended to other categories as well; especially movies, music and other media, simply because of the fact that we believe customers tend to search for a specific title to the same extent in these cases and not just a category or tag, e.g. "action movie". This renders titleMatch and descMatch equally useful in the case of different media or entertain-

(27)

ment product categories. The other attributes should also be similarly applicable - a product’s novelty is likely as relevant when it comes to games as when it comes to movies, but might be less useful when it comes to product categories whose products have a longer lifespan before they are conceived as less relevant, e.g. furniture or even some types of electronics equipment. This means that the usefulness of the daysSinceRelease attribute might vary somewhat between categories. The usefulness of the popularity- and rating-related attributes, however, is likely the same, no matter the category - they identify trends and we believe the significance of an increase in attention for a product is the same for all categories.

There was no experiment regarding the trees’ actual performance in terms of the accuracy of the predictions. The conclusions are thus limited to being suggestions for the behavior of customers at webshops, but aren’t verified guidelines for how to make accurate decision trees. This would have to be explored in another paper.

Something that we couldn’t take into account was the ordering of the search results. It wasn’t provided in the data set, and throughout this report it is assumed that it was completely random. It is likely, however, that the search results were organized in terms of relevance (to the query string) or price. This would mean that the test subjects in the training data might have already been affected by the ordering, since people are likely to choose the top-most search results first.

ID3 was chosen as the construction algorithm since it is one of the more common algorithms to teach. We could have compared this algorithm with other ones, for example CART. However, that would be out of scope for this paper, since we were more interested in the specific variables and attributes that give a higher and more accurate result instead of what specific algorithms construct a better tree.

(28)

Chapter 6

References

Bao, L., & Intille, S. S. (2004). Activity recognition from user-annotated accelera- tion data. In Pervasive computing (pp. 1-17). Springer Berlin Heidelberg.

Bennett, J., & Lanning, S. (2007). The netflix prize. In Proceedings of KDD cup and workshop (Vol. 2007, p. 35).

BestBuy (2012). Data Mining Hackathon on (20 mb) Best Buy mobile web site - ACM SF Bay Area Chapter. [ONLINE] Available at: https://www.kaggle.com/c/acm- sf-chapter-hackathon-small. [Accessed 4 May 2016].

Goel, S., Hofman, J. M., Lahaie, S., Pennock, D. M., & Watts, D. J. (2010). Pre- dicting consumer behavior with Web search. Proceedings of the National academy of sciences, 107(41), 17486-17490.

Gutierrez, J., & Leroy, G. (2008). Using decision trees to predict crime reporting.

Advanced Principles for Improving Database Design, Systems Modeling, and Soft- ware Development, eds, J. Erickson and K. Siau, IGI Global, 132-145.

Hall, L. O., Liu, X., Bowyer, K. W. (n.d.) & Banfield, R. An analysis of neural network versus decision tree performance on a bio-informatics problem. University of South Florida.

Hobbs, G. (n.d.). Decision Trees as a Predictive Modeling Method. Department of Statistics, West Virginia University. http://www.wuss.org/proceedings10/

analy/3055_2_ANL-Hobbs.pdf

Lee, J. J., McCartney, R., & Santos Jr, E. (2001). Learning and Predicting User Behavior for Particular Resource Use. In FLAIRS Conference (pp. 177-181).

Maimon, O. & Rokach, L.. eds. (2010). Data Mining and Knowledge Discovery

(29)

CHAPTER 6. REFERENCES

Handbook, 2nd ed. Springer Science+Business Media. p. 164.

Neville, P. G. (1999). Decision trees for predictive modeling. SAS Institute Inc.

Radinsky, K., Svore, K., Dumais, S., Teevan, J., Bocharov, A., & Horvitz, E. (2012).

Modeling and predicting behavioral dynamics on the web. In Proceedings of the 21st international conference on World Wide Web (pp. 599-608). ACM.

Squire, David. (2004). CSE5230 Tutorial: The ID3 Decision Tree Algorithm.

Monash University.

Yu, Y., Chen, S., Wang, L. S., Chen, W. L., Guo, W. J., Yan, H., Zhang, W. H., Peng, C. H., Zhang, S. H., Li, H. W. & Chen, G. Q. (2005). Prediction of pancreatic cancer by serum biomarkers using surface-enhanced laser desorption/ionization- based decision tree classification.Oncology, 68(1), 79-86.

Zheng, B., Thompson, K., Lam, S. S., Yoon, S. W., & Gnanasambandam, N. (2013).

Customers’ Behavior Prediction Using Artificial Neural Network. In IIE Annual Conference. Proceedings (p. 700). Institute of Industrial Engineers-Publisher.

(30)

Constructing decision trees for user behavior prediction in the online consumer market

Constructing decision trees for user behavior prediction in the online consumer market

DENNIS FOKIN

JOEL HAGROT

Constructing decision trees for user behavior prediction in the online consumer market

Beslutsträdskonstruktion för förutsägning av användarbeteende i internethandeln

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1 Problem statement

Chapter 2

Background

2.1 Related work

2.2 Decision trees

Chapter 3

Methodology

3.1 Webshops’ common attributes

3.2 The data set

3.3 BestBuy API

3.4 Implementation

3.5 Hypothesis

Chapter 4

Results

4.1 Gains of the attributes

4.2 Usage of the attributes

Chapter 5

Discussion

5.1 Importance of attributes

5.2 Usage of attributes

5.3 Usefulness of decision trees

5.4 Conclusion

5.5 Critical analysis

Chapter 6

References