A Semi Supervised Support Vector Machine for a Recommender System : Applied to a real estate dataset

(1)

Linköping University | Department of Computer and Information Science Master Thesis | Division of Statistics and Machine Learning Spring 2021 | LIU-IDA/STAT-A--21/023--SE

A Semi-Supervised Support

Vector Machine for a

Recommender System

Applied to a real-estate dataset

Author: José Jaime Méndez Flores

Supervisor: Michał Krzemiński

Examiner: Krzysztof Bartoszek

(2)

(3)

P a g e | i

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida https://ep.liu.se/ .

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/her own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.

(4)

Recommender systems are widely used in e-commerce websites to improve the buying experience of the customer. In recent years, e-commerce has been quickly expanding and its growth has been accelerated during the COVID-19 pandemic, when customers and retailers were asked to keep their distance and do lockdowns. Therefore, there is an increasing demand for items and good recommendations to the users to improve their shopping experience. In this master’s thesis a recommender system for a real-estate website is built, based on Support Vector Machines (SVM). The main characteristic of the built model is that it is trained with a few labelled samples and the rest of unlabelled samples, using a semi-supervised machine learning paradigm. The model is constructed step-by-step from the simple SVM, until the semi-supervised Nested Cost-Sensitive Support Vector Machine (NCS-SVM). Then, we compare our model using four different kernel functions: gaussian, second-degree polynomial, fourth-degree polynomial, and linear. We also compare a user with strict housing requirements against a user with vague requirements. We finish with a discussion focusing principally on parameter tuning, and briefly in the model downsides and ethical considerations.

(5)

P a g e | iii

Acknowledgments

I am grateful to my supervisor Michał Krzemiński, my examinator Krzysztof Bartoszek and my colleague Jun Li for enlightening my path to achieve this master’s Thesis with their experience, knowledge and advice.

In any adventure, the weight is easier to carry if you share it out with friends. Thanks to the friends I made in the path, especially Agustin, Bayu, Ismaïl, Jooyoung, and Marcos. Of course, many thanks to my best teammate Kelly Ying Luo she is a friend, a teacher and a truly kind-hearted person.

I am grateful to my all-life friends for making me travel, party, laugh, debate and think no matter the distance.

I am immensely grateful to my family for giving me the education, support, guidance and resources without which this master’s thesis would have not been possible. My mother, father and my sister are the kindest and most comprehensive persons I know, and I am fortunate to always have an ear to listen to my angst, fears, failures, worries and all the emotions that one has while studying far, on their own and with a pandemic going on.

(6)

Chapter 1. Introduction ... 1

Motivation ... 1

Background on recommender systems ... 2

1.2.1. Recommender systems ... 2

1.2.2. Basic models of recommender systems ... 3

1.2.3. Collaborative Filtering ... 3

1.2.4. Content-based recommender systems ... 3

1.2.5. Knowledge-based recommender systems ... 4

1.2.6. Hybrid recommender systems ... 4

Overview of the learning paradigms in Machine Learning ... 4

1.3.1. Unsupervised learning ... 5

1.3.2. Supervised learning ... 6

1.3.3. Semi-supervised learning: learning from labelled and unlabelled data ... 7

Research questions ... 8 Delimitations ... 8 Chapter 2. Data ... 9 Data Source ... 9 2.1.1. Raw dataset ... 9 Secondary data ... 12

Chapter 3. Theoretical Background ... 13

(7)

P a g e | v

3.3.1. Kernel Trick ... 20

3.3.2. Basic Kernels ... 23

Semi-supervised Support Vector Machine ... 25

Cost-Sensitive Support Vector Machine ... 26

Nested Cost-Sensitive Support Vector Machine ... 27

3.6.1. First step: Finite family of nested sets ... 28

3.6.2. Second step: Interpolation ... 28

Model Assumptions in Semi-Supervised learning and parameter tuning. ... 29

Chapter 4. Methodology ... 31

Software ... 31

Sub-setting the training data for an “active user” ... 31

Model selection ... 32

Calculating the NCS-SVM ... 32

Stopping in the low-density region ... 33

Returning results and plotting ... 35

Chapter 5. Results and Analysis ... 36

Comparing different kernels ... 36

Comparing a user with dense preferences vs. user with loose preferences ... 42

Evaluation with users that have a closed a deal ... 47

Chapter 6. Conclusions ... 49 Discussion ... 49 Method Criticism ... 51 Further research ... 51 Ethical considerations ... 51 Bibliography ... 53

(8)

Figure 1 Hypothetical scenario with datapoints describing the size of the property and the

monthly rent to pay ... 5

Figure 2 Results of k-means clustering algorithm on hypothetical data ... 6

Figure 3 Supervised classifier results on different values of monthly rent and size ... 7

Figure 4 Maximal Margin Classifier ... 15

Figure 5 Inseparable Case, note that each point has its own value of 𝛏 ... 18

Figure 6 The observations belong to two different classes but there is no linear boundary between them ... 21

Figure 7 SVM with linear kernel used in three different datasets ... 23

Figure 8 SVM with polynomial kernel of sixth degree used in three different datasets ... 24

Figure 9 Radial kernel used in three different datasets ... 25

Figure 10 Example of different boundaries obtained using three different cost asymmetries – 𝛄 = 𝟎. 𝟓, 𝟎. 𝟕, 𝟎. 𝟗... 27

Figure 11 Low density criterion principle. Left side: low-density region; Right side: high-density region ... 30

Figure 12 Example of 3D scatter plot with recommendations... 35

Figure 13 Graphic demonstration of the recommendations using Gaussian kernel with sigma = 1 ... 37

Figure 14 Graphic demonstration of the recommendations using second-degree polynomial kernel ... 39

Figure 15 Graphic demonstration of the recommendations using fourth-degree polynomial kernel ... 40

Figure 16 Graphic demonstration of the recommendations using linear kernel ... 41

(9)

Chapter 1. Introduction

This master’s thesis aims to develop a recommender system in a semi-supervised manner. Hence, the introduction chapter is comprised of five main sections: the motivation, a background about recommender systems, an overview of the machine learning paradigms, the research objectives and the delimitations. The second section explains what recommender systems are, and their classification according to how they are modelled. The third section explains what is supervised, unsupervised and semi-supervised Machine Learning. Fourth and Fifth sections are related to this master’s thesis goals and limitations.

Motivation

During the COVID-19 pandemic in 2020, many populations worldwide were urged to stay at home and work remotely, which resulted in an increase in the demand for e-commerce websites and which is expected to continue in the long run [1]. As more e-commerce websites and marketplaces with thousands of retailers and items are created, there is an increased need for better recommender systems to improve users shopping experience and reduce the frustration of fruitless navigation in overwhelming amounts of items.

Recommender systems try to infer customer interests by using search fields and various sources of data like item descriptions, item ratings, item specifications or requirements, user interactions or implicit feedback, similarities between users, similarities between items, and more. As recommender systems are becoming part of our everyday consuming habits, it is worth having a vast variety of models that can analyse and process these data sources and return meaningful items that generate benefits to the user and the retailers.

This master’s thesis aims to build a recommender system based on Support Vector Machines and apply it on anonymized data from an e-real estate business. Support Vector Machines or SVM are used for pattern classification, which is the action of categorizing some objects into a “class” or label. To develop classifiers there are two general approaches [2]: 1) a non-parametric approach in which no prior knowledge of the data distributions is assumed; and 2) a parametric approach where prior knowledge is assumed. A model based on an SVM belongs to the non-parametric approach, which uses input-output pairs to learn the decision functions to classify future inputs into one of the classes. Particularly, the model of SVM used in this work is called Nested Cost-Sensitive Support Vector Machine (NCS-SVM), which allows to increasingly add properties to a list of recommendations starting with the items that are most similar to those that the user has shown interest in. The NCS-SVM is trained under a Machine Learning paradigm called semi-supervised learning that requires partially labelled data during the training phase.

(10)

The dataset is publicly available on Kaggle*_{, it has been generated by the e-real estate systems.}

The data have been anonymized as well as some variables have been encrypted. This dataset was chosen because of personal interest in the explosive growth of e-commerce and the traditional always-growing real-estate business.

Background on recommender systems

This section is meant to inform the reader about non-technical concepts that are important to understand the aim of the model developed in this master’s thesis. It starts by explaining what a recommender system is in section 1.2.1, then section 1.2.2 describes very briefly the basic categories of recommender systems, followed by a more developed summary of each category from section 1.2.3 to 1.2.6. In section 1.3, a brief overview of the learning paradigms in machine learning is introduced. Particularly, we want the reader to familiarize themselves with the less common paradigm of semi-supervised learning 1.3.3, on which the Nested Cost-Sensitive SVM is based.

1.2.1. Recommender systems

Recommender systems are computer algorithms used to filter out items that a particular user would be interested in. These algorithms use various sources of data to infer customer interest in the items, which can be some way of explicit feedback, like the now common rating systems on Amazon, Mercado Libre, Alibaba and other marketplaces; or clicking the thumbs up/down after watching video content on Netflix or YouTube. The source of data can also be implicit, for instance, a search of the term “apartment” on Google or browsing through different real-estate publications on Instagram could be viewed as an expression of interest in those items.

The basic principle of these algorithms is that there are correlations between the active users and their item-specific activities. For instance, a person whose activity show that is young and single, is more likely to look for an apartment than for a family house.

Recommender systems are of business interest because they help to increase product sales by showing the users personalised items to attract their attention. But even if the final goal is to increase sales, the specific goal for the algorithm is not to measure item sales but to show items that are relevant and new to the user. Showing irrelevant results is not desired because this would not allow companies to consummate their sales, and on the contrary, it could make the users abandon the

(11)

P a g e | 3 would be more interesting to the user to be shown more diverse results like a few Nolan’s movies, a few Spielberg’s and some Ridley Scott’s.

1.2.2. Basic models of recommender systems

Following the classification of recommender systems in [4], there are three models of recommender systems: 1) collaborative filtering, 2) content-based recommender systems, and 3) knowledge-based recommender systems. Each will be explained in the following sections. It is worth mentioning that these basic models can be combined into hybrid systems to overcome the disadvantages of each method working on its own.

1.2.3. Collaborative Filtering

Collaborative filtering was proposed in 1990 by two scientists at Xerox PARC to reduce the annoyance of inbox overload. I encourage reading the full story in [5], as it is interesting, however not relevant to this work. Collaborative filtering models use the collective influence of the different users to make recommendations. The underlying principle is to impute ratings on the items that the active user has not seen by finding correlations between the active user’s collected item ratings and those of their peers’. This means that users with preferences like those of the active user, can supply recommendations of items that are new to the active user. Additionally, the items that the active user has rated can be used to produce item-item recommendations to themselves. There exist two methods to do collaborative filtering: 1) Neighbourhood-based methods and 2) Model-based methods.

1. Neighbourhood-based methods: The ratings of the user-item combinations can be predicted based on similarities with the users’ neighbours.

a. User-based collaborative filtering: Likeminded users to the active user A are used to make recommendations to A. The way to calculate the neighbours is by similarity functions like those described in [6].

b. Item-based collaborative filtering: A set of items S, rated by user A, that are the most similar to target item I is determined. Then the ratings of S are used to predict if A would like I.

2. Model-based methods: In model-based methods, predictions on the user ratings are made using Machine Learning and Data Mining methods. The methods are overly broad but those that are more scalable are preferred, like decision trees or decision rules, because the websites often deal with thousands of users and items.

1.2.4. Content-based recommender systems

In content-based recommender systems, text-mining techniques are used on the items’ descriptions to make recommendations. These models are used more often in a context when it is not possible to make collaborative filtering recommendations. For example, when a user has liked an

(12)

item but there is no access to more users, then one can use the item description to search for keywords and match those keywords with the keywords of the rest of the items. Some disadvantages of content-based recommender systems are that they may provide obvious recommendations which could lead us to poor diversity and that they do not return recommendations to new users who have not rated any items.

1.2.5. Knowledge-based recommender systems

In knowledge-based recommender systems ratings are not used for the purpose of recommendations. Instead, the user “feeds” the system with constraints or cases that are used as an anchor to recommend similar items. These models are often used in case of analysing items that are not purchased often, like real-estate, automotive or the hospitality industry where there may be not enough user ratings to make recommendations. As opposite to collaborative filtering and content-based systems, knowledge-content-based systems allow the users to explicitly specify what they want, then similarities can be computed between the customer requirements, the customer constraints, and the attributes of the items.

A simple case of knowledge-based recommender would be the laptops’ e-stores. Often the users are allowed to constraint the results by min-max price, brand, memory size, etc.

1.2.6. Hybrid recommender systems

Each of the earlier mentioned recommender systems may work well in different scenarios. Each uses distinct types of inputs, and the outcomes may be better or worst depending on what the data allows them to do. Good data and a well-chosen model will result in good recommendations, while poor data or using the wrong model may result in a non-optimal recommendation.

One can use several types of recommender systems for the same task if a variety of inputs is available. In such cases, the strengths from the different recommender systems work together to achieve a better approach, this is called hybridization. To create these models, techniques similar to those used in machine learning ensemble methods are applied, to improve the effectiveness of the otherwise isolated models.

(13)

P a g e | 5

Figure 1 Hypothetical scenario with datapoints describing the size of the property and the monthly rent to pay

Prior to describing the different paradigms, recall that an instance 𝑥 is an abstract representation of an object. Each instance is represented by a 𝑝-dimensional vector 𝐱 = (𝑥₁, … , 𝑥𝑝) ∈ 𝑅𝑝, where each 𝑝 dimension is called a feature. A training sample is a collection of instances {𝐱𝑖}𝑖=1𝑛 = {𝐱₁, … , 𝐱𝑛}, which is going to be the input of the learning algorithm. We assume that each training sample is sampled independently from an unknown underlying distribution {𝐱_𝑖}_𝑖=1𝑛 _{∼ Ρ(𝐱).}

Also, recall that for each training instance there could be a label 𝑦 that may come from a finite set of values like {expensive, cheap} or {big, medium, small} which are called classes. The classes can be encoded, for instance, expensive = 1 and cheap = -1, thus 𝑦 ∈ {−1,1}. One can also find labels with continuous values, for instance, if we want to predict the price of rent according to the size of the property.

1.3.1. Unsupervised learning

Unsupervised learning algorithms are trained with 𝑛 instances {𝐱𝑖}𝑖=1𝑛 where any of these instances have been labelled, therefore there is no explicit pattern to supervise on how to handle the individual instances. Some unsupervised learning tasks [7] are:

• clustering, which goal is to separate the 𝑛 instances into clusters or groups; • outlier detection, to identify which instances are different from the majority;

• dimensionality reduction, which aim is to reduce the dimensionality of the feature vector while keeping as much of the training sample characteristics intact.

As an example of an unsupervised technique, consider searching for two or four clusters on Figure 1 with a clustering algorithm. Look, for instance, at the k-means clusters shown in Figure 2.

(14)

Figure 2 Results of k-means clustering algorithm on hypothetical data

The results of the clustering algorithm should be followed by a subject-matter expert’s interpretation, since the algorithm has not been given any sign about what these clusters are. In our example, assume that the two clusters on the top are from city 1 and the two on the bottom are from city 2. Then assume that the clusters from the left are old properties and the clusters on the right are very new buildings. We may identify what the algorithm with two clusters has found as the discrimination between the properties from city 1 (red) and the properties from city 2 (green). Similarly with four clusters, we realize that the algorithm has learned to discriminate between each of these groups: {blue = (city 1, old), cyan = (city 1, new), red = (city 2, old), green = (city 2, new)}.

1.3.2. Supervised learning

In supervised learning, the training sample consists of pairs, each containing an instance 𝐱 and a label 𝑦. Let the domain of instances be 𝒳, and the domain of the labels be 𝒴. And let Ρ(𝐱, y) be an unknown joint probability distribution on instances and labels 𝒳 × 𝒴.

Given a training sample (𝐱_𝑖, 𝑦𝑖)𝑖=1𝑛 ∼ Ρ(𝐱, y), a supervised learning algorithm [7] trains a function 𝑓: 𝒳 ↦ 𝒴 from some function family ℱ, such that 𝑓(𝐱) predicts the true label 𝑦 on some new data 𝐱, where (𝐱, 𝑦) ∼ Ρ(𝐱, 𝑦) as well.

Moreover, depending on the domain of 𝑦 , supervised learning problems are divided into classification and regression problems [7]. If the domain 𝒴 is continuous, then the supervised

(15)

P a g e | 7

Figure 3 Supervised classifier results on different values of monthly rent and size

1.3.3. Semi-supervised learning: learning from labelled and

unlabelled data

Although the typical learning paradigms are supervised and unsupervised learning, there are times where we have available labels for only part of the data or, like in this thesis, only one class is labelled but not the rest. For these cases, there is another paradigm somewhere between supervised and unsupervised learning: semi-supervised learning. In this paradigm, there is some quantity of labelled instances as well as some unlabelled instances. Most of the semi-supervised learning strategies are extensions of either supervised or unsupervised learning with the goal of including more information from the other learning paradigm [7]. Some problems usually solved by semi-supervised learning are the following:

• semi-supervised classification: The training data consists of 𝑙 labelled instances {𝐱𝑖, 𝑦𝑖}_𝑖=1𝑙 and 𝑢 unlabelled instances {𝐱𝑗, 𝑦𝑗}_𝑗=𝑖+1

𝑙+𝑢

. The goal is to use these labelled and unlabelled instances to train a classifier 𝑓 that is better than the standard supervised classifier on labelled data;

• constrained clustering: The training data consists of unlabelled instances, as well as “supervised information” about the clusters. The “supervised information” can be, for example, if two instances 𝐱𝑖, 𝐱𝑗 can or cannot be in the same cluster or constraints about the size of the clusters. The goal is to obtain better clustering than the unsupervised clustering algorithm.

(16)

In this master’s thesis, we will build a set of recommendations based on partially labelled data using a semi-supervised learning approach, because the dataset allows knowing what the active users like but not what they do not like. More information about the data is given in Chapter 2 and about the semi-supervised model in Chapter 3. The ultimate goal of the semi-supervised model is to find an item set that would fit the profile of a specific active user.

Research questions

The main goal of this thesis is to explore the concepts associated with Support Vector Machines (SVM) and apply them in a semi-supervised approach on a real-estate recommender system. Even if there is some literature about SVM for collaborative filtering like [8, 9], SVMs application have been little explored where the data is partially labelled. Therefore, the following research questions are explored:

1. How can Support Vector Machines be used to construct a real-estate recommender system in a semi-supervised manner when there is partial information on the labels.? 2. What is the effect of implementing different kernel functions on that real-estate

recommender system?

3. How can the hyperparameters be tuned, in an objective manner, in such recommender system?

Delimitations

This study is restricted to the use of Support Vector Machines as a method for the recommender system, even if other Machine Learning techniques are more suitable for this kind of recommender systems. The kernel functions used are basic (see Chapter 3.3.2), although it is possible to create our own function if it meets specific criteria [2]. For sake of simplicity, some variables were excluded from the model (see Chapter 4.2).

(17)

Chapter 2. Data

In section 2.1 is shown where the data come from and the tables that comprise it. Each variable of each table is described, and one example of each variable is shown. Section 2.2 shows a transformation made on the raw data for being able to use it.

Data Source

The data is a mirror of a log from an online e-real estate system†_{. An e-real estate system makes}

the process of finding the right home easier and less tedious by bringing custom-chosen items to the sight of the customer. This way, the client can focus on exploring the attributes that make relevant houses stand out rather than cherry-picking and scrolling long lists of items before finding an item of their interest.

2.1.1. Raw dataset

The original dataset consists of two tables, 1) one containing the activities or logs of some of the subscribed users and 2) another holding the attributes of the properties.

The user logs table has 4 variables and 323,893 rows, as described in Table 1.

Table 1 Users-logs raw data

Variable Description Examples (values)

item_id Unique item identifier 00062bc5-2535-4b1e-bbcb-228526c990b8

user_id Unique user identifier 182aa519-83a8-848f-84a1-8697046d84c2

event_type User interaction with the item seen

create_timestamp Timestamp when the user made the

(event_type) action 2020-02-03 15:47:25

There are a total of eighteen distinct types of interactions, as described in Table 2, note that the events are in “chronological” order from top to bottom (e.g., seen on website → requested visit → negotiated a deal → bought/rent the property).

(18)

Table 2 User's interactions descriptions

Event name Meaning

seen in list The user saw the item on the website list

seen The user clicked on the item to see its description

suggest new User asked for suggestions to Address team.

suggest_similar User requested for similar properties

sent_catalog_link Address team sent catalogue link sent via SMS

Visit_request-new User requested to visit the property address

Visit_request-cancelled User cancelled the visit request

Visit-new Property visit scheduled

Visit-cancelled Property visit cancelled after being scheduled

Visit-unsuccess Home visit done but user did not like the property

Visit-success Home visit done and user liked the property

Meeting_request-new Meeting to negotiate deal requested by user

Meeting_request-cancelled User cancelled the request of a meeting to negotiate deal Meeting-new Meeting to negotiate deal scheduled

Meeting-cancelled Meeting to negotiate deal cancelled

Meeting-unsuccess Meeting to negotiate deal completed, but deal was unsuccessful

Meeting-success Meeting to negotiate deal finished

Deal Success Closed deal, property rented to user

(19)

P a g e | 11

Table 3 Properties raw data

Variable Description Examples

item_id Unique item identifier 00062bc5-2535-4b1e-bbcb-228526c990b8

deposit Deposit price 64800000

monthly_rent Monthly rent price 4320000

district_uuid Unique district identifier 97c9535e-3985-47ce-a84c-a962c838a76b

room_qty Number of rooms in property 2

unit_area Area of property 116

has_elevator Boolean value indicating if property has

elevator FALSE

building_floor_count Number of units in the same floor 3

unit_floor Nth floor where property is located 1

has_storage_area Boolean value indicating if property has

storage area TRUE

property_age Years since property was built 16

While processing the data, some rows were removed from the dataset described in Table 3: • The values of monthly rent in the 1st_{percentile and those above the 99}th_{percentile. This}

trimming removes rent prices that are too cheap or free, and a few points of properties that are too expensive.

• The properties that do not include their district_uuid or room_qty. The variable district_uuid is used to make suggestions (see Chapter 4.2) and room_qty is part of the model variables (see Chapter 4.4).

(20)

Secondary data

Because the user interactions log focus on the time the different events happened rather than the interactions themselves, it was necessary to re-arrange that table in a way that is easier to visualize the different interactions that the users had with the properties. The resulting dataset has a column for each interaction, and the fields stand for the number of times that the user has had certain interaction on a certain property. After the transformation, the new user interactions data has 20 columns (see Table 4) and 287435 observations.

Table 4 User's interactions counts, secondary data

Variable Description Example

item_id Unique item identifier 182aa519-83a8-848f-84a1-8697046d84c2

user_id Unique user identifier 00062bc5-2535-4b1e-bbcb-228526c990b8

[event_type] (× 18) User interaction with the item 2

The rows belonging to the new users were removed because they do not have information to create a model for those users. These rows are characterized because they are filled with zeros, meaning that the user has not had any interaction with the website more than creating their account.

Summarizing, the properties data used was that of Table 3. The user interactions data was from Table 4. Chapter 4.2 is addresses which features of Table 3 are going to be used in the model, as well as the events from Table 4 that express an interest in an item. Table 1 and Table 2 are deprecated in favour of Table 4.

(21)

Chapter 3. Theoretical Background

In this section, we recall the underlying principles of the SVM’s theory and then expand on it, until the proposed semi-supervised model is introduced. This way, it will be easier for the reader to understand the principles of the suggested method, and to compare it with the classical supervised formulation.

First, the maximal margin classifier is explained and then it is extended to a more generic form, the Support Vector Machine (SVM). Because the SVM is a supervised model and we do not have all the labels, we must extend it to a semi-supervised form, the Cost-Sensitive Support Vector Machine (CS-SVM). Finally, the CS-SVM that returns a single solution, is extended to a model capable of returning nested solutions, the Nested Cost-Sensitive Support Vector Machine (NCS-SVM). One can think about the nested solutions as clusters that are expanding from the active user-provided instances to the rest of the dataset.

Maximal margin classifier

The basic geometric concept of the maximal margin classifier is the hyperplane, defined in section 3.1.1. A hyperplane can be used to separate instances belonging to two distinct categories, this is a hyperplane classifier, defined in section 3.1.2. The maximal margin classifier in section 3.1.3 is the best solution to the hyperplane classifier. Because the data usually is not linearly separable, it is possible to make a maximal margin classifier where a small number of instances are on the wrong side of the hyperplane, this is called a soft-margin maximal margin classifier and is explained in 3.1.4.

3.1.1. The hyperplane

In a 𝑝 -dimensional space, a hyperplane is an affine subspace of dimension 𝑝 − 1 that not necessarily passes through the origin. In mathematical terms, the hyperplane in the 𝑝-dimensional space is represented by:

𝛽0+ 𝛽1𝑋1+ 𝛽2𝑋2+ ⋯ + 𝛽𝑝𝑋𝑝= 0 (3.1) where β0, … , β𝑝 are scalars and not all equal to 0. If an instance X = (𝑋1, 𝑋2, … , 𝑋𝑝)

⊺

satisfies (3.1), then it is said that 𝑋 lies on a hyperplane.

Suppose that 𝑋 does not satisfy (3.1) and instead,

(22)

or

β0+ β1𝑋1+ β2𝑋2+ ⋯ + β𝑝𝑋𝑝< 0, (3.3)

then these two equations tell us that 𝑋 lies to one or the other side of the hyperplane. In other words, we can think of the hyperplane as dividing the 𝑝-dimensional space into two halves.

3.1.2. Classification using a Separating Hyperplane

Suppose that 𝑋 is a 𝑛 × 𝑝 data matrix that consists of 𝑛 training observations in a 𝑝-dimensional space and these observations belong to two classes 𝐶 ∈ {−1,1}. We also have a test observation 𝐱∗_{= (𝑥}

1∗, … , 𝑥𝑝∗) ⊺

. The goal is to correctly classify 𝐱∗_{into one of the 𝐶 classes using its feature} measurements. For this, suppose that it is possible to construct a hyperplane that separates the training observations according to their class label (cf. [10]), that is:

β0+ β1𝑥𝑖1+ β2𝑥𝑖2+ ⋯ + β𝑝𝑥𝑖𝑝> 0, if 𝑦𝑖 = 1 (3.4) and

𝛽0+ 𝛽1𝑥𝑖1+ 𝛽2𝑥𝑖2+ ⋯ + 𝛽𝑝𝑥𝑖𝑝< 0, if 𝑦𝑖 = −1. (3.5) Thus, a separating hyperplane has the property that

𝑦𝑖(β0+ β1𝑥𝑖1+ β2𝑥𝑖2+ ⋯ + β𝑝𝑥𝑖𝑝) > 0 for all 𝑖 = 1, … , 𝑛. (3.6)

If the hyperplane described in (3.6) exists, it can be used to construct a classifier in which the observation 𝐱∗_{is assigned to the class depending on which side of the hyperplane is located.}

More formally, the βs starting from β1 are called weights and can be grouped in a 𝑝-dimensional vector 𝒘 and β0 can be considered the bias term b. Then (3.6) can be expressed as [2]:

𝑦𝑖(w⊺ 𝐱𝑖+ b) ≥ 0, for all i = 1, … , n. (3.7) For a new 𝑛-dimensional vector of observations, the decision to classify it as −1 or 1 will be determined by the sign of:

(23)

P a g e | 15 hyperplane that is farthest from the training observations. By computing the perpendicular distance from each training observation to a given hyperplane, we can choose the minimum of such distances. This minimum distance is known as the margin; the maximal margin hyperplane is the hyperplane for which the margin is the largest.

For a maximal margin classifier with coefficients β0, β1, … β𝑝, the test observation 𝐱∗ will be classified based on the sign of D(𝐱∗_{) = β}

0+ β1𝑥1∗+ β2𝑥2∗+ ⋯ + β𝑝𝑥𝑝∗, just as in (3.8).

Figure 4 Maximal Margin Classifier

In Figure 4 we see three observations that are lying on the dashed lines showing the maximum margin and that are equidistant from the maximal margin hyperplane. These three observations are called the support vectors of the maximal hyperplane, shown as an orange line. Because the maximal margin hyperplane depends directly on the support vectors, if one of these points is moved even slightly, then the maximal margin hyperplane will be moved as well.

To construct the maximal margin hyperplane on linearly separated data, the following should be done. The Euclidean distance from the separating hyperplane to a training data sample x is given by |𝐷(x)|/‖w‖ [2]; then all the training data must satisfy

𝑦𝑘𝐷(𝐱𝑘)

‖w‖ ≥ δ, for 𝑘 = 1, … , 𝑛, (3.9)

where δ is the margin.

If (w, b) is a solution (where w and b are the coefficients of the hyperplane), (𝑎w, 𝑎b) is also a solution, where 𝑎 is a scalar. Then we impose the constraint

(24)

From (3.9) and (3.10) we need to find w with the minimum Euclidean norm that satisfies (3.7) by solving the following minimization problem for w and b [2]:

minimize 𝑄(w, 𝑏) =1 2‖w‖

2_,

subject to 𝑦𝑖(w⊺𝐱𝑖+ 𝑏) ≥ 1, for all i = 1, … , n.

(3.11) (3.12) That formulation makes our problem a convex optimization problem [2, 12]. In fact, as the square of the Euclidean norm in (3.11) indicates that our objective function is quadratic, (3.12) are linear inequality constraints, we have a so-called quadratic programming problem. Because of the assumption of linear separability, there exist w and 𝑏 that satisfies (3.12). One of the advantages of this optimization problem is that the objective function is unique even if the solutions are nonunique, in other words, there is only one global minimum. The vectors that satisfy 𝑦𝑖(w⊺𝐱𝑖+ 𝑏) = 1 are the support vectors.

The problem (3.11) and (3.12) can be converted into an equivalent dual optimization problem [2], which is the problem that one usually solves in practice. For doing that, one must first convert (3.11) and (3.12) into an unconstrained problem

𝐿(𝐰, 𝑏, 𝛂) =1 2𝐰 ⊺_{𝐰 − ∑ α} 𝑖{𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) − 1} 𝑛 𝑖=1 , _(3.13)

where the function 𝐿(𝐰, 𝑏, 𝛂) is called the Langrangian and 𝛂 = (α1, … α𝑛)⊺ are the nonnegative Lagrange multipliers [2]. The Lagrangian must be maximized with respect to 𝛂 and minimized with respect to the primal variables 𝐰 and 𝑏, i.e., a saddle point has to be found. The optimal solution satisfies the following Karush-Kuhn-Tucker (KKT) complimentary conditions of optimization theory [12]: ∂𝐿(𝐰, 𝑏, 𝛂) ∂𝐰 = 𝑤 − ∑ α𝑖𝑦𝑖 𝑛 𝑖=1 𝐱𝑖 = 0 (3.14) ∂𝐿(𝐰, 𝑏, 𝛂) ∂b = − ∑ α𝑖𝑦𝑖 𝑛 𝑖=1 = 0, _(3.15) where,

(25)

P a g e | 17 From (3.16), α𝑖= 0, or α𝑖 ≠ 0 and 𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) = 1 must be satisfied. In this context, the training data with α𝑖≠ 0 are the support vectors. By using the Lagrangian, (3.14) and (3.15) are reduced to:

𝐰 = ∑ α𝑖 𝑛 𝑖=1 𝑦𝑖𝐱𝑖, (3.18) ∑ α𝑖 𝑛 𝑖=1 𝑦𝑖 = 0. (3.19)

Substituting (3.20) and (3.21) into (3.13), the dual problem could be formulated as:

maximize 𝐿(α) = ∑ α𝑖 𝑛 𝑖=1 −1 2 ∑ α𝑖α𝑗𝑦𝑖𝑦𝑗𝐱𝑖 ⊺ 𝑛 𝑖,𝑗=1 𝐱𝑗, subject to ∑ 𝑦𝑖α𝑖 𝑛 𝑖=1 = 0, α𝑖 ≥ 0 for all 𝑖 = 1, … , 𝑛. (3.20) (3.21)

The concept of zero duality gap in quadratic programming means that the values of the primal and dual problem functions coincide at the optimal solution, if they exist. This problem can be solved efficiently with a specialized algorithm to solve Quadratic Programming problems.

From (3.18) the decision function is given by:

𝐷(𝐱) = ∑ 𝛼𝑖 𝑖 ∈ 𝑆

𝑦𝑖𝐱𝑖⊺𝐱 + 𝑏 _(3.22)

where 𝑆 is the set of support vector indices [2]. Therefore, only the support vectors indices are necessary to classify a sample. And from the KKT conditions given by (3.16), 𝑏 equals

b = 𝑦𝑖− 𝐰⊺𝐱𝑖 for i ∈ 𝑆. (3.23)

Thus, an unlabelled data sample 𝐱 will be classified into

{𝐶𝑙𝑎𝑠𝑠 1, if 𝐷(𝐱) > 0,

𝐶𝑙𝑎𝑠𝑠 2, if 𝐷(𝐱) < 0. (3.24)

Because we construct this classifier assuming that the data is linearly separable, this classifier is known as a hard-margin classifier [2].

3.1.4. The non-separable case

It is common that the assumption of linear separability is not fulfilled by the data. When using linearly inseparable data, the hard-margin classifier has no feasible solution as there is no “space to

(26)

fit” a margin. In such cases, we must weaken our constraints and allow for some violations to the margin, by introducing the nonnegative slack variables 𝛏 into (3.7)

𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) ≥ 1 − ξ𝑖, for all i = 1, … , n. (3.25)

Figure 5 Inseparable Case, note that each point has its own value of 𝛏

With the introduction of the slack variables a feasible solution always exists [2]. For a training point 𝐱𝒊, if its corresponding 𝜉𝑖 is 0 < 𝜉𝑖 < 1, the data point does not have the maximum margin but is still correctly classified (see 𝐱𝑗 in Figure 5). On the other hand, if the corresponding slack variable ξ𝑖 ≥ 1, the data is misclassified by the optimal hyperplane (see 𝐱𝑖 in Figure 5).

To find the optimal hyperplane for which the number of misclassified training data is minimum, the following optimization problem should be solved (cf. [2]):

minimize 𝑄(𝐰, 𝑏, 𝐱𝑖) = 1 2‖𝐰‖ 2₊𝐶 ℓ∑ ξ𝑖 ℓ 𝑛 𝑖=1 ,

subject to 𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) ≥ 1 − ξ𝑖, ξ𝑖≥ 0 for all i = 1, … , n,

(3.26) (3.27) where 𝛏 and 𝐶 are the margin parameters that determine the trade-off between the minimization of

(27)

P a g e | 19 To solve the problem of the L1 soft-margin classifier, the method of Lagrange multipliers is used similarly to the hard-margin classifier (cf. [2]):

𝐿(𝐰, 𝑏, 𝛏, 𝛂, 𝛃) =1 2‖𝐰‖ 2_{+ 𝐶 ∑ ξ} 𝑖 𝑛 𝑖=1 − ∑ α𝑖(𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) − 1 + ξ𝑖) 𝑛 𝑖=1 − ∑ β𝑖 𝑛 𝑖=1 ξ𝑖 (3.28) where 𝛂 = (α1, … , α𝑛)⊺ and 𝛃 = (β1, … , β𝑛)⊺.

The optimal solution satisfies the following KKT conditions: 𝜕𝐿(𝐰, 𝑏, 𝛏, 𝛂, 𝛃) 𝜕𝐰 = 𝟎, (3.29) 𝜕𝐿(𝐰, 𝑏, 𝛏, 𝛂, 𝛃) 𝜕𝐛 = 0, (3.30) 𝜕𝐿(𝐰, 𝑏, 𝛏, 𝛂, 𝛃) 𝜕𝛏 = 𝟎, (3.31) α𝑖(𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) − 1 + ξ𝑖) = 0, for 𝑖 = 1, … , 𝑛, (3.32) β𝑖ξ𝑖 = 0, for 𝑖 = 1, … , 𝑛, (3.33) α𝑖 ≥ 0, β𝑖 ≥ 0, ξ𝑖 ≥ 0, for 𝑖 = 1, … , 𝑛. (3.34) By differentiating the Lagrangian (3.29) and solving for 𝐰:

𝐰 = ∑ α𝑖𝑦𝑖𝑥𝑖 𝑛

𝑖=1

, (3.35)

the derivatives (3.30), (3.31) are reduced to

∑ α𝑖𝑦𝑖 𝑛

𝑖=1

= 0, (3.36)

α𝑖+ β𝑖 = 𝐶 for 𝑖 = 1, … , 𝑛. (3.37)

By substituting (3.35), (3.36) and (3.37) into the Lagrangian (3.28), the dual problem is obtained:

maximize 𝑊(α) = ∑ α𝑖 𝑛 𝑖=1 −1 2 ∑ α𝑖α𝑗𝑦𝑖𝑦𝑗𝐱𝑖 ⊺ 𝑛 𝑖,𝑗=1 𝐱𝑗, subject to ∑ 𝑦𝑖α𝑖 𝑛 𝑖=1 = 0, 𝐶 ≥ α𝑖 ≥ 0 for all i = 1, … , n. (3.38) (3.39)

(28)

The difference between the soft margin classifier and the hard-margin classifier is that α𝑖 cannot exceed C; therefore, the Lagrangian is bounded. The inequalities in (3.39) are the box constraints.

There are three cases for α𝑖 (cf. [2]):

1. 0 < α𝑖< 𝐶. Then 𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) − 1 + ξ𝑖 = 0 and ξ𝑖 = 0. Then 𝐱𝑖 is an unbounded support vector.

2. α𝑖 = 𝐶 Then 𝑦𝑖(𝐰⊺𝐱𝑖+ 𝑏) − 1 + ξ𝑖 = 0 and ξ𝑖 ≥ 0 . Therefore 𝐱𝑖 is a bounded support vector. There are two subcases for a bounded support vector, if (i) 0 ≤ ξ𝑖< 1, 𝐱𝑖 is correctly classified and if (ii) 𝜉𝑖≥ 1, 𝐱𝑖 is misclassified.

3. α𝑖 = 0. Then ξ𝑖 = 0 and 𝐱𝑖 is correctly classified.

The decision function is the same as in the hard-margin classifier (3.22).

In order to ensure the precision of the calculations, 𝑏 is calculated from the unbounded support vectors: b = 1 𝑈∑(𝑦𝑖− w ⊺_𝐱 𝑖) 𝑖∈𝑈 , _(3.40)

where U is the set of unbounded support vector indices.

Therefore, an unknown data sample 𝐱 is classified with the same decision function used for the hard-margin classifier (3.24).

Support Vector Machines

The support vector machine (SVM) is an extension of the soft margin classifier that allows for the creation of more general decision surfaces than just a separating hyperplane. To achieve this, a special kind of function 𝛟(𝐱) that maps the input data 𝐱₁, … , 𝐱𝑛∈ 𝒳 into a high-dimensional feature space ℋ is used, and the linear separation is done there [12]. This function is called kernel and the procedure of mapping 𝐱 into a high-dimensional feature space is called the kernel trick, further explained in Chapter 3.3.

Kernel Trick and Kernels

3.3.1. Kernel Trick

(29)

P a g e | 21 because, in the enlarged feature space, the boundary is linear. In other words, to allow linear separability, the input space is mapped into a high-dimensional space called the feature space.

Figure 6 The observations belong to two different classes but there is no linear boundary between them

Using a nonlinear function ϕ, that maps the 𝑚-dimensional input vector 𝐱 to the 𝑙-dimensional feature space ϕ(𝐱) = (ϕ₁(𝐱), … , ϕ_𝑙(𝐱))⊺, the linear decision function in the feature space is:

𝐷(x) = w⊺_{ϕ(𝐱) + 𝑏,}

(3.41) where 𝐰 is an 𝑙-dimensional vector and 𝑏 is the bias term.

If a symmetric function 𝐾(𝐱, 𝐱′) satisfies

∑ ℎ𝑖 𝑛

𝑖,𝑗=1

ℎ𝑗𝐾(𝐱𝑖, 𝐱′𝑗) ≥ 0, for all 𝑖, 𝑗 = 1, … , 𝑛 , (3.42)

where ℎ ∈ ℝ then, according to the Hilbert-Schmidt theory [13], there exists a mapping function, ϕ(𝐱) that maps 𝐱 into the dot-product feature space and ϕ(𝐱) satisfies

𝐾(𝐱, 𝐱′) = 𝛟⊺_{(𝐱)𝛟(𝐱}′_). (3.43) Then ∑ ℎ𝑖 𝑛 𝑖,𝑗=1 ℎ𝑗𝐾(𝐱𝑖, 𝐱𝑗) = (∑ ℎ𝑖𝛟𝑇(𝐱𝑖) 𝑛 𝑖=1 ) (∑ ℎ𝑖𝛟(𝐱𝑖) 𝑛 𝑖=1 ) ≥ 0, _(3.44) (cf. [2])

(30)

The condition (3.42) is called the Mercer’s condition and the function that satisfies (3.42) or (3.44) is called the Mercer kernel, the positive-semidefinite kernel, or just kernel. A way to think about the kernel is as a function which purpose is to quantify the similarity of two observations [10].

The kernel trick [11] refers to the use of the Mercer kernel 𝐾(𝐱, 𝐱’) in training and classification instead of ϕ(𝐱). Kernel-based methods are statistical learning methods that make use of the kernel trick to improve their generalization and/or classification abilities. The improvement comes from the observation that the kernel trick allows using a simple function of two vector variables to implicitly calculate a dot product in an extended feature space.

Using the kernel, the dual problem is as follows:

maximize 𝑊(α) = ∑ α𝑖 𝑛 𝑖=1 −1 2 ∑ α𝑖α𝑗𝑦𝑖𝑦𝑗𝐾(𝐱𝑖, 𝐱𝑗) 𝑛 𝑖,𝑗=1 , (3.45) subject to ∑ 𝑦𝑖𝛼𝑖 𝑛 𝑖=1 = 0, 𝐶 ≥ 𝛼𝑖≥ 0 for all i = 1, … , n. _(3.46)

The optimization problem is a concave quadratic programming problem because 𝐾(𝐱, 𝐱’) is a positive semidefinite kernel and has a global optimum solution [2]; the KKT complementary conditions are the following

α𝑖(𝑦𝑖(∑ 𝑦𝑗 𝑛 𝑗=1 α𝑗𝐾(𝐱𝑖, 𝐱𝑗) + 𝑏) − 1 + ξ𝑖) = 0 for 𝑖 = 1, … , 𝑛 (3.47) (𝐶 − α𝑖)ξ𝑖 = 0 for 𝑖 = 1, … , 𝑛 (3.48) α𝑖 ≥ 0, ξ𝑖 ≥ 0 for 𝑖 = 1, … , 𝑛, (3.49) with decision function

𝐷(𝐱) = ∑ α𝑖 𝑖∈𝑆

𝑦𝑖𝐾(𝐱𝑖, 𝐱) + 𝑏, _(3.50)

(31)

P a g e | 23 where 𝑈 is the set of unbounded support vector indices. Therefore, an unlabelled sample will be classified according to:

{𝐶𝑙𝑎𝑠𝑠 1 if 𝐷(𝐱) > 0,

𝐶𝑙𝑎𝑠𝑠 2 if 𝐷(𝐱) < 0. (3.53)

We see that, in result, every dot product is replaced by a non-linear kernel function.

3.3.2. Basic Kernels

For completeness, we find hereafter a brief description for three commonly used kernels. They will allow us for a more intuitive approach while analysing the results in the next chapters.

Linear Kernels

When a classification problem is linearly separable in the input space, it is not necessary to map the input space into a high-dimensional space. In other words, the linear kernel is equivalent to the L1 soft margin classifier.

𝐾(𝐱, 𝐱′_{) = 𝐱}⊺_𝐱′

(3.54) The linear kernel quantifies the similarity of a pair of observations using Pearson correlation. Figure 7 shows an example of a linear kernel used in three datasets. Note that the original data is opaque while the kernel result has transparency. For the second and third dataset, the linear kernel was not able to find a solution.

(32)

Polynomial Kernels

A polynomial kernel of degree 𝑑, where 𝑑 ∈ ℕ and 𝑑 > 0 is given by 𝐾(𝐱, 𝐱′_{) = (𝐱}⊺_𝐱′_{+ 1)}𝑑_.

(3.55) It quantifies the fitting of a support vector classifier in a higher-dimensional space of polynomials of degree d. When 𝑑 = 1 the kernel in (3.55) is the linear plus 1, and it can become the linear kernel by adjusting the value of 𝑏 in the decision function. In general, it can be proven that polynomial kernels satisfy Mercer’s condition [2].

Figure 8 shows the same datasets as the linear kernel but this time using a support vector machine with a sixth-degree polynomial kernel. Notice that the complexity of the polynomial function makes it possible to correctly classify the single points found in the first and second line of the third dataset.

Radial Basis Function Kernels

A kernel with the radial basis function is given by

𝐾(𝐱, 𝐱′) = exp(−σ‖𝐱 − 𝐱′_‖2_),

(3.56) with parameter σ > 0 to control the radius. The decision function of the radial basis function kernels

(33)

P a g e | 25 behaviour of the radial kernel does not allow to include the single points on the first and second lines in the third dataset, also note round edges on the boundaries.

Semi-supervised Support Vector Machine

In our problem, the data can show which properties the user is interested in, but not those that the user is not (interested in). Hence, this is not a usual binary supervised classification problem that can be solved with the Support Vector Machine from Chapter 3.3. However, we recall from Chapter 1.3.3 that it is possible to train a Machine Learning model with some quantity of labelled data and the rest of unlabelled data using a semi-supervised learning paradigm.

In the set-up of semi-supervised learning concerning the following chapters, the term “unchanged” will be used interchangeably with “labelled”. One can understand the “unchanged” label as a training instance 𝐱 that will always have the same label 𝑦 = 1 or 𝑦 = −1, no matter what the result of its decision function 𝐷(𝐱) is. Similarly, the term “changed” will be used interchangeably with “unlabelled” for those training instances 𝐱 which label can be 𝑦 = 1 or 𝑦 = −1 depending on the sign of 𝐷(𝐱). For the model in this thesis, instances with the “changed” label are exclusive of the candidates to be recommended.

With the semi-supervised SVM, the aim is to extend the original SVM to create an unbalanced classifier in which different costs will be applied to the false positives and false negatives. This unbalanced classifier is called the Cost-Sensitive Support Vector Machine (CS-SVM) [14], further explained in Chapter 3.5.

Then, the CS-SVM will be extended once again to be able to grow a set of nested solutions. The nested solutions can be viewed as growing a recommendation set, starting close to the labelled or “unchanged” instances, and increasing gradually to add more samples into the cluster. This model

(34)

is called the Nested Cost-Sensitive Support Vector Machine (NCS-SVM) [15], further explained in Chapter 3.6.

Cost-Sensitive Support Vector Machine

Within the paradigm of supervised learning, the formulation of the Support Vector Machine penalizes misclassifications on two classes equally by the parameter 𝐶. However, in Semi-supervised learning, it is more proper to think of two costs. If one of the “unchanged” instances is wrongly classified, its penalty should be greater. The Cost-Sensitive SVM or CS-SVM achieves this by adding two parameters to the problem formulation. One of the parameters is the cost asymmetry γ [16], defined as:

γ = 𝐶+ 𝐶++ 𝐶−

, _(3.58)

where 𝐶₊ and 𝐶₋ are costs for the class +1 and −1, respectively. This parameter controls the trade-off between positive and false-negative rates. In practice, a range [0,1] of γ is chosen and a set of classifiers is calculated for each value of γ. When γ = 0.5 the algorithm reduces to the standard SVM [14].

The second parameter that the CS-SVM add is the regularization parameter λ [16], given by:

λ = 1

𝐶++ 𝐶−

. _(3.59)

Let 𝐼₊= {i: 𝑦𝑖 = +1} be the set of labelled training samples (“unchanged”) and 𝐼−= {i: 𝑦𝑖 = −1} be the set of unlabelled samples which is wanted to be classified as −1 or +1. The CS-SVM optimization problem is minimize 𝑄(𝑤, ξ) =λ 2‖𝑤‖ 2_{+ γ ∑ ξ} 𝑖 𝑖∈𝐼+ + (1 − γ) ∑ ξ𝑖 𝑖∈𝐼−

subject to 𝑦𝑖(𝑤⊺ϕ(𝐱𝑖)) ≥ 1 − ξ𝑖, ξ𝑖 ≥ 0 for all i = 1, … , n.

(3.60)

To gain intuition on how the different values of gamma affect the boundaries on a simple 2-dimensional example, see Figure 10. The points on the right side of the boundaries will be classified

(35)

P a g e | 27 where the indicator function 𝟏_{𝑦_𝑖<0} returns 1 for 𝑦𝑖 = −1 and 0 otherwise.

As in the original SVM, a test sample 𝐱 is classified according to the sign of the decision function:

D(𝐱) =1

λ∑ α𝑖,γ 𝑖∈𝑆

𝑦𝑖𝐾(𝐱𝒊, 𝐱), _(3.62)

where S is the set of indices of the support vectors.

Figure 10‡_{Example of different boundaries obtained using}

three different cost asymmetries – 𝛄 = (𝟎. 𝟓, 𝟎. 𝟕, 𝟎. 𝟗)

Nested Cost-Sensitive Support Vector Machine

There are situations where it is desirable to have nested solutions such as clustering. The meaning of “nested solutions” is that the previous solution is a subset of the new solution. Such a situation [15] describes a Nested Cost-Sensitive SVM or NC-SVM. The NCS-SVM forces the boundaries obtained for different cost asymmetries γ to be nested.

The Nested Cost-Sensitive SVM is made in a two-step process [15]. In the first step, a finite number of cost asymmetries 0 = γ₁< γ2< ⋯ < γ𝑀−1< γM= 1 is chosen a priori, and a family of nested decision sets is generated at some preselected asymmetries. This is achieved by adding constraints into the Cost-Sensitive SVM (3.61) dual problem. In the second step, the solution coefficients of the finite nested collection are interpolated to a continuous nested family for all γ.

(36)

3.6.1. First step: Finite family of nested sets

The NCS-SVM finds decision functions simultaneously for cost asymmetries γ1, γ2, … , γM by maximizing the sum of duals in the standard CS-SVM (3.61) at each γ and by imposing more constraints to force the nested sets. The new dual problem for the NCS-SVM becomes:

maximize α1, … , α𝑚 W(𝛂) = ∑ [∑ α𝑖,𝑚 𝑛 𝑖=1 − 1 2λ ∑ α𝑖,𝑚α𝑗,𝑚𝑦𝑖𝑦𝑗𝐾(𝐱𝑖, 𝐱𝑗) 𝑛 𝑖,𝑗=1 ] 𝑀 𝑚=1 , (3.63) subject to 0 ≤ α𝑖,𝑚 ≤ 1{𝑦𝑖<0}+ 𝑦𝑖γ𝑚, ∀𝑖, 𝑚 𝑦𝑖α𝑖,1≤ 𝑦𝑖α𝑖,2≤⋅⋅⋅≤ 𝑦𝑖α𝑖,𝑀, for all i = 1, … , n, (3.64) (3.65) where α𝑚 = (α1,𝑚, … , α𝑁,𝑚) and α𝑖,𝑚 is a coefficient for point 𝑥𝒊 and cost asymmetry γ𝑚. The optimal solution α_𝑚∗ _{= (α}

1,𝑚 ∗ _{, … , α}

𝑁,𝑚

∗ _{) defines the decision function 𝐷}

γ𝑚(𝐱) (see (3.66)) and its corresponding decision set 𝐺̂γm (see (3.67)) for each 𝑚:

𝐷γ𝑚(𝐱) = 1 λ∑ α𝑖,𝑚 ∗ 𝑛 𝑖 𝑦𝑖K(𝐱𝒊, 𝐱), (3.66) 𝐺̂γm = {𝐱: 𝐷γm(𝐱) > 0}. (3.67)

3.6.2. Second step: Interpolation

To make an interpolation for an intermediate cost asymmetry γ between two cost asymmetries (γ1, γ2), we can use γ = ϵγ1+ (1 + ϵ)γ2 for some ϵ ∈ [0,1]. Then, the coefficients for α𝑖∗(γ) are defined by [15]:

α𝑖∗(γ) = ϵα𝑖,1∗ + (1 − ϵ)α𝑖,2∗ . (3.68) Thus, the positive decision set at cost asymmetry γ is

𝐺̂γ= {𝐱: 𝐷γ(𝐱) = 1 λ∑ α𝑖 ∗_(γ)𝑦 𝑖𝐾(𝐱𝑖, 𝐱) 𝑛 𝑖=1 > 0}. (3.69)

(37)

P a g e | 29

Proof:

1. Let 𝜶1∗ and 𝜶2∗ denote the optimal solutions for 𝛾1 and 𝛾2. Then ∑ α𝑛𝑖 𝑖,1∗ 𝑦𝑖K(𝐱𝑖, 𝐱) ≤ ∑ α𝑛 _𝑖,2∗

𝑖 𝑦𝑖K(𝐱𝑖, 𝐱) . Therefore 𝐺̂γ1= {𝐱: 𝐷γ1(𝐱) > 0} ⊂ 𝐺̂γ2= {𝐱: 𝐷γ2(𝐱) > 0} . This result

can be extended for the subsequent solutions α3∗, α4∗, … , α𝑀∗, which decision sets 𝐺̂γ₃ ⊂ 𝐺̂γ4 ⊂⋅⋅⋅⊂ 𝐺̂γ𝑀.

2. The linear interpolation (3.68) and the nesting constraints (3.65) imply 𝑦𝑖α𝑖,1∗ ≤ 𝑦𝑖α𝑖∗(γ) ≤ 𝑦𝑖α𝑖,2∗ which, in turn, leads to ∑ α𝑖 𝑖,1∗ 𝑦𝑖K(𝐱𝑖, 𝐱) ≤ ∑ α𝑖 𝑖∗(γ)𝑦𝑖K(𝐱𝑖, 𝐱)≤ ∑ α𝑖 _𝑖,2∗ 𝑦𝑖𝐾(𝐱𝑖, 𝐱). Therefore if γ𝑚 < γ < γ𝑚+1, then 𝐺̂γ_𝑚⊂ 𝐺̂γ⊂ 𝐺̂γ_𝑚+1 .

3. Consider an arbitrary 0 ≤ 𝛾𝜖 < 𝛾𝛿 ≤ 1. If 𝛾𝜖 ≤ 𝛾𝑚≤ 𝛾𝛿 for some 𝑚, then 𝐺̂𝛾𝜖 ⊂ 𝐺̂𝛾𝛿 by

the results in step 2. Suppose this is not the case and assume 𝛾1< 𝛾𝜖 < 𝛾𝛿 < 𝛾2 without loss of generality. Then there exist 𝜖 > 𝛿 such that 𝛾𝜖 = 𝜖𝛾1+ (1 − 𝜖)𝛾2 and 𝛾𝛿 = 𝛿𝛾1+ (1 − 𝛿)𝛾2 . Suppose 𝐱 ∈ 𝐺̂γϵ . Then 𝐱 ∈ 𝐺̂γ2 ; hence, 𝐷γϵ(𝐱) = 1/λ ∑ (ϵα𝑖,1

∗ _{+ (1 −} 𝑖 ϵ)α_𝑖,2∗ )𝑦𝑖𝐾(𝐱𝒊, 𝐱) > 0 and 𝐷γ2(𝐱) = 1/λ ∑ α𝑖,2 ∗ _𝑦 𝑖𝐾(𝐱𝑖, 𝐱) > 0 𝑖 . By adding (δ/ϵ)𝐷γϵ(𝐱) + (1 − (δ/ϵ))𝐷γ2(𝐱), we have 𝐷γδ(𝐱) = ∑ (δα𝑖,1 ∗ _{+ (1 − δ)α} 𝑖,2 ∗ _)𝑦 𝑖𝐾(𝐱𝑖, 𝐱) 𝑖 > 0. Thus, 𝐺̂𝛾𝜖 ⊂ 𝐺̂𝛾𝛿.

With this demonstration we proved that any two sets from the NCS-SVM are nested if a positive semi-definite kernel is used. This property will be later exploited to add recommendations into the active user’s recommendation set.

Model Assumptions in Semi-Supervised learning and

parameter tuning.

A particularly important requirement about semi-supervised learning that has not been discussed in this thesis yet, is that one must make the correct assumptions about the link between the distribution of the unlabelled data 𝑃(𝐱) and the target label 𝑦. If the assumptions hold, then using the labelled and unlabelled data results in a more reliable estimate of the model [7].

Thus far, the CS-SVM does not make any assumption [17] and relies on the unbalanced cost for the “labelled” and “unlabelled” samples. On the other hand, the NCS-SVM only assumes that the result of the kernel dot product is positive [15]. But there is one last step missing on the model which requires a strong assumption, parameter tuning. The parameters to be tuned are the cost asymmetry 𝛾, the regularization 𝜆 and the choice of kernel. Parameter tuning, for the model proposed, is about setting a break point in the set of recommendations when the recommendations stop being meaningful to the user. The assumption being made is that the “changed” and “unchanged” distributions are clustered together in the input space [14], the so-called “cluster assumption”.

In [18], we can find insights about the semi-supervised learning assumptions and their linkage. The authors argue that the cluster assumption is a generalization of three other assumptions:

(38)

1. Smoothness assumption: For two input instances 𝑥, 𝑥′_{∈ 𝑋 that are close by in the input} space, their corresponding labels 𝑦, 𝑦′_{should be the same.}

2. Low-density assumption: This assumption implies that the decision boundary of the classifier should pass through a low-density region in the input space. If the boundary is passing through a high-density region, the boundary could be crossing through the middle of the cluster.

3. Manifold assumption: States that the input space is composed of multiple lower-dimensional manifolds§_{on which all instances lie and that instances in the same manifold}

have the same label.

Then, considering that clustering is a generalization of the low-density assumption, parameter tuning can be done by analysing the boundary between the two labels. The base idea is that the boundary should pass through the low-density region separating the labels. To analyse the density, we can measure the distances between the samples that are closest to the boundary and the boundary itself. These distances are inversely related to the density in these regions.

Figure 11 Low density criterion principle.

Left side: low-density region; Right side: high-density region

Figure 11 is a simple diagram with six instances which colours represent two labels and the red line being the boundary that separates the two labels. The closest instances of the black label are connected to the closest instances of the white label. Note that the first closest instance of one label is connected to the first closest instance of the other label, the second closest with the second closest

(39)

Chapter 4. Methodology

Software

The models were implemented in R, according to the theory presented in Chapter 3. The implementation of the Nested Cost-Sensitive SVM is an adaptation of the code written by the original author of [15] in Matlab, available in [20]. To make the adaptation easier, the R packages pracma [21] and modopt.matlab [22] (for the quadratic programming solver) were used. The package pracma provides a catalogue of numerical analysis and linear algebra functions with MATLAB-like variable names to make porting easier. The package modopt.matlab provides optimization functions with Matlab structure. To make the calculation of the kernel matrices, the package kernlab [23] provides the function kernlab::kernelMatrix along with the kernel functions kernlab::dots.

Sub-setting the training data for an “active user”

The starting point for a recommender system is the active user, the user who is searching for a new apartment or house where to live in. At first, the active user may browse through an area (city, district or block) of interest where they want to live and click on those properties that caught their attention, to see their descriptions and features. Those clicks may be counted implicitly as positive feedback or a “like”, as well as when the user asks for similar properties by clicking a button named “suggest similar”; according to the dataset these two actions are independent of each other, i.e., the user can ask for similar properties without seeing the property homepage. These items that the user has shown interest correspond to the “unchanged” label for training the model. Once the user has shown interest in a small set of items, the model can be built. In this thesis, we considered ten items as the least amount of labelled data allowing for building the model.

For the “changed” label, two sets of properties are created: (i) a set of “collaborative” properties or suggestions from similar users and (ii) a set of properties that are in the same district as those liked by the user and those of the collaborative set. The aim of creating these two sets is to reduce the size of the unlabelled data and, at the same time, try to filter the properties that could be more interesting to the active user.

To create the first set, the users’ interaction dataset (see Table 4) is used. On this dataset, the interest is in looking for “similar users”. A similar user is a user that also has seen at least one of the properties that the active user has seen. The list of similar users is used then to extract those properties that these similar users have liked.

For the second set, a list of districts is created from the properties belonging to the liked properties and the first set. Then the list of districts is created to filter the properties data and extract all the properties that are in those districts. Since the data is anonymized, it cannot be verified if the location

(40)

of the properties contributed by this set makes sense. Then if the observations from this set exceed 300 properties, 300 random samples are taken from it.

Model selection

Since the dataset presented in Chapter 2 allows us to know what the users are interested in but not what they are not (interested in), the standard supervised SVM for two-class classification from Chapter 3.2 shall not be used. Then, a Semi-supervised SVM must be applied in which the model can be trained using only one label. A semi-supervised alternative to the standard SVM is a one-class support vector machine (OC-SVM) [12]. OC-SVMs can be trained using only the “labelled” datapoints and their aim is to estimate the underlying probability density. However, an issue arises because the parameters of the Kernel function does not fit to every user, the result is static (not nested) and the data from the two “unlabelled” set is not being exploited to supply information into the model.

Therefore, the NCS-SVM presented in Chapter 3.6 is proposed as the recommender model because it allows growing nested clusters with information from both, the “labelled” data and the “unlabelled” data. These nested clusters can be interpreted as the model not being static, because it is possible to gradually increase the extension of the cluster up to the total size of the dataset by varying the value of γ. Note that, even if both sets are being used for training, it does not mean that the model is a supervised binary classification problem because there is no availability of the real class for the unlabelled data. Also, note that this is not an unsupervised clustering model because we are using the labelled data to grow the cluster and the unlabelled data is supplying more information about the characteristics of the cluster.

Calculating the NCS-SVM

To train the Nested Cost-Sensitive SVM, the properties that the active user has shown interest in are going to hold the “unchanged” labels. The collaborative and the district filtered sets are the “changed” instances, those for which we want to predict the eventual interest of the user.

Only three out of the ten property features were chosen to make the process of showing the results simpler, although more variables can be added to make the model more complex and give better recommendations to the users. The chosen features were the monthly rent, unit area and room quantity. Monthly rent and unit area as these continuous variables are considered particularly