Eﬃcient Features for Movie Recommendation Systems

(1)

Efficient Features for Movie Recommendation Systems

SUVIR BHARGAV

Master’s Degree Project Stockholm, Sweden October 2014

XR-EE-KT 2014:012

(2)

(3)

Efficient Features for Movie Recommendation Systems

SUVIR BHARGAV

Master’s Thesis at VionLabs AB Supervisor: Roelof Pieters

Examiner: Markus Flierl

XR-EE-KT 2014:012

(4)

(5)

iii

Abstract

User written movie reviews carry substantial amounts of movie related features such as description of location, time period, genres, characters, etc. Using natural language processing and topic modeling based techniques, it is possible to extract features from movie reviews and find movies with similar features. In this thesis, a feature extraction method is presented and the use of the extracted features in finding similar movies is investigated. We do the text pre-processing on a collection of movie reviews. We then extract topics from the collection using topic modeling techniques and store the topic distribution for each movie.

Similarity metrics such as Hellinger distance is then used to find movies with similar topic distribution. Furthermore, the extracted topics are used as an explanation during subjective evaluation. Experimental results show that our extracted topics represent useful movie features and that they can be used to find similar movies efficiently.

(6)

(7)

v

Acknowledgements

This thesis has been carried out at Vionlabs AB. Starting from the initial idea to final execution, everyone at Vionlabs supported the en- deavour to build and create something around movie and technology. I would like to thank my supervisor, Roelof Pieters for his guidance and having so many endless discussions around NLP, topic modeling and movie recommendation systems.

I would also like to thank main author of Gensim library, Radim for his endless suggestions and ideas. I extend my gratitude to great community of programmers and engineers who took time to reply and gave suggestions to my questions on stackoverflow.

I would like to thank my coordinator and examiner, Markus Flierl for giving valuable guidance and suggestions at each stage of the project.

I would also like to thank all the movie judges at Vionlabs for their time and effort in rating movies. In the end, I would like to thank my family and friends, who constantly supported me throughout the thesis.

(8)

List of Figures

2.1 Vector Space Model of documents, Figure by pyevolve[24] . . . 5 2.2 The graphical model for latent Dirichlet allocation. Each node is a ran-

dom variable in the generative process. Shaded circle represents observed variable i.e. words of documents and unshaded circles are all hidden variable. Plates represents replication i.e. N denotes words within documents and D is collection of documents. Figure taken from Blei’s paper [20]. . . 8 2.3 Angle between two documents in a 2-d document-term space. . . 9 3.1 The overall system showing all steps involved. System works by pre-

processing reviews, traning LDA model, extracting topics out of it. Top- ics are then later used to find similar movies. . . 11 3.2 Screen-shot shows a sample movie review taken from IMDB. Highlighted

words are relevant features that can be used for finding similar movies. . 12 3.3 Collection and preprocessing of movie reviews. . . 13 3.4 Preprocessing of movie reviews is done in parallel by spawning sub-

processes for available number of CPU cores. Above representation is inspired from Chris Kiehl’s blog [37]. . . 14 3.5 Tree showing nltk based chunking technique applied on movie data. . . 14 3.6 Sample topics generated from user movie reviews for the movie Gravity 17 3.7 Cosine similarity and Hellinger distance shows strong positively correla-

tion. The X-axis shows similarity score for Hellinger distance whereas Y-axis represents cosine similarity score. . . 18 4.1 A tree diagram showing movie review corpus . . . 20 4.2 Chart showing movies genres of popular movies from last 10 years. . . 20 4.3 A visualization showing 20 topics generated from 100 movie reviews. Ver-

tical axis represents movie reviews data denoted by their corresponding ids while the horizontal axis represents movie topics. . . 21 4.4 Front-page of the movie evaluation system, showing five target movies.

A user clicks on a target movie and five similar movies are presented for evaluation. . . 24

viii

(11)

List of Figures ix

4.5 Web based movie evaluation system. Shown on left is a target movie Front-page upon log-in to the movie evaluation system, showing 10 target movies. . . 25 4.6 Movie evaluation system with explanation. . . 26 4.7 Result of average rating for Genre (top) and Genre with explanation

(bottom). . . . 29 4.8 Result of average rating for Mood (top) and Mood with explanation

(bottom). . . . 30 4.9 Result of average rating for Plot (top) and Plot with explanation (bottom). 31 4.10 Result of average rating for Overlap (top) and Overlap with explanation

(bottom). . . . 32 4.11 Result shows average ratings for the movie topics. . . 33 4.12 Strong positive correlation between Genre and Mood. . . 33 4.13 Strong positive correlation of ratings between two judges. Judges most

agree with rating 1 and then with ratings 2 and 3. . . 34

(12)

(13)

Chapter 1

Introduction

The advent of movie streaming services made available thousands of movies with a click of a button [1]. We now have movies not only from Hollywood, but also from international cinema, documentaries, indie movies, etc. With so many movies at hand, the consumer faces the dilemma of what to watch. At the end of the day, people just want to relax and watch something that matches with their mood, taste and style. This is where Recommendation Systems (RS) can help, suggesting movies that match user taste and viewing habits. In order to recommend movies, we need to understand movies first. The more we understand movie features(genre, keywords, mood, etc,.), the better recommendation we can serve.

Commercial streaming services such as Netflix [2] and Jinni [3] combine semantic information about movies with user ratings to get the optimum hybrid RS. However, they still depend on human taggers [4], [5] for basic feature representation which are needed to classify movies or songs. Although the results obtained from human taggers is quite good, such an approach is definitely not scalable when tagging hundred of thousands of movies or millions of daily generated videos.

For a system to understand a movie, it needs movie features such as movie cast, movie genre, movie plot, etc. With these information, a system can better categorize movies. User written movie reviews is one such source of features. It car- ries substantial amount of movie related information such as location, time period, genre, lead characters and memorable scene descriptions.

Since a user written movie review contains both useful (i.e keywords) and useless (i.e. stopwords) information, some text pre-processing is required before it can be used by a RS. With pre-processed movie data, the next step is to find a good feature representation for movies. In this thesis, we explore feature extraction from movie reviews using Natural Language Processing (NLP) and topic modeling techniques and use the extracted features to find similar movies. The experiments are done on a small set of movies to show that movie topics are efficient features for RS.

1

(14)

2 CHAPTER 1. INTRODUCTION

1.1 Question

• Is it possible to extract or generate movie features from user reviews?

• Is it possible to use extracted features to find similar movies?

• What is a good feature and how can we distinguish good features from bad ones?

1.2 Goals

The goals of this master thesis are:

• Extract movies features from user reviews of movies.

• Investigate extracted features to find similar movies.

• Draw conclusions about the performance of the developed prototype system.

1.3 Outline

This work is presented in the following chapters:

• Chapter 2 discusses the background study done during the project. Technical concepts that have been used in the project will be presented.

• Chapter 3 presents a recommendation system based on movie topics. Major steps involved in topic extraction are discussed in detail. The chapter closes by discussing the implementation of similarity metrics used to find similar movies based on topics.

• Chapter 4 presents the experimental setup, evaluation system and results.

• Chapter 5 concludes the project and discusses future directions.

(15)

Chapter 2

Background

A Recommendation System provides items and suggestions to a person based on his or her interests and past usage history. Such a system is the backbone to many of today’s content streaming services such as Netflix¹, Pandora² and Youtube³. Recommendation Systems (RS) are usually classified based on the approach used to filter information such as content based filtering, collaborative filtering (based on users activities) and hybrid (combining both).

Collaborative filtering based RS have seen interest lately because of the Netflix competition [6] whereas content based systems face challenges of efficient feature representation of meta-data from audio, video and text. Luckily for the movie domain, lot of textual information is readily available such as plot lines, dialogues and reviews.

2.1 Movie Data Processing: A Literature review

Movie data in the form of keywords, script, dialogue, review has been used in research activity in past decade [7]–[10]. [8] explores movie recommendation using cultural metadata such as user comments, plot outlines, keywords, etc, and shows highest precision with user comments. The report [9] discusses movie classification using NLP based methods such as Named Entity Recognizer (NER) and Part-of- Speech (POS) tagger with movie script as input. It concludes that NLP based features performs well when compared to non-NLP features (without the use of NER and POS) although it reports only 50% accuracy because of the small corpus size.

We decided to use movie reviews written by moviegoers primarily because a) easy availability [7], [11], and computationally inexpensive for off-the-shelf hardware. b) Each movie review can be considered as a single document representing the movie.

This allows us to use document based classification methods on movies. c) Simple

1www.netflix.com

2http://www.pandora.com/

3www.youtube.com

3

(16)

4 CHAPTER 2. BACKGROUND

heuristic of combining several user written movie reviews of a single movie into a single document have the potential to discover semantic patterns at the movie level.

Furthermore combining all individual movie documents into collection allows us to explore patterns across the collection (essentially, across genres).

In order to use movie review as data, it is necessary to remove irrelevant words, symbols, html tags, etc,. In NLP, a large number of open source tools and libraries [12]–[14] are available and used as the first step in any kind of text processing.

The chapter 7 of [12] mentions the steps involved in extracting information from text. [15] uses nltk toolkit for stopwords and stemming steps. Text data with noise can drastically affect the result of any kind of NLP based model training. Text filtering helps in removing unnecessary information and allows us to use complex mathematical models on it.

The paper [8] compares the results obtained by preprocessing of meta data. It simply computes the cosine similarity from document-term matrix of movie data.

Although paper showed highest precision with user comments of movies, it did not analyzed the data further by advance techniques such as LSA. After preprocessing, reducing the data dimensionality is next step in feature extraction.

Document-term matrix is used as the input to many semantic analysis techniques starting from basic tf-idf scheme to complex models such as LSI, LSA and LDA. These dimensionality reduction techniques could yield semantic features of big data from off-the-shelf hardware [13]. Such models are interesting candidates to investigate semantic concepts from movie data. The thesis [10] studies sentiment analysis done on movie review using LSA but concludes that dimensions that capture the most variance are not always the most discriminating features for classification.

On the other hand, [16] shows interesting results when using topic modeling in content based RS. Probabilistic topic modeling allows us to extract hidden features i.e. “topics” from documents. LDA, a model based on topic modeling, shows good results in both document clustering [17] and recommendation system [18], [19]. It can capture important intra-document statistical structure by considering mixture models for exchangeability of both words and documents within a corpus [20]. With probabilistic techniques such as LDA, it is possible to derive semantic similarities from textual movie data. Such extracted semantic information can be used to find similar movies. Moreover, LDA can assign topic distribution to new unseen documents, an important requirement for building scalable RS for movies as it should be trivial to add new movies on regular basis. For a RS, computing similarity is an essential part, be it either similarity of content or user rating.

Clustering, an unsupervised classification of patterns is a technique applied on movie meta-data in RS. The review paper on clustering [21] briefly discusses similarity measure but emphasizes that similarity is fundamental to clustering. [22], [23] has done a detailed study with commonly used similarity measure in text clustering. Since the input data for our project i.e. movie reviews are in the form of text document, we can look for similarity measure discussed in [22] to begin with.

Once the similarity between movies is computed, it is important to evaluate the result obtained. For unsupervised learning techniques such as LDA, evaluation is

(17)

2.2. DOCUMENT REPRESENTATION 5

still a challenge. Since our project is based on movies, subjective evaluation is an obvious choice as the movies are ultimately watched by people.

In the end, even though topic modeling has shown good results in recommender systems [16], [19], it has been hardly explored for movie recommendation. Under- standing of movie data still face challenges and we need algorithms with semantic understanding to solve it.

2.2 Document representation

Before stepping into NLP based techniques, it is important to understand basic document representation. Let’s say we have a set of documents and for the sake of simplicity, each document consist of a single sentence. We can represent such a model into vector space as shown in Figure 2.1. Such a representation is called Vec- tor Space Model (VSM). Each word corresponds to a dimension and each document is a vector with non-negative values on each dimensions. Figure 2.1 is an example in a 3-dimensional space but in practice the document space usually runs into tens and thousands of dimensions. VSM allow us to use 2, 3-dimensional geometric formulae and extend it to m-dimensions, where m is the number of distinct terms appearing in a set of documents.

Figure 2.1. Vector Space Model of documents, Figure by pyevolve[24]

To represent document-term as a vector, consider each word as a term. Obvi- ously, some terms appear more frequently and are considered to be important for document. Let D = {d₁, ..., d_n}, be a corpus of documents and T = {t₁, ..., t_m}, be the set of distinct terms occurring in D. Let tf(d,t) represents the frequency of term t ∈ T in document d ∈ D. For document d, we can then represent a m-dimensional vector ~t_d[22] as

(18)

t~_d= (tf (d, t₁), ..., tf (d, t_m))

In practice more complicated schemes such as tfidf weighting are used. For basic proto-typing, document-term vector, ~td is good to start with.

2.3 Topic Modeling

2.3.1 Overview

Modeling the text corpora is central problem of information retrieval (IR) and classification. tf-idf, a widely used scheme in IR domain is based on document-term methodology. It describes the importance of word to a document in the collection and reduces the length of documents to a fixed length matrix representation. But tf-idf hardly gives any insight into intra and inter document statistical structure and it still has a term × document sized matrix, quite a high dimension. To tackle these problems, Latent semantic indexing (LSI) was proposed which uses singular value decomposition on document-term matrix. LSI, being a dimensionality reduction technique quickly became popular. In 1999, Hofmann proposed an improvement over LSI called probabilistic LSI (pLSI) [25]. pLSI models each word in a document as a sample from a mixture model [20] thereby giving a representation of document interms of probability distribution of “topics”. Mixture components representing topics, are basically multinomial random variables.

Although an improvement over LSI, pLSI lacked the probabilistic model at the level of documents. This lead to pLSI parameters growing linearly with the size of corpus. Another challenge was to assign topic proportion to new unseen documents.

Improving on pLSI shortcomings, LDA model was introduced by David Blei [20].

Before going into LDA, it is important to distinguish between feature and hidden feature. In image analysis, a feature is said to be “point of interest” for image description. A “good feature” is said to have useful properties [26] such as

• perceptually meaningful (as to humans)

• analytically special (eg. maxima)

• identifiable on different images

Hidden features is mostly used in statistical and probabilistic modeling, are hidden random variables that are inferred from observed variables. In topic modeling sense, hidden variables are topics representing the thematic structure of a document collection and observed variables are words of the document.

2.3.2 Latent Dirichlet Allocation

In the original paper [27], a topic is defined as distribution over fixed vocabulary.

Such a distribution allows us to represent document in terms of multiple topics

(19)

2.3. TOPIC MODELING 7

with different proportions thereby making it easier to classify, store and find similar documents in a collection.

LDA defines the generative process for documents with the assumption that the topics are generated first, before the documents. Hence, while training with a number of topic equal to 100, we are basically assuming that there are 100 topics in the collection of documents. For each document in the collection, the words can be generated in two stage [27] process

1. Randomly choose a distribution over topics.

2. For each word in the document

a) Randomly choose a topic from the distribution over topics in step 1.

b) Randomly choose a word from the corresponding distribution over the vocabulary

Above process reflects the idea of LDA that documents exhibits multiple topics.

Step 1 shows that each document exhibits the topics with different proportion.

Further, each word within each document is picked from one of the topics (step 2b), where the selected topic is chosen from the per-document distribution over topics (step 2a).

The generative process for LDA can be written as joint distribution of the hidden and observed variables:

p(β1:K, θ1:D, z1:D, w1:D)

=

K

Y

i=1

p(βi)

D

Y

d=1

p(θd) ^Q^N_n=1p(zd,n| θ_d)p(w_d,n| β_1:K, zd,n) (2.1) Where β_1:K are the topics and each β_kis a distribution over the vocabulary. w_d are observed words for document d. w_d,nis the n_th word in document d. The topic proportions for the d^th document are θ_d, where θ_d,k is the topic proportion for topic k in document d. The topic assignments for the d^th document are z_d, where z_d,n is the topic assignment for the n^thword in document d. Figure 2.2 shows the graphical model of LDA with three levels. First, α and η are corpus level parameter, assumed to be sampled once in the process of generating a corpus. The variables θ_d are document-level variables, sampled once per document. Finally, the variables z_d,n and w_d,nare word-level variables and sampled once for each word in each document [27].

After obtaining the joint distribution, we now compute the conditional distri- bution of the hidden variables that is topics given the observed variables that is words. In Bayesian statistics, it is called a posterior of the hidden variables given the observed variables.

p(β_1:K, θ_1:D, z_1:D | w_1:D) = p(β_1:K, θ_1:D, z_1:D, w_1:D)

p(w_1:D) (2.2)

(20)

Figure 2.2. The graphical model for latent Dirichlet allocation. Each node is a random variable in the generative process. Shaded circle represents observed variable i.e. words of documents and unshaded circles are all hidden variable. Plates represents replication i.e. N denotes words within documents and D is collection of documents.

Figure taken from Blei’s paper [20].

The numerator is the joint distribution of all the random variables and the denominator is the marginal probability, summing over all possible ways of assigning each observed word of the collection to one of the topics [27]. With exponentially large computation in denominator, various approximation techniques are used to approximate the posterior. We used the MALLET [28] package, which uses Gibbs sampling for posterior approximation.

As mentioned in [27], relaxing and extending the statistical assumptions made by LDA could narrow down the topics to specific semantic patterns. Nowadays, topic modeling have been optimized with features such as online learning LDA model for documents arriving in stream and multi-threading support.

2.4 Similarity Metrics

Finding similar movies for a target movie is the objective of the Content Based RS. Media content can be in the form of audio, video, and text. In our case, each movie is represented by a single document consisting of movie review as text hence, it is useful to look at currently used similarity metrics in the document clustering domain. In document clustering, closeness between documents is defined in terms of similarity or distance between them. In the rest of the chapters, some of the commonly used similarity metrics are discussed.

2.4.1 Cosine Similarity

Cosine Similarity (CS) is the most used measure of document similarity. Its usage can be seen in information retrieval domain such as measuring similarity between the documents with data obtained from LSI algorithm [29]. In-order to measure the similarity of two documents, we can calculate the cosine of the angle between the two term-vectors of the document. Figure 2.3 shows the angle in two-dimensional document space.

Given two documents ~t_a and ~t_b, their cosine similarity is represented by

(21)

2.4. SIMILARITY METRICS 9

Figure 2.3. Angle between two documents in a 2-d document-term space.

docsimcs(~ta, ~t_b) = t~a· ~tb

|~ta| × |~tb|, (2.3) where ~taand ~t_b are m-dimensional vectors over the term set T = {t₁, ..., tm} . It is important to note that for documents, the tf-idf weights are non-negative. Hence, the CS is always between [0,1].

Although cosine similarity is a widely used similarity metric, it is important to consider metrics based on probability distributions if the input data is topic distribution. The Kullback-Leibler divergence is shown to effectively cluster text data using both terms [22] and topic distribution [30].

2.4.2 Kullback-Leibler (KL) divergence

In the field of information theory, a document is described by a probability distribution of terms. We can then calculate the similarity between two documents as the distance between the two corresponding probability distributions [22]. For two distributions P and Q, the KL divergence of Q from P is

D_kl(P ||Q) =^X

i

P (i) log (Pi

Q_i) (2.4)

In other words, KL divergence of Q from P is a measure of the information lost when Q is used to approximate P [31].

The limitation with KL divergence when using it for similarity between documents based on probability distribution of topics is that it is not symmetric. For a distance measure to be considered as a metric of similarity, it must be symmetric i.

e. distance from x to y is the same as the distance from y to x. For the case of KL divergence, consider the following equation, again in the document scenario:

Dkl(~ta||~tb) − D_kl(~tb||~ta) =

m

X

t=1

log (wt,a

w_t,b)(w_t,a+ w_t,b) (2.5)

(22)

Since the above equation is not zero, KL divergence is not symmetric. The solution is to use the arithmetic average of D_kl(P ||Q) and D_kl(Q||P ) or calculate the Hellinger Distance (HL) for such cases [32], [33]. In this work, we explored further with the HL distance.

2.4.3 Hellinger Distance

Hellinger Distance is a metric of similarity between two probability distributions.

For probability distributions P = {p_i}_i∈[n], Q = {qi}_i∈[n] supported on [n], the Hellinger distance [34] between P and Q is defined as

h(P, Q) = 1

√2 · ||√

P −^pQ||₂, (2.6)

It is important to note that for cosine similarity, a higher value is better whereas for the Hellinger distance, a smaller value represents more similarity.

The motivation to improve movie recommendation is the initial push to explore NLP and topic modeling techniques. Along with the knowledge about document processing, topic modeling and similarity measure, we can now discusses the design approach taken during the project implementation.

(23)

Chapter 3

Recommendation based on Movie Topics

This chapter discusses the implementation of major steps involved in prototyping a movie topics based RS. The algorithm for overall system is visualized in four steps as shown in Figure 3.1.

Figure 3.1. The overall system showing all steps involved. System works by preprocessing reviews, traning LDA model, extracting topics out of it. Topics are then later used to find similar movies.

Summarizing the system

1. Two set of dataset set were created and preprocessed for the experiment

• Corpus A, a set of user written movie reviews extracted from the web.

Basically, a list of popular 943 movies over the last 10 years, rated by users on web.

• Corpus B, a list of ten target movies representing popular genres, hand- picked by two movie lovers who later evaluated the results.

2. A LDA based model is trained on Corpus A to generate movie topics.

11

(24)

12 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

3. Using the trained model, indexes of topic distributions for both corpus A and unseen corpus B are created.

4. Using Similarity metrics, a list of five similar movies for each target movie is created and presented for evaluation.

To implement the above system, python [35] is used as programming language of choice because of the large ecosystem of machine learning (ML) tools and libraries around it. Python based ML systems are easy to scale as most of the open source libraries are memory efficient and supports multi-thread of execution.

We start by analyzing and pre-processing movie data. Next, feature extraction is performed on processed data. Finally, extracted features are used to find similarity between movies.

3.1 User Reviews of Movies as Data

Figure 3.2. Screen-shot shows a sample movie review taken from IMDB. High- lighted words are relevant features that can be used for finding similar movies.

Movies reviews are available widely in the form of audio, video and text based, we needed to narrow down our approach of initial data. We decided to use text based movie reviews as they are easy to extract over the Internet and has low computation complexity when proto-typing with different algorithms. Reviews themselves are written by movie critics or users. Basing our feature extraction on movie critic reviews could result in biased view about movie. Combining large amount of reviews written by users and using it as the source to our feature extraction system has the benefit that we might pick semantic patterns considered or agreed by wide audience of cinema. Figure 3.2 shows such semantic patterns that we want to extract in this project. In the sample review for movie Gravity shown below, observe the description of another movie “Apollo 13 ”. Users connect movies while writing

(25)

3.2. TEXT PREPROCESSING 13

reviews and it could useful in finding semantic patterns across movies belonging to same genre.

In the report, we use the term “document” to be consistent with the IR and topic modeling domain but in our experimental setup, a document consists of user written movie reviews and it represents a movie.

3.2 Text Preprocessing

In Natural Language Processing, a corpus is a collection of text data [36], used for verifying hypothesis about language such as extracting features from text or finding pattern of word usage. For movie review data, we collected the text data and followed the preprocessing as shown in Figure 3.3. During preprocessing, irrelevant words such as {of, and, or} are removed using common english stopword list.

Figure 3.3. Collection and preprocessing of movie reviews.

(26)

Figure 3.4. Preprocessing of movie reviews is done in parallel by spawning sub- processes for available number of CPU cores. Above representation is inspired from Chris Kiehl’s blog [37].

Next, NLTK’s default lemmatizer is used for lemmatisation. It uses WordNet Database¹to look up lemmas. A lemmatizer reduces all derivationally related forms of a word to a common base form. For example the word “cars” is reduced to “car ”.

This allows us to keep the concept words and remove other forms of same word in a corpus.

Since text preprocessing is done on 1k movie reviews, it is useful to process them in parallel. Figure 3.4 shows multiprocessing approach taken to implement preprocessing of movie reviews in parallel. Python based multiprocessing package is used as it allows to spawn new processes to utilize multiple processors on a given machine [35]. This saves time during prototyping and allows us to scale the system.

With the preprocessed data at hand, we explored number of techniques in NLP domain. We experimented with chunk extraction on movie data. Chunking is useful for segmenting and labelling multi-token sequences in a sentence. One such result is shown in Figure 3.5.

Figure 3.5. Tree showing nltk based chunking technique applied on movie data.

1http://wordnet.princeton.edu/

(27)

3.3. FEATURE EXTRACTION 15

Although chunking based approach is useful for tasks such as extracting information it is not the right tool when analyzing semantic pattern in large volumes of unlabeled text such as movie reviews. In IR domain, analyzing large unlabeled text is common requirement and this motivated us to look for various IR techniques such as LSI and LDA.

3.3 Feature Extraction

3.3.1 Overview

The goal of feature extraction is to transform data from image or text into numerical features for the purpose of analysis. In text processing, techniques such as document-term methodology convert text documents into numerical data. We can then easily feed such matrix-form data into machine learning algorithms to observe the thematic structure of documents. Mathematical techniques such as Latent Se- mantic Indexing (LSI) are then used to project document-term matrix from high dimensional to lower dimensional spaces in order to identify semantic meaning and similarity between documents. LSI is basically an application of Singular Value Decomposition (SVD) to a document-term matrix. Another approach in text processing is to express such words and documents in terms of probability distribution leading to models useful in finding semantic information. Probabilistic LSI (pLSI) and Latent Dirichlet allocation (LDA) are such probabilistic models. Compared to LDA, pLSI provides no probabilistic model at the level of documents. For analyzing movies, it is necessary to model at the level of movies in a collection. Another benefit with LDA is that it better fits for new unseen documents (new upcoming movies in our case), a important requirement for movie recommendation system.

In Figure 3.2, we can observe that the movie review talks about the concept space with words such as “science”, “cosmic”; genres with words such as “drama”,

“thriller”. Hence, a single movie review blends multiple topics with different pro- portions. Essentially, a movie is a combination of different genres where each genre could be represented with different proportion. As discussed in section 2.3.2 of chapter 2, LDA model correlates with the idea of representing document (movie in our case) with multiple topics. Hence, we experimented with LDA modeling on movie reviews dataset to analyze the movie topics.

3.3.2 Movie Topics

For the project, Gensim’s [13] python wrapper to LDA MALLET [28] is used.

MALLET has a number of benefits such as multi-threading support and a fast implementation of Gibbs sampling. In order to generate movie topics, we first train the LDA model on 1k movie reviews corpus. We then obtained the topic distribution by passing a movie review to a trained LDA model.

Figure 3.6 shows five topics generated from reviews of the movie Gravity. Each column represents one topic. It can be observed from the Figure that the topics t1,

(28)

t2 and t4 represents the movie Gravity with words such as “shuttle”, “exploration”,

“debris”, “adrenaline”. The topics t3 and t5 does not give accurate description and needs some more filtering in order to get better topics.

Prototyping with review dataset gave us following insights about the quality of topics

• Preprocess the reviews extensively, remove unnecessary words.

• Use descriptive reviews as they are more useful compared to reviews with just sentiment value.

Ultimately, training LDA model on movie reviews is just one step in getting good movie features. As a post-processing step, similarity measures can be used to find movies with similar topic distribution.

3.4 Topic Similarity

During prototyping we explored commonly used similarity measure such as Cosine Similarity (CS), Kullback Leibler (KL) divergence and Hellinger Distance (HL).

As mentioned in 2.4.2, KL divergence is a non symmetric measure. Hence, we calculated both CS and HL as similarity metric for ten target movies against the corpus of 1k reviews. Similarity values are then converted to common similarity score of 0-100 for comparison. Figure 3.7 shows positive correlation obtained from 50 movie score done separately for both CS and HL. Considering the probability distribution of the movie topics, we used the Hellinger distance as the similarity measure for the experimental setup. The distance metric is calculated as

1. Index the topic distribution of the query movies q and the movie corpus C.

2. Apply distance metric formula on indexed q and C.

3. Sort and pick the top five movies.

(29)

3.4.TOPICSIMILARITY17

clooney willis bullock

sandra gravity mcclane

debris brucewillis

justin shuttle Topic 1

suicide flashbacks philosophical

bleak symbolism

narration linear exploration

poetic artsy Topic 2

man damaged

outset cards watcher whathappens

nerve onhis maintains

atfirst Topic 3

johnson weapons

assassin bullets installment

bullet adrenaline actionsequences

matrix combat Topic 4

gaps lasted

ialso welldone

ifeel posted

edit knowthat

insteadof ahuge Topic 5

Figure 3.6. Sample topics generated from user movie reviews for the movie Gravity

(30)

18CHAPTER3.RECOMMENDATIONBASEDONMOVIETOPICS

Figure 3.7. Cosine similarity and Hellinger distance shows strong positively cor- relation. The X-axis shows similarity score for Hellinger distance whereas Y-axis represents cosine similarity score.

(31)

Chapter 4

Experimental Setup and Results

4.1 Experimental Setup

We started the setup by collecting necessary movie reviews from the web. A list of top movies from last 10 years is created. It consist of around top 50 movies from each year between 2004-2013. IMDB creates popular movie list¹for each year based on user votes. Such a list represents a good mixture of popular genres liked by movie goers. With the list of 943 movies, user written movie reviews were scraped and stored in raw HTML format. Next, text content is extracted from html files using BeautifulSoup [38], an open source library.

The extracted text is stored in a directory where each text file consists of user written movie reviews for a single movie. In total, the corpus has 943 movie reviews as shown in Figure 4.1 in the tree structure. The prepared corpus is a balanced mixture of popular genres as shown in Figure 4.2. It allows us to experiment without any bias towards particular movie genre.

A point to note is that we created a corpus by processing Large Movie Review Dataset [39] as well but due to the computational complexity, we decided to scale down and prototyped on smaller dataset as mentioned above.

4.1.1 Text processing

For text processing on reviews we used NLTK [40], an open source library. First, iterate over the corpus and tokenize each file. Tokens are basic element of text mining, allowing us to analyze and process text at word level. Since we have ac- cess to individual words now, remove the punctuation and unwanted words i.e.

stopwords. Stopwords are high-frequency grammatical words which are usually ig- nored as they do not provide any useful information. Examples of stopwords are {other, there, the, of, are}. We used NLTK’s default stopword² for English lan-

1http://www.imdb.com/search/title?year=2013,2013&title_type=feature

2http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/

/english.stop

19

(32)

20 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

src

movie-reviews-10-years

movie1.txt movie2.txt ...

...

movie943.txt

Figure 4.1. A tree diagram showing movie review corpus

Figure 4.2. Chart showing movies genres of popular movies from last 10 years.

guage. We kept all the word tokens which are longer than two alphabetical characters. Words occurring only once in whole corpus are also removed.

During prototyping, we picked few meaningless topics (from the movie recom- mendation point of view; words such as good, great, bad, excellent, review, film) and created a stop word list out of it. The visualization Figure 4.3 shows meaningless topics with high density blue columns. These columns represents topics with common words present throughout the corpus. With the new list as feedback to the system, we re-preprocessed our corpus and obtained improved topics.

(33)

4.1.EXPERIMENTALSETUP21

Figure 4.3. A visualization showing 20 topics generated from 100 movie reviews.

Vertical axis represents movie reviews data denoted by their corresponding ids while the horizontal axis represents movie topics.

(34)

4.1.2 Training LDA model

The movie reviews corpus is passed to the gensim’s [13] python wrapper for LDA mallet [28]. We tested the quality of generated topics with 100, 150 and 250 topics and found 100 topic as the right size for our 1k movie review corpus. Having too less or too many topics could affect the quality of topics. Also, training model for 100 topics is computationally efficient and saves time. This allows us to repeat the process with different inputs. With the generated movie topics, we decided to investigate further and experimented with the use of topics in finding similar movies.

4.1.3 Calculating movie similarity

Once the model is trained on movie reviews, we can infer the topic distributions on new, unseen movies by passing their reviews in the same way as we passed original corpus. We indexed and stored the topic distribution for ten target movies that we wish to use during evaluation. Using the HL distance metric, we then find movies with similar distribution. The final result is saved in json format and passed to the web evaluation system for evaluation purpose.

4.2 Evaluation

The goal of our experimental evaluation is twofold. First, evaluate the performance of the system. Secondly, verify the topics themselves and their effectiveness in representing movies. We choose the subjective evaluation with explanation as it fits our two-fold goal. Subjective evaluation measures are expressions of the users about the system or their interaction with the system [41]. It is commonly used to evaluate the usability of a recommender systems.

During implementation, we borrowed the ideas from approaches used in RS with explanation [42]. Traditional RS behaves like a black box to end-user. This leaves the end user confused as to why a particular movie have been recommended. RS with explanation could help the user to understand the system better and comple- ment it by giving feedback. For our evaluation, the movie topics are presented as an explanation to the recommended movie and as a criterion to receive feedback on it.

4.2.1 Evaluation criteria

Table 4.1 shows five-point criteria for evaluation system. Genre, Mood and Plot are basic to the movie similarity. We observed the presence of actor names in the extracted movie topics. Hence, evaluating the effect of actor overlap could be useful.

(35)

4.2. EVALUATION 23

Criteria Explanation

Genre similarity of genres between the target movie and recommended movies.

Mood similarity of mood.

Plot similarity of plot.

Overlap Overlap of Actors/Actress/Director or lead cast.

Topic-relevance- score

Relevance of topic as an explanation to the recommended movies.

Table 4.1. Evaluation criteria used in our web based movie evaluation system.

4.2.2 Web based movie evaluation setup

A web based movie evaluation system is created to evaluate the results obtained from our experimental setup. Figure 4.4 shows the home page of the system. It allows users to log-in over the web and rate movies. Evaluation starts by

• presenting a target movie and a recommended movie. Judges then rate the recommended movie based on various evaluation criteria.

• Next, explanation in the form of movie topics is shown and judges re-rate the movie after reading the explanation.

We decided to show explanation for each recommended movie in order to evaluate how well the topics represent a movie. Figure 4.5 and 4.6 shows the above mentioned two step evaluation. Finally, for each target movie, five movies are presented following the above mentioned steps. The ratings are then saved to database and later used to analyze the results. For the project, three judges were invited to rate movies. Our judges are regular moviegoers and watch movies from a wide spectrum of genres.

(36)

24CHAPTER4.EXPERIMENTALSETUPANDRESULTS

Figure 4.4. Front-page of the movie evaluation system, showing five target movies.

A user clicks on a target movie and five similar movies are presented for evaluation.

(37)

4.2.EVALUATION25

Figure 4.5. Web based movie evaluation system. Shown on left is a target movie Front-page upon log-in to the movie evaluation system, showing 10 target movies.

(38)

26CHAPTER4.EXPERIMENTALSETUPANDRESULTS

Figure 4.6. Movie evaluation system with explanation.

(39)

4.3. RESULTS 27

4.3 Results

Initially we did an evaluation for ten target movies, but realized that subjective evaluation is a slow process. Hence we updated the system and did the evaluaiton for five target movies only. Figures [4.7, 4.8, 4.9, 4.10, 4.11] show evaluation results for four evaluation criterion. For each of the evaluation criteria, we also show the movie topics as an explanation. The result for evaluation with explanation is shown at the bottom visualization of each figure. The movie judges rated the movie on a scale 1-4 with 1 equal to “Not Similar”. Rating 2 is for “Somewhat similar”. Rating 3 is for “Similar” and finally, rating 4 is for “Perfect”, representing that user is happy with recommended movie.

4.3.1 Evaluation result

Some observations about the results

• Out of five movies given to judges for evaluation, one movie has similarity score of 40-50%, hence it received lowest ratings whereas for most of the other movies, scores were in the range of 50-70%.

• As shown in the top of Figure 4.7, the genre criterion shows results with 30- 35% ratings between 2 and 3 with median of 3 for genre only and 2 for genre with explanation. As observable in re-ratings (bottom figure), movie topics are slightly different than judge’s understanding about genre. But overall judges agree with movie topics as an explanation for movie genre.

• As shown in the top of Figure 4.8, the mood criterion shows results between 25-30% ratings between 2 and 3 with a median of 2. Both genre and mood information have been captured quite well by movie topics. As observable, both ratings (top) and re-ratings (bottom) are almost same for mood evalua- tion. Hence, judges agree with movie topics as explanation of mood aspect of movies.

• As shown in Figure 4.9, the majority of recommended movies are not similar at all in terms of movie plot with 40% of ratings given to “Not Similar”. This shows that capturing the plot is much more difficult than genre or mood. Both ratings (top) and re-ratings (bottom) are almost similar for plot evaluation.

Hence, judges agree that the plot information is not well captured by movie topics and more information is needed to recommend movies with similar plots.

• In LDA model, the order of words and order of documents are not considered.

For modeling movie plot information, a time based description of concepts and events are important. In order to better extract plot information, topics must evolve over the timeline of a movie.

(40)

• Overlap criterion have been rated “Not Similar” with 60-70% of the ratings.

This is understandable as our collection is of 1k movies only and it is difficult to find overlaps of actors within smaller corpus. Again, both ratings (top) and re-ratings (bottom) are almost same for overlap evaluation. Hence, judges agree that the actors overlaps are not well captured by movie topics.

• Figure 4.11 shows ratings for topic relevance score. A combined rating of 73% is given between 2 and 3. It represents the overall usefulness of topics in finding similar movies and use of topics as an explanation.

• We did the subjective evaluation on smaller scale as it is a slow process to rate movies. Judges needed some time to watch previously unseen movies before rating them.

• Overall, genre, mood and topic relevance score criteria has shown useful results.

It is important to observe that topics generated from LDA model changes every time a model is re-trained. Running the model anew generates different set of topics, slightly changed from previous one. Hence, final result of similar movies might change as well based on generated topics.

4.3.2 Rating correlation

• Although we analyzed other criteria for correlation but observed a strong positive correlation between genre and mood as shown in the Figure 4.12.

• For correlation between judges ratings, we analyzed all the ratings. Figure 4.13 shows strong positive correlation between two judges. This shows that both judges agree with each other with 34.4% of ratings into rank 1 followed by 13.6% ratings in rank 2.

4.3.3 Observations on Subjective evaluation

Subjective evaluation is useful in getting feedback on recommended movies. Our judges gave feedback about the movie topics and showed interest in rating topics individually for future evaluation. This could be useful in filtering noise and main- taining a top rated list of movie topics. Although, our system has only 100 topics, topic rating could be highly relevant when building hierarchical list of movie topics as the key challenge with higher number of topics is to maintain good topics and remove bad topics. In the end, subjective evaluation has time constraint as it is slow process to evaluate movies and topics individually, but the outcome is quite accurate conclusion of extracted features, recommended movies and the system itself.

(41)

4.3. RESULTS 29

Figure 4.7. Result of average rating for Genre (top) and Genre with explanation (bottom).

(42)

Figure 4.8. Result of average rating for Mood (top) and Mood with explanation (bottom).

(43)

4.3. RESULTS 31

Figure 4.9. Result of average rating for Plot (top) and Plot with explanation (bottom).

(44)

Figure 4.10. Result of average rating for Overlap (top) and Overlap with explana- tion (bottom).

(45)

4.3. RESULTS 33

Figure 4.11. Result shows average ratings for the movie topics.

Figure 4.12. Strong positive correlation between Genre and Mood.

(46)

Figure 4.13. Strong positive correlation of ratings between two judges. Judges most agree with rating 1 and then with ratings 2 and 3.

(47)

Chapter 5

Conclusion and Future Directions

5.1 Conclusion

In this project, we developed the prototyping system for extracting movie features i.e. topics. We trained a model on a collection of movie reviews and used the trained model to find similar movies based on the Hellinger distance of movie topics.

Evaluation results shows that such an approach gives good result even with a small movie collection. Results shows that the movie topics are efficient features as they performs fairly well in capturing movie genre and mood. Movie plot results are somewhat satisfactory but need descriptive plot information and better methods that can capture the story-line. Our small sized movie corpus resulted in very few overlap between actors. The topics as an explanation in movie recommendation are quite useful but need to be fine-tuned with the ability to rate individual topics.

User rated movie topics could be used as a feedback to the system.

Finally, movie topics are efficient features for movie recommendation systems as they represent the semantic patterns behind movies. With user movie reviews as data, movie topics capture the essential movie aspects such as genre and mood.

Our prototyping approach to feature extraction has the potential to scale for a large number of movies.

5.2 Future Directions

In this project, we considered user written movie reviews for extracting features.

Such a method could be extended or combined with other forms of movie meta-data such as plot, genres, keywords. With recent advancement in deep learning, it would be interesting to study the effect of combining LDA as a preprocessing step in deep learning analysis of movie reviews. In the following, we discuss a few interesting future directions.

35

(48)

36 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS

5.2.1 Movie review preprocessing

Basic LDA model itself does not care about the word order. As it is easily observable, word-order matters in several cases, especially for bi-grams movie keywords such as “dark comedy” or “nordic horror ”. We did a little experiment with bi-grams but ended up with noisy bi-grams based movie topics as the bi-grams were not consistent with their representation. Approaches in language construction [43] could be used to create multi-word movie keywords. Finally, extracting and using word construction from movie reviews has the potential to further capture the movie semantics.

5.2.2 Building complex topic models

The LDA model can be considered as a base model, and more complex models can be build on top of it based on the complex needs we have from the data at hand.

Correlated topic model (CTM) [44] and Dynamic topic model (DTM) [45] are such models built on top of LDA. For example, DTM could be used to observe changing movie patterns over the years. With TV shows being made for 10-15 seasons, DTM could highlight the rise and fall of characters over the seasons.

Topic models can be extended to include additional information such as meta- data. For example, author-topic models attach the topic proportions to authors, making it possible to calculate author similarity [27] based on topic proportions. Hi- erarchical LDA models [46] are another direction to explore as extending hundreds of topics to thousands could represent a wide spectrum of movie genres. Recom- mending movies based on the topics liked by users and rating topics themselves are some of the ways to improve extracted topics and build a system based on topic modeling.

With so many choices of streamable content, the challenge is to efficiently extract features from all forms of meta-data, recommend relevant content to the end-user and keep serendipity in your recommendation.

(49)

Bibliography

[1] J. Booton, One-click netflix button to make movie streaming even easier | fox business, en-US, Text.Article, Netflix, Aug. 2011. [Online]. Available: http:

//www.foxbusiness.com/markets/2011/01/04/click- netflix- button- appear-remote-controls-movie-streaming/ (visited on Jun. 11, 2014).

[2] A. C. Madrigal, How netflix reverse engineered hollywood, Jan. 2014. [Online].

Available: http : / / www . theatlantic . com / technology / archive / 2014 / 01 / how - netflix - reverse - engineered - hollywood / 282679/ (visited on May 13, 2014).

[3] Me TV: how jinni is revolutionizing search. [Online]. Available: http://www.

forbes.com/sites/dorothypomerantz/2013/02/18/me- tv- how- jinni- is-revolutionizing-search/ (visited on May 13, 2014).

[4] B. Fritz, “Cadre of film buffs helps netflix viewers sort through the clutter”, en-US, Los Angeles Times, Sep. 2012, issn: 0458-3035. [Online]. Available:

http://articles.latimes.com/2012/sep/03/business/la-fi-0903-ct- netflix-taggers-20120903 (visited on May 15, 2014).

[5] J. Layton. (May 2006). How pandora radio works, [Online]. Available: http:

//computer.howstuffworks.com/internet/basics/pandora.htm.

[6] X. Amatriain, The netflix tech blog: netflix recommendations: beyond the 5 stars (part 1). [Online]. Available: http://techblog.netflix.com/2012/

04/netflix-recommendations-beyond-5-stars.html (visited on Jun. 11, 2014).

[7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques”, in Proceedings of EMNLP, 2002, pp. 79–

86.

[8] S. Ah and C.-K. Shi, “Exploring movie recommendation system using cul- tural metadata”, in 2008 International Conference on Cyberworlds, Sep. 2008, pp. 431–438. doi: 10.1109/CW.2008.13.

[9] A. Blackstock and M. Spitz, “Classifying movie scripts by genre with a MEMM using NLP-Based features”, Stanford, M.Sc.Course Natural Language Pro- cessing, Student report, Jun. 2008. [Online]. Available: http://nlp.stanford.

edu/courses/cs224n/2008/reports/06.pdf.

37

(50)

38 BIBLIOGRAPHY

[10] R. Berendsen, “Movie reviews: do words add up to a sentiment?”, PhD thesis, Rijksuniversiteit Groningen, Sep. 2010.

[11] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens:

an open architecture for collaborative filtering of netnews”, in Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work, ser.

CSCW ’94, Chapel Hill, North Carolina, USA: ACM, 1994, pp. 175–186, isbn:

0-89791-689-1. doi: 10 . 1145 / 192844 . 192905. [Online]. Available: http : //doi.acm.org/10.1145/192844.192905.

[12] S. Bird, E. Klein, and E. Loper, Natural language processing with Python, 1st ed. O’Reilly, 2009.

[13] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora”, English, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, http://is.muni.cz/publication/

884893/en, Valletta, Malta: ELRA, May 22, 2010, pp. 45–50.

[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.

Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.

Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: machine learning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825–

2830, 2011.

[15] J. Vig, S. Sen, and J. Riedl, “Tagsplanations”, Proceedingsc of the 13th in- ternational conference on Intelligent user interfaces - IUI ’09, 2008. doi:

10 . 1145 / 1502650 . 1502661. [Online]. Available: http : / / dx . doi . org / 10.1145/1502650.1502661.

[16] T. Luostarinen and o. Kohonen, “Using topic models in content-based news recommender systems”, English, in nodalida13, ser. 19, vol. 85, Oslo, Norway:

Linkoping University Electronic Press; 581 83 Linkoping; Sweden, May 2013, 239 of 474, isbn: 978-91-7519-589-6. [Online]. Available: http://emmtee.net/

oe/nodalida13/conference/11.pdf.

[17] R. K. V and K. Raghuveer, “Article: legal documents clustering using latent dirichlet allocation”, International Journal of Applied Information Systems, vol. 2, no. 6, pp. 27–33, May 2012, Published by Foundation of Computer Science, New York, USA.

[18] R. Krestel, P. Fankhauser, and W. Nejdl, “Latent dirichlet allocation for tag recommendation”, Proceedings of the third ACM conference on Recommender systems - RecSys ’09, 2009. doi: 10.1145/1639714.1639726. [Online]. Avail- able: http://dx.doi.org/10.1145/1639714.1639726.

[19] C. Wang and D. M. Blei, “Collaborative topic modeling for recommending scientific articles”, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’11, San Diego, California, USA: ACM, 2011, pp. 448–456, isbn: 978-1-4503-0813-7.

(51)

BIBLIOGRAPHY 39

doi: 10.1145/2020408.2020480. [Online]. Available: http://doi.acm.org/

10.1145/2020408.2020480.

[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation”, J.

Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003, issn: 1532-4435. [Online].

Available: http://dl.acm.org/citation.cfm?id=944919.944937.

[21] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review”, CSUR, vol. 31, no. 3, pp. 264–323, 1999. doi: 10.1145/331499.331504. [Online].

Available: http://dx.doi.org/10.1145/331499.331504.

[22] A. Huang, “Similarity measures for text document clustering”, in Proceed- ings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.

[23] S. Bordag, “A comparison of co-occurrence and similarity measures as simula- tions of context”, Proceedings of the 9th international conference on Compu- tational linguistics and intelligent text processing, pp. 52–63, 2008. [Online].

Available: http://dl.acm.org/citation.cfm?id=1787584.

[24] C. Perone. (Sep. 2013). Machine learning :: cosine similarity for vector space models (part iii) | pyevolve, [Online]. Available: http://pyevolve.sourceforge.

net/wordpress/?p=2497.

[25] T. Hofmann, “Probabilistic latent semantic indexing”, in Proceedings of the 22nd annual international ACM SIGIR conference on Research and develop- ment in information retrieval, ACM, 1999, pp. 50–57.

[26] A. Aichert, “Feature extraction techniques”, in CAMP MEDICAL SEMINAR, 2008.

[27] D. M. Blei, “Introduction to probabilistic topic models”, Communications of the ACM, 2011. [Online]. Available: http://www.cs.princeton.edu/~blei/

papers/Blei2011.pdf.

[28] A. K. McCallum, Mallet: a machine learning for language toolkit, 2002. [On- line]. Available: http://mallet.cs.umass.edu.

[29] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008, isbn:

0521865719, 9780521865715.

[30] D. Olszewski, “Fraud detection in telecommunications using kullback-leibler divergence and latent dirichlet allocation”, English, in Adaptive and Natural Computing Algorithms, ser. Lecture Notes in Computer Science, A. Dobnikar, U. Lotrič, and B. Šter, Eds., vol. 6594, Springer Berlin Heidelberg, 2011, pp. 71–80, isbn: 978-3-642-20266-7. doi: 10.1007/978-3-642-20267-4_8.

[Online]. Available: http://dx.doi.org/10.1007/978-3-642-20267-4_8.

[31] Wikipedia. (Sep. 2014). Kullback–leibler divergence, [Online]. Available: https:

//en.wikipedia.org/wiki/Kullback-Leibler_divergence.

Eﬃcient Features for Movie Recommendation Systems

Efficient Features for Movie Recommendation Systems

SUVIR BHARGAV

Master’s Degree Project Stockholm, Sweden October 2014

Efficient Features for Movie Recommendation Systems

Contents

List of Figures

Chapter 1

Introduction

1.1 Question

1.2 Goals

1.3 Outline

Chapter 2

Background

2.1 Movie Data Processing: A Literature review

2.2 Document representation

2.3 Topic Modeling

2.4 Similarity Metrics

Chapter 3

Recommendation based on Movie Topics

3.1 User Reviews of Movies as Data

3.2 Text Preprocessing

3.3 Feature Extraction

3.4 Topic Similarity

clooney willis bullock

sandra gravity mcclane

debris brucewillis

justin shuttle Topic 1

suicide flashbacks philosophical

bleak symbolism

narration linear exploration

poetic artsy Topic 2

man damaged

outset cards watcher whathappens

nerve onhis maintains

atfirst Topic 3

johnson weapons

assassin bullets installment

bullet adrenaline actionsequences

matrix combat Topic 4

gaps lasted

ialso welldone

ifeel posted

edit knowthat

insteadof ahuge Topic 5

Chapter 4

Experimental Setup and Results

4.1 Experimental Setup

4.2 Evaluation

4.3 Results

Chapter 5

Conclusion and Future Directions

5.1 Conclusion

5.2 Future Directions

Bibliography