Labeling Moods of Movies by Processing Subtitles

(1)

Labeling Moods of Movies by Processing Subtitles

PETER SVENSSON YOUSSEF TAOUDI

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

(2)

Abstract

Labeling movies by moods is a feature that is useful for recommendation en- gines in modern movie streaming applications. Movie recommendation based on moods is a feature that could improve user experience for movie stream- ing platforms by recommending more relevant movies to users. This thesis describes the development of a mood labeling feature that labels movies by processing movie subtitles through Natural Language Processing. Movies are processed by analysing subtitles to predict the mood of a movie through com- putational methods. The prototype utilizes movies pre-labeled with moods to construct a lexicon that contains information of the defining attributes for moods in movie subtitles. Using the constructed lexicon, the similarities be- tween a movie subtitle and a lexicon can be compared to calculate the prob- ability that a movie belongs to a specific mood. Four moods were chosen for analysis in the prototype: fear, sadness, joy, and surprise.

The Naive Bayes method was chosen as the classifier for the prototype. A Naive Bayes classifier observes each occurring word in a movie without con- sideration to the context of the word in a text or sentence. The results showed that the classifier had trouble distinguishing between the moods. However, for all configurations of the prototype, the classifier showed higher precision for the mood fear compared to the other moods. Overall the classifier performed poorly and did not produce a reliable result.

Keywords - Mood Labeling, Natural Language Processing, Navie Bayes, Text

Classification

(3)

Sammanfattning

Klassificering av filmer via stämning är en funktion som är användbar för re- kommendationsmotorer i moderna filmströmmingsprogram. Filmrekommen- dation baserad på stämning är en funktion som kan förbättra användarupple- velsen på filmströmmande plattformar genom att rekommendera mer relevan- ta filmer till användarna. Denna uppsats beskriver utvecklingen av en prototyp för att klassificera filmer efter deras stämning genom att bearbeta filmens un- dertexter med hjälp av metoder inom språkteknologi. Filmer bearbetas genom att analysera undertexter för att avgöra stämningen hos en film. Prototypen an- vänder filmer som är fördefinierade med stämning för att konstruera ett lexikon som innehåller information om de definierande egenskaperna för en stämning i filmtexter. Med hjälp av ett konstruerat lexikon kan likheterna mellan en film- textning och ett lexikon jämföras för att beräkna sannolikheten för att en film tillhör en viss stämning. Fyra stämningar valdes för analys i prototypen: räds- la, sorg, glädje och överraskning.

Navie Bayes-metoden valdes som klassificeringsmedel för prototypen. En Na- ive Bayes-klassificerare observerar varje förekommande ord utan hänsyn till ordets sammanhang i en mening eller text. Resultaten visade att klassificering- en hade problem att skilja mellan stämningarna. För samtliga konfigurationer av prototypen visade klassificeringsenheten dock högre precision för rädsla jämfört med de andra stämningarna. Sammantaget presterade klassificeraren dåligt och gav inte ett tillförlitligt resultat.

Nyckelord - Stämningsklassificering, språktekonolgi, Navie Bayes, textklas-

sificering

(4)

1 Introduction 1

1.1 Background . . . . 2

1.2 Problem . . . . 2

1.3 Purpose . . . . 3

1.4 Goal . . . . 3

1.5 Methods . . . . 3

1.6 Stakeholders . . . . 4

1.7 Delimitations . . . . 4

1.8 Benefits, Ethics and Sustainability . . . . 4

1.9 Outline . . . . 6

2 Moods and Natural Language Processing 7 2.1 Moods and Emotions . . . . 7

2.1.1 Basic Emotions . . . . 8

2.1.2 Moods in Film . . . . 9

2.2 Natural Language Processing . . . . 10

2.2.1 Word Sense Disambiguation . . . . 11

2.2.2 Phases of Corpus Processing . . . . 12

2.2.3 Natural Language Levels . . . . 13

2.2.4 Similarity Measurement . . . . 15

2.2.5 Naive Bayes . . . . 16

2.2.6 Accuracy, Precision and Recall . . . . 20

2.3 Database . . . . 21

2.4 Related Work . . . . 21

2.5 Software and Frameworks . . . . 23

2.5.1 Python . . . . 23

2.5.2 Natural Language toolkit . . . . 24

2.5.3 Beautiful Soup . . . . 24

2.5.4 MySQL . . . . 24

iii

(5)

3 Methodology and Methods 26

3.1 Research Methods . . . . 26

3.2 Data Collection . . . . 27

3.2.1 Literature Study . . . . 27

3.2.2 Interview . . . . 28

3.3 Data Analysis . . . . 29

3.4 Research Quality Assurance . . . . 30

3.5 Software Quality Assurance . . . . 30

3.6 Software Process Model . . . . 31

3.6.1 Reuse Model . . . . 31

3.6.2 Waterfall Model . . . . 32

3.6.3 Incremental Model . . . . 34

4 Development Process 36 4.1 Prerequisites . . . . 37

4.2 Implementation . . . . 37

4.3 Testing . . . . 38

5 Prerequisites 39 5.1 Interview . . . . 39

5.2 Selecting Moods for Prototype . . . . 40

5.3 Selecting Sample Movies . . . . 40

6 Implementation of Prototype 42 6.1 Text Processing . . . . 43

6.1.1 Removing Noise . . . . 43

6.1.2 Stop Words . . . . 43

6.1.3 Stemming . . . . 45

6.1.4 Term Frequency . . . . 46

6.2 Learning . . . . 46

6.3 Classification . . . . 47

7 Testing and Results 50 8 Conclusion and Future Work 55 8.1 Discussion . . . . 55

8.2 Conclusion . . . . 56

8.3 Future Work . . . . 57

A Training and Test Movies 58

(6)

B Stop Words 63 C Interview with the Swedish Film Institute 64

D Implementation 65

Bibliography 66

(7)

2.1 Figure illustrating the Bag of Words method . . . . 13

2.2 Illustration of the conditional probability for the events H and E in Bayes theorem . . . 17

2.3 A figure representing the structure of the attributes, A with regards to a class node (C) in the Naive Bayes classifier . . . . 18

3.1 Figure illustrating the flowchart of the methods applied for lit- erature study . . . . 28

3.2 Figure illustrating the waterfall model . . . . 33

3.3 Figure illustrating a sprint in the Scrum model . . . . 34

4.1 Figure illustrating the development process of the prototype . . 36

6.1 Figure illustrating the processes of the prototype . . . . 42

6.2 A subtitle formatted as HTML. . . . 43

6.3 Figure illustrating stemming . . . . 45

7.1 Results with stemming, with stop words removed . . . . 51

7.2 Results without stemming, with stop words removed . . . . 52

7.3 Results without stemming, without stop words removed . . . . 52

7.4 Results with stemming, without stop words removed . . . . 53

vi

(8)

6.1 A table displaying the frequency for the ten most occurring words in movies for the moods ’Joy’ (left table) and ’Fear’

(right table) without removing stop words . . . . 44 6.2 A table displaying the frequency for the ten most occurring

words in movies for the moods ’Joy’ (left table) and ’Fear’

(right table) after removing stop words . . . . 45 6.3 Table illustrating structure of lexicon . . . . 46 7.1 Table displaying the accuracy of the prototype for different

configurations . . . . 51

(9)

Introduction

Natural language has developed and evolved as a method of communication between humans[1]. The field of Natural Language Processing centers around how computers can be used to understand and manipulate natural language in text or speech. The goal of the research within this field is to develop tech- niques for machines to understand and handle the natural language to perform tasks [2, p. 51]. The applications can be machine translation, natural text pro- cessing, user interfaces, speech recognition and artificial intelligence[2]. In this project, Natural Language Processing techniques are used to develop a prototype that labels movies by moods through processing movie subtitles.

Moods and emotions have a very a close relation even though they are separa- ble from each-other[3]. What sets a mood apart from an emotion is that moods are longer-lasting, while emotions exist in a very short time frame. Emotions are often related to a specific event or person while moods can exist without attachment to any particular event or person. Moods also tend to be less in- tense than emotions[3]. Emotions can be categorized into basic emotions as described by Ekman[4, pp. 45-60] and Plutchik[5]. Ekman [4] lists six ba- sic emotions: happiness, sadness, anger, fear, disgust, surprise. Plutchik [5]

lists eight basic emotions: ecstasy, grief, vigilance, amazement, terror, rage, loathing and admiration. The characteristics of basic emotions are that they are primal and hard-wired in our brains which humans respond quickly to, help- ing survive and avoiding dangerous situations[5]. The distinction between a mood and an emotion is minimal and the terms are thus used interchangeably in this thesis.

1

(10)

1.1 Background

Sentiment Analysis, within Natural Language Processing, is the process of computationally analysing a piece of text to identify or categorize a reader or writer’s sentiment or opinion[6]. Usually Sentiment Analysis is used to spec- ify the positive or negative polarity of a piece of text [7]. Sentiment Analysis for multiple categories can also be done[8]. In the case of this thesis, Senti- ment Analysis is used to determine the moods of different movies by analyzing their movie subtitles.

Sentiment Analysis uses Text Classification to categorize sentiments and opin- ions[8, p. 24]. Text classification is the method of classifying texts of different topics or categories [8]. Text classification is done by observing patterns in text, such as the structure of words and sentences or the word frequencies in movies of different moods [9]. An algorithm that classifies text is called a classifier.

1.2 Problem

A problem in sentiment analysis and text classification is that pre-labeled text must be collected before any classification can be done [9]. Generally, classi- fication can not be done before the patterns in text for the different categories has been observed. A pattern could for example be a word structure or word frequency that coincide with a specific class[9, p. 221]. To find patterns for each category, a classifier must rely on knowledge from previously analyzed data, called a training set [9]. Movies and their Swedish subtitles for the moods had to be collected before implementing the classifier.

There are many obstacles and difficulties in extracting sentiments and patterns from text. The human language is subjective and the same word or sentence can be ambiguous or give different impressions when used in different con- texts. The ambiguity can be problematic when trying to identify underlying moods in a word or sentence because they may vary from context to context[10, p. 180].

This can be summed up with the following problem statement:

How can a system label the mood of a movie by analysing the movie subtitles?

(11)

1.3 Purpose

The purpose of this thesis is to present the development of a prototype for la- beling moods of movies with Swedish subtitles. The functionality of being able to correctly label movies by their mood could be useful for recommen- dation engines in movie streaming platforms. Mood labeling could be used to recommend more relevant movies for a user based on the moods of the movies in the user’s history. An effective recommendation engine could increase user experience and increase user watch-time on a movie streaming platform.

1.4 Goal

The goal of this project is to develop and present a prototype for labeling moods of movies through processing movies subtitles by applying Natural Language Processing techniques.

1.5 Methods

There are two method categories when performing research, Qualitative and Quantitative[11]. Qualitative research methods involve gaining an understand- ing of underlying reasons, opinions and motivations. The understanding can be reached by performing observations, interviews and reading literature re- lated to the work that the research is carried out for. The Quantitative research methods focuses on objective measurements and/or mathematical analysis of collected data. The Quantitative research methods rely on large sets of data collected through surveys, case studies or experiments to ensure a valid out- come.

Research methods of qualitative character were the main focus in this project.

The data collection involved defining moods for the chosen movies which

required understanding the characteristics that define a particular mood in

Swedish text. Although qualitative research methods were used, the project

still involved processing a rather large data-set. If proper qualitative research

regarding the moods of movies was not performed, it would have rendered the

outcome of the data useless or invalid.

(12)

1.6 Stakeholders

SF Studios[12] is a Swedish movie production and distribution company. For the distribution to their private customers, SF studios has a platform called SF Anytime[13]. On this platform, customers can rent movies and series on de- mand. The platform is available via a web page, smartphone application and TV-application.

SF Anytime expressed a desire of a mood labeling feature for movies on their platform. The feature is desired to improve their recommendation system.

1.7 Delimitations

Due to the complex nature of moods, which is subjective to an observer, the project and the prototype covered four moods: fear, joy, sadness and surprise.

Since the project focused on the technical aspect of developing a prototype, there was no consideration to philosophical questions regarding the proper definition of what a mood is.

The subtitles were analysed word by word, meaning that the context of a word and the structure or the grammar of a sentence was not taken into account in classification and analysis.

The prototype is only compatible with Swedish subtitles, due to the database of SF Anytime[13] only consisting of subtitles for the Scandinavian countries.

Time performance and speed optimization were not prioritized in the devel- opment of the prototype.

1.8 Benefits, Ethics and Sustainability

A benefit of integrating the prototype with a movie recommendation engine

could be that it would increase the user experience of the end user. Movies

recommended to a user could have a higher relevance and keep the user active

longer. A more active user would generate more income for the owner of the

(13)

platform.

Since the intended use of the prototype is to create movie recommendations by moods, the platform that integrates the prototype could be able to construct a mood profile of a user based on the mood labels derived from the watched movies of a user. An ethical issue might arise since the information could pro- vide personal information about a users mental state and provide information for targeted advertising.

Sustainability refers to the concept of Sustainable Development[14] which is used to identify development that meets the needs of the present world without compromising the ability of future generations to meet their needs[14]. Sus- tainable development is divided into three different areas: ecological, social, and economic sustainability.

Ecological sustainability relates to the function of the biochemical system of the Earth[15]. Services and products that are produced must take consid- eration to the water, air, land, biodiversity, and ecological services of the Earth[15]. The production of services and goods must not overload the ca- pacity of the ecosystem and make sure that nature is given time to regenerate resources of the ecosystem[15]. The prototype has no effect in this aspect.

Social sustainability concerns the psychological and physical needs of the in- dividual[16]. This includes human rights, justice, and quality of life for each individual. The prototype has no effect in this area.

There are two major definitions of economical sustainability[17]. The first definition is from an ecological and a social sustainability perspective. The increase in economical growth must not have a negative impact on the envi- ronment or social sustainability. The second definition of economical sustain- ability is from an economical perspective where economical growth is desired.

The second definition allows for economical growth on the expense of natural

resources and welfare[17]. The prototype might have an impact in this area. If

the prototype is integrated in a movie recommendation system it could increase

the amount of money a user would spend on the platform by recommending

more relevant movies to the user. A user might spend more money on the

platform than they intended and have a negative on the personal economy of a

individual.

(14)

1.9 Outline

Chapter two of this thesis, Theoretical Background, discusses emotions, moods as labels and which emotions that are identifiable in movies. Furthermore, the chapter introduces the topic of Natural Language Processing and discusses dif- ferent problems and techniques in the area of analyzing the human language relevant to the project.

In chapter 3, Methods and Methodology, the methods and methodologies used in the project will be discussed. This includes software development-, research- , data collecting- and analysis methods.

The development process for the prototype is presented in chapter 4, Devel- opment Process. This chapter discusses the three phases of the development process, Prerequisites , Implementation and Testing.

The first of the three phases in the development process is described in chapter 5, Prerequisites. The chapter describes the data collection process of develop- ment.

Chapter 6 describes the Implementation phase of the development process. In this chapter, all functionality of the prototype is explained along with the ap- plication of Natural Language Processing methods for classifying the mood of a movie.

The 7th chapter of the thesis, Testing and Results, presents the estimated pre- cision and accuracy of the classifier for the different moods. The chapter also discusses how the prediction tests were constructed.

Chapter 8, Conclusion and Future Work discusses the validity and accuracy

of the results and any improvements that could be made to the final prototype.

(15)

Moods and Natural Language Pro- cessing

This chapter discusses the theoretical background that was used in develop- ment of the prototype. The chapter is divided into five sections. The first section, section 2.1, describes moods and emotions on a human level, the con- cept of basic emotions and the use of moods in film. Section 2.1 also gives an insight to which moods were suitable as classes for text classification in the prototype. The ability to classify text into categories was crucial to the func- tionality of the prototype. Movie subtitles had to be classified into moods, therefore, section 2.2 aims to explain how Natural Language Processing tech- niques can be used to classify texts into categories. Section 2.3 presents how the movie data were collected. Section 2.4 presents related work in this field.

Section 2.5 discusses the use of programming languages and frameworks for the development.

2.1 Moods and Emotions

Very often the two terms, mood and emotion, are confused with each-other or used in the wrong context[18]. David Watson[19] defines moods as “tran- sient episodes of feeling or affect”[19], which in it self, proves a strong rela- tion between moods and emotions. The duration of the emotion is the most significant characteristic which sets moods and emotions apart. An emotion is something that is usually intense and the duration is very short as a result of an encounter with meaningful stimuli that requires quick adaptation in behavior and response[18]. The nature of this reaction is sometimes argued to be a rem- nant of our primal instincts, a trigger of the basic “fight or flight behavior”[5].

7

(16)

Emotions can be seen as reactions or a response to a specific event and do not arise randomly and for no reason[19]. In contrast a mood is something that is much longer in duration, from a few hours to a couple of days. Moods can also arise from internal processes and they do not need a clear external iden- tifiable object as a cause for the specific mood. Moods are not as strong as emotions but they are, as mentioned, very strongly related. Being in a "good"

mood makes the person more prone to experiencing positive emotions, while being in a "bad" mood makes a person more inclined to experiencing negative emotions[3].

2.1.1 Basic Emotions

There is a close relationship between moods and emotions and the theoreti- cal background will be based on research with regards to emotions rather than moods. A major reason for using the term emotion is because of the theory of basic emotions.

Plutchik[5] has developed a model for representing human emotions[5]. In this model Plutchik has placed the primary, or basic, emotions in the center of the model. The argument for this architecture is that other emotions are mixtures of the eight basic emotions[20]. In the original model the proposed basic emotions were; joy, sorrow, anger, fear, acceptance, disgust, surprise and expectancy. These emotions were selected since they polarize each-other in pairs. Naturally, over the years, there have been other suggestions for basic emotions and the number of basic emotions. The basic emotions of the model today are; ecstasy, grief, vigilance, amazement, terror, rage, loathing, and ad- miration[5].

Ekman[21] has conducted extensive research concerning the relation between emotions and facial expressions. Through this research Ekman found that fa- cial expressions of certain emotion appeared to be universal[22]. These emo- tions can be identified as separate, discrete states. The strongest evidence for distinguishing one emotion from another comes from research of facial expres- sions and through this method Ekman recognizes six different basic emotions;

happiness, sadness, anger, fear, disgust, surprise[21]. Ekman also argues that

there is no need to take an evolutionary view of the emotions to be able to

successfully identify them. Social-learning is seen as a major contributor to

these basic emotions, regardless of culture.

(17)

2.1.2 Moods in Film

Despite its neglect in film-philosophy and film theory, mood has long been recognized as important to the films aesthetics. One term that often has been used in movie context for referring to the mood of a film, is the German word Stimmung[23], meaning mood or atmosphere. Moods contribute to the com- position of the universe where the film takes place. The mood of a certain cinematographic universe sets the baseline for what events, actions and situa- tions the viewer might find interesting, odd, disturbing or otherwise emotion- ally significant. Moods can also serve as tools to mold or disturb the narrative flow of a film or aid in a transition of narrative[23].

Greg M. Smith[24] claims that the "...primary emotive effect of film is to cre- ate mood.”[24]. Movies can invoke both emotions and moods. A viewer might become angry for a brief moment during one scene, but the movie in its en- tirety invokes a mood of happiness. In this context the issue of art moods versus human moods arises. Saying that a movie is sad is metaphorical, since a human mood is a discrete mental state and a movie cannot posses a mental state[25]. A movie can be used to invoke different moods in viewers but it cannot have human moods itself. In film and literature, the mood of a work is the emotional character or tone. The mood of a film can be seen as a combina- tion of all the small elements of the film, that together characterize the overall experience of the film. A sad film might fail to invoke sadness and a happy film might fail to invoke happiness, but the film extends an invitation to “feel”.

It is up to the viewer to accept or reject the invitation[23].

The Swedish Film Institute[26] conducted a research, in the fall of 2018, to map what moods the Swedish film audience wished to experience when watch- ing a film[27]. The research done by the Swedish Film Institute focused on four basic moods; excitement/surprise, happiness/laughter, emotional/cry, fear/dis- comfort. The research showed that the majority of the film audience wished to feel excitement/surprise when watching a film. The results were most promi- nent among audience in the age category of 40 to 54, living in rural area and of low income. Among women the wish to fell happiness/laughter was partic- ularly popular[27].

When asked about what evokes the mood of excitement/surprise in the film

audience, the result showed a strong correlation between a mood and certain

genres. Particularly strong correlation was evident with excitement/surprise

(18)

and the genres action, criminal and thriller. Answers from the research also showed a strong correlation between the mood of happiness/laughter and the genre comedy[27].

2.2 Natural Language Processing

Natural Language Processing is a field in computer science concerning how to manipulate natural language in the form of text or speech through compu- tational methods [2]. One field in Natural Language Processing, text classifi- cation, is particularly important for this degree project. Text classification is a problem of determining which category a body of text belongs to. Bodies of text in Natural Language Processing applications are represented as corpora, which is the plural of corpus.

A corpus is a body of text [28, p. 6] that is used to perform statistical analysis [29]. Corpora serve as building blocks of data that are used to build up large lexicons [9]. Analysing corpora is done by statistically probing and manip- ulating text. Some corpora contain noise, that must be filtered out for better results[29]. One type of noise in corpora is markup [28, p. 475] which is an annotation in a document that explains the structure or format of a text [28, p. 123]. Another type of noise that occurs in written corpora are function words. Function words are short grammatical words, such as it, in and for, that generally dominate the word population in text that may need to be ac- counted for when processing a corpus[28, pp. 20-23].

There are many different types of corpora that can be used for Natural Lan- guage Processing applications. The leading platform for building python pro- grams with human language data, Natural Language toolkit (NLTK) [30], uses four different types of corpora: Isolate Corpus, Categorized Corpus, Overlap- ping Corpus and Temporal Corpus[9].

The simplest type of corpus used by NLTK, the isolated corpus, is a standard collection of text without any categorization. If a corpus is grouped into dif- ferent types of categories, it is called a categorized corpus. An overlapping corpus is a categorized corpus with overlapping categories. The final type of corpus that is used by NLTK is called a temporal corpus. A temporal corpus is a collection of usages of text for a specific period in time [9].

Another representation of text that is particularly useful for text classification

(19)

are lexicons[31, p. 19]. A lexicon is a collection of information about words that belong to a specific category [32]. Each set in a lexicon is called a lexical entry. Each lexical entry contains a word and information about the particular word in the lexicon [32]. Lexicons can be used to store information concern- ing word frequency for a given category which is needed in text classification [33]. Information needed for probabilistic approaches in large lexicons should always be collected computationally as it is infeasible to do manually. Lexical entries are collected by analysing adequate corpus data [33].

Text classification can be done using a Corpus-Based Approach. A corpus- based classifier approach relies on using corpora to build specific lexicons categories that can be used for analysis [8, p. 95]. There is however one ma- jor problem with a corpus-based text classification approach, the Word Sense Disambiguation problem.

2.2.1 Word Sense Disambiguation

Word Sense Disambiguation is a classification problem for determining which sense or meaning of a specific word is activated in a specific context. There are three classes of Word Sense Disambiguation methods used in the field of artificial intelligence: Supervised-, Unsupervised and Knowledge-Based Dis- ambiguation [34].

Supervised disambiguation is based on learning and makes use of annotated corpora to learn. The input data are marked with the classes or categories they belong in to further build on a lexicon [35]. The corpora can also be anno- tated with the weight of how much the text data reflects the importance of the corpus for a certain category [36]. Supervised learning is generally used as a classification task [28].

As opposed to supervised disambiguation, unsupervised disambiguation only uses raw corpora for learning. Unsupervised learning will often start by mak- ing use of knowledgeable sources that it can improve further. Unsupervised algorithms often use a technique called clustering. Clustering is a way of analysing corpora by grouping similar objects into the same category [28].

The final technique, the Knowledge-Based technique relies primarily on dic-

tionaries or lexicon bases [36]. Knowledge-based methods are usually based

on already developed and well established lexicons. The problem with using

(20)

knowledge-based methods is that they only perform well for very large dic- tionaries. These are often general and may not be suitable to use for niche projects [35]. A knowledge-based approach is sometimes called dictionary- based approach[8, p. 91].

2.2.2 Phases of Corpus Processing

The method for processing a corpus consists of three phases: Text Pre-processing, Text Representation and Knowledge Discovery[37, pp. 388-390].

The first phase, the text pre-processing phase, is the process of filtering out noise from text [38]. Corpora scraped from the web can for example be anno- tated with markup. Markup is a type of noise must be taken into account when data mining on the web [9]. A popular format for web pages is HTML, which is a markup language [39], meaning it has tags and markings that explains and denotes the structure of a document [40]. Another, more common, type of noise are function words [28] or stop words [38]. Stop words can be removed by using a stop word dictionary [38]. It is important that words in a corpus are all in the same letter case. If a word is in the beginning of a sentence, it starts with an uppercase letter and will not be seen as the same word in the eyes of a computer as the lowercase word [29, p. 71].

Stemming is an important term in text pre-processing. "Stemming is the pro- cess of removing affixes from a word in order to obtain its stem, and a stemmer is an algorithm that performs this process " [41]. The complexity for stemmers vary from language to language. Languages such as Arabic where many affixes can be used on words are more difficult than languages such as English [41].

Stemmers can become quite complex and may become difficult to implement or may require a large amount of data. There are many stemmers available in English, but not very many exist in Swedish. Stemming is particularly useful for smaller entities of text. A similar technique called Lemmatization can be used instead to convert words into their grammatical base [41].

The second phase of corpus processing is defining data structures for text rep- resentation. Text can be represented as corpora or lexical entities. The data- structure of corpora and lexical entries is an important choice for text analysis.

The simplest model is called the Bag of Words model [9, p. 50] where each

word is one entry in the data set. The data set is unordered and the only infor-

mation stored about a word in the bag is the frequency in which it appears in a

corpus [42, p. 65]. Figure 2.1 illustrates how raw text can be transformed into

(21)

a Bag of Words data-structure that contains the frequency of each word in the text.

Figure 2.1: Figure illustrating the Bag of Words method. Figure inspired by Manning [28]

Another model could be used is the Vector Space Model. In the Vector Space Model, a document is represented as a count vector. Each entry in the vector is called a feature and it represents one entry in the document vector [28]. While the two are similar, a document vector generally contains more information than a Bag of Words.

The third and final phase is called Knowledge Discovery. Knowledge Discov- ery means to extract knowledge from data [43]. It is in this phase where the data from the text representation phase is actually used through machine learn- ing or data mining methods [37, p. 390] such as the Naive Bayes [44, p. 20]

classifier. For example, the Bag of Words could be used to create or build upon lexical entries in a lexicon.

2.2.3 Natural Language Levels

Before processing a corpus, one must first decide which parts of natural lan-

guage to analyse. Natural Language Processing is a broad field that concerns

(22)

processing many different levels of texts. Extracting and processing text can be done in six different levels: morphological-, lexical, syntactic, semantic, discourse and pragmatic levels [2, p. 56]. Different techniques should be used depending on the level a corpus will be processed on.

The morphological level deals with the smallest parts of words such as pre- fixes and suffixes [2, p. 56]. A big focus in the morphological level is word formation. The same words can appear in different forms depending on the inflection of the word [28]. Examples of techniques in the scope of morphol- ogy are Stemming and Lemmatization [45] that are used in text pre-processing.

Following the morphological level is the lexical level. In this level, analysis is applied in the scope of words [2, p. 56]. This level focuses on which part of speech a particular word is used for. Part of speech can be seen as a gram- matical category for words [28]. The technique used for labeling words after their grammatical category is called Part-of-Speech Tagging (POS-tagging) [46]. Stockholm University [47] has a part-of-speech tagger for Swedish text available called Stagger [48] that is available in open source.

The syntactic level in Natural Language Processing deals with the structure of sentences [2, p. 56]. A syntax in linguistics is a rule in a language that con- cerns the form of sentences [32]. The form of a sentence should not be con- fused with morphological structures as they only concern words. As the form of sentences can be structured in different ways, it is not feasible to capture all possible patterns for a sentence manually. Instead computational approaches must be taken. One approach is to construct syntactic N-Trees [49] which are tree data-structures with n-tuples that can be used to store the structures of sentences.

The semantic level covers the meaning of words and sentences [2, p. 56]. This level focuses on the meaning of a word or sentence without taking the outside context into account [32]. Lexical semantics can be studied through two differ- ent approaches. The first approach is to study the meaning of individual words.

The second approach is to study how the meaning of words are related to each

other and how multiple words can be combined into meaning of sentences [28,

pp. 109-110]. Examples of applications in this level are identifying words with

similar meanings (synonyms), words with opposite meanings (antonyms) and

meanings of ambiguous words [28, p. 110].

(23)

The discourse level concerns the relationships between sentences in a text. The analysis in this level mainly focuses on analysing anaphors [28]. Anaphors are expressions that refer back to previous expressions in the same text [32]. For example, a person that has previously been introduced in a text may be referred to as ’He’ or ’the Man’. For a program to know who ’He’ or ’the Man’ may be referring to, it must be able to identify the person in the context of the text. If

’He’ and ’the Man’ were referring to the same person, they would be anaphor- ically related [28].

Pragmatics is the study of how knowledge about the world interacts with the literal meaning of text. Due to the difficulty of modeling the complexity of the world and the lack of data, this area of Natural Language Processing has not received a lot of attention.[28]. The semantic representation, or logical form, of an utterance is distinct from its pragmatic interpretation[50]. Pragmatics considers language as an instrument of communication, what people actually mean when they use language and how a listener interprets it. According to Jenny Thomas[51], pragmatics consider the negotiation of meaning between speaker and listener, the context of the utterance, and the meaning potential of an utterance[51]. This can be illustrated by this simple example.

Utterance: “Do you know what time it is?”

Pragmatic meaning: Why are you late? Response: Explanation or apology.

Literal meaning: What time is it? Response: A time

The above example supports the notion of ambiguity in the natural language along with the difficulty of modeling the real world when performing Natural Language Processing.

2.2.4 Similarity Measurement

Similarity measurement is the means of determining how well a particular corpus or vector fits a certain pattern or category by quantifying the relation- ship between different features [52]. Similarity measurement is useful when testing supervised dictionary algorithms or when learning unsupervised dic- tionary algorithms. [34, p. 57].

If both the input document and the lexicon are represented as vectors in the

(24)

same term space, a measure of the similarity between them can be computed.

Similarity can be calculated as the probability that a certain query belongs to a certain category. Similarity can only be measured through probabilistic models if there is quantifiable data available. To get quantifiable data, there must be some form of weighting in the proposed data-structures of a lexicon.

The most optimal weighting principle for documents is based estimating the relevancy of a query to a certain document or lexicon [44].

2.2.5 Naive Bayes

The Naive Bayes is a probabilistic classifier that is useful for text classifica- tion. A classifier is an algorithm that implements classification by mapping input data to a category. Naive Bayes applies Bayes theorem[53] which is a mathematical formula for calculating conditional probabilities and is mainly used in statistical mathematics. Conditional probability is the probability of one event occurring, given that another event has already occurred[54].

P (H|E) = P (E|H) × P (H)

P (E) (2.1)

In Bayes theorem, presented in formula 2.1, P (H|E) is the likelihood of event H given that E is true. In the same way, P (E|H) is the likelihood of event E given that H is true. The probability P (H) is the likelihood of event H without any knowledge of E and P (E) is the likelihood of event E without any knowl- edge of H. P (E|H) × P (H) can be written as P (H ∩ E) which is the prob- ability of both H and E being true. P (H ∩ E) also equals P (H|E) × P (E).

Figure 2.2 illustrates the probability of the events as two intersecting circles in

a Venn diagram. The diagram displays the denominator P (E) in formula 2.1

as the right circle, P (H) as the left circle and the numerator, P (E|H) × P (H)

is illustrated as intersection of the two circles. In the Venn diagram, the proba-

bility of P (H|E) would thus be the size of the intersection divided by the size

of the right circle.

(25)

Figure 2.2: Illustration of the conditional probability for the events H and E.

Figure by the authors.

The Naive Bayes classifier assigns a class to a corpus, given its vector repre- senting the occurrence of words, also referred to as features or attributes, in a specific corpus. A classifier is constructed by using a collection of labeled training corpora to estimate the parameters for each class. Classification can then be performed by assigning a new corpus to the class that is most likely to have generated that specific corpus[55]. The Naive Bayes classifier is a very effective and is one of the simpler classifier among the Bayesian classifiers.

The Naive Bayes classifier assumes that all attributes of a vector are indepen- dent of each-other, also called conditional independence[56]. In reality, the assumption is far from correct since there is a strong correlation between the occurrences of words in a corpus, but regardless of this, the Naive Bayes clas- sifier generally performs well. Since the attributes are seen as non-related they can be learned separately hence simplifying learning when a large number of attributes are involved, which is often the case when working with corpora. In Naive Bayes each attribute has no parent node, except the class node as shown in 2.3[56].

There are two models of how occurrences of attributes in a corpus can be rep-

resented before the Naive Bayes assumption is performed[55]. In both models

the order of words are lost, thus both can be seen as a Bag of Words[9]. In the

first model a corpus is represented as a vector of binary attributes indicating

which words occur or do not occur in the corpus. The number of occurrences

of a word is not represented in the vector. When calculating the probability

of a corpus, multiplication of the probability of all attribute values, including

(26)

Figure 2.3: A figure representing the structure of the attributes, A

n

, with re- gards to a class node, C, in the Naive Bayes classifier. Figure inspired by H.

Zhang[56]

non-occurrence, is performed. In the second model, a corpus is represented by the set of word occurrences in the document, along with the number of occur- rences for that particular word. When calculating probability for a corpus in the second model, multiplication with the probability of the occurring words is performed[55]. An example is to calculate the probability of a document belonging to a class through multiplying the conditional probabilities of the features in the corpus [34, p. 11] as illustrated through the formula 2.2 where the features are words. In formula 2.2, P (word

ⁱ

|class) is the probability of a word belonging to a class and P (corpus|class) is the probability of the entire text belonging to a class. When classifying a corpus, the probability that the corpus belongs to each possible class must be calculated. The class that gives gives the highest probability is the most likely class that the corpus is catego- rized into.

P (corpus|class) =

a

Y

i=1

P (word

_i

|class) (2.2)

The Naive Bayes classifier encounters a problem when the number of words

in a corpus grows large. The probability for a word belonging to a class will

be represented by a value between one and zero. Since the probability fac-

tors for each word in the corpus are multiplied with each other, the overall

product will be very small. This number can be so small that computer com-

pilers cannot represent them as floating point data and thus assumes them as

zero[57]. This would result in the introduction of invalid zero probabilities in

(27)

the calculations. A common way of overcoming this issue is to represent the probabilities as log-probabilities[57]. The transformation to logarithm works well because the logarithm of a product is the sum of the logarithms. The formula 2.3 shows how the formula 2.2 is logarithmized to calculate the log- arithmic probability P

logarithmic

(corpus|class), which is the probability of a corpus in a specific class. The logarithm of a product is calculated as the sum of the logarithms. Therefore, the probabilities P (word

ⁱ

|class) in 2.3 will be summed instead of multiplied. This will result in more manageable numbers but since the logarithm is taken for values between zero and one the final prob- ability will be a negative number.

P

logarithmic

(corpus|class) =

a

X

i=1

ln P (word

i

|class) (2.3)

Another obstacle in the Naive Bayes implementation is when the classifier en- counters a word that was not in the training corpora, meaning that the word does not exist in the lexicon of the classifier. This will result in the classi- fier calculating the probability for a word missing in the lexicon as zero. The classifier calculates the probability of a corpus belonging to a specific class through multiplication of the probability of each word belonging to the class.

An introduced zero will result in a zero product and zero probability for that corpus belonging to a class[9]. This can have dramatic effect due to the prob- ability being calculated to zero even though all words but one matched to a certain class lexicon. This, however, can be solved by using a statistical tech- nique called Smoothing[58], where the classifier assumes that all words have been seen one more time than they actually have. Smoothing leads to a priorly unseen word being handled as it was included in the training corpora one time.

This will remove any unwanted zero probabilities from the final probability calculation. This technique is sometimes referred to as Laplace estimator or Additive Smoothing[58] due to the minimum one occurrence approach for all words.

P

_smoothing

(word|class) = occ(word, class) + 1

P

i=1

occ(word

i

, class) + |V | (2.4)

(28)

2.4 is the formula for calculating the conditional probability of a word given a specific class with additive smoothing. The function occ(word, class) is the amount of occurrences of the word for a specific class while

^Pi=1

occ(word

_i

, class) is the amount of occurrences of all words in the class. |V | is the length of a lexical vector or in other words, the amount of words in a lexicon or dictio- nary. Adding one to the occurrence of a word for a specific class guarantees non-zero probability and adding the size of the vector to the divisor means that words that do not occur for a class are not very significant.

2.2.6 Accuracy, Precision and Recall

When evaluating a classifier there are three different methods: accuracy, pre- cision and recall[28]. The simplest method is accuracy, which measures the percentage of provided inputs that the classifier correctly labeled. To receive the accuracy metric the number of correctly labeled inputs is divided with the total number of inputs. This metric can be misleading since it does not account for the number inputs not being relevant to a specific class[28].

Precision is how many of the items that were identified were relevant and recall is how many of the relevant items that were identified. To calculate precision and recall four metrics needs to be identified:

True positive, TP, relevant items labeled correctly True negative, TN, irrelevant items labeled incorrectly False positive, FP, irrelevant items labeled correctly False negative, FN, relevant items labeled incorrectly

P recision = T P

T P + F P Recall = T P

T P + F N (2.5)

When calculating precision and recall account is taken to the fact that, in infor- mation retrieval, the number of irrelevant inputs outweigh the relevant inputs for a class[9].

When evaluating the accuracy of a classifier, the test set can not consist of data

that was used for learning or training. The evaluation of the results from the

classifier must be with regards to previously unseen data[28].

(29)

2.3 Database

Sf Anytime [13] provided a database with movies and their movie subtitles.

The subtitles from SF Anytime were formatted as HTML meaning markup noise had to be removed from subtitles. The SF Anytime database did not have any information about the mood of the movies in the database. Movies that were pre-labeled with moods were available through the Internet Movie Data- Base, IMDB[59] and TiVo[60]. IMDB is an online database for information about movies. The movies on the IMDB website are tagged with the mood of the movies. TiVo is a company that is involved in creating and licensing property related to television. TiVo has a large database [61] with information about movies and tv-series that include a mood tag. TiVo has an API that can be used to fetch information about the movies in the database.

2.4 Related Work

Related work covers published work in the same problem domain as the de- gree project. The purpose of exploring related work is to compare techniques and methods and to find out how well certain methods solve the problem. Ex- ploring related work also gives a good understanding of what limitations there are for solving a problem.

IBM [62] has an AI platform called Watson[63]. The platform provides API’s concerning many different fields in artificial intelligence and machine learn- ing [63]. One API, called Tone Analyzer, is particularly interesting to discuss.

The Tone Analyzer is an API that allows a user to input a corpus and receive

the emotional tone of the corpus as return value. The tone analyzer analyses

text on a discourse level by identifying anaphors in conversation, on a syntac-

tic level through the use of n-trees and on a lexical level through the use of

dictionaries [64]. IBM Watson makes use of an alternative to the Naive Bayes

classifier called a Support Vector Machine [64]. A Support Vector Machine

classifier transforms training data into a high-dimensional space and tries to

find a plane in the space that separates the training data in a way that all points

of a category are on one side of the plane and all points of another category

are on another side of the plane [35, p. 27]. While the tone analyzer was not

specific to movie subtitles, the research made by IBM shows that it is possible

to identify emotions from text with decent precision.

(30)

In an article by Tarvainen and Laaksonen [65], it is described how film moods can be identified through analysis of movie scenes. The article describes anal- ysis of movie scenes through visual and auditory style. For the visual analysis, the focus was on brightness and color features, pixel motion and face emotion.

The audio analysis on the other hand consisted of analysing loudness, music and dialogue [65]. The dialogue was analysed using IBM’s Watson tone an- alyzer [62]. In the results, the authors discussed the influence of each field of analysis to the result. The results of the influence analysis showed that speech or dialogue was fairly influential in the prediction of moods in non- music scenes. In the scenes where there was music, dialogue did not influence the prediction of moods [65]. The results in the article indicate that analysing moods through dialogues and narration is possible for scenes without music when analysing audio.

Wietreck, 2018[66], investigates the possibility of implementing a movie rec- ommendation system based on emotions. A 10-item Positive Affect - Negative Affect Scale, PANA-Scale, is used to reflect a specific mood state. The rec- ommendation system was based on a Random Forest which is a set of train- able decision trees. A decision tree is a data structure that employs a top- down divide-and-conquer strategy that partitions a given set of objects into smaller and smaller subsets with the growth of the tree[67]. Decision trees are specially effective with small data sets[66]. To aid the decision making and weighting of the system, input knowledge was collected from the user such as age, gender, and current mood, trough a questionnaire. The mood of the user is determined with help of the PANA-Scale, ten positive and ten negative affect questions. As a source for the output the Internet Movie Data Base[59]

was used for selecting movie sets, based on genre, that is believed to match the user. After the questionnaire, the user is presented with a set of movie titles from a genre that has the highest probability of matching the user. Two versions of the system were constructed, one which uses arbitrary recommen- dation and one that utilizes on the mood of the user for recommendations. The recommendations were then compared against each other for relevance to the user. The results show a favor for the recommendation system based on the users mood[66]. The author summarizes that it is possible to build a recom- mendation system based on moods.

Wakil, Ali, Bakhtyar and Alaadin, 2015[68], developed a movie recommenda-

tion system based on emotions of the user. The recommendation system con-

sisted of Collaborative Filtering, where content was presented to the user based

(31)

on the content of similar user profiles and Content-Based Filtering, where con- tent was compared to the users profile and then presented. The system con- sisted of five steps: user registration, user rating, hybrid recommendation sys- tem, prediction, and a list of recommendation movies. In the first step the user registered and choose three colors to represent the emotion state. In the second step the system calculated initial rating and allowed the user to rate movies, from a scale of one to five. The third step was where the system calculated the similarity between users and items. In the fourth step, the system predicted the users rating and in the last step a list of recommended movies was shown to the user. The result showed that the recommendation system based on emotions had much better accuracy when presenting recommendations.

Blackstock and Spitz, 2008[69], classifies movies scripts by genre using Max- imum Entropy Markov Model and Natural Language Processing-based fea- tures. Movies scripts were run through the Stanford Named Entity Recog- nizer[70] and Part-of-Speech Tagger[71] systems for generating output files.

Out of 399 movies scripts, 359 were used for training the system, and 40 were used for testing the system. The result showed that the system could accurately label the genre of a movie, by analyzing the movies scripts, in 55 % of the test cases.

2.5 Software and Frameworks

The prototype uses supervised disambiguation to build a lexicon in the form of a vector using the Vector Space Model. The lexicon contains information about the probability of a mood each word in the collection. Through a Naive Bayes Classifier, input corpora can be compared to the lexical entries in the lexicon to measure the probability that a movie is of a certain mood.

2.5.1 Python

The prototype was developed in Python[72]. Python is a high-level object- oriented open source programming language[72]. Python is sometimes re- ferred to as a object-oriented scripting language. A script is a small program that runs in the shell of a operating system and handles simple tasks[73].

Python provides fast development speed due to the high-level nature of the

language. The Python interpreter handles details which would have had to

been explicitly coded for in a low-level language[73]. Type declarations, mem-

ory management and build procedures are not present in the Python language,

(32)

which gives the developer freedom to focus on the task of the developed pro- gram. Python is suitable for Natural Language Processing because of its string- handling functionality and transparent syntax and semantics[9]. There are many useful Natural Language Processing libraries in Python, some of which were used in the development of the prototype.

2.5.2 Natural Language toolkit

One of the frameworks that was used in development of the prototype was the Natural Language toolkit (NLTK)[30] which is a platform with useful tools to handle human language data. It provided an easy to use API for Python.

NLTK was one of the few platforms that provided stemming in Swedish. The Natural Language toolkit had multiple stemmers available for use but the only one that provided stemming for Swedish was their snowball stemmer library [74]. The toolkit also had a library for removing stop words in Swedish.

nltk.stem.snowball.SwedishStemmer was a NLTK class that was used for stemming Swedish text. The class provided a method for stemming text named stem(text). The stemming method took the text that was to be stemmed as an argument in the form of a string and returned the stemmed text as a string [74].

2.5.3 Beautiful Soup

Another Python framework that was used in development of the prototype was Beautiful Soup[75]. Beautiful Soup is a python framework that was used for extracting data from markup [75]. The framework was useful for parsing un- structured data to raw text.

find_all(tag) was a Beautiful Soup[75] method which extracted text from inside all tags specified by the argument tag[75]. The method is gener- ally used to extract text from markup languages such as HTML.

2.5.4 MySQL

MySQL[76] is an open source client/server relational database management

system using Structured Query Language (SQL). Structure Query Language

(SQL)[77] is a database access language created by IBM[43] for writing rou-

tines to query databases [77]. Examples of queries for SQL are the INSERT

(33)

query that adds new data to a collection [78] and the SELECT query which

reads data from a collection [79]. MySQL includes an SQL server, client pro-

grams for accessing the server, administrative tools, and a programming inter-

face for writing your own programs. MySQL was developed in Sweden as a

result of a company requiring a faster management system that could handle

their large tables[76].

(34)

Methodology and Methods

This chapter discusses the relevant research methods in the project, relevant software engineering methods and techniques for the development of the pro- totype. This chapter also explains the use of the methods and methodologies as well as the motivation for choosing the methods. Section 3.1 discusses the characteristics of qualitative and quantitative research methodologies. Section 3.2 discusses the research methods that were used to collect data. Section 3.3 discusses relevant analysis methods used to solve the problem at hand. Section 3.4 explains quality assurance from the perspective of research and software development. The final section of the chapter, section 3.5, discusses suitable software process models that could be used for development of the prototype.

3.1 Research Methods

Research methods can be split up into two different methodologies, Qualitative and Quantitative methods. The Qualitative methods are based on case studies, field studies, document studies, interviews and observational studies[11][80]

and uses smaller data sets. When working with Qualitative methods when per- forming research, the goal is to understand meanings, opinions and behaviors.

This can also be referred to as the verstehen approach, where the name derives from the German word for understanding[80]. Quantitative research methods fall under the category of empirical studies[81], to some, or statistical stud- ies, according to other[80]. The Quantitative research methods requires large data sets, on which experiments and tests are conducted. Data in Quantita- tive research is coded according to a prior operational and standardized def- initions[80]. The hypothesis must be formulated in such a way that it can be verified or discarded by the result of the Quantitative research[11].

26

(35)

3.2 Data Collection

The data collection methods used as a base for the development of the pro- totype were carried out before initial implementation of the prototype began.

The reason for this was to ensure that the first iteration of the prototype was implemented and operated in a manner that ensured as valid of a result as pos- sible. The two data collection methods that were used were literature study and interviews.

3.2.1 Literature Study

Literature study can be described as an objective, thorough summary and crit- ical analysis of the relevant available research and non-research literature on the topic being studied [82]. Literature study is also essential for identifying what has been written on a specific subject determining the extent of which a specific research area reveals any interpretable trends or patterns, aggregating empirical findings related to a narrow research question to support evidence- based practice and identifying topics or questions requiring more investiga- tion[83]. The purpose of performing a literature study is to bring the reader up to speed with what work has been conducted within a certain topic or field.

It can also identify the gaps and inconsistencies in the knowledge collected during the literature study[84]. A literature study can also provide a theoreti- cal foundation for a study and motivate the methods applied and choices made in a study[83].

As proposed by Templier and Paré[85], when conducting the literature study

the following methods were applied as shown in figure 3.1: formulating the

research question(s) and objective(s), searching the extant literature, screening

for inclusion, assessing the quality of the primary studies, extracting data and

analyzing data. By first formulating a good research question the need and ob-

jective of the research could be motivated. This resulted in a more well defined

and fine-grained search of literature which gave a smaller but more suitable re-

search results for use in the project. After the material was identified they were

all screened at a more detailed level to determine if the information they con-

tained was applicable or relevant in this project. Material gathered at this stage

was also screened by the other group member to ensure that the material was

relevant. Once this was done, the quality of the material had to be assessed,

(36)

Figure 3.1: Figure illustrating the flowchart of the methods applied for litera- ture study. Figure by the authors.

which included confirming that the source and/or author were reliable and that the material had references to its claims or presented facts. When the mate- rial had been verified as being a reliable source of information, the relevant information in the material could be extracted from each of the sources. In- formation about how, when and where a study in a source had been conducted could also be collected. After this, the gathered information was summarized, organized, compared and presented in a meaningful way.

The two main sources for the material was KTH’s online library[86] and Google Scholar[87].

3.2.2 Interview

Interviews falls within the Qualitative research method and is used as a pri-

mary data gathering method[88]. Interviews can give a more in-depth un-

derstanding of a certain problem by tapping into the expert knowledge of an

individual. Interviews can can be conducted and structured in three differ-

ent ways: unstructured, semi-structured, and structured[11]. A structured in-

terview only consists of predefined questions, compared to an unstructured

interview, that can be compared to a conversation about a topic or issue, rep-

resenting the interviewers control of the interview. Interviews are a good way

to resolve conflicting information since the researcher has the opportunity to

(37)

directly ask about a specific topic, whereas data from a survey can be more generalized[88]. Interviews are also a good way of exploring the more fine- grained aspects of a topic with follow-up questions.

3.3 Data Analysis

Data Analysis in qualitative research is comprised of reducing raw data to themes through qualitative analysis methods and then representing that data in the form of figures, tables or text [89, p. 188]. Commonly used meth- ods in qualitative data analysis analysis include Coding, Analytic Induction, Grounded Theory, Narrative Analysis, Hermeneutic and Semiotic [11]. Only two of the data analysis methods are relevant in text and document analysis:

Coding and Grounded Theorising [90].

Grounded Theory is an approach to conduct data collection and analysis simul- taneously through iterations [91]. Coding on the other hand is a strategy that allows for quick retrieval and collection for all text and data that corresponds to an identified theme. Simply put, coding is a process of labeling a passage of text or piece of data [92]. The grounded theory strategy had to be discarded for the project as large amount of data had already been collected before the analysis phase. Since the prototype concerned labeling text by moods, it only seemed natural that the coding strategy was the most suitable choice.

The data that was analysed in this thesis was movies, or more specifically, movie subtitles. According to Mikos [93], there are four types of activities that can be analysed through qualitative data analysis in films: cognitive activities, emotional activities, habitual activities and social-communicative activities.

This thesis will focus purely on emotional activities of films. The object of film analysis is the textual structure of films [93].

Data analysis was thus done by reducing the raw movie subtitles into more

suitable data-structures that could used for mood labeling through the coding

strategy.

(38)

3.4 Research Quality Assurance

Quality assurance is the validation and verification of research material[11].

Methods for verification of the research material can be divided into two strate- gies: ensuring the quality of the data and the quality of the analysis[94]. There are different concepts that can be applied to the gathered research material de- pending on how it has been collected. In this thesis the qualitative research method has been used hence concepts for verification with aspects to this method are best applied, which are: credibility, transferability, dependability, confirmability and authenticity[95].

Credibility - is the researcher or the source of the data reliable?

Transferability - is the findings applicable in other context?

Dependability - would similar findings be produced if someone else per- formed the research?

Confirmability - is the data a product of a researchers biases, motiva- tions, interests, or perspective?

Authenticity - does the research represent a fair rage of viewpoints on the topic?

The presented concepts for improving validity of the collected data must be applied with judgement on the part of the researcher[96].

3.5 Software Quality Assurance

The Institute of Electrical and Electronics Engineers (IEEE)[97] define quality assurance as the following[98].

1. A planned and systematic pattern of all actions necessary to provide ad- equate confidence that an item or product conforms to established tech- nical requirements.

2. A set of activities designed to evaluate the process by which the products are developed or manufactured.

This definition describes the processes and standards that should be in place,

throughout a development process, that leads to a high-quality software prod-

uct. This can sometimes also include the phase after the software product has