Labeling Moods of Movies by Processing Subtitles
PETER SVENSSON YOUSSEF TAOUDI
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Abstract
Labeling movies by moods is a feature that is useful for recommendation en- gines in modern movie streaming applications. Movie recommendation based on moods is a feature that could improve user experience for movie stream- ing platforms by recommending more relevant movies to users. This thesis describes the development of a mood labeling feature that labels movies by processing movie subtitles through Natural Language Processing. Movies are processed by analysing subtitles to predict the mood of a movie through com- putational methods. The prototype utilizes movies pre-labeled with moods to construct a lexicon that contains information of the defining attributes for moods in movie subtitles. Using the constructed lexicon, the similarities be- tween a movie subtitle and a lexicon can be compared to calculate the prob- ability that a movie belongs to a specific mood. Four moods were chosen for analysis in the prototype: fear, sadness, joy, and surprise.
The Naive Bayes method was chosen as the classifier for the prototype. A Naive Bayes classifier observes each occurring word in a movie without con- sideration to the context of the word in a text or sentence. The results showed that the classifier had trouble distinguishing between the moods. However, for all configurations of the prototype, the classifier showed higher precision for the mood fear compared to the other moods. Overall the classifier performed poorly and did not produce a reliable result.
Keywords - Mood Labeling, Natural Language Processing, Navie Bayes, Text
Classification
Sammanfattning
Klassificering av filmer via stämning är en funktion som är användbar för re- kommendationsmotorer i moderna filmströmmingsprogram. Filmrekommen- dation baserad på stämning är en funktion som kan förbättra användarupple- velsen på filmströmmande plattformar genom att rekommendera mer relevan- ta filmer till användarna. Denna uppsats beskriver utvecklingen av en prototyp för att klassificera filmer efter deras stämning genom att bearbeta filmens un- dertexter med hjälp av metoder inom språkteknologi. Filmer bearbetas genom att analysera undertexter för att avgöra stämningen hos en film. Prototypen an- vänder filmer som är fördefinierade med stämning för att konstruera ett lexikon som innehåller information om de definierande egenskaperna för en stämning i filmtexter. Med hjälp av ett konstruerat lexikon kan likheterna mellan en film- textning och ett lexikon jämföras för att beräkna sannolikheten för att en film tillhör en viss stämning. Fyra stämningar valdes för analys i prototypen: räds- la, sorg, glädje och överraskning.
Navie Bayes-metoden valdes som klassificeringsmedel för prototypen. En Na- ive Bayes-klassificerare observerar varje förekommande ord utan hänsyn till ordets sammanhang i en mening eller text. Resultaten visade att klassificering- en hade problem att skilja mellan stämningarna. För samtliga konfigurationer av prototypen visade klassificeringsenheten dock högre precision för rädsla jämfört med de andra stämningarna. Sammantaget presterade klassificeraren dåligt och gav inte ett tillförlitligt resultat.
Nyckelord - Stämningsklassificering, språktekonolgi, Navie Bayes, textklas-
sificering
1 Introduction 1
1.1 Background . . . . 2
1.2 Problem . . . . 2
1.3 Purpose . . . . 3
1.4 Goal . . . . 3
1.5 Methods . . . . 3
1.6 Stakeholders . . . . 4
1.7 Delimitations . . . . 4
1.8 Benefits, Ethics and Sustainability . . . . 4
1.9 Outline . . . . 6
2 Moods and Natural Language Processing 7 2.1 Moods and Emotions . . . . 7
2.1.1 Basic Emotions . . . . 8
2.1.2 Moods in Film . . . . 9
2.2 Natural Language Processing . . . . 10
2.2.1 Word Sense Disambiguation . . . . 11
2.2.2 Phases of Corpus Processing . . . . 12
2.2.3 Natural Language Levels . . . . 13
2.2.4 Similarity Measurement . . . . 15
2.2.5 Naive Bayes . . . . 16
2.2.6 Accuracy, Precision and Recall . . . . 20
2.3 Database . . . . 21
2.4 Related Work . . . . 21
2.5 Software and Frameworks . . . . 23
2.5.1 Python . . . . 23
2.5.2 Natural Language toolkit . . . . 24
2.5.3 Beautiful Soup . . . . 24
2.5.4 MySQL . . . . 24
iii
3 Methodology and Methods 26
3.1 Research Methods . . . . 26
3.2 Data Collection . . . . 27
3.2.1 Literature Study . . . . 27
3.2.2 Interview . . . . 28
3.3 Data Analysis . . . . 29
3.4 Research Quality Assurance . . . . 30
3.5 Software Quality Assurance . . . . 30
3.6 Software Process Model . . . . 31
3.6.1 Reuse Model . . . . 31
3.6.2 Waterfall Model . . . . 32
3.6.3 Incremental Model . . . . 34
4 Development Process 36 4.1 Prerequisites . . . . 37
4.2 Implementation . . . . 37
4.3 Testing . . . . 38
5 Prerequisites 39 5.1 Interview . . . . 39
5.2 Selecting Moods for Prototype . . . . 40
5.3 Selecting Sample Movies . . . . 40
6 Implementation of Prototype 42 6.1 Text Processing . . . . 43
6.1.1 Removing Noise . . . . 43
6.1.2 Stop Words . . . . 43
6.1.3 Stemming . . . . 45
6.1.4 Term Frequency . . . . 46
6.2 Learning . . . . 46
6.3 Classification . . . . 47
7 Testing and Results 50 8 Conclusion and Future Work 55 8.1 Discussion . . . . 55
8.2 Conclusion . . . . 56
8.3 Future Work . . . . 57
A Training and Test Movies 58
B Stop Words 63 C Interview with the Swedish Film Institute 64
D Implementation 65
Bibliography 66
2.1 Figure illustrating the Bag of Words method . . . . 13
2.2 Illustration of the conditional probability for the events H and E in Bayes theorem . . . 17
2.3 A figure representing the structure of the attributes, A with regards to a class node (C) in the Naive Bayes classifier . . . . 18
3.1 Figure illustrating the flowchart of the methods applied for lit- erature study . . . . 28
3.2 Figure illustrating the waterfall model . . . . 33
3.3 Figure illustrating a sprint in the Scrum model . . . . 34
4.1 Figure illustrating the development process of the prototype . . 36
6.1 Figure illustrating the processes of the prototype . . . . 42
6.2 A subtitle formatted as HTML. . . . 43
6.3 Figure illustrating stemming . . . . 45
7.1 Results with stemming, with stop words removed . . . . 51
7.2 Results without stemming, with stop words removed . . . . 52
7.3 Results without stemming, without stop words removed . . . . 52
7.4 Results with stemming, without stop words removed . . . . 53
vi
6.1 A table displaying the frequency for the ten most occurring words in movies for the moods ’Joy’ (left table) and ’Fear’
(right table) without removing stop words . . . . 44 6.2 A table displaying the frequency for the ten most occurring
words in movies for the moods ’Joy’ (left table) and ’Fear’
(right table) after removing stop words . . . . 45 6.3 Table illustrating structure of lexicon . . . . 46 7.1 Table displaying the accuracy of the prototype for different
configurations . . . . 51
Introduction
Natural language has developed and evolved as a method of communication between humans[1]. The field of Natural Language Processing centers around how computers can be used to understand and manipulate natural language in text or speech. The goal of the research within this field is to develop tech- niques for machines to understand and handle the natural language to perform tasks [2, p. 51]. The applications can be machine translation, natural text pro- cessing, user interfaces, speech recognition and artificial intelligence[2]. In this project, Natural Language Processing techniques are used to develop a prototype that labels movies by moods through processing movie subtitles.
Moods and emotions have a very a close relation even though they are separa- ble from each-other[3]. What sets a mood apart from an emotion is that moods are longer-lasting, while emotions exist in a very short time frame. Emotions are often related to a specific event or person while moods can exist without attachment to any particular event or person. Moods also tend to be less in- tense than emotions[3]. Emotions can be categorized into basic emotions as described by Ekman[4, pp. 45-60] and Plutchik[5]. Ekman [4] lists six ba- sic emotions: happiness, sadness, anger, fear, disgust, surprise. Plutchik [5]
lists eight basic emotions: ecstasy, grief, vigilance, amazement, terror, rage, loathing and admiration. The characteristics of basic emotions are that they are primal and hard-wired in our brains which humans respond quickly to, help- ing survive and avoiding dangerous situations[5]. The distinction between a mood and an emotion is minimal and the terms are thus used interchangeably in this thesis.
1
1.1 Background
Sentiment Analysis, within Natural Language Processing, is the process of computationally analysing a piece of text to identify or categorize a reader or writer’s sentiment or opinion[6]. Usually Sentiment Analysis is used to spec- ify the positive or negative polarity of a piece of text [7]. Sentiment Analysis for multiple categories can also be done[8]. In the case of this thesis, Senti- ment Analysis is used to determine the moods of different movies by analyzing their movie subtitles.
Sentiment Analysis uses Text Classification to categorize sentiments and opin- ions[8, p. 24]. Text classification is the method of classifying texts of different topics or categories [8]. Text classification is done by observing patterns in text, such as the structure of words and sentences or the word frequencies in movies of different moods [9]. An algorithm that classifies text is called a classifier.
1.2 Problem
A problem in sentiment analysis and text classification is that pre-labeled text must be collected before any classification can be done [9]. Generally, classi- fication can not be done before the patterns in text for the different categories has been observed. A pattern could for example be a word structure or word frequency that coincide with a specific class[9, p. 221]. To find patterns for each category, a classifier must rely on knowledge from previously analyzed data, called a training set [9]. Movies and their Swedish subtitles for the moods had to be collected before implementing the classifier.
There are many obstacles and difficulties in extracting sentiments and patterns from text. The human language is subjective and the same word or sentence can be ambiguous or give different impressions when used in different con- texts. The ambiguity can be problematic when trying to identify underlying moods in a word or sentence because they may vary from context to context[10, p. 180].
This can be summed up with the following problem statement:
How can a system label the mood of a movie by analysing the movie subtitles?
1.3 Purpose
The purpose of this thesis is to present the development of a prototype for la- beling moods of movies with Swedish subtitles. The functionality of being able to correctly label movies by their mood could be useful for recommen- dation engines in movie streaming platforms. Mood labeling could be used to recommend more relevant movies for a user based on the moods of the movies in the user’s history. An effective recommendation engine could increase user experience and increase user watch-time on a movie streaming platform.
1.4 Goal
The goal of this project is to develop and present a prototype for labeling moods of movies through processing movies subtitles by applying Natural Language Processing techniques.
1.5 Methods
There are two method categories when performing research, Qualitative and Quantitative[11]. Qualitative research methods involve gaining an understand- ing of underlying reasons, opinions and motivations. The understanding can be reached by performing observations, interviews and reading literature re- lated to the work that the research is carried out for. The Quantitative research methods focuses on objective measurements and/or mathematical analysis of collected data. The Quantitative research methods rely on large sets of data collected through surveys, case studies or experiments to ensure a valid out- come.
Research methods of qualitative character were the main focus in this project.
The data collection involved defining moods for the chosen movies which
required understanding the characteristics that define a particular mood in
Swedish text. Although qualitative research methods were used, the project
still involved processing a rather large data-set. If proper qualitative research
regarding the moods of movies was not performed, it would have rendered the
outcome of the data useless or invalid.
1.6 Stakeholders
SF Studios[12] is a Swedish movie production and distribution company. For the distribution to their private customers, SF studios has a platform called SF Anytime[13]. On this platform, customers can rent movies and series on de- mand. The platform is available via a web page, smartphone application and TV-application.
SF Anytime expressed a desire of a mood labeling feature for movies on their platform. The feature is desired to improve their recommendation system.
1.7 Delimitations
Due to the complex nature of moods, which is subjective to an observer, the project and the prototype covered four moods: fear, joy, sadness and surprise.
Since the project focused on the technical aspect of developing a prototype, there was no consideration to philosophical questions regarding the proper definition of what a mood is.
The subtitles were analysed word by word, meaning that the context of a word and the structure or the grammar of a sentence was not taken into account in classification and analysis.
The prototype is only compatible with Swedish subtitles, due to the database of SF Anytime[13] only consisting of subtitles for the Scandinavian countries.
Time performance and speed optimization were not prioritized in the devel- opment of the prototype.
1.8 Benefits, Ethics and Sustainability
A benefit of integrating the prototype with a movie recommendation engine
could be that it would increase the user experience of the end user. Movies
recommended to a user could have a higher relevance and keep the user active
longer. A more active user would generate more income for the owner of the
platform.
Since the intended use of the prototype is to create movie recommendations by moods, the platform that integrates the prototype could be able to construct a mood profile of a user based on the mood labels derived from the watched movies of a user. An ethical issue might arise since the information could pro- vide personal information about a users mental state and provide information for targeted advertising.
Sustainability refers to the concept of Sustainable Development[14] which is used to identify development that meets the needs of the present world without compromising the ability of future generations to meet their needs[14]. Sus- tainable development is divided into three different areas: ecological, social, and economic sustainability.
Ecological sustainability relates to the function of the biochemical system of the Earth[15]. Services and products that are produced must take consid- eration to the water, air, land, biodiversity, and ecological services of the Earth[15]. The production of services and goods must not overload the ca- pacity of the ecosystem and make sure that nature is given time to regenerate resources of the ecosystem[15]. The prototype has no effect in this aspect.
Social sustainability concerns the psychological and physical needs of the in- dividual[16]. This includes human rights, justice, and quality of life for each individual. The prototype has no effect in this area.
There are two major definitions of economical sustainability[17]. The first definition is from an ecological and a social sustainability perspective. The increase in economical growth must not have a negative impact on the envi- ronment or social sustainability. The second definition of economical sustain- ability is from an economical perspective where economical growth is desired.
The second definition allows for economical growth on the expense of natural
resources and welfare[17]. The prototype might have an impact in this area. If
the prototype is integrated in a movie recommendation system it could increase
the amount of money a user would spend on the platform by recommending
more relevant movies to the user. A user might spend more money on the
platform than they intended and have a negative on the personal economy of a
individual.
1.9 Outline
Chapter two of this thesis, Theoretical Background, discusses emotions, moods as labels and which emotions that are identifiable in movies. Furthermore, the chapter introduces the topic of Natural Language Processing and discusses dif- ferent problems and techniques in the area of analyzing the human language relevant to the project.
In chapter 3, Methods and Methodology, the methods and methodologies used in the project will be discussed. This includes software development-, research- , data collecting- and analysis methods.
The development process for the prototype is presented in chapter 4, Devel- opment Process. This chapter discusses the three phases of the development process, Prerequisites , Implementation and Testing.
The first of the three phases in the development process is described in chapter 5, Prerequisites. The chapter describes the data collection process of develop- ment.
Chapter 6 describes the Implementation phase of the development process. In this chapter, all functionality of the prototype is explained along with the ap- plication of Natural Language Processing methods for classifying the mood of a movie.
The 7th chapter of the thesis, Testing and Results, presents the estimated pre- cision and accuracy of the classifier for the different moods. The chapter also discusses how the prediction tests were constructed.
Chapter 8, Conclusion and Future Work discusses the validity and accuracy
of the results and any improvements that could be made to the final prototype.
Moods and Natural Language Pro- cessing
This chapter discusses the theoretical background that was used in develop- ment of the prototype. The chapter is divided into five sections. The first section, section 2.1, describes moods and emotions on a human level, the con- cept of basic emotions and the use of moods in film. Section 2.1 also gives an insight to which moods were suitable as classes for text classification in the prototype. The ability to classify text into categories was crucial to the func- tionality of the prototype. Movie subtitles had to be classified into moods, therefore, section 2.2 aims to explain how Natural Language Processing tech- niques can be used to classify texts into categories. Section 2.3 presents how the movie data were collected. Section 2.4 presents related work in this field.
Section 2.5 discusses the use of programming languages and frameworks for the development.
2.1 Moods and Emotions
Very often the two terms, mood and emotion, are confused with each-other or used in the wrong context[18]. David Watson[19] defines moods as “tran- sient episodes of feeling or affect”[19], which in it self, proves a strong rela- tion between moods and emotions. The duration of the emotion is the most significant characteristic which sets moods and emotions apart. An emotion is something that is usually intense and the duration is very short as a result of an encounter with meaningful stimuli that requires quick adaptation in behavior and response[18]. The nature of this reaction is sometimes argued to be a rem- nant of our primal instincts, a trigger of the basic “fight or flight behavior”[5].
7
Emotions can be seen as reactions or a response to a specific event and do not arise randomly and for no reason[19]. In contrast a mood is something that is much longer in duration, from a few hours to a couple of days. Moods can also arise from internal processes and they do not need a clear external iden- tifiable object as a cause for the specific mood. Moods are not as strong as emotions but they are, as mentioned, very strongly related. Being in a "good"
mood makes the person more prone to experiencing positive emotions, while being in a "bad" mood makes a person more inclined to experiencing negative emotions[3].
2.1.1 Basic Emotions
There is a close relationship between moods and emotions and the theoreti- cal background will be based on research with regards to emotions rather than moods. A major reason for using the term emotion is because of the theory of basic emotions.
Plutchik[5] has developed a model for representing human emotions[5]. In this model Plutchik has placed the primary, or basic, emotions in the center of the model. The argument for this architecture is that other emotions are mixtures of the eight basic emotions[20]. In the original model the proposed basic emotions were; joy, sorrow, anger, fear, acceptance, disgust, surprise and expectancy. These emotions were selected since they polarize each-other in pairs. Naturally, over the years, there have been other suggestions for basic emotions and the number of basic emotions. The basic emotions of the model today are; ecstasy, grief, vigilance, amazement, terror, rage, loathing, and ad- miration[5].
Ekman[21] has conducted extensive research concerning the relation between emotions and facial expressions. Through this research Ekman found that fa- cial expressions of certain emotion appeared to be universal[22]. These emo- tions can be identified as separate, discrete states. The strongest evidence for distinguishing one emotion from another comes from research of facial expres- sions and through this method Ekman recognizes six different basic emotions;
happiness, sadness, anger, fear, disgust, surprise[21]. Ekman also argues that
there is no need to take an evolutionary view of the emotions to be able to
successfully identify them. Social-learning is seen as a major contributor to
these basic emotions, regardless of culture.
2.1.2 Moods in Film
Despite its neglect in film-philosophy and film theory, mood has long been recognized as important to the films aesthetics. One term that often has been used in movie context for referring to the mood of a film, is the German word Stimmung[23], meaning mood or atmosphere. Moods contribute to the com- position of the universe where the film takes place. The mood of a certain cinematographic universe sets the baseline for what events, actions and situa- tions the viewer might find interesting, odd, disturbing or otherwise emotion- ally significant. Moods can also serve as tools to mold or disturb the narrative flow of a film or aid in a transition of narrative[23].
Greg M. Smith[24] claims that the "...primary emotive effect of film is to cre- ate mood.”[24]. Movies can invoke both emotions and moods. A viewer might become angry for a brief moment during one scene, but the movie in its en- tirety invokes a mood of happiness. In this context the issue of art moods versus human moods arises. Saying that a movie is sad is metaphorical, since a human mood is a discrete mental state and a movie cannot posses a mental state[25]. A movie can be used to invoke different moods in viewers but it cannot have human moods itself. In film and literature, the mood of a work is the emotional character or tone. The mood of a film can be seen as a combina- tion of all the small elements of the film, that together characterize the overall experience of the film. A sad film might fail to invoke sadness and a happy film might fail to invoke happiness, but the film extends an invitation to “feel”.
It is up to the viewer to accept or reject the invitation[23].
The Swedish Film Institute[26] conducted a research, in the fall of 2018, to map what moods the Swedish film audience wished to experience when watch- ing a film[27]. The research done by the Swedish Film Institute focused on four basic moods; excitement/surprise, happiness/laughter, emotional/cry, fear/dis- comfort. The research showed that the majority of the film audience wished to feel excitement/surprise when watching a film. The results were most promi- nent among audience in the age category of 40 to 54, living in rural area and of low income. Among women the wish to fell happiness/laughter was partic- ularly popular[27].
When asked about what evokes the mood of excitement/surprise in the film
audience, the result showed a strong correlation between a mood and certain
genres. Particularly strong correlation was evident with excitement/surprise
and the genres action, criminal and thriller. Answers from the research also showed a strong correlation between the mood of happiness/laughter and the genre comedy[27].
2.2 Natural Language Processing
Natural Language Processing is a field in computer science concerning how to manipulate natural language in the form of text or speech through compu- tational methods [2]. One field in Natural Language Processing, text classifi- cation, is particularly important for this degree project. Text classification is a problem of determining which category a body of text belongs to. Bodies of text in Natural Language Processing applications are represented as corpora, which is the plural of corpus.
A corpus is a body of text [28, p. 6] that is used to perform statistical analysis [29]. Corpora serve as building blocks of data that are used to build up large lexicons [9]. Analysing corpora is done by statistically probing and manip- ulating text. Some corpora contain noise, that must be filtered out for better results[29]. One type of noise in corpora is markup [28, p. 475] which is an annotation in a document that explains the structure or format of a text [28, p. 123]. Another type of noise that occurs in written corpora are function words. Function words are short grammatical words, such as it, in and for, that generally dominate the word population in text that may need to be ac- counted for when processing a corpus[28, pp. 20-23].
There are many different types of corpora that can be used for Natural Lan- guage Processing applications. The leading platform for building python pro- grams with human language data, Natural Language toolkit (NLTK) [30], uses four different types of corpora: Isolate Corpus, Categorized Corpus, Overlap- ping Corpus and Temporal Corpus[9].
The simplest type of corpus used by NLTK, the isolated corpus, is a standard collection of text without any categorization. If a corpus is grouped into dif- ferent types of categories, it is called a categorized corpus. An overlapping corpus is a categorized corpus with overlapping categories. The final type of corpus that is used by NLTK is called a temporal corpus. A temporal corpus is a collection of usages of text for a specific period in time [9].
Another representation of text that is particularly useful for text classification
are lexicons[31, p. 19]. A lexicon is a collection of information about words that belong to a specific category [32]. Each set in a lexicon is called a lexical entry. Each lexical entry contains a word and information about the particular word in the lexicon [32]. Lexicons can be used to store information concern- ing word frequency for a given category which is needed in text classification [33]. Information needed for probabilistic approaches in large lexicons should always be collected computationally as it is infeasible to do manually. Lexical entries are collected by analysing adequate corpus data [33].
Text classification can be done using a Corpus-Based Approach. A corpus- based classifier approach relies on using corpora to build specific lexicons categories that can be used for analysis [8, p. 95]. There is however one ma- jor problem with a corpus-based text classification approach, the Word Sense Disambiguation problem.
2.2.1 Word Sense Disambiguation
Word Sense Disambiguation is a classification problem for determining which sense or meaning of a specific word is activated in a specific context. There are three classes of Word Sense Disambiguation methods used in the field of artificial intelligence: Supervised-, Unsupervised and Knowledge-Based Dis- ambiguation [34].
Supervised disambiguation is based on learning and makes use of annotated corpora to learn. The input data are marked with the classes or categories they belong in to further build on a lexicon [35]. The corpora can also be anno- tated with the weight of how much the text data reflects the importance of the corpus for a certain category [36]. Supervised learning is generally used as a classification task [28].
As opposed to supervised disambiguation, unsupervised disambiguation only uses raw corpora for learning. Unsupervised learning will often start by mak- ing use of knowledgeable sources that it can improve further. Unsupervised algorithms often use a technique called clustering. Clustering is a way of analysing corpora by grouping similar objects into the same category [28].
The final technique, the Knowledge-Based technique relies primarily on dic-
tionaries or lexicon bases [36]. Knowledge-based methods are usually based
on already developed and well established lexicons. The problem with using
knowledge-based methods is that they only perform well for very large dic- tionaries. These are often general and may not be suitable to use for niche projects [35]. A knowledge-based approach is sometimes called dictionary- based approach[8, p. 91].
2.2.2 Phases of Corpus Processing
The method for processing a corpus consists of three phases: Text Pre-processing, Text Representation and Knowledge Discovery[37, pp. 388-390].
The first phase, the text pre-processing phase, is the process of filtering out noise from text [38]. Corpora scraped from the web can for example be anno- tated with markup. Markup is a type of noise must be taken into account when data mining on the web [9]. A popular format for web pages is HTML, which is a markup language [39], meaning it has tags and markings that explains and denotes the structure of a document [40]. Another, more common, type of noise are function words [28] or stop words [38]. Stop words can be removed by using a stop word dictionary [38]. It is important that words in a corpus are all in the same letter case. If a word is in the beginning of a sentence, it starts with an uppercase letter and will not be seen as the same word in the eyes of a computer as the lowercase word [29, p. 71].
Stemming is an important term in text pre-processing. "Stemming is the pro- cess of removing affixes from a word in order to obtain its stem, and a stemmer is an algorithm that performs this process " [41]. The complexity for stemmers vary from language to language. Languages such as Arabic where many affixes can be used on words are more difficult than languages such as English [41].
Stemmers can become quite complex and may become difficult to implement or may require a large amount of data. There are many stemmers available in English, but not very many exist in Swedish. Stemming is particularly useful for smaller entities of text. A similar technique called Lemmatization can be used instead to convert words into their grammatical base [41].
The second phase of corpus processing is defining data structures for text rep- resentation. Text can be represented as corpora or lexical entities. The data- structure of corpora and lexical entries is an important choice for text analysis.
The simplest model is called the Bag of Words model [9, p. 50] where each
word is one entry in the data set. The data set is unordered and the only infor-
mation stored about a word in the bag is the frequency in which it appears in a
corpus [42, p. 65]. Figure 2.1 illustrates how raw text can be transformed into
a Bag of Words data-structure that contains the frequency of each word in the text.
Figure 2.1: Figure illustrating the Bag of Words method. Figure inspired by Manning [28]
Another model could be used is the Vector Space Model. In the Vector Space Model, a document is represented as a count vector. Each entry in the vector is called a feature and it represents one entry in the document vector [28]. While the two are similar, a document vector generally contains more information than a Bag of Words.
The third and final phase is called Knowledge Discovery. Knowledge Discov- ery means to extract knowledge from data [43]. It is in this phase where the data from the text representation phase is actually used through machine learn- ing or data mining methods [37, p. 390] such as the Naive Bayes [44, p. 20]
classifier. For example, the Bag of Words could be used to create or build upon lexical entries in a lexicon.
2.2.3 Natural Language Levels
Before processing a corpus, one must first decide which parts of natural lan-
guage to analyse. Natural Language Processing is a broad field that concerns
processing many different levels of texts. Extracting and processing text can be done in six different levels: morphological-, lexical, syntactic, semantic, discourse and pragmatic levels [2, p. 56]. Different techniques should be used depending on the level a corpus will be processed on.
The morphological level deals with the smallest parts of words such as pre- fixes and suffixes [2, p. 56]. A big focus in the morphological level is word formation. The same words can appear in different forms depending on the inflection of the word [28]. Examples of techniques in the scope of morphol- ogy are Stemming and Lemmatization [45] that are used in text pre-processing.
Following the morphological level is the lexical level. In this level, analysis is applied in the scope of words [2, p. 56]. This level focuses on which part of speech a particular word is used for. Part of speech can be seen as a gram- matical category for words [28]. The technique used for labeling words after their grammatical category is called Part-of-Speech Tagging (POS-tagging) [46]. Stockholm University [47] has a part-of-speech tagger for Swedish text available called Stagger [48] that is available in open source.
The syntactic level in Natural Language Processing deals with the structure of sentences [2, p. 56]. A syntax in linguistics is a rule in a language that con- cerns the form of sentences [32]. The form of a sentence should not be con- fused with morphological structures as they only concern words. As the form of sentences can be structured in different ways, it is not feasible to capture all possible patterns for a sentence manually. Instead computational approaches must be taken. One approach is to construct syntactic N-Trees [49] which are tree data-structures with n-tuples that can be used to store the structures of sentences.
The semantic level covers the meaning of words and sentences [2, p. 56]. This level focuses on the meaning of a word or sentence without taking the outside context into account [32]. Lexical semantics can be studied through two differ- ent approaches. The first approach is to study the meaning of individual words.
The second approach is to study how the meaning of words are related to each
other and how multiple words can be combined into meaning of sentences [28,
pp. 109-110]. Examples of applications in this level are identifying words with
similar meanings (synonyms), words with opposite meanings (antonyms) and
meanings of ambiguous words [28, p. 110].
The discourse level concerns the relationships between sentences in a text. The analysis in this level mainly focuses on analysing anaphors [28]. Anaphors are expressions that refer back to previous expressions in the same text [32]. For example, a person that has previously been introduced in a text may be referred to as ’He’ or ’the Man’. For a program to know who ’He’ or ’the Man’ may be referring to, it must be able to identify the person in the context of the text. If
’He’ and ’the Man’ were referring to the same person, they would be anaphor- ically related [28].
Pragmatics is the study of how knowledge about the world interacts with the literal meaning of text. Due to the difficulty of modeling the complexity of the world and the lack of data, this area of Natural Language Processing has not received a lot of attention.[28]. The semantic representation, or logical form, of an utterance is distinct from its pragmatic interpretation[50]. Pragmatics considers language as an instrument of communication, what people actually mean when they use language and how a listener interprets it. According to Jenny Thomas[51], pragmatics consider the negotiation of meaning between speaker and listener, the context of the utterance, and the meaning potential of an utterance[51]. This can be illustrated by this simple example.
Utterance: “Do you know what time it is?”
Pragmatic meaning: Why are you late? Response: Explanation or apology.
Literal meaning: What time is it? Response: A time
The above example supports the notion of ambiguity in the natural language along with the difficulty of modeling the real world when performing Natural Language Processing.
2.2.4 Similarity Measurement
Similarity measurement is the means of determining how well a particular corpus or vector fits a certain pattern or category by quantifying the relation- ship between different features [52]. Similarity measurement is useful when testing supervised dictionary algorithms or when learning unsupervised dic- tionary algorithms. [34, p. 57].
If both the input document and the lexicon are represented as vectors in the
same term space, a measure of the similarity between them can be computed.
Similarity can be calculated as the probability that a certain query belongs to a certain category. Similarity can only be measured through probabilistic models if there is quantifiable data available. To get quantifiable data, there must be some form of weighting in the proposed data-structures of a lexicon.
The most optimal weighting principle for documents is based estimating the relevancy of a query to a certain document or lexicon [44].
2.2.5 Naive Bayes
The Naive Bayes is a probabilistic classifier that is useful for text classifica- tion. A classifier is an algorithm that implements classification by mapping input data to a category. Naive Bayes applies Bayes theorem[53] which is a mathematical formula for calculating conditional probabilities and is mainly used in statistical mathematics. Conditional probability is the probability of one event occurring, given that another event has already occurred[54].
P (H|E) = P (E|H) × P (H)
P (E) (2.1)
In Bayes theorem, presented in formula 2.1, P (H|E) is the likelihood of event H given that E is true. In the same way, P (E|H) is the likelihood of event E given that H is true. The probability P (H) is the likelihood of event H without any knowledge of E and P (E) is the likelihood of event E without any knowl- edge of H. P (E|H) × P (H) can be written as P (H ∩ E) which is the prob- ability of both H and E being true. P (H ∩ E) also equals P (H|E) × P (E).
Figure 2.2 illustrates the probability of the events as two intersecting circles in
a Venn diagram. The diagram displays the denominator P (E) in formula 2.1
as the right circle, P (H) as the left circle and the numerator, P (E|H) × P (H)
is illustrated as intersection of the two circles. In the Venn diagram, the proba-
bility of P (H|E) would thus be the size of the intersection divided by the size
of the right circle.
Figure 2.2: Illustration of the conditional probability for the events H and E.
Figure by the authors.
The Naive Bayes classifier assigns a class to a corpus, given its vector repre- senting the occurrence of words, also referred to as features or attributes, in a specific corpus. A classifier is constructed by using a collection of labeled training corpora to estimate the parameters for each class. Classification can then be performed by assigning a new corpus to the class that is most likely to have generated that specific corpus[55]. The Naive Bayes classifier is a very effective and is one of the simpler classifier among the Bayesian classifiers.
The Naive Bayes classifier assumes that all attributes of a vector are indepen- dent of each-other, also called conditional independence[56]. In reality, the assumption is far from correct since there is a strong correlation between the occurrences of words in a corpus, but regardless of this, the Naive Bayes clas- sifier generally performs well. Since the attributes are seen as non-related they can be learned separately hence simplifying learning when a large number of attributes are involved, which is often the case when working with corpora. In Naive Bayes each attribute has no parent node, except the class node as shown in 2.3[56].
There are two models of how occurrences of attributes in a corpus can be rep-
resented before the Naive Bayes assumption is performed[55]. In both models
the order of words are lost, thus both can be seen as a Bag of Words[9]. In the
first model a corpus is represented as a vector of binary attributes indicating
which words occur or do not occur in the corpus. The number of occurrences
of a word is not represented in the vector. When calculating the probability
of a corpus, multiplication of the probability of all attribute values, including
Figure 2.3: A figure representing the structure of the attributes, A
n, with re- gards to a class node, C, in the Naive Bayes classifier. Figure inspired by H.
Zhang[56]
non-occurrence, is performed. In the second model, a corpus is represented by the set of word occurrences in the document, along with the number of occur- rences for that particular word. When calculating probability for a corpus in the second model, multiplication with the probability of the occurring words is performed[55]. An example is to calculate the probability of a document belonging to a class through multiplying the conditional probabilities of the features in the corpus [34, p. 11] as illustrated through the formula 2.2 where the features are words. In formula 2.2, P (word
i|class) is the probability of a word belonging to a class and P (corpus|class) is the probability of the entire text belonging to a class. When classifying a corpus, the probability that the corpus belongs to each possible class must be calculated. The class that gives gives the highest probability is the most likely class that the corpus is catego- rized into.
P (corpus|class) =
a
Y
i=1
P (word
i|class) (2.2)
The Naive Bayes classifier encounters a problem when the number of words
in a corpus grows large. The probability for a word belonging to a class will
be represented by a value between one and zero. Since the probability fac-
tors for each word in the corpus are multiplied with each other, the overall
product will be very small. This number can be so small that computer com-
pilers cannot represent them as floating point data and thus assumes them as
zero[57]. This would result in the introduction of invalid zero probabilities in
the calculations. A common way of overcoming this issue is to represent the probabilities as log-probabilities[57]. The transformation to logarithm works well because the logarithm of a product is the sum of the logarithms. The formula 2.3 shows how the formula 2.2 is logarithmized to calculate the log- arithmic probability P
logarithmic(corpus|class), which is the probability of a corpus in a specific class. The logarithm of a product is calculated as the sum of the logarithms. Therefore, the probabilities P (word
i|class) in 2.3 will be summed instead of multiplied. This will result in more manageable numbers but since the logarithm is taken for values between zero and one the final prob- ability will be a negative number.
P
logarithmic(corpus|class) =
a
X
i=1
ln P (word
i|class) (2.3)
Another obstacle in the Naive Bayes implementation is when the classifier en- counters a word that was not in the training corpora, meaning that the word does not exist in the lexicon of the classifier. This will result in the classi- fier calculating the probability for a word missing in the lexicon as zero. The classifier calculates the probability of a corpus belonging to a specific class through multiplication of the probability of each word belonging to the class.
An introduced zero will result in a zero product and zero probability for that corpus belonging to a class[9]. This can have dramatic effect due to the prob- ability being calculated to zero even though all words but one matched to a certain class lexicon. This, however, can be solved by using a statistical tech- nique called Smoothing[58], where the classifier assumes that all words have been seen one more time than they actually have. Smoothing leads to a priorly unseen word being handled as it was included in the training corpora one time.
This will remove any unwanted zero probabilities from the final probability calculation. This technique is sometimes referred to as Laplace estimator or Additive Smoothing[58] due to the minimum one occurrence approach for all words.
P
smoothing(word|class) = occ(word, class) + 1
P
i=1