Exploring Hybrid Topic Based Sentiment Analysis as Author Identification Method on Swedish Documents

(1)

Linköping University | Department of Computer and Information Science (IDA) Bachelor’s Thesis, 18 hp | Cognitive Science Spring term 2021 | LIU-IDA/KOGVET-G--21/012--SE

Exploring Hybrid Topic Based Sentiment

Analysis as Author Identification Method

on Swedish Documents

Jakob Bremer

Supervisor: Arne Jönsson Examiner: Erik Marsja

(2)

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: https://ep.liu.se/.

(3)

Abstract

The Swedish national bank has had shifting policies when it comes to publicity and confidentiality concerning publishing of texts within the bank. For some time, texts written by commissioners within the bank were decided to be published anonymously. Later they revoked the confidentiality policy, publishing all documents publicly again. This led to emerged interests in possible shifting attitudes toward topics discussed by the commissioners when writing anonymously versus publicly.

On a request, based on the interests, there are ongoing analyses being conducted with the help of language technology where topics are extracted from the anonymous and public documents respectively. The aim is to find topics related to individual commissioners with the purpose of, as accurately as possible, identifying which of the anonymous documents is written by who. To discover unique relations between the commissioners and the generated topics, this thesis proposes hybrid topic based sentiment analysis as an author identification method to be able to use sentiments of topics as identifying features of commissioners. The results showed promise in the proposed approach. Though, further research is substantial, conducting comparisons with other acknowledged author identification methods, to confirm some level of efficacy, especially on documents containing close similarities among topics. Key words: Natural Language Processing, Author identification, Sentiment analysis, Topic modeling

(4)

Copyright 1 Abstract 2 Introduction 1 1.1 Background 1 1.2 Aim 2 1.3 Research question 2 1.4 Proposed approach 2 Theory 5 2.1 Author identification 5 2.2 Sentiment analysis 6 2.3 Topic modeling 7

2.3.1 Latent Dirichlet Allocation 9

Method 11

3.1 Data 11

3.2 Choosing representing words in topics 13

3.3 Sentiment Analysis 14

3.3.1 Word sense lexicon 14

3.3.2 Implementation of VADER 15

3.4 Identification of speeches 16

3.4.1 Topic and sentiment based identification 16

3.4.2 Baseline 17

Experiments 19

4.1 Performance Evaluation 19

4.1.1 Topic and sentiment based prediction 20

4.1.2 Baseline 21

4.2 Results 21

Discussion 22

5.1 Discussion of results 22

5.1.1 Discussion of the data 22

5.1.2 Discussion of the proposed method 23

5.2 Future research 24

5.2.1 Possibilities of the used method 24

5.2.2 Possibilities with other methods 24

Conclusion 27

(5)

1. Introduction

1.1 Background

The Swedish national bank has had shifting policies when it comes to publicity and confidentiality concerning publishing of texts within the bank. For some time, documents of various types, written by commissioners within the bank, were decided to be published anony-mously. Later they revoked the confidentiality policy, letting all documents be published publicly again. This led to an emergence of interest in possible shifting attitudes toward topics discussed by the commissioners when writing anonymously versus publicly.

On a request, based on the emerged interest, there are ongoing analyses being conducted with the help of language technology where topics are extracted from the anonymous and public documents respectively. The aim is to find topics related to an individual commis-sioner for the purpose of, as accurately as possible, identifying which of the anonymous documents is written by who. A research group is conducting this project for/with Swe-Clarin1_.

Swe-Clarin is a unit of Clarin (Common Language Resources and Technology Infrastructure) which is a European research infrastructure focused on language technology aiming to improve and develop infra-structure for speech and text based e-science, digitalizing material and advanced language technological tools and aid research applications in various ways. This thesis is part of the recently mentioned project as a means for exploration of possible enhancement of identification of authors. Possible enhancements of the attempt of matching commissioners with anonymous documents will be investigated by analysis of sentiments expressed towards at least the main topics related to each commissioner. By being able to track every individual author’s expressed sentiments toward topics, the possibility to match them with anonymous documents is hypothesized to increase. The main projects’ initial approach was to solely use the generated topics as means for matching anonymous and public texts by finding identifying features of each commissioner. The accuracy was not satisfying enough and thus, the researchers hypothesized that the commissioners are likely to have written about the same or similar economic or bank related topics and therefore, solely based on topics, it is not easy to predict who wrote what.

(6)

1.2 Aim

The aim with this project is to explore the possibilities of enhancing topic-based author identification with sentence-level sentiment analysis of topics on Swedish documents to investigate how this can contribute to the task of predicting speakers of Swedish national bank related speeches. To the best of my knowledge is this the first usage of hybrid topic based sentiment analysis (HTBSA) in this domain (bank/economy related documents), in Swedish, with the purpose of predicting authors.

1.3 Research question

Could hybrid topic based sentiment analysis be used to enhance the possibilities of author identification on documents where topics within the text and among the authors are closely similar?

1.4 Proposed approach

The procedure and vision of this project is to use sentence-level sentiment analysis on automatically generated topics’ word distributions occurring in each commissioner's speeches to attain general sentiments expressed towards each generated topic by each commissioner. This information is supposed to be used as identifying factors for each commissioner and thus, used as means for predicting the speaker of speeches with similar generated identifying factors.

It was mentioned in section 1.1 that this thesis is a reaction to the results of only using topic modeling to identify authors which was considered not satisfying enough. The reasoning of causality of why solely topics as identification did not give satisfactory results was referred to a recent article by Zechner (2020) argumenting against previous methods in the author identification field and possible misleading accuracies. He stated that it is topics that are being found and identifying features of those, instead of authors when conducting author identification tasks. This was investigated by using different books of the same author for identification and prediction. He tries to predict the author of one book with the true author as the goal and uses the same author as if they were pseudonym authors but from different books as “false” authors. If the author is what is identified with the author identification methods the “false” authors, to some extent, should be chosen as authors as well since it is the same author but to different books. His investigation showed that this was not generally the case. The “fasle” authors (same but under pseudonym) were not chosen by the methods as the predicted authors. Thus, he argues that, in some models, possibly used on datasets where the authors write about different topics, the topics become the authors and vice

(7)

versa. Though, if these models were used on datasets where the authors write about the same similar topics, how will the accuracy be affected and how could one get around the problem? If it was made possible to extract, from the texts, the individual’s, in this case, commissioner’s expressed sentiments towards the various topics, an increase in accuracy could be achieved since it will lead to distinctions between the otherwise similar topics, thus making it possible to, with higher accuracy, predict authors of anonymous texts in very specific domains.

The procedure up to the identification of authors may be similar to previous works, especially similar to the somewhat novel method Hybrid Topic Based Sentiment Analysis (HTBSA) (Bansal & Srivastava, 2018). In Bansal and Srivastavas works the method is used to predict elections based on tweets and as later can be read in this thesis, the proposed method is somewhat similar and inspired by this. HBTSA is based on four steps; (1) generate topics; (2) find word distributions of topics; (3) find topic sentiments based on the word distributions; (4) calculate sentiment-scores of tweets. Because of the very short text of tweets, Bansal och Srivastava decided to use the BTM topic model in step one since it, in contrast to the LDA topic model, finds term co-occurrences instead of documents. In step three, the sentiment score for each topic is generated by finding each topic's top 20 words’ corresponding sentiment in the Sentiwordnet Lexicon. The overall sentiment of each topic is decided by the weighted mean in a vector of the words’ sentiment scores. The fourth step - finding a sentiment score of each tweet - was achieved by using the assumption that each tweet is a mixture of weighted topics which is represented by the proportion of each topic in a tweet. Hence, the sentiment score of each tweet is the weighted mean of topic sentiments.

Though, this project’s aim is identification of authors and in the particular field of author identification it has recently been investigated that sentiment analysis can enhance identification of authors (Chen et al., 2017; Martins et al., 2019). They hypothesized that, by using sentiment analysis in social media messages, thus, analyzing emotions, might increase the ability to predict the author by identifying the author's emotional profiles. Though, they have not used it as enhancement of topic modeling as an approach for authorship identification which is the aim of this project.

Also, topic modeling for author identification has been used before, i.e, with the author-topic model (Rosen-Zvi et al. 2004) which is an extension of the Latent Dirichlet Allocation (LDA) topic model that includes author information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that are a mixture of the distributions associated with the authors. Their model might therefore work well on documents

(8)

with a great variation and distance between the topics. Though, as suggested by Zechner (2020), it might not work well on documents with somewhat closely similar topics. The scope of this thesis is to try and tackle that kind of problem. Topic modeling as approach will still be substantial but the authorship identification will be based on general sentiment scores expressed towards topics occurring in the authors’ documents. The general sentiment score will be based on distributions of sentiments expressed on a sentence level towards topic-representing words occurring in the authors’ speeches.

Thus, as stated earlier, both sentiment analysis and topic modelling have been proposed methods for the purpose of author identification but, to the best of my knowledge, not proposed as a hybrid method where the sentiment of topics have been analyzed (at the very least, not on Swedish documents). The hybrid approach HTBSA, mentioned by (Bansal & Srivastava, 2018), which was used to predict elections is used as inspiration for the research conducted in this project which can be described as explorative rather than hypothesis testing since the language causes limitations in usability of varying possible methods (which will be further discussed in section 5).

The theory of the proposed methods is explained in section 2 and the procedure is more in-depth explained in section 3. The results are presented in section 4 and later discussed in section 5 along with suggestions for future scopes based on the results and on different available sentiment analysis, topic modeling and author identification methods.

(9)

2. Theory

In this section different definitions and explanations of the major fields and theories will be presented equipping the reader with some fundamental knowledge that will aid the understanding of the project in the following sections.

2.1 Author identification

The classic model of authorship attribution was presented by Mosteller and Wallace (1964) and was by them, used to classify The Federalist Papers with Bayesian statistics with the aim of distinguishing authorship of 12 disputed papers between Hamilton and Madison and the training was done on the remaining papers with known authors.

Applications of author identification today are mainly used as tools for detecting plagiarism, in author detection (e.g. on historical letters or anonymous texts) or forensics and more recently, to detect fake news. (e.g. Potthast et al., 2010; Su et al., 2008; Martins et al., 2019). The field has in the latest years become more popular and the amount of different approaches has increased and/or been enhanced by, for instance, different machine learning models (Kale & Prasad, 2017) and attention has turned to applying these techniques to several authors at once using informal text that is limited in quantity (Chen et al., 2017). For instance, Silva et al. (2011) worked on authorship attribution in Portuguese tweets, and achieved attractive results (up to 95% accuracy) using Support Vector Machines (SVMs) and a small amount of data (from 100 to 2,000 tweets).

Topic modeling has had a role in authorship identification ranging all the way back to Blei’s et al., (2003) article on Latent Dirichlet Allocation (LDA), their generative model made for text corpora. It is based on hierarchical Bayesian networks and information about low-level features are included, such as documents, stylistic markers and topics. It was extended by Rosen-Zvi et al. (2004) to include authorship information, and Seroussi et al. (2012) applied it to authorship attribution. It was at this point it became evident that including information in the classification process about the a priori likelihood of authors writing about a specific topic, and when the author does so, how that is influencing style markers and increasing the accuracy of the classification. A difference in this thesis when making authorship attribution is that a close world assumption is not needed when it comes to topics.

Another evaluation of features across topics was conducted by Menon and Choi (2011). They used an SVM classifier in their experiments and the features included n-grams of part-of-speech and function words. The experiments varied by the topics disjointness and the likelihood of a particular topic. All experiments showed that function words achieve the

(10)

best results, though, sometimes they fell short of the combination of all considered features.

2.2 Sentiment analysis

Sentiment analysis or opinion mining, explained by Jurafsky and Martin (2020), is a type of text categorization that extracts sentiment, the expressed positive, neutral or negative orientation towards an entity. This field in natural language processing can also be described as the computational study of author’s opinions, emotions and attitudes towards other individuals, events or topics (Medhat et al., 2014).

Most commonly, sentiment analysis and opinion mining are two words with the same notion. Though, some suggest that they are slightly different (Tsytsarau & Palpanas, 2012). The method of opinion mining can be described as extracting and analyzing the author's opinion about an entity while the method of sentiment analysis identifies the expressed sentiment in a document and then analyzes it. Thus, the target of sentiment analysis is to discover the opinions, identify the sentiments they express, and then classify their polarity.

Sentiment analysis has most commonly been applied as a method for classifying documents, as reviews, to obtain general opinions about movies, products etcetera. Though, as technology develops, applications and methods become out-dated and the sentiments are analyzed at more and more fine grained levels. There can be several opinions expressed in a whole document and therefore, sub-fields have been developed, such as sentence-level sentiment analysis (Meena & Prabhakar, 2007) and topic-based sentiment analysis (Liu, 2012). It has also been shown that the performance of sentiment analysis can be improved by using word embeddings and semantic similarity since this decreases the number of features (Poria et al., 2016). With the help of vectors, words’ meanings will be used as one feature. Synonyms as oculist and eye-doctor will be treated as one feature instead of two since they have the same embedding in text. Extraction of embeddings is achieved by looking at which words surround a word of interest. Both oculist and

eye-doctor are likely to be found with the same surrounding words,

representing their embedding, for instance, words like eye, hospital or

examined.

Aside from product reviews, sentiment analysis can also be used on stock markets, news articles or political debates (Yu et al., 2013; Hagenau et al., 2013; Tao et al., 2012; Maks & Vossen, 2012). Opinions expressed towards certain candidates or parties in elections can be analyzed and in some cases, i.e as in Bansal and Srivastava’s (2018) application mentioned earlier, it can be used as a tool for prediction of elections.

(11)

In this thesis it is suggested as a tool for prediction of authors in combination with topic modeling. The suggested sentiment analysis tool to be used on generated topics is the state-of-the-art rule-based model VADER (for Valence Aware Dictionary for sEntiment Reasoning) which is shown to, in some cases, outperforme human sentiment classification of tweets (Hutto & Gilbert, 2014) and thus also other methods for sentiment analysis, such as Maximum Entropy (ME), Naive Bayes (NB) and Support Vector Machines (SVM). With the VADER model, it is possible to conduct an overall sentiment intensity analysis on short texts which outputs three numerical values (within the range 0 to 1) which corresponds to neutral, positive and negative intensity of the short text with a sum of 1. The polarity with the highest intensity score will be the assigned sentiment of the short text. In addition to being the state-of-the-art model at the time being, it is suitable for this project because of it being made for short texts. The goal with VADER in this thesis is to conduct a sentence-level sentiment analysis of the sentences wherein the topic words occur to achieve a topic based sentiment analysis.

2.3 Topic modeling

The field of topic modeling was born from the larger field of Information Retrieval (IR) (Baeza-Yates & Ribeiro-Neto, 1999) which goal is to extract valuable and analyzable data from text documents, such as, within this sub-field, topics or aspects. Topic models are used to, by unsupervised learning, extract categories of words from larger sets of texts (Jurafsky & Martin, 2020). Jurafsky and Martin explain how semantic fields are related to topic modeling. One can find relatedness between words if they belong to the same semantic fields. These are sets of words which occur within the same semantic domain and together they carry structured relations. One can find that words are related by being found in the same semantic field. For instance, words like balance, currency, hause and/or baise might be found within the same semantic field as stock market. Therefore, aspects or topics are not single words or have a general descriptive word if they are automatically generated, they are clusters of words without any explicit description. Clusters is a commonly used concept in topic modeling, and can in this context be explained as collections of objects which among themselves are “similar” and to other objects belonging to other clusters are “dissimilar” (Buyya et al., 2016). Thus, a challenge in topic modelling is to achieve easily interpreted and semantically similar clusters representing topics.

One of the more popular approaches at the moment is generative probabilistic topic modelling (Blei, 2012). One topic model of this sort is the Latent Dirichlet Allocation (LDA) topic model (Blei et al. 2003) which, today, is one of the most frequently used. LDA is a three-level

(12)

hierarchical Bayesian model, in which each item of a collection of discrete data, such as text corpora, is modeled as a finite mixture over an underlying set of topics. Then, each topic is modeled as an infinite mixture over an underlying set of topic probabilities. In text modeling, the topic probabilities give an explicit representation of a document (Blei, 2003). Jurafsky and Martin (2020) and Martins et al. (2017) as mentioned earlier, states that word embeddings and semantic similarity among words used for sentiment analysis in documents (and topics) should enhance the results of sentiment analysis. Thus, topic modelling approaches which focus on semantic similarities might increase the coherence among topic words and with that the interpretability of relatedness among the words representing a topic. This, in turn, might lead to a more representative sentiment analysis of topics identifying authors.

Ideas based on the word2vec framework (Mikolov et al. 2013) are increasingly becoming more popular to integrate with topic modeling methods or used as topic modeling methods based upon its strong ability to find word relations. Two such ideas inspired the models (1) LDA2vec (Moody, 2016) and (2) topic2vec (Niu & Dai, 2015); (1) Word2vec captures powerful relations between words but the vectors are not interpretable and do not represent documents and LDA is interpretable by humans but does not model local word relations. Though, together both document and word topics are built, made more interpretable and supervised and the topics can be over client, times and documents. (2) The topic2vec model is instead solely based upon vectors. A joint embedding of document and word vectors are created into a vector space where the algorithm then aims to find dense clusters of documents wherein it identifies which words attracted those documents together. Each dense area is to be considered a topic and the words that attracted the documents to the dense areas are the topic words.

Other possibilities which have not yet been suggested in the author identification field could be implementation of “all-inclusive-models” of topic modeling and sentiment analysis such as Aspect-based sentiment analysis (ABSA) models. ABSA is a result of the idea that a sentiment analysis of, for instance, reviews, only shows a general sentiment, but of course, expressed sentiments can vary among aspects related to the object the review is based upon. A classification of opinions made on a review of a new smartphone can be found generally negative. Though, within the review the author most likely will comment on several and specific aspects related to the smartphone. The author will have an individual and probably objective opinion towards aspects like camera, processor efficacy, screen quality etcetera. which could be shown to vary between authors and across the chosen spectrum of sentiment polarity (Liu, 2012). The most common usages of sentiment analysis of topics is mainly on review data or twitter data where the goal is to find out what people think about a particular

(13)

product or subject and its features (Lek & Poo, 2015; Brody & Elhadad, 2010; Zhai et al. 2011). Thus, the topics are predefined and the goal is, for instance, to find out opinions about a smartphone and its new camera. The topics are generated based on the word camera, creating a topic with words which are synonyms, hyponyms, hypernyms etcetera. of the word camera.

In this project the topics are in contrast generated based on the content of the corpus without a general word representing the topic. The topics are represented by the sum of its parts and thus the general sentiment expressed towards each topic cannot show exactly what it is the person expressing the sentiment is having opinions about. Though, it is not what is the most important in this case where the goal is to predict the speaker. The most important in this project is to find each commissioner’s expressed sentiment towards the generated topics (unimportant which topics and what they are concerning) and with that find similarities with the expressed sentiments towards topics generated from the data set with unknown speakers.

Though, the limitations which led to difficulties of implementation and experimentations with these methods largely concerned the small data size (based on what is suggested for the recently mentioned methods) and the language of the documents (there are some possibilities of using these methods with the help from language transformers, though, not without affecting the results). This will further be discussed and touched upon in sections 3 and 5.

2.3.1 Latent Dirichlet Allocation

The LDA topic model was used to generate the topics for this project. The same model which was used in the earlier mentioned prior project (section 1.1) when an attempt of author identification was made with the help from topic modeling, is used in this project since the same data set is used and modeled. The implementation of the LDA model previous to this project generated the economic and bank related topics from which the top words were used for sentence-level sentiment analysis to identify authors. In section 3.2 it will be described how the number of top words were decided to achieve the highest possible accuracy when predicting authors but first the model will be described below.

As described by Blei et al. (2003), the LDA models generates a document collection modeled as three steps; (1), for each document, a distribution over topics is sampled from a Dirichlet distribution; (2), for each word in the document, a single topic is chosen according to this distribution; (3), each word is sampled from a multinomial distribution over words specific to the sampled topic.

This process of generation is similar to the hierarchical Bayesian model in figure 1. In the model, ϕ denotes the topic distribution matrix,

(14)

with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet ( ) prior. isβ θ the matrix of document-specific mixture weights for these T topic, each independently drawn from a symmetric Dirichlet ( ) prior. For each word,α

z denotes the topic responsible for generating that word, drawn from the θ

distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z. Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively.

Figure 1: Graphical model representation of the LDA model. The boxes are plates as representations of replicates as boxes. The inner plate represents the repeated choice of words and topics within a document, the outer plate. The corpus-level parameters α and β are assumed to be sampled once in the generation of a corpus. The document-level variables θM are sampled once per document. Finally the word-level variables zMN and wdn are sampled once per word in each document (Blei et al. 2003).

(15)

3. Method

The major steps in the process of reaching a result in this project will be thoroughly explained below, going into detail of the models and methods used to generate topics and to analyze the sentiments of those to finally make predictions of authors on unidentified speeches based on the results from the prior two steps. It is inspired by the similar procedure of HTBSA, though with the contrast, that LDA is used instead of BTM because of the difference in document size compared to tweets as Bansal and Srivastava (2018) used for their predictions.

3.1 Data

The data set used in this project for topic modeling and sentiment analysis was compiled from thirteen collections of speeches held by commissioners in the Swedish National Bank represented by capital letters from A to P in the following tables 1-5, to safeguard their integrity. The speeches are of different sizes in regard to the number of sentences (see table 1). There are some speeches by the commissioners represented as A, D, K and N in table 1, that were considered extreme values and thus were discarded. The total number of remaining sentences in the data set could be summed up to 22 316 sentences with 677872 words unevenly split over a total of 474 documents. It is considered to be small, compared to corpuses used for purposes similar to the one in this study which affected choices of methods and tools for i.e, topic modelling, which will be touched upon in section 5. Therefore, even though the sizes of each commissioner’s collection of speeches (seen in table 1) is somewhat uneven, they will all be used to not further reduce the total number of data.

The total amount is used when training the LDA model and for the sentiment analysis, the data set is split for cross-validation since the purpose of the sentiment analysis is to predict speakers of unseen data and a gold standard is essential for evaluating the results. This test set was created by extraction of 25% of each commissioner's file of speeches before conducting the sentiment analysis (see table 2).

(16)

A 12 B 2150 C 1694 D 385 E 5055 F 2647 G 2203 H 1896 I 1086 J 1683 K 17 L 3155 M 1437 N 31 O 3385 P 2770

Table 1: Representation of the dataset with each commissioner in the left column and their number of sentences in the dataset in the right column.

B 1612 538 C 1270 424 E 3791 1264 F 1985 662 G 1652 551 H 1422 474 I 814 272 J 1262 421 L 2366 789 M 1077 360 O 2388 997 P 2077 693

Table 2: A representation of the remaining dataset after removal of the considered extreme values and after being split in two for the purpose of getting a set for prediction (the right column).

(17)

3.2 Choosing representing words in topics

Clusters of words representing each topic can contain hundreds of words and for this task it was necessary to rank the words to be able to choose the

k most representing words in each topic. Primarily, all words could not be

used for sentiment analysis from each topic since it would result in never ending iterations and RAM-memory overload. Secondly, from the most topic-representing word in a topic, the representation decreased, the further down the list of topic-words one went. The order of words in the clusters representing each topic is by default sorted based on frequency of total occurrence in the semantic field of the other words representing the topic (see figure 2). Based on this and Jurafsky and Martins (2020) explanation of how word embeddings can enhance sentiment analysis, it became obvious that the clusters needed to be cropped to some extent to avoid any kind of overfitting since the words being used for sentiment analysis will be, by the sentiment analysis tool treated as similar and the author prediction based on general expressed sentiments, will presumably flatten. Synonyms, hypernyms and hyponyms of the top words are prone to be found further down in the hierarchy of topic words and the sentiments towards those words will not contribute to the general sentiment. The evaluation of the best number of representing words will be described in section 4 and future work related to word embedding based rankings of topic words will be discussed in section 5.

(18)

Figure 2: An visualization of the generated topics from the interactive visualization tool LDAvis (Sievert & Shirley, 2004). To the right a term barchart can be seen, representing overall term frequency (blue) and estimated term frequency within the topic (red) of a marked topic from the global topic view on the left.

3.3 Sentiment Analysis

This section examines and describes the sentiment analysis model and word sense lexicon used for finding general opinions expressed towards each topic. Furthermore, the usage and output are explained since the goal with the sentiment scores is to be used for predicting speakers of speeches in contrast to other usages of sentiment analysis of topics mentioned in section 2.

3.3.1 Word sense lexicon

Since the corpus used in this project is Swedish, it was essential to find a Swedish sentiment and word sense lexicon. The best lexicon for Swedish sentiment analysis at the moment for finding word senses was the SenSALDO lexicon, created for SWE-CLARIN (Rouces et al., 2018). It is described as a comprehensive sentiment lexicon for the Swedish language. It contains 65953 items, that is text word forms, as well as 7618 word senses. The sentiment analysis method used and made compatible for

(19)

Swedish documents thanks to the SenSALDO lexicon in this project is the model VADER.

3.3.2 Implementation of VADER

VADER, the parsimonious Rule-based Model originally made for English by Hutto and Gilbert (2014) is at the moment the state-of-the-art model for sentiment analysis. An overall sentiment intensity analysis can be conducted with VADER on short texts which outputs three numerical values all between 0 and 1 and they correspond to the neutral, positive and negative intensity of the short text. The application of the model is based on Hutto and Gilbert’s usage of the code for short texts.

The code generating sentiment scores towards each topic iterates through each file containing each commissioner's speeches. Every sentence in each file where each topic’s representing words occur is checked for word senses. The sentences found to contain a topic and word sense are saved and labeled sense sentences. Then the code iterates through the sense sentences related to the words in the topics to conduct a sentiment intensity analysis which outputs a value between 0 and 1 which then will be put together to get a value between -1 and 1 representing the polarity from negative (-1) to positive (1). The code will interpret a value > 0.05 as positive, a value < -0.05 as negative and the score in between will be tagged as neutral. These tags will be assigned to each topic word for each sense sentence it is found to occur within. Thus, each topic word has as many tags as the number of sense sentences it occurred within for each commissioner, and the highest number of sentiment tags (positive/ negative/neutral) will be considered the general opinion towards that word (see table 3).

The next step in the code is summarizing the sentiment scores for each topic and adding them to a dictionary with the structure {commissioner : {topic n : {Positive : x, Negative : y, Neutral : z} where n is the topic number and x, y and z are the different general sentiment scores. This gives a general overview of the sentiments expressed towards each topic (see table 4) and will be used for predictions on an identical dictionary generated from the test set where the key commissioner instead is labeled as file k where k is a unique number within the range of the number of commissioners.

(20)

H Stat 88 Pos: 44, Neu: 16, Neg: 28

L Stat 87 Pos: 25, Neu: 20, Neg: 42

M Stat 60 Pos: 12, Neu: 13, Neg: 31

Table 3: An extraction from the sentiment analysis which shows the sentiment polarities found (most right column) in the number of sense sentences (second most right column) where the word Stat (state) occurred (second most left column) in three of the author’s (most left column) collection of speeches, thus identifying their opinions towards that word.

E Topic 0: Pos: 94, Neu: 151, Neg: 106

Topic 1: Pos: 10, Neu: 15, Neg: 16 ...

F Topic 0: Pos: 41, Neu: 22, Neg: 30

Topic 1: Pos: 4, Neu: 1, Neg: 2 ...

Table 4: An example output of topic sentiment analysis of two commissioners seen in the left column and two of their topics (the middle column) with the generated sentiment scores found toward each (the right column).

3.4 Identification of speeches

This section describes the reasoning behind the prediction of speakers and the code used to accomplish the topic and sentiment based prediction. The baseline is also described below and how it was used to evaluate the prediction.

3.4.1 Topic and sentiment based identification

This procedure is, of course, almost similar to the baseline procedure except that the sentiment scores are used to further sift the candidates from this step instead of randomly choosing a candidate as speaker. The general sentiment scores expressed towards each topic is used to find one candidate that is, based on scores, the most probable speaker of the different speeches.

One issue though that had to be worked out before the predictions based on scores could be accomplished was that, since the test set was of different size, the amount of scores was much lower. The amount of sense sentences is almost linearly affected by the amount of total sentences and thus, the sentiment scores representing the opinions expressed towards each topic is not easy to match with the main data set. The scores had to be converted to proportions to make the predictions possible.

(21)

Each positive, negative and neutral score had to be converted to proportions of the sum of sentiment scores expressed towards the topics by each commissioner within their speeches. Doing this will keep the general sentiment score since the highest number will have the biggest proportion and so on within the dictionaries and each score will be represented by a number between 0 and 1 both in the main data set and in the test data set and thus it will be possible to accomplish a prediction based on matching general sentiment scores (see and compare table 4 and 5).

Hitherto, candidates have been selected based on identical or nearly identical occurence of topics, the sentiment scores have been converted to proportions to make the scores representing expressed opinions in the test and main data set possible to match. If a positive score has a high proportion of the summed score towards topic n in file k and one or more candidates has a high positive proportion of the summed score towards the same topic n, it is likely that this might be the one having that opinion about that topic. Thus, the next step was that each file’s topics was to be assigned the most probable owner of opinion expressed towards it, based on similarity between sentiment proportions. Each topic in each file now had one probable owner of the opinions expressed towards them and the prediction on who the most probable speaker in each file was, was decided by checking who the most common “owner” of topics was. If commissioner y had been chosen to be the most probable “owner” of the most topics in file k then commissioner y was predicted to be the speaker of those speeches. If there happened to be zero co-occurrences of candidates for a file, the prediction was based on who had the most identical sentiment proportions assigned towards his/hers matching topics.

E Topic 0: Pos: .0186, Neu: .0283, Neg: .0205

Topic 1: Pos: .0057, Neu: .0090, Neg: .0215 ...

F Topic 0: Pos: .0218, Neu: .0124, Neg: .0162

Topic 1: Pos: .0067, Neu: .0042, Neg: .0107 ...

Table 5: Same example output from table 1 of topic sentiment analysis of two authors seen in the left column and two of their topics (the middle column), though, with the generated sentiment scores found toward each represented as proportions of the total number of sentiment scores for each author (the right column).

3.4.2 Baseline

The baseline represents how many speakers are likely to be correctly predicted ignoring predictors and it is calculated with one of the most common baseline methods, the “Zero-Rule” (Zero-R) classifier

(22)

(Choudhary & Gianey, 2017). It solely relies on the target and ignores predictors and therefore, based on a table of frequency it chooses the most recurrent value. Compared to a simple random prediction algorithm, it is considered better since it uses more information on a given problem to create a rule for prediction. This is since this baseline is based on the results from the training of classification methods.

(23)

4. Experiments

This section aims to describe the validation procedure and to present the obtained results of the proposed approach. Thus, comparison of the baseline and topic and sentiment based prediction will be presented below.

4.1 Performance Evaluation

The first and foremost variable to be experimented with was the amount of words to be representing each individual topic for sentiment analysis and prediction of authors. This was evaluated by predicting authors with different numbers of representing words from each topic cluster to find the number providing the best results. The values tested ranged between 10 and 55 representing words used for sentiment analysis and author identification and prediction, giving various results. The results are presented in diagram 1 and table 6, showing that the best predictions were provided between the numbers 20-30 representing topic-words.

As the number of representing words increases, the number of sense sentences increases somewhat linearly and the prediction accuracy peaks between 20 and 30 representing words. The total number of sentences of the entire data set is 22 316 and the number of sense sentences passes that number greatly during testing, suggesting that the same words occur in the same sentences and the same sense sentences are used multiple times, and at some point the generated sentiments stop contributing to the general sentiment. Even though each sentiment is expressed towards different words, the sense words found in the sense sentences are most likely to have been used multiple times already and thus leading to a decline of correctly predicted authors seen in diagram 1 and table 6).

(24)

Diagram 1: Representation of the number of correctly predicted authors (y-axis) depending on the number of representing topic words used for sentiment analysis and prediction (x-axis).

Table 6: A table representing the same conditions as in diagram 1, though, with a representation of the number of found sense sentences (right column) depending on the chosen number of topic-words (left column) and how it affected the prediction results (middle column, representing the number of correctly predicted authors out of 12).

4.1.1 Topic and sentiment based prediction

As seen in diagram 1 and table 6 in the previous section, the highest number of predicted authors achieved was 7 out of 12, i.e roughly 58.3% of speakers were correctly predicted with 20 representing words (since it resulted in the highest score, with lowest effort).

(25)

4.1.2 Baseline

The baseline was achieved by a zero-R algorithm calculating the major recurrencies in the output of the different classes from the proposed approach. The different classes in this case are represented by the different numbers of topic-representing words. The output values seen in the middle column of table 6 were used as input to the algorithm which calculated the baseline to be 4.54. In other words, the baseline, which refers to the prediction accuracy without predictors, is 4.54₁₂ × 100 = 37, 83%.

4.2 Results

The results showed that the proposed approach is slightly more effective than predicting speakers without predictors (proposed approach: 58.3% versus baseline: 37.83% correctly predicted speakers). The acquired results compared with each other shows that the proposed approach can be effective. It can and needs to be improved a great deal compared with other author prediction methods and such improvements will be discussed in section 5.

(26)

5. Discussion

This section will discuss the results in comparison with other similar tasks to discuss possible future steps in this particular branch of author identification. Some of the biggest limitations with this work, which will be more in depth described later in this section, are concerning the language of the documents and the novelty of the field associated with it. A large number of methods for sentiment analysis, topic modeling and topic-based sentiment analysis are presented in the academic writings, one better than the other in different ways. Though, some are hard to apply for particular purposes or not suitable for particular domains or vice versa. This project needed a method suitable for author identification on Swedish documents touching economic and bank related topics. Thus, this section aims to suggest methods that might be possible to make compatible in the future for the task at hand.

5.1 Discussion of results

The results presented in this thesis are not, in comparison with other author prediction methods on other documents, at all competing with state-of-the-art results. As mentioned in section 2.1, Silva et al. (2011) achieved a 95% accurate prediction on unknown documents with the help from SVM, though not on the same data or within the same domain. Thus, until more comparative experiments have been conducted, the 58,3 % accurate prediction presented in section 4.2 shows that there is potential in identifying authors with the help from sentiment analysis and topic modeling as suggested by Chen et al. (2017) and Rosen-Zvi (2004) in section 2.1.

5.1.1 Discussion of the data

There are several potential ways to improve the results of the method used and one possible factor that may be a drawback is the data or this particular method with this data. Author identification within this domain, economy/bank related, is, to the best of my knowledge, almost non-existent in the academic sector and thus, not much data-related comparisons can be made. As mentioned in section 2.1, Zechner (2020) published a paper where he suggested that topics and authors are represented in similar manner in most other author identification tasks. Thus, he argues that, if one tries to predict authors in i.e newspapers, the predictions are easy to make since the topics discussed vary in different ways such as each author speaks about an individual topic or individual set

(27)

of topics. Thus, if you can identify topics, you can identify authors, they are linked together. Therefore, one can theorize, that in a domain like this, used in this project, it is bound to be harder to make predictions since the authors are most likely to be talking about the same topics - economy/bank related. Therefore, if one were to use other renowned author identification methods (primarily used on newspapers or other corpuses with very varying topics) on the same corpus as in this project, it may not be sure that the accuracy is much better than the one in this project since they are looking for topics as authors.

5.1.2 Discussion of the proposed method

This thesis’s proposed method is not commonly used within the field of author identification and therefore not known as either the best or the worst. It is under exploration and especially on Swedish documents. The obtained results, compared to accuracies of other author identification projects, are not satisfactory enough if one expects an accuracy of 95%, which was obtained by Silva et al. (2011) when using SVM. Though, based on Zechner’s paper (2020), as mentioned recently, if methods as Silva’s et al. were used on this project’s data, they would presumably not achieve a much better accuracy as the one presented in this paper. Other methods are most commonly used and evaluated on corpuses with varying topics in comparison to the ones generated from the data used in this project.

Sentiment analysis of topics might be the best possible way to discover identifying factors of authors in corpuses similar to the current study. Corpuses like this, where the authors talk about the same topics sentiment analysis would make each author’s relation to the topics unique. Until it is shown that accuracies presented in a great deal of the author identification field is misleading or not, one can theorize that authors are not what is identified with previous methods, it might be topics as suggested by Zechner. Thus, it is of need to investigate this further before any conclusions can be made, but other method’s accuracies might be similar to the one in this paper if they were used on similar corpuses as the one used in this study.

A great deal of the time available for this project was used to research previously used methods for author identification and sentiment analysis of topics. Some were tested as possible methods, to later be discarded and others were discarded even before that, generally based on the lacking possibilities of use on Swedish documents or on smaller datasets for the time being. Therefore, it is worth further exploration of the proposed approach since other methods are available, just not yet compatible. These methods will be discussed in section 5.2 proposing future approaches.

(28)

5.2 Future research

First and foremost, the major limitations within this project concerning language would be most interesting to avoid to make comparative studies possible where renowned methods for author identification with high accuracy outcomes are conducted on similar topic-specific data. It would be most interesting since Zechner (2020) hypothesized that the topics are what most commonly identifies authors in most author identification tasks. If this were to be conducted on the same data used in this project instead of on similar data in English, it would be necessary to enhance the methods used in this project or to find better methods compatible for Swedish documents, since there is a lot of potential for increasing the accuracy.

5.2.1 Possibilities of the used method

The method, as it is, could have been and can be experimented with and investigated further by re-ranking the generated topics (Alokaili et al. 2019) to increase interpretability and by that possibly improve prediction accuracy as the representing words may become more similar, thus enhancing the sentiment analysis, as suggested by i.e Jurafsky and Martin (2020) and Martins et al. (2017). If re-ranking of topic words makes the topics more discernible, the general sentiment scores might become more distinctly different.

In addition to this method of improving the proposed approach it could also be investigated how the predictions would be affected by adding more features for identification of authors. Evaluative research could be done in the same manner as Menon and Choi (2011) mentioned in section 2.1. They evaluated different features when using SVM in an author attribution task. In this project a manipulation of features could, among other, be a removal of words lacking or with low semantic orientation to increase the amount of representing words from topics that provide a sentiment score.

5.2.2 Possibilities with other methods

Jurafsky and Martin (2020) and Martins et al. (2017) as mentioned in section 2, state that word embeddings and semantic similarity among words used for sentiment analysis in documents (and topics) should enhance the results of sentiment analysis. In section 2 several methods were proposed with the main focus of semantic similarities when generating topics, some based on the word2vec framework by Mikolov et al. (2013). These methods looked very attractive for the aim of this project but it was found after some attempts that the limitations concerning the dataset (such as small data size and that it is in Swedish) made them incompatible. Though, there are future possibilities of making these

(29)

methods work, at least on Swedish. The topic2vec (Niu & Dai, 2015) model can be made compatible with Swedish by the help of language transformers (Ma et al. 2019) though it does affect the outcome. Some words do not get translated correctly and that would not affect the generation of topics or interpretability of topics but the sentiment analysis would be affected since the words might not represent the correct word sense if transformed incorrectly. Thus, either a top2vec model needs to be made for Swedish documents or some kind of control of the translated words need to be made. Though, it would also be prefered to have a larger data size to be able to use the top2vec model since it is limited to somewhat large corpuses containing a large number of documents since it was originally made compatible for the 20 newsgroups dataset which contains 18846 documents2_{. It is not yet compatible for corpuses}

containing small amounts of documents as in this case, with the, in comparison, small number of 474 documents.

In addition to top2vec, Aspect-Based Sentiment Analysis (ABSA) was also mentioned in section 2.3 as a possible model for this purpose of author identification on Swedish documents. This might be a possibility for author identification in a near future on Swedsih documents since Rouces et al. (2020) recently created an annotated corpus for ABSA in Swedsih. Schouten and Frasincar (2016) mention and identify semantically-rich concept-centric aspect-level sentiment analysis as one of the most promising future research directions and thus, making it compatible in tasks as this one, might lead to bright horizons. Therefore, this might be the most fruitful direction in attempting author identification tasks with sentiment analysis of topics if an annotation tool becomes a available possibility in the near future in Swedish following the work of Rouces et al. (2020).

5.3 Ethical Consideration

A field like author identification does not come without a need for ethical considerations since it, as in this case, involves asserting authors to anonymous texts. There is nothing stating that the prediction of authors on anonymous texts is completely accurate or even accurate at all since it is impossible to know who the author actually is. If 95% accuracy is achieved on a task where it is possible to, with a key, retrospectively check if the predictions were correct or not, it does not automatically tell us that it would achieve 95% accuracy on anonymous documents since it is not possible to check. Further, it is either not possible to state which 5% of the authors were not correctly predicted when a key is not available.

(30)

Though, this does not mean that information of this kind could and would not be used in a bad manner. It is important to be watchful if tools like these were to be used to encourage bad influence by, for instance, political parties. It can be easy to use this for propaganda, stating that some leader or other influential person wrote something invidious anonymously with a certainty of 95%. If an author identification tool were to be used for purposes as recently mentioned it is gravely important to be source critical and bear in mind that the accuracy measure is a result from use on different documents where it was possible to check the predictions.

(31)

6. Conclusion

This thesis integrated three different natural language processing fields: Topic modeling, sentiment analysis and author identification. Topic modeling had previously been used as a method for author identification and sentiment analysis has been suggested to enhance it and therefore this project aimed to explore an integration of the two for the purpose of identifying authors, or in this case, speakers. The aim of the thesis

-exploring the possibilities of enhancing topic modeling with sentiment analysis on Swedish documents to investigate how this can contribute to the task of predicting speakers of Swedish national bank related speeches

-were fulfilled since it was shown possible that sentiments of topics can be used as identifiers of authors even though the results can be greatly improved. After discussing possible reasons for the not too satisfactory results compared to other accuracies achieved on other corpuses with other methods of author identification, it was shown that a need for future research exists within this branch of the field introducing topic-based sentiment analysis.

Though, it has recently been found that author identification might be the same as topic identification, indicating that there may exist some misleading accuracies in the academic writing in the field (Zechner, 2020). Author identification methods used on corpuses containing a vast amount of different and distant topics makes the identifying task much easier since the probability of finding authors discussing identical topics is extremely small. Though, on corpuses similar to the one used in this thesis where every document touches every topic to some extent and all the authors are very likely to discuss identical topics, the identifying task becomes more difficult. Thus, the results and the procedure of this paper might effectively be used as a baseline in future works using sentiment analysis of topics to identify authors in Swedish or other documents with similar topics among the authors. This can also be an indication that the accuracy of this project possibly can be considered good until other author identification methods have been used on this or similar corpuses.

To conclude, the accuracy was satisfactory if seen to the aim and compared with the baseline since it shows a potential of using this proposed approach as a tool for enhancing author identification in documents with not much semantic distance between the topic clusters. This was an explorative thesis introducing a novel branch of author identification for Swedish documents and many suggestions for future improvements and explorations with this approach were proposed both for Swedish exclusively but also for other languages.

(32)

References

Alokaili, A., Aletras, N., & Stevenson, M. (2019). Re-Ranking Words to Improve Interpretability of Automatically Generated Topics.

https://www.aclweb.org/anthology/W19-0404.pdf. Bansal, B., & Srivastava, S. (2018). On predicting elections with

hybrid topic based sentiment analysis. Procedia Computer Science, 135, 346-355.

https://doi.org/10.1016/j.procs.2018.08.183.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, pp. 993-1022.

https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf. Brody, S. & Elhadad, N. (2010). An unsupervised Aspect-Sentiment

Model for Online Reviews. Published in HLT-NAACL 2010.

https://www.aclweb.org/anthology/N10-1122.pdf.

Buyya, R., Calheiros, R. N. & Vahid Dastjerdi, A. (2016). Big Data. Cambridge: Elsevier inc.

Chen, H., Lin, M. & Wei. Y. 2006. Novel association measures using web search with double checking. In Proceedings of the 21st International Conference of COLING and the 44th Annual Meeting of ACL, 1009–1016.

https://www.aclweb.org/anthology/P06-1127/. Chen, L., Gonzalez, E., & Nantermoz, C. (2017). Authorship

Attribution with Limited Text on Twitter.

http://cs229.stanford.edu/proj2017/final-reports/5241953.pdf

.

Choudhary, R., & Gianey, H. K. (2017). Comprehensive Review On Supervised Machine Learning Algorithms. 2017

International Conference on Machine Learning and Data Science (MLDS), 37-43.

https://ieeexplore.ieee.org/document/8320256.

Church, K. W. & Hanks, P. (1989). Word association norms, mutual information and lexicography. Annual Conference of the Association of Computational Linguistics, 27, 76-83.

(33)

Hagenau, M., Liebmann, M., & Neumann, D. (2013). Automated news reading: stock price prediction based on financial news using context-capturing features. Decis Supp Syst, 55(3), 685-697.https://doi.org/10.1016/j.dss.2013.02.006.

Hutto, C. J. & Gilbert, E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Proceedings of the Eighth International AAAI Conference on Weblogs and Social Media.

https://www.researchgate.net/275828927.

Jurafsky, D. & Martin, J. H. (2020). Speech and Language Processing; An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, (2nd edition). London: Pearson.

Kale, S. D., & Prasad, R. S. (2017). A Systematic Review on Author Identification Methods. International Journal of Rough Sets and Data Analysis (IJRSDA), 4(2), 81-92.

https://www.igi-global.com/article/a-systematic-review-on-a uthor-identification-methods/178164.

Lek, H. H., & Poo, D. C. C. (2013, November 4-6). Aspect-based Twitter Sentiment Classification[paper presentation].2013 IEEE 25th International Conference on Tools with Artificial Intelligence, Herdon, VA, USA.

https://ieeexplore.ieee.org/document/6735273. Lin, D. & Wu, X. (2009). Phrase clustering for discriminative

learning. In Proceedings of ACL, 1030–1038, Suntec, Singapore. ACL.

Liu, B. (2012). Sentiment Analysis and Opinion Mining. San Rafael: Morgan & Claypool Publishers.

Liu, B., Hu, M. & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference of WWW, pp. 342–351.

https://doi.org/10.1145/1060745.1060797.

Maks, I., & Vossen, P. (2012). A lexicon model for deep sentiment analysis and opinion mining applications. Decis Support Syst, 53, 680-688.https://doi.org/10.1016/j.dss.2012.05.025.

Ma, X., Zhang, P., Zhang, S., Duan, N., Hou, Y., Song, D., & Zhou, M. (2019). A Tenzorized Transformer for Language

(34)

Martins, R., Almeida, J. J., Henriques, P., & Novais, P. (2019). A sentiment analysis approach to increase authorship identification. In Hall, J. G. (eds) Expert Systems, 38(3). Hoboken: John Wiley & Sons Ltd.

Meena A., & Prabhakar T.V. (2007). Sentence Level Sentiment Analysis in the Presence of Conjuncts Using Linguistic Analysis. In: Amati G., Carpineto C., Romano G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer: Heidelberg. Menon, R. K., & Choi, Y. (2011). Domain Independent Authorship

Attribution without Domain Adaptation. In Proceedings of the International Conference Recent Advances in Natural Language Processing.

https://www.aclweb.org/anthology/R11-1043.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.

https://arxiv.org/abs/1301.3781.

Moody, C. E. (2016). Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.

Mosteller, F., & Wallace, D. L. (1964). Inference and Disputed Authorship: The Federalist. Reading, MA.

Niu L., Dai, X., Zhang, J., & Chen, J. (2015). Topic2Vec: Learning distributed representations of topics. 2015 International Conference on Asian Language Processing (IALP).

https://ieeexplore.ieee.org/document/7451564. Palme, J., & Berglund, M. (2002). Anonymity on the Internet.

Retrieved August, 15, 2009.

https://people.dsv.su.se/~jpalme/society/anonymity.pdf. Poria, S., Chaturvedi, I., Cambria, E. & Bisio, F. (2016, july 24-29).

Sentic LDA: Improving on LDA with semantic similarity for aspect-based sentiment analysis[paper presentation]. 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada.

https://ieeexplore.ieee.org/abstract/document/7727784/autho rs#authors.

(35)

Potthast, M., Stein, B., Barron-Cedeno, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. Proceedings of the 23rd international conference on computational linguistics.

https://dl.acm.org/doi/10.5555/1944566.1944681.

Rosen-Zvi, M., Griffith, T., Steyvers, M., & Smyth, P. (2004). The Author-Topic Model for Authors and Documents.

Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004).

Rousces, J., Borin, L., Tahmasebi, N. & Rødven Eide, S. (2018). SenSALDO: a Swedish Sentiment Lexicon for the SWE-CLARIN Toolbox.

https://ep.liu.se/ecp/159/018/ecp18159018.pdf. Schouten, K. & Frasincar, F. (2016). Survey on Aspect-Level

Sentiment Analysis. IEEE Transactions on Knowledge and Data Engineering, 28(3), 813-830.

https://ieeexplore.ieee.org/document/7286808. Seroussi, Y., Bohnert, F., & Zukerman, I. (2012). Authorship

attribution with author aware topic models. In Proceedings of the 50th annual meeting of the association for

computational linguistics, 264–269. Stroudsburg, PA, USA: Association for Computational Linguistics.

Sievert, C. & Shirley, K. E. (2014). LDAvis: A method for visualizing and interpreting topics. In Workshop on Interactive Language Learning, Visualization, and

Interfaces at the Association for Computational Linguistics. DOI:10.13140/2.1.1394.3043.

Silva, R. S., Laboreiro, G., Sarmento, L., Grant, T., Oliveira, E., & Maia, B. (2011). ‘twazn me!!! ;(’ Automatic Authorship Analysis of MicroBlogging Messages. In: Muñoz R., Montoyo A., Métais E. (eds) Natural Language Processing and Information Systems. NLDB 2011. Lecture Notes in Computer Science, 6716, 161-168. Springer, Berlin,

Heidelberg.https://doi.org/10.1007/978-3-642-22327-3_16. Su, Z., Ahn, B., Eom, K., Kang, M., Kim, J., & Kim, M. (2008).

Plagiarism detection using the levenshtein distance and smith-waterman algorithm. Innovative Computing Information and Control, 569–569.

(36)

Tao, X., Peng, Q., & Cheng, Y. (2012). Identifying the semantic orientation of terms using S-HAL for sentiment analysis. Knowl-Based Syst, 35, 279-289.

https://doi.org/10.1016/j.knosys.2012.04.011.

Tsytsarau, M., & Palpanas, T. (2012). Survey on mining subjective data on the web. Data Min Knowl Discov, 24, 478-514.

https://doi.org/10.1007/s10618-011-0238-6.

Yu, L., Wu, J., Chang, P., & Chu, H. (2013). Using a contextual entropy model to expand emotion words and their intensity for the sentiment classification of stock market news. Knowl-Based Syst, 41, 89-97.

https://www.sciencedirect.com/science/article/abs/pii/S0950 70511300004X.

Zhai, Z., Liu, B., Xu. H., & Jia, P. (2011). Clustering product features for opinion mining. In Proceedings of the 4th International Conference of WSDM, 347–354.