Detection of deceptive reviews: using classification and natural language processing features

(1)

UPTEC F 16056

Examensarbete 30 hp November 2016

Detection of deceptive reviews

using classification and natural language processing features

Johan Fernquist

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Detection of deceptive reviews using classification and natural language processing features

Johan Fernquist

With the great growth of open forums online where anyone can give their opinion on everything, the Internet has become a place where people are trying to mislead others. By assuming that there is a correlation between a deceptive text's purpose and the way to write the text, our goal with this thesis was to develop a model for detecting these fake texts by taking advantage of this correlation.

Our approach was to use classification together with three

different feature types, term frequency-inverse document frequency, word2vec and probabilistic context-free grammar. We have managed to develop a model which have improved all, to us known, results for two different datasets.

With machine translation, we have detected that there is a possibility to hide the stylometric footprints and the

characteristics of deceptive texts, making it possible to slightly decrease the accuracy of a classifier and still convey a message.

Finally we investigated whether it was possible to train and test our model on data from different sources and managed to achieve an accuracy hardly better than chance. That indicated the resulting model is not versatile enough to be used on different kinds of deceptive texts than it has been trained on.

Examinator: Tomas Nyberg Ämnesgranskare: Sofia Cassel Handledare: Marianela García Lozano

(4)

(5)

Populärvetenskaplig sammanfattning

På Internet existerar grupper av användare som har till syfte att sprida rykten eller falska uttalanden. Dessa användare försöker att påverka och vilseleda andra för någons vinning. Med den ökande mängden bedrägliga texter på Internet kan människor börja göra beslut baserade på felaktig och vilseledande information. Den stora ökningen av öppna plattformar på Internet har gett en stor möjlighet för detta, och olika typer av organisationer har med framgång kunnat sprida sin propaganda. Detta medför att det nns ett behov av att automatiskt kunna upptäcka huruvida texter är falska eller inte.

Vi har utvecklat en modell som kan avgöra om texter från två olika dataset innehål- lande recensioner är sanna eller falska. För att utveckla vår modell har vi använt oss av klassiceringsalgoritmen support vector machines (SVM). Algoritmen använder en datamängd med känt klassvärde för att hitta mönster som skiljer de olika klasserna åt.

I vårt fall har SVM använts för att hitta mönster som kännetecknar om recensionerna är sanna eller falska. Modellen har sedan testats genom att klassicera nya texter.

För att använda SVM krävs det att man representerar sin data matematiskt. Vi har gjort matematisk representation av vår data med tre olika metoder vilka är TF-IDF, word2vec och PCFG.

TF-IDF står för term frequency - inverse document frequency och genererar ett värde för varje ord i ett dokument och dokumentsamling. Detta värde indikerar hur viktigt ett ord är och ökar proportionellt mot antalet förekomster av ordet i dokumentet, men straas också av förekomsten av ordet i hela dokumentsamlingen.

En word2vec modell skapas med en dokumentsamling. När modellen skapats erhåller varje ord i dokumentsamlingen en punkt i ett koordinatsystem. I detta koordinatsystem placeras ord som förekommer i liknande sammanhang närmare varandra.

PCFG är ett sätt att matematiskt modellera beståndsdelarna i ett språk. Vi har gjort mätningar på vår data bestående av recensioner genom att ta reda på vilka gramma- tiska beståndsdelar recensionerna är uppbyggda av. Dessa beståndsdelar består av regler där en ordklass kopplas samman med antingen ord eller andra ordklasser.

Dessa tre metoder har använts separat, men vi har även kombinerat de två bästa metoderna med förhoppningen att erhålla en ännu bättre modell. Detta resulterade i en modell med mätningar från metoderna TF-IDF och PCFG som förbättrade tidigare forskningsresultat där andra klassicerat samma recensioner.

Vi har även undersökt hur vår bästa modells resultat påverkas av att vi använt maskin- översättning. Vi skapade en modell med en del av våra recensioner och översatte sedan resterande recensioner till olika språk, och sedan tillbaka till originalspråket och testade att klassicera dem. Ju mer lingvistiskt olikt originalspråket det översatta språket var och ju er översättningar som gjordes, desto sämre presterade vår modell. Försämrin- gen motsvarade dock bara några få procentenheter och vår modells förmåga att nna falska recensioner var fortfarande hög.

Slutligen undersökte vi hur mångsidig modellen var genom att skapa modellen med recensioner från ett dataset och testa med recensioner från ett annat. På grund av olikheten mellan dataseten uppnådde vi ett resultat som var knappt bättre än slumpen.

(6)

(7)

Introduction

With the great growth of online booking agencies, forums and shops, such platforms have become a great forum for advertisers, marketers, etc., to try to inuence people that their product or service is the best one oered. These open platforms where almost anyone can write a review visible for all visiting the page has made them a target for deceptive messaging. For example, this can be reviews written by people hired by companies with the purpose to highlight the company's competitive advantages to inuence other people to make them buy their product or service. Another example with deceptive reviews can be rivaling companies, trying to make competitors' products or services look worse. For various reasons, messages can be contradictory, false, ambiguous and biased. These examples demonstrate applications where deceptive messaging has become a problem where authors are actually not proles managed by trustworthy persons but rather individuals on a mission of someone's benecial interest.

There is a need of automatic methods which are able to detect these fake reviews and Swedish defence research agency (FOI) are interested in these automatic methods.

It already exists models for detecting fake reviews. In collaboration with FOI, we will develop some of the existing models and evaluate new models which are able to automatically detect whether reviews are fake or not.

1.1 Problem statement

The main goals of this thesis are to:

Improve and evaluate a model that is able to classify whether a text is deceptive or truthful using machine learning techniques

Examine how machine translation aects the linguistic patterns of deception and truthfulness

Examine the versatility of the trained classication models in terms of classication of new datasets

(12)

CHAPTER 1. INTRODUCTION

1.2 Delimitations

Even though it might be interesting for FOI to investigate and detect what the review is trying to say someone, the focus with this work is only to detect whether the review is truthful or not. In this thesis we will only focus on how people are writing deceptive reviews, not why.

We will only use machine learning and natural language processing (NLP) techniques to classify the reviews. NLP is the study of human evolved languages with the use of computer science.

We will only look at reviews from datasets consisting of both deceptive and truthful reviews. These reviews are relatively short texts which consists of 2 to 40 sentences.

This is mainly because there exists previous work where the same data has been used [1, 2, 3], and therefore we can compare our results with theirs. We will not investigate longer texts such as magazines and books or other media such as pictures. We will only look on how the text is structured and its components, not investigate the meta data such as the time the review was posted or which user that wrote it.

(13)

Chapter 2

Background

In this chapter, we present the client and their interest in this thesis and we also introduce the basic concepts regarding the dierent scientic elds and theories used in this thesis. We make it clear why this work is needed and summarize related work.

2.1 The client

This thesis is made on behalf of Swedish defence research agency (FOI). FOI is a Swedish state-mandated funded research agency. One of FOI's main missions is to perform research, development and investigations regarding the national security and military defence. In the department of decision support systems, the focus is to develop systems which help people to get an understanding of a situation based on all kinds of data such as web-pages, reports and structured databases.

2.2 Deceptive texts on the Internet

There is a phenomenon on the Internet called astroturng where people are paid to convey a message for a client [4] and there are known astroturng cases [5]. These users might have been provided with some kind of manuscript which work as a guideline while writing the message to be spread. This might result in that all these users in some way will write relatively similar messages, regarding the use of keywords, structure, etc. In the case with fake reviews, people are on some kind of mission to emphasize or vilify something, most likely for their own or anyone else's gain and there might be some kind of similarity to these deceptive texts. For example, it is known that deceptive reviews are more likely to hold an excessive amount of positive or negative words [1]. In 2012, it was believed that by 2014, 10 to 15% of all reviews on social media would be fake and paid for by companies [6].

(14)

CHAPTER 2. BACKGROUND

2.3 Data mining, machine learning and classication

Data mining is a scientic eld that deals with nding patterns and trends in large datasets with the use of statistical methods in combination with computational algorithms from articial intelligence (AI) . The main purpose with data mining is to extract information from a dataset and be able to use that information for future use.

John McCarthy was the one that coined the term articial intelligence in 1955, and his denition yields "the goal of AI is to develop machines that behave as though they were intelligent" [7].

From AI, the sub-eld of machine learning was evolved and can be dened by "Field of study that gives computers the ability to learn without being explicitly programmed"

[8]. In this thesis we will focus on the machine learning task classication where the problem is to identify which class new observations belong to. This is done by training a model with a data containing observations with already known classes.

An example of a classication task can be to predict whether a tumor is malignant or benign. Data from previous patients such as tumor size and age can be used as features to build a model for the classication. In gure 2.1 we have an example with two previously unseen samples classied. The measured features are the age of the patient and the size of the tumor.

Figure 2.1: An example of two previously unseen tumor samples being classied In gure 2.2 we have two dierent classes represented as blue and red dots in a 2D- space. The axes represent two dierent features. For this classication task we use these pre-labeled classes and their measured features to classify new and previously unseen samples, in the gure represented as yellow dots.

As in gure 2.2 we have a group of samples with their known class but we want to build a classier to be able to classify new previously unseen samples. For each sample, we have a number of features measured. Each feature corresponds to a dimension in a multiple dimension space with the total number of dimensions equal to the total number of features. In gure 2.2, we have two features and therefore a two dimensional space. One popular algorithm for classication is support vector machines (SVM) [9], which will be used in this thesis.

(15)

Figure 2.2: Two dierent labeled classes (red and blue dots) and previously unseen samples (yellow dots) in a 2D-space

2.4 Vector space modelling

A vector space model (VSM) is a numerical representation of data on vector form. In a VSM, each dimension corresponds to some feature of the data. If we for example want to represent each document's text in a corpus as a VSM, we can create a VSM where the dimensionality of each document's VSM will be the same as the total number of unique words in the corpus. The occurrence of a word in a text is represented with a non-zero value in the vector. Vector space modelling is a common way to represent data as input for machine learning algorithms.

2.5 Natural language processing

NLP is the study of natural language (NL) with the use of AI, computer science and computational logistics. A NL refers to a by human evolved language such as English and Swedish, which has evolved by the usage of humans rather than through conscious intent and planning. A NL can have the forms of speech and writing but should not be mixed up with formal languages such as programming languages and languages invented to study linguistics. It is said that the scientic eld of NLP started to evolve in 1950s when Alan Turing wrote and published the article Computing Machinery and Intelligence. One of the most used and researched tasks of NLP are machine translation which is the task to automatically translate a text from one NL to another. Another commonly used task is parsing which is the task of creating a parse tree from a given sentence. In this thesis, we will use NLP tools to obtain dierent mathematical representations of our data.

2.6 Related work

In 2011, a research group from Cornell University gathered 400 5-star reviews of the 20 most popular hotels in the Chicago area from TripAdvisor [1]. Then they gathered 400 deceptive positive reviews using Amazon mechanical turk (AMT). With this dataset the group managed to classify whether a review is deceptive or truthful with an accuracy of 89.8% using linguistic inquiry and word count (LIWC) in combinations

(16)

with bi-grams on a linear SVM classier. LIWC [10] is a word counting tool which relates words to categories such as psychology and grammar. A bi-gram is a sequence of two consecutive letters or words. The group has also written a paper [2] where they used standard n-gram based SVM, but this time gathered 1- and 2-star reviews from TripAdvisor for the 20 same most popular hotels in the Chicago area. They also gathered 400 deceptive negative reviews, once again using AMT. The results for this experiment was similar to the previous and they received an accuracy of 89.3% for the positive reviews and 86.8% for the negative reviews. The group also made a combination of the two datasets holding both positive and negative reviews and obtained an accuracy of 88.4% for testing with the positive reviews and 86% for testing on the negative reviews.

In [3], the same dataset as in [1] has been used. SVM has been used, but it is not clear what type of kernel that has been used. What distinguishes this paper from earlier is the use of probabilistic context-free grammar (PCFG) as input for the SVM algorithm.

In combination with unigrams, the group manages to receive an accuracy of 91.2%.

To the best of our knowledge, this is the best result obtained on this dataset. We have not discovered any other previous work where a combination of negative and positive reviews had been used when training the SVM with PCFG features.

In [11] a dataset written in Dutch is presented. It consists of 540 reviews where half are deceptive and the other half truthful. In the paper they managed to obtain an accuracy of 72.2% using a linear SVM with bigrams as features.

The use of word2vec in combination with SVM has not been done in large scale, and to the best of our knowledge never with the purpose to detect deception, like we are doing in this thesis. In [12], both term frequency - inverse document frequency (TF- IDF) and word2vec has been used as features, both alone and in combination for a SVM to cluster the category of newsgroups posts. The dimensionality for word2vec was 100 and for each document, every word in the document had their corresponding vectors added and then used as features. For word2vec alone the accuracy was 84%, for TF-IDF alone 88% and in combination an accuracy of 90% was obtained.

Regarding classication, SVM has outperformed other classication algorithms such as K-Nearest Neighbors [13] and even Naive Bayes [14], in cases where the data has been text documents and the features stylometric. It has been proved that the J48 algorithm which generates a decision tree and SVM outperformed each other on dierent kinds of data [15] and for binary outcomes, the algorithms SVM, Random Forest and Adaptive Boosting are known to be ideal [16].

There exists work regarding deceptive detection on texts where the authors have tried to hide their own writing style. In [15], they have used dierent kinds of feature types for deceptive detection. One is called write print with features such as average words in sentence, number of characters per words, number of large words and percentage of letters and digits. They also used a feature type called content specic features where the topic of a word is taken into account. They have tried to detect both imitation where authors have tried to imitate the writing style of another person and obfuscation where they tried to hide their own writing style. By using write prints together with SVM they managed to detect both imitation and obfuscation with a F-measure of 85%

for imitation detection and 89.5% for obfuscation detection.

(17)

CHAPTER 2. BACKGROUND In [17], the research group has tried to anonymize texts with the purpose of inves- tigating if it is possible that with simple means try to anonymize text an by that hid the authors stylometric patterns. In the paper, the group used Google Translate and Bing to do one- and two-steps translations to either German, Japanese or both, and then back to English. These machine translated texts were then matched with non-translated documents from the same authors. The result was that the accuracy dropped between 15% and 35%, in some cases with loss of the text's intention.

(18)

(19)

Chapter 3

Theory

In this chapter, essential parts from the scientic elds covered in this thesis will be mentioned. We will go through some of the mathematics behind the various methods for feature generation and classication.

3.1 Support vector machines

The development of SVM has been going on since the 1960s, but it was not until 1992 that the algorithm was of the form it has today and got the name support vector machines. The important milestones are presented in the paper A training algorithm for optimal margin classiers by Boser, Guyon and Vapnik [18]. SVM became popular because of its empirical good results and its versatile applications such as text and image recognition and bioinformatics. SVM is robust to large number of features and small number of samples and is able to learn not just simple but also highly complex classication models.

For SVM we assume that each sample consists of a vector space model with n features and we have a set of N samples which will give us a set ~x1, ~x2, . . . , ~xN in Rⁿ where every sample has a class y. Each ~x starts in origo and points to it's feature values, therefore the vectors can also be treated as points. Each sample's class y is the value we will predict for future samples and can only take the value -1 or 1 for a binary classier.

There are two main problems to be solved for SVM. These are:

1. To insert a decision surface in the space that separates the two dierent classes 2. To have the largest margin between the border samples

The decision surface is a Rⁿ⁻¹ dimensional hyperplane in Rⁿ, which makes it a binary classier. The equation of a hyperplane is

~

w · ~x + b = 0 (3.1)

where ~w is the direction of the hyperplane's normal, b is the position of the plane in that direction and ~x is a vector from the origin to an arbitrary point on the hyperplane.

(20)

CHAPTER 3. THEORY

Equation (3.1) holds for Rⁿ when n ≥ 3. The hyperplane is the classier surface that separates the two classes from each other. For our binary examples, the hyperplane will be a line. In gure 3.1 we have inserted a number of hyperplanes to separate the classes.

Figure 3.1: A set of dierent hyperplanes separating the two classes

There exist a nite number of hyperplanes which separates the two classes, but SVM

nds the hyperplane which maximizes the distance between the samples on the bound- aries. The gap is the distance of two parallel hyperplanes (same ~w) with at least one sample (vector) for each of the classes laying on the surface of the class' hyperplane.

The samples that contributes to calculate the margin are called support vectors, which is the reason for the name of the algorithm.

Figure 3.2: A hyperplane with maximum margin between the two classes

In gure 3.2 we have the hyperplane with the maximum margin separating the two classes. The dotted lines in the gure correspond to the hyperplanes with the support vectors on the surface. The gap is the distance between the two hyperplanes

~

w · ~x + b = 1 and ~w · ~x + b = −1 (3.2) where right-hand side (RHS) corresponds to the class y. Since we want to maximize the distance D between the two hyperplanes, we have to maximize

D = |b₁− b₂|

|| ~w|| (3.3)

where ||~w|| is the euclidean length, also called the L2-norm. For an arbitrary vector ~a in a n dimensional space, the euclidean length ||~a|| is calculated as

||~a|| = q

a²₁+ a²₂+ · · · + a²_n. (3.4)

(21)

CHAPTER 3. THEORY Since b1 = b + 1and b2 = b − 1 in equation (3.3), the equation we need to maximize is

D = 2

|| ~w||. (3.5)

We can instead say that we want to minimize ||~w||.

3.1.1 Primal and dual formulations

The primal formulation of the linear SVM yields:

Minimize

n

X

i=1

wi s.t. yi( ~w · ~x + b) − 1 ≥ 0for i = 1, . . . , N. (3.6) This is a so called convex quadratic programming (QP) optimization problem. QP is a mathematical optimization problem where we try to eithther minimize or maximize a quadratic function. The meaning of convex is that there will always exists a local minimum that is also a global minimum.

The equation (3.6) can be reformulated as a dual form which is also a convex QP but with N variables ai where N is the number of samples. The dual formulation of the linear SVM yields:

Maximize

n

X

i=1

a_i−1 2

n

X

i=1

a_ia_jy_iy_jx~_ix~_j s.t. ai ≥ 0 and

n

X

i=1

a_iy_i = 0.

for i = 1, . . . , N.

(3.7)

Then ~w is dened in terms of ai: ~w = Pⁿ

i=1

a_iy_ix~_i and the solution to the problem becomes the function f(~x) = sgn(Pⁿ

i=1

a_iy_ix~_i· ~x + b).

3.1.2 Soft margin SVM

The formulations seen in equation (3.6) and (3.7) are so called hard-margin linear SVM. If there are outliers or noisy data in the sample data or if the data is non-linear, then the previously seen SVM formulations will not be able to nd a hyperplane that separates the two classes. In gure 3.3 a soft-margin classier is shown. By adding a slack variable to each instance, we are able to classify problems where the data is non-linear and noisy with the cost of misclassifying some of the training samples.

For each instance ξi ≥ 0, a slack-variable is assigned. The slack-variable can be seen as the distance from the misclassied instance to the separating hyperplane. The magnitude of how much misclassication that is tolerated is controlled by the variable C. For large values of C, the soft-margin SVM behaves as the hard-margin SVM, and for small values of C, we admit miss-classications in the training data.

The linear soft-margin primal formulation yields Minimize

n

X

i=1

wi+

n

X

i=1

ξi s.t. yi( ~w · ~x + b) ≥ 1 − ξi for i = 1, . . . , N, (3.8)

(22)

CHAPTER 3. THEORY

Figure 3.3: A soft-margin SVM on not linear separable data

and the linear soft-margin dual formulation yields

Minimize

n

X

i=1

a_i− 1 2

N

X

i,j=1

a_ia_jy_iy_jx~_i· ~x_j s.t. 0 ≤ ai≤ C and

N

X

i=1

a_iy_i = 0 for i = 1, . . . , N.

(3.9)

After the hyperplane with the maximal margin is set, it is possible to classify the new samples. From the samples in gure 2.2 the unknown samples would have been classied as represented in gure 3.4.

Figure 3.4: The previously unseen samples now classied

In some cases, the samples can be hard to separate linearly, as in gure 3.5. A solution to this problem is to use kernels which map the data into a higher dimensional feature space where the data is linearly separable.

(23)

CHAPTER 3. THEORY

Figure 3.5: A 2D-space where there exists no linear hyperplane

3.2 Context-free grammars

Sentences can be divided into dierent small grammatical parts called constituents. A context-free grammar (CFG) is a common way to mathematically model constituent structures in NL. In CFG, we determine a parse tree to grammatical analyze a given sentence.

A context free grammar G is represented on the form G = hT, N, S, Ri where

T is the set of terminals (lexicon)

N is the set of non-terminals

S is the start symbol

R is rules of the form A → B, where A ∈ N and B ∈ (T ∪ N) and the grammar G is said to generate a language L.

Here we present a set of rules where the nonterminal expression NP (noun phrase) can be composed either by a PN (proper noun) or a determiner (Det) and a nominal The two latter rules shows that a nominal can be composed by one or more Nouns.

N P → P N N P → Det N oun V P → V erb N P N om → N oun N om → N om N oun

Here we have a set of rules with terminals on the right side of the rules.

Det →"a"

Det →"the"

N uon →"cat"

N oun →"ball"

V erb →"took"

With these rules, we can construct a parse tree of the sentence "a cat took the ball"

which is presented in gure 3.6.

(24)

CHAPTER 3. THEORY

S

VP NP

Noun ball Det

the Verb took NP

Noun cat Det

a

Figure 3.6: Exempel of a parse tree of the sentence "a cat took the ball"

We want to generate a tree representing a sentence, where every node in the tree is a grammar rule. This method of dividing a sentence into a tree structure is called parsing. A parser is a program which in some way indicates whether a specic sentence is accepted by the grammar, but it also gives us the parse trees for the string.

3.2.1 Probabilistic context free grammar

A PCFG is a CFG but with a fth element. It is dened as G = hT, N, S, R, Di where D is a function which for each rule R assigns a probability p and refers to the probability that that particular rule occurs. For PCFG the probability for each parse tree τ for a given sentence S is dened as

P (τ, S) =Y

n∈τ

p(R(n)) (3.10)

which corresponds to the product of all probabilities of all the rules to expand each node n in τ. For a PCFG it also applies that

∀i, X

j

R(N_i→ B_j) = 1 (3.11)

where i is the total number of non-terminals and B ∈ (T ∪ N). This means that the sum of the probability for all rules with a given LHS is always 1. For example, say we have a grammar with the following rules,

N P → P N [0.6]

N P → Det N oun [0.4]

V P → V erb N P [1]

N om → N oun [0.2]

N om → N om N oun [0.8]

where NP is composed by a PN with probability 0.6 and a determiner followed by a Noun with probability 0.4.

In this thesis, we will use software which already has a given grammar holding both rules and probabilities. This algorithm will return a parsed tree for given sentences.

(25)

CHAPTER 3. THEORY

x₂ w₂

Σ ^f

Activate function

y Output

x₁ w₁

x3 w3

Weights Inputs

Figure 3.7: The computational principle of a neuron with three inputs

3.3 Neural networks

Neural network modeling is greatly inspired by our knowledge of how the human brain works. In the human brain, each neuron is a computational unit which gets a number of inputs through its input wires. Each neuron does some computation of the incoming signals in the form of small pulses of electricity and then sends the output to other neurons in the brain. An articial neural network, from this on called neural network (NN), works by the same principle. The neuron is a computational unit which gets V number of inputs x1, x₂, . . . , x_V through each of its input wires, each with a weight factor w. When training a neural network, w is calibrated for further use. The inputs are summarized and then used as input in an activation function f. f can for example be set to a logistic function. In gure 3.7 we show an example of how a neuron can be illustrated. In this case, the output y of the neuron will be

y = f (x1w1+ x2w2+ x3w3). (3.12) In a NN, neurons are connected to each other by taking the output of one or several other neurons as input.

3.4 Word2vec

Word2vec is a word embedding tool, which is the collective name of methods used to generate VSM representations of words. Word2vec is a multi layer NN like in gure 3.8. A layer consists of a set of neurons which all are connected to all of the neurons in the surrounding layers, as seen in the gure. For word2vec, the multi layer NN has a hidden layer which transforms the inputs to something that the output layer can use.

Word2vec uses one of two models, skip-gram or continuous bag of words (CBOW) to build a multidimensional space for each word in a corpus. By a given window size, skip gram tries to predict nearby words around current word. For CBOW, the model is trying to predict the current word given the nearby words.

(26)

CHAPTER 3. THEORY

Input

layer Hidden

layer Output

layer Input 1

Input 2 Input 3 Input 4 Input 5

Ouput

Figure 3.8: A multi-layer NN

A word2vec model is trained by using a corpus as input. After the training every word in the corpus will receive a coordinate in a multidimensional space. This coordinate is represented as a vector. In this space, words appearing in similar contexts are closer to each other. The dimension of this space is the number of neurons in the hidden layer. In an ideal word2vec space, the distance between the words "queen" and "king"

would be the same as the distance between the words "woman" and "man" as shown in gure 3.9. This because after training, each element in the vector corresponds to some abstract way of the meaning of the word. Words such as "king" and "queen" are similar in ways such as royalty, wealth and power. In the same way, "man" and "king"

are similar in ways such as masculinity and femininity and the opposites to "queen"

and "woman". This results in that "man" and "king" are as close to each other in signicance as "woman" and "queen".

Figure 3.9: Example of a word2vec space

Word2vec has a variety of extensions such as Item2vec [19], where items rather than words are clustered together and doc2vec [20], which is used to label documents by cor- relate words and labels. For a complete explanation of word2vec and the mathematics behind it, see [21].

For example, suppose that we have a corpus with the words "the", "cat" and "dog"

and we have trained word2vec in multidimensional space where n = 4 and received the word vectors represented in table 3.1. Then suppose we want to train SVM for

(27)

CHAPTER 3. THEORY a document with just the sentence "the dog". Then the VSM representation for that document will be the sum of the vectors for the word "the" and the word "dog" which in this example will be [0.3, 0.5, 0.7, 0.7].

Table 3.1: Example of three word vectors as output from word2vec the [0.2, 0.4, 0.1, 0.2]

cat [0.1, 0.0, 0.4, 0.7]

dog [0.1, 0.1, 0.6, 0.5]

3.5 Term frequency and inverse document frequency

The term frequency (TF) and inverse document frequency (IDF) are two statistical measurements which are often combined to calculate TF-IDF value [22]. The TF-IDF value is a weight of the importance of a word in a corpus. The importance increases proportional to the number of occurrences of the word in the document but is then penalized by the occurrence of the word in the corpus.

The TF corresponds to how frequently a term occurs in a document and the weight for term t in document d is calculated as

w_t,d=

(log ft,d, if ft,d> 0

0, otherwise (3.13)

where ft,dis the frequency of t in d.

The IDF score measures how important a term is by weighting down words which occur in more documents. For term t, the IDF score is calculated as

idft= log(N nt

) (3.14)

where N is the total number of documents and nt is the number of documents that t occurs in.

The TF-IDF of t in d is then

TF-IDFt,d= log ft,d· log(N nt

). (3.15)

Suppose that we have a two document corpus consisting of just the two documents

"the dog" and "the cat". By calculating the TF-IDF weights for this corpus, we might end up with something like the vectors in table 3.2.

Table 3.2: Example of the TF-IDF weights for a corpus with the two documents "the cat" and "the dog".

the cat [0.815, 0.0, 0.58]

the dog [0.0, 0.815, 0.58]

As seen in both vectors, one of the elements is 0, which corresponds to the word not appearing in that specic document. The element with 0.58 corresponds to the word

"the", and the element 0.815 corresponds to cat or dog.

(28)

(29)

Chapter 4

Design and implementation

In this chapter we describe how the dierent steps of the work in this thesis are designed, motivate our dierent methods for the steps and how the tests are implemented.

We will describe what data that will be used, how it will be gathered and how it will be processed before used for the dierent feature generating methods. We will also describe how we use the SVM algorithm and how to nd the best parameter values and describe how the dierent features are generated for SVM. Finally we will go through the dierent evaluation methods to validate our work. Each step of the process is shown in gure 4.1. All programming tools used in this thesis are also mentioned.

Data

gathering Preprocessing Feature

generation Feature

selection Classication Evaluation

TF-IDF

·Download

·Translate

·Tokenizing

·Stopwords

·Stemming word2vec K-Best SVM ·Perf.meas.

·Compare

PCFG

Figure 4.1: Process chart from data to evaluation

(30)

CHAPTER 4. DESIGN AND IMPLEMENTATION

4.1 Data gathering

We will use two dierent datasets in this thesis. One of the datasets that is used is the same as used in [2] and the dataset is gathered by the authors of the paper. The data is called AMT and consists of 800 trustworthy and 800 deceptive hotel reviews, all written in English. Half of the reviews of the dataset are positive (corresponding to 5-stars rating) and the other half negative (corresponding to 1- or 2-stars ratings).

The positive truthful reviews are gathered from TripAdvisor and the negative truthful reviews are gathered from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp. All deceptive reviews are gathered using Amazon's Mechanical Turk where users have gotten paid to write deceptive reviews. A clarication of how the number of reviews are divided in the dataset are shown in table 4.1.

Table 4.1: How the number of reviews are divided for the AMT dataset Deceptive Truthful

Positive 400 400

Negative 400 400

The other dataset is the so called CLiPS stylometry investigation corpus (CSI) [11].

It is a Dutch written corpus holding both essays and reviews. The review part of the dataset holds 1298 reviews, both deceptive and truthful and positive and negative which will be used in this thesis. The division of reviews are shown in table 4.2. All reviews are written by students taking Dutch prociency courses at the university of Antwerpen and the topics of the reviews are musicians, food chains, books, smart phones and movies.

Table 4.2: How the number of reviews are divided for the CSI dataset Deceptive Truthful

Positive 319 323

Negative 330 326

To use the AMT and CSI datasets together, we will translate the Dutch texts to English. We also want to explore how the performance of our classication changes if we are trying to classify reviews which have been translated to other languages and then back to the language we are classifying with. We will give the datasets an index corresponding to how the data has been translated. For example, AMTeng→rus→eng

means that we have translated the AMT dataset from English to Russian and then back to English again.

4.2 Preprocessing

To use the data for the dierent methods, the data needs to be preprocessed. This step is needed to ensure that the input data will not hold unnecessary information which can impair the outcome of the classication. For our data, the reviews are

(31)

CHAPTER 4. DESIGN AND IMPLEMENTATION already relatively clean, with one review per text document. To get easier and faster access to the reviews, they will be converted to json-objects and be stored in json les where each document holds all the reviews of a specic type (deceptive or truthful and positive or negative).

We also modify the review text using the following commonly used methods: stop word removal, stemming, and tokenization. Stop words are commonly occurring words.

These common words are ltered out because they are considered to have a small impact during the training and classication. There is no universal list of the stop words, the words considered as stop words can be chosen dierently. Within stop words, it is common to chose words from the word classes pronouns and prepositions.

Other common stop words in the English language are words such as "a, an, and, but, or" and common verbs. In this thesis we will train models both with and without stop words to calculate whether there is a change in accuracy for the classication. In [12], the training has been made both with and without stop words and there has been a notable dierence in accuracy, motivating us to do the same. The list of stop words used in this thesis are the ones in the list provided in the NLTK package [23].

A tokenizer takes a text and divides it into an vector with a set of strings. Our methods for building features will use single strings as inputs and will not be able to treat complete texts as inputs. To be able to treat every word in a text as a single unit, it has to be tokenized.

We will also use stemming during the preprocessing. Stemming is a method for group- ing together dierent forms of a word such as "catching", "catches" and "catch" to be treated as one single item. This has been done together with SVM in [24] with varying results but in some cases the result was improved using stemming.

4.3 Feature generation

In this thesis we will use SVM as classication algorithm and SVM needs features as inputs. To use SVM, we have to decide what kind of features we will use. SVM needs a VSM representation of each document called feature vector f such as [f1, f₂, . . . , f_n] where fi is the value of feature i and n the number of features. It can be tricky to know what features to be used. For this thesis, we have decided to use three dierent kinds of features generated from NLP methods, and compare the performance when using them with SVM. The dierent features will also be used together with the expectation to get a higher classication rate.

The rst feature type to use will be from the statistical method TF-IDF [22]. All of the essentials have been covered in 3.5. For TF-IDF, we have a corpus with a number of documents. The features generated from TF-IDF are the calculated TF-IDF weight for each word in each document. This results in that every document will have a feature vector with the length of the total number of words in the corpus.

The second feature type to be used in this thesis, will be the output vectors of word2vec [25]. The essentials have been covered in 3.4. The expectation is that some words and context of words are more used in deceptive texts rather than in truthful texts. For each review, all the vectors for each of the words appearing in the review will then be

(32)

CHAPTER 4. DESIGN AND IMPLEMENTATION summarized and used as the feature vector such as

f (d_i) =

Ti

X

t=1

word2vec(t), (4.1)

where Ti are the all the terms in document i and word2vec(t) the output vector from word2vec of term t. Each review will by this obtain a multidimensional vector which will be used for training the SVM algorithm. This method of adding the corresponding words for each document has been done in [12] with an accuracy above 80%, motivating us to do the same.

The nal feature type that will be used is the output of the grammar modeling tool PCFG [26]. The purpose is to determine if the review is false written not regarding the topic and what words that are used but rather grammars and the structure of the text to determine what is a typical writing pattern for a certain text purpose.

We will use the Berkeley parser to parse sentences [27]. The parser already has a grammar which is trained on texts from the "Wall Street Journal". For a given sentence, the parser builds a tree of the rules with the highest probabilities corresponding to the sentence and returns the tree with the highest probability. For each review, every sentence will be parsed into a tree and will receive a list of every rule in that tree. Each of the rule lists from a review will then be concatenated. The features will be generated by encode the rule lists as TF-IDF values where every rule works as a term. This method of using TF-IDF on the rule lists is used in [3] and gave satisfying results. We will use four dierent types of rules with examples from gure 3.6:

Unlexical rules All rules except those where the RHS is a terminal, e.g. NP → Noun

Lexical rules All rules including those where the RHS is a terminal, e.g. Det → "a"

Unlexical rules with grandparent node All rules except those where the RHS is a terminal with the grandparent node, e.g. S^NP → Noun

Lexical rules with grandparent node All rules including those where the RHS is a terminal with the grandparent node, e.g. NP ^Det → "a"

We can also use features from dierent methods by concatenate feature vectors such as

f = [f₁¹, f₂¹, . . . , f_n¹, f₁², f₂², . . . , f_m²] (4.2) where f¹ and f² are two feature vectors of length m and n respectively, which are merged to a new feature vector f.

4.4 Feature selection

Classication can perform poorly when some features are irrelevant. By using feature selection and removing irrelevant features, we are able to both increase the accuracy and decrease the training time [28]. This will be done using a feature selection method where we select the K number of features with the highest value using the χ²-test

(33)

CHAPTER 4. DESIGN AND IMPLEMENTATION [29]. The χ²-test is a measurement of how features are dependent of class and how each feature value diers from an expected value based on the null hypothesis that there should be no correlation between features and class. By calculating which of the K features that are most likely to be class dependent, we can remove the rest which are most likely to be class independent. There are earlier works of classication which have performed well using the χ²-test together with SVM [30].

4.5 Classication

The classication algorithm that will be used in this thesis is SVM. The essentials of SVM has been covered in 3.1. In this section we will motivate why we used SVM and how we used the algorithm. As mentioned in [31], it is sometimes necessary to apply scaling to the feature vectors with the purpose to avoid features of greater numerical range to dominate features with smaller numerical range. It is recommended to scale to a range between 0 and 1, which might be tested in this thesis. Features generated from word2vec are known to perform well with SVM [12] and SVM is in general a good algorithm for text classication [14, 16] making it the natural classication algorithm to be used for this thesis. For all of the papers using the same dataset that will be used in this thesis [1, 2, 3], SVM has outperformed the other algorithms, making SVM a natural choice for this task.

We will only use the linear kernel in this paper because it is documented that it performs well for text classication [32]. The SVM algorithm used in this thesis will solve the dual formulation of the problem. The soft margin parameter C has to be found by testing for dierent values, and from [31] we are told that testing with dierent exponentials of C as C = 2⁻⁵, 2⁻³, . . . , 2¹⁵will help us nd a good value for C. To nd optimized parameters, it is common to apply grid search. Grid search will be done with cross-validation (CV). CV is an statistical algorithm to predict estimation error.

In our case we will use cross validation for our SVM algorithm while we are testing the new parameter. This means that we divide our dataset into k smaller subsets. For training we are leaving one of the k subsets out, and then use that subset for validation i.e. using the trained model to classify the subset and calculating the accuracy. This is done k times and the estimation error is the mean of the k prediction errors. CV k times is also called k-fold. The reason we will use CV is because we want to make sure that all data is used both for training and validation and that we have k tests, which increases our condence in the reliability of the model performance. Cross validation also prevents us from receiving a high accuracy for the training data that does not

t the test data (also called overtting). All of the papers with results that will be compared to our results have used 5-fold cross validation, which motivates us to do the same.

4.6 Program and libraries

For this thesis the programming language of use is Python [33]. Python is a widely used programming language and works well for general purposes. One of the reasons that Python is used for this thesis is the high availability of dierent NL and NLP

(34)

program packages that can be imported and used in Python. The dierent packages that are imported and used in Python for this thesis are explained here.

NLTK [23] The NLTK package holds a various set of tools for NLP. The package provides us with a list of stop words that will be ltered. Functions for stemming are also included in the package.

Scikit-learn [34] scikit-learn is a machine learning package and holds a lot of simple and ecient tools for classication and clustering. For this thesis we will use the package's functions for SVM, feature selection and TF-IDF.

Gensim [35] Gensim is designed to process raw, unstructured digital texts. We will use Gensim's inbuilt function for word2vec to create a word2vec model for feature generation.

Textblob Textblob is a package for translating texts online which is using Google translate. It will be used to machine translate reviews.

JPype JPype is a program which allows to use Java class libraries in Python. The reason JPype is used is to have access to the Berkeley parser.

4.7 Test setups

We will do three dierent type of tests, one for each of the goals in this thesis. These tests and how they are set up are described in this section.

4.7.1 Improve the others

One of the goals with this thesis is to develop a model that improves the results from [1, 2, 3, 11]. We will divide the AMT dataset into two subsets which will be refereed to as AMT+ which only hold the positive reviews and AMT- which only hold the negative reviews. In order to use the exact same data setup as in [2], AMT will also be divided into two other sets, AMT_TO+ and AMT_TO- where the index indicates which part of the data that is only used to test on. In the paper, both the AMT+ and AMT-

datasets are divided into ve subsets where four of the subsets from each dataset are used for training. In the AMT_TO+ case, the last fth part of AMT_-is completely held out and the last fth part of AMT+is used for testing, for that part of the 5-fold cross validation. This means that we train on 1280 reviews and test on 160 reviews for the combined cases.

For each of our datasets CSI, AMT+ AMT-, AMTTO+ and AMTTO- we will build models with features of the following types:

TF-IDF both with and without stemming and with and without stopwords included

Word2vec both with and without stemming and with and without stopwords included

(35)

PCFG with stopwords and without stemming both with and without lexicalized nodes and grandparent nodes

A combination with the feature-vectors generated from the methods giving the best accuracies

For TF-IDF and PCFG, we will use feature selection and test dierent number of features. For word2vec, we will test for dierent dimensions of the word vectors. To use PCFG, a grammar le for the used language is required. We do not have a grammar

le for Dutch so we will use PCFG on CSI_dut→eng instead of CSI.

We will also combine AMT and CSIdut→eng to one big dataset and investigate the performance of our best model on this dataset.

4.7.2 Trick the classier

We will also use machine translation to investigate how the classication model performs on translated data. This test might tell us whether it is possible to hide the stylometric footprint and decrease the accuracy of a classier and still be able to convey a message. This will be done with the best classier from 4.7.1. We will do this on both datasets separately. To compare how the dierent translations aects the classication, we need to have some kind of base case to compare with. For AMT, the base case will be the AMT reviews in English, but since we can not use all features methods on CSI in Dutch due to the lack of a Dutch grammar le, the base case for the CSI dataset will be the reviews translated to English i.e. CSI_dut→eng. This means that we will have two dierent cases, one where we are doing the dierent translations on AMT and the other when we are doing the translations on CSI_dut→eng. The datasets will be translated in the following ways:

AMTeng→swe→eng

AMTeng→rus→eng

AMTeng→swe→rus→eng

CSIdut→eng→swe→eng

CSIdut→eng→rus→eng

CSIdut→eng→swe→rus→eng

The reason we chose Swedish is because of its structural similarities to English and Dutch, and the reason we chose Russian is its dierences to English and Dutch. This because Swedish, English and Dutch are Germanic languages, and Russian is a Slavic language [36]. The assumption is that the more times the data is translated, and the more linguistic dierent the translated language is from the original language, the more will its stylometric footprint be hidden and result in a decrease of the accuracy. This might occur at the expense of the quality of the texts regarding sentence structure and comprehensibility.

4.7.3 Train one - test another

To get an idea of how useful the classication models are, we will take the best classier from 4.7.1 and use two dierent datasets for training and testing. We will do this to

(36)

investigate how the classication model performs when trained on one dataset, and tested on another. This will be done in two dierent tests:

train on AMT and test on CSI_dut→eng

train on CSI_dut→engand test on AMT

4.8 Evaluation

The evaluation will be done by measuring the performance with accuracy, F-measure, precision and recall. These methods are described in this section.

With the performance measured, we are then able to compare our results with others' results.

4.8.1 Performance measure

We want to be able to measure how well our classication model will perform. For a binary classication where the outcome is either positive or negative, each outcome of the classication will take one of these options:

true positive (TP) A sample that is classied as positive and its true class is positive false positive (FP) A sample that is classied as positive but its true class is nega-

tive

true negative (TN) A sample that is classied as negative and its true class is negative

false negative (FN) A sample that is classied as negative but its true class is positive

By measuring the value of these, we are able to calculate four common performance measurements. Accuracy A is the most common one and tells us how well the classi-

cation algorithm classies true results and is calculated as A = T P + T N

T P + F P + T N + F N. (4.3)

For this thesis we will also calculate the precision, recall and F-measure for our classication algorithms. The precision tells us how many of the positive classications that are true as

p = T P

T P + F P (4.4)

and the recall tells us how many of the positive samples that classies as true as r = T P

T P + F N. (4.5)

The F-measure is a combined metric which takes a balanced average of p and r. The F-measure is calculated as

F1 = 2 · p · r

p + r. (4.6)

(37)

Chapter 5

Results

In this chapter we present the results for each of our tests. We present the best results for each dataset and model.

5.1 Improve the others

In this section we present the results when testing the dierent models with dierent feature types. The section consists of one table for each dataset which holds the number of TP, FP, TN and FN for each feature type and conguration. We also present the results obtained when using our best model while combining the two datasets.

Finally we present a table with the most deceptive and truthful TF-IDF terms for both datasets.

Table 5.1: Best result for each feature type and conguration on AMT₊

Feature type Conguration TP FP TN FN

TF-IDF

unstemmed w/o stopwords 358 51 361 30 unstemmed with stopwords 341 41 384 34

stemmed w/o stopwords 365 46 355 34

stemmed with stopwords 422 37 300 41 word2vec

stemmed with stopwords 513 59 172 56 PCFG

w/o grandparent node, w/o lexical rules 343 87 263 107 w/o grandparent node, with lexical rules 396 44 329 31 with grandparent node, w/o lexical rules 279 74 335 112 with grandparent node, with lexical rules 430 40 299 31 PCFG +TF-IDF w/o grandparent node, with lexical rulesunstemmed with stopwords 485 37 247 31

(38)

CHAPTER 5. RESULTS

Table 5.2: Best result for each feature type and conguration on AMT-

TF-IDF

unstemmed w/o stopwords 319 108 320 53 unstemmed with stopwords 304 64 336 96 stemmed w/o stopwords 283 114 364 39 stemmed with stopwords 429 98 204 69 PCFG

Table 5.3: Best result for each feature type and conguration on AMTTO+

TF-IDF

stemmed w/o stopwords 532 76 554 118 stemmed with stopwords 653 95 426 106 PCFG

Table 5.4: Best result for each feature type and conguration on AMTTO-

TF-IDF

stemmed w/o stopwords 542 207 371 160 stemmed with stopwords 477 90 623 90 PCFG

(39)

CHAPTER 5. RESULTS

Table 5.5: Best result for each feature type and conguration on CSI

TF-IDF

stemmed w/o stopwords 551 111 529 107 stemmed with stopwords 611 122 463 102 word2vec

unstemmed w/o stopwords 543 145 432 178 unstemmed with stopwords 670 218 252 158 stemmed w/o stopwords 523 163 454 158 stemmed with stopwords 554 184 379 181 PCFG

Table 5.6: Best result for the best classier with AMT and CSI_dut→eng in combination

PCFG +TF-IDF w/o grandparent node, with lexical rulesunstemmed with stopwords 1196 241 1282 179

Table 5.7: The terms from the AMT and CSI datasets with highest TF-IDF values

AMT CSI

Top deceptive Top truthful Top deceptive Top truthful TF-IDF terms TF-IDF terms TF-IDF terms TF-IDF terms

staying oor restaurant years

food breakfast romantic chain

husband michigan ministry movies

looking small café fan

anyone nights pizza liked

recently booked Antwerpen have

visit free comedy remains

ever excellent read ago

luxury bar friendly beautiful

went large cosy opinion

Detection of deceptive reviews: using classification and natural language processing features

Examensarbete 30 hp November 2016

Detection of deceptive reviews

using classification and natural language processing features

Johan Fernquist

Abstract

Detection of deceptive reviews using classification and natural language processing features

Populärvetenskaplig sammanfattning

Contents

Chapter 1

Introduction

1.1 Problem statement

1.2 Delimitations

Chapter 2

Background

2.1 The client

2.2 Deceptive texts on the Internet

2.3 Data mining, machine learning and classication

2.4 Vector space modelling

2.5 Natural language processing

2.6 Related work

Chapter 3

Theory

3.1 Support vector machines

3.2 Context-free grammars

Σ f

3.3 Neural networks

3.4 Word2vec

3.5 Term frequency and inverse document frequency

Chapter 4

Design and implementation

4.1 Data gathering

4.2 Preprocessing

4.3 Feature generation

4.4 Feature selection

4.5 Classication

4.6 Program and libraries

4.7 Test setups

4.8 Evaluation

Chapter 5

Results

5.1 Improve the others

2.3 Data mining, machine learning and classication

Σ ^f

4.5 Classication