Identification of machine-generated reviews: 1D CNN applied on the GPT-2 neural language model

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2020

Identification of

machine-generated reviews

1D CNN applied on the GPT-2 neural language

model

STAFFAN AL-KADHIMI

PAUL LÖWENSTRÖM

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Identification of

machine-generated reviews

1D CNN applied on the GPT-2 neural

language model

STAFFAN AL-KADHIMI

PAUL LÖWENSTRÖM

Degree Project in Computer Science, DD142X Date: June 8, 2020

Supervisor: Christopher Peters Examiner: Pawel Herman

School of Electrical Engineering and Computer Science

Swedish title: Identifiering av maskingenererade recensioner: 1D CNN applicerat på den neurala språkmodellen GPT-2

(4)

(5)

iii

Abstract

With recent advances in machine learning, computers are able to create more convincing text, creating a concern for an increase in fake information on the internet. At the same time, researchers are creating tools for detecting computer-generated text.

Researchers have been able to exploit flaws in neural language models and use them against themselves; for example, GLTR provides human users with a visual representation of texts that assists in classification as human-written or machine-generated.

By training a convolutional neural network (CNN) on GLTR output data from analysis of machine-generated and human-written movie reviews, we are able to take GLTR a step further and use it to automatically perform this clas-sification. However, using a CNN with GLTR as the main source of data for classification does not appear to be enough to be on par with the best existing approaches.

(6)

iv

Sammanfattning

I och med de senaste framstegen inom maskininlärning kan datorer skapa mer och mer övertygande text, vilket skapar en oro för ökad falsk information på internet. Samtidigt vägs detta upp genom att forskare skapar verktyg för att identifiera datorgenererad text.

Forskare har kunnat utnyttja svagheter i neurala språkmodeller och använ-da dessa mot dem. Till exempel tillhananvän-dahåller GLTR använanvän-dare en visuell representation av texter, som hjälp för att klassificera dessa som människo-skrivna eller maskingenererade.

Genom att träna ett faltningsnätverk (convolutional neural network, eller CNN) på utdata från GLTR-analys av maskingenererade och människoskrivna filmrecensioner, tar vi GLTR ett steg längre och använder det för att genomföra klassifikationen automatiskt. Emellertid tycks det ej vara tillräckligt att använ-da en CNN med GLTR som huvudanvän-datakälla för att klassificera på en nivå som är jämförbar med de bästa existerande metoderna.

(7)

Chapter 1 Introduction

“OpenAI has published the text-generating AI it said was too dangerous to share” —The Verge, November 2019.1

Never before has it been easier to publish information, and now with the recent improvements in machine learning, computers can create text that is hard to differentiate from human writing.

Simple methods for machine generation of text have existed for a relatively long time, but in recent years it has been possible to utilize deep learning to achieve better results (Gatt and Krahmer 2017). In 2019, the research organi-zation OpenAI released GPT-2, which is one of the latest additions to the text generating neural language models, and has been recognized for its capabili-ties in generating false information including fake news1_{(Radford et al. 2019).}

This has sprouted concerns about the future and the dangers of text-generating artificial intelligence (AI)1_{(Radford et al. 2019).}

As technology continues to improve, we may need ways to distinguish real from fake automatically, considering the potential massive amount of false information that could be published on different media. Detection of auto-matically generated text is something that is being actively researched. For example, GLTR is a tool that has been recognized for how it can help people detect machine-generated text by visualizing important data given by GPT-2.2

Today people choose movies, restaurants, airlines, hair salons, and more

1_{Vincent, James (2019). OpenAI has published the text-generating AI it said was too}

dangerous to share. url: https://www.theverge.com/2019/11/7/20953040/

openai-text-generation-ai-gpt-2-full-model-release-1-5b-parameters (visited 2020-02-15)

2_{Quach, Katyanna (2019). Remember the OpenAI text spewer that was too}

danger-ous to release? Fear not, boffins have built a BS detector for it. url: https://www.

theregister.co.uk/2019/03/11/openai_gltr_ai/ (visited 2020-05-13)

(10)

2 CHAPTER 1. INTRODUCTION

based on online reviews that they read. The integrity of reviews could be chal-lenged if computers start to fill the internet with computer-generated reviews. This could in turn deceive people into buying a product or service, or more seriously, cripple companies and potential adversaries.

The number of fake reviews on the web today is likely large; Akoglu, Chandy, and Faloutsos (2013) estimate that around 20% of the reviews on Yelp are faked by paid human writers. Automatically generating convincing reviews is also already possible, as evidenced by Adelani et al. (2019). In a capitalistic society where everyone wants to outperform their opponents, it may only be a matter of time before computer-generated reviews become a common weapon in the company arsenal.

1.1 Research Question

The purpose of our work is to investigate a possible extension of GLTR, that would be possible to use as part of spam filters or similar utilities.

In this thesis, we will therefore study: “Is it possible to automatically classify reviews as human-written or machine-generated, using the text body mapped to GLTR values as input?”

Under the assumption that the answer to the aforementioned research ques-tion is “yes”, the following will be studied as subquesques-tions:

• “How well would this adaptation work for detecting reviews written by models other than the one GLTR was trained on?”

• “How does the length of a given review influence the classification per-formance?”

• “Would this adaptation be on par with other automatic detection meth-ods?”

1.2 Scope

There are a large number of ways to generate text, and due to this we chose to limit our study to the GPT-2 language model only, and we also limited it to the smaller versions of it (124M and 355M) due to limited available resources. There is also a large number of types of reviews; by limiting ourselves to a single topic we can reduce the amount of variation in our text, and thus both reduce the complexity of our project and make our tool more useful in practice

(11)

CHAPTER 1. INTRODUCTION 3

for entities such as websites. In our case, we focus on movie reviews from the media website and database IMDB, which makes our work potentially useful for identifying computer-generated reviews posted there.

1.3 Approach

In short, we utilize GPT-2 to find out how likely it is to write text similar to what is analyzed. A similar project, GLTR, has analyzed text before with this approach with promising results (Gehrmann, Strobelt, and Rush 2019), and we integrate it as the core of our framework. However, unlike GLTR which is a tool for making it easier for humans to identify machine-generated text (Gehrmann, Strobelt, and Rush 2019), we use a convolutional neural network to automatically classify the text based on the data extracted from the analysis.

1.4 Thesis Outline

The following chapter explores the current state of fake text/review detec-tion, in particular related work that has emerged following the development of GPT-2 and similar research. It also introduces the building blocks of the project: neural language models, GPT-2, GLTR, and convolutional neural net-works.

This is followed by Chapter 3, where we describe our approach in detail and go through the structure of the neural network we create. In Chapter 4, we present the results and compare the models we create. Chapter 5 is then delegated to discussion of the results as well as limitations and future work, and finally, Chapter 6 holds the conclusion where we summarize our work and discuss it in a broader context.

(12)

Chapter 2 Background

2.1 Neural language models

A language model (LM) essentially describes the probability of each possible token (e.g., a word) appearing as the next one given an input text (Mikolov et al. 2010). Traditional models are based on n-grams and look at the n previous words to decide what word should follow (Arisoy et al. 2012).

More recently, neural networks have risen into the spotlight for language model generation due to their handling of data sparseness; neural networks are able to avoid data sparseness issues through embedding words in a continuous space, which removes issues with small changes in probabilities creating a big impact (Arisoy et al. 2012). In particular, recurrent neural networks (RNN) with long-short term memory have been able to build on previous approaches by enhancing the network’s memory capability (Józefowicz et al. 2016).

Vaswani et al. (2017) introduced transformer architectures, which have the potential to take the place of RNNs. While similar to RNNs in that they have a memory, transformers are not reliant on the sequence to be processed in order, enabling a lot more parallelisation during training and leading to new advances in neural language modeling (Vaswani et al. 2017).

2.1.1 Transfer learning

Machine learning has historically needed a large amount of data to produce good results, but in recent years a lot of progress has been made in reducing the amount of data required (Wang and Zheng 2015). This has been made possible thanks to transfer learning (Wang and Zheng 2015). Transfer learning is the practice of using an already trained model in a similar area and then tuning it

(13)

CHAPTER 2. BACKGROUND 5

by continuing training the model to data relevant to the problem at hand (Wang and Zheng 2015).

2.1.2 GPT-2

GPT-2 is a large language model based on a transformer neural network ar-chitecture. It has been trained on a large and broad dataset, which helps it achieve state-of-the-art results in text generation (Radford et al. 2019). It is able to generate text that is up to 1,024 tokens in length, with tokens usually being words, parts of words, or individual characters (Radford et al. 2019; Sennrich, Haddow, and Birch 2016). Through transfer learning, it is possible to relatively quickly re-train the model to generate text about a specific topic (Radford et al. 2019). The full size model is called 1.5B and contains 1.5 bil-lion parameters, but there are also three smaller versions: 124M, 355M, and 774M, with 124 million, 355 million, and 774 million parameters respectively (Solaiman et al. 2019).

Generation parameters used by GPT-2 include Top-K, Top-P and temper-ature (Solaiman et al. 2019). Top-K constrains the language model to only select from the k most likely tokens to appear next, Top-P constrains it to se-lect with a specific cumulative token probability cutoff, and the temperature is to control randomness (Solaiman et al. 2019). A lower temperature results in less randomness and more repetitive text (Solaiman et al. 2019).

In the context of review generation, Adelani et al. (2019) show that it is fully possible to use GPT-2 124M for this purpose by generating thousands of fake Amazon and Yelp reviews with the help of transfer learning. They also demonstrate that it can be difficult to identify the reviews as fake. In the testing stage, they set up a testing environment for detecting these reviews, where the evaluator would be presented with three machine-generated reviews as well as one human-written review, and have to pick out the human-written one.

2.2 Detecting machine-generated text

2.2.1 GAN-based approaches

Generative adversarial networks (GANs) is a machine learning approach where, simply speaking, a generator neural network and a discriminator neural net-work improve by continuously competing with each other (Creswell et al. 2018). This is done by having the discriminator learning how to distinguish real from fake, and the generator learning to generate results that deceive the

(14)

6 CHAPTER 2. BACKGROUND

discriminator (Creswell et al. 2018). GANs have seen a notable amount of use in multimedia-related applications, but are not exclusive to the multimedia field (Creswell et al. 2018).

One example in the case of fake text detection using GANs is Grover, a model trained on news articles that is able to identify “neural fake news”, as the authors label it, with 73% accuracy (Zellers et al. 2019). Adelani et al. (2019) attempted to use Grover in a zero-shot (non-fine-tuned) approach to detect the fake Amazon and Yelp reviews that they had generated; this resulted in an equal error rate of 41% for distinguishing between real and fake.

2.2.2 Neural LMs as weapons against themselves

Researchers have found ways to use neural language models to detect text writ-ten by neural language models. For example, Solaiman et al. (2019) have de-veloped what they call a “GPT-2 output detector” which uses the RoBERTa neural language model (from Liu et al. (2019)) trained on GPT-2 outputs to return a probability for a given text to be generated by GPT-2. They also found that classification of shorter texts is particularly difficult.

Solaiman et al. (2019) evaluated multiple variations of training on GPT-2 outputs. When trained and tested on texts generated by GPT-2 124M with tem-perature 1, they were able to obtain an accuracy of 99.2%. With 355M texts, the accuracy was slightly reduced, to 98.7%. The detector transferred well to texts generated by a GPT-2 model of a different size, with a 1.9 percentage point reduction in accuracy with the 124M-based detector on 355M texts, and a 0.3 percentage point increase the other way around.

GLTR

GLTR is an analysis tool built upon GPT-2 intended to be used by humans to identify machine-generated text (Gehrmann, Strobelt, and Rush 2019). For each token in a given text, GLTR retrieves the probability as well as absolute rank (i.e. the rank in the probability list for that token) that GPT-2 gives for the token. The tool visually presents this to the user by colour coding the tokens based on their absolute rank, while simultaneously displaying graphs of the distribution of the values and allowing the user to hover over the tokens to see additional data (Gehrmann, Strobelt, and Rush 2019).

The tool exploits an assumed flaw in large language models, that the tokens generated by those are biased towards what is likely to appear, unlike human-written text which tends to have a more varied language use (Gehrmann, Stro-belt, and Rush 2019).

(15)

CHAPTER 2. BACKGROUND 7

Figure 2.1: A screenshot of the GLTR web interface (Gehrmann, Strobelt, and Rush 2019)

In their study, Gehrmann, Strobelt, and Rush (2019) showed that untrained humans can use the tool to detect machine-written text with an accuracy of 72.3%. Adelani et al. (2019) created an automatic detector based on the val-ues of the bars in GLTR’s top-k count chart (see Figure 2.1), but were less successful, with an overall equal error rate of 38.5%, though this was reduced to 22.5% when GLTR results were fused with results from Grover and the GPT-2 output detector using logistic regression.

2.3 Text classification using CNNs

Originally, CNNs were developed for image classification (Masko and Hens-man 2015). They have risen in popularity in multiple fields, one of which is natural language processing because they can find patterns between words based on the proximity to one another (Masko and Hensman 2015). This al-lows them to see patterns some other types of neural networks cannot (Masko and Hensman 2015).

In natural language processing, one-dimensional CNNs are suitable, as this type of CNN focuses less on the location of the features, which can be good for text as the found features might not be highly connected to the position in the text (Kiranyaz et al. 2019). 1D CNNs have lower computational complexity since they do operations on simple arrays instead of matrices (Kiranyaz et al. 2019). Studies have also shown that in some cases 1D CNNs perform better

(16)

8 CHAPTER 2. BACKGROUND

Figure 2.2: An example of what the structure of a 1D CNN can look like (Hou, Adhikari, and Cheng 2018)

than their counterparts in tests with limited labeled data (Kiranyaz et al. 2019). In a neural network, the data passes through multiple layers (Altenberger and Lenz 2018). A CNN generally consists of multiple convolutional layers, pooling layers, and at the end, fully connected layers (Masko and Hensman 2015; Altenberger and Lenz 2018) (see Figure 2.2).

The convolutional layers are made out of trainable kernels, of which each learns to recognize a feature (Khan et al. 2019; Masko and Hensman 2015). Each kernel goes through all the positions in the input and outputs a feature map (Kiranyaz et al. 2019). The feature map contains information on how well the kernel matches each position in the input data (Khan et al. 2019). An activation function is typically applied afterwards to obtain non-linearity in the output (Altenberger and Lenz 2018).

Pooling layers are normally found between convolutional layers, and are used to reduce the spatial size of the representation, which lowers needed computation and can in many cases improve the results by helping control overfitting (Palaz, Collobert, and Magimai-Doss 2013).

The fully connected layers are used for classification and produce the scores for each individual class (Masko and Hensman 2015).

Perhaps contrary to the name, it is possible to have multi-dimensional data in 1D CNNs; the “1D” in this case refers to the kernel moving in one dimension (Kiranyaz et al. 2019). When multi-dimensional data is used, normalization of the input data is a common method to ensure that each channel (dimension) is treated equally by the neural network and thus prevent certain channels to have a larger impact than others (Bishop 1995).

One common problem with convolutional neural networks is overfitting (Dietterich 1995). This is when the network sees patterns that it incorrectly identifies as features that are specific to only the dataset it is training on

(17)

(Di-CHAPTER 2. BACKGROUND 9

etterich 1995). Meaning that the model will perform badly on a dataset that is not the one it trained on (Dietterich 1995).

There are multiple ways of reducing the risk of overfitting; these include dropout layers in the neural network, which remove random neurons during execution, preventing neurons from learning to depend on each other (Srivas-tava et al. 2014), as well as batch normalization, which reduces the variance in the distribution of layer inputs, and in doing so also accelerates training (Ioffe and Szegedy 2015).

(18)

Chapter 3 Method

Our approach is largely built upon GLTR. However, instead of having a human look at the results and classify the text, we trained a 1D CNN to do it based on GLTR analysis data of a large number of human-written and machine-written reviews. The decision to use a CNN was made because of its potential in detecting patterns.

The approach can be divided into four stages. A simplified diagram show-ing parts of the structure of the program and how some values were gathered is included in Appendix A.

3.1 Text preparation stage

We started with a dataset that is a collection of 50,000 movie reviews from IMDB, originally intended for training binary sentiment classification models (Maas et al. 2011). It was deemed to be a reasonable choice with respect to its size given the available computing power, as well as the size of its entries which mostly fit the length limitations of GPT-2.

We extracted 46,000 of the reviews in the IMDB dataset and divided those into five parts:

• Two separate sets of 15,000 texts for training GPT-2, which we will refer to as IAand IB respectively.

• 10,000 texts for training GPT-2 models for GLTR to use, which will be referred to as IG.

• 5,000 texts for use as a set of human-written reviews in our CNN training data, referred to as R.

(19)

CHAPTER 3. METHOD 11

• 1,000 texts for use as a set of human-written reviews in our CNN test data, referred to as T .

To facilitate keeping track of the various datasets that we use, we will intro-duce a naming convention for them. G(n, t) will be used to refer to a training dataset of generated texts which were created by a GPT-2 model with n million parameters and temperature t, and H(n, t) will have the same purpose except that it is used to refer to a testing dataset instead.

Using the gpt-2-simple Python library1_{, we then used the texts from I} A

to fine-tune the 124M and 355M GPT-2 models and generate 5,000 reviews each in G(124, 0.875) and G(355, 0.875) respectively. We then did the same with IB as 124M and 355M fine-tuning input to retrieve 1,000 reviews each

in (H(124, 0.7), H(124, 0.875), and H(124, 1.0)), and H(355, 0.875) respec-tively. We configured gpt-2-simple to not use any other token selection con-straints.

Randomly picked human-written and generated review samples can be found in Appendix B.

3.2 GLTR stage

We trained two separate models to use as part of GLTR, one based on the base model with 124M parameters and another based on the base model with 355M parameters. Both were fine-tuned using the texts from IG. This is done

to ensure that we do not create bias by using the same texts to train our GLTR models as the ones used for generating fake reviews, and accurately reflect reality where we would not have access to the dataset used to train the fake text generators. Compared to the models we use for text generation, the models used for GLTR were trained on 1/3 of the number of iterations, to avoid an overfitting-like situation where our CNN is too adapted to the dataset that it was trained on.

To ensure that the model does not have a bias based on the length of the text, we enforce that there are roughly the same amount of fake and real reviews with the same length. We do this by removing entries in the datasets so that there is an equal number of entries in each interval of 32 GPT-2 tokens. This is done both for the training and the testing data.

Using GLTR, we calculated and extracted the probability p and absolute rank k of each token of each entry in G∪R. This information was then mapped to a x × 2 × 1024 tensor appropriate for use as neural network training input

(20)

12 CHAPTER 3. METHOD

data, where x is the number of reviews we have left after the removal previ-ously described. The three dimensions represent the reviews, data channels ([k, p]), and token indices respectively. Values past the end of reviews were padded with zeros. We also built a vector of length x, corresponding to the tensor and describing the ground truth, where each value is either 0 (human-written review) or 1 (computer-generated review), appropriate for use as neural network training target data.

We will define a notation for GLTR as:

gltrn(D) =the output of GLTR with a model with n million parameters

run on dataset D, with the output on the form [k, p]

3.3 CNN training stage

We used the PyTorch2_{machine learning library for Python to feed the training}

data into a one-dimensional CNN. This neural network consists of: 1. an input layer, with two channels

2. a one-dimensional convolutional operation, with a kernel size of 7 and depth 16

3. a batch normalization function, with depth 16 4. a rectifying activation function (ReLU) 5. a dropout layer, with probability 0.2

7. a batch normalization function, with depth 32 8. a rectifying activation function (ReLU) 9. a dropout layer, with probability 0.5

10. a one-dimensional max pooling layer, with kernel size 2

(21)

CHAPTER 3. METHOD 13

12. a rectifying activation function (ReLU)

13. a one-dimensional max pooling layer, with kernel size 2

14. a flattening layer, that flattens starting from the second dimension 15. a fully connected layer, reducing the number of channels to 1 16. an output layer, with a single channel

The training ran for 32 epochs using a binary cross entropy loss function with logits and an Adam optimizer with learning rate 0.001. We trained six CNN models: • Ma, trained on gltr124(G(124, 0.875) ∪ R) • Mb, trained on gltr124(G(355, 0.875) ∪ R) • Mc, trained on gltr124(G(124, 0.875) ∪ G(355, 0.875) ∪ R) • Md, trained on gltr355(G(124, 0.875) ∪ R) • Me, trained on gltr355(G(355, 0.875) ∪ R) • Mf, trained on gltr355(G(124, 0.875) ∪ G(355, 0.875) ∪ R)

When training, 1/6 of the dataset is used for validation.

3.4 CNN testing stage

We used the CNN model generated in the training stage on data that has been GLTR encoded in the GLTR stage.

Ma, Mb, and Mcwere tested on all gltr124(H)as well as gltr124(T ). Md,

(22)

Chapter 4 Results

Tables 4.1 and 4.2 show the accuracy of our CNN models using 124M and 355M based GLTR respectively. Tables 4.3 and 4.4 are structured in a similar way, and show the equal error rates with the different GLTR models. Finally, Tables 4.5 and 4.6 as well as Figure 4.1 show the correlation between the length of a review and the accuracy of our CNN models.

4.1 Overall performance

The true negative (human-written review) results in Table 4.1 and 4.2 are based on four different runs due to the length bias prevention performed in the GLTR stage, and the numbers presented are averages of those runs. It should be noted that the results of these different runs are almost equal; the small differences are due to the length bias prevention resulting in different entries being re-moved in each run. We generated the results displayed in Tables 4.1 and 4.2 with a classification threshold value that attempts to maximize the total num-ber of true positives and true negatives. This threshold is calculated from the average validation output value.

Training models on a combination of 124M and 355M texts improved the worst performing testing results and worsened the best performing results, re-sulting in a more stable overall performance. We had two outliers when using GLTR based on the 355M model: 124M texts with t = 0.7 and 124M texts with t = 0.875, where the combined training results were equal to or better than the best non-combined results (see Table 4.4). However, some of these differences are within the margin of error.

Evaluating the equal error rates, one can see that when the training dataset and testing dataset are generated with the same temperature and with GPT-2

(23)

CHAPTER 4. RESULTS 15

Training dataset 124M, t = 0.875 355M, t = 0.875 Combined Testing dataset 124M, t = 0.7 86.3 ± 2.3 96.8 ± 1.2 95.8 ± 1.3 124M, t = 0.875 86.2 ± 2.4 95.2 ± 1.5 94.4 ± 1.6 124M, t = 1.0 79.9 ± 2.8 92.9 ± 1.8 91.5 ± 2.0 355M, t = 0.875 60.5 ± 3.3 80.8 ± 2.7 76.5 ± 2.9 Human-written 86.7 ± 2.3 70.5 ± 3.1 70.6 ± 3.1 Table 4.1: Accuracy in percent, with GLTR based on the 124M model and tested/trained on different datasets with temperature t. All confidence intervals are 95% and are based on a normal approximation interval.

Training dataset 124M, t = 0.875 355M, t = 0.875 Combined Testing dataset 124M, t = 0.7 66.0 ± 3.2 44.4 ± 3.3 71.3 ± 3.0 124M, t = 0.875 67.8 ± 3.2 34.3 ± 3.3 68.3 ± 3.2 124M, t = 1.0 66.7 ± 3.3 24.3 ± 3.0 56.2 ± 3.5 355M, t = 0.875 62.1 ± 3.3 94.0 ± 1.6 91.7 ± 1.9 Human-written 71.9 ± 3.1 84.3 ± 2.5 71.9 ± 3.1 Table 4.2: Accuracy in percent, with GLTR based on the 355M model and tested/trained on different datasets with temperature t. All confidence intervals are 95% and are based on a normal approximation interval.

models of the same size, and the GLTR model is also trained with this size, the CNN is able to yield favourable results (see Tables 4.3 and 4.4).

The accuracy of the model trained on 355M generated reviews using 355M GLTR values drops as the temperature of the 124M test reviews increases, from 44.4% to 24.3% (see Table 4.2). However, the accuracy was 94.0% when the model was tested on 355M generated reviews, and 84.3% on human-written reviews. The results had less variation when trained on 124M texts, with the accuracy ranging from 62.1% to 71.9%.

(24)

16 CHAPTER 4. RESULTS

Training dataset 124M, t = 0.875 355M, t = 0.875 Combined Testing dataset

124M, t = 0.7 13.4 ± 1.6 11.9 ± 1.5 14.4 ± 1.7 124M, t = 0.875 13.4 ± 1.7 14.3 ± 1.7 15.1 ± 1.7 124M, t = 1.0 16.1 ± 1.8 16.5 ± 1.9 17.6 ± 1.9 355M, t = 0.875 26.2 ± 2.1 23.8 ± 2.1 25.5 ± 2.1 Table 4.3: Equal error rate in percent, with GLTR based on the 124M model and tested/trained on different datasets with temperature t. All confidence intervals are 95% and are based on a normal approximation interval.

Training dataset 124M, t = 0.875 355M, t = 0.875 Combined Testing dataset

124M, t = 0.7 31.3 ± 2.2 33.2 ± 2.2 28.6 ± 2.1 124M, t = 0.875 30.2 ± 2.2 37.0 ± 2.3 30.0 ± 2.2 124M, t = 1.0 29.9 ± 2.3 42.9 ± 2.5 35.3 ± 2.4 355M, t = 0.875 32.7 ± 2.3 9.2 ± 1.4 15.9 ± 1.8 Table 4.4: Equal error rate in percent, with GLTR based on the 355M model and tested/trained on different datasets with temperature t. All confidence intervals are 95% and are based on a normal approximation interval.

4.2 Impact of text length

As seen in Table 4.5 and 4.6, as well as Figure 4.1, our approach performs increasingly better as the length of the text increases. With very short texts, it yields a very poor accuracy, however it quickly starts achieving better results. As it increases in accuracy quickly, it also reaches its peak fast and stagnates. On our better test runs it reaches up to around 90% (see Table 4.5) while this value is around 75% on one of our worse runs (see Table 4.6).

(25)

CHAPTER 4. RESULTS 17

Tokens 0–63 64–127 128–255 256–383 384–511 512–767 768–1024 Sensitivity 1/22 56/91 300/339 176/186 84/89 68/71 21/21 Specificity 22/22 82/91 296/339 155/186 72/89 66/71 18/21

Table 4.5: Correlation between text length (in terms of GPT-2 tokens) and accuracy of Matested on 124M, t = 0.875 generated and human-written

re-views.

Tokens 0–63 64–127 128–255 256–383 384–511 512–767 768–1024 Sensitivity 12/30 60/109 253/357 145/189 71/89 56/68 19/22 Specificity 18/30 72/109 258/357 139/189 67/89 48/68 16/22

Table 4.6: Correlation between text length (in terms of GPT-2 tokens) and ac-curacy of Mf tested on 124M, t = 0.7 generated and human-written reviews.

0–63 _64–127 128–255 _{256–511 512–1023} 0 10 20 30 40 50 60 70 80 90 100 No. tokens Av g. accuracy [%]

Figure 4.1: Correlation between text length (in terms of GPT-2 tokens) and average accuracy of all models M. All confidence intervals are 95% and are based on a Wilson score interval.

(26)

Chapter 5 Discussion

One cannot expect to get a perfect accuracy because some machine-generated reviews will be indistinguishable from a human-written review. Possible rea-sons for this include the text simply being well-written (which logically is pos-sible even if the generator is bad), consisting largely of generic elements (e.g. “I loved the acting”), or being too short. When a review is short, it gives our models less to work with and makes it more difficult to guess whether the text is real or fake. This can be seen in our results where the models perform much better on longer texts, while they appear to guess randomly on the shorter ones or assign the same label to all of them. This lowers our overall accuracy.

In order to reduce bias based on the length of the texts, we ensured that there was an equal amount of real and fake reviews with roughly the same length in the training data. This does not seem to have had an effect on the models’ performance on very short reviews, where the network ended up with an accuracy of roughly 50% regardless.

In terms of equal error rates, the best results are achieved when the training dataset and testing dataset are generated with the same temperature and with GPT-2 models of the same size, and the GLTR model is also trained with this size. This is expected; the k and p values that are given by GLTR for training would be most similar to the testing data when all parameters are the same, and this provides the CNN with the best circumstances to obtain good results. The results from 124M-based GLTR were, to our surprise, better than the ones from 355M-based GLTR. The CNN did not perform as well on text gen-erated by 355M GPT-2 when trained on 124M text, which was expected, as the larger model should create more human-like text. However, the CNN appears to be able to perform well on the 355M texts if it trained on them, but sadly that resulted in it taking a hit on its total accuracy, because it then had a harder

(27)

CHAPTER 5. DISCUSSION 19

time identifying the human-written reviews. Training on generated text from both models did not seem to yield any significant improvements in results.

Interestingly, the CNN model trained using 355M GLTR data was not able to extrapolate on the 124M model texts as well as the reverse (124M GLTR data on the 355M model texts), regardless of which dataset was used for train-ing. From the results we can see that in general, our CNN performs worse the higher the temperature the 124M generator used, which indicates that the generators with greater variations are able to deceive it. This could mean that the models trained using 355M GLTR data classified anything that looked var-ied as human-written. In turn, this could be caused by a form of overfitting; seeing as the equal error rate produced by the 355M GLTR-based model was 8 percentage points higher than the 124M GLTR-based one, it is possible that it could have focused on patterns that are exclusive to its corresponding GLTR model.

In comparison to the GPT-2 output detector by Solaiman et al. (2019), which uses the RoBERTa neural language model, we can clearly see that our results fall short. Solaiman et al. (2019) achieved accuracy rates around 97– 99% when tested on text generated by the GPT-2 models we studied with a fixed temperature, while the results of our approach are not as good. No less, the result is still a clear indication that the CNN is doing more than arbitrarily guessing when it comes to identifying fake reviews.

Looking at the equal error rates with 124M-based GLTR generated reviews and a fine-tuned 124M GPT-2 model, our models performed better than the models Adelani et al. (2019) examined. However, they trained their model on reviews from Amazon and Yelp, while our dataset contains longer texts, which helps improve our results; in reality the overall performance of our models are likely closer than it looks. The GLTR-based method that Adelani et al. (2019) evaluated only examined the overall structure of the text, and did not use a fine-tuned GLTR model. Meanwhile, our method used a CNN to scan through fine-tuned GLTR values of all tokens, potentially missing the whole picture but finding the small indications in the sentences. In the most comparable results, our method obtained an equal error rate of 13.4%, while theirs was 38.5%.

An automated system for detecting fake reviews can be designed in a large variety of ways. Perhaps a review website manager wants to remove all reviews that are detected as fake, and in that case it might be ideal to only do so when it is very certain that a fake review has been found. Or perhaps one only wants to allow reviews that one is very certain are human-written and flag other ones for manual review. By changing the classification threshold value, it is possible to configure the classifier for a number of different use cases.

(28)

20 CHAPTER 5. DISCUSSION

5.1 Limitations

Using the IMDB dataset has potential risk, as we do not know how many computer-generated reviews it contains. However, we are fairly certain that if there are any, the amount is negligible and should not affect our models.

One thing to keep in mind is that this approach would likely not be effective with reviews that are generated using non-language model based approaches such as Grover. This is because GLTR exploits a flaw in large language models specifically. In addition to this, as we have not tested our models on non-GPT-2 based generators, we cannot conclude that it translates to other language model based generators either.

The discussion above is based on that the CNN can handle the GLTR out-put to an optimal level, but in reality it could be that the GLTR values contain more than enough information to get a great result. And it is the CNN model which is suboptimal for this problem. The fact that GPT-2 uses transformers and is a different model type could hinder the CNN’s capabilities to pick up the patterns found in GPT-2.

5.2 Future Work

Pursuing solely GLTR may not be ideal, as it seems like its ceiling is lower than that of, for instance, the GPT-2 output detector by Solaiman et al. (2019). Perhaps combining GLTR with a state-of-the-art detection model would yield better results.

It could be interesting to look at absolute rank and probability from other neural language models as well, and include this when training the CNN. GLTR data as input to the neural network worked with GPT-2-generated texts; expanding this to a collective of absolute rank and probability from multiple top models could potentially create a more versatile classifier.

(29)

Chapter 6 Conclusions

Based on our results, we can relatively safely say that we can indeed uti-lize GLTR to automatically classify reviews as human-written or machine-generated. Whether we can do so reliably is however unclear.

Our results are at their best when GLTR used the same model size as the one that generated the text. This leads to questions about its capabilities in regards to identifying fake text from other models that are more different.

We believe that a tool like this should not be used by itself, but rather that it might be useful as one of several parts of a larger fake/spam review detection framework. The limitations that come with using a CNN and restricting the model to only GLTR values appears to be too high; the results yielded are not comparable to the best classifiers in the field, and more information or another approach is needed to make more accurate classifications.

Today, short, generic reviews are already vulnerable for being faked by means of automatic generation. In the future, we may not be able to distinguish between real and fake content at all, and instead need to focus on authenticating the user as a human, or even take more radical measures such as shifting to relying on curation of sources of information—in other words, perhaps writing an online review will not be easy for everyone later on.

(30)

Bibliography

Adelani, David Ifeoluwa et al. (2019). “Generating Sentiment-Preserving Fake Online Reviews Using Neural Language Models and Their Human- and Machine-based Detection”. In: CoRR abs/1907.09177. arXiv: 1907.09177. Akoglu, Leman, Rishi Chandy, and Christos Faloutsos (2013). Opinion Fraud

Detection in Online Reviews by Network Effects.

Altenberger, Felix and Claus Lenz (2018). “A Non-Technical Survey on Deep Convolutional Neural Network Architectures”. In: CoRR abs/1803.02129. arXiv: 1803.02129.

Arisoy, Ebru et al. (2012). “Deep Neural Network Language Models”. In:

Pro-ceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-Gram Model? On the Future of Language Modeling for HLT. WLM

’12. Montreal, Canada: Association for Computational Linguistics, pp. 20– 28.

Bishop, Christopher M. (1995). Neural Networks for Pattern Recognition. USA: Oxford University Press, Inc. isbn: 0198538642.

Creswell, A. et al. (2018). “Generative Adversarial Networks: An Overview”. In: IEEE Signal Processing Magazine 35.1, pp. 53–65.

Dietterich, Tom (Sept. 1995). “Overfitting and Undercomputing in Machine Learning”. In: ACM Comput. Surv. 27.3, pp. 326–327. issn: 0360-0300. doi: 10.1145/212094.212114.

Gatt, Albert and Emiel Krahmer (2017). “Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation”. In: CoRR abs/1703.09902. arXiv: 1703.09902.

Gehrmann, Sebastian, Hendrik Strobelt, and Alexander M. Rush (2019). “GLTR:

Statistical Detection and Visualization of Generated Text”. In: CoRR abs/1906.04043. arXiv: 1906.04043.

Hou, Jie, Badri Adhikari, and Jianlin Cheng (2018). “DeepSF: Deep Convolu-tional Neural Network for Mapping Protein Sequences to Folds”. In:

Pro-ceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. BCB ’18. Washington,

(31)

BIBLIOGRAPHY 23

DC, USA: Association for Computing Machinery, p. 565. isbn: 9781450357944. doi: 10.1145/3233547.3233716.

Ioffe, Sergey and Christian Szegedy (2015). “Batch Normalization: Acceler-ating Deep Network Training by Reducing Internal Covariate Shift”. In:

CoRRabs/1502.03167. arXiv: 1502.03167.

Józefowicz, Rafal et al. (2016). “Exploring the Limits of Language Modeling”. In: CoRR abs/1602.02410. arXiv: 1602.02410.

Khan, Asifullah et al. (2019). “A Survey of the Recent Architectures of Deep Convolutional Neural Networks”. In: CoRR abs/1901.06032. arXiv: 1901. 06032.

Kiranyaz, Serkan et al. (2019). 1D Convolutional Neural Networks and

Appli-cations: A Survey. arXiv: 1905.03554 [eess.SP].

Liu, Yinhan et al. (2019). “RoBERTa: A Robustly Optimized BERT Pretrain-ing Approach”. In: CoRR abs/1907.11692. arXiv: 1907.11692.

Maas, Andrew L. et al. (June 2011). “Learning Word Vectors for Sentiment Analysis”. In: Proceedings of the 49th Annual Meeting of the Association

for Computational Linguistics: Human Language Technologies. Portland,

Oregon, USA: Association for Computational Linguistics, pp. 142–150. Masko, David and Paulina Hensman (2015). The Impact of Imbalanced

Train-ing Data for Convolutional Neural Networks.

Mikolov, Tomas et al. (2010). “Recurrent neural network based language model”. In: INTERSPEECH.

Palaz, Dimitri, Ronan Collobert, and Mathew Magimai-Doss (2013). “Esti-mating Phoneme Class Conditional Probabilities from Raw Speech Signal using Convolutional Neural Networks”. In: CoRR abs/1304.1018. arXiv: 1304.1018.

Radford, Alec et al. (2019). Language Models are Unsupervised Multitask

Learners.

Sennrich, Rico, Barry Haddow, and Alexandra Birch (Aug. 2016). “Neural Machine Translation of Rare Words with Subword Units”. In: Proceedings

of the 54th Annual Meeting of the Association for Computational Linguis-tics (Volume 1: Long Papers). Berlin, Germany: Association for

Compu-tational Linguistics, pp. 1715–1725. doi: 10.18653/v1/P16-1162. Solaiman, Irene et al. (2019). Release Strategies and the Social Impacts of

Language Models. arXiv: 1908.09203 [cs.CL].

Srivastava, Nitish et al. (2014). “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In: Journal of Machine Learning Research 15.56, pp. 1929–1958.

(32)

24 BIBLIOGRAPHY

Vaswani, Ashish et al. (2017). “Attention Is All You Need”. In: CoRR abs/1706.03762. arXiv: 1706.03762.

Wang, D. and T. F. Zheng (2015). “Transfer learning for speech and language processing”. In: 2015 Asia-Pacific Signal and Information Processing

As-sociation Annual Summit and Conference (APSIPA), pp. 1225–1237.

Zellers, Rowan et al. (2019). “Defending Against Neural Fake News”. In: CoRR abs/1905.12616. arXiv: 1905.12616.

(33)

(34)

26 APPENDIX A. STRUCTURE DIAGRAM

Appendix A

Structure Diagram

train GPT -2 and generate train GPT -2 and generate 15 000 texts for training GPT -2 for generation #1 train GPT -2 and generate train GPT -2 and generate 15 000 texts for training GPT -2 for generation #2 train GPT -2 10 000 texts for training GPT -2 for GL TR GL TR-encoded 5 000

"real" train for CNN

1 000 "real" testing texts for CNN 5 000 "fake" training for CNN generated by 124M 5 000 "fake" training for CNN generated by 355M 1 000 "fake" testing for CNN generated by 124M 1 000 "fake" testing for CNN generated by 355M 124M GL TR model 5 000 "real" training texts for CNN GL TR-encoded 1 000

"real" test for CNN

Results of 355M text

based model on

124M texts

GL

TR-encoded 5 000

"fake" test for CNN

gen. by 355M

GL

TR-encoded 5 000

"fake" test for CNN

gen. by 124M

GL

TR-encoded 5 000

"fake" train for CNN

gen. by 355M

GL

TR-encoded 5 000

"fake" train for CNN

gen. by 124M CNN model trained on 355M texts CNN model trained on 124M texts Results of 124M text based model on 124M texts Results of 124M text based model on 355M texts Results of 355M text based model on 355M texts T ext preparation stage GL TR stage CNN training stage CNN testing stage GL TR model

(35)

Appendix B

Review Samples

Each section contains one review from our datasets. If the review was gener-ated, t refers to the GPT-2 temperature that it was generated with.

Generated by fine-tuned GPT-2 124M, t = 0.7

The main problem with The College Girl Murders is that it feels too long, trying to be a psychological study, while coming over as a somewhat contrived character piece. The longer it tries to do something with the story, the better.

It's too long, trying to be longer makes it feel mushy, mush that very much resembles the same thing that grabs you from watching Alrite TV.

The attempt at making a better ending was hit and miss, but that's not going to be the last thing. College Girl Murders hits the nail squarely on the head with unforgivable force.

Avoid this mess.

(36)

28 APPENDIX B. REVIEW SAMPLES

Generated by fine-tuned GPT-2 124M, t = 0.875

What a great gem of a movie, with really great acting by Mira Sorvino, and wonderful film-making. The film is actually worse than "Rosemary's Baby" as some movie reviewers have here. I was expecting to be really disappointed by this movie. I actually wish I hadn't watched it--it was bad in almost every way. I loved the idea of the story: Is there any other grisely person out there, who devotes his life to the same, more worthless, self-indulgent, bottom-line life he or she says he or she wants.

But it gets worse, and better. The actors all did a very good job, but some did a poor job, too. The house was very un-interesting and almost pitch-perfect, but the story--which had a strong hint of going for it--was very dull and boring.

That other movie, "La Femme Nikita" (1967) which is also great, is also a better and more realistic (and

immediate!) example of bad movie making.

My rating: 1/2 out of ****. 80 mins. R for violence.

Generated by fine-tuned GPT-2 124M, t = 1.0

The script is so good that when I watched it I thought to myself, 'What the hell is this?' But when I saw it I said to myself, "It must be suspense! It must be a creepy tale!" And it is, really really suspenseful.

The actors are very good especially Daniel Day Blue. I am not sure if you know him by name, but he is very

compelling as the Grinch.

A good tale buried in the mystery literature of the schools of our era.

(37)

APPENDIX B. REVIEW SAMPLES 29

Generated by fine-tuned GPT-2 355M, t = 0.875

for years we've been watching Japanese anime (mainly The One Piece) on the big screen, but now with the release of the latest installment of Naruto, it's time for the US

release to catch up.

Well, there wasn't really an US release because, as in the 80's, there weren't many films made in Japan. But I guess it doesn't really matter.

This film is the third, and sadly final installment of the Naruto trilogy, and it's wonderfully superb. This is a film that has it all, visual art, comedy, drama, comedy, and everything in between. It has badass characters, cool gadgets, a love-hate relationship, and even a bit of comedy. Plus, it has a teenage boy that is constantly shadowed by a mysterious entity, and a girl that is rather timid. I kinda liked her.

If you've never seen Naruto, this is the film to see. It has a fantastic story, and every scene has a great visual with excellent dialogue and excellent acting. The

characters are all adorable and endearing, and the main villains are just simply awful. Again, I only gave a seven because of the terrible ending, and it does seem like it's a little overkill, but it's better than the last film.

And speaking of the final film, this is also rather enjoyable.

(38)

30 APPENDIX B. REVIEW SAMPLES

From IMDB

I rented this movie because the DVD cover made it look like it was going to be a ridiculous college comedy like van wilder or animal house. I took it to my friend house to watch for movie night. We ended up stopping it 15 minutes into the film, and watched Copper Mountain instead. I don't know if any of you have seen Copper Mountain, but it isn't great either. However, I would have to say that the Alan Thick Jim Carrey Duo made it a more enjoyable watch.

I later finished Puddle Cruiser. This movie was slow and the humor was forced. This movie reminded me of some stinkers that I saw in some of my earlier production classes in college. I was left wondering "was this the film that enabled Broken Lizard to make Supertroopers?" Also how could this movie suck so bad? Supertroopers was good and Club Dread was decent. Don't see this movie!

(39)

(40)

www.kth.se

Identification of machine-generated reviews: 1D CNN applied on the GPT-2 neural language model

Identification of

machine-generated reviews

1D CNN applied on the GPT-2 neural language

model

STAFFAN AL-KADHIMI

PAUL LÖWENSTRÖM

Identification of

machine-generated reviews

1D CNN applied on the GPT-2 neural

language model

STAFFAN AL-KADHIMI

PAUL LÖWENSTRÖM

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Research Question

1.2

Scope

1.3

Approach

1.4

Thesis Outline

Chapter 2

Background

2.1

Neural language models

2.1.1

Transfer learning

2.1.2

GPT-2

2.2

Detecting machine-generated text

2.2.1

GAN-based approaches

2.2.2

Neural LMs as weapons against themselves

2.3

Text classification using CNNs

Chapter 3

Method

3.1

Text preparation stage

3.2

GLTR stage

3.3

CNN training stage

3.4

CNN testing stage

Chapter 4

Results

4.1

Overall performance

4.2

Impact of text length

Chapter 5

Discussion

5.1

Limitations

5.2

Future Work

Chapter 6

Conclusions

Bibliography

Appendix A

Structure Diagram

Appendix B

Review Samples

Generated by fine-tuned GPT-2 124M, t = 0.7

Generated by fine-tuned GPT-2 124M, t = 0.875

Generated by fine-tuned GPT-2 124M, t = 1.0

Generated by fine-tuned GPT-2 355M, t = 0.875

From IMDB