• No results found

Generative Adversarial Networks in Text Generation

N/A
N/A
Protected

Academic year: 2021

Share "Generative Adversarial Networks in Text Generation"

Copied!
54
0
0

Loading.... (view fulltext now)

Full text

(1)

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING,

SECOND CYCLE, 30 CREDITS

STOCKHOLM, SWEDEN 2019

Generative Adversarial

Networks in Text Generation

ZESENWANG

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)
(3)

Master Thesis Report

Generative Adversarial Networks in Text Generation

Zesen Wang

zesen@kth.se

School of Electrical Engineering and Computer Science

Master’s Programme in Machine Learning (TMAIM)

Royal Institute of Technology Supervisor at EECS: Håkan Lane

Supervisor at Seavus: Reijo Silander, Emilio Marinone Examiner: Pawel Herman

(4)

Abstract

The Generative Adversarial Network (GAN) was firstly proposed in 2014, and it has been highly studied and developed in recent years. It has ob-tained great success in the problems that can not be explicitly defined by a math equation such as generating real images. However, since the GAN was initially designed to solve the problem in a continuous domain (image generation, for example), the performance of GAN in text generation is developing because the sentences are naturally discrete (no interpolation exists between “hello" and “bye").

In the thesis, it firstly introduces fundamental concepts in natural language processing, generative models, and reinforcement learning. For each part, some state-of-art methods and commonly used metrics are introduced. The thesis also proposes two models for the random sentence generation and the summary generation based on context, respectively. Both models involve the technique of the GAN and are trained on the large-scale dataset. Due to the limitation of resources, the model is designed and trained as a prototype. Therefore, it can not achieve the state-of-art performance. How-ever, the results still show the promising performance of the application of GAN in text generation. It also proposes a novel model-based metric to evaluate the quality of summary referring both the source text and the summary.

(5)

Sammanfattning

Det generativa motståndsnätverket (GAN) introducerades först 2014 och det har studerats samt utvecklats starkt under senare år. GAN har uppnått stor framgång för problem som inte kan definieras uttryckligen av en matematisk ekvation, som att generera riktiga bilder. Men eftersom GAN ursprungligen var utformat för att lösa problemet i en kontinuerlig domän (till exempel bildgenerering), utvecklas GAN:s prestanda i textgenerering eftersom meningarna är naturligt diskreta (ingen interpolering finns mellan “hej" och “hejdå").

I examensarbetet introduceras grundläggande begrepp i naturlig språk-bearbetning, generativa modeller och förstärkningslärande. För varje del introduceras några bästa tillgängliga metoder och vanligt förekommande mätvärden. Examensarbetet föreslår också två modeller för slumpmässig meningsgenerering respektive sammanfattningsgenerering baserat på sam-manhang. Båda modellerna involverar tekniken för GAN och är tränade på storskaliga datamängder.

På grund av begränsningen av resurser är modellen designad och tränad som en prototyp. Därför kan den inte heller uppnå bästa möjliga pre-standa. Resultaten visar ändå lovande prestanda för tillämpningen av GAN i textgenerering. Den föreslår också en ny modellbaserad metrik för att utvärdera kvaliteten på sammanfattningen som hänvisar både till källtexten och sammanfattningen.

(6)

Contents

1 Introduction 6

1.1 Background & Motivation . . . 6

1.2 Research Question . . . 6

2 Background Research 7 2.1 Overview . . . 7

2.2 Natural Language Processing . . . 7

2.2.1 Sequence Analysis . . . 7 2.2.2 Word Embedding . . . 10 2.2.3 Attention . . . 13 2.2.4 Pointer-Generator Network . . . 14 2.2.5 Evaluation . . . 16 2.3 Generative Models . . . 18 2.3.1 Variational Auto-Encoder . . . 18

2.3.2 Variants of Generative Adversarial Networks . . . 19

2.3.3 Adversarial Text Generation . . . 21

2.4 Reinforcement Learning . . . 25

2.4.1 Policy Gradient . . . 25

3 Methods 27 3.1 Dataset . . . 27

3.1.1 AMI Meeting Corpus . . . 27

3.1.2 LMDB Movie Review Dataset . . . 27

3.1.3 CNN News Dataset . . . 27 3.2 Pre-processing . . . 27 3.2.1 Tokenization . . . 27 3.2.2 Word Filtering . . . 27 3.2.3 Data Filtering . . . 27 3.2.4 Word Embedding . . . 28 3.3 Model . . . 29 3.3.1 Sentence Generation . . . 29 3.3.2 Summary Generation . . . 31 4 Results 35 4.1 Test Data . . . 35 4.2 Model Size . . . 35

4.3 Training Time & Inference Time . . . 35

4.4 Results & Performance Comparison . . . 36

(7)

4.4.2 Summary Generation . . . 36 5 Discussion 40 5.1 Performance Analysis . . . 40 5.1.1 Sentence Generation . . . 40 5.1.2 Summary Generation . . . 40 5.1.3 Limitation . . . 41 5.2 Sources of error . . . 42

5.3 Ethics & Sustainability Analysis . . . 42

6 Conclusions 43 6.1 Contribution . . . 43 6.2 Future Work . . . 43 References 44 Appendix 47 A Implementation Detail 47 A.1 Sentence Generation . . . 47

A.2 Summary Generation . . . 47

(8)

1

Introduction

1.1 Background & Motivation

Recently, Generative Adversarial Networks (GAN for short) show excellent performance on image generation [18] and style transfer of images [19]. However, GANs do not work well with text generation because GANs are initially defined for real-valued continuous data, while texts are naturally discrete. Note that the discrete property of text means that people can not find an interpolation between “See you later" and “Good morning" for example. In order to solve the problem, some methods are proposed, and there are two trends in these methods. One trend is to use approximated discrete to make the loss function differentiable. Then, GANs can be trained using gradient descent. Some methods use softmax to approximate the one-hot encoding (vector with only one single high value while others are low) [21][24]. Another trend is to introduce the algorithm from Reinforcement Learning to train the discriminator in discrete scenes. SeqGAN designs the discriminator as a policy on the pre-trained Maximum Likelihood Estimation (MLE) model, and it can be trained using policy gradient [22]. Another application of Reinforcement Learning algorithm is [25], which uses Actor-critic methods.

However, some papers did not directly show their results, and some methods’ results are not satisfying. For example, some results in [20] still have grammatical errors, and in [22], it only shows the results of metric scores without any generated samples. Also, the state-of-art evaluation of the summaries is based on standard tests, including BLEU [12], ROUGE [13] and other tests which use the exact word matching as the basic principle. However, it may be problematic when coming across rephrased summaries. Also, since the application of GAN in text generation is new, much more researches and investigations can be done in this area.

The reason that GAN is suitable for the problem is that the evaluation of the quality of the generated text is complex and is hard to be defined. No loss function can define how real a sentence is and how correct a summary is for a paragraph. GAN has great potential in solving problems with inexplicit definitions.

The thesis is of interest because the current performance and the evaluation in text generation are under developing and is not satisfying. The application of the thesis is broad. Some examples can be the summarization of a large amount of information, such as meeting transcripts. It also can be used for boosting knowledge publishing by generating summaries for the content, which allows readers to learn about the publication faster. The summary-generation technique can be used in chatbots as a preprocessing for the long conversation, which helps AI understands the context better.

1.2 Research Question

The primary purpose of the thesis is to test whether it can improve the performance of text generation by combining the text-generation models and the GAN techniques compared with the original text-generation models.

The research area of the thesis is in Generative Adversarial Network (GAN), Natural Language Processing (NLP) and Reinforcement Learning (RL).

(9)

2

Background Research

2.1 Overview

The document aims at providing a reference for all works and techniques related to this project. The literature review was done in the areas of GAN, NLP and RL.

2.2 Natural Language Processing

Natural Langugage Processing is a sub-field of computer science, information engineering and artificial intelligence concerned with the interaction between computer and human (natural) languages. In this section, the related techniques will be introduced to provide a technical background of the thesis in this area.

2.2.1 Sequence Analysis

Long Short-term MemoryRecently, recurrent neural networks are widely used to analyze and extract information from the sequential data. Speech recognition and semantic analysis are two typical examples of tasks on which the recurrent neural networks are applied.

Figure 1: Left: The Structure of RNN. Right: The Unfold Structure of RNN. (h, ht−1, htand

ht+1are identical units) [1]

The recurrent neural network is the artificial neural network that is composed of repeated units connected in sequence. Moreover, Long Short-term Memory (LSTM) [1] is proofed to be a good choice of the unit.

The basic LSTM cell transition equations are in the following equations. it=σ(W(i)xt+U(i)ht−1+b(i)),

ft=σ(W( f )xt+U( f )ht−1+b( f )),

ot=σ(W(o)xt+U(o)ht−1+b(o)),

ut=tanh(W(u)xt+U(u)ht−1+b(u)),

ct=it ut+ft ct−1,

ht=ot tanh(ct)

(1)

where xtis the input at the current time step, σ denotes the logistic sigmoid function and

denotes element-wise multiplication. Intuitively, the forget gate ft∈ [0, 1]controls the

extent to which the previous memory cell is forgotten, the input gate it ∈ [0, 1]controls

how much each unit is updated, and the output gate controls the exposure of the internal memory state. ctstands for the cell state of step t, and htdenotes the hidden state of step t

which will be fed to the unit in the next step.

(10)

peephole LSTM is that the updated current state is used for output via the read gate as opposed to the prior state read by the basic LSTM cell.

it=σ(W(i)xt+U(i)ht−1+b(i)),

ft=σ(W( f )xt+U( f )ht−1+b( f )),

ot=σ(W(o)xt+U(o)ht−1+b(o)),

ut=tanh(W(u)xt+U(u)ct−1+b(u)),

ct= ft ct−1+it ut,

ht=ot tanh(ct)

(2)

Tree LSTM

Figure 2: Chain Structured LSTM vs Tree Structured LSTM [3]

Compared with normal LSTM, the tree LSTM [3] has the characteristics that it can have several children and that the selection of children is dynamic.

The paper [3] proposes two kinds of structures for tree LSTM in different cases. 1. Child-Sum Tree-LSTMs:

ehj=

k∈C(j)

hk,

ij=σ(W(i)xj+U(i)ehj+b(i)), fjk=σ(W( f )xt+U( f )hk+b( f )),

oj=σ(W(o)xj+U(o)ehj+b(o)), uj=tanh(W(u)xj+U(u)hj+b(u)),

cj=ij uj+

k∈C(j)

fjk ck,

hj=oj tanh(cj)

(3)

where C(j)denotes the set of all children of node j, and other notations are similar to normal LSTM.

(11)

Also, the structure can be used with dependency. Some libraries like spaCy can pre-process the text to find the dependency, and the Tree-LSTM can be constructed according to the dependency of words.

2. N-ary Tree-LSTMs:

The N-ary Tree-LSTM can be used on tree structures where the branching factor is at most N. ij =σ W(i)xj+ N

l=1 Ul(i)hjl+b(i) ! , fjk =σ W( f )xj+ N

l=1 Ukl( f )hjl+b( f ) ! , oj =σ W(o)xj+ N

l=1 Ul(o)hjl+b(o) ! , uj =tanh W(u)xj+ N

l=1 Ul(u)hjl+b(u) ! , cj =ij uj+ N

l=1 fjl cjl, hj =oj tanh(cj) (4)

where hjland cjlstands for the hidden state and the cell state of lthchildren of node

j respectively.

The main advantage is that in the forget gate, each child is assigned with a set of parameters Ukl( f). This allows the network to learn more fine-grained conditioning on the states of a unit’s children than the ChildSum Tree-LSTM.

Generally speaking, Tree-LSTM is a structure to analyze the text in a dependency way. Also, it might be used as a way to generate sentences. An idea could be that the generator generates sentences clause by clause. For each time it generates a simple sentence, and a compound sentence can be generated in a tree structure step by step.

Gated Recurrent Unit

The Gated Recurrent Unit [4] is designed to solve the vanishing gradient problem of a standard recurrent neural network.

The transitions are expressed in the following equations. zt=σ  W(z)xt+U(z)ht−1+b(z)  , rt=σ  W(r)xt+U(r)ht−1+b(r)  , ht= (1−zt) ht−1+zt tanh(W(h)xt+U(h)(rt ht−1) +b(h)) (5)

where ztdenotes update gate and rtdenotes reset gate.

The update gate and the reset gate are the core components that make GRU work. The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. The forget gate is used from the model to decide how much of the past information to forget.

(12)

2.2.2 Word Embedding

Word embedding is the collective name for a set of language modeling and feature learning techniques in NLP where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually, it involves a mathematical embedding from space with one dimension per word to a continuous vector space with much lower dimensionality. It is essential for the project because the input of GAN is the transcript consisting of sentences with variant lengths. Therefore, it is essential to use word embedding techniques to transform words to fix-length lower-dimension representation.

Word2Vec

The core model for Word2Vec [8] is the skip-gram model, which is shown in Figure 3.

Figure 3: Skip-gram Model (wtstands for the vector of word at tthposition) [8]

The idea of the skip-gram model is to learn word vector representations that are good at predicting nearby words. So the object of the skip-gram model is set to maximize the average log probability of predicting nearby words.

1 T T

t=1−c≤j≤c,j6=0

log p(wt+j|wj) (6)

And the probability p(wt+j|wj)is defined by the softmax function:

p(wO|wI) =

exp(v0TwOwwI)

∑W

w=1exp(v0TwvwI)

(7)

where vwand v0ware the “input" and “output" vector representations of w, and W is the

num-ber of words in the vocabulary. However, Eq. 7 has large cost of computing (proportional to W) when calculating the gradient∇log p(wO|wI).

(13)

1. Hierarchical Softmax: The hierarchical softmax uses a binary tree representation of the output layer with the W words as its leaves. So the probability of a specific word can be calculated by the path from the root to the leaf.

p(w|wI) = L(w)−1

j=1 σ  

x∈ch(n(w,j)) 1[n(w, j+1) =x] ·v0TxvwI   (8)

where L(w)is the length of the path from root to the leaf, σ is the sigmoid function, 1[X] = 1 if X is true and1[X] = −1 if X is false, n(w, j)is the jth node on the path from the root to the leaf (n(w, 1) =root and n(w, L(w)) =w), and ch(x)is the children of node x.

Since it is a binary tree, the sum probability of the siblings is equal to the probability of their parent because σ(x) +σ(−x) =1. Therefore, the design makes sure that

the sum of the probabilities of all leaves is 1.

The main advantage of the hierarchical softmax is that it reduces the computational cost. It only costs O(log2(W))to calculate the probability of one word. In the paper, it constructs Haffman tree according to words’ frequencies so that the efficiency can be improved even more.

2. Negative Sampling: Another alternative is Noise Contrastive Estimation (NCE) [9]. In the Word2Vec paper [8], it simplifies the method to Negative Sampling (NEG). The main idea is to maximize the log probability of the softmax approximately, and it is approximation because it only includes k random samples from other words in the objective. The objective becomes:

log σ(v0TwOvwI) + k

i=1 Ewi∼Pn(w) h log σ(−v0TwivwI) i (9) In the experiment part of the paper [8], it shows the best performance when k=15. Then, the vectors for each word can be trained efficiently. The trained vectors have hundreds of dimensions (typically 300), and it shows that the vectors of similar words are similar (having higher cosine similarity between two vectors). The trained vector can be used as the input of some advanced tasks like sentiment analysis [36].

GloVe

The main idea of GloVe [10] is to utilize the statistics of word occurrences in a corpus. GloVe is short for Global Vector because the global corpus statistics are captured directly by the model.

Probability and Ratio k = solid k = gas k = water k = fasion

P(k|ice) 1.9×10−4 6.6×10−5 3.0×10−3 1.7×10−5

P(k|steam) 2.2×10−5 7.8×10−4 2.2×10−3 1.8×10−5

P(k|ice) / P(k|steam) 8.9 8.5×10−2 1.36 0.96

Table 1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus.

Firstly, it introduces the word-word co-occurrence matrix, which is X. Xijdenotes the

num-ber of times that word j occurs in the context (in the same sentences) of word i. Moreover, Xi =∑jXijwhich represents the total number of words appearing in the context of word i.

The probability that word j appears in context i is defined as Pij=P(j|i) =Xij/Xi.

In this scenario, if word i has closer relation with word k than word j, the ratio Pik/Pjk

(14)

/ P(gas|steam) is much smaller than 1. When word k is not related or equally related with word i and word j, the ratio should be close to 1.

The model is designed to solve a least square problem.

J= V

i,j=1 f(Xij)(wTi wej+bi+ebj−log Xij) 2 (10)

and the final vector for each term is

Wi=wi+wei (11)

where f(x)is a weight function which is used to prevent rare co-occurrences and frequent co-occurrences from over-weighted, and it is defined as

f(x) =

(x/x

max)α , if x<xmax

1 , otherwise (12)

e

w and w are two sets of vectors for each word. It is designed in this way because the paper [33] shows that training multiple models and combining the results will help prevent overfitting.

It shows excellent performance on tests like word similarities and word analogy [10]. By extracting the information from statistics, it also provides an available solution to generate word vector.

(15)

2.2.3 Attention

Attention is one of the most recent advancements in NLP. Attention is a concept that helped improve the performance of neural machine translation applications. Attention tries to capture the high-level meaning of the text by calculating dependency (represented by weights; it will be clarified in Section 2.2.3) between words. Therefore, it is a technique that can be used to extract information from sentences and further to generate summaries.

Attention is All You Need

Vaswani, et al. [11] introduced Attention in a paper published by Google in 2017. In the paper, it introduces the Transformer model used for text translation, which uses the self-attention technique.

Figure 4: Illustration of Attention [11]

Figure 4 shows how attention structure is designed. Q and K stand for query vector (the dimension of the vector is dk) and key (dimension dk) which are used to calculate the weight

for values V. In the Scaled Dot-Product Attention, Q has dot product with K, and the result is divided by√dk(scale), then softmax is applied on the result, which is the weight for the

values V. Finally, V multiplies with the result, which induces the final vector. The process expressed in formula is Eq. 13.

Attention(Q, K, V) =softmax(QK

T

dk

)V (13)

In the right part of Figure 4, it is the Mutli-Head Attention. V, K and Q are fed into the Scaled Dot-Product Attention after a linear transform, then the results are concatenated, and a linear transform is applied on the concatenated result, which is the final result of

the Multi-Head Attention. In Eq. 13, the dkis the dimension of vector K which is used to

normalize the result and reduce the variance in the training [11]. In formula, it is expressed as in Eq. 14.

MultiHead(Q, K, V) =Concat(head1, ..., headh)WO

headi= Attention(QWiQ, KWiK, VWiV)

(14)

The complete structure of Transformer is shown in Figure 5.

(16)

Figure 5: Illustration of the Structure of Transformer. Left: Encoder consisting of N =6

repeated and stacked layers. Right: Decoder consisting of N=6 repeated stacked layers.

[11]

2. Feed Forward: Feed Forward stands for the point-wise feed-forward network. It consists of two linear transforms and a ReLU activation is in between.

FFN(x) =max{0, xW1+b1}W2+b2 (15)

3. Multi-Head Attention: The structure of attention layer is shown in Figure 4. The Q, K and V are generated by a linear transform from the input, respectively.

2.2.4 Pointer-Generator Network

Pointer-generator model [27] is an improved model of the Transformer proposed in 2017. The purpose of the model is to generate a summary for a paragraph of text.

In Figure 6, it shows the structure of the pointer-generator model. The red part is the encoder, which is a bidirectional LSTM, and the yellow part is the decoder, which is single-layer LSTM generating one token at each time step. The mechanisms it used can be summarized as follows.

1. Attention: Similar to Attention model [11], at each time step the decoder unit generates a attention distribution which depends on the hidden states of encoder hi

and the hidden state of the current step of the decoder st.

eti =vTtanh(Whhi+Wsst+battn)

at=so f tmax(et) (16)

where at is the attention distribution. The context vector at time step t can be

calculated as the weighted sum of the hidden states of the encoder. h∗t =

i

(17)

Figure 6: Illustration of the pointer-generator model [27]

Based on the context vector, the model generates a probability distribution among the pre-built vocabulary.

Pvocab =so f tmax(V0(V[st, h∗t] +b) +b0) (18)

2. Pointer-Generator: Unlike common algorithms of text summarization, pointer-generator model can generate summary by copying the word that is out-of-vocabulary but is from the original text (“3-0” and names for example). These tokens are rare in the dataset but it may include necessary information for the summary.

The model uses the attention distribution as the probability distribution of selecting a token from the source text.

The final probability distribution is based on both probability distributions on the source text and the pre-built vocabulary.

P(w) =pgenPvocab(w) + (1−pgen)

i:wi=w

ati (19)

where pgenis the probability (or weight) that the word is generated from the

pre-built vocabulary.

pgen=σ(wTh∗+wTsst+wTxxt+bptr) (20)

where wh∗, wTs, wTx are learnable vectors, and bptris learnable scalar.

3. Coverage mechanism: To improve the performance of the model, and let the gener-ated samples can cover as much of source text as possible, the model introduced the coverage mechanism which measured the coverage by summing up the attention distributions from all previous steps and compare it with the current attention distribution to see whether the current time step is covering different parts of the source text. ct= t−1

i=0 ai (21)

and the coverage loss is calculated as covlosst=

i

(18)

The model is trained in a supervised manner, which is to optimize the sum of log probability of generating the ground truth at each time step. Combined with the coverage loss, the final objective function is

losst = −logP(w∗t) +λ

i

min(ait, cti) (23)

The next part is an introduction about the algorithm, beam search [28], that is used in the pointer-generator network [27] to extract the best sequence.

The algorithm has a similar idea as the breadth-first search. The difference is that at each time step, it only keeps the top-β results and expands them in the next step. Finally, the results are selected among the kept results. Since some results are deprecated, the algorithm is indeed a greedy algorithm. When β is infinite, the algorithm is equal to the breadth-first search, and when β is 1, the algorithm is to generate a sequence by selecting the token with the largest probability at each time step, which is usually much worse than the result when

βlarger than or equal to 4.

In the pointer-generator network, the author uses 4 as the beam width [27].

The result of the paper [27] is tested with the ROUGE metric which will be introduced in Section 2.2.5.

Model ROUGE-1 ROUGE-2 ROUGE-L

seq-to-seq + attention baseline (150k vocab) 30.49 11.17 28.08

seq-to-seq + attention baseline (50k vocab) 31.33 11.81 28.83

pointer-generator 36.44 15.66 33.42

pointer-generator + coverage 39.53 17.28 36.38

Table 2: Result of the pointer-generator model compared with the result of seq-to-seq model [27]

The result shows that the pointer-generator is promising compared with the seq-to-seq model. Moreover, the mechanism that it can directly copy the word from the source text solves the problem that the model does not have to use the inaccurate word embedding caused by the rare word. Therefore, it should be a good structure that can be used and tested.

2.2.5 Evaluation

The method for the evaluation of the results is critical for the project. It decides the per-formance of the project. In the following sections, three kinds of standard metrics for the evaluation of summary quality are introduced.

ROUGE score

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation [13], is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) sum-mary or translation.

The most common set of metrics in ROUGE is ROUGE-N. Formally, ROUGE-N is an n-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows:

ROUGE-N= ∑S∈{Reference Summary}∑gramn∈SCountmatch(gramn) ∑S∈{Reference Summary}∑gramn∈SCount(gramn)

(24) Where n stands for the length of the n-gram, gramn, and Countmatch(gramn)is the maximum

(19)

Another set of metrics in ROUGE is ROUGE-L (Longest Common Subsequence). Suppose that it is going to measure the similarity between two summary X and Y that the length of X is n and the length of Y is m. The LCS-based F-measure of the similarity between X and Y is: Rlcs= LCS(X, Y) m Plcs= LCS(X, Y) n Flcs= (1+β2)RlcsPlcs Rlcs+β2Plcs (25)

where LCS(X, Y)stands for the length of the longest common sub-sequence between two

summaries. In Eq. 25, β is a parameter used to set the weights on Rlcs and Plcs. In the

introduction of ROUGE [13], it shows that β should always choose large numbers, which means only Rlcsis considered.

"One advantage of using LCS is that it does not require consecutive matches but in-sequence matches that reflect sentence level word order as n-grams. The other advantage is that it automatically includes the longest in-sequence common n-grams. Therefore no predefined n-gram length is necessary." [13]

BLEU score

BLEU (bilingual evaluation understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human [12]. BLEU was one of the first metrics to claim a high correlation with human judgments of quality and remains one of the most popular automated and inexpensive metrics [12]. The core metric of BLEU is modified unigram precision. When it is performed on n-grams,

BLEU-N= ∑C∈{Candidate}∑n-gram∈CCountclip(n-gram)

∑C0∈{Reference Summary}n-gram0∈C0Count(n-gram0) (26)

where Countclip = max{Count, Max_Re f _Count}, which means the count of overlap

n-gram is clipped by the maximum number of the n-n-gram in reference. The metric can also be applied on unigram. An example provided by the paper [12] is

Candidate: the the the the the the the

Reference 1: The cat is on the mat. (Score: 2/7) Reference 2: There is a cat on the mat. (Score: 1/7)

while the standard uni-gram precision for the first reference is 7/7 = 1 which is not

reasonable.

METEOR score

The Metric for Evaluation of Translation with Explicit Ordering (METEOR) is proposed in 2017 in the paper [42]. The main differences it made are

1. The word matching in METEOR involves stemming and synonyms.

(20)

2.3 Generative Models

2.3.1 Variational Auto-Encoder

Figure 7: Traditional Auto-Encoder [14]

As shown in Figure 7, in traditional Auto-Encoder, the encoder network encodes the image into a latent vector, and the decoder network decodes the latent vector back to the original image. Moreover, the authors [15] want to generate real images by generating some latent vectors and decoding the vectors to get some real images. However, the distribution of latent vectors is unknown in the traditional Auto-Encoder. Therefore, it has to add some regularization on the latent space, which is to force the latent space to fit a certain distribution (unit Gaussian distribution, for example). Then people can sample random vectors from the distribution and generate images by the decoder.

Figure 8: Variational Auto-Encoder [14]

Intuitively, Variational Auto-Encoder [15] is the network to obtain the goal. In Variational

Auto-Encoder, the latent vector is interpreted as the concatenation of dL independent

random variable, which means for each image, it is projected to a multivariate normal distribution.

In order to interpret the projection process, the encoder network generates a mean vector µ and a standard deviation vector σ (as shown in Figure 8) to represent the parameters of the multivariate normal distribution.

The objective of Variational Auto-Encoder is

L = Lrecon+ LKL

= Lrecon+

h

µ2+σ2−log(σ2) −1

i (27)

The first part is the reconstruction loss, which can be a mean square error or entropy loss depending on the type of input data. The second part represents the KL-divergence between the multivariate normal distributions generated from images and the normal

(21)

reconstruction loss makes sure that the information is kept in the latent distribution, and the KL-divergence loss makes sure that the generated distribution is close to the unit normal distribution. Note that the two loss cannot be minimized at the same time. When KL-divergence is 0, all images generate the same normal unit distribution. Then no information is kept. Therefore, the KL-divergence is the regularization that tries to shape the total distribution to a unit normal distribution.

2.3.2 Variants of Generative Adversarial Networks

The following sections introduce variants of basic GAN. The variants are mostly used for generating images instead of text, but their overall structures, loss function design, and training methods can provide directions for further researches.

Generative Adversarial Network

Generative Adversarial Network is firstly introduced in the paper [16] in 2014. The core idea is that by applying a transformation (Generator) on a random distribution, it can transform the random distribution to approximate the distribution of real data, and the network used to measure the distance between two distributions is the discriminator.

In GAN, generator and discriminator are defined by neural network G(z, θg)and D(x, θd).

z is multi-dimension random noise with prior distribution (usually the distribution is uniform or normal distribution). The input of G is the random noise z, and the output of G is a generated sample. x stands for the input of D which can be real samples or generated samples, and the output of D is the probability that the input sample is real (range 0 - 1). θg

and θdare the parameters of the generator and the discriminator, respectively.

The algorithm trains the discriminator and the generator in turn, which is a min-max game represented by the following objective,

min

G maxD V(D, G) =Ex∼pdata(x)[log D(x)] +Ez∼pz(z)[log(1−D(G(z)))] (28) The proof in the Generative Adversarial Network paper [16] shows that optimizing the object V(D, G)is to minimize the Jensen-Shannon (JS) divergence between two distributions. The explicit algorithm for training GAN is

fornumber of training iterations do

fork steps do

• Sample minibatch of m noise samples{z(1), . . . , z(m)}from the noise prior pg(z).

• Sample minibatch of m samples{x(1), . . . , x(m)}from data generating

dis-tribution pdata(x).

• Update the discriminator by ascending its stochastic gradient:

θd 1 m m

i=1 h

log D(x(i)) +log(1−D(G(zi)))i

end for

• Sample minibatch of m noise samples {z(1), . . . , z(m)} from the noise prior pg(z).

• Update the generator by descending its stochastic gradient:

θg 1 m m

i=1 log(1−D(G(zi))) end for

(22)

And the training converges when pg= pdatawhich is called Nash Equilibrium.

Conditional Generative Adversarial Network

Conditional Generative Adversarial Network [17] is a variant of normal GAN which is used to generate samples based on some conditions. For example, it can be used to generate hand-written numbers (MNIST) given the number.

In the first version of GAN, the pgis based on the prior distribution of the noise z while

in conditional GAN, the samples are generated based on both noise z and the condition y. Also, the discriminator has both samples x and the condition y as input. The illustration (Figure 9) of the structure of a simple conditional adversarial network is shown below.

Figure 9: Structure of a Simple Conditional Adversarial Networks [17] And the objective function of the two-player minimax game is

min

G maxD V(D, G) =Ex∼pdata(x)[log D(x|y)] +Ez∼pz(z)[log(1−D(G(z|y)))] (29)

The training algorithm is identical to the one in traditional GAN [16].

Wasserstein Generative Adversarial Network

When the concept of GAN was firstly introduced in [16], the commonly used objective function is Eq. 28, where x is real data, z is noise, D is discriminator network, G is generator. Optimizing the objective function is proved to be improving the Jensen-Shannon (JS) divergence between the distribution of x and Gz[16].

(23)

replace JS divergence, which makes the training more stable and fixes the problem of mode collapse [18].

min

G maxD∈DEx∼pdata[D(x)] −Ez∼pz[D(G(z))] (30)

whereDstands for the set of 1-Lipschitz functions [31]. To enforce the 1-Lipschitz constraint on the discriminator, [18] proposes weight clipping, but it performs poorly in many cases. Moreover, [18] proposes to add gradient penalty in the loss function, which has better performance than weight clipping. The loss function with gradient penalty is

L= Ex∼pdata[D(x)] −Ez∼pz[D(G(z))]

+λEx∼pˆ ˆx[(k ∇ˆxD(xˆ) k2−1)

2]

pˆxsamples uniformly on the linear combinations between pdataand pz [18]. The added

second part of the equation is the gradient penalty.

Figure 10 below shows that gradient penalty can provide better stable training process, and it utilizes the complexity of the spaces while the weight clipping may make the weights

gather at−0.05 and 0.05 which reduces the complexity of the network.

Figure 10: Comparison Between Weight Clipping and Gradient Penalty [18]

2.3.3 Adversarial Text Generation

Currently, there are some papers about generating text using adversarial training. The following sections are about introducing some papers using adversarial training but without reinforcement learning algorithms. Even though they are generating text without given context, their methods and network structures can provide references for the structure of GAN for text generation.

Adversarial Text Generation Without Reinforcement Learning

Pfau and Rumshisky proposed LaTextGAN [30] as a method to generate text without reinforcement learning.

Figure 11 is an illustration of the structure. The structure of LaTextGAN consists of two parts, one part is a Variational Auto-Encoder, and another part is a GAN. The Variational Auto-Encoder is constructed using LSTM with 100 cell size for encoder and 600 cell size for the decoder. The generator and the discriminator are built using the structure of ResNet with 40 layers for each network, and all layers have the same dimension of 100.

(24)

Figure 11: Structure of LaTextGAN [30]

For the second stage, the autoencoder is fixed. Moreover, the training is on the generator and the discriminator. z is sampled from a multivariate Gaussian distribution. Also, the training algorithm and the objective are adopted from Wasserstein GAN, which uses Earth-Mover Distance and gradient penalty, and it trains the discriminator for 10 times in one step. For the result (Table 3), the LaTextGAN achieves a better result on human evaluation, which shows the potential of GAN in text generation.

Model More Realistic Less Realistic Equally Realistic BLEU Score

LaTextGAN 13.9% 55.6% 30.5% 0.678

NLM 12.2% 46.6% 41.2% 0.643

VAE 6.0% 81.6% 12.4% 0.688

Table 3: Human evaluation of model-generated sentences as more realistic, less realistic, or equally realistic with respect to real English sentences. BLEU-4 score calculated on a held-out validation set.

Adversarial Generation of Natural Language

Adversarial Generation of Natural Language [20] is a paper published in 2017. The work generates text using GAN alone, and it obtains state-of-art performance on a Chinese poem dataset.

The general structure of the model is shown in Figure 12. The yellow blocks stand for the output of generator. For the real data, it uses 1-hot vectors to represent the sentence. For the generated samples, sampled noise vector from Gaussian distribution is fed into the generator, and the output is the probability distribution over the vocabulary.

There are two sets of choices for the generator and the discriminator that are introduced in the work. It achieves best BLEU score among all methods.

1. Peephole LSTM: The first set of choice is that the generator is a recurrent neural network with peephole LSTM cell as its basic unit. And a shared affine transfor-mation is applied on the outputs of the RNN to map the outputs to a probability distribution. The discriminator is also an RNN with LSTM units. The state gener-ated by the last unit performs regression to generate the score showing whether the sample is from the real data or is generated.

(25)

Figure 12: Structure of the Network [20]

For the training, it adapts the objective function from WGAN [18] which uses the gradient penalty.

For the experiment part, since it uses the probability distribution (softmax) as the output, it can only use a subset of the datasets whose vocabulary size is about 30k. It means that the method can not be generalized to the cases with large size of vocabulary.

Figure 13: Results on Chinese Poem Dataset [20]

In Table 13, it shows that it attains state-of-art performance on the Chinese Poem Dataset. Generally speaking, the paper shows a high-quality method to generate text. However, it can not be generalized to large corpus, and there is no latent vectors generated that can be used to generate summary in the second step (it has no encoding on the real data).

Generate Text via Adversarial Training

This is a paper published on NIPS workshop on Adversarial Training in 2016 [21]. The main difference it makes is that instead of using the standard loss function of GAN, it uses a loss function to match the distribution of the feature.

In Figure 14, it shows the general structure of the network.

1. CNN Discriminator: A sentence is represented with a matrix XRk×Twhere k is

(26)

Figure 14: Left: Illustration of the textGAN model. The discriminator is a CNN, the sentence decoder is an LSTM. Right: the structure of LSTM model [21]

convolutional kernel Wc∈Rk×his applied on the sentence. In the practice, the h

has multiple sizes, and the concatenated results (denoted by feature vector f ) use a softmax layer to map them to D(x) ∈ [0, 1].

2. LSTM Generator: As shown in Figure 14, the noise vector z is fed to every cell, so the formula for the probability of a length-T sentencees given the noise vector z is

p(es|z) =p(w1|z)

T

t=2

p(wt|w<t, z) (31) The t-th word wtis determined by wt=arg max(Vht), where htis the hidden state of the cell t.

The output ytis determined by yt = We[wt]. And the paper uses approximated

discretization to solve the problem that gradient descend will fail on the discrete problem.

yt=Wesoftmax(Vht−1 L) (32)

when L→∞, the term approximates yt=We[wt].

For the training objective, The iterative optimization schemes consists of two steps: minimizing: LD= −Es∼Slog D(s) −Ez∼pz(z)log[1−D(G(z))]

minimizing: LG=tr(Σ−1s Σr+Σ−1r Σs) + (µs−µr)T(Σ−1s +Σ−1r )(µs−µr)

(33)

whereΣs,Σrrepresents the covariance matrices of real and synthetic sentence feature vector

fs, fr, respectively. µs, µr denote the mean vector of fs, fr, respectively.Σs,Σr, µsand µrare

empirically estimated on minibatch.

The model is trained on BookCorpus dataset [26] (70 million sentences) and Arxiv dataset (5 million sentences).

(27)

2.4 Reinforcement Learning 2.4.1 Policy Gradient

Policy gradient methods [34] are a type of reinforcement learning techniques that rely upon optimizing parametrized policies for the expected return (long-term cumulative reward) by gradient descent. They do not suffer from many of the problems that have been marring traditional reinforcement learning approaches such as the lack of guarantees of a value function, the intractability problem resulting from uncertain state information and the complexity arising from continuous states and actions.

Sequence Generative Adversarial Nets with Policy Gradient

The characteristics of SeqGAN [22] is that instead of using neural networks as classifier or generator, in SeqGAN, the generator is designed as a policy, and the discriminator is designed as an interactive environment. The goal of the generator is to get higher expectation of reward from the environment (D), and the goal of the discriminator is to give real data higher reward and give synthetic data lower reward.

Figure 16: The Illustration of SeqGAN [22]

In Figure 16, it shows the general structure of SeqGAN. The true data and the synthetic data are in the form of sequences. And the synthetic data is generated by sampling from the policy (generator). The quality of the policy is evaluated by sampling some samples until time t, then Monte-Carlo searches will be perform from time t to time T to evaluate the score for a sample.

For the discriminator, it is updated according to the standard GAN objective,

LD= −EY∼pdata[log(Dφ)] −EY∼Gθ[log(1−Dφ(Y))] (34)

For the generator, it is updated according to the policy gradient,

θJ(θ) = T

t=1 EY1:t−1∼Gθ "

yt∈Y ∇θGθ(yt|Y1:t−1) ·Q Gθ Dφ(Y1:t−1, yt) # θ=θ+αh∇θJ(θ) (35)

where θ stands for the parameter of the generator, αhis the learning rate, Y is the sequence

sampled from the distribution of the generator and ytis the state in time step t. The scores

for the sequences are sampled by Monte-Carlo search as shown in Figure 16.

The SeqGAN can be applied on text generation, music generation [22] and many other application about sequence generation.

Also, there is another REINFORCE method using actor-critic method [25] which performs good at generating sequences.

Improving Conditional Sequence Generative Adversarial Networks by Stepwise Evalu-ation

(28)

Figure 17: Illustration of the model [23]

The model contains two parts, the generator, and the discriminator. For both the generator and the discriminator, they have an encoder and a decoder, respectively. The encoders take the input sentence as the input, and the final states are fed to the decoder.

In the decoder of the generator, for each time step, it generates a token by sampling from the word distribution. In the decoder of the discriminator, for each time step it takes the token generated at the current time step, and it outputs a score directly estimating the state-action value ˆQ(si, yi), and the score for the whole response is given by the mean of all scores at

each time step.

(29)

3

Methods

3.1 Dataset

Three datasets are used in this project. They are AMI Meeting corpus [35], LMDB movie review dataset [37] and CNN News dataset [38]. The scale of the dataset after applying filtering is shown in Section 3.2.3.

3.1.1 AMI Meeting Corpus

AMI Meeting Corpus [35] is a multi-modal data set consisting of 100 hours of meeting recordings, and the data is transcribed into text. The dataset has been widely used for analyzing and evaluating meeting transcripts. AMI Meeting Corpus contains resourceful types of labels given by people. Every word in the dataset is labeled with the time when spoken and the speaker. Every sentence is labeled with its types which can be “decision”, “action” and other seven types [35]. The dataset also includes both extractive summaries

and abstractive summaries which are linked correspondingly.

3.1.2 LMDB Movie Review Dataset

LMDB movie review dataset was firstly fetched by crawler scripts and was used by the paper [37]. Every review in the dataset is labeled with the score given by the users. The paper initially used the dataset for sentiment analysis [37]. Although, from the perspective of quality, LMDB movie review dataset is of less quality than AMI Corpus because reviews given by reviewers may contain typo and special tokens, LMDB movie review dataset has more variety and quantity in sentences.

3.1.3 CNN News Dataset

CNN news dataset was firstly collected and used by the paper [38]. The original purpose of the dataset is to generate summary which exactly fits the need of the thesis. The data in the dataset is in pair, which has both context and highlights. In the thesis, it downloaded the dataset crawled by the scripts given by one GitHub repository [39].

3.2 Pre-processing 3.2.1 Tokenization

The first step of the pre-processing is to tokenize the data. It used Spacy [40] which is an open-source Python library in natural language processing. The library includes the model that can be used to split the paragraphs into words. However, the model is not stable for some words such as “Let’s” and “he’s”, which may cause double- when training the model. Therefore, some word filtering should be done in the next step.

3.2.2 Word Filtering

The reason for the word filtering is that the Spacy library [40] performs unstably on some words. Therefore, the first step is to remove some common short forms so that it makes sure that the same word is tokenized in the same way in a different context.

In the thesis, the stop words (for example, “ok”, “ah”, etcetera) are not removed from the original sentences even though it is a typical operation for word filtering. The reason is that the goal of the thesis is to generate sentences or summaries that look real, which means they should contain all stop words. Therefore, the stop words in the training data should not be removed.

3.2.3 Data Filtering

(30)

than 1000 tokens are removed, and the summary with more than 120 tokens are removed, and the context with more than 200 out-of-vocabulary words are removed.

For the dataset used by the sentence generation, there are 634943 samples of data (sentences) after filtering. The majority of the dataset is LMDB Movie Review dataset. For the dataset used by the summary generation, there are 67005 pairs of data (source text - summary pairs) after filtering.

3.2.4 Word Embedding

The last step of the pre-processing is the word embedding. As it is introduced in section 2.2.2, the word embedding can transform the word into a fixed-length vector which can represent the meaning of the word, and the word embedding is proofed to be a good choice as the input of the neural networks.

In the thesis, the model Word2Vec [8] is used. Moreover, the implementation and the training of the model is from Gensim [41] which is an open-source library for unsupervised topic modeling and natural language processing, using modern statistical machine learning. A visualization of the trained word embedding can be shown in Figure 18.

(31)

In the thesis, two sets of word embedding are trained for the text-generation model and the summary-generation model, respectively. The trainings are based on the two training datasets for two models respectively.

3.3 Model

The project has two stages. One is to generate random readable and synthetic sentences, and another one is to generate a summary based on a paragraph of text. Therefore, the chapter introducing the models will be divided into two sections for sentence generation and summary generation, respectively.

3.3.1 Sentence Generation

The idea of the design of the sentence-generation model is from the paper [23] that is introduced in section 2.4.1. In the paper [23] using step-wise evaluation, the model takes a sequence as an input and outputs its response. In the project, the sentence generation model is to generate random readable sentences. Therefore, the model is designed as a simplified version of the model described in the paper [23].

The model contains two parts, the generator and the discriminator (Figure 19 and Figure 20). The training process has two phases, and, therefore, it involves two sets of loss function, which will be introduced in the following section.

Q= n1ni=1Q(si|x1:i)

Q(s1|x1:1) Q(s2|x1:2) Q(s3|x1:3) Q(s4|x1:4) Q(sk|x1:k) Q(sn|x1:n)

Dis Dis Dis Dis · · · Dis

Token1 Token2 Token3 Token4 · · ·

Gen Gen Gen Gen · · · Gen

Emb Emb Emb Emb Emb Emb

Emb Emb Emb Emb Emb Sample Sample Sample Sample Sample Sample

Figure 19: Illustration of the Sentence-Generation Model

Model Structure

(32)

Word Distribution

Dense Layer

LSTM

Pre State Next State

Word Embedding

Score

Dense Layer

LSTM

Pre State Next State

Word Embedding

Softmax Sigmoid

Figure 20: Left: Generator Unit, Right: Discriminator Unit

and a softmax activation on the output, the word distribution is generated. oGt+1, st+1G =LSTMG(iGt, sGt)

Pvocab =so f tmax(WoGoGt+1+bGo)

(36) where otGis the output from time step t, sGt is the cell state from time step t, WoGand

bGo are the learnable weights of the dense layer.

After having the probability distribution over the vocabulary, a multinomial sam-pling is done based on the categorical distribution given by the generator unit. For each time step, the generated token is transformed into word embedding and is fed to the discriminator unit.

2. Discriminator: Discriminator is also a single-layer LSTM which is represented by the red units in Figure 19. For each time step, the LSTM cell takes the state from the previous time step and the token generated by the generator unit in the current step. The output of the LSTM cell goes through a dense layer and get transformed to a scalar, which represents the score for the current time step.

ot+1D , sDt+1=LSTMD(iDt+1, sDt )

Qt+1=σ(wDoot+1D +bDo)

(37) where σ is the sigmoid function, wDo and boDare the learnable weights of the dense

layer, and Qt+1stands for the score Q(st+1|x1:t+1)given to the current time step.

Loss Function & Training

1. Phase 1 (maximum likelihood optimization): Since in the text generation, the result space is vast (suppose that the vocabulary size is v and the maximal time step is t, then the space can be vt), it is necessary to have a pre-training on the generator to reduce the space so that it avoids the training falls into some bad local minima when training the whole network together.

The pre-train process is to maximize the log-likelihood that the generator can generate the same sentences in the dataset. The loss function is

LG = −Ex∼PD " log p(x1) + T

i=2 log p(xi|x1:i−1) # (38)

where x ∼ PDmeans x is a sample from the dataset, and T is the length of the

sample x.

(33)

standard GAN objective. For the generator, the loss function is designed based on the theorem of gradient policy which is introduced in Section 2.4.1 which is to maximize the probability that the generator can get a high score in the environment controlled by the discriminator.

LD = −Ex∼PD[log D(x)] −Ex0∼PGlog(1−D(x 0)) (39) LG = −Ex∼PG " T

i=1

(Q(si|x1:i) −V(x1:i)) ·log pG(xi|x1:i−1)

#

(40) where V(x1:i)is the baseline at current state given by the baseline model. The

purpose of introducing the baseline model is to reduce the variance of the loss function and accelerate the training process, which is mentioned in Section 2.4.1. The loss function for the baseline model V is

LV =Ex∼PG,PD " 1 T T

i=1

[V(x1:i−1) −Q(xi+1|x1:i)]2

#

(41) The whole training process is in the following algorithm.

Initialize generate G, discriminator D and baseline V Pre-train generator G using Eq. (38)

for number of training iterations do for i=1 to D-iteration do

Sample x from real data

Sample x0from the generator

Update D using Eq. (39)

end

Update V using Eq. (41) Sample x from real data

Sample x0from the generator

Update G using Eq. (40)

end

Algorithm 1:Training Algorithm (Sentence Generation)

How the model is explicitly designed and how the hyperparameters are selected are intro-duced in Appendix A.

Result Generation

The results are generated by multinomial sampling using Tensorflow api1. Since the method of extracting results is based on step-wise categorical sampling, the model may generate different results when it runs for multiple times.

3.3.2 Summary Generation

The idea of the design of the generator of the summary-generation model is from the paper [27] which is introduced in Section 2.2.4, and the design of the discriminator of the summary-generation model is from the paper [23] which is introduced in section 2.4.1. Note that the encoder of the pointer-generator network is a single-layer bi-directional LSTM.

The pointer-generator network is designed to be trained in a supervised manner, and it has achieved great results [27]. The pointer mechanism used in the network allows the network to directly copy the word from the context which provides a way to deal with the rare words and the out-of-vocabulary words. However, the design of the loss function still can be improved.

(34)

1. MLE Loss: The MLE loss that forces the network to generate the exact same sum-mary as the ground truth might be a problem. A good sumsum-mary should cover all critical points in the context, but it should not be restricted to a particular result. The summary can be rephrased in different ways, which can also be useful summaries. 2. Coverage Loss: The coverage loss helps the network to distribute the attention

evenly on the context. However, a summary does not need to cover all information from the context. Some details should be ignored in summary. Also, the coverage loss does not have the restrict on the order of the information that the summary covers.

Generally speaking, the pointer-generator network can be improved by changing the loss function. By combining the state-of-art step-wise evaluation discriminator from the paper [23], the proposed method combines the pointer-generator network and the step-wise evaluation discriminator. Dec · · · Dec Dec Dec Dec Input · · · Input Input Input Input at n · · · at 4 at 3 at2 at 1 hn · · · h4 h3 h2 h1 o1 o2 o3 o4 · · · on

Enc Enc Enc Enc · · · Enc State

Input · · · Input Input Input Input Context Vec Vocabulary Distribution Attention Distribution pgen

Final Distribution Token

Dense Dense Dense Dense Dense Dense Dense

Weighted Sum

Dense Argmax

Figure 21: Illustration of the Generator of Summary-Generation Model

Model Structure

(35)

Source Text Reference Summary

Generator

Generated Summary

Step-wise Discriminator

Dis Score D(x, y)

Figure 22: Illustration of the Discriminator of Summary-Generation Model

2. Discriminator: The structure of the discriminator is as same as the structure in the GAN with the step-wise evaluation [23], which is introduced in Section 2.4.1.

Loss Function & Training

1. Phase 1 (maximum likelihood optimization): The first phase of the training is to pre-train the generator to maximize the log-likelihood of the generator to generate the exact same summary as the reference summary. Therefore, the loss function is designed as L = −E(x,y)∼PG " 1 T T

i=1 log p(yi|x, y1:i−1) # (42) 2. Phase 2 (coverage optimization): The second phase of the training is to pre-train the generator to optimize both the log likelihood and the coverage loss of the generator. Therefore, the loss function is designed as

Lcov= −E(x,y)∼PG " 1 T T

t=1 [log p(yt|x, y1:t−1) −min{at, ct}] # (43) 3. Phase 3 (pre-train discriminator): The next step is to pre-train the discriminator before conducting the adversarial training. The score of sample(x, y)given by the discriminator is D(x, y) = 1 T T

t=1 Q(yt|x, y1:t−1) (44)

Moreover, according to the discriminator loss of a standard GAN, the loss function of the discriminator is

LD= −E(x,y)∼PD[log D(x, y)] −E(x,y0)∼PGlog(1−D(x, y

0

))

(36)

fixed. For the loss of the generator, the loss is designed as the step-wise evaluation [23], which is LG = −E(x,y)∼PG "T

i=1

(Q(si|x, y1:i−1) −V(x, y1:i−1)) ·log pG(yi|x, y1:i−1)

# (46) The baseline model is designed to predict the baseline scores for all time steps given the context and all previous tokens.

LV =E(x,y)∼PD,PG " T

t=1 (V(x, y1:t−1) −Q(x, y1:t))2 # (47) The whole training algorithm is described in Algorithm 2.

Initialize generate G, discriminator D and baseline V Train generator G using Eq. (42)

Train generator G using Eq. (43)

for number of training iterations do

Update D using Eq. (45) Update V using Eq. (47)

end

for number of training iterations do for i=1 to D-iteration do

Sample x from real data

Sample x0from the generator

Update D using Eq. (45)

end

Update V using Eq. (47) Sample x from real data

Sample x0from the generator

Update G using Eq. (46)

end

Algorithm 2:Training Algorithm (Summary Generation)

Result Generation

(37)

4

Results

Implementation Details

The full implementation is available at the GitHub repository to make the thesis reproducible. The link for the repository and some implementation details are shown in Appendix A.

4.1 Test Data

In the experiments for sentence generation, there is no need for test data since no input is needed from the dataset.

In the experiments for summary generation, the whole dataset (CNN News Dataset [38]) is divided into two parts which are training data (90%, 55 024 samples) and test data (10%, 6 112 samples). In the experiments, only the test data will be used.

4.2 Model Size

For the sentence-generation model, the size of the model is around 3.2 GB. The size of the pre-built vocabulary is around 200k.

For the summary-generation model, the size of the model is around 880 MB. The size of the pre-built vocabulary is around 70k.

4.3 Training Time & Inference Time

1. For the sentence-generation model, the model is trained on Google Cloud with one Nvidia P100 GPU. The pre-training took 14 hours for 47k steps (around 2.3 epoch), and the adversarial training took 15 hours for only one epoch.

The inference time for the sentence-generation model is around 0.008 second for one sample (about 0.16 second for one batch).

2. For the summary-generation model, the model is trained on Google Cloud with one Nvidia P100 GPU. The phase 1 took 14.5 hours for 12 epoch, phase 2 took 2.4 hours for 2 epoch, phase 3 took 3.3 hours for one epoch, and phase 4 took 1 day for 6k training loop (one training loop includes 3 times of the discriminator update and 1 time of the generator update and 1 time baseline update).

(38)

4.4 Results & Performance Comparison 4.4.1 Sentence Generation

At first, it uses the model to generate around 12k sentences for further testing. Some of them are listed in the Appendix B.1.

The experiment is to compare the percentage of overlap of k-gram between the generated sentences and the training dataset where k-gram is a piece of continuous k words from the sentences.

K-gram 2-gram 3-gram 4-gram

Overlap (%) 82.49 50.51 20.61

Table 4: K-gram overlap between 12k generated samples and the dataset Some observations from the generated results are listed as follows.

1. The sentences are mostly about movies.

2. Most sentences are grammatically correct. However, minor errors still exist like “but i i would” and “it is a classic and one”. However, the results still show promising performance for the sentence-generation model.

3. The model does generate some new patterns on 2,3,4-grams. Also, the 12k sentences generated by the model do not exist in the training dataset.

4. The training of the model is time-consuming. The problem is mainly because of the dense layer used to generate the probability distribution on the vocabulary.

4.4.2 Summary Generation

The results of the test data are generated using the models from different stages. The idea of designing the experiments comes from paper [27]. The purpose of showing results for multiple phases is to show the effects of multiple mechanisms more clearly.

Some of the results are shown in Appendix B.2.

Standard Tests

ROUGE scores and METEOR scores are calculated based on the test data. Note that the result of the original paper [27] was using another set of data different from what used in the thesis. The results of the paper are given only for reference.

Metrics (10−2)\Models PG (no coverage) PG (coverage) PG (GAN) PG (paper) [27] ROUGE-1 29.55 33.10 29.70 39.53 ROUGE-2 9.37 10.78 8.54 17.28 ROUGE-L 21.52 23.51 21.15 36.38 METEOR(exact) 7.72 9.53 8.21 17.32 METEOR(+stem/para) 8.39 10.35 9.02 18.72 BLEU(weighted) 44.57 44.65 41.31

-Table 5: ROUGE and METEOR scores of different training methods

In Table 5, the METEOR scores are calculated under two mode: exact match and exact match + stem + paraphrase. In the second mode, the weights for the scores of three matching

ways are 1.0, 0.5 and 0.5, respectively. METEOR scores are calculated by the software2

provided by the author [42]. For the BLEU score, the weights for BLEU-1, BLEU-2, BLEU-3 and BLEU-4 are 0.25, 0.25, 0.25 and 0.25, respectively. BLEU score is calculated by the NLTK library3.

(39)

In Table 5, the result shows that in standard tests, after the adversarial training, the per-formance of the model becomes worse. The model trained by the pointer generator with coverage mechanism shows the best performance. Also, the results indicate that all results from the models implemented in the thesis do not outperform the results from the original paper [27], which means the results in the thesis is not state-of-art.

Attention Heat Map Visualization

The following figures visualize the attention using heat map, and the heat map in Figure 23 is a small sample about how the attention heat map represents the information.

Figure 23: A simple sample of attention heat map. The x-axis is the source text, and the y-axis is the summary. The color in the cell a(i, j)represents the attention value showing

how much the ithword in the summary focuses on the jthword in the source text. If the

color is brighter, it means that the corresponding value is larger. The arrows show the direction of the summary (y-axis) and the source text (x-axis), respectively.

Figure 23 is a sample of the attention heat map. Also, in Figures 24 - 27, the visualizations also come with the same format. In Figure 23, it shows that the attention heat map has good accuracy. Good accuracy means that every word in the summary has relatively high attention on the words with a similar meaning in the source text. For example, “jason derulo" in summary refers to “pop singer jason derulo" in the source text.

(40)

Figure 24: Attention Heat Map of Ground Truth

Figure 25: Attention Heat Map After Phase 1 (no coverage)

Figure 26: Attention Heat Map After Phase 2 (with coverage)

Figure 27: Attention Heat Map After Phase 4 (GAN)

In Figures 24 - 27, the red lines at the bottom show the intervals in the source text focused by certain parts of the summary, and the white dashed lines are the auxiliary tools to show the intervals more clearly. The data used in this set of visualization is shown in Appendix B.2.1.

The purpose of Figures 24 - 27 is to show on which part the summary covers given by models from different training phases. Intuitively, a good summary should cover similar content like it in the ground truth summary.

(41)

By comparing the red lines vertically, readers can easily find out which part is covered differently from the situation in the ground truth.

The desirable outcome of the visualization is that the heat map should cover the same content as in the heat map from the ground truth.

From the results, it shows that

1. From Figure 25, without the coverage mechanism, the sample generated by the model tends to repeat the contents in summary.

2. From Figure 26, after training with coverage mechanism, the sample generated by the model tends to cover more content than before.

3. From Figure 27, after training the model with the discriminator, the generated summary has better sequential property, and it covers more identical content as in ground truth compared with the result before the adversarial training.

In order to validate that the results from Phase 4 have closer attention distributions with the ground truth, there is another experiment in the following section.

Attention Distribution Difference

Since the standard tests are mostly based on matching between exact same words, it may not be objective if the summaries are rephrased in some ways. A good summary should cover all contents covered in the ground-truth summary. The similarity of the attention coverage or the similarity of the distribution of the attention provides a good way of evaluating it. In order to compare the similarity of the attention distribution, a straight-forward method is proposed. Assumed that there are two attention matrices AL1×Ls

1 and A

L2×Ls

2 , where L1

and L2are the length of the summaries respectively, and Lsis the length of the source text.

The attention matrices are generated by feeding different summaries to the decoder of the pointer-generator network, and at each time step, the attention distribution is collected, which leads to a matrix with the length of the summary times the length of the source text. The distribution difference D(A1, A2)is calculated by

D(A1, A2) = Ls

i=1 |c1i −ci2| (48) where cij= Lj

t=1 A(t,i)j (49)

The method sums up the attention along the dimension of the summary, and it uses the L1-norm loss function to evaluate the difference between two attention matrices.

The experiment is performed among the model from different training phases on the test data, and the mean of D is the result. The purpose is to compare the performance of models from different phases on this metric to show which mechanism improves the performance. Note that in Table 6, the ground truth attention distribution is generated by feeding the ground truth summary to the decoder of the pointer-generator network and by collecting the attention distribution at each time step.

The mean difference D with ground truth is calculated as the mean of the attention dis-tribution differences between the attention generated by the model from a phase and the attention generated by the ground truth on all test data.

The results comparing models from different phases are in Table 6.

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Generally, a transition from primary raw materials to recycled materials, along with a change to renewable energy, are the most important actions to reduce greenhouse gas emissions

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

Från den teoretiska modellen vet vi att när det finns två budgivare på marknaden, och marknadsandelen för månadens vara ökar, så leder detta till lägre

Generella styrmedel kan ha varit mindre verksamma än man har trott De generella styrmedlen, till skillnad från de specifika styrmedlen, har kommit att användas i större

På många små orter i gles- och landsbygder, där varken några nya apotek eller försälj- ningsställen för receptfria läkemedel har tillkommit, är nätet av

Industrial Emissions Directive, supplemented by horizontal legislation (e.g., Framework Directives on Waste and Water, Emissions Trading System, etc) and guidance on operating