TEXT BLOCK PREDICTION AND ARTICLE RECONSTRUCTION USING BERT
Submitted by
Andreas Walter Estmark
A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a two-year Master of Arts degree
in Statistics in the Faculty of Social Sciences
Supervisors M˚ ans Magnusson
Andreas ¨ Ostling
Spring, 2021
ABSTRACT
Kungliga biblioteket (National Library of Sweden, KB) uses Optical Character Recog-
nition (OCR) engines to extract and segment texts from their archive of daily newspaper
articles. These systems are good at extracting and segmenting text on the paragraph level
and lower (i.e., sentences, words, and characters), but less on the article level, resulting in
the segmentation of articles into text blocks not attached to their articles. In this thesis,
BERT, a natural language processing (NLP) model, is fine-tuned on newspaper articles and
used to reconstruct these articles by predicting if a text block is the next or not. A small
data set of 127 text blocks from 21 articles is used. The best performing BERT achieved
an accuracy of 94% on text block pair prediction when the blocks are ordered. It resulted
in 13 reconstructed articles. The performance was reduced when selecting from all possible,
unordered text pairs. It was also found that BERT performs well on clustering text blocks
from the same articles.
CONTENTS
Introduction 1
1.1 Related Work . . . . 3
1.1.1 Syntactic Tasks . . . . 3
1.1.2 Article Extraction . . . . 4
1.2 Objective . . . . 4
Theory 6 2.1 Feed-Forward Neural Network (FFNN) . . . . 6
2.1.1 Parameter Updates . . . . 8
2.2 Vector Representations of Words . . . . 10
2.2.1 Tokenization . . . . 11
2.3 Transformer . . . . 11
2.3.1 Self-Attention . . . . 12
2.3.2 Layer Normalization and Residual Connection . . . . 15
2.4 Transfer Learning . . . . 15
2.5 Bidirectional Encoder Representations from Transformers (BERT) . . . . 16
2.5.1 Pre-Training . . . . 17
2.5.2 Fine-Tuning . . . . 19
Data 20 3.1 Descriptive Statistics . . . . 20
3.2 Pre-Processing . . . . 21
Methodology 23 4.1 Models . . . . 23
4.2 Training . . . . 24
4.2.1 Generating Train Data Structure . . . . 24
4.2.2 Hyperparameters . . . . 25
4.3 Test Data Structures . . . . 26
4.3.1 Text Block Prediction (TBP) . . . . 26 4.3.2 Page Block Prediction (PBP) . . . . 26 4.4 Metrics . . . . 28
Results 31
5.1 TBP results . . . . 31 5.2 PBP results . . . . 32
Discussion 35
Conclusion 35
Acknowledgments 37
Appendix 41
1 INTRODUCTION
Kungliga biblioteket (National Library of Sweden, KB), and other heritage institutions worldwide with extensive collections of texts are digitizing their printed materials. Among other printed materials, KB collects, preserves, and digitizes Swedish daily newspapers (Kungliga biblioteket 2020). The newspapers are digitized using Optical Character Recogni- tion (OCR) systems, and their corpus ”KubHist” contains Swedish daily newspapers span- ning from 1645 to today (Dannells et al. 2019). Each issue is photographed and processed with OCR software (Adesam et al. 2019). OCR systems will continue to play a role in the digitization process while newspaper articles continue to be printed. These systems achieve high accuracy on text data and are easily accessible with open-source alternatives, making them a popular tool to digitize texts (Holley 2009).
OCR systems convert image input into readable text data (Saha et al. 2010) and seg- ment and extract texts from images, such as newspapers, into units of text (often entire paragraphs
1, but more generally text that spans a specific part of a page), titles, images, and smaller units like dates and authors. The OCR segmentation is valuable in research in newspaper articles. The caveat is that only segmenting into standalone text blocks serves a limited purpose. Extracting whole articles would therefore be of major value to stakeholders.
OCR engines do not segment text blocks into articles, which implies that text blocks cannot be assigned to articles solely using OCR. Additionally, articles can span multiple pages, increasing the difficulty of assigning text blocks to their respective articles. OCR data are not solely segmented vertically or horizontally, which aggregates the challenge with working with OCR’d data.
The OCR systems can incorrectly generate characters, cut paragraphs into multiple parts, and capture paragraphs incorrectly (Bjork et al. 2018). Tesseract is an accurate and open- source OCR system (Zacharias et al. 2020). How the Tesseract OCR system (see ”An Overview of the Tesseract OCR Engine”) can segment a page from a newspaper can be seen in Figure 1
2. Red lines indicate text units, and the lines in green indicate images. Figure 1 shows an example of inaccurate segmentation. The top left article (see Figure 1a) was
1
Paragraphs aim to be coherent and contained units of text with a structure (Hearst 1997).
2
KB does not use Tesseract on newspaper articles, this figure showcases how an OCR processes a page.
segmented into multiple text blocks.
(a) First page of the issue
(b) Top left article of the front page. The OCR engine has in- correctly segmented the body of text, indicated by the many hor- izontal lines.
Figure 1: Segmentation of newspaper (Bjork et al. 2018).
Dannells et al. (2019) describe the need to improve OCR systems due to the increased demand for correct rendering of printed materials due to ”data driven research”. This increased need to improve the OCR process is likely due to recent advances in Natural Language Processing (NLP) algorithms. NLP is the intersection of computational linguistics, statistics, and computer science. These algorithms make it possible for computers to read, analyze and understand texts. More specifically, semantics, context, and syntax. Recent advancements in NLP include the Transformer, an NLP model that has have achieved record- breaking results on benchmark NLP tasks (Vaswani et al. 2017).
To train a model on article reconstruction, a definition of what a reconstructed article is warranted. The chosen definition is the correct prediction of text block pairs within an article and its beginning and end. A more detailed description follows:
Description of what reconstructing an article consists of:
• Correct classification of the first and last text pairs (i.e., where the article overlaps with
other articles in the beginning and end). In practice, using the ending of an article
and the beginning of another as combined input.
• Correct classification of the text block pairs that make up the article, i.e. all text block pairs.
1.1 Related Work
The specific task of article reconstruction is relatively unexplored in NLP, but adjacent lan- guage tasks have been more explored. The relation between sentences and article extraction are two areas that tie into text block pair prediction and article reconstruction.
1.1.1 Syntactic Tasks
BERT is a recently published, transformer-based, language model that achieves recording- breaking results on different language understanding tasks, including document classification (Adhikari et al. 2019, Devlin et al. 2018) and relation extraction (Shi & Lin 2019). Shi &
Demberg (2019) claim that BERT’s ability to capture implicit discourse relation in texts is due to its next sentence prediction (NSP) task. NSP is one of BERT’s two pre-training tasks.
Shi & Demberg (2019) motivate the choice of NSP by the fact that predicting the next likely segment helps with discourse expectations. Implicit discourse relation classification aims to classify whether texts are implicitly connected. Implicit relation between two events means no explicit connective words tying texts together. Examples of explicit words presented in the paper are: ”but,” ”because,” ”however.”
Shi & Demberg (2019) argue that the NSP task is a good fit for implicit discourse relation because it allows for the representation of what a typical text block would look like and conclude that predicting the next sentence task gives BERT a good ability to capture the semantic connection. Their research raises the question of how BERT can be used to reconstruct articles since it uses one of BERT’s pre-training tasks.
Ghosh et al. (2016) present a Contextual Long short-term memory network (CLSTM)
and evaluate it on three different NLP tasks: word prediction, sentence topic prediction, and
next sentence selection. This research presents one approach to how the relation between
texts can be modeled. Their approach to next sentence selection is to provide candidate
sequences and predict which one is the most probable conditioned on the sentence the model has already seen. The model trains on sentences that are part of the same sequence, meaning they have the same topic. They use supervised learning to label text segments with their topic. The model predicts the last sentence given the first sentences and the shared context.
The final model obtains an accuracy of 63%. This method requires a label for each text segment to predict the next segment. It also requires deciding candidate segments where some can be of the same context; however, text blocks from the same article will naturally share context.
1.1.2 Article Extraction
There is previous research on identifying article structure, Palfray et al. (2012) uses Condi- tional random fields to extract newspaper articles from images. They define an article as a structure with a title, followed by ”text entity” and ending with either a horizontal separator or a new title. They use machine learning to segment articles by defining their general logical structure and then detecting structural entities such as horizontal and vertical separators.
They use a test dataset of 42 images, and their method identified 85.84% of the articles on the test data. However, this method requires steps before the OCR process.
1.2 Objective
This thesis aims to explore and study the challenge of combining text blocks segmented by an OCR by analyzing and implementing a predictive model in text pair prediction and ultimately reconstruct the daily newspaper articles.
The two main research questions are:
1. How well can a pre-trained NLP model predict if the second text block of a text pair, digitized by OCR, is, in fact, the subsequent text block?
2. How can the model subsequently be applied to reconstruct OCR’d newspaper articles?
BERT, a transformer-based, State-of-the-art language model, obtains recording-breaking
results on NLP task sets, such as the General Language Understanding Evaluation (GLUE)
benchmark (Devlin et al. 2018). BERT is used in this thesis to answer the research questions.
The thesis will also compare the fine-tuned BERT with only the pre-trained version of BERT
on the same classification task.
2 THEORY
Devlin et al. (2018) proposed BERT based on the Transformer architecture (Vaswani et al.
2017), which is a deep learning architecture widely used in NLP. The Transformer consists of encoders and decoders with sub-layers of self-attention and Feed-forward neural networks.
This section outlines the theoretical background of the Transformer and BERT and why the architecture is the method of choice to answer the research questions.
2.1 Feed-Forward Neural Network (FFNN)
The Feed-forward neural network is the simplest neural network where the data exclusively travels in one direction. The architecture linearly applies a transformation followed by a non-linear transformation and has become a prevalent method in NLP. A fully connected FFNN consists of an input, a selected number of hidden layers, and an output. Figure 2 portrays a fully connected FFNN where the circles indicate nodes that are connected in the direction of the arrows.
The number of nodes (or neurons) in a layer is also called the layer’s size. Figure 2 shows an FFNN with an input layer of three nodes, three nodes in the only hidden layer and one in the output layer. The first layer (green nodes) is the input, denoted as x
i, where i = 1, ..., n, denote the input where n is the input size, three in the figure. The middle layer (blue nodes) is denoted as h
j, where j = 1, .., 3 denotes the nodes in the middle layer, and y denotes the output. The yellow circles indicate the bias and the total number of biases in the network equals the sum of the hidden size and output size.
Figure 2: A fully connected Feed-forward neural network for binary classification
There are three biases in the hidden layer, denoted as: b
(1)j, j = 1, ..., 3 and are represented by a yellow circle with three lines. The three hidden nodes are the weighted sum from the input nodes in the preceding layer, followed by a non-linear activation function σ.
h
1= σ(w
(1)11x
1+ w
12(1)x
2+ w
(1)13x
3+ b
(1)1) h
2= σ(w
(1)21x
1+ w
22(1)x
2+ w
(1)23x
3+ b
(1)2) h
3= σ(w
(1)31x
1+ w
32(1)x
2+ w
(1)33x
3+ b
(1)3), where
σ = 1
1 + e
−x,
which is called the sigmoid activation function. The input-to-node connections in the hidden layer have weights, denoted by W
(1)which is a parameter matrix with superscript 1 because it is the weight matrix of the first and only hidden layer. The output of this FFNN is expressed the following way:
y = σ(w
1(2)h
1+ w
2(2)h
2+ w
3(2)h
3+ b
(2)), (1) where y is the weighted sum of the second weights and a bias. An additional activation is applied. The last two layers can be compactly presented by the two expressions, where the activation function operate element-wise.
h = σ(W
(1)x + b
(1)) (2)
y = σ(w
(2)h + b
(2)). (3)
A neural network without a hidden layer with a linear activation function, and one output node, on the other hand, is a linear regression model. If there is more than one output node, it is a multivariate regression model.
Another activation function is the softmax activation function, in which the sigmoid is
a special case. The softmax function is commonly applied as the last activation function in
neural networks. For example, if Figure 2 had two output nodes and the softmax function
as the last activation, the output is normalized into a probability distribution for the two
classes. Besides the softmax, other common activation functions are those of ReLU and GELU (Figure 3).
(a) Sigmoid function (b) Rectified Linear Unit (ReLU) function.
(c) Gaussian Error Linear Units (GELU) function
Figure 3: Three common activation functions
BERT uses the Gaussian Error Linear Unit (GELU) activation function in the FFNN (Devlin et al. 2018), instead of ReLU. Hendrycks & Gimpel (2016) show that GELU performs better than RELU in NLP tasks. GELU is approximated with the following expression:
σ(x) ≈ 0.5x(1 + tanh[ p
2/π(x + 0.44715x
3)]). (4)
2.1.1 Parameter Updates
A neural network updates its weights to reduce its error. These weights are in practice updated using gradient descent, an optimization algorithm. First, the FFNN expressions outlined are computed, and some value of a loss function is calculated to measure the error.
An example of a loss function is the Cross entropy loss function, where the loss value increases when predictions increasingly deviate from the true label. This is also the loss function used in this thesis where the two classes are whether a text block is the next text block or not.
Below is an expression of the cross-entropy loss function for two classes (also known as the negative log of the Bernoulli distribution).
− log L( ˆ p|y) = −y log (ˆ p) − (1 − y) log(1 − ˆ p), (5)
where L is the likelihood function, ˆ p is the probability for the positive class and 1 − ˆ p is the
probability for the negative class. The true label is y.
An optimizing algorithm aims to minimize the value of the loss function w.r.t to its parameters, θ. Stochastic gradient descent (SGD) is a gradient-based optimization algorithm that approximates the actual gradient of the loss function. It uses only a single training example, randomly selected, (i superscript in equation 6 at a time to compute the gradient of the loss function and update the weights. This method is more efficient than computing the gradient for the entire training data. Mini-batch gradient descent is another gradient descent algorithm that uses a batch of training examples instead to calculate the mean gradient and update the weights. This increases the computation speed compared to SGD.
Ruder (2016) presents a SGD in the following way:
θ
t= θ
t−1− η · ∇
θJ (θ; x
(i); y
(i)), (6) where θ represents the parameters of the loss function. η is the learning rate that determines the rate to find the minimum of the loss function. J (θ) is the loss function that is differen- tiable w.r.t to its parameters denoted θ. Kingma & Ba (2015) first suggest Adam (short for Adaptive Moment Estimation), a common SGD optimizing algorithm. The algorithm uses the mean and the variance of the loss function’s gradient. It scales well to high-dimensional tasks and outperforms other stochastic optimization algorithms, (see Kingma & Ba 2015, for details).
θ
t= θ
t−1− η · m ˆ
t√ ˆ v
t+ (7)
where ˆ m
tis an average of the past gradients and ˆ v
tis an average of past squared gradients.
ˆ
m
tis the bias-corrected estimate of the gradient’s mean and a function of β
1. ˆ v
tis the bias- corrected estimate of the gradient’s variance and a function of β
2. β denotes the exponential decay terms. The optimizer updates the weights and biases of the neural network by the opposite direction of the loss function’s gradient.
I use an updated version of Adam in fine-tuning BERT for article reconstruction, Adam
with decoupled weight decay (AdamW), which allows for separate optimizing of the weight
decay and learning rate (Loshchilov & Hutter 2017). Weight decay is a regularization tech-
nique to prevent a model from overfitting by adding a small penalty term to the loss function,
penalizing big weights.
2.2 Vector Representations of Words
Word embeddings are techniques to map words for text analysis to vectors of real numbers (Mandelbaum & Shalev 2016). Embeddings are dense vector representations of words, and an example implementation of word embeddings is to encode the meaning of a word that makes that word’s vector close to a similar word in vector space. Figure 4 presents an example of two-word embeddings for ”Cat” and ”Dog”.
Figure 4: Word embedding example of the words cat and dog with each represented by a dense vector of arbitrary length.
Figure 4 portrays two 1x3 vectors, exemplifying how two words can be embedded. Each element in the vector represent some characteristic, for example, the last element in Figure 4 can represent ”animal”. This method of embedding words is called dense word embeddings, where each element contain floating-point values, contrasted with one-hot encoding where a 1 in a sparse n by n matrix represents a specific word.
Like updating the weights in an FFNN, the word vectors are trainable parameters. A popular word embedding model is Word2vec (Mikolov et al. 2013) which uses a two-layered neural network to learn word embeddings. The embeddings contain information, and for example, the words in Figure 4 would have similar values in some parameters because both are domesticated animals, illustrated by the last element of each vector. However, this word embedding method creates context-independent representations.
Devlin et al. (2018) state that BERT incorporates context from both directions. BERT
uses contextual word representations where token representations are also functions of them-
selves and the input, Liu et al. (2020) attribute this type of embeddings to recording breaking
results on NLP tasks.
2.2.1 Tokenization
Tokenization is a method that splits words based on their existence in a vocabulary. The standard way to represent words is to assign them to a unique id that maps to an initial word embedding. Using the vocabulary created KB, the phrase ”Kungliga biblioteket digitaliserar sina samlingar” tokenizes into ”Kungliga, Biblioteket, digital, ##iserar, sina, samlingar”.
”[D]igitaliserar” is divided into two smaller words, ”digital” and ”##iserar. If a word splits into subwords, all components after the first one prepend with ##. The tokens have the following mapping ID’s:
[11798, 14551, 16949, 13446, 604, 21315],
and these integers map to an initial dense embedding. The first step looks to see if the word is in the vocabulary. The second step is to look for words within the original word. The example shows that ”digital” is in the vocabulary. ”[I]serar” is also tokenized. Furthermore, a word is split into individual characters if not in the vocabulary. Subwords or even single characters represent words that are not in the vocabulary (Wu et al. 2016).
2.3 Transformer
Vaswani et al. (2017) presents the Transformer, a deep learning model consisting of stacked Transformer layers with encoders and decoders. They employ six layers in the paper. A rea- son for the success of the Transformer architecture is its modeling of contextual-dependent word representations using the attention mechanism. The Transformer is made up of en- coders and decoders. Figure 5 shows half of the Transformer’s architecture, known as the encoder, which is the block BERT uses.
The encoder uses word embeddings of d
modeldimension as input, explained in 2.2. Self-
attention is applied in the ”Multi-Head Attention” sublayer in the figure, which a fully
connected, piece-wise FFNN follows, see Section 2.1. A residual connection and a layer
normalization wrap both sub-layers in the encoder. The output of the encoder is a new
vector representation for each token.
Figure 5: Model architecture of the encoder. Figure derived from (Vaswani et al. 2017)
2.3.1 Self-Attention
Two sublayers make up the encoder, and the first layer is the focus of this section. Self- attention is a mechanism that computes the relations between words in a sequence. These relations are encoded into the representations of the sequence that the function outputs. A simple example of the attention mechanism will follow to display how the function represents tokens.
Each word embedding multiplies with three pre-trained matrices, which are known as
Value, Key and Query and their weights determine how important other tokens in the input
are in relation to in creating the current token’s representation (Clark et al. 2019). Word
embeddings x
1and x
2multiply with the W
V, W
Q, W
Kmatrices, creating the v, q, k
vectors seen at the bottom of Figure 6.
Figure 6: x
1and x
2represent two word embeddings. These are generally vectors of much larger size, for example, BERT’s embeddings are 768 in length. The matrices in the top-right corner are attention weights and multiple with the word embeddings to output the vectors at the bottom.
The two example word embeddings in Figure 6, x
1and x
2each have a dimension of 1x3.
The second step of the attention function is to compute the dot-product of q and k, for each embedding. Each dot product is scaled by √
d
k−1
, known as the scaling factor. These two steps are outlined below for both words. s
11and s
12are the two output values for x
1.
s
11: q
1· k
1√ d
k, s
21: q
2· k
1√ d
k, s
12: q
1· k
2√ d
k, s
22: q
2· k
2√ d
k,
where the two obtained values for the first token subsequently go through the softmax activation function. The two softmax scores multiply with respective value vectors in the next step. The last step is to sum the weighted value vectors. Therefore, how ”important”
x
2is to x
1has been encoded into x
1’s representation. Equation 8 describes a compact representation of the attention formula.
Attention(Q, K, V ) = σ( QK
T√ d
k)V , (8)
Figure 7: Example of how the word ”it” encapsulates the words ”the animal didn’t” and
”was too tired” highlighted in orange and green. Figure from (Jay Alammar 2018)
In practice, x
1and x
2are instead much longer representations and concatenated into a matrix. The attention function outputs a matrix of new representations, collectively referred to as an attention head. Moreover, instead of obtaining a single attention head, Vaswani et al.
(2017) compute multiple attention mechanisms in parallel and independently. This is called Multi-headed attention, and they apply it instead because it allows the model ”to jointly attend to information from different representation subspaces at different positions”.
Z = M ultiHead(Q, K, V ) = Concat(H
1, ...H
h)W
O(9) H
i= Attention(QW
iQ, KW
iK, V W
iV), (10) where W
iQ, W
iK, W
iVand W
Oare matrices that have been trained and were initially gener- ated by the Normal distribution (Xiong et al. 2020), where h is the number of heads. Concat is short for concatenation and means to join together, so the attention heads are joined together into one big matrix. W
ois an additional pre-trained matrix which multiplies with the attention heads to condense the output to the next part of the embedding. This final matrix has information from h number of heads. Vaswani et al. (2017) discuss how each attention head learns to perform different tasks, yielding results that are easier to interpret.
Figure 7 shows the focus of two heads concerning the word ”it”. This figure displays how
two attention heads capture different different word related things. The number of heads is
scalable, (Vaswani et al. 2017) employ eight attention heads.
2.3.2 Layer Normalization and Residual Connection
Each encoder sub-layer passes through a residual connection in a normalization layer. Xu et al. (2019) show that layer normalization enables faster training in the Transformer.
LayerN orm rescales the input and produces the following as output for each layer: LayerNorm(x+
sublayer(x)), where x is either the word embedding or output from previous sublayer, and sublayer(x) is the function currently applied in the encoder. He et al. (2015) show that resid- ual connections help retain information from previous layers and improve accuracy. Figure 8 illustrates an example residual connection with two hidden layers and ReLu as activation in between.
Figure 8: Illustration of residual connection. Figure derived from (He et al. 2015).
The output from the first encoder sub-layer continues to a position-wise FFNN where there are no dependencies between the token representations. Similar to the network outlined in Section 2.1, a two-layered neural network is applied to the attention output with reLu as activation. Position-wise implies that the FFNN applies to each representation, but independently. Another LayerN orm function applies to the FFNN output.
2.4 Transfer Learning
Transfer learning (TL) transfers ”knowledge” from having solved broad tasks to similar,
more domain-specific tasks (Zhuang et al. 2019). It aims to solve the challenge with both
labeled data and compute power being scarce. A TL model has its weights trained, meaning
it does not have to train from scratch. For example, if a model is trained to detect dogs,
transfer learning makes it possible for a model, without much training, to perform well on detecting cats, too (Neyshabur et al. 2021). An FFNN trains on detecting dogs by updating its parameters weights and biases to detect dogs better, and if the same model is used to detect cats, it is possible to use the pre-trained weights from the network trained on dogs.
Generally, NLP models are pre-trained on some language task(s) using an unlabelled cor- pus. Pre-training aims to create models with a general language understanding, and in NLP, pre-training models aim to capture general language patterns, enabling more domain-specific models to utilize the gained knowledge from the pre-training. Next sentence prediction and masked token prediction are two examples of pre-training tasks in NLP.
2.5 Bidirectional Encoder Representations from Transformers (BERT)
BERT has become a popular choice in NLP. Its success is partly due to its pre-training tasks that encode non-directional context and the relationship between two text segments in its token representations. Goldberg (2019) show that BERT performs well on finding syntactical patterns in the English language, making it an appropriate model to try classifying whether the next text block is the next one or not.
Initially configured with two architectures, a base, and a large version, where the base version has 12 encoders with 768 as the size of the hidden layers and 12 attention heads (see 2.3.1) totaling 110M parameters. The model was trained on BooksCorpus and the English Wikipedia, containing 3300M words combined. The original BERT uses WordPiece embeddings with a vocabulary of size 30522 tokens (Devlin et al. 2018).
In addition to token embeddings, BERT adds two more embedding layers to its input.
Segment embedding embeds to which input segment the tokens belong. In Figure 9, A represents tokens in the first segment, and 1 represents tokens in the following segment.
The third embedding layer consists of positional embedding, representing the position of the
tokens in the input. The position embedding consists of two parts, an integer of its location
and how close it is to the other words in the sequence. The three embedding layers in BERT
sum element-wise to fit BERT’s input (Klein et al. 2017).
Figure 9: BERT embedding layers, if the input is a single segment, then segment embeddings only consist of A, and the first [SEP] token is discarded. The maximum number of tokens BERT can use as input is 512, including special tokens (Devlin et al. 2018)
.
2.5.1 Pre-Training
BERT pre-trained on two self-supervised tasks using unlabeled data; Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM allows for non-directional training by masking random input and predicting them. NSP helps the model to understand sentence relationships by predicting if a text segment is the subsequent segment or not.
NSP aims to represent the relation of two text segments, allowing for downstream tasks built on similar relationships, such as text pair relationships. Devlin et al. (2018) state that removing NSP from pre-training reduces performance on different NLP tasks. BERT uses two text blocks in the corpus for the NSP task, and 50% of the time, the model selects a random text segment. BERT trains on the prediction of the next sentence by looking at consecutive sentences of the corpus and then randomly selecting sentences to use as negative examples. Figure 10 visualizes both pre-training tasks, where the classification token [CLS]
prepends to the input’s beginning.
Figure 10: Architecture from BERT for next sentence prediction and Masked LM (Devlin et al. 2018). Sentence A has number of tokens and sentence b has m.
Special token C, seen in the top-left corner, is the final hidden vector of the [CLS] token.
This is used ”as the aggregate sequence representation for classification tasks” (Devlin et al.
2018), and its representation is used in binary classification tasks. The [CLS] token output transforms into a 2x1 vector or single node using a classification layer. A softmax activation function outputs the probability that segment b follows segment a. In pre-training of the original BERT, the final model achieves 97%-98% accuracy on NSP (Devlin et al. 2018).
MLM predicts the most probable token for the masked token using T
i, and the output layer for the masked tokens applies the softmax activation function. T
idenotes the final hidden vector representation for every input. In 80% of the cases, a "MASK" token replaces 15% of the input tokens. 10% of the time, a token is replaced with a random token. In the remaining 10% of the cases, no tokens are masked, reducing the difference between pre- training and fine-tuning, where input is not masked. Both pre-training tasks run in parallel, so the NSP task uses incomplete data.
Devlin et al. (2018) state that the training loss is the sum of the mean NSP likelihood
and the mean of the MLM likelihood, and the model is trained to minimize the combined
loss. The combined loss function is a linear combination of the two-loss functions (Zhang
et al. 2020) and both tasks apply cross-entropy, and optimizer AdamW, see Section 2.1.
2.5.2 Fine-Tuning
It is relatively easy to fine-tune BERT compared to other recently published language models.
ELMo, for example, requires a task-specific architecture for every downstream task. On the other hand, BERT is fine-tuned with relative ease, only requiring comparatively small changes (Zhang et al. 2020). Applying BERT for domain-specific tasks requires additional layer(s) added on top of the final layer. The input is either a pair of sentences or a single sentence. A linear classifier is added to the pre-trained model in binary classification tasks.
The pre-trained parameters are also updated when fine-tuning the model.
3 DATA
The training data are retrieved from Mediearkivet
3, a database with Swedish daily news- papers that is regularly updated. 2469 articles were retrieved, where every row in data represents a text block. The articles were printed in Sweden in late 2019 or throughout 2019, depending on the newspapers’ availability. Some newspapers have premium services which reduces the number of articles available. For example, articles by Svenska dagbladet and Dagens nyheter span all of 2019 instead of only a part of 2019 due to the limited amount.
They are therefore less represented in the training data.
Test data are made available by KB and consists of articles printed in 2017 and onward, converted into readable text with OCR. The first ten pages from an Expressen issue, exclud- ing the cover, printed in 2020, were retrieved. Additionally, articles from other newspapers in training and more local ones were also retrieved. A page per newspaper is collected. Day, month, and newspaper are all randomly selected
4. Only text blocks that form articles were selected. The final size of the test data is 127 text blocks from 21 articles. Some of these articles are on the same page.
3.1 Descriptive Statistics
The training data consists of 76629 text blocks from 2469 articles where the mean of words per article is 495 words and characters
5. Text blocks can be anything from a few words to whole articles. This is due to how the training data were retrieved.
The test data consists of 127 text blocks from 21 articles, where the mean of words per article is 475. Figure 11 displays the frequency of words for the articles in both data set.
The dotted-blue lines indicate the means of words for the articles.
3
https://www.ub.gu.se/sv/databaser/mediearkivet
4
If the generated page number only contains stock values or ads, the next page with articles is selected.
5
For example, a hyphen (-) separated by two white spaces also counts as a word, and therefore, the mean
is slightly higher if only words count.
(a) Frequency of articles based on number of words used in training the model.
(b) Frequency of articles based on number of words used as test data.
Figure 11: Data
Summary statistics for the data sets are in Table 1. The mean of words per text block in the training data is much lower than the mean of words per text block in the test data.
It is much lower because each sentence is a text block in the retrieval process. However, as will be described in Section 4.2.1, the text blocks in the training data are joined together to form longer sequences.
Table 1: Summary statistics
Train data Test data
Variables Articles Text blocks Articles Text blocks
Number of units 2469 76629 21 127
Mean number of words/unit type 495 15.918 475 73.3
3.2 Pre-Processing
Pre-processing of the training and test data are done in two different ways. Recurring phrases
in the training data due to the retrieval process were removed. Examples are: ”Aftonbladet
eller artikelf¨ orfattaren.”,”L¨ as hela artikeln”, ”Alla artiklar ¨ ar skyddade enligt”. Date, author,
symbols, and titles, image captions are all removed. The titles from the articles in the test
data are not included because they can consist of a few words but generate multiple text
blocks, making them short and unlikely to be valuable to the model, likely only introducing
noise.
4 METHODOLOGY
This section outlines the thesis’s methodology. First, the models and training are explained.
The following part explains the different data formats used in the paper. Finally, the last part outlines the metrics used to evaluate the models.
4.1 Models
The trained BERT by the Swedish National Library is used to answer the research ques- tions. ”[B]ert-base-Swedish-cased ”
6is trained with the same experimental settings Google originally used
7. One difference is the use of embedding algorithm where KB instead used
”SentencePiece” to split words and tokenize when training (Malmsten et al. 2020).
A comprehensive list of the text materials used in pre-training is in Table 2. The corpus is heavily skewed towards newspapers and consists of more than 200M sentences and 3000M tokens. The authors decided to remove text blocks shorter than ten sentences, and the vocabulary size was set to 50325, making it much more extensive than Google’s. The vocab- ulary size is motivated by the many compound words in the Swedish language (Malmsten et al. 2020). How accurate the pre-training tasks were is not included in the paper.
Table 2: Resources used in pre-training BERT Material
Digitized newspapers
Official Reports of the Swedish Government Legal e-deposits
Social media Swedish Wikipedia
Source: (Malmsten et al. 2020)
KB’s BERT is fine-tuned on binary text block prediction, similar to the pre-training task, to answer the research questions. This means that two text blocks are fed to the model, separated by a [SEP] token. Fine-tuned BERT is also used to reconstruct articles and compare against the BERT base version, which has only been pre-trained.
6
Trained on lower-cased Swedish text.
7
https://github.com/Kungbib/swedish-bert-models.
The last BERT base layer for sequence classification tasks is another dense layer, applying a linear transformation over the [CLS] token representation. On top of this layer, another dense layer is added with two output nodes, with a softmax activation, which is the fine- tuning step. The added layer is a fully connected classification layer with weights W ∈ R
2x768. The number of labels is two, and the hidden size is 768. The labels are "next" and "not next" and refers to whether two text blocks are in sequence or not. The same modification is done to the base version.
4.2 Training
BERT with different hyperparameter settings is training on the daily newspaper articles retrieved from Mediearkivet. Training data consists of 2469 articles, 80% for training, and 20% (500 articles randomly selected) for validation. BERT is then tested on the set of OCR’d daily newspaper articles provided by the National Library of Sweden.
4.2.1 Generating Train Data Structure
I adopted the data format for the next sentence prediction task in order to train BERT on both correct and incorrect text block pairs
8. It introduces negative sampling and generates text pairs in a BERT format. Some hyperparameter values need to be set, such as the total number of tokens and the probability to sample a random block in the next segment. Since the maximum number of tokens in BERT is 512, the max input is 509 tokens + 3 special tokens. The number of non-special tokens is divided by two, generating two segments that are often equal in length, where a [PAD]
9token replaces the empty slots.
8
Code adapted from: https://github.com/huggingface/transformers/blob/master/src/
transformers/data/datasets/language_modeling.py
9
The loss function ignores the [PAD] token in the PyTorch library.
Table 3: Example output from train data
Text block pairs Label
[CLS] Article 1, text block 1 & 2 [SEP] Article 2, text block 1 [SEP] "not next"
[CLS] Article 1, text block 3 [SEP] Article 2, text block 1 & 2 [SEP] "not next"
[CLS] Article 2, text blocks 1 [SEP] Article 2, text blocks 2 & 3 [SEP] "next"
[CLS] Article 3, text block 1 [SEP] Article 3, text blocks 2 & 3 [SEP] "next"
[CLS] Article 3, text block 1 [SEP] Article 1, text blocks 1-3 [SEP] "not next"
Table 3 illustrates the generated data structure. The number of articles in the example is three, with three text blocks each. The probability of selecting a random text block from another article is 0.5, which is also used in training. Negative samples are labeled "not next", and true text pairs are labeled "next". The first row in the table is labeled "not next" because the first text block of the second article is not a continuation of the first article. While row 3 is labeled as "next" because the second and third text blocks of article 3 follow the first text block.
The function generates 7500 true and false text block pairs in total from the training data.
4.2.2 Hyperparameters
The batch size is 8, and the number of epochs is in the range of three to four, also motivated by the number of epochs used in the downstream tasks in the original BERT paper (Devlin et al. 2018). The same applies to the values of the weight decay. The number of warm-up
10steps was selected conservatively. The following hyperparameter values are used in training:
• Batch size : ∈ {8}
• Epochs: ∈ {3, 4}
• Weight decay: ∈ {0.01, 0.05}
• Warm-up steps: ∈ {50, 100}
10