Text Block Prediction and Article Reconstruction Using BERT

(1)

TEXT BLOCK PREDICTION AND ARTICLE RECONSTRUCTION USING BERT

Submitted by

Andreas Walter Estmark

A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisors M˚ ans Magnusson

Andreas ¨ Ostling

Spring, 2021

(2)

ABSTRACT

Kungliga biblioteket (National Library of Sweden, KB) uses Optical Character Recog-

nition (OCR) engines to extract and segment texts from their archive of daily newspaper

articles. These systems are good at extracting and segmenting text on the paragraph level

and lower (i.e., sentences, words, and characters), but less on the article level, resulting in

the segmentation of articles into text blocks not attached to their articles. In this thesis,

BERT, a natural language processing (NLP) model, is fine-tuned on newspaper articles and

used to reconstruct these articles by predicting if a text block is the next or not. A small

data set of 127 text blocks from 21 articles is used. The best performing BERT achieved

an accuracy of 94% on text block pair prediction when the blocks are ordered. It resulted

in 13 reconstructed articles. The performance was reduced when selecting from all possible,

unordered text pairs. It was also found that BERT performs well on clustering text blocks

from the same articles.

(3)

Introduction 1

1.1 Related Work . . . . 3

1.1.1 Syntactic Tasks . . . . 3

1.1.2 Article Extraction . . . . 4

1.2 Objective . . . . 4

Theory 6 2.1 Feed-Forward Neural Network (FFNN) . . . . 6

2.1.1 Parameter Updates . . . . 8

2.2 Vector Representations of Words . . . . 10

2.2.1 Tokenization . . . . 11

2.3 Transformer . . . . 11

2.3.1 Self-Attention . . . . 12

2.3.2 Layer Normalization and Residual Connection . . . . 15

2.4 Transfer Learning . . . . 15

2.5 Bidirectional Encoder Representations from Transformers (BERT) . . . . 16

2.5.1 Pre-Training . . . . 17

2.5.2 Fine-Tuning . . . . 19

Data 20 3.1 Descriptive Statistics . . . . 20

3.2 Pre-Processing . . . . 21

Methodology 23 4.1 Models . . . . 23

4.2 Training . . . . 24

4.2.1 Generating Train Data Structure . . . . 24

4.2.2 Hyperparameters . . . . 25

4.3 Test Data Structures . . . . 26

(4)

4.3.1 Text Block Prediction (TBP) . . . . 26 4.3.2 Page Block Prediction (PBP) . . . . 26 4.4 Metrics . . . . 28

Results 31

5.1 TBP results . . . . 31 5.2 PBP results . . . . 32

Discussion 35

Conclusion 35

Acknowledgments 37

Appendix 41

(5)

1 INTRODUCTION

Kungliga biblioteket (National Library of Sweden, KB), and other heritage institutions worldwide with extensive collections of texts are digitizing their printed materials. Among other printed materials, KB collects, preserves, and digitizes Swedish daily newspapers (Kungliga biblioteket 2020). The newspapers are digitized using Optical Character Recogni- tion (OCR) systems, and their corpus ”KubHist” contains Swedish daily newspapers span- ning from 1645 to today (Dannells et al. 2019). Each issue is photographed and processed with OCR software (Adesam et al. 2019). OCR systems will continue to play a role in the digitization process while newspaper articles continue to be printed. These systems achieve high accuracy on text data and are easily accessible with open-source alternatives, making them a popular tool to digitize texts (Holley 2009).

OCR systems convert image input into readable text data (Saha et al. 2010) and seg- ment and extract texts from images, such as newspapers, into units of text (often entire paragraphs

¹

, but more generally text that spans a specific part of a page), titles, images, and smaller units like dates and authors. The OCR segmentation is valuable in research in newspaper articles. The caveat is that only segmenting into standalone text blocks serves a limited purpose. Extracting whole articles would therefore be of major value to stakeholders.

OCR engines do not segment text blocks into articles, which implies that text blocks cannot be assigned to articles solely using OCR. Additionally, articles can span multiple pages, increasing the difficulty of assigning text blocks to their respective articles. OCR data are not solely segmented vertically or horizontally, which aggregates the challenge with working with OCR’d data.

The OCR systems can incorrectly generate characters, cut paragraphs into multiple parts, and capture paragraphs incorrectly (Bjork et al. 2018). Tesseract is an accurate and open- source OCR system (Zacharias et al. 2020). How the Tesseract OCR system (see ”An Overview of the Tesseract OCR Engine”) can segment a page from a newspaper can be seen in Figure 1

²

. Red lines indicate text units, and the lines in green indicate images. Figure 1 shows an example of inaccurate segmentation. The top left article (see Figure 1a) was

1

Paragraphs aim to be coherent and contained units of text with a structure (Hearst 1997).

2

KB does not use Tesseract on newspaper articles, this figure showcases how an OCR processes a page.

(6)

segmented into multiple text blocks.

(a) First page of the issue

(b) Top left article of the front page. The OCR engine has in- correctly segmented the body of text, indicated by the many hor- izontal lines.

Figure 1: Segmentation of newspaper (Bjork et al. 2018).

Dannells et al. (2019) describe the need to improve OCR systems due to the increased demand for correct rendering of printed materials due to ”data driven research”. This increased need to improve the OCR process is likely due to recent advances in Natural Language Processing (NLP) algorithms. NLP is the intersection of computational linguistics, statistics, and computer science. These algorithms make it possible for computers to read, analyze and understand texts. More specifically, semantics, context, and syntax. Recent advancements in NLP include the Transformer, an NLP model that has have achieved record- breaking results on benchmark NLP tasks (Vaswani et al. 2017).

To train a model on article reconstruction, a definition of what a reconstructed article is warranted. The chosen definition is the correct prediction of text block pairs within an article and its beginning and end. A more detailed description follows:

Description of what reconstructing an article consists of:

• Correct classification of the first and last text pairs (i.e., where the article overlaps with

and the beginning of another as combined input.

• Correct classification of the text block pairs that make up the article, i.e. all text block pairs.

1.1 Related Work

The specific task of article reconstruction is relatively unexplored in NLP, but adjacent lan- guage tasks have been more explored. The relation between sentences and article extraction are two areas that tie into text block pair prediction and article reconstruction.

1.1.1 Syntactic Tasks

BERT is a recently published, transformer-based, language model that achieves recording- breaking results on different language understanding tasks, including document classification (Adhikari et al. 2019, Devlin et al. 2018) and relation extraction (Shi & Lin 2019). Shi &

Demberg (2019) claim that BERT’s ability to capture implicit discourse relation in texts is due to its next sentence prediction (NSP) task. NSP is one of BERT’s two pre-training tasks.

Shi & Demberg (2019) motivate the choice of NSP by the fact that predicting the next likely segment helps with discourse expectations. Implicit discourse relation classification aims to classify whether texts are implicitly connected. Implicit relation between two events means no explicit connective words tying texts together. Examples of explicit words presented in the paper are: ”but,” ”because,” ”however.”

Shi & Demberg (2019) argue that the NSP task is a good fit for implicit discourse relation because it allows for the representation of what a typical text block would look like and conclude that predicting the next sentence task gives BERT a good ability to capture the semantic connection. Their research raises the question of how BERT can be used to reconstruct articles since it uses one of BERT’s pre-training tasks.

Ghosh et al. (2016) present a Contextual Long short-term memory network (CLSTM)

and evaluate it on three different NLP tasks: word prediction, sentence topic prediction, and

next sentence selection. This research presents one approach to how the relation between

texts can be modeled. Their approach to next sentence selection is to provide candidate

(8)

sequences and predict which one is the most probable conditioned on the sentence the model has already seen. The model trains on sentences that are part of the same sequence, meaning they have the same topic. They use supervised learning to label text segments with their topic. The model predicts the last sentence given the first sentences and the shared context.

The final model obtains an accuracy of 63%. This method requires a label for each text segment to predict the next segment. It also requires deciding candidate segments where some can be of the same context; however, text blocks from the same article will naturally share context.

1.1.2 Article Extraction

There is previous research on identifying article structure, Palfray et al. (2012) uses Condi- tional random fields to extract newspaper articles from images. They define an article as a structure with a title, followed by ”text entity” and ending with either a horizontal separator or a new title. They use machine learning to segment articles by defining their general logical structure and then detecting structural entities such as horizontal and vertical separators.

They use a test dataset of 42 images, and their method identified 85.84% of the articles on the test data. However, this method requires steps before the OCR process.

1.2 Objective

This thesis aims to explore and study the challenge of combining text blocks segmented by an OCR by analyzing and implementing a predictive model in text pair prediction and ultimately reconstruct the daily newspaper articles.

The two main research questions are:

1. How well can a pre-trained NLP model predict if the second text block of a text pair, digitized by OCR, is, in fact, the subsequent text block?

2. How can the model subsequently be applied to reconstruct OCR’d newspaper articles?

BERT, a transformer-based, State-of-the-art language model, obtains recording-breaking

results on NLP task sets, such as the General Language Understanding Evaluation (GLUE)

(9)

benchmark (Devlin et al. 2018). BERT is used in this thesis to answer the research questions.

The thesis will also compare the fine-tuned BERT with only the pre-trained version of BERT

on the same classification task.

(10)

2 THEORY

Devlin et al. (2018) proposed BERT based on the Transformer architecture (Vaswani et al.

2017), which is a deep learning architecture widely used in NLP. The Transformer consists of encoders and decoders with sub-layers of self-attention and Feed-forward neural networks.

This section outlines the theoretical background of the Transformer and BERT and why the architecture is the method of choice to answer the research questions.

2.1 Feed-Forward Neural Network (FFNN)

The Feed-forward neural network is the simplest neural network where the data exclusively travels in one direction. The architecture linearly applies a transformation followed by a non-linear transformation and has become a prevalent method in NLP. A fully connected FFNN consists of an input, a selected number of hidden layers, and an output. Figure 2 portrays a fully connected FFNN where the circles indicate nodes that are connected in the direction of the arrows.

The number of nodes (or neurons) in a layer is also called the layer’s size. Figure 2 shows an FFNN with an input layer of three nodes, three nodes in the only hidden layer and one in the output layer. The first layer (green nodes) is the input, denoted as x

_i

, where i = 1, ..., n, denote the input where n is the input size, three in the figure. The middle layer (blue nodes) is denoted as h

_j

, where j = 1, .., 3 denotes the nodes in the middle layer, and y denotes the output. The yellow circles indicate the bias and the total number of biases in the network equals the sum of the hidden size and output size.

Figure 2: A fully connected Feed-forward neural network for binary classification

(11)

There are three biases in the hidden layer, denoted as: b

⁽¹⁾_j

, j = 1, ..., 3 and are represented by a yellow circle with three lines. The three hidden nodes are the weighted sum from the input nodes in the preceding layer, followed by a non-linear activation function σ.

h

1

= σ(w

⁽¹⁾₁₁

x

1

+ w

₁₂⁽¹⁾

x

2

+ w

⁽¹⁾₁₃

x

3

+ b

⁽¹⁾₁

) h

2

= σ(w

⁽¹⁾₂₁

x

1

+ w

₂₂⁽¹⁾

x

2

+ w

⁽¹⁾₂₃

x

3

+ b

⁽¹⁾₂

) h

3

= σ(w

⁽¹⁾₃₁

x

1

+ w

₃₂⁽¹⁾

x

2

+ w

⁽¹⁾₃₃

x

3

+ b

⁽¹⁾₃

), where

σ = 1

1 + e

^−x

,

which is called the sigmoid activation function. The input-to-node connections in the hidden layer have weights, denoted by W

⁽¹⁾

which is a parameter matrix with superscript 1 because it is the weight matrix of the first and only hidden layer. The output of this FFNN is expressed the following way:

y = σ(w

₁⁽²⁾

h

₁

+ w

₂⁽²⁾

h

₂

+ w

₃⁽²⁾

h

₃

+ b

⁽²⁾

), (1) where y is the weighted sum of the second weights and a bias. An additional activation is applied. The last two layers can be compactly presented by the two expressions, where the activation function operate element-wise.

h = σ(W

⁽¹⁾

x + b

⁽¹⁾

) (2)

y = σ(w

⁽²⁾

h + b

⁽²⁾

). (3)

A neural network without a hidden layer with a linear activation function, and one output node, on the other hand, is a linear regression model. If there is more than one output node, it is a multivariate regression model.

Another activation function is the softmax activation function, in which the sigmoid is

a special case. The softmax function is commonly applied as the last activation function in

neural networks. For example, if Figure 2 had two output nodes and the softmax function

as the last activation, the output is normalized into a probability distribution for the two

(12)

classes. Besides the softmax, other common activation functions are those of ReLU and GELU (Figure 3).

(a) Sigmoid function (b) Rectified Linear Unit (ReLU) function.

(c) Gaussian Error Linear Units (GELU) function

Figure 3: Three common activation functions

BERT uses the Gaussian Error Linear Unit (GELU) activation function in the FFNN (Devlin et al. 2018), instead of ReLU. Hendrycks & Gimpel (2016) show that GELU performs better than RELU in NLP tasks. GELU is approximated with the following expression:

σ(x) ≈ 0.5x(1 + tanh[ p

2/π(x + 0.44715x

³

)]). (4)

2.1.1 Parameter Updates

A neural network updates its weights to reduce its error. These weights are in practice updated using gradient descent, an optimization algorithm. First, the FFNN expressions outlined are computed, and some value of a loss function is calculated to measure the error.

An example of a loss function is the Cross entropy loss function, where the loss value increases when predictions increasingly deviate from the true label. This is also the loss function used in this thesis where the two classes are whether a text block is the next text block or not.

Below is an expression of the cross-entropy loss function for two classes (also known as the negative log of the Bernoulli distribution).

− log L( ˆ p|y) = −y log (ˆ p) − (1 − y) log(1 − ˆ p), (5)

where L is the likelihood function, ˆ p is the probability for the positive class and 1 − ˆ p is the

probability for the negative class. The true label is y.

(13)

An optimizing algorithm aims to minimize the value of the loss function w.r.t to its parameters, θ. Stochastic gradient descent (SGD) is a gradient-based optimization algorithm that approximates the actual gradient of the loss function. It uses only a single training example, randomly selected, (i superscript in equation 6 at a time to compute the gradient of the loss function and update the weights. This method is more efficient than computing the gradient for the entire training data. Mini-batch gradient descent is another gradient descent algorithm that uses a batch of training examples instead to calculate the mean gradient and update the weights. This increases the computation speed compared to SGD.

Ruder (2016) presents a SGD in the following way:

θ

_t

= θ

_t−1

− η · ∇

_θ

J (θ; x

⁽ⁱ⁾

; y

⁽ⁱ⁾

), (6) where θ represents the parameters of the loss function. η is the learning rate that determines the rate to find the minimum of the loss function. J (θ) is the loss function that is differen- tiable w.r.t to its parameters denoted θ. Kingma & Ba (2015) first suggest Adam (short for Adaptive Moment Estimation), a common SGD optimizing algorithm. The algorithm uses the mean and the variance of the loss function’s gradient. It scales well to high-dimensional tasks and outperforms other stochastic optimization algorithms, (see Kingma & Ba 2015, for details).

θ

_t

= θ

_t−1

− η · m ˆ

_t

√ ˆ v

_t

+ (7)

where ˆ m

_t

is an average of the past gradients and ˆ v

_t

is an average of past squared gradients.

ˆ

m

t

is the bias-corrected estimate of the gradient’s mean and a function of β

1

. ˆ v

t

is the bias- corrected estimate of the gradient’s variance and a function of β

₂

. β denotes the exponential decay terms. The optimizer updates the weights and biases of the neural network by the opposite direction of the loss function’s gradient.

I use an updated version of Adam in fine-tuning BERT for article reconstruction, Adam

with decoupled weight decay (AdamW), which allows for separate optimizing of the weight

decay and learning rate (Loshchilov & Hutter 2017). Weight decay is a regularization tech-

nique to prevent a model from overfitting by adding a small penalty term to the loss function,

penalizing big weights.

(14)

2.2 Vector Representations of Words

Word embeddings are techniques to map words for text analysis to vectors of real numbers (Mandelbaum & Shalev 2016). Embeddings are dense vector representations of words, and an example implementation of word embeddings is to encode the meaning of a word that makes that word’s vector close to a similar word in vector space. Figure 4 presents an example of two-word embeddings for ”Cat” and ”Dog”.

Figure 4: Word embedding example of the words cat and dog with each represented by a dense vector of arbitrary length.

Figure 4 portrays two 1x3 vectors, exemplifying how two words can be embedded. Each element in the vector represent some characteristic, for example, the last element in Figure 4 can represent ”animal”. This method of embedding words is called dense word embeddings, where each element contain floating-point values, contrasted with one-hot encoding where a 1 in a sparse n by n matrix represents a specific word.

Like updating the weights in an FFNN, the word vectors are trainable parameters. A popular word embedding model is Word2vec (Mikolov et al. 2013) which uses a two-layered neural network to learn word embeddings. The embeddings contain information, and for example, the words in Figure 4 would have similar values in some parameters because both are domesticated animals, illustrated by the last element of each vector. However, this word embedding method creates context-independent representations.

Devlin et al. (2018) state that BERT incorporates context from both directions. BERT

uses contextual word representations where token representations are also functions of them-

selves and the input, Liu et al. (2020) attribute this type of embeddings to recording breaking

results on NLP tasks.

(15)

2.2.1 Tokenization

Tokenization is a method that splits words based on their existence in a vocabulary. The standard way to represent words is to assign them to a unique id that maps to an initial word embedding. Using the vocabulary created KB, the phrase ”Kungliga biblioteket digitaliserar sina samlingar” tokenizes into ”Kungliga, Biblioteket, digital, ##iserar, sina, samlingar”.

”[D]igitaliserar” is divided into two smaller words, ”digital” and ”##iserar. If a word splits into subwords, all components after the first one prepend with ##. The tokens have the following mapping ID’s:

[11798, 14551, 16949, 13446, 604, 21315],

and these integers map to an initial dense embedding. The first step looks to see if the word is in the vocabulary. The second step is to look for words within the original word. The example shows that ”digital” is in the vocabulary. ”[I]serar” is also tokenized. Furthermore, a word is split into individual characters if not in the vocabulary. Subwords or even single characters represent words that are not in the vocabulary (Wu et al. 2016).

2.3 Transformer

Vaswani et al. (2017) presents the Transformer, a deep learning model consisting of stacked Transformer layers with encoders and decoders. They employ six layers in the paper. A rea- son for the success of the Transformer architecture is its modeling of contextual-dependent word representations using the attention mechanism. The Transformer is made up of en- coders and decoders. Figure 5 shows half of the Transformer’s architecture, known as the encoder, which is the block BERT uses.

The encoder uses word embeddings of d

_model

dimension as input, explained in 2.2. Self-

attention is applied in the ”Multi-Head Attention” sublayer in the figure, which a fully

connected, piece-wise FFNN follows, see Section 2.1. A residual connection and a layer

normalization wrap both sub-layers in the encoder. The output of the encoder is a new

vector representation for each token.

(16)

Figure 5: Model architecture of the encoder. Figure derived from (Vaswani et al. 2017)

2.3.1 Self-Attention

Two sublayers make up the encoder, and the first layer is the focus of this section. Self- attention is a mechanism that computes the relations between words in a sequence. These relations are encoded into the representations of the sequence that the function outputs. A simple example of the attention mechanism will follow to display how the function represents tokens.

Each word embedding multiplies with three pre-trained matrices, which are known as

Value, Key and Query and their weights determine how important other tokens in the input

are in relation to in creating the current token’s representation (Clark et al. 2019). Word

embeddings x

₁

and x

₂

multiply with the W

^V

, W

^Q

, W

^K

matrices, creating the v, q, k

vectors seen at the bottom of Figure 6.

(17)

Figure 6: x

₁

and x

₂

represent two word embeddings. These are generally vectors of much larger size, for example, BERT’s embeddings are 768 in length. The matrices in the top-right corner are attention weights and multiple with the word embeddings to output the vectors at the bottom.

The two example word embeddings in Figure 6, x

₁

and x

₂

each have a dimension of 1x3.

The second step of the attention function is to compute the dot-product of q and k, for each embedding. Each dot product is scaled by √

d

k

−1

, known as the scaling factor. These two steps are outlined below for both words. s

₁₁

and s

₁₂

are the two output values for x

₁

.

s

₁₁

: q

₁

· k

₁

√ d

_k

, s

₂₁

: q

₂

· k

₁

√ d

_k

, s

₁₂

: q

1

· k

2

√ d

_k

, s

₂₂

: q

2

· k

2

√ d

_k

,

where the two obtained values for the first token subsequently go through the softmax activation function. The two softmax scores multiply with respective value vectors in the next step. The last step is to sum the weighted value vectors. Therefore, how ”important”

x

2

is to x

1

has been encoded into x

1

’s representation. Equation 8 describes a compact representation of the attention formula.

Attention(Q, K, V ) = σ( QK

^T

√ d

_k

)V , (8)

(18)

Figure 7: Example of how the word ”it” encapsulates the words ”the animal didn’t” and

”was too tired” highlighted in orange and green. Figure from (Jay Alammar 2018)

In practice, x

₁

and x

₂

are instead much longer representations and concatenated into a matrix. The attention function outputs a matrix of new representations, collectively referred to as an attention head. Moreover, instead of obtaining a single attention head, Vaswani et al.

(2017) compute multiple attention mechanisms in parallel and independently. This is called Multi-headed attention, and they apply it instead because it allows the model ”to jointly attend to information from different representation subspaces at different positions”.

Z = M ultiHead(Q, K, V ) = Concat(H

₁

, ...H

_h

)W

^O

(9) H

_i

= Attention(QW

_i^Q

, KW

_i^K

, V W

_i^V

), (10) where W

_i^Q

, W

_i^K

, W

_i^V

and W

^O

are matrices that have been trained and were initially gener- ated by the Normal distribution (Xiong et al. 2020), where h is the number of heads. Concat is short for concatenation and means to join together, so the attention heads are joined together into one big matrix. W

^o

is an additional pre-trained matrix which multiplies with the attention heads to condense the output to the next part of the embedding. This final matrix has information from h number of heads. Vaswani et al. (2017) discuss how each attention head learns to perform different tasks, yielding results that are easier to interpret.

Figure 7 shows the focus of two heads concerning the word ”it”. This figure displays how

two attention heads capture different different word related things. The number of heads is

scalable, (Vaswani et al. 2017) employ eight attention heads.

(19)

2.3.2 Layer Normalization and Residual Connection

Each encoder sub-layer passes through a residual connection in a normalization layer. Xu et al. (2019) show that layer normalization enables faster training in the Transformer.

LayerN orm rescales the input and produces the following as output for each layer: LayerNorm(x+

sublayer(x)), where x is either the word embedding or output from previous sublayer, and sublayer(x) is the function currently applied in the encoder. He et al. (2015) show that resid- ual connections help retain information from previous layers and improve accuracy. Figure 8 illustrates an example residual connection with two hidden layers and ReLu as activation in between.

Figure 8: Illustration of residual connection. Figure derived from (He et al. 2015).

The output from the first encoder sub-layer continues to a position-wise FFNN where there are no dependencies between the token representations. Similar to the network outlined in Section 2.1, a two-layered neural network is applied to the attention output with reLu as activation. Position-wise implies that the FFNN applies to each representation, but independently. Another LayerN orm function applies to the FFNN output.

2.4 Transfer Learning

Transfer learning (TL) transfers ”knowledge” from having solved broad tasks to similar,

more domain-specific tasks (Zhuang et al. 2019). It aims to solve the challenge with both

labeled data and compute power being scarce. A TL model has its weights trained, meaning

it does not have to train from scratch. For example, if a model is trained to detect dogs,

(20)

transfer learning makes it possible for a model, without much training, to perform well on detecting cats, too (Neyshabur et al. 2021). An FFNN trains on detecting dogs by updating its parameters weights and biases to detect dogs better, and if the same model is used to detect cats, it is possible to use the pre-trained weights from the network trained on dogs.

Generally, NLP models are pre-trained on some language task(s) using an unlabelled cor- pus. Pre-training aims to create models with a general language understanding, and in NLP, pre-training models aim to capture general language patterns, enabling more domain-specific models to utilize the gained knowledge from the pre-training. Next sentence prediction and masked token prediction are two examples of pre-training tasks in NLP.

2.5 Bidirectional Encoder Representations from Transformers (BERT)

BERT has become a popular choice in NLP. Its success is partly due to its pre-training tasks that encode non-directional context and the relationship between two text segments in its token representations. Goldberg (2019) show that BERT performs well on finding syntactical patterns in the English language, making it an appropriate model to try classifying whether the next text block is the next one or not.

Initially configured with two architectures, a base, and a large version, where the base version has 12 encoders with 768 as the size of the hidden layers and 12 attention heads (see 2.3.1) totaling 110M parameters. The model was trained on BooksCorpus and the English Wikipedia, containing 3300M words combined. The original BERT uses WordPiece embeddings with a vocabulary of size 30522 tokens (Devlin et al. 2018).

In addition to token embeddings, BERT adds two more embedding layers to its input.

Segment embedding embeds to which input segment the tokens belong. In Figure 9, A represents tokens in the first segment, and 1 represents tokens in the following segment.

The third embedding layer consists of positional embedding, representing the position of the

tokens in the input. The position embedding consists of two parts, an integer of its location

and how close it is to the other words in the sequence. The three embedding layers in BERT

sum element-wise to fit BERT’s input (Klein et al. 2017).

(21)

Figure 9: BERT embedding layers, if the input is a single segment, then segment embeddings only consist of A, and the first [SEP] token is discarded. The maximum number of tokens BERT can use as input is 512, including special tokens (Devlin et al. 2018)

.

2.5.1 Pre-Training

BERT pre-trained on two self-supervised tasks using unlabeled data; Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). MLM allows for non-directional training by masking random input and predicting them. NSP helps the model to understand sentence relationships by predicting if a text segment is the subsequent segment or not.

NSP aims to represent the relation of two text segments, allowing for downstream tasks built on similar relationships, such as text pair relationships. Devlin et al. (2018) state that removing NSP from pre-training reduces performance on different NLP tasks. BERT uses two text blocks in the corpus for the NSP task, and 50% of the time, the model selects a random text segment. BERT trains on the prediction of the next sentence by looking at consecutive sentences of the corpus and then randomly selecting sentences to use as negative examples. Figure 10 visualizes both pre-training tasks, where the classification token [CLS]

prepends to the input’s beginning.

(22)

Figure 10: Architecture from BERT for next sentence prediction and Masked LM (Devlin et al. 2018). Sentence A has number of tokens and sentence b has m.

Special token C, seen in the top-left corner, is the final hidden vector of the [CLS] token.

This is used ”as the aggregate sequence representation for classification tasks” (Devlin et al.

2018), and its representation is used in binary classification tasks. The [CLS] token output transforms into a 2x1 vector or single node using a classification layer. A softmax activation function outputs the probability that segment b follows segment a. In pre-training of the original BERT, the final model achieves 97%-98% accuracy on NSP (Devlin et al. 2018).

MLM predicts the most probable token for the masked token using T

_i

, and the output layer for the masked tokens applies the softmax activation function. T

_i

denotes the final hidden vector representation for every input. In 80% of the cases, a "MASK" token replaces 15% of the input tokens. 10% of the time, a token is replaced with a random token. In the remaining 10% of the cases, no tokens are masked, reducing the difference between pre- training and fine-tuning, where input is not masked. Both pre-training tasks run in parallel, so the NSP task uses incomplete data.

Devlin et al. (2018) state that the training loss is the sum of the mean NSP likelihood

and the mean of the MLM likelihood, and the model is trained to minimize the combined

loss. The combined loss function is a linear combination of the two-loss functions (Zhang

et al. 2020) and both tasks apply cross-entropy, and optimizer AdamW, see Section 2.1.

(23)

2.5.2 Fine-Tuning

It is relatively easy to fine-tune BERT compared to other recently published language models.

ELMo, for example, requires a task-specific architecture for every downstream task. On the other hand, BERT is fine-tuned with relative ease, only requiring comparatively small changes (Zhang et al. 2020). Applying BERT for domain-specific tasks requires additional layer(s) added on top of the final layer. The input is either a pair of sentences or a single sentence. A linear classifier is added to the pre-trained model in binary classification tasks.

The pre-trained parameters are also updated when fine-tuning the model.

(24)

3 DATA

The training data are retrieved from Mediearkivet

³

, a database with Swedish daily news- papers that is regularly updated. 2469 articles were retrieved, where every row in data represents a text block. The articles were printed in Sweden in late 2019 or throughout 2019, depending on the newspapers’ availability. Some newspapers have premium services which reduces the number of articles available. For example, articles by Svenska dagbladet and Dagens nyheter span all of 2019 instead of only a part of 2019 due to the limited amount.

They are therefore less represented in the training data.

Test data are made available by KB and consists of articles printed in 2017 and onward, converted into readable text with OCR. The first ten pages from an Expressen issue, exclud- ing the cover, printed in 2020, were retrieved. Additionally, articles from other newspapers in training and more local ones were also retrieved. A page per newspaper is collected. Day, month, and newspaper are all randomly selected

⁴

. Only text blocks that form articles were selected. The final size of the test data is 127 text blocks from 21 articles. Some of these articles are on the same page.

3.1 Descriptive Statistics

The training data consists of 76629 text blocks from 2469 articles where the mean of words per article is 495 words and characters

⁵

. Text blocks can be anything from a few words to whole articles. This is due to how the training data were retrieved.

The test data consists of 127 text blocks from 21 articles, where the mean of words per article is 475. Figure 11 displays the frequency of words for the articles in both data set.

The dotted-blue lines indicate the means of words for the articles.

3

https://www.ub.gu.se/sv/databaser/mediearkivet

4

If the generated page number only contains stock values or ads, the next page with articles is selected.

5

For example, a hyphen (-) separated by two white spaces also counts as a word, and therefore, the mean

is slightly higher if only words count.

(25)

(a) Frequency of articles based on number of words used in training the model.

(b) Frequency of articles based on number of words used as test data.

Figure 11: Data

Summary statistics for the data sets are in Table 1. The mean of words per text block in the training data is much lower than the mean of words per text block in the test data.

It is much lower because each sentence is a text block in the retrieval process. However, as will be described in Section 4.2.1, the text blocks in the training data are joined together to form longer sequences.

Table 1: Summary statistics

Train data Test data

Variables Articles Text blocks Articles Text blocks

Number of units 2469 76629 21 127

Mean number of words/unit type 495 15.918 475 73.3

3.2 Pre-Processing

Pre-processing of the training and test data are done in two different ways. Recurring phrases

in the training data due to the retrieval process were removed. Examples are: ”Aftonbladet

eller artikelf¨ orfattaren.”,”L¨ as hela artikeln”, ”Alla artiklar ¨ ar skyddade enligt”. Date, author,

symbols, and titles, image captions are all removed. The titles from the articles in the test

data are not included because they can consist of a few words but generate multiple text

(26)

blocks, making them short and unlikely to be valuable to the model, likely only introducing

noise.

(27)

4 METHODOLOGY

This section outlines the thesis’s methodology. First, the models and training are explained.

The following part explains the different data formats used in the paper. Finally, the last part outlines the metrics used to evaluate the models.

4.1 Models

The trained BERT by the Swedish National Library is used to answer the research ques- tions. ”[B]ert-base-Swedish-cased ”

⁶

is trained with the same experimental settings Google originally used

⁷

. One difference is the use of embedding algorithm where KB instead used

”SentencePiece” to split words and tokenize when training (Malmsten et al. 2020).

A comprehensive list of the text materials used in pre-training is in Table 2. The corpus is heavily skewed towards newspapers and consists of more than 200M sentences and 3000M tokens. The authors decided to remove text blocks shorter than ten sentences, and the vocabulary size was set to 50325, making it much more extensive than Google’s. The vocab- ulary size is motivated by the many compound words in the Swedish language (Malmsten et al. 2020). How accurate the pre-training tasks were is not included in the paper.

Table 2: Resources used in pre-training BERT Material

Digitized newspapers

Official Reports of the Swedish Government Legal e-deposits

Social media Swedish Wikipedia

Source: (Malmsten et al. 2020)

KB’s BERT is fine-tuned on binary text block prediction, similar to the pre-training task, to answer the research questions. This means that two text blocks are fed to the model, separated by a [SEP] token. Fine-tuned BERT is also used to reconstruct articles and compare against the BERT base version, which has only been pre-trained.

6

Trained on lower-cased Swedish text.

7

https://github.com/Kungbib/swedish-bert-models.

(28)

The last BERT base layer for sequence classification tasks is another dense layer, applying a linear transformation over the [CLS] token representation. On top of this layer, another dense layer is added with two output nodes, with a softmax activation, which is the fine- tuning step. The added layer is a fully connected classification layer with weights W ∈ R

^2x768

. The number of labels is two, and the hidden size is 768. The labels are "next" and "not next" and refers to whether two text blocks are in sequence or not. The same modification is done to the base version.

4.2 Training

BERT with different hyperparameter settings is training on the daily newspaper articles retrieved from Mediearkivet. Training data consists of 2469 articles, 80% for training, and 20% (500 articles randomly selected) for validation. BERT is then tested on the set of OCR’d daily newspaper articles provided by the National Library of Sweden.

4.2.1 Generating Train Data Structure

I adopted the data format for the next sentence prediction task in order to train BERT on both correct and incorrect text block pairs

⁸

. It introduces negative sampling and generates text pairs in a BERT format. Some hyperparameter values need to be set, such as the total number of tokens and the probability to sample a random block in the next segment. Since the maximum number of tokens in BERT is 512, the max input is 509 tokens + 3 special tokens. The number of non-special tokens is divided by two, generating two segments that are often equal in length, where a [PAD]

⁹

token replaces the empty slots.

8

Code adapted from: https://github.com/huggingface/transformers/blob/master/src/

transformers/data/datasets/language_modeling.py

9

The loss function ignores the [PAD] token in the PyTorch library.

(29)

Table 3: Example output from train data

Text block pairs Label

[CLS] Article 1, text block 1 & 2 [SEP] Article 2, text block 1 [SEP] "not next"

[CLS] Article 1, text block 3 [SEP] Article 2, text block 1 & 2 [SEP] "not next"

[CLS] Article 2, text blocks 1 [SEP] Article 2, text blocks 2 & 3 [SEP] "next"

[CLS] Article 3, text block 1 [SEP] Article 3, text blocks 2 & 3 [SEP] "next"

[CLS] Article 3, text block 1 [SEP] Article 1, text blocks 1-3 [SEP] "not next"

Table 3 illustrates the generated data structure. The number of articles in the example is three, with three text blocks each. The probability of selecting a random text block from another article is 0.5, which is also used in training. Negative samples are labeled "not next", and true text pairs are labeled "next". The first row in the table is labeled "not next" because the first text block of the second article is not a continuation of the first article. While row 3 is labeled as "next" because the second and third text blocks of article 3 follow the first text block.

The function generates 7500 true and false text block pairs in total from the training data.

4.2.2 Hyperparameters

The batch size is 8, and the number of epochs is in the range of three to four, also motivated by the number of epochs used in the downstream tasks in the original BERT paper (Devlin et al. 2018). The same applies to the values of the weight decay. The number of warm-up

¹⁰

steps was selected conservatively. The following hyperparameter values are used in training:

• Batch size : ∈ {8}

• Epochs: ∈ {3, 4}

• Weight decay: ∈ {0.01, 0.05}

• Warm-up steps: ∈ {50, 100}

10

Meaning a low learning rate from 0 to the learning rate, applies a linear warm-up.

(30)

The initial learning rate for the optimizer is 5e-5 with β

₁

= 0.9 and β

₂

= 0.999 which are used in all models. Moreover, the dropout probability is 0.1 for all layers. The training/val- idation losses should be interpreted with caution since they are not identical in structure to the test data. Validation loss is considered regarding model overfit.

4.3 Test Data Structures

Two test data formats are used to evaluate the models. The first format is called Text block prediction, where the models only look at the actual next text block or a text block from the following article. The text blocks are presented in chronological order. The second format is called Page block prediction, and the models predict each text block with all the other text blocks in the same article or same page. This structure aims to increase the similarity to accessing OCR’d articles in practice.

4.3.1 Text Block Prediction (TBP)

This format works presents the text blocks in order, article by article. Generally not how OCR processed data are presented, but this format is compatible with the definition of a reconstructed article. The 21 articles are combined into one column, where each row is a text block. The first row is the first text block from the first article and the last row is the last text block from the last article. There are 106 text pairs that constitute a ”true pair”

and is thus labeled as "next". There are 20 pairs labeled "not next" because they contain text blocks from two different articles. The last text block of the data does not have a next text block.

BERT initially predicts if the second text block from the first article is the next text block of the first text block. The model iterates over the data set and continue through the articles, with the last prediction being the last two text blocks from the last article. This way, the model inputs are the i and i + 1 text blocks, where i = 1, 2, ..., 127.

4.3.2 Page Block Prediction (PBP)

Another data format is warranted to increase the similarity between how OCR’d data are

presented in practice and the data used in this thesis. Instead of three text blocks from an

(31)

article creating two pairs, i.e blocks one-two and two-three like in TBP, this format creates six. BERT predicts on every possible text pair, separately and without repetition, from articles on the same page.

For each page, n(n − 1) is computed to obtain the number of unique text pairs on that page. N is the number of text blocks per page. As an example, one article with six text blocks yields 30 different text pairs. Applying n(n − 1) for all pages yields 1342 text pairs each with an obtained probability "next", from the 21 articles In this format, the models are presented with all possible text pairs and assign each a probability.

Reference text block denotes the text block that matches several other text blocks, and the candidate text blocks denote those other text blocks. In order to simplify and reduce the number of possible pairs, text blocks from different pages cannot create pairs. The number of reference text blocks is equal to the number of text blocks in the data (127). Each reference text block matches with candidate text blocks, equal to the number of text blocks on the same page -1 (j).

Algorithm 1 selects the pair with the highest "next" probability of the candidates for that reference. Then, if that is the correct text pair compared to the ground truth the model correctly predicted the pair. This pair is labeled "next" for that reference block. The algorithm initiates by looking at the text blocks (candidates) from the first article but the first one (reference) and selects the pair that has the highest "next" probability. It then moves to the second text block in the data and looks at the other text blocks.

Algorithm 1 Selecting candidate text blocks

1: procedure

2: for i in reference text blocks

3: for j in candidate text blocks

4: y ˆ

ij

← argmax

_j

( ˆ p

i

) == j

Examples of both formats are in Figure 12. Text block prediction (Figure 12a) iteratively

predicts if the following text block is the next one, whereas page block prediction (Figure

12b) predicts for all text blocks on the same page. Text block in green is the first segment

and text block(s) in red segment two. In some cases, multiple articles were on the same

page, this means the number of candidate blocks increases because the model predicts text

(32)

pairs from different articles. The green block is the reference text block, and the red text blocks are the candidate text block(s).

(a) Text block prediction example where the first two text blocks are used as input.

(b) Page block prediction example. Green is reference and red the candidates. If the red box directly under the green would have the highest probability, it would get label 1

Figure 12: Single example for either structure. The green box is the first text block in the data.

Additionally, using the 1342 text block pairs from the page block prediction format, it is also looked at the obtained probabilities for all pairs and not using Algorithm 1 to label one reference-candidate pair as "next" per reference block. If any pair’s "next" probability is

≥ 0.5, it is labeled as such. This is to observe how pairs within the same articles, and where a pair is from different articles are predicted and thus 0.5 was selected as the threshold.

4.4 Metrics

Several commonly used metrics in statistical analysis are used to evaluate the models’ per- formance. The accuracy, F1-score, the number of reconstructed articles, and the Matthews correlation coefficient (MCC) are computed as help to evaluate binary classification tasks.

Equation 11 is known as the F1-score, which is the harmonic mean of recall and precision.

A score of 1 indicates perfect accuracy. F1 score is a standard statistical metric in statistics

when the number of true negatives is high. This is the case for the second test data format.

(33)

F

₁

= 2 ∗ P recision ∗ Recall

P recision + Recall . (11)

Precision is the number of true positives (TP) per the predicted positives. In the text blocks domain, true positives are the number of true text block pairs that the model also predicted as true. The other metric of the F1 score is recall, which is the number of true positives per the total true text block pairs (TP+TN).

P recision = T P

T P + F P , (12)

Recall = T P

T P + F N , (13)

where:

• TP = True Positive

• TN = True Negative

• FP = False Positive

• FN = False Negative

MCC aims to capture the quality of a binary classification task (Rothman 2021) and is similar to the chi-square statistic. It is a balanced metric when the occurrence imbalance for the two classes is large. A value of 1 represents perfect prediction.

M CC = T P ∗ T N − F P ∗ F N

p(T P + F P )(T P + F N)(T N + F P )(T N + F N) , (14)

|M CC| = r χ

²

n , (15)

where n is the number of observations. M CC ∈ [−1, 1].

Lastly, the ratio of the number of reconstructed articles is computed. For a given article that is not the first or last in the data set, three parts are needed to reconstruct an article.

1. The last text block of the preceding article and the first text block of the current article

is predicted "not next" .

(34)

2. Every text block pair within the article is predicted "next".

3. The last text block of the current article and the first text block of the succeeding is

predicted "not next" .

(35)

5 RESULTS

This section presents the results for the data structures for the fine-tuned BERT models and BERT base. The first part presents the metrics on the text block prediction format, and the second part presents the metrics for the page block prediction format. The training and validation losses for the fine-tuned models are found in the Appendix. Each fine-tuned BERT trained for three to five hours.

5.1 TBP results

The findings show five models, and they achieved an accuracy above 90%. 11-13 articles were correctly reconstructed, and the more balanced metrics display good performance. The model that achieved the highest accuracy ran for four epochs with warm-up steps 100 and weight decay 0.01. It achieved an accuracy of 94.4%, marginally better than that of the other fine-tuned versions and BERT base. As seen in Table 4 there is not a large difference between the fine-tuned version and BERT with only pre-trained weights. The base version achieved an accuracy of 93.6%, almost as high as the fine-tuned model with the highest accuracy.

Table 4: Metrics for text block prediction

BERT

₁

BERT

₂

BERT

_base

Metrics 3 epochs 4 epochs 3 epochs 4 epochs

Accuracy 0.913 0.944 0.913 0.929 0.937

F1 score 0.947 0.967 0.947 0.957 0.962

Articles reconstructed 12/19 13/19 12/19 12/19 11/19

MCC 0.71 0.796 0.71 0.751 0.762

Note: BERT

1

: warm-up steps 100 and weight decay 0.01. BERT

2

: warm-up steps 50, weight decay 0.05.

The models did not achieve an equally high number of correct predictions regarding the prediction of "not next" for overlapping text pairs, seen by the MCC values in Table 4.

Table 5 presents confusion matrices for the fine-tuned with the highest accuracy and BERT

base, and the difference is only one prediction.

(36)

Table 5: Confusion matrices for text block prediction Truth

Positive Negative Prediction Positive 102 3

Negative 4 17

(a) Fine-tuned BERT

Truth

Positive Negative Prediction Positive 102 4

Negative 4 16

(b) BERT base

Based on the results in Table 5, Each BERT model correctly classified 102 out of the 106 text block pairs and correctly classified overlapping text blocks 16-17 times out of 20.

Figure 13 illustrates an example article with three green boxes. The red boxes portray another article.

Figure 13: Illustration of text block prediction. Only for illustrative purposes, does not present actual data.

5.2 PBP results

All models achieve an accuracy of more than 80% on the page block prediction format. The

more balanced metrics such as F1-score show that the models struggled to predict the correct

reference to candidate block pairs. The best performing fine-tuned model correctly predicted

20 out of 106 correct reference-candidate pairs. BERT base achieved higher values for all

(37)

metrics. The low values for balanced metrics still show that some syntax appears to have represented.

Table 6: Metrics format for page block prediction

BERT

1

BERT

2

BERT

base

Metrics 3 epochs 4 epochs 3 epochs 4 epochs

Accuracy 0.85 0.845 0.854 0.84 0.874

F1 0.145 0.119 0.169 0.093 0.275

Precision 0.132 0.108 0.154 0.084 0.252

Recall 0.16 0.132 0.189 0.104 0.302

MCC 0.064 0.035 0.091 0.006 0.207

Overall, the models perform quite poorly on selecting the correct reference-candidate pair. BERT base correctly selected 12 more pairs than the fine-fined model with the highest accuracy and obtained quite much higher on the more balanced metrics.

Another difference between the fine-tuned and the base is how confident each is in its predictions. For the fine-tuned BERT mode, the median of the softmax probabilities for

"next" is 0.999, and the BERT-base model has a median of 0.525.

Table 7: Confusion matrices for page block prediction Truth

Positive Negative Prediction Positive 20 110

Negative 86 1126

(a) Fine-tuned BERT

Truth

Positive Negative Prediction Positive 32 95

Negative 74 1141

(b) BERT base

Lastly, regarding the performance on the page block prediction, but where each of 1342 pairs is labeled 1 if the probability of "next" is > 0.5, but where algorithm 1 is not applied.

Table 8 shows the percentage of how many text block pairs within the same articles that were labeled "next" and the percentage of how many text block pairs where the two segments are not from the same articles predicted "not next".

The fine-tuned models classify more than 85% text block pairs within the same articles

as "next", and almost 70% of text block pairs where the two segments are not from the

(38)

same articles are predicted as "not next". BERT base also scores high on both metrics.

Compared to the fine-tuned, it performs better on the text pairs from different articles.

Table 8: Article clustering

BERT

1

BERT

2

BERT

base

Metrics 3 epochs 4 epochs 3 epochs 4 epochs

% "next" for same 87.7% 85.2% 89% 87.1% 69%

% "not next" for between 69.8% 71.1% 69.3% 71% 78.4%

Note: The values in the table should not be confused with the values in Table 6 and 7. The values are from two different thresholds.

Figure 14 illustrates example articles. Like in the previous figure, the green boxes make up one article, and the red boxes make up another. The double-headed arrows represent text blocks classified as "next" by the models.

Figure 14: Illustration of page block prediction. Only for illustrative purposes, does not

present actual data.

(39)

6 DISCUSSION

The thesis explores how well BERT can predict text block pairs and reconstruct articles.

If OCR’d text blocks were processed and obtained in the order shown in Figure 13, then the method of inputting text pairs chronologically to BERT to reconstruct articles can be applied with high accuracy. To an extent, this confirms what Shi & Demberg (2019) state about what BERT encodes regarding the semantic connection between two inputs segments.

However, some text blocks in this thesis are short due to the OCR, which reduces the information the encoder can utilize.

The syntactic information to represent is reduced for split paragraphs. The balanced metrics in Table 6 display that BERT’s representations capture some syntactic patterns, but the models perform quite poorly on the page block prediction format. Goldberg (2019) shows that BERT can represent syntax, and thus a possible explanation for the poor performance is that the syntactic information is sparse when text blocks randomly split. Therefore, it is not unreasonable that the logical relationship between texts can be anything but logical when using OCR’d text blocks. In addition to that circumstance, it is interesting how well humans annotate text blocks in order.

Table 8 shows how BERT encodes context into the representations. This is what Liu et al. (2020) also show. Given that text blocks from the same articles share context, this result is not surprising.

7 CONCLUSION

In conclusion, the findings show that BERT, depending on the presentation of the data, be used in reconstructing OCR’d daily newspaper articles. BERT can be used with high accuracy to predict whether a text block of a text pair is the subsequent text block from ORC’d articles, given that the text blocks are ordered. All models obtain high accuracy on both true and false text pairs. Regarding article reconstruction, with the specific data format presented, BERT can be used with quite a high accuracy.

For data structures more similar to how they are in practice, BERT can cluster text

blocks from the same articles based on that BERT classified those pairs as ”next”. It almost

(40)

equally applies to text pairs where each text block in a pair is from two different articles.

There, BERT classifies a large extent as ”not next”. BERT clusters text blocks from the same articles and separates text blocks from different articles.

Due to the accurate prediction of text pairs from the same article, it is valuable to see how the BERT’s performance would change if the training setup were more similar to predicting the next segment. An alternative is to train on the following text block in an article and another section from the same article that is not directly following or from a random article.

The model then trains on syntax and context. Moreover, the test data structure can be

improved where if text block one and two are predicted to be a pair, then text block three

cannot test against text block two in a later iteration. The latter is purely a technical task.

(41)

8 ACKNOWLEDGMENTS

I want to thank M˚ ans Magnusson and Andreas ¨ Ostling for their valuable input and patience

throughout this project. I also extend my thanks to Kungliga biblioteket (National Library

of Sweden, KB) / KBLab for providing the necessary material and resources. A special

thanks to Faton Rekathati at KBLab.

(42)

REFERENCES

Adesam, Y. et al. (2019), ‘Exploring the quality of the digital historical newspaper archive kubhist’, 2364, 9–17.

URL: http://ceur-ws.org/Vol-2364/1 paper.pdf

Adhikari, A., Ram, A., Tang, R. et al. (2019), ‘Docbert: BERT for document classification’, CoRR abs/1904.08398.

URL: http://arxiv.org/abs/1904.08398

Bjork, L., Johansson, T. & Dannells, D. (2018), ‘Evaluation and refinement of an enhanced ocr process for mass digitisation’, [PowerPoint Presentation] .

URL: https://spraakbanken.gu.se/sites/spraakbanken.gu.se/files/OCR.pdf

Clark, K., Khandelwal, U., Levy, O. & Manning, C. D. (2019), ‘What does BERT look at?

an analysis of bert’s attention’, CoRR abs/1906.04341.

URL: http://arxiv.org/abs/1906.04341

Dannells, D., Johansson, T. & Bjork, L. (2019), ‘Evaluation and refinement of an enhanced ocr process for mass digitisation.’.

URL: http://ceur-ws.org/Vol-2364/9

_p

aper.pdf

Devlin, J., Chang, M., Lee, K. & Toutanova, K. (2018), ‘BERT: pre-training of deep bidi- rectional transformers for language understanding’, CoRR abs/1810.04805.

URL: http://arxiv.org/abs/1810.04805

Ghosh, S., Vinyals, O., Strope, B. et al. (2016), ‘Contextual LSTM (CLSTM) models for large scale NLP tasks’, CoRR abs/1602.06291.

URL: http://arxiv.org/abs/1602.06291

Goldberg, Y. (2019), ‘Assessing bert’s syntactic abilities’, CoRR abs/1901.05287.

URL: http://arxiv.org/abs/1901.05287

He, K., Zhang, X., Ren, S. et al. (2015), ‘Deep residual learning for image recognition’, CoRR abs/1512.03385.

URL: http://arxiv.org/abs/1512.03385

Hearst, M. A. (1997), ‘Texttiling: Segmenting text into multi-paragraph subtopic passages’, Comput. Linguist. 23(1), 33–64.

URL: https://www.aclweb.org/anthology/J97-1003.pdf

Hendrycks, D. & Gimpel, K. (2016), ‘Bridging nonlinearities and stochastic regularizers with gaussian error linear units’, CoRR abs/1606.08415.

URL: http://arxiv.org/abs/1606.08415

Holley, R. (2009), ‘How good can it get? analysing and improving ocr accuracy in large scale

historic newspaper digitisation programs’, D-Lib Magazine: The Magazine of the Digital

Library Forum 15.

(43)

Jay Alammar (2018), ‘The illustrated transformer’.

URL: http://jalammar.github.io/illustrated-transformer/

Kingma, D. P. & Ba, J. (2015), ‘Adam: A method for stochastic optimization’.

URL: http://arxiv.org/abs/1412.6980

Klein, G., Kim, Y., Deng, Y. et al. (2017), Opennmt: Open-source toolkit for neural machine translation, in ‘Proc. ACL’.

URL: https://doi.org/10.18653/v1/P17-4012 Kungliga biblioteket (2020), ‘Kungliga biblioteket’.

URL: https://www.kb.se/in-english/about-us/the-national-library-of-sweden.html

Liu, Q., Kusner, M. J. & Blunsom, P. (2020), ‘A survey on contextual embeddings’, CoRR abs/2003.07278.

URL: https://arxiv.org/abs/2003.07278

Loshchilov, I. & Hutter, F. (2017), ‘Fixing weight decay regularization in adam’, CoRR abs/1711.05101.

URL: http://arxiv.org/abs/1711.05101

Malmsten, M., B¨ orjeson, L. & Haffenden, C. (2020), ‘Playing with words at the national library of sweden – making a swedish bert’.

URL: https://arxiv.org/pdf/2007.01658.pdf

Mandelbaum, A. & Shalev, A. (2016), ‘Word embeddings and their use in sentence classifi- cation tasks’, CoRR abs/1610.08229.

URL: http://arxiv.org/abs/1610.08229

Mikolov, T., Chen, K., Corrado, G. et al. (2013), ‘Efficient estimation of word representations in vector space’.

URL: http://arxiv.org/abs/1301.3781

Neyshabur, B., Sedghi, H. & Zhang, C. (2021), ‘What is being transferred in transfer learn- ing?’.

URL: https://arxiv.org/pdf/2008.11687.pdf

Palfray, T., Hebert, D., Nicolas, S., Tranouez, P. & Paquet, T. (2012), ‘Logical segmentation for article extraction in digitized old newspapers’, CoRR abs/1210.0999.

URL: http://arxiv.org/abs/1210.0999

Rothman, D. (2021), Transformers for Natural Language Processing: Build innovative deep neural network architectures for NLP with Python, Packt Publishing, Birmingham, Eng- land.

Ruder, S. (2016), ‘An overview of gradient descent optimization algorithms’, CoRR abs/1609.04747.

URL: http://arxiv.org/abs/1609.04747

(44)

Saha, S., Basu, S., Nasipuri, M. et al. (2010), ‘A hough transform based technique for text segmentation’, CoRR abs/1002.4048.

URL: http://arxiv.org/abs/1002.4048

Shi, P. & Lin, J. (2019), ‘Simple BERT models for relation extraction and semantic role labeling’, CoRR abs/1904.05255.

URL: http://arxiv.org/abs/1904.05255

Shi, W. & Demberg, V. (2019), ‘Next sentence prediction helps implicit discourse relation classification within and across domains’, pp. 5790–5796.

URL: https://www.aclweb.org/anthology/D19-1586

Vaswani, A., Shazeer, N., Parmar, N. et al. (2017), ‘Attention is all you need’, CoRR abs/1706.03762.

URL: http://arxiv.org/abs/1706.03762

Wu, Y., Schuster, M., Chen, Z. et al. (2016), ‘Google’s neural machine translation system:

Bridging the gap between human and machine translation’, CoRR abs/1609.08144.

URL: http://arxiv.org/abs/1609.08144

Xiong, R., Yang, Y., He, D., Zheng, K. et al. (2020), ‘On layer normalization in the trans- former architecture’, CoRR abs/2002.04745.

URL: https://arxiv.org/abs/2002.04745

Xu, J., Sun, X., Zhang, Z. et al. (2019), ‘Understanding and improving layer normalization’, CoRR abs/1911.07013.

URL: http://arxiv.org/abs/1911.07013

Zacharias, E., Teuchler, M. & Bernier, B. (2020), ‘Image processing based scene-text detec- tion and recognition with tesseract’.

Zhang, A., Lipton, Z. C., Li, M. & Smola, A. J. (2020), Dive into Deep Learning.

URL: https://d2l.ai

Zhuang, F., Qi, Z., Duan, K. et al. (2019), ‘A comprehensive survey on transfer learning’, CoRR abs/1911.02685.

URL: http://arxiv.org/abs/1911.02685

(45)

9 APPENDIX

Table 9 presents the loss values after the last epoch, for all fine-tuned BERT models.

Table 9: Loss values for the last epoch

BERT

1

BERT

2

Text Block Prediction and Article Reconstruction Using BERT

TEXT BLOCK PREDICTION AND ARTICLE RECONSTRUCTION USING BERT

Submitted by

Andreas Walter Estmark

A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisors M˚ ans Magnusson

Andreas ¨ Ostling

Spring, 2021

ABSTRACT

Kungliga biblioteket (National Library of Sweden, KB) uses Optical Character Recog-

nition (OCR) engines to extract and segment texts from their archive of daily newspaper

articles. These systems are good at extracting and segmenting text on the paragraph level

and lower (i.e., sentences, words, and characters), but less on the article level, resulting in

the segmentation of articles into text blocks not attached to their articles. In this thesis,

BERT, a natural language processing (NLP) model, is fine-tuned on newspaper articles and

used to reconstruct these articles by predicting if a text block is the next or not. A small

data set of 127 text blocks from 21 articles is used. The best performing BERT achieved

an accuracy of 94% on text block pair prediction when the blocks are ordered. It resulted

in 13 reconstructed articles. The performance was reduced when selecting from all possible,

unordered text pairs. It was also found that BERT performs well on clustering text blocks

from the same articles.

CONTENTS

Introduction 1

1.1 Related Work . . . . 3

1.1.1 Syntactic Tasks . . . . 3

1.1.2 Article Extraction . . . . 4

1.2 Objective . . . . 4

Theory 6 2.1 Feed-Forward Neural Network (FFNN) . . . . 6

2.1.1 Parameter Updates . . . . 8

2.2 Vector Representations of Words . . . . 10

2.2.1 Tokenization . . . . 11

2.3 Transformer . . . . 11

2.3.1 Self-Attention . . . . 12

2.3.2 Layer Normalization and Residual Connection . . . . 15

2.4 Transfer Learning . . . . 15

2.5 Bidirectional Encoder Representations from Transformers (BERT) . . . . 16

2.5.1 Pre-Training . . . . 17

2.5.2 Fine-Tuning . . . . 19

Data 20 3.1 Descriptive Statistics . . . . 20

3.2 Pre-Processing . . . . 21

Methodology 23 4.1 Models . . . . 23

4.2 Training . . . . 24

4.2.1 Generating Train Data Structure . . . . 24

4.2.2 Hyperparameters . . . . 25

4.3 Test Data Structures . . . . 26

4.3.1 Text Block Prediction (TBP) . . . . 26 4.3.2 Page Block Prediction (PBP) . . . . 26 4.4 Metrics . . . . 28

Results 31

5.1 TBP results . . . . 31 5.2 PBP results . . . . 32

Discussion 35

Conclusion 35

Acknowledgments 37

Appendix 41

1 INTRODUCTION

OCR systems convert image input into readable text data (Saha et al. 2010) and seg- ment and extract texts from images, such as newspapers, into units of text (often entire paragraphs

. Red lines indicate text units, and the lines in green indicate images. Figure 1 shows an example of inaccurate segmentation. The top left article (see Figure 1a) was

Paragraphs aim to be coherent and contained units of text with a structure (Hearst 1997).

KB does not use Tesseract on newspaper articles, this figure showcases how an OCR processes a page.

segmented into multiple text blocks.

(a) First page of the issue

(b) Top left article of the front page. The OCR engine has in- correctly segmented the body of text, indicated by the many hor- izontal lines.

Figure 1: Segmentation of newspaper (Bjork et al. 2018).

To train a model on article reconstruction, a definition of what a reconstructed article is warranted. The chosen definition is the correct prediction of text block pairs within an article and its beginning and end. A more detailed description follows:

Description of what reconstructing an article consists of:

• Correct classification of the first and last text pairs (i.e., where the article overlaps with

other articles in the beginning and end). In practice, using the ending of an article

and the beginning of another as combined input.

• Correct classification of the text block pairs that make up the article, i.e. all text block pairs.

1.1 Related Work

The specific task of article reconstruction is relatively unexplored in NLP, but adjacent lan- guage tasks have been more explored. The relation between sentences and article extraction are two areas that tie into text block pair prediction and article reconstruction.

1.1.1 Syntactic Tasks

BERT is a recently published, transformer-based, language model that achieves recording- breaking results on different language understanding tasks, including document classification (Adhikari et al. 2019, Devlin et al. 2018) and relation extraction (Shi & Lin 2019). Shi &

Demberg (2019) claim that BERT’s ability to capture implicit discourse relation in texts is due to its next sentence prediction (NSP) task. NSP is one of BERT’s two pre-training tasks.

Ghosh et al. (2016) present a Contextual Long short-term memory network (CLSTM)

and evaluate it on three different NLP tasks: word prediction, sentence topic prediction, and

next sentence selection. This research presents one approach to how the relation between

texts can be modeled. Their approach to next sentence selection is to provide candidate

The final model obtains an accuracy of 63%. This method requires a label for each text segment to predict the next segment. It also requires deciding candidate segments where some can be of the same context; however, text blocks from the same article will naturally share context.

1.1.2 Article Extraction

They use a test dataset of 42 images, and their method identified 85.84% of the articles on the test data. However, this method requires steps before the OCR process.