Machine Learning explainability in text classification for Fake News detection

(1)

Teknik och samh¨alle

Datavetenskap och medieteknik

Bachelor Thesis

15 h¨ogskolepo¨ang,grundniv˚a

Machine Learning explainability in text classification for Fake

News detection

(2)

(3)

Abstract

Fake news detection gained an interest in recent years. This made researchers try to find models that can classify text in the direction of fake news detection. While new models are developed, researchers mostly focus on the accuracy of a model. There is little research done in the subject of explainability of Neural Network (NN) models constructed for text classification and fake news detection. When trying to add a level of explainability to a Neural Network model, allot of different aspects have to be taken under consideration. Text length, pre-processing, and complexity play an important role in achieving success-fully classification. Model’s architecture has to be taken under consideration as well. All these aspects are analyzed in this thesis. In this work, an analysis of attention weights is performed to give an insight into NN reasoning about texts. Visualizations are used to show how 2 models, Bidirectional Long-Short term memory Convolution Neural Net-work (BIDir-LSTM-CNN), and Bidirectional Encoder Representations from Transformers (BERT), distribute their attentions while training and classifying texts. In addition, sta-tistical data is gathered to deepen the analysis. After the analysis, it is concluded that explainability can positively influence the decisions made while constructing a NN model for text classification and fake news detection. Although explainability is useful, it is not a definitive answer to the problem. Architects should test, and experiment with different solutions, to be successful in effective model construction.

(4)

Acknowledgements

(5)

Abbreviations

BERT - Bidirectional Encoder Representation from Transformers

BiDir - Bidirectional

CNN - Convolutional Neural Network

LSTM - Long-Short Memory

ML - Machine Learning

NLTK - Natural Language Tool Kit

NN - Neural Network

RNN - Recurrent Neural Network

SVM - Support Vector Machine

(8)

(9)

1 Introduction

In this section, an introduction to the thesis is presented. First, in section 1.1 a brief background to the subject is laid out, presenting an overall scope. Later, in section 1.2 an issue regarding the explainability of Machine Learning, Neural Network models for text classification, and fake news detection is presented. In section 1.3 a research question is formulated, clarifying the subject of this work. Finally, limitations are presented in section 1.4, framing the capacity of this thesis.

1.1 Background

The Internet became a fast and cost-effective way of sharing information. With its rise, more and more people rely on online media to provide information, news, and facts about the world via news-feeds, online newspapers, and social media [1]. The Internet is reach-ing a broader public than any other paper publication ever did. With the vast amount of information it provides, an everyday user cannot keep up with fact-checking everything he or she consumes. Lack of in-depth knowledge on a subject, gives online sources the benefit of trust from the public, counting on the medium to provide reliable and honest information.

With a growth of honest information available on the Internet so did the spread of mis-information [2]. Distribution of false claims can happen, unknowingly by providing facts based on unreliable sources, or knowingly to confuse, mislead, and deceive people. What-ever the case may be, the fake news as it became to be known, turned into a rising problem internet users face today. For this reason, the process of detecting false information has become a subject of extensive study in the research community. There is a need for a system that makes it possible to detect, classify, and eliminate, news articles providing malicious content. Systems based on manual identification of this type of material are costly and time-consuming. That is why an effort is being made to find an effective Ma-chine Learning (ML) approach to solve these issues.

There have been many different solutions published so far, while researches are trying to find an efficient ML model for tackling the problem. Many of those solutions are based on different ML architectures, approaches to data representation, and Natural Language

(10)

attempt is made to understand how data representation and architecture, influence ML understanding of texts, in the context of text classification.

1.2 Problem

Work done in the process of creating an ML model for text classification using a Neural Network (NN) architecture, can be simplified to a couple of basic steps. First, a model is researched. Later, a data-set is fed into it, and the model is trained. Lastly, the accuracy of fitting the model to the data is used as a measure of its usefulness. Some parameters of a given architecture are changed along the way, to maximize the result. The problem is that researchers seldom ask why a Neural Network model is effective or not, and focus only on the result. The essence of this work is to peek inside the black box that is a NN and try to understand the reasoning behind it. There is, very little to no research done, in the subject of providing explainability of Neural Network systems for text classification. Hopefully, this analysis helps future researchers pursuing effective Machine Learning Neu-ral Network models for text classification, and fake news detection, by providing a deeper understanding of the explainability behind it.

1.3 Research question

Research question is derived directly from the problem description, that is:

• How can we provide explainability of neural-based models for the task of fake news detection via text-classification?

1.4 Limitations

The focus of this work is not on producing a perfect neural network model. Neural net-work models used are derived from previous net-work, showing acceptable results. Models chosen are means of getting results for the analysis, hopefully leading to a better, general understanding of Neural Networks in Machine learning and text classification.

(11)

Other limitations of this work, involve data pre-processing, reduction of layers’ dimen-sionality, and noise-cleaning. All of these operations have a positive impact on model performance. Unfortunately every pre-processing brings a risk of meaning distortion into a text. Although, in this work data pre-processing is minimal it still introduces some constraints to the analysis.

In this work, the focus lies on exploring Deep Artificial Neural Networks only. Study [3] shows that Deep Neural Networks outperform other traditional methods of text classifica-tion. For this reason, methods like Support vector machines (SVM), Bayesian methods, and Decision Tree are omitted in this analysis.

Analysis involves text classification based on finding word patterns by models. There is no study done in the field of text classification based on fact-checking or other techniques involving source analysis. In this work, only the architecture and text processing is taken into consideration.

(12)

2 Method

In this thesis, an analysis of two different ML Deep Learning Neural Network models is performed. One based on Convolutional Neural Network architecture (CNN) and Bidi-rectional Recurrent Neural Network architecture (BiDir-LSTM), and a second one based on Bidirectional Encoder Representation from Transformers (BERT). These two Neural Network architectures are used, because of the difference in their approach to textual data processing. These two architectures are representative of the two most popular, general approaches to text classification. The first one is based on sequential text processing and the second one is based on processing the text as a whole. Both architectures give good results and are actively developed by researchers today. The two models make use of the same type of basic layers, however in a different way. One of the layers models use is an Attention Layer. This layer plays a special role in the analysis. In addition to being an important component of both models, it is also used as a visualization tool, giving a deeper insight into results produced by the Neural Networks.

In addition, a wide variety of data-sets are used in the analysis. All of the sets used are a part of the same larger data-set ”Fake News Corpus” [4]. By categorizing the sets into two domains: ”tech” and ”sports”, two smaller sets are constructed for this analysis. Be-fore processing the data-sets through the Neural Network models, data pre-processing is performed. This data preparation involves stop-words removal, lemmatization, stemming, and text summarization. All of the aspects of model architectures used, and data prepa-ration are explained in detail next.

First, data pre-processing is discussed in section 2.1. Later in section 2.2 all of the impor-tant layers are explained. Next, in section 2.3 model architectures are outlined. Method execution is explained after that, in section 2.4. The whole process of training and gener-ating results is presented here. Finally, the method description is closed with some final notes and observations in section 2.5. Although an effort is made to explain all of the techniques used, a basic knowledge on the subject is expected.

(13)

model performance, by cleaning data-sets from unnecessary words and characters. In this section, an in-depth description of the data-sets used is presented. A detailed explanation of how they are prepared, and the processes they undergo before they are used in model training and classification.

2.1.1 News corpus data-set

The data-sets used in this analysis is the ”Fake News Corpus” set [4]. This set is com-posed of over nine million texts labeled as one of the 11 categories: false, satire, bias, conspiracy, state, junksci, hate, click bait, unreliable, political, true. Additionally, this set is categorized by topic using the Term Association Technique [5]. For the comparison in this thesis, two smaller sets, one for each category, are derived from the ”Fake News Corpus”. A number of articles, labeled ”true” or ”false”, and categorized as ”sports” or ”tech”, are pulled out of the collection. A detailed outline of all sets used is presented in table 1.

Set 1 A 1000 of randomly chosen articles categorized as ”sports”, and labeled ”true” or ”false”, are used for the first set. The composition is balanced between 500 articles labeled as ”true”, and an additional 500 labeled as ”false”.

Set 2 The second set, is composed of a 1000 randomly chosen articles categorized as ”tech”. The composition is also balanced between 500 articles tagged as ”true”, and ad-ditional ”500” tagged as ”false”.

To deepen the analysis, additional 16 sets are derived from those two, giving in total 18 balanced sets each containing 1000 articles. There are 9 sets produced, for each category (”sports”, ”tech”). Both categories contain articles subjected to different pre-processing operations. In each category there is a ”vanilla” set of articles, a set of articles summarized to a maximum of 5 sentences each, and a set of articles summarized to a maximum of 10 sentences each. Summarizations are based on a pioneering work done on the subject by H. P. Luhn. [6], who successfully summarized texts taking into consideration word frequen-cies and stop-words (a, an, etc.) removal. Later, in each group of ”Vanilla” texts, texts summarized to 5 sentences, and texts summarized to 10 sentences, additional variations are made using lemmatization and stemming. All of the methods used in data preparation

(14)

Table 1: A summary of all of the data-sets used in the analysis.

Data-set Size Summarized Stemmed Lemmatized

1 ”sports” 1000 - -

-2 ”sports” 1000 - Yes

-3 ”sports” 1000 - - Yes

4 ”sports” 1000 max 5 sentences - -5 ”sports” 1000 max 5 sentences Yes -6 ”sports” 1000 max 5 sentences - Yes 7 ”sports” 1000 max 10 sentences - -8 ”sports” 1000 max 10 sentences Yes -9 ”sports” 1000 max 10 sentences - Yes

10 ”tech” 1000 - -

-11 ”tech” 1000 - Yes

-12 ”tech” 1000 - - Yes

13 ”tech” 1000 max 5 sentences - -14 ”tech” 1000 max 5 sentences Yes -15 ”tech” 1000 max 5 sentences - Yes 16 ”tech” 1000 max 10 sentences - -17 ”tech” 1000 max 10 sentences Yes -18 ”tech” 1000 max 10 sentences - Yes

(15)

2.1.2 Summarization

Text summarization is utilized in many Neural Network models used for text classification. This is why it had to be included in this analysis. This method of text summarization is based on the Term Frequency-Inverse Document Frequency (TF-IDF) method described in [7]. It produces a summary accuracy of 68%, which is better than other similar sum-marization methods available. Summary accuracy is a measure of meaning conveyance. It describes how much the meaning of a summarized text generated by the model, resembles the meaning of a summarized text, a human would create. The technique is based on a process of producing summaries, by extracting sentences from a text. Sentences in sum-maries are the same as they are in the original texts, preserving the order of appearance.

The method is based in a statistical approach, where a frequency with which a word appears in a document gives it a TF-IDF score.

T F = Total appearances of a word in document

Total words in a document (1)

IDF = log Total number of sentences

Sentences with a given word in it (2) TF-IDF score = TF x IDF (3) Common words like: ”a”, ”are”, ”in” are omitted from scoring. Since every sentence is bound to have them, they don’t really contribute to the result.

After giving a score to words, every sentence in a document gets an additional score on its own. This sentence score is based on the sum of the TF-IDF scores of individual words in that sentence.

Sentence score = XTF-IDF score (4)

Sentences with highest scores are understood, as those containing the highest amount of relevant information in the document.

Before attempting TF-IDF, texts are pre-processed, with the help of a Python Natural Language ToolKit module (NLTK) [8]. Stop-words are removed, to declutter texts from

(16)

Stemming used in this work uses a suffix stripping technique, removing suffixes like: ”-ed”, ”-ize”, ”-s”, ”-de”, ”mis” from a word. Porter Stemmer as it is called, reduces words like: ”cats” to ”cat”, and words like ”trouble”, ”troubling”, ”troubled”, to ”trouble”. On the other hand words like: ”misstated”, are reduced to ”misstat”. This technique is used in various Neural Networks for text classification. It helps simplify data inputs to the network, reducing dimensions and complexity. This process positively contributes to results in many different supervised architectures, as presented in [9]. In this analysis, the Python NLTK package [8] is used to stem data-sets used.

2.1.4 Lemmatization

This type of data pre-processing also uses Pythons NLTK package [8]. Lemmatization, unlike Stemming, reduces words to their proper dictionary root form. While words like: ”cats” are still reduced to the root form ”cat”, words like ”misstated” on the other hand, are not. Lemmatization is often used as an alternative to Stemming if dictionary vocabu-lary is to be preserved.

In theory, text simplification, whether it is Stemming or Lemmatization, helps reduce the dimensionality of a model without a significant negative effect on the training accuracy. As shown in [11], models using texts that undergo these pre-processing operations can still achieve an acceptable result. Stemming technique is quicker than Lemmatization, but texts can end up having nonsensical, non-dictionary words. For a better understanding of Neural Network reasoning, both alternatives are tested and compared in this analysis.

Lastly, both sets undergo simple overall basic pre-processing and cleaning operations. Stop-words like: ”a”, ”an”, and ”the” are being removed while training, and text are checked for artifacts. Stray letters, parenthesis, and none ”utf-8” encoded characters like copyright sign c , and registered sign , are also removed. All of those things areR

considered noise data not contributing to the result.

2.2 Neural Networks architectures

(17)

assigned a value. This value is associated with an embedding lookup table. In this table, an additional values are stored, corresponding to the similarity of words with each other. There are two possibilities of obtaining the embedding values. First, the embedding layer’s values can be trained. In this method, additional model is developed that generates these values from a given data-set. Another possibility is to use an already available embedding lookup-table. The second option is used in this thesis. An already existing embedding lookup table named Glove300d developed by Stanford University researchers [10], is used in this analysis.

2.2.2 Bidirectional Recurrent Layer with Long-Short Term Memory

In a simple Recurrent Neural Network (RNN), the current network state knows only about past inputs. This can be a disadvantage in text processing since words in a text are con-nected not only to ones before it, but also to those appearing after. For this reason, a network that can process textual data within its context, in both directions, is much more suitable to be used. Bidirectional (BiDir) Neural Network has exactly this kind of ability. In BiDir Network, two sets of hidden states are maintained. The first state is a forward state, where network processes inputs from beginning to the end, like in any typical RNN. The second state is a backward state, where input is processed back to front. Knowledge of the context from both sides of a current input improves the results of models dealing with sequential data like texts [12]. BiDir architecture is presented in figure 1.

(18)

Figure 1: Bidirectional Neural Network layer with two hidden states (h): forward, and backward. Input-data (X) is processed from both sides by two hidden layers simultaneously, and fed into the output. [23].

(19)

A BiDir network is suitable for processing short textual data like words and sentences. Unfortunately, it has problems when dealing with extensively long data, like paragraphs and articles. In this situation, a problem of vanishing or exploding gradient can occur. Since states in the BiDir network are passed from one hidden layer to another, there is a chance that with lengthy texts, the gradient starts approaching a minimum or a maximum value. While weights in the hidden states are updated, the result can be overflowed in the end by some dominant values. This situation, if occurs, renders the network unstable and in effect useless. For this reason, there is another type of component added to the structure, a Long-Short Term Memory cell (LSTM), shown in figure 2. This cell is respon-sible for retaining at least a part of a hidden state, even when dealing with lengthy texts. The weights that are at risk of overflowing are now ”regulated” by the addition of values carried by the LSTM layer. The LSTM hinders gradient from destabilizing the network. Using LSTM in conjunction with BiDir makes it possible to avoid gradient problems in data with dependencies extended over a longer text.

(20)

2.2.3 Convolutional Layer

Convolutional Neural Network (CNN) as explained in [12], is one of the most popular Neural Network architectures used today. First developed for 2-dimensional image pro-cessing, with time was adapted to other tasks like text processing. The main operation associated with CNN is the convolution operation. Partially connected convolution layers substantially decrease the number of weights (parameters) in the network, making them very efficient in the process. How convolution operates is explained next.

Convolution works by taking into consideration data in the context of its surround-ings. It uses dependencies and relationships present in the data, to achieve learning. This property makes it a good tool for text processing since words in a text are not isolated, but sequential, dependent on one another. In the context of image data, it makes sense to assume that nearby pixels are typically more relevant to each other than pixels that are far away. The same analogy can be projected onto textual data, where words also have a relation with each other [15].

CNN is built with inner layers (dimensions), wherein every layer, some features of the input layer is being convoluted into the next layer. Since there are many inner layers, different features of the input are exposed and projected onto those layers. This gives the network an ability to capture salient features, dependencies among words, and their surroundings. In figure 3, C1, and C2 represent inner layers subjected to the convolution operation. Different filters are used to highlight, and extract different features from the input layer. This operation is repeated in the hidden layers, by applying filters to the layers representing the output of the previous filters.

An important aspect of inner layers is that they are not necessarily connected, but rather represent some partial region of the previous layer. This characteristic demands, careful approach while designing a model. While only a spatial region of one layer is represented in the next, the parameters are shared across all. Both local connectivity between inner layers and parameter sharing, reduce dimensionality in the network, making it memory efficient.

(21)

Figure 3: A typical CNN layer with inner convolutional layers. C1 with 6 filters projecting on inner layers, repre-senting different partial relationship of the input layer [13].

(22)

2.2.4 Attention Mechanism

First developed by Bahdanau [16], and later refined by Luong [17], An Attention Mecha-nism is a key addition to architectures dealing with lengthy textual data. The main idea of an Attention Mechanism is to establish a connection between the input layer and the output layer, by assigning additional weights (attention) to every word that is processed by the network. Weights are carried over the hidden layers of the network, even when input data is overly long. Words having substantial influence on the result retain higher weight value then the words that are less important. The attention mechanism improves the translation of longer sentences, making it an additional improvement over LSTM.

Both models have an Attention Layer implemented in their architecture. The Attention layer thanks to its construction, makes it possible to note the level of interest a Neural Network model has, for different aspects of data, used in training and classification. In the case of this analysis, the aspects of interest are words and sentences in any given input to the network. As a bi-product of assigning attention weights to words, it is possible to extract those values for the analysis.

2.2.5 Encoder-Decoder

Introduced by Google researchers in 2014 [18], this architecture is used in mapping an input data sequence to an output data sequence. Both input and output sequences have a fixed maximum length, not necessarily equal to each other. Layers as seen in figure 4, are constructed with either RNN or LSTM, and an Attention Mechanism. Encoder-Decoder is composed of two processing blocks, the encoding block, and the decoding block. After feeding data to the encoder, hidden states are produced and passed further down the network. A hidden state, produced by the last layer of the encoding block, is vectorized and past as the first hidden state to the decoding block. Decoder, making use of the context information gathered in the Attention Mechanism and its last hidden state, maps an input to an appropriate output. The real value of this architecture is in the possibility of having an input data with one length, being mapped to an output with a different length, simultaneously paying attention to the context in which it occurs.

(23)

Figure 4: Encoder-Decoder architecture based on RNNs mapping input data to the output data. Additional intermediate Attention Mechanism is used for context extracting [14].

(24)

2.2.6 Transformer

This architecture was first described in [19]. It is similar to the Encoder-Decoder in that, it is also used in sequence to sequence translation. The difference is, there are no RNNs or LSTMs used in this implementation. This architecture is based only on an Attention Mechanism and a feed-forward network. Since there is no sequential processing in this structure, an additional dimension is added to every word. This dimension represents the relative positioning of every word in the context. This can be seen in figure 5, as the area marked ”positional encoding”. After the hidden states are produced in the encoder, and initial masking operations are performed in the decoder, hidden states of the encoder are introduced as the initial states of the decoder. This process of passing the hidden state is similar to the one described in the Encoder-Decoder explained in the earlier section. This architecture has proven to achieve better results than the Encoder-Decoder architecture.

(25)

Figure 5: transformer architecture mapping input data to the output data. Implementation without RNNs, with Attention Mechanisms only [20].

(26)

2.3 Models

In this section two models used in the analysis are presented. A description of each model and an overall architecture is presented here.

2.3.1 Bidirectional Long-Short Term Memory Neural Network with Atten-tion mechanism and ConvoluAtten-tional layer (BiDir-LSTM-CNN)

This Neural Network model is built with all of the basic layers presented in section 2.2. Layers used in this model are a BiDir-LSTM, a CNN, and an Attention Mechanism. According to authors [21], the model uses the CNN layer for extraction of high-level rep-resentation from the Embedding layer, and the BiDir-LSTM for extraction of past and future context. The Attention Mechanism is used to give focus to different outputs from the hidden layers of BiDir-LSTM. The model has proven to be very effective in text classi-fication. In the original work, it was tested on a high variety of different data-sets, giving a mean accuracy of 90%. This model was chosen for the analysis since it is representative of the sequential approach to text processing, and has proven to give similarly good results independently from the data-set used.

Model’s architecture is shown in figure 6, and goes as follows. First, a data-set is pre-processed, tokenized, and fed into the Embedding layer. Later, high-level feature extrac-tion is performed by the CNN layer. After that, features are fed into the BiDir layer. There, a forward and a backward inner layer extract context from past and future se-quences and passe them further into two Attention layers. There is one Attention Layer for each forward and backward state. The Attention Mechanisms processes long term dependencies further. Finally, both contexts are concatenated and processed through the last simple feed-forward layer, outputting the classification.

(27)

Figure 6: Overall architecture of a BiDir-LSTM-CNN model used in the analysis [22]. Input data is processed by the layers and outputted as a probability of a text being true or false.

(28)

2.3.2 Bidirectional Encoder Representation from Transformers (BERT) This Neural Network Model is based on a paper [25] published by Google researchers in 2019. It is chosen for the comparison in this analysis, because of its innovative approach to text classification. This architecture makes use of a partial Transformer structure and BiDir layers. From the Transformer, the encoder block is utilized for context extraction and the BiDir layer for classification. The architecture of this model is presented in figure 7. BERT differs from the BiDir-LSTM-CNN model in the way it processes data. It does not look at the input in sequences, from left to right, and from right to left. Instead, it processes the input as a whole, like a Transformer would do. After learning the context in the encoding block, outputs are fed into the BiDir layers, where they are trained according to a given classification.

BERT is a very effective model for text classification. Unfortunately it is also very re-source demanding. For this reason, in this thesis, a slightly simplified version of a BERT is used called Distilled BERT. This implementation as described in [26], achieves 98% accuracy of BERT, with greater speed and lighter processing. This architecture simplifies the classification layer, and optimizes dimensionality, without sacrificing the quality of the outcome from the network.

(29)

2.4 Method execution

First, data-sets (table 1) are created, according to descriptions in section 2.1. Next, two models, the BiDir-LSTM-CNN, and BERT are built in accordance to their architectures in section 2.3. Later, every data-set is trained on both models. Lastly, a prediction is obtained, and the Attention Mechanism’s weights are extracted. To aid the analysis, attention weights are presented visually as a color-coded texts, representing the interest given by the network, to every word in the data processed. Visualizations based on those attentions can be seen in appendices A-E. Lastly, statistical data is collected, to strengthen the analysis. Statistics, representing attention distribution among words in all of the texts, give a deeper insight into NN classification process.

Visualizations are generated with a color-coded scheme and should be understood as fol-lows. A background color is incorporated for every word in a text. Colorations represent the level of attention a model pays to a particular word while training and classifying a text. Words marked as red should be understood as those, not having a substantial influ-ence on the model’s classification. In contrast, words marked as green should be recognized as those, having a strong effect on the outcome of the text’s classification. Following this understanding, yellow words should be interpreted as those with a moderate influence on a model’s classification. In addition to 3 principal colors used as background, a color strength or alpha is used to further define the levels of attention. Words with stronger colors, the ones with higher alpha value for any of the 3 principle colors, have a higher influence on classification, than words with the same color and weaker alpha value. For BERT, a yellow background color is not present. In this model the intermediate weak attention is omitted from the visualization, due to technical difficulties. This difference in visualizations has no effect on the results and analysis. An entire lack of coloration indicates a virtually complete absence of attention for a particular word.

In addition, statistical data is generated. This data show a cumulative distribution of attention levels among different parts of speech in texts examined. Statistical data is pro-duced for each of the two text categories, ”sports”, and ”tech” and for each of the models BiDir-LSTM-CNN and BERT. Statistical data with visual representation of the attention weights is believed to give insight into NN classification process.

(30)

2.5 Method discussion

Text processing is problematic since text has to be viewed not only as a collection of words but also as a whole semantic structure. The context in which words appear in sentences, has a significant influence on the training process. Words exist in relation with each other, as well as in relation to the whole text. Positioning of words needs to be taken into ac-count. Words appearing at the beginning of a sentence, at the end of a sentence, or in the middle, have different influences on Neural Network’s reasoning about them. Text length is something that has to be taken under consideration as well. Some inputs can be a couple of words long, and others can be full-fledged articles. Other aspects like language diversity and a semantic domain can further add to the overall complexity. This is the reason why two models with two distinct approaches to text processing, and an extensive, versatile group of texts are chosen for this analysis.

The BiDir-LSTM-CNN model’s approach is sequential, front to back and back to front. The context is extracted in chunks with help of the CNN, and the local dependencies are highlighted with BiDir-LSTM. BERT on the other hand, processes inputs as a whole, ex-tracting context and dependencies from both sides at the same time. Both Models, among other building blocks, include the Attention Mechanism. This mechanism is used, in this thesis, for visualization purposes. The task of extracting attention weights is complicated and difficult. Since Models are built with the help of available Python libraries, some modifications had to be made in these packages, to facilitate this functionality. The pro-cess of creating layers, models, attention weights extraction, and visualization is beyond the scope of this work. It is why it is not included in this thesis.

(31)

3 Results

In this section all of the results are provided in detail. Figures are gathered in the appen-dices for easy access. Here, a description of the results is provided. First, a summarization of the results is laid out, with references to appropriate tables and figures. Later, the re-sults are explained in detail. Rere-sults covered are:

• Visualizations of the attention weights for a typical texts.

• Classification accuracy - statistics.

• Attentions weights coverage in texts - statistics.

• Attentions span over nearby words - statistics.

• Domain specific words - statistics.

• Parts of speech in every attention type - statistics.

• Parts of speech in all attentions together - statistics.

Gathering data for the analysis, building, and training models is a difficult and time-consuming task. It takes about 200 hours to train models and produce the results. First, a sum of about 100 hours is needed to train the models, and an additional 100 to gather data for the analysis. Although the time taken to train models is substantial, it is not as drastic as time presented in the original work [26]. Additional steps are taken to improve the efficiency of this analysis. Data-sets used in this analysis are smaller, and models simplified according to the description in section 2.3.2. These actions positively affected the time needed for producing the results.

After training both models, and extracting the attention weights, visualizations of data and statistics are generated. While visualizations are made for a larger number of texts, only a particular 2 examples are presented in this thesis. Texts presented, exemplify typi-cal results achieved for the majority of input data, for both contextual domains (”sports”, ”tech”), in all data-sets used. An overall visual overview of results can be seen in appendix

(32)

by each attention type is shown, for both models. In tables 6 - 9, an average attention span is presented. The results of both models show a different attention stretch over nearby words. Here, an average of nearby words covered with the same attention type is presented for both models and all data-sets. In all tables, additional cumulative data is presented as well. Average values for all texts, and texts belonging to one of the 3 text length groups, are also included. In tables 10 - 13, a percentual share of context-specific words is presented. The context-specific words are those used in assigning texts to a par-ticular domain. In this case 2 domains used are ”sports”, and ”tech”. Percentual share is a ratio obtained by looking at the number of context-specific words, in all of the words models paid attention to. In tables 14 - 21, statistical data representing shares of parts of speech is given. First in tables 14 - 17, the most significant percentual shares of parts of speech in a particular attention type (strong, weak, etc) are presented. Later in tables 18 -21, a percentual share of parts of speech in all of the attentions for a given text is presented.

3.1 Visualizations

Results for the BiDir-LSTM-CNN model are color-coded with the red-yellow-green scheme. Colors correspond to the level of NN’s attention to words while training and classifying texts. In BERT, the yellow background color is not present. In this model, the lack of color represents the model’s absence of interest in a particular word. For BiDir-LSTM-CNN attention thresholds are divided in 1/3 of attention weight for each of the attention type. For BERT, on the other hand, the division is 1/2 for each of the attention type. More about the color-coding schemes can be seen in section 2.4.

As presented in appendices A-E, several different results have been achieved. Visualiza-tions show differences in attention paid by the networks to words, depending on texts’ lengths, word diversity, level of pre-processing, and contextual domain. Results also show, how a NN shifts focus to different words when the same text is structured differently.

(33)

Mod-3.2 Data-sets used and texts presented in the analysis

Texts chosen for the visualization, illustrate a typical result achieved in training and clas-sification. For ”sports”, a representative text for the domain is chosen. It is a typical text conveying sports as a subject. It is used in the training of both models, with sports as its context. Models are expected to learn to classify this text with this domain in mind. For ”tech”, a representative example of a domain-specific text is chosen as well. In this case, models’ efforts to reason about the text in boundaries of the domain, is also expected. The text holds well to the subject of technology. By learning on domain-specific data-sets, models can achieve a better understanding of the importance, of individual words, for that domain. Both texts can be seen in appendices A - E. Both texts are classified correctly.

3.3 Models’ accuracy

Accuracy of a model is a measure of a number of correct classified texts in a data-set. For BiDir-LSTM-CNN, accuracy is 85% and for BERT is 52%, for original texts. All accuracy values for both models are presented in tables 2 and 3. In both cases, models classification accuracies are presented for all data-set types. Percentage values correspond to the level of accuracy a particular model has, in its classification decisions. In this thesis’ imple-mentation of models, inputs differ from those used in the original scientific work. This is the reason why results achieved here deviate from those in original papers [21], [26]. Models originally were not developed for classifying fake news. Nonetheless, both models show a consistent classification accuracy above 50%, which corresponds to a 50/50 correct classification of an actual label. For ”sports”, a noticeable advantage is demonstrated by BiDir-LSTM-CNN, in the original text, and original lemmatized text, with values 85% and 66%. BERT, despite consistently low values, demonstrates some slightly higher val-ues for the original text, original text lemmatized, and text summarized to 5 sentences. For ”tech”, both models show high values for original text, and text summarized to 5 sentences. The highest value for BIDir-LSTM-CNN in ”tech” is achieved in lemmatized text summarized to 5 sentences. For BERT the highest value in ”tech” is achieved in text summarized to 5 sentences.

(34)

Table 2: BiDir-LSTM-CNN and BERT ”sports” - Accuracy for all texts. ”sports” BiDir-LSTM-CNN accuracy BERT accuracy Original text 85% 52%

Original text lemmatized 66% 50% Original text stemmed 51% 51%

Text max 5 sen. 50% 51%

Text max 5 sen. lemmatized 50% 50% Text max 5 sen. stemmed 50% 50%

Table 3: BiDir-LSTM-CNN and BERT ”tech” - Accuracy for all texts.

”tech” BiDir-LSTM-CNN

accuracy

BERT accuracy

Original text 52% 51%

Original text lemmatized 50% 50% Original text stemmed 51% 50%

(35)

3.4 Attention weights coverage

Attention weights coverage represents a percentual amount of text covered by attention weights. BERT’s results show a more selective approach to assigning attention to words, while BiDir-LSTM-CNN show a broader attention coverage. As presented in tables 4 - 5, BiDir-LSTM-CNN has a double the coverage of a strong attention than BERT. For mod-erate attention in BiDir-LSTM-CNN and weak attention in BERT, the situation seems to be reversed. In this group, it is BERT that has double the attention coverage in compari-son to BiDir-LSTM-CNN. The weak attention for BiDir-LSTM-CNN and no attention for BERT, seem to be somewhat the same.

3.5 Attention weights span over nearby words

Attention weights span represents the number of adjacent words, covered with the same attention type. Both models architectures and approaches to input-data processing, in-fluence classification processes. For BiDir-LSTM-CNN, attentions in original longer texts tend to have attention spans stretching through a higher number of words than in BERT. As presented in the statistical data in tables 6 - 7, a median of 39 and 129 words in suc-cession are covered by strong or weak attention in BiDir-LSTM-CNN. BERT on the other hand, has an attention span of 2 words, as can be seen in tables 8 - 9. A median of words is used since average values are exaggerated due to the BiDir-LSTM-CNN’s overblown at-tention span coverage in some of the text. Nevertheless, this difference in atat-tention spans, indicate BiDir-LSTM-CNN’s broader contextual consideration of texts than BERT. A me-dian taken across all of the text groups show a meme-dian of 15 and 9 words span in strong attention for BiDir-LSTM-CNN in both domains. Meanwhile, BERT holds 2 words at-tention span across of all of the texts. Although BiDir-LSTM-CNN has a higher atat-tention span than BERT, there is a lot of volatility in its statistical numbers.

3.6 Domain specific words

Data-sets used were assigned to one of the two contextual groups, ”sports”, and ”tech”. The contextual grouping was done with the help of the keyword association. Some words in the text were characterized as belonging to a particular domain. For example, the word

(36)

models with very few exceptions. For summarized text, the data shows a higher variation of results for both models, with values ranging from 0% to 8% share of domain-specific words. For a moderate, weak, and no attention categories, there is a lot of difference in values between models and types of text. Shares of domain-specific words differ drastically depending on the text’s length and type of pre-processing operation, the text was exposed to. All of the specifics can be seen in tables.

3.7 Parts of speech

Not only words from a specific domain but different parts of speech as well, influence learning and classification process. In tables 14 to 21, shares of words belonging to a particular part of speech are presented. Both models show an overwhelming dominance of nouns, followed by verbs as parts of speech with the highest attention. With a couple of exceptions, these two groups of words seem to have a foremost influence on both models reasoning about the texts. For shares in each of the attention groups presented in table 14-17, and for shares in all of the attentions together presented in 18-21, the dominant are nouns and verbs. In 14-17, it can be seen that in every attention group, for both models nouns and verbs are the dominant part of speech taken into consideration by the models. In tables 18-21 as well, it can be seen that from all of the words models paid attention to in a particular text, nouns are the type of words both models paid strong attention to in majority.

(37)

4 Analysis

Many different aspects have to be taken into account when analyzing the classification pro-cess of NN models for text classification and fake news detection. First, the model itself and its architecture has to be taken into consideration. Models learning, and classification capabilities are going to be affected by the way it is designed. Second, the data and its representation influence the learning process. Texts on which models are trained, impact the way a model processes data while classifying new texts. Text’s length, its complexity, stylistic domain, lemmatization, and stemming, influence models’ reasoning as well. All of these aspects influence the way models learn, to distinguish what is important, and what is not, in a given classification. The results of learning, and how they affect models ability to classify, can be analyzed by studying attention weights. Such a study gives a degree of explainability of NN models for text classification and fake news detection. In this section, two models are analyzed from all the perspectives of all aforementioned aspects affecting learning and classification processes in fake news detection.

4.1 Architectures

Two models chosen for this analysis, differ from each other in a way they process the input data. These architectural differences translate to the way attentions are assigned to words in a processed text. As can be seen in section 3, and appendices A - E, both models demonstrate a different approach to distributing their attention to words in all of the texts examined. BiDir-LSTM-CNN’s approach to data processing is sequential. Words are processed from left to right and from right to left. This pre-processing strategy causes this model to pay attention to a larger number of words. Attentions in this model tend to spread through a larger number of words, sometimes covering whole sentences. BERT on the other hand tends to be more selective in distributing its attention weights. Attentions in this model, are scattered throughout the text. BERT looks at a whole contextual structure of every text it is processing. For BiDir-LSTM-CNN, to make a decision when classifying a text, a broader contextual data is needed. For the original text for example presented in appendix B, this model pays strong attention to almost every word present in the text, to make a classification decision. Model seems to struggle, to cherry-pick nuances existing in texts that could be used to generalize the classification

(38)

have a very wide attention span. BERT on the other hand demonstrates a very different approach to attention distribution. The attention weights are scattered throughout the text. This behavior also, is the consequence of the way BERT is constructed. It seems that its architecture, where a whole contextual structure of a given text is processed simultaneously, helps in identifying nuances in a given text. BERT seems to cope better with finding keywords in texts for a given classification. As can be seen in appendices A -E, the model is generalizing its classification process basing decisions on particular words in texts. There is no situation where a broader contextual attention is used, in any of the original texts for this model.

4.2 Length of input-data

The length of the input-data is another important factor, in models ability to learn. All of the nuances and dependencies in texts, have to be carried over from one layer to another. If input data is very long, it can lead to a corruption of the learning process. An example of such corruption can be a problem of vanishing/exploding gradient explained in 2.2.2. Both models utilize a different approach to processing texts with different lengths. For BiDir-LSTM-CNN, when a text is very long, the model seems to try to pay attention to every word in the text or is ignoring large chunks of it. It seems that this model cannot cope with overly long texts and tries to find a broader context to ease the classification process. This problem, of not coping with long texts, seems to be supported by the way model processes short texts. In this case, BiDir-LSTM-CNN seems to be more selective in its distribution of attentions weights in some situations, while paying attention to whole texts in others. BERT on the other hand is consistent in its attention distribution. This model is more selective in the choice of parts of the text important for classification. Text’s length does not influence BERT’s way of distributing weights. This observation seems logical when architecture is taken into account. BiDir-LSTM-CNN having problems coping with longer texts tries to overcompensate by paying attention to a larger group of words. In contrast, BERT has the same approach to finding contextual dependencies in text, independently from its length. From the accuracy standpoint, it is BiDir-LSTM-CNN that is most effective in classifying text. This model, not only has the highest accuracy

(39)

4.3 Domain specific words

Texts belonging to a certain domain are characterized, by the presence of words typical for that domain. This contextual repetitiveness of domain-specific words, ease models’ ability to learn dependencies in texts. It helps to identify words or phrases having the biggest impact on the classification, by having all of the texts in a contextual coherence with each other. Models exhibit different approaches to text processing when the domain, in which words appear within a text, changes. The same contextual domain of all of the texts provided for learning helps to find nuances and dependencies, within the domain as an underlying subject. Both models tend to pay a strong or at least moderate attention (weak in case of BERT) to words, identified as belonging to a particular domain. Both models exhibit a pattern of reasoning, based on the contextual domain in which they are trained in. For ”sports”, words like ”stadium”, and ”team” are of interest, while for ”tech” words like ”screen”, and ”camera” are interesting. This behavior is due to the learning process itself. While models learn to classify text from the same contextual domain, and the same words keep coming up over and over again in different texts, they learn to pay attention to those words. These words become the central interest of models while developing generalized dependencies, which later are passed into classification.

4.4 Pre-processing

Pre-processing techniques used in this analysis are lemmatization and stemming. These techniques, are explained in section 2.1. From a human perspective, such pre-processing does not help but rather distort texts, clouding the meaning. From a NN perspective however, these operations help to minimize the level of complexity of texts. For a NN model, there is no difference if words are grammatically correct or not. NN models do not operate on an understanding of texts like humans do. For a NN model, every word is a variable, with connections and dependencies to other words in a particular text. By minimizing the number of unique words, by for example stemming words to their root form, the complexity of each text is reduced. Models still can find relationships between words, just not based on there grammatical structure. For BiDir-LSTM-CNN, the effect of pre-processing can be seen in ”sports” texts, presented in appendices. Texts subjected to lemmatization and stemming, tend to be paid attention to in a more selective manner, than it is the case with the original, not pre-processed texts. Since the vocabulary in

(40)

in appendix C. In this example, original and stemmed text keep the word ”going” in its original form. This causes BiDir-LSTM-CNN, to not to pay attention to this part of the text. On the other hand, when the word ”going” is lemmatized and simplified to its root form ”go”, the model starts to perceive this part of the text as an important aspect of the classification. In every text, where words are in their original form, some dependencies are missed since there are too many variables. Once words are brought to their root forms, new dependencies are observed, giving a better generalization of the classification. Sim-plifying texts, and finding more dependencies, do not guaranty that a model is going to be successful. From the accuracy perspective, models tend to achieve better accuracy on texts that are not subjected to pre-processing operation. In almost all cases, on variations of texts subjected to pre-processing operations, models get worse accuracy values, than it is the case with original texts. It cannot be concluded if it is stemming or lemmatization that has a more positive effect on classification. However, by comparing original texts with texts that are pre-processed, it can be concluded that both models tend to get better classification results when texts are not processed in any way.

4.5 Parts of speech

The most dominant parts of the speech used in classifications are nouns and verbs. For both models, in all texts examined, nouns and verbs get the highest attention. As can be seen in tables 16 - 21, models seem to recognize the fact that other parts of speech like adjectives, pronouns, and adverbs are the same for all texts. What makes a particu-lar sentence unique, and what makes a difference in a text, is the nouns and verbs used. Models recognize this, and adapt to recognize them. While learning, and later when classi-fying, these two types of words end up getting the highest attention of all attention types. While training, models develop an understanding of the importance of nouns, and in a combination with a contextual domain, can distinguish which nouns should be paid strong attention to and which one should be omitted while classifying. The evidence of models understanding the importance of nouns and verbs can be seen in statistical tables 17, 21. Not only in each particular group of attentions but for a share in all of the attentions combined, these two groups of parts of speech are indisputably dominant. This translates well to a human understanding of the language. Using only nouns and verbs, it would be

(41)

5 Conclusion

The architecture is not a definitive reason why a model is successful or not. For it to be successful, a lot of different aspects have to be taken into consideration. Data used in learning, its length, complexity, and pre-processing, have a big impact on the process of learning. This is why a level of explainability of NN for text classification and fake news detection, has a positive impact on the creation of successful models. Two architectures have been analyzed in this thesis. First, BiDir-LSTM-CNN that looks at the data se-quentially word by word, from left to right, and from right to left. Second, BERT finding dependencies between words across all of the text. Both models have proven to be suc-cessful in different degrees when classifying texts in the direction of fake news detection. The outcome of the analysis showed, how different is the approach of these two models to learning and classifying texts. BiDir-LSTM-CNN takes larger chunks of text under consid-eration or ignores them entirely. BERT is more selective in distributing attention. There is also a big difference in both models’ learning process when it comes to shorter texts or text pre-processed in some way. Sometimes, as it is for BiDir-LSTM-CNN when text is too long and the model cannot handle it, it tries to overcompensate by paying attention to the whole text. In this situation pre-processing, although messy from a human perspective, can help reduce text complexity and ease the model’s ability to efficient learning. On the other hand, pre-processing texts does not necessarily mean that a model is going to achieve adequate accuracy. It seems that a NN architect must balance the representation of the data-set between the length of the input, and its complexity, to achieve the best accuracy possible. Another aspect of learning is the fact that nouns and verbs are primary candi-dates to influence models learning and classification. This is why the conclusion is that the best-suited texts for learning, are those rich in words belonging to these types of speech groups. BiDir-LSTM-CNN is superior in its accuracy, scoring 85% on the original texts, and 65% on original lemmatized texts. In contrast, BERT is consistent with its accuracy of 50% for all data-sets. Overall, the explainability of NN models cannot be reduced to text length, pre-processing, or word patterns. A NN architect should experiment with different text representations to get the best results. The conclusion is, that it is possible to provide a degree of explainability to NN models for text classification and fake news detection, as it is shown in the analysis. Explainability of neural-based models for the task of fake news detection via text classification can be provided by taking a closer look, at the state of data-set used, and models architecture. However, although explainability

(42)

6 Future work

In this thesis, an analysis is made, based on models’ ability to detect patterns to classify texts and fake news detection. Future work can be expanded to a broader statistical analysis. A higher number of data-sets could be used in training, and in generating visualizations and statistical data. This would give an even deeper understanding of the subject. Moreover, the influence of the context on the learning and classification processes could be analyzed further by expanding data-sets with additional randomized sets not belonging to any domain. This additional data-sets could be cross-checked with domain-specific sets used in this thesis. A more extensive training of a higher number of texts labeled as true or false would also be beneficial in examining text classification in fake news detection. Finally, an experiment involving training models on a set of texts from one domain and using them in the classification of text from a different domain could give more information about attentions and their distributions.

(43)

A

Overview of attention visualizations for BiDir-LSTM-CNN

and BERT - all cases

Figure organization:

• Figure 8, BiDir-LSTM-CNN ”sports”.

• Figure 9, BERT ”sports”.

• Figure 10, BiDir-LSTM-CNN ”tech”. • Figure 11, BERT ”tech”.

Overview organization: 1. Original text.

2. Original text - lemmatized.

3. Original text - stemmed.

4. Text summarized max 5 sentences.

5. Text summarized max 5 sentences - lemmatized. 6. Text summarized max 5 sentences - stemmed.

7. Text summarized max 10 sentences.

8. Text summarized max 10 sentences - lemmatized.

(44)

(45)

(46)

(47)

(48)

B

Visual representations of Attention Mechanism for

BiDir-LSTM-CNN - ”sports”

(49)

(50)

(51)

Figure 17: BiDir-LSTM-CNN ”sports” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 18: BiDir-LSTM-CNN ”sports” - Summarized max 10 sentences in training, typical example.

(52)

C

Visual representations of Attention Mechanism for

BiDir-LSTM-CNN - ”tech”

(53)

(54)

Figure 24: BiDir-LSTM-CNN ”tech” - Summarized max 5 sentences in training, typical example.

Figure 25: BiDir-LSTM-CNN ”tech” - Summarized max 5 sentences in training,Lemmatization, typical example.

(55)

(56)

D

Visual representations of Attention Mechanism for BERT

- ”sports”

(57)

(58)

(59)

Figure 35: BERT ”sports” - Summarized max 5 sentences in training,Stemmed, typical example.

Figure 36: BERT ”sports” - Summarized max 10 sentences in training, typical example.

(60)

E

Visual representations of Attention Mechanism for BERT

- ”tech”

(61)

(62)

Figure 42: BERT ”tech” - Summarized max 5 sentences in training, typical example.

Figure 43: BERT ”tech” - Summarized max 5 sentences in training,Lemmatization, typical example.

(63)

(64)

F

Statistical data

Attention share - A procentual attention coverage of a text.

Attention span - A number of adjacent words covered with the same attention type.

Table organization:

• Table 4, BiDir-LSTM-CNN and BERT ”sports” attention shares in texts. • Table 5, BiDir-LSTM-CNN and BERT ”tech” attention shares in texts. • Table 6, BiDir-LSTM-CNN ”sports” average attention span in texts. • Table 7, BiDir-LSTM-CNN ”tech” average attention span in texts. • Table 8, BERT ”sports” average attention span in texts.

• Table 9, BERT ”tech” average attention span in texts.

• Table 10, BiDir-LSTM-CNN ”sports” procentual share of context words. • Table 11, BiDir-LSTM-CNN ”tech” procentual share of context words. • Table 12, BERT ”sports” procentual share of context words.

• Table 13, BERT ”tech” procentual share of context words.

• Table 14, BiDir-LSTM-CNN ”sports” procentual share of parts of speech in a given attention.

• Table 15, BiDir-LSTM-CNN ”tech” procentual share of parts of speech in a given attention.

• Table 16, BERT ”sports” procentual share of parts of speech in a given attention. • Table 17, BERT ”tech” procentual share of parts of speech in a given attention. • Table 18, BiDir-LSTM-CNN ”sports” procentual share of parts of speech in all

(65)

Table 4: BiDir-LSTM-CNN and BERT ”sports” overall attention coverage of texts.

Coverage BiDir-LSTM-CNN BERT ”sports” Strong attent. Moderate attent. Weak at-tention Strong attent. Weak at-tent. No attent. Original text 82% 18% 0% 31% 41% 28% Original text lem-matized 21% 7% 72% 40% 39% 21% Original text stemmed 91% 2% 7% 37% 39% 24% Text max 5 sen. 4% 17% 79% 48% 24% 28% Text max 5 sen. lem-matized 31% 69% 0% 45% 45% 10% Text max 5 sen. stemmed 10% 24% 66% 55% 31% 14% Text max 10 sen. 55% 33% 12% 45% 46% 9% Text max 10 sen. lemmatized 11% 32% 57% 35% 55% 10% Text max 10 sen. stemmed 64% 10 26% 41% 50% 9% Avrg. origi-nal texts 64% 9% 27% 36% 40% 24% Avrg. max 5 sen. texts 15% 37% 48% 50% 33% 17%

(66)

Table 5: BiDir-LSTM-CNN and BERT ”tech” overall attention coverage of texts.

Coverage BiDir-LSTM-CNN BERT ”tech” Strong attent. Moderate attent. Weak at-tent. Strong attent. Weak at-tent. No attent. Original text 41% 57% 2% 32% 48% 20% Original text lem-matized 100% 0% 0% 34% 46% 20% Original text stemmed 97% 2% 1% 46% 42% 12% Text max 5 sen. 63% 36% 1% 49% 40% 11% Text max 5 sen. lem-matized 95% 5% 0% 47% 47% 6% Text max 5 sen. stemmed 91% 8% 1% 32% 59% 9% Text max 10 sen. 22% 29% 49% 35% 52% 13% Text max 10 sen. lemmatized 13% 48% 39% 34% 55% 11% Text max 10 sen. stemmed 99% 1% 0% 39% 51% 10% Avrg. origi-nal texts 79% 20% 1% 37% 46% 17% Avrg. max 83% 16% 1% 43% 48% 9%

(67)

Table 6: BiDir-LSTM-CNN ”sports” average and median word attention span in texts.

BiDir- LSTM-CNN

Average word attent. span Median word attent. span

”sports” Strong attent. Moderate attent. Weak at-tent. Strong atten-tion Moderate attent. Weak attent. Original text 23 5 0 16 5 0 Original text lem-matized 19 4 41 20 4 44 Original text stemmed 410 2 17 410 2 17 Max 5 sen. 1 2 23 1 2 23 Max 5 sen. lemmatized 5 9 0 5 9 0 Max 5 sen. stemmed 3 3 17 3 3 17 Max 10 sen. 52 9 5 52 6 5 Max 10 sen. lemmatized 5 5 12 5 3 10 Max 10 sen. stemmed 19 2 8 7 2 4 Avrg. origi-nal texts 39 5 33 18 4 38 Avrg. max 5 sen. texts 3 5 13 3 3 17 Avrg. max 10 sen. 20 5 9 7 2 4

(68)

Table 7: BiDir-LSTM-CNN ”tech” average and median word attention span in texts.

BiDir- LSTM-CNN

Average word attent. span Median word attent. span

”tech” Strong attent. Moderate attent. Weak at-tent. Strong atten-tion Moderate attent. Weak attent. Original text 17 7 1 12 5 1 Original text lem-matized 453 0 0 453 0 0 Original text stemmed 446 3 1 446 3 1 Max 5 sen. 16 9 1 14 9 1 Max 5 sen. lemmatized 70 5 0 70 5 0 Max 5 sen. stemmed 68 3 1 68 3 1 Max 10 sen. 6 3 9 6 2 11 Max 10 sen. lemmatized 3 3 5 2 3 3 Max 10 sen. stemmed 174 1 0 174 1 0 Avrg. origi-nal texts 129 5 1 18 4 1 Avrg. max 5 sen. texts 37 6 1 20 5 1

(69)

Table 8: BERT ”sports” average and median word attention span in texts.

BERT Average word attent. span Median word attent. span ”sports” Strong attent. Weak at-tent. No attent. Strong atten-tion Weak at-tent. No attent. Original text 2 2 1 2 2 1 Original text lem-matized 2 2 1 2 2 1 Original text stemmed 2 2 1 2 2 1 Max 5 sen. 3 2 1 2 1 1 Max 5 sen. lemmatized 1 1 1 1 1 1 Max 5 sen. stemmed 3 1 1 2 1 1 Max 10 sen. 2 2 1 2 2 1 Max 10 sen. lemmatized 2 2 1 2 2 1 Max 10 sen. stemmed 2 2 1 2 2 1 Avrg. origi-nal texts 2 2 1 2 2 1 Avrg. max 5 sen. texts 2 1 1 1 1 1 Avrg. max 10 sen. texts 2 2 1 2 2 1

(70)

Table 9: BERT ”tech” average and median word attention span in texts.

BERT Average word attent. span Median word attent. span ”tech” Strong attent. Weak at-tent. No attent. Strong atten-tion Weak at-tent. No attent. Original text 1 2 1 1 2 1 Original text lem-matized 2 2 1 1 2 1 Original text stemmed 2 3 1 2 2 1 Max 5 sen. 3 2 1 2 1 1 Max 5 sen. lemmatized 2 2 1 2 2 1 Max 5 sen. stemmed 2 3 1 2 2 1 Max 10 sen. 2 3 1 1 2 1 Max 10 sen. lemmatized 2 2 1 2 2 1 Max 10 sen. stemmed 2 3 1 2 2 1 Avrg. origi-nal texts 2 2 1 2 2 1 Avrg. max 5 sen. texts 2 2 1 2 2 1 Avrg. max 2 3 1 1 2 1

(71)

Table 10: BiDir-LSTM-CNN ”sports” procentual share of context words attentions in all attentions.

BiDir- LSTM-CNN

Procentual share of context words attentions in all attentions

”sports” All Strong attent. Moderate attent. Weak attention Original text 3% 3% 2% 5% Original text lem-matized 3% 2% 8% 3% Original text stemmed 3% 3% 0% 0% Text max 5 sen. 3% 0% 14% 0% Text max 5 sen. lem-matized 3% 10% 0% 0% Text max 5 sen. stemmed 3% 25% 0% 0% Text max 10 sen. 3% 5% 0% 0% Text max 10 sen. lemmatized 3% 7% 3% 2% Text max 10 sen. 3% 4 0% 0%

(72)

Table 11: BiDir-LSTM-CNN ”tech” procentual share of context words attentions in all attentions.

BiDir- LSTM-CNN

Procentual share of context words attentions in all attentions

”tech” All Strong attent. Moderate attent. Weak attention Original text 3% 4% 2% 0% Original text lem-matized 3% 3% 0% 0% Original text stemmed 3% 3% 0% 0% Text max 5 sen. 2% 0% 3% 0% Text max 5 sen. lem-matized 1% 1% 0% 0% Text max 5 sen. stemmed 1% 1% 0% 0% Text max 10 sen. 3% 3% 4% 2% Text max 10 sen. lemmatized 3% 8% 2% 1% Text max 3% 3% 0% 0%

(73)

Table 12: BERT ”sports” procentual share of context words attentions in all attentions.

BERT Procentual share of context words attentions in all attentions

”sports” All Strong attent. Weak at-tent. No attent. Original text 4% 3% 5% 1% Original text lem-matized 4% 3% 4% 11% Original text stemmed 4% 2% 6% 2% Text max 5 sen. 7% 0% 10% 17% Text max 5 sen. lem-matized 7% 7% 8% 0% Text max 5 sen. stemmed 7% 6% 10% 0% Text max 10 sen. 4% 4% 5% 0% Text max 10 sen. lemmatized 4% 2% 6% 0% Text max 10 sen. stemmed 4% 7 2% 0%

(74)

Table 13: BERT ”tech” procentual share of context words attentions in all attentions.

BERT Procentual share of context words attentions in all attentions

”tech” All Strong attent. Weak at-tent. No attent. Original text 3% 4% 3% 2% Original text lem-matized 3% 5% 4% 0% Original text stemmed 3% 4% 1% 10% Text max 5 sen. 1% 0% 3% 0% Text max 5 sen. lem-matized 1% 0% 2% 0% Text max 5 sen. stemmed 1% 0% 2% 0% Text max 10 sen. 3% 3% 4% 0% Text max 10 sen. lemmatized 3% 4% 3% 0% Text max 3% 3% 4% 6%

(75)

Table 14: BiDir-LSTM-CNN ”sports” procentual share of parts of speech in a given attention. Only 5 parts of the speech with highest values are presented

BiDir- LSTM-CNN

Procentual share of parts of speech in a given attention. N - Noun, V - Verb, J - Adjective, P - Pronoun, R - Adverb

”sports” Strong attent. Moderate attent. Weak attent. Parts of speech N V J P R N V J P R N V J P R Original text 32% 15% 3% 8% 8% 24% 27% 2% 15% 8% 27% 19% 5% 8% 5% Original text lem-matized 21% 22% 5% 10% 11% 29% 15% 6% 3% 3% 33% 16% 2% 8% 7% Original text stemmed 32% 16% 3% 8% 7% 37% 27% 0% 9% 0% 20% 29% 0% 20% 11% Text max 5 sen. 0% 0% 0% 0% 100% 57% 0% 0% 0% 0% 20% 4% 8% 4% 8% Text max 5 sen. lem-matized 10% 10% 10% 0% 10% 35% 0% 4% 4% 9% 0% 0% 0% 0% 0% Text max 5 sen. stemmed 25% 0% 0% 0% 25% 50% 12% 0% 0% 0% 19% 0% 10% 5% 10% Text max 10 sen. 22% 12% 5% 8% 7% 24% 24% 0% 18% 12% 31% 6% 6% 6% 6% Text max 10 sen. lemmatized 29% 7% 0% 0% 0% 10% 23% 3% 10% 10% 30% 8% 5% 3% 8% Text max 19% 18% 4% 4% 9% 44% 0% 0% 11% 0% 32% 13% 3% 19% 10%

Machine Learning explainability in text classification for Fake News detection

Machine Learning explainability in text classification for Fake

News detection

Abstract

Acknowledgements

Contents

Abbreviations

1

Introduction

2

Method

3

Results

4

Analysis

5

Conclusion

6

Future work

A

Overview of attention visualizations for BiDir-LSTM-CNN

and BERT - all cases

B

Visual representations of Attention Mechanism for

BiDir-LSTM-CNN - ”sports”

C

Visual representations of Attention Mechanism for

BiDir-LSTM-CNN - ”tech”

D

Visual representations of Attention Mechanism for BERT

- ”sports”

E

Visual representations of Attention Mechanism for BERT

- ”tech”

F

Statistical data