Comparing and contrasting the dissemination cascades of different topics in a social network : What are the lifetimes of different topics and how do they spread

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 15 ECTS | Computer Science

2021 | LIU-IDA/LITH-EX-G--21/067--SE

Comparing and contrasting the

dissemination cascades of

diﬀer-ent topics in a social network

–

What are the lifetimes of diﬀerent topics and how do they

spread

Jämförelse av spridningskaskader för olika ämnen i ett socialt

nätverk

Linus Käll, Simon Pertoft

Supervisor : Niklas Carlsson Examiner : Marcus Bendtsen

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Abstract

The web has granted everyone the opportunity to freely share large amounts of data. Individuals, corporations, and communities have made the web an important tool in their arsenal. These entities are spreading information online, but not all of it is constructive. Some spread misinformation to protect themselves or to attack other entities or ideas on the web. Checking the integrity of all the information online is a complex problem and an ethical solution would be equally complex. Multiple latent factors decide how a topic spreads and finding these factors is non-trivial.

In this thesis, the patterns of different topics are compared with each other and the gen-eralized patterns of fake, true, and mixed news, using Latent Dirichlet Allocation (LDA) topic models. We look at how the dissemination of topics can be compared through differ-ent metrics, and how these can be calculated through networks related to the data.

The analyzed data was collected using the Twitter API and news article scrapers. From this data, custom corpora were made through lemmatization and filtering unnecessary words and characters. The LDA models were made using these corpora, making it pos-sible to extract the latent topics of the articles. By plotting the articles according to their most dominant topic, graphs for the popularity, size, and other distribution statistics could easily be drawn. From these graphs, the topics could be compared to each other and be categorized as fake, true, or mixed news by looking at their patterns and novelty. How-ever, this brought up the question if it would be ethical to generalize topics in this way. Suppressing or censuring an article because it contains a lot of novel information might hide constructive novelties and violate freedom of speech. Finally, this thesis presents the means for further work in the future, which could involve collecting one large, continuous dataset for a fair and accurate comparison between topics.

(4)

Acknowledgments

We want to thank:

• Niklas Carlsson for the opportunity to study within the field of artificial intelligence and pointing us in the right direction.

• Alireza Mohammadinodooshan for teaching us about topic modeling and continuously supporting us throughout the project.

• Carl Terve, Mattias Erlingsson, Otto Heino, and Richard Johansson for the great discus-sions while working on the joint project and throughout the whole term.

• Martin Christensson and William Holmgren for sharing their data with us, when we had a hard time finding a good dataset and helping us quickly get started with our work.

(5)

List of Figures

2.1 Cascade of (re)tweets with size 8. . . 6

4.1 This is how the data collected was formatted before being used to create the LDA model. All fields within #{ .. } are variables. The symbols ", .." signal that more instances of the preceding data structure can exist. . . 11

4.2 Distribution of documents over the number of words. The left subfigure shows a more complete version of the distribution while the right subfigure depicts a small portion for few words. Not shown are a few documents which are up to 47,000 words large, which would obscure the pattern seen in the figure. . . 12

4.3 Python code depicting how the M1 matrix was created and how it was used to generate a model, from which the M2 and M3 matrices can be extracted. The M2 matrix is not extracted in this code. . . 13

4.4 The coherence values of one of the corpora used in this thesis. The graph con-verges at around 14 topics, which means that any local maximum after this could be used to create an optimal model for the corresponding corpus. However, a smaller value would be preferred for the ease of use. . . 14

5.1 Popularity of all topics in the corpus made from the Mid-February 2021 data. . . . 16

5.2 Size of all topics in the corpus made from the Mid-February 2021 data. . . 17

5.3 CDF of all topics in the corpus made from the Mid-February 2021 data. . . 17

5.4 CCDF of all topics in the corpus made from the Mid-February 2021 data. . . 18

5.5 Boxplot of all topics in the corpus made from the Mid-February 2021 data. . . 18

5.6 Popularity of all topics in the corpus made from the Spring 2020 data. . . 19

5.7 Size of all topics in the corpus made from the Spring 2020 data. . . 20

5.8 CDF of all topics in the corpus made from the Spring 2020 data. . . 20

5.9 CCDF of all topics in the corpus made from the Spring 2020 data. . . 21

5.10 Boxplot of all topics in the corpus made from the Spring 2020 data. . . 21

A.1 Popularity of all topics in the corpus made from the Fall 2020 data. . . 30

A.2 Size of all topics in the corpus made from the Fall 2020 data. . . 31

A.3 CDF of all topics in the corpus made from the Fall 2020 data. . . 31

A.4 CCDF of all topics in the corpus made from the Fall 2020 data. . . 32

A.5 Boxplot of all topics in the corpus made from the Fall 2020 data. . . 32

A.6 Popularity of all topics in the corpus made from the Early-February 2021 data. . . 33

A.7 Size of all topics in the corpus made from the Early-February 2021 data. . . 33

A.8 CDF of all topics in the corpus made from the Early-February 2021 data. . . 34

A.9 CCDF of all topics in the corpus made from the Early-February 2021 data. . . 34

A.10 Boxplot of all topics in the corpus made from the Early-February 2021 data. . . 35

A.11 Popularity of all topics in the corpus made from the March 2021 data. . . 36

A.12 Size of all topics in the corpus made from the March 2021 data. . . 36

A.13 CDF of all topics in the corpus made from the March 2021 data. . . 37

A.14 CCDF of all topics in the corpus made from the March 2021 data. . . 37

(8)

List of Tables

5.1 The number of topics and passes which were used for each corpus. All corpora span one week after the start date. . . 15 5.2 ICs of the topics within Mid-February 2021 and Spring 2020 corpora. The topics

(9)

1 Introduction

In this chapter, we introduce the topic, scope, and contributions of this thesis.

1.1 Motivation

When visiting the internet a lot of information is spreading and dissipating. For example, topics surrounding COVID-19 as of now continues to spread while a meme could gain ex-treme popularity overnight only to be forgotten some days later. Some individuals such as Donald Trump have been thought to try utilizing or control these torrents of information to their advantage for political or economical purposes [1]. This has inspired a desire to vali-date these kinds of weaponized pieces of information and to understand the patterns they can exert in the way they spread. In detecting and marking news as fake it could be possible to improve the integrity of the web.

However, as the internet can display a lot of different patterns and as dissemination cas-cades of information often are entangled, solving this problem is complex. Technology that can help in this endeavor is Natural Language Processing (NLP) and Topic Modeling (TM) which are subsets of artificial intelligence methods.

1.2 Aim

This thesis aims to derive patterns of different topics, that spread throughout a social net-work, and to compare and contrast these patterns. The results would then be analyzed and discussed to gain insight into the spread of different kinds of information compared to false news. To do this, we created tools that can generate plots, calculate different statistics, and preprocess data. This thesis presents examples of results these tools can produce.

1.3 Research questions

1. When fake news and statements are being spread, sometimes they are used for memes which induce further dissemination of the information. This can cause even broader dissemination when people perceive these memes as true information, which feeds the creation of even more memes. These kinds of feedback loops could cause enormous

(10)

1.4. Contributions

dissemination cascades. Is it true that these kinds of feedback loops happen within social networks and is it possible to detect similar feedback loops for other topics? 2. How do distinct topics differ in the way they disseminate in a social network, compared

to false news which was discovered to disseminate far and wide due to novelty [1]? 3. How do distinct topics differ in the way they disseminate in a social network, compared

to each other?

1.4 Contributions

The contributions of this thesis are the tools created for processing tweet data and related articles, and creating graphs. These tools in turn provided multiple instances of data, looking at different metrics for the topics related to news articles posted over a week. Two of these instances were analyzed by applying the different concepts, theories, and ideas presented in the chapter of related works below.

1.5 Delimitations

As access and space were limited, for the data needed to do this analysis, samples of news articles posted on Twitter were used. As the data was subsets of all data on Twitter, the accuracy of the model can vary depending on the samples collected. This meant that the result would be statistical, which can cause some inaccuracies when applied in the real world. Due to time limitations and the large size of the data, most of the steps in the method were conducted with samples of the data.

(11)

2 Background

This chapter explains some of the technologies and theories referenced throughout this thesis. The main concepts are those connected to topic modeling and different implementations.

2.1 Terminology

Some terminology and their meanings will be necessary to understand this thesis: • The term false news is used as a more neutral synonym for fake news.

• A word is a sequence of letters, which is not interrupted by punctuation or spaces. • A document refers to a list of words and will be used interchangeably with “article” in

this paper.

• A corpus refers to a list of documents.

• A bag of words or BoW is a list of tuples (index, occurrences), where the index repre-sents a unique word in the document, corresponding to the BoW. The occurrences are how many times a unique word occurs in the same document.

• A stop list or a list of stop words is a collection of words that bring no real value when trying to define a document’s topic. Conjunctions are typical stop words as they bring no real information to a sentence, but there may be words that can be added to the stop list to make a document more coherent to the LDA algorithm [2]. A few examples would be “ourselves”, “with”, “in”, “all”, “no”, “why”, et cetera.

• A topic is a normalized distribution of words in a document, corpus, or other collections of words. It is normalized to make it easier for comparing two documents. As an example, a topic could be “artificial intelligence” which could have a distribution of words like “machine”, “deep”, “topic”, “modeling”, “learning”, et cetera.

• Lemmatization is the act of transforming inflected forms of words into their basic form or lemma. This is useful because it simplifies the data.

• A social network graph represents a social network as a graph form. In this thesis, it refers to a network of users and how they are connected through follows.

(12)

2.2. Machine Learning (ML)

• A feedback loop can happen when an idea is tweaked and passed back-and-forth be-tween two or more nodes, or groups of nodes. A loop like this can continue until the idea cannot be tweaked anymore. When there is a cycle when looking at information disseminating in a social network graph, that would be a feedback loop.

2.2 Machine Learning (ML)

This technology is used to derive black boxes, which inner workings often are neural net-works, that can transform inputs into desired outputs. A high-level explanation can be sum-marized in the following two steps.

1. Train the black box using training data, which contains inputs and corresponding out-puts. For each tuple (input, true output) in the training data, feed the black box’s output and the true output to a back-propagation method which adjusts the black box’s neural network for the next iteration. This step is iterated multiple times.

2. Evaluate the black box using testing data, which contains inputs and corresponding outputs. For each tuple (input, true output) in the testing data, if the black box’s output was correct, increment how many outputs are correct. This step can be used to calculate the accuracy of the black box as the percentagethe number of correct outputs_{the number of tuples} [3].

2.3 Topic Modeling (TM)

Topic modeling is a subcategory of natural language processing that focuses on finding latent topics in a corpus. These topics can be used to categorize the documents in the corpus. A topic model can be developed in the two following ways.

For simplicity, unsupervised learning is explained first. With this method, the model is expected to find latent topics by feeding it training data. According to the normalized distribution of words in each document, the document is determined to contain some latent topics as probabilities. Documents that have similar probabilities for most, if not all topics, will create clusters of documents akin to each other. These clusters help determine the latent topics.

Supervised learning is when the topics are known beforehand. The goal of this method is to find how the known topics are related to the latent topics in the corpus and to find the discriminative curve between the known topics. Specifically which N-dimensional volume the known topics occupy, where N is the number of unique words present in a model [3].

There are several topic models, but the one used and discussed in this thesis is Latent Dirichlet Allocation.

2.4 Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is a “generative probabilistic model for collections of dis-crete data” and “a three-level hierarchical Bayesian model” [4]. As can be derived from the name LDA, hidden Dirichlet distributions are being derived from discrete data, to generate, or more accurately, derive topics. It is used in unsupervised learning.

First, the corpus used is filtered to remove inflections and stop words, these are unnec-essary for the context of topics. These words or symbols could be different forms of a more basic word or punctuation respectively. The filtered corpus is transformed into a list of BoWs and a Dictionary is created, which can map indexes in the BoWs to corresponding unique words. These two data structures are used in the construction of the LDA model.

When a basic LDA model has been created, multiple filtered corpora can be fed into it for training. When it is sufficiently trained it can take in a random document and accurately determine contained latent topics as probabilities.

(13)

2.5. The important matrices for LDA: M1, M2, and M3

2.5 The important matrices for LDA: M1, M2, and M3

Three matrices are of interest when working with LDA. The first one contains how many times each word occurs in each document. It is used to create the initial model which contains data about the two other matrices. We call this matrix 1. The second matrix contains how each word permeates each topic, in other words, the distribution of words for each topic. We call this matrix M2. The last matrix contains how much and which topics permeate each document. We call this matrix M3.

2.6 Term Frequency - Inverse Document Function (TF-IDF)

The TF-IDF function is the product of the TF and IDF functions for a chosen word, document, and respective corpus [5]. This function can be used on all words in a corpus to decrease the weight for insignificant words and to increase the weight of important words. The combined output is called the TF-IDF_M1 matrix, which can be used instead of the original M1 matrix to create an LDA model.

2.7 Cumulative Distribution Function (CDF)

The CDF can be expressed as CDF(x) =FX(x) =P(X ď x). It calculates the probability of a

random value in the distribution X being less or equal to x. For a more general description; the X-axis tells us how large a specific data point is while the Y axis tells us how much of the data is greater or equal to the current value of the X axis. For this thesis, it tells us how many percent of the (re)tweet cascades in a topic are of the size x or smaller.

2.8 Complementary Cumulative Distribution Function (CCDF)

The CCDF can be expressed as CCDF(x) =FX(x) =P(X ě x), which means that it calculates

the probability of a random value in the distribution X being greater or equal to x. This is the complement of the CDF and can be written as 1 - CDF.

2.9 Impact Coefficient (IC)

The idea of the impact coefficient comes from the viral coefficient [6]. The viral coefficient is how many individuals outside a set can be successfully appended to the set by existing members. In this thesis, for example, it means if each person which (re)tweets about a topic on average gets 100 retweets, the viral coefficient would be 100 for that topic. On the other hand, the impact coefficient of a topic would be how many retweets original tweets of that topic get on average. If people A, B, C, and D got 3, 11, 2, and 4 retweets on their respective original tweets about a topic, the impact coefficient would be (3+11+2+4)/4 = 5. In other words, the impact coefficient for a topic is the average number of retweets for original tweets about a topic. This metric will be used to represent the initial growth, or impact, of a tweet. Compared to the viral coefficient, this metric only takes the first depth of a cascade into account.

2.10 Cascade

A cascade is a tree of retweets, where a tweet is the root. The size of this cascade is the number of retweets plus the root tweet.

(14)

2.10. Cascade

Tweet

Retweets

(15)

3 Related works

Here we present previous works related to the dissemination of information online.

3.1 Metrics, false news, and hoaxes

Vosoughi, et al. [1] explored how verified true and false news stories posted during the period from 2006 to 2017, disseminated through Twitter. They collected continuous data from this period, generated network graphs, and applied LDA modeling using 200 topics. From this, it was possible to measure different metrics. These metrics are

• Depth over time measured as the longest retweet chain in the network at any time, • Size over time measured as the total number of nodes in the network at any time, • Max-Breadth over time, measured as the largest set of nodes at any depth at any time

in the network, and

• Structural Virality, “a measure that interpolates between content spread through a sin-gle, large broadcast and that which spreads through multiple generations, with anyone individual directly responsible for only a fraction of the total spread”

What they found was that fake news stories were mostly made up of novel information and that such information often is larger in all the metrics mentioned above. Also, they found that the CCDF of false news often has a convex shape, large initial values which quickly approach 0% for larger values. The opposite was found for true news, which has a concave shape. Mixed news has a convex shape for earlier values like false news, but has a concave shape for larger values. In other words, mixed news often spread the most.

To compare the dissemination of different topics the metrics above could be used, how-ever, other similar metrics could be used. Zhou, et al. [7] presented metrics that are instead calculated over the number of hops instead of time. This means that if the original tweet X got the retweet Y at the time A, and retweet Z at the time A+10, then Y and Z would be parallel for the metrics using hops, but not for the metrics using time.

Tambuscio et al. [8] investigated the dissemination of hoaxes and how the availability of fact-checking may contain their dissemination. They regarded the hoaxes as viruses and

(16)

3.2. Properties of the disseminating contents

once a user would become “infected”, they would have a probability of spreading it to other “uninfected” users. They looked at the fact-checking activity and found a threshold for the probability of a hoax being verified. At this threshold, the dissemination of the hoax is likely to be contained.

3.2 Properties of the disseminating contents

Heimbach et al. [9] investigated how the properties of content can affect the likelihood of the content being shared with peers online. In other words how the properties of content affect its virality. When analyzing the data they collected, they found that the posts followed Zipf’s law. In general, this means that by sorting a distribution from largest to smallest and normal-izing it, each element’s size would be inversely proportional to its size rank. Emotionality caused the articles to be shared less often than articles that did not contain strong emotions, such as sadness. Users from Twitter and Google+ shared more articles about technology, sci-ence, politics, and business than other topics. Users from all the platforms they investigated shared original content more often, indicating that originality affects virality. Like Berger and Milkman [10], they also found that complexity can cause virality.

Beyond the measures of content virality and content popularity mentioned by Heimbach et al, they refer to the work by Guerini et al. [11]. They discovered that their metrics of vi-rality, buzz, appreciation, raising-discussion, and controversiality, all could be automatically predicted with their framework. Looking at the metrics, buzz, and appreciation correlated, while controversiality was independent of those two. However, The discussion-raising met-ric seemed to correlate to all of the above. Two other metmet-rics were also looked at. Positive and negative buzz called white and black buzz respectively. They found that their data was mostly black buzz, but that it might be due to the lack of any other way of expressing dis-like on Digg. Like Twitter, Digg does not have any kind of “disdis-like” button but has a “dis-like” button.

3.3 Communities and events

Weng et al. [12] looked at content virality from the perspective of communities. They looked at how memes spread within and beyond these communities. They found the possibility that a low concentration of a meme could imply universal interest, while a high concentration could indicate interest in a few communities.

Looking at virality from a perspective of events, De Leng et al. [13] investigated the dis-semination of Tweets during the National Hockey League and the 2015 Stanley Cup playoffs. They found that during the event, the number of original tweets to the number of retweets ratio was higher. During the climaxes, the tweet rate was seven to eight times higher than the in-game baseline. After the game, the tweet rates could be even higher. Looking at the geo-location of the tweets, they could see distributions with heavy-tail patterns, centered at the competing cities.

Leskovec et al. [14] studied how different topics, including memes and news, spread across an audience. This was done to represent how the news cycle works online. Their approach was to track specific phrases, and their variants, on mainstream news media sites and blogs. Their approach was effective when using simple phrases as a record when tracing the spread of an idea. They found that when one of these phrases reached its usage peak on news sites, on average 2.5 hours later the phrase would reach its usage peak on blogs. For any phrase, looking at the ratio of its number of mentions in the blogs and news sites, a heartbeat pattern could be seen as the news sites mention it before the blogs. Six to nine hours afterward, the ratio converges at a higher value than before, as the phrase keeps getting mentioned in blogs.

(17)

3.4. Influence and efficiency

3.4 Influence and efficiency

Bakshy et al. [15] looked at quantifying influence on Twitter. They found that individuals with influence were more likely to create tweets that would go viral, but that it was not certain. They observed that it could be possible to predict the future influence of a user from their properties on the platform. Particularly, past influences and the number of followers could be used for a prediction. However, their statistics predicted that it would be more cost-effective to pay individuals with average or less influence to spread a message, than paying individuals with a lot of influence to do the same.

3.5 Predicting dissemination

Zhao et al. [16] looked at predicting rumors on Twitter without concerning themselves with if the rumor was true or not. They found many statistics which could predict the beginning of a rumor. However, they found that looking at the number of tweets that seek to verify the integrity of the statement, was an efficient method for predicting rumors. In other words, if there are a lot of tweets questioning the integrity of a tweet, then it is likely to go viral.

Kupavskii et al. [17] investigated if it was possible to predict a specific tweet’s cascade size, based on different factors. They used a machine learning algorithm to predict the cas-cade size of a tweet at different times in the future. The factors used in the prediction could be placed in one of four categories of features. The social features of an initial tweet take into ac-count the person posting the tweet and how many followers they have, how many times the user showed up in the dataset, how many posts the user has made, etc. The content features focus on the contents of a tweet and contain features like how long the tweet is, if it contains positive or negative emojis, if it has exclamation points or question marks, etc. There were two other categories as well, the time-sensitive features and the social features of the users viewing the tweet. From these two, only the time-sensitive features were implemented and tested. They found that combining more features from all the tested categories improved the results of the model.

(18)

4 Method

This chapter describes the method used to transform a corpus to an LDA model and how one such model is used to calculate the CCDF, popularity, and total size of retweets over time for each topic. The corpus had accompanying information related to what kind of interactions each document had experienced over time, such as retweets, mentions, quotes, and likes. When the CDF, CCDF, size, and popularity had been extracted, the topics were named by looking at which words permeate each topic and connecting these words to an event or entity. Python with Jupyter was the programming environment of choice for this paper [18]. To begin with, data had to be collected as described below.

4.1 Tools which can be used for data collection

First, the tweet data used was collected using the Twitter API by contacting the “/2/tweets/search/all” endpoint for each news article [19]. Each news article’s URL was used as a parameter together with “-is:retweet” to exclude retweets, which were then saved as CSV files. After the tweets were collected, the texts of the URLs were extracted using the li-braries news-please, news-crawler, and a custom library based on BeautifulSoup4 [20]. These texts were then saved as JSON files.

The process of collecting the data was outside the scope of this thesis. The data used in this thesis was provided by another group of students, who used these tools.

4.2 Twitter terms of service and GDPR

To respect the users of Twitter, Twitter’s terms of service, and GDPR, the data presented was stripped of any personal information and all users were assigned a random ID to protect users from being tracked, in case any data related to users would be displayed in the results.

4.3 Preprocessing and cleaning data

The text from each URL was used to create an LDA model using the NLP library Gensim. First, the CSV files from Twitter were preprocessed by first extracting the relevant fields, which were the times the tweets were created, the tweets’ IDs, the users’ IDs, and the number

(19)

4.4. Creating an LDA model

of retweets, quotes, mentions, and likes each tweet got. These extracted pieces of data from each CSV file were then combined with the corresponding JSON files to create new JSON files with the structure described in Figure 4.1.

1 { 2 #{URL}: { 3 "text": #{ARTICLE_TEXT}, 4 "tweet_ids": { 5 #{TWEET_ID}: { 6 "user_id": #{USER_ID}, 7 "time": #{UTC_TIME_CREATED_AT}, 8 "likes": #{NUMBER_OF_LIKES}, 9 "retweets": #{NUMBER_OF_RETWEETS}, 10 "quotes": #{NUMBER_OF_QUOTES}, 11 "mentions": #{NUMBER_OF_MENTIONS} 12 }, .. 13 } 14 }, .. 15 }

Figure 4.1: This is how the data collected was formatted before being used to create the LDA model. All fields within #{ .. } are variables. The symbols ", .." signal that more instances of the preceding data structure can exist.

After the custom data sets were created they were all fed into a cleaning function, which processes the text field in one of the custom JSON files. Firstly, it lemmatizes the contents of the “text” field using the NLTK library’s WordNetLemmatizer [21]. Secondly, it feeds the lemmatized text into Gensim’s preprocess_string function, which takes in a string and a list of closures to be applied to the string. These closures are filters that remove certain characters or substrings. There are a few filters available in the Gensim library which were used, such as strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric, and strip_short [22]. After each corpus had been filtered, the NLTK English stop words were removed from each corpus [23].

Lastly, one more iteration of cleaning was done in a more manual way to process words that were not handled in the first iteration, such as emojis, unusual whitespace characters, and non-Latin-based letters. Documents containing such words were usually removed.

As shown in Figure 4.2b, some articles contained few English words. When printing out these articles they often only contained one word repeatedly, unusual symbols, and more that would not bring any value to this analysis and were thus removed.

Looking at Figure 4.2a, a lot of documents have less than 1,000 words. The largest docu-ment had approximately 47,000 words. It resembles a right-skewed normal distribution.

4.4 Creating an LDA model

When all the preprocessing was done the cleaned corpus was used to create LDA mod-els. Two functions were suggested to create models, Gensim’s LdaMulticore function and Gensim’s LdaMallet wrapper function together with MALLET from the University of Mas-sachusetts [24]. Firstly, a dictionary was created to assign an ID to each unique word. Next, an M1 matrix was derived using the doc2bow() method of the created dictionary. These two data structures would then be used as parameters for the creation of an LDA model. Option-ally, the M1 matrix could have been processed using the TF-IDF function before creating the

(20)

4.5. Optimizing an LDA model

(a) Distribution from 0 to 3,500 words.

(b) Distribution from 0 to 60 words.

Figure 4.2: Distribution of documents over the number of words. The left subfigure shows a more complete version of the distribution while the right subfigure depicts a small portion for few words. Not shown are a few documents which are up to 47,000 words large, which would obscure the pattern seen in the figure.

model, however, this is not possible when using MALLET. After a model had been created, the M3 matrix was created from it using its get_document_topics() method, which re-turns how many percent each topic assumes in a specified document. Figure 4.3 lists the code used to create a model.

4.5 Optimizing an LDA model

As the data used in this paper was almost nine gigabytes large, random samples that were arbitrarily, roughly 5% of each corpus were used to estimate the optimal number of topics and passes. These values would be used in the creation of each final model, that were created using 100% of each corpus. This significantly sped up the process of optimizing the models, at the cost of some words being left out, which could have caused some inaccuracies.

To optimize the model, multiple models of the same corpus were created with different numbers of topics. Gensim coherence models created from these models were then used to compare them to each other. The coherence of an LDA model is just an arbitrary number that is in the range of(0, 1), which only applies to the models created from the same data. In other words, it is unreasonable to compare the coherence value of models that are created from different corpora.

When multiple coherence values had been collected from the sample of a corpus, these values were drawn as a function of the number of topics. From this plot, it is determined how many topics should be used. As can be seen in Figure 4.4, the graph converges after a certain number of topics. It is one of the local maxima after this certain number that will be used when creating the final model.

After determining the number of topics to be used for each corpus, finding a good value for the number of passes was the next step. Here, the same process of creating multiple models was used, however, the number of topics was set to the estimated optimal value found in the previous step. This yielded a similar graph as seen in Figure 4.4.

When the estimated optimal values for the number of topics and passes had been deter-mined, the final models were created using these values. These models would then be used to create plots for the final step of the method.

(21)

4.6. Creating plots

1 corpus_dictionary = gensim.corpora.Dictionary(clean_corpus)

2 M1 = [

3 corpus_dictionary.doc2bow(doc)

4 for doc in clean_corpus

5 ]

6

7 # This is the optional TFIDF step

8 # tfidf_model = gensim.models.Tfidf(M1) 9 # M1 = tfidf_model[M1]

10

11 # gensim.models.wrappers.LdaMallet() could be used instead 12 model = gensim.models.ldamulticore.LdaMulticore(

13 corpus=M1, 14 num_topics=200, 15 passes=2, 16 random_state=0 17 id2word=corpus_dictionary, 18 workers=6 19 ) 20 21 M3 = [ 22 model.get_document_topics(doc) 23 for doc in M1 24 ]

Figure 4.3: Python code depicting how the M1 matrix was created and how it was used to generate a model, from which the M2 and M3 matrices can be extracted. The M2 matrix is not extracted in this code.

4.6 Creating plots

When the M3 matrix was created, it was possible to extract the most prominent topic of each document, which would be referred to as a document’s “dominant topic”. When each docu-ment’s dominant topic had been extracted from M3, the documents with the same dominant topic were grouped into collections. These collections were placed in a list with the index corresponding to the topics’ IDs. This list would be referred to as the “topics list”.

When all this data had been compiled it was possible to plot the figures needed for analy-sis. For each topic, five plots were created using the topics list and data about each document in the custom corpus from Figure 4.1. The popularity plot could be done by calculating the number of retweets of a topic within three-hour periods. The size plot could be done by accu-mulating the number of retweets over time. The box plots were created from the popularity data. The final plots contained the CDF and CCDF of the data. These could be calculated by taking all tweets from each topic in the topics list and collecting how many retweets each tweet had. Then for each cascade size from zero and larger, the CDF was determined by counting how many cascades were less or equal to the current cascade size. The CCDF was calculated the same way, except it counts how many cascades were greater or equal to each cascade size.

(22)

4.7. Limitations

0

5

10

15

20

25 Number of topics

0.30

0.35

0.40

0.45

0.50 Coher

ence value

Figure 4.4: The coherence values of one of the corpora used in this thesis. The graph con-verges at around 14 topics, which means that any local maximum after this could be used to create an optimal model for the corresponding corpus. However, a smaller value would be preferred for the ease of use.

4.7 Limitations

When the topics had been derived from each article, all but the most dominant topic would be stripped from each article. This could cause some inaccuracies but would allow for a simpler analysis.

Because of the limited access to data from Twitter, it would be hard to create a social network graph. This is because Twitter only allows developers to collect 15 instances of user data per 15 minutes [25, 26]. In the worst case, it would take years to create the social network graph needed to analyze the propagation of topics in a social network.

When trying to optimize a model, it would be necessary to create multiple models of the same data, which have different amounts of topics and passes. Because this process could take multiple weeks or even months to complete due to the size of the data analyzed, it was necessary to take random samples of the data to optimize the number of topics and passes. Thus the optimal number of topics and passes were estimated in the method, which could net close to optimal results.

Because the networks of the tweets were not collected, it was not possible to present any data related to depths after the first level.

(23)

5 Results

In this thesis, we use a dataset collected by a different student group. The data collection was outside the scope of this thesis. Here we describe how we processed and analyzed the dataset.

5.1 The datasets, or corpora, received

We received five datasets. As mentioned in the method, these datasets contained information related to tweets that mention URLs linking to news articles. The texts from these articles were also available in the datasets. Each dataset available spanned over seven days, give or take 24 hours. As to why these datasets span about seven days or why the recordings began the days they did, these were things outside our control.

5.2 Optimizing the LDA models

Running the optimization with 5% of the full corpus resulted in the number of topics and passes in Table 5.1.

Name Start date Start day Number of topics Number of passes

Spring 2020 20-03-31 Tuesday 10 14

Fall 2020 20-03-31 Saturday 8 11

Early-February 2021 21-02-01 Monday 14 16

Mid-February 2021 21-02-14 Sunday 15 10

March 2021 21-03-02 Tuesday 14 16

Table 5.1: The number of topics and passes which were used for each corpus. All corpora span one week after the start date.

These values are not the absolute optimal values. Local maxima with similar values to the global maxima were chosen.

(24)

5.3. Mid-February 2021 results

5.3 Mid-February 2021 results

Figure 5.1 to Figure 5.5 show the results made from the Mid-February 2021 data. The left subplot contains the topics which are on average smaller than those in the right subplots.

Looking at Figure 5.1 and Figure 5.2, the “Former President Trump” topic is on a down-ward trend while President Biden seems like a regularly occurring topic with a few spikes in popularity. The “US new bills 2021” topic also looks regular and seems to correlate with different topics such as “Black lives matter”, “Economics”, and “Family”. One could suspect that the sudden large incline for both the former and current presidents after the sizeable in-crease in the “Texas power crisis” topic, could be the beginning of a feedback loop. It could also be that they both only gave their statement on the topic and left it at that.

The CCDF in Figure 5.4 the “Former President Trump” topic was included in both graphs, making the topics easier to compare. The patterns of the topics can be looked at and compared to each other to determine if they are made up of mostly true, false, or mixed news. If a topic’s CCDF has a convex shape it is likely either mixed or false news. All topics except “New technology” and “Mexican/sports news” have a convex shape. These topics which have more concave shapes likely contain true news. The topics which have convex shapes but suddenly go to 0% for larger cascades are likely made up of false news, such as “Former President Trump”, “Black lives matter”, “US police scandals”, and “Texas power crisis”. The other convex topics that do not suddenly go down to 0% and instead continue more toward the right are likely mixed news. Examples of such topics are “Mars space mission”, “Covid-19 vaccines”, and “Covid-19 restrictions”.

Looking at Figure 5.3, 30% of the cascades related to the “Medical research” topic have a size greater than 1. For most of the other topics, circa 20% of the cascades have a size greater than 1 instead. This might suggest that the “Medical research” topic disseminates more broadly.

Let us focus on the “Texas power crisis” topic. In Figure 5.2, we can see that until the 120-hour mark, the topic grows exponentially. In Figure 5.5, we can also see that the first quartile of the topic is broader when compared to the other topics. This might suggest that multiple smaller influencers can cause further dissemination, than fewer larger influencers, like Bakshy et al., presented [15].

(25)

5.4. Spring 2020 results

Figure 5.2: Size of all topics in the corpus made from the Mid-February 2021 data.

Figure 5.3: CDF of all topics in the corpus made from the Mid-February 2021 data.

5.4 Spring 2020 results

Figure 5.6 to Figure 5.10 show the results made from the Spring 2020 data. The left subplots contain the topics which are on average smaller than those in the right subplots.

Looking at Figure 5.6, Figure 5.7, and Figure 5.9 the same way, one could say that “Span-ish tech news”, “Covid-19 and Los Angeles”, “Covid-19 and health”, and “US China news” would likely contain false news. Similarly, “Covid-19 and work” and “Covid-19 and the econ-omy” would mostly consist of mixed news, while “Virus research”, “Competetive sports”, “Donald trump 2020 US election”, and “Discussing time at work and home” would be mostly true news. The “Covid-19 and health” and “Covid-19 and the economy” seem to correlate to each other. Early on in the week, these two topics also seem to correlate to “US China news” and “Covid-19 and work”. This could suggest that the “Covid-19 and health” and “Covid-19 and the economy” topics would be spread at the same time, but that “Covid-19 and the econ-omy” would spread further as it was considered mostly mixed news according to its pattern. Meanwhile, “Covid-19 and health” could be considered mostly false news according to its

(26)

5.5. Comparing two datasets

Figure 5.4: CCDF of all topics in the corpus made from the Mid-February 2021 data.

Figure 5.5: Boxplot of all topics in the corpus made from the Mid-February 2021 data.

pattern. In Figure 5.8, we can see when a topic reaches 100% to determine how large cascades related to that topic can get. For example, most of the cascades related to “Spanish tech news” are smaller than 100. In Figure 5.10, individually non of the topics stand out, compared to the “texas power crisis” topic in Figure 5.5.

5.5 Comparing two datasets

From the two datasets presented above, the ICs between similar topics can be compared. In Table 5.2, the topics from each corpus are displayed together with respective IC. First, topics related to Covid-19 in the corpora are compared using the average IC of those topics. The

(27)

average ICs of Covid-19 related topics in the Mid-February 2021 and Spring 2020 corpora are 3.19 and 4.22 respectively. This means that the impact of Covid-19 related topics has decreased by 1.03 retweets per original tweet. Looking at the two topics related to former President Trump, the IC has increased from 4.91 to 5.71. Comparing the two topics related to the economy, the IC has increased from 2.45 to 2.75.

Mid-February 2021 Spring 2020

Topic Impact Coefficient Topic Impact Coefficient

New technology 2.28 Spanish tech news 4.12

Mexican/sports news 3.35 Competetive sports 5.23

Economics 2.75 Virus research 7.18

Covid-19 restrictions 1.3 Donald Trump 2020 US election 4.91

Medical research 5.56 Discussing time at work and home 6.84

Family 2.76 Covid-19 and Los Angeles 4.26

Black lives matter 4.84 US China news 4.47

US new bills 2021 2.55 Covid-19 and health 4.93

Covid-19 vaccines 2.72 Covid-19 and work 2.29

Mars space mission 5.05 Covid-19 and the economy 2.45

Royal Family on Facebook 3.52

Former President Trump 5.71

US police scandals 4.69

President Biden 1.59

Texas power crisis 3.81

Table 5.2: ICs of the topics within Mid-February 2021 and Spring 2020 corpora. The topics colored purple, and orange are related to Covid-19, and Donald Trump respectively.

(28)

Figure 5.7: Size of all topics in the corpus made from the Spring 2020 data.

(29)

Figure 5.9: CCDF of all topics in the corpus made from the Spring 2020 data.

(30)

6 Discussion

Here we discuss the results, method, and the work in a wider context.

6.1 Results

The custom corpora that were created for this paper contained a lot more data than what was used. For the popularity, size, boxplot, CDF, and CCDF plots only retweets were used, but as seen in Figure 4.1 the fields “likes”, “quotes”, and “mentions” were available as well. These unused fields could have been used to measure sentiment, which could tell us how emotions can lead to further dissemination.

The final models used for the results all had a coherence score above 0.5, which does not imply that the models were good or bad, only that they were better than models with less than a 0.5 coherence score. As the optimization of the final models was done using samples made from a random 5% of the custom corpora, the final models would only be as good as the random samples created during the optimization. This could cause issues when trying to replicate the results presented in this thesis.

Similarly to Heimbach et al. [9], we found that topics related to technology, science, pol-itics, and business, disseminated more than other topics. Because their findings came from comparing different services to each other, the increased activity in these topics might be un-related to our findings. Also like Heimbach et al., we found that original content, or original topics in this case, were more likely to disseminate further. If we look at the “Texas power cri-sis” topic in the Mid-February 2021 dataset, we can see an exponential increase in the topic’s activity. This suggests that the topic might be new. Searching for the topic in Google Trends [27] we can see that the topic peaks around the time the Mid-February 2021 dataset was con-ceived, with little to no activity outside of this week. This reinforces the idea of the “Texas power crisis” topic being new and that originality affects the dissemination of a topic.

The work of Vosoughi et al. [1] suggested that false news disseminated much more in all of their metrics. When looking at the work of Tambuscio et al. [8] and Zhao et al. [16], they suggest that if the probability of a topic’s integrity being verified is lower than a certain threshold, the topic would continue to disseminate further. This reinforces the statement of Vosoughi et al. This might also be the reason why mixed news disseminates the furthest [1]. It might be because the news is a mix of true and false content, it would be hard to verify.

(31)

6.2. Method

Thus a low chance of verifying the integrity of the topic, causing the topic to disseminate deeper and broader.

6.2 Method

We spent a significant amount of time trying to identify a good dataset to analyze for this project. Due to the lack of public social networking datasets, this is a challenging task. About halfway through the project, we joined forces with a student group collecting data from Twit-ter that fit our needs.

Due to time restraints, the creation of the models could have been better. When first opti-mizing a model, a random set of documents is extracted from the corresponding corpus. This new, random, and smaller corpus was used to estimate the original corpus, meaning impor-tant information such as distributions of words and words themselves could sometimes be lost. This could somewhat hinder replicability as the sample is random each time. Preferably, the whole corpus would have been used in the optimization process, but unfortunately, this would have taken too much time. If the whole would have been used instead, any large or small inaccuracies caused by creating random samples would no longer be an issue.

Also, instead of picking global maxima for coherence optimization, local maxima in the region where the coherence value has almost converged, were usually picked instead to de-crease the number of topics and passes needed to be done by the model to save time.

In the end, only the LdaMulticore function was used as the LdaMallet function would have taken too much time to generate models. This is unfortunate as MALLET seems to be able to yield better results more often than not. If time was not a problem, the models of Gensim and MALLET would have been compared to each other for each corpus, to expand the available options for further optimization. Furthermore, there was no time to compare LdaMulticorewith and without the TF-IDF function, meaning TF-IDF was never used.

In our early tests, we compared LdaMallet to LdaMulticore. When looking at the resulting topics that were produced from both functions, MALLET made word distribu-tions that focused on fewer, more important words, causing the topics to be more distinct. This might be comparable to the results the TF-IDF function could produce together with LdaMulticore, which was not tested in this thesis.

A lot of time was needed to find data and almost the same amount of time was needed to process and analyze it. Performing these experiments in the future, more time should be used to collect better data, do better optimizations, and do a more extensive analysis. For example, if the dataset was one large, continuous stream, we would have had access to the complete lifetimes of multiple topics. This would make the analysis more accurate, compared to analyzing samples of topics’ lifetimes where the before and after is unknown.

Also, using (re)tweets as a metric for measuring the dissemination of information online is not always accurate. This is because the links that are being shared through these (re)tweets can have largely differing clicks-per-retweets ratios [28].

6.3 The work in a wider context

As topics were being compared to the patterns of false, true, and mixed news as seen in the results from Vosoughi et al. [1], and as the topics were placed in one of these categories from observation, one could consider this result ethically incomplete. For example, we considered the “Black lives matter” topic mostly false news. Without looking further into it, one might conclude that “Black lives matter” is a false statement. However, we believe it is probably the case that most articles discussing the topic are spreading novel information related to it. As false news is mostly made up of novel information [1], considering the topic “Black lives matter” mostly false news would probably be a false positive. If the topic consists of mostly

(32)

6.3. The work in a wider context

false news, it might be articles trying to attack the efforts of those who wish to spread the true message of “Black lives matter”.

Also, generalizing articles as true, false, or mixed could also hurt the articles themselves. If an author posts a revolutionary article with a lot of new and novel information, their article could be automatically suppressed or censored. This is because there could be systems in the future that are designed to combat false news by detecting novel information. This would likely not happen but demonstrates the dangers of generalizing.

As with any work trying to manipulate the trends and natural discussion on the web, ethical problems arise. If we look at the work from Heimbach et al., we can see that there is a focus on how to create content to make it more viral on specific platforms. If we take this into account and apply the works of Zhao et al. and Kupavskii et al., it might be possible to lead trends and manipulate the dissemination of information and topics online.

Leading different topics and kinds of information comes down to balancing the integrity and availability of information and doing it in a way that is both convenient and ethical.

(33)

7 Conclusion

At the beginning of this thesis, two research questions were presented. Regarding feedback loops, it was hard to determine if the topics found in the results would cause each other’s activity to increase. If possible, one could extract a social network graph from Twitter and analyze the cycles that occur when tweets spread. Using topic modeling, it might be possible to see a feedback loop, where a topic rotates through a cycle by using topic modeling. In the plots, one could a few times see that a topic would get a lot of activity and a few hours later another topic would gain the same amount of activity. This could not be considered a feedback loop as the pattern did not continue.

Regarding how different topics spread compared to false news, one could look at the highlight of the plots. It could be argued that new topics often come with a lot of novelties, thus more false news according to the pattern. However, as false news often contain mostly novel information it could be the case that because a topic is new, almost all information would be novel and therefore would exert a pattern similar to false news. As topics are being established, one could also argue that they lose their novelty and would thus be considered more true over time.

However, classifying the integrity of articles about a topic by only measuring the amount of novel content would not be ethical. It could cause censorship of articles that contain a lot of new and useful novelties, even though the integrity of these articles would not be compro-mised. Combating false news should be done with care as to not violate freedom of speech. In the future, novelty detection might be one of the steps toward separating information that is true from that which is false, making the web safer through increased integrity.

7.1 Future work

We consider this work just the beginning of a larger project. Different theories and models have been studied, and tools have been created, the next step could be to collect a more extensive, continuous dataset. This dataset should contain data regarding interactions, such as the number of retweets, mentions, clicks, etc, temporal information, and the networks of the original tweets. This will allow us to compare the topics using the metrics discussed in the chapter of the Related works.

(34)

7.1. Future work

If the number of clicks would be available, it would be possible to see if it is the topics within the articles or the topics of the headlines that are disseminating. Then it would be possible to see which topics are more likely to be clicked.

The tool should be expanded to use both Gensim and MALLET When optimizing the models. As mentioned before, MALLET seems to be able to produce better results in some situations, however, it is slower than Gensim.

We did not look at the contents of the news articles beyond the latent topics nor how they were written. To increase the depth of the analysis, we can look at the sentiment and emotionality metrics described by Heimbach et al [9]. By classifying the writing style of a news article, or its headlines, it would be possible to correlate different writing styles to how a topic disseminates. It would also be possible to calculate how different writing styles disseminate compared to each other and how this relates to the dissemination of topics. By implementing their work together with this thesis, we could expand the work of both fields.

In future works, it would be interesting to look at the tweet-to-retweet ratio for different topics to estimate how much original content each topic has compared to how many reposts, retweets, etc. As Heimbach et al. [9] discovered, originality seems to correlate to virality, making it an interesting metric. We could look at the percentage of (re)tweets questioning or verifying the topic as an extra metric for verifying the integrity of news stories [8], together with the typical patterns of true, false, and mixed news [1].

One of the greatest tasks would be compiling the findings, concepts, theories, and ideas from the related works to create one or more cohesive models, which can categorize the dis-semination and virality of different topics and kinds of information.

(35)

Bibliography

[1] Soroush Vosoughi, Deb Roy, and Sinan Aral. “The spread of true and false news on-line”. In: Science 359.6380 (2018), pp. 1146–1151.DOI: 10.1126/science.aap9559. [2] Christopher Fox. “A Stop List for General Text”. In: SIGIR Forum 24.1–2 (Sept. 1989),

pp. 19–21.DOI: 10.1145/378881.378888.

[3] M. I. Jordan and T. M. Mitchell. “Machine learning: Trends, perspectives, and prospects”. In: Science 349.6245 (2015), pp. 255–260. DOI: 10 . 1126 / science . aaa8415.

[4] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet Allocation”. In: J. Mach. Learn. Res. 3.null (Mar. 2003), pp. 993–1022.ISSN: 1532-4435.

[5] Gerard Salton and Christopher Buckley. “Term-weighting approaches in automatic text retrieval”. In: Information Processing and Management 24.5 (1988), pp. 513–523.

[6] Vandita Grover. The Viral Coefficient: Your 2020 Guide to Viral Marketing. https : / / www.martechadvisor.com/articles/social-media-marketing-2/viral-coefficient - and - tips - for - viral - marketing/. [Online; accessed 24-May-2021]. 2020.

[7] Xinyi Zhou and Reza Zafarani. “A Survey of Fake News: Fundamental Theories, De-tection Methods, and Opportunities”. In: ACM Comput. Surv. 53.5 (Sept. 2020). DOI: 10.1145/3395046.

[8] Marcella Tambuscio, Giancarlo Ruffo, Alessandro Flammini, and Filippo Menczer. “Fact-Checking Effect on Viral Hoaxes: A Model of Misinformation Spread in Social Networks”. In: Proceedings of the International Conference on World Wide Web. WWW ’15 Companion. Florence, Italy: Association for Computing Machinery, pp. 977–982.DOI: 10.1145/2740908.2742572.

[9] Irina Heimbach, Benjamin Schiller, Thorsten Strufe, and Oliver Hinz. “Content Virality on Online Social Networks: Empirical Evidence from Twitter, Facebook, and Google+ on German News Websites”. In: Proceedings of the ACM Conference on Hypertext & Social Media. HT ’15. Guzelyurt, Northern Cyprus: Association for Computing Machinery, 2015, pp. 39–47.DOI: 10.1145/2700171.2791032.

[10] Jonah Berger and Katherine L Milkman. “What makes online content viral?” In: Journal of marketing research 49.2 (2012), pp. 192–205.

(36)

Bibliography

[11] Marco Guerini, Carlo Strapparava, and Gozde Ozbal. “Exploring text virality in social networks”. In: Proceedings of the International AAAI Conference on Web and Social Media. Vol. 5. 1. 2011.

[12] Lilian Weng, Filippo Menczer, and Yong-Yeol Ahn. “Virality prediction and community structure in social networks”. In: Scientific reports 3.1 (2013), pp. 1–6.

[13] Daniel de Leng, Mattias Tiger, Mathias Almquist, Viktor Almquist, and Niklas Carls-son. “A second screen journey to the cup: Twitter dynamics during the stanley cup playoffs”. In: Network Traffic Measurement and Analysis Conference (TMA). IEEE. 2018, pp. 1–8.

[14] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. “Meme-Tracking and the Dynam-ics of the News Cycle”. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’09. Paris, France: Association for Comput-ing Machinery, 2009, pp. 497–506.DOI: 10.1145/1557019.1557077.

[15] Eytan Bakshy, Jake M Hofman, Winter A Mason, and Duncan J Watts. “Everyone’s an influencer: quantifying influence on twitter”. In: Proceedings of the fourth ACM interna-tional conference on Web search and data mining. 2011, pp. 65–74.

[16] Zhe Zhao, Paul Resnick, and Qiaozhu Mei. “Enquiring Minds: Early Detection of Ru-mors in Social Media from Enquiry Posts”. In: Proceedings of the International Conference on World Wide Web. WWW ’15. Florence, Italy: International World Wide Web Confer-ences Steering Committee, pp. 1395–1405.DOI: 10.1145/2736277.2741637.

[17] Andrey Kupavskii, Liudmila Ostroumova, Alexey Umnov, Svyatoslav Usachev, Pavel Serdyukov, Gleb Gusev, and Andrey Kustarev. “Prediction of Retweet Cascade Size over Time”. In: Proceedings of the ACM International Conference on Information and Knowl-edge Management. CIKM ’12. Maui, Hawaii, USA: Association for Computing Machin-ery, 2012, pp. 2335–2338.DOI: 10.1145/2396761.2398634.

[18] Project Jupyter. Jupyter. https://jupyter.org/. [Online; accessed 21-May-2021]. 2021.

[19] Inc Twitter. Twitter API. https://developer.twitter.com/en/docs/twitter-api/. [Online; accessed 18-May-2021]. 2021.

[20] Leonard Richardson. Screen-scraping library beautifulsoup4. https : / / pypi . org / project/beautifulsoup4/. [Online; accessed 18-May-2021]. 2020.

[21] Edward Loper Bird Steven and Ewan Klein. NLTK 3.6.2 documentation. https://www. nltk.org/api/nltk.stem.html?highlight=wordnetlemmatizer. [Online; accessed 26-April-2021]. 2021.

[22] Radim ˇReh ˚uˇrek. Functions to preprocess raw text. https : / / radimrehurek . com / gensim _ 3 . 8 . 3 / parsing / preprocessing . html. [Online; accessed 26-April-2021]. 2019.

[23] Edward Loper Bird Steven and Ewan Klein. Installing NLTK Data. https : / / www . nltk.org/data.html. [Online; accessed 26-April-2021]. 2021.

[24] Andrew Kachites. McCallum. MALLET: A Machine Learning for Language Toolkit. http: //mallet.cs.umass.edu/. [Online; accessed 26-April-2021]. 2018.

[25] Inc Twitter. Rate limits. https://developer.twitter.com/en/docs/twitter-api/rate-limits. [Online; accessed 26-April-2021]. 2021.

[26] Inc Twitter. Follow, search, and get users. https://developer.twitter.com/en/ docs/twitter-api/v1/accounts-and-users/follow-search-get-users/ api-reference/get-followers-ids. [Online; accessed 26-April-2021]. 2021. [27] Google Inc. The Viral Coefficient: Your 2020 Guide to Viral Marketing. https://trends.

google . com / trends / explore ? date = today % 205 - y & geo = US & q = texas % 20power%20crisis. [Online; accessed 27-May-2021]. 2021.

(37)

Bibliography

[28] Jesper Holmström, Daniel Jonsson, Filip Polbratt, Olav Nilsson, Linnea Lundström, Se-bastian Ragnarsson, Anton Forsberg, Karl Andersson, and Niklas Carlsson. “Do We Read What We Share? Analyzing the Click Dynamic of News Articles Shared on Twit-ter”. In: Proceedings of the IEEE/ACM International Conference on Advances in Social Net-works Analysis and Mining. ASONAM ’19. Vancouver, British Columbia, Canada: As-sociation for Computing Machinery, 2019, pp. 420–425. DOI: 10 . 1145 / 3341161 . 3342933.

(38)

A

Appendices

A.1 Fall 2020 results

The graphs from Figure A.1 to Figure A.5 are the resulting plots made from the Fall 2020 data.

(39)

A.1. Fall 2020 results

Figure A.2: Size of all topics in the corpus made from the Fall 2020 data.

(40)

A.1. Fall 2020 results

Figure A.4: CCDF of all topics in the corpus made from the Fall 2020 data.

(41)

A.2. Early-February 2021 results

A.2 Early-February 2021 results

The graphs from Figure A.6 to Figure A.10 are the resulting plots made from the Early-February 2021 data.

Figure A.6: Popularity of all topics in the corpus made from the Early-February 2021 data.

(42)

Figure A.8: CDF of all topics in the corpus made from the Early-February 2021 data.

(43)

(44)

A.3. March 2021 results

A.3 March 2021 results

The graphs from Figure A.11 to Figure A.15 are the resulting plots made from the March 2021 data.

Figure A.11: Popularity of all topics in the corpus made from the March 2021 data.

(45)

A.3. March 2021 results

Figure A.13: CDF of all topics in the corpus made from the March 2021 data.

Comparing and contrasting the dissemination cascades of different topics in a social network : What are the lifetimes of different topics and how do they spread

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 15 ECTS | Computer Science

2021 | LIU-IDA/LITH-EX-G--21/067--SE

Comparing and contrasting the

dissemination cascades of

diﬀer-ent topics in a social network

What are the lifetimes of diﬀerent topics and how do they

spread

Jämförelse av spridningskaskader för olika ämnen i ett socialt

nätverk

Linus Käll, Simon Pertoft

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

List of Tables

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

1.4

Contributions

1.5

Delimitations

2

Background

2.1

Terminology

2.2

Machine Learning (ML)

2.3

Topic Modeling (TM)

2.4

Latent Dirichlet Allocation (LDA)

2.5

The important matrices for LDA: M1, M2, and M3

2.6

Term Frequency - Inverse Document Function (TF-IDF)

2.7

Cumulative Distribution Function (CDF)

2.8

Complementary Cumulative Distribution Function (CCDF)

2.9

Impact Coefficient (IC)

2.10

Cascade

Tweet

Retweets

3

Related works

3.1

Metrics, false news, and hoaxes

3.2

Properties of the disseminating contents

3.3

Communities and events

3.4

Influence and efficiency

3.5

Predicting dissemination

4

Method

4.1

Tools which can be used for data collection

4.2

Twitter terms of service and GDPR

4.3

Preprocessing and cleaning data

4.4

Creating an LDA model

4.5

Optimizing an LDA model

4.6

Creating plots