Analysing Credibility of Twitter Users Using the PageRank Algorithm

(1)

INOM

EXAMENSARBETE TEKNIK, GRUNDNIVÅ, 15 HP

STOCKHOLM SVERIGE 2017,

Analysing Credibility of Twitter Users Using the PageRank

Algorithm

ALICE HEAVEY ELIN KARAGÖZ

KTH

SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION

(2)

Analysing Credibility of Twitter Users Using the PageRank Algorithm

ALICE HEAVEY ELIN KARAGÖZ

Bachelor’s Thesis in Computer Science Date: June 1, 2017

Supervisor: Alexander Kozlov Examiner: Örjan Ekeberg

Swedish title: Analys av trovärdighet hos Twitteranvändare med hjälp av PageRank-algoritmen

School of Computer Science and Communication

(3)

ii

Abstract

In a time when information and opinions are to a large extent shared via social media, it is important to find a way to determine how credible the content is. The purpose of this study is to investigate whether PageRank based algorithms can be used to deter- mine how credible a Twitter user is based on how much the user’s posts are retweeted by other users. Two different algorithms based on PageRank have rated the credibility of Twitter users in a network. This ranking has been compared with a manual credibil- ity check on the users to determine how close to reality the credibility distribution from the algorithms is. The results show that the algorithms can be said to preform better than random, but they still assign inaccurate credibility scores to many users. The simplicity of the algorithms is an advantage compared to other methods used in previous research.

The conclusion is that the algorithms in their current states are not suitable for determin-

ing the credibility of Twitter users.

(4)

iii

Sammanfattning

I en tid då information och åsikter till stor del delas via sociala medier är det viktigt att

finna ett sätt att avgöra hur trovärdig detta innehåll är. Syftet med denna studie är att

utreda om det med hjälp av algoritmer baserade på PageRank-algoritmen går att avgö-

ra hur trovärdig en Twitteranvändare är, baserat på hur mycket användarens inlägg bli-

vit delade av andra användare. Två olika algoritmer baserade på PageRank har rankat

trovärdigheten hos Twitteranvändare i ett nätverk. Denna rankning har sedan jämförts

med en manuell trovärighetstilldelning av användarna för att avgöra hur nära verklighe-

ten algoritmernas trovärdighetsfördelning är. Resultaten visar att algoritmerna kan anses

prestera bättre än slumpen, men att de trots detta tilldelar en fleaktig trovärdighet till

många användare i nätverket. Algoritmernas triviala natur ger dem en fördel gentemot

algoritmer som använts i tidigare studier. Slutsatsen är att algoritmerna i deras nuvaran-

de form inte är lämpade för att fastställa trovärdighet hos Twitteranvändare.

(5)

Introduction

Today an increasing part of society consume news through social media outlets such as Twitter and Facebook. As a result concerns about the credibility of the news stories shared in these non classical environments are raised. In a media feed where news sto- ries from established media houses are interspersed with those from lesser known news websites and personally authored posts from individual users it becomes evermore diffi- cult to discern what is true and what is not.

The problem with the spread of fake news stories on social media sites has in the re- cent year become evermore pronounced, and discussions around the responsibility of said media sites have risen. The sheer amount of new content posted everyday prohibits manual assessment of credibility. One way to tackle these problems could be to develop an algorithm for automatically analysing the credibility of content published and shared online. If such an algorithm is found and it performs well it could be used for automati- cally labelling online content as credible or non credible, which could help in preventing fake news stories from spreading. However, a bad performing algorithm could if imple- mented cause real damage to the trust of the users, as it could mislabel credible online content as non credible, and vice versa.

One way to tackle this issue is to use a version of the PageRank algorithm, which is used by Google to rank websites tracked by their search engine, based on links from other websites pointing to them[4]. By applying the same concepts, but for Twitter users and retweets instead of websites and links, credibility could be distributed over a net- work of users based on how they interact with each other.

PageRank has been used in areas other than website ranking. For example, it has been used to predict the use of public spaces[13], and it is used in Twitter’s follower rec- ommendation service[6]. Its widespread use, together with its simplicity makes it inter- esting to evaluate the algorithm’s performance in credibility analysis, which is why it has been chosen for this study.

1.1 Problem Definition

This study will investigate if the PageRank algorithm can be used to successfully de- termine the credibility of Twitter users. This will be done by formulating and testing a heuristic for ranking the credibility of Twitter users based on retweet data.

1

(9)

2 CHAPTER 1. INTRODUCTION

1.2 Scope and Constraints

The focus of the study is to investigate the possibility of ranking the credibility of Twitter users based on their activity on Twitter. Every attempt is based on the PageRank algo- rithm.

A user is considered credible by the algorithms if the user is retweeted by users with high credibility. To determine the success of the heuristic, samples of rated users will be examined manually, based on a number of criteria.

The dataset is constrained to users associated with a set of hashtags defined in the method chapter of this report.

1.3 Outline

Chapter 2 introduces necessary terminology, concepts and general background regard-

ing the subject. In chapter 3 the implementations used to extract data from Twitter and

the algorithms used to interpret the data are explained. Chapter 4 presents the results of

the analysis of Tweets and Twitter users and these results are discussed in chapter 5. In

chapter 6 a conclusion of the study is presented together with possible future research

areas.

(10)

Chapter 2

Background

The aim of the background is to introduce the notions, concepts and techniques which are used in this study. In section 2.1 the properties of the Twitter network are presented, in section 2.2 the PageRank algorithm is explained, and in section 2.3 previous studies on automated credibility analysis on Twitter are presented.

2.1 Twitter

Twitter is an online social networking community founded in 2006, where users interact by posting messages known as "Tweets", restricted to 140 characters. While membership is required for someone who wants to post a Tweet, non-members can can read Tweets posted by members of Twitter. As of June 2016, Twitter had 313 million monthly active users and 1 billion unique monthly visits to sites with embedded Tweets[12]. On an aver- age day 500 million Tweets are posted[15], making Twitter an important source of infor- mation.

2.1.1 Users

A user is someone who creates content, and interact with other users, on Twitter. A user interacts with other users by posting Tweets, or retweeting, sharing and liking other users Tweets. Users can follow each other[11], and Tweets posted by followed users ap- pear in the user’s Twitter feed.

2.1.2 Tweet

Tweets constitutes the Twitter content and are posted by Twitter users. Each Tweet is composed of up to 140 characters, and can also contain photos, videos and links. Other users can post, retweet, quote, share and like the Tweet.

2.1.3 Retweet

A retweet is a repost or a forward of a Tweet. If a user retweets a Tweet it will appear in the Twitter feeds of the user’s followers.

3

(11)

4 CHAPTER 2. BACKGROUND

2.1.4 Hastag

Tweets can contain hashtags, which can be used to index keywords or subject for a Tweet[10].

Hashtags are a words preceded by a hash sign (#), and they are used as a way to cate- gorise posts. Below follows four hashtags relevant to this study.

#svpol

Hashtag used mainly by politicians and politically involved Twitter users to comment on current events in swedish politics and society in general[16].

#svtagenda

Hashtag collecting comments about the live television program Agenda. Agenda focuses on the most important events in Sweden and the world[2].

#svtopinion

Hashtag collecting comments about the live television debate program Opinion live and debate articles from SVT[3].

#opinionlive

Hashtag collecting comments about the live television debate program Opinion live[1].

2.2 PageRank

PageRank is an algorithm most famous for being used by Google in order to rank web- sites. The algorithm represents the Internet as a directed graph, where nodes represent websites, and the directed edges represent hyperlinks from one website to another. The rank of a certain website is determined by the quality of the links pointing to it, which in turn is determined by the rank of the pointer. The algorithm is run iteratively, repeatedly calculating and updating the rank of the sites, until a stable ranking of all the websites in the graph is achieved[4]. For a more thorough description and pseudocode of the most simple version of the algorithm, see section 3.4.1.

2.3 Previous Studies

In this section three different approaches for analysing credibility in Twitter events are

presented. An event is a collection of Tweets on the same subject, such as a current news

event. The approach presented in 2.3.1 calculates event credibility based on different fea-

tures extracted from associated users and Tweets, and the event itself. Having been cited

over 800 times, this article can be considered to have laid down the fundamentals for

credibility analysis on Twitter. In 2.3.2 two methods, BasicCA and EventOptCA, are pre-

sented. They both take a queue from the method in 2.3.1, but refine the credibility scores

through a PageRank inspired process. Together these two studies are considered to give

an overview of the field, as well as state of the art, of credibility analysis of Twitter users.

(12)

CHAPTER 2. BACKGROUND 5

2.3.1 Credibility on Twitter

This study was conducted on around a million Tweets collected over a two month pe- riod.

The study examines discussion topics on Twitter by studying bursts of activity on the same topic. Relevant features are extracted from each labelled topic, and these are used to build a classifier that attempts to automatically determine if a topic corresponds to newsworthy information, and to automatically assess its level of credibility. The clas- sification takes the text content, the network of the user and propagation (retweets and previous Tweets) into account when assigning credibility. After this, each item is assessed on its level of credibility through surveys answered by a group of human judges. The accuracy of the methods are around 70-80%[5].

2.3.2 Evaluating Credibility on Twitter

This study was conducted on two datasets, each containing millions of Tweets.

BasicCA

BasicCA constructs a network of nodes consisting of users, Tweets and events. The credi- bility of the different nodes is initiated using an extended version of the method in 2.3.1.

The credibility scores are then propagated iteratively through the network, using a PageR- ank inspired approach, through which the credibility of a node is directly affected by the credibility of its neighbours. This iterative process continues until the difference between each iteration surpasses a threshold value. The accuracy of the BasicCA method is about 76%[14].

EventOptCA

EventOptCA is a slightly enhanced version of the BasicCA method. The difference lays

within the iterative process, during which the credibility scores are propagated through

the nodes network. In each iteration, a separate graph of events is created. This graph

is used to separately calculate new credibility scores for the events in the network, us-

ing Quadratic Optimization. This approach helps similar seeming events to get similar

scores. The accuracy of the EventOptCA method is about 86%[14].

(13)

Chapter 3

Method

In this chapter the method of this study is presented. In section 3.1 the literature study is presented, sections 3.2 and 3.3 explains the selection and collection of data, and sec- tion 3.4 presents the heuristic and the two approaches to the PageRank algorithm used in this study. In sections 3.5 and 3.6 the Manual Credibility Check and its application is explained. Finally, in section 3.7 the estimation of the actual credibility in the network is presented.

The method used in this study for estimating the credibility of Twitter users takes cue from the study presented in 2.3.2, by using the PageRank algorithm for calculating cred- ibility over a network of users. The approach in this study is a vastly simplified version of the one used in that study.

3.1 Literature Study

A literature study was conducted in order to gain knowledge about necessary aspects of the field. Relevant academic articles and papers about Twitter credibility analysis has been used, as well as the official Twitter documentation when extracting data from Twit- ter. The articles that was used are Evaluating Event Credibility on Twitter[14] and Infor- mation Credibility on Twitter[5]. An overview of these articles can be found in the Back- ground chapter under section 2.3. The literature study was used to obtain some kind of benchmark over how good an algorithm is and how well the result is compared to ear- lier analyzes.

3.2 Selection of Data

Due to the vast amount of Tweets posted every day, limitations have to be set on what Tweets and users to include in this study. In order to get get a useful data set, it is de- sirable to constrain the collected Tweets and their users in such a way that the network they form contain a sufficient amount of user interactions in form of retweets. This can be done by constraining the collection to users who have posted Tweets containing a cer- tain set of hashtags, as hashtags are used to gather Tweets and users around a certain topic, making them form what can be regarded as a form of community. In this study such a community is assumed to show on average a higher number of user interactions than in the Twitter network as a whole.

6

(14)

CHAPTER 3. METHOD 7

For this study, Tweets containing any of the following four hashtags have been se- lected for the data set: #svpol, #svtagenda, #svtopinion, and #opinionlive. The hashtags have been opt for as they are all associated to actively discussed, and highly polarising topics.

A set of Tweets based around them is therefore assumed to provide a vast diversity in credibility.

In the cases where a collected Tweet is a retweet of, or quoting, another Tweet this other Tweet and its author is also collected to the dataset. The same goes for user men- tions – if a user is mentioned in a Tweet that user becomes a part of the dataset.

3.3 Data Collection and Storage

The data is collected from the Twitter APIs, using Python and Tweepy, and stored in a Neo4j database. The data consists of Tweets posted between March 21st and April 4th 2017, and containing any of the following four hashtags: #svpol, #svtagenda, #svtopinion, and #opinionlive. The gathered collection consists of 7648 Tweets, 4271 users, and 6351 retweets. Every Tweet and retweet in the collection belongs to a user in the collection.

3.3.1 Twitter APIs

The Twitter APIs provide programmatic access to Twitter data. Tweets up to a week old can be accessed through the Search API, which allows for querying for Tweets containing certain words. In this study those words have been the four above-mentioned hashtags.

The Twitter Search API has a rate limit, constraining the number of requested Tweets over a 15 minute window to 5000[9].

3.3.2 Python

Python is a high level programming language, supporting modules, packages and li- braries. The version used in this study is 2.7.10[7].

3.3.3 Tweepy

Tweepy is a Python library for accessing the Twitter APIs. The version used in this study is 3.3.0[17].

3.3.4 Neo4J

The graph database tool Neo4J is used to store data and to manage the relationships be- tween the collected datasets[8].

3.3.5 Database Schema

Nodes in the database are in the form of Users and Tweets, and relationships in the form

of Tweets and Retweets, as seen in figure 3.1.

(15)

8 CHAPTER 3. METHOD

Figure 3.1: Relationships Between Users and Tweets. A TWEETS edge from a user to a Tweet indicates that the user is the original author of the Tweet, whereas a RE-TWEETS edge indicates that a user has retweeted the Tweet.

3.4 Heuristic

A heuristic is formulated based on the assumption that a user’s credibility depends on the credibility of the users who retweet them. This heuristic is evaluated using two ver- sions of the PageRank algorithm, as described in the subsections below.

A graph of users is constructed from collected data set of users and Tweets, as seen in figure 3.2. In the new user graph, an edge from user X to another user Y indicates that user X has retweeted user Y. The weight of the edge describes the proportion of X’s retweets that go to user Y.

Figure 3.2: Construction of a user graph from the collected dataset.

The PageRank algorithm repeatedly calculates and updates the rank of each user in the graph, until it stabilises due to changes in ranks between iterations being negligibly small.

The difference in the two algorithms presented below is that the first one, PageRank

_Reset

,

overwrites the old ranks with the newly calculated ones. In the second, PageRank

_Keep

,

the newly calculated ranks are added to the previously calculated rank.

(16)

CHAPTER 3. METHOD 9

3.4.1 PageRank

_Reset

This is a direct implementation of the most simple version of the PageRank algorithm.

The list retweet_ratio is initialised to contain the edge weights of each directed edge in the graph. That is, retweet_ratio[i,j] contains the proportion of user i:s retweets that go to user j. Each user’s rank is initialised to 1/n, where n is the number of users in the graph.

The users’ ranks are stored in pageRank.

The algorithm will iteratively compute new ranks for all users in the graph, until it the pageRank list stabilises between iterations. The values in pageRank are moved to pageRank_old, and pageRank is reset to 0 for each user. The new rank for user i is com- puted as the sum of pageRank_old[k] * retweet_ratio[k,i], for all users k, k 6= i, in the graph.

This means that the new rank for the i:th is based entirely on the most previous rank of the other users that have retweeted user i.

After a new pageRank has been computed for each user, the pageRank list is normalised, so that the sum of all pageRank values is equal to 1.

Pseudocode PageRank

_Reset

retweet_ratio[i, j] = proportion of user i:s retweets that go to user j pageRank[i] = 1/n, i = 1..n

while pageRank not stabilised do pageRank_old = pageRank pageRank = [0,0, ... , 0]

for i = 1 ... n do

for k = 1 ... n, where k != i do

pageRank[i] += pageRank_old[k] * retweet_ratio[k,i]

end end

pageRank = normalise(pageRank) end

3.4.2 PageRank

_Keep

The list retweet_ratio is initialised to contain the edge weights of each directed edge in the graph. That is, retweet_ratio[i,j] contains the proportion of user i:s retweets that go to user j. Each user’s rank is initialised to 1/n, where n is the number of users in the graph. The users’ ranks are stored in pageRank.

The algorithm will iteratively compute new ranks for all users in the graph, until it the pageRank list stabilises between iterations. The values in pageRank are moved to

pageRank_old. The new rank for user i is computed as the sum of pageRank_old[k] * retweet_ratio[k,i], for all users k, k 6= i, in the graph. This means that the new rank for user i is based

on both i:s most previous rank, and the most previous rank of the other users that have retweeted user i.

After a new pageRank has been computed for each user, the pageRank list is normalised,

so that the sum of all pageRank values is equal to 1.

(17)

10 CHAPTER 3. METHOD

Pseudocode PageRank

_Keep

retweet_ratio[i, j] = proportion of user i:s retweets that go to user j pageRank[i] = 1/n, i = 1..n

while pageRank not stabilised do pageRank_old = pageRank for i = 1 ... n do

for k = 1 ... n, where k != i do

pageRank[i] += pageRank_old[k] * retweet_ratio[k,i]

end end

pageRank = normalise(pageRank) end

3.4.3 LogRank

In each iteration, both PageRank algorithms assign a PageRank to every user in the net- work. In this study, a credibility score called LogRank is calculated for each user based on the PageRank calculated by the algorithm:

LogRank(user

i

) = log

₂

P ageRank(user

i

)

AvgP ageRank (3.1)

This was used to get a smaller range of values to work with, which allows for clearer presentation in graphs and tables.

3.5 Manual Credibility Check

A manual user credibility check is developed, according to a number of selected criteria that, in combination with each other, are assumed to be crucial in deciding credibility, or at least give an indication of the credibility of a user. The criteria differs for accounts associated with individuals or organisations. This check is made in order to evaluate the performance of the algorithm, and it is constructed to be easy to apply manually, while maintaining as high objectivity and correctness as possible.

The manual credibility check gives a user a score from 1 to 3, based on the average of a verification score and a follower count score. This score is not to be confused with the one given as LogRank, but is solely connected to the manual credibility check.

3.5.1 User Types

The user is divided into a category according the profile that is best suited for the user.

Person

The user account is associated with a private or public person. This person could have

links to one or more organisations, but the account is intended for personal use.

(18)

CHAPTER 3. METHOD 11

Organisation

The user account is associated with an organisation, news agency, authority or political party.

3.5.2 Verification Score Person

For personal accounts, the verification scores can be seen in table 3.1.

Table 3.1: Verification Score for Person User Accounts Score Description

3 Account belongs to a politician, celebrity, or business leader

2 Person behind account can be identified through name and profile picture 1 Person behind account can not be identified

Organisation

For accounts belonging to organisations, the verification scores can be seen in table 3.2.

Table 3.2: Verification Score for Organisation User Accounts Score Description

3 The organisation has a legally responsible person 1 A legally responsible person can not be identified

If the user is the Twitter account of a news agency, is there a publisher and does editing have an official address? This information indicate that the news agency is serious and can take responsibility for its content.

Assumptions

Due to accountability, a user account connected to a physical person will have more cred- ible content than one that has no such connection.

3.5.3 Follower Count Score

The more followers a user has, the more credible the user is considered to be. The scor- ing is assigned as seen in table 3.3.

Table 3.3: Follower Count Score for all User Accounts Score Description

3 The user account has over 5000 followers

2 The user account has between 1000 and 4999 followers

1 The user account has between 0 and 999 followers

(19)

12 CHAPTER 3. METHOD

Assumptions

The approval of a large group of users boosts the credibility of a user, leading to the as- sumption that a user with a high amount of followers is more credible than a user with a lesser amount of followers.

3.5.4 Spam

If a user posts a large amount of similar Tweets under short periods of time, or on av- erage uses over three hashtags in their Tweets, the user is considered to be producing spam posts. Tweets are considered to be similar if they all contain a small set of the same hashtags, phrases and links or posts with repetitive content. Posts are considered repet- itive if they contain the same content over and over again with little or no alteration.

Users with this behaviour are considered non credible. All spam users receive a manual credibility score of 1, regardless of amount of followers.

Assumptions

If a Twitter account contains spam posts it is considered bad for the users’ credibility since the posts or retweets are generated to gain followers or attention, and does not con- tain actual content.

3.6 Manual Control of Results

All users with an assigned LogRank of 3 or higher are evaluated using the manual credi- bility check. Users with a LogRank of two or less are not evaluated, as there are to many users in the network for one to manually check everyone within a reasonable time. The line was drawn at this value both because it gave a manageable amount of users to man- ually check, and also because users with a LogRank of two or less weren’t considered distinguishable enough as many of the other users scored higher.

3.7 Estimating Credibility Distribution in User Population

In order to evaluate the efficiency of the algorithms used in this study the credibility dis- tribution across the entire population of 4271 users is estimated. The distribution is esti- mated by evaluating the manual credibility on a randomly chosen sample of 354 users.

The results can then be used to estimate the distribution across the population, assuming

a multinomial distribution.

(20)

Chapter 4

Results

In this chapter the results from both approaches of the PageRank algorithm are presented, as well as the estimated credibility distribution in the entire population of users.

4.1 PageRank

_Reset

The PageRank

_Reset

algorithm terminates after four iterations on the collected data set.

Out of the 4271 users in the network, 397 users receives a positive LogRank, as shown in table 4.1. Out of these, 114 users receive a LogRank of three or higher. In the succeed- ing iterations the number of users with a positive LogRank decreases steadily, until only two are left.

Table 4.1: LogRank distribution after each iteration of the PageRank

_Reset

algorithm. Rows show LogRank, and columns show iterations.

LogRank 1 2 3 4

11 1

10 1

9 1 3 1

8 1 3 2

7 2 4

6 5 7

5 17 12

4 27 17

3 55

2 113

1 177

-inf 3874 4227 4265 4269

Increased 397 44 6 2 Decreased 3874 4227 4265 4269

4.1.1 Manual Credibility Check of PageRank

_Reset

After the first iteration, when conducting the manual credibility check two of the three by the algorithm top rated users are considered to be spam accounts. The rest of the ac-

13

(21)

14 CHAPTER 4. RESULTS

counts that the algorithm considers to have high credibility has a large spread of credi- bility scores after the manual check. At this point the credibility scores of the algorithm does not correlate with the credibility scores of the manual check, as seen in table 4.2 a).

After the second iteration there is only one by the algorithm top rated user that is considered to have a low credibility score by the manual check (rated below 2). The rest of the users has fairly high credibility according to the manual check. This is shown in table 4.2 b).

After the third iteration there are only six users with credibility left, and they all have a high credibility score from the algorithm, as seen in table 4.2 c). Of these six users, four are considered very credible after the manual check.

After the fourth iteration the algorithm stabilises and there are only two users left.

They are considered to be credible by both the algorithm and the manual credibility check.

This is shown in table 4.2 d).

In the following iterations the two users that are left takes turn in endorsing each other with credibility, and the algorithm is terminated.

Figure 4.1 show how the manual rating scores are distributed among the Twitter users considered credible by the PageRank

_Reset

algorithm.

Table 4.2: Manual credibility score for each iteration of the PageRank

_Reset

algorithm. The column Rating contains the manual credibility scores.

a) Iteration 1 Rating Count Percent

3 33 30,8%

2,5 19 17,8%

2 25 23,4%

1,5 19 17,8%

1 7 6,1%

Sum 114

b) Iteration 2 Rating Count Percent

3 44 25,0%

2,5 11 25,0%

2 9 20,5%

1,5 11 25,0%

1 2 4,5%

Sum 44

c) Iteration 3 Rating Count Percent

3 2 33,3%

2,5 2 33,3%

2 1 16,7%

1,5 1 16,7%

1 0 0,0%

Sum 6

d) Iteration 4 Rating Count Percent

3 2 100%

2,5 0 0,0%

2 0 0,0%

1,5 0 0,0%

3 2 0,0%

Sum 2

(22)

CHAPTER 4. RESULTS 15

Figure 4.1: Distribution of manual credibility score among users assigned a LogRank of three or higher, for each iteration of the PageRank

_Reset

algorithm.

After the first iteration, 27.6% of the users that are considered credible by the algorithm, are considered to have low credibility by the manual credibility check. As the total num- ber of users drops for each iteration, the number of user who are considered non credible decreases. After the fourth iteration, all of the users considered credible by the algorithm are also considered credible by the manual check. This is shown in figure 4.1.

4.2 PageRank

_Keep

After 17 iterations of The PageRank

_Keep

algorithm, there are only two users left with positive ranking. The rest of the users receives a lower ranking for each iteration while the two positive users remain the only ones who are considered credible. This is when the algorithm terminates. Since the credibility score assigned after the first iteration re- mains through the following iterations, users who would lose their credibility fast in the PageRank

_Reset

algorithm remain credible longer in the PageRank

_Keep

algorithm.

Table 4.3 shows the distribution in LogRank after every other iteration. As seen in the

table, up until the seventh iteration the number of users with increased (non negative)

LogRank remain stable, and thereafter they decrease steadily until the 17th iteration in

which only two of them remain.

(23)

Table 4.3: LogRank distribution after every other iteration of the PageRank

_Keep

algorithm.

Rows show LogRank, and columns show iterations. Only odd iterations are shown in the table due to the small changes between each iteration.

LogRank 1 3 5 7 9 11 13 15 17

11 1 1 1 1

10 1 1 1 1 1

9 1

8 1 2

7 2 1 2 1

6 2 1 5 6 3 1

5 3 10 12 8 9

4 4 25 23 25 10 7 1

3 32 29 40 29 23 7

2 46 93 81 52 39 17 6

1 135 66 63 102 76 28 7 1

0 175 171 171 171 63 39 11 3

-1 3874 3874 171 62 23 7 1

-2 3874 63 40 8 2

-3 3874 171 73 17 6

-4 3874 63 30 6

-5 171 33 13

-6 3874 62 20

< -6 3874 4108 4221

Increased 397 397 397 397 226 101 27 6 2

Decreased 3874 3874 3874 3874 4045 4170 4244 4265 4269

4.2.1 Manual Credibility Check of PageRank

_Keep

The figure 4.2 show a manual credibility check on users predicted to be credible by the

PageRank

_Keep

algorithm. After the first five iterations the amount of users who is con-

sidered credible by the PageRank

_Keep

algorithm increases steadily. In percentage, the

amount of users with score 1,5 increase the most, with 183% between iterations 1 and

5. In absolute amounts, the number of users with score 1,5 and 3 increased with eleven

between the same iterations. After the fifth iteration the amount of users who are con-

sidered credible decreases, while the amount users who are considered non credible is

steady through the rest of the iterations until iteration 15 when the only users who are

left is the ones who are considered to be credible. For more exact numbers, see table A.1.

(24)

Figure 4.2: The distribution of manual credibility scores after each iteration, among users who have scored a LogRank of three or higher from the PageRank

_Keep

algorithm.

4.3 Estimated Credibility Distribution in Population

Results from the manual credibility check on the randomly selected samples chosen from the entire population of collected Twitter accounts are shown in table 4.4. The column s.e. contains the standard errors of the sample at a 95% confidence level, assuming a multinomial probability distribution.

Table 4.4: Manual credibility score in sample of population Rating Count Percent s.e.

3 33 9,3% 1,5%

2,5 16 4,5% 1,1%

2 55 15,5% 1,9%

1,5 122 34,5% 2,5%

1 128 36,2% 2,6%

Sum 354

In figure 4.3 the estimated amount of users in the entire population, for each manual

credibility score is shown. According to the estimation, a majority of c.a. 70% of the users

in the population have a low or very low manual credibility score. Around 15% of users

have a medium, and the remaining 15% of users have a high or very high manual credi-

bility score. For exact numbers, see table A.2.

(25)

Figure 4.3: Estimated distribution of manual credibility score in the entire population. The error bars mark the confidence intervals.

4.4 Observations

4.4.1 Spam

Four users were assigned high LogRank scores by the PageRank algorithms, but were considered to be posting spam in the manual check. In the PageRank

_Reset

algorithm two of these users appeared with a LogRank of 3 or higher in the first iteration. None of them remained in succeeding iterations. In the PageRank

_Keep

algorithm these four users maintained a LogRank of 3 or higher until the 10th iteration. In the 13th iteration all of them had a LogRank under 3.

Spam User 1

The user has posted 884 out of the 7648 Tweets in the dataset, which is almost 12% of all Tweets. On average each Tweet contains 6,25 hashtags. The user has been retweeted a total of 325 times. The user is considered a spam user due to repetitive content in the Tweets posted, as well as the excessive use of hashtags.

Spam User 2

The user has posted one Tweet in the dataset. The Tweet contains nine hashtags, and has been retweeted by 140 users, 139 of which have not posted a single Tweet in the dataset.

The user is considered a spam user due to an excessive use of hashtags.

Spam User 3

The user has posted 77 Tweets in the dataset, and has been retweeted a total of 416 times.

On average each Tweet contains 1,09 hashtags. The user is considered a spam user due

(26)

to repetitive content in the Tweets posted.

Spam User 4

The user has posted 8 Tweets in the dataset, and has been retweeted a total of 121 times.

On average each Tweet contains 0,625 hashtags. The user is considered a spam user due to repetitive content in the Tweets posted.

4.4.2 Credible Users with Low LogRank

According to the estimation of credibility distribution in the entire population of users in 4.3, there should be around 600 credible or very credible users in the population. The PageRank

_Reset

algorithm identifies 52 of these in the first iteration, which corresponds to circa 8,7% of these 600 users. For the PageRank

_Keep

algorithm at most 46, or 7,7% of these users are identified. This implies that over 90% of the credible or very credible users are not identified by the algorithm.

Below follows examples of why credible users were not identified by the PageRank algorithms.

Quoting Tweets

A user not participating in discussions around the four hashtags may become a part of the dataset through quoting Tweets. One example is NASA, who became a part of the dataset through Spam User 1, who quoted one of NASA’s Tweets, and, among five hash- tags, included #svpol, #svtopinion, and #svtagenda in the quoting Tweet. The quoted Tweet is the only Tweet posted by NASA in the dataset, and it lacks relevance to any of the four hashtags around which this study is conducted. As a result of this, the col- lected dataset does not contain any retweets of the NASA Tweet, causing the PageRank algorithms to assign NASA with a low LogRank.

User Mentions

There have been occurrences of Swedish and international celebrities becoming a part of the dataset due to being mentioned in a Tweet containing any of the four hashtags around which this study is conducted. These users have not themselves posted any Tweets in the dataset, and hence do not have any retweeted Tweets in the dataset. This causes the PageRank algorithms to assign them with a low LogRank.

Followers

A low amount of followers will lead to a low amount of retweets. Since the algorithms assign credibility based on amount of retweets from other users, some accounts who ac- tually are credible will be assigned a low LogRank by the algorithm. An example of this is the Twitter accounts of some Swedish authorities who are considered to be credible, but with a follower amount between 1000-4999.

4.4.3 Circular Endorsement

In both versions of the PageRank algorithm, the two users remaining in the final algo-

rithm were the same. Both user accounts belong to a Swedish news service and they

(27)

only retweet each other. In the PageRank algorithm, this circular endorsement makes

sure they maintain their credibility, not "leaking" it to other users.

(28)

Chapter 5

Discussion

For both PageRank

_Reset

and PageRank

_Keep

, the the group of users ranked credible by the algorithm contain a considerably higher proportion of users with a manual credibility rating of 2.5 or 3, than a random sample of the same size would. In this regard the al- gorithms performs better than random. However, the result of the study omits many users who could in fact be considered to be credible. The reason for this could be that the only way for the algorithm to determine whether a user is credible or not is to check how many times the user has been retweeted by other users. If a user who exists in the network is not retweeted, the user will be perceived as non-credible. In reality, the lack of retweets does not mean that the user does not distribute credible material, but may be, for example, due to the fact that the user did not fully or at all participate in discussions in which the hashtag was intended to be used. The user may have ended up in the net- work by mistake. One example of this is the case of NASA, where the Twitter account of NASA is quoted by another user who uses a hashtag in the Tweet with the quote, which is explained in section 4.4.2. The user with the original post, in this case NASA, will then become part of the network without actively participating and will therefore not be di- rectly retweeted, leading to an unfair mistrust of the user. One way to avoid such situ- ations could be to restrict the network solely to users who in their own posts uses the hashtags that are being investigated, thus omitting users who are quoted and who does not have association with the network. However, since the users in the network are col- lected based on specific hashtags, users who are considered irrelevant in the context can in that way be interpreted as having low credibility.

In PageRank

_Reset

, endorsements from users who are not being retweeted, thus not considered to be credible, does not generate credibility to the users they are retweeting.

This prevents non credible users from receiving undeserved credibility. Generally, users who are considered to be spam users receive a large amount of retweets from other users who posses a low amount of retweets themselves. The algorithm identifies these users as low credibility users, making their endorsements useless. However, in some cases a user who is retweeted by many non-credible users (for example users with a small amount of followers) might actually distribute credible content. With this algorithm, that user will be dismissed.

In order for a user to maintain credibility through the iterations, it is required that the users retweeting them have maintained their credibility in the previous iteration. In the first iteration, retweets from any user generates a credibility score. In the second it- eration, only retweets from users who received retweets in the first iteration generates

21

(29)

22 CHAPTER 5. DISCUSSION

credibility. In the third iteration, only retweets from users who received retweets in the second iteration generates credibility, and so on. However, it is possible to circumvent this property when a group of users enters a circular endorsement and never retweets anyone outside the group. This group maintain their credibility through each iteration since they form an infinite chain of retweets among themselves. This behaviour was ob- served in the collected dataset in section 4.4.3. The algorithm is based on the assumption that a user cannot endorse itself, but a group who endorse each other within the group keep their mutual credibility score within the group, which by extent is the same thing as endorsing yourself.

In the case of PageRank

_Keep

it is possible for a user to rise in rank by simply get- ting endorsed by other users without credibility, since the credibility score is maintained through every iteration. This makes it possible for spam users to rise in rank. An exam- ple of this is Spam User 2, who receives a high credibility score in the first iteration in PageRank

_Keep

but only has one Tweet in the network, and that Tweet is retweeted many times. This is described in section 4.4.1.

When conducting the manual credibility check it might be hard for the person who is estimating the score to avoid being biased by the actual content of the account that is being evaluated. One’s own opinions must not interfere with the manual check. In order to achieve objectivity when manually estimating the credibility, strict guidelines has been used during the evaluation. However, since the check is coarse grained it will inevitably misjudge some of the users. For example a user who might not be credible can be as- signed a higher credibility score than they deserve and a user who is considered to be credible might be assigned a lower credibility score than they should. A good example of the latter is Twitter accounts of Swedish authorities who in reality are very credible but they have a low amount of followers which lowers their credibility score according to the criteria that is being followed when assigning the score.

Considering the time and resources available when conducting this study, it was not possible to execute a more elaborate investigation in this case. Limitations in both re- sources and time is also the reason why the study had to be restricted regarding the col- lecting of data. If more time had been at hand, Tweets could have been collected over a longer time period than 15 days, hence giving a more accurate representation of the behaviour in the network, as this would have resulted in a higher number of registered user interactions. This would have given the algorithm more data to work on, likely ren- dering a different result. Due to this, it is hard to draw any definitive conclusions regard- ing the accuracy of the methods evaluated in this study.

Previous studies has found successful methods to determine credibility. Information Credibility on Twitter[5] uses semantic analysis, interpreting the actual content of the posts in social media feeds, together with meta data, achieving good results. Evaluating Event Credibility on Twitter[14] combines this method with a PageRank inspired algo- rithm, achieving slightly better results. Using these kinds of methods has been consid- ered too complicated and time consuming for this study. Since methods of interpreting semantics and sentiments in Twitter posts could not be used, it was not possible to reach the same levels of accuracy in credibility determination as the previous studies reached.

On the other hand, both of these studies were conducted on much larger datasets than the study presented in this paper (millions of tweets, as opposed to thousands). This makes it hard to compare results in an accurate way.

If the implementations used in this study was to be used on social media, for exam-

(30)

CHAPTER 5. DISCUSSION 23

ple Twitter, in its current state, it could have a negative impact regarding credibility. The

somewhat doubtful credibility rating could contribute to spread of fake news, suppress

real news and lower the credibility of Twitter in general as a source of information.

(31)

Chapter 6

Conclusion

The results show that the PageRank algorithm, although it could be considered to per- form better than random, is not efficient when evaluating user credibility. On the limited data set that was used, several non credible users received a high rank, while many cred- ible users received a lower rank. This would likely change if a greater data set were to be used. In both versions of the PageRank algorithm tested in this study, the two users receiving the highest rank when the algorithms terminate are both considered to be of very high credibility. However, all the other users in the network are assigned a low credibility score, including several users who should be considered credible. The conclu- sion is that the PageRank algorithm is most likely not suitable to successfully determine the credibility of Twitter users.

6.1 Future Research

The subject of the credibility of information shared on the internet is a relatively un- explored topic scientifically, where there is much to do research on. This thesis has fo- cused on determining the credibility of Twitter users and thereby determine whether the Tweets from these users are credible. There are several more parameters that can be taken into account when distributing credibility in a network of Tweets.

An initial suggestion for future research on this subject would be to test the algorithm on a larger dataset of users, collected over a longer period of time. Furthermore, the Twitter APIs offers a vast amount of meta data associated with both users and Tweets, and this data could be incorporated into an evaluation algorithm, in order to achieve bet- ter results.

24

(32)

Bibliography

[1] Sveriges Television AB. Opinion live, 2017. URL https://www.svtplay.se/

opinion-live . Last visited 2017-04-11.

[2] Sveriges Television AB. Svt agenda, 2017. URL http://www.svt.se/agenda. Last visited 2017-04-11.

[3] Sveriges Television AB. Svt opinion, 2017. URL https://www.svt.se/opinion.

Last visited 2017-04-11.

[4] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine.

In Seventh International World-Wide Web Conference (WWW 1998), 1998. URL http:

//ilpubs.stanford.edu:8090/361/1/1998-8.pdf.

[5] M. Mendoza C. Castillo and B. Poblete. Information credibility on twitter. WWW, page 675–684, 2012.

[6] P. Gupta et al. Wtf: The who-to-follow system at twitter. WWW, 2013.

[7] Python Software Foundation. What is python? executive summary, 2017. URL https://www.python.org/doc/essays/blurb. Last visited 2017-04-20.

[8] Neo Technology Inc. Neo4j, 2017. URL http://www.neo4j.com. Last visited 2017-04-11.

[9] Twitter Inc. The search api, 2017. URL https://dev.twitter.com/rest/

public/search . Last visited 2017-04-24.

[10] Twitter Inc. Using hashtags on twitter, March 2017. URL https://support.

twitter.com/articles/49309 . Last visited 2017-03-27.

[11] Twitter Inc. Users – twitter developers, 2017. URL https://dev.twitter.com/

overview/api/users . Last visited 2017-03-27.

[12] Twitter Inc. Twitter usage, February 2017. URL https://twitter.com. Last visited 2017-03-27.

[13] B. Jiang. Ranking spaces for predicting human movement in an urban environment.

International Journal of Geographical Information Science, 23 (7):823–837, 2006. doi: 10.

1080/13658810802022822.

[14] P. Zhao M. Gupta and J. Han. Evaluating event credibility on twitter. In Proceedings of the 2012 SIAM International Conference on Data Mining (SDM), page 153–164, 2012.

25

(33)

26 BIBLIOGRAPHY

[15] Internet Live Stats. Twitter usage statistics, 2017. URL http://www.

internetlivestats.com/twitter-statistics . Last visited 2017-05-09.

[16] Svpol. Twitter hashtag intended for discussions of swedish politics, 2017.

[17] Tweepy. Tweepy, 2017. URL http://www.tweepy.org. Last visited 2017-04-11.

(34)

Appendix A

Tables

Table A.1: The distribution of manual credibility scores after each iteration, among users who have scored a LogRank of three or higher from the PageRank

_Keep

algorithm. Rows show manual credibility score and columns show iterations.

Interation

RATING 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

3 15 17 20 25 26 22 22 19 12 10 10 8 4 2 2

2,5 12 17 19 19 20 20 20 18 16 12 13 10 6 3 1

2 7 8 10 12 12 10 10 9 6 4 4 4 0 0 0

1,5 6 6 11 17 17 14 14 11 10 5 7 6 4 2 0

1 4 6 6 7 7 6 6 5 4 4 4 4 2 1 0

Sum 44 54 66 80 82 72 72 62 48 35 38 32 16 8 3

Table A.2: The estimated amount of users in the entire population, for each manual credi- bility score is shown. The column s.e. contains the standard errors, and columns CI Lower and CI Upper contains the upper and lower limits for the confidence intervals at a 95%

confidence level.

RATING Count s.e. CI Lower CI Upper

3 398 66 269 528

2,5 193 47 101 285

2 664 82 502 825

1,5 1472 108 1260 1683

1 1544 109 1331 1758

Sum 4271

27

(35)

www.kth.se

Analysing Credibility of Twitter Users Using the PageRank Algorithm

Analysing Credibility of Twitter Users Using the PageRank

Algorithm

ALICE HEAVEY ELIN KARAGÖZ

Analysing Credibility of Twitter Users Using the PageRank Algorithm

ALICE HEAVEY ELIN KARAGÖZ

Bachelor’s Thesis in Computer Science Date: June 1, 2017

Supervisor: Alexander Kozlov Examiner: Örjan Ekeberg

Swedish title: Analys av trovärdighet hos Twitteranvändare med hjälp av PageRank-algoritmen

School of Computer Science and Communication

Abstract

The conclusion is that the algorithms in their current states are not suitable for determin-

ing the credibility of Twitter users.

Sammanfattning

I en tid då information och åsikter till stor del delas via sociala medier är det viktigt att

finna ett sätt att avgöra hur trovärdig detta innehåll är. Syftet med denna studie är att

utreda om det med hjälp av algoritmer baserade på PageRank-algoritmen går att avgö-

ra hur trovärdig en Twitteranvändare är, baserat på hur mycket användarens inlägg bli-

vit delade av andra användare. Två olika algoritmer baserade på PageRank har rankat

trovärdigheten hos Twitteranvändare i ett nätverk. Denna rankning har sedan jämförts

med en manuell trovärighetstilldelning av användarna för att avgöra hur nära verklighe-

ten algoritmernas trovärdighetsfördelning är. Resultaten visar att algoritmerna kan anses

prestera bättre än slumpen, men att de trots detta tilldelar en fleaktig trovärdighet till

många användare i nätverket. Algoritmernas triviala natur ger dem en fördel gentemot

algoritmer som använts i tidigare studier. Slutsatsen är att algoritmerna i deras nuvaran-

de form inte är lämpade för att fastställa trovärdighet hos Twitteranvändare.

Contents

Contents iv

1 Introduction 1

1.1 Problem Definition . . . . 1

1.2 Scope and Constraints . . . . 2

1.3 Outline . . . . 2

2 Background 3 2.1 Twitter . . . . 3

2.1.1 Users . . . . 3

2.1.2 Tweet . . . . 3

2.1.3 Retweet . . . . 3

2.1.4 Hastag . . . . 4

2.2 PageRank . . . . 4

2.3 Previous Studies . . . . 4

2.3.1 Credibility on Twitter . . . . 5

2.3.2 Evaluating Credibility on Twitter . . . . 5

3 Method 6 3.1 Literature Study . . . . 6

3.2 Selection of Data . . . . 6

3.3 Data Collection and Storage . . . . 7

3.3.1 Twitter APIs . . . . 7

3.3.2 Python . . . . 7

3.3.3 Tweepy . . . . 7

3.3.4 Neo4J . . . . 7

3.3.5 Database Schema . . . . 7

3.4 Heuristic . . . . 8

3.4.1 PageRank

. . . . 9

3.4.2 PageRank

. . . . 9

3.4.3 LogRank . . . . 10

3.5 Manual Credibility Check . . . . 10

3.5.1 User Types . . . . 10

3.5.2 Verification Score . . . . 11

3.5.3 Follower Count Score . . . . 11

3.5.4 Spam . . . . 12

3.6 Manual Control of Results . . . . 12

iv

3.7 Estimating Credibility Distribution in User Population . . . . 12

4 Results 13 4.1 PageRank

. . . . 13

4.1.1 Manual Credibility Check of PageRank

. . . . 13

4.2 PageRank

. . . . 15

4.2.1 Manual Credibility Check of PageRank

. . . . 16

4.3 Estimated Credibility Distribution in Population . . . . 17

4.4 Observations . . . . 18

4.4.1 Spam . . . . 18

4.4.2 Credible Users with Low LogRank . . . . 19

4.4.3 Circular Endorsement . . . . 19

5 Discussion 21 6 Conclusion 24 6.1 Future Research . . . . 24

Bibliography 25

A Tables 27

Chapter 1