Counting the clicks on Twitter : A study in understanding click behavior on Twitter

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer Science

Bachelor thesis, 16 ECTS | Datateknik

2017 | LIU-IDA/LITH-EX-G--17/001--SE

Counting the clicks on Twitter

–

A study in understanding click behavior on Twitter

Filip Polbratt, Olav Nilsson

Supervisor : Niklas Carlsson Examiner : Nahid Shahmehri

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannenslitterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(3)

Students in the 5 year Information Technology program complete a semester-long soft-ware development project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, currently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culmi-nates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bache-lor thesis. The current report represents the results obtained during this specialisation work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

Abstract

Social media has a large impact on our society. News articles are often accessed and shared through different social media sites . In fact, today the most common way to enter a website is from social medias. However, due to technical restrictions in what information these sites make public, it is often not possible to access click information from social me-dias. This complicates the analysis of popularity dynamics of news articles, for example. In this thesis, we work around that problem. By using an URL shortener service API, we can extract information about the clicks from the API. We will only look at content that is shared on Twitter because they have the friendliest view on sharing data for research pur-poses. To test this methodology we are doing a small prestudy in which we look at how biased news articles are shared on Twitter compared to more objective content. There are three parts in investigating the biased content. The first part is to extract Bitly links from Twitter. The second part is to examine the links and decide if it is a news article. Finally, we determine if the news article is biased. For this third step, we use two different approaches. First, we build a computational linguistics tool called a Naive Bayes classifier from already classified training data. Second, we classify different articles as articles with biased content or not, where an article is considered biased if the domain it resides on has a high content of biased articles. Our analysis of a sample data set that we have collected over a week showed that biased content is clicked for a longer period of time compared to non-biased content.

(5)

Acknowledgments

We would like to thank our supervisor Niklas Carlsson and our two reviewers Carl Nykvist and Linus Sjöström. At last we want to thank Anna Vapen, Anton Forsberg and Karl Andersson whose code we have used as a basis when we built our program.

(6)

List of Figures

2.1 Distribution functions for a fair six-sided dice . . . 8

2.2 A fictitious CCDF in log-log scale for the cups of coffee consumed by a student . . 9

2.3 Plotting the regression line and its confidence interval . . . 11

3.1 A schematic figure of how the data mining is done . . . 14

3.2 A closer look how the data collection works . . . 14

4.1 CCDF distributions after five days in lin-lin scale . . . 18

4.2 CCDF distributions after five days in log-log scale . . . 18

4.3 CCDF distributions after two hours in log-log scale . . . 18

4.4 CCDF distributions after one day in log-log scale . . . 19

4.5 CCDF distributions after five days in log-log scale . . . 19

(8)

(9)

1 Introduction

In the last decade, social media has grown explosively and has today an important role in billions of people’s daily lives. Today traffic from social media is the most common way to access websites [20].

Twitter is one of the largest social media platforms with over 300 million users. The posts are called tweets and are limited to 140 characters. Therefore, it is important that the charac-ters are used sparingly. To make the most out of the available space, an important factor is to use a URL-shortener. Twitter has an integrated URL-shortener called t.co. However, some users prefer an external URL-shortener instead, a benefit with this is that it is possible to keep track of how many times each such link have been clicked. One external shortener that is commonly used is Bitly. It is Bitly-links that this thesis investigates closer. Most importantly the Bitly API provides valuable information about what sort of content is receiving the most attention.

In social media, a problem that has started to grow is the problem with very biased content and fake news. According to Business Insider, during the American election 2016, the top 20 of fake news were shared more on Facebook than the 20 most shared legitimate news [15]. For example, the most shared article during the election was "Pope Francis Shocks World, Endorses Donald Trump for President, Releases Statement," an article from Ending the Fed, that article was fake [15]. A study from Ipsos on BuzzFeed News behalf showed that 75% of the Americans that have seen a fake news headline also thought they were true [18].

It is not just Americans that are struggling with what news to trust. Classifying content if it is biased or not, is a complicated problem. In this thesis, we are trying two different methods. First, we use a simple machine learning tool called Naive Bayes classifier. With Naive Bayes, we use training data consisting of texts that have been classified manually as highly biased news and legitimate news. When we have a correct set of training data, we use it to train our model. The model will calculate the probability that words are in legitimate news articles and biased news articles; more about this in the theory section. The idea behind the method is that biased news will use a different language than the non-biased content. The second method is to classify domains as biased news sites and legitimate news sites depending on the occurrence of biased news that have been classified as biased manually. A potential third approach could be to classify every article manually, but it is a time-consuming task to determine the level of truth for every article. Therefore, we have not proceeded with that method.

(10)

1.1. Motivation

In this thesis, we are using Twitter’s and Bitly’s APIs. From Twitter’s API, we collect tweets containing certain keywords that are being posted at the moment. In our case, we are looking for the Bitly-links. With Bitly’s API we get information about how many times the link has been clicked. We are interested to see how many times a link has been clicked at fixed intervals over a span of time to see differences in the spread of materials from different categories.

1.1 Motivation

What is happening on social media is starting to affect society more and more. That is why it is becoming more important to know what content that is getting most attention on social media. In this thesis, we present a methodology to get this information. Detecting fake or highly biased content has started to get more attention lately after Trump won the American election. The amount of research in this field is still very small.

1.2 Aim

The primary goal for this thesis is to create a crawling/parsing tool for finding what content is read on Twitter. After developing the tool, we test the tool and investigate the differences in how biased and non-biased content is read.

1.3 Research questions

1. Is it possible to create a working methodology to know what sort of content is read on Twitter?

2. Are there any differences in how biased and objective news are clicked on Twitter?

(11)

2 Theory

2.1 Related work

Clicks on social media: One of the few studies regarding what is clicked on social media was published in 2016 by Gabielkov et al. [7]. In their study, they found that the distribution of when a news article gets shared on Twitter is significantly different from the distribution of when people read the article. People read the article over a longer time compared to when the article get shares.

Biased content: There is a wide span between what is biased and not, from hoaxes on satire pages to real articles on legitimate news sites to disinformation. But in between, there is a wide variety of different types of texts.

How fake news and biased content is shared has not been studied thoroughly and we know little about how fake news is shared and which people are sharing the stories. There is much potential for studies to be done in this area. Allcott et al. [1] looked at how many people that remember fake news headlines during the American election of 2016. They did it using a survey where they asked the respondent if they had seen the headlines or not. They were comparing the results with some placebo headlines which were made up. In the study, there was only one unit of percentage difference if the respondent thought it had seen a published fake news headline or just a made up one. They did also find that that half of the readers thought that the article was real. They do not directly make any conclusion if they thought that the fake news did affect the election or not.

In a 2016 study by Garrett et al. [8] about misperceptions due to ideologically biased news media online in the 2012 American presidential election, they chose to classify entire news outlets by their ideological bias. The method by which they did this was to estimate the bias from the language used to describe an issue, the bias of the editorial content and on the political beliefs of an outlet’s readership. The method used by Garrett et al. for estimating the bias based on language was the text-analytic technique Contrast Analysis of Semantic Similarity (CASS) by Holtzman et al. [9]. This method does not determine the objectivity of the reporting in news in the traditional journalistic sense. Instead, it gives a score on how negative and positive terms are associated with the possible biases, where a score of zero is said to be objective. This means that what traditionally would be considered objective

(12)

2.2. Classification

reporting of, for example, a scandal might be considered biased by CASS since many negative terms would be associated with one possible bias.

As we stated earlier not much research has been done regarding how to inhibit that very biased and fake news are shared on social medias. But some noticeable work to look at is how the companies themselves work to inhibit the sharing of fake news, much of this work is not publicly available. Some of the work that has been released is from Facebook, they have said that they are using third party companies which can flag articles as fake news. When a user then tries to share an article, they will receive a notice that the article is disputed [5].

Popularity dynamics: In a study from 2009 Leskovec et al. [11] developed a framework for tracking the popularity of short, distinct phrases and small variations of them and how this can be used for representing the news cycle. In their study, they concluded that the volume y, mentions at the moment, of a phrase can be described by a decreasing exponential function y = e´bx_{, both directions away from the peak intensity. However, for an 8-hour window} around the peak intensity, the volume behaved more in a logarithmic fashion y=a|log(|x|)|, where x=0 is the location of the peak. They also note that the decrease after the peak is much faster than the build up before the peak. The plot of the volume x against the number of phrases with a higher volume y and similar plots were noted to have power-law distribution y=axb.

For a longer perspective than the news cycle, which is only a few days long, there is a 2011 study of the popularity dynamics of user-generated videos on YouTube by Borghol et al. [4]. In their study, they find that most videos, sampled randomly from recently uploaded videos, reach their popularity peak within their first six weeks of their lifetime, but for some videos this peak is much later. However, the viewing rate at peak popularity is approximately independent of when peak popularity is reached. They also find that the lifetime of a video can be split into three phases: before, at and after peak popularity. Within these three phases the views gained per week can be approximated by lognormal distributions. They also show that the current popularity of, particularly young videos, is not a good predictor of future popularity.

In a later study by Borghol et al. [3] the popularity dynamics of YouTube videos with very similar content is explored. When the content can be considered to be near identical copies the most important factors for gaining view count is the already achieved view count in a rich-get-richer behaviour and early uploaders have a first-mover advantage. However, in the very early life of a video its view count gain is better explained by the uploaders view count. They also find that when not controlling for content, inaccurate conclusions can be reached.

2.2 Classification

Labeling documents according to what class or classes they belong in can be done in sev-eral ways. Starting from a set of labelled training data a system can learn what pattern is characteristic of the data sharing a label. In machine learning, this is known as classification. Support Vector Machines, Naive Bayes classifiers and k-nearest neighbours are examples of methods for labelling an unknown document according to patterns observed in documents with known labels [12]. Rennie et al. [16] achieved a score of 93.4% in the Industry Sector and 86.7% in the 20 Newsgroups data sets with a tuned Multinomial Naive Bayes (MNB) classifier. Industry Sector and 20 Newsgroups is two common used dataset for testing the accuracy of a classifier.

Naive Bayes

Naive Bayes assumes, naively, that every pair of features used in the classification process is independent of each other. This assumption is rarely true in the real world but works quite well despite that, and it has an optimal time complexity. Because of this Naive Bayes has 4

(13)

2.2. Classification

become popular as simple machine learning tool for classifying texts into different categories. Naive Bayes classifies a text from a vector of features x = (x1, ..., xn), describing the text being classified and P(Ck|x1, ..., xn)how probable this feature vector is to occur in a class Ck of text. Bayes’ law states that the posterior probability equals the prior probability times the likelihood divided by the evidence:

P(A|B) = P(A)P(B|A)

P(B) , (2.1)

where P(Ck)is the probability that a text in the training data belongs to class k out of all the classes. Bayes’ law can then be written with our terminology as:

P(Ck|x) = P

(Ck)P(x|Ck)

P(x) . (2.2)

Since the numerator can be rewritten as P(Ck, x1, ..., xn)and independence between the features is assumed it is possible to write:

P(Ck|x) =

P(Ck)śni=1P(xi|Ck)

P(x) . (2.3)

The denominator P(x)is a constant for all the possible classes of texts since it only de-pends on the text being classified, not the class. This means that the results for each class for one text are scaled by the same number; thus, we can ignore it:

P(Ck|x)9P(Ck) n ź i=1

P(xi|Ck). (2.4)

Texts being classified by Naive Bayes classifiers are labelled with a label y of the class that received the highest probability for the feature vector of that text since this is the more likely class: y=argmax 1...K P(Ck|x) =argmax 1...K P(Ck) n ź i=1 P(xi|Ck). (2.5) For classifying texts, we use the method MNB which is a version of Naive Bayes classi-fication. MNB is suitable for text classification and differs from other common versions of Naive Bayes classifiers by using a feature vector x where each feature has a count fithat is the number of occurrences of that feature in a text [12]. Typically fiis a non-negative integer, but this is not necessarily the case since the feature frequency can be normalised. Equation 2.6 shows the likelihood of a feature vector x belonging to a class k in an MNB classifier. The sum of the probabilities of the features of a class is one; i.e.ř

iP(xi|Ck) =1 [16]. P(x|Ck) = (ř ifi)! ś i fi! ź i=1 P(xi|Ck)fi (2.6)

What features are selected to be in the vector x depends on how the classifier has been trained. Possibilities for directing the classifiers training include, but are not limited to: ex-cluding the most common words in a language (stop words), a predetermined maximum number of features, if the feature vector is smoothed in some way, what kind of features should be used, and the prior probability of class can be set to be uniform for all classes.

Smoothing

Since Naive Bayes classification works by multiplying the probabilities of each feature in a text the probability for a class can become zero and thus not be classified as that class even if it should be. This happens if a combination of feature and class did not occur in the

(14)

2.2. Classification

training data, then that feature will have a probability of zero for that class P(xi|Ck) =0. Zero counts and low counts of features can be ameliorated by using a smoothed estimation of the probability of a feature and class combination instead of dividing the feature’s occurrences in a class Niby the sum of the number of occurrences of all the features in the class N:

P(xi|Ck) =

Ni+α

N+αn. (2.7)

This is the general case of the smoothing that can be used for estimating the probability P(xi|Ck)and can be found in textbooks such as [12] and in articles [16]. The term n is the number of features that are used for classifying the text. If α = 1, this is called Laplace smoothing or add-one smoothing.

Feature selection

Rennie et al. [16] emphasise that since Naive Bayes classification assigns classes according to for which class the observed features are the most significant, it is important to select and weight features for use by the classifier in a way that models text well. There are many meth-ods for modelling a text, with some having several variations. A selection of the variations that have been considered in this project is presented below.

Term frequency: The best representation of how common a feature fi is in a text is not always the raw frequency of the feature. The probability of a feature appearing many times in the same text can be severely underestimated by the MNB model, thus overestimating its importance [16]. How common a feature is in a text, its term frequency (tf), can be represented by a different measure; e.g.

f_i1=tf=log(fi+1). (2.8)

Term frequency–inverse document frequency: According to Manning et al. [12] some fea-tures are less informative of what class a text belongs to than other feafea-tures due to how com-mon that feature is across all texts. For example, the word "Nobel" would be comcom-mon in texts regarding a Nobel Prize. However, despite a high tf "Nobel" would have little discriminating power in determining which category of Nobel Prize a text belongs to due to how common it would be in all categories. Since common features are common according to their nature they are likely to appear in a text. To counteract this the occurrences of features can be weighted by how common or rare the feature is over all the texts in the training data, the feature’s inverse document frequency (idf). In this thesis, we have used two examples of idfs:

f_i1 =idf=log( n nf ), (2.9) f_i1=idf=log 1+n 1+nf +1. (2.10)

Here, where n is the number of texts in the corpus and nf is the number of texts where the feature appears. The definition of idf in equation 2.9 is suggested in textbooks (e.g. [12]), and equation 2.10 is the function used by the Python module scikit-learn in version 0.18.1 of the API.

Taking the product of the term frequency together with the inverse document frequency gives the term frequency–inverse document frequency (tf-idf) [12] of a feature:

f1

i =tf-idf=tf ˆ idf. (2.11)

(15)

2.3. Statistical tools for analysis

For example, using equations 2.8 and 2.10, we get the tf-idf function:

f1 i =tf-idf=log(fi+1) log( 1+n 1+nf ) +1. (2.12)

This transform can, but does not always, provide a better sense of how important a feature is for a text. Other possible transforms to the term frequency include, for example, transform-ing based on document length [16].

N-grams and a bag-of-words: The features used to classify texts are the words the texts consist of (unigram, bag-of-words) or sets of words in specific order (n-gram) [12]. In the case of a bag-of-word representation, the order of the words does not matter, only the number of occurrences. For example, "Sweden closes border to Denmark" would be considered identical to "Denmark closes border to Sweden" in this representation. A model allowing for larger n-grams would be able to tell the difference. However, in exchange, such a model requires more texts in the training data than a unigram model. Additionally, it is also possible to use the letters of the words (as n-grams) to identify features since otherwise "presumptuous" and "presumptions" would be considered two completely different features and could indicate different classes, this is also a good guard against spelling errors.

2.3 Statistical tools for analysis

To analyse the data, we have collected in our program we used a type of distribution function, cumulative distribution function (CDF), and some derivatives of it to represent our result.

Cumulative distribution function

Integrating the ubiquitous probability density function (PDF) for the area from minus infinity to x, were PDFX(x)is the probability that a random variable X takes the value x, gives the CDF of that PDF. Thus the CDFX(x)is the probability that X takes a value less than or equal to x; i.e. P(X ď x). In the case of a continuous PDF it can be written as:

CDFX(x) = żx

´8

PDFX(t)dt. (2.13)

In the case of a discrete PDF, a probability mass function (PMF), as our case will be, the CDF can be expressed as:

CDFX(x) = ÿ xiďx

PMFX(xi). (2.14)

The CDF can be used to answer questions regarding what the probability that something less than or equal to x will occur. We use the simple example of a die roll illustrated in Figure 2.1 to show the relationship between the PMF and CDF. Here, Figure 2.1 (a) show the PMF of a die roll of a uniform fair six-sided die, and Figure 2.1 (b) show its companion CDF. As can be seen in the PMF the probability for each possible resulting side is one sixth; i.e. PMFX(xi) = 16, txi : 1, 2, 3, 4, 5, 6u. For example, the probability of rolling less than or equal to 3 with a fair six-sided die is 1/2, as we can clearly see in Figure 2.1 (b).

Complementary cumulative distribution function

The CDF has a complementary function CCDF so that CDFX(x) +CCDFX(x) = 1, which gives CCDFX(x) =1 ´ CDFX(x). The CCDF function is also known as the survival function or the reliability function due to representing the probability that a variate X takes a value larger than x; i.e. P(X ą x). In Figure 2.1 (c) the CCDF value of 4 is 1/3 (CCDFX(4) = 13) due

(16)

(a) The PDF (b) The CDF

(c) The CCDF

Figure 2.1: Distribution functions for a fair six-sided dice

to the probability of rolling a dice with the result 5 or 6 (which are the possible values higher than 4) is 1 out of 3 rolls.

In our analysis, we can use these functions to evaluate if there is any difference in the probability of a biased news article or a normal article being clicked more than a number of times x.

This distribution would be especially interesting to look at after different intervals of time to see if there is any difference in how the two types of articles spread after we start measuring them. For example after two hours, one day and one week.

Complementary cumulative distribution function in logarithmic scale

Using different scales than linear when plotting the respective axis can reveal information about trends in the data that are not readily apparent in the linear-linear scale. For example, when plotting with linear y-axis and a logarithmic x-axis a logarithmic function such as y=

a log x+b will look like a straight line. If the logarithmic scale is on the y-axis and the x-axis is linear exponential functions: y= ax+b, will appear to be straight lines. In a log-log scale, functions of the form y = axbwill look straight. The relationship y = axb is also known as the power-law, an example of it is the relationship between the radius r of a circle and its area πr2.

In the 1-hour plot of Figure 2.2, it is possible to see that for the probability of a student having more than a number x of cups of coffee in one hour the relationship appears to behave in a power-law fashion (y=axb).

Pearson correlation coefficient

If a change of scale reveals an apparently linear representation of the data it may be worth-while to test how linear this transform of the data is. The linear correlation between two

(17)

Figure 2.2: A fictitious CCDF in log-log scale for the cups of coffee consumed by a student

data sets, X and Y, can be measured by Pearson’s correlation coefficient, also known as r. Pearson’s correlation coefficient of two data sets can be calculated from:

r= cxy

sxsy, (2.15)

where cxy is the covariance between the x and y values in the data sets and is calculated through: cxy= 1 n ´ 1 n ÿ i=1 (xi´ ¯x)(yi´¯y). (2.16) The ¯x and ¯y are the arithmetic means of their respective data sets. In equation 2.15 sxand sy are the standard deviation of x and y, respectively. The standard deviation of x, and in similar fashion for y, is calculated by:

sx= g f f e 1 n ´ 1 n ÿ j=1 (xj´ ¯x)2. (2.17)

The value of Pearson’s correlation coefficient is a real number such that ´1 ď r ď 1. If the value of the coefficient r = ´1 there is a total negative linear correlation between data sets and if plotted the data makes a perfectly straight line with a downwards slope. Total positive linear correlation, a perfectly straight line with a positive slope, gives a value of 1. If there is no linear correlation the value of Pearson’s correlation coefficient is 0. However, even if r=0 that does not mean there is not a relationship between the two data sets; only that there is not a linear one [2].

The data of the 1-hour plot in Figure 2.2, suspected to have a power-law relationship, has an r value of roughly -0.97 when both data sets have been transformed by a logarithm. This means that a power-law relationship between the x-data and the y-data is likely.

Least square regression

It is possible to fit a line to the data of the 1-hour plot. This can be done using least square regression. In the linear case this would be for a function on the form y=a+bx and can be determined through:

b= Sxy

Sxx and a

(18)

where Sxyand Sxxare given by:

Sxy= n ÿ i=1 (xi´ ¯x)(yi´¯y)and Sxx = n ÿ j=1 (xj´¯x)2. (2.19) To handle the case with the suspected power law from Figure 2.2 so that y = AxB it is possible to calculate A and B from the logarithms of the x and y data [19]:

b= n řn i=1(ln xiln yi)´ řn i=1(ln xi) řn i=1(ln yi) nřn i=1(ln xi)2´( řn i=1(ln xi))2 , (2.20) a= řn i=1(ln yi)´b řn i=1(ln xi) n , (2.21)

where A is equivalent to eaand B is equivalent to b.

To evaluate the results from the linear regression it is possible to use a confidence band to tell with a selectable level of confidence, 1 ´ p, the area the theoretical true regression line will be in according to the level of confidence selected. This is typically done with a confidence level of 95%, 99% or 99,9% by adding and subtracting a margin of error corresponding to the selected level of confidence. The confidence interval around a point on the regression line is:

Iµ0 = (ˆy ´ tp/2(n ´ 2)s d 1 n + (x0´¯x)2 Sxx , ˆy +tp/2(n ´ 2)s d 1 n + (x0´¯x)2 Sxx ), (2.22) where x0is the x for which this confidence interval will be calculated. The s in equation 2.22 is the estimation of the standard deviation:

s= g f f e Syy´ S2xy Sxx n ´ 2 . (2.23)

The Syyin equation 2.23 is the sum of the squares made from the distance between the mean of y, ¯y and the y of each point: Syy=ři=1n (yi´ˆy)2. The tp/2(n ´ 2)in equation 2.22 is the critical value for this level of confidence 1 ´ p with n number of samples [2].

It is also possible to test if there is a statistical significance to the slope b of the regression line by creating a confidence interval Ibfor b and then test if the slope of the null hypothesis is inside this interval:

Ib= (b ´ tp/2(n ´ 2)s a

Sxx, b+tp/2(n ´ 2)s a

Sxx). (2.24)

Typically, the null hypothesis being tested for the slope is if there is no relationship be-tween x and y, that b=0 is a possibility. The alternative is that b ‰ 0 and the null hypothesis is rejected in favour of the hypothesis that there is a relationship. The hypothesis being tested can also be if two lines regressed from different data sets are the same or not [2].

The plot of the coffee drinking students from Figure 2.2 has now turned into the plot in Figure 2.3, where the 1-hour plot is shown together with its regression and the confidence interval for that regression. The lower limit of the interval is below zero, except for the first point and does not show on the graph because of this. The regression line in this plot is y=0.457x´2.94and with a 95% confidence interval of the slope is Ib= (´3.07, ´2.82). Since 0 cannot be found between ´3.07 and ´2.82, this means that we can reject the null hypothesis that there is no relation between x and y.

(19)

Figure 2.3: Plotting the regression line and its confidence interval

Goodness-of-fit How well the regression line explains the values in Y is given by the coef-ficient of determination r2[14]. For a simple linear regression, this coefficient is the square of Pearson’s correlation coefficient from equation 2.15. The r2value gives a percentage of the variance in y that is explained by the model used to describe y. It should be noted that r2_is only an indication and if the fitted line is systematically over and under-predicting the data the r2value may still be high even though the line is a bad fit [6].

(20)

(21)

3 Method

In this project we are using Python 2.7 throughout the whole project. This version of Python is used since one of the libraries used in this thesis is not available in Python 3. The first part is to start listening to the Twitter stream, though due to restrictions in Twitter’s API we receive at most 1% of the tweets. From the tweets stream we get a live stream of tweets that are posted. To access Twitter’s API, we use a library in Python called Tweepy. What we are looking for are tweets that contain a Bitly-link which is a commonly used URL-shortener. The tweets collected over 10 minutes are saved in a JSON format in a .txt file.

Next part is to expand the shortened URL to the full-length URL. This is done with Bitly’s API. We are limited in how many requests we can send at a time to Bitly. This is one bottleneck in how much data that we can analyse. To mitigate this, we only expand unique links, there is no need to expand the same link twice. This should give a good representation of all the links collected and the entire tweets-stream.

After we have extracted the URLs we want to determine if they lead to news pages or not. To do this we compare the expanded URL from the tweet to the URLs of several news sites that we have collected in a list. This list contains a mixture of sites that are known to have published fake news and sites that are considered to be objective. The sites that have published false/fake news are a selection of sites identified as having published fake/false news in a study by BuzzFeed [17].

The web page at the news site is then classified as either real or fake news. This is done with the help of machine learning. We are using a model called MNB for the text classifica-tion. The training data that we are using for our classifier consists of 40 texts that had been classified as fake news and 40 texts that had been classified as ordinary articles. With the help of this training data, the program learnt which sort of words and combinations of words that are used in fake news and which were used in ordinary articles. The classifier was then eval-uated against 40 articles, 20 each of fake and real, from the same set of articles as the training articles. All the articles used for training and evaluation were written about the American election of 2016 and have been classified manually [17].

As we collected a dataset with links to news articles we started to record how many times the links to these articles are clicked. This information is collected through Bitly’s API. Data on how many times the links were clicked was gathered in two-hour intervals and saved to a new .txt file. The last thing we will do before we start to analyse the data is to set a threshold that 50% of the clicks must have been collected during the test period. This is done

(22)

t

Figure 3.1: A schematic figure of how the data mining is done

Figure 3.2: A closer look how the data collection works

to make sure we are only looking at recently published news articles. Figure 3.1 and Figure 3.2 respectively provide an overview of the entire program and of the data collection part in specific.

In the last part of the project, we analysed the data to see if there are any differences in the pattern of how fake news are shared compared to real news.

Limitations

What is biased content?

There is not a well-defined reliable system of measurement to tell if an article can be consid-ered biased, real, or somewhere in between.

Bitly links compared to all shared links

In this thesis, we are only looking at links from Bitly, due to that those are the only links that we can get statistics on how many times they have been clicked. We do not know if there is a difference in the group of users that prefer Bitly to those who use Twitter’s built-in URL-shortener.

(23)

Do the clicks come from real persons?

It is not possible to know if it is real persons that are clicking on the links or if it is some sort of crawling tools (like ourselves).

API restriction

The APIs from Twitter and Bitly have some restrictions. From Twitters API, it is at most possible to see 1% of the tweets that are posted at the moment. Still, this will give us a sufficient amount of data if the script is running for a while.

From Bitly the limitations are that over a connection it is only possible to request infor-mation to expand 15 links at a time in a batch and all additional types of inforinfor-mation must be done with one request per type per link. This led to this part being quite time-consuming. Multi-threading the program in these sections helped some by giving us more concurrent connections; however, Bitly limits the number of connections to five at a time.

HTML-extraction

The first stage in the process of turning the HTML code of the website into text for the classi-fier is to remove all the HTML code inside tags that are of the script or style type. After that, any tags with an attribute indicating a comment section for the article on the web page is removed since these are not actually part of the article itself and can sometimes, collectively, be much longer than the article. Finally, the text of the web page is extracted using the Python module Beautiful Soup for HTML parsing.

(24)

(25)

4 Results

Starting in the afternoon on the third of May 2017 we collected tweets for 24 hours and then tracked them for five days. From the Twitter stream, we collected approximately 11 million tweets that contained one Bitly link each. We sampled 1000 of the Bitly links that we had not seen before out of the links collected for every 10-minute interval. Those links were run through the Bitly API and we extracted the long URLs corresponding to the shortened links. In total, we have looked at 144 000 long URL. We then filtered out URLs that lead to English-language news sites and stored these URLs as pointing to news articles. The Bitly links to these articles were tracked for 5 days.

Out of the 835 links that were tracked by the program, 207 were classified as biased by our MNB classifier and 628 were classified as objective. Several of the articles that were discovered and were older articles and saw most of their growth before they were tracked by the program. To remove these from our analysis we introduce a threshold that at least 50% of the clicks recorded when we stop tracking a link should have happened during the tracking. This gave us 310 links to articles classified as not biased and 99 as biased. During our selection of articles classified after domain we decided on using articles from BBC as our objective articles, BBC is well known for publishing legitimate news [10]. Articles from Breitbart represent our biased news. The reason for using Breitbart for the biased category is that it was prominent in the manually classified training data for the MNB classifier.

First, we present the plots of the CCDFs, the likelihood that an article of a certain class will receive more than a number of clicks with the classes determined by the MNB classifier.

As can be seen in Figure 4.1 both classes have very similar forms, but it is difficult to make out details. The form they do follow however looks similar to y= mx´k_{, thus we will} transform so both x and y are plotted on logarithmic scales instead of linear.

The plots from Figure 4.1 now looks much more like a straight line in Figure 4.2. Attempt-ing linear regression on these plots when the data is log-transformed gives us a line for the articles classified as being objective by the MNB classifier so that: y = 2.83x´0.453. For the plot of the biased articles in Figure 4.2 (b) the linear regression gives a line: y =1.62x´0.354. The 95% confidence interval of the slope coefficient in the regression for the line in Figure 4.2 (a) is Ib = (´0.453, ´0.453), and as the slope coefficient of the plot in Figure 4.2 (b), -0.354, is not between these values the slopes are significantly different at this confidence level. Note that we have chosen to round our values to three significant figures despite our calculations providing us with 15 figures. This is due to our measurements simply not having the

(26)

pre-(a) Classified as objective by MNB (b) Classified as biased by MNB Figure 4.1: CCDF distributions after five days in lin-lin scale

(a) Classified as objective by MNB (b) Classified as biased by MNB Figure 4.2: CCDF distributions after five days in log-log scale

cision necessary to be confident of such a specific value. We will continue to round to three significant figures in this thesis.

The coefficient of determination r2_{for the line in Figure 4.2 (b) is 0.878 and for the line} in Figure 4.2 (a) it is 0.918. This means that the clicks an article has already received explain 87.8% and 91.8% of the likelihood of it receiving more clicks.

(a) BBC (b) Breitbart

Figure 4.3: CCDF distributions after two hours in log-log scale

Performing linear regression on the data in Figure 4.3 (a) gives a line y=1.65x´0.495_and a confidence interval of Ib = (´0.495, ´0.495). For Figure 4.3 (b) the line produced by linear regression is y=1.26x´0.638, thus the two lines are significantly different. The coefficient of determination of the line in Figure 4.3 (b) is 0.937 and 0.941 in 4.3 (a).

After one day of tracking the number of clicks that our selected articles have received the linear regression lines are now y=2.04x´0.459for the articles published by BBC and y=

1.67x´0.485_{for those by Breitbart. The confidence interval I}

bfor the linear regression in Figure

(27)

(a) BBC (b) Breitbart Figure 4.4: CCDF distributions after one day in log-log scale

4.4 (a) is now(´0.46, ´0.459). This means that the slopes are still statistically significantly different. The goodness-of-fit is r2₌_{0.914 for BBC and r}2₌_{0.892 for Breitbart.}

(a) BBC (b) Breitbart

Figure 4.5: CCDF distributions after five days in log-log scale

When five days of tracking has passed linear regression gives a line y=2.53x´0.479for the plot in Figure 4.5 (a) and y=1.72x´0.468for the plot in Figure 4.5 (b). The Ibfor the slope de-scribing the articles published by BBC is(´0.481, ´0.478). The slopes are again significantly different. The coefficient of determination is 0.909 for BBC and 0.883 for Breitbart.

Figure 4.6: CDF that compare the rate at which clicks are gained for Breitbart and BBC

In Figure 4.6 we can see a graph that is showing the normalised sum of all clicks on articles from one site at a time t against the final sum of clicks on that site. We can see that BBC has a steeper click curve compared to Breitbart. After six hours, the difference was that Breitbart had 42% of its total clicks and BBC had 75%. Summing the clicks for all articles, regardless of

(28)

classification, we saw that it took 10 hours until 50% of the clicks were reached, compared to how many clicks had been received after 5 days.

Deriving a PDF from a CCDF

It is possible to derive the corresponding PDFs to these CCDFs if desired. After five days of tracking clicks, the CCDF for the BBC articles could be described with y = 2.53x´0.479. Since the CCDF is 1 ´ CDF and the CDF isşx

´8PDF(t)dt we can calculate the PDF to y = 1.21x´1.48, rounded down to three figures.

(29)

5 Discussion

In the discussion, we will discuss our results and what conclusion we can make. We will also discuss our choice of method. At last, we will set our work in a wider content.

5.1 Results

It is always a complicated problem to present large amounts of collected data in an under-standable way, in this study our collected data is in total approximately 13 GB. In the result, we have used CDF and CCDF distributions to visualise the data that we have collected. How-ever, when we plot the graphs over the data collected we use a simple and rather naive log-log transform and then perform linear regression on the transformed data.

Furthermore, for both approaches, the slope of the CCDF was steeper for objective articles after five days. This means that fewer articles classified as objective, with both methods, had unusually many clicks. Conversely, this means that a relatively larger group of the articles classified as biased were much more popular to click than the rest. However, early on the slope was steeper for the articles published by Breitbart, meaning that they were more evenly popular initially.

Newman [13] suggests that, technically, determining a power-law distribution for discrete quantities, as is usually the case when measuring since the measurement usually yields a number, not a continuous variable, should be done differently. This is more complicated to calculate and require more specialised functions than the integrals used for the continuous case. Newman also recommends using a cutoff value to set a minimum value from which data points will be used for determining a power-law distribution since the tail of the data is the most interesting for a power-law distribution anyway. Further, Newman mentions that simply fitting a slope with linear regression to the log transformed data is known to introduce systematic biases into the exponent and should not be relied upon, despite being the most common method. Taking this into consideration the slopes we have fitted might differ significantly from their true slopes. The CCDF graphs for BBC and Breitbart appear to behave more according to a power law roughly after x= 10 where the data appear to be closer to a straight line when plotted in log-log. For the graphs for the MNB classifier, the respective number appears to be x=20. If these thresholds were to be used the slopes would likely be steeper. Depending on how much the discarded points influenced the fitting of the slope it might be that some of the slopes are no longer statistically significantly different.

(30)

5.2. Method

We can see from the graph comparing the rate at which BBC and Breitbart gained clicks that there was a noticeable difference. One idea about why we can see this difference is that the BBC-content is just clicked only as long as it is interesting as a news article.

5.2 Method

When we trained our MNB classifier we used scikit-learn’s gridsearchcv function to tune the parameters of our classifier so that it would be as good as possible at classifying the articles in our training data. The settings that had been chosen were that unigrams and up to 4-grams were used for features and 800 features were selected. All characters were turned into lower-case and no stop words were used. To smooth the probabilities of the features a smoothing of α= 0.25 was selected. Using tf-idf was an option, but this did not provide the best result against our training data so it was not used in the final classifier. When we tested the clas-sifier against our evaluation data it received a score of 89.7% correct classifications. During some of our earlier shorter tests of the program, we noticed that the classifier gives results that are questionable. For articles that did not concern American politics the classes assigned were not as accurate and if the program selects an article for tracking that is not in English the classifier will happily classify that article as well. Additional stages of classifiers, one to filter for articles that are written in English and one following that to filter for American politics, before trying to use the current classifier might help in selecting which tracked arti-cles to analyse. As the classifier works right now the classifier only has two possible classes to assign an article to, thus another possible improvement to the classifier is to introduce a class corresponding to articles who cannot be considered to be highly biased or completely objective, but slightly biased.

A big concern is that we were only able to find statistics about Bitly-links and it is quite hard to know if these links are a good representation of links on Twitter in general. One problematic part is that we are only able to see the total amount of clicks a link has. In some cases, we have seen that suddenly the number of clicks for an article grow from 113 clicks to 47208 clicks in two hours, one day after we started tracking it. This was probably because the link was shared somewhere else, but in the way that we are analysing the data we are not taking this into account.

5.3 The work in a wider context

In a wider context analysing what is written in social media is a growing research field, es-pecially Twitter is popular due to its friendly view in sharing data. This is still one of the first studies that are analysing the clicks instead of just shares. As we have seen in this paper, there is no easy way to get the number of clicks from articles. This paper is using a method that is far from perfect but is filling a void in understanding our social behaviour on Twitter.

We are investigating how different news articles are clicked on Twitter, at first sight it is tempting to measure how different individuals read content on Twitter. But when doing this sort of study it is important to take privacy into account since sensitive information such as political leanings might be exposed otherwise.

(31)

6 Conclusion

We have created a working methodology for understanding click behaviour on Twitter. We have done this with the help of statistics from the URL shortener service Bitly. This answers our first research question; it is possible to create a working methodology for knowing what is clicked on Twitter. However, it is worth noticing that the methodology has some weaknesses. We compared how biased content was clicked to more objective news articles. We used two different approaches for this. The first was that we used a simple machine learning tool, MNB. This method is good for classifying in categories, but we had issues due to trying to classify texts well outside the training data. However, the survival functions for the two classes as labelled by the classifier were statistically significantly different and showed sim-ilarities to the domain classification method. Secondly, we categorised by simply looking at the domain an article resided on. With both approaches, we could see that the slope of the CCDF is steeper for objective articles after five days. Objective articles have fewer outlier articles that are much more clicked than typical. We have also seen that the period when an article is relevant is shorter on articles published by BBC compared to Breitbart. Out of the total number of clicks generated on a news site 75% were generated in the first 12 hours for BBC and 42% for Breitbart articles. This answers our second research question. The study we did was quite small but our results are pointing towards that there is a difference in how biased and non-biased content is clicked on Twitter.

(32)

(33)

Bibliography

[1] Hunt Allcott and Matthew Gentzkow. Social Media and Fake News in the 2016 Election. Working Paper 23089. National Bureau of Economic Research, Jan. 2017. DOI: 10 . 3386/w23089.

[2] Gunnar Blom, Jan Enger, Gunnar Englund, Jan Grandell, and Lars Holst. Sannolikhet-steori och statistikteori med tillämpningar. 5th ed. Studentlitteratur, 2004.

[3] Youmna Borghol, Sebastien Ardon, Niklas Carlsson, Derek Eager, and Anirban Ma-hanti. “The Untold Story of the Clones: Content-agnostic Factors That Impact YouTube Video Popularity”. In: Proceedings of the ACM SIGKDD. KDD. Beijing, China, 2012, pp. 1186–1194.DOI: 10.1145/2339530.2339717.

[4] Youmna Borghol, Siddharth Mitra, Sebastien Ardon, Niklas Carlsson, Derek Eager, and Anirban Mahanti. “Characterizing and Modelling Popularity of User-generated Videos”. In: Perform. Eval. 68.11 (Nov. 2011), pp. 1037–1055.DOI: 10.1016/j.peva. 2011.07.008.

[5] “Disputed by multiple fact-checkers’: Facebook rolls out new alert to combat fake news”. In: Guardian (Mar. 2017). URL: https : / / www . theguardian . com / technology/2017/mar/22/facebook-fact-checking-tool-fake-news. [6] Jim Frost. Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?

2013.URL: http://blog.minitab.com/blog/adventures-in-statistics- 2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit(visited on 06/01/2017).

[7] Maksym Gabielkov, Arthi Ramachandran, Augustin Chaintreau, and Arnaud Legout. “Social Clicks: What and Who Gets Read on Twitter?” In: Proceedings of the ACM SIG-METRICS. SIGSIG-METRICS. Antibes Juan-les-Pins, France, 2016, pp. 179–192.DOI: 10 . 1145/2896377.2901462.

[8] R. Kelly Garrett, Brian E. Weeks, and Rachel L. Neo. “Driving a Wedge Between Evi-dence and Beliefs: How Online Ideological News Exposure Promotes Political Misper-ceptions”. In: Journal of Computer-Mediated Communication 21.5 (2016), pp. 331–348.DOI: 10.1111/jcc4.12164.

(34)

Bibliography

[9] Nicholas S. Holtzman, John Paul Schott, Michael N. Jones, David A. Balota, and Tal Yarkoni. “Exploring media bias with semantic analysis tools: validation of the Con-trast Analysis of Semantic Similarity (CASS)”. In: Behavior Research Methods 43.1 (2011), pp. 193–200.DOI: 10.3758/s13428-010-0026-z.

[10] Jasper Jackson. “BBC rated most accurate and reliable TV news, says Ofcom poll”. In: The Guardian (Dec. 2015).URL: https://www.theguardian.com/media/2015/ dec / 16 / bbc rated most accurate and reliable tv news says -ofcom-poll(visited on 06/01/2017).

[11] Jure Leskovec, Lars Backstrom, and Jon Kleinberg. “Meme-tracking and the Dynam-ics of the News Cycle”. In: Proceedings of the ACM SIGKDD. KDD. Paris, France, 2009, pp. 497–506.DOI: 10.1145/1557019.1557077.

[12] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, Apr. 2009.

[13] MEJ Newman. “Power laws, Pareto distributions and Zipf’s law”. In: Contemporary Physics 46 (5 2005), pp. 323–351.

[14] Hossein Pishro-Nik. The First Method for Finding β0 and β1. URL: https : / / www . probabilitycourse.com/chapter8/8_5_2_first_method_for_finding_ beta.php(visited on 06/01/2017).

[15] Rob Price. “The fact fake news ’outperformed’ real news on Facebook proves the problem is wildly out of control”. In: Buisness Insider (Nov. 2016). URL: http : / / nordic . businessinsider . com / fake news outperformed real news -on - facebook - before - us - electi-on - report - 2016 - 11 ? r = UK % 5C & IR = T (visited on 06/01/2017).

[16] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. “Tackling the Poor Assumptions of Naive Bayes Text Classifiers”. In: Proceedings of the Twentieth Inter-national Conference on InterInter-national Conference on Machine Learning. ICML. Washington, DC, USA: AAAI Press, 2003, pp. 616–623.

[17] Craig Silverman. “This Analysis Shows How Viral Fake Election News Stories Out-performed Real News On Facebook”. In: BuzzFeed News (2016). URL: https : / / www . buzzfeed . com / craigsilverman / viral fake election news -outperformed-real-news-on-facebook(visited on 06/01/2017).

[18] Craig Silverman and Jeremy Singer-Vine. “Most Americans Who See Fake News Believe It, New Survey Says”. In: BuzzFeed News (Dec. 2016). URL: https : / / www . buzzfeed . com / craigsilverman / fake - news - survey ? utm _ term = .jk8NxNRlP#.qpxALAqMW(visited on 06/01/2017).

[19] Eric W. Weisstein. Least Squares Fitting–Power Law. 2017.URL: http://mathworld. wolfram.com/LeastSquaresFittingPowerLaw.html(visited on 06/01/2017). [20] Danny Wong. In Q4, Social Media Drove 31.24% of Overall Traffic to Sites. Jan. 2015.URL:

http : / / blog . shareaholic . com / social media traffic trends 01 -2015/(visited on 06/01/2017).

Counting the clicks on Twitter : A study in understanding click behavior on Twitter

Linköping University | Department of Computer Science

Bachelor thesis, 16 ECTS | Datateknik

2017 | LIU-IDA/LITH-EX-G--17/001--SE

Counting the clicks on Twitter

A study in understanding click behavior on Twitter

Filip Polbratt, Olav Nilsson

Upphovsrätt

Copyright

Acknowledgments

Contents

List of Figures

1

Introduction

1.1

Motivation

1.2

Aim

1.3

Research questions

2

Theory

2.1

Related work

2.2

Classification

Naive Bayes

2.3

Statistical tools for analysis

Cumulative distribution function

Complementary cumulative distribution function

Pearson correlation coefficient

Least square regression

3

Method

Limitations

HTML-extraction

4

Results

Deriving a PDF from a CCDF

5

Discussion

5.1

Results

5.2

Method

5.3

The work in a wider context

6

Conclusion

Bibliography