• No results found

Longitudinal measurements of link usage on Twitter

N/A
N/A
Protected

Academic year: 2021

Share "Longitudinal measurements of link usage on Twitter"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

Linköpings universitet SE–581 83 Linköping

Linköping University | Department of Computer and Information Science

Bachelor’s thesis, 16 ECTS | Link Usage

2019 | LIU-IDA/LITH-EX-G--19/036--SE

Longitudinal measurements of

link usage on Twitter

Longitudinella mätningar av länkanvändning på Twitter

Oscar Järpehult and Martin Lindblom

Supervisor : Niklas Carlsson Examiner : Markus Bendtsen

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka ko-pior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervis-ning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säker-heten och tillgängligsäker-heten finns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsman-nens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

Students in the 5 year Information Technology program complete a semester-long software devel-opment project during their sixth semester (third year). The project is completed in mid-sized groups, and the students implement a mobile application intended to be used in a multi-actor setting, cur-rently a search and rescue scenario. In parallel they study several topics relevant to the technical and ethical considerations in the project. The project culminates by demonstrating a working product and a written report documenting the results of the practical development process including requirements elicitation. During the final stage of the semester, students create small groups and specialise in one topic, resulting in a bachelor thesis. The current report represents the results obtained during this specialisation work. Hence, the thesis should be viewed as part of a larger body of work required to pass the semester, including the conditions and requirements for a bachelor thesis.

(4)

Abstract

As Twitter launched with their unique way of limiting posts to only 140 characters the usage of link shorteners was brought forth. This was the only way to fit long URLs in tweets until Twitter solved this by providing their own integrated link shortener. This study investigates how links are used on Twitter. The study include both care fulldata collection including multiple APIs and analysis of the collected data providing new insight into this topic.

It was found that a small set of internet domains account for a large part of the links found in posted tweets. This set of top occurring domains did not necessarily reflect the top domains typically on common internet top lists. When looking at link shorteners in posted tweets we found that “bit.ly” was the most common one. Due to our method of collecting data we had the possibility of looking up the amount of clicks “bit.ly” links had received. We compared the click data to the amount of retweets the tweets containing these links had received and this led to some interesting discoveries regarding the ratio between these two.

(5)

Acknowledgments

We would like to give a special thanks to our supervisor Niklas Carlsson that has been very helpful answering questions and helping us come up with solutions to problems we have encountered as well as putting us on the right track as to what research questions that would be beneficial for us to study.

(6)

Contents

Abstract iv

Acknowledgments v

Contents vi

List of Figures viii

List of Tables ix 1 Introduction 1 1.1 Motivation . . . 1 1.2 Aim . . . 1 1.3 Research Questions . . . 1 1.4 Contributions . . . 2 1.5 Delimitations . . . 2 2 Background 3 2.1 Twitter . . . 3 2.2 URL Shortening . . . 4

2.3 Domain Top Lists . . . 4

2.4 Related Work . . . 5 3 Method 7 3.1 Data Collection . . . 7 3.2 Data Analysis . . . 12 3.3 Limitations . . . 12 4 Results 14 4.1 Domain Statistics . . . 15 4.2 User Statistics . . . 19

4.3 Bitly Link Interaction . . . 23

4.4 Verified vs Non-verified Users . . . 25

4.5 Miscellaneous Statistics . . . 27

5 Discussion 29 5.1 Results . . . 29

5.2 Method . . . 30

5.3 The Work in a Wider Context . . . 31

6 Conclusion 32 6.1 Future Work . . . 32

(7)
(8)

List of Figures

3.1 Overview that shows the data flow and process communication of the tweet

col-lection. . . 8

3.2 Overview that that shows the data flow and process communication of the addi-tional data collector. . . 9

3.3 Overview of how the timing for the different phases looked. . . 10

4.1 Distribution of domains for different classes in top 1M lists. . . 17

4.2 Distribution of domain rank. . . 17

4.3 5x5 scatter plot that shows the frequencies and ranks of domains in the top 25 of the categories All links, Shortened links, Bitly links, Alexa ranking and Majestic ranking. . . 18

4.4 Distribution of the age for users account at the time of posting their tweet. . . 19

4.5 Distribution of the number of tweets favourited by users at the time of posting their tweet. . . 20

4.6 Distribution of the number of tweets posted by users at the time of posting their tweet. . . 20

4.7 Ratio between tweets favourited and tweeted by users at the time of posting their tweet. . . 21

4.8 Distribution of the number of followers for users at the time of posting their tweet. 21 4.9 Distribution of the number of friends for users at the time of posting their tweet. . 22

4.10 Followers-to-friends ratio for users at the time of posting their tweet. . . 22

4.11 Distribution of the delays for Bitly data to be retrieved after retweet data for tweets. 24 4.12 Two scatter plots of Bitly clicks-to-retweets-ratio. . . 24

4.13 Clicks-to-followers ratio for Bitly links. . . 25

4.14 Heat-map of retweets vs followers tweeted. . . 26

4.15 Heat-map of retweets vs number of tweets tweeted. . . 26

4.16 Heat-map of followers vs number of tweets tweeted. . . 27

A.1 Followers-to-friends ratio for users at the time of posting their tweet, side-by-side view for all categories. . . 40

(9)

List of Tables

3.1 Clicks from different time spans and different sources for a Bitly link. The symbol in each cell indicates which information that can be extracted from the same API

call. . . 10

4.1 Amount of tweets that are included in different categories. . . 14

4.2 Top 20 most frequent domains for two different categories. . . 15

4.3 Top 20 most frequent domains for three different categories. . . 16

4.4 Amount of unique users for each category and how many of those users that are verified. . . 23

4.5 Breakdown of the number of geo-coded tweets for each category and the percent-age those tweets makes up of total number of tweets for that category. . . 27

4.6 Breakdown of top 5 languages for tweets of the four different categories. . . 27

4.7 Breakdown of top 5 languages for users that posted tweets of the four different categories. . . 28

A.1 All collected shorteners sorted on domains and how many we were able to get the full domain from. . . 39

(10)

1

Introduction

As the world wide web expands more and more people can keep in touch and share their lives online via social media sites. With only one click away one can reach millions of users worldwide and share everything from just plain text to media like pictures or videos. One of the largest social media sites is Twitter with 326 million monthly active users [12]. With this large user-base Twitter becomes a good source for analyzing patterns among users and get an understanding of what type of content is being shared online.

1.1

Motivation

The basis for doing this project stems from the importance of knowing how users behave on social medias. This data surely can serve as informational to the public, but even more than so, it can serve as a source of knowledge for future research in the academic world for others to build upon. For the wider audience, insight into how links are used can help educate on users habits and how these might differ depending on culture, location or age to name a few. Going forward, as the world wide web expands and becomes a bigger part of society, this could help tell a story about our society as well.

1.2

Aim

With this project, we aim to provide better understanding of how links are used on the social media site Twitter. Here, we are particularly interested in understanding the usage of nor-mal links, shortened links, links from the specific shortening service Bitly and the difference between them. Our aims are achieved by collecting tweets and creating a dataset that then are analyzed to provide a foundation upon which we draw conclusions and answer research questions. Throughout the project, the dataset serves as the central pillar from which we take information. We look into what other researchers have found and compare this to the findings that we make.

1.3

Research Questions

(11)

1.4. Contributions

1. With links being posted on Twitter, is it possible to see any patterns in regards to what domains are tweeted?

2. With Twitter being a social media site, are there any correlations to be found between users and how they interact with tweets containing links?

1.4

Contributions

We provide a methodology for creating a versatile dataset with data from Twitter and Bitly that can be used for a variety of studies. Furthermore, we provide insight into various areas of link usage and user behaviour on Twitter for others to build upon in the future, and make key statements regarding the link-usage landscape on Twitter.

1.5

Delimitations

Since a tweet contains a lot of data such as images, videos and other media related data, we limit what fields are saved to the dataset. Media fields and user created text fields are ignored to avoid complications when saving to file. To make sure there is a possibility to find information post-collection, the id that points to the tweet is always saved. To make the collection manageable to process we only collect tweets over a time period spanning seven days. This, in order to have a dataset that is not too large and that can be analyzed in a reasonable time period.

(12)

2

Background

2.1

Twitter

Twitter is an online social media service that provides users with the ability to interact with each other through various different means like tweeting, following, retweeting and more. This concept lies as a foundation for what type of data we are both able to and will collect during our project [32].

Types of Tweets

Since there are different types of tweets and we will be mentioning them in our report we will below present them and explain the differences.

General Tweets

A “General Tweet” is original content that may contain text, photos, a GIF and video. The tweet will appear in the Home timeline of users who are following the sender of the tweet [35].

Mentions

A mention has all the attributes of a “General Tweet” but also contain another Twitter ac-counts username preceded by the "@" symbol. In addition to followers Home timeline, a mention will also appear in the Home timeline of the mentioned user [35].

Replies

A reply is a response to another tweet and will be displayed in conjunction to the original tweet. It is possible to reply to another reply [35].

Retweets

A retweet is a re-posted tweet. It appear as the original tweet and also contain a special icon showing it is a retweet. A retweet may also contain a comment from the user who re-posts a tweet [35].

(13)

2.2. URL Shortening

Follow and Friend

To follow means to subscribe to another user in order to get updates on their activity. A friend is someone that a user follows and a follower is someone that follows the user. This means that the number of friends is how many accounts a user follows and the number of followers is the number of accounts that follows the user [27].

Twitter API

Twitter provide an application programming interface (API) that allow anyone to collect pub-lic tweet- and user data via HTTP requests [29]. Tweets requested through the API is returned as a JSON object which contain all information about the Tweet and the user who sent it [33]. There are two ways to collect tweets for free, search tweets and stream realtime tweets. Search tweets provide the feature to search against a sample of tweets posted in the past 7 days with a rate limit of 450 request each 15 minutes where each request may contain 0 to 100 tweets [34]. Streaming realtime tweets will return around 1% of all tweets posted at any timed with the ability to add custom filters to the stream [25].

2.2

URL Shortening

URL shortening is a way to make URLs on the web considerably shorter while still directing to the same page. This makes it more convenient to share an URL and it may even be required when sharing content on sites like Twitter or SMS where there is a character limit. For exam-ple the URL “https://a-very-long-url.com” can be shortened to “https://sho.rt/3wK5”. The short URL can then be shared and when this URL is accessed it will redirect to the original long URL. Most URL shortener services work by storing long URLs with a unique key in a database and when the key is accessed the long URL will be returned. In the previous exam-ple the unique key is “3wK5”. Old URLs that are not accessed frequently may be deleted to make room for new ones.

An URL shortener could be either public for general use or internal for a specific purpose. YouTube and Twitter have their own internal shorteners, youtu.be and t.co. Youtu.be is only used for sharing videos on YouTube and t.co is used on Twitter to automatically shorten URLs in tweets. A public general shortener will allow anyone to shorten any URL. Bitly, goo.gl and TinyURL works like this where Bitly is the most common one.

Bitly

Bitly offer the service to view statistics for every URL by requesting the URL with a + sign at the end. This statistics contains data of each click on the Bitly link whith parameters like referrer, location and date. The statistics page also include what URL the Bitly link will redi-rect to and the date when it was created. Bitly also provide an API which makes it possible to access this data by using HTTP requests to different endpoints [6].

2.3

Domain Top Lists

An internet domain top list is simply a list with the top x most popular domains found on the internet. There are several different ways to construct such a list, varying from gathering user data in search engines to counting DNS requests for specific domains. There are multiple lists available online but only two of them will be used in this thesis.

Alexa

The “Alexa top 1 million”, provided by Alexa is the most popular and widely used top list around [26]. For this list the data is gathered in two ways, user data from browser extensions

(14)

2.4. Related Work

and data from sites running the Alexa script [1]. Alexa claim that the data gathered from extensions comes from over 25,000 extensions used by millions of people [3] and in order for a site to run the Alexa script they have to certify their metric [1]. The full list is not available for free and can only be accessed for a monthly fee [2], therefore an old version from 16-05-2014 will be used in this thesis.

Majestic

The “Majestic million”, provided by Majestic is a less commonly used top list compared to Alexas but the full list is available for free [23]. The data used to construct the list is gathered by crawling the web and counting the number of referring subnets for each individual do-main [23]. This means that unless Alexa, Majestic do not take into account how often a link is clicked. The Majestic list that will be used in this thesis was downloaded from their web page on 07-05-2019.

2.4

Related Work

Internet Domains

Scheitle et al. [26] discuss how three different internet domain top lists compare to each other. The top lists that are analyzed are Alexa, Majestic and Cisco Umbrella. Cisco Umbrella, like Alexa and Majestic, is a list containing 1 million unique domains but the data is gathered from the Umbrella global network which handles over 100 Billion DNS requests each day across 65 Million users spread over 165 countries [11]. The effect of this is that Umbrella do not reflect how people browse the web since DNS requests are also performed by mobile devices and internet of things (IoT) devices. However, both Alexa and Majestic also have their own flaws. When comparing the domains present in a list one day but not the next day it is shown that the Alexa list some days differs up to 50%. This effect have a high impact on the domains with a lower rank while the highest ranked are more consistent. The Majestic list was found to be highly consistent from day to day which can be explained by it only counting links which is more static than peoples daily use. However, Majestic also includes “hidden” links that may be loaded but not knowingly requested by humans, resulting in a list that may not reflect how humans interact with links on the web. It is concluded that when using an internet domain top list in a study it is important to consider the purpose of the study. For the purpose of investigating how humans browse the web the best option is Alexa and if focus is on analyzing the structure of the Internet Majestic is the way to go [26]. Furthermore, Gill et al. [15] explain that a large proportion of the internet traffic typically associated with user browsing comes from a small subset of domains.

Twitter Streaming API

Morstatter et al. [24] describes how the streaming API from Twitter differs from the firehose API also provided by Twitter. The streaming API is free to use whereas the firehose is a payed service. The largest differences is that the streaming API provides 1% of tweets posted in real time where as the firehose provides all the tweets. They wanted to find out what how well the data received from the streaming API represents all of the tweets. They accomplished this by collecting data from both endpoints and then performing comparative analysis on them to see how well the compared. One way they looked at it was how the top n hashtags of each dataset compared to each other. What they found was that the data from the streaming API performed well for large n but not very well for small n. To test this more they did another run where they instead of using the data from the streaming API they sampled randomly from the firehose data and found out that a random sample from the firehose performed better for all n than the streaming API did when it came to representing the top hashtags. This

(15)

2.4. Related Work

they attributed to there probably being some sort of bias in how Twitter selects which tweets to provide users through the streaming API. How this “filter” works is unknown. Another observation were that most geo-tagged tweets appeared in the data from the streaming API.

There has also been another study that looked at how the streaming API functions. Cam-pan et al. [8] used the free streaming API during the FIFA 2018 World Cup in order to see how the filtering on different topics yielded different results in regards to how representative of all tweets the data were. This they did in order to answer the question whether or not data collection from the streaming API could be useful for academic research. Since they only looked at the streaming API there are some limitations as to what conclusions can be drawn from their results. One instance is that they state that when filtering for keywords that have less than six hundred tweets per minute it is “likely” that all tweets will be obtained. This is left somewhat vague since they do not compare to the full firehose API. It is also discussed if the sampling of tweets for the stream is deterministic or random and they conclude that it is most likely deterministic. This they go on to explain that in some cases as with very popu-lar keywords could lead to a biased selection of tweets being presented to the user. Another interesting observation made was that there seemed to be a limit on how many tweets that could be retrieved when using filtering and that there seemed to be no such limit when using no filter. It should be noted that they found that Twitter data is useful for academic research but one should be considerate when choosing to filter as it can create biased data that might render the data less useful depending on what type of research that is conducted.

Choudhury et al. [10] looked into how studies about information diffusion are impacted by different types of sampling strategies. And found that studies in regards to concentrated events benefit from selecting samples focused on the location the event it is connected to. When it comes to data sampling Llewellyn et al. [22] looked at how gathering data from different channels on Twitter with the aim to look into the same topic (Brexit). They found that different channels produced different results even though the topic was the same.

Twitter User Behavior

Gabielkov et al. [13] claim that there is a lot of research into the sharing behaviour on social media but not much into how much links are clicked on. Motivated by this, similar to the work presented here, both Gabrielkov at al. [13] and Holmström et al. [17] combine the use of the Twitter API and the Bitly API to investigate click behaviour and contrast it to the retweet behaviour. Gabrielkov et al. [13] shows that there is a possibility to predict the click behaviour for a given day on social media based on early sharing activity the same day, whereas Holmström et al. [17] focus on temporal click dynamics for links to the news articles of a selected set of new websites.

When it comes to user interactivity on the Twitter platform Krishnamurthy et al. [21] found that users that have more than 250 followers post status updates more often than those that follow more than 250 people. Cha et al. [9] looked into what makes a user influencial on Twitter. Having a high number of followers does not necessarily mean they are influential and that retweets are driven by the content of a tweet while mentions are driven by the popularity of the user.

Garimella et al. [14] used Twitter data to study the political polarization in the United States between 2009 and 2016. They find that the polarization had increased over those years. They did however not present in which direction. Content classification was something Iman et al. [18] considered the problem of topic classification on Twitter and found that the three most informative parts of a tweet are in order: term (any word in the tweet that is not a hashtag), location and hashtag.

URLs frequently tweeted are usually of two different conflicting types. Either they come from sites of high quality or they are spam [19]. Also several studies have been conducted on spam in URL shorteners with the mutual result that a large proportion of the analyzed links are classified as spam [4, 20, 36, 16].

(16)

3

Method

The method we used for our project to answer the research questions posed was to perform a longitudinal measurement study of tweets posted on Twitter. This we accomplished by collecting data from a bit more than twenty five million tweets over the span of seven days. The collection of these tweets was carried out between 26-04-2019 to 03-05-2019.

3.1

Data Collection

In order to gather our data we decided to collect tweets with Twitters streaming API for one week. For every tweet we extracted the data needed for our dataset and stored it locally. In addition we collected extra data for each tweet 24 hours after being posted. This data was the number of retweets for the tweet and in the case that a Bitly-link was found in the tweet the data also consisted of information about the Bitly-link.

We decided on collecting data in smaller intervals to lower the risk of loosing a lot of data if an error would occur during the collection. Each interval was set to four hours and 20 hours after an interval was done we started to gather the extra data for that interval. This process resulted in a total of 42 smaller datasets over one week that we put together to construct our final dataset.

API Limits

The first problem we encountered when starting to implement our data collection was vari-ous limits for the APIs we used. Since we choose to collect tweets with the stream we did not have any limits except for only getting around 1% of the total amount of tweets. However when using other endpoints provided by the Twitter API there was always a rate limit in-volved. These rate limits differ between endpoints but for the one we used to gather retweet data the limit was 900 requests per 15-min window [31]. We found that this limit was no issue for us since we could process tweets at a higher rate than the stream provided us with new ones.

The part where we had a big problem with rate limits was when using the Bitly API. We found that the documentation for the Bitly API was not clear on the rate limits for different endpoints. The only rate limit that was listed was for shorten URLs which had the limits 1,000 calls per hour and 100 calls per minute [7]. However Gabielkov et al. found that the

(17)

3.1. Data Collection

number of calls that could be performed each hour was 200 [13]. We conducted our own test and discovered that our rate limit was 1,000 request per hour for all endpoints that we used. We also had access to four Bitly accounts which allowed us to make 4,000 requests per hour when switching between accounts.

First Phase Data Collection

This section will explain how we went about to collect tweets for our study. We stream tweets from Twitter’s API and wrote them to the file system for later use.

Figure 3.1: Overview that shows the data flow and process communication of the tweet col-lection.

The overview seen in Figure 3.1 shows the structure of our application that we used to collect tweets for this project. What is shown is an abstraction which means that things are simplified and some details are left out in order to reduce the complexity and provide an easier to understand work flow. Details will be discussed and presented as they become relevant under this chapter. A full arrow as seen in Figure 3.1 is used to indicate the flow of data and a dashed arrow is used to show that there is communication between processes. We utilized multiple processes in our application in order to allow for parallel execution of tasks. In order to present what is what in the figure we will refer to their shapes. A process is indicated by a rectangle, data storage (temporary and permanent) are indicated as a rectangle with rounded corners and outside data sources such as APIs are indicated using parallelograms. Color is used to indicate that some processes are more involved with each other than with others.

We will now present more what the task of each of the different processes are. This first description is brief and we will later provide a more in depth explanation of how they co-operate and why we chose this implementation. Master (M) is responsible for scheduling and starting up the other processes. It handles what should happen, when it should happen and who should do it. Tweet Gatherer (TG) has the sole task of just opening up a stream to the Twitter API (TAPI) and constantly receive tweets from it. Tweet Queuer (TQ) adds said tweets to a queue. Tweet Writer (TW) reads from the queue and writes them to a file on the File System (FS).The Second Phase Data Collection (SPDC) is too complex and will have it’s functionality described in it’s own section. But to put it briefly it takes the file written to the FS and gathers additional data in the form of retweets and information from Bitly to then write that to the FS as well. All of these processes run parallel and they pass data between them as they are done with it.

(18)

3.1. Data Collection

To look into how this all works we will go through all the full arrows and describe the process of collecting a tweet. It starts with TG opening a stream (Arrow 1) to the TAPI that will continually send tweets live to TG. TG will then send this over a pipe to TQ. TQ’s only task is to add this data to the Queue (Q). This implementation might look a bit weird since TG could just add to Q immediately and the intermediate TQ would not be needed. This was also the case in earlier iterations of our application. There was however some problems with this implementation in regards to speed. Since TAPI requires you to not fall behind too much when consuming the stream it is necessary to keep even footing with it. This was not the case when TG was enqueuing tweets directly, it was too slow and caused us to slowly fall behind and over the course of a collection this would cause us to fall too far behind and be disconnected by TAPI. Using a pipe (Arrow 2) and the intermediate process gave us a faster flow. The drawback was that the pipe has a blocking call when reading from it and we wanted to have the ability to do other stuff in TW even if there was no tweets in line. This is the reason why we went with the intermediate step plus the queue and not a pipe directly from TG to TW. Once the tweet is enqueued in Q, TW continually reads from it and then unpacks the tweet object and filters out only the relevant parts that we for this project deemed useful. What fields were filtered out is furthered discussed under the section “Dataset Structure”. Once this filtration was done the data was written to the FS as a Comma-separated values (CSV) file. Furthermore TW did not write to the same file over the entire duration but rather (as explained earlier) it changed files every four hours. M was responsible for this scheduling and told TW what file it should change to and when it should do it. This gave us four hour chunks of data that then could be passed on to the SPDC that would process it further.

Second Phase Data Collection

When it comes to collecting extra data from different APIs with different limitations created some complications that lead to a bit more complex solution than the rest and therefore it has it’s own section. It is of importance to mention that the SPDC was spawned as a new process for every four hour chunk of data that the collector created. These processes were started 20 h after the collection was finished. This meant that the additional data was gathered 24 h (4 h + 20 h) after the collection was started.

Figure 3.2: Overview that that shows the data flow and process communication of the addi-tional data collector.

Figure 3.2 shows an overview of how we went about to collect additional data. It follows the same structure when it comes to what is what as Figure 3.1. To start explaining the Second Phase Data Collection (SPDC) we will look into the “sub-processes” of it and what they do. It spawns two of these, one is the retweets Retriever (RR) and the other is the Bitly Retriever

(19)

3.1. Data Collection

From All Sources From Twitter

All Time X X

Since Posting O O

Table 3.1: Clicks from different time spans and different sources for a Bitly link. The symbol in each cell indicates which information that can be extracted from the same API call.

(BR). These run in parallel and each communicates with different APIs. RR communicates with the Twitter API (TAPI) which is the same as used before but here we used another endpoint that functioned as a search terminal where we could look up information about tweets (in batches of 100). BR communicates with the other API which is the Bitly API (BAPI), it allowed us to look up information about Bitly links.

RR starts by reading a file from the File System (FS) and then for every tweet it adds it to a batch of tweets. When the batch reaches 100 entries it is sent of to the TAPI which then returns a list that contains all the information about the tweets. Here we were only interested in the number of retweets the tweet had amassed since it was posted. This was then saved to a CSV file together with the corresponding tweet id and a time stamp of when the data was retrieved.

In parallel BR also read from the same file and filtered out only tweets that contained Bitly links. From these tweets the Bitly links were extracted. For every link we wanted to know three things: the first is to what URL the Bitly link redirected, the second was when the shortened link was created and the last one was how many clicks it had received from different sources during different time periods. This is shown in Table 3.1. The click statistics is divided into four different categories: all clicks from all sources ever, all clicks from Twitter ever, clicks from all sources since the posting of the tweet and clicks from Twitter since the posting of the tweet. If we were to make a call to BAPI for all of these it would result in six calls per link which would quickly hit the rate limit. As seen in Table 3.1 the cells with the same symbol can be extracted from the data received from the same API call and with that we got it down to four calls per link. To speed it up BR created four threads for each Bitly link and executed each of the calls to BAPI on their own thread in parallel. When this information then was returned from BAPI it was written to a CSV file on the FS which contained the tweet id, the data gathered and a time stamp of when it was retrieved.

Phase Overview

To provide better understanding on what timings we used for the collections (first and second phase) we here present a bit more in-depth view on this matter.

Figure 3.3: Overview of how the timing for the different phases looked.

When it came to timing we have mentioned that chunks were four hours long and after one chunk is done we wait 20 hours before we gather data in the second phase for the col-lected chunk. The figure illustrates the dream scenario where everything is perfectly timed. To avoid cluttering we avoided drawing all the connections between the different first and second phase chunks but they all still have one; 1.1 is connected to 2.1, 1.2 to 2.2, 1.3 to 2.3 and

(20)

3.1. Data Collection

so on. As said this was the perfect scenario where everything took four hour and it did for the first phase, there every chunk was four hours long but that was not the case. Here some chunks took longer than four hours and therefore overlapped with each other. This could have caused some problems due to rate limits but thanks to the fact that we had access to four Bitly accounts we could run up to four in parallel with no problem. Still many of the sec-ond phase chunks turned out longer than four hours and caused a discrepancy between the retweet data and Bitly data of that chunk because the retweet data collection finished faster than the Bitly one. This is discussed in further detail in the result section.

Dataset Structure

Below is a list that contains information about all the fields that we filtered out from the tweet data in the first step where we collected tweets. There are some fields that do not always contain data these are annotated by being italic.

• Tweet id

• When the tweet was posted

• The id of the place where the user posted from

• The name of the place where the user posted from

• The country of the place where the user posted from

• Coordinates of the user when the tweet was posted

• What language the tweet was posted as • A list with the hashtags in the tweet • A list with the URLs in the tweet

• If it is a retweet or not

• If it is in reply to another tweet • Id of the user that posted • When the user was created • How many followers the user has • How many friends the user has

• How many tweets the user has tweeted • How many tweets the user has

favour-ited

• Whether or not the user is verified • What language the user uses

The following columns are the data collected in the retweet step of the SPDC. This data is retrieved for all tweets in the collection.

• The number of retweets the tweet has received since posting

• When the retweet count was retrieved

The following columns are the data collected in the Bitly step of the SPDC. This data is retrieved for all tweets containing a Bitly URL.

• A list of the total number of clicks the Bitly links has received

• A list of the total number of clicks the Bitly links has received that originate from Twitter

• A list of the number of clicks the Bitly links has received since the tweet was posted

• A list of the number of clicks the Bitly links has received that originate from Twitter since the tweet was posted • A list of the URLs the Bitly links redirect

to

• A list of time stamps when the Bitly links were posted

(21)

3.2. Data Analysis

All of these are then merged into one big dataset where every row of the CSV file is a tweet and if there is no data for a specific column for a tweet it is left empty.

3.2

Data Analysis

Before we started to analyze our data we started out by combining the 42 smaller dataset chunks into one large that we then combined with the data from the SPDC, that is the number of retweets as well as the additional Bitly data. This produced a CSV file of around 5.2 GB large which was still manageable to work with. But we still decided to create subsets of this one. Three to be precise and they were as following: one that contained all the tweets that had links in them, one that had all the tweets that had shortened links in them and then a last one that only contained tweets that had Bitly links in them. This was in order to even further streamline the analysis of the tweets since for most things we would like to compare theses different sets against each other.

Once we had these subsets and the big dataset we started to pre-process the data with Python and created files which we could then import into Matlab and plot. The pre-processing step was mostly to reduce the amount of code that we had to write in Matlab since we were more comfortable with Python and used Matlab almost exclusively for plotting.

3.3

Limitations

Due to the nature of working with both large amounts of data, network connections and external APIs we ran into some limitations when performing our data collection. Below we will present the ones that affected our work.

Collection Time

Since our collection only went on for one week we can not conclude if there are any weekly patterns in the behaviour of users on Twitter. We have also not taken into account if there were any special events or holidays during our collection that might had an impact on the data.

Collection Time Variation

With our way of collecting tweets for four hours and then wait for 20 hours before collect-ing additional data there are some variation on the time between post and additional data between tweets. For example, the download of retweets was often finished in less than four hours meaning the time between post and retweet data will be less than 24 hours for a tweet that was collected at a later stage of the four hour interval. For the Bitly data, this effect was the opposite meaning gathering the Bitly data often took longer than four hours. With this variation it is hard to draw any conclusions on the retweets-vs-clicks ratio.

Failed Requests

If a request for retweets failed we retried it one time and then dropped the request and moved on with the next one. For the Bitly data we did not retry failed requests at all and instantly moved on to the next request. This is because we made the decision that it was better to lose some data than delaying the collection of all other data. The result of this is that we only perform our analysis on the tweets where we managed to gather all required data.

Clicks from Bitly

When using the Bitly API to get the number of clicks there are two major distinctions that cannot be determined. First of we cannot guarantee that clicks come from a particular tweet.

(22)

3.3. Limitations

This means that the data could show that a tweet had a big impact on the number of clicks for a link but in reality all clicks comes from another tweet that we did not catch while streaming. The second distinction not cannot be done is when multiple clicks from the same location occur. While Bitly are not counting “spam-clicking” as multiple unique clicks, multiple clicks performed with a few seconds apart will still count as unique clicks [5]. The result of this is that we cannot derive each click to a unique user.

Shortener Lookup

While we did not have any problem on looking up the full URL for Bitly links we had some problem with other shorteners. We found that the shortener “goo.gl” was problematic to look up pragmatically because our script was flagged as a bot after just a few lookups. For other shorteners we found that a lot of URLs redirected to an unvalid page. In the case where we could not get the full URL we discarded the short URL. Look at Table A.1 in Chapter A for the full list of looked up shorteners.

Top Lists

Since the top lists can change depending on when they are downloaded, the lists that we will use might not reflect how the top domains was distributed at the collection time. This is especially true for the Alexa list that is from 2014.

(23)

4

Results

The total amount of Tweets that was collected during a seven-day period summed up to 25.5 millions. We now present the result of our analysis which is based on the tweets collected and different subsets of this data.

Category Count %

All Tweets 25482108 100%

Link Tweets 4026101 15.8%

Shortener Tweets 322954 1.27% Bitly Tweets 159143 0.625%

Table 4.1: Amount of tweets that are included in different categories.

In Table 4.1 there are four different categories presented and they are the same that is used through out this document. The count is the number of tweets of that category that we collected and then the last column is the percentage of all tweets a category makes up. “All Tweets” are all tweets collected, “Link Tweets” are tweets that contains atleast one link, “Shortener Tweets” are tweets that contain a link where the domain is a shortener (to see which domains we considered shorteners see the list in Chapter A) and “Bitly Tweets” are tweets that contain a link where the domain is “bit.ly”.

From these tweets we were able to extract a total of 4.2 million URLs in which 333,616 were link shorteners. In order to determine the most popular domains we extracted the domain from each URL and counted how many occurrences each domain had. The result for the top 20 most common domains is shown in Table 4.2a. Note that Twitter is by far the most common domain. This is because every retweet contains the URL to the original tweet. To get a better overview on URL shorteners Table 4.2b displays the top 20 most common shortener domains. The most common link shortener were Bitly with 164,307 occurrences. This differs from another metric that is the number of tweets that contained at least one Bitly link. This number was instead 159,143 which shows that some tweets contained more than one Bitly link.

(24)

4.1. Domain Statistics Domain Occurrences 1 twitter.com 2166840 2 du3a.org 187580 3 bit.ly 164307 4 facebook.com 122999 5 youtu.be 116849 6 instagram.com 54980 7 showroom-live.com 42351 8 curiouscat.me 41691 9 youtube.com 29447 10 peing.net 25007 11 dlvr.it 19039 12 fllwrs.com 17613 13 goo.gl 14472 14 open.spotify.com 14089 15 twcm.co 12265 16 naver.me 11301 17 pscp.tv 10893 18 ow.ly 10059 19 blbrd.cm 9964 20 swarmapp.com 9752

(a) Top domains overall.

Domain Occurrences 1 bit.ly 164307 2 youtu.be 116849 3 goo.gl 14472 4 ow.ly 10059 5 buff.ly 8742 6 tinyurl.com 5871 7 j.mp 3034 8 wp.me 2617 9 lnkd.in 2385 10 is.gd 2300 11 ouo.io 1005 12 po.st 400 13 cort.as 243 14 flic.kr 225 15 tcrn.ch 217 16 bit.do 187 17 eepurl.com 173 18 migre.me 116 19 snip.ly 108 20 bc.vc 83

(b) Top shortener domains.

Table 4.2: Top 20 most frequent domains for two different categories.

4.1

Domain Statistics

Results regarding domains and how they appeared in our dataset are gathered under this section and will present distribution as well as frequency of domains. We also looked at how phishing domains are posted but we found none of these.

Top Domains

From our data we constructed three different sets, all links, link shorteners and Bitly links. Bitly links is a subset to link shorteners which in turn is a subset to all links. These sets will be compared to each other in order to determine if there is any difference on how various types of links are used. In these sets every shortened URL was translated to the URL that the short URL was redirecting to and finally all URLs was grouped by domain. Worth noting is that some shortened URLs was invalid or deleted in which case they were discarded from the sets. This was very common for “goo.gl”, “ow.ly” and “buff.ly” where most URLs could not be translated to the full URL. The top 20 most common domains for each set is shown in Table 4.3. Each domain is listed with the total occurrences and the rating in Alexas and Majestics top 1 million domains lists. NaN means that the domain do not have a rank among the top 1 million.

Popularity Distribution

This section will explain how the distribution of links looks like between 7 different classes. The domain in every link is assigned to one of the following classes: Alexa[1-10]; Alexa[11-100]; Alexa[101-1K]; Alexa[1001-10K]; Alexa[100001-100K]; Alexa[100001-1M]; other [non-ranked]. This is done for each one of the subsets with links and a forth set containing only non-Bitly shorteners is also added. The results is displayed in Figure 4.1a. Figure 4.1b show the same classes but with the Majestic million instead of Alexa top 1M. When comparing the two Figures there is a clear distinction between the classes [1-10] and [11-100]. This is because

(25)

4.1. Domain Statistics

Domain Occur. Alexa Maj. 1 twitter.com 2167059 12 4 2 du3a.org 187580 - -3 youtube.com 147359 2 3 4 facebook.com 123883 3 2 5 instagram.com 57117 15 7 6 showroom-live.com 42356 4156 77018 7 curiouscat.me 41691 4915 168225 8 peing.net 25007 6472 228312 9 twittascope.com 23174 301905 -10 dlvr.it 19700 - 11127 11 fllwrs.com 17613 59565 831014 12 open.spotify.com 14160 - 219 13 lawson.co.jp 13264 35589 17836 14 twcm.co 12265 - -15 naver.me 11310 177121 23425 16 pscp.tv 10895 2836 1428 17 blbrd.cm 9964 210800 -18 swarmapp.com 9752 73711 29610 19 cas.st 8642 - -20 shindanmaker.com 8326 8309 32267

(a) Top domains for all links.

Domain Occur. Alexa Maj. 1 youtube.com 117912 2 3 2 twittascope.com 23173 301905 -3 lawson.co.jp 13226 35589 17836 4 k.kakaocdn.net 5457 - -5 img1.daumcdn.net 5168 - -6 linkedin.com 2327 43 6 7 instagram.com 2137 15 7 8 t1.daumcdn.net 1846 - -9 reddit.com 1521 16 43 10 youtu.be 1510 31627 14 11 google.com 1343 1 1 12 cards.twitter.com 1195 - -13 extratv.com 1106 118225 12330 14 el-nacional.com 980 1253 8829 15 mayla.jp 927 - -16 facebook.com 884 3 2 17 54.202.34.80 861 - -18 careerarc.com 822 70517 91327 19 uls.her.jp 805 - -20 drive.google.com 792 - 39

(b) Top domains for shortened links.

Domain Occur. Alexa Maj. 1 twittascope.com 23173 301905 -2 lawson.co.jp 13226 35589 17836 3 k.kakaocdn.net 5457 - -4 img1.daumcdn.net 5164 - -5 instagram.com 2133 15 7 6 t1.daumcdn.net 1843 - -7 reddit.com 1518 16 43 8 google.com 1333 1 1 9 youtu.be 1320 31627 14 10 cards.twitter.com 1194 - -11 extratv.com 1106 118225 12330 12 youtube.com 1040 2 3 13 el-nacional.com 980 1253 8829 14 mayla.jp 927 - -15 54.202.34.80 861 - -16 facebook.com 824 3 2 17 careerarc.com 822 70517 91327 18 uls.her.jp 805 - -19 drive.google.com 789 - 39 20 cdiscount.com 781 780 11783

(c) Top domains for Bitly links.

(26)

4.1. Domain Statistics

“twitter.com” which is the most frequent domain is assigned to [11-100] in Alexa but [1-10] in Majestic.

(a) Alexa top 1M. (b) Majestic top 1M.

Figure 4.1: Distribution of domains for different classes in top 1M lists.

Domain Frequencies

In Figure 4.2a we can see that the a small number of domains make up a large part of all links. In all three cases the top 10 most common domains constitutes over 90 percent of the total amount of links. The different classes all show the same pattern in distribution but the first domain in “all links” constitutes a somewhat larger part compared to “link shorteners” and “Bitly links”. Figure 4.2b shows how the amount of unique domains differ from each class and also suggest that the pattern also holds for all ranks.

(a) CDF (b) CCDF

Figure 4.2: Distribution of domain rank.

Relative Ranks and Frequencies

Under this section we present a 5x5 pair-wise scatter plot that show the relation between five different classes in our dataset. The first one is the number of occurrences of unique domains in all tweets containing links, the second one is the number of occurrences of unique domains in all tweets containing a shortened links, and then the same for tweets containing Bitly links. The last two categories are the rank of domains in the Alexa top one million dataset and the rank of domains in the Majestic Million dataset.

(27)

4.1. Domain Statistics

Figure 4.3: 5x5 scatter plot that shows the frequencies and ranks of domains in the top 25 of the categories All links, Shortened links, Bitly links, Alexa ranking and Majestic ranking.

To describe Figure 4.3 in more detail the axes show two different things, for the columns and rows of “All links”, “Shortened links” and “Bitly links” the unit is number of occurrences for domains in these different categories. For the other two “Alexa ranking” and “Majestic ranking” the unit is the rank of the domain where a lower value means a higher rank. This plot contains the top 25 domains from each category, that is even if a domain appears in one category’s top 25 but not another one’s it is still plotted. These are the red crosses in the plot. They only appear in the columns and rows of “Alexa ranking” and “Majestic ranking” due to them being frequent in our collection but not on the top one million of either. And since we do not know the ranking of these in any of the other datasets we put all of them to the same value, one million and one (1000001). This was in order to make it easier to distinguish them from the other ones. The other ones are the blue rings, these are domains that have a domain count occurrences or rank in both datasets the gives input to a plot. What we can observe is that there is a form of linear relationship between “Shortened links” and “Bitly links”. There are no other real obvious relationships that we can spot beyond that. But we can see some minor interesting things. The first is that the highest ranking domain in both “Alexa ranking” and “Majestic ranking” are present in all categories and is also the same for both of the ranking datasets. The second thing we can spot is that the highest ranking domain is never the highest occurring domain in our collection in any of the categories. For example in the Bitly-Alexa plot we can see that the highest occurring domain is also the one with the lowest rank in the top one million of Alexa.

Phishing Domains

During our collection we did not find any links that matched the database of Phishtank. This was the case both for links posted on Twitter as well as links shortened. There were a few cases of what one could call spam and bot-behaviour where there were accounts tweeting

(28)

4.2. User Statistics

similar tweets probably in an attempt to get user to their sites but none of these were flagged by Phistank.

4.2

User Statistics

Age

We will here present figures showing the distribution of account ages for users at the time of posting their tweets.

(a) CDF (b) CCDF

Figure 4.4: Distribution of the age for users account at the time of posting their tweet. In Figure 4.4a we can see that all four categories are biased towards older accounts rather than new ones. Tweets containing Bitly links even more so than the others. If we look to the starting point of the plot we can see that there is no real difference in the number of fresh accounts used for posting any of the different categories of tweets.

The CCDF plot in Figure 4.4b tells the same story as seen in Figure 4.4a that the account age of users tends to be on the older side rather than new accounts. It also shows that there is no real discrepancy between the oldest accounts of the four different categories regarding age. There are though a bigger fraction of accounts that posted a Bitly link during our collection that are older than in the other categories.

Favourites

The number of tweets that a user has favourited will be presented here to give an overview as to what the interaction rate of users looked like during the collection.

Figure 4.5a shows us that there were more users that posted Bitly links while we were collecting that had favourited low amounts of tweets. The jump at around 2 ¨ 102 for both “Bitly tweets” and “Shortener tweets” shows that there is a correlation between these two and that a few account are responsible for this jump.

The CCDF plotted in Figure 4.5b shows that “All tweets” are more back-heavy when it comes to favourites and that they interact more with other tweets than other categories, where tweeters of Bitly links during the collection interacts the least.

Number of Tweets

The number of tweets posted by a user is the metric that we will show results about under this section.

(29)

4.2. User Statistics

(a) CDF (b) CCDF

Figure 4.5: Distribution of the number of tweets favourited by users at the time of posting their tweet.

(a) CDF (b) CCDF

Figure 4.6: Distribution of the number of tweets posted by users at the time of posting their tweet.

Figure 4.6a shows that for low number of tweets there really is not a difference between the categories but at the higher end of the spectrum we can see that before the last jump there is a clear difference.

The CCDFs in Figure 4.6b show the same thing as mentioned above. In particular, there is a huge jump in probability. But here we can see more clearly the difference between the different categories. Leading up to the end of the plot we can see that users that posted a tweet containing a Bitly link during our collection were more numerous.

Favourites to Tweets

To look further into the relation between tweeting and interaction of tweets we will here present plots showing the ratio between favourited tweets and posted tweets for users.

There is a real discrepancy here between the categories, this can clearly be seen in Fig-ure 4.7a. What we observe is that users that posted a Bitly link in general tweet more than they retweet other tweets. This goes hand-in-hand with what have been observed in the two previous sections. Figure 4.7b shows the same that users that posted a tweet containing Bitly link interact less with other tweets and post more tweets in comparison to the other categories.

(30)

4.2. User Statistics

(a) CDF (b) CCDF

Figure 4.7: Ratio between tweets favourited and tweeted by users at the time of posting their tweet.

Followers

This section contains plots showing the distribution for the number of followers a user had.

(a) CDF (b) CCDF

Figure 4.8: Distribution of the number of followers for users at the time of posting their tweet. There is an observation to be made in Figure 4.8a that users that tweeted a Bitly link has a higher probability to have more followers than users of the other category. This same statement can also be seen clearer in Figure 4.8b. Although this fact we can also see that the user that had the most followers did not post a tweet containing any link.

Friends

Here we will show the distribution of friends for users in our dataset. A friend is an account that a user follows on Twitter.

Figure 4.9a and Figure 4.9b tells the same story that the Bitly users have a lot of users with many friends but the accounts that have the most friends are actually those that did not post a link during our collection but rather someone that tweeted a normal tweet.

Followers vs Friends

We found that plotting followers vs friends for users in a scatter plot produced an interesting figure that shows the relation and a few limitations when it comes to following users.

(31)

4.2. User Statistics

(a) CDF (b) CCDF

Figure 4.9: Distribution of the number of friends for users at the time of posting their tweet.

(32)

4.3. Bitly Link Interaction

Figure 4.10 shows that the followers-to-friends-ratio follows a distinguishable pattern that has been marked with two lines. The data shows that in order to have more than 5,000 friends a user need to have at least the same amount of followers as friends. The occurrences of users who have over 5,000 friends and still more friends than followers can be explained by that it is possible to lose followers while still keeping all the friends. We can also see that they follow a trend that is around equal following but could also be interpreted as a tiny bit skewed towards more friends than followers. For a figure showing each of the categories plotted on their own see Figure A.1a in Chapter A.

Unique and Verified Users

Here we will show how many unique and verified users there is in our dataset. A verified user is an account of public interest [30].

Category Unique Users Verified %

All Tweets 12253599 53326 0.44%

Link Tweets 2905502 28736 0.99%

Shortener Tweets 245984 2859 1.16%

Bitly Tweets 112682 1856 1.65%

Table 4.4: Amount of unique users for each category and how many of those users that are verified.

Table 4.4 shows that in regards to percentage each category have more verified users than the prior one. This means that of those that posted a Bitly tweet make up a bigger part of their subset than those of other subsets.

4.3

Bitly Link Interaction

In this section we will look into how the number clicks for a Bitly link found in a tweet relates to the number of retweets for the same tweet. Since we can not know from the statistics of Bitly links if a click came from the tweet that was collected we will ignore Bitly links that was created more than 10 minutes before the tweet was posted. By doing this we can be sure that no clicks come from an old tweet containing the same link. Another factor that should be taken into account is that the data gathered in the seconds phase is collected at different times for each tweet.

Figure 4.11 shows the distribution of how long delay there was for the Bitly data to be retrieved after the retweet data as a CDF for the tweets were we could collect all data in phase 2. Two lines are also displayed showing at which times the probability is 0.33 and 0.66. The delay times for these probabilities are 105 minutes and 240 minutes respectively meaning that if we would only pick the tweets with a delay time of 105 minutes maximum we will only get 33 % of all tweets.

Figure 4.12a shows the ratio between retweets and clicks for Bitly links that was no older than 10 minutes when the tweet was posted. The two thresholds shown in 4.11 are also included which means that “105 Minutes” is a subset to “240 Minutes” which is a subset to “All”.

We can see that there is a compact cluster right below the “Equal Ratio” line which means that a lot of tweets have more retweets than clicks on the embedded Bitly link. There is also no distinguishable difference between the pattern of the different sets which suggets that the delay in the second phase collection had little impact on the clicks-to-retweets-ratio.

Figure 4.12b shows the logarithmic average of Bitly clicks per retweet count for all Bitly links that was no older than 10 minutes when the tweet was posted. Similar to 4.12a there is a cluster under the “Equal Ratio” line but the figure also shows that tweets with fewer retweets than 30 tend to have a higher clicks-to-retweets-ratio.

(33)

4.3. Bitly Link Interaction

Figure 4.11: Distribution of the delays for Bitly data to be retrieved after retweet data for tweets.

(a) Bitly clicks-to-retweets-ratio for different sets. (b) Logarithmic average of Bitly clicks per retweet.

(34)

4.4. Verified vs Non-verified Users

4.4

Verified vs Non-verified Users

In this section we investigate whether there are any noticeable differences between verified and non-verified users.

Bitly Clicks

When comparing the number of clicks for Bitly links, were discard all links that was older than 10 minutes when the corresponding tweet was posted.

(a) Non-verified users. (b) Verified users.

Figure 4.13: Clicks-to-followers ratio for Bitly links.

When comparing Figure 4.13a and Figure 4.13b we can see that verified users tend to get more clicks if they have more followers while non-verified seem to be able to get up to 10,000 clicks even with a low amount of followers. This is probably due to the fact that some links are posted multiple times by different users which we cannot detect. We have noticed this kind of behaviour when manually searching for different Bitly links on Twitter. This makes it hard to draw further conclusions around the behaviour of tweets posted by non-verified users. The effect of links being posted multiple times could also occur for links posted by verified users but should not have a meaningful impact. Something to take note of is that even though a verified user have a lot of followers, only a small amount of followers interact with their tweets.

Followers, Number of Tweets and Retweets

This section presents three pairs of heat-maps that compare different user statistics between verified and non-verified users. The first one is retweets vs followers.

The heat-maps in Figure 4.14 shows a difference between non-verified and verified users. The difference is that there is an indication of a vertical streak of higher density in Figure 4.14a while there is more of a blob formation in Figure 4.14b. Both of these higher density structures indicate different things. For the vertical streak it indicates that there are many accounts around the same number of followers that have received very different amounts of retweets. While the blob formation indicates that many accounts that have the same amount of followers also received the same amount of retweets.

Figure 4.15 shows how non-verified and verified users differ when it comes to retweets vs the number of tweets they post. The interesting thing here is that the pattern presented before regarding Figure 4.14 still holds. The vertical streak and blob formation is still there albeit at slightly different values. This indicates that the relation for retweets and number of tweets looks somewhat the same as the relation between retweets and followers.

(35)

4.4. Verified vs Non-verified Users

(a) Non-verified users. (b) Verified users.

Figure 4.14: Heat-map of retweets vs followers tweeted.

(a) Non-verified users. (b) Verified users.

Figure 4.15: Heat-map of retweets vs number of tweets tweeted.

To establish this fact even more we will now present heat-maps looking at the relation between followers and the number of tweets for verified and non-verified.

This last pair seen in Figure 4.16 show that there is a linear relationship between follow-ers and the number of tweets for a user. This combined with the result presented above regarding retweets relation to these two fields shows that followers and number of tweets are interchangeable when comparing to retweets. Furthermore because of the linear relationship between followers and number of tweets we can say that those that post more tweets most likely will get more followers.

As can be seen in all of the heat-maps the highest density regions (yellow) are bunched up at 100=1 retweets in Figure 4.14 and 4.15. This we attribute to a specific group of users that we call informational-high-volume-posters. These kinds of accounts has a lot of followers and post a lot of tweets but get very few retweets. Upon manually sampling from this category and looking at these accounts we found some common denominators for all of them. The first one is as stated before that they post a lot of tweets, multiple ones a day. The second is that they have a lot of followers. The third is that the tweets they post are mostly informational. An example of this that we found was a verfied Twitter account that provided live traffic updates regarding accidents, blockades and other traffic related information. The fourth is that the few retweets these accounts received were almost always from the same one or two accounts. These are the observations we made but we do not have a concrete explanation as to why they receive so few retweets with so many followers and posts.

(36)

4.5. Miscellaneous Statistics

(a) Non-verified users. (b) Verified users.

Figure 4.16: Heat-map of followers vs number of tweets tweeted.

4.5

Miscellaneous Statistics

There are some data that we encountered during our research that is not directly connected to our research questions but still gives good insight and might be of use to future work. This data is presented shortly under this section.

Category Total Count %

All Tweets 25482108 284633 1.12%

Link Tweets 4026101 66364 1.65%

Shortener Tweets 322954 1952 0.604%

Bitly Tweets 159143 981 0.616%

Table 4.5: Breakdown of the number of geo-coded tweets for each category and the percentage those tweets makes up of total number of tweets for that category.

Table 4.5 shows an overview of how many geo-coded tweets that there are for each cate-gory and the percentage of those tweets for that catecate-gory. This can be an indication as to how many users have their location services turned on. In this case we can see that roughly 1% of all users have their location services turned on for Twitter.

Language Count % 1 English 7899732 31.0% 2 Japanese 5021194 19.7% 3 Spanish 2159004 8.47% 4 Undetermined 1660485 6.52% 5 Portuguese 1556532 6.11%

(a) Top 5 languages for all tweets.

Language Count % 1 English 1543250 38.3% 2 Japanese 606613 15.1% 3 Spanish 358266 8.9% 4 Arabic 338184 8.4% 5 Undetermined 272351 6.76% (b) Top 5 languages for link tweets.

Language Count % 1 English 124018 38.4% 2 Japanese 68786 21.3% 3 Spanish 31064 9.62% 4 Korean 29480 9.13% 5 Undetermined 21832 6.76% (c) Top 5 languages for shortener tweets.

Language Count % 1 English 67265 42.3% 2 Japanese 34880 21.9% 3 Spanish 16417 10.3% 4 Korean 11137 7% 5 Undetermined 7927 4.98% (d) Top 5 languages for Bitly tweets.

(37)

4.5. Miscellaneous Statistics Language Count % 1 English 11688519 45.9% 2 Japanese 4399110 17.3% 3 Spanish 2360274 9.26% 4 Portuguese 1710093 6.71% 5 Thai 1129995 4.43%

(a) Top 5 languages for all users.

Language Count % 1 English 2030855 50.4% 2 Japanese 537240 13.3% 3 Spanish 403139 10.0% 4 Arabic 242842 6.03% 5 Portuguese 238772 5.93% (b) Top 5 languages for users that posted link tweets. Language Count % 1 English 153798 47.6% 2 Japanese 63500 19.7% 3 Spanish 34968 10.8% 4 Korean 15773 4.88% 5 Thai 8567 2.65%

(c) Top 5 languages for users that posted short-ener tweets. Language Count % 1 English 78661 49.4% 2 Japanese 32397 20.4% 3 Spanish 17693 11.1% 4 Korean 7420 4.66% 5 French 3187 2%

(d) Top 5 languages for users that posted Bitly tweets.

Table 4.7: Breakdown of top 5 languages for users that posted tweets of the four different categories.

The sub-tables seen in Table 4.6 and 4.7 shows the top 5 languages for both tweets and users on Twitter. We also provided stats for how that distribution looks for the subset cate-gories used throughout the document as well as the percentage they make up. One obser-vation from this is that there is a discrepancy between the percentage of users that tweet in English and that has their account language set as English.

(38)

5

Discussion

To further expand on the reasoning behind the results we presented in Chapter 4 we will look into how and why some of the results we got look as they do. Furthermore we will look at how to improve our methodology as well as evaluate the limitations that we described in Chapter 3. A presentation as to why the work we have done might be useful in a wider context is included as well in this chapter.

5.1

Results

There were many interesting results obtained during this project, and we will list the most significant ones here that help us provide answers to our research questions.

When it comes to top domains twitter stands for a huge part of occurrences, this is due to the fact that every time a tweets is mentioned or retweeted the link to it is embedded in the tweet. This causes the number of twitter links to be higher than any other. One other that stuck out is the fact that “du3a.org” is number two on our list. We have no explanation as to why this domain is tweeted so much. Other than that the top ones are big sites as well as a few other out-liers in the same way as “du3a.org”. These seem to be part of some sort of Twitter-specific behaviour as they do not rank very high on Alexa nor Majestic.

Our top shortener is “Bitly” with a good margin, other than that it is “Youtube”’s short-ener that also is posted a lot. The explanation for this is that when you share a video from “Youtube” it automatically shortens the link. This can be observed by going to any “Youtube” video and clicking the share button and it will provide a shortened link. The other shorteners were not used to nearly the same extent during our collection.

After expanding all shorteners to their redirection links and looking at the top domains the distribution is somewhat different but not hugely different. The interesting part is to look at the links from different subsets. Starting with the links from link shorteners we can see that “Youtube” is in the lead, this is due to all “youtu.be” links redirecting to the domain. The same can not be said for “Bitly” where they are instead used to shorten a variety of links and is therefore not in the top spot. One interesting one is the domain “twittascope.com” which we found out points to a horoscope site. We observed that all of the “twittascope.com” links were shortened with “Bitly”.

There was a clear difference in the popularity distribution of domains depending on if we compared to “Alexa” or “Majestic”. When compared to “Alexa” about 3 ¨ 105of our domains

References

Related documents

Other sentiment classifications of Twitter data [15–17] also show higher accuracies using multinomial naïve Bayes classifiers with similar feature extraction, further indicating

Figures 20.3A-F illustrate this relatively fixed geometric relationship in each of the hearts F1-F11 by clamping, in each frame of each heart, the best-fit annular plane to the

By taking as his point of depar- ture the role of the “colonial languages” in Africa’s participation in the trans- continental and transnational dialogue of the Black Atlantic,

The chief analyst argues that the economies of scale benefits are valuable to a certain limit and means that a blockchain platform allowing municipalities direct access

I like/want to be part of the Twitter community that creates itself during the airing of the television series. I like to exchange my thoughts and opinions. Through

All recipes were tested by about 200 children in a project called the Children's best table where children aged 6-12 years worked with food as a theme to increase knowledge

There is no archive to retrieve tweets from, even though a complete archive could be created with firehose access to the API (paid access to the full stream of

In this work, I want to decode and read into social contagion of messages of a number of leading politicians and their party accounts that stood for constituency in the