A Comparison of Clustering the Swedish Political Twittersphere Based on Social Interactions and on Tweet Content

(1)

INOM

EXAMENSARBETE TEKNIK,

GRUNDNIVÅ, 15 HP ,

STOCKHOLM SVERIGE 2016

A Comparison of Clustering the

Swedish Political Twittersphere

Based on Social Interactions and

on Tweet Content

THOMAS VAKILI

KTH

(2)

(3)

A Comparison of Clustering the Swedish Political

Twittersphere Based on Social Interactions and

on Tweet Content

THOMAS VAKILI

Bachelor’s Thesis in Computer Science at CSC Supervisor: Mårten Björkman

Examiner: Örjan Ekeberg

(4)

(5)

Abstract

This thesis evaluates and compares two different clustering strategies for clustering users in Sweden’s political Twitter-sphere: clustering based on tweet content and clustering based on social interactions data.

Users were detected by filtering a stream of tweets filtered on a list of politically charged keywords. The top 10 % of the detected users with the most followers were selected and their social interactions data as well 2,000 of their latest tweets were downloaded. The gathered data was used to construct one similarity matrix for each of the strategies studied. Spectral clustering of the matrices was performed to form two separate sets of clusters, one based on tweet content and one based on social interactions.

(6)

Referat

En jämförelse mellan att klustra den svenska

poltiska twittersfären baserat på innehåll och

på sociala interaktioner

Denna uppsats utvärderar och jämför två olika strategier för att klustra användare i Sveriges politiska twittersfär: klustring baserat på tweet-innehåll och klustring baserat på sociala interaktioner.

(7)

Introduction

Social media usage has increased greatly during the last few years and an estimated 76 % of all Swedes use social media in some form (Findahl & Davidsson 2015). It is not uncommon to use social media channels as a primary source for finding news and for forming your opinions on contemporary issues. Around 500 million tweets are posted to Twitter every day (Krikorian 2013), and the opinions and news articles that reach a given user depends on what data breaks through this noise of information.

Regardless of which strategies a user employs in his or her search for information, the information and the opinions that dominate the stream of tweets will be affected by the size and strength of the various political camps posting to Twitter. This may in turn skew the Twitter user’s view of what political sentiments are the dominating ones and what news is to be considered important (Van Alstyne & Erik 2005). Theories about what political affiliations dominate on Twitter are mostly based on anecdotes and the speculators’ own Twitter feeds. While there have been numerous studies on clustering tweets and Twitter users, relatively few of these studies have focused on the political side of Twitter. There is reason to believe that the Twitter habits of political Twitter users differ from those tweeting about technology and hobbies as political discourse tends to be more polarized.

Developing methods for finding and analyzing political clusters on Twitter can be useful for informing the public as well as to help policymakers gain a better under-standing of what their constituents want.

1.1 Related Work

Previous studies such as Gruzd et al. (2014) have examined political polarization on Twitter. There does however seem to be a gap concerning the technical aspects of clustering political Twitter users. Studies have tended to focus more on the

(10)

CHAPTER 1. INTRODUCTION

consequences and existence of polarization rather than what factors are effective parameters for clustering users.

The research that has been leveled at developing clustering techniques, such as the results put forward by Zhang et al. (2012) and by Ifrim et al. (2014), has focused on topic based clustering. This approach has been successful in detecting events and in determining the topics that distinguish various users, such as whether users primarily tweet about technology or about politics. However, there are reasons to believe that more detailed information such as a user’s political affiliation may be hard to detect because of the short and terse nature of Twitter posts.

Investigating this will not only provide information for furthering the clustering methodology, it may also lead to insights on social media behaviour. Discovering what behaviours are indicative of political affiliation can provide insights into the nature of political exchanges online and to what degree cross-partisan conversa-tions are taking place, and whether social media is acting as a force for ideological polarization or as a force for unity.

1.2 Problem Statement

This paper will study the Swedish political Twittersphere by clustering Twitter users based on their tweets as well as on their social interactions and then analyzing and comparing the different clusters formed. This can then be used to evaluate the two clustering approaches and the goal is to answer the following question:

What are the advantages and drawbacks of clustering political Twitter accounts based on tweet content or on social interactions?

(11)

Chapter 2

Theoretical Background

In this section we shall define the terminology that will be used in the rest of this report as well as develop an understanding for the techniques that will be further described in the method section.

2.1 Twitter

Twitter is a social media focused on microblogging. The service is centered around short status updates called tweets that must adhere to a 140 character limit. These tweets are then gathered on the author’s timeline.

Figure 2.1. A tweet by Elin Andersson retweeting Isobel Hadley-Kamptz tweet about net neutrality.

The social dimension of Twitter revolves around the concept of users following each other (an account following a user is called that user’s friend). When a user follows another user, updates from the followed user’s timeline will appear on the follower’s home page. Tweets can also be retweeted, which results in the tweet being reissued by the retweeting user on his or her own timeline.

2.2 Clustering - Unsupervised Categorization

An important application of machine learning is categorizing data into different groups. The two main schools of machine learning, supervised and unsupervised learning, map to their corresponding methods of categorization: classification and

(12)

CHAPTER 2. THEORETICAL BACKGROUND

clustering. Classification techniques categorize data based on precategorized train-ing data whereas unsupervised clustertrain-ing works by findtrain-ing groups within a data set based on distance or similarity metrics between its data points.

This study is about clustering, the unsupervised method for categorizing data. Specifically, the clustering algorithm used in this study is called spectral

cluster-ing. This algorithm works by solving a relaxed version of the minimum normalized

cut problem. The problem is solved by dividing a graph such that the combined weights of the edges between the new subgraphs are as small as possible compared to the weights between nodes within the subgraphs (von Luxburg 2007). Subgraphs (i.e. clusters) found by the spectral clustering algorithm have member nodes that are heavily connected to each other.

In our case, every user is a node and the graph is represented by an adjacency matrix. In the context of spectral clustering, this matrix is called the similarity

matrix and is the main input for the algorithm. Constructing the similarity matrix is

analogous to constructing an adjacency matrix for a complete graph, were the nodes are users and the edges between them are weighted by their calculated similarities. The spectral clustering algorithm is then provided this similarity matrix and uses its eigenvalues and eigenvectors to reduce the dimensionality of its data. Finally, this lower dimension data is clustered using a second clustering technique and the resulting clusters are returned.

One challenging aspect of the spectral clustering algorithm is that it requires a prespecified cluster count. This value, conventionally denoted as k, varies depending on the goal of the clustering and the structure of the data. Too large values for k result in natural clusters being scattered and too small values for k lead to large clusters that lack in meaning. There is no consensus on how to choose the number of clusters, and different data call for different strategies.

2.3 Cosine Similarity

A similarity metric is needed to establish the adjacency of the nodes in the graph that is to be clustered. Not all data is available neatly packed in numerical form, but the algorithms at our disposal are based on mathematical models concerning numbers and vectors. A crucial part of conducting research using machine learning is translating real world data into vectorized data that can be run through the algorithms.

Cosine similarity is a measure of the similarity between two vectors, and is well suited for use with text (Alodadi et al. 2015). The data at hand is converted into vectors and the similarity between two vectors is determined by measuring the angle

(13)

2.4. HYBRID TF-IDF

between them. The cosine similarity of two vectors a and b is calculated as follows:

similarity(a, b) = a · b

|a| · |b|

2.4 Hybrid TF-IDF

TF-IDF stands for Term Frequency - Inverse Document Frequency and is a metric for assigning weighted scores to words based on their importance in characterizing a certain document (Ramos 2003). The term frequency part of the metric is simply the number of times a given term (in our case, a word) occurs in an examined document, divided by the number of terms in the document. This value is then combined with the inverse document frequency of the word, calculated as the inverse of the number of documents in which the term occurs. This penalizes words that occur very frequently in other documents and that are unlikely to be characteristic of the document we are examining.

By combining the two values we are able to find words that are likely to be charac-teristic of a given document because of their high term frequency, while at the same time filtering out words that occur often but are not characteristic of the document. Examples of words that are frequent but not interesting in this context, and that are penalized by the algorithm, are stop words such as “the”, “a”, “of”, etc. There are many different variations of TF-IDF, and in this thesis we will be using a variant adapted specifically for Twitter. It is called the Hybrid TF-IDF algorithm and was developed by Sharifi et al. (2010). This variant defines the document as the concatenation of a user’s tweets when calculating the term frequency, but then redefines the documents as the individual sentences in the body of the tweets when calculating the document frequency. The mathematics can be summarized as follows:

tf -idf (w) = tf (w) · log2(idf (w))

tf (w) =#Occurences of the word w in all posts

#Words in all posts

idf (w) = #Sentences in all posts

#Sentences in which the word w appears

The justification for this adaptation is the short nature of tweets which hampers the quality of traditional TF-IDF implementations, where the document is defined the same way for both the TF and the IDF calculations.

(14)

(15)

Chapter 3

Method

The study was conducted in several steps. First, data was collected from Twitter. Then, content based and social interactions based similarity matrices were con-structed. After this, the clustering algorithm was applied on both matrices. Lastly, the clusters formed were analyzed and evaluated.

Figure 3.1. Flow chart illustrating the process of gathering user data, computing the similarities and clustering.

3.1 Data Collection

Data was collected using the Twitter Stream API by listening to the stream and filtering on the list of political keywords described further in appendix A. The tweets and their metadata were then saved to disk. Tweets were collected for seven days to account for various temporal differences in the tweeting habits of different users. This data was then used to extract the users assumed to be political tweeters by being detected in the stream. User data included in our data set consisted of all users’ follower lists, their friend lists, and 2,000 of their most recent tweets.

(16)

CHAPTER 3. METHOD

Because of the scope, time and resources available for this study we chose to limit ourselves to a subset of the users found. The 12,682 detected users were sorted based on their follower count and the top 10 % of the users were chosen for further analysis and thus the size of our sample was limited to 1,268 users.

3.2 Calculating the Content Similarity Between Users

Cosine similarity was the similarity measure of choice when measuring content sim-ilarity as well. The vectors representing the tweets contents were constructed using the Hybrid TF-IDF explained in the background section of this thesis.

Vectors were constructed in which every word found in a user’s 2,000 tweet corpus was assigned an index in a word vector. The elements of the vector were then assigned the TF-IDF values of their corresponding words, where the TF-IDF value was based on the given user’s corpus of tweets. The similarity between two users was then calculated by measuring the cosine similarity of the given users’ word vectors.

3.3 Calculating the Similarities Between Users’ Social

Interactions

The similarity of the users’ follower and lists was determined by calculating the number of shared followers and dividing this with the maximum number of shared followers, i.e. the number of followers of the user with the least amount of followers. This is motivated by the fact that there was a large disparity in follower counts which in turn affected the similarity between users. Follower count was assumed to be a poor indicator of dissimilarity in the context of political affiliation. This method was also used to compute the similarity between user’s friend lists. The measure can be summarized as follows, given two user sets A and B:

similarity(A, B) = |A ∩ B| min(|A|, |B|)

The similarity of the users’ follower and friend lists was then combined with a value reflecting whether or not two given users followed each other, as this is assumed to be an indicator of similarity and social closeness. A value of 1 was given when the follow relationship was mutual while a similarity value of 0.5 was given when the followship was unilateral. If neither user followed the other, the value was set to 0. Another included metric was the similarity of the users’ retweet networks. These were constructed by creating vectors to represent all the users retweeted by a given user. The retweet similarity of two users was calculated using the cosine similarity of the users’ corresponding retweet vectors.

(17)

3.4. CONSTRUCTING THE SIMILARITY MATRICES

3.4 Constructing the Similarity Matrices

Two similarity matrices were constructed, one based on content and one based on social interactions. The content based similarity matrix was constructed by computing the content similarity between every pair of users.

Constructing the social interactions based similarity matrix, while similar, required an additional weighting step as it included all of the four different measures de-scribed in the previous section. Four matrices were constructed, one for each social interactions based similarity measure. These four matrices were then weighted and summed as follows:

Social Matrix = Retweets · 0.35 + Mutuality · 0.35 + Followers · 0.15 + Friends · 0.15

The weights used in this study were determined experimentally, as there does not seem to exist any conclusive research regarding the optimal weighting of similarity matrices based on social interactions. The weights were chosen by systematically evaluating the clusters formed by differently weighted social matrices and choosing the weights that seemed to give the best clusters.

3.5 Clustering

As was mentioned earlier, a challenge when using spectral clustering is determining

k: the number of clusters to cluster for. We determined this value by experimenting

with the data set for different values for k. The value settled on was k = 100. This value as well as the similarity matrices were used as input for the spectral clustering algorithm.

The implementation of the spectral clustering algorithm used in this study is im-plemented in scikit-learn, a machine learning toolkit for the Python programming language (Garreta & Moncecchi 2013). The algorithm chosen for clustering after dimensionality reduction was the discrete algorithm further described in Yu et al. (2003).

3.6 Analyzing the Clusters

The quality of the clusters was determined by manually inspecting them. A cluster was determined to be of high quality if there was a high degree of similarity in the ideology, partisanship or profession of its users. This was determined by examining the self-provided descriptions of the users as well as by inspecting their tweets.

(18)

(19)

Chapter 4

Results

The results of the various similarity measures are visualized by presenting a sorted similarity matrix for each clustering approach as well as a table of significant users in high quality clusters. Each pixel in the image represents the similarity between a given pair of user, with higher levels of similarity resulting in darker shades. Clusters can be spotted by looking at the diagonal of the matrix.

4.1 Content Based Clustering

Clustering on content resulted in 100 clusters. The median cluster size was 11 with sizes ranging from one user in one cluster, and one cluster containing 40 users.

Figure 4.1. Content based similarity matrix sorted into clusters and centered around the diagonal.

(20)

CHAPTER 4. RESULTS

Table 4.1. A sample of interesting clusters formed by content based clustering. Cluster Size Characteristic users Explanation

#0 26 kentekeroth, ingridcarlqvist, RichardJomshof, Avpixlat, friatider

Far right #9 22 LOSverige, FacketKommunal, TCOSverige,

Unionen, Eva_Nordmark

Labour unions #10 10 Utrikesdep, CBildt, Bistandsdebatt,

FNforbundet, Palmecenter

Foreign policy #15 17 sr_nyheter, sr_ekot, medierna, MarcusEkot,

LindaThulin

Radio Sweden #17 8 annaklarabratt, Feministerna, SorayaPostFi,

gudschy, StinaSvensson

Feminism #27 12 ModeratSthlm, sthlmssossar, sthlmsvanstern,

MP_Sthlms_Stad, svtstockholm

Stockholm #31 15 kdriks, BuschEbba, skyttedal, Ladaktusson,

CarolineSzyber

Christian

Democrats (KD) #43 25 interasistmen, nyheteridag, William_Hahne,

VingeHenrik, StiftelsenExpo

Mix of far right and anti-racists #58 28 nya_moderaterna, KinbergBatra, alliansswe,

kentpersson, GustafReinfeldt

Moderate Party (M)

#87 37 RebeccaWUvell, hanifbali, Rodgronrora, LinusBylund, pophoger

Various right wingers

4.2 Clustering Based on Social Interactions

Clustering on social interactions resulted in 90 clusters with at least one user, and 10 empty clusters. The median cluster size was 9 with sizes ranging from zero users, and one cluster containing 59 users.

(21)

4.2. CLUSTERING BASED ON SOCIAL INTERACTIONS

Figure 4.2. Interactions based similarity matrix sorted into clusters and centered around the diagonal.

Table 4.2. A sample of interesting clusters formed by interactions based clustering. Cluster Size Characteristic users Explanation

#18 28 kentekeroth, ingridcarlqvist, RichardJomshof, Avpixlat, friatider

Far right #31 29 kdriks, BuschEbba, skyttedal, markus_uvell,

ulfekman

Christian

Democrats (KD) and right wingers #33 16 Utrikesdep, Sverigesriksdag, vinnovase,

SwedeninEU, arbetsmarkdep

Government agencies #38 37 gudschy, Feministerna, SorayaPostFi,

PolitismSE, detljuvalivet

Leftist identity politics

#44 45 RebeccaWUvell, aClassicLiberal, hanifbali, Ivarpi, pophoger

Various right wingers

#53 19 sr_nyheter, sr_ekot, nordbergj, MarcusEkot, LindaThulin

Radio Sweden #55 29 birgittaohlsson, liberalerna, FredrikMalm,

LUFswe, RobertHannah85

Liberal Party #68 59 socialdemokrat, bladetledare, ssu_sverige,

sthlmssossar, Aida_Hadzialic

Social Democrats (S)

#71 22 LOSverige, DagensArena, SvenskHandel, AlmegaAB, Eva_Nordmark

Labour related #84 46 nya_moderaterna, KinbergBatra, alliansswe,

kentpersson, CBildt

Moderate Party (M)

(22)

(23)

Chapter 5

Discussion

In this part of the report the results presented in the previous chapter will be discussed. What do they mean, and what are their implications? The aim of this chapter is to answer these questions and provide a context for the data presented in the result section. The chapter is ended with the conclusions drawn from the results together with some suggestions for future research.

5.1 Content Based Clustering

The feature extraction techniques used in this thesis are commonly used for summa-rizing tweets (Nichols et al. 2012, Inouye et al. 2011) and for grouping tweets based on their topics (Phuvipadawat et al. 2010). Thus, it is no surprise that clustering users based on the contents of their tweets is biased towards users tweeting about similar ideas.

Two interesting clusters presented in the results are clusters 0 and 43. Cluster number 0 has a high ideological coherence, with all users either describing themselves as “Sverigevänner” (euphemism for anti-immigrant) or being self-declared members of anti-immigrant organizations. On the other hand, cluster number 43 not only contains several anti-immigrant politicians and organizations, but also contains the Twitter accounts of the anti-racist organizations Expo and Interasistmen. This is an example of how clustering on content can produce clusters with users interested in a topic, in this case anti-immigration, but with diametrically opposed stances. Clusters 31 and 58 are good examples of strong partisan clustering. In the case of cluster 58, all but one user listed their party affiliation with the Moderate Party in their description. That last user is the account of “Alliansen”, the coalition led by the Moderate Party that governed Sweden between 2006 - 2014. A similar result for the Christian Democrats is obtained in cluster number 31 which is also heavily partisan, and there are other examples of strongly partisan clusters in the result set beyond those presented in the table.

(24)

CHAPTER 5. DISCUSSION

One markedly topic based cluster is cluster number 27, which contains users with a very weak partisan similarity. This cluster contains users with a connection to the city of Stockholm, and includes the Stockholm accounts for the Moderate Party, the Left Party, the Social Democrats and the Green Party.

5.2 Clustering Based on Social Interactions

The ideological clusters formed through clustering on social interactions are coherent and relatively large. There is also less scattering of the clearly partisan users. Some are purely partisan clusters, such as the clusters 55, 68 and 84. There are also examples of ideological but less partisan clusters such as cluster 38, which consists of leftist identity political users and contains 37 users.

An interesting result is cluster number 71. When clustering on social interac-tions, the clustering algorithm failed to discriminate between labour unions and employer’s associations and placed LOSverige, the account of Sweden’s largest blue collar labour union, together with AlmegaAB, an employer’s association.

5.3 Comparison of the Clustering Techniques

As can be seen in both fig. 4.1 and fig. 4.2, both clustering techniques resulted in a high degree of scattering of similar users. This is illustrated by the large amount of dark pixels that are not centered around the diagonal. The problem is more severe when clustering based on content, but this is because there was a high degree of similarity between all users regardless of which cluster they were ultimately assigned to.

There is a tendency for less topic based clustering in the interactions based clustering results. CBildt, the account used by Sweden’s former foreign minister Carl Bildt of the Moderate Party, was placed in a foreign policy oriented cluster when clustering on content. When clustering on social interactions, he ended up in a strongly partisan Moderate Party cluster. The first clustering is clearly more satisfying from a topic based perspective, while the latter clustering is better if we are interested in grouping users based on their political affiliations.

An exception to the proposed pattern of social interactions being less prone to topics based clustering is cluster number 71 in table two. This cluster can be compared to content cluster number 9 in table one. The clustering on social interactions performed worse when considering partisanship and placed labour union LOSverige and employer’s association AlmegaAB in the same cluster. The content based cluster number 9 was a more purely labour union based cluster.

Overall, however, the partisan clusters formed by social interactions are larger and more coherent. There is also less scattering of self described partisans. For example,

(25)

5.4. ETHICS

the Moderate Party was clustered into cluster number 58, with 28 user, when clus-tering on content. When clustered on social interactions, the corresponding cluster number 84 ended up containing 46 users.

5.4 Ethics

Clustering based on political affiliation is in itself something that must be carried out with caution. Politics is a sensitive subject, and publishing or assigning someone’s political orientation can result in personal embarrassment and even threats to a person’s security. The methods presented here could possibly be used to catalogue people based on their political views, which would be ethically questionable. The results presented in this study only concerns the top 10 % of active Twitter users in the political Twittersphere. As such, they are considered to be open about their political orientations and the sensitivity of this information is far less than if we had published the machine learned political orientations of ordinary citizens. We have also avoided listing under-aged Twitter users.

5.5 Method Critique

It is possible that the results of this study were affected by the decision to only examine the top 10 % users with the highest follower count. Computing a similarity matrix is by its very nature a Ω(n2_{) calculation, and the rate limits imposed on}

Twitter’s API limited the amount of data that was feasible to gather.

The nature of the results as well as the difficulty of choosing a good k was affected by the decision not to clean up the data set. The broad range of keywords used in the Twitter filter when streaming political tweets allowed several Twitter accounts to be included in the data set, even though they had little to do with Swedish politics. These were left in the data set as it was deemed interesting to evaluate how effectively the different clustering approaches handled the noise introduced by these accounts. The vast majority of these users ended up together in various small clusters, which drove up the cluster count but did not adversely affect the quality of the more relevant clusters.

One could also question how the quality of the weights used when constructing the social interactions matrix affected the results. It is certainly true that the weights arrived at through experimentation are unlikely to be optimal. However, the problem statement did not concern finding optimal weights, and the weights chosen were good enough for the purposes of this report. The trend that clustering based on social interactions provided more ideologically centered clusters than content based clustering was apparent even with the possibly suboptimal composition of the social interactions matrix, and would have simply been more pronounced with better weights.

(26)

CHAPTER 5. DISCUSSION

5.6 Conclusion

In this study, users in the Swedish political Twittersphere was clustered on their social interactions as well as on the content of their tweets using spectral clustering. The goal of this was to provide insights as to what the advantages and drawbacks of these two strategies are.

The results of this study show that it is feasible to form political clusters with both content based and social interactions based similarity measures using spectral clustering. Content based clustering tended to form more topically oriented clusters whereas social interactions based clustering performed better when the aim was to find clusters centered around a political party or an ideology.

5.6.1 Future Research

As has been mentioned previously in the report, there is a gap in research concerning how to weigh different similarity measures when clustering. Researching whether there exists a optimal weighting strategy and what such a strategy looks like would allow for improvements in the results found in this report.

We also believe that the results from the content based clustering could be improved by performing sentiment analysis on the tweets. In our analysis, the individual words are treated as if they are independent of each other. Analyzing user sen-timents towards certain topics, such as the various political parties, may provide deeper insights regarding the political orientations of the users and would be an interesting subject for further investigation.

(27)

Bibliography

Alodadi, M., Mohammad, A. & Janeja, V. P. (2015), Similarity in patient sup-port forums using TF-IDF and cosine similarity metrics, in ‘2015 International Conference on Healthcare Informatics’.

Findahl, O. & Davidsson, P. (2015), Svenskarna och internet, Technical report, IIS. Garreta, R. & Moncecchi, G. (2013), Learning scikit-learn: Machine Learning in

Python, Packt Publishing Ltd.

Gruzd, A., Anatoliy, G. & Jeffrey, R. (2014), ‘Investigating political polarization on twitter: A canadian perspective’, Policy & Internet 6(1), 28–45.

Ifrim, G., Shi, B. & Brigadir, I. (2014), Event detection in twitter using aggressive filtering and hierarchical tweet clustering., in ‘SNOW-DC@ WWW’, pp. 33–40. Inouye, D., David, I. & Kalita, J. K. (2011), Comparing twitter summarization

algorithms for multiple post summaries, in ‘2011 IEEE Third Int’l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int’l Conference on Social Computing’.

Krikorian, R. (2013), ‘New tweets record per second, and how!’, https://blog.twitter.com/2013/new-tweets-per-second-record-and-how. Ac-cessed: 2016-3-23.

Nichols, J., Jeffrey, N., Jalal, M. & Clemens, D. (2012), Summarizing sporting events using twitter, in ‘Proceedings of the 2012 ACM international conference on Intelligent User Interfaces - IUI ’12’.

Phuvipadawat, S., Swit, P. & Tsuyoshi, M. (2010), Breaking news detection and tracking in twitter, in ‘2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology’.

Ramos, J. (2003), Using tf-idf to determine word relevance in document queries, in ‘Proceedings of the first instructional conference on machine learning’.

(28)

BIBLIOGRAPHY

Sharifi, B., Beaux, S., Mark-Anthony, H. & Kalita, J. K. (2010), Experiments in microblog summarization, in ‘2010 IEEE Second International Conference on Social Computing’.

Van Alstyne, M. & Erik, B. (2005), ‘Global village or Cyber-Balkans? modeling and measuring the integration of electronic communities’, Manage. Sci. 51(6), 851– 868.

von Luxburg, U. (2007), ‘A tutorial on spectral clustering’, Stat. Comput. 17(4), 395–416.

Yu, Yu & Shi (2003), Multiclass spectral clustering, in ‘Proceedings Ninth IEEE International Conference on Computer Vision’.

Zhang, Y., Wu, Y. & Yang, Q. (2012), ‘Community discovery in twitter based on user interests’, Journal of Computational Information Systems 8(3).

(29)

Appendix A

Keywords used when filtering the

stream of tweets

Keyword Explanation

moderaterna The Moderate Party socialdemokraterna The Social Democrats centerpartiet The Centre Party liberalerna The Liberals

folkpartiet The old name for the Liberals kristdemokraterna The Christian Democrats sverigedemokraterna The Sweden Democrats

SD The commonly used abbreviation for the Sweden Democrats vänsterpartiet The Left Party

miljöpartiet The Green Party

politik The Swedish word for “politics" parti The Swedish word for “party"

alliansen The right-of-centre alliance consisting of the Christian Democrats, the Liberals, the Centre Party and the Moderate Party

regeringen The Swedish word for “the government"

rödgröna An informal term for the centre-left bloc usually described as consist-ing of the Social Democrats and the Green Party but also commonly including the Left Party

riksdagen The Swedish word for “the Riksdag”, which is the name of the Swedish parliament

#svpol A hashtag used for tweeting about Swedish politics

(30)

A Comparison of Clustering the Swedish Political Twittersphere Based on Social Interactions and on Tweet Content

A Comparison of Clustering the

Swedish Political Twittersphere

Based on Social Interactions and

on Tweet Content

THOMAS VAKILI

A Comparison of Clustering the Swedish Political

Twittersphere Based on Social Interactions and

on Tweet Content

Abstract

Referat

En jämförelse mellan att klustra den svenska

poltiska twittersfären baserat på innehåll och

på sociala interaktioner

Contents

Chapter 1

Introduction

1.1

Related Work

1.2

Problem Statement

Chapter 2

Theoretical Background

2.1

Twitter

2.2

Clustering - Unsupervised Categorization

2.3

Cosine Similarity

2.4

Hybrid TF-IDF

Chapter 3

Method

3.1

Data Collection

3.2

Calculating the Content Similarity Between Users

3.3

Calculating the Similarities Between Users’ Social

Interactions

3.4

Constructing the Similarity Matrices

3.5

Clustering

3.6

Analyzing the Clusters

Chapter 4

Results

4.1

Content Based Clustering

4.2

Clustering Based on Social Interactions

Chapter 5

Discussion

5.1

Content Based Clustering

5.2

Clustering Based on Social Interactions

5.3

Comparison of the Clustering Techniques

5.4

Ethics

5.5

Method Critique

5.6

Conclusion

Bibliography

Appendix A

Keywords used when filtering the

stream of tweets