Recommending Hashtags for Tweets Using Textual Similarity and Geographic Data

(1)

IN

DEGREE PROJECT TECHNOLOGY, FIRST CYCLE, 15 CREDITS

,

STOCKHOLM SWEDEN 2017

Recommending Hashtags for

Tweets Using Textual Similarity

and Geographic Data

JONATHAN BERGLIND

MIKAEL FORSMARK

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

Recommending Hashtags for

Tweets Using Textual Similarity

and Geographic Data

JONATHAN BERGLIND & MIKAEL FORSMARK

Master in Computer Science Date: June 4, 2017

Supervisor: Roberto Guanciale Examiner: Örjan Ekeberg

Swedish title: Föreslå hashtags till tweets med textbaserad likhet och geografisk data

(3)

3

Abstract

Twitter is one of today’s largest and most popular social networks. The users of the service generate huge amounts of data each day and rely heavily on the service helping them find interesting tweets in short time. The concept of hashtags aids in this practice but relies on the users choosing to include the correct and commonly used hashtags for the topic of their tweet. Hashtag recommendation has been a target of research before with varying results. This thesis proposes a method taking the location of the users into account when making recommen-dations. The method generated improved results over just using simi-lar tweets as a basis for recommendation. Various factors like the han-dling of different variations of vocabulary in the tweets, how many tweets the suggestions can be picked from and how the combination of similarity and geographic ranking should function could affect the result. This leads to the conclusion that geographic data can be used to improve hashtag suggestions, but a different approach in handling similarity and alternative combinations of similarity and geographic ranking could cause another result.

(4)

4

Sammanfattning

Twitter är ett av nutidens största och populäraste sociala nätverk. Tjäns-tens användare producerar stora mängder data varje dag och förvän-tar sig att tjänsten ska kunna hjälpa dem att hitta intressanta tweets snabbt. Därmed finns konceptet med hashtags, men detta förutsät-ter att användare väljer att inkludera vanligt förekommande hashtags som på ett korrekt sätt avspeglar innehållet i tweeten. Automatisk re-kommendation av hashtags har därmed varit ett populärt forsknings-ämne de senaste åren, med varierande resultat. Denna studie under-söker en rekommendationsmetod som väger in användarens geogra-fiska position för att rekommendera så passande hashtags som möj-ligt. Resultaten visar att denna metod generellt rekommenderar mer passande hashtags än metoder som enbart rekommenderar hashtags genom att analysera likhet mellan tweets. Olika faktorer så som hante-randet av olika varianter av vokabulär, hur många tweets som meto-den kan föreslå hashtags från samt hur kombinationen av rekommen-dation baserat på likhet och geografiskt position ska fungera, kan sam-tidigt påverka resultaten. Detta leder till slutsatsen att geografisk data kan användas för att förbättra hashtagrekommendation, men att ett annorlunda tillvägagångsätt i att hantera likhet och alternativa kom-binationer av likhetsrangordning och geografisk rangordning kan leda till ett annorlunda resultat.

(5)

Chapter 1 Introduction

As the software market has evolved a lot the last 20 years, good user experience has transformed into an essential part of the development stage in order to attract more users and handle the increasing compe-tition with other software companies. Good user experience do often include features that makes the interaction between the user and the application as simple as possible in order to minimize the user effort and speed up the interaction [7]. One way of achieving these features is to analyze user data to be able to predict the user’s interaction be-havior. Analyses like these can be used to create shortcuts with func-tionality that the user might want to utilize since the user has used the functionality before.

One example of this is topic recommendation, which is the pro-cess of assigning a topic to text data [5]. A service that has previously been connected to this research area is Twitter, a social media service where users post short messages, so called tweets. These tweets can be indexed by one or several words describing the topic of the con-tent, known as hashtags. Including hashtags allows the tweet to be discovered by users that are interested in the topic that the hashtag represents. To maintain the purpose of hashtags it is therefore impor-tant to tag the tweet with a suitable topic. To help and encourage users to include appropriate hashtags, recommendations could be shown to the user when a tweet is being produced based on its content. Twitter does not currently offer this possibility, and the complexity of the task has lead to several papers within the area.

Data mining is the process of discovering useful patterns and trends in large data sets [10]. The growing ability to store user data combined

(8)

8 CHAPTER 1. INTRODUCTION

with the challenges of developing well functioning prediction algo-rithms has resulted in some companies announcing user prediction competitions along with public datasets to encourage research within the user behavior prediction area. Netflix [12] for example arranged a competition in 2009 where the competitors were given a rich user data set and were expected to develop an algorithm that would sug-gest movies to a user based on the dataset. Competitions like these show that there is a great demand of reliable user prediction algo-rithms among different types of stakeholders.

1.1 Purpose

The Twitter API allows anyone to collect tweet data, and this possibil-ity, in combination with the absence of an actual hashtag recommen-dation feature in Twitter, has drawn attention to that specific research field. Kywe et al. [9] for example discusses a method where hashtags are suggested based on the content of a tweet and the user’s prefer-ences, while Godin et al. [5] uses natural language processing tech-niques on the content of the tweet to deliver appropriate recommen-dations. These papers among others have shown promising results but the question is if the hashtag suggestions would be even better with additional parameters?

One possible additional parameter is to also consider geographic information about the tweet when finding suitable hashtags. By com-paring a tweet with tweets that are similar in content and nearby ge-ographically, the accurateness of the suggestion algorithms could in-crease when a tweet is produced in a crowded area, such as a small concert in the middle of New York. Therefore, it would be interesting to investigate if additional information of the tweets location would result in a more optimal hashtag recommendation algorithm.

(9)

CHAPTER 1. INTRODUCTION 9

1.2 Problem Statement

This thesis aims to answer the following questions.

• Can geographic data about nearby users with similar tweet pat-terns be used to suggest hashtags that a Twitter user would like to use?

• What are the difficulties with predicting this kind of user be-haviour?

By investigating the possibility of recommending hashtags based on nearby users with similar tweet patterns, one possible outcome of this report is not only a solution to this specific problem, but also a concept that could be applied to similar problems. Regardless of the result, this report also aims to document the difficulties and challenges with developing these kind of solutions in order to contribute to future research within the area.

1.3 Outline

This paper contains six chapters. Chapter 1, Introduction, introduces the subject and the purpose of the thesis. Chapter 2, Background, presents related studies and relevant background information. Chapter 3, Method, shows the procedure of the thesis and motivates the chosen approach and experiments. Chapter 4, Results, presents the produced results of the thesis. Chapter 5, Discussion, analyzes the results and discusses possible improvements. Chapter 6, Conclusion, summarizes the dis-cussion and presents conclusions.

(10)

Chapter 2 Background

Different fundamental terms related to the project is described in this chapter. The social network Twitter and its applications are explained, with a deeper presentation of hashtags and previous studies within hashtag recommendation. Textual similarity is later on explained with a focus on text clustering, where the Jaccard Similarity Coefficient and its usage in this project is established. Finally, the subject of Natu-ral Language Processing and its significance for creating clusterable text is briefly introduced together with an introduction to geographic dis-tance calculation.

2.1 Twitter

Twitter is an online social network which originally launched in 2006 at Twitter.com. By posting short messages, known as tweets, which may contain text or links to photos and videos, the service lets its users stay connected with each other. A tweet consists of a maximum of 140 characters and is posted to the profile of the user, sent to its followers and searchable on the platform [19]. The user is able to draft, post and consume Tweets from other users on the platform through an array of connected devices such as smartphones, computers and other de-vices that make use of the public API of the service. The 313M active users (as of March 2016) [17] generate huge amounts of tweets each day. An example of this is the measured 40M tweets posted during a single hour at 2016 U.S. presidential election day [13]. The massive amount of data makes it hard for users to find tweets related to their own interests.

(11)

CHAPTER 2. BACKGROUND 11

2.2 Hashtag Recommendation

2.2.1 Trends and Hashtags

To aid the users in finding like minded users to connect with and in-teresting tweets, Twitter provides the concept of hashtags. A hashtag, which is a word preceded by the #-symbol, is used to index keywords or topics on Twitter [21]. Hashtags are usually included in a tweet by putting the hash-sign before a topic word in an already written sen-tence or appended to the end of the tweet. On Twitter.com and in its clients, users can press on a hashtagged word in any message to be shown other tweets containing that specific hashtag. This makes it a powerful tool for helping users connect with other users all over the world that they do not already follow.

Since any word can be a hashtag, it can be hard for a user to choose what to tag when drafting a tweet. Therefore, when a user writes the #-symbol before a word, the service makes a search for similar hash-tags and shows them as suggestions. While this helps when the user has an idea of what to tag, there is as of writing this, no way to get sug-gestions based on the content of a completed tweet without hashtags. The problem, often known as hashtag recommendation, is an interest-ing one and has been approached multiple times by different groups since the launch of the service [23] [9] [5].

2.2.2 Geotagged Tweets and Users

A location may be attached to a tweet as it is posted. This is done either by selecting a place from a list of nearby places, or allowing the service to attach a precise location using coordinates from the GPS of the device being used [18].

Apart from tagging specific tweets with geographic data, users may also specify their location in their profile on a city-level granularity. These geotagged tweets and users can then be found for example by using the Twitter Advanced Search [20] to search for tweets near a spe-cific place.

Geographic data of tweets could possibly be used as a parame-ter when recommending hashtags to improve the predictions. When tweet-specific geographic data is missing, the location listed in a user’s profile might be used as a backup. According to previous studies,

(12)

12 CHAPTER 2. BACKGROUND

less than 1% of all tweets being posted on Twitter are geotagged while many more choose to include a location in their profile [11].

2.2.3 Previous Studies

As mentioned previously, several studies has been published on the topic of predicting hashtags for completed tweets.

One of the simpler approaches to the problem was done by finding the most similar tweets in a group of tweets [23]. Similarity was cal-culated using term frequency-inverse document frequency (tf-idf) where tweets scoring over a threshold value were included in the set of sim-ilar tweets, and this method is explained further in Section 2.3.1. A set of hashtags was then extracted from the set and ranked. As noted in the paper, the ranking of the candidates is crucial for the success of recommendations due to the limited space for displaying them and the limited cognition of the user when writing a tweet. The paper pro-poses 3 different methods for ranking hashtags:

• OverallPopularityRank - Number of occurrences of the hashtag in the whole dataset.

• RecommendationPopularityRank - Number of occurrences of the hashtag in the set of similar tweets.

• SimilarityRank - Score for how similar the tweets that contained the hashtag are (using the same tf-idf metric).

The best method, SimilarityRank, reached recall values at about 45-50%, where recall values are calculated by dividing the number of cor-rectly recommended hashtags by the number of original hashtags of the tweet.

Work published in 2012 by Kywe et al. [9] proposed a method, that apart from tweet content, also considered user preferences. User pref-erences were represented as a vector of preference weights towards a hashtag in a dictionary of all hashtags. A method very similar to the one mentioned above was used for finding similar tweets. The results showed increased recall values for the selected hashtags over methods not taking user preferences into account.

Alternative work from 2013 by Godin et al. [5] used a natural pro-cessing technique known as Latent Dirichlet Allocation (LDA) to recom-mend hashtags for a tweet. LDA is a hidden topic model, often used to

(13)

discover the general topics in large documents assuming there exists a topic model underlying the data. They found that they were able to recommend hashtags in a more general way than other methods since the hashtag did not need to exist in the dataset before-hand. As a con-sequence of this, the recall model for measuring results could not be used. Instead, a team of two persons evaluated if the suggested tag represented the topic of the tweet and could be used as a hashtag.

Geographic data of tweets has, to the best of our knowledge, not been used in hashtag recommendation before.

2.3 Textual Similarity

2.3.1 Text Clustering

One established way of grouping tweets by textual similarity is text clustering [23], where a set of text data is divided into different sets (also known as clusters) based on its content. Text clustering is often performed using the vector space model. With a set of n texts contain-ing ω words, a text data element can be represented as below, where j ∈ {1...n}.

Tj = (w1,j, w2,j, ..., wω,j)

Two text vectors are considered similar depending on how many words the sets have in common, and these text vectors can be com-bined into a word-document matrix wij. This matrix contains the two

dimensions word and document, and each specific coordinate repre-sents a given weight to a specific word i in a document j. This weight is calculated as the product between the frequency of the word in the document tfi,j and frequency of documents containing this word idfi.

wi,j = tfi,j· idfi

This kind of matrix gives information on how important a word is in a specific document and can be used when determining the similar-ity between text data [16].

One algorithm that is commonly used within text clustering is the K-Means algorithm, a partitioning algorithm that returns a final parti-tion of clusters that are initially constructed around K common occur-ring words, so called centroids. Texts are placed in the cluster with the

(14)

14 CHAPTER 2. BACKGROUND

most similar centroid [16]. The similarity calculation can be based on the distance between the cluster and the text, and one way of calculat-ing this distance is uscalculat-ing the Jaccard Similarity Coefficient [4], that is described in Subsection 2.3.2.

The text clustering technique is not used in this thesis since a less advanced similarity calculation was sufficient to investigate the results with geographic ranking. It is meanwhile useful to be familiar with the concept since it is a possible extension of this thesis’ hashtag sugges-tion algorithm in the future.

2.3.2 Jaccard Similarity Coefficient

The Jaccard Similarity Coefficient, also known as the Jaccard Index, is a statistic tool that is used to compare similarity and diversity between sets of data [2]. By sending two sets as parameters, a similarity coeffi-cient between the two sets can be calculated using the model below.

J (A, B) = | A ∩ B | | A ∪ B |

The algorithm outputs a number between 0 and 1 called a Jaccard coefficient. The bigger the number produced by two sets is, the bigger the similarity [4] is.

2.4 Natural Language Processing

2.4.1 Definition

Natural language processing, also known as Computational Linguis-tics, is an area within computer science with the purpose of under-standing, learning and producing human language content using com-putational techniques. There are several applications of natural lan-guage processing using multiple senses, like the ability to read and summarize printed text, recording and processing spoken sentences, and also mining social media and analyzing the age and gender of the author based on the scraped data [8].

(15)

2.4.2 Stop Words

When preparing text for analyzing and clustering, a common prepa-ration stage is to remove so called stop words from the text in order to highlight the words that actually bring semantics to the text [1]. These stop words are often defined as the most common words of the lan-guage of the text and often includes prepositions, like ”on”, ”in” and ”between”, and conjunctions, like ”and”, ”or”, and ”but”. By remov-ing these kind of words, the risk of matchremov-ing text data just because they contain the same set of redundant words is eliminated and the chance of finding texts that actually contain the same semantics is in-creased. This is a product of the fact that the distance calculation be-tween the text data only will be done on words that actually bring a specific meaning to the text [22].

2.5 Approximating Geographical Distance

Due to the uneven surface of the earth, distance calculation formulas may only approximate the geographical distance between two points. There are numerous formulas for distance calculations, all based on different abstractions providing different levels of accuracy.

2.5.1 Haversine Formula

The Haversine Formula abstracts the surface of the earth as a sphere and calculates the great-circle distance between two points on it as an approximation. The distance is calculated as follows [15]:

d = 2r arcsin s

sin2 ϕ2− ϕ1 2

+ cos(ϕ1) cos(ϕ2) sin2

λ2− λ1

2

!

Where d is the distance between two points, ϕ1and ϕ2are the latitudes

of the points in radians and λ1 and λ2 are the longitudes of the points

(16)

Chapter 3 Method

3.1 Data Collection

In order to recommend hashtags for tweets based on similar tweets and their location, a dataset containing these features was needed. A search for a dataset with these features was therefore performed in the initial phase of the project. Since Twitter is a popular research topic a lot of different datasets are publicly available. However, most of them lack the geographic features needed for this study and only one sufficient dataset was found.

The dataset used in this study contains 3,844,612 tweets from 115,886 users with locations. The locations of the users are self labeled on a city-level granularity. The data was gathered September 2009 to Jan-uary 2010 for use in the paper ”You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users” [3] and is licenced under a Creative Commons Attribution Noncommercial 3.0 United States Li-cense.

As noted in Section 3.5, the dataset used in this study does not contain geotagged tweets but rather geotagged users meaning the as-sumption is made that every distinct user has produced all tweets from a single location. This will undoubtedly have affected the results as is discussed later in Section 5.1.

An alternative to this dataset was considered - using the Twitter Streaming API to collect data as it is being posted to Twitter. Consider-ing the time constraints of the project and the fact that only few tweets are geotagged anyway (see 2.2.2), the API alternative was turned down in favor for the publicly available dataset.

(17)

CHAPTER 3. METHOD 17

The dataset consist of two files, one with User ID’s along with the name of their location and the other with tweets along with the ID of the user that posted the tweet.

3.2 Data Preprocessing

In order for the data to be easy to work with within the scope of the project, some preprocessing steps were taken.

For the file containing user ID’s with their location names, this meant converting locations to geographical coordinates to allow easy distance comparison between users. This was done using the Google Maps Geocoding API [6] by specifying the location name as the ad-dress parameter and saving the coordinates of the first hit on file along with the user ID. The entries for which the API did not return any hits where discarded.

Additional preprocessing steps were taken regarding the file con-taining the tweets. Firstly, all tweets were lowercased and stripped from links and everything not being text. Secondly, the hashtags of the tweets were extracted leaving only the actual words in the tweet, leaving the word but removing the hash-signs. This decision is based on the observation that most hashtags are part of a sentence and not just appended at the end of a tweet. Thirdly, all words matching a list of common stop words [14] (see 2.4.2) were removed from the tweets. The complete list of stop words used in the preprocessing step is in-cluded in Appendix A. Finally, the geographical coordinates of the tweet were looked up from the previously processed file with user locations. Processed tweets were then saved on file as a quadruple of (User ID, User Coordinates, Processed tweet, Extracted hashtags). Tweets that lacked either hashtags or user coordinates were discarded during this process.

The preprocessing steps left a final set of 380,636 tweets.

3.3 Dataset Analysis

In order to provide some insight into the preprocessed data, the fol-lowing table and figures were produced.

(18)

18 CHAPTER 3. METHOD

Metric Value Note

Number of tweets 380,636

Distinct users 52,742 Average 7.2 tweets/user

Distinct locations of users 2,216 Average 23.8 users/location

Total hashtags used 529,727

Average hashtags per message 1.39

Distinct hashtags used 82,936

Hashtags occurring less than 5 times 69,740 84% of hashtags Hashtags occurring less than 3 times 62,039 75% of hashtags

Hashtags occurring once 51,571 62% of hashtags

Table 3.1: The table shows statistics about the preprocessed data for some metrics deemed interesting to the results of the study.

0% 10% 20% 30% 40% 50% 60% 70% 80% 1 2 3 4 5 6 7 8 9 10 > 10 Per centage of all tweets

Hashtags per tweet Number of hashtags per tweet

Figure 3.1: The figure shows the percentage of tweets containing the corresponding number of hashtags. More than 75% of all tweets only contain 1 hashtag and a considerable smaller fraction (around 15%) of the tweets contain 2 hashtags leaving approximately only 10% for tweets containing more than 2 hashtags.

(19)

CHAPTER 3. METHOD 19 20 25 30 35 40 45 50 55 −130 −120 −110 −100 −90 −80 −70 −60 Latitude Longitude User Locations

Figure 3.2: The figure shows the distinct locations of the users over a coastline map of the USA. The dense clusters of points are centered around the biggest cities.

3.4 Hashtag Recommendation

The aim of this paper is to find a suitable set of hashtags for any given tweet and to see if geographic data can influence the set of suggested hashtags in a positive way. This is done by analyzing the tweet with-out hashtags together with a set of tweets already containing hashtags along with the locations of the authors. Recommendations are com-puted through the following steps:

1. Measuring how similar the tweet without hashtags is to every other tweet with hashtags in a set of training tweets to get a sim-ilarity score.

2. Extracting the hashtags for all tweets scoring a positive similarity score.

(20)

20 CHAPTER 3. METHOD

(a) SimilarityRank - ranking hashtags by similarity score of origin tweet and tweet without hashtags in descending or-der.

(b) GeoRank - ranking hashtags by geographical distance be-tween origin tweet and tweet without hashtags in ascend-ing order.

4. presenting the top-k ranked hashtags as recommendations for the tweet without hashtags.

Similarity score is calculated using the Jaccard Similarity Coefficient (see 2.3.2) and geographical distance is calculated using the Haversine Formula (see 2.5.1).

3.5 Limitations

The dataset that this thesis is based on was collected from September 2009 to January 2010 and contains 115,886 Twitter users and 3,844,612 tweets from the users [3]. This data is presented in triplets of

(userID, tweetID, tweet)_{. The locations of the users are found}

in a complementary dataset containing tuples of (userID, coordinates), which means that this thesis assumes that every distinct user is

pro-ducing all its tweets from the same location.

Only tweets containing one or more hashtags are used in the con-struction of the training set and test set. When a tweet is tested, its original hashtags are removed before entering the suggestion algo-rithm. When the algorithm returned a result, the set of suggestions is compared with the set of original hashtags in order to establish the result’s correctness.

(21)

CHAPTER 3. METHOD 21

3.6 Evaluation

3.6.1 Recall and Precision

Recall and precision values were chosen as the metric for measuring the results of the recommendations, defined as follows.

Recall = | Hrecommended ∩ Horiginal| | Horiginal|

P recision = | Hrecommended ∩ Horiginal| | Hrecommended|

Where Hrecommendedis the set of recommended hashtags for a tweet and

Horiginalis the original set of hashtags present in the tweet and what the

algorithm is trying to recommend.

The recall value for a recommendation thus describes how many of the relevant hashtags the algorithm managed to predict. The precision value describes how accurate the recommendations were.

3.6.2 Tests

From the processed data, 10,000 tweets were picked out on a random basis for use as a testing set. Hashtag recommendations were com-puted for each tweet in the testing set using the remaining tweets from the processed data as input. Recommendations were computed using SimliarityRank and GeoRank separately and the recall and precision values for each tweet were recorded.

To address concerns about the GeoRank method prioritizing hash-tag recommendations from tweets written by the same user as the test-ing tweet (same location givtest-ing 0 in geographical distance), another set of tests was performed where tweets from the same user were re-moved for both ranking methods.

(22)

Chapter 4 Results

After running several tests based on different divisions of the original dataset into training set and test set, the results generally followed the pattern as shown in the figures below.

(23)

CHAPTER 4. RESULTS 23 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 1 2 3 4 5 6 7 8 9 10 Mean recall value

Top-k suggested hashtags Mean Recall Values of 10,000 tweets

SimilarityRank GeoRank

Figure 4.1: The figure shows how the recall value varies as the num-ber of suggested hashtags increases when also recommending hash-tags from the user’s own tweets. The x-axis represents the number of hashtags that were suggested by the algorithm, and these are either sorted by the similarity in content or the geographical distance. The y-axis represents the mean recall value of all tweets in the test set, and the number visualises the percentage of how many of the hashtags of the original tweet that were suggested by the algorithm. The curve marked with squares represents the mean recall value when the sug-gested hashtags are sorted by the similarity in content and the curve marked with dots represents the mean recall value when the suggested hashtags are sorted by the geographical distance.

(24)

24 CHAPTER 4. RESULTS 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 1 2 3 4 5 6 7 8 9 10 Mean pr ecision value

Top-k suggested hashtags

Mean Precision Values of 10,000 tweets SimilarityRank

GeoRank

Figure 4.2: This figure shows how the precision value varies as the number of suggested hashtags increases when also recommending hashtags from the user’s own tweets. The x-axis represents the num-ber of hashtags that were suggested by the algorithm, and these are either sorted by the similarity in content or the geographical distance. The y-axis represents the mean precision value of all tweets, and the number visualises the quota of how many of the suggested hashtags that actually belonged to the original tweet. The curve marked with squares represents the mean precision value when the suggested hash-tags are sorted by the similarity in content and the curve marked with dots represents the mean precision value when the suggested hashtags are sorted by the geographical distance.

(25)

CHAPTER 4. RESULTS 25 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 1 2 3 4 5 6 7 8 9 10 Mean recall value

Top-k suggested hashtags Mean Recall Values of 10,000 tweets

No hashtags from same user

SimilarityRank GeoRank

Figure 4.3: The figure shows how the recall value varies as the number of suggested hashtags increases when not recommending a hashtag if it is picked from the user’s own tweet. The axises and curves repre-sents the same values as in Figure 4.1.

(26)

26 CHAPTER 4. RESULTS 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 7 8 9 10 Mean pr ecision value

Top-k suggested hashtags Mean Precision Values of 10,000 tweets

No hashtags from same user SimilarityRank

GeoRank

Figure 4.4: The figure shows how the recall value varies as the number of suggested hashtags increases when not recommending a hashtag if it is picked from the user’s own tweet. The axises and curves repre-sents the same values as in Figure 4.2.

(27)

CHAPTER 4. RESULTS 27

The figures above show how the recall value increases as the num-ber of suggested hashtags increases. The risk that the algorithm recom-mends unsuitable hashtags increases at the same time, as the precision values decrease when the number of suggested hashtags increases.

By comparing the results when including and not including hash-tags from the user’s previous tweets in the set of suggested hashhash-tags, the figures also show that the geographical ranking performs better if the algorithm is allowed to suggest hashtags from tweets that the user has produced before. The similarity ranking performs a better result than the geographic ranking if hashtags from tweets that the user has produced before are excluded.

(28)

Chapter 5 Discussion

In this section, the meaning of the results and reliability of the algo-rithm is discussed by taking a look at the conditions and methods that the solution is based on, while also detecting difficulties in hashtag recommendation. The efficiency of the algorithm is also analyzed in order to identify potential improvements. Finally, mechanisms behind the algorithms and challenges along the way are discussed in order to find sources of errors.

5.1 Performance and Reliability

As the numbers presented in Chapter 4 show, the hashtag suggestion based on geographic ranking suggests more correct hashtags when the suggested hashtag can be picked from tweets that the user has produced before. Meanwhile, the inability to pick hashtags from the user’s previous tweets causes opposite results where the ranking solely based on similarity performs a significantly higher recall value. The reason behind these results can be viewed in different ways. The fact that the average number of tweets per user is 7.2 at the same time as the average number of users per location is 23.8 means that each user’s tweets are taking up a significant percentage of the tweets at the user’s location. Since the tweets from a distinct user often are parts of the same discussions or based on the same vocabulary in this dataset, it is quite likely that hashtags from the user’s previous tweets will score a high rank, both in similarity and geographically. This tendency is even more reinforced since the dataset only maps one location per user, which means that all of the user’s tweet are published at the same

(29)

CHAPTER 5. DISCUSSION 29

location. These results are therefore quite strongly affected by the lim-itations of the dataset, and further tests with more detailed data could lead to a more scientifically interesting analysis about different users’ vocabulary in a specific geographical area and if this can be used for hashtag recommendations.

Another attribute of the dataset that could benefit the ranking of the hashtags from other users’ tweets is the fact that all users that are located in a specific city are assigned to the same coordinates, as the locations of the users originally only were specified with a city name in the dataset. This fact results in that the tweets produced in a city will get exactly the same geographical rank, leaving the definitive ranking to be decided by the similarity in content. This behaviour is signifi-cantly present in big cities like New York, where around 7, 000 users are located. This means that it is not possible to geographically rank hashtags within a city, and this is also a flaw where further tests with more detailed user data could lead to numbers that are more related to the reality. But since there still are 2, 216 distinct coordinates in the dataset, the geographical ranking is still giving a lead on how the hash-tag suggestion performance is affected.

Another characteristic of the data and tweets in general that affects the result in a negative way is that users often use a small amount of hashtags in their tweets. 529, 727 unique hashtags results in an aver-age of 1.39 hashtags per tweet in the dataset, and since a suggested hashtag is considered suitable only if it is one of the original hashtags of the test tweet, it is almost impossible to receive a high recall and precision value. One way of getting better results could be by consid-ering a hashtag suggestion successful also when the suggested hash-tags almost spell the same as the correct hashhash-tags, which also could be a good way to make users use the same hashtag and not several vari-ants of the word. To maximize this kind of feature, it would be optimal if the spell check only considers a misspelled hashtag a match if the al-gorithm knows that the intended semantic of the suggested hashtag is the same as the correct hashtag. Another easier route would be to only correct and match misspelled hashtags if the edit distance is really short, but it could lead to strange suggestions. It is hard to speculate how much the recall values would rise if this step was to be added, but the results would definitely be better.

(30)

30 CHAPTER 5. DISCUSSION

5.2 Efficiency

As the algorithm is constructed, the Jaccard Similarity Coefficient be-tween the test tweet and each of the 380, 636 tweets in the training set is initially calculated in order to find tweets that are similar in content in some way. The somewhat similar tweets are then sorted, either by how similar they were to the test tweet, or by their geographical dis-tance to the author of the test tweet. This method is sufficient when working with datasets of this size, but can escalate quickly when ap-plying this algorithm to real-world data, where millions of tweets are produced every hour. It would therefore be desirable to find a way to sort out possible suitable tweets before applying the algorithm in order to generate suggestions faster, for example by initially cluster newly produced tweets based on their similarity to tweets in existing clusters. This would make it possible to only perform the algorithm on tweets from the same cluster, and this is a possible improvement to achieve a faster result. This would at the same time probably lead to a noticeable change in the results, as by only search a certain cluster would exclude possible suitable hashtags that belong to other clusters.

5.3 Significant Challenges when Suggesting

Hashtags

During the construction and composition of the hashtag suggesting al-gorithm, several challenges were detected and identified that could af-fect the result depending on how they were to be solved. By constantly analyzing the possible outcomes of these decisions and the way the al-gorithm would have to be constructed, we are confident in that the paths we have chosen follows the way that we want the algorithm to work. It is also important to remember that a change in these parame-ters probably will affect the results in some way. Most of these factors have been described in detail previously in Section 5.1 and Section 5.2, and a brief summary of these is presented below.

• Big variation in vocabulary and hashtags. As described in Sec-tion 5.1, lots of users are misspelling and using their own unique vocabulary when producing tweets. This affects the result in a negative way since the algorithm right now only considers a

(31)

sug-CHAPTER 5. DISCUSSION 31

gested hashtag as a hit if it is spelled exactly the same as the orig-inal hashtag. One way of solving this is by correcting the sug-gested hashtags and suggest them as the correctly spelled word that it has the shortest edit distance to, but this could also lead to the result that the hashtags that are suggested are worse fitting words. It is also hard to determine when a hashtag should be corrected and when it is supposed to be spelled as it already is. • Target which tweets in the dataset to test. As described in

Sec-tion 5.2, it gets unsustainable to test the similarity between a tweet and all the training tweets in the dataset if the amount of training tweets reaches the same amount as the Twitter service contains in total. One solution is to initially place tweets in clus-ters where each cluster contains tweets that are similar to each other in a way and then run the test within a certain cluster, but it could also exclude possible suitable results that are placed in other clusters. The algorithm works fine by testing the similarity between the tweets when using this dataset since it is relatively small, but it should be improved if it shall be used at consider-ably larger amounts of data.

• Combination of similarity and geographic ranking. When both ranking by similarity and geographic data, different results are generated depending on which order the rankings are performed. If the algorithm would pick out the tweets within a certain geo-graphic range and then rank them by similarity, the algorithm would most probably perform a bad result unless it is a local event going on that all users are tweeting about. With this in mind, the algorithm is instead programmed to pick out the tweets that are somewhat similar in content and then rank them by the geographic distance. This causes a better result since the hash-tags that are going to be geographically ranked are picked from tweets that already are similar in some way to the test tweet. At the same time, it is in this way hard to really determine if a sug-gested hashtag is good mostly because it was picked from a sim-ilar tweet or mostly because it was published nearby.

(32)

Chapter 6 Conclusion

This thesis shows that geographic data can be used to suggest hash-tags to a tweet, with some reservations. The implemented hashtag al-gorithm that only ranks the suggested hashtags by its tweet’s similar-ity in content with the test tweet performs better than the geographic ranking when hashtags from the author’s own tweets can not be sug-gested. The version of the algorithm that ranks the suggested hash-tags by geographic distance performs a better result if hashhash-tags from the author’s own tweets also can be suggested. This means that the algorithm that ranks the suggested hashtags by geographic distance is the one to prefer, since hashtags from the author’s own tweets often are suitable suggestions. An even more realistic result could be per-formed with a dataset with more tweets, users and detailed locations.

Significant challenges with suggesting hashtags to tweets are the variation of vocabulary that users are using when composing tweets, the targeting of which tweets to suggest hashtags from in order to per-form a fast result and how to combine the ranking by similarity and ge-ographic distance. Possible countermeasures are to correct and gather different variations of hashtags around a common hashtag and then suggest it, and to only suggest hashtags to tweets that belong to the same pre-computed cluster as the test tweet, but it is not clear how these implementations would affect the result.

(33)

Bibliography

[1] Charu C Aggarwal and Chandan K Reddy. “An Introduction to Cluster Analysis”. In: Data Clustering - Algorithms and Appli-cations. University of Minnesota: Taylor & Francis Group, 2014, p. 7. URL: https : / / www - cambridge - org . focus . lib . kth.se/core/books/mining- of- massive- datasets/

C1B37BA2CBB8361B94FDD1C6F4E47922.

[2] Charu C Aggarwal and Chandan K Reddy. “Data Mining”. In: Mining of Massive Datasets. Stanford University: Cambridge Uni-versity Press, 2014. URL: http : / / www . crcnetbase . com .

focus.lib.kth.se/isbn/9781466558212_.

[3] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. “You Are Where You Tweet: A Content-based Approach to Geo-locating Twitter Users”. In: Proceedings of the 19th ACM International Con-ference on Information and Knowledge Management. CIKM ’10. Toronto, ON, Canada: ACM, 2010, pp. 759–768. ISBN: 978-1-4503-0099-5.

DOI: 10.1145/1871437.1871535.URL: http://doi.acm.

org/10.1145/1871437.1871535_.

[4] R Ferdous and M Shameem. “An efficient k-means algorithm integrated with Jaccard distance measure for document cluster-ing”. In: (2009).URL: http://ieeexplore.ieee.org.focus.

lib.kth.se/stamp/stamp.jsp?arnumber=5340335.

[5] Fréderic Godin et al. “Using Topic Models for Twitter Hashtag Recommendation”. In: Proceedings of the 22Nd International Con-ference on World Wide Web. WWW ’13 Companion. Rio de Janeiro, Brazil: ACM, 2013, pp. 593–596. ISBN: 978-1-4503-2038-2. DOI:

10.1145/2487788.2488002. URL: http://doi.acm.org/

10.1145/2487788.2488002.

(34)

34 BIBLIOGRAPHY

[6] Google Developers. Google Maps Geocoding API. Accessed 2017-04-24. 2017.URL: https://developers.google.com/maps/

documentation/geocoding/.

[7] Marc Hassenzahl. “User Experience (UX): Towards an Experi-ential Perspective on Product Quality”. In: Proceedings of the 20th Conference on L’Interaction Homme-Machine. IHM ’08. Metz, France: ACM, 2008, pp. 11–15. ISBN: 978-1-60558-285-6. DOI: 10.1145/

1512714.1512717. URL: http://doi.acm.org/10.1145/

1512714.1512717.

[8] J Hirschberg and C Manning. “Advances in natural language processing”. In: (2015).URL: http://science.sciencemag.

org.focus.lib.kth.se/content/349/6245/261_.

[9] Su Mon Kywe et al. “On Recommending Hashtags in Twitter Networks”. In: Social Informatics: 4th International Conference, SocInfo 2012, Lausanne, Switzerland, December 5-7, 2012. Proceedings. Ed. by Karl Aberer et al. Berlin, Heidelberg: Springer Berlin Heidel-berg, 2012, pp. 337–350.ISBN: 978-3-642-35386-4.DOI: 10.1007/

978-3-642-35386-4_25_. URL: http://dx.doi.org/10.

1007/978-3-642-35386-4_25.

[10] Daniel T. Larose and Chantal D. Larose. “An Introduction to Data Mining”. In: Discovering Knowledge in Data. John Wiley & Sons, Inc., 2014, pp. 1–15.ISBN: 9781118874059. DOI: 10.1002/

9781118874059 . ch1_. URL: http : / / dx . doi . org / 10 .

1002/9781118874059.ch1_.

[11] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. “Home Lo-cation IdentifiLo-cation of Twitter Users”. In: ACM Trans. Intell. Syst. Technol. 5.3 (July 2014), 47:1–47:21. ISSN: 2157-6904. DOI: 10 .

1145 / 2528548. URL: http : / / doi . acm . org / 10 . 1145 /

2528548_.

[12] Netflix, Inc. Netflix Prize. 2009.URL: http://netflixprize.

com_.

[13] New York Times. For Election Day Influence, Twitter Ruled Social Media. Accessed 2017-03-21. Nov. 2016. URL: https : / / www . nytimes.com/2016/11/09/technology/for-election-day - chatter - twitter - ruled - social - media . html ? _r=0.

(35)

BIBLIOGRAPHY 35

[14] Ranks NL. Default English Stopwords List. Accessed 2017-04-24.

2017.URL: http://www.ranks.nl/stopwords/.

[15] C. C. Robusto. “The Cosine-Haversine Formula”. In: The Ameri-can Mathematical Monthly 64.1 (1957), pp. 38–40.ISSN: 00029890,

19300972.URL: http://www.jstor.org/stable/2309088.

[16] Magnus Rosell. “Text Clustering Exploration”. In: (2009). URL:

http : / / kth . diva - portal . org / smash / get / diva2 :

209282/FULLTEXT01.pdf_.

[17] Twitter, Inc. Company | About. Accessed 2017-03-21. 2017. URL:

https://about.twitter.com/company_.

[18] Twitter, Inc. FAQs about adding location to your Tweets. Accessed 2017-03-21. 2017. URL: https : / / support . twitter . com /

articles/78525.

[19] Twitter, Inc. New user FAQs. Accessed 2017-03-21. 2017.URL: https: / / support . twitter . com / articles / 101125 # Trend _ Location.

[20] Twitter, Inc. Twitter Advanced Search. Accessed 2017-03-21. 2017.

URL: https://twitter.com/search-advanced.

[21] Twitter, Inc. Using hashtags on Twitter. Accessed 2017-03-21. 2017.

URL: https://support.twitter.com/articles/49309?

lang=en_.

[22] W. John Wilbur and Karl Sirotkin. “The automatic identifica-tion of stop words”. In: Journal of Informaidentifica-tion Science 18.1 (1992), pp. 45–55.DOI: 10.1177/016555159201800106. eprint: http: / / dx . doi . org / 10 . 1177 / 016555159201800106. URL:

http://dx.doi.org/10.1177/016555159201800106.

[23] Eva Zangerle, Wolfgang Gassler, and Gunther Specht. “Recommending#-tags in twitter”. In: Proceedings of the Workshop on Semantic

Adap-tive Social Web (SASWeb 2011). CEUR Workshop Proceedings. Vol. 730. 2011, pp. 67–78.

(36)

Appendix A

Stop Words

a about above after again against all am an and any are aren’t as at be because been before being below between both but by can’t cannot could couldn’t did didn’t do does doesn’t doing don’t down during each few for from further had hadn’t has hasn’t have haven’t having he he’d he’ll he’s her here here’s hers herself him himself his how how’s i i’d i’ll i’m i’ve if in into is isn’t it it’s its itself let’s me more most mustn’t my myself no nor not of off on once only or other ought our ours ourselves out over own same shan’t 36

(37)

APPENDIX A. STOP WORDS 37 she she’d she’ll she’s should shouldn’t so some such than that that’s the their theirs them themselves then there there’s these they they’d they’ll they’re they’ve this those through to too under until up very was wasn’t we we’d we’ll we’re we’ve were weren’t what what’s when when’s where where’s which while who who’s whom why why’s with won’t would wouldn’t you you’d you’ll you’re you’ve your yours yourself yourselves

Recommending Hashtags for Tweets Using Textual Similarity and Geographic Data

Recommending Hashtags for

Tweets Using Textual Similarity

and Geographic Data

JONATHAN BERGLIND

MIKAEL FORSMARK

Recommending Hashtags for

Tweets Using Textual Similarity

and Geographic Data

JONATHAN BERGLIND & MIKAEL FORSMARK

Abstract

Sammanfattning

Contents

Chapter 1

Introduction

1.1

Purpose

1.2

Problem Statement

1.3

Outline

Chapter 2

Background

2.1

Twitter

2.2

Hashtag Recommendation

2.2.1

Trends and Hashtags

2.2.2

Geotagged Tweets and Users

2.2.3

Previous Studies

2.3

Textual Similarity

2.3.1

Text Clustering

2.3.2

Jaccard Similarity Coefficient

2.4

Natural Language Processing

2.4.1

Definition

2.4.2

Stop Words

2.5

Approximating Geographical Distance

2.5.1

Haversine Formula

Chapter 3

Method

3.1

Data Collection

3.2

Data Preprocessing

3.3

Dataset Analysis

3.4

Hashtag Recommendation

3.5

Limitations

3.6

Evaluation

3.6.1

Recall and Precision

3.6.2

Tests

Chapter 4

Results

Chapter 5

Discussion

5.1

Performance and Reliability

5.2

Efficiency

5.3

Significant Challenges when Suggesting

Hashtags

Chapter 6

Conclusion