• No results found

Evaluation of a Prototype for Relevance Profiling

N/A
N/A
Protected

Academic year: 2021

Share "Evaluation of a Prototype for Relevance Profiling"

Copied!
44
0
0

Loading.... (view fulltext now)

Full text

(1)

Bachelor Thesis

Evaluation of a Prototype for Relevance Profiling

Karl-Johan Alm

(2)
(3)

Abstract

Only a small portion of the amounts of information generated online are relevant to a given person.

In this thesis, a prototype for determining a relevance value based on sets of data for some topic is evaluated to determine its viability in a future product called Votia.

To achieve this, an evaluation model was defined based on “accuracy” and “efficiency” for various machine learning algorithms applied to various types of data found in a tweet — a short user message on the Twitter platform — such as the message, relations between users and the tweeter, users’ general behavior characteristics, and geographic data. A system was set up to fetch and convert Twitter data into data fitting the prototype, with the hypothesis that (1) the Twitter data model could be mapped into the Votia data model, from which user behavior could be predicted at an adequate accuracy, and that (2) user behavior could be predicted to some degree from isolated sets of data.

Data from Twitter was obtained by taking a random sample of users — the main actors — and then loading their and their friends’ timelines∗. The data was processed, identifying interactivity between the set of users and their friends, in particular in who retweeted what. A number of machine learning algorithms, such as Na¨ıve Bayes classifier, were tested on this data and evaluated according to the model.

In the case of user relation, data was instead obtained by identifying a number of the top Twitter users, and the evaluation revolved around grouping their followers based on how simi-larly they behaved.

The evaluation shows that predicting user behavior from isolated sets of data is not applicable in the given environment, and that the data set must be analyzed in a more integrated manner, e.g. by grouping similar users together. As the input data sets are arbitrary, each being analyzed in specific ways, a pipeline with processing modules that not only analyze the data sets in terms of relevance, but also perform preprocessing, is suggested. Examples of preprocessing might be filtering, adjusting data for use by subsequent modules, or flat out rejecting the data prematurely.

Keywords: machine learning, twitter, information, grouping, social networks, rela-tions

(4)
(5)

Tables and figures

Figures

6.1. Data prediction using an SGD classifier with linear SVM . . . 20 6.2. Real data accuracy rate over number of positive hits. . . 20

Tables

6.1. Top (registered) tweets by activity . . . 21 6.2. Group statistics, with applicable followers . . . 21

(6)
(7)

Contents

Abstract iii

Tables and figures v

Contents viii 1. Introduction 1 1.1. Research problem . . . 2 1.2. Purpose . . . 2 1.3. Goal . . . 2 1.4. Method . . . 2 1.5. Stakeholders . . . 3 1.6. Delimitations . . . 3

1.7. Security and ethics . . . 3

1.8. Thesis structure . . . 3 2. Method 5 2.1. Research phases . . . 5 2.2. Evaluation model . . . 5 2.2.1. Accuracy . . . 6 2.2.2. Efficiency . . . 6

3. Twitter and Votia 9 3.1. Twitter . . . 9

3.2. Votia . . . 9

3.3. Twitter data into Votia data . . . 10

4. Machine learning 11 5. Relevance profiling 13 5.1. Defining the research problem . . . 13

5.1.1. Learning from data . . . 13

5.1.2. Profiling a user . . . 14

5.2. Defining the data . . . 14

5.3. Software . . . 14 5.4. Algorithms . . . 15 5.5. Tweet text . . . 15 5.6. Geographic location . . . 15 5.7. Time . . . 16 5.8. User relation . . . 16 5.8.1. Groups . . . 16

(8)

6. Evaluation 19

6.1. Tweet text . . . 19

6.2. Geographic location . . . 20

6.3. User relation . . . 20

6.3.1. Applied and neutral affinity . . . 21

7. Results 23 7.1. Tweet text . . . 23 7.2. Geographic location . . . 23 7.3. User relation . . . 23 8. Analysis 25 9. Conclusions 27 9.1. Future work . . . 27 10. References 29 A. Synthesized Data Generation 31 A.1. Vocabulary . . . 31

A.2. Users . . . 31

A.3. Polls . . . 32

A.4. Specific instances used in the thesis . . . 32

A.4.1. Noise test . . . 32

A.4.2. Voting frequency test . . . 32 B. Geographic data structure from Twitter 33

C. Twitter Data Collection Strategy 35

(9)

1. Introduction

The last number of decades have been aptly labeled as “the Information Age,” propelled by the equally aptly named “Digital Revolution” which started in the early eighties. Throughout this period, the availability of information has increased without interruption, not only in quantity and quality, but also in format and characteristics, as can be observed on sites such as the “Internet World Statistics” web site [1] or in articles such as Bret Swanson’s “The Coming Exaflood” [2]. Information today possesses properties whose correlation did not exist a year or decade prior, such as a Tweet, a short messsage on the social platform Twitter∗, which may contain links, pictures, and/or geographic information provided by its author as they see fit.

Unprecedented privacy implications abound, and the general Internet populace surprise — and in some cases dishearten — traditional privacy advocates in its willingness to without abandon publicize seemingly private matters, e.g. through Facebook†, Twitter or similar social networks. In “Millennials and Privacy in the Information Age: Can They Coexist?” [3, p. 35] Yadin speaks of “The Right to Privacy,” saying that “There is a long debate about the specific definition of these rights. However, even when considering the basic definitions adopted by the U.N. in the Universal Declaration of Human rights, recent events undermine them,” going on to mention terrorist threats as an argument for, and “rapid technological advancement clashes with ‘the right to be let alone’ ” as an argument against.

This “public-making” trend has plenty of arguably destructive examples, where jobs were lost or reputations ruined. In an ABC News article, “a supervisor of the high school math and science program in Cohasset, Mass., was forced to resign [that] week after parents spotted Facebook comments she wrote describing students as ‘germ bags’ and parents as ‘snobby’ and ‘arrogant,’ ” which was noticed by two parents in the community [4]. A Connecticut official was similarly fired because he “wrote on Facebook that his first day on the job involved ‘counseling an administrator to retire or face termination,’ and ended the comment with a smiley face.” [5] Plenty more examples exist, and About.Com’s career planning page [6] even advices users openly that “bad behavior can make you lose your job.”

The important question, however, is not whether it is acceptable — such a question is ir-relevant, as the trend is already embraced by the global online community: “Millennials will continue to expose their lives, ignoring potential negative implications and undermining the ‘old’ privacy norms.”‡ [3, p. 37] — but where to draw the line, to take responsibility for a service (to “do good”) by e.g. ensuring implication awareness with the user, and to establish a solid foundation upon which this previously unheard of level of interaction can stand productive and remain a positive force in society.

There are a wide range of online social networks. The “List of social networking websites” on Wikipedia [7] shows Facebook, Twitter, Google+, Bebo, and Tagged§in the 100 million and up

http://www.twitter.com/

http://www.facebook.com/

The author describes “Millenials” as follows: “Millennials (also referred to as Gen Y) were born between 1981 and 2000. People belonging to this generation were influenced by the rapid expansion of technology and media, violence, widespread drug usage, and unprecedented immigration growth” [3, p. 33]

§

(10)

range (counting general and not country specific services), but after these the subscriber count rapidly halves in just a few steps and continues to drop. From a global “non-niche” perspective, a few monolithic corporations consequently host the majority of the social networks in use. These giants surround themselves with a thriving eco system of applications, made possible through well-crafted API’s (Application Programming Interface), which allow regular users to create additional content for use within the service in question. The result is layer upon layer of opportunities for users to hook themselves into “mini-services” on the platform, each of which is a piece of the puzzle that makes up the (public) profile of the user in question. Properly mapped out, there is no limit to the level of detail at which a user’s interests, habits and opinions can be categorized using these pieces.

All this and, unbelievable as it may seem, a corporation’s word that it will “do good” is sufficient for the, as of October 2012, one billion users of Facebook.

The urgent question that arises is “why?” and the answer may lie in the benefits: by accu-rately profiling a user’s interests, habits and opinions, a service can make qualified guesses as to what the user wants at any given point in time. With the amount of information produced each day in our immediate vicinity, and with only a fraction of that information being of interest to us (presumably), it is perhaps not all too surprising that we would make such big sacrifices in return for this “filtering service.”

The unparalleled amount of noise that surrounds us today shows no signs of abating, and the knives’ edge balance between what we offer of ourselves and what we keep to ourselves spreads into the realm of the philosophical.

1.1. Research problem

Given a “topic,” in the form of a set of keywords, and a “profile”, a set of data from a number of sources related to some user, can a system be developed which accurately and efficiently determines the “relevance” of the given topic for the given user, where “relevance” is defined as a scalar describing how likely the user is to benefit from being presented with the topic?

1.2. Purpose

The purpose of the thesis was to evaluate a prototype for determining the relevance as defined in 1.1 above.

1.3. Goal

The goal was to implement a functional (proven) version of the system into a product called Votia, a “social opinion network” of sorts, in which the system would be used to determine which polls to present to the user, and how to present them.

1.4. Method

The method is described separately in Chapter 2 on page 5.

(11)

1.5. Stakeholders

Aside from KTH, The Royal Institute of Technology in Stockholm, Sweden, and the student, the project involved Grafikbolaget AB∗ as the company owning the Votia trademark and product, with Anders Jild´en as the main on-site mentor.

1.6. Delimitations

In this thesis, the important aspects “security” and “ethics” have not been evaluated. The potential implications in these aspects are discussed in 1.7 below.

1.7. Security and ethics

This report involves the maintenance of sensitive data, that could potentially have serious consequences if handled improperly. The system builds upon a foundation of mutual trust between the users of the system and the owners of the system, and as such, the importance of addressing security related issues and ethical issues can not be understated.

The information obtained by the system is used to profile the interests of the user. As such, keeping the user as anonymous as possible (to the system itself) is essential. If the data related to users was somehow leaked, it must not be possible to identify a user’s real identity based on the data, except for public figure users, such as politicians or celebrities choosing to use the system “officially.”

Twitter’s “Developer Rules of the Road”†has a Principles section with the sentence “Don’t surprise users”

which is a powerful and accurate principle that should be adhered to here as well. Using the system, users should be aware of what is happening. They should not be surprised by unexpected information getting into hands of unexpected people. When user information is shared, careful consideration must be made as to whether it should be shared at all, and if it should, whether the user should be explicitly asked to allow the sharing.

Even if kept supposedly anonymous, if leaked, information can still be used to identify individ-uals, as Thelma Arnold found out first hand, when New York Times reporters tracked her down based on her supposedly anonymous AOL search history, as “AOL Searcher No. 4417749.”[8] As such, making sure the data is secure is essential, no matter how anonymized it may ultimately be.

1.8. Thesis structure

The thesis is divided into 9 chapters. Introduction

Chapter 1 (p. 1) introduces the topic, giving background information on social networks, privacy, and users’ choice to offer up privacy.

http://grafikbolaget.com

(12)

Method

Chapter 2 (p. 5) describes the method used in the evaluation of the prototype, including defi-nitions for the key aspects.

Twitter and Votia

Chapter 3 (p. 9) gives a brief introduction to Twitter, and describes how the relationship between the Twitter data model and the Votia data model were defined in the research. Machine learning

Chapter 4 (p. 11) introduces the concept of machine learning, describing key theoretical con-cepts.

Relevance profiling

Chapter 5 (p. 13) goes into specific details on how the prototype was implemented in practice. Evaluation

Chapter 6 (p. 19) describes the evaluation process, focusing on the execution rather than the results.

Results

Chapter 7 (p. 23) contains the results of the evaluation, in the form of reports with data from the evaluation.

Analysis

Chapter 8 (p. 25) contains the analysis, where the implications of the evalulation results are discussed.

Conclusions

Conclusions and suggestions for future research are presented in Chapter 9 (p. 27).

(13)

2. Method

In this chapter, the research method is presented. Chapter 2.1 describes the phases of the research. Chapter 2.2 presents the evaluation model.

2.1. Research phases

The work was divided into three phases: (1) literature and study phase, (2) evaluation model defining phase, and (3) evaluation.

The literature and study phase consisted of finding material on machine learning, online pri-vacy implications, ethics definitions, and similar. Keywords used in KTHB, [9] KTH’s electronic library, were “data mining”, “community mapping”, “search engines”, “social networking” and “information filtering” among others. Google Scholar was also often used to find potential sources that were then located within e.g. Engineering Village [10]. Beyond this, several books [11] [12] and a video recorded course [13] on the subject of machine learning were used.

The theory in machine learning that was necessary to perform the experiments required in this thesis was obtained, as were the insights to define the evaluation model, combined with the literature related to social networking, online privacy, and similar.

The evaluation model defining phase consisted, as the name suggests, of defining the evalua-tion model and of doing further research in existing literature on the subject, but also of setting up a working developer environment with actual data, and a tentative, working version of the prototype being developed. This latter part was necessary in order to properly support the decisions related to the evaluation model with real data.

The evaluation phase consisted of evaluating the prototype, in various incarnations, noting the pro’s and con’s of each.

During the evaluation, both the algorithms being evaluated and the data upon which the evaluations were made were improved upon and as such changed over time. All of the above phases thus include data gathering.

2.2. Evaluation model

The evaluation model comprises two key aspects: “accuracy” and “efficiency”. In the definitions, the following key expressions are used:

• an item is a particular result, such as an entry in a search query or a targeted ad, • a property is a defining attribute of some item or group of items, such as “Politics” or

“Kista, Stockholm, Sweden, Europe” or “2013-03-31 17:25”,

• realm of interest is used to describe a given interest with a user; users have (usually) multiple realms of interest, and these are not necessarily interchangeably interwoven, • relevance is a scalable indicator of how useful or interesting a given item is to the user; a

successful guess means the item is relevant, which is the opposite of the item being noise, • user environment defines everything that surrounds or makes a user, such as their age,

(14)

A note on the realms of interest: “politics,” “Kista, Stockholm,” and “health care” are examples of realms of interest, as are “politics in Kista, Stockholm” and “health care politics in Kista, Stockholm,” but it is important to note that someone with an interest in “health care politics in Kista, Stockholm” may not care at all about other, non-health care related kinds of “politics in Kista, Stockholm” or about “politics” or “Kista, Stockholm” in general. Chances are higher, however, that they do, compared to users with no related realm at all.

2.2.1. Accuracy

“Accuracy” describes how accurate the system is at determining how relevant a given item is to the user, and how quickly the system learns and adapts to input from the user or changes in the user’s environment that affect the results. An item can be relevant in a number of ways.

It can be related, meaning it is properly identified as being of similar kind as something else that the user has expressed interest in.

It can be interesting, meaning it has been (correctly) identified as an item that the user would find useful, even if no direct proof that such was the case existed beforehand.

It can be profile-derived, meaning the user has been (correctly) profiled, based on their circle of friends, profile information, geographical location, the author, time or date, or other outstanding states, to find the item of interest.

It can be present, meaning a decision was (correctly) made that the item is present enough (in time or in geographical proximity to the user or some location the user has marked as interesting) that the user would consider it valuable, even if it did not fall into the user’s realms of interest.

As each user is influenced differently by the above factors, the accuracy also takes into account how these are individually weighted in the system.

The scoring of accuracy is done as follows: a number of random sets of data of increasingly large sizes is used for learning, after which the system is asked to predict the relevance of the items in another given set of items.

The resulting data is then compared to the actual relevance, which is determined in various ways depending on the kind of data being examined. For binary data where users e.g. do or do not click on a link, a “yes” is considered an item whose relevance (sorted in descending order) places it in the group of items making up the total number of “yes’es” in the set. In other words, in a set with 5 items, where 2 answers are “yes” and 3 are “no”, and the items (in descending order) are yes, no, yes, yes, no, the system receives 1 + 1 correct answer and 1 + 2 wrong answers, giving it a 2/5 = 40% accuracy (for that set). For numerical data, the factors above are combined as appropriate (where possible) and this is then relatively compared to the system relevance number.

Each set and its resulting accuracy is then weighted∗ and the sum of each makes up the accuracy score for the given system.

2.2.2. Efficiency

“Efficiency”† describes how efficient the system is in terms of processing power and memory usage. A system which gives a 0.1% increase in accuracy but requires 100 times more processing power and/or memory would be more accurate but really inefficient.

Sets with smaller amounts of data are given a higher impact on the final result to effect a quick-to-adapt reward.

Although efficiency was ultimately not taken into account in the report, its definition is still of interest for potential future evaluations.

(15)

The efficiency of the system is scaled to where the most efficient system has a 100 score in its respective factor (processing power and/or memory usage). These are then weighted and combined into an average efficiency which represents the system.

Any system that is not as efficient as the most efficient system receives a score proportional to the latter. In other words, if the memory usage of a given system S was 4 MB/s and the best system B only used 3 MB/s, S would receive a score (in terms of memory usage) of 75 (3/4). If S processed 1000 entries per second and the current best, B, was 1500 entries per second, S would receive a 67 score (1000/1500).

(16)
(17)

3. Twitter and Votia

This chapter describes what Twitter and Votia are, and how Twitter data is mapped into Votia data in the research.

3.1. Twitter

Twitter is a social network where users “tweet” short messages restricted to 140 characters. Users follow and are followed by other users, and whenever a user tweets, their followers see that tweet in their timelines. Besides from tweeting, users can also retweet other users’ tweets, thus propagating the tweet so that their followers see it as well.

Users can reference other Twitter users, using “@username”, and the users mentioned will see the message even if they do not follow the tweeter in question. Furthermore, users can make use of arbitrary hash tags, “#word”, to tag their messages, a sort of self-categorizing mechanism, and users can follow and/or view a hashtag timeline.

Users can view other users’ timelines as well, simply by viewing somebody’s profile. An exception to this is protected users — anyone can choose to be a protected user, which means they have to approve who follows them, and only those who follow the protected user can see their timeline.

Users can optionally choose to include geographic information with their tweets, which means the location at which the user tweeted is included in the tweet as GPS coordinates (generated via Geo-IP∗ or, in the case of cell phones, a combination of the cell phone’s GPS and Geo-IP). Despite the simplicity of the platform, Twitter is huge, and is used for both political and commercial reasons by governments and corporations throughout the world. For example, the @BarackObama Twitter account is actively used by the U.S. government (and, at times, by Obama himself) to spread messages of various kind.

3.2. Votia

Votia is a social opinion network, where users create polls and/or vote on existing polls. Users listen to and are listened to by other users, and whenever a user creates or votes on a poll, their listeners see the activity in their timelines. Besides from creating and voting on polls, users can also promote existing polls, to indicate that they find the poll interesting or important, and promoted polls appear in listeners’ timelines as well.

When creating a poll, a user can reference other public users, indicating that said users are related to the poll in question (such as the two president candidates of an election). A public user is a user who turned on the public flag, which by default is off. This is the equivalent of a protected user in Twitter, except in Votia, everyone is protected by default.

http://ipinfodb.com/ip_database.php shows geographic information about the visitor using their IP ad-dress. The use of both geo-IP and GPS means a user on a device without GPS (e.g. a regular desktop computer) can provide their location as well.

(18)

Users can also choose to tag their poll, using regular hash tags like in Twitter, or using a separate tag field. Tags can be searched within the system by users, just like hash tags in Twitter.

Users can choose to tag their poll with geographic information. This can be their current location, or an arbitrary location as a point or polygon drawn on a map, or as a place (street, city, country, ...). Users can search or listen to specific geographic areas, which means a matching poll for a listened to area will appear in the user’s timeline.

3.3. Twitter data into Votia data

The hyopthesis was that a prototype could be developed by obtaining Twitter data and mapping it into Votia data due to the similarities between the two services, in particular on the data model level:

• a tweet is a poll, if the tweet has at least one retweet by someone • a retweet of a tweet corresponds to a vote on the poll

• following someone is listening to the person • being followed is being listened to

By fetching a random sample of users, and their followers, and then mapping tweets and retweets into polls and votes, research could be done to determine an efficient system for pre-dicting user behavior in terms of what they would find interesting (i.e. what they would retweet).

(19)

4. Machine learning

This chapter covers some basic machine learning (and related) concepts used in this thesis, as well as some other related concepts. The information described here is compiled from Harring-ton’s “Machine Learning in Action”[11], Mason’s “An introduction to machine learning with web data”[13], as well as the “scikit-learn”[14] documentation.

Supervised and unsupervised learning

Supervised learning is where you train the machine from a set of existing data with given answers. Unsupervised learning is where the machine learns by itself by following of predefined rules or algorithms.

Classification

Classification is where data is sorted (classified) into distinct categories.

Na¨ıve Bayes classification

A method of classification called Na¨ıve Bayes classification is used in this report. It is called na¨ıve because it presumes that every piece of data is independent of the other pieces of data. Bayes classification is a statistical method following Bayes’ theorem that the probability of A given B is

P (A|B) = P (B|A)P (A) P (B)

that is, the probability that A holds given B is the probability of the opposite situation (B given A) times P (A) divided by P (B).∗

Regression

Regression is the alternative to classification. “The difference between regression and classifi-cation is that in regression our target variable is numeric and continuous.” [11, p. 178]

Related to regression is regression equations, which are regular equations with weights on each component. Regression equations are the results of regression.

Stochastic Gradient Descent (SGD)

SGD according to the scikit-learn developers is described as such: “the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).”† In other words, for each sample in the data set being tested, the model is adjusted in increasingly small quantities.

If A is “has cancer” and B “is a smoker” then, if we know P (A), P (B) and P (B|A) — the probability of a person being a smoker given that they have cancer — we can deduce the probability of a person having cancer from the knowledge that they’re a smoker.

(20)

Support Vector Machine (SVM)

According to Wikipedia[15], Support Vector Machines are “supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis.” The scikit-learn documentation states that SVM is “widely regarded as one of the best text classification algorithms”[14] with the caveat that it is “a bit slower than na¨ıve Bayes.” Harrison states that “Support vector machines are considered by some people to be the best stock classifier. By stock, I mean not modified.”[11, p. 101]

Count Vectorizing and Term Frequency times Inverse Document Frequency

Count Vectorizing is the process of converting text features such as words into unique numeric representations.

Term Frequency is a measurement of how often a feature occurs in some data. Term Frequency times Inverse Document Frequency is a downscaling of the former, which is used to neutralize the bias caused by the data input being of varying size (short documents and long documents should be treated equally in terms of which class they belong to in a classification problem).

Stemming

Stemming is where words are manipulated to where they become more uniform, thus assisting in finding similarities between texts using different variations of the same (key) word. For example, the word “football” and the word “footballs” both have the stem “football”. Without stemming, the system would assume that the two words are different. Porter stemming is a common variant of stemming used in machine learning. The variant used here is the implementation in NLTK’s Stem module, which “follows the algorithm presented in Porter, M. ‘An algorithm for suffix stripping.’ Program 14.3 (1980): 130-137.” with some modifications∗.

Positives and negatives

In this report, positives refer to the set of outputs from a machine learning operation or from labeled data for which interest was predicted — tweets that the system expects the user to retweet. Negatives is the complement set of the positives.

http://nltk.org/_modules/nltk/stem/porter.html

(21)

5. Relevance profiling

This chapter further defines the research problem (5.1) and how the input data was processed to fit into the prototype model (5.2), and details the decisions made in terms of software (5.3) and algorithms (5.4). The last part of the chapter goes into specifics for how the input data sets were handled.

5.1. Defining the research problem

The research problem as defined in Chapter 1.1 (p. 2) needs some clarification before going into more specifics on the actual machine learning part, in particular the method at which the set of input data for a given tweet was supposed to be handled, not to mention how individual users would be “learned”.

5.1.1. Learning from data

The hypothesis:

1. each data entity in the data set is dealt with in its own, defined way 2. data entities may, however, affect each other

3. some data entities may have multiple processing steps for parts of their content

4. the parameters for the processing steps in the data sets are a part of the learning process and thus change, in the form of influence weights

5. results from analyses expire

The second item became apparent fairly quickly in the research. Due to the dominating amount of negative values in whether a user retweets a given tweet that appears in their timeline, the system degraded with the increase of data in terms of false positives∗ (see Chapter 6.1 (p. 19)). To combat this bias and filter out some of the “negative noise,” other data, such as relations between the users involved, became necessary in the learning process.

The fourth item basically means that users care differently about different types of data, such as how popular a tweet’s creator is, or whether the tweet is in some geographically significant location. The system weighs each processing step depending on how influential its corresponding data seems to be over the user’s actions, and for very low weights, the processing step may be skipped entirely.

The last (fifth) item means that learned results are systematically weeded out of the system as they become old, or if the system has a redundant amount of data.

The set of input data then results in a set of relevance values, normalized to a uniform scale so that they are neutral in relation to each other. Finally, this set of outputs is combined into a weighted average which is the actual relevance of the given tweet, for the given user. The weight here is the influence weight defined in the fourth item.

(22)

5.1.2. Profiling a user

The above provides information about a specific user, but due to a reasonably high common interest probability it also provides information about their circle of friends, location, age group, interest group, and other. As such, the profile of a user may be affected by the profiles of other users, and their behavior may affect the profiles of others. This kind of “influence-learning” was left out of this report at first, but ultimately turned out to be a vital part of it. See Chapter 8 (p. 25).

5.2. Defining the data

The desired data is described according to the following key concepts:

1. relations (friend, follower, or neither) are defined, between the users in the system 2. polls are defined, and include the question, poll creator, geographic location (optional),

time, etc

3. polls can be viewed, promoted, demoted and voted upon

4. votes include a reference to the poll and the answer (e.g. yes/no), as well as the voting user, time, etc

Since this kind of data was unavailable, an approximation was made using Twitter data∗ in the following manner:

1. relations are a perfect match as defined in Twitter, where users follow and/or are followed by other users

2. any tweet that is retweeted is counted as a poll, and tweets contain the desired data 3. every poll that is in a user’s timeline is “viewed”; every poll that is voted on is promoted;

demotion is disregarded in this prototype

4. any retweet is counted as a vote on the target tweet; as it is hard to say if a retweet is agreement or disagreement (a “yes” vote or a “no” vote), votes are here simplified to only being agreement

Via the Twitter API, user profiles (including users they follow) and tweet timelines were obtained and mapped according to the above approximation into a database, and the prototype analysis was performed on this data. The complete strategy defining this process can be found in Appendix C (p. 35).

5.3. Software

The following software was used in the evaluation:

• Ruby was chosen as the language used to mine data from Twitter, as it had sophisticated modules† available for authenticating to and communicating with the Twitter API‡. It was also used to convert data from the Twitter model to the Votia model, as defined in 5.2 above, and for preparing test data for the machine learning algorithm tests.

For information on what Twitter, Tweets, Retweets, etc. are, please see https://twitter.com/about and related pages.

The Twitter gem, at https://github.com/sferik/twitter, and the TweetStream gem, at https:// github.com/intridea/tweetstream.

http://dev.twitter.com/

(23)

• Simple JSON serialization was used to keep the raw results of the Twitter API calls. Since the API limits the number of calls a user can make, the mining operation was separated from the processing operation, so that duplicate fetches would be kept to a minimum. • MongoDB was used as the database back end for storing the (processed) Votia

represen-tation of the data.

• Python was used for the actual machine learning code, for a number of reasons. Most of the scientific community tends to prefer Python, and as a consequence, Python has readily available modules for the algorithms used in machine learning, such as matrix manipulation, linear algebra, k-means, etc, with NLTK∗ and NumPy†.

• In particular, Scikit-learn, a machine learning library for Python, was used extensively to try out various machine learning algorithms on the data.

5.4. Algorithms

Following the advice of Harrington in his book “Machine Learning in Action”[11, p. 11], the following set up was used as the starting point:

• Supervised over unsupervised learning: As I was trying to “predict or forecast a target value,” supervised learning was recommended. In my case, in determining the relevance of a given poll, the data would be the poll and its creator, and the answer would be defined according to the Accuracy model in Chapter 2.2.1 (p. 6). Ultimately, unsupervised learning was a part of the system as well, in that the system would adjust itself by seeing whether users responded according to expectations.

• Regression over classification: Although I was aiming to obtain a continuous relevance value defining how interesting a given poll is, the input data was discreet (either a user did or they did not retweet (vote on) the tweet (poll)), and as such classification was chosen as the starting point.

5.5. Tweet text

The text of the tweet was analyzed by stemming the words, trimming out common words (e.g. “and” or “the”), and analyzing the resulting words individually.

The data was preprocessed and stored in a simple text format in chronologic order, where for each retweet (yes), a 1 was printed, and for each ignored tweet (no), a 0, followed by the text content.

Text analysis was done using synthesized data as well as real data.

5.6. Geographic location

Geographic data for a tweet was available as a GPS point (latitude and longitude) and/or as a set of coordinates (as longitude and latitude GPS points — note the swapped order) and/or as a place, a construct containing type (e.g. “city”), country, name, and similar, as well as a bounding box which is a polygon of GPS points. See Appendix B for an example of the geographic part of a tweet.

Natural Language ToolKit, http://nltk.org/

(24)

Geographic data was also potentially available in the form of a user location, which is e.g. “Stockholm, Sweden”.

Geographic data could be relevant in two ways: a user may be interested in the location itself, or the location is close to the user. One might argue that these are the same thing (a user is interested in some specified set of locations, plus their own location), but in reality, a user may not at all care about what’s going on around them, even though they care passionately about what is happening at their favorite sports bar, or at home, or their kid’s day care center.

Both types of data were analyzed in the same way: if a tweet had geographic data, it was compared to the locations in the set of interesting points. The distance to the closest of these points determined the geographic relevance.

5.7. Time

Time was a factor in judging relevance based on the time when the tweet was created, as well as the time difference between when it was created and when the user saw/retweeted it. From this, a number of things can be discerned about the user’s behavior.

For instance, some users only skim the 20-30 newest tweets in their timeline, while others continue reading where they left off, even if it means hundreds of tweets. From the former type of user, activity (such as retweeting) can be used to detect their presence on the platform, and from there, whether they would retweet something can be predicted simply by seeing whether the item is within visible range.

Some users only use the system at certain given times. Some users only care about very new information, while others are more age-forgiving.

5.8. User relation

In its simplest form, a user relation was defined as the probability that a user A would retweet any tweet created by a user B, i.e. the number of retweets by A of tweets by B over the total number of tweets by B. While easy to compute, more accurate models were desired, using groups with concepts from basic set theory.

To properly evaluate how user relations played a role in relevance, a separate data set was obtained from Twitter. Users in the top 12 Twitter user range∗ were used as “sources” — they are referred to as super users from here on, not to be confused with the *NIX term — and the user base was made up of random followers of these 12 users. The followers were then analyzed in terms of how similarly they behaved to each other, and groups were set up based on this analysis.

5.8.1. Groups

A group in this context was a list of users (its members), and a source (a super user). Every group had an affinity score, which gave the overall similarity between all of its users. Each user had an affinity weight associated with each group they were a member of, which indicated how tuned they were to the group in question.

The affinity score was simply defined as the portion of similar retweeting behavior for all users over the retweets by all users.

The users were: TheEllenShow, jtimberlake, BarackObama, justinbieber, katyperry, ladygaga, Oprah, tay-lorswift13, rihanna, shakira, Cristiano, britneyspears

(25)

Given the set of n tweets T1, T2, . . . , Tn which makes up the tweets seen by all the members

in the group, and a set of m user retweet sets R1, R2, . . . , Rm where all tweets retweeted by user

i are stored in Ri, the affinity was defined as

|R1∩ R2∩ · · · ∩ Rm|

|R1∪ R2∪ · · · ∪ Rn| (5.1) i.e. the intersection of the set of retweets made by each user divided by the union of the same set. Since the union is the total number of retweets by the group, this can be simplified to

|R1∩ R2∩ · · · ∩ Rm| n

Whenever a group’s source tweeted, the group’s interest was determined through machine learning of the tweet text.

The (user relation) relevance value for a user belonging to the set of groups G1, G2, . . . , of

which the subset S ∈ G, ` = |S| with the affinity scores A1, A2, . . . , A` were positives for a given

tweet, and W1, W2, . . . , W` was the associated affinity weights for the user, was given as `

X

k=1

(Wk· Ak) (5.2)

i.e. the sum of the associated affinity weights for the score of the corresponding group. Groups with negatives did not affect the relevance value (i.e. there was no subtraction of values).

In this report, groups were restricted to one source, and the source was restricted to one of the twelve super users. In reality, a group can be anything (some keyword, some common user profile data, or similar), which should be apparent from the description above.

Note that, as a consequence of Eqn. (5.1), a group is considered to have retweeted the combined list of tweets for all users in the group. This affects the evaluation in a number of interesting ways; see Chapter 6.1 (p. 19) on tweet text evaluation.

(26)
(27)

6. Evaluation

The evaluation was done separately for each type of data: tweet text (6.1), geographic location (6.2), and user relation (6.3).

6.1. Tweet text

The tweet text was analyzed using a number of Na¨ıve Bayes classifiers and Support Vector Machines (via SGD classifier).

Figure 6.1a shows the algorithm used with synthesized data iterated over a variable noise ratio (chance to randomize∗ outcome in data generation) between 0.0 (0%) and 0.5 (50%), and a variable input quantity between 500 and 5000 polls. The synthesized data used a randomly generated vocabulary of 1000 unique pretend words of which 40 were designated “keywords” (words of interest to the (fake) users), with a 50% probability of “actually being interesting” to a given user. We return to this figure again below.

Figure 6.1b uses the same algorithm, but noise is set at 0.0, keywords is 1000 of 1000, and the interest probability for keywords varies between 0.10 and 0.90. Note that this is not the same as there being a 10% chance a user will vote on a given poll (see Appendix A (p. 31)).

The accuracy at each iteration is smoothed out by taking the average of three executions per iteration, to dampen spikes due to uncommon input due to randomization.

The data was processed in scikit-learn using a count vectorizer, through a TFIDF (Term Fre-quency times Inverse Document FreFre-quency) transformer, then finally through an SGD classifier with hinge loss, L2 penalty†, 10−3 alpha and 5 iterations.

Equivalent calculations using na¨ıve bayes classifications were made, with variably lower ac-curacies, consistent for nearly every calculation in 6.1.

The problem with the method can be noted in (b), where the prediction accuracy drops drastically as the retweeting frequency increases, near low values, and only manages a 70% accuracy at 3,000. Since the accuracy is determined as the number of accurate predictions, this includes both positive and negative answers. In the end, the system is almost always guessing negatively, and due to the low amount of positive input, this gives a very high accuracy.

With real input, the problem is much more pronounced. In the synthesized tests, 500 input keys were considered low, but with real data, 100 retweets over a user’s available timeline‡ was the average, meaning a mere 3% positive input. Figure 6.2 shows the accuracy for a number of sample users, again including both positive and negative input in the accuracy. At 4-6 positives§, the system has learned hundreds of negatives, and while the overall accuracy is basically 100%,

Not invert, which means there is a 50% chance the random outcome is the correct value. I chose to do this to avoid unintentional effects in the learning process (a system that guesses 100% wrong also guesses 100% right, if you simply invert the answers).

Standard regularizer for linear SVM models, — http://scikit-learn.org/dev/modules/generated/ sklearn.linear_model.SGDClassifier.html

Base use of the Twitter API is restricted to the most recent 3,600 tweets of a user’s timeline.

§

(28)

1,000 2,000 3,000 0 0.2 0.4 0.6 0.7 0.8 0.9

Data entries Noise intensity

Prediction

accuracy

SGD over synthesized quantity/noise

0.6 0.65 0.7 0.75 0.8 0.85 0.9

(a) Synthesized data with variable noise and quantity. 1,000 2,000 3,000 0.2 0.4 0.6 0.8 0.6 0.7 0.8 0.9

Data entries Retweet frequency

Prediction

accuracy

quantity/retweet rate

(b) Variable retweet frequency and quantity. Figure 6.1.: Data prediction using an SGD classifier with linear SVM in scikit-learn. The more noise

the worse the accuracy, indicating algorithm correctness (a). Low data quantity and/or high retweeting freq. (50%+) give low accuracy (b), except for low quantity combined with low freq. (30%-).

4 6 8 10 12 14 16 0.95 0.96 0.97 0.98 0.99 1

Positive hit count

Accuracy

Figure 6.2.: Real data accuracy rate over number of positive hits.

the positive accuracy is in the 0-10% range. Adding a class weight of 100 to the positive label∗ gives a 25-30% hit rate for positives but negatives drop down toward 30-40% or lower, as well.

6.2. Geographic location

Of all the tweets obtained from Twitter, approximately 15% had geographic data embedded in them. Due to the unexpectedly low amount of data, geographic evaluation was not possible.

6.3. User relation

The data consisted of approximately 60,000 users and 20,000 retweets, obtained by fetching retweeters for recent statuses for the 12 source users. Table 6.1 shows the top 20 retweets (†)

Meaning the system should modify the positive probability by 100 before determining whether the input is positive or negative.

The number of retweets seen here is not the actual number of retweets, but the number of fetched retweets, along with the retweeting user, etc. The tweets themselves have tens or hundreds of thousands of retweets in

(29)

Rank Retweets User Rank Retweets User 1 670 justinbieber 2 625 justinbieber 3 531 justinbieber 4 481 justinbieber 5 470 justinbieber 6 452 justinbieber 7 444 justinbieber 8 382 justinbieber 9 364 justinbieber 10 355 justinbieber 11 340 justinbieber 12 322 justinbieber 13 316 justinbieber 14 315 justinbieber 15 311 justinbieber 16 306 justinbieber 17 300 taylorswift13 18 295 justinbieber 19 294 justinbieber 20 274 taylorswift13

Table 6.1.: Top (registered) tweets by activity

Super Followers Groups Members Coverage Affinity

TheEllenShow 17/5329 5 8 (1.6) 0.12 0.36 (0.35) jtimberlake 9/4800 2 3 (1.5) 1.02 0.64 (0.35) BarackObama 55/5653 2 4 (2.0) 0.06 0.41 (0.43) justinbieber 541/4904 1545 417 (0.27) 0.18 0.31 (0.29) katyperry 126/4981 207 103 (0.5) 0.29 0.32 (0.27) ladygaga 53/3465 11 15 (1.36) 0.27 0.36 (0.31) Oprah 13/5806 0 0 - -taylorswift13 126/4436 328 92 (0.28) 0.29 0.31 (0.26) rihanna 92/3674 44 45 (1.02) 0.44 0.36 (0.28) shakira 47/5137 57 39 (0.68) 0.73 0.37 (0.23) Cristiano 47/3910 39 40 (1.03) 0.82 0.42 (0.25) britneyspears 60/4053 29 37 (1.28) 0.35 0.36 (0.3)

Table 6.2.: Group statistics, with applicable followers (taking only those with 10 or more actions), groups, members (average members/group), coverage — the portion of retweeted over min-imum seen tweets (which may be above 1.0 as the retweet count may include “unseen” tweets by some users), and affinity (neutral affinity).

in the system.

6.3.1. Applied and neutral affinity

To form groups, a number of algorithms were tested, in particular in reference to the affinity threshold. Because the quality of the input was so dependent on the frequency of positives, groups with a high total retweet count were desired, even at the expense of affinity (Eqn. 5.1 (p. 17)). The ideal group is thus one with a high amount of members, a high amount of shared retweets, and a high amount of positives, even if not shared between the members.

As such, an applied affinity was used which took this requirement into account. This affinity will be referred to as the affinity, and the previous one will be referred to as the neutral affinity from here on.

Initial groups were generated by iterating over the known followers for each super user, and

(30)

combining users into groups in all instances where the resulting group affinity would not go below a given threshold. The applied affinity A0, with A being the neutral affinity, was thus determined as A0 =  0.9 + |R1∪ R2· · · ∪ Rm| σ  · A (6.1)

where Ri is the list of tweets which member i retweeted, and σ is the number of tweets seen

by all the members in the group, which means the factor may be above 1.0 in the instances where a lot of activity happened which some members did not see∗. For a group with 600 total retweets across all the users and 1200 total seen tweets, the applied affinity would be (0.9 +1200600) · A = 1.4A, i.e. it would “boost” the neutral affinity by 40% due to the coverage.

At first, only an affinity threshold was used, but this resulted in groups with too low neutral affinity (where only a few tweets were shared across all the members, out of hundreds). With an affinity threshold of 0.30 / 0.20 (secondary / neutral), groups with the characteristics as described in Table 6.2 were formed, showing the statistics from group generation for each user. Notably, the obtained data for the “Oprah” account had too few applicable followers — followers which had retweeted at least 10 times in the known data set — to form any groups at all, which insinuates that the followers are less prone to act (retweet) than in the other cases. It stands in contrast to the “justinbieber” account, with over 500 applicable followers and over 1500 groups.

Here, “seen” is also used for instances where the system does not know, due to lack of timeline data for a member.

(31)

7. Results

7.1. Tweet text

Due to lack of extensive data, and due to the overestimated frequency at which users perform retweets, none of the tested methods performed well. One big reason was the overwhelming amount of negative input, which skewed the classifiers to where all guesses became negative, with positive input as rare as 0.1% or below. Another reason was simply the lack of data. The average sampled user retweeted about 100-200 times throughout the available timeline history. A third reason was that isolated analysis of the poll text alone ignores a number of factors that play too big a role, such as who tweeted, what time the tweet occurred, and similar.

7.2. Geographic location

Lack of data prevented this evaluation from being executed.

7.3. User relation

As can be noted in Table 6.2 (p. 21), the applicability varies depending on the source, which may be due to a number of reasons, among them lack of extensive data. Most of the data shows that generating groups of users based on a source and their relative similarity in reference to said source is applicable, with applicable followers matched into at least one group in most cases, and a reasonably high coverage for the users with enough applicable followers.

(32)
(33)

8. Analysis

The prototype was not successful, due to a failure to produce reliable predictions for tweet text, an essential part of the system. In order to properly predict user behavior, isolating the classifier to only seeing the text is not applicable. Instead, the classifier must be more context aware, which can be achieved by calling the classifier on the user groups, rather than the users themselves. Groups with a strong affinity and enough positives to properly teach the classifier could be used to give a “sub-answer” which, when combined with the results of the other related groups, the user’s own history and other related factors, could provide adequate prediction accuracy for individual users.

Mapping Twitter data into Votia data had its problems. In Votia, a single poll is expected to have anything from a few to hundreds or thousands of votes and promotions, whereas in Twitter, there are many more tweets than there are retweets. Mapping tweets to polls and retweets to votes was unsuccessful, due to this inverted ratio relationship, except in certain given circumstances as seen in the user relation evaluation, Chapter 6.3 (p. 20).

On a more fundamental level, a tweet is by nature not the same as a poll. For one, Votia does not (at this point) have the 140 character limit for polls, which means a poll may have more details to assist in the machine learning process, and for another, arguably more importantly, a tweet is a short message about anything at all, whereas a poll is an explicit question or statement of an opinion. This difference of intent may ultimately have bigger consequences on the accuracy of prediction than initially expected.

(34)
(35)

9. Conclusions

The problem formulated in Chapter 1.1 (p. 2) can not be satisfactorily solved based on this report. Several clues as to how one might solve it have become apparent, however, such as the necessity to more tightly combine the data sets, and to hierarchically structure the system on groups of similar users, rather than on the individual users themselves.

Representing Votia data as mapped Twitter data had its weaknesses, in particular in how users were selected in the text evaluation (completely at random). Only a subset of the Twitter users retweet other users’ tweets, and retweets in general between regular individuals is far more rare than expected. Instead, a number of top Twitter users should have been chosen (as was done in the user relation evaluation), and the research based on their followers. This would have its flaws as well, however, as most users only retweet a small portion of the total tweets made by the user, and some measurement based on the total number of actual retweets may have been required (such as a threshold on a per user basis for whether they should even be polled at all, as a separate machine learning problem that filtered out negatives).

However, using Twitter data to do further research is warranted. With proper selection of the sources, and with better understanding of the overall Twitter scene, an applicable algorithm for determining relevance of topics, or at least the foundation for one, can most likely be created.

9.1. Future work

Focus should be placed on the potential in group generation, in particular in how to efficiently generate groups according to the ideals as presented in Chapter 6.3 (p. 20). Groups with high spread across many tweets and a high affinity could provide the prediction accuracy necessary to reliably generate a relevance value for users.

(36)
(37)

10. References

[1] Miniwatts Marketing Group, “World internet users statistics usage and world population stats.” http://www.internetworldstats.com/stats.htm. Accessed: 2013-03-31.

[2] B. Swanson, “The coming exaflood.” http://online.wsj.com/article/ SB116925820512582318.html. Accessed: 2013-03-31.

[3] A. Yadin, “Millennials and privacy in the information age: Can they coexist?,” Technology and Society Magazine, IEEE, vol. 31, no. 4, pp. 32–38, 2012.

[4] A. Ki Mae Heussner, “Teacher loses job after commenting about students, par-ents on facebook.” http://abcnews.go.com/Technology/facebook-firing-teacher-loses-job-commenting-students-parents/story?id=11437248#.UVgreRkxrRs. Ac-cessed: 2013-03-31.

[5] Associated Press, “Conn. official losing job after facebook posts.” http: //www.boston.com/news/education/k_12/articles/2010/09/15/conn_official_ losing_job_after_facebook_posts/. Accessed: 2013-03-31.

[6] D. R. McKay, “Bad behavior that can make you lose your job.” http://careerplanning. about.com/od/workplacesurvival/tp/harm_your_job.htm. Accessed: 2013-03-31. [7] Wikipedia, “List of social networking websites.” http://en.wikipedia.org/wiki/List_

of_social_networking_websites. Accessed: 2013-04-05.

[8] T. Z. J. Michael Barbaro, “A face is exposed for aol searcher no. 4417749.” http://www. nytimes.com/2006/08/09/technology/09aol.html?_r=0. Accessed: 2013-05-09.

[9] KTH, “KTH bibliotek.” http://www.kth.se/kthb.

[10] Elsevier B.V., “Engineering village.” http://www.engineeringvillage.com.focus.lib. kth.se/controller/servlet/Controller?CID=quickSearch&database=3.

[11] P. Harrington, Machine Learning in Action. Manning, 2012. [12] T. Segaran, Programming Collective Intelligence. O’Reilly, 2007.

[13] H. Mason, “An introduction to machine learning with web data,” May 2011.

[14] scikit-learn developers, “scikit-learn: machine learning in python.” http://scikit-learn. org/stable/. Accessed: 2013-05-12.

[15] Wikipedia, “Support vector machine.” http://en.wikipedia.org/wiki/Support_ vector_machine. Accessed: 2013-05-26.

(38)
(39)

Appendix A.

Synthesized Data Generation

The synthesized data was generated as Votia data, which means tweets are referred to as polls and retweets as votes.

Synthesized data was generated based on the following parameters:

• Noise (float): 0.0 to 1.0. How often do user vote actions get randomized? • Mains (int): how many actors are in the synthesized scene

• Friends (int): how many friends (per actor) should participate • Vocabs (int): how many words should exist in the vocabulary

• Keywords (int): how many of the above words should be keywords, words that hold some interest to users

• Interest (float): 0.0 to 1.0. For each user and for each keyword, will the user be interested in the keyword?

• Intensity (int): for every user and for every keyword that they find interesting, their interest in that keyword is set to be a random value between 1 and the intensity

• Threshold (int): threshold for triggering a vote on a poll; for every poll, if the sum of the intensities for any keywords appearing in the poll is above the threshold, the user will vote on the poll

• Polls (int): the number of polls to create in total

For every new synthesized set up, the existing data (if any) was wiped, including users, polls, and other settings.

A.1. Vocabulary

The vocabulary was set up as “vocabs” number of unique words using a pretend language structure resembling that of Japanese (consonants and vowels rigidly dependent on each other). Example might look like

“sagyo siyieezo gyonomu zegikobio soheru ozeomi chyugyue suteicha uiiefyi gukorora tumiyeruta”

Keywords were then chosen randomly (distinctly) out of the vocabulary.

A.2. Users

“Mains” number of times, a new user is set up, and their interest is generated by iterating over the keywords and the configuration parameters above. After this, “friends” number of other users are created, and the main user is set to follow each of them.

(40)

A.3. Polls

“Polls” number of times, the poll text is created according to the following method:

1: procedure PollText(V ocab, Keywords) 2: text ←− random element from V ocab

3: words, kwords ←− 0

4: while words < 3 or rand() < 0.8 − 0.1 · words do

5: word ←− random element from V ocab

6: Repeat above line while kwords > 0 and word in Keywords and rand() < 0.8

7: text ←− text + ”” + word

8: words ←− words + 1

9: kwords ←− kwords + 1 if word in Keywords

Above, rand() means a random float between 0.0 and 1.0, so that rand() < 0.8 translates to “80% chance”. Note that there is no check for unique poll texts (which would be odd — people can and do write the same thing as each other all the time).

After creating the poll text, a random main user is selected and a random friend of theirs is selected as the person creating the poll. The main user is then asked whether they would vote on the poll or not, using the intensity/threshold parameters above.

A.4. Specific instances used in the thesis

A.4.1. Noise test

The noise test (Fig 6.1a (p. 20)) used 1 main, 50 friends, 1000 vocabs, 40 keywords, 0.50 interest, 10 intensity, and 1 threshold. The latter part is a simplification so that whenever a user is interested in some keyword, they will always vote on the poll for that keyword.

A.4.2. Voting frequency test

The voting frequency test (Fig 6.1b) used 0.0 noise, 1 main, 50 friends, 1000 vocabs, 200 keywords, 10 intensity, and 1 threshold. Again, the simplification used above was applied here as well. Worth noting is that keywords were set to 200, one fifth of the total vocabulary. The reason for this is that with a too high keywords to vocab ratio, the chances of one interesting keyword appearing in a poll with between 3 and 10-15 words hits 100%, which is a pointless data set to test anything against.

(41)

Appendix B.

Geographic data structure from Twitter

"geo":{ "type":"Point", "coordinates":[ 20.68685367, -103.37596178 ] }, "coordinates":{ "type":"Point", "coordinates":[ -103.37596178, 20.68685367 ] }, "place":{ "id":"7807880b17af73d8", "url":"http:\/\/api.twitter.com\/1\/geo\/id\/7807880b17af73d8.json", "place_type":"city", "name":"Guadalajara",

"full_name":"Guadalajara, Jalisco",

"country_code":"MX", "country":"M\u00e9xico", "contained_within":[], "geometry":null, "polylines":[], "bounding_box":{ "type":"Polygon", "coordinates":[ [ [ -103.410515, 20.6005851 ], [ -103.410515, 20.7527191 ], [

(42)

-103.262645, 20.7527191 ], [ -103.262645, 20.6005851 ] ] ] }, "attributes":{} }, 34

(43)

Appendix C.

Twitter Data Collection Strategy

TWITTER DATA COLLECTION STRATEGY SPECIFICATION

Revision : 3

Date : 2013-04-26

Author : Karl-Johan Alm <kjalm at kth dot se>

=========================================================================== INTRODUCTION

=========================================================================== This document describes the strategy used in obtaining data from Twitter in the "relevance profile" bachelor thesis from Spring, 2013.

It also describes how this data is mapped into a data representation of the "Votia" application.

A ’global’ file exists, and has as its sole purpose to keep track of which users are currently being actively tracked, and in which scene they are located. This file is used to do two things: to select users distinctly across scenes, and to prepare the parameters used by the global stream reader.

The actual data is divided into independent sets called ‘scenes’. Each scene consists of (main) users and their friends, and for each user, status updates of varying amounts, including retweets and responses. It also holds meta data relevant for expanding the scene (more below). Each scene is stored in its own folder, and contains one ’scene’ file with information about the participating users, as well as one file per user, named after their respective user id’s.

The Twitter API is rate limited, with a 15 minute window, restricting users (the system is, currently, a user) on a per-request-type basis.

(44)

Further information can be found below.

Due to this limiting, the collection strategy is defined to gradually and systematically increase the number of scenes as well as the granularity of the respective scenes over time. Thus, quantity and quality both increase with time. See below.

[For the rest of this document, please contact the author at the above address.]

References

Related documents

Non-sequential machine learning algorithms for anomaly detection like one- class support vector machine (OC-SVM), k-nearest neighbor (K-NN), and random forest (RF) are

Previous work has noted a gap in localization research, where robots have usually depended on only either local esti- mates or fusing data onto a global estimate frame [24]. There

The approaches supported in the master algorithm is the parallel, i.e where all the sub-systems are simulated in parallel over the same global integration step and all input signals

If the accuracy for Support Vector Machine and Naive Bayes together with Bag of Words are compared with Almeida et al.[6] study, who did not use any method for feature selection,

A model based Iterative Learning Control method applied to an industrial robot.. Mikael Norrl¨of and Svante Gunnarsson Department of

Det finns ingen vedertagen formell definition av vad en trojan är. Detta beror till stor del på att det inte är möjligt att strikt avgöra vad som är en trojan – ett program

[r]

The base for our evaluation of the CURE maintenance model (developed by SchlumbergerSema) were both the result of our case study that comprised interviews from five