• No results found

Tweet Collect: short text message collection using automatic query expansion and classification

N/A
N/A
Protected

Academic year: 2021

Share "Tweet Collect: short text message collection using automatic query expansion and classification"

Copied!
85
0
0

Loading.... (view fulltext now)

Full text

(1)

Februari 2013

Tweet Collect: short text message

collection using automatic query

expansion and classification

(2)
(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Tweet Collect: short text message collection using

automatic query expansion and classification

Erik Ward

The growing number of twitter users create large amounts of messages that contain valuable information for market research. These messages, called tweets, which are short, contain twitter-specific writing styles and are often idiosyncratic give rise to a vocabulary mismatch between typically chosen keywords for tweet collection and words used to describe television shows. A method is presented that uses a new form of query expansion that generates pairs of search terms and takes into consideration the language usage of twitter to access user data that would otherwise be missed. Supervised classification, without manually annotated data, is used to maintain precision by comparing collected tweets with external sources. The method is implemented, as the Tweet Collect system, in Java utilizing many processing steps to improve performance.

The evaluation was carried out by collecting tweets about five different television shows during their time of airing and indicating, on average, a 66.5% increase in the number of relevant tweets compared with using the title of the show as the search terms and 68.0% total precision. Classification gives a, slightly lower, average increase of 55.2% in number of tweets and a greatly increased 82.0% total precision.

The utility of an automatic system for tracking topics that can find additional keywords is demonstrated. Implementation considerations and possible improvements are discussed that can lead to improved performance.

Sponsor: KDDI R&D Laboratories, Göran Holmquist Foundation ISSN: 1401-5749, UPTEC IT 13 003

(4)
(5)

Social media som Twitter växer i popularitet och stora mängder med-delanden, tweets, skrivs varje dag. Dessa meddelanden innehåller värde-full information som kan användas till marknadsundersökningar men är mycket korta, 140 tecken, och uppvisar i många fall ett idiosynkratiskt uttryckssätt. För att komma åt så många tweets som möjligt om en viss produkt, till exempel ett TV program, krävs det att rätt söktermer är tillgängliga; en twitteranvändare använder nödvändigtvis inte samma ord för att beskriva samma sak som en annan. Olika grupper använ-der således olika språkbruk och jargong. I text i twittermeddelanden är detta uppenbart, vi kan se hur somliga använder vissa så kallade hash-tagsfor att uttrycka sig samt andra språkyttringar. Detta leder till vad som brukar kallas problemet med olika ordförråd (vocabulary mismatch problem).

För att försöka samla in så många twittermeddelanden som möjligt om olika produkter har ett system som kan generera nya söktermer utvecklats, här kallat Tweet Collect. Genom att analysera vilka ord som ger mest information, generera par av ord som beskriver olika saker och ta hänsyn till det språkbruk som finns på Twitter skapas nya söktermer utifrån ursprungliga söktermer, s.k. frågeexpansion (query expansion). Utöver att samla in tweets som motsvarar de nya söktermerna så avgör en maskininlärningsalgoritm om dessa tweets är relevanta eller inte för att på så sätt öka precisionen.

(6)
(7)

This thesis expands upon:

Erik Ward, Kazushi Ikeda1, Maike Erdmann2, Masami Nakazawa2, Gen Hattori2,

and Chihiro Ono2. Automatic Query Expansion and Classification for Television

Related Tweet Collection. Proceedings of Information Processing Society of Japan

(IPSJ) SIG Technical Reports, vol. 2012, no. 10, pp. 1-8, 2012.

Acknowledgment

I wish to thank the Göran Holmquist Foundation and the Sweden Japan Foundation for travel funding.

1

Supervisor

(8)

Corpus A set of documents, typically in one domain. Relevance

feed-back Update a query based on documents that are known to be relevantfor this query.

Table of Notations

Ω The vocabulary: the set of all known terms.

t Term: a word without spacing characters. q Query: a set of terms. q ∈ Q ⊂ D.

C Corpus: a set of documents.

d Document: a set of terms. d ∈ D, where D is the set of all possible

documents.

tf(t, d) Term frequency: an integer valued function that gives the frequency

of occurrence of t in d.

df(t) Document frequency: the number of documents in a corpus that

contains t.

idf(t) lg(1/df(t))

R Set of related documents; used for automatic query expansion.

Contents

1 Introduction 1

2 Background 3

2.1 Twitter . . . 3

2.1.1 Structure of a tweet . . . 3

2.1.2 Accessing twitter data: Controlling sampling . . . 4

2.1.3 Stratification of tweet users and resulting language use . . . . 5

2.2 Information retrieval . . . 6

2.2.1 Text data: Sparse vectors . . . 6

2.2.2 Terms weights based on statistical methods . . . 9

2.2.3 The vocabulary mismatch problem . . . 9

2.2.4 Automatic query expansion . . . 9

(9)

2.4 External data sources . . . 13

3 Related work 15 3.1 Relevant works in information retrieval . . . 15

3.1.1 Query expansion and pseudo relevance feedback . . . 16

3.1.2 An alternative representation using Wikipedia . . . 17

3.2 Classification . . . 17

3.2.1 Television ratings by classification . . . 18

3.2.2 Ambiguous tweet about television shows . . . 18

3.2.3 Other topics than television . . . 20

3.3 Tweet collection methodology . . . 21

3.4 Summary . . . 22

4 Automatic query expansion and classification using auxiliary data 25 4.1 Problem description and design goals . . . 25

4.2 New search terms from query expansion . . . 26

4.2.1 Co-occurrence heuristic . . . 27

4.2.2 Hashtag heuristic . . . 28

4.2.3 Algorithms . . . 29

4.2.4 Auxiliary Data and Pre-processing . . . 30

4.2.5 Twitter data quality issues . . . 30

4.2.6 Collection of new tweets for evaluation . . . 32

4.3 A classifier to improve precision . . . 33

4.3.1 Unsupervised system . . . 34

4.3.2 Data extraction . . . 34

4.3.3 Web scraping . . . 34

4.3.4 Classification of sparse vectors . . . 34

4.3.5 Features . . . 35

4.3.6 Classification . . . 35

4.4 Combined approach . . . 36

5 Tweet Collect: Java implementation using No-SQL database 37 5.1 System overview . . . 37 5.2 Components . . . 38 5.2.1 Statistics database . . . 39 5.2.2 Implementation of algorithms . . . 41 5.2.3 Twitter access . . . 43 5.2.4 Web scraping . . . 43 5.2.5 Classification . . . 43

5.2.6 Result storage and visualization . . . 44

5.3 Limitations . . . 44

(10)

6.1.1 Auxiliary data . . . 48 6.1.2 Experiment parameters . . . 49 6.1.3 Evaluation . . . 49 6.2 Results . . . 50 6.2.1 Ambiguity . . . 51 6.2.2 Classification . . . 51 6.2.3 System results . . . 53 7 Analysis 55 7.1 System results . . . 55

7.2 Generalizing the results . . . 56

7.3 Evaluation measures . . . 57

7.4 New search terms . . . 57

7.5 Classifier performance . . . 57

8 Conclusions and future work 61 8.1 Applicability . . . 61

8.2 Scalability . . . 62

8.3 Future work . . . 62

8.3.1 Other types of products and topics . . . 62

8.3.2 Parameter tuning . . . 62

8.3.3 Temporal aspects . . . 63

8.3.4 Understanding names . . . 63

8.3.5 Improved classification . . . 63

8.3.6 Ontology . . . 64

8.3.7 Improved scalability and performance profiling . . . 64

Bibliography 65

Appendices 68

(11)

List of Figures

2.1 The C4.5 classifier . . . 13 3.1 Approach to classifying tweets, here for television shows, but the same

approach applies for other proper nouns. . . . 19

4.1 Visualization of fraction of tweets by keywords for the show Saturday

night live, here different celebrities that have been on the show dominate

the resulting twitter feed. . . 33 4.2 Conceptual view of collection and classification of new tweets. . . 36 5.1 Conceptual view of collection and classification of new tweets. . . 38 5.2 How the different components are used to evaluate system performance.

This does not represent the intended use case where collection, pre-processing and classification is an ongoing process. . . 39 6.1 Results of filtering auxiliary data to improve data quality. Note that the

first filtering step is not included here and these tweets represent strings containing either the title of a show or the title words formed into a hashtag. . . 48 6.2 Fraction of tweets by search terms for How I met your mother. . . 52 6.3 Fraction of tweets by search terms for The X factor. . . 52 7.1 Ways to generate training data from auxiliary data. Here we have two

(12)

2.1 Examples of two vector representations of the same document. In this example the vocabulary is severely limited and readers should imagine a vocabulary of several thousands of words and the resulting, sparse, vector representations. Note that capitalization is ignored which is very common in practice. . . 7 2.2 Inverted index structure. We can look up which documents that contain

certain words by grouping document numbers by the words that are included in the document. If we wish to support frequency counts of the words we store not only document numbers but instead tuples of (Number, frequency). . . . 8 3.1 Methods that I use from related works in my combined approach of query

expansion and classification. . . 22 4.1 Expansion terms for the show “How I met your mother” using equation

2.7 and resulting search terms by hashtag,mention and co-occurrence heuristics. Note that a space means conjunction and a comma means disjunction. This used data where tweets mentioning other show have been removed. . . 30 4.2 Search terms generated for the television shows The vampire diaries and

The secret circle using a moderately sizes data-set. . . 31

5.1 List of dependencies organized by (sub) component. . . 40 6.1 TV shows used for collecting tweets with new search terms. Shows

marked with “*” are aired as reruns multiple times every day. . . 49 6.2 Text sources used for comparing with tweets. . . 50 6.3 Number of tweets collected for the different TV shows during 23h30min. 51 6.4 Percentage of tweets containing the title that are related to the television

show. . . 51 6.5 Classification results when using manually labeled test data as training

data with 10-fold cross validation. . . 52 6.6 Classification results when using training data generated from the same

(13)

cbaseline(tweet) = related . . . 53

6.8 System performance using automatic query expansion, before and after classification. The subscript c denotes results after classification. . . 54

7.1 95% confidence interval for accuracy with training data generated from the same external sources, training examples are from all five shows. . 56 7.2 First 13 term pairs for AQE using top 40 terms to form pairs and virtual

documents of size 5. Also visible is a bug where I do not remove hashtags from consideration when forming pairs. . . 58 B.1 Results of classification of annotated test data with linear support vector

machines. Text data is treated as sparse vectors. . . 71

List of Algorithms

(14)
(15)

Introduction

“Ultimately, brands need to have a role in society. The best way to have a role in society is to understand how people are talking about things in real time.”

– Jean-Philippe Maheu, Chief Digital Officer, Ogilvy [19]

Adoption of social media has increased dramatically in the last years and millions of users use social media services every day. There are for example 806 million Facebook users [11] and 140 million twitter users [5]. Since the creation of material is decentralized and requires no permission, enormous quantities of unstructured, uncategorized information is created by users every minute; 340 million twitter messages that are authored every day [5].

This development has co-occurred with the explosion of data generation in gen-eral and presents IT practitioners and computer scientists with new unsolved prob-lems but also with opportunity for new business. Industry has quickly realized that there is value in all this unstructured data, coining the equivocal term big data. The marked for managing big data has grown faster than the IT sector in general and showed a growth of 10% annually to $100 billion in 2010 [3].

The multitude of data available yields unprecedented opportunities to gain in-sight of what people are thinking and what they want; to conduct automated market research. This knowledge is extremely valuable for public relations and for adver-tising. One maturing technology that attempts to analyze users opinions in text is sentiment analysis, another is estimating ratings of television programs using the number of twitter messages written [39][35]. But for these technologies to be truly useful, text about the topics of interest needs to be collected in a reliable and representative fashion.

Classification and information retrieval techniques can be used to improve the quality and reach of twitter message collection. However, human text, especially in a social media setting, is often very vague and one problem is to find the many messages that do not explicitly mention the topic that one wishes to analyze: the

vocabulary miss match problem[12]. Mitigating these difficulties is the focus of this

(16)

A crucial part of the process of conducting market research on a topic, such as determining sentiment towards a certain product or estimating ratings, is to get a good sample of messages. When gathering messages in social media, often keywords determined by an analyst is used, such as in [38] [39] and [35]. I argue that this method ignores a large fraction of the messages relating to certain topics and thus detrimentally affects the validity of results of later analysis. The idiosyncratic and novel language use on twitter, driven by the short message length, results in a vocabulary mismatch that can be mitigated by the use of a systematic method to find the messages not covered by using the title, or other manually selected terms, as a search terms.

At its core the research problem addressed by this thesis work is:

Get as many as possible tweets about a specified product.

This goal is to be solved using the methods available for a running and scalable system. I use the term product in stead of topic here since this is closer to most business goals of tweet analysis and it reflects the experiments that I have carried out. In essence, I wish to optimize tweet collection.

To improve tweet collection I present and test the use of streaming retrieval with additional keywords determined using relevance feedback techniques and automatic query expansion (AQE), as seen in information retrieval, particularly in ad-hoc re-trieval. By comparing term distributions in sets of messages about different topics I determine descriptive terms for each topic that yield improved recall when included as search terms. By also classifying the retrieved tweets as either relevant or irrel-evant to the topic, higher precision can be achieved. Classification also, in part, deals with the issue of ambiguity [40].

The proposed method is evaluated by collecting tweets about television shows using streaming retrieval for popularity estimation, but the method is not limited to this domain.

This thesis consists of the following chapters:

Chapter 2 An overview of general techniques.

Chapter 3 Related work in the area of tweet retrieval and classification.

Chapter 4 Methods that I have employed.

Chapter 5 Prototype system that I developed.

Chapter 6 Experiments and data.

Chapter 7 Analysis of results.

(17)

Background

This chapter presents the concepts and technologies involved, in particular: twitter, information retrieval (IR) and classification. The problem of conduction market research is first presented in terms of how to get data from twitter, then how to find

whichof this data to use by IR and classification.

This chapter is intended to introduce the most common techniques used when finding relevant information, but these techniques are not used in the standard way in my proposed method. Instead they serve as the inspiration and implementation building blocks of the method. A holistic, conceptual view of the proposed method is introduced in chapter 4 and it could be useful to read this first if the reader is already familiar with the concepts presented here.

Subsection 2.1.2 is technical and subject to the changing implementation of the twitter API1. However, it is necessary to analyze the API to understand the

limitations present when working with twitter data since this represents the access point used by researchers and other third parties.

2.1

Twitter

Twitter is a growing social media site where users can share short text messages of 140 characters; tweets. A user base of 140 million users [5] make it a very interesting source of data. The data that I am interested in collecting is the messages where users write about certain TV programs; to use for market research.

2.1.1 Structure of a tweet

Below is a hypothetical twitter message highlighting different features. The format of the message is very similar to what one can find in actual tweets and it is evident that much of the information has been shortened to fit in 140 characters.

(18)

User Erik Ward User tag erikuppsala

Text I am writing a master thesis

http://bit.ly/KdMep5 @kddird #KDDI Time-stamp 17:22 PM - 12 Oct 2012 via web

The very sort text message format has given rise to several adoptions by the com-munity:

Retweet The letters “RT” at the start of a message indicate that it is a copy of

another message.

User tag A unique string associated with each twitter account

Reply and mentions The @<uid> sign indicates that the message is directed

towards a specific user with user tag <uid> or refers to that user.

Hashtags A ’#’ sign followed by a keyword denotes the user selected category

of the message (one category for each unique keyword string). Hashtags are unorganized and work by gentleman’s agreement.

Short URLs Several services provide a way to shorten URLs such as transforming

http://www.it.uu.se/edu/exjobb/helalistan to http://bit.ly/KdMep5 by redirecting through their site.

2.1.2 Accessing twitter data: Controlling sampling

In essence, accessing twitter data is done by collecting tweets that contain certain keywords or are written by specific users. What the keywords are for finding tweets about television shows and how they are obtained are described in chapter 4. But, even if these keywords are known, access to the data is limited because of the underlying medium.

The basic approach is sampling at different times using standard HTTP GET requests, the so called REST2 approach. Each sample has an upper limit of how

many tweets that are retrieved and a user is allowed only a certain number of calls per hour.

Conceptually: the Twitter company maintains a buffer of tweets of a fixed size that is indexed by a full text index for Boolean search. This FIFO cache is replaced with new tweets at different rates depending on the rate which tweets are produced. Users are allowed to query this very large cache of tweets and thus gain access only to the fraction of results that was produced in a fairly recent time period. Furthermore, not all tweets that are produced are available through this method and the complexity of a query is limited.

(19)

“Please note that Twitter’s search service and, by extension, the Search API is not meant to be an exhaustive source of Tweets. Not all Tweets will be indexed or made available via the search interface." – https: //dev.twitter.com/docs/api/1.1/get/search/tweets3

Besides the fact that not all tweets are accessible through the REST approach, there are further complications. These limitation has to do both with the fact that the number of tweets per request is limited and that the number of requests are limited. If tweets a produced faster than the requests are issued the surplus tweets are dropped without warning, this can happen if one wishes to track many different keyword sets. If they are produced slower then each request will return many previous seen tweets (wasting bandwidth). The following is also stated in the API documentation, making long queries harder to use in this setting:

“Limit your searches to 10 keywords and operators.” – https://dev. twitter.com/docs/api/1.1/get/search/tweets

Twitter data can also be accessed in a streaming fashion in two ways:

1. Access all incoming tweets or a sample of all incoming tweets. Accessing a random sample of all tweets is not attractive for our application and obtaining all tweets is a very data intensive streaming service requiring a contract with retailers.

2. Access all messages that match a Boolean query e.g. “My friend has a dog” and “My father drives a Volvo” will match q = (My ∧ dog) ∨ (My ∧ V olvo). This sample is limited to at most 1% of all tweets but represents the most exhaustive way of collecting tweets containing certain keywords.

When researchers evaluate their twitter related research it is common to use a static data set composed of messages collected for a certain query over a period of time [10][26]. One important such data set is the TREC microblog corpus4. I will

revisit various sampling issues in chapter 3.

In my project a combination of methods is used, the REST search method to acquire a large sample of tweets for many different topics over a long period of time and the streaming method to track a specific topic in an exhaustive way.

2.1.3 Stratification of tweet users and resulting language use

It is safe to assume that different groups of twitter users use different language to describe their thoughts. Certain trends in e.g. hashtag use spread to different groups of users depending on their position in the social network and other factors such as what their interests are and so on. There is support for this assumption

3

Accessed Oct. 16 2012

(20)

in work done here at KDDI R&D where the feature extraction was used to extract terms used by different demographic groups show that the terms used differ [23].

If we expand our assumption slightly we can also assume that an analyst that selects keywords to use for tweet collection need not necessarily be aware of the language use of different strata. It is therefore possible to achieve an improvement in recall if we can catch other types of language use.

In the proposed method we start with the title of a television show as the basis for our analysis, see chapter 4 but it is not hard to imagine that the jargon of users is not an exact specification and that they will sometimes use the title combined with words that are more specific to their writing style, demographic and social context. These words could include slang expressions and hashtags.

2.2

Information retrieval

This section is based upon the book An Introduction to Information Retrieval by Manning, Raghavan and Schütze [25] and summarizes the key concepts of informa-tion retrieval that are used in this thesis.

The task of finding the correct content out of a large collection of documents is often called information retrieval (IR). Most work in IR focus on finding the text document that, according to the models employed, the user wants, although there are several applications in which IR is expanded to other content such as images, video or audio recordings.

A typical task for a commercial IR system is ad hoc retrieval: find the best documents related to a set of user supplied search terms, a query. This thesis is not concerned with ad hoc retrieval, the topics for which I want to retrieve documents for are automatically generated or known beforehand, nevertheless a great deal of overlap exists between the more traditional techniques of IR an my proposed method, described in chapter 4. Specifically, my method uses queries, represents documents in a similar way and builds upon an existing IR system destined for ad-hoc retrieval.

2.2.1 Text data: Sparse vectors

Textual data is composed of strings of characters where one can choose different scopes of analysis, common strategies are to regard a document as one scope or to consider parts of documents as scopes on their own, e.g. 100 word segments or the different structural parts of the documents such as titles, captions and main text.

(21)

Table 2.1: Examples of two vector representations of the same document. In this example the vocabulary is severely limited and readers should imagine a vocabulary of several thousands of words and the resulting, sparse, vector representations. Note that capitalization is ignored which is very common in practice.

Vocabulary ’a’, ’brown’, ’dog’, ’fox’, ’i’, ’is’, ’jumped’, ’lazy’, ’over’, ’quick’, ’the’, ’this’

Document “The quick brown fox jumped over the lazy dog”

Words (lex. order) ’brown’, ’dog’, ’fox’, ’jumped’, ’lazy’, ’over’, ’quick’, ’the’, ’the’ Boolean vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0)

Frequency vector (0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 0)

A well know concept is lexicographic ordering and this definition of how strings are ordered can be used to transform text into a vector representation in a concise way. Assume that one know all words that can appear in a language; call this set the vocabulary, Ω. Using the vocabulary one can transform any document into a vector representation by counting the number of of each word and outputting these counts in lexicographical order: create a feature vector for the string. Table 2.1 shows an example of the process. In practice we do not know all words that can appear in texts but can ignore words that are not seen before or add them to the vocabulary since the vectors are never formed explicitly.

Conceptually, the Boolean model, used by the twitter search API, creates Boolean vectors as in table 2.1 for all document and all queries. A bit wise matching is then performed and the documents with one or more matches are eligible results of a search. Since the data is sparse many optimizations in storage and computation are possible, but they are omitted here.

For twitter messages the Boolean model makes a lot of sense, tweets are very short and each word can thus be assumed to be very important to the message. Furthermore, it is the method that requires the least computation and storage so it can be implemented effectively with an inverted index, see table 2.2.

If we study table 2.2 further we can see that the more common a word is the more documents will be stored in the index entry for that word, possibly slowing down retrieval. For common data sets of documents, often called corpora and denoted by C, an important phenomena called Zipf’s law has been observed: the probability mass distribution of words decrease exponentially for words according to how common they are. So roughly: the second most common word is about half as common as the most common word and so on. This empirical law means that a very small number of words are present in most documents. To reduce the space and time needed for look up in an inverted index these words are commonly ignored: these words are often called stop words.

(22)

Table 2.2: Inverted index structure. We can look up which documents that contain certain words by grouping document numbers by the words that are included in the document. If we wish to support frequency counts of the words we store not only document numbers but instead tuples of (Number, frequency).

Documents Num. Text 1 “a dog” 2 “a brown fox”

Index Key Value ’a’ 1,2 ’brown’ 2 ’dog’ 1 ’fox’ 2

are many extensions and interpretations one can use. The basic building block for ranking is the following two assumptions:

• The more common a word is inside a document, the more relevant the docu-ment is to a query containing this word.

• The more documents that contain a word, the less important the word is. This leads us to the very common representations of texts as tf · idf vectors (term frequency, inverse document frequency vectors). For each word that we keep track of in the vocabulary we also keep track of the number of documents that term appears in and calculate idf(t) = lg(1/|{d | t ∈ d}|), d ∈ C. When a query,

q= {t1, t2, ...} is issued we sum up, for each document d, the tf · idf elements that

match that document:

Score(q, d) = X

t∈q∧t∈d

w(t, d) = X

t∈q∧t∈d

tf(t, d) · idf(t) (2.1)

One can also compare two documents, or a query and a document, in the form of

tf · idf vectors using any distance metric, most notably the cosine distance: cos(~di, ~dj) =

~ di· ~dj

||~di|| || ~dj||

(2.2) A keen reader will note that by using frequency vectors to represent text we have lost a lot of information; namely the ordering information of words. This model may appear simplistic but has been shown to work well in practice. Because order is not accounted for the model is often called the BOW (Bag6 Of Words) model.

important.

(23)

2.2.2 Terms weights based on statistical methods

One can formalize the two basic assumptions used in the tf ·idf vector representation and instead use statistical methods. The intuition is the same but the models are more familiar since it allows IR systems to benefit from the vast knowledge in probability theory and statistics. For instance, it is possible to weigh terms in documents according to some function, f, defined on the two probabilities p(t|d): the probability of term t given a document d, and p(t|c): the probability of a term in the whole corpus.

w(t, d) ∼ f(p(t|d), p(t|C)) (2.3) p(t|d) = tf

T F (2.4)

p(t|C) = tfC

|C| (2.5)

Where tf is the number of times we see t in document d, T F is the number of terms in d, tfC is the number of times we see t in the whole collection and |C|

the number of terms in the collection. Note that we are essentially estimating the probability distribution of terms in different sets using the assumption that terms are distributed according to the mean (maximum likelihood estimation).

These methods are usually similar to hypothesis testing in that we choose terms for which we reject the null hypothesis that p(t|d) has a good fit with p(t|C). An-other way is to consider the information of the seeing a term in a document such as using the Kullback-Liebler divergence. It is also possible to use an urn model that explicitly considers the size of documents, T F , instead of just p(t|d) = tf/T F such as the divergence from randomness framework [7].

2.2.3 The vocabulary mismatch problem

When a user is searching for a set of relevant documents it is a typical case that the user and the authors of those documents use a different vocabulary (not to be confused with the vocabulary of all seen terms Ω). This means that not all relevant documents are found or that ranking of documents is not in line with what the user finds important.

This problem can also be seen when one tried to conduct market research us-ing twitter data. It is not hard to imagine that authors of tweets use a different vocabulary and jargon than an analyst that selects keywords to search for.

2.2.4 Automatic query expansion

(24)

categorized into local and global methods. Local methods use the results obtained from the first query to find new search terms while global methods often use global semantic information such as Wordnet or use an auxiliary corpora. An excellent review of query expansion is available in the survey by Carpineto and Romano [12]. In the local case query expansion is often pseudo relevance feedback, related to the concept of relevance feedback. In relevance feedback, an ad-hoc IR technique, users are asked to grade results and a new search is carried out taking into account the best rated results of the first query. If instead the top k results of ranked ad-hoc retrieval are assumed to be relevant, here denoted as R, the pseudo Relevant set, one can use this set to get new query terms. A common technique is to use Roc-chios algorithm where a BOW query vector is moved towards relevant documents according to different weights.

~ qm= α~q0+ β 1 |R| X ~ dj∈R ~ dj (2.6)

We see that the original query vector q0 of tf · idf elements is multiplied with

the scalar α and added to a summation of the tf · idf vectors of the documents in R multiplied with another scalar β. Note that we need to use a similarity based score for a new ranking of results as in equation 2.2 rather than a summation of all

w(t), t ∈ qm as in equation 2.1 to have use of query expansion with re-weighting.

For our purpose the re-weighting is not very interesting since we are interested in only obtaining new search terms. If we consider the vector:

1 |R| X ~ dj∈R ~ dj

as a list instead and sort the elements in ascending order we can use the first L elements as additional terms.

The statistical method in section 2.2.2 can easily be extended to query expansion where we instead of p(t|d) one considers p(t|R), where R is a set of known “relevant” documents. Using the χ2 statistic we can perform AQE by ranking expansion terms

according to highest score and using the top K ones:

score(t) = χ2(t) = (p(t|R) − p(t|C)) 2 p(t|C) (2.7) p(t|d) = tfR |R| p(t|C) = T F |C|

(25)

proposed method I calculate a metric similar in spirit to confidence for the rule:

keyword → group of records.

Instead of χ2, there are other way of hypothesis testing that can be used to

find good descriptors of sets such as the AIC (Akaike information criterion), which compares different models, in this case rules of the form above. [6]

2.2.5 Measuring performance

In ad hoc retrieval performance is perhaps best measured by what users think of the results returned. This is a very time consuming and expensive process so most IR systems are tested on annotated test collection such as the TREC [4] collections. Here a set of queries are supplied and a list of relevant documents for each, the IR system is then tested on how well it can retrieve the predetermined relevant documents. Often only the first couple of results are measured, but for the problem I am concerned with, maximum recall, this is not an option: I will instead sample the results to determine overall performance, see chapter 6.

For each query q and underlying information need, each document d is in one of the four categories:

True positive A document that is relevant and was retrieved by the IR system in

response to q. Denoted by tp.

True negative A document that is not relevant and was not retrieved, tn.

False positive A non relevant document assumed wrongly to be relevant and is

thus retrieved, fp.

False negative A document that is relevant and was not retrieved, fn.

From these simple definitions several metrics have been developed. The two most common measures are:

precision= tp

tp+ fp (2.8) recall = tp

tp+ fn (2.9)

Where precision reflects how many of the retrieved results are relevant and recall how many of the relevant results that are made available. It is desirable to maximize both metrics but they are naturally opposed goals, if one returns all documents in the collection then recall is maximized (1.0) and if no documents are returned then precision is maximized. In actual systems, increasing one often decreases the other. Therefore the harmonic mean of the two is often used to measure the IR system, called the F-measure.

F1= 2 ∗ precision ∗ recall

(26)

These metrics can also be used in other cases when an asymmetric importance is assigned to two different classes: that related results are important and unre-lated unimportant. A typical case is classification where one wants to filter out all unrelated records but keep all related ones. The difference between filtering and IR is blurred when there are temporal relevance demands on search results. In the extreme case, used in this thesis, where no results are stored for a standing query on streaming data they become inseparable.

2.2.6 Software systems for information retrieval

There are many software systems that are perfectly suited for information retrieval employing different versions of the inverted index idea presented in section 2.2.1. Relational database management systems or other systems optimized for e.g. fast indexing instead of consistency can be used. There are many specialized information retrieval systems and they can be called No-SQL databases because they do not adhere to SQL specifications.

I have worked with one such specialized information retrieval system (No-SQL), Terrier 3.57 [31]. It is a relatively mature research system dedicated to information

retrieval with open source code, good documentation and community support. Its design is focused on experimentation and configuration and it written entirely in Java giving me a good trade-off between performance, scalability, stability, ease of implementation and experimentation. One drawback is that there is no query language and if one wishes to do other things than document search you have to do this in source code. This is still preferable since custom search operations is exactly the idea with this software: Terrier is designed for easy modification of the open source code and easy configuration. In contrast, many SQL systems would require foreign functions or possibly even recompilation to do what I wanted to do.

2.3

Topic classification

The act of retrieving top ranked documents for a query is it self a form of classifica-tion of all the documents in the corpus. But in the case of twitter no ranking is done of the results of streaming retrieval, so we can explicitly introduce a classification step here to improve precision. To clarify:

1. Obtain a query vector q

2. If desired, create a new query vector q∗ with AQE

3. Rank all documents according to q∗ by sorting their scores. For example scores obtained by using equation 2.2.

4. The k highest ranked documents are considered related.

(27)

In an ad-hoc IR system the user can themselves decide what value of k is ac-ceptable and in that way allow for maximum recall my looking at more and more results. In practice this will result in very low precision. By instead of ranking results classifying all of them (giving them a score of either 1 or 0) I suggest that one can achieve more favorable results in terms of overall F − measure.

A supervised classifier is a function or program that uses has a training step to modify its behavior. In this step it is fed data that is similar to the data it will later be asked to classify and can generalize various properties of the data in order to make an informed decision later.[18] I will assume that the reader is familiar with supervised classification8 and instead focus on concepts that are important for my

proposed method.

The sparse vector formats listed in table 2.1 are necessarily not optimal for classification and if we can introduce some form of processing to include background knowledge in the features used there is a possibility for improving the results. In chapters 3 and 4 I will elaborate on this idea further but the basic scheme used is to focus as much as possible on transforming a sparse text representation into a more concise representation based on comparing it with other texts.

Our classifier will need to make a decision for each tweet whether or not it is relevant to our information need. The information need in IR is usually expressed as a query but in supervised classification we describe it in the form of training examples and in the proposed method, also as external sources.

Figure 2.1: The C4.5 classifier

The C4.5 is a commercial decision tree classifier. It is an extension of a basic decision tree induction algorithm [32] and thus creates partitioning rules for data into classes based on purity measures.

In C4.5 and the open source Java implementation J48 that I used, additional measures are taken to reduce over-fitting and to handle missing values and other improvements [34].

2.4

External data sources

External data sources can be used both for query expansion by looking for context of the original search terms [12] and for classification by comparing with our data. The key issue is that we can find representative information about the information need. Many researchers have focused on link structure of external resources but due to time constraints I have not considered this angle of approach and instead focused on using the text of the source itself.

(28)

Wikipedia The on-line dictionary, Wikipedia9, is the largest such resource of its

kind. It has been used by many researchers in text mining and I will use it to provide context for my classifier.

EPG EPGs (Electronic program guides) contain a the airtime of a show, the most

prominent actors of the show and a short synopsis. Several companies provide API access to EPG data such as Rovi Corporation10.

Web pages If we have web pages that are relevant to our information need we can

use them as additional background information.

9

wikipedia.org

(29)

Related work

In this chapter I will investigate related work regarding the collection of tweets with certain topics. One topic that is of special interest is television and thus much of the related work presented will be about the problem of finding and identifying television related tweets.

Perhaps the most straightforward interpretation of the problem of identifying television related tweets is as a classification task. The solution taken by most researchers is that of supervised classification of tweets by topic. A training set of labeled data is used to train a classifier such as a support vector machine and the approach is tested on an unused part of the training set, typically using k-fold cross validation. But in reality a running system must deal not only with classifying tweets but also retrieving them from the a large database such as the twitter API and therefore formulating the problem as an IR task is also attractive. I will also go through the approaches taken by different authors for collecting tweets.

My proposed method, see chapter 4, is a combination of query expansion in tweet retrieval and classification of tweets and thus these two research areas are directly related to my work. However I have not found any directly comparable results.

3.1

Relevant works in information retrieval

(30)

3.1.1 Query expansion and pseudo relevance feedback

One promising idea is to account for the change of language use over time. Massoudi et al. [26] use a retrieval model where a language modeling approach is used and query expansion terms are generated by looking at the recency of tweets that they appear in and how many tweets they appear in.

Another variant of temporal pseudo relevance feedback used for analyzing twitter messages is to build a fully connected graph of initial search results where edges are weighed by their terms temporal correlation similar to the approach above. Page rank is then used on this graph to find the most relevant documents. The temporal profiles where built with four hour intervals pretty small corpus of twitter messages and page rank is not suited for working with this kind of graph so it is not surprising that this TREC submission [2]1 was unsuccessful.

A very interesting use of Wikipedia is in one AQE approach where anchor texts are used as expansion terms. In [8] Wikipedia is indexed and searched for the same query terms as in an original query for a blog collection, the top Wikipedia documents returned are analyzed to find popular anchor text that link to the the highest ranked Wikipedia pages. These anchor texts are then used as expansion terms resulting in an improvement over a baseline.

In [15] Efron uses hashtags to improve the effectiveness of twitter ad hoc re-trieval. By analyzing a corpus of twitter messages he creates subsets where one hashtag is present and fits a probabilistic language model for each such subset. A language model is also fitted to each query and the models that correspond the best to the query model (according to Kullback Lieber divergence) will have their hashtags added as additional query terms. This approach provides modest improve-ment, but I think that creating a language model from just the query terms is risky since there is so little evidence present in a query of a few words.

Papers submitted for the TREC micro-blog track of 2011 represent the use of different IR techniques for twitter search including topic modeling, different forms of query expansion, extensive use of hashtags and many other approaches. Many of the papers are not published in peer-reviewed journals but never the less represent the latest research in this area. The main evaluation measure was precision at 30 results averaged for each of the different queries and the best results where in the 40% range. [2].

The approach taken by Bhattacharya et al. [2]2 is particularly interesting since

they report one of the best unofficial test scores of more than 60% P@30 and use a IR methodology perhaps best suited to structured (XML) retrieval. They use

Indri3which employs a combination of language modeling and inference in Bayesian

networks. They create different regions from the tweet and external sources that can be treated using different language models and combine the similarity of a query with each of these regions in a Bayesian network. The external sources are web

1Qatar Computing Research Institute submission to the TREC 2011 microblog track 2

Bhattacharya et al. University of Iowa (UIowaS) at TREC 2011 microblog track.

(31)

pages from URLs in the tweets and definitions of hashtags from a community based site, i.e. they expand tweets to include externally referenced information.

3.1.2 An alternative representation using Wikipedia

Since my goal is to achieve good recall while maintaining precision I have looked at the work of Gabrilovich et al. with much interest. They use an alternative representation they call explicit semantic indexing, ESA, contrasting LSA (latent semantic indexing), where Wikipedia serves as the basis for the representation of documents.

In [20] the basic idea of ESA is presented. Each word in a text is associated with a set of Wikipedia pages. They create this representation by building an inverse index of Wikipedia and instead of using it for look-up they use the index structure itself as the representation, i.e. each word is represented by a sparse vector with all Wikipedia pages as elements. A word such as Sweden might appear in thousands of articles but using an tf ∗ idf scheme the word might have the greatest association with articles about Sweden. To make the system feasible only the top k articles and the weights corresponding to the term frequency in those articles are kept. For a text the vectors of the words in it are summed to create a document vector. This approach works best for small text so it should be ideal for use with twitter messages.

The alternative representation is used to build an ad-hoc IR system where queries are also transformed into Wikipedia-space and compared with e.g. cosine similarity to the text we have in a collection. Using only ESA results in poor per-formance but very impressive abilities to associate queries and texts that does not share a single word with each other, which highlights the possibility for greatly in-creased recall. In unison with a BOW IR system and automatic feature selection using the information measure the method yields good results [17]. But this method can definitely cause a loss in precision for some queries because unrelated Wikipedia pages may contain the same word.

3.2

Classification

(32)

3.2.1 Television ratings by classification

Arguing that conventional TV ratings, the so called Nielsen ratings, are outdated Wakamiya et al. employ an alternative method for estimating the number of viewers [39]. In their paper they present a method that uses tweets to calculate the ratings and they use a large data set from the Twitter Open API. The data was geotagged and filtered by keywords such as TV and watching and later filtered.

Here they key problem of identifying which messages are related to a particu-lar TV shows is addressed. As seen in other works, additional information about the television programs are used, here in the form of an electronic program guide (EPG). Textual similarity is then computed between the set of collected tweets and EPG entries. As far as I know Wakamiya and her colleagues are unique in also incorporating both temporal and spatial information to make the decision.

The textual similarity is based on the Jaccard similarity coefficient and a mor-phological analysis function is used to only compare nouns, possibly due to the way the EPG is structured. In contrast, one could for instance imagine that verbs such as

watching could be useful, an observation made in other related works. To facilitate

the large number of text comparisons required an inverted index was employed. The use of spatial relevance is motivated by the need to determine which TV station the author of a tweet was watching. Therefore it might be unnecessary in the general problem of determining weather or not a tweet is related to a particular TV program.

Hypothesizing that users write about TV shows in close temporal proximity to the broadcast a temporal relevance score is used in the final relevance measure, a quotient of the three similarity scores, which is then used to match a tweet to the highest rated EPG entry, corresponding to one television broadcast. The sought after popularity measure of how many people watched the show can then be calcu-lated.

Experimental results indicate high precision for the proposed method but pos-sibly low recall. Regrettably no discussion about the statistical significance of the ratings acquired was present.

3.2.2 Ambiguous tweet about television shows

(33)

Figure 3.1: Approach to classifying tweets, here for television shows, but the same

approach applies for other proper nouns.

labeled by the first classifier which is also used to extract additional information such as new search terms for the twitter streaming API. The second classifier is used to make a decision about a tweet m and a show i having the title si:

f(i, m) =

(

1 if m is a reference to show si

0 otherwise

The first classifier is a binary classifier that also models the function above, it does however use less features that the second classifier, as seen below. The training and testing data is generated using twitters streaming API where one searches for keywords and get a statistically significant sample back. The search terms used include not only the name of the show but also alternative titles found at IMDB.com and TV.com. A set of labeled data was manually created for eight shows and this data set is then used for training and validation of the first classifier.

The features used for the first classifier differs from the previous approach listed since they are not directly related to textual similarity where one uses a bag of words model. Instead a collaboration of features is used:

• Manually selected terms and language patterns of interest.

Television terms such as watching, episode. Network terms such as cnn, bcc.

Regular expression capturing episode and season information e.g. S{0 −

9}+E{0 − 9}+.

• Automatically captured language patterns.

From a large data set (10 million tweets), replace titles and hashtags with

(34)

Use si to check for the presence of the uppercase string.

Check if there is more than one title that is not an ambiguous word

(according to Wordnet).

• Textual comparison with external sources using cosine similarity measure and the bag of words model.

Characters of the show Actors of the show

Words from Wikipedia page

Most features are treated as binary values, 1 if a positive match was found 0 otherwise, the rest are scaled to the unit interval.

After training on a few thousand twitter messages the first classifier is then used to classify the large unlabeled data set, this yields labels for each twitter message. This data is then used as training data together with the original data set for the second classifier that uses all of the features of the first classifier and three additional feature types. Interestingly new more refined rules are captured from this new labeled data set as well as new search terms.

Using the features listed above, different classifiers where tested for the two classifiers support vector machines and rotation forests [36] where deemed the best. An F-measure of 85.6% was the best result achieved by the latter classifier in 10-fold cross validation of the initial labeled data set. To summarize: several interesting features are combined with the common textual similarity measures often used information retrieval, the two classifier approach slightly increases the F-measure and also generalizes quite well to unseen shows.

Parts of this approach can certainly be applied to classifying new tweets that are retrieved using query expansion and in my method I use a similar approach with slightly different features, see chapter 4.

3.2.3 Other topics than television

Other authors tackle the related and very similar problem of identifying which tweets are about a certain company and which are not. Company names can be ambiguous in much the same way as television shows and programs. As stated in National University of Distributed Learning’s WePS-3 workshop’s task definition: “Nowadays, the ambiguity of names is an important bottleneck for these experts”[1], referring to experts in on-line reputation management. The task outlined in the workshop included data to be analyzed.

(35)

manually recorded related words. The features used was co-occurrence counts of words in the tweet with the different profiles. Experimental results were positive and indicate the need for high quality external information.

One idea to improve recall in classification is to cluster messages, however this method typically suffers from poor results if applied to just term occurrence vectors. Perez et al find terms from the corpus of twitter messages they are working on to help clustering methods [33]. They call their method Self-term expansion methodology and achieve improvements in recall and precision by finding a set of additional terms for ambiguous company names. Words that co-occur with company names in tweets labeled true in the training set are added to each tweet containing the company name in the test set. Unfortunately the paper is very vague in its method description but using unsupervised clustering with k-means with k = 2 does not seem like a promising idea for classification, however the method could be used as a query expansion technique.

3.3

Tweet collection methodology

In chapters 1 and 2 I argued that the common approach of accessing twitter data for various research project is lacking in reach, or recall, of the data that is considered for sampling. To see this we can consider the methodology used by some of the related work presented in this chapter.

Regarding ad-hoc information retrieval of tweets such as [37], [16]; [10] that use the TREC microblog data set4; and [26] which employed query expansion,

it is not clear if we can compare effectiveness of tweet collection. In relation to market research, it is an open question weather results achieved on a small data set sampled for a shorter period of time and annotated with a modest number of

query–relevance judgment pairs are applicable to the problem of obtaining as many

as possible related tweets. We are most interested in evaluations done with the constraints of up to date, inclusive tweet collection in place. Nevertheless, many of the techniques used are certainly interesting.

In [28], Mitchell et al. evaluate a system they have set up for on-line television in which social media is integrated. Twitter is used to present tweets about the currently viewed program. Here the twitter API is used and a simple search of the programs title is employed to retrieve relevant messages. Their work represents the basic use of twitter for retrieving TV related tweets and unfortunately recall and precision is not evaluated.

The work done with classifying TV-program related tweets [39], [35] and works about classifying other ambiguous topics such as [40] use test sets collected using simple rules, such as using the title of the topic, or manually selected keywords. A limited form of query expansion is used in [9] to generate the data set, all hashtags found in the data set retrieved by searching for “#worldcup” are recursively used to

(36)

search for new tweets. In [30] the streaming API is used and messages are classified in a streaming fashion, however the search terms used are manually selected.

Wakamiya et al., that employ an alternative method for estimating the number of viewers by counting certain tweets [39], do not use titles of TV programs directly. Instead, a large data set collected from the Twitter API during one month was used, where all geotagged5 data with Japanese origin available was filtered for the,

manually selected, Japanese keywords equivalent to words such as TV and watching is used. Experimental results indicate high precision for the proposed method but possibly low recall. Regrettably, no discussion about the statistical significance of the ratings acquired was present.

Dan et al.[35] [13] [14] use an approach that achieves an F-measure of 89%. However, their results are only valid as a measure of an overall system if all the relevant tweets can be found using the title of the show as a search term.

3.4

Summary

Ad-hoc search of twitter messages typically uses text indexing, either the common

tf ∗ idf scheme or a language model approach. Even though there are many

dif-ferences between ad-hoc search of microblogs and web documents [37] techniques learned from established IR methods can certainly be applied.

Text classification typically represents texts as tf vectors and either does super-vised training directly on the sparse vectors or extracts features to use for training. When dealing with short documents such as tweets, external sources are often used and the best results [14] come from including hand crafted features and mining very simple rules from a bootstrapped sample. Unsupervised approaches are less suc-cessful when dealing with short text messages as described in [33] and a clustering can never be maintained for the incoming stream of data in a live application. Table 3.1: Methods that I use from related works in my combined approach of query expansion and classification.

Method Reference

Use hashtags as expansion terms [15] Look up URL contents in tweets [2]6

Use Wikipedia as a way to compare tweets [20], [13] Use EPGs as a way to compare tweets [39] Co-occurrence with name to get additional terms [33] Multiclass supervised classification of tweets using external sources [13], [41]

Since we are treading in somewhat unknown territory, the streaming retrieval of television related tweets, I have perhaps included some works that can be considered

5

(37)
(38)
(39)

Automatic query expansion and

classification using auxiliary data

This chapter contains a conceptual overview of the methods used. I start by elabo-rating on the problem statement in chapter 1 and then continue to describe the two parts of my approach: AQE (automatic query expansion) and supervised classifica-tion. Some initial experiments are described since they guided my development of various features of the method that was later used for larger experiments.

4.1

Problem description and design goals

To clarify the intuition behind which methods are used I repeat and elaborate the research problem definition from chapter 1:

Get as many as possible tweets about a specified product.

With the following constraints:

Tweet availability The twitter API determines how tweets can be accessed. Precision We do not want false positives.

Scalability The resources required in terms of CPU time, memory consumption

and disk should increase sub-linearly, disregarding storage of collected tweets, with the number of tweets we collect.

Recency Results need to be made available in a timely fashion. We need to access

the tweets about a product as soon as possible, ideally in real time.

Resolution We need to be able to tell when a specific tweet was created.

We can also formulate the problem in terms of a search problem:

(40)

Retrieve tweets about products by searching (Boolean match) for keywords about these products.

We can prove by example that this method does not retrieve all related tweets. Furthermore, many keywords that are good descriptors for products are also good descriptors for other classes of tweets. There is a problem of ambiguity.

Another interpretation is in terms of a supervised classification problem:

Classify all tweets as related to the product or not

Superficially this looks as an attractive approach; we have now gained a good method to weed out ambiguous tweets. But there are several problems:

Multiple-class problem If we need to track more than one product it becomes a

multiple class problem. To solve this we need to have access to labeled data for each product.

Unbalanced classes The ratio of tweets that are about a specific product are

vanishingly small, let us assume this ratio is 0.1% and that the false positive rate is 0.01%. With perfect recall the results will be almost tied between 50.02% tp and 49.97% fp. This classifier performance is of course completely unrealistic making the problem harder.

The class distribution problem is not insurmountable, we must give up recall to achieve reasonable precision. This is directly opposed to our goal but is unavoidable, we must instead focus on giving up as little recall as possible. But compared to treating the problem as a search problem we are still ahead in achieving our goal.

There is an important caveat here: if we have high recall of the related class and consistent fp rates we can still track the change in the number of tweets.

The worst problem is to acquire training data; it is not financially viable to manually annotate data for a given product unless that product is very impor-tant. When tracking television programs this manual labor is staggering, there are hundreds of large shows and programs in the US alone of very different genres.

4.2

New search terms from query expansion

The main idea of my approach to the problem described in section 4.1 is to combine the strength of both search and classification by removing the analyst that selects search terms and replacing him with an automatic method.

(41)

Original search expression Expansion search expression “hello” “hello” ∨ “hi” ∨ “greetings” ∨ ...

Here we assume that some query expansion method generates some additional terms and that we can retrieve the tweets that match one or more of these terms. As described in section 4.2.1 I will refine this approach somewhat to search for not only a disjunction of single search terms but a disjunction of pairs of two search terms where both of the terms in the pair must be present in the retrieved messages (logical AND). If we revisit the example above it could look something like this:

Original search expression Expansion search expression “hello” “hello” ∨ “hi” ∧ “greetings” ∨ ...

Since I want the Tweet Collect system to only need a list of product names to work the original search expression will be a conjunction of the words that make up the product name. For the television show “How I met your mother” the original search expression will be: ”how” ∧ “I” ∧ “met” ∧ “your” ∧ “mother”.

Taking inspiration from the automatic query expansion techniques listed in chap-ter 2: we can treat a set of tweets that contain the exact title of a product as pseudo relevant tweets; they are generated by an original query. Given a larger population of tweets C where R ⊂ C we can calculate many different statistics about the terms present in R. From these statistics we can generate well chosen additional search terms. This method is very simple and highly effective, given that our initial as-sumptions holds: the tweets in R are actually relevant and we can approximate the

true distributions by the statistics in our corpus.

I have performed some small scale tests with different query expansion methods that use different ranking criteria for new search terms and found that χ2, equation

2.7, performed the best. Furthermore, it is not very important which method is used:

“However, several experiments suggest that the choice of the ranking function does not have a great impact on the overall system performance as long as it is used just to determine a set of terms to be used in the expanded query [Salton and Buckley 1990; Harman 1992; Carpineto et al. 2001]”[12]

I also did some small scale tests where I tried the query expansion method from [26] directly but this resulted in very poor expansion terms. The terms could be dismissed upon inspection and testing searching for them on twitter.

4.2.1 Co-occurrence heuristic

(42)

terms in virtual documents consisting of V tweets, a tweet, the bV/2c tweets col-lected just before and the bV/2c colcol-lected just after the tweet containing the first term in the conjunction pair. Rank the pairs according to their modified dice coef-ficient:

˜

D= 2 · ˜dfu∧v

dfu+ dfv (4.1)

Where ˜df represents document frequency of the virtual documents in the pseudo

relevant set and df the document frequency in the collection as a whole.

4.2.2 Hashtag heuristic

(43)

Algorithm 1 Algorithm, Top(K,R), produces an array of single search terms.

1: R is an array of relevant tweets, twl,1 ≥ l ≥ N.

2: for allterms t ∈ ∪i twi do

3: if pR> pC then

4: Use equation 2.7 calculate score(t) and add ht, score(t)i to list l

5: end if

6: end for

7: Sort l in order of score(t).

8: Let top[K] be an array of terms ti.

9: top ←the K terms corresponding to largest Inf(t).

10: return top

Algorithm 2 Algorithm, Pairs, produces the pairs of search terms used.

1: Let R be an array of relevant tweets, twl,1 ≥ l ≥ N.

2: top ← Top(K, R)

3: Let pairs[K · (K − 1)/2] be an array of hString, String, Integeri. 4: for allterms ti in top do

5: Tu{tweets tw | ti ∈ tw}

6: for allterms tj ∈ top | j > i do

7: Tv{tweets tw | tj ∈ tw}

8: for all twl∈ Tv do

9: vd ← twl−2@twl−1@...@twl+2

10: if ti ∈ vd then

11: hti, tj, counti ← pairs[index(i, j)]

12: pairs[index(i, j)] ← hti, tj, count+ 1i

13: end if

14: end for

15: end for

16: end for

4.2.3 Algorithms

Algorithm 2 describes how to get pairs of terms and their counts. Note that the nested loops on lines 6-14 correspond to doing a join between the tweets that contain

tiand the virtual documents, formed by tj, that contain ti. This can be implemented

References

Related documents

Stöden omfattar statliga lån och kreditgarantier; anstånd med skatter och avgifter; tillfälligt sänkta arbetsgivaravgifter under pandemins första fas; ökat statligt ansvar

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Däremot är denna studie endast begränsat till direkta effekter av reformen, det vill säga vi tittar exempelvis inte närmare på andra indirekta effekter för de individer som

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Syftet eller förväntan med denna rapport är inte heller att kunna ”mäta” effekter kvantita- tivt, utan att med huvudsakligt fokus på output och resultat i eller från

Parallellmarknader innebär dock inte en drivkraft för en grön omställning Ökad andel direktförsäljning räddar många lokala producenter och kan tyckas utgöra en drivkraft

Närmare 90 procent av de statliga medlen (intäkter och utgifter) för näringslivets klimatomställning går till generella styrmedel, det vill säga styrmedel som påverkar

I dag uppgår denna del av befolkningen till knappt 4 200 personer och år 2030 beräknas det finnas drygt 4 800 personer i Gällivare kommun som är 65 år eller äldre i