Sentiment and growth of different news categories on Twitter: A study in Natural Language Processing

(1)

FIRST CYCLE, 15 CREDITS STOCKHOLM SWEDEN 2019,

Sentiment and growth of

different news categories on Twitter

A study in Natural Language Processing

DIAR SABRI

NAVID HAGHSHENAS

DEGREE PROJECT IN COMPUTER SCIENCE,

(2)

Sentiment and growth of

different news categories on Twitter

A study in Natural Language Processing

Diar Sabri

Navid Haghshenas

We would like to thank our supervisor, Prof. Arvind Kumar for his patient guidance, encouragement and advice he has provided us throughout this thesis. We have been lucky to have a supervisor who knew how to guide us during this research and ask all the right questions to lead us in the right path.

Degree Programme in Computer Science and Engineering Date: June, 2019

Supervisor: Arvind Kumar Examiner: Örjan Ekeberg

Swedish title: Sentiment och tillväxt av olika nyhetskategorier på Twitter

School of Electrical Engineering and Computer Science

(3)

Abstract

In this age of digitalization, people have begun to change their news consumption behavior. More than half the population of the world has internet access and thereby readily available platforms for acquiring and disseminating news. Twitter has evolved into being a staple for news discussions and even transforming into a stable news source provider. This study aims to examine Twitter as a legitimate news media based on certain news categories and how they differ on public interest and opinion. The categories are celebrity, crime, economy, politics & global, and we examined the relative growth rate and public sentiment for each category during a period of 10 hours before and 14 hours after an event occurring. The results show that the political news category had the most public appraise and the highest public interest. Furthermore, the political news category also had the lowest fluctuation regarding public sentiment. On the other side of the spectrum, the crime category had both the most negative public sentiment and the lowest relative growth rate.

(4)

Sammanfattning

I dagens digitaliserade värld har människor börjat ändra p˚a sina ny- hetsrelaterade konsumtionsvanor. Över halva jordens befolkning har tillg˚ang till internet och därvid tillgängliga plattformar för att erh˚alla och sprida nyheter. Twitter har utvecklats till en primär aktör för nyhetsdiskussioner och har delvis transformerats till en stabil ny- hetskälla. Denna studie ämnar att undersöka huruvida Twitter är en legitim aktör inom nyhetsindustrin genom att granska fem nyhetskategorier och analysera det allmäna intresset samt sentimentet för dessa. Detta genomfördes genom att studera utvecklingen 10h in- nan och 14h efter en nyhetshändelse. Kategorierna beträffar kändis, brott, ekonomi, politik & globala nyheter. Resultatet tyder p˚a att den politiska nyhetskategorin hade den högsta relativa tillväxttakten och

även mest positivt sentiment. Dessutom hade politik minst fluktue- ring gällande sentiment. ˚A andra sidan hade brottkategorin b˚ade den lägsta relativa tillväxttakten och det mest negativa sentimentet.

(5)

Chapter 1 Introduction

Ever since their rise to popularity, social media platforms have had profound im- pact on the way internet users receive and disseminate information with others.

One platform in particular, Twitter, has played a vital part in the news industry for many years. Twitter has been used to interact with people around the globe via messages since its creation in 2006. These messages came to be known as tweets, of which there are billions published by over 300 million monthly users (Grothaus 2018). Most of these tweets relate to broadcasting subjective opinions and circulating information. Nowadays it is common to use Twitter as a news media by following different news sources. Because of its extensive reach worldwide, Twitter is also an embraced social platform by many politicians and corporations.

The mannerisms of spread and growth on tweets regarding news has lately been a concern in the western countries. This is due to their ability to persuade civilians to think or act in a certain way. Politics is arguably one of the most precarious news categories in this sense. Events such as the prosperity of Donald Trump during the United States elections in 2016 are clouded by the effects Twitter had on the election. Special Counsel Robert Mueller attested to the fact that the presidential election was heavily influenced by propaganda campaigns, many of which focused primarily on Twitter (Muller 2019). Many high-profile officials such as Roger Stone, Sean Hannity and Michael Flynn Jr. interacted with fake Twitter profiles created by the Russian interference operations. In our time, Twitter seems to hold more influential power than most large news sources such as CNN or The New York Times. This should come as no surprise, since prominent figures usually

(8)

have their own ”official” Twitter-accounts. This fact alone is usually enough to camouflage news as more reliable since the tweets become a first-hand source (Grinberg et al. 2016).

Internet users are usually well-aware of the potential effects Twitter can have. It has played a great part in bringing freedom of speech to many countries worldwide, going as far as representing a pathway for freeing countries from dictatorships and oppression. An example of this is the Arab Spring, which was mainly triggered by communication & revolt-preparations on Facebook and Twitter. The effect of ma- jor social media was especially clear in Egypt and Tunisia where the government was overthrown because of the rebellions (Blakemore 2019).

1.1 Problem statement

Following the growth of Twitter as a news source as well as a direct way for people of power (politicians, celebrities, outstanding financiers & large corporations) to reach out to civilians, it is important to acknowledge the negative aspects of this medium. Since the highest user activity is in the U.S., a restriction has been made to only examine American data. The purpose of this thesis is to examine Twitter as a legitimate news media. It will be a two-part study, related to both the interest and the opinions on certain news categories which are defined by us. The news categories have been elected based on our own beliefs on which news categories are most distinguishable and most prominent and are as follows: celebrity, crime, economy, politics & global. The research question will be as follows:

How do news categories differ in regard to Twitter activity?

The definition of ”activity” is a two-part factoring of relative growth rate (interest) and sentiment (opinions) in the U.S.

The hypothesis of this study is that the celebrity category will be the most notable regarding interest. Regarding opinions, crime will be the most negative, but no hypothesis will be made on the other side of the spectrum.

(9)

1.2 Approach

The approach of this thesis is a quantitative study on twitter data i.e. tweets.

An analysis on each news category will be conducted in a two-factor approach by calculating both the sentimental values and the relative growth rate. The results will be composed of a study in natural language processing for both factors.

Furthermore, a literature study will be performed in order to obtain the relevant information needed.

1.3 Thesis outline

The following chapter of the thesis paper presents relevant background information, definitions and research related to the field. The subsequent chapter presents the methodology of the quantitative study. The fourth chapter depicts the results obtained from the research. The fifth chapter discusses the outcome and also how the shortcomings of the methodology affected the results. The sixth and final chapter summarizes the results and concludes the study by answering the research question.

(10)

Chapter 2 Background

The news industry has been a staple in the lives of civilians for centuries, providing them with information regarding a wide spectrum of matters. Lately, this industry has been changing by the decade following altered user behavior and consumption patterns. This is the foundation of our thesis and is the primary motivator to examine Twitter as a legitimate factor in the news industry. We will initially investigate how the news industry has evolved, transitioning into the role Twitter plays in the user behavior. Since this is a study in computer science, a section in natural language processing will follow. Finally, a section exploring the related and previous research will conclude the background.

2.1 News industry

The news industry has been a source of information for the general population for centuries. The means of communication by which this industry relied on has historically been printed paper. One form of printed news can be dated back all the way to the Aztecs of South America when colored banners were hung in the main square to spread certain news. But the first published newspaper that we know of and would recognize today emerged in Germany in the 17th century with the earliest one dating back to 1605 (Newspaper Industry, History of 2019, Weber 2006).

Printed news grew in popularity and the industry had no clear adversary until the start of the 20th century when radio made its entrance. Furthermore, when

(11)

the television made its commencement, it further weakened the previously strong position of the newspaper industry. This forced newspapers into adapting to the changing market and thus, the tabloid was created. Tabloids were a form of newspapers that were more colorful and visually appealing. They were also smaller and were not published daily but rather weekly or even more sporadically, in contrast to the daily publications of the ordinary newspaper (ibid).

The newspapers were not done adapting to the market though. In the 1980s the computers came into play and it became clear that the newspaper industry had to acclimate to this new technology. During the entirety of the twentieth century, the newspaper had to make adjustments and comply with the changing markets.

From the radio to the television and finally to the computers advancements, there had to be acclimations from the newspaper industry in order for the industry not to fall. This had an effect on the number of daily newspapers. They were steadily declining during the entirety of the twentieth century. This accumulated to the newspapers finally joining the new-media arena in the 1990s, accompanied by a digital online presence on the internet.

The news media today has begun to question the future market share of the newspaper industry in converse to new forms of news consumption in the face of changing user behaviors (Ahlers 2006). In this age of digital convergence, the end-users of the news industry have begun to change their consumption behaviors of news and media. A study recently investigated how the new age of media and digitalization affects the way people gather, process, and circulate information.

The study focused on six outlets; Twitter, blogs, online communities, online por- tals, online newspapers and offline media such as newspapers. The conclusion was that “people-based” media played a critical role in diffusing latest information to the public, and the most prominent one being Twitter. In contrast, the traditional news media such as newspapers, and television adopts and broadcasts the information to the wider public (Sung & Hwang 2014).

Amid these changing times for the news industry, a new way of receiving, processing and disseminating news has emerged. The platform of social media has grown tremendously since the beginning of the 21st century. According to recent studies, almost seven in ten American use at least one social media site. Just ten years ago, the same source states that the same metric was around half that of today (Social Media Fact Sheet 2018). In 2018, news use across social media was examined. The report states that a little over two-thirds of Americans occasionally get news on social media. Among the social media sites Facebook is

(12)

the one that is most frequently used, but our focus is on Twitter and around 12% of Americans receive their news from Twitter at least occasionally. Another interesting measurement is about how much the users are exposed to news when on the respective social media site. 71% of Twitter users are exposed to tweets related to news (Shearer & Matsa 2018).

Due to the nature of direct and unfiltered communication that Twitter paves the way for, news organizations have begun to utilize the power of Twitter as a tool for reaching out to a broader and bigger audience. Twitter is mainly used as a channel for breaking news, which allows reporters to take on a different role than they have previously in the history of journalism (Broersma & Graham 2012). Michael Schudson wrote a book about the power of news. A running theme throughout the book is that news has to be perceived as public knowledge.

Schudson argues for news being branded as a form of social culture. Another important point that Schudson addresses is regarding the power of the publishers referring to the change in the news industry when comparing newspapers from 1895 to newspapers from 1995, which can be extracted in the following quote

”Reporters are far freer from marching in step behind an editorial line set by the publisher than they once were.” (Schudson 1996)

Michael Schudson wrote another book, named Discovering the news. In this book, Schudson states that it was uncommon for journalists to see a separation between facts and values during the first couple of decades of the twentieth century. He also states that the definition of objectivity is precisely that, a divide between facts and values. This changed after the first world war when journalists reflected on the propaganda during the first world war and the public relations after the war. This was a turning point in journalistic history, and it cemented a belief in the journalistic mindset that mere facts cannot always be trusted. Schudson states that a world which the interested parties had constructed was a world where naive empiricism could not last (Schudson & Leonard 1979). This led to a more modern definition of objectivity. It states that a statement can be trusted if the statement can be submitted to established rules deemed legitimate by a professional community. This essentially implies that facts are not merely irrefutable metrics, but consensually validated statements in the world.

(13)

2.2 Twitter usage

For many years, Twitter and its users have provided the web with information, emotions, and entertainment. It has developed from a micro-blogging source into an extensive news platform, going as far as having journalists reaching out to official Twitter accounts of public figures for material. This is especially common in the United States, where the Twitter usage is the highest in the world. A study from 2013 showed that 16 percent of all American web-traffic was on Twitter, which had doubled from only two years prior (Brenner & Duggan 2013).

A study reviewing Twitter as a news source showed that over the sampling period, 946 news stories / segments from the seven most prominent news outlets (The New York Times, Washington Post, ABC News, CBS News, NBC News, Fox News Network, and CNN) used Twitter as a news source. Something to note is that this study was performed in 2011, which is a long time ago relative to how long Twitter has been active. Several statistics were revealed in this study, one of which showed that Twitter accounted for over 57% of all news sources in media. Another metric in the report showed that approximately 1/3 of all Twitter-originated TV- news was regarding U.S. politics while the same number was slightly lower ( 30%) for the newspapers. This ratio was similar for other sampled news categories.

According to the same study, TV-news journalists have more frequent deadlines and for the sake of keeping up with competitors, they must rely on Twitter more than newspaper journalists (Moon & Hadley 2014).

According to a study from the City University of Hong Kong, there are two reasons behind the popularity of Twitter. One is the quality of the content that is uploaded on Twitter. As time has gone by and with the increased growth of Twitter, users have become more critical towards the information received on the service. Besides this, the content managers constantly filter out unpleasant content such as gore, violence, and pornography to make the user-experience more gratifying (Liu et al. 2010).

The second factor is that people nowadays appreciate exploring modern technology. Twitters functionality while simple, allows for a gratifying and contempo- rary user experience, and being the first of its kind has certainly aided its growth (ibid).

(14)

2.3 Natural language processing

Natural language processing, NLP, is a way for computer programs to interpret natural language, both oral and written. Researchers in the field develop tools that can manipulate these aspects in order to execute certain tasks. Moreover, linguistic properties such as nouns & verbs, and grammatical frameworks are used in NLP-utilities. To achieve this, branches of knowledge such as dictionaries, grammatical rules, abbreviations & synonyms are usually essential to the work (Kao & Poteet 2007).

The early research & development of NLP has successfully established fundamen- tal aspects such as translation from text to machine code, speech recognition &

synthesis. Nowadays, the functionality & precision of such utilities have been enhanced and further developed in speech-to-speech translating programs, analysis of information on social media and more (Hirschberg & Manning 2015).

2.4 Sentiment analysis

One example of modern NLP is presented in this report as sentiment analysis.

It is a method for extracting opinions and emotions from text and evaluating its polarity (Paltaglou & Thelwall 2012). The most common approaches are either machine learning based, or lexicon based. Research at KTH concluded that a lexicon based approach is more suited for sentiment analysis on social media data such as Twitter (Johansson & Lilja 2016). A lexicon based approach classifies data into different categories and uses lexicons to evaluate a certain words sentimental rating. Furthermore, the text as a whole is also examined, and the contextual setting is taken into account.

(15)

2.5 Previous research

This section aims to provide an insight into how Twitter has been used and can be used for research purposes. In summary, it indicates a wide array of research applications ranging from tweet legitimacy to earthquake detection to health diagnostics, and more. This is due to the fast-paced nature of Twitter accompanied by the big cluster of discussion topics and an arsenal of different Twitter usages.

In 2019, a study on fake news-tweets during the 2016 U.S. campaign was carried out to investigate whether Twitter-propaganda had taken place during this period.

The results showed that 6% of the news on Twitter were fake, and that only 0.1% of users accounted for the spread of fake news across the platform. Mostly conservative voters from the extreme right were affected by fake news. This report was established after controversies regarding pro-Trump propaganda spread on Twitter, which was said to have affected the election choice of many Americans (Grinberg et al. 2016). This was recently confirmed by the findings of Robert Mueller’s report (Muller 2019).

In 2011, an analysis into the sentiment of stock market-related Twitter-feeds from seven different mood dimensions was carried out. The collected tweets specifically concern the closing values in Dow Jones Industrial Average. The main purpose of the study was to predict future changes in the market using sentiment analysis.

The findings show that predictions on the daily closing values were 87.6% accurate (Bollen et al. 2011).

Castillo et al. analyses methods to automate the process of assessing credibility of tweets using machine learning approaches. More specifically, they implemented a supervised learning method. Using this approach, they labeled tweets of ”top trending” subjects as either not credible or credible with respect to information about the user and the tweet. These notations were then compared to true as- sessments of people, showing a precision of 70-80% (Castillo et al. 2013).

By using Twitters real-time interactions of different events, researchers have created a system for reporting earthquakes in Japan. This was implemented by considering each user as a sensor and applying traditional earthquake-detection approaches. The probabilistic spatiotemporal model could locate the center of the event only based on Twitter data. Due to the large number of tweets concerning Japans numerous earthquakes, the algorithm could detect earthquakes with a 96%

(16)

accuracy (Sakaki et al. 2010).

In 2015, a model was created based on the language & habitual patterns expressed in tweets to recognize the correlation between risk factors for heart diseases and negative emotions, such as anger and unhealthy social relations. Furthermore, the presented model also investigated factors such as income and academic background. The cross-sectional regression model based only on Twitter language concluded a result that was more accurate than a previously used model, which looked at 10 different demographic, socioeconomic, and health risk factors (Eich- staedt et al. 2015).

2.6 Definitions

Below are a few phrases which need definite descriptions. These definitions will be used throughout this thesis.

2.6.1 News categories

As previously mentioned, we have defined five news categories; crime, politics, economy, celebrity & global news. In this report, crime news is delimited to more brutal crimes committed by civilians, such as homicide. Political news will relate to decisions made by powerful politicians (presidents, prime ministers etc.). The economy category will consist of stories regarding decisions made in the banking world and economical advancements affecting the societal level. The celebrity category has a broader range than the previously mentioned ones, covering any and all news related to celebrities. Finally, global news can be of any category but must be a significant piece of news that reaches around the globe.

2.6.2 Application Programming Interface - API

In general, an API is an interface that can be used by developers to allow communication between programs and applications that are of different kind. Once the communication has been set up, developers can extract information from one platform or programming environment to another. For the sake of fulfilling the purpose of this report, two APIs have been implemented in our program.

(17)

Twitter Developer API This platform includes several tools that provide the open database of twitter for working with applications, statistical analysis and more. These tools are categorized into Standard, Premium & Enterprise. For this report, we have used the Standard API which offers a 7-day tweet-searching API and is free of charge.

Google Maps API

This API in particular allows the user to customize maps with their own desired content and imagery to be displayed on web pages and/or mobile devices. It includes the four basic map types satellite, road map, terrain and hybrid which can be modified using layers, styles & libraries. Furthermore, it is possible to query this API in order to extract positional information.

(18)

Chapter 3 Methodology

3.1 Data acquisition

In order for us to properly perform the quantitative study required in this thesis, we developed a systematic approach to administer the data. This was all done in the R programming environment within RStudio. The first step was acquiring the data. To do this, we used the Twitter Developer API in conjunction with a third- party package called Rtweet. Rtweet provides users multiple functions which are designed to extract data i.e. tweets, from Twitters REST and streaming APIs. We primarily relied on the search tweets function in which we mainly took advantage of the following arguments.

Argument Description

type Specifies which type of tweets should be returned. Must be either recent/mixed/popular

q Query to be searched. Must be a character string of

maximum 500 characters. Boolean operators such as AND and OR can be used in conjunction with parentheses

include rts Boolean value which specifies whether to include retweets n Number of tweets to collect, maximum of 18000 tweets each

15 minutes

retryonratelimit If the provided n exceeds the limit of 18000 tweets, a boolean can be provided in order for the collection to

(19)

Table 3.1 continued from previous page stall and proceed after 15 minutes

langs Restricts to tweets in the specified language

geocode Geographical restriction, given in coordinates and a radius since Restricts the collection to return tweets newer than the

specified date

until Restricts the collection to return tweets older than the specified date

Table 3.1: Function for gathering tweets via both the Twitter Developer API and the Google Maps API

An example of the source code to one of our queries in our data collection process

x <- s e a r c h _t w e e t s ( type = ’ r e c e n t ’, q = ’ n i p s e y h u s s l e ’, i n c l u d e_rts = FALSE , n = 36000 ,

r e t r y o n r a t e l i m i t = TRUE , l a n g s = ’ EN ’,

g e o c o d e = l o o k u p_ c o o r d s(’ c o u n t r y : us ’, key = G O O G L E_MAPS_KEY ) ,

s i n c e = ’ 2019 -03 -30 ’, u n t i l = ’ 2019 -03 -32 ’)

The results from the search query we received via the Twitter developer API that the above R code generated was then stored unprocessed in a .csv file. Each and every news incident therefore had a separate file specifically for that incident.

This way, we had all the raw data for each news incident before the next steps in the process of our quantitative inquiry.

3.1.1 Chosen news

For each piece of news, all occurrences on Twitter have been collected by querying for keywords related to the news at hands. Abbreviations have not been taken

(20)

into consideration. The selected news for each of the categories and their respective keywords are listed below. The keywords are separated by comma which represents the OR operator in the actual search query. The time stamps listed have been converted to local Swedish time (UTC +2) and represent the time of the event. All time stamps have been rounded down to the nearest hour. The categories and the accompanying news are as following: crime (1-2), politics (3-4), economy (5-6), celebrity (7-8) and global (9-10).

1. STEM School shooting - Two students were behind a shooting at the STEM School in Denver, Colorado. The aftermath was the death of an 18-year old student and eight injuries (Denver7 2019).

Keywords: highlands ranch, colorado school shooting, tony spurlock, stem shooting

Timestamp: 7 May 2019, 21:00

2. Nipsey Hussle - The Grammy-nominated rapper Nipsey Hussle was killed in a shooting outside a store in Los Angeles. Two others were wounded by gunfire (Blankstein & Johnson 2019).

Keywords: nipsey hussle, nipsey dead, nipsey shot, nipsey killed Timestamp: 1 April 2019, 01:00

3. Dianne Feinstein - The senator of California Dianne Feinstein met with a youth group on anti-climate-change that are known by the name of Sun- rise Movement. Apparently, arguments got ”heated” between her and the group of children which later raised critique towards the respected politician (Friedman 2019a, Beckett 2019).

Keywords: dianne feinstein, sunrise movement, feinstein children, green new deal feinstein

Timestamp: 22 February 2019, 19:00

4. Citizens United / Adam Schiff - The Citizens United organization had won a supreme court case in 2010 which led to the establishment of a federal law, prohibiting unions & corporations from financially supporting federal elections (Chillizza 2014). In early May 2019, representative Adam Shiff suggested a constitutional amendment to overrule this law (Schouten 2019).

Keywords: adam schiff, citizens united schiff, citizens united adam, overturn citizens united

(21)

5. Kraft Heinz - The shares of one of the largest food corporations in North America, Kraft Heinz, went down by 25% in one day, causing a loss of $4 billion dollars for Warren Buffett (Helmore 2019).

Keywords: kraft heinz, warren buffett loss, berkshire hathaway shares Timestamp: 22 Feb 2019, 21:00

6. Unemployment rate - Due to the increased hiring by companies, the unemployment rates in the U.S. hit the lowest digit in 49 years (Rugaber 2019).

Keywords: unemployment rate, low unemployment Timestamp: 3 May 2019, 15:00

7. Precious Harris - Sister of the rapper ”T.I.” (full name Clifford Harris Jr.) was in a car accident few days prior to her death (Pasquini 2019).

Keywords: precious harris, ti sister, t i sister, precious death, precious accident

8. MET-gala - An annual fundraising event that welcomes celebrities to wear formal dresses inspired by the theme of that year (Friedman 2019b).

Keywords: met gala, m e t gala Timestamp: 6 May 2019, 21:00

9. Sri Lanka Easter Bombings - During easter of 2019, eight explosions occurred in different churches and hotels in Colombo, Sri Lanka by suicide bombers.

A total of 258 were killed and over 500 others were injured in these attacks (Ethirajan 2019).

Keywords: sri lanka, easter bombings, easter terrorists, church bombings easter

Timestamp: 20 April 2019, 16:00

10. Archie - The Duke and Duchess of Sussex named their newborn baby ”Archie”

(BBC 2019).

Keywords: archie, royal baby Timestamp: 8 May 2019, 14:00

(22)

3.2 Pre-processing

When the data acquisition code snippet finished executing, what will be left is a data frame containing the raw data received from the Twitter API. This raw data has to be processed before we can apply our analytic techniques. What we receive is a data frame with the dimensions of 88xM, M being the number of rows we received from the search query. For the purpose of our thesis, we are mainly interested in the following columns.

Column Description

user id Unique user id for each user created at Time and date of tweet lang Language of the tweet location Location of the user country Country of the user

text Tweet text, in the form of a string Table 3.2: Important columns of each tweet

We initially thought that each tweet had coordinates in the form of longitude and latitude position markings. This proved not to be the case. We quickly came to the realization that we needed another method of locating the users since very few of the tweets had coordinates readily available. We were compelled to take advantage of the location of the users because the location tag was not omitted in the majority (almost all) of the tweets we gathered. This is where we capitalized on the versatility of the Google Maps API. By querying the Google Maps API and providing the location of each user, we were able to extract the coordinates from the location provided by the user.

To ease this process, we built a function in R which took an argument (the location data of the tweet) and returned the coordinates in the form of longitude and latitude. One problem we encountered was that the locations provided sometimes had character accents, such as Espa˜nola, New Mexico. We noticed that the Google Maps API did not work if the provided argument had any accents and therefore built a function to remove the accents from the word. Espa˜nola, New Mexico would instead be converted to Espanola, New Mexico which can be parsed to the Google Maps API in order for us to yield the coordinates.

(23)

Another segment in our data processing fragment was the cleaning of the text from undesirable signs such as URL:s and non-alphabetical characters. When the coordinates were gathered, it was acknowledged that a filtering of the data needed to be done in order for the coordinates to correctly be within the borders of the United States. We noticed that large parts of our data had originated from countries other than the U.S. This obviously led to a substantially smaller data set than what we started out with. Initially, a total of around 27.000 tweets were gathered. After our pre-processing methods, we were left with approximately 9300 tweets.

3.3 Data analysis

The process of data analysis used in this study primarily relied on three variables.

These variables being the time, sentiment, and coordinates of each tweet.

3.3.1 Relative growth rate

The relative growth rate is growth rate relative to size. This makes a comparison between the news categories possible, despite a difference in the amount of tweets each category produces. The relative growth rate is calculated by the following formula where W stands for the combined number of tweets for each news category, and T represents the hour.

Relative Growth Rate (RGR) = ln W₂− ln W₁ T₂− T₁

3.3.2 Sentiment analysis

A sentiment analysis on the data after pre-processing was performed. In our case, this concept is about recognizing and extracting subjective information about the different news from the text field of each tweet. Our methodology here relying on Sentimentr, an R-package that is one of the most widely used utilities in sentimental analysis on Twitter.

(24)

Sentimentr With a constantly up-to-date database and adoption of nine dictionaries, this package provides accurate calculations on extracting a sentimental value given a text. The creator of this package has taken the theoretical knowledge of another sentimental analysis-package, Syuzhet, into consideration when com- posing this package (Rinker 2019). Syuzhet accesses four different dictionaries and is in turn based upon another tool called coreNLP, which has been developed at Stanford University (Jockers 2017).

3.3.3 Change point analysis

As the results will show, a simple chart over the sentiment and RGR is quite hard to read and extract any meaningful information from. In order for us to analyze the meaningful changes in our time frame, an analysis into the change points was initiated (Killick 2017). An instance of a change point indicates that statistical properties differ before and after this time. In this research paper, it allows us to put a number on the amount of changes during our 24h periods. In turn, it concedes valuable information relating to the fluctuation of public sentiment when applied to sentimental data. Essentially, an insight into how often the users change their views regarding certain news categories is derived.

Our methodology here again relying on an R package, aptly named changepoint.

We have used the mean function when calculating the change points, which provides changes in the mean. This function has several arguments, the most important ones being the penalty and the method. The method argument specifies which underlying method to use. In the journal of statistical software, the creator of this package wrote an issue on this specific package. It was noted that the pruned ex- act linear time (PELT) was accurate while also prevailing computationally faster over the compared segmentation methods (Killick & Eckley 2014).

The other important argument, the penalty defines the underlying tightness of detecting a change point. An elemental theme when working with change points is that it builds on a lot of trial and error when trying to determine which methods and penalty values to use. We used the PELT method, and for the penalty argument we used two approaches. One was a logarithmic approach where the penalty value was set to 1.5*log(n) in accordance with the examples in the journal (ibid). The other was the Modified Bayes Information Criterion (MBIC) (Zhang & Siegmund 2007).

(25)

Chapter 4 Results

The results from the data analysis are presented in this section. As previously mentioned, the quantitative research is based on approximately 9300 tweets after pre-processing the data from around 27000 tweets. This is in part due to the fact that a great deal of tweets, that supposedly were published in the U.S., were actually published by for example South American & Australian Twitter- users.

The result has been summarized with two different charts and two tables. The charts are constructed by aligning the timestamps previously listed in section 3.1.1 to hour 0. Furthermore, the time series data is depicted by illustrating 10 hours before and 14 hours after the events occurring.

(26)

Figure 4.1: Number of Tweets over 24h

The chart above shows that the celebrity-category reached the highest peak, reaching above tweets hours at its peak hour. It also had the highest activity prior to the 0-hour mark, which should be of no surprise since at least one of the subjects, MET-gala, was already known to the public due to it being a yearly event. Fur- thermore, the crime category had the lowest activity with just below 200 tweets per hour at one point in the active period of the category. However, it had no activity before the 0-hour mark due to the topics of matter not being known of before the occurrence of the events. The other categories achieved similar results;

an activity of somewhere between 200 and 250 tweets at their climax.

This chart above is not sufficient to warrant any conclusions on interest, which is why an RGR analysis was performed. Such an inquiry does not discriminate on size or time and is therefore more suitable for comparative research.

Category Crime Politics Economy Celebrity Global RGR (%) -10.284 15.223 2.409 13.973 8.294 Sentiment -0.152 0.326 -0.116 -0.029 0.198

#Tweets 36 90 84 117 61

Table 4.1: Averages across all categories over 24 hours

Table 4.1 summarizes the important means of each category. The mean relative growth rate represents the average hourly RGR over our time period. The first

(27)

two columns were the opposites among our categories, crime being the lowest and politics showing the highest mean RGR. For the sentimental values, the results yet again show that the average sentiment was the highest for politics and lowest for crime.

Figure 4.2: Celebrity metrics over 24h

Figure 4.2 visualizes the RGR and sentimental values for the celebrity category over the specified time period. One important aspect of this thesis is evaluating how the public sentiment changes during this time period. Extracting such information from the chart above is neigh impossible, and certainly not scientific.

Therefore, an analysis into the change points was performed.

Category Crime Politics Economy Celebrity Global

Logarithmic 8 13 14 16 12

MBIC 3 7 11 15 10

Average 5.5 10 12.5 15.5 11

Table 4.2: CPA using the PELT method and two different penalties, logarithmic &

MBIC and the average of these two

Table 4.2 depicts the amount of change points in each category over the 24h period. Obviously since the crime category had no activity prior to the event break out, a fair comparison between crime and the rest of the categories is not

(28)

really possible unless we exclude the inactive hours. Essentially, the four remaining categories indicate that changes occur quite frequently regarding the sentiment.

This indicates that the sentiment is fluctuating for around half of the period.

Comparing the categories and excluding crime, the lowest changes were in politics and the highest being the celebrity category.

(29)

Chapter 5 Discussion

5.1 Result analysis

Reconnecting to the hypothesis, we can conclude that our initial assumption was partially correct. The results above show that we were incorrect regarding public interest since the RGR-metric was the highest for the political category and not the celebrity one, though only by a small margin. Our hypothesis regarding the sentiment was correct, crime showing the lowest average sentiment over 24h.

Table 4.1 demonstrates a distinct value for the sentimental values. Surprisingly, the highest sentiment existed in politics. Essentially, both the RGR and sentiment was the highest in the politics category while the lowest of both were in the crime category.

Since an analysis in the amount of change points was performed, it is necessary to reflect on why the results might appear the way they are. An interesting figure is yet again regarding the politics category. It can be noted that only 10 changes occurred, which in relation to the other 3 (excluding crime) categories was the lowest, even though the politics category had both the highest sentiment and RGR. This indicates that while politics had the highest sentiment, the very same category had the least amount of public opinion fluctuation. Essentially, the users felt the most positive about political news while at the same time not straying far from the public opinion. This indicates that the political category behaves in a hive-mind behavior. On the opposite side of the spectrum, the celebrity category had the highest amount of changes in public opinion. This indicates that celebrity news are the most polarizing events which might originate from the fact that by

(30)

nature, celebrities draw a lot of attention, both negative and positive.

An interesting off topic reflection can be made relating to the current political climate in the U.S. The map below visualizes the red vs blue states in the presidential election in 2016 (Zifan 2016). Looking at Figure A.5 in the Appendix, it can be noted that the political sentiment when overlaid with the political climate indicates a clear relation between the public interest and the political views. For example, looking at the Washington & New York area, a mostly positive public opinion can be seen. At the same time, most of that area is blue i.e. democratic.

Another point of interest is regarding the crime category, viewing the condensed data points around Denver, Colorado on figure A.2 it can be noted that the public sentiment is overwhelmingly negative. At the same time, that specific county is democratic.

Figure 5.1: Map over red and blue states in the 2016 election

(31)

5.2 Approach analysis

Something to ponder is the precision of the sentimental analysis and our specific methodology on applying this analysis. Sentimentr being the package of choice is obviously not accurate in every single instance. On the other hand, we can compare the precision of SentimentR with other similar sentiment analysis imple- mentations. One such study shows that comparing Sentimentr with meanr, the coreNLP/Java approach and the Suyzhet approach, Sentimentr performed the best among these four prominent methods. On the other hand, coreNLP was the most similar to human-scored sentimental values, followed by Sentimentr. It can be concluded that the approach for sentiment analysis is at least grounded in a precise approach (Rinker 2019).

Furthermore, our data acquisition might not contain all the data available regarding our chosen news. Since we used the standard search API which was available free of charge, something to note is that not all tweets are indexed and/or available when querying the Twitter API (Twitter n.d.). This in turn obviously affects the results since the very foundation of this thesis relies on extracting all tweets related to a certain event.

Finally, regarding the acquisition methodology it can be noted that an exhaustive collection of all tweets related to a news event is neigh impossible. More specifically, our search terms might not have been comprehensive enough to derive all tweets related to a specific news piece.

5.3 Future research

One of the most glaring faults of our research emerges from the fact that we only had two pieces of news per category. This essentially indicates that a scientific conclusion is footed on quite weak ground. The more news gathered per category, the further the argument is enhanced. We initially assumed that two news per category would be sufficient for reaching a more general conclusion. An obstacle was that the free version of the Twitter API only provides data for the last 7 days which in turn forced us to always be up to date on American news. Even if we had constantly followed the news, it would be challenging to find news that are only related to one news category. For instance, at first, we hesitated to include the news regarding the death of Nipsey Hussle since he is a celebrity.

(32)

After examining the data, it was noted that there was no mentioning of him prior to him getting shot that specific day, therefore classifying it as news relating to the crime category.

Finally, building upon the reflections in section 5.1, future research can further depict a more scientific correlation between public sentiment and political per- spectives based on social media data. This was not the main focus of the thesis and thus not much effort was imposed on this point.

(33)

Chapter 6 Conclusion

The results demonstrate that the political news category had the highest public interest followed by news relating to celebrity, global, economy and crime.

Politics likewise received the highest public appraisal among the news categories while also indicating on the least amount of fluctuation regarding public sentiment. Our hypothesis was correct on the point of the crime category having the worst public opinion but incorrect on the celebrity category attaining the highest interest.

(34)

Appendix A Appendix

The maps below depict the sentimental values by coordinates. The colors range from negative (red) to positive (green).

Figure A.1: Celebrity Sentiment

(35)

Figure A.2: Crime Sentiment

Figure A.3: Economy Sentiment

(36)

Figure A.4: Global Sentiment

Figure A.5: Political Sentiment

(37)

List of Figures

4.1 Number of Tweets over 24h . . . 20

4.2 Celebrity metrics over 24h . . . 21

5.1 Map over red and blue states in the 2016 election . . . 24

A.1 Celebrity Sentiment . . . 28

A.2 Crime Sentiment . . . 29

A.3 Economy Sentiment . . . 29

A.4 Global Sentiment . . . 30

A.5 Political Sentiment . . . 30

(38)

List of Tables

3.1 Function for gathering tweets via both the Twitter Developer API and the Google Maps API . . . 13 3.2 Important columns of each tweet . . . 16 4.1 Averages across all categories over 24 hours . . . 20 4.2 CPA using the PELT method and two different penalties, logarith-

mic & MBIC and the average of these two . . . 21

(39)

Bibliography

Ahlers, D. (2006), News consumption and the new electronic media, in ‘The International Journal of Press/Politics’, Vol. 11, Belfer Center for Science and International Affairs at Harvard University’s Kennedy School of Government, pp. 23–52.

URL: https: // doi. org/ 10. 1177/ 1081180X05284317

BBC (2019), ‘Royal baby: Duke and duchess of sussex name son archie’.

URL: https: // bbc. in/ 2Ye68Hl

Beckett, L. (2019), ‘”you didn’t vote for me”: Senator dianne feinstein responds to young green activists’.

URL: https: // bit. ly/ 2BN0hjC

Blakemore, E. (2019), ‘What was the arab spring and how did it spread?’.

URL: https: // bit. ly/ 2KBWVFh

Blankstein, A. & Johnson, A. (2019), ‘Rapper nipsey hussle killed in shooting outside his l.a. store’.

URL: https: // nbcnews. to/ 2U5LjAg

Bollen, J., Mao, H. & Zeng, X. (2011), ‘Twitter mood predicts the stock market’, Journal of Computational Science 2(1), 1 – 8.

URL: https: // doi. org/ 10. 1016/ j. jocs. 2010. 12. 007

Brenner, J. & Duggan, M. (2013), ‘The demographics of social media users - 2012’.

URL: https: // bit. ly/ 2Qg7XU6

Broersma, M. & Graham, T. (2012), ‘Social media as beat’, Journalism Practice 6(3), 403–419.

URL: https: // doi. org/ 10. 1080/ 17512786. 2012. 663626

Castillo, C., Mendoza, M. & Poblete, B. (2013), ‘Predicting information credibility in time-sensitive social media’, Internet Research: Electronic Networking

(40)

Applications and Policy 23.

URL: https: // doi. org/ 10. 1108/ IntR-05-2012-0095

Chillizza, C. (2014), ‘How citizens united changed politics, in 7 charts’.

URL: https: // wapo. st/ 2WdzFDZ

Denver7, T. (2019), ‘1 student dead, 8 injured in highlands ranch school shooting, officials say’.

URL: https: // bit. ly/ 2VkjvZD

Eichstaedt, J. C., Schwartz Hansen, A. & Kern, M. L. (2015), ‘Psychological language on twitter predicts county-level heart disease mortality’, Psychological Science 26, 159–169.

URL: https: // doi. org/ 10. 1177/ 0956797614557867 Ethirajan, A. (2019), ‘What is the met gala, and who gets to go?’.

URL: https: // bbc. in/ 2J5alsu

Friedman, L. (2019a), ‘Dianne feinstein lectures children who want green new deal, portraying it as untenable’.

URL: https: // nyti. ms/ 2VhjnWp

Friedman, V. (2019b), ‘What is the met gala, and who gets to go?’.

URL: https: // nyti. ms/ 2jx1YrW

Grinberg, N., Joseph, K. et al. (2016), Fake news on twitter during the 2016 u.s.

presidential election, in ‘Science 25’, Vol. 363, pp. 374–378.

URL: https: // doi. org/ 10. 1126/ science. aau2706 Grothaus, M. (2018), ‘Twitter’s q3 earnings by the numbers’.

URL: https: // bit. ly/ 2UgVo8R

Helmore, E. (2019), ‘Kraft heinz share plunge loses warren buffett $4bn in one day’.

URL: https: // bit. ly/ 2U52IoE

Hirschberg, J. & Manning, C. (2015), Advances in natural language processing, in ‘Science 17’, Vol. 349, pp. 261–266.

URL: https: // doi. org/ 10. 1126/ science. aaa8685 Jockers, M. (2017), ‘Introduction to the syuzhet package’.

URL: https: // bit. ly/ 30Dtmsx

Johansson, H. & Lilja, A. (2016), ‘Method performance difference of sentiment analysis on social media databases’.

URL: https: // bit. ly/ 2XvG3U8

(41)

Kao, A. & Poteet, S. (2007), Natural language processing and text mining, Springer-Verlag London, pp. 1–3.

URL: https: // doi. org/ 10. 1007/ 978-1-84628-754-1

Killick, R. (2017), ‘Introduction to optimal changepoint detection algorithms’.

URL: https: // bit. ly/ 2MoLHXr

Killick, R. & Eckley, I. (2014), ‘changepoint: An r package for changepoint analysis’, Journal of Statistical Software, Articles 58(3), 1–19.

URL: https: // doi. org/ 10. 18637/ jss. v058. i03

Liu, I. L. B., Cheung, C. M. K. & Lee, M. K. O. (2010), Understanding twitter usage: What drive people continue to tweet, p. 92.

URL: https: // bit. ly/ 31g37c8

Moon, S. & Hadley, P. (2014), Routinizing a new technology in the newsroom:

Twitter as a news source in mainstream media, in ‘Journal of Broadcasting &

Electronic Media’, Vol. 58, pp. 289–305.

URL: https: // doi. org/ 10. 1080/ 08838151. 2014. 906435

Muller, R. S. (2019), ‘Report on the investigation into russian interference in the 2016 presidential election’.

URL: https: // bit. ly/ 2vbycyT

Newspaper Industry, History of (2019), Encyclopedia of Communication and In- formation.

URL: https: // bit. ly/ 2Hy9qQk

Paltaglou, G. & Thelwall, M. (2012), Twitter, myspace, digg: Unsupervised sentiment analysisin social media, in ‘Transactions on Intelligent Systems and Tech- nology’, Vol. 3, ACM Transactions on Intelligent Systems and Technology.

URL: http: // doi. acm. org/ 10. 1145/ 2337542. 2337551

Pasquini, M. (2019), ‘T.i.’s older sister precious harris dead at 66 following car accident’.

URL: https: // bit. ly/ 2VZm0AO Rinker, T. (2019), ‘Sentimentr’.

URL: https: // bit. ly/ 2HzIQqc

Rugaber, C. (2019), ‘Unemployment hits 49-year low as us employers step up hiring’.

URL: https: // bit. ly/ 2vCEWG6

(42)

Sakaki, T., Okazaki, M. & Matsuo, Y. (2010), Earthquake shakes twitter users:

Real-time event detection by social sensors, pp. 851–860.

URL: https: // doi. org/ 10. 1145/ 1772690. 1772777

Schouten, F. (2019), ‘Rep. adam schiff introduces constitutional amendment to overturn citizens united’.

URL: https: // cnn. it/ 2JZc1V9

Schudson, M. (1996), Three hundred years of the american newspaper, in ‘The Power of News’, Harvard University Press, pp. 37–53.

Schudson, M. & Leonard, T. (1979), Discovering the news: A social history of american newspapers, in ‘The Journal of American History’, Vol. 66, pp. 9–38.

Shearer, E. & Matsa, K. (2018), ‘News use across social media platforms 2018’.

URL: https: // pewrsr. ch/ 2CEU7m1

Social Media Fact Sheet (2018), Pew Research Center.

URL: https: // pewrsr. ch/ 2JY9kTD

Sung, M. & Hwang, J. S. (2014), Who drives a crisis? the diffusion of an issue through social networks, in ‘Computers in Human Behavior’, Vol. 36, Depart- ment of Advertising & PR, Chung-Ang University.

URL: https: // doi. org/ 10. 1016/ j. chb. 2014. 03. 063 Twitter (n.d.), ‘Search tweets’.

URL: https: // bit. ly/ 2N3xFGX

Weber, J. (2006), Strassburg, 1605: The origins of the newspaper in europe, in

‘German History’, Vol. 24, pp. 387–412.

URL: https: // doi. org/ 10. 1191/ 0266355406gh380oa

Zhang, N. R. & Siegmund, D. O. (2007), A modified bayes information criterion with applications to the analysis of comparative genomic hybridization data, in

‘Biometrics’, Vol. 63, pp. 22–32.

URL: https: // doi. org/ 10. 1111/ j. 1541-0420. 2006. 00662. x Zifan, A. (2016), ‘Presidential election by county’.

URL: https: // bit. ly/ 2HRC9zV

(43)

TRITA-EECS-EX-2019:362

Sentiment and growth of different news categories on Twitter: A study in Natural Language Processing

Sentiment and growth of

different news categories on Twitter

A study in Natural Language Processing

DIAR SABRI

NAVID HAGHSHENAS

Sentiment and growth of

different news categories on Twitter

A study in Natural Language Processing

Diar Sabri

Navid Haghshenas

Degree Programme in Computer Science and Engineering Date: June, 2019

Supervisor: Arvind Kumar Examiner: Örjan Ekeberg

Swedish title: Sentiment och tillväxt av olika nyhetskategorier på Twitter

School of Electrical Engineering and Computer Science

Contents

Chapter 1 Introduction

1.1 Problem statement

1.2 Approach

1.3 Thesis outline

Chapter 2 Background

2.1 News industry

2.2 Twitter usage

2.3 Natural language processing

2.4 Sentiment analysis

2.5 Previous research

2.6 Definitions

2.6.1 News categories

2.6.2 Application Programming Interface - API

Chapter 3

Methodology

3.1 Data acquisition

3.1.1 Chosen news

3.2 Pre-processing

3.3 Data analysis

3.3.1 Relative growth rate

3.3.2 Sentiment analysis

3.3.3 Change point analysis

Chapter 4 Results

Chapter 5 Discussion

5.1 Result analysis

5.2 Approach analysis

5.3 Future research

Chapter 6 Conclusion

Appendix A Appendix

List of Figures

List of Tables

Bibliography