What makes an (audio)book popular?

(1)

Master Thesis in Statistics and Data Mining

What makes an (audio)book popular?

Arian Barakat

Division of Statistics and Machine Learning

Department of Computer and Information Science

(2)

Supervisor

Måns Magnusson

Examiner

(3)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet – eller dess framtida ersättare – under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och admin-istrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sam-manhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet – or its possible replacement – for a period of 25 years starting from the date of publication barring exceptional circum-stances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the con-sent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping Uni-versity Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

c

(4)

Abstract

Audiobook reading has traditionally been used for educational purposes but has in re-cent times grown into a popular alternative to the more traditional means of consuming literature. In order to differentiate themselves from other players in the market, but also provide their users enjoyable literature, several audiobook companies have lately directed their efforts on producing own content. Creating highly rated content is, however, no easy task and one reoccurring challenge is how to make a bestselling story. In an attempt to identify latent features shared by successful audiobooks and evaluate proposed methods for literary quantification, this thesis employs an array of frameworks from the field of Statistics, Machine Learning and Natural Language Processing on data and literature provided by Storytel - Sweden’s largest audiobook company.

We analyze and identify important features from a collection of 3077 Swedish books con-cerning their promotional and literary success. By considering features from the aspects Metadata, Theme, Plot, Style and Readability, we found that popular books are typically published as a book series, cover 1-3 central topics, write about, e.g., daughter-mother-relationships and human closeness but that they also hold, on average, a higher proportion of verbs and a lower degree of short words. Despite successfully identifying these, but also other factors, we recognized that none of our models predicted “bestseller” adequately and that future work may desire to study additional factors, employ other models or even use different metrics to define and measure popularity.

From our evaluation of the literary quantification methods, namely topic modeling and narrative approximation, we found that these methods are, in general, suitable for Swedish texts but that they require further improvement and experimentation to be successfully deployed for Swedish literature. For topic modeling, we recognized that the sole use of nouns provided more interpretable topics and that the inclusion of character names tended to pollute the topics. We also identified and discussed the possible problem of word inflections when modeling topics for more morphologically complex languages, and that additional preprocessing treatments such as word lemmatization or post-training text normalization may improve the quality and interpretability of topics. For the narrative approximation, we discovered that the method currently suffers from three shortcomings: (1) unreliable sentence segmentation, (2) unsatisfactory dictionary-based sentiment anal-ysis and (3) the possible loss of sentiment information induced by translations. Despite only examining a handful of literary work, we further found that books written initially in Swedish had narratives that were more cross-language consistent compared to books written in English and then translated to Swedish.

Keywords:Audiobooks, Bestsellers, Algorithmic Criticism, Large-scale Literary Analysis, Natural Language Processing, Gaussian Processes, Topic Modeling

(5)

Acknowledgments

When I began this thesis, little did I know about the challenges that were waiting ahead and how deep of a rabbit hole this project could be. Although facing difficulties at times, this thesis has admittedly been an exciting experience, which has also changed how I view literature. Most importantly, this thesis has reignited my passion for reading and for that, I would like to thank Mikael Holmquist and the people at Storytel for making this project possible.

I would also like to express my heartfelt gratitude to my supervisor Måns Magnusson for the never-ending enthusiasm and support during the whole thesis. Thank you for helping me navigate through the ocean of words and letters, without you, I would have been lost at c. Finally, I would like to thank my friends and family, including my girlfriend, for encouraging and supporting me in my academic aspiration. If there is one thing I have learned, it is that the only true wisdom is in knowing you know nothing and that education is a never-ending story. As Socrates once said: “Education is the kindling of a flame, not the filling of a vessel”. Linköping/Stockholm May 2018

(6)

List of Figures

3.1 Outline of the Methodology. . . 11 3.2 Example of Narrative approximation (Sentiment Time Series) for Jane Austen’s

Pride and Prejudice. The blue data in the background represent the raw sentence-level sentiments, the black line the rolling mean with window size 0.1 and the red line the approximated narrative line using Jocker’s Syuzhet DCT transformation with low pass size 5. Sentiment values rescaled[´1, 1]. . . 15 3.3 Example of Narrative alignment using Dynamic Time Warping. Figure (a) display

two approximated narratives along the narrative time, where the grey line be-tween the narratives illustrate the alignment. Figure (b) present the same two nar-ratives together with the constructed n-by-m matrix and warping path between them, which minimize the distance caused by e.g. time shifts . . . 20 3.4 Simplified Overview of Latent Dirichlet Allocation. The black points represent

both the bag-of-words representation wd, of a document d and the word-types,

where each word-type v can belong to multiple topics (the colors) simultaneously. Each topic is defined as a distribution over word-types. . . 23 4.1 Distribution of inferred predictive log likelihood of for all documents in the

hold-out set given topic model configuration 1.1-1.3, 2.1-2.3 and 3.1-3.3 trained on the corpus of 3077 books. Note that comparison between configuration groups is not possible due to different text segmentation. . . 34 4.2 Model Diagnostics (Document-topic entropy, Topic coherence and

Kullback-Leibler divergence from a uniform distribution) for the different configurations: 1.1 - 1.3, 2.1 - 2.3, 3.1 - 3.3. The diagnostic values for each feature group are pre-sented column-wise, where the different colors represent the different topic pa-rameter values. . . 35 4.3 Top 200 words (word clouds) of the four most predominant topics using topic

model 3.2. The figure highlights the potential problem of word inflections that may lead to an overrepresentation of a word in the topics. . . 37 4.4 The assignment of sentiment covariance between events given a narrative distance

∆t using the squared exponential kernel with different length scales and σf =1. . 42

4.5 Marginal Log Likelihood log p(y|t, θ) (y-axis) of approximated Narrative shape (GP) using different hyperparameters: length scale`(x-axis), noise parameter σn

(colours), the Gavagai Swedish Lexicon to extract sentiments and a fixed value

σf =1. . . 43

4.6 Approximated Narrative shapes using GP with different length scales`, a fixed noise parameter σn = 1 and the Gavagai Swedish Lexicon to extract sentiments.

The amount of sentiment smoothing (generalization) increases as` Ñ 8(left to right) visualized by the different facets. . . 44

(9)

4.7 Cross-Lingual comparison of Narrative Approximation for books originally writ-ten in Swedish. Top: Gaussian Process with` = 20, σf = 1 and σn = 1. The

bands around the mean function (MAP estimation) are the 95% confidence inter-vals. Middle: Rolling Mean with window size 0.1. Bottom: Syuzhet DCT

Trans-form with a low-pass size of 5. . . 46

4.8 Cross-Lingual comparison of Narrative Approximation for books originally writ-ten in English. Top: Gaussian Process with ` = 20, σf = 1 and σn = 1. The bands around the mean function (MAP estimation) are the 95% confidence inter-vals. Middle: Rolling Mean with window size 0.1. Bottom: Syuzhet DCT Trans-form with a low-pass size of 5. . . 47

4.9 Silhouette Coefficient for K = t2, 3, . . . , 50u using the Average, Complete and Ward’s linkage criterion with Dynamic Time Warping and Euclidean distance as distance measures. . . 49

4.10 The Different Narrative Shapes with LOESS estimation for every Cluster. 1: Icarus (The Hill), 2: Man in a hole (The Valley), 3: Cinderella (The Camel Shape), 4: The Rollercoaster, 5: Oedipus (The Dual Valley), 6: Rags to Riches (The Rise), 7: Riches to Rags (The Fall). Note that the narratives in each cluster may visually appear dissimilar due to time shifts. . . 50

4.11 The minimum achieved CV-score (Deviance) using the optimal λ for the different feature group models at α = t0, 0.05, 0.1, . . . , 1u. Note that the y-axis is bounded between [4000, 20000] . . . 53

4.12 The minimum achieved CV-score (MSE) using the optimal λ for the different fea-ture group models at α=t0, 0.05, 0.1, . . . , 1u. . . 54

4.13 Promotional Success: Density of residual for predicting the number of Unique Bookmarks for the hold-out set using different models with the corresponding op-timal hyperparameters α and λ. Vertical lines represent the first, second and the third quantile. . . 56

4.14 Literary Success: Density of residuals for predicting the Average Finishing Degree (log-odds) for the hold-out set using different models with corresponding optimal hyperparameters α and λ. Vertical lines represent the first, second and the third quantile . . . 57

A.1 Sentiment Dispersion for “The Emperor of Portugalia“ . . . 64

A.2 Sentiment Dispersion for “The Heart of a Woman“ . . . 65

A.3 Sentiment Dispersion for “A Christmas Carol“ . . . 65

A.4 Sentiment Dispersion for “A Fool, Free“ . . . 66

A.5 Sentiment Dispersion for “Angels & Demons“ . . . 66

A.6 Top 24 Narratives regarding Average Finishing Degree (AFD), Literary success, with the criterion of having more than 100 listeners. The colors indicate the as-signed plot cluster. The black line represents the expected narrative shape for every literary work and the band around the line the 95 percent confidence interval. 67 A.7 Top 24 Narratives regarding the number of Unique Bookmarks (UB), Promotional success. The colors indicate the assigned plot cluster. The black line represents the expected narrative shape for every literary work and the band around the line the 95 percent confidence interval. . . 68

(10)

List of Tables

2.1 Summary of the Corpus’ Genre and Category Distribution . . . 7 2.2 Summary of Popularity and Reading activity as the number of Unique Bookmarks

(UB) and Average Finishing Degree (AFD) . . . 7 2.3 Examples of manually annotated sentences from The Amsterdam Slavic Parallel

Aligned Corpus. The sentences are independently annotated by two persons to avoid subjective biases. . . 8 3.1 List of features used to represent the corpus. . . 12 3.2 SUC Part-of-Speech Tags with Swedish Examples . . . 14 3.3 Example of Part-of-Speech and Named Entity tagging using Stagger by Östling . . 14 3.4 Model notation for the Latent Dirichlet Allocation (LDA) . . . 21 3.5 Definition and Interpretation of LIX Readability Index . . . 24 4.1 The different topic model configurations for identifying the thematic content of the

literary works. The configurations are arranged into three main groups regarding chunk size and the inclusion of word types (POS-tags), where PN = Proper Nouns and NN = Nouns . . . 32 4.2 Overall (sum) Inferred log predictive likelihood of for all documents in the

hold-out set given a topic model trained on the corpus of 3077 books. Higher values indicate a better fit. Note that comparison between configuration groups is not possible due to different text segmentation. . . 33 4.3 Sentiment classification of the 103 aligned sentences from the ASPAC Corpus

us-ing different Swedish sentiment lexicons. The "Gold" column represent the gold standard - the manually annotated sentiments, where ´1, 0 and 1 are negative, neutral respectively positive sentences. . . 39 4.4 Sentiment classification of the 103 aligned sentences from the ASPAC Corpus

us-ing different English sentiment lexicons. The "Gold" column represent the gold standard - the manually annotated sentiments, where ´1, 0 and 1 are negative, neutral respectively positive sentences. . . 40 4.5 A subset of sentences from the ASPAC corpus where either all the Swedish or

all the English lexicons failed in classifying the sentiment correctly (hard cases). The examples highlight the potential problems of valance shifters and context-dependent sentiments. The "Gold" column represent the gold standard - the man-ually annotated sentiments, where ´1, 0 and 1 are negative, neutral respectively positive sentences. . . 41 4.6 Number of features within feature groups: Metadata, Plot, Topic, Style, Readability

and All. The latter group is a union of the former feature groups. . . 52 4.7 Optimal hyperparameters for each feature group model with corresponding

CV-scores (deviance) and number of non-zero coefficients for each configuration when modeling Promotional Success . . . 53

(11)

4.8 Optimal hyperparameters for each feature group model with corresponding CV-scores (MSE) and number of non-zero coefficients for each configuration when modeling Literary Success . . . 54 4.9 Promotional Success: Prediction accuracy measured as the Mean Absolute

Per-centage Error (MAPE) and Mean Squared Prediction Error (MSPE) of log response using different model configurations with corresponding optimal hyperparame-ters α and λ. The green and red marked models represent the superior respectively inferior model. . . 55 4.10 Literary Success: Prediction accuracy measured as the Mean Absolute

Percent-age Error (MAPE) and Mean Squared Prediction Error (MSPE) using different model configurations with the corresponding optimal hyperparameters α and

λ.The green and red marked models represent the superior respectively inferior

model. . . 56 4.11 Inferred coefficients for Promotional Success for different models using the

opti-mal α and λ. Note that the table does not present all coefficients and that vertical dots imply additional features. . . 59 4.12 Inferred coefficients for Literary Success for different models using the optimal α

and λ . . . . 60 B.1 Number of non-Zero Coefficients for predicting Promotional Success (Unique

Book-marks) for the different feature group models at α-levels t0, 0.05, 0.1, . . . , 1u. The Number of non-zero Coefficients is obtained by choosing the optimal λ for every model at each α-level. . . . 69 B.2 Number of non-Zero Coefficients for predicting Literary Success (log-odds

Av-erage Finishing Degree) for the different feature group models at α-levels t0, 0.05, 0.1, . . . , 1u. The Number of non-zero Coefficients is obtained by choosing the optimal λ for every model at each α-level. . . . 70 B.3 Inferred coefficients (All) for Promotional Success for Theme/Topic and Metadata

models using the optimal α and λ . . . . 71 B.4 Number of Central Topics among the Top 300 books regarding Average Finishing

Degree. A central topic is defined as a topic with proportion greather than 0.1. Note that there may exists books with topic proportions less than 0.1 . . . 72 B.5 Top 20 influential Topics for Promotional Success . . . 72

(12)

1 Introduction

”A reader lives a thousand lives before he dies. The man who never reads lives only one”

- George R.R. Martin

1.1 Background and Motivation

Storytelling has been a central activity in many cultures as a means of entertainment, edu-cation or instilling moral values. It is what links us to our past but also provides us with a window into different perspectives. From the oral traditions of folktales and mythologies to the more modern means of written text, stories reach back as far as humans do. More recently, the trend of consuming literary work on the run has become a popular alternative to the more conventional forms of reading. Audiobook reading has traditionally been used for educational purpose but has in recent time found its way to a broader audience with its characterizing marriage of our two oldest storytelling techniques.

Although audiobooks materialize in many formats, it is nowadays more popular to lis-ten to literary work through platforms provided by companies such as Storytel. Several of these companies have lately, similar to their film counterpart Netflix, started to create own exclusive content as a channel to differentiate themselves from other players in the market. However, producing highly rated content is not an easy task, and one reoccurring challenge is how to make a bestselling story. Regardless of how stories are told, whether they are written on a piece of paper or narrated by a person, it appears to be that some stories are more popu-lar than others. How is it that stories such as Homer’s epic Odyssey are still acclaimed among present-day people and that books like To kill a Mockingbird and The Da Vinci Code are fea-tured on prestigious lists such as The New York Times Bestseller [46, 8]. Are bestselling stories positive outliers or do these stories share latent features that may explain their popularity?

The traditional means of identifying key aspects of literary work originates in the field of literary criticism. The undertaking involves the deeper understanding of the text through a careful review and synthesis of works’ literary elements such as narrative mode and autho-rial style. Although being the core methods for scholars, the conventional form of literary analysis limits the research to a handful of works due to the manual nature of the task. The

(13)

1.2. Related Work digitalization of humanities together with advancements in textual and quantitative methods in recent times have, however, paved the way for a more profound approach to large-scale literary text analysis that is beyond human capabilities [42].

The Bestseller Code, released in 2016, identified a set of aspects that arguably explain the success of bestsellers using quantitative methods. Despite that a portion of the findings point toward on already recognized literary elements, the authors also identified a set of subtle signals separating bestsellers from the non-bestsellers. The study focused on a corpus of English novels and used solely a dichotomized popularity metric indicating whether a book featured the NYT bestseller list [5]. While the simplification of the real-world complexity often is necessary, it is critical to recognize that the reduction of a complex notion such as popularity into binary form can many times lead the loss of valuable information and that the access to more precise data may reveal an entirely different set of significant aspects.

1.2 Related Work

The idea of applying quantitive methods on digitilized work is not a new concept, and it has shown great success in many applications. The more modern form of the practice has, however, gained impetus with the increase of computational power and aid from the fields of Statistics, Natural Language Processing (NLP) and Machine Learning (ML). Advancements that has resulted in tools that are directly applicable to analyze literature. Despite the ma-tureness of some of the methods, it was only recently that the array of technologies were con-solidated and applied to the literary domain into what some refers to as reading machines or algorithmic critisiscm [36]. The methods suggested for textual analysis ranges from topic, syn-tactic and sentiment analysis to named-entity recognition and part-of-speech tagging, which for instance are adopted in [5] and [20].

The premise of quantitive text analysis is that texts possess distinct and quantifiable fea-tures or markers that are inherently determined by the author’s idiolect and the work’s anatomy. Features that are effectively used to distinguish texts written by different authors, as authors tend to be consistent with his or her use of language [23, 44]. However, identifying relevant attributes is one of the many challenges of quantitative text analysis and many have, for this reason, adopted ideas from frameworks traditional used in literary studies and other domains. The general approach of algorithmic criticism is suggested to be carried out in a two-step procedure, consisting of a text passage to extract features and a subsequent mod-eling step [5]. Despite the coarse systematization of the methodology, the process of feature extraction is usually further decomposed into more granular elements and it is, in that sense, not much different from the decompositions made by literary scholars when studying texts.

Previous research points toward a range of principal elements that separate bestsellers from non-bestsellers, which rarely is described by a single aspect but rather by the synthe-sis of distinct ones. In addition to the literary elements previously mentioned, research also nominates aspects such as the thematic content and the narrative structure of a story as po-tential differentiators. Archer and Jockers agree on these findings in [5] and [20] from studies they conducted on bestselling and non-bestselling novels as earlier mentioned. In their re-search, they conclude that components such the overall theme of a novel, usually consisting of a mixture of topics, and character attribution, especially regarding the protagonist, may influence a book’s popularity. Moreover, how the author delivers the content through his or her’s writing style and how the author invokes emotions through the plot are also considered as contributing factors for popular literature.

The thematic content and the plot constitute the core elements of the narrative, and it is therefore not surprising that previous research regards these aspects as being central parts of “the perfect anatomy of bestsellers” [5]. Jockers relates these discoveries to the literary

(14)

1.2. Related Work terms Fabula and Syuezhet1originating in Russian formalism, but are arguably also associ-ated with Aristotle’s thoughts on the narrative described in his work of Poetics. Archer and Jockers suggest that successful stories are limited to three to four central topics occupying 30 percent of the pages and that the plot exhibits a regular beating rhythm usually induced by conflicts throughout the narrative. The authors found that the use of Latent Dirichlet Alloca-tion (LDA), introduced by Blei et al. [11], with the parameter restricAlloca-tion of 500 topics and the sole inclusion of nouns were favorable for the thematic breakdown while dictionary-based sentiment analysis showed promising results for extracting the emotional structure of the novels. Despite the existence of more accurate methods of measuring emotional language in texts [25], the authors found that the use of sentiment lexicons is the most effective method of estimating the narrative structure. From the raw narratological sequence of positive and negative sentences, Jockers approximate individual plots using a novel method with inspira-tion from the domain of signal processing in the form of Fourier transformainspira-tion [21]. Archer and Jockers later suggest that the various plots belong to one of seven different fundamental plot shapes.

While Jockers’ novel method opens up a new and exciting way of analyzing literature, po-tential drawbacks of the approach have been pointed out [45, 41]. Both Swafford and Schmidt raise the inherent problem of wrong sentiment classification due to the absence of particular sentiment words in a lexicon and the ringing artifacts introduced by the Fourier transforma-tion that may misrepresent the underlying sentiment. Additransforma-tional methods for estimating the narrative are presented by [38] and [17], whereas the authors of the former also address the task of clustering plot shapes from a quantitative perspective. In their study, Reagan et al. investigate three different clustering methods using Principal Component Analysis (PCA), Hierarchical Clustering and Self-organizing map (SOM), all of which point toward the pres-ence of six fundamental plot shapes. An alternative method to approximate the flexible and complex latent function of a narrative is by utilizing the framework of Gaussian Processes (GP) as introduced in [49]. While not typically used in the field of literature text analysis, in [7], Beck shows the benefits of applying GPs when analyzing emotional language that exhibits noisy behavior.

In addition to the thematic and narrative structure of stories, the characters are also sug-gested to play a central part of a good story [5]. The fact that one-fifth of the bestselling novels used in The Bestseller code included references to characters in the title supports this further. Archer and Jockers utilized dependency parsing to extract the character attributions described in the novels. They found that bestsellers tend to portray a character with words such as grab and do, think and ask, look and hold, while non-bestsellers describe a character with fewer verbs. Although the state-of-the-art algorithms usually identify entities with suc-cess, the algorithms tend to exhibit difficulties in recognizing characters when not referred by name [5]. Archer and Jockers, however, managed to conclude a range of character attri-butions that influence literature popularity by using local dependencies between characters and verbs.

The underlying theory behind the study of linguistic and authorial style is that the au-thor’s idiolect, consisting of the choice and spelling of words, sentence and paragraph struc-ture but also grammar and the usage of punctuation, constitute a unique authorial finger-print. As the authorial style is the mechanism from which the plot, theme, and character get delivered, research suggests that the authorial fingerprint plays an essential role in well-liked books [5]. The practice of stylometry, the application of linguistic and authorial style, is successfully applied in, e.g., forensic linguistics and Jockers describes markers such as lexical variety, type-token ratio and hapax richness/legomenon as important features [22], which is also supported by Juola and Stamatatos in [23] and [44]. Both Juola and Stamatatos introduce

ad-1_{Fabula and Syuezhet are terms originating in the Russian school of literary criticism (Russian formalism) and}

employed in narratology that describes narrative construction. The Fabula is the content and chronological sequence of events constituting the story while Syuezhet is the order in which the events are presented. In other words, Fabula is the story and Syuezhet the plot.

(15)

1.3. Aim and Research Question ditional stylometric markers in which the latter classify these markers into the categories of lexical, character, syntactic, semantic and application-specific features.

Archer and Jockers continue with expanding the concept of how the authorial style in-fluences the popularity by introducing the aspect of how the impression of novels may tell bestsellers apart from non-bestsellers. The aspect of metadata includes, for instance, a novel’s title as titles tend to provide the reader with clues by capturing story events, settings but also, depending on the specificity of the title, the relevance to a reader [5]. Titles often commu-nicate the authors’ condensed impressions of their work, and by choosing right words, the author can use the title as a persuasive advertisement. Jockers goes further in [20] by sug-gesting additional bibliographic metadata, such as the author’s gender, birthplace, and age, that may be used as valuable information to reveal literary trends or popular literature.

An aspect that has not received focus to the same extent as the previously mentioned elements concerning popular literature is the ease in which a reader understands a text, pri-marily referred to as readability. In [6], Bailin and Grafstein suggest that text passages that possess a higher degree of readability communicate the content more effectively and thereby reducing the reading effort and extending the reading enjoyment, an essential element for audiobook reading. There exists plenty of research concerning how to measure readability which has resulted in a range of metrics, ranging from syntactic to lexical complexity [6]. LIX is one, among many, formulas using various attributes of the text to measure readability and it appears to be that the use of text attributes is the conventional approach for the task. Falkenjack et al. enumerate a set of features in [15] and [14] grouped into shallow, lexical, morpho-syntactic and syntactic features that are directly applicable to extract readability fea-tures for audiobooks. In [14], Falkenjack et al. also propose that readability is measured with relatively high accuracy without the need for dependency parsing that usually exhibits high complexity.

It is clear that there exists a vast collection of textual features that may distinguish best-sellers from non-bestbest-sellers, which are not necessarily mutually exclusive to a single aspect. Archer and Jockers demonstrate this clearly in [5] with a list of 28 000 parsed features cov-ering the various mentioned aspects. However, they successfully reduced the list to a set of 2799 explanatory variables using modeling and regularization techniques such as Nearest Shrunken Centroid (NSC). In addition to Archer and Jockers attempt to model popular litera-ture, research also proposes Principal Component Analysis (PCA) to be an effective approach for feature reduction [23]. The two other models deployed in [5] are the K-Nearest Neighbors (KNN) and Support Vector Machine (SVM) with predictive accuracies of 90 respectively 70 percent. Despite NSC’s inferior performance to the KNN, with an accuracy of 79 percent, the authors underlined the benefits of using the method considering its ability to reduce features and the ease of interpreting the model.

1.3 Aim and Research Question

There exists an extensive interest among publishers as well as storytellers to identifying qual-ities that may explain the popularity among audiobooks. Previous research has recognized a collection of aspects that describes bestsellers in written form; however, few to no stud-ies have worked toward finding elements that distinguish bestsellers from non-bestsellers within the domain of audiobooks. This thesis aims to bridge this gap by identifying features that explain popularity and usage among audiobooks using quantitative methods and data from Storytel. The mindset of this undertaking will be of the exploratory and confirmatory sort yet retaining the aim and focus on how to operationalize different literary qualities into content production. The main research questions of this thesis are as follows:

• What makes Swedish audiobooks popular?

(16)

1.4. Delimitation In an attempt to limit the span of the first research question, we define the problem as a hierarchical structure composed of the previously recognized aspects, but also of hypotheses made by Storytel. Each branch of this representation comprises of distinct features associated with each aspect. The main aspects we intend to investigate in this thesis are Metadata, Theme, Plot, Style and Readability.

As presented in the related work, many methods have been proposed and applied to de-code the latent factors shared by popular literature. The second research question aims to investigate whether these methods are suitable for quantify Swedish literature. We focus primarily on the proposed frameworks for identifying the thematic content and the approxi-mation of the plot posed by the narrative structure.

1.4 Delimitation

Digitial Humanities is in its essence interdisciplinary, merging new technology with the more classical academic disciplines. The undertaking of identifying factors that explain popularity among audiobooks may, for this reason, include a wide range of fields. Although it is of importance to study the problem from different perspectives, we analyze the problem solely from the quantitative viewpoint as the substantial span of disciplines for a complete analysis is beyond the scope of the thesis.

While computers have the ability to study literature on a microscopical level, the aim of operationalizing the findings restrict us only to investigate a subset of all possible elements that may apply to content production. As we strive to understand and decode the conclusions of this thesis, we also consider the limitation of using modeling techniques with interpretabil-ity as one of its prominent features. The limited time at our disposal for this thesis also restrict us from carrying out a complete exploration of the feature landscape.

Despite having access to user activity from multiple countries and literary work in differ-ent languages, the thesis limit itself to the Swedish market as a result of the company’s interest in analyzing this particular market. Under this limitation, we only consider texts written in or translated to the cohort’s official language (Swedish) as the majority of the cohort is native Swedish speakers.

1.5 Outline

The rest of the thesis is structured as follows. Chapter 2 presents an overview of the mate-rial used in this thesis, covering the corpus, the popularity measure and additional external material necessary for this project. Followed by this, we present an overview of the adopted methodology, where the subsequent sections present the theory more in detail. Chapter 4 presents the results and discussion of the literary quantification and modeling results. Fi-nally, a conclusion is given in Chapter 5.

(17)

2 Material

”There is more treasure in books than in all the pirate’s loot on Treasure Island.”

— Walt Disney

The material used in this thesis compose of data provided by Storytel which is formed by two parts: (1) A collection of books (denoted as the corpus) with matching metadata and (2) Popularity and Reading activity data aggregated by the book. In addition to the aforemen-tioned material, a subset of The Amsterdam Slavic Parallel Aligned Corpus (Swedish and English translations) is used together with a collection of sentiment lexicons for evaluation purposes. We provide a further description of each part separately in the sections below.

2.1 Corpus and Metadata

The corpus is a selection of 14 509 books written in or translated to Swedish with release dates as early as 1901 until 2018. Despite the relatively large collection, only a subset of 3077 books is accompanied by popularity and reading activity data that also fulfill two criteria: (1) existing in both an e-book and an audiobook version and (2) listed within literary genre or categories that exhibit a fictional narrative. Storytel provides books within a wide range of genre and categories; however, as we aim to analyze literary elements mainly found in fictional work, we only include literature within the genre and categories listed in table 2.1. Each book is also accompanied by basic metadata such as book name, ISBN1, author(s) name, a short description and whether the book belongs to a series or not.

1_{ISBN is an International Standard Book Number consisting of 13 digits that identifies the product’s registrant}

as well as the specific title, edition and format, used for instance by publishers and booksellers for ordering, listing, sales records and stock control purposes.

(18)

2.2. Popularity and Reading activity

Genre N Genre N

Classics 87 Harlequin 39

Crime 862 Short stories 114

Erotica 73 Teens & Young Adult 84 Fantasy & SciFi 162 Thrillers 554

Fiction 1102

Total 3077

Table 2.1: Summary of the Corpus’ Genre and Category Distribution

2.2 Popularity and Reading activity

In contrast to the traditional approach of determining bestsellers from lists of top-selling or frequently-borrowed titles, the material used in this thesis captures the notion of book pop-ularity through user activity obtained from Storytel’s platform. Despite that the data covers user activity from multiple countries, only data concerning the Swedish market has been se-lected as a result of the limitation previously stated in 1.4. The book popularity and reading activity are captured by the two following variables: (1) The number of Unique Bookmarks (UB)2and (2) the Average Finishing Degree (AFD), which are aggregated by the book. The former variable measures the number of users with at least one bookmark for a given book, indicating the number of individuals that have begun reading a book, while the latter mea-sure the average narrative coverage among the listeners. Since users may have bookmarks for both the audio- and e-book version of the literary work, the highest of the two is selected at a user level for the AFD variable, followed by an aggregation using the arithmetic mean. A general assumption is that books with a higher number of bookmarks and finishing degree are more successful compared to books with lower values.

It is worth pointing out that the provided data only is a snapshot of the state of user ac-tivity, which may affect the popularity measure of newly added literature. To mitigate the potential problem of misrepresenting a book’s popularity, only user-updates older than 42, but younger than 730 days, contribute to the measure of popularity3. Moreover, the inclu-sion of the previously mentioned rule allows us to interpret the variable as the likelihood of completing a book as the criterion gives the reader a chance to finish a book once started.

Variable N Mean St. Dev. Min Median Max

Unique Bookmarks (UB) 3077 4608.108 7452.303 1 2161 100563 Average Finishing Degree (AFD) 3077 0.726 0.183 0 0.757 1 Table 2.2: Summary of Popularity and Reading activity as the number of Unique Bookmarks (UB) and Average Finishing Degree (AFD)

Table 2.2 present the summary for Unique Bookmarks (UD) and Average Finishing Degree (AFD) variable. The table reveals that there exists a skewness in the distribution of UB with the median book having around 2000 readers while the mean value reaches a value of about 4000 and the most successful book, in terms of number of readers, reaching just above 100 000 readers. The AFD variable, however, displays a relatively symmetrical distribution with the median almost coinciding with the location of the mean.

2_{Note that a user may have several bookmarks for a book, however, in this thesis, these are considered as a}

single unique bookmark as we intend to measure the number of unique users that have chosen a particular book.

3_{This rule implies that if, e.g., a user managed to complete a book 12 days before the popularity and reading}

activity were extracted and summarized, then this user update will not be reflected nor considered in the Unique Bookmarks or Average Finishing Degree variable.

(19)

2.3. The Amsterdam Slavic Parallel Aligned Corpus

2.3 The Amsterdam Slavic Parallel Aligned Corpus

The Amsterdam Slavic Parallel Aligned Corpus (ASPAC) is a collection of aligned literature trans-lated into Swedish and English, accessed from Språkbanken [35]. For this thesis, we consider a subset of 103 randomly selected sentences from the corpus that we manually annotate with the purpose of evaluating different sentiment lexicons4. A few examples of the aligned sen-tences are provided in Table 2.3, with the manually annotated sentiments in the right column. Although the majority of the selected sentences classifies as neutral, we regard the sentences as a relatively good representation of literary texts as the distribution of sentiments are rea-sonably representative but also as the included novels are from different time periods.

Book Swedish English Sentiment

Alice In Wonder-land (1865)

- Hm! "Ahem!" said the Mouse with an important air.

Neutral The Girl With The

Dragon Tattoo

Det tog henne en stund att snirkla sig fram på den halvt övervuxna vä-gen, och ännu längre tid att hitta stigen ut till Got-tfrieds stuga.

It took her a while to wind her way along the half-overgrown road, and even longer to find the path to Gottfried’s cabin.

Neutral

Winnie-The-Pooh (1926)

- Jag bara undrade... Ungefär så stor som Nasse, sa han sorgset för sig själv.

" I just wondered... About as big as Piglet, " he said to himself sadly.

Negative

The Diary Of A Young Girl

Pim är inte heller så hjärtlig längre.

Even Pim is not as nice as he used to be.

Negative The Hobbit Or

There And Back Again

Lyckligtvis sken inte solen, så ingen förrädisk skugga uppstod, och ödet var nådigt - han nös inte på en bra stund.

Luckily there was no sun at the time to cast an awk-ward shadow, and for a mercy he did not sneeze again for a good while.

Positive

..

. ... ... ...

The Hobbit Or There And Back Again

Är det ett brott att gå vilse i skogen, att vara hungrig och trött och bli snarad av spindlar ?

"Is it a crime to be lost in the forest, to be hungry and thirsty, to be trapped by spiders? Are the spi-ders your tame beasts or your pets, if killing them makes you angry?"

Negative

Table 2.3: Examples of manually annotated sentences from The Amsterdam Slavic Parallel Aligned Corpus. The sentences are independently annotated by two persons to avoid sub-jective biases.

2.4 Additional Resources

2.4.1 Sentiment Lexicons

There exists a number of sentiment lexicons/dictionaries that may be appropriate for detect-ing sentiments in Swedish texts. Together with the Amsterdam Slavic Parallel Aligned

(20)

2.4. Additional Resources pus described above, we evaluate the following lexicons for identifying and approximating the narratives among the literary works in the corpus.

AFINN (English & Swedish): Words rated between -5 to 5, manually annotated by Finn

Nielsen [32]. The Swedish lexicon is translated from English.

NRC Word-Emotion Association Lexicon (English & Swedish): Words rated between -1 to

1, developed by Mohammad, Saif M., and Turney, Peter D. as the NRC Emotion Lex-icon. The lexicon contains the sentiments negative and positive but also the emotions anger, anticipation, disgust, fear, joy, sadness, surprise and trust. The Swedish lexicon is translated from English using Google Translate (November 2017) [29, 21].

Språkbanken Sentiment Lexicon (Swedish): Words rated between -3 to 3 using a

semi-supervised approach. By Bianka Nusko, Nina Tahmasebi and Olof Mogren [33].

Gavagai API (English & Swedish): A proprietary lexicon accessed through an API. The

lex-icon contains the sentiments and emotions: positivity, negativity, fear, hate, love, skep-ticism, violence, and desire [39, 24].

Gavagai Dictionary (English & Swedish): A compiled version of the proprietary lexicon

de-scribed above (based on the results from the API requests). Used for practical and com-parison reasons, when compared to the other lexicons.

Suyzhet Dictionary (English): Words rated between -1 to 1, developed by the Nebraska

Lit-erary Lab [21].

The performance of these lexicons will also be used as the basis for our assessment of how suitable dictionary-based sentiment analysis is for Swedish literature.

(21)

3 Methodology

”If I have seen further it is by standing on ye shoulders of Giants”

- Sir Isaac Newton

This chapter introduces the reader to the methodology of this thesis but also to the theoretical frameworks that constitute the method. We begin with an overview of the adopted approach followed by sections that outline the different frameworks used for quantifying the corpus. The chapter also presents the features we intend to extract to represent the corpus, which are used together with the adopted modeling techniques to identify the latent factors among popular literature.

3.1 Approach

The undertaking of identifying the latent factors of popular literature is approachable in many ways, especially as the notion of popularity rarely is limited to a single definition. The access to user data, as presented in Chapter 2, allows us to elaborate on the definition of popularity by breaking up the notion into two facets. We define popularity as (1) the promo-tional success of a book captured by the number of unique bookmarks (UB), and (2) a book’s literary success quantified by the average finishing degree (AFD). Although the target variable for literary success is not a probability by definition, we adopt the interpretation of AFD as the likelihood of finishing a book in terms of logarithmic odds. We formulate the undertaking as a prediction problem and limit the scope of the thesis to the literary aspects described in section 1.3.

In a multiple outcome problem, such as described above, one could argue that the appro-priate modeling approach would be to model the targets jointly or even expand the model to a multivariate version, particularly in the presence of correlation between the target variables or the error terms. However, regarding our two-faceted representation of popularity, the potential problem of inferring correlation from aggregated data and the different domains of the target variables, we consider the approach of modeling these two notions of popular-ity separately. This arrangement also provides the opportunpopular-ity for identifying the optimal model, i.e. the features and hyperparameters, for each definition of popularity. In short, we define popularity as a two-sided notion modeled as a log-linear (Poisson) model for the

(22)

3.1. Approach promotional success and a logit model (log-odds) for the literary success:

Promotional Success: Popularity_UB„Poisson(g´1(η)) g(µUB) =log µUB=η=β0+βTX (3.1) Literary Success: UAFD: =log Popularity_AFD 1 ´ Popularity_AFD UAFD„Normal(g´1(η), ν2) g(µAFD) =µAFD=η=β0+βTX (3.2)

where β0, β are the coefficients under each model, UAFDthe log-odds tranformation of

av-erage finishing degree, µ the expected number of unique bookmarks or log-odds avav-erage finishing degree and X the feature representation of the literature. These models are further described in section 3.4.1.

Since the thesis aims to study a similar problem as queried in [5], the thesis adopts a con-ceptually equivalent approach as suggested by Archer and Jockers for the preprocessing and feature construction step. The general framework of this thesis comprises thus by a three-step procedure as illustrated in Figure 3.1: (1) Text preprocessing, (2) Literary Quantification (feature construction) and (3) Predictive modeling. In addition to the aforementioned proce-dures, we also evaluate the suitability of the suggested methods on Swedish books as this is a part of the research question.

Figure 3.1: Outline of the Methodology.

The general idea behind literary quantification is to represent the text as a point in the "litera-ture space" with the fea"litera-tures defining a book’s position in that vector space. We classify these features conceptually as either macroscopical or microscopical which for instance relate to concepts such as the thematic and narratological structure but also to the textual traits associ-ated with the author’s style. It is clear that there may exist numerous features with predictive power, but as the purpose of this thesis is to operationalize the findings into content produc-tion, we confine the extent of features with interpretability in mind. The features examined in this thesis are presented in Table 3.1, which we divide into different literary elements that form the hierarchical representation described in section 1.3.

(23)

3.2. Preprocessing

Aspect Feature Description Type [Domain]

Metadata

TitleLength Title length as the number of words

Discrete CharacterRef Character reference in the title Binary LocationRef Location reference in the title Binary TitlePoSFreq Title Part-of-speech frequencies,

[1xP] vector for each book

Discrete

Genre Genre/Category Categorical

inSeries If the book belongs to a series Binary Theme/Topic

TopicProportion Topic Proportion, [1xK] vector for each book

Continuous [0,1] TopicEntropy Topic allocation as the empirical

entropy (Log Base 10)

Continuous Plot

PlotCluster Type of Narrative shape Categorical Style

DTP Document Term Proportion, [1xV] vector for each book

Continuous [0,1] DPoSP Document Part-of-Speech

Pro-portion, [1xP] vector for each book

Continuous [0,1]

TTR Type-Token Ratio, number of word types divided by the num-ber of tokens

Continuous [0,1]

Hapax Ratio Number of singletons divided by the number of tokens

Continuous [0,1] Lexical Density Word allocation as the empirical

entropy over the Document-term frequency (Log Base 10)

Continuous

DL Document length as the number of words

Discrete Readability

MSL Mean sentence length as the av-erage number of words per sen-tence

Continuous

MWLC Mean word length as the average number of characters per word

Continuous LIX Readability index, ratio of words

longer than 6 characters coupled with average sentence length

Continuous

Table 3.1: List of features used to represent the corpus.

3.2 Preprocessing

The first step toward quantifying literature is to preprocess the corpus. The preprocessing carried out in this thesis covers basic text operations such as tokenization and text segmenta-tion to more advanced methods for annotating and extracting informasegmenta-tion from the literature. In this section, we describe in particular the Part-of-Speech (POS) Tagging and the Named-Entity Recognition (NER) framework, both of which are performed using the open-source implementation Stagger by Östling [34].

(24)

3.2. Preprocessing

3.2.1 Part-of-Speech Tagging

To identify syntactical patterns in the author’s use of language we use the framework of Part-of-Speech (POS) tagging. POS, also known as word classes or syntactic categories, are sets of words grouped by linguistics based on their syntactic behavior that, depending on the granu-larity of the classification, are often composed by classes such as nouns, verbs, pronouns and articles. Due to the morphological variety of words, it is not unusual to classify words into a more fine-grained POS-tag system as suggested by the Stockholm-Umeå Corpus (SUC)1and seen in Table 3.2.

The task of POS-tagging is often reformulated into a classification problem and ap-proached either as a rule-based or a supervised learning machine learning problem. Many methods have been developed over the years that usually are associated with dynamic pro-gramming or probabilistic models. Examples are algorithms as the Viterbi algorithm and probabilistic models such as Hidden Markov Model (HMM) and the Maximum Entropy Markov Model (MEMM) but also other classification methods such as Decision Trees or K-Nearest Neighbour[26]. In [34], Östling presents an open-source tagger for Swedish texts (Stagger) based on an average perceptron with per-token accuracy reaching 96.6 percent. In addition to part-of-speech tagging, Stagger also performs NER-tagging and lemmatization of tokens. The average perceptron in Stagger assumes a set of feature functions φ(hi, ti)paired

with feature weights αs and an input sequence of the form w = (w1, w2, . . . , wn). In the

task of POS-tagging, the sequence typically takes the form of a sentence, where (w1, . . . , wn)

consequently are the words in the sentence. Stagger uses a set of both history-dependent and history-independent binary features where the former utilize previously assigned tags

(ti´1, ti´2, . . .)and is for instance given by

φs(hi, ti) =

#

1 if ti =a, ti´1=b

0 otherwise (3.3)

while the history-independent features only consider the current sequence position and is for instance given by

φs(hi, ti) =

#

1 if ti=a, wi =v

0 otherwise (3.4)

where hiis the historical context, tiand widenotes a tag respectively the word type at position

i in a sentence. For a word sequence of length n and d feature functions, a scoring function over the sentence w= (w1, w2, . . . , wn)using a tag-sequence set t is given by

scorew(t) = n ÿ i=1 d ÿ s=1 αsφs(hi, ti) (3.5)

where a sentence w is assigned a sequence of tags by maximizing the scoring function:

rt=arg max t # scorew(t) + (3.6)

1_{The Stockholm-Umeå Corpus (SUC) is a collection of Swedish texts available at Språkbanken from the}

1990’s, consisting of one million words in total. The texts are annotated with Part-of-Speech tags, morpholog-ical analysis and lemma, as well as some structural and functional information. For detailed information, see https://spraakbanken.gu.se/eng/resources/suc

(25)

3.2. Preprocessing Pos-Tag Explanation Example Pos-Tag Explanation Example

AB Adverb inte DT Determiner denna

HA Relative Adverb när HD Relative Determiner vilken HP Relative Pronoun som HS Relative Possessive vars

IE Infinitive Marker att IN Interjection ja

JJ Adjective glad KN Conjunction och

NN Noun pudding PC Participle utsänd

PL Particle ut PM Proper Noun Mats

PN Pronoun hon PP Preposition av

PS Possessive hennes RG Cardinal number tre

RO Ordinal number tredje SN Subjunction att

UO Foreign Word the VB Verb kasta

MAD Major Delimiter . MID Minor Delimiter ,

PAD Parwise Delimiter )

Table 3.2: SUC Part-of-Speech Tags with Swedish Examples

3.2.2 Named Entity Recognition

Characters are a central part of many stories as they frequently are involved in the different types of conflicts, tensions as well as the resolutions that unfold in the narrative. While the theme or the narrative may be the reasons why many people read, it is often through the char-acters and their interaction with the setting a reader understands and relates to the emotions in the events of a story. To identify named entities in the literature, we use the framework of Named-Entity Recognition (NER).

A named entity (NE) is a discrete real-world object with an attached name, classified into pre-defined categories such as person, organization, product, location, quantity or a geopolitical entity. The undertaking of NER is usually broken down into two subtasks: identifying the boundaries of a named entity (segmentation), and the identification (classification) of the NE-type [9]. There exist several of approaches to NER, where one method is to use a rule-based approach. This approach builds upon linguistic, grammar-based and hand-crafted rules that exploit features common to NEs, e.g. word capitalization and word shapes. However, the shortcomings of this approach are for instance the difficulty of creating rules that are general enough but also word ambiguity, which also is typical for other NLP techniques. Other ap-proaches are for example supervised learning, where one instance is the model described in the previous section.

Token POS-Tag NE-Tag NE-Type

1 Hi IN O _ 2 , MID O _ 3 my PS O _ 4 name NN O _ 5 is VB O _ 6 King PM B person 7 Julian PM I person

Table 3.3: Example of Part-of-Speech and Named Entity tagging using Stagger by Östling Named entities are frequently expressed in a multi-word setup, in particular when it comes to named persons. To solve the problem of identifying two separate entities when there is in fact only one, the classification task is formulated as a sequence labeling problem, namely as a beginning (B), inside (I) and outside (O) token sequence labeling task for of each entity type. Table 3.3 present an example of the Part-of-Speech and Named Entity tagging using Stagger.

(26)

3.3. Literary Quantification

3.3 Literary Quantification

3.3.1 Approximating Narratives through Emotional Trajectories

The following subsections describe the novel method introduced by Jockers in [21] for ap-proximating the narrative in literary work, but also our proposed approach using the frame-work of Gaussian Processes that we view as an extension to current research. Moreover, we explain the clustering method used for organizing similar narrative shapes into distinct groups, which we later use as input features in the modeling phase.

3.3.1.1 The Syuzhet DCT Approach

Jockers presents a novel approach of approximating the narrative structure with the premise that shifts in the emotional language may serve as a reliable approximating for the underlying narrative. The method builds upon the rejected thesis of the American writer Kurt Vonnegut2, which describes the shape of plots as the relative emotional trajectory on the “Beginning-End” (the narrative time) and “Ill Fortune - Good Fortune” (sentiment polarity) axes as shown in Figure 3.2. The suggested method comprises of three steps: (1) Segmentation, (2) Sentence-level Sentiment analysis and (3) Approximation of the narrative from the raw sentence-Sentence-level sentiments using Syuzhet DCT Transform (a low-pass filter Fourier transform).

Figure 3.2: Example of Narrative approximation (Sentiment Time Series) for Jane Austen’s Pride and Prejudice. The blue data in the background represent the raw sentence-level senti-ments, the black line the rolling mean with window size 0.1 and the red line the approximated narrative line using Jocker’s Syuzhet DCT transformation with low pass size 5. Sentiment values rescaled[´1, 1]

(27)

3.3. Literary Quantification The method begins with segmenting the text into sentences using a rule-based approach, i.e., regular expression, where for instance periods and exclamation marks are typical boundaries that mark the beginning or the end of a sentence. Each sentence is then later parsed using the mentioned dictionary-based sentiment approach, where the words in each sentence are assigned a sentiment value according to the word’s defined emotional intensity and polarity by a compiled sentiment lexicon. These values may for instance range between[´5, 5],[´3, 3]

or[´1, 1]as described in chapter 2. In the absence of a sentiment value, the word is assigned a value of zero (neutral). The emotional score of each sentence is then estimated by aggre-gating the word-sentiment values. This approach implies that for a given a set of sentence-sentiments, the emotional trajectory of a book is represented as a vector y=ty1, y2, . . . , ySu,

where ysis the aggregated emotion score for sentence s and S the total number of sentences.

To illustrate, let the sentence “This is an awful threat” be defined as collection of words such that s = [w1, . . . , w5]. By assuming a sentiment lexicon with values ´1 and+1 for negative

respectively positive words, the emotion score ysfor sentence s is computed as follows

s= [w1, w2, w3, w4, w5] = [This, is, an, awful, threat]

Parsing

ÝÝÝÝÑ [0, 0, 0, ´1, ´1]

Aggregation

ÝÝÝÝÝÝÝÑ [´2]

To infer a general narrative structure from the raw sentence sentiments, which frequently exhibits an irregular pattern as shown in Figure 3.2, the method includes a third step that extracts the “macro shape” of the narrative. In [21], Jockers adopts a combination of discrete cosine transformation and a low pass filter, where the emotional trajectory of a book bears the analogy to functions oscillating at different frequencies. The Syuzhet DCT Transform is given by:

Algorithm:Syuzhet DCT Transform

Input:EmotionScores of sequence size S (Sentece-level sentiments)

Input:LowPassSize as K number of DCT coefficients to keep. Default = 5

Output:Transformed values representing the approximated Narrative

/* Assuming vectorized operations */

begin

dctTranformed Ð discreteCosineTranform(EmotionScores);

/* Filtering out high-frequency information by keeping the k first DCT coefficients when applying IDCT. Thus, only the remaining low-frequency structure is used to obtain the

approximated narrative. _*/

dctOut Ð inverseDiscreteCosineTranform(dctTranformed, K = LowPassSize);

if Scale = True then

/* Rescaling [-1, 1] */

dctOut Ð 2 ¨ dctOut ´ min(dctOut)

max(dctOut)´min(dctOut)´1;

end

return dctOut end

(28)

3.3. Literary Quantification where the Discrete Cosine Transformation (DCT) for a data sequence of size S is given by

Gy(0) = ? 2 S S´1 ÿ s=0 ys, k=0 Gy(k) = 2 S S´1 ÿ s=0 yscos (2s+1)kπ 2S , k=1, . . . , S ´ 1 (3.7)

and Gy(k)is the kth DCT coefficient. The inverse Discrete Cosine Transformation (IDCT) is

given by ys= ?1 2Gy(0) + K´1 ÿ k=1 Gy(k)cos (2s+1)kπ 2K , k=1, . . . , K ´ 1 (3.8) Thus, the combination of the Syuzhet DCT Transform and the low-pass filter allows us to see the overall structure of the narrative (i.e., low-frequency structure) while filtering out high-frequency information [3].

3.3.1.2 The Gaussian Process Approach

To overcome the debated problems with Jocker’s method mentioned in section 1.2, i.e., the ringing artifacts introduced by the Syuzhet DCT Transform and the uncertainty of the ex-tracted sentiment values, we purpose an alternative method for representing the narrative from sentiment values using a probabilistic approach. The method is based on the frame-work of Gaussian Process (GP) as introduced in [49], which we view as a natural extension to Jocker’s method as our method builds upon prior procedure of sentence splitting and sen-timent analysis as previously described.

The theoretical advantages of using a GP for extracting the underlying narrative from raw sentiments are two-fold: (1) the non-parametric form of the model and (2) the probabilistic nature of the framework. The first property allows us to avoid making an assumption or specifying the unknown narrative form apriori, whereas the second advantage provides the means for taking the uncertainty of the sentence-level sentiment into account using any given sentiment lexicon. Let t= tt1, t2, . . . , tSube the normalized narrative time, where each time

point represent the narrative time location for sentence s, and y = ty1, y2, . . . , ySuthe

emo-tional trajectory (the observed sentence sentiment values) for a given book as described in the previous section. While the GP, in theory, is viewed as a potentially innite-dimensional generalization of a Gaussian distribution, it is clear that the model is bounded by the number of sentences in any given book given our definition of the input space. Following the notation above, we define the observed sentence value as

y= f(t) +e, e „ N(0, σ_n2) (3.9)

where f(t)is the underlying structure of the narrative and e the noise or the sentiment esti-mation error for the dictionary-based analysis using any given sentiment lexicon.

A GP is defined as a stochastic model over the latent function f(¨)and is given by f(t)„GP(m(t), K(t, t1₎₎

m(t) =E[f(t)]

K(t, t1_{) =}_E_[(_f₍_t₎_´_m₍_t₎₎₍_f₍_t1₎_´_m₍_t1_))]

(3.10)

where m(t) is the mean function of and K(t, t1₎ _{the covariance function that specifies}

that similarity or covariance between any two given time points. This means that any collection of function values follows the multivariate normal distribution such that

(29)

3.3. Literary Quantification

[f(t1), f(t2), . . . , f(tS)]T „ N(m, K). To obtain the posterior of the latent narrative f(t), the

GP combines information of observed data (the likelihood) with the prior belief of the un-kown f(t)such that

p(f |t, y) = p(y|t, f)p(f)

p(y|t) (3.11)

The specification of the prior p(f)usually involves the mean function m(t) = 0, which in the case of sentiment values translates to unobserved sentiment or a neutral sentence. Since neutral sentences are defined as the absence of emotion words in a dictionary, we treat these sentences as unobserved information where the uncertainty is handled by the GP prior that also incorporate the error term e. The aforementioned arrangement also allows us to reduce the complexity and running times of the modeling by making the input space sparse.

To infer the narrative structure at time points t˚ =t1, 2, 3, . . . 100u that represent the

nar-rative time in percentages, we simply condition f˚on the known information that gives the

following exact distribution

p(f˚|t, y, t˚) =N(f˚|m˜˚, ˜K˚) (3.12) where ˜ m˚=K(t˚, t)T[K(t, t) +σn2I]´1y ˜ K˚=K(t˚, t˚)´K(t˚, t)[K(t, t) +σn2I]´1K(t, t˚) (3.13) From the equations above, it is evident that the choice of kernel is essential to the poste-rior distribution of f(¨). There exist many different covariance functions that can be used to add prior structural assumption like smoothness, periodicity or hierarchical structures to the model. A common kernel is the squared exponential kernel, a.k.a the Radial Basis function as shown in equation 3.14. The kernel is governed by two parameters, the length-scale`that determines the smoothness and σf that controls the scaling of the covariance.

kse(t, t1) =σ2_f exp ´(t ´ t

1₎2

2`2 (3.14)

Most kernels, and consequently GPs, rely on appropriate hyperparameters θ that usually is formulated as a model selection problem. A typical approach is to maximize the log marginal log-likelihood with respect to the training data as follows

log p(y|t, θ) =´1 2y T_K_¯´1_{y ´}1 2log | ¯K| ´ 1 2nlog 2π (3.15)

where ¯K = K(t, t) +σn2I. The first term represents the data-fit, the second the complexity

penalty, which depends on the input space and the third term is a normalization constant. Despite being valuable in many situations, it is clear that the intention of using a GP to repre-sent the narrative structure with a smooth plot shape will give rise to a poor fit. This objective makes it inherently difficult to evaluate hyperparameters in this context, which leave us left with a strategy of finding a set of hyperparameters that retain the general sentiment fluctua-tions but smooth out the short-term noise.

For approximating the narratives in the corpus from the raw sentiment values, we use a self-implemented algorithm for the Gaussian Process approach describe above based on the pseudo-code presented in [37].

3.3.1.3 Hierarchical Clustering of Narratives

While every book forms a unique story, a collection of stories are often arrangeable into differ-ent types of narratives based on, e.g., the evdiffer-ents and sdiffer-entimdiffer-ents in the literature. To iddiffer-entify

What makes an (audio)book popular?

Master Thesis in Statistics and Data Mining