Corpus methods in linguistics and NLP Lecture 4: Quantitative methods
UNIVERSITY OF GOTHENBURG
Richard Johansson
November 17, 2015
the remaining lectures
I all: quantitative methods (today)
I technical track: distributional semantics(Nov 24)
I nontechnical track: historical corpora (Nov 26)
I technical track: large-scale data processing(Dec 1)
I nontechnical track: investigations in syntax (Dec 8)
UNIVERSITY OF
basic idea in quantitative investigations
I our research question might be hard to measure directly!
I aproxyis something that is easy to observe, but that we think gives us information about what we really care about
I so we need to reduce the research question to something that can be measured numericallyandautomatically
I for instance, we're interested in text complexity and we use the sentence length as a proxy
I the selection of a proxy must be well motivated unless it is already well established
I don't forget that the proxy is a proxy
overview
basics about frequencies measuring text complexity measuring association measuring dierences
UNIVERSITY OF
frequencies
I the most basic quantitative measure is probably some sort of frequency
I for instance, that's what you use in the two search assignments
I absolute frequency: how many X are there?
I relative frequency: how common is X?
I comparing frequencies: is X more common than Y?
example: word frequency table in Korp
UNIVERSITY OF
example: word frequency table in Antconc
terminology: type/token distinction
I aword token is a single occurrence of a word: a corpus consists of tokens
I aword type is a unique word: a vocabulary consists of types
UNIVERSITY OF
properties of word frequencies
I word frequencies follow a Zipf distribution
I the corpus is dominated bycommon word types
I the 50 most frequent word types cover about 41% of the corpora at the Swedish Language Bank
I the vocabularyis dominated by rareword types
I of the 11,927,698 observed word types, 7,881,083 (66%) occur only once (hapax legomena)
I in an English corpus I investigated, about 42%
I a couple of references on word distributions:
I Baayen, R.H. 2001. Word frequency distributions. Kluwer.
I Baroni, M. 2009. Distributions in text. In Corpus linguistics:
An international handbook, Volume 2. Mouton de Gruyter.
practical consequences of Zipf's law
I the Zipf distribution has a heavy tail
I there arelotsof rare words
I even if each rare word occurs with a very small probablity, it's very likely to encounter somerare word
UNIVERSITY OF
overview
basics about frequencies measuring text complexity measuring association measuring dierences
simple complexity measures: overview
I complexity measures of various kinds are used in a wide range of research
I examples:
I how does the language change in people aected by Alzheimer's disease? Hirst & Wei Feng: Changes in Style in Authors with Alzheimer's Disease, English Studies 93:3, 2012
I which features are typical of translated text? Volansky et al.: On the Features of Translationese, Literary and Linguistic Computing
I what features from L1 can we see in L2? Tetreault et al.: A Report on the First Native Language Identication Shared Task, WS on Innovative Use of NLP for Building Educational Applications
I for an overview of a wide range of complexity measures, see K.
Heimann Mühlenbock,I see what you mean, dissertation from two years ago
UNIVERSITY OF
average sentence length
number of words number of sentences
0 20 40 60 80 100
meningslängd 0.00
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
relativ frekvens
Läkartidningen (18.28) Läsbart (10.85)
Samhällsvetenskap (20.02)
in Korp
UNIVERSITY OF
typetoken ratio
I simple idea: if the text is more complex, the author uses a more varied vocabulary so there's a larger number of word types (unique words)
I the typetoken ratio is dened:
TTR = number of word types number of word tokens
I TTR is very popular, e.g. when studying how child language develops over time, or the eect of some impairment
problems with TTR
I despite its popularity, using TTR can be risky, because it is sensitive to the size of the text
I a 2000-word text doesn't normally have twice the vocabulary size of a 1000-word text!
I extreme case: if the text consists of one word, the TTR is 100%
I see for instance B.J. Richards (1987): Type/Token Ratios:
What Do They Tell Us? J. of Child Language 14:201-9.
I solution: split the text into pieces of the same length, compute TTR over each piece, and take the average
UNIVERSITY OF
readability indices
I areadability index attempts to quantify how hard a text is to read
I the FleschKincaidscore:
FC = 206.835−1.015· nbr of tokens
nbr of sentences−84.6·nbr of syllables nbr of tokens
I in Swedish readability research, theLIXscore (läsbarhetsindex) is popular:
LIX = nbr of tokens
nbr of sentences+100 ·nbr of tokens of length > 6 nbr of tokens
I <25: very easy (e.g. children's books)
I . . .
I >60: very dense (e.g. scientic prose)
other measures. . .
I lexical density:
LD = number of content words number of word tokens
I frequencies for the most common function words
UNIVERSITY OF
example: the eect of Alzheimer's disease
I see Hirst & Wei Feng: Changes in Style in Authors with Alzheimer's Disease, English Studies 93:3, 2012
example: selecting suitable sentences for a learner
I what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences?
I Flickan sover. → A1
I Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2
I Pilán et al. (2014)Rule-based and machine learning approaches for second language sentence-level readability
UNIVERSITY OF
Ildikó's sentence complexity measures
overview
basics about frequencies measuring text complexity measuring association measuring dierences
UNIVERSITY OF
measures of strength of association
I what are the most typical objects of the verb write?
I what words would we expect to see near the word Lumpur?
I with association measures, we can discover statistical connections between linguistic units
I . . . for instance, between words
I idioms, xed phrases, cliches, collocations, . . .
I naively, we could count the most frequently occurring pairs
I but this can be misleading why?
measures of strength of association
I what are the most typical objects of the verb write?
I what words would we expect to see near the word Lumpur?
I with association measures, we can discover statistical connections between linguistic units
I . . . for instance, between words
I idioms, xed phrases, cliches, collocations, . . .
I naively, we could count the most frequently occurring pairs
I but this can be misleading why?
UNIVERSITY OF
measures of strength of association: better
I most association measured used in practice check whether two word types x and y cooccur unexpectedly often
I that is, more often than what could be expected from x and y considered independently
I a comprehensive survey of association measures:
I Evert, S. 2005. The statistics of word cooccurrences: Word pairs and collocations. IMS, Stuttgart.
example of an association measure: PMI
I we dene the PMI(Pointwise Mutual Information) as PMI = log P(x, y)
P(x) · P(y)
(where P(x) means the chance of observing x, etc)
I intuition: this score is
I large positive number if x and y co-occurmore oftenthan would be expected by chance
I and conversely, large negative number if x and y co-occur less oftenthan expected
I near zero if they are independent
I remember: κ for inter-annotator agreement
UNIVERSITY OF
improved PMI: LMI and normalized PMI
I PMI has a tendency to give too much importance to rare words
I Adam Kilgarrif has suggested a PMI variant that he terms Lexicographer's Mutual Information:
LMI = count(x,y) · log P(x, y) P(x) · P(y)
I Gerlof Bouma (see here) proposed a normalizedPMI that gives values between 1 and -1:
NPMI = log P(x, y)
P(x) · P(y)/ −log P(x, y)
example: frequencies vs normalized PMI in Europarl
of the 137051
in the 77307
to the 53954
, the 52695
, and 45135
on the 43023
the European 41172
that the 40803
and the 38411
the Commission 36225
, I 34503
for the 32333
to be 31677
President , 29707
, which 29358
it is 27150
Middle East 0.971 Amendments Nos 0.968 Member States 0.934 Amendment No 0.918 Human Rights 0.902 Mr President 0.846
next item 0.846
human rights 0.840 United Nations 0.840
laid down 0.838
once again 0.826
United Kingdom 0.792 European Union 0.792
Of course 0.780
With regard 0.769 carried out 0.763
UNIVERSITY OF
AntConc
word sketches
I corpus search tools such as the Sketch Engine and Korp can make word sketches (Korp: ordbild)
I for a word: print the words most strongly associated with it
UNIVERSITY OF
overview
basics about frequencies measuring text complexity measuring association measuring dierences
measuring dierences between corpora
I is news text more similar to ction than to chat logs?
I useless question: we need to operationalize it somehow
I we rst make a prole of each corpus that summarizes some statistical properties we're interested in
I we then use some measure to compare the proles
I for an overview, see Kilgarri (2001) Comparing corpora
I in the examples, we'll assume that the proles are based on word frequencies, but in principle we could use some other information as well (e.g. PoS tags or syntax)
UNIVERSITY OF
measuring dierences between corpora
I is news text more similar to ction than to chat logs?
I useless question: we need to operationalize it somehow
I we rst make a prole of each corpus that summarizes some statistical properties we're interested in
I we then use some measure to compare the proles
I for an overview, see Kilgarri (2001) Comparing corpora
I in the examples, we'll assume that the proles are based on word frequencies, but in principle we could use some other information as well (e.g. PoS tags or syntax)
example: frequencies in Europarl and Wikipedia
the 893691
of 448588
to 429379
and 353080
in 276178
that 223706
is 214560
a 210742
I 140165
for 139694
on 125430
this 118811
be 116139
we 109064
not 97283
are 96053
the 791106
of 517461
and 358059
in 276556
to 271909
a 248975
is 156074
as 106784
was 98185
by 93357
that 93127
for 92585
with 79604
on 71310
are 65161
from 59820
UNIVERSITY OF
comparing frequency proles
I Kilgarri's article describes a number of ways to compare frequency proles
I for instance, comparing distributions or ranks
I Spearman's rank correlation coecientcompares how much the ranks dier between two frequency tables:
r = 1 − 6 · P di2 n · (n2−1)
where di is the rank dierence for word i, and n the number of words
I the maximal value is 1: the proles are identical
I can be computed in Excel and many data analysis packages
example: comparing three corpora
I frequency proles (the top 500 words) for two news corpora:
Dagens Nyheter (1987), Göteborgsposten (2009), and a collection of novels
I comparison using Spearman's coecient:
DN 1987 GP 2009 novels
DN 1987 1 0.98 0.89
GP 2009 1 0.88
novels 1
UNIVERSITY OF
using the comparison function in Korp
I in addition to just printing a measure of the dierences between the corpora, it can be informative to highlight the dierences
words in easy-to-read and academic writing
UNIVERSITY OF
PoS tags in easy-to-read and academic writing
verb inection in easy-to-read and academic writing
UNIVERSITY OF
adjectives describing girls and boys in Göteborgsposten
introduction to the next assignment (technical track)
I measuring associations:
I what are the most important objects of the verb eat?
I comparing genres