Corpus methods in linguistics and NLP Lecture 1: Introduction

(1)

Corpus methods in linguistics and NLP Lecture 1: Introduction

UNIVERSITY OF GOTHENBURG

Richard Johansson

November 3, 2015

(2)

overview

course-related matters overview of corpora types of corpora collecting corpora

adding structure to the text quick overview of search tools

(3)

teaching: technical and nontechnical tracks

I MLT students and doctoral students in NLP follow the technical track

I the others follow the nontechnical track

(4)

lectures

I 8 lectures in total (possibly 9)

I the rst four are relevant to all:

I introduction and overview (this lecture)

I annotation: adding linguistic information to corpora

I treebanks: syntactically annotated corpora

I quantitative methods & statistics

I lectures mainly for a NLP audience:

I distributional semantics: discovering word meaning

I large-scale data processing

I (possibly one on corpora in information retrieval)

I lectures mainly for a linguistic audience:

I historical corpora: why more dicult than modern?

I (possibly: introduction to clustering and topic modeling)

I quantitative investigations in syntax

(5)

literature

I we will use a few parts of the online book edited by Wynne (2004): Developing Linguistic Corpora: a Guide to Good Practice

I in addition, the web page contains pointers to a number of articles

(6)

assignments

technical nontechnical

designing annotation • •

searching Swedish corpora •

searching in treebanks • •

nding collocations • distributional semantics •

using Spark (VG)

project (PhD) •

I the rst three assignments are already out

I we will discuss them the next few times we meet

(7)

project

I pick a corpus-related topic related to your research

I formulate a research question

I carry out an investigation (quantitative or qualitative)

I write a report (DecemberJanuary)

I present it at a seminar (December)

(8)

overview

(9)

what is a corpus?

I acorpus (pl. corpora; Swedish en korpuskorpusar) is a collection of authentic text

I selected,

I annotated(linguistically analyzed),

I andcomputerized(stored in an electronic form),

I for a specicpurpose(but can of course be reused for other purposes)

(10)

why use corpora in natural language processing?

I in developmentof NLP systems:

I training of statistical/machine learning systems, e.g.

estimate HMM probabilities for a tagger

I . . . but also development of knowledge-based systems, e.g.

Grammatical Framework

I evaluation:

I what is the accuracy of our tagger?

I discovery:

I collocations: discovering patterns

I clustering of words, documents, . . .

I distributional vectors

I topic models

(11)

but how about linguistics (and the humanities in general)?

I linguistics is much more empirical nowadays than before

I corpus methods allow the linguist to work more eciently and objectively

I . . . but also to pose new research questions

(12)

criticism of the use of corpora in linguistics

I Chomsky (recently): Corpus linguistics doesn't mean

anything. It's like saying suppose a physicist decides [. . . ] that instead of relying on experiments, what they're going to do is take videotapes of things happening in the world and they'll collect huge videotapes of everything that's happening and from that maybe they'll come up with some generalizations or insights.

I but it's not clear what would be the linguistic counterpart of a controlled experiment in the hard sciences

I introspection? can hardly be called scientic!

I interviews? should be OK, but costly, and limited

I eye tracking, etc? also OK, but even more costly and limited

(13)

typical uses of corpora (1)

I quantitative investigations, most importantly frequencies:

I syntax: are short noun phrase objects more often fronted?

I sociolinguistics: is the passive voice more frequent among academics?

I lexicography: what are the senses of the English word line?

I dialectology: how common is the word bamba as a function of the distance from Gothenburg?

I press history: how often was Napoleon mentioned in Swedish newspapers from the early 19th century?

I cultural history: what crimes were people convicted of in 17th-century Stockholm?

(14)

typical uses of corpora (2)

I nding attestations:

I is that word used nowadays?

I is it possible to extract more than one wh-pronoun in English?

(15)

typical uses of corpora (3)

I cultural heritage:

I documenting a moribund language

I giving the people access to older stages of the language, e.g.

old law texts, runestones

(16)

but why not just use Google?

I probably the largest corpus around

I but Google searches are not reproducible

I Kilgarri (2007): Googleology is bad science

I we know nothing about the selection of the data

I but still nice if we just want to nd an attestation: Can you say that?

I . . . and a rough indication of relative frequencies

I is believe in more common than believe on?

(17)

short and biased history of (computerized) corpora

I Index Thomisticus (19462005)

I http://www.corpusthomisticum.org

I Brown Corpus (1967)

I Press-65 (1970)

I early treebanking: Talbanken / MAMBA (197678)

I spoken: LondonLund (1980)

I parallel: HANSARD (∼1990)

I Penn Treebank (1995)

I Web as a Corpus (∼2003) (SweWAC 2010)

(18)

overview

(19)

types of corpora: a few dimensions

I modality: written, spoken, sign language, multimodal, . . . ?

I genre/ domain: ction, news, debates, . . . ?

I speaker/writer: normal, child, aphatic, learner, . . . ?

I language(s): one, two, many?

I and if more than one: what relation between languages?

parallel, comparable?

I size

I annotation: just words, syntax, dialogue, . . . ?

(20)

some examples of corpora: typical mixed general language

I StockholmUmeåCorpus (general written Swedish):

I 500 texts, ∼2000 words each

I 9 main genres, with subgenres:

I K imaginative prose

I KK general ction

I KL science ction and mystery

I KN light reading

I KR humour

I Brown corpus (Francis and Ku£era, 1964)

I British National Corpus

(21)

some examples of corpora: limited amount

I written Finnish Romani (Borin, 2000):

I ∼110.000 words

I a signicant part of the total written production in this language

I Gothic: the Silver Bible and a few fragments

(22)

some examples of corpora: parallel

I Europarl: http://www.statmt.org/europarl

I proceedings of the European Parliament 19962006

I the proceedings are translated into all the member languages

I 20 languages, ∼1050 million words / language

I all languages linked sentence by sentence to English

EN: Technical requirements for inland waterway vessels (vote) SV: Tekniska föreskrifter för fartyg i inlandssjöfart (omröstning)

I a similar corpus: the Canadian Hansard (EnglishFrench)

I the Bible (or parts of it) in 100 languages:

http://homepages.inf.ed.ac.uk/s0787820/bible/

I UN charters. . .

(23)

some examples of corpora: comparable

I Wikipedia (http://www.wikipedia.org, . . . )

I linked articlearticle (several hundred languages)

I the texts are related but not (often) direct translations

I also contains useful semi-structured information

Sweden, ocially the Kingdom of Sweden, is a Scandinavian country in Northern Europe. It bor- ders Norway and Finland, and is connected to Denmark by a bridge-tunnel across the Öresund. At 450,295 square kilometres (173,860 sq mi), Sweden is the third-largest country in the European Union by area, with a total population of over 9.8 million. Sweden consequently has a low population density of 21 inhabitants per square kilometre (54/sq mi), with the highest concentration in the southern half of the country.. . .

La Svèsia (449 964 km²; 9 082 995) la xe un Stado del nord de l'Eoropa inte la Penisola Scandinava.

La so cavedal la xe Stocolma che La gà 765 044 abitanti. Altre cità importante le xe Göteborg, Uppsala, Malmö e Lund. La conna co la Norveja a nord-ovest e co la Finlandia a est; la xe

(24)

some examples of corpora: parentchild dialogues

I CHILDES (http://childes.psy.cmu.edu)

*NOR: did you wash your hands Jinny ?

*JIN: yeah .

*NOR: you did already ?

*JIN: already .

*NOR: you forgot the candy off from around your mouth huh ?

*JIN: yeah .

*NOR: <laughs> okay <% N goes back to sit down> .

*JIN: do I look dirty ?

(25)

overview

(26)

selection of a corpus

I top-down:

I why? what is the purpose of the corpus?

I what? what type of language are we interested in?

I bottom-up:

I from where? what material can we access?

I how? what is the method for gathering the material?

(27)

corpus as a sample

I a corpus is a sample in the statistical sense

I it's a similar situation as when we carry out an opinion poll

I we select a representative sample from a well-dened population that is large enough

I then we can carry out investigations with some degree of statistical certainty

(28)

what's the population? what's representative?

I the sample isrepresentative if what's true about the sample is also true in general

I but: is New York Times representative of English in general?

I probably more meaningful to speak about whether it's representative of a genre

I in practice, it's hard to determine whether a corpus is representative

I Clear (1992) Corpus sampling

I Biber (1993)Representativeness in corpus design

I corpus collectors sometimes also try to make the corpus

balanced, which is equally hard to dene

I it is probably more important to document the composition

(29)

the eect of sampling on NLP systems

I the way the corpus has been sampled also has implications in NLP

I at least in data-driven NLP systems that observe a corpus

I well-known example: the WSJ part of the Penn Treebank, often used for developing syntactic parsers in English

I vocabulary eects

I the relative frequencies of e.g. PoS tags dier between genres

I constructions: for instance, questions are rare in the WSJ

(30)

example: the written part of the BNC

http://www.natcorp.ox.ac.uk

DOMAIN % TIME %

Imaginative 21.91 1960-74 2.26

Arts 8.08 1975-93 89.23

Belief and thought 3.40 Unclassied 8.49

Commerce/nance 7.93 MEDIUM %

Leisure 11.13 Book 58.58

Natural/pure science 4.18 Periodical 31.08 Applied science 8.21 Misc. published 4.38 Social science 14.80 Misc. unpublished 4.00

World aairs 18.39 To-be-spoken 1.52

Unclassied 1.93 Unclassied 0.40

(31)

example: the spoken part of the BNC

REGION % CONTEXT-GOVERNED %

South 45.61 Educational 20.56

Midlands 23.33 Business 21.47

North 25.43 Institutional 21.86

Unclassied 5.61 Leisure 23.71

Unclassied 12.38

INTERACTION %

Monologue 18.64

Dialogue 74.87

Unclassied 6.48

(32)

availability of text

I sometimes, we don't have the luxury of selecting a

representative sample: we just have to take what we can get

I for instance Finnish Romani, Runic Swedish, . . .

I copyright issues:

I published on the web 6= freely available!

I there is a risk that the work you do will be wasted

I ESPC, the EnglishSwedish Parallel Corpus

I Twitter corpora I also, technical issues:

I ction is harder to access than web-published text (news, blogs, . . . )

I so at the Department of Swedish we have a huge amount of web data but a much smaller amounts of ction, academic text, etc

(33)

example: the Koala corpus of contemporary Swedish

I in the recent Koalacorpus, we selected ve subcorpora where we were sure that the texts were legally kosher

I blogs: the authors agreed to release their texts

I ction: we selected text under a Creative Commons license

I European parliament proceedings: they are public

I Wikipedia: everything under CC

I government press releases

(34)

overview

(35)

making the text usable for automatic tools

I to be able to carry out quantitative research about text, we need to add more information to the texts

I information about the text: metadata

I information inside the text: annotation

I but rst, what do we mean by the text?

(36)

storing language in our computers

I plain text le: the le contains only the letters

I this is what we tend to use when processing corpora

I rich text: not only the letters, but also formatting information such as font, size, color

I for instance Word, PDF, HTML, . . .

I will typically be converted into plain text before inclusion in a corpus

I other media: require complex conversion methods

I scanned page: requires OCR

I recording: requires speech recognition

Pierre Vinken, 61 years old, will join the board as nonexecutive director Nov. 29.

(37)

the anatomy of a plain text le

I in a text le, the letters the character symbols are stored in a sequence

I the character symbols are dened by an international standard calledUnicode

(38)

character encoding

I when the text is stored in a le (or transmitted over the Internet), the character symbols are converted into bytes (numbers)

I there are several encoding systems

I nowadays,UTF-8dominates: it can handle all Unicode

I a few years back, many language-specic encodings. . .

(39)

when we accidentally use the wrong encoding. . .

(40)

are there letters for my language?

I http://www.unicode.org/charts

I most known writing systems (living and dead) are now standardized in Unicode

I for instance, several kinds of runes

(see http://www.unicode.org/charts/PDF/U16A0.pdf) but not some lesser-known types (e.g. Dalrunor)

I there are also a number of semi-standardized writing systems

(41)

some languages require more complex rendering

I Arabic is drawn from right to left, and the shape of the letter depends on its position in the word:

kaf, teh, 'alef, beh →

I Indic scripts (e.g. Devanagari), and Hangul (Korean) combine vowel symbols and consontant symbols in irregular ways:

I but even in the Latin script, we use ligatures such as and

(42)

processing rich text formats

I rich text formats such as Word and PDF are dicult to handle for corpus processing tools, so we typically want to convert them into plain text

I examples:

I pdftotext: PDF → plain text

I wvtext: Word → plain text

I the output from these tools often need a bit of polishing

(43)

using text from the web

I boilerplate removal http://code.google.com/p/boilerpipe

(44)

from image to text: Optical Character Recognition

⇒

1.

Lund Halmstad.

Färden från Lund gjordes genom Engelholm till Margretetorp.

Uppkomne på höjden af Hallandsås nedsände vi afskedstagande blickar öfver en bördig sträckning af Skåne.

Engeltoftas välbyggda sätesgård med dithörande utbrytningar och skimrande kyrkor i förening med täcka lundar pryda den vackra slätten. . . .

(45)

when OCR goes wrong. . .

I From Dalpilen (1923): Talet utmynnade i ett rungandeteve för jubilaren, ännu så kraftfull och ungdomlig.

(46)

when OCR goes wrong (again). . .

I let's look at the popularity of funk music using Google ngram viewer: http://books.google.com/ngrams

(47)

when OCR goes wrong (again). . .

I let's look at the popularity of funk music using Google ngram viewer: http://books.google.com/ngrams

http://languagelog.ldc.upenn.edu/nll/?p=2848

(48)

metadata: information about the document as a whole

I language

I creation or publication time

I the author:

I native language

I location

I gender

I age

I genre

I modality: written? spoken?

I topic classication (e.g. library classication system)

(49)

adding structure to the text

I to be able to do anything interesting with the text, we need to go beyond the letters and add some structure

I typically, we start by segmenting(splitting) the text into manageable pieces: sentences and words

I then, we addlinguistic annotation: morphology, syntax, discourse, . . .

(50)

sentence splitting and word tokenization

I the rst step in text processing is typically that we split it into sentences andwords(tokens)

I how do you think this could be done automatically?

I typical method for sentence splitting: look for end-of-sentence punctuation followed by a capital letter

I typical method for word segmentation (tokenization): look for letters between spaces or punctuation

I we also need to dene what we mean by a word!

I is cannot one word or two?

I how about don't or Mary's?

I how about clitics in Romance languages, e.g. Italian farcela?

(51)

sentence splitting and word tokenization

I the rst step in text processing is typically that we split it into sentences andwords(tokens)

I how do you think this could be done automatically?

I typical method for sentence splitting: look for end-of-sentence punctuation followed by a capital letter

I typical method for word segmentation (tokenization): look for letters between spaces or punctuation

I we also need to dene what we mean by a word!

I is cannot one word or two?

I how about don't or Mary's?

I how about clitics in Romance languages, e.g. Italian farcela?

(52)

tokenization and sentence splitting: tricky cases

I automatic tokenization and sentence splitting can be done in a quite reliable way for normal text

I but there are some corner cases. . .

. . . an account with theU.S. Treasuryto buy Savings Bonds online . . .

. . . then I went back to theU.S. Mydad and I moved . . . I hyphenated words: should we remove the hyphen?

I tokenization in languages that don't use spaces is nontrivial:

(53)

adding linguistic annotation

I if we want to carry out more complex linguistic investigation, we need linguistic annotation

I part-of-speech tags: this is a noun

I morphological analysis: it is in the singular

I syntactic analysis: it is the subject of the sentence

I . . .

I the linguistic annotation can be added manually(high cost, high quality) or automatically (low cost, low quality)

I more about annotation in the next two lectures!

(54)

overview

(55)

corpus statistics

I frequencies:

I which word is most frequent?

I which words are most typical of this corpus, compared to the

general language?

I development over time (neologisms, new senses)

I words, phrases, syntactic constructions, . . .

I co-occurrences:

I which words tend to occur close to the word jazz?

I which verbs take the noun cake as an object?

I how do we normally translate the word case from English into Swedish?

(56)

concordancers

I aconcordancer creates a concordances: a list of the contexts where a search term appears

I today, many concordancers are web-based

I for instance, Korp http://spraakbanken.gu.se/korp

(57)

stand-alone concordance / search tools

I AntConc:

http://www.laurenceanthony.net/software.html

I Wordsmith (Windows only)

http://www.lexically.net/wordsmith/

(58)

purpose of concordances

I overview of the usage of a word: its senses, etc

I typical usage, e.g. for examples in dictionaries

I common cooccurrences

I nding idioms and collocation

(59)

structural searches

I for instance TIGERSearch (lecture 3 and assignment)

ANNIS (http://corpus-tools.org/annis/) is a web-based

(60)

structural searches (2)

I another example with Korp:

(61)

time, location, . . .

I occurrences of the word kommunism (`communism') in Swedish newspapers 18201920

(62)

next lecture: annotation

I adding linguistic information to corpora: annotation

I describing the linguistic model

I managing the annotation project

I measuring annotation reliability

I quick survey of tools for manual and automatic annotation

Corpus methods in linguistics and NLP Lecture 1: Introduction