Corpus methods in linguistics and NLP Lecture 2: Annotation

(1)

Corpus methods in linguistics and NLP Lecture 2: Annotation

UNIVERSITY OF GOTHENBURG

Richard Johansson

(2)

-20pt

today's lecture

I the Swedish research council has awarded you 10 MSEK for annotating a large corpus

I how do you organize this project?

(3)

today's lecture: overview

I we need annotation in order to carry out nontrivial linguistic investigations and to build NLP systems

I annotation model: how do we describe linguistic structure (and other things) systematically?

I format: how do we store the linguistic structure in a le?

I some case studies

I the annotation process

I a few annotation tools

(4)

-20pt

overview

annotation models

how to store the annotation in les examples of annotation tools the annotation process

(5)

annotation models

I annotation model: how do we describe linguistic structure (and other things) systematically?

I will possibly build on some linguistic theory or description, but made more specic

I if you have some software engineering experience: designing an annotation model is similar to object-oriented design

(6)

-20pt

formalizing the vocabulary and the structure

I what are the units that we describe?

I phonemes? words? phrases? sentences? passages?

documents?

I which dierent types of unitsare there?

I noun phrase, verb phrase, clause, . . .

I any other attributes of the units?

I gender of the author, publication date, . . .

I case, tense, number, . . .

I how are the unitsconnected?

I e.g. organized hierarchically in a phrase structure tree

I what dierent types of relationsare there?

I subject, object, adverbial, . . .

(7)

a few examples

I we'll have a look at some dierent types of annotation models

I and a few examples of each type

(8)

-20pt

simplest case: categorization

I we assign a categoryto the document (or sentence, or passage) as a whole

I the category is taken from a predened set of categories

I examples:

I is this email a real message or junk mail (spam)?

I what is the gender of the author?

I genre (editorial, learner material, spoken, . . . )

I topic (music, sports, economy, . . . )

I more complex cases:

I hierarchical category systems, e.g. what we have in libraries

(9)

example: categorization by diculty level

I what learner level (e.g. according to CEFR) do you need to understand the following Swedish sentences?

I Flickan sover. → A1

I Under förberedelsetiden har en baslinjestudie utförts för att kartlägga bland annat diabetesärftlighet, oral glukostolerans, mat och konditionsvanor och socialekonomiska faktorer i åldrarna 35-54 år i era kommuner inom Nordvästra och Sydöstra sjukvårdsområdena (NVSO resp SÖSO). → C2

(10)

-20pt

example of multiple categories: relevance annotation for information retrieval

I given an information need, for instance is there a relation between the consumption of red meat and colon cancer?

I is the document relevant or not?

(11)

word tagging

I we have one or more annotations for each word

I classical case: morphological (part-of-speech) annotation

The rain falls hard .

determiner noun verb adverb punctuation

I another example: word sense annotation (e.g. WordNet)

The rain falls hard .

rain#n#1 fall#n#5 hard#a#4

(12)

-20pt

example, the StockholmUmeå Corpus

Avspänningen NN.UTR.SIN.DEF.NOM avspänning

mellan PP mellan

stormaktsblocken NN.NEU.PLU.DEF.NOM stormaktsblock

och KN och

nedrustningssträvanden NN.NEU.PLU.IND.NOM nedrustningssträvande

i PP i

Europa PM.NOM Europa

har VB.PRS.AKT ha

inte AB inte

mycket PN.NEU.SIN.IND.SUB/OBJ mycket

motsvarighet NN.UTR.SIN.IND.NOM motsvarighet

i PP i

Mellanöstern PM.NOM Mellanöstern

. DL.MAD .

(13)

segmentation

I we split the text intosegments, each segment has some label

I a couple of examples:

I segmentation of a text by language

I dialogue act tagging, e.g. Stolcke et al. (1997)

(14)

-20pt

bracketing

I we select and label some parts of the text

I example: named entity annotation:

United Nations ocial Ekeus heads for Baghdad.

[ ORG ] [ PER ] [ LOC ]

I the MPQA corpus of opinion annotation (Wiebe et al., 2005):

(15)

graph

I agraph is formally a set of nodes(or vertices) connected by edges

I and the nodes and edges may havelabels

I example: the graph nodes are the words, and they are connected by edges representing syntactic and semantic relations

SBJ ROOT

We were expecting prices to fall

VC IM

OPRD OBJ

A1 A1

(16)

-20pt

example: Abstract Meaning Representation

I http://amr.isi.edu/index.html

photo by Je Flanigan, Chuan Wang, and Yuchen Zhang

(17)

important special case of graph: tree

I atree is a graph where each node except a root has exactly one incoming edge

I the most well-known use of trees is syntaxannotation (more about syntax in the next lecture)

(18)

-20pt

example of another tree: rhetorical structure tree

I not only syntax, but also in annotations such as

discourse/rhetorical trees, e.g. the Rhetorical Structure Theory treebank (Marcu, 1997)

(19)

feature structures, formulas

I feature structure representations (e.g. HPSG) and logical formulas are equivalent to graphs

I the dierence is just a matter of presentation

I example of representation of discourse structure using a feature structure and a graph (Baldridge and Lascarides, 2006)

(20)

-20pt

discussion: annotating coreference (anaphora)

CHEYENNE - Two more teens charged in connection with a vehicle theft and burglary scheme earlier this year have admitted their conduct.

Aaliyah Cotton, 18, the alleged ringleader of the trio, pleaded guilty Oct.

26 before Laramie County District Judge Steven Sharpe to counts of felony theft of a vehicle and conspiracy to commit burglary.

Then Angel J. Gonzalez, 18, pleaded guilty Thursday before Laramie County District Judge Catherine Rogers to conspiring to commit burglary and misdemeanor property destruction.

Gonzalez told Rogers the group went "car hopping," which means they checked for open car doors and stole valuables.

"We went to dierent - multiple - places," Gonzales said. "I only opened one car door, and that's it."

(21)

the rst assignment

I you'll consider four dierent scenarios

I think of how to represent the information and design the annotation

I the structure: what units are we speaking about, and how do they t together?

I the vocabulary: what dierent types of units and relations are there?

I for some of the tasks, I don't really know the answer!

(22)

-20pt

multilayered annotation

I annotation is often organized into layers: more complex annotation may be built on top of simpler annotation

I for instance, in the Penn Treebank (Marcus et al., 1993)

I the text is split into sentences and tokens

I part-of-speech tags are annotated for each token

I phrases are annotated on top of the part-of-speech tags

I other annotation projects have added additional annotation on top of the PTB:

I PropBank and NomBank for verb and noun semantics, respectively (Palmer et al., 2005;Meyers et al., 2004)

I Penn Discourse Treebank for discourse structure (Prasad et al., 2008)

(23)

example: syntactic and frame-semantic annotation

I the SALSA project created frame-semantic annotation on top of the TIGER treebank of German

I Burchardt et al. (2006): The SALSA corpus: a German corpus resource for lexical semantics

(24)

-20pt

overview

annotation models

(25)

example of what not to do!

(26)

-20pt

example of format, late 1970s

(27)

examples of primitive coding systems

I categories: using le names or directories

I Excel sheets

I home-made text formats

I e.g. the Penn Treebank uses round brackets for the structure

WHADVP ADVP

PRP SBJ

VP SQ

SBARQ

*T*

(28)

-20pt

encoding of structure with XML

I XML (extensible markup language) is a standard for encoding structured data

I XML consists of text mixed with structural tags

I start and end tags are used to encode nesting

I tags can have attributes

<word pos="NAME">Sundler</word>

</sentence>...

</paragraph>...

</chapter>...

</document>

(29)

XML is a meta-format

I XML is a general format for encoding structure: a

meta-format

I the format itself knows nothing of what we are trying to achieve

I so we still need to decide how the structures are encoded

(30)

-20pt

example of name bracketing using XML (alternative 1)

I one option: names are annotated directly inside the text (inline)

Om man till exempel tänker på <PERSON>Charlotte Löwensköld</PERSON>, så hade hon också velat förmå honom att resa till

<LOCATION>Karlstad</LOCATION> och försona sig med sin mor.

</DOCUMENT>

(31)

example of name bracketing using XML (alternative 2)

I with separate annotation of the words:

<w>exempel</w>

<w>tänker</w>

<w>Charlotte</w>

<w>Löwensköld</w>

</PERSON>

(32)

-20pt

example of name bracketing using XML (alternative 3)

I stand-o: the linguistic annotation is separate from the text (possibly in another le), and refers to the text

<TEXT>

Om man till exempel tänker på Charlotte Löwensköld, så hade hon också velat förmå honom att resa till Karlstad och försona sig med sin mor.

</TEXT>

<NAMES>

</NAMES>

</DOCUMENT>

(33)

standardized formats

I the most well-known format for linguistic annotation isTEI (Text Encoding Initiative)

I the rst version of the standard was released in 1990

I the latest incarnation (version 5) in 2007

I a newer format, more widely used in NLP: LAF(Linguistic Annotation Format), an ISO-standard

(34)

-20pt

overview

annotation models

(35)

annotation tools

I to be productive, it is essential that we use a practical annotation tool

I for some typical tasks (e.g. tagging, bracketing) there are several tools available

I we'll exemplify annotation tools for a few dierent tasks

(36)

-20pt

example of an annotation tool for bracketing: Callisto

(37)

a web-based tool: Brat (here, names)

(38)

-20pt

word sense annotation (in-house tool)

(39)

frame semantic annotation with SALTO

(40)

-20pt

automatic annotation tools

I for many classical linguistic annotation tasks, there are automatic tools that can do the job for us:

I part-of-speech annotation: part-of-speech taggers

I morphological features: morphological analyzers

I syntactic annotation: syntactic parsers

I name annotation: named entity recognizers

I anaphora annotation: coreference solvers

I . . .

I so why don't we just use these tools instead of annotating?

(41)

automatic annotation tools: pros

I with automatic tools, we can annotate lots of text!

I for instance, at the Department of Swedish, we have about 10 billion words in our collections

I the largest manually annotated corpora tend to have a few million annotated words

(42)

-20pt

automatic annotation tools: cons

I but the automatic tools sometimes make mistakes

I the error rate depends on how hard the task is

I also, it's a question of resources: did the automatic system have access to linguistic resources such as morphological lexicons or grammars?

I if a linguist is interested in a complex and rare phenomenon, it's unlikely that the automatic system will be able to nd it reliably

I they also work better for English than for other languages. . .

(43)

automatic annotation tools: eects of domain / genre

I a typical part-of-speech tagger has an error rate of about 1 in 40 on English news text

I but on social media, the error rate will probably be something like 1 in 7

I Foster et al. (2011): From News to Comment: Resources and Benchmarks for Parsing the Language of Web 2.0

I a similar situation if we apply modern tools to older stages of a language

I Pettersson et al. (2012): Parsing the Past - Identication of Verb Constructions in Historical Text

(44)

-20pt

example: Stanford CoreNLP

I http://nlp.stanford.edu:8080/corenlp

(45)

example: Språkbanken's annotation lab

I http://spraakbanken.gu.se/korp/annoteringslabb

(46)

-20pt

semi-automatic annotation

I even if the automatic tools aren't reliable enough for high-quality annotation, maybe they can still be useful?

I can we get something that is quite OK, and then have annotators clean up the errors?

I this is quite common:

I the Penn Treebank used an automatic part-of-speech tagger as a rst step

I modern Web-based tools such as WebAnno learn during annotation: after you've annotated for a while, they will give suggestions

I but is this safe?

I isn't there a risk of bias and error reinforcement because annotators are lazy or insecure?

(47)

overview

annotation models

(48)

-20pt

example of a process

I to make the work more systematic, it can be useful to reect about how the annotation process will work

I MAMA (Pustejovsky and Stubbs, 2012)

I start with quick cycles (pilot studies)

I until the specications are fairly stable

(49)

annotation manual / specications

I when we have formalized the model, we need to write down a manual that

I the clarity of the manual will inuence the quality of the annotation

I a few useful things to include:

I the purpose of the annotation

I denitions of the concepts in the model

I . . . and practical explanations of how they are applied

I a reasonable amount of examples

describe common hard cases, borderline situations

(50)

-20pt

example: Karin's manual

(51)

example: the manual for the syntactic annotation in Koala

(52)

-20pt

getting annotators

I after the initial pilot studies, it is time to nd someone to do the hard work

I how do we nd suitable annotators?

I hire students you know?

I consultancy companies such as Academic Work?

I ideally, the annotators should annotate some initial training material, after which they get feedback

(53)

crowdsourcing

I crowdsourcing: using a large pool of non-expert annotators instead of trained linguists

I the most well-known framework is Amazon Mechanical Turk:

http://mturk.com

I takes a bit of work to set up the system

I works best if the annotation can be split into several small steps

I also games with a purpose

(54)

-20pt

does crowdsourcing work in practice?

I is it possible to do anything if the task is linguistically complex?

I can we use it if the language is not English?

I risk of cheating!

I is it even ethical?

I arguments for:

I Snow et al. (2008)Cheap and fast But is it good?

Evaluating non-expert annotations for NL tasks

I Hovy, Plank, and Søgaard (2014)Experiments with crowdsourced re-annotation of a POS tagging data set.

I arguments against:

I Adda et al. (2011)Crowdsourcing for language resource development: Critical analysis of Amazon Mechanical Turk overpowering use.

(55)

example: the people's synonym lexicon Synlex

(56)

-20pt

adjudication

I after annotation, a specialist can go through a part of the annotation to estimate the overall quality

I adjudication: going through the annotations where the annotators disagree

I (but of course it can happen that annotators are wrong even when they agree)

(57)

measuring agreement

I if we have more than one annotator, we may investigate the quality of the annotation by comparing the annotations to each other: the inter-annotator agreement

I simplest idea: just see how often the annotators agree

(58)

-20pt

example

I we want to build a spam lter, so we annotate a corpus of 10 emails using two annotators

I they agree in 80% of the cases

I is this any good?

Jane Joe

Y N

N N

N Y

N N

(59)

Cohen's κ

I estimate a chance agreement probabilityP(e): if the two annotators were completely independent, what is the chance that they agree?

I compare this to the estimated probability of agreement P(a):

we want the agreement to be better than chance κ = P(a) − P(e)

1 − P(e)

I rule of thumb:

I <0.4: no good

(60)

-20pt

example, continued: computing κ

I as we saw previously, the probability P(a) of agreement was 0.8

I if the annotators are independent, what is the probability that they agree by chance?

I that is, both selectYor bothN

P(e) = 0.9 · 0.9 + 0.1 · 0.1 = 0.82

I so we get

κ = 0.8 − 0.82 1 − 0.82 ≈0

Jane Joe

Y N

N N

N Y

N N

(61)

if the κ was low

I problems with the linguistic model?

I imprecise annotation manual?

I have we described the common hard cases?

I have the annotators received proper instructions and training?

I is the task inherently hard?

(62)

-20pt

variations on κ

I Cohen's kappa can be used if we have 2 annotators

I the Fleiss κ: more than two annotators

I Krippendor's α: dierent types of disagreements

I Artstein and Poesio (2008)give an overview of inter-annotator agreement measures

(63)

inter-annotator agreement for non-categorical data

I κ and its cousins can be used if we are annotating categories

I but what if we annotate something more complex, e.g. a treebank or a dialogue corpus?

I much harder to dene the notion of chance agreement in that case

I most annotation projects tend to use simple evaluation metrics, for instance precision/recall, attachment accuracy

I exception: Skjærholt (2014) presented a new IAA measure for syntax

(64)

-20pt

next lecture: treebanks

I syntactically annotated corpora: treebanks

I and syntactic search tools

Corpus methods in linguistics and NLP Lecture 2: Annotation