Research on historical data at Språkbanken
Dana Dannélls
Språkbanken
Department of Swedish Language University of Gothenburg
Språkbanken Kick-off meeting 2015-01-27
Digital historical texts
I Digitization of historical texts aims at preserving cultural heritage and making it more accessible.
I Ongoing digitization initiatives seek to create digital text resources which can be searched and processed by machines.
I Together with the availability of historical text resources in digital form, there is a growing interest in applying NLP methods and tools to historical texts.
Digitized historical Swedish texts
I Old Swedish (fornsvenska)
I starts with latin script manuscripts of law text (around 1225)
I ends with (pre-)publication of the Gustav Vasa bible (1526)
I Modern Swedish (nysvenska)
I Early Modern Swedish (äldre nysvenska)
I starts from the Gustav Vasa bible 1526
I ends with Olof v. Dalin’s Then Swänska Argus 1732
I Late Modern Swedish (yngre nysvenska)
I starts from 1732
I ends with August Strindberg’s Röda rummet 1879
Material includes codices legal manuscripts, religious prose, poetry, non-fiction, letters, etc.
Characterization of historical texts
I lack of standardized orthography
I old words and word forms
I spelling variation
I grammar change: inflectional complexity, word order, subordination
I OCR errors
Problems for LT: limited resources, lack of annotated material, lack of grammar and morphological descriptions, no native speakers.
Old Swedish
Han beddisalmosoaf petro ok johanne Tha han saa themtil byriäat inga jmönstrit Petrussagde tilhans Jak hafuir ey gul ällär silfuervtan thz somiakhafuir gifuirjak thik jihesu christi nazareni nampn stat vp okgak Ok ginstan grep sanctus petrus hans höghre hand ok vplypte han ok ämbratfestos hans sinor ok fötir ok han sprang vp stodh ok gik in j mönstrit medhär thöm gangande ok springande oklowande gudh okalt folkit saa han gangande oklofuandegudh ok kiändohanathanvar then sami som saat formönstersinsport thiggiande almoso ok vndradho mykit a thz somhonomvarhänt ok then tidhthe hioldopetrumociohannem lop alt folkit til therävndrande
Early Modern Swedish
Ingen lärer kunna neka, at ju sådane Skriffter hafwa stor nytta med sig, som, på ett angenämt och lustigt sätt, föreställa Lärdomar och Wettenskaper; Derföre hafwa och de gamla, under roliga Dikter, liufliga Samtahl eller nöysamma Historier, underwisat Folket om Dygden, och likasom skiämtewijs förehållit dem alfwarsamma Sede-Läror. I nyare tider, och än i dag, se wi äfwen, hos kloka Nationer, sådane Skriffter med mycken nytta utgifwas och älskas. Men fast än hwarken de gamla sådane Läro-sätt skulle älskat eller nyare frägdade Folckeslag dem älska, så wete wi dock at Hof-smaken . . .
Digital image (Then Swänska Argus)
Exploring sentence boundaries
Thaa kesarinnan fik höra ath then wnga herrän war kommen thaa tilreddhe hon sigh mz iomfrum och klädom som hon aldra bäst kunne och kom gaangande til konungen och tilhans son ther the saatho baaden til sammankkonungen bad hänne sätia sig när sonnenk kesarinnan sade til konungen herra är tättha edher son som saa länghe hafuer borto waridh när the wisa mästarakkonungen sade ya män jak kan ekke wettha hurw thz gaar til thy han wil inthe tala,kthaa sadhe hon herra antuarden honnom mik jak skal wil wäl göra honnom talande och togh honom widh haandena och wille hafuan mz sigh,kthaa warde han sig och wille ekke mz,kfadren bad honom gaa mz hännek thaa negh han sinom fader ödhmyuklighan och war honom lydogerk kesarinnan ledde honom in j en kammara och badh alla wtgaa och satte honnom oppaa en sänga stok när sigh och sadhe, hiärtans käre
diocleciane, (Sju vise mästare C, 1492)
Machine learning with HaCOSSA
Sentence annotation:
I Hamburg Corpus of Old Swedish with Syntactic Annotations (Höder 2011): annotation for clauses
I Construct sentence-like annotation: 8k sentences in 137k words
I Average sentence length: 16.5 tokens Sentence as a sequence of tags:
I S|L0(L1M∗)?R
Sentence boundaries evaluation
All feat 10-fold: prec 82.9, rec 66.4 All feat leave1out: prec 76.0, rec 58.2
Lexical information helps, but its generalization is difficult
Simple spelling normalization helps especially precision for leave1out (>+5%)
Spelling variation
o → u: 0.2 arvuþi ærvoþi;
æ → e: 0.27 ær er;
au → ö: 0.31 barnlös barnalös barnalaus;
pt → ft: 0.42 apter after æftær;
gg# → g#: 0.43 væg vægg vegg;
þer → n: 0.44 maþer man;
th → þ: 0.44 oþolskipti othalskipte;
mp → m: 0.45 hamn hampn;
eli → li: 0.45 lastelika lastlika;
ghi → i: 0.62 aplöia opplöghia.
(Ahlberg & Bouma 2012)
Spelling variation evaluation
Link-up 96% of Fornsvenska Textbanken Estimated correctness in top3: 78%
Does not handle morphology öknen → ökn.N
Problems with compound splitting villhonnugh → vilder.A + hunagh.N
POS-tagging
Projecting syntactic information
⇑
1526
1873
Dalin, 1732-1734 Swart, 1560
POS-tagging
Extending Hunpos with morphological information
I Idea: use the model for contemporary Swedish, and plug-in an extra morphology for historical language.
I Swedberg and Dalin(Borin & Forsberg 2008)
+ komm NN → VB
+ doch UO → AB foreign words + ähra PM → NN proper nouns
- Stolpe NN → PM proper/common nouns - moste VB → NN errors in the morphology - wisa VB → NN morphology not yet covering
öfverväxa vb inf aktiv öfvervälta vb inf aktiv öfvervintra vb inf aktiv öfvervinna vb inf aktiv öfvervika vb inf aktiv prisar vb pres sg ind aktiv prisas vb pres sg ind s-form prise vb pres pl ind p1 aktiv prises vb pres pl ind p1 s-form prisen vb pres pl ind p2 aktiv prisens vb pres pl ind p2 s-form prisa vb pres pl ind p3 aktiv
Morphology
Semi-supervised learning?
Can we make abstractions from known inflection tables?
Paradigm induction
Can we use these to predict inflection of unseen words?
Lexicon construction
(Ahlberg, Forsberg & Hulden 2014)
Morphology
Paradigm induction
Morphology
Lexicon induction
Morphology evaluation
Modern languages:
Table accuray of 76.50–98.00%
Form accuray of 91.81–99.58%
Old Swedish: (Adesam et al. 2014)
Table accuracy of ∼54.00%
Fonsvenska reader
http://demo.spraakdata.gu.se/fsvreader/
Morfologilabbet
http://spraakbanken.gu.se/karp/morfologilabbet/
OCR error correction
I OCR Swedish text with high quality, independent on the quality of the print.
I OCROpus: open source OCR engine, neural-network based
I Material:
I Blackletter texts printed between approx 1600 and 1800
I Olof v. Dalin’s Swänska Argus (1732–1734, Stockholm)
Post-processing
Noisy channel model:
argmax
orig
p(orig|ocr) = argmax
orig
[p(ocr|orig) · p(orig) ]
I error model (EM) = p(ocr|orig)
I language model (LM)= p(orig)
Kolak, O. (2005). OCR post-processing for low density languages. In Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP).
Error model
Estimates the probability that a certain transformation can occur to a string
ocr: apitel cller dcl är ey så wäl utarbetad, orig: Capitel eller del är ey så wäl utarbetad, ocr: dock hålst welat blifwa wid det sättet, orig: dock hälst welat blifwa wid det sättet, ocr: wårt Bärck.
orig: wårt Wärck.
Language model
I trigram model
constructed from a training coprus
I unigram model
words and their frequencies (72,359 entries)
I word-based model
wordlists constructed from Swedberg and Dalin fullform dictionary (567,108 entries)
Post-processing evaluation
Argus (1732–1734) CER 11–15%
WER 40–50%
Mixed texts (1600–1800) CER 17–25%
WER 55–60%
OCR cloud
http://demo.spraakdata.gu.se/ocr/
Word Sense Changes
1. Automatically find word senses
• Unsupervised Word Sense Discrimination
2. Track the senses over time to find change
Experiments on The Times Archive, London (1785- 1985)
First we had to correct OCR errors!
Increased number of clusters, on average 24%
more clusters, 61% more clusters before 1815.
ti tnow
t1 t2 t3 t4 time
Evaluation
• Evolved sense (broadened/narrowed)
• Personal computer, mobile phone, email
• Novel related sense
• Music tape, computer mouse
• Novel unrelated sense
• (new word, sense), e.g., Internet
• (exist. word, new sense), e.g., rock music
• Existing sense
• Stone sense of rock
• Deer, horse, …
Exp 1 counts recall in any unit Exp 2 counts recall in correct form
95% recall in our units, 82% in correct form!
On average, 6.3— 9.4 years to find change from first cluster evidence.
It takes 29-32 years to find change from defintion.
Correct OCR errors
For the Kubhist data:
• Use a sliding window method to create a graph
• Cluster the graph using word sense discrimination
Many spelling variations end up in the same cluster Examples:
{Hvitliafre, hvithafpe,rag,hvete, svarthafre, hvithafro, hvithafre, korn, slipsten, ny, kora}
{Fianinon, planlnon,fotograflapparatcr, planinon, pnino, orafofoner, orgel, flaninon, fotografiapparate,rurafofoner, grafofoner, fotograflapparater}