Information Retrieval

(1)

Informationssökning

Information Retrieval - IR

Karin Friberg Heppin

Språkdata

Institutionen för svenska språket Göteborgs Universitet

(2)

2

Example

Google

Web

(3)

3

Retriving documents

• Goal = find documents large document collection relevant to an information need

Document collection

Information need

Query

Answer list IR system

Retrieval

(4)

General definition

• Retrieval of unstructured data Usually it is

• Retrieval of text documents in big collections

• Web Search

But it can also be

• Image retrieval

• Video retrieval

• Music retrieval ….

What is Information Retrieval?

(5)

IR Research ranges from…

computer science - researching the frame, the modeling of relations

between documents and queries …

… to information science -

investigating the search behavior and preferences of the users

… over linguistics - optimizing the language of document representation and queries …

(6)

Information retrieval is about finding

documents relevant to an information need, which are stored and indexed.

This is done by posing a query to a search engine

which matches the terms used as search keys to the terms used to store the

documents in the index.

(7)

Informationssökning

Dokumenten lagras i index

Formulera sökfrågor - formella representationer av informationsbehov Ord i sökfrågor matchas mot ord i dokumenten (indexet)

Utdata – en lista med dokument

Ju bättre matchning mellan sökfrågan och ett dokument, desto högre rankas dokumentet

(8)

The goal of Information retrieval

•Separate relevant from non-relevant documents

•Rank relevant documents above non-relevant ones

•Rank highly relevant documents above less relevant documents

(9)

Relevance

The relevance of a document, is a measure of how well the document satisfies the user's information need

- A relevant document contains the information that a person was looking for when they submitted a query to the search engine

- Many factors influence a person’s decision about what is relevant: e.g., task, context, novelty, style

- Topical relevance (same topic) vs. user relevance (what is useful for the user)

- Relevance may be binary: relevant/irrelevant or have a multilevel scale: e.g. 0-3

(10)

What IR is usually not about

Not about structured data

Retrieval from databases is usually not considered

• Queries in databases assumes that the data is in a standardized format

Database queries have a right answer

- How much money did you make last year?

IR problems usually don’t

- Find all documents relevant to “hippos in a zoo”

(11)

Databases IR

What we’re retrieving

Structured data. Clear semantics based on a formal model.

Mostly unstructured. Free text with some metadata.

Queries we’re posing

Formally (mathematically) defined queries.

Unambiguous.

Vague, imprecise

information needs (often expressed in natural

language).

Results we get Exact. Always correct in a formal sense.

Sometimes relevant, often not.

Databases vs. IR

(12)

12

What to study in IR

• Document and query indexing

• How to best represent their contents?

• Query evaluation (or retrieval process)

• To what extent is a document relevant to a query?

• System evaluation

• How good is a system?

• Are the retrieved documents relevant?

(precision)

• Are all the relevant documents retrieved? (recall)

(13)

13

How do we find the documents?

1

. String matching (linear search in documents) - Slow

- Difficult to improve

- Like browsing through a whole book 2. Indexing

- Fast

- Flexible to further improvement

- Like consulting the index at the back of the book

Possible approaches:

(14)

Word-Level Issues

• Morphological variation

= different forms of the same concept

– Inflectional morphology: same part of speech

– Derivational morphology: different parts of speech

• Synonymy

= different words, same meaning

• Polysemy

= same word, different meanings

{dog, canine, doggy, puppy, etc.}  concept of dog

Bank: financial institution or side of a river?

Crane: bird or construction equipment?

Is: depends on what the meaning of “is” is!

break, broke, broken; sing, sang, sung; etc.

destroy, destruction; invent, invention, reinvention; etc.

(15)

Textoperationer

Tokenisering – omvandla en följd av tecken till en följd av ord

Ta bort skiljetecken och siffror

Stoppordlista – ta bort vanliga funktionsord

Hitta flerordsfraser som bör betraktas som ett ord

Stemming/lemmatisering – föra samman olika böjningsformer och avledningar till en stam eller grundform

(16)

Tokenization

Tokenization: separating text into terms

Word separators: Space - most common separator

But it is not so easy

(17)

What is a word / token?

I'll send you Luca's book

IBM360, IBM-360, ibm 360

Richard Brown brown paint

flowerpot flower-pot flower pot

(18)

Capitalized words can have different meaning from lower case words

Bush, Apple

But they can also represent the same word Horse horse HORSE

Apostrophes can be a part of a word, a part of a possessive, or just a mistake

rosie o'donnell, can't, don't, 80's, 1890's men's straw hats, master's degree

england's ten largest cities, shriner's

Tokenizing Problems

(19)

Numbers are vague

1992 - year or number?

Numbers can be important, including decimals:

nokia3250, top 10 courses united 93, B-52, 7-eleven

Periods can occur in numbers,

abbreviations, URLs, ends of sentences, and other situations

I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Tokenizing Problems

(20)

Small words can be important in some queries,

usually in combinations xp, ma, pm, el paso, pm, j lo, world war II

Both hyphenated and non-hyphenated forms of many words are common

Sometimes hyphen is not needed e-bay, wal-mart, active-x, cd-rom, t-shirts

At other times, hyphens should be considered either as part of the word or a word separator

Winston-salem, mazdarx-7, e-cards, pre-diabetes, t-mobile, spanish-speaking

Tokenizing Problems

(21)

Many morphological variations of words

In most cases, these have the same or very similar meanings

Stemmers attempt to reduce morphological variations of words to a common stem

Usually involves removing suffixes

The alternative is to use several variations of the words in the queries

Stemming/lemmatization

(22)

Lemmatization:

Dictionary-based - uses lists of related words Produces morphologically valid units

Stemming:

Algorithmic: uses program to determine related words

The product is not neccessarily morphologically valid units

Stemming vs lemmatization

(23)

Over-stemming/under-stemming

Errors of comission:

doe/doing

execute/executive ignore/ignorant

Errors of omission:

create/creation europe/european cylinder/cylindrical

Incorrectly lumps unrelated terms together

Fails to lump related terms together

Over-stemming:

Under-stemming:

(24)

Generally a small but significant effectiveness improvement

Crucial for some languages:

5-10% improvement for English, up to 50% in Arabic

Effect of stemming/lemmatization

Example of mean relative improvement due to stemming:

+4% with the English language +4% Dutch

+7% Spanish +9% French +15% Italian +19% German +29% Swedish +34% Bulgarian +40% Finnish +44% Czech

(25)

Stemmer Comparison

(26)

Undesirable tokens can be eliminated:

Non-content bearing tokens Special characters

Numbers: dates, amounts…

Very short or very long tokens, ...

Full text index or not

Not as important anymore as storage space is cheap

(27)

Removed:

- Function words, determiners,

prepositions, which have little meaning on their own

- High occurrence frequencies Reason:

- To reduce index space, improve

response time, improve effectiveness

Downside:

Can be important in combinations

“to be or not to be”

Stopping

(28)

I a

about an are as at be

by com for from how in is it

of on or that the this to was

what when where who will with the www

Example of a stopword list

(29)

Diacritics

differ from one language to another:

résumé, Äpfel

Often they distinguish the meaning:

tache (task) or tâche (mark, spot)

Spelling of named entities may change with languages:

Gorbachev ↔ Gorbacheff ↔ Gorbachov Mona Lisa ↔ La Joconde ↔ La Gioconda

Normalization

(30)

(31)

Problems

• Synonymy - different words, same meaning – Effect: lower recall

• Polysemy - same word, different meanings – Effect: lower precision

dog, canine, doggy, puppy  concept of dog

Bank: financial institution or side of a river?

Crane: bird or construction equipment?

(32)

Paraphrase

Who killed Abraham Lincoln?

(1) John Wilkes Booth killed Abraham Lincoln.

(2) John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln’s life.

When did Wilt Chamberlain score 100 points?

(1) Wilt Chamberlain scored 100 points on March 2, 1962 against the New York Knicks.

(2) On December 8, 1961, Wilt Chamberlain scored 78 points in a triple overtime game. It was a new NBA record, but Warriors coach Frank McGuire didn’t expect it to last long, saying, “He’ll get 100 points someday.” McGuire’s prediction came true just a few months later in a game against the New York Knicks on March 2.

Language provides different ways of saying the same thing

(33)

Ambiguity in Action

• Different documents with the same keywords may have different meanings…

What do frogs eat?

(1) Adult frogs eat mainly insects and other small animals,

including earthworms, minnows, and spiders.

(2) Alligators eat many kinds of small animals that live in or near the water, including fish,

snakes, frogs, turtles, small mammals, and birds.

(3) Some bats catch fish with their claws, and a few species eat lizards, rodents, small birds, tree frogs, and other bats.





keywords: frogs, eat

What is the largest volcano in the Solar System?

(1) Mars boasts many extreme

geographic features; for example, Olympus Mons, is the largest volcano in the solar system.

(2) The Galileo probe's mission to Jupiter, the largest planet in the Solar system, included amazing photographs of the volcanoes on Io, one of its four most famous moons.

(3) Even the largest volcanoes found on Earth are puny in comparison to others found around our own cosmic backyard, the Solar System.





keywords: largest, volcano, solar, system

(34)

Informationssökning

(35)

Dokument

<DOC>

docid: gp1994-3

<TEXT>

Magasinera TV ! Ett förslag i frågetramsets tid

B ARNEN SÄGER att det var synd om TV-tittarna på nollkanalens tid då det bara fanns en kanal .

Inget att välja på .

Man försöker då förklara för dem att det fortfarande inte finns några valmöjligheter .

Programutbudet är nämligen sådant att det alltid bara finns en enda möjlig kanal eftersom programmen i de båda övriga ( jag har tre ; ett , två och fyra ) är fullkomligt outhärdliga ,

Loket och Oldsberg eller Det kommer merar eller frågetävling , frågetävling eller frågtävling .

…

</TEXT> </DOC>

(36)

Idexes

• Indexes are data structures designed to make search faster

• Text search has unique requirements, which leads to unique data structures

• Most common data structure is inverted index –general name for a class of structures – ‘inverted’ because documents are

associated with words, rather than words

with documents

(37)

Tokenizer

Token stream _Friends _Romans Countrymen

Inverted index construction

Linguistic modules

Modified tokens ^friend ^roman countryman

Indexer

Inverted index

friend roman

countryman

2 4 2 13 16

1

Documents to

be indexed Friends, Romans, countrymen.

Sec. 1.2

(38)

38 38

Four documents to index

(39)

39 39

Simple inverted

index

(40)

40

• supports better ranking algorithms

40

Inverted index

with counts

(41)

41

supports proximity matches

41

Inverted Index with positions

(42)

42

•Matching phrases or words within a window

–e.g., ‘tropical fish’, or ‘find tropical within 5 words of fish’

•Word positions in inverted lists make these types of query features efficient

42

Proximity Matches

(43)

Boolean Retrieval

• Weights assigned to terms are either “0” or “1”

– “0” represents “absence”: term isn’t in the document – “1” represents “presence”: term is in the document

• Build queries by combining terms with Boolean operators

– AND, OR, NOT

• The system returns all documents that satisfy the query

Why do we say that Boolean retrieval is “set-based”?

(44)

AND/OR/NOT

A B

All documents

C

(45)

Boolean queries: Exact match

• The Boolean retrieval model is being able to ask a query that is a Boolean expression:

– Boolean Queries use AND, OR and NOT to join query terms

• Views each document as a set of words

• Is precise: document matches condition or not.

– Perhaps the simplest model to build an IR system on

• Primary commercial retrieval tool for 3 decades.

• Many search systems you still use are Boolean:

– Email, library catalog, Mac OS X Spotlight

45

Sec. 1.3

(46)

Exempel på sökning

<num> 033 </num>

<SV-title> Cancergenetik </SV-title>

<SV-desc> Sök dokument som behandlar nyligen gjorda upptäckter om sambandet mellan gener och cancer

</SV-desc>

<SV-narr> Man har nyligen upptäckt den gen som har samband med vissa tumörer. Alla dokument som innehåller information om detta nya

diagnosinstrument är relevanta. </SV-narr>

</top>

(47)

Expansion av sökfråga

Baseline

#sum(upptäckt samband gen cancer)

Böjning

#sum(#syn(upptäckt upptäckten upptäckter)

#syn(samband sambandet) #syn(gen genen gener)

#syn(cancer cancern))

Böjning + synonymer

#sum(#syn(upptäckt upptäckten upptäckter)

#syn(samband sambandet orsak orsaka) #syn(gen gener arvsmassa arvsmassan) #syn(cancer cancern tumör

tumörer))

(48)

The result for baseline query 1

1 Q0 ntdk-0077 1 -6.68144 indri 1 Q0 nmdc-0652 2 -6.76833 indri 1 Q0 ltxx-1297 3 -6.813 indri 1 Q0 ntdk-0074 4 -6.88695 indri 1 Q0 ltxx-2525 5 -6.94228 indri 1 Q0 svrd-1865 6 -6.9971 indri 1 Q0 mdll-0226 7 -7.01527 indri 1 Q0 pfzr-0047 8 -7.02031 indri 1 Q0 nmdc-0937 9 -7.07394 indri 1 Q0 vrdg-0434 10 -7.12576 indri 1 Q0 svdx-0089 11 -7.13687 indri 1 Q0 svdx-0087 12 -7.15393 indri 1 Q0 svrd-0614 13 -7.19322 indri 1 Q0 svrd-1443 14 -7.19562 indri 1 Q0 nmdc-0843 15 -7.20493 indri 1 Q0 flkb-0015 16 -7.23677 indri 1 Q0 nmdc-0621 17 -7.29184 indri 1 Q0 fyss-0025 18 -7.30071 indri 1 Q0 bmsx-2066 19 -7.30643 indri 1 Q0 bmsx-0044 20 -7.31138 indri 1 Q0 ltxx-1087 21 -7.34271 indri 1 Q0 ltxx-3768 22 -7.4061 indri 1 Q0 svrd-1867 23 -7.43562 indri 1 Q0 dnxx-0683 24 -7.51783 indri 1 Q0 dagm-0549 25 -7.54264 indri 1 Q0 ltxx-1255 26 -7.57046 indri 1 Q0 bmsx-0021 27 -7.57146 indri 1 Q0 frsk-0461 28 -7.5796 indri 1 Q0 svrd-1862 29 -7.58203 indri 1 Q0 bmsx-2305 30 -7.58903 indri 1 Q0 nmdc-0086 31 -7.59473 indri 1 Q0 ntdk-1325 32 -7.62016 indri 1 Q0 nmdc-0392 33 -7.63131 indri 1 Q0 lkbk-0015 34 -7.63304 indri

1 Q0 gpxx-0023 35 -7.64715 indri 1 Q0 svrd-1866 36 -7.66673 indri 1 Q0 fyss-0002 37 -7.67032 indri 1 Q0 ltxx-0478 38 -7.67949 indri 1 Q0 elli-0015 39 -7.69401 indri 1 Q0 ltxx-0943 40 -7.69744 indri 1 Q0 svrd-1096 41 -7.70263 indri 1 Q0 mdll-0185 42 -7.70869 indri 1 Q0 frsk-0262 43 -7.71503 indri 1 Q0 frsk-0477 44 -7.71505 indri 1 Q0 frsk-0497 45 -7.72518 indri 1 Q0 dagm-0552 46 -7.72721 indri 1 Q0 elli-0017 47 -7.73269 indri 1 Q0 elli-0012 48 -7.73473 indri 1 Q0 bmsx-2006 49 -7.73993 indri 1 Q0 nmdc-0144 50 -7.74181 indri 1 Q0 schr-0074 51 -7.74259 indri 1 Q0 dnxx-0184 52 -7.77071 indri 1 Q0 nmdc-0306 53 -7.77286 indri 1 Q0 elli-0013 54 -7.7747 indri 1 Q0 ltxx-2526 55 -7.78337 indri 1 Q0 ntdk-2403 56 -7.78677 indri 1 Q0 dagm-0645 57 -7.78689 indri 1 Q0 nmdc-0803 58 -7.7875 indri 1 Q0 ltxx-1043 59 -7.78828 indri 1 Q0 schr-0049 60 -7.78855 indri 1 Q0 ltxx-0944 61 -7.79202 indri 1 Q0 schr-0043 62 -7.79536 indri 1 Q0 dagm-1070 63 -7.79913 indri 1 Q0 ltxx-2514 64 -7.80579 indri 1 Q0 bmsx-1981 65 -7.81331 indri 1 Q0 ntdk-0677 66 -7.82148 indri 1 Q0 mrck-0062 67 -7.82615 indri

1 Q0 mdll-0205 68 -7.84119 indri 1 Q0 bmsx-0035 69 -7.84249 indri 1 Q0 mrck-0087 70 -7.84605 indri 1 Q0 ntdk-2710 71 -7.84657 indri 1 Q0 mrck-0050 72 -7.8479 indri 1 Q0 dagm-0742 73 -7.85168 indri 1 Q0 whpl-0930 74 -7.85455 indri 1 Q0 schr-0040 75 -7.86163 indri 1 Q0 schr-0061 76 -7.86744 indri 1 Q0 bmsx-2140 77 -7.8778 indri 1 Q0 mrck-0086 78 -7.88484 indri 1 Q0 ntdk-2471 79 -7.88642 indri 1 Q0 bmsx-2863 80 -7.89441 indri 1 Q0 pfzr-0054 81 -7.90202 indri 1 Q0 mrck-0074 82 -7.90292 indri 1 Q0 mdll-0361 83 -7.90519 indri 1 Q0 schr-0042 84 -7.90544 indri 1 Q0 abxx-0103 85 -7.91486 indri 1 Q0 svrd-0113 86 -7.91705 indri 1 Q0 dnxx-0105 87 -7.91795 indri 1 Q0 mrck-0063 88 -7.92239 indri 1 Q0 schr-0044 89 -7.92376 indri 1 Q0 bmsx-0045 90 -7.93384 indri 1 Q0 svrd-0032 91 -7.94811 indri 1 Q0 ltxx-3941 92 -7.95155 indri 1 Q0 bmsx-0020 93 -7.95468 indri 1 Q0 ltxx-0918 94 -7.95792 indri 1 Q0 nmdc-0713 95 -7.96319 indri 1 Q0 nmdc-0729 96 -7.9632 indri 1 Q0 dnxx-0101 97 -7.97049 indri 1 Q0 frsk-0575 98 -7.97514 indri 1 Q0 mdll-0057 99 -7.9756 indri 1 Q0 dnxx-0360 100 -7.97735 indri

(49)

Key word based queries

The result of word queries for most retrieval

models is the set of documents containing at least one of the words of the query

The resulting documents are ranked according to the degree of similarity with respect to the query

To support ranking, two common statistics on word occurrences inside texts are commonly used

The first is called term frequency and counts the number of times a word appears inside a document

The second is called inverse document frequency and counts the number of documents in which a word appears

(50)

Formal Information need / Topic

<TOP>

<DESC> Hur påverkar en fettsnål, energifattig kost

koncentrationerna av HDL och LDL i blodet? </DESC>

<NARR> Relevanta dokument ska innehålla information om

lipoproteinerna HDL och LDL och deras funktion, samt beskriva hur en kostförändring med fettsnål, energifattig kost påverkar koncentrationen av dessa i blodet.

</NARR>

</TOP>

(51)

Queries

A query is the formulation of a user information need put to the system

Keyword based queries are popular, since they are intuitive, easy to express, and allow for fast ranking

However, a query can also be a more

complex combination of operations using different kinds of operators

(52)

Query in the Indri/Lemur query language

#combine(

#syn( fettsnål energifattig )

#syn( kost diet föda dricka kosthållning kostförändring)

#syn( koncentration halt)

#syn( ldl kolesterol lågdensitetslipoprotein kolesterolester #uw5(lätt lipoprotein))

#syn( hdl kolesterol högdensitetslipoprotein kolesterolester lipoprotein )

#syn( blod ) )

(53)

Facets

#combine(

#syn( fettsnål energifattig )

#syn( kost diet föda dricka kosthållning kostförändring)

#syn( koncentration halt)…

Corresponds to the Boolean statement:

( fettsnål OR energifattig ) AND ( kost OR diet OR föda OR dricka

OR kosthållning OR kostförändring) AND ( koncentration OR halt)…

(54)

Facets

fettsnål OR

energifattig

kost OR

diet OR

föda OR

dricka OR

kosthållning OR

kostförändring

koncentration

OR

halt

AND AND …

(55)

Evaluation

For search evaluation measures based on relevance are used

- For example: recall, precision, average precision different recall levels

- Recall is the proportion of relevant documents recovered

- Precision is the proportion of recovered documents that are relevant

(56)

Comparing queries

• Baseline

• #sum(upptäckt samband gen cancer)

• Inflections

• #sum(#syn(upptäckt upptäckten upptäckter)

#syn(samband sambandet) #syn(gen genen gener)

#syn(cancer cancern))

• Inflection + synonyms

• #sum(#syn(upptäckt upptäckten upptäckter)

#syn(samband sambandet orsak orsaka) #syn(gen gener arvsmassa arvsmassan) #syn(cancer cancern tumör

tumörer))

(57)

Distribution of word frequencies is very skewed

A few words occur very often, many words hardly ever occur

The two most common words “the” and “of” make up about 10% of all word occurrences in English text documents

Zipf’sLaw

(58)

Resolving power of terms

(59)

Terms in the MedEval test collection

(60)

Efficient and non-efficient terms in Med Eval

(61)

61

• tf = term frequency

– frequency of a term/keyword in a document

The higher the tf, the higher the importance (weight) for the doc.

• df = document frequency

– no. of documents containing the term – distribution of the term

• idf = inverse document frequency

– the unevenness of term distribution in the corpus – the specificity of term to a document

The more the term is distributed evenly, the less it is specific to a document

weight(t,D) = tf(t,D) * idf(t)

**tf*idf weighting schema**

(62)

Recall/precision

The Cranfield experiments culminated in the modern metrics of precision and recall

Recall ratio: the fraction of relevant documents retrieved

Precision ration: the fraction of documents retrieved that are relevant

(63)

Why Ranked Retrieval?

• Arranging documents by relevance is

– Closer to how humans think: some documents are

“better” than others

– Closer to user behavior: users can decide when to stop reading

• Best (partial) match: documents need not have all query terms

– Although documents with more query terms should be “better”

• Easier said than done!

(64)

How good are the retrieved docs?

• Precision : Fraction of retrieved docs that are relevant to user’s information need

• Recall : Fraction of relevant docs in collection that are retrieved

• More precise definitions and measurements to follow in later lectures

64

Sec. 1.1

(65)

Effectivity

precision =

recall =

|retrieved relevant documents|

| retrieved relevant documents |

| retrieved documents|

|(known) relevant documents|

(66)

Find black cats

Strategy 1

Strategy 2 It depends on what

you want

(67)