Textual information retrieval: An approach based on language modeling and neural networks

(1)

Textual information retrieval

An approach based on language modeling and neural networks

av

Apostolos A. Georgakis

Akademisk avhandling

som med vederbörligt tillst˚and av rektorsämbetet vid Ume˚a universitet för avläggande av filosofie doktorsexamen

framl¨aggs till offentligt f¨orsvar i N420, Naturvetarhuset, fredag den 15 April 2004, kl.13.00.

Avhandlingen kommer att f¨orsvaras p˚a engelska.

Fakultetsopponent: Professor Timo Honkela, Helsinki University of Technology,

Helsinki, Finland.

Department of Applied Physics and Electronics Ume˚ a University

Ume˚ a, 2004

(2)

2

Organization Document name

UME˚A UNIVERSITY DOCTORAL DISSERTATION

Dept. of Applied Physics and Electronics

SE-901 87, Ume˚a Sweden Data of issue

February 2004 Author

Apostolos A. Georgakis

Title Textual information retrieval: An approach based on language modeling and neural networks

Abstract This thesis covers topics relevant to information organization and retrieval. The main objective of the work is to provide algorithms that can elevate the recall-precision performance of retrieval tasks in a wide range of applications ranging from document organization and retrieval to web-document pre-fetching and finally clustering of documents based on novel encoding

techniques.

The first part of the thesis deals with the concept of document organization and retrieval using unsupervised neural networks, namely the self-organizing map, and statistical encoding methods for representing the available documents into numerical vectors. The objective of this section is to introduce a set of novel variants of the self-organizing map algorithm that addresses certain shortcomings of the original algorithm.

In the second part of the thesis the latencies perceived by users surfing the Internet are shortened with the usage of a novel transparent and speculative pre-fetching algorithm. The proposed algorithm relies on a model of behaviour for the user browsing the Internet and predicts his future actions when surfing the Internet. In modeling the users behaviour the algorithm relies on the contextual statistics of the web pages visited by the user.

Finally, the last chapter of the thesis provides preliminary theoretical results along with a general framework on the current and future scientific work. The chapter describes the usage of the Zipf distribution for document organization and the usage of the adaboosting algorithm for the elevation of the performance of pre-fetching algorithms.

Key words: Self-organizing map, marginal median, Wilcoxon test, language modeling, pre-fetching, Zipf distribution, adaboost.

Language: English ISBN: 91-7305-623-5 Number of pages: 176 Date: 20 February 2004

Signature: Apostolos A. Georgakis

(3)

Version 1.0

“Title page:” Replace “fredag” with “torsdag”

Page 21: Replace

A. Georgakis, C. Kotropoulos, A. Xafopoulos and I. Pitas,

“Marginal median WEBSOM for information organization and retrieval”, submitted to Elsevier Neural Networks Journal, (accepted for publication).

The main content of this paper can be found in chapter 8. This paper offers a set of variants to the SOM algorithm based on the marginal median operator.

A. Georgakis, C. Kotropoulos, and I. Pitas, “A variant of the SOM algorithm for document organization and retrieval based on the Wilcoxon test”, submitted to IEEE Tr. on Neural Networks, (accepted for publication).

This paper is the basis for chapter 9. It offers a variant to the VSM and a metric that employs this variant and is based on the Wilcoxon non-parametric statistical test in an effort to elevate the performance of the SOM algorithm.

A. Georgakis and H. Li, “User behaviour modeling and content based speculative web page retrieval”, submitted to IEEE Tr. on Knowledge and Data Engineering, (accepted for publication).

This paper offers the groundwork for chapter 10. It offers a pre-fetching algorithm for assisting an end-user surfing the Internet by pre-downloading web pages that have higher probability to be requested by the user.

Continued on next page

(4)

with

“Marginal median WEBSOM for information organization and retrieval”, submitted to Elsevier Neural Networks Journal, in press.

This paper is the base for chapter 8. It offers a set of variants of the basic SOM algorithm that are based on the marginal median operator. The development of the variants is motivated by the SOM algorithm’s poor robustness properties in the presence of outliers. In the first variant, the real-valued data of the input space are being quantized into 256 quantization levels whereas the second variant avoids any quantization error.

A. Georgakis, C. Kotropoulos, and I. Pitas, “A variant of the SOM algorithm for document organization and retrieval based on the Wilcoxon test”, submitted to IEEE Tr. on Neural Networks.

This paper is the basis for chapter 9. It offers a novel modeling method that belongs to the general framework of Salton’s VSM. However, it proposes a new encoding variant, that provides an ordering of the document-bigrams, instead of document terms, based on their frequencies observed in the training corpus. Moreover, it proposes a metric that employs this variant and is based on the Wilcoxon non-parametric statistical test in an effort to elevate the performance of the SOM algorithm.

A. Georgakis and H. Li, “User behaviour modeling and content based speculative web page retrieval”, submitted to IEEE Tr. on Knowledge and Data Engineering, (accepted for publication).

This paper offers the groundwork for chapter 10. It offers a novel transparent and speculative pre-fetching algorithm for Web documents.The objective is to model the behaviour of a user browsing the Internet and predict future actions on his behalf in order to reduce the perceived latency when surfing the Internet. By accurately predicting and pre-fetching (in the browser cache) the documents that the user will request in the near future, the algorithm can reduce the overall response times.

A. Georgakis and H. Li, “An adaboosting approach to web documents pre-fetching”. Manuscript to be submitted to IEEE Transactions on Knowledge and Data Engineering.

A. Georgakis, C. Kotropoulos and H. Li, “The generalized Zipf distribution for document organization and retrieval”.

Manuscript under preparation.

Continued on next page

(5)

with “where n is the number of m-dimensional vectors”

Page 69: Line 24, replace F

_n

(x) = P r(x

_q(n)

≤ x) = P

^N

j=1

¡

_n

i

¢ P

ⁱ

(x)[1 − P (x)]

ⁿ⁻ⁱ

with F

_n

(x) = P r(x

_q(n)

≤ x) = P

^N

i=1

¡

_n

i

¢ P

ⁱ

(x)[1 − P (x)]

ⁿ⁻ⁱ

Page 70: Line 5, replace f

_n

(x) = N ¡

_{N −1}

n−1

¢ P

ⁱ⁻¹

(x)[1 − P (x)]

ⁿ⁻ⁱ

p(x)

with f

_n

(x) = N ¡

_{N −1}

n−1

¢ P

ⁿ

(x)[1 − P (x)]

^{N −n}

p(x)

Page 105: Line 9, replace “However” with “Moreover”

Page 147: Line 25, replace D

_ij

= h (g (x

_i

, x

_j

)) = h

µ (

^xⁱ⁺^√^xⁱ^x^j

)

²

xi

¶

= h ¡

x

_i

+ x

_j

+ √ x

_i

x

_j

¢ with D

ij

= h (g (x

i

, x

j

)) = h

µ (

^xik+√xikxjk

)

²

xik

¶

= h

³

x

i

+ x

j

+ p x

^T_i

x

j

´

Pages 147 & 148: Replace “negation” with “rejection”

Page 150: Line 13, replace “for m a given number” with “for a given number m”

(6)

(7)

Textual information retrieval

An approach based on language modeling and neural networks

Apostolos A. Georgakis

Department of Applied Physics and Electronics, Ume˚a University

Ume˚a 2004

(8)

2

Department of Applied Physics and Electronics Ume˚ a University

SE-901 87 Ume˚ a, Sweden

Copyright c° 2004 by Apostolos A. Georgakis

mechanical, including photocopying, recording or by any information storage and retrieval system, without the prior permission of the author.

isbn 91-7305-623-5

Author’s email: apostolos.georgakis@tfe.umu.se Typeset by the author in L^ATEX 2ε

Printed by Print & Media, Ume˚a University, Ume˚a, 2004

(9)

To my parents, Anastasio and Maria,

who always supported and believed in me.

(10)

(11)

List of Figures

6.1 A typical IR system. The main blocks in the diagram are the information repository, the queries aiming at the repository and the algorithm employed by the IR system in organizing the information. 27 6.2 An IR system oriented for document organization and retrieval. . . 28 6.3 The frequencies of the annotation categories for: (a) the Reuters-

21578 corpus, and (b) the Hypergeo corpus. . . 31 7.1 The two dimensional topology of the computational layer of the

SOM algorithm. The input layer can be seen on the top of the figure (bold face arrows pointing downwards). The notion xj corresponds to the vector-valued observations that are feed to the network and wl corresponds to the reference vectors of the neurons. . . 50 7.2 The three dimensional topology of the computational layer for the

SOM algorithm. The input layer can be seen on the top of the figure (bold face arrows pointing downwards). The notion xj corresponds to the vector-valued observations that are feed to the network whereas wlcorresponds to the reference vectors of the neurons. 51 7.3 A hexagonal lattice with six neighboring neurons placed at distance

equal to one from the center neuron. . . 51 7.4 A 2D orthogonal lattice and two of its variants: (a) a four-connected

neighborhood with four neurons placed at distance equal to one from the center neuron, and (b) an eight-connected neighborhood with four neurons placed at distance equal to one (gray colored neurons) from the central neuron and four neurons placed at distance equal to√

2 (white colored neurons). . . 52 7.5 The 3D orthogonal lattice with the neighboring neurons placed at

three different distances; neighbors at distance: (a) one, (b) √ 2, and (c)√

3. . . 53

(16)

8 LIST OF FIGURES

7.6 The bell shaped Gaussian neighborhood function. . . 54 7.7 Two different time dependent neighborhood functions for a 2D lat-

tice for: (a) an orthogonal topological neighborhood, and (b) a hexagonal topological neighborhood (t1< t2< t3). . . 55 7.8 The Voronoi tessellation partitioning of a 2D input space. The

vectors assigned in a particular region have the same reference vector. 56 8.1 The component-wise ordering of the feature vectors. The compo-

nents of the feature vectors are column-wise sorted (each dimension independently). To the left, the vectorial components are not ordered. To the right, the vectorial components are ordered along each of the Nw-dimensions. . . 69 8.2 The determination of the marginal median for five feature points

on a 2D space. The coordinates of the marginal median are found in each axis independently and then an artificial point is created at the position pointed by these coordinates. . . 70 8.3 The updating of the reference vector ws with the feature vector

xi. For each component of an “unseen” feature vector xi the cor- rect position is identified using binary search and the component is inserted to the appropriate position. . . 73 8.4 Word categories map using the MMWQ-SOM for the Hypergeo

corpus on a 11 × 11 neural network. The highlighted neurons cor- respond to word categories related to “accommodation” (left) and

“sightseeing” (middle and right). . . 75 8.5 Words categories map using the MMWQ-SOM for the Reuters-

21578 corpus on a 15 × 15 neural network. The highlighted neurons correspond to word classes related to “finance” (top left), “oil” and

“energy” (bottom right). . . 76 8.6 The three distinct steps in the formation of the document vector

aj. From the raw textual data (top left) to the stemmed document (bottom left) and the histogram of the word categories (middle right). 77 8.7 The document map constructed for the Reuters-21578 corpus for a

9 × 9 neural network using the MMWQ-SOM. The document titles are listed for each one of the highlighted document classes. . . 78 8.8 The document map constructed for the Hypergeo corpus for a 7 × 7

neural network using the MMWQ-SOM. The document titles as well as their respective URL addresses are listed for each one of the highlighted classes. . . 79 8.9 The mean squared error curves for the basic SOM algorithm and

the MMWQ-SOM variant for: (a) a 11 × 11 neural network using the Hypergeo corpus, and (b) a 15 × 15 neural network for the Reuters-21578 corpus. . . 82

(17)

LIST OF FIGURES 9

8.10 (a) The average recall-precision curves for the basic SOM, the batch SOM and the MMWQ-SOM variant for the “Mergers & Acquisi- tions (“acq”)” category of the Reuters-21578 corpus, respectively, and (b) the average recall-precision curves for each one of the ar- chitectures for the “Earnings and Earnings Forecasts (“earn”)”

category of the Reuters-21578 corpus. . . 84 8.11 The average recall-precision curves for the basic SOM, the batch

SOM and the MMWQ-SOM variant for the Hypergeo corpus. The sample test document was classified into the “history” category. . . 85 9.1 Dimensionality reduction approach of the indicator and the feature

vectors. After the thresholding, only the top portion is retained in both vectors. The notions b_i(·)and x_i(·)correspond to the ordered bigrams and their frequencies, respectively. . . 91 9.2 The distance between two instances of the same bigram (i.e. the

jth bigram) in two different indicator vectors. It is equal to the range of the indices after the permutation defined by arranging the bigram probabilities in descending order. . . 92 9.3 The distances between the original document and its modified ver-

sion for: (a) the document with ID = 1 from the Reuters-21578 corpus, and (b) the document with ID = 5214 from the same corpus. 97 9.4 The Wilcoxon distances between the original document and its

modified version for: (a) the document with ID = 1 from the Reuters-21578 corpus, and (b) the document with ID = 5214 from the same corpus. . . 98 9.5 The block diagram of the formation of the indicator and the feature

vector respectively and their clustering into contextually similar collections. . . 102 9.6 The document map constructed for the training documents of the

Reuters-21578 corpus for a 9 × 9 neural network using the standard SOM algorithm. The highlighted nodes correspond to document clusters related to “debts” (top left), and “corporate economical results” (top and bottom right). . . 103 9.7 The document map constructed for the training documents of the

Reuters-21578 for a 9 × 9 neural network using the Wilcoxon SOM variant. The highlighted neurons correspond to document clusters related to “financial debts” (top middle and left), “bonds” (bottom left), and “corporate economic results” (bottom right). . . 104 9.8 The block diagram for the formation of the DM for the on-line SOM

algorithm as-well-as its variants proposed in the present thesis. . . 106 9.9 The mean squared error curves for the standard SOM and the

Wilcoxon SOM using a 9 × 9 neural network for the Reuters-21578 corpus. . . 107

(18)

10 LIST OF FIGURES

9.10 The average recall-precision curves of the standard SOM and the Wilcoxon SOM variant for: (a) the “Mergers & Acquisitions (acq)”

category, (b) the “Earnings and Earnings Forecasts (earn)” category, (c) the “Grain (grain)” category, and (d) the “Money &

Foreign Exchange (money-fx)” category. . . 108 9.11 The F₁-measure curves of the standard SOM and the Wilcoxon

SOM variant for: (a) the “Mergers & Acquisitions (acq)” category, (b) the “Earnings and Earnings Forecasts (earn)” category, (c) the “Grain (grain)” category, and (d) the “Money & Foreign Exchange (money-fx)” category. . . 109 10.1 Three different methods of communication between the client and

the web server through the usage of: (a) a cache repository, (b) a cache repository and a pre-fetching agent working in parallel with the client, and (c) a cache repository and a pre-loading agent working in cooperation with the web server. . . 119 10.2 Block diagram of the proposed pre-fetching algorithm. . . 122 10.3 The bigrams consisting the web pages viewed by the user are ar-

ranged into descending frequency order. The bigrams with low frequency (grayed area) are rejected from the rest of the process. . 124 10.4 The feedback from the user is in the form of keywords that signify

his special interest. . . 128 10.5 User behaviour and the corresponding frequency variations over

time. Each one of the iterations corresponds to 1000 visited web pages. The iteration with index number zero corresponds to the newest user profile whereas the index number 10 corresponds to the oldest profile. . . 129 10.6 The GUI of the Muhci project where one can see the the outbound

links of the web page under consideration. The links that where pre-downloaded in the local cache are indicated by the symbol[R]. 132 10.7 Statistical description of: (a) the distribution of the outbound links

in the corpus of visited web pages, and (b) the distribution of the length of these web pages. . . 133 10.8 Statistical description of: (a) the number of bigrams per page af-

ter the preprocessing steps, and (b) the number of bigrams per outbound link. . . 134 10.9 The recall-precision curves for each one of the four pre-fetching

algorithms. . . 136 10.10The GUI of the Muhci project where one can see the performance

of the pre-fetching algorithm. . . 137

(19)

LIST OF FIGURES 11

10.11Statistical evaluation for the recall-precision curves of the pre-fetching algorithms with: (a) the box and whisker plot for the two-way ANOVA between the four tested algorithms, and (b) the confidence intervals for the mean precision values of the four pre-fetching algorithms. . . 139 11.1 The divergence viewed under different metrics. The grayed area

corresponds to the divergence measured from: (a) the Euclidean distance, which relies on the shaded area between the distribution functions, and (b) the proposed metric that is based on the shaded area in the bottom left side of the plot. . . 149 11.2 The probability density function for the Zipf distribution for Nw=

100 and for (a) θi = 1.35, (b) θj= 1.55, and (c) the product z_ijm^∗ . . 152 11.3 The support regions for the null and the alternative hypothesis for

the RV Dij for Nw= 2000, θi= 1.35 and θj= 1.55 at a significant level of α = 0.90. (Important remark: Although the above graph implies a uniform distribution this is not true. The slope of the line in the graph approaches zero but still is significantly different than this value.) . . . 156 11.4 Block diagram from the formation of the feature vectors based on

the Reuters-21578 corpus until the clustering algorithms that will partition the set of feature vectors into collection of “similar” vectors. 159 11.5 The formation of the feature vectors based on the Reuters-21578

corpus. The notion N¹, N², . . . correspond to the number of stems found in each document set. The notion Nw correspond to the cardinality of the set containing the stems that are common in all the document sets The last term correspond to the OOV term. . . 160 11.6 The generation of the artificial feature vectors. (a) The original

distribution generated using Eq. (11.7) with θi = 1.35, and (b) The modified distribution with θi= 1.37. . . 161 11.7 The formation of the sequence of training examples. . . 169 11.8 The adaboosting algorithm used in cooperation with a pre-fetching

algorithm. . . 170

(20)

(21)

List of Tables

6.1 Corpora statistics for the Hypergeo and the Reuters-21578 corpus. 32 6.2 Contingency table for evaluating the retrieval performance. . . 40 6.3 Overview of the recall-precision computation for a specific annota-

tion category. . . 41 6.4 Frequency of the annotation categories for the Reuters-21578 cor-

pus used in document retrieval. . . 42 8.1 Overview of the marginal median SOM variant. . . 71 9.1 The Wilcoxon distance between sample document pair (A, B). . . 99 9.2 The Wilcoxon distance between sample document pair (A, C). . . 100 9.3 Assessment of precision rates for the Wilcoxon SOM variant and

the standard SOM. . . 111 10.1 Five outbound links with equal weights. . . 131 10.2 Two-sample t-test between the four algorithms. . . 138 11.1 Distribution tables for the RV Dij and for Nw= 2000 at α = 0.90

(10% Confidence Level). . . 164 11.2 Distribution tables for the RV Dij and for Nw= 2000 at α = 0.95

(5% Confidence Level). . . 165 11.3 Overview of the generalized adaboost algorithm. . . 167

(22)

(23)

Chapter 1 Symbols and abbreviations

ANN Artificial neural network ANOVA Analysis of variance

cdf Cumulative density function CLT Central limit theorem

DM Document map

GUI Graphical user interface HTML Hypertext markup language IR Information retrieval ISP Internet service provider LVQ Learning vector quantization MLE Maximum likelihood estimator MMQ-SOM Marginal median quantized SOM MMSOM Marginal median SOM

MMWQ-SOM Marginal median without quantization SOM

MSE Mean squared error

NLI Nonlinear interpolation NLP Natural language processing

OOV Out of vocabulary

PCA Principal component analysis pdf Probability density function ROI Region of interest

RV Random variable

SGML Standard generalized markup language SOM Self organizing map

URL Universal resource locater

VSM Vector space model

VQ Vector quantization

(24)

16 SYMBOLS AND ABBREVIATIONS

WCM Word categories map

WVM Wilcoxon vector median

WWW World wide web

(25)

Chapter 2 Mathematical notions

a Significance level a(·) Learning rate parameter b Indicator vector

bi ith word bigram

dim Distance between two elements of the indicator vectors E(·) Mean value

FA Statistic for the two-way ANOVA test Fb Fb-measure (performance evaluation) gi Forward history of length i

hi Backward history of length i

hsi(·) Neighborhood function for the SOM algorithm HNw,θ Nwth Harmonic number of order θ

k Cardinality of the vocabulary (|V | = k)

L Number of neurons employed by the SOM algorithm li ith eigenvalue of a covariance matrix

N Number of feature vectors (documents in the corpus) Nw Dimensionality of feature vectors

N (µ, σ²) Normal distribution with mean value µ and variance σ² oi ith eigenvector of a covariance matrix

P Precision ration (performance evaluation) R Recall ratio (performance evaluation) S^m Sentence of length m word stems t, n Time or iterations

U Uniform distribution

V Vocabulary (set of distinct word stems) wi ith word stem

wi ith neuron of the SOM algorithm

(26)

18 MATHEMATICAL NOTIONS

Wim Wilcoxon distance between the ith and mth feature vector x Feature vector

θ Zipf distribution parameter

σ(·) Width of the neighborhood function

(27)

Chapter 3 Acknowledgement

The content of the present dissertation resulted after meaningful and dedicated research work by the author. But it did not result solely on the author’s efforts. A number of people assisted and contributed in the preparation of the present work.

The first years of the author’s research life were greatly affected by Dr. Ioannis Pitas, professor in the Artificial Intelligence and Information Analysis laboratory (AIIA) of the department of Informatics, Aristotle University of Thessaloniki, Greece and Dr. Constantine Kotropoulos, assistant professor in the same laboratory. I would like to thank both of them, from the depths of my heart, for accepting me in their group and introducing me to the wonderful world of sci- ence. A plethora of meaningful ideas and suggestions, that form the base in many sections of the present thesis, originated from their rampant imagination.

In the succeeding years, my scientific life was deeply shaped by Dr. Haibo Li, professor in the Digital Media Laboratory (DML) of the department of Applied Physics and Electronics (TFE), Ume˚a University, Sweden who accepted me at his laboratory and gave me the opportunity to expand my horizons. His insightful perception and clear guidance helped me fine tune my abilities and led me to the final destiny which is the present thesis.

Furthermore, I am greatly indebted to Prof. Staffan Andersson who is “prefekt”

(head of the department) in TFE, Dr. Adi Anani who is a lecturer in TFE, Mrs. Annemaj Nilsson who is the heart of TFE being “Ekonomi- och personal- administrat¨or” (economics- and personnel-administrator) and last but not least Mrs. Linda Johansson the “Systemingenj¨or” (system administrator) at TFE for the zeal they showed in assisting me with numerous aspects and parameters either related to the scientific work at TFE in general and in the DML in particular or the everyday life in Ume˚a.

Furthermore, the author acknowledges the help of all the past and present Ph.D. students, researchers and members of both the AIIA and the DML labo-

(28)

20 ACKNOWLEDGEMENT

ratories respectively. Namely he would like to thank the following: G. Albanidis (AIIA), N. Basiou (AIIA), I. Buciu (AIIA), Z. Cernekova (AIIA), M. Gordan (AIIA), A. A.-Hamam (DML), S. L. Hung (DML), E. Kalkopoulou (AIIA), J.

Karlsson (DML), I. Kotsia (AIIA), M. Krinidis (AIIA), S. Krinidis (AIIA), X.

Laftsidis (AIIA), E. Loutas (AIIA), A. Nikolaidis (AIIA), N. Nikolaidis (AIIA), K. Prorok (DML), P. Rydesäter (DML), S. Sjögren (DML), U. Söderström (DML), V. Solachidis (AIIA), J. Sun (DML), A. Tefas (AIIA), S. Tsekeridou (AIIA), A.

Vogiatzi (AIIA), A. Xafopoulos (AIIA), Z. Yao (DML).

This work has been funded by the Greek Secretariat for Research and Tech- nology. It was also supported by the European Community through the following research projects:

Hypergeo European Union Information Society Technology (IST) project: “HY- PERGEO: Easy and friendly access to geographic information for mobile users” (IST-1999-11641).

Muhci European Union Research Training Network (RTN) project “MUHCI:

Multi-modal Human Computer Interaction” (HPRN-CT-2000-00111).

(29)

Chapter 4 List of publications

Peer review journal papers

A. Georgakis, C. Kotropoulos, A. Xafopoulos and I. Pitas, “Marginal median WEBSOM for information organization and retrieval”, submitted to Elsevier Neural Networks Journal, (accepted for publication).

The main content of this paper can be found in chapter 8. This paper offers a set of variants to the SOM algorithm based on the marginal median operator.

A. Georgakis, C. Kotropoulos, and I. Pitas, “A variant of the SOM algorithm for document organization and retrieval based on the Wilcoxon test”, submitted to IEEE Tr. on Neural Networks, (accepted for publication).

This paper is the basis for chapter 9. It offers a variant to the VSM and a metric that employs this variant and is based on the Wilcoxon non-parametric statistical test in an effort to elevate the performance of the SOM algorithm.

A. Georgakis and H. Li, “User behaviour modeling and content based speculative web page retrieval”, submitted to IEEE Tr. on Knowledge and Data En- gineering, (accepted for publication).

This paper offers the groundwork for chapter 10. It offers a pre-fetching algorithm for assisting an end-user surfing the Internet by pre-downloading web pages that have higher probability to be requested by the user.

(30)

22 LIST OF PUBLICATIONS

Peer review conference papers

• A. Georgakis, C. Kotropoulos and I. Pitas, “A combination of R-estimates and Wilcoxon test for document organization and retrieval”, CD-ROM Proc.

of IEEE-EURASIP Workshop Nonlinear Signal and Image Processing, Grado - Gorizia, Italy, June 2003.

• A. Georgakis, C. Kotropoulos, and I. Pitas, “A SOM variant based on the Wilcoxon test for document organization and retrieval”, Proc. of Int. Conf.

Artificial Neural Networks(ICANN), Madrid, Spain, pp. 993-998, August 2002.

• A. Georgakis, C. Kotropoulos, A. Xafopoulos and I. Pitas, “Document organization and retrieval using SOM’s and statistical language modeling”, Proc. of ICEIS Conf. Pattern Recognition in Information Systems (PRIS), Setubal, Portugal, pp. 149-160, July 2001.

• A. Georgakis, C. Kotropoulos, A. Xafopoulos and I. Pitas, “MM-WEB SOM: A variant of WEBSOM based on order statistics”, CD-ROM Proc.

of IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing, Baltimore, U.S.A., June 2001.

• A. Georgakis, C. Kotropoulos, N. Bassiou and I. Pitas, “Hypergeo: A data organization and retrieval system for tourist information”, Proc. of IASTED Conf. on Applied Informatics, Innsbruck, Austria, pp. 719-724, February 2001.

• A. Georgakis, C. Kotropoulos, and I. Pitas, “Organization and retrieval of tourist content information using statistical language processing techniques”, Proc. of 21st Annual Glossology Sector Meeting of the Philosophical De- partment of the Aristotle Universityof Thessaloniki, Thessaloniki, Greece, pp.110-120, 2001.

(31)

Chapter 5 Abstract

This thesis covers topics relevant to information organization and retrieval. The main objective of the work is to provide algorithms that can elevate the recall- precision performance of retrieval tasks in a wide range of applications ranging from document organization and retrieval to web-document pre-fetching and finally clustering of documents based on novel encoding techniques.

The first part of the thesis spans from chapters 6 until 9 and deals with the concept of document organization and retrieval using unsupervised neural networks, namely the self-organizing map, and statistical encoding methods for representing the available documents into numerical vectors. The objective of this section is to introduce a set of novel variants of the self-organizing map algorithm that addresses certain shortcomings of the original algorithm. The variants are motivated by the algorithm’s poor robustness properties in the presence of outliers in the input space. Furthermore, a novel modeling method that belongs to the general framework of Salton’s VSM is also proposed. However, the proposed modeling method provides an ordering of the document-bigrams, instead of document terms, based on their frequencies observed in the training corpus. The replacement of the term histograms by sets of high-frequent bigrams improves the recall and precision ratios in document retrieval tasks.

In the second part of the thesis the latencies perceived by users surfing the Internet are shortened with the usage of a novel transparent and speculative pre- fetching algorithm. The proposed algorithm relies on a model of behaviour for the user browsing the Internet and predicts his future actions when surfing the Internet. In modeling the users behaviour the algorithm relies on the contextual statistics of the web pages visited by the user. The collected statical data are used in a weighting scheme which assigns weights to the outbound links of a particular web page and subsequently retrieves the most probable links.

Finally, the last chapter of the thesis provides preliminary theoretical results along with a general framework on the current and future scientific work. The

(32)

24 ABSTRACT

chapter describes the usage of the Zipf distribution for document organization and the usage of the adaboosting algorithm for the elevation of the performance of pre-fetching algorithms.

(33)

Chapter 6 Information retrieval: An introduction

Vandra blott. V¨agen vet var du ska g˚a.

Are Nymph This chapter opens with a short definition of the scientific area that goes under the name information retrieval. It continues with the organization and retrieval of textual information. Then, it proceeds with a short description of the text pre- processing methods applied on document collections and the language modeling techniques, available today that are used in encoding the documents into numerical vectors. Following, a short introduction is provided in the dimensionality reduction techniques. Finally, some performance evaluation methods are presented. The chapter concludes with an overview-contribution of the thesis.

6.1 General description

For the past sixty years the problem of information retrieval (IR) has attracted an ever increasing attention. The above problem, roughly stated can be formulated as follows: there is a vast collection of information (text, video, multimedia or image) which a user¹is interested in accessing as fast as possible but above all effectively.

A shortcoming that emerges here, while the information volume increases, is that relevant information gets ignored since it is oftenly uncovered, which degrades the effectiveness of any system designed towards that direction.

In principle, an IR system is simple and can be described as follows: Suppose that there is a collection of objects (in any of the above mentioned categories: text, video, multimedia or image) spanning in more than two different thematic areas² and an end-user is interested in retrieving objects from the collection that pertain

(34)

26 INFORMATION RETRIEVAL: AN INTRODUCTION

in a particular thematic area or that are relevant to a formulated question (request or query). In theory, the user can access all the objects in the collection, thus retaining the relevant objects and discarding all the others. This would constitute a “perfect” retrieval. Unfortunately, this approach is obviously impractical due to time constrains or the waste of user’s effort and computational resources.

The advance of high speed digital computers were once regarded to be the answer to the previous problem. Namely they were considered to be able to “read”

the entire collection of objects and to “extract” the objects that are relevant to the user’s query. It soon became apparent that using the objects in their native nature format not only caused input and storage problems but also left unsolved the problem of characterizing the object’s content.

Automatic characterization is a very viscous problem in which the machine is

“reading” the object from the collection and attempts to extract information, both syntactic and semantic in order to use that information in deciding whether each object is relevant or not to a particular query (user- or machine-defined). The difficulty is not only knowing how to extract the information but also how to use the extracted information in evaluating relevance to the given query.

Following to the characterization of the objects is the notion of information

“relevance” which is at the center of the IR community. The purpose of an IR system is to retrieve as many of the relevant objects (even all, if it is possible) and at the same time to retrieve as few of the non-relevant as possible(none, if it is possible). For the retrieval to be effective, the characterization of the objects should be such that the objects relevant to a query to be retrievable in response to that particular query and the non-relevant to be non-retrievable.

In the past, human indexers were used (and in some cases are still employed) to index object collections like newspaper archives and library records³. The indexers attempt to anticipate the index terms⁴ an average user would employ to characterize each object whose content is being described. In an automatic indexing scheme, the system is assumed to represent the objects of the collection in an unparallel manner, like as if human indexers were employed. Unfortunately, since the indexing process is non-formalizable there does not exist a closed-form solution.

Figure 6.1 depicts a typical IR system. The diagram shows three components:

the collection of objects (input space - information repository), the organized collection of these objects (organized according to the IR system) and the user- or machine-defined queries which is the final stage in an IR system.

6.1.1 Document organization

Document organization has been a vivid research and development area, in the center of the IR community for the past 30 years. The primary goal of docu- ment organization is document indexing [Jon97; Kor97]. In document indexing, the body of the documents is decomposed into terms (index terms) that are used as indicators of the document’s contents. Owing to the explosive growth of the

(35)

GENERAL DESCRIPTION 27

Figure 6.1: A typical IR system. The main blocks in the diagram are the information repository, the queries aiming at the repository and the algorithm employed by the IR system in organizing the information.

available textual data and the emergence of new application demands, the aforementioned scope has grown far beyond the primary goal. Nowadays, document classification and clustering are also taken into consideration [Yat99; Seb02]. Doc- ument classification (or categorization) is the process of automatically assigning documents to a set of predefined categories, where a document can belong to zero or more categories. On the other hand, document clustering refers to the parti- tioning of the collection of documents (i.e., corpus) into clusters of semantically related documents.

Figure 6.2 depicts an IR system oriented for document organization and retrieval (document retrieval will be covered in 6.1.2). The corpus (in the upper left corner) is encoded and subsequently the queries (middle top) are used to extract documents relevant to the query.

Prior to the document indexing step, and due to the nature of the algorithms employed in the current thesis the available textual data have to be transcribed into a numerical form. Among the most widely accepted encoding models used by the IR community [Yat99], namely the boolean, the probabilistic, the fuzzy, and the vector space model (VSM), the latter model is the most appropriate.

This thesis will focus on the VSM [Sal83b], where the available textual data of the corpus and the queries used in the training and the retrieval phase are represented by high-dimensional vectors. Each vector element corresponds to a different word type, that is, a distinct word appearance in the corpus [Yat99]. It is gener- ally agreed upon that the contextual similarity between documents exists also in their vectorial representation. Therefore, the similarity can be assessed by the use of any vector metric [Yat99]. Subsequently, the numerically encoded documents

(36)

Figure 6.2: An IR system oriented for document organization and retrieval.

can be organized using any of the available clustering algorithms. One such well- known document organization algorithm, capable of creating semantically related document collections is the WEBSOM [Kas98b; Kas98d; Koh98; Lag99; Koh00].

The WEBSOM is based on the Self-Organizing Map (SOM) [Koh01]. The current thesis will focus on the SOM algorithm (chapter 7) and it will provide variants of the SOM algorithm (chapters 8 and 9) in an effort to elevate its performance, in terms of its document organization and retrieval capabilities.

6.1.2 Document retrieval

Document retrieval (or searching) has been an equally important area in the IR community and involves, given a query, the retrieval of relevant documents from the corpus. The retrieval efficiency and the retrieval speed can be elevated with the exploitation of the collections of semantically related documents that are generated by the aforementioned clustering process [Sal91]. In a cluster based document IR system, a large corpus is partitioned into clusters and each of the clusters is represented using its reference vector (section 7).

In order to retrieve documents relevant to a user-defined query, the query undergoes the same encoding process (section 6.2.2 and 6.3) and is transcribed into numerical vector. Subsequently, the similarity between the encoded query and

(37)

TEXT PROCESSING 29

the reference vectors corresponding to the aforementioned clusters is assessed. The cluster closer to the query document corresponds, with high probability, to a subset of the corpus with documents relevant to the user defined query. Furthermore, the documents marked as being relevant to the query can also be ranked in a decreasing similarity order with the help of a vector metric [Yat99].

6.2 Text processing

Due to the nature of the algorithms that will be presented in the following sections the textual data cannot be directly interpreted or used by these algorithms.

Because of the previous setback, an indexing procedure that projects a document onto a compact representation of its content needs to be uniformly applied to training, and test documents.

6.2.1 Corpus description

A variety of parameters for an IR systems can be evaluated without consulting the end-user; despite that, some end-user feedback will be necessary at the end in a controlled IR experiment. But, sometimes is hard to control and replicate an experiment that involves end-users. For that, document test collections have been developed to supply an unbiased test-bed for evaluating IR systems. These collections are created by potential users, and once they are build they can be used to evaluate IR systems without any further user-feedback.

The performance of each algorithm, either introduced in the thesis or available in the literature, is described here for document retrieval. That is, the algorithms used in the thesis will be used to build IR systems which will be evaluated using some available test collections. The training of each algorithm has been performed on three document collections, namely the Hypergeo corpus, the Reuters-21578 corpus [Lew97] and the CISI corpus.

The Hypergeo corpus comprises of 606 HTML files manually collected over the Internet. These files are web pages of touristic content mostly from Greece, Spain, Germany, and France. They were collected during the European Union IST funded project HYPERGEO⁵. The selected files are annotated by dividing them into 19 categories related to tourism, such as accommodation, history, geography, etc., so that ground truth is incorporated into the files.

The second corpus is the Distribution 1.0 of the Reuters-21578 text categorization collection compiled by David Lewis [Lew97]. It consists of 21578 documents which appeared on the Reuters newswire in 1987. The documents are marked up using SGML tags and are manually annotated according to their content into 135 topic categories. Figure 6.3 depicts the topic frequencies, with the topics being arranged into lexicographical order for both corpora. From the figure is evident that a large portion of the annotation categories are not used at all or are used rarely. Furthermore, a large portion of the documents are multiply annotated.

(38)

Finally, the last corpus used is the CISI collection which is a relatively small corpus with only 1460 documents; still it is used extensively in IR tasks.

6.2.2 Text preprocessing

Prior the indexing of the documents, and due to the nature of the algorithms, the available textual data have to be transcribed into a numerical form. A series of actions were taken in order to encode the textual data into numerical vectors.

These steps are:

Markup language cleaning: During the first step, the HTML and SGML tags and entities are removed.

Text cleaning: Text cleaning refers to the removal of URLs, email addresses, numbers, and punctuation marks. The sole punctuation mark left intact is the full stop which is preserved in order to provide a sentence delimiter. This is done because the context for a given word is confined by the limits of the sentence. Furthermore, the collocations (i.e., expressions consisting of two or more words) are meaningful only within the limits of a sentence [Man99].

Stopping: Common English words such as articles, determiners, prepositions, pronouns, conjunctions, complementizers, abbreviations and some frequent non-English terms are removed.

Stemming: Stemming refers to the elimination of word suffixes, to shrink the vocabulary without significantly altering the context. It can be considered as an elementary clustering technique, with the word roots (stems) forming distinct classes. The underlying assumption for the successful usage of a stemming program, called stemmer, is that the morphological variants of words are semantically related [Fra92]. The commonly used Porter stemmer was applied to each corpus [Por80].

Finally, prior to encoding the documents into vectors, the stems, whose frequency was below a certain threshold were eliminated. For the above action the threshold was set to 20. Table 6.1 depicts some statistical data for the Hypergeo and the Reuters-21578 corpora, respectively. The third column of Table 6.1 contains the number of documents that were retained after the completion of all the aforementioned preprocessing steps. It must be noted that the number of retained documents in the Reuters-21578 corpus is nearly 12% lower than its initial value.

This is due to the fact that some documents did not contain textual information to start with or lost all their textual information due to the preprocessing and the thresholding steps.

Furthermore, the resulting Reuters-21578 corpus was partitioned into two dis- tinct sets, a training set and a test set, according to the recommended Modified Apte split of the collection [Lew97]. The first set was used for document cluster- ing during the training phase of the algorithms, whereas the second set was used

(39)

TEXT PROCESSING 31

0 20 40 60 80 100 120 140

0 500 1000 1500 2000 2500 3000

Reuters−21578 topic frequencies

Topic frequency

Index number of topic grain

money−fx

trade earn

acq

(a)

0 2 4 6 8 10 12 14 16 18 20

0 20 40 60 80 100 120

Index number of topic

Topic frequency

Hypergeo topic frequencies

accommodation

gastronomy geography

history

leisure

sightseeing

museum

(b)

Figure 6.3: The frequencies of the annotation categories for: (a) the Reuters-21578 corpus, and (b) the Hypergeo corpus.

(40)

Table 6.1: Corpora statistics for the Hypergeo and the Reuters-21578 corpus.

Number of Number of Word Stem types Corpus original retained tokens Before After

documents documents thresholding

Hypergeo 606 606 290973 16397 1524

Reuters-21578 21578 19043 2642893 28670 4671

to assess the quality of document clustering through retrieval experiments that employ its documents as query-documents during the test phase.

6.3 Language modeling

Language modeling can be seen as a case of statistical inference which makes in- ferences about the unknown distribution of data [Man99]. This approach divides the problem into three possibly overlapping areas. Division of training data into equivalence classes, estimator selection for the classes and combination of estima- tors. A plethora of language models can be found in [Goo00]. In the following, a brief description for the boolean, the probabilistic, the fuzzy and the VSM will be supplied.

6.3.1 Boolean model

The Boolean model is the first approach in language modelling in the IR community. In the boolean model the documents and the queries are encoded into index terms. Using Boolean logic, the query terms and their corresponding sets of documents can be combined to form new sets of documents.

In the Boolean logic there are three basic operators, namely, AND, OR and NOT. Using a combination of the above operators and the index terms available, an IR system based on the Boolean model, is theoretically capable of answering a user-defined query.

Unfortunately, the Boolean model has a series of flaws[Sal83a]:

• The main disadvantage is that the model does not provide a ranking of retrieved documents according to decreasing probability of relevance.

• The model either retrieves a document or not.

• Only keyword based queries can be handled (no document-based queries).

• Formulation of the query is difficult.

• All query terms are considered to be equal: they are either present or not.

(41)

LANGUAGE MODELING 33

• The absence of an index term in a document leads to a false index for that document.

6.3.2 Fuzzy set model

In fuzzy set modeling the corpus documents are assigned different membership degrees to the set defined by an index term. That is, two different documents that seem to be relevant with a particular term of the query are assigned different membership degrees in the set of relevant documents for the term under consideration.

Although it is known that a document contains the particular term, some documents are more relevant than others. For the degree of membership for a single term, one of the document term weighting approaches can be used [Sal88; Lee95].

6.3.3 Probabilistic model

In the probabilistic model, an IR system responds to a query by ranking the documents in the corpus in order of decreasing probability of usefulness to the query [Rob77].

Let V = {w1, w2, . . . , wk} denote the vocabulary of distinct word stems. A probabilistic language model assigns to each sentence S^m=©

w(1), w(2), . . . , w(m)

ª, a probability P (S^m). In the definition of the sentence, S^m, the notion (·) denotes a word stem drawn from the set V .

The a priori probability P (S^m) = P (w₍₁₎, w₍₂₎, . . . , w_(m)) for the word se- quence S^mcan be expressed as a product of the conditional probabilities P (w_(i)| w₍₁₎ w₍₂₎. . . w_(i−1)) = P (w_(i)|Sⁱ⁻¹), using the chain rule:

P (S^m) = P (w(1)) · Ym i=2

P (w(i)|Sⁱ⁻¹) . (6.1)

Equation 6.1 is the basis for the probabilistic and the VSM (6.3.4), respectively.

In the probabilistic model the interest is focused in the estimation of the probability that a specific document will be regarded as relevant with respect to a user-defined query. The probability of relevance is a function of the presence or absence of query-terms in the document [Fuh92]. The probability is evaluated based on the following assumption: the terms forming the documents of the corpus are distributed differently within relevant and non relevant documents.

The probabilistic model does not rely on term weighting which is very important and it has been one of the most important language models for this very reason. Unfortunately, the distribution of the terms over the relevant and non- relevant documents is not always available. Furthermore, the model supplies only partial ranking of the documents and it does not allow the user to control the retrieved set of documents.

(42)

6.3.4 Vector space model

The VSM stems also from Eq. (6.1). In this model the task is reduced to the estimation of the probabilities appearing on the right-hand side of Eq. (6.1). In what follows, the word sequence Sⁱ⁻¹is referred to as the (i − 1)-length backward history hi−1 of the underlying stochastic process for P (S^m), where hn ∈ Vⁿ, n ∈ IN⁺ [Bro92]. All possible conditional probabilities that the model estimates are the parameters of the model, which are kⁿ in number⁶.

Related to the division of training data into classes is the type of models used.

A widely used language model is the n-gram model. When using this model for n ∈ IN⁺, the probability P (S^m) can be approximated by restricting the history to the preceding n − 1 words, except for the first few words of the sequence for which fewer than n − 1 words exist:

P(n)(S^m) = P (w(1)) · Ym i=2

P (w(i)|hmin{n−1,i−1}). (6.2)

Generally hl = ∅ when l < 1. Also P (w(·)|∅) = P (w(·)). Another moot point is the value of n, which is related to the number of equivalence word bins. Higher n values result in more bins and the opposite. The whole thing constitutes a reliability-discrimination compromise. Generally higher n values require larger corpora in order to provide robust estimates, owing to the much larger number of model parameters that need to be estimated.

The next issue is the selection of an appropriate statistical estimator for the parameters of the n-gram models. The straightforward approach for the estimation of the conditional probabilities P (w_(i)|hi−1) is the well-known maximum likelihood estimator (MLE), which uses the notion of relative frequencies [Ney97]. The MLE of the latter conditional probabilities is given by the number of occurrences of the word sequence constituted by the word w(i) and its preceding history divided by the number of occurrences of this history:

PbM LE(n)(w(i)|hi−1) = c(Sⁱ)

c(hi−1), 1 < n < i (6.3) where c(Sⁱ) is the number of occurrences of the word sequence Sⁱ in the corpus.

For bigram models (6.3) is simplified to:

Pb_{M LE(2)}(w|h₁) = c(h1, w)

c(h1) . (6.4)

If k denotes the size of the corpus, that is, the number of word tokens in the corpus, the unigram MLE estimate is bP_{M LE(1)}(w_(i)) = ^c(w_k⁽ⁱ⁾⁾. For the zerogram case, bPU(w(i)) = ¹_k with U implying a uniform estimate.

(43)

LANGUAGE MODELING 35

6.3.4.1 Smoothing

Considering the number kⁿ of potential n-grams, especially when n increases, and the limited input in terms of both the lexical units and the syntactic limitations of the corpus, it is concluded that the training corpus only approximates a small percentage of the potential n-grams⁷. This fact becomes even worse when natural languages, which have very large k, are being modeled. To deal with this sparsity problem of missing, over- and under-estimated n-grams, two methods are usually used, namely building equivalence word classes, and smoothing estimates.

Smoothing is a very useful technique used in the construction of robust language models. One smoothing method is the interpolation which performs a suit- ably weighted combination of n and lower order estimates [Bec99; Goo01]. During or after smoothing, the resulting probability estimates are normalized so that they sum to unity per vocabulary word. Several types of interpolation have been proposed in speech literature [Bec99; Jel99; Man99]. The nonlinear interpolation method lies among the best models. Let hn be the generalized history of length n which refers to histories of length less than n. Let ct(hn) denote the number of n-grams which have exactly t occurrences and their history is hn. Accordingly, the unseen n-grams beginning with h_n−1 are c₀(h_n−1) in total. The nonlinear interpolation (NLI) estimate of the backward conditional n-gram probabilities, for n > 1, is:

PbN LI(n)(w|hn−1) =











max{c(hn−1,w)−δ(hn−1,w),0}

c(hn−1) +

+^δ(hⁿ⁻¹^,w)(k−c_c(h ⁰^(hⁿ⁻¹⁾⁾

n−1) PbQ(w|hn−1), if c(hn−1, w) = 0 ∧ c(hn−1) > 0 PbQ(w|hn−1), if c(hn−1) = 0

(6.5)

where δ(hn−1, w) denotes a discount parameter. The estimate bPQ(w|hn−1) is a suitably selected one that takes into consideration a truncated generalized history.

The parameter Q is usually replaced by the same estimate (here N LI), M LE or U and hn−1by hn−1or ∅. Also, bP (w|∅) = bP (w). The above NLI estimate is used on bigrams and the parameters δ, Q are substituted as in:

c(h1, w) − D(2)(h1), 0ª c(h1)

+ D₍₂₎(h1)(k − c0(h1)) bP_{M LE(1)}(w)

c(h1) (6.6)

where c(h1) is non-zero due to the fact that all encountered words are included in