• No results found

With or without context

N/A
N/A
Protected

Academic year: 2022

Share "With or without context"

Copied!
2
0
0

Loading.... (view fulltext now)

Full text

(1)

With or without context

Automatic text categorization using semantic kernels

Johan Eklund

Academic dissertation for the Degree of Doctor of Philosophy in Library and Information Science at the University of Bor˚as to be publicly defended on Friday 15 April at 13.00 in lecture room C203,

the University of Bor˚as, All´egatan 1, Bor˚as The Swedish School of Library and Information Science

The University of Bor˚as

(2)

Title: With or without context: Automatic text categorization using semantic kernels

Language: English, with a summary in Swedish

Available: http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-8949 ISBN (printed version) 978-91-981654-8-7

ISBN (digital version) 978-91-981654-9-4 ISSN 1103-6990

In this thesis text categorization is investigated in four dimensions of analysis: theoretically as well as empirically, and as a manual as well as a machine-based process. In the first four chapters we look at the theoretical foundation of subject classification of text documents, with a certain focus on classification as as a procedure for organizing documents in libraries.

A working hypothesis used in the theoretical analysis is that classification of documents is a process that involves translations between statements in different languages, both natural and artificial. We further investigate the close relationships between structures in classification languages and the order relations and topological structures that arise from classification.

A classification algorithm that gets a special focus in the subsequent chapters is the support vector machine(SVM), which in its original formulation is a binary classifier in linear vector spaces, but has been extended to handle classification problems for which the categories are not linearly separable. To this end the algorithm utilizes a category of functions called kernels, which induce feature spaces by means of high-dimensional and often non-linear maps. For the empirical part of this study we investigate the classification performance of semantic kernels generated by different measures of semantic similarity. One category of such measures is based on the latent semantic analysis and the random indexing methods, which generates term vectors by using co-occurrence data from text collections. Another semantic measure used in this study is pointwise mutual information. In addition to the empirical study of semantic kernels we also investigate the performance of a term weighting scheme called divergence from randomness, that has hitherto received little attention within the area of automatic text categorization.

The result of the empirical part of this study shows that the semantic kernels generally outperform the “standard” (non-semantic) linear kernel, especially for small training sets.

A conclusion that can be drawn with respect to the investigated datasets is therefore that semantic information in the kernel in general improves its classification performance, and that the difference between the standard kernel and the semantic kernels is particularly large for small training sets. Another clear trend in the result is that the divergence from randomness weighting scheme yields a classification performance surpassing that of the common tf-idf weighting scheme.

Keywords: automatic text categorization, subject classification, machine learning, computational linguistics, support vector machines, semantic kernels, term weighting, divergence from randomness

References

Related documents

46 Konkreta exempel skulle kunna vara främjandeinsatser för affärsänglar/affärsängelnätverk, skapa arenor där aktörer från utbuds- och efterfrågesidan kan mötas eller

Byggstarten i maj 2020 av Lalandia och 440 nya fritidshus i Søndervig är således resultatet av 14 års ansträngningar från en lång rad lokala och nationella aktörer och ett

Omvendt er projektet ikke blevet forsinket af klager mv., som det potentielt kunne have været, fordi det danske plan- og reguleringssystem er indrettet til at afværge

I Team Finlands nätverksliknande struktur betonas strävan till samarbete mellan den nationella och lokala nivån och sektorexpertis för att locka investeringar till Finland.. För

Both Brazil and Sweden have made bilateral cooperation in areas of technology and innovation a top priority. It has been formalized in a series of agreements and made explicit

För att uppskatta den totala effekten av reformerna måste dock hänsyn tas till såväl samt- liga priseffekter som sammansättningseffekter, till följd av ökad försäljningsandel

The increasing availability of data and attention to services has increased the understanding of the contribution of services to innovation and productivity in

Av tabellen framgår att det behövs utförlig information om de projekt som genomförs vid instituten. Då Tillväxtanalys ska föreslå en metod som kan visa hur institutens verksamhet