Classification along Genre Dimensions
Exploring a Multidisciplinary Problem
Mikael Gunnarsson
Academic dissertation for the Degree of Doctor of Philosophy in Library and Information Science at the University of Borås to be publicly defended on Friday 29 april 2011 at 13.00 in lecture room M506, the University of Borås,
Allégatan 1, Borås
Swedish School of Library and Information Science the University of Borås
Title: Classification along Genre Dimensions Language: English, with a short summary in Swedish
Available: http://hdl.handle.net/2320/7920 ISBN 978-91-85659-72-2 ; ISSN 1103-6990
This thesis treats the sociotechnical notion of genre as a conflation of a communicative situation and a community of practices involved in producing and using documents.
It explores the ways in which documents may be mapped to the sociocultural con- texts from which they emanate. In other words, it is concerned with the classification of documents along genre dimensions, with the purpose of supporting information seeking.
The thesis positions itself within Library and Information Science in two parts.
Firstly, a theoretical framework for classification along genre dimensions is developed based on relevant theories and practices from Library and Information Science, as well as from sociologically motivated Linguistics, and neighbouring domains. Secondly, a setup for experiments, including feature derivation and reannotation of existing cor- pora, is designed in order to explore the relationship between text documents and genres, and the extent to which a mapping of documents to genres can be realized in real world applications.
The experimental part of the thesis relies on an existing corpus for genre classifi- cation research, used in comparable research, with an addition of a slight extension.
In the experiments, combinations of feature sets and target genres are evaluated, using traditional performance estimators for classification performance.
The outcome of the first part of the work indicates that the notion of genre with respect to classification is largely undertheorized in Library and Information Science.
We need to know more about the nature of different genres, how to robustly identify the documents of a genre, and the impact genres have on information seeking. In- terdisciplinary collaborative research would be most beneficial in these efforts. The results of the experiments of the second part are fairly inconclusive for the evaluation of feature sets, but it can be concluded that the optimal combination of feature sets and target genres is a crucial issue for high performance, and worthy of more investigation.
Keywords: Genre, Library and information science, Document studies, Classifica- tion, Knowledge organization, Library classification, Text linguistics, Sociolinguis- tics, Speech act theory, Machine learning, Support vector machines, k-NN classifica- tion, K-means clustering