Automatic web page categorizationusing text classication methods

(1)

Degree project in Computer Science Second cycle

Automatic web page categorization using text classification methods

Tobias Eriksson

(2)

Automatic web page categorization using text classication methods

Automatisk kategorisering av webbsidor med textklassificeringsmetoder

Master's Degree Project in Computer Science CSC School of Computer Science and

Communication

Author:

Tobias Eriksson Author E-mail:

tobier@csc.kth.se

Supervisor:

Johan Boye Examiner:

Jens Lagergren Project provider:

Whaam AB

September 8, 2013

(3)

Abstract

Over the last few years, the Web has virtually exploded with an enormous amount of web pages of dierent types of con- tent. With the current size of Web, it has become cum- bersome to try and manually index and categorize all of its content. Evidently, there is a need for automatic web page categorization.

This study explores the use of automatic text classication methods for categorization of web pages. The results in this paper is shown to be comparable to results in other papers on automatic web page categorization, however not as good as results on pure text classication.

Referat

Automatisk kategorisering av webbsidor med textklassificeringsmetoder

Under de senaste åren så har Webben exploderat i stor- lek, med miljontals webbsidor av vitt skilda innehåll. Den enorma storleken av Webben gör att det blir ohanterligt att manuellt indexera och kategorisera allt detta innehåll. Up- penbarligen behövs det automatiska metoder för att kate- gorisera webbsidor.

Denna studie undersöker hur metoder för automatiskt text- klassicering kan användas för kategorisering av hemsidor.

De uppnådda resultatet i denna rapport är jämförbara med

resultat i annan litteratur på samma område, men når ej upp

till resultatet i studier på ren textklassicering.

(4)

List of Figures

1.1 The number of Internet hosts (1994-2012) according to the ISC Domain Survey [46]. . . . 1 1.2 Overview of the methodology used in this thesis. . . . 2 3.1 Example of a maximum margin separating two classes of data

points [52]. . . . 14

4.1 Training the classier for web page categorization. . . . 19

4.2 An example of a web page. . . . 20

6.1 Mean Micro F-Measure Score with 95% condence interval shown. 27

6.2 Mean Macro F-Measure Score with 95% condence interval shown. 28

6.3 Mean Classication Accuracy with 95% condence interval shown. 29

(7)

List of Tables

2.1 N-grams of dierent lengths for the word APPLE . . . . 7 6.1 Distribution of samples over the categories in the data. . . . 25 6.2 Micro-averaged F

1

score for the dierent classiers, averaged over

all instances. . . . 27 6.3 Macro-averaged F

1

score for the dierent classiers, averaged over

all instances. . . . 28 6.4 Accuracy of the dierent classiers in terms of correct classication

rate, averaged over all instances . . . . 29

(8)

Chapter 1

Introduction

1.1 Background

The World Wide Web has grown tremendously in the last few years, with at least 14 billion web pages indexed by search engines such as Google and Bing [40]. The Internet Systems Consortium publishes quarterly statistics on the on the number of reachable hosts in the Internet, shown in gure 1.1. The Web has content of virtually any category imaginable: entertainment, fashion, news, multimedia, science and technology (to name a few). The Open Directory project [42] tries to take on the daunting task of categorizing the Web. It is a human-edited index, with over 5 million web pages listed in over 1 million categories. While this is an impressive feat, it becomes clear that a human-edited index of the Web is not viable as the Web continues to grow.

Figure 1.1: The number of Internet hosts (1994-2012) according to the ISC Do- main Survey [46].

One of the applications of machine learning is to automatically classify doc-

uments into predened sets of categories, based their textual content. This is

known as text classication. Often these documents are news articles, research

papers and the like. A web page is essentially a text document, and often can

be considered to be of a certain category or subject. Therefore it seems proba-

ble that proven machine learning methods for text classication can be used to

(9)

CHAPTER 1. INTRODUCTION

assign categories to web pages.

1.2 Problem statement

This master's thesis is an exploratory study, with the goal to try and answer the question of how can machine learning and natural language processing tools for text classication be used to perform automatic web page categorization? Machine learning and natural language processing methods and tools for text classication are studied, and a few methods selected. From this a general method for auto- matic web page categorization is proposed, and evaluated by experimentation of real web pages.

1.3 Scope

Given that there are many methods for text classication, this thesis only focuses on a selected few. Document classication papers most often use documents in a single language (English is very common), and this is also the case in this thesis.

1.4 Purpose and contribution

The purpose of this thesis is to try and identify a method for web page clas- sication that has good performance in terms of classication accuracy that is acceptable is practical applications. The author hopes that the thesis inspires others to identify other machine language methods than the ones studied for use with the categorization method proposed in this report, and expand upon it.

1.5 Methodology

The approach to automatic web page categorization proposed in this report is based on contemporary methods for performing automatic text classication.

This involves extracting the textual, natural language content from the web page, encoding the document as a feature vector with natural language processing meth- ods. This subsection gives a brief overview of the methodology used in this thesis.

The methodology is summarized in gure 1.2.

Figure 1.2: Overview of the methodology used in this thesis.

First, a study of relevant document classication literature were performed to

get an understanding of the eld and nd suitable machine learning algorithms

(10)

CHAPTER 1. INTRODUCTION

and natural language processing methods that can be used for web page classi- cation. There were no hard criteria for selecting a specic algorithm or method.

The general guideline is:

The algorithm or method should be cited in the literature.

The algorithm or method should have a readily available implementation in Java, or be easily implemented in a short time.

Preferably, if a machine learning algorithm, it should have been a part of a comparative study in document classication.

This is followed by proposing a general method of automatic web page catego- rization using text classication methods. To evaluate this method, a framework for automatic web page categorization is developed that implements the general method and includes support for several algorithms and tools that have been studied. To establish how well the general method performs, several experiments are done by performing web page categorization using the framework (with the several classiers) on dierent data sets of real web page data. The performance of the framework is documented, and statistical analysis of the results is done.

From these results, conclusions about the method can be drawn and any further work proposed.

The real web pages are given by the project provider, Whaam

¹

. Whaam is a discovery engine based on the idea of social surng. On Whaam, users store links to web pages in link lists, that can be shared with friends. Links are seperated into a set of eight broad categories. The categories are Art & Design, Fashion, Entertainment, Blogs, Sport, Lifestyle, Science & Tech, and Miscellaneous. The web pages have been labeled by the users of Whaam, which brings up the issue of quality of the labeling. One of the concerns is that a cursory examination of the data shows that many samples have been labeled as Miscellaneous, when they should have been labeled more specically. For example, many links to musical artists are in the Miscellaneous category, but should probably belong to the En- tertainment category. Another concern is that many samples are media-centered, thus having minimal textual content other than basic metainformation. For ex- ample, the following is the textual content taken from a typical Entertainment sample from the video-sharing website YouTube

²

:

Uploaded on Dec 23, 2011

An acoustic rendition of Silent Night by Matt Corby.

Merry Christmas from Matt and team!

Samples from the Science & Technology, for example, are comparatively large in term of textual content, as illustrated by the following extracted text

³

:

Clojure is a dynamic programming language that targets the Java Virtual Machine (and the CLR, and JavaScript). It is designed to be a general-purpose language, combining the approachability and inter- active development of a scripting language with an ecient and robust infrastructure for multithreaded programming. Clojure is a compiled

1

See http://www.whaam.com

2

For reference, at the time of this writing, the sample referred to is http://www.youtube.

com/watch?v=z6_Hcr5o_0M.

3

Again, for reference, at the time of this writing, the sample referred to is http://clojure.

org/.

(11)

CHAPTER 1. INTRODUCTION

language - it compiles directly to JVM bytecode, yet remains com- pletely dynamic. Every feature supported by Clojure is supported at runtime. Clojure provides easy access to the Java frameworks, with optional type hints and type inference, to ensure that calls to Java can avoid reection.

Clojure is a dialect of Lisp, and shares with Lisp the code-as-data philosophy and a powerful macro system. Clojure is predominantly a functional programming language, and features a rich set of im- mutable, persistent data structures. When mutable state is needed, Clojure oers a software transactional memory system and reactive Agent system that ensure clean, correct, multithreaded designs.

I hope you nd Clojure's combination of facilities elegant, powerful, practical and fun to use. The primary forum for discussing Clojure is the Google Group - please join us!

Of course, the texts above are not the full content of both samples but the most relevant by simple selection. Still, it illustrates the signicant dierence in the amount of the textual content of the dierent categories.

1.6 Related work

There has been much research done on text classication over the years, and also on dierent methods for performing categorization of web pages. To get an idea of what work has been done in the past, in this section we will briey look at some research papers on automatic web page categorization. A fairly recent paper by Qi and Davison lists multiple methods [21] that are of interest.

An interesting paper by Kan [13] studies the use of URLs for web page cat- egorization. The assumption is that the URL inherently encodes information about a page's category. In the paper, Kan describes several methods for ex- tracting tokens from URLs. One of them uses a nite state transducer to expand URL segments, such as cs in http://cs.cornell.edu to computer science.

These tokens are then used as the data for the classier. The experiments used a Support Vector Machine as the classier, and tried using dierent sources for text data as a comparison to using URLs only.

A paper written by Daniele Riboni in 2002 focused on feature selection for web page classication [23]. It highlighted the fact that categorizing web pages is dierent, but related to, text classication mainly because of the presence of HTML markup. Riboni chose a set of sources for text in HTML: the content of the BODY tag, the content of the META tag, the content of the TITLE tag, and combinations of these. Robini experimented with several feature selection techniques based on information gain, word frequency, and document frequency.

The experiments were made using a large set of samples from the subcategories of the Science directory from Yahoo!

⁴

, using a Naive Bayes' classier and a kernel perceptron.

Not unexpectedly, there has been work done methods for automatic web page categorization that is not based purely on text classication. An interesting paper by Attardi et al. [1] discusses categorization by context. The intended application is cataloguing, by spidering from a root web page. The paper makes the assumption that a web page can be categorized by links pointing to it, either

4

http://www.yahoo.com

(12)

CHAPTER 1. INTRODUCTION

directly by the anchor text itself or text in the same context i.e. surrounding it.

Of course, this can be less useful if the link has a generic anchor text and occurs in a generic context, such as Read more about it here. Another paper [31]

explored the possibility of using mixed Hidden Markov models for clickstreams

⁵

to use as a model for web page categorization.

This report takes a similar approach web page categorization as [23], but with more emphasis on studying several classiers and dierent sources of text.

While Robini used the whole content of the BODY tag, in this report we will explore further by using dierent parts under the BODY tag as sources for text.

Chapter 6 explains these sources in more detail. This report utilizes classic text classication methods, thus making it possible to make relevant comparisons to other studies done on both text classication and web page categorization that uses text classication methods.

1.7 Description of this document

Introduction The introduction gives a brief background of the problem state- ment, the problem statement itself, discusses the methodology and related work.

Background This chapter introduces the reader to machine learning and nat- ural language processing.

Automatic text categorization This chapter presents the identied machine learning methods for text classication, describes the algorithms and their re- garded performance.

Automatic web page categorization This chapter presents a general method for automatic web page categorization using text classication methods.

Implementation This chapter briey discusses the implementation of the cat- egorization framework.

Results This chapter presents the experiments performed, and discusses the results.

Conclusions This chapter presents the conclusion drawn based on the exper- imental results, and proposed further work.

5

A clickstream is the recording of the parts of the screen a computer user clicks on while

web browsing. Users clicks are logged inside the web server.

(13)

Chapter 2

Background

This chapter introduces the elds of machine learning and natural language pro- cessing, with some specic concepts that are relevant to later parts of this thesis.

This chapter is meant to give a quick overview of the concepts for the reader with minimal to no knowledge of these elds, without going into details. The reader is encouraged to explore the cited literature for more detailed information.

2.1 Machine learning

2.1.1 Overview

The core of the topic of this paper is learning. Let us take a moment and dene what learning is. The Oxford English Dictionary denes learning as the acquisi- tion of knowledge or skills through study, experience, or being taught. Computers can be taught to learn from data, and if we put it into human behavioural terms we can say that they learn from experience. American computer scientist Tom M. Mitchell denes machine learning as

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its perfor- mance at tasks in T, as measured by P, improves with experience E. [35]

Mitchell gives an example:

A computer program that learns to play checkers might improve its performance as measured by its ability to win at the class of tasks in- volving playing checkers games, through experience obtained by play- ing games against itself. [35]

Applications of machine learning include text classication [26], computer vision [24] and medical diagnosis [15].

2.1.2 Types of machine learning

There are several types of machine learning. Supervised learning uses a provided

training set of examples with correct responses. Based on this training set, the

algorithm generalizes to respond correctly to all possible inputs. On the other

hand, unsupervised learning does not rely on provided correct responses for the

training but instead tries to identify similarities between the inputs so that the

inputs have something in common are grouped together. A combination of the

(14)

CHAPTER 2. BACKGROUND

Table 2.1: N-grams of dierent lengths for the word APPLE

N Resulting N-grams 2 _A, AP, PP, PL, E_

3 _AP, APP, PPL, PLE, LE_, E__

4 _APP, APPL, PPLE, PLE_, LE, E_

two former types is reinforcement learning, where the algorithm is notied that it has done an error but is not told how to correct it. Instead, the algorithm has to explore and try out dierent possible solutions until it makes a correct prediction [34].

Other types of machine learning are evolutionary learning [34], semi-supervised learning [36] and multitask learning [3].

2.2 Natural language processing

2.2.1 Overview

Natural language processing, or NLP for short, is a eld of computer science and linguistics concerned with the interaction between human natural languages and computers. Some major applications of NLP include information retrieval, dialogue and conversational agents, and machine translation [32]. NLP is a large

eld, and therefore this section will only briey focus on the concepts needed for later chapter in this thesis.

2.2.2 N-grams

An N-gram is a contiguous sequence of n items from a given sequence of text.

The items are usually letters (characters), but can also be other things based on the application (e.g. words). Typically, one slices a word into a set of overlapping N-grams, and usually pad the words with blanks

¹

to help with start-of-word and end-of-word situations [32]. Refer to table 2.1 for an example.

2.2.3 Tokenization

Tokenization is process of breaking up a text into words, called tokens. Tok- enization is relatively straightforward in languages such as English, but is par- ticularly dicult for languages such as Chinese that has no word boundaries [9].

In English, words are often separated by each other with blanks or punctuation.

However this does not always apply since, for example, Los Angeles and rock 'n' roll are each often considered a single word.

2.2.4 Term frequency - inverse document frequency

Consider the case where we query a system for the sentence the car. As the

is a very common word in English, it will appear in many documents. The result of the query would then return documents not relevant to car as the likely appears in all of the documents in the system. Instead we want to highlight the importance of the term car. Term frequency - inverse document frequency, or tf-idf for short, is a measure of how important a term is in a document [32].

1

In this report, blanks are represented by underscores.

(15)

CHAPTER 2. BACKGROUND

The term frequency tf

i

of a term i can be chosen to be the raw frequency of a term in the document. Other possibilites include boolean frequencies and logarithmically scaled frequencies [32]. The problem with the term frequency is that it considered all terms equally important. Thus we introduce a factor that weights the term with a factor that discounts its importance if it appears in many documents in the set. This approach denes the weight w

i

of the term i as:

w

_i

= tf

_i

log n

n

_i

(2.1)

where n is the total number of documents in the set, and n

i

is the number of occurrences of term i in the whole document set.

2.2.5 Stemming

Many natural languages are inected, meaning that words sharing the same root can be related to the same topic [27]. In information retrieval and classifcation, one often wishes to group words with the same meaning to the same term. An example of this is are the terms cat and cats that can be represented by the term cat. In some algorithms, the stems that are the results of the stemmer algorithm are not words, that can appear incorrect in terms of the natural language. However, this is not seen as a aw and rather a feature [27]. Stemmer algorithms can be roughly be classied as ax removing, statistical or mixed.

The ax removal algorithms can be as simple as removing n tokens from the term, or removing −s from plural words. One stemmer for sux removal is the Porter Stemming Algorithm, that is very popular and can perhaps be consider the de-facto standard for English words [27]. It's based on performing a set of steps for reducing words to stems using rules for transformation. An example of a rule in the Porter Stemmer is

SSES → SS

i.e. if the word ends in sses change the sux to ss [20].

There are some approaches to stemming that are statistic. For example, there have been work done on using Hidden Markov Models for performing unsuper- vised word stemming [27]. Finally there are method that mix methods from both ax stemmers and statistical stemmers.

2.2.6 Bag-of-words model

The bag-of-words model is a simplifying representation of documents. A bag-of- words is a unordered set of words, with their exact positions ignored [33]. The simplest representation of a bag-of-words is a binary term vector, each binary feature indicating whether a vocabulary word does or does not occur in the document [32]. For example, assuming the vocabulary is

V = { dog, cat, f ish, iguana }

the term vector of a document containing only the words cat and iguana would be

d = ~





 0 1 0 1







(16)

CHAPTER 2. BACKGROUND

Another, in some cases more useful, term vector representation for the bag-of-

words model is to use word frequency as the feature elements. In information re-

trieval and classication, various term weighting schemes are used that emphasize

the relative importance of a term in the context (for example tf-idf weighting).

(17)

Chapter 3

Automatic text categorization

3.1 Overview

Automatic text categorization

¹

is a supervised learning task, dened as assigning pre-dened category labels to new documents based on the likelihood suggested by a training set of labeled documents [29].

Let us illustrate the concept with an example. Imagine we want to classify the following text to a predened set of categories:

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only.

Let us assume that the set of categories are a set of authors. Then, the task would be to determine which author has most likely written the given text. For the text above, the correct category would be Charles Dickens.

We want to use machine learning methods to classify this text. To do this, we must encode the text as a feature vector. The simplest approach is to represent the document by a bag-of-words feature vector with the features being word occurrences. However, as we will see in the following sections, the features are usually term weights as determined by a term weighting scheme. For simplicity's sake, we decide that the vocabulary of the training documents is the small set of words:

V = {matchbook, times, age, wisdom, hope, despair}

We can encode the document, based on the frequency of the distinct words, as the bag-of-words feature vector:

d = ~





 0 2 2 1 1 1







1

Text categorization is also known as text classication, document categorization, and doc-

ument classication. These terms are used interchangeably in this document.

(18)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

where the rst feature represents the frequency of matchbook in the text, the second the frequency of times in the text, and so on. The vector representation d ~ of the document is point in a feature space of dimension |V |. Using this feature vector, we apply a machine learning algorithm to determine its category.

In this chapter, we will look at a selection of document classication meth- ods that appear in the studied literature. Essentially, these are the selection of methods that were deemed suitable for experimentation with automatic web page categorization during the literature study. We will take a brief look at how the methods work, and how well their performance is at document classication tasks in the literature.

3.2 Naive Bayes

A Naive Bayes classier is a probabilistic classier based on applying Bayes' theorem, with the naive assumption of independence between features. What these features are vary from application to application; in text categorization the features are usually term (word) weights calculated using tf-idf or another weighting scheme. These classiers are commonly studied in machine learning [29], and is frequently used because they are fast and easy to implement [22]. The basic idea of the Naive Bayes' classier is that we want to construct a decision rule d that labels a document with the class that yields the highest posterior probability:

d(X

₁

, . . . , X

_n

) = arg max

c

P (C = c|X

₁

= x

₁

, . . . , X

_n

= x

_n

) (3.1) This is known as a maximum a posteriori or MAP decision rule. However, the posterior are usually not known. The trick is to make the rather naive assumption that all features X

1

, . . . , X

n

are conditionally independent. Bayes' rule state that:

P (C = c|X

1

= x

1

, . . . , X

n

= x

n

) = P (C = c)P (X

1

= x

1

, . . . , X

n

= x

n

|C = c) P (X

1

= x

1

, . . . , X

n

= x

n

) (3.2) Applying the naive assumption of the independence between features:

P (X

1

= x

1

, . . . , X

n

= x

n

|C = c) =

n

Y

i=1

P (X

i

= x

i

|C = c) (3.3) From this we see that:

P (C = c|X

₁

= x

₁

, . . . , X

_n

= x

_n

) ∝ P (C = c)

n

Y

i=1

P (X

_i

= x

_i

|C = c) (3.4) Using the above we can write the decision rule 3.1 as:

d(X

1

, . . . , X

n

) = arg max

c

P (C = c)

n

Y

i=1

P (X

i

= x

i

|C = c) (3.5) The assumption of independence between features is a strong one. If one thinks about a typical document, it seems unlikely that its features occur independent of one another. Not only that, but the increase in the number of the features make the model describe noise in the data rather than the actual underlying rela- tionship between features (known as overtting) [17]. However, the performance of the Naive Bayes classiers vary in the literature. In some cases, NB is one of the poorest of the classiers compared [11, 12, 29], but in some studies they are comparable to the best known classication methods [5].

There are several variants of the NB classier, which dier mainly by the

assumptions they make regarding the distribution of P (X

i

= x

_i

|C = c) [49].

(19)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

3.3 Multinomial Naive Bayes

A classic approach [49] for text classication is the Multinomial Naive Bayes (MNB) classier, which models the distribution of words (features) in a document as multinomial i.e. the probability of a document given its class is the multinomial distribution [18]. The estimated individual probabilities for P (X

t

= x

t

|C = c) is generally written as [18, 49]:

P (X ˆ

_t

= x

_t

|C = c) = N

_ct

+ α

N

c

+ α|V | (3.6)

where |V | is the size of the vocabulary, N

ct

is the number of times feature t appears in class c in the training set, and N

c

is the total count of features of class c. The constant α is a smoothing constant used to handle the problems of overtting and the edge case where N

ct

= 0 . Setting α = 1 is known as Laplace smoothing [18, 49]. Applying this estimate to the decision rule 3.5 we get

d(X

₁

, . . . , X

_n

) = arg max

c

P (C = c)

n

Y

i=1

P (X ˆ

_i

= x

_i

|C = c)

= arg max

c

P (C = c)

n

Y

i=1

N

ci

+ α N

_c

+ α|V |

(3.7)

In the literature the minimum-error classication rule is more often used [14, 22]:

d(X

1

, . . . , X

n

) = arg max

c

"

log P (C = c) +

n

X

i=1

f

i

P (X ˆ

i

= x

i

|C = c)

#

= arg max

c

"

log P (C = c) +

n

X

i=1

f

i

N

ci

+ α N

_c

+ α|V |

# (3.8)

where f

i

is the frequency of word x

i

.

3.4 TWCNB

Transformed Weight-Normalized Complement Naive Bayes (TWCNB) is a vari- ant on the MNB classier that, according to the original authors, xes many of the classier's problems without making it slower or signicantly more dicult to implement. [22]

While similar to MNB, one of the dierences is that the tf-idf normalization transformation is part of the denition of the algorithm. The main dierence between the two, however, is that TWCNB estimates the conditioned feature probabilities by using data from all classes apart from c [14, 22]. The estimated probability is dened as [22]:

θ ˆ

_ic

= α + P

|C|

k=1

d

ik

α|V | + P

|C|

k=1

P

|V | x=1

d

_xk

, k 6= c ∧ k ∈ C (3.9) where |V | is the size of the vocabulary and d

ik

, d

xk

are the tf-idf weights of words n and x in class k. Now, we apply this estimate as normalized word weight [22]:

w

ci

= log ˆ θ

ic

P log ˆ θ (3.10)

(20)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Letting ˆ P (X

_i

= x

_i

|C = c) = w

_ic

and applying it to the decision rule 3.8 we get:

d(X

1

, . . . , X

n

) =

"

P (C = c) +

n

X

i=1

f

i

w

ic

#

=

"

P (C = c) +

n

X

i=1

f

i

log ˆ θ

ic

P

k

log ˆ θ

kc

# (3.11)

The literature shows that TWCNB performs about equally or better than MNB [14, 22].

3.5 k-nearest-neighbour

k-nearest-neighbor (kNN) is a well known statistical approach for classication and has been widely studied over the years, and has was applied early to text categorization tasks [29]. In the kNN algorithm the data points are in a fea- ture space, and what the points represents depends on the application. In text classication tasks, the components of the feature vector is usually term (word) weights (as for the NB classiers).

The algorithm is quite simple. It determines the category of a test document t based on the voting of a set of k documents that are nearest to t it terms of distance, usually Euclidean distance [28]. In some applications, Euclidean distance can start to become meaningless if the number of features are high and thus reduces the accuracy of the classier. However, kNN is still regarded as one of the top performing classiers on the Reuters corpus [29]. The basic decision rule given a testing document t for the kNN classier is [10]:

d(t) = arg max

c

X

xi∈kN N

y(x

i

, c) (3.12)

where y(x

i

, c) is a binary classication function for training document x

i

(which returns value 1 if x

j

is labeled with c, or 0 otherwise). This rule labels with t with the category that is given the most votes in the k-nearest neighborhood.

The rule can also be extended by introducing a similarity function s(t, x

i

) that labels t with the class with the maximum similarity to t [10, 29]:

d(t) = arg max

c

X

xi∈kN N

s(t, x

_i

)y(x

_i

, c) (3.13) The latter, weighted decision function is thought to be better than the former and is more popular [10].

3.6 Support Vector Machine

Support Vector Machines, or SVM, is a relatively new learning approach intro- duced by Vapnik in 1995 [37], for solving two class pattern recognition problems.

Empirical evidence suggest that SVM is one of the best techniques for performing

automatic text categorization [29]. The SVM problem is commonly solved using

quadratic programming techniques [30].

(21)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Figure 3.1: Example of a maximum margin separating two classes of data points [52].

3.6.1 Linear SVM

The SVM method is dened over a vector space where the problem is to nd a decision surface that best separates two classes of data points. To do this, we introduce a margin between the two classes [29]. Similarily to kNN and NB classiers, the data points represents term (word) weights when the task is text classication. Figure 3.1 shows an example of a margin that separates two classes of data points in a two-dimensional feature space of linearly separable classes.

The solid line is an example of a decision surfaces that separates the two classes, and the dashed lines parallel to the solid one show how much the decision surface can be moved without risking causing misclassication of data points. The SVM solves the problem of nding a decision surface that maximizes the margin. The decision surface as seen in gure 3.1 is expressed as [29, 30]:

~

w · φ(~ x) − b = 0 (3.14)

where ~x is an arbitrary data point to be classied, and φ is a transformation function on the data points. The vector ~w and constant b are learned from a training set of linearly separable data. Let D = {(y

i

, ~ x

_i

)} of size N denote the training set where y

i

∈ {±1} is the classication for ~x (+1 being a positive example for the given class, -1 a negative example) [29]. The SVM problem is then to nd ~w and b that satises the following two constraints [30]:

~

w · φ(~ x

i

) − b ≥ +1 for y

i

= +1

~

w · φ(~ x

_i

) − b ≤ −1 for y

i

= −1 (3.15) Refer to gure 3.1 for a graphical example. The circled points on the dotted lines are the support vectors that gives the name to the method. The decision function for an unlabeled document d, represented by feature vector ~x

d

is [30]:

d(~ x

_d

) =

( +1 if ~w

^T

· φ(~ x

_d

) + b > k

−1 otherwise (3.16)

where k is a user-dened threshold.

(22)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

3.6.2 Nonlinear SVM

If we take an arbitrary text document, it seems very unlikely that the data in the feature space will be linearly separable. Thus we want a decision surface that can separate nonlinear data points. It can be shown that we can reformulate the decision function as [30]

d(~ x

_d

) =

( +1 if P

^N_i=1

α

_i

y

_i

K (~ x

_d

, ~ x

_i

) + b > k

−1 otherwise (3.17)

where α

i

≥ 0 . The function K (~x, ~x

i

) is called a kernel function and allows us to have a decision surface for non-linearly separable data. An example of such a kernel function is the polynomial kernel K (~x, ~x

i

) = (~ x

^T

· ~ x

_i

+ 1)

^d

[30].

3.6.3 Multi-class SVM

The SVM as describe above is a binary classier. However, in many practical applications there is usually multiple categories that a data point can belong to and certainly is the case in non-trivial document classication applications.

There are several studied methods for multi-class classication [2, 8]; the usual approach is to combine several binary SVMs to produce a single multi-class SVM.

In 2005, Duan & Keerthi made an empirical study of some multi-class meth- ods; these are summarized in the following paragraphs. For a given multi-class problem, let M denote the number of classes and ω

i

, i = 1, . . . , M will denote the M classes. For binary classication, we will refer to the two classes as positive and negative.

The rst method that we will look at is the so-called one-versus-all winner- takes-all method. The method constructs M binary classier. The ith classier output function p

i

is trained taking the examples from ω

i

as positive and all others as negative. For a new document ~t, it assigns it the class with the largest value of p

i

[2].

Another method is the one-versus-one max-wins voting method that con- structs one binay classier for every pair of distinct classes, all together M(M − 1)/2 classiers. The binary classier C

ij

is trained taking the examples from ω

i

as positive and the examples from ω

j

as negative. When classifying a new document ~t, all the classiers take a vote on which of its classes ω

i

, ω

j

should be assigned to ~t. After all classiers have voted, the class with the most votes is assigned to ~t [2].

Yet another approach is known as pairwise coupling. It works under the assumption that the output of each binary classiers should be interpreted pos- terior probabilities of the positive class. Then, the strategy is to combine the outputs of all one-versus-one binary classiers to obtain estimated of the prior probabilities p

i

= P (ω

_i

|~t) . The classier chooses the class that yields the highest p

i

, for details see [2].

3.7 N-Gram based classication

In 1994, Cavnar and Trenkle proposed an N-gram-based method of text catego- rization [4], which was used for language and subject classication of USENET

²

newsgroup articles.

2

USENET is a worldwide distributed Internet discussion system. It's one of the oldest

computer network communications systems still in widespread use.

(23)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Common sense dictates that human languages invariably have some words which occur more frequently than others. Zipf's law, as re-stated by Cavnar and Trenkle, expresses this as:

The nth most common word in a human language text occurs with a frequency inversely proportional to n. [4]

An implication of this law is there is always a set of words in a language that dominates most of the other words in terms of frequency of use. In English, for example, the most frequently used words are function words such as the, be, and to [48]. Cavnar and Trenkle states that the law also implies that there is always a set of words more frequent for a specic subject. For example, articles about sports may have many mentions of football, players and fumble while articles about computers and technology more frequently mention computer,

programmer and the like.

A possible conclusion at this point is that if articles in a specic language or subject has a set of words more frequent than others, the same articles should also have a set of N-grams that are more frequent than others. This seems to be an reasonable conclusion, as experiments show that using N-grams for language identication is reliable [4, 7], and fairly reliable for subject identication [4].

The method proposed by Cavnar and Trenkle is based on comparing N-gram frequency proles from a set of training documents to test documents, and cate- gorize the latter based on distance measures. The steps for proling a document are as follows [4]:

Tokenize the text, discard digits and punctuation. Pad the tokens with sucient blanks before and after.

Generate all possible N-grams (including blanks) for each token, for N = 1 to 5.

Count the occurrence for each N-gram.

Sort the N-grams in reverse order in order of occurrence, i.e. the most frequent N-gram is the rst element in the sorted list.

By generating N-grams for all the training documents and merging the proles of documents in the same category, a prole of N-grams frequency for each cat- egory is made. The rst 300 or so N-Grams are considered to be very language dependent, and are thus removed from the prole [4].

Categorizing an incoming document is straightforward: generate an N-gram frequency prole for the document, measure the distance to the category and pick the category with the minimum distance. The prole distance measure used by Cavnar and Trenkle is simple:

Take two N-gram proles and calculate a simple rank-order statistics (called

out-of-place measure) by measuring how far out of place an N-gram in one prole is from its place in the other prole. For each N-gram in the document prole, nd its counterpart in the category prole and calculate how far out of place it is in terms of position (rank) in the prole. We can formulate the out-of-place measure of an N-gram n in proles i and j as

d

n

(i, j) =

( M if n is not in both proles i and j

| rank(n,i) − rank(n,j)| otherwise (3.18)

(24)

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

where M is a pre-dened maximum out-of-place value given if the N-gram does not exist in both proles, and rank(n,i) is the rank of the N-gram n in the prole i [4]. Using the distance measure in equation 3.18, we can construct the decision rule:

d(t) = arg min

c

X

i∈Pt

d

i

(P

t

, P

c

) (3.19) where P

c

, P

t

are the N-gram proles for class c and document t respectively.

Experiments indicate that this method works very well for language classi-

cation and fairly well for subject classication, achieving as high as an 80%

classication rate for the latter [4]. Cavnar and Trenkle notes that a higher sub-

ject classication rate could probably be achieved by removing very frequently

used words of the language from the data, i.e. using a stop list. It is interesting

to see a reasonably good classication rate, given that the method can be seen

as less sophisticated when compared to other approaches.

(25)

Chapter 4

Automatic web page categorization

4.1 Overview

In this chapter a general approach to categorize web pages is presented. To solve the problem of automatic web page categorization, we will make use of automatic text categorization and natural language processing techniques as described in previous chapters. Before we go into details, we should dene what we mean by a web page.

Denition 4.1. A web page is a web document that is encoded in the HTML or XHTML format, and can be accessed either locally or from a remote web server with the use of a web browser.

This denition is introduced to emphasize what when we refer to a web page, we refer to the actual document containing the HTML of XHTML information.

The basic assumption of this method is that document classication can be directly applied to automatic web page categorization, given that the web doc- uments are transformed into a plain text document. This general method is, obviously, very general. The reason for this is to make it easily possible to cus- tomize the method for the specic practical application. For example, the method makes no assumptions about which document classier is used.

4.2 Data gathering

For a web page to be used for categorization, it must rst be somehow acquired and stored in a convenient format (which depends on the requirements of the application). For example, the document can be downloaded o the web server and its content stored in a relational database. Another, very simple, approach is to download the document and store it as is on the local le system.

It's advisable to gather the whole web page data set (both training and test samples) once and store it locally in case the parameters of the classier is to be tweaked in training. This is important for the simple reason that gathering thousands of web pages from the Web is time consuming even if automated.

4.3 Choice of classier

The choice of classier depends on the practical requirements of the application.

Of course, preferably one would want to use a classier that yields good results

(26)

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

Figure 4.1: Training the classier for web page categorization.

in the established literature for document classication. However, there may be specic requirements in the application that narrows the selection of classier such as the computational complexity for the training and/or testing phase. As mentioned earlier, the rest of this chapter will make no assumption of what classier is being used.

4.4 Training

The training process requires that a set of categories has been dened, and the training documents need to be somehow labelled with their respective category.

The chosen classier must be trained on a training set of web pages, and prefer- ably evaluated on a smaller testing set. The evaluation is useful for determining the optimal values of any parameters of the classier on the available training data. Figure 4.1 illustrates the training process.

In the following sections, we will look closer at the individual steps of the training process.

4.5 Feature selection and extraction

A web page document itself is not very suitable for categorization, as it by deni- tion contains HTML or XHTML. In order to use machine learning techniques for document classication, we need to reduce the web page document to a plain text document. This plain text document must then be transformed into a feature vector suitable for use with the chosen machine learning algorithm.

4.5.1 Plain text conversion

Converting a web page to a plain text document can be done quite straightfor-

ward by applying a regular expression that matches HTML and/or XHTML tags

and replacing the occurrences with empty strings. However, web pages contain

elements that are not part of the content per se, but as a part of the web page

design (navigation menu entries for example) that may not be relevant for cate-

gorization purposes. Figure 4.2 shows a very simple of a web page using HTML

markup. A better approach would be to make use of a parser to extract the de-

sired content from the web page. The exact content that is extracted depends on

the application, but a general approach could be to extract paragraphs and head-

ings. The parser should implement standards from organizations such as World

Wide Web Consortium (W3C) [50] or Web Hypertext Application Technology

Working Group (WHATWG) [51].

(27)

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

<!DOCTYPE html>

<html>

<head>

<title>A web page</title>

</head>

<body>

<h1>A header</h1>

<p>Some text! And a <a href="#">link</a></p>

</body>

</html>

Figure 4.2: An example of a web page.

4.5.2 Tokenization and stemming

With the web document converted into a plain text document, the next step is to extract words that make up the textual content. Blindly applying a tokenizer at this point would not be advisable, seeing as the number of distinct words used throughout the entire training set is likely incredible large. Therefore we need to reduce the number of distinct tokens, i.e. reduce the number of dimensions in the feature vector. How can we do this?

The rst, easy, method of reducing the dimensions of the feature vector is to use a stop list to remove very common words such a the in English. These words are common in all sorts of text and are uninteresting for categorization purposes.

However, the number of distinct tokens is still likely to be large. On the remaining tokens, stemming is performed to end up with a nal set of stemmed tokens that make up the vocabulary. This will assign words in the same context to the same token (for example car and cars will be assigned to to the same token car). The number of distinct stems are is very likely to be less than the number of distinct words, thus it is likely that stemming will reduce the number of dimensions signicantly. The stemming algorithm to be used depends on the language of the web page, thus it may be necessary to classify the language of a certain web page using a language classier

¹

before stemming can be performed.

We are almost ready to dene the feature vector. However, for determining term weights, we also need a vocabulary. We dene it as

Denition 4.2. For a set of documents ~d = {d

1

, · · · , d

N

} , its vocabulary V is the set of tokens that are extracted and stemmed from the the documents in ~d.

1

A language classier has been used in the implementation when ltering the web database,

so that all pages used in the experiments are most likely in English. For more details refer to

chapter 5.

(28)

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

4.6 Document representation

Let V the vocabulary of the training set as stated in denition 4.2. We represent a web page as the feature vector

d = ~





 w

₁

...

w

_i

...

w

_N







(4.1)

where w

i

is weight of the term i and N is the size of the vocabulary V . Less formally, a web page is represented as a bag-of-words vector with weights repre- senting the stemmed tokens.

There are several weighting schemes to be considered for the weights w

i

[25].

A good choice is the popular tf-idf weighting scheme [41], described in section 2.2.4. In some algorithm, such as TWCNB, the weighting scheme is a part of the algorithm. Therefore the weights can also simply be frequencies for the tokens if the algorithm performs its own weighing scheme. In all but the simplest weighting schemes, the vocabulary must be extended with term frequencies. This is most easily done with a mapping of stemmed tokens to frequencies, either as a part of the vocabulary or as a separate structure.

4.7 Categorization of unlabeled documents

Given a unlabeled document ~d, it's fairly straightforward to categorize it using

the trained classier. As with the training data, the unlabeled document must

be processed and turned into a feature vector. This is done by performing the

feature extraction as described in previous sections, using the vocabulary of the

training set. The feature vector of the unlabeled document is then given to the

classier's decision function, which will in turn make a prediction based on its

training and return a category for the unlabeled document.

(29)

Chapter 5

Implementation

In this chapter gives a brief overview of the automatic web page categorization framework that was implemented to be used to evaluate the performance of dierent text classication algorithms on real web page data, as discussed in Chapter 6. The purpose of this chapter is not to convey details, but to give an overview of what technologies were used to build the framework and give the proper credit to these projects.

5.1 Overview

The implementation of the automatic web page categorization framework (here- after simply referred to as the framework) was done in Java, due to it being cross platform and has a high availability of third party libraries for tasks relating to machine learning and natural language processing. The framework makes us of a number of library packages to provide methods of machine learning and natural language processing needed for automatic web page categorization, as described in Chapter 4. It also provides ways of transforming web page document data into formats needed by some of the third party packages used. The core packages of the framework are the following:

content Interfaces and classes for extracting content from downloaded web doc- uments.

ml Provides an interface for categorizing web pages to a set of predetermined categories, and implementation classes using concrete classiers.

nlp Classes that uses natural language processing tools to transform raw text documents into token streams.

tools Command-line tools for crawling a specic web page database used in the experiments, and for batch processing of web pages into other formats required by some of the third party libraries.

web Classes for representing and managing web pages.

The framework also consist of several packages dealing with the experiments as

discussed in Chapter 6, however these are not discussed here.

(30)

CHAPTER 5. IMPLEMENTATION

5.2 The content framework package

The content package is responsible for extracting content from web pages. For convenience the package provides a number of dierent content extractors that specialize on extracting dierent sets of data from a web page, which is used for experimentation (see Chapter 6).

The content extractors are by themselves very simple: they get the desired content of the page (see section 5.6 how it is done), perform natural language processing as described in Chapter 4, using the nlp package (see section 5.4).

5.3 The ml framework package

The ml package provides category predictors using the classiers as described in Chapter 3. The classiers are trained by given a set of training documents which have been processed by the nlp package, convert these into feature vectors (as described in Chapter 4), and actually trained on these. Categorization is performed similarly by turning the unlabelled page into a feature vector, and running the decision function of the classier.

The MNB, kNN and SVM classier implementations are provided by the WEKA machine learning Java library [39], while the TWCNB implementation is provided by the Apache Mahout machine learning Java library [44].

5.4 The nlp framework package

The nlp package provides a tool for natural language processing o English-language documents. Essentially, it takes a raw plain text document, i.e. the parsed con- tent from a web page, and performs tokenization, stop-word removal, and stem- ming. This tool is used by the content package, see section 5.2, to fully extract content from a web page into a suitable format (i.e. a stream of stemmed tokens).

The bulk of the work is performed by the Apache Lucene Java library, that has many packages for natural language processing for dierent languages [43].

5.5 The tools framework package

The tools package implements various command-line tools. The most important tools are those that convert a set of training documents into formats suitable to be read for the classiers. The WEKA classiers expect documents to be in the ARFF format, which is a le format that describes a list of instances sharing a set of attributes [47]. The TWCNB classier uses the Sequence le format, which is a binary key/value-pair format [38].

Also included in this package are tools for traversing and ltering the link database from the project provider, Whaam

¹

.

5.6 The web framework package

The web package implements classes for representing and managing web pages from the link database. The web page representations makes use of the jsoup HTML parser [45] for extracting text when called by the content package (see section 5.2).

1

http://whaam.com/

(31)

Chapter 6

Results

This chapter presents the result of a set of evaluation performed using the frame- work as described in Chapter 5. First we present the classiers used and their parameters (if any), followed by the data and experiments. Then we continue do discuss dierent types of evaluation metrics used for measuring the success of the results. The chapter concludes by presenting the results based on the evaluation metrics.

6.1 Classiers

For the evaluation, we're comparing Support Vector Machines (SVM), k-nearest- neighbor (kNN), Multinomial Naive Bayes (MNB), Term-Weighted Complemen- tary Naive Bayes (TWNCB) and the Cavnar-Trenkle N-Gram-based classier (N-Gram).

The SVM is congured to use the Radial basis function, or RBF, kernel with γ = 0.01. RBF is considered to be the most popular kernel to use [6] for Support Vector Machines. For kNN the number of nearest neighbor is k = 30, as some preliminary testing seems to indicate that it was a fairly good compromise between a high and low k. For the N-Gram classier the proles are set to have a maximum length of 400, as suggested by the results of [4].

6.2 Data and experiments

To reiterate, the data used in the evaluation is a link database from the project provider Whaam

¹

, from which web pages were extracted. Whaam is a discovery engine based on the idea of social surng. On Whaam, users store links to web pages in link lists, that can be shared with friends. Links are seperated into a set of eight broad categories. The categories are Art & Design, Fashion, Entertainment, Blogs, Sport, Lifestyle, Science & Tech, and Miscellaneous. The data is real in the sense that the web pages have been categorized by real users in an existing public product.

As described in Chapter 1, the scope of the evaluation is to only consider web pages that are in English. A cursory examination of the data showed that the vast majority of links points to English and Swedish web sites, with many of the Swedish samples being blogs. Thus the evaluation focuses on English web sites to avoid any articial grouping of the data that does not depend on the

1

See http://www.whaam.com

(32)

CHAPTER 6. RESULTS

Table 6.1: Distribution of samples over the categories in the data.

Category Number of samples

Art & Design 17

Fashion 20

Entertainment 789

Blogs 334

Sport 319

Lifestyle 1125

Science & Tech 597

Miscellaneous 2546

Total number of samples 5747

real content but on language

²

. All of the English samples have been extracted using the tools described in Chapter 5; the distribution of the samples over the dierent categories in shown in table 6.1.

In order to try and see what parts of web pages are relevant to categorization, there are several sources used for gathering textual content from the web pages used in the experiment. The sources chosen were inspired by the work done by Kwok in document representation for automatic classication [16], but more so by the paper by Riboni about feature selection for web page categorization [23].

For each data source, ve instances were generated. In each of these instances, the samples that go into the training set and test set respectively are randomized.

Seven dierent text sources were used, namely:

T, the content of the <title> tag;

H, the contents of all <h1>, ... , <h6> tags;

P, the contents of all <p> tags;

TH, the contents of the T and H sources;

HP, the contents of the H and P sources;

TP, the contents of the T and P sources;

THP, the contents of the T, H and P sources.

If we would apply this to the simple web page shown in gure 4.2, the content from the dierent sources can be represented by the following multisets, with HTML markup and punctuation removed:

T = {a, web, page}

H = {a, header}

P = {some, text, and, a, link}

T H = {a, web, page, a, header}

HP = {a, header, some, text, and, a, link}

T P = {a, web, page, some, text, and, a, link}

T HP = {a, web, a, header, some, text, and, a, link}

Of course, as described in Chapter 4, these sources would be processed and encoded as feature vectors and not used as-is.

2

The concern was that any web page in Swedish would most likely be categorized as being

a blog, independent of its actual content.

(33)

CHAPTER 6. RESULTS

6.3 Evaluation metrics

A commonly used metric for judging classier performance [29] is the micro- averaging F

1

score. It gives equal weight to each document and is therefore considered as an average of all the document/category pairs, and tends to be dominated by the classier's performance on common categories [19, 29].

The F

1

score takes a value between 0 and 1, where 0 is the worst possible score and 1 is the best possible score. It is calculated using the precision (p) and recall (r) measures, dened as [19]:

p

_i

= T P

i

T P

_i

+ F P

_i

, r

_i

= T P

i

T P

_i

+ F N

_i

(6.1)

where T P

i

is the number of documents correctly labelled with class i (known as true positives); F P

i

are the number of documents incorrectly labelled with class i (known as false positives), and F N

i

is the number of documents that should have been labelled with class i but were not (known as false negatives). The F

1

measure for class i is then expressed as [19, 29]:

F

i

= 2p

i

r

i

p

_i

+ r

_i

(6.2)

The global precision and recall values are obtained by summing over all in- dividual decisions [19]:

p =

P

M i=1

T P

i

P

M

i=1

(T P

i

+ F P

i

) , r =

P

M i=1

T P

i

P

M

i=1

(T P

i

+ F N

i

) (6.3) where M is the number of categories. The micro-averaged F

1

score is then dened as in equation 6.2, but using the global precision and recall values [19]:

F

1

(micro-averaged) = 2pr

p + r (6.4)

Seeing that table 6.1 show that the data is dominated by a few categories, it also seems appropriate to have a metric that shows how well a classier performs on less common categories. The macro-averaged F

1

score is understood to do just this [29]. It is calculated by rst calculating individual F

1

values for the dierent categories, and taking the average [19]:

F

1

(macro-averaged) = P

M

i=1

F

_i

M (6.5)

To be thorough, the raw accuracy results (which is simply the percentage of correct classications) is also used as performance measure. This measure, is less informative as it does nothing to show how well the classier performs on individual categories.

6.4 Per-classier performance

6.4.1 Micro-averaged F-measure

Table 6.2 shows the micro-averaged F

1

Automatic web page categorizationusing text classication methods

Degree project in Computer Science Second cycle

Automatic web page categorization using text classification methods

Tobias Eriksson

Automatic web page categorization using text classication methods

Automatisk kategorisering av webbsidor med textklassificeringsmetoder

Master's Degree Project in Computer Science CSC School of Computer Science and

Communication

Author:

Tobias Eriksson Author E-mail:

tobier@csc.kth.se

Supervisor:

Johan Boye Examiner:

Jens Lagergren Project provider:

Whaam AB

September 8, 2013

Abstract

This study explores the use of automatic text classication methods for categorization of web pages. The results in this paper is shown to be comparable to results in other papers on automatic web page categorization, however not as good as results on pure text classication.

Referat

Automatisk kategorisering av webbsidor med textklassificeringsmetoder

Denna studie undersöker hur metoder för automatiskt text- klassicering kan användas för kategorisering av hemsidor.

De uppnådda resultatet i denna rapport är jämförbara med

resultat i annan litteratur på samma område, men når ej upp

till resultatet i studier på ren textklassicering.

Contents

1 Introduction 1

1.1 Background . . . . 1

1.2 Problem statement . . . . 2

1.3 Scope . . . . 2

1.4 Purpose and contribution . . . . 2

1.5 Methodology . . . . 2

1.6 Related work . . . . 4

1.7 Description of this document . . . . 5

2 Background 6 2.1 Machine learning . . . . 6

2.1.1 Overview . . . . 6

2.1.2 Types of machine learning . . . . 6

2.2 Natural language processing . . . . 7

2.2.1 Overview . . . . 7

2.2.2 N-grams . . . . 7

2.2.3 Tokenization . . . . 7

2.2.4 Term frequency - inverse document frequency . . . . 7

2.2.5 Stemming . . . . 8

2.2.6 Bag-of-words model . . . . 8

3 Automatic text categorization 10 3.1 Overview . . . . 10

3.2 Naive Bayes . . . . 11

3.3 Multinomial Naive Bayes . . . . 12

3.4 TWCNB . . . . 12

3.5 k-nearest-neighbour . . . . 13

3.6 Support Vector Machine . . . . 13

3.6.1 Linear SVM . . . . 14

3.6.2 Nonlinear SVM . . . . 15

3.6.3 Multi-class SVM . . . . 15

3.7 N-Gram based classication . . . . 15

4 Automatic web page categorization 18 4.1 Overview . . . . 18

4.2 Data gathering . . . . 18

4.3 Choice of classier . . . . 18

4.4 Training . . . . 19

4.5 Feature selection and extraction . . . . 19

4.5.1 Plain text conversion . . . . 19

4.5.2 Tokenization and stemming . . . . 20

4.6 Document representation . . . . 21

4.7 Categorization of unlabeled documents . . . . 21

5 Implementation 22 5.1 Overview . . . . 22

5.2 The content framework package . . . . 23

5.3 The ml framework package . . . . 23

5.4 The nlp framework package . . . . 23

5.5 The tools framework package . . . . 23

5.6 The web framework package . . . . 23

6 Results 24 6.1 Classiers . . . . 24

6.2 Data and experiments . . . . 24

6.3 Evaluation metrics . . . . 26

6.4 Per-classier performance . . . . 26

6.4.1 Micro-averaged F-measure . . . . 26

6.4.2 Macro-averaged F-measure . . . . 27

6.4.3 Raw classication accuracy . . . . 28

7 Conclusions 30 7.1 Concerns about the data . . . . 30

7.2 Suggestions for improvements in future work . . . . 30

7.2.1 Data sets . . . . 30

7.2.2 Improved feature selection . . . . 31

7.2.3 Utilizing domain knowledge . . . . 31

Automatic web page categorization using text classication methods

This study explores the use of automatic text classication methods for categorization of web pages. The results in this paper is shown to be comparable to results in other papers on automatic web page categorization, however not as good as results on pure text classication.

Denna studie undersöker hur metoder för automatiskt text- klassicering kan användas för kategorisering av hemsidor.

till resultatet i studier på ren textklassicering.

3.7 N-Gram based classication . . . . 15

4.3 Choice of classier . . . . 18

6 Results 24 6.1 Classiers . . . . 24

6.4 Per-classier performance . . . . 26

6.4.3 Raw classication accuracy . . . . 28

4.1 Training the classier for web page categorization. . . . 19

6.1 Mean Micro F-Measure Score with 95% condence interval shown. 27

6.2 Mean Macro F-Measure Score with 95% condence interval shown. 28

6.3 Mean Classication Accuracy with 95% condence interval shown. 29

2.1 N-grams of dierent lengths for the word APPLE . . . . 7 6.1 Distribution of samples over the categories in the data. . . . 25 6.2 Micro-averaged F

score for the dierent classiers, averaged over

score for the dierent classiers, averaged over

all instances. . . . 28 6.4 Accuracy of the dierent classiers in terms of correct classication

uments into predened sets of categories, based their textual content. This is

known as text classication. Often these documents are news articles, research

ble that proven machine learning methods for text classication can be used to

Given that there are many methods for text classication, this thesis only focuses on a selected few. Document classication papers most often use documents in a single language (English is very common), and this is also the case in this thesis.

The approach to automatic web page categorization proposed in this report is based on contemporary methods for performing automatic text classication.

The methodology is summarized in gure 1.2.

First, a study of relevant document classication literature were performed to

get an understanding of the eld and nd suitable machine learning algorithms

and natural language processing methods that can be used for web page classi- cation. There were no hard criteria for selecting a specic algorithm or method.

The algorithm or method should be cited in the literature.

The algorithm or method should have a readily available implementation in Java, or be easily implemented in a short time.

Preferably, if a machine learning algorithm, it should have been a part of a comparative study in document classication.

The real web pages are given by the project provider, Whaam