Document and Image Classiﬁcation with Topic Ngram Model

(1)

Document and Image Classification with Topic Ngram Model

TONG SHEN

Master’s Thesis at NADA Supervisor: Hedvig Kjellström

Examiner: Danica Kragic

(2)

(3)

Abstract

Latent Dirichlet Allocation (LDA) is a popular probabilistic model for information retrieval. Many extended models based on LDA have been introduced during the past 10 years. In LDA, a data point is represented as a bag (multiset) of words. In the text case, a word is a regular text word, but other types of data can also be represented as words (e.g. visual words). Due to the bag-of-words assumption, the original LDA neglects the structure of the data, i.e., all the relationships between words, which leads to information loss. As a matter of fact, the spatial relationship is important and useful. In order to explore the importance of the relationship, we focus on an extension of LDA called Topic Ngram Model, which models the relationship among adjacent words. In this thesis, we first implement the model and use it in for text classification.

Furthermore, we propose a 2D extension, which enables us to model spatial relationships of features in images.

(4)

Referat

Dokument och bildklassificering med Topic Ngram Model

Latent Dirichlet Allocation (LDA) är en populär sanno- likhetsmodell för informationsinhämtning. Flera vidareut- vecklade modeller baserade på LDA har introducerats under de senaste 10 åren. I LDA representeras en datapunkt som en bag (multiset) of words. I fallet med text är ett ord bara ett vanligt ord i text, men andra typer av data kan ock- så representeras som ord (t.ex. visuella ord). På grund av bag-of-words-antagandet negligeras den ursprungliga LDA strukturen i datat, d.v.s. alla relationer mellan ord som kan leda till informationsförluster. Ett faktum är att den rumsliga relationen är viktig och användbar. För att utforska vikten av relationen fokuserar vi på en vidareutveckling av LDA kallad Topic Ngram Model, som modellerar relationen mellan intilliggande ord. I denna avhandling implementerar vi först modellen och använder den för klassificering av text.

Dessutom föreslår vi en vidareutveckling i 2D som gör det möjligt för oss att modellera rumsliga relationer av särdrag i bilder.

(5)

Acknowlegment

I would like to thank Professor Hedvig Kjellström, who was my supervisor for the thesis. During the project, she gave me so much help, inspiration as well as encouragement. Whenever I had questions, she would give me feedback soon, which made me able to solve the problems quickly. I would also like to thank Cheng Zhang, who was my co-supervisor for the thesis. She always explained the problems very clearly to me, which made me realize more insight into the thesis.

(6)

Introduction

Probabilistic models, which are becoming more and more popular in recent years, have become an important brach of machine learning and studies in machine learning have benefited greatly from theses models. A probabilistic model is based on probability theory. By using various kinds of distributions and Bayes rule, models can effectively describe the data and infer the information we need. As the computational power has significantly improved, we can process complex mass of data more efficiently using probabilistic models.

One of the important goals in machine learning is dimensionality reduction. How to represent data is a big issue in machine learning because high dimensional data is complicated and difficult to interpret. To perform classification or clustering, we prefer more simplified data that is defined in lower-dimension.

There are multiple approaches to dimensionality reduction such as PCA (Prin- cipal Component Analysis) [1] and Isomap [2]. PCA is a linear dimensionality reduction algorithm that seeks the dimensions where data has great variance and represents data using those dimensions. Isomap is a nonlinear method that is ex- pected to determine the intrinsic geometry of data and then simplify it.

Among numerous dimensionality reduction algorithms, there is a family of models such as [3][4] called topic models, in which documents are modeled as a mixture of topics, which are latent variables, and each basic unit, word, is associated with a topic. There is usually a generative process in these models where topics, words and word assignments are generated by a series of assumptions and conditional proba- bilities. Given a collection of documents, parameters can be obtained by estimating the posterior probability according to the generative process.

Latent Dirichlet Allocation (LDA) [3] is a prevalent model that has inspired many invariants for topic modeling. LDA assumes that each document has a Dirichlet distribution of topics and the same words can be drawn from different topics. More importantly, the whole process is based on the bag-of-words assumption, which means the words are drawn independently without any order. Although the assumption is not realistic, it simplifies the issue and can solve many problems in practice.

(10)

CHAPTER 1. INTRODUCTION

LDA is not perfect because of its bag-of-words assumption, which leads to information loss. Many modified models have been proposed in order to retain more spatial information. In this thesis, we are going to focus on one of these models called Topic Ngram Model (TNG) [5] where words are not isolated with each other anymore, instead, they can form into phrases.

1.1 Contribution

Our contribution is that we implemented TNG in C++ and estimated its performance in text classification using a benchmark text corpus. Moreover, since the original model is only for text that is only one-dimensional information, we also created a extension of the model that can deal with images, which are two-dimensional information. The performance of this extension in image classification was also estimated and compared with standard LDA.

1.2 Outline

The thesis is divided into 8 chapters. Chapter 2 describes the background of this research and the problem we are facing. Chapter 3 and 4 are mainly about the theories of standard LDA and TNG including the model structure and inference methods. Next part, Chapter 5, is the details in implementation. The experimen- tation on text classification is then given in Chapter 6. The extension of TNG is introduced in Chapter 7 where the difference from the original TNG is explained and the result is discussed. At last, there is a conclusion, in which we give a brief summary and talk about the future work.

(11)

Chapter 2

Background

2.1 Related Work

After LDA was published, many extended versions have been proposed. Blei et al. introduces an extended LDA, Correlated Topic Model (CTM) [6], where the topic distributions are replaced with a logistic normal distribution instead of a Dirichlet distribution. Therefore, the model can be more expressive due to the correlation between topics. Its drawback is the non-conjugacy of multivariate Gaussian distribution with multinomial distribution, which causes some sacrifice of computational convenience. Nevertheless, it still can outperform LDA in many circumstances. Kim et al. proposes a LDA with hierarchical corpora [7]. Organizing documents in a multilevel form before inference, this model can discover topics more accurately especially for some complicated corpora. In [8], an online version of LDA is proposed.

Since the original LDA performs the inference in batch mode, so that it can hardly analyze a large collection of documents in an efficient way because it must scan the entire corpus in each iteration. Online LDA models are able to deal with documents in a stream. By analyzing a mini-batch each time and performing update based on the weighted previous parameters, online LDA can converge as good as or better than static LDA models. Moreover, it saves a large amount of computational time.

Dynamic Topic Model (DTM) [9] is another invariant of LDA, which focus on the time evolution of topics. Given a series corpora sorted by time, the model can estimate the tendency of topics and then make predictions of the future topics. To some extent, the model is comprised of a sequence of original LDA models, each of which represents the state of topics in a certain time.

Thanks to the unsupervised way of learning, LDA models do not require much manual labor. However, from another angle, these models may not always give satisfactory results. If we were able to provide the models with some guidance, it would be easier for them to obtain more reasonable patterns and make effective predictions. Therefore, many supervised LDA models have been proposed. Blei et al. introduces a single-labeled LDA model [10], which means each document has a value indicating the content or the category of it. After training all the documents

(12)

CHAPTER 2. BACKGROUND

associated with corresponding labels, any unseen queries can be evaluated with a label showing their relation with other documents. To strengthen the supervision, Ramage et al. presents a more elaborate model with multi-labeled documents [11].

Instead of allowing the model to assign topics by itself, they force the documents in some certain topics. By this constraint, those irrelevant topics are ignored and the related topics are emphasized. As seen from the experiments, the model can be very competitive in applications such as Topic Visualization, Tagged document visualization and so on.

Another direction of research on topic models is non-parametric mechanism.

Although LDA performs in an unsupervised way, it still requires setting the topic number manually before training. Since our target is dimensionality reduction, if too many topics are used, it does not fulfill the requirement. Moreover, increasing topics will correspondingly increase computational time. In comparison, using too few topics will also produce bad results because the data is underestimated and there could be more dimensions required to describe the data. Hierarchical Dirichlet Processes (HDP) [12] is proved to be a good solution to this kind of problem. HDP is inspired by Dirichlet Processes (DP), which is an approach to representing non- parametric mixture models. Including the hierarchical structure into the model, HDP makes it possible to let different groups of data share the mixture components, which is just the case in LDA. Additionally, an online HDP model [13] was also introduced when dealing with streams of data.

It is worth nothing that LDA can be applied to not only text retrieval, but also image analysis such as image classification and image segmentation. In [14], the LDA model is used to discover the objects in images. As an analogy to document processing, words in this setting are codewords (or textons), which are the basic unit of images. However, unlike text words that have exactly the identical form in different documents, codewords can impossibly be identical. As a result, codewords should have sufficient tolerance to the variance and robust among images. In practice, SIFT features are often adopted. Fei Fei et al. presents a modified LDA model for classifying different scene images into their corresponding categories [15].

In their model, some supervision is involved as the training images are annotated with a tag. According to the result, although the performance is not good as those that involve a great deal of annotating work, it greatly reduces the tedious manual work and shows potential of a better performance.

One of problems that LDA is facing is its bag-of-words assumption. It indeed simplifies the structure and costs less computational time, but it suffers from a large amount of information loss such as relations between words or spatial correlation among image features, which are considered playing a significant role in documents.

So people start including more information into the model. In [16], in order to recognize objects, images are segmented first as to retain the partially spatial information and then each segment is viewed as a separate document. After training, all the separated parts are sorted out and the most related part is chosen as the desired object. Similarly, in [17], the images are also segmented first, but the difference is that each segment is not a separated document; instead, it is a sub-document with

(13)

2.2. SPATIAL STRUCTURE

a topic assignment. After inference, for segmentation, the same topic parts can merge and form the appropriate segmentation; for classification, the most related topic will be selected as the category. It can be seen that due to the extra information, the performance gets greatly improved. As for the text processing, there are many attempts also. There is another model [18] that combines HMM and LDA and intends to simultaneously grasp the long range dependency, which is indicated by topics, and the short range dependency, which occurs within sentences. Words are thus put into either the topic class or syntactic classes. It is a good attempt to extract topics while retaining the function of the words in sentences. Wallach introduces a model that tries to catch the relation between every two words [19].

Specifically, the word is not only dependent of the global topic distribution, but also dependent of the previous word. The model in [5] has a more elaborate structure, in which an extra variable, state variable, is used to determine whether to treat two words as a phrase. This model is the one we are going to explore. It is interesting to study how important the spatial relationship is.

2.2 Spatial Structure

2.2.1 Text Structure

Imagine that we only have a bag of words without any order. It might be easier to roughly get some meaning conveyed from the article, but we can never know the exact meaning. This is why we need grammar. Only when people follow the grammar can they understand each other. Although it is difficult to capture all the grammar in documents, a little bit would be much helpful. In the model we plan to study, the frequently occurred phrases can be captured by the model and they are treated as one part instead of two words. For example, "Red Square" is a place in Moscow and they should be used together. If they are separated, we will lose information. Therefore, when analyzing the document, "Red Square" should be a phrase rather than two words.

2.2.2 Image Structure

Dealing with images is more complex. One reason is that text already has words that can be used directly as tokens but images have no such ready-made words.Thus, we have to extract features from images and create a visual words vocabulary manually.

The second reason is that unlike text that all words are in a one dimensional array, features in images are in two dimensional space. In order to apply TNG to images, modification is required.

Visual words are features extracted from the images. There are many features that can be used. In this thesis, we use SIFT [20] features. SIFT features are invariant to many factors such as scale, rotation and illumination, which means the features can represent images in a more general way; two similar features can be very close in the feature space.

(14)

CHAPTER 2. BACKGROUND

SIFT features can be used as global features or local features. If we use the global features, it would be difficult to decide the neighbors of a feature since the interesting points are not in a regular form. Dense SIFT features are suited to the our situation. More specifically, we extract SIFT features in local patches. To transform the image data into the text data, some other work is involved. Firstly the image is divided into equal-sized patches and each patch is represented by a SIFT feature extracted from the patch. Then K-means is performed to make features fall into K groups, which creates a visual word vocabulary. In this way, all images can be represented by two-dimensional text information.

Compared to text documents, the spatial structure in images is the geometry. If some visual words always show together, then they have geometric relations and it is more reasonable to treat them as one part. Because images are more complex than text, it is interesting to investigate how important these small geometric relations are in classification.

2.3 Classification

The main goal of this thesis is to classify text documents as well as images. Gener- ally, there are two sets of data; one is for training and another one is for estimation.

Each document or image is associated with a class. Firstly we use the training set to train the model making it find patterns for classes. Then we use the other dataset to test if the classification is successful.

In practice, there are multiple ways of classifying labeled documents or images.

One method is to include supervision in the model as in [10][11], in which labels are a part of the model. Since the classification is integrated into the model, there is no extra step required to perform classification. The other method, which is what we use, is to treat classification as a separate part which utilizes the result from the model’s unsupervised learning. More specifically, after training and validation, we have each document represented by a vector, which is the topic distribution for the document. Thanks to this step, we are able to map the complex data into low dimensional representation learned from the data, which enables more robust classification.

Classifier such as SVM is a good choice. But in this thesis, we will not use SVM because it involves more factors to tune due to the different choices of kernel function. Instead, a simple approach is adopted. After obtaining all vectors, we can simply find centers for each category in the training set, which is the average point of all topic distribution vectors belonging to the certain category. Then labeling a testing document is nothing but finding the nearest center point next to its vector.

(15)

Chapter 3

Standard LDA

In Standard LDA, documents are assumed to be exchangeable in the corpus and the words in each document are also exchangeable within the document. The latter assumption is widely known as the "bag-of-words" assumption, in which the order of words is ignored. However, the "bag-of-words" assumption does not imply that the words are independent and identically distributed [3]. In the generative process, the words are generated based on how topics are distributed, which is what we are going to focus on.

3.1 Terms

In the context of text retrieval, the terms such as “corpus”, “document” and “word”

are usually adopted. In other fields of application, there are always corresponding terms as analogy. The notation in this thesis will follow the original usage.

Given a corpus, a vocabulary with size V can be obtained, which means there are V unique words. Therefore, a word is represented by its index in the vocabulary.

w_i is the ith word in a document.

A document with N_dwords can then be represented by w_d= {w₁^(d), w^(d)₂ , . . . , w_N^(d)

d}.

d is the index of document in the corpus.

The whole corpus with M documents is denoted by D = {w₁, w₂, . . . , w_M}.

Suppose there are K topics, and each word in document d is assigned with a topic, z_i^(d).

3.2 Multinomial Distribution and Dirichlet Distribution

Multinomial distribution describes a distribution of n independent trials where each trial has k outcomes. It has a probability mass function:

Pr(n|x) = n!

n1! . . . n_k!

k

Y

i=1

xⁿ_iⁱ

(16)

CHAPTER 3. STANDARD LDA

Dirichlet distribution is a family of distributions that can be used as the prior to multinomial distributions due to its conjugacy. It is usually viewed as the distribution of distributions with hyper-parameters. If a vector follows Dirichlet distri- bution, x ∼ Dir(α), the distribution will have the form:

Pr(x; α) = Γ(^P^k_i=1α_i) Qk

i=1Γ(α_i)

k

Y

i=1

x^α_iⁱ⁻¹ Without normalizer, they can merge into another Dirichlet:

Pr(n|x)p(x; α) ∝

k

Y

i=1

xⁿ_iⁱ^+αⁱ⁻¹

⇓

Pr(x|n, α) ∼ Dir(n + α)

Then the posterior distribution can be simplified by the conjugacy and easily calculated.

The hyper-parameter, α in this case, plays an important role in the distribution.

It controls how concentrated the distribution is. From another angle, it serves as pseudo counts in the posterior distribution as. If large hyper-parameters are chosen, the actual counts can hardly have significant effect and vice versa. Besides, it can also avoid the zero count from the real data.

3.3 Model

LDA is a generative probabilistic model. The basic idea of LDA is that each document is a mixture of topics. Some topics with high percentage would have major effect on the content of the document, while the topics with little percentage would affect the document a bit. The word distribution is also different under different topics. For example, if a document is about economic policies in education. Then economy and education are the major topics which implies that words such as uni- versity, school, money and tuition fee will have high possibility to be shown in the document. But there could be other words that seem unrelated to these topics such as cat, dog, and mouse. It seems unlikely, but there is still possibility if the author of the document mentioned them as a metaphor or used them in some example.

The whole generative process can be described as the follow:

For the whole corpus:

Draw φ_z ∼ Dir(β) for each topic For each document w_d in D:

1. Draw θ_d∼ Dir(α)

2. For each word w_i in the document d:

(17)

3.3. MODEL

Figure 3.1. LDA graphical model. (from the original LDA paper [3])

Symbol Description

V Number of unique words M Number of documents

K Number of topics

N_d Number of words in document d w_i^(d) The ith word in document d

z_i^(d) The ith assignment in document d θ_d Topic distribution in document d φ_z Word distribution with topic z

α Hyper-parameter for θ β Hyper-parameter for φ

Table 3.1. Notation in standard LDA

a) Draw z_i^(d) ∼ Cat(θ_d) b) Draw w_i^(d) ∼ Cat(φ

z_i^(d))

Dir stands for Dirichlet distribution. Formally, the hyper-parameters should be vectors, but in our case, we use a scalar to describe a symmetric distribution, which indicates that the hyper-parameter keeps the same for all elements and there is no bias towards data before training. Cat stands for Categorical distribution. It is a special multinomial distribution that only draws one sample. Table 3.1 is a summary for the notation.

The generative process can also be described by a graph, which is shown in Figure 3.1. In the graph, the plates mean replications of nodes and the shaded node indicates the variable is observable. It is worth noting that φ is a global variable and is only sampled once for the entire corpus. θ is a document level variable, which is sampled for each document. z and w are word level variables, which are sampled for each word. Our goal is to obtain φ and θ after training. θ then becomes a representative for each document and it has lower dimension, which is the most important feature of LDA, dimension reduction. φ, as a global variable, will be used to test queries.

According to the model, the joint probability of document d is given by:

(18)

Pr(θ^(d), z^(d), w^(d), φ; α, β) =

K

Y

n=1

Pr(φ_n|β) Pr(θ^(d)|α)

Nd

Y

i=1

Pr(z_i|θ^(d)) Pr(w_i|z_i, φ_z_i) For the whole corpus, the probability can be expressed as:

Pr(θ, z, w, φ; α, β) =

M

Y

d=1 K

Y

n=1

Pr(φ_n|β) Pr(θ^(d)|α)

Nd

Y

i=1

Pr(z^(d)_i |θ^(d)) Pr(w^(d)_i |z_i^(d), φ_z_i) In order to get the most fitting φ and θ, we must maximize the posterior prob- ability of all latent variables given all the documents, which is

Pr(w|α, β) = Z Z ^M

Y

d=1 K

Y

n=1

Pr(φ_n|β) Pr(θ^(d)|α)

Nd

Y

i=1

X

z_i^(d)

Pr(z_i^(d)|θ^(d)) Pr(w_i^(d)|z_i^(d), φ_z_i)dφdθ

Marginalizing the latent variables, we can obtain:

Pr(w|α, β) = Z Z ^M

Y

d=1 K

Y

n=1

Pr(φ_n|β) Pr(θ^(d)|α)

Nd

Y

i=1

X

z_i^(d)

Pr(z_i^(d)|θ^(d)) Pr(w_i^(d)|z_i^(d), φ_z_i)dφdθ

As an usual way, we can compute the derivative of the equation and make it equal to zero. Then the optimal solution is reached. Unfortunately, Due to the coupling of z and φ, it is impossible to calculate it precisely. To solve these intractable problems, two main methods, Variational Inference and Gibbs Sampling, are often used in practice, which will be discussed in detail in the next section.

3.4 Inference

Variational Inference and Gibbs Sampling are two good approaches to dealing with inference problems that can not be solved analytically.

The basic idea of Variational Inference approach is to factorize the model and use some variational parameters to describe the distribution of the latent variables so as to break the coupling among variables. Then the goal is to find a lower bound of the likelihood using Jensen’s inequality and maximize the lower bound.

The advantage of this method is that it converges very fast. However, it requires a complex derivation of the equations especially for some complex models. Besides, the approximation is not always accurate.

On the contrary, Gibbs Sampling is easy to implement and can obtain a better approximation if the iteration is long enough. But the drawback is that it is very time-consuming. The main idea behind it is to sample each latent variable and treat

(19)

3.4. INFERENCE

the other variables as known quantities. It has been proved that although they are initialized randomly, they are able to converge after enough iterations. Because of its simplicity, we will focus on this approach and show more details.

Gibbs Sampling is a particular method from Markov Chain Monte Carlo (MCMC) theory, which attempts to approximate complex distributions by drawing samples sequentially and generating a Markov chain. More specifically in LDA model, what we desire to know is all the topic assignments. Then θ and φ can be calculated easily. Therefore, for each assignment, it is necessary to know what the distribution it has based on all the assignments excluding the current one. The equation is given by:

Pr(z_i^(d) = j|w, z^(d)_−i; α, β) = p(z_i^(d)= j, z^(d)_−i, w; α, β) Pr(z^(d)_−i, w; α, β)

∝ Pr(z_i^(d) = j, z^(d)_−i, w; α, β)

∝ Pr(w^(d)_i |z_i^(d) = j, z^(d)_−i, w_−i^(d); α, β) Pr(z_i^(d) = j|z^(d)_−i, w^(d)_−i; α, β)

∝ Pr(w^(d)_i |z_i^(d) = j, z^(d)_−i, w^(d)_−i; α, β) Pr(z_i^(d) = j|z^(d)_−i) where z^(d)_−i represents all the remaining assignments excluding ith in document d and w^(d)_−i means all words except for the current one.

For the two parts, we are going to integrate θ and φ out and represent them by empirical counts.

Pr(w_i^(d)|z_i^(d) = j, z^(d)_−i, w^(d)_−i; α, β) = Z

Pr(w^(d)_i |z_i^(d)= j, φ_j) Pr(φ_j|z^(d)_−i, w^(d)_−i; α, β)dφ_j

= n⁻ⁱ

(j)(w^(d)_i )+ β n⁻ⁱ_(j)(·)+ V β

Pr(z_i^(d) = j|z^(d)_−i) = Z

Pr(z_i^(d)= j|θ_d) Pr(θ_d|z^(d)_−i)dθ_d

= q_(d)(j)⁻ⁱ + α q⁻ⁱ_(d)(·)+ Kα

Pr(z_i^(d)= j|w, z^(d)_−i; α, β) ∝ n⁻ⁱ

(j)(w^(d)_i )+ β n⁻ⁱ_(j)(·)+ V β

q⁻ⁱ_(d)(j)+ α q_(d)(·)⁻ⁱ + Kα

n and q are used to count words and topic assignments respectively. n⁻ⁱ

(j)(w_i^(d)) is count of word w_i^(d) in topic j without the current word; n⁻ⁱ_(j)(·) is the marginalized

(20)

count; q_(d)(j)⁻ⁱ stores the count of topic j shown in document d excluding the current assignment.

After normalization, a new sample can be drawn as update for z_i^(d). It is obvious from the last step of the equation that we always use all the remaining data as the evidence when predicting the current one. Starting from a random initialization, after sufficient iterations, the latent variables would tend to reach where they are supposed to be and finally we can find our approximate solution. Next, θ and φ can be computed by:

θ_z^(d)= α + qdz

Kα + q_d(·) φzw= β + nzw

V β + n_z(·)

The above is a general introduction to standard LDA model. In the following part, we are going to extend it and introduce the Topic Ngram model, which can be more effective.

(21)

Chapter 4

Topic Ngram Model

In standard LDA, the bag-of-words assumption indeed simplifies the computation and avoids many complex issues, but at the same time it brings a large amount of information loss simultaneously. From a semantic and linguistic point of view, the relation among words is a significant part when analyzing a document and can be very essential when performing text retrieval.

As we discussed in 2.2.1, "Red Square" has special meaning rather than two simple words. If we treat them isolatedly, we may lose some important information. In this section, we are going to relax the bag-of-words assumption and try to integrate the search of phrases into LDA. The model we are going to use is TNG (Topic Ngram model). The notation and inference will be consistent with standard LDA as a comparison.

Figure 4.1. TNG graphical model. (from the original TNG paper [5])

(22)

CHAPTER 4. TOPIC NGRAM MODEL

4.1 Model

Figure 4.1 depicts the structure of TNG. Compared with standard LDA, it has a new latent variable besides the topic assignment z, which is the state variable x.

The use of this variable is to determine whether two adjacent words should form a phrase. As a consequence, the word is not only dependent on the topic assignment, but also the previous word due to the phrase connection. Thus we need some other parameters to describe the extra distribution.

The follow is the generative process of this model and all the notation is given by table 4.1.

Symbol Description

V Number of unique words M Number of documents K Number of topics

N_d Number of words in document d w_i^(d) The ith word in document d

z_i^(d) The ith assignment in document d x^(d)_i The ith state variable in document d

θd Topic distribution in document d φ_z Word distribution with topic z

ψ_zw Distribution of state variables with regard to previous topic z and word w σ_zw Distribution of words with regard to current topic z and previous word w

α Hyper-parameter for θ β Hyper-parameter for φ γ Hyper-parameter for ψ δ Hyper-parameter for σ

Table 4.1. Notation in TNG.

For the whole corpus:

1. Draw φ_z ∼ Dir(β) for each topic.

2. Draw ψ_zw∼ Dir(γ) for each topic and word.

3. Draw σ_zw ∼ Dir(σ) for each topic and word.

For each document d:

Draw θ_d∼ Dir(α) For each word:

1. Draw x^(d)_i ∼ Ber(ψ

z^(d)_i−1w^(d)_i−1) 2. Draw z_i^(d) ∼ Cat(θ_d)

(23)

4.2. INFERENCE

3. Draw w_i^(d) ∼ Cat(σ

z_i^(d)w^(d)_i−1) if x^(d)_i = 1; Draw w^(d)_i ∼ Cat(φ

z_i^(d)) if x^(d)_i = 0 Ber represents Bernoulli distribution, which is a special case of Categorical distri- bution where there are only two outcomes. According to the generative process, if the word is not related to any phrase, it will be generated in the same way as in standard LDA. If the word is supposed to be in a phrase with the previous word, the choice or word will depend on the current topic as well as the previous word.

Since every first word of documents has no previous word, the first state variable can always be set to zero. There is another issue that how to determine the topic assignment of phrases as the two words might belong to different topics. In practice, we can either force the latter word inherit the topic of the former word or choose the latter word as the central word and use its topic. In this thesis, we will use the second method.

4.2 Inference

4.2.1 Topic Assignment

In comparison with standard LDA, we should update not only the topic assignments, but also the state variables. Thus, for each token in a document, two separate update steps are performed. Let us start with the topic assignment by:

Pr(z_i^(d) = j|w, z^(d)_−i, x; α, β, γ, δ) = Pr(z_i^(d) = j, z^(d)_−i, x, w; α, β, γ, δ) Pr(z^(d)_−i, x, w; α, β, γ, δ)

∝ Pr(z^(d)_i = j, z^(d)_−i, x, w; α, β, γ, δ)

∝ Pr(w^(d)_i |z_i^(d) = j, z^(d)_−i, x, w^(d)_−i; α, β, γ, δ) Pr(z_i^(d) = j|z^(d)_−i, x, w^(d)_−i; α, β, γ, δ)

∝ Pr(w^(d)_i |z_i^(d) = j, z^(d)_−i, x, w^(d)_−i; α, β, γ, δ) Pr(z^(d)_i = j|z^(d)_−i) The second part is identical to what is in standard LDA where:

Pr(z_i^(d) = j|z^(d)_−i) = q_(d)(j)⁻ⁱ + α q⁻ⁱ_(d)(·)+ Kα

As to the first part, the result is dependent on the state variable of w_i^(d). If x^(d)_i = 0, w^(d)_i will have no relation to the adjacent words and only be affected by the current topic, which leads to an identical calculation as in standard LDA.

Pr(w^(d)_i |z_i^(d) = j, z^(d)_−i, x, w^(d)_−i; α, β, γ, δ) = n⁻ⁱ

(j)(w_i^(d))+ β n⁻ⁱ_(j)(·)+ V β

(24)

CHAPTER 4. TOPIC NGRAM MODEL

If x^(d)_i = 1, things become different. w_i^(d) will depend on the previous word and the current topic, which is controlled by σ. Therefore, the equation becomes:

Pr(w_i^(d)|z_i^(d) = j, z^(d)_−i, x, w^(d)_−i; α, β, γ, δ) = m⁻ⁱ

(j)(w^(d)_i−1)(w^(d)_i )+ δ m⁻ⁱ

(j)(w^(d)_i−1)(·)+ V δ where m⁻ⁱ

(j)(w_i−1^(d))(w^(d)_i )represents the count of w^(d)_i with topic j and previous word w^(d)_i−1.

Combining the two conditions, we obtain the distribution of z_i^(d):

Pr(z_i^(d)= j|w, z^(d)_−i, x; α, β, γ, δ) ∝











n⁻ⁱ

(j)(w(d) i )

+β n⁻ⁱ_(j)(·)+V β

q⁻ⁱ_(d)(j)+α

q⁻ⁱ_(d)(·)+Kα if x^d_i = 0

m⁻ⁱ

(j)(w(d) i−1)(w(d)

i )

+δ

m⁻ⁱ

(j)(w(d) i−1)(·)

+V δ

q_(d)(j)⁻ⁱ +α

q_(d)(·)⁻ⁱ +Kα if x^d_i = 1

(4.1)

4.2.2 State Variable

Similarly, for the update of state variables, we get:

Pr(x^(d)_i = j|w, z, x^(d)_−i; α, β, γ, δ) ∝ Pr(w^(d)_i |x^(d)_i = j, x^(d)_−i, w^(d)_−i, z; α, β, γ, δ) Pr(x^(d)_i = j|x^(d)_−i) For the second part:

Pr(x^(d)_i = j|x^(d)_−i) = p⁻ⁱ

(z_i−1^(d))(w_i−1^(d))(j)+ γ p⁻ⁱ

(z^(d)_i−1)(w_i−1^(d))(·)+ 2γ (4.2) where p⁻ⁱ

(z^(d)_i−1)(w^(d)_i−1)(j) stores the count of state j with previous word and topic z_i−1^(d) and w^(d)_i−1, which is the prior.

As to the first part, the equations involved are the same as the last section.

Taking it into consideration, we have:

Pr(x^(d)_i = j|w, z, x^(d)_−i; α, β, γ, δ) ∝











n⁻ⁱ

(z(d) i )(w(d)

i )

+β

n⁻ⁱ

(z(d) i )(·)

+V β p⁻ⁱ

(z(d) i−1)(w(d)

i−1)(j)

+γ

p⁻ⁱ

(z(d) i−1)(w(d)

i−1)(·)

+2γ if j = 0

m⁻ⁱ

(z(d) i )(w(d)

i−1)(w(d) i )

+δ

m⁻ⁱ

(z(d) i )(w(d)

i−1)(·)

+V δ p⁻ⁱ

(z(d) i−1)(w(d)

i−1)(j)

+γ

p⁻ⁱ

(z(d) i−1)(w(d)

i−1)(·)

+2γ if j = 1 (4.3)

(25)

4.2. INFERENCE

Certainly, both two update steps need to be normalized in order to draw new samples. After random initialization and sufficient iterations, the model tends to converge and the other parameters can be calculated by:

θ^(d)_z = α + q_dz

Kα + q_d(·), φ_zw= β + nzw

V β + n_z(·) ψ_zwk = γ + p_zwk

2γ + p_zw(·), σ_zwv= δ + mzwv

V δ + m_zw(·)

(26)

(27)

Chapter 5

Implementation

The model is trained in a batch manner using a set of training examples. In the training phase, parameters including θ, φ, ψ and σ, are obtained and the model is created. After the training phase, the global parameters, φ, ψ and σ, do not update anymore. Given a set of novel data examples, the model estimates the topic mixture θ for each example that best explains the data. Then θ becomes a low-dimensional representation of the document and can be used to perform classification, ranking, etc. Our following implementation will be based on the two phases.

5.1 Training

The pseudo code of training is as follows:

Input: w, α, β, γ, δ, V, M, K;

Output: z, x, θ, φ, ψ, σ;

Declare: n(n_zw : K × M ), q(q_dz : M × K), p(p_z_i−1_w_i−1_x : K × V × 2), m(m_zw_i−1_w: K × V × V ), z(z^(d)_i : M × N_d), x(x^(d)_i : M × N_d);

initialize z, x randomly and force x^(d)₁ = 0 for all d;

calculate n, q, p, m;

while current iteration < max iteration do for d = 1 to M do

for i = 1 to N_d do update z_i^(d): q_dz(d)

i

− −;

if x^(d)_i = 0 then n_z(d)

i w_i^(d)− −;

n_z(d)

i (·)− −;

else

mz^(d)_i w^(d)_i−1w^(d)_i − −;

mz^(d)_i w^(d)_i−1(·)− −;

(28)

CHAPTER 5. IMPLEMENTATION

end if

if i 6= document end then p

z_i^(d)w^(d)_i x^(d)_i+1− −; end if if x^(d)_i = 0 then

for j = 1 to K do

Pr(z^(d)_i = j) = (α + q_dj) ×

β+njw(d) i

V β+nj(·); end for

else if x^(d)_i = 1 then for j = 1 to K do

Pr(z^(d)_i = j) = (α + q_dj) ×

δ+mjw(d) i−1w(d)

i

V δ+m

jw(d) i−1(·)

end for end if

normalize Pr(z_i^(d) = j) for all j;

sample z^(d)_i ∼ Cat(Pr(z^(d)_i ));

qdz_i^(d)+ +;

if x^(d)_i = 0 then nz_i^(d)w_i^(d)+ +;

n_z(d)

i (·)+ +;

else m_z(d)

i w^(d)_i−1w^(d)_i + +;

m_z(d)

i w^(d)_i−1(·)+ +;

end if

if i 6= document end then p

z_i^(d)w^(d)_i x^(d)_i+1+ +; end if update x^(d)_i :

if i = 1 then x^(d)_i = 0;

else

pz^(d)_i−1w^(d)_i−1x^(d)_i − −;

if x^(d)_i = 0 then n_z(d)

i w^(d)_i − −;

nz^(d)_i (·)− −;

else

mz_i^(d)w^(d)_i−1w^(d)_i − −;

mz_i^(d)w^(d)_i−1(·)− −;

end if

Pr(x^(d)_i = 0) = (γ + p

z_i−1^(d)w^(d)_i−10)

β+nz(d) i w(d)

i

V β+n

z(d) j (·)

;

(29)

5.2. ESTIMATION

Pr(x^(d)_i = 1) = (γ + p

z_i−1^(d)w^(d)_i−11)

δ+mz(d) i w(d)

i−1w(d) i

V δ+m

z(d) i w(d)

i−1(·)

normalize Pr(x^(d)_i = 0) and Pr(x^(d)_i = 1) sample x^(d)_i ∼ Ber(Pr(x^(d)_i ));

p_z(d)

i−1w^(d)_i−1x^(d)_i + +;

if x^(d)_i = 0 then n_z(d)

i w^(d)_i + +;

n_z(d)

i (·)+ +;

else m_z(d)

i w^(d)_i−1w^(d)_i + +;

m_z(d)

i w^(d)_i−1(·)+ +;

end if end if end for end for end while calculate θ^(d)z = _Kα+q^α+q^dz

d(·), φ_zw= _{V β+n}^β+n^zw

z(·), ψ_zwk = _2γ+p^γ+p^zwk

zw(·), σ_zwv = _{V δ+m}^δ+m^zwv

zw(·);

One thing that needs attention is when we calculate θ, not all counts q are used.

As we discussed in the previous part, if two words belong to a phrase, only the topic of the latter word is considered, which means two topics in a phrase should merge into one. This is what we call topic merging and more details will be given in the experiment part.

5.2 Estimation

In this phase, the trained model is used to estimate the latent variables from a set of novel data examples. The pseudo code is given by:

Input: w, α, β, γ, δ, K, N_d; Output: z, x, θ;

Declare: q(q_dz : 1 × K), z(z_i : 1 × N_d), x(x_i : 1 × N_d);

initialize z, x randomly and force x₁= 0;

calculate q;

while requirement not satisfied do for i = 1 to N_d do

update z_i: q_dz_i = q_dz_i− 1;

if x_i = 0 then for j = 1 to K do

Pr(z_i= j) = (α + q_dj) × φ_jw_i; end for

Document and Image Classiﬁcation with Topic Ngram Model

Document and Image Classification with Topic Ngram Model

Abstract

Referat

Dokument och bildklassificering med Topic Ngram Model

Acknowlegment

Contents

Chapter 1

Introduction

1.1 Contribution

1.2 Outline

Chapter 2

Background

2.1 Related Work

2.2 Spatial Structure

2.3 Classification

Chapter 3

Standard LDA

3.1 Terms

3.2 Multinomial Distribution and Dirichlet Distribution

3.3 Model

3.4 Inference

Chapter 4

Topic Ngram Model

4.1 Model

4.2 Inference

Chapter 5

Implementation

5.1 Training

5.2 Estimation