STOCHASTIC EM FOR GENERIC TOPIC MODELING USING PROBABILISTIC PROGRAMMING

(1)

STOCHASTIC EM FOR GENERIC TOPIC MODELING USING PROBABILISTIC PROGRAMMING

Submitted by Robin Saberi Nasseri

A thesis submitted to the Department of Statistics in partial fulfillment of the requirements for a two-year Master of Arts degree

in Statistics in the Faculty of Social Sciences

Supervisor Måns Magnusson

Spring, 2021

(2)

ABSTRACT

Probabilistic topic models are a versatile class of models for discovering latent themes in document collections through unsupervised learning. Conventional inferential methods lack the scaling capabilities necessary for extensions to large-scale applications. In recent years Stochastic Expectation Maximization has proven scalable for the simplest topic model: La- tent Dirichlet Allocation. Performing analytical maximization is unfortunately not possible for many more complex topic models. With the rise of probabilistic programming languages, the ability to infer flexibly specified probabilistic models using sophisticated numerical optimization procedures has become widely available. These frameworks have however mainly been developed for optimization of continuous parameters, often prohibiting direct optimization of discrete parameters. This thesis explores the potential of utilizing probabilistic programming for generic topic modeling using Stochastic Expectation Maximization with numerical maximization of discrete parameters reparameterized to unconstrained space. The method achieves results of similar quality as other methods for Latent Dirichlet Allocation in simulated experiments. Further application is made to infer a Dirichlet-multinomial Regression model with metadata covariates. A real dataset is used and the method produces interpretable topics.

Keywords: SEM, topic model, probabilistic programming, LDA, DMR, TFP

(3)

1 Introduction

Within the field of natural language processing probabilistic topic models have proven very useful for constructing semantic models for text. This is a class of models that from a corpus of documents extract a set of topics in an unsupervised fashion by modeling them as latent variables. Latent Dirichlet Allocation (LDA) is the simplest of these and views documents as mixture distributions over topics and topics as distributions over words (Blei et al. 2003).

When estimating LDA models the goal is to evaluate the topic distribution conditioned on the observed documents, also known as the posterior distribution. Computing it is however intractable and suitable techniques for approximation have to be used. Popular inference methods in this scenario are Collapsed Gibbs Sampling (CGS), Variational Expectation Maximiza- tion (VEM), and variants thereof (Blei 2012). CGS is a Markov chain Monte Carlo (MCMC) method that simulates a target distribution by repeatedly sampling a subset of variables and conditioning them on the rest of the parameters. This while VEM takes the approach of from a broad family of parameterized distributions finding one close to the target distribution through optimization techniques (ibid.).

With data becoming increasingly large there comes the need for highly scalable methods.

The CGS procedure for topic models is inherently sequential with a computational complexity of N · K for topic models, where N is the total number of word tokens and K is the number of topics. Assuming that larger corpora requires more topics this results in poor scalability (Magnusson 2018). VEM has similar scaling issues coming in the form of its memory requirements, most notably the need to store variational parameters in an N × K array (Zaheer et al.

2015). Currently large topics models typically use K = 1000 topics, but increasing this number dramatically is a key factor for successfully extending topic models to larger applications (Y.

Wang et al. 2015). For extensions to web-scale corpora Yuan et al. (2015) used 1 million topics.

Another notable issue with VEM is that it (potentially severely) underestimates the posterior variance, which often leads to poor quality results (Buntine and Jakulin 2012).

Stochastic Expectation Maximization (SEM) serves as a potentially attractive solution to these problems by both being parallelizable and yet having a similar memory footprint as the Gibbs sampler (Zaheer et al. 2016) This is accomplished by approximating the E-step by imputing values through a single Monte Carlo (MC) draw. Due to the stochasticity introduced by SEM there may be additional robustness properties to be had. It has in contrast to standard Expectation Maximization (EM), which is deterministic, shown such properties when it

(5)

comes to poor initialization (Dias and Wedel 2004; Zaheer et al. 2015) and to getting stuck in a suboptimal mode by randomly getting "pushed" out of them (Jank 2006).

With greater scalability there comes the potential for more elaborate models. An increasingly popular way of building probabilistic models is through the use of probabilistic programming languages (PPL). These allow for rapid model specification and flexibility utilizing generic inferential methods and a modular syntax. As described by Holtzen et al. (2020) this level of generalization has been made possible by putting restrictions on which programs can be executed. As most PPLs focus on inference using continuous parameters, they utilize assumptions such as strict differentiability which consequentially prohibits optimization of discrete parameters. This is unfortunate as topic models have discrete components. The restriction can however be worked around by using bijective transformations to estimate the discrete parameters in continuous space. Reparameterization however alters the posterior geometry and can have a great influence on the performance of inference algorithms, both for better or worse (Gorinova et al. 2020).

Utilizing the PPL toolkit with SEM for topic modeling is interesting for several reasons, mainly that analytical solutions for the M-step are not available or difficult to derive for more complex models than LDA. Slight modifications to model specification currently can require great deals of expertise and effort (Foulds et al. 2015). PPLs offer simple modular syntaxes which automatically derive expressions necessary for inference, making model building and tweaking quick. Inference in the M-step could make use of the sophisticated gradient-based optimization procedures readily available in PPLs, which potentially would allow the method to seamlessly be applied on a wide range of models. Other than simply offering flexibility, the optimization procedures provide tools suitable for optimizing the large amount of parameters in big topic models. These are especially attractive for extensions incorporating deep archi- tectures, such as in Zaheer et al. (2017) and Benton and Dredze (2018) where gradient-based optimization is implemented on the deep components. Concerning scalability, the E-step and Stochastic-step (S-step) of SEM are parallelizable and PPLs have built-in GPU-support for optimization which could be utilized in the M-step.

(6)

1.1 Aim

This thesis aims to explore the potential of SEM for generic topic model inference utilizing a probabilistic programming framework. This through implementing SEM with gradient-based optimization of continuous parameters in the M-step using bijective transformations.

1.2 Related Research

In recent years developments for LDA model inference using Stochastic Expectation Maxi- mization (SEM) has yielded impressive results. Zaheer et al. (2015) derive SEM for LDA with the idea being that it can leverage the computational strengths of both EM and CGS. These being that the EM algorithm is easily parallelized and that CGS has a relatively small memory footprint. The three methods are compared for another model from the class of latent variable models, namely the Gaussian mixture model. It was found that performance is comparable across a range of configurations and that SEM appears more robust to extreme and poor initialization. Most notably SEM was more robust than Gibbs sampling in the realistic case of initializing to components uniformly at random, as Gibbs in this case depends on pure luck in early iterations.

Zaheer et al. (2016) extended this work introducing the inference method Exponential Cel- lular Stochastic Automaton (ESCA) for LDA. The algorithm maps inference in latent variable models with complete data likelihood in the exponential family to a Stochastic Cellular Au- tomaton (SCA), allowing for great scalability. By utilizing the fact that when employing S- steps, the analytical M-step operates entirely on sufficient statistics. Rather than performing each of the SEM-steps explicitly, this allows for only performing a series of S-steps updating sufficient statistics count matrices, with the E- and M-step calculations being made on the fly.

The algorithm thus implicitly simulates SEM with a mapping that is easily parallelized and distributed. It achieved comparable results at over an order of magnitude greater speed compared to the popular inference algorithms CGS and Collapsed Variational Inference (CVB0). ESCA has further been extended for efficient inference using GPUs (Li et al. 2017; S. Wang et al.

2020) and for making progress in streaming LDA on for example news or social media feeds (Tristan et al. 2017).

Mimno and McCallum (2008) introduce Dirichlet-multinomial Regression (DMR) which models the document-topic prior distribution through a Generalized Linear Model (GLM) us-

(7)

ing metadata covariates. Inference is conducted using CGS for the topic model components and optimization for the regression components makes use of the quasi-Newton optimization algorithm L-BFGS. This model is extended in Benton and Dredze (2018) where the regression component is switched out for a neural network, improving convergence speed and allowing for image metadata through the use of convolutional layers. Their estimation method is very similar to in the original DMR paper. They made use of the Adadelta optimizer and found adaptive learning rates to always be better than Gradient descent (GD) with tuned learning and decay rates. Zaheer et al. (2017) develops Latent LSTM Allocation (LLA) where sequential text data is efficiently modeled utilizing the power of Long Short-Term Memory (LSTM) but with a reduced amount of parameters achieved by lifting the modeling of temporal dynamics from word to topic level. They use a partially gradient-based SEM algorithm for inference, utilizing the same analytical M-step as in LDA on the topic-word component and Stochastic Gradient Descent (SGD) using the Adam optimizer for inference on the LSTM component.

1.3 Research Questions

The research questions in this thesis can be summarized as follows:

1. Can SEM for topic models utilize the probabilistic programming framework for numerical gradient-based optimization by performing maximization of discrete parameters in continuous space through bijective transformations?

2. How does SEM with a numerical gradient-based maximization step perform against CGS and SEM with an analytical M-step for the LDA model?

3. Can an early stopping criterion in the numerical gradient-based maximization step be used to reduce computational cost while still maintaining performance?

4. Can Gradient SEM be implemented to infer a DMR model on a real dataset and produce interpretable topics?

(8)

2 Theory

This section aims to describe the underlying theory of the topic models studied in this thesis and the inference algorithms used to estimate them.

2.1 Probabilistic Topic Models

The class of probabilistic topic models is used to extract latent topics from collections of text documents known as corpora. In the following sections two popular topic models will be cov- ered, namely LDA and the DMR model which incorporates metadata covariates on a document level to model the topics.

2.1.1 Latent Dirichlet Allocation (LDA)

LDA is the simplest topic model and is based on the intuition that documents contain mixtures of topics and that each document has its unique mixture (Blei et al. 2003). To capture this the model assumes a generative process visualized in Figure 1, in which each document d . . . D has its own mixture distribution of topics denoted θd and each topic k . . . K is a distribution over words φ_k, both of these are Categorical distributions. They have their respective conjugate prior distributions Dirichlet(α) and Dirichlet(β). Parameters α, β determine the Dirichlet distributions concentration and are arrays with of size D × K and K × V respectively, where V is the size of the vocabulary. They are however typically chosen to be symmetric and to have their own constant values, which is the reason for the common slight notational abuse of denoting them as scalars.

Documents are generated by first drawing a distribution over words φ_k for each topic k . . . K, and for each document d . . . D drawing a topic distribution θd. Word tokens in the documents are then generated by a two-stage process in which for word token w_i in document

D K

α θ z w ϕ β

Figure 1: LDA model

(9)

d a topic indicator z_idis drawn from θdwhich indicates which topic the word token belongs to.

In the next step the word token is generated by sampling from the topic-word distribution φ_k corresponding to k = zid. Formally the complete generative process of the corpus is as follows:

1. For each topic k . . . K:

(a) φk ∼ Dirichlet(β) 2. For each document d . . . D:

(a) θd∼ Dirichlet(α) (b) For each word token i:

i. zid∼ Categorical(θ_d) ii. wid ∼ Categorical(φ_z_id)

Randomly sampling words in such a manner implies that the ordering of words within documents does not matter, which is the bag-of-words assumption.

2.1.2 Dirichlet-multinomial Regression (DMR)

In many cases documents have associated metadata which could be incorporated into a topic model, for example who authored a document. Developed by Mimno and McCallum (2008) the DMR model extends from LDA by modelling the document-topic prior through a Generalized Linear Model (GLM) with an arbitrary amount of features. This resemblance to LDA is clear from the model illustration in Figure 2. Here α is modelled as a GLM with observed metadata covariates X and regression coefficients (including an intercept) λ ∼ N ormal(µ, σ²), where µ, σ² are hyperparameters in the model.

D

K KxP

μ

μ 2

λ α

X

θ z w ϕ β

Figure 2: DMR model

(10)

An important property of DMR compared to other earlier topic models with covariates is that DMR can handle covariates of any data type. Sampling is further as cheap and simple as for LDA due to all of the metadata information being taken into account by θ (Mimno and McCallum 2008).

2.2 Topic Model Inference

Estimation different topic models typically make use of the same fundamental techniques as in the simple LDA case with some extensions and adaptations. Focus will therefore be on LDA inference when explaining the methods. When estimating topic models the length of the vocabulary V is fixed to include words present in the corpus. Document lengths are also fixed as the model is conditioned on the data and hence also on document size. The LDA Dirichlet prior distributions, α and β are concentration parameters > 0 and are treated as hyperparameters. Small values of α, β correspond to the document-topic distributions and topic-word distributions being sparse respectively. Tweaking these values can thus be used to incorporate prior knowledge into the model. They are typically selected on an ad-hoc basis, typically to values of 0.1 and 1/K for either or both hyperparameters (George and Doss 2018).

Inferring an LDA model is formulated as a problem of computing the posterior distribution in Equation 1. The denominator is intractable, which is the fundamental reason that approxi- mate techniques have to be used.

p(θ, φ, z|w, α, β) = p(θ, φ, z, w|α, β)

p(w|α, β) . (1)

2.2.1 Collapsed Gibbs Sampling (CGS)

CGS is an MCMC method for approximating a distribution by in each state sampling a subset of variables and conditioning them on the rest of the variables. Griffiths and Steyvers (2004) em- ployed this method for LDA where it implies simulating the word-topic posterior distribution, P (z|w), from which φ and θ easily can be estimated. This is done by constructing a Markov Chain where new states are reached through word by word sampling new topic indicators given all other variables. With the notation of Magnusson (2018) this is described by

(11)

P (z_i = k|w_i, z¬i) ∝ n^(v)_k,v,(w

i)+ β

Σ^V_wn^(v)_k,v+ W β · n^(d)_d(w

i),k+ α Σ^K_kn^(d)_d(w

i),k+ Kα. (2)

Here zi is the current topic indicator and z−i is all other topic indicators to be conditioned on.

The count of how many times word token w_i has been assigned to topic k is n^(v)_k,v,(w

i) and how many times topic indicator zi occur in the current document is n^(d)_d(w

i),k. Subscript v is used to index words in the vocabulary. The probability of assigning topic indicator z_i = k conditioned on the word token wi and all other topic indicators is thus quite intuitively a function of how that word is assigned to topics in other cases and which mixture of topic proportions are present in the current document (Griffiths and Steyvers 2004).

As can be seen in Equation 2, when sampling topic indicator z_i all other topic indicators have to be conditioned on, making the Gibbs sampler inherently serial. Sampling a topic indicator also has a sampling complexity of O(K). Under the assumption that the number of topics increases with the size of the corpora, these two aspects result in poor scalability of the method (Magnusson 2018). Other problems with CGS are that one has to assess the Markov chain convergence, when and how many samples to collect from the posterior distribution, and how to pool the results from these samples. These issues are typically sidestepped by letting the chain run for as long as what is computationally feasible and then collecting a single sample, which is not very efficient nor accurate (Teh et al. 2007).

2.2.2 Expectation Maximization (EM)

For latent variable models inferring parameter estimates is typically not possible analytically and approximation techniques have to be used. The EM algorithm is one such method that alternates between an Expectation-step (E-step) and Maximization (M-step) in order to reach parameter estimates. In the E-step the expectation of the latent variables is computed and the M- step consists of updating model parameters by maximizing the likelihood function conditioned on the expectation of the latent variables. In Dempster et al. (1977) it is shown that each iteration of this procedure monotonically increases the likelihood and that the same procedure can be applied for maximum a posteriori (MAP) estimation with the slight modification of including the log prior density term during maximization. This latter case can then be seen as performing the M-step with regularization. The EM procedure is iterated and guaranteed to converge to a stationary point for a wide range of models under mild assumptions (Wu 1983).

(12)

2.2.3 Stochastic Expectation Maximization (SEM)

The Stochastic Expectation Maximization (SEM) algorithm differs from EM in that the E-step is approximated by imputing values from a single Monte Carlo draw in a Stochastic-step (S- step). Another common abbreviation for this method is StEM and the S-step is sometimes referred to as a Simulation-step. For the methods used in this thesis these two steps are explicitly separate but will for convenience be referred to as a single SE-step when applicable.

SEM was developed to solve several drawbacks that the standard EM algorithm suffers from (Diebolt and Celeux 1993). The E-step in EM can be intractable or expensive, it is sensitive to starting values and the stationary point to which it converges may be either a local maximum or a saddle point (Diebolt and Ip 1996). Approximating the E-step may skip the need for evaluating expensive or intractable expressions. As will be seen further computational benefits may be had in the form of reduced memory requirements as only the simulated values have to be stored and in that the M-step may simplify to more convenient expressions.

Instead of deterministically converging to a stationary fixed point, SEM converges to a stationary normal distribution centred at the maximum likelihood estimate (MLE) under mild technical assumptions (Nielsen et al. 2000). The stochasticity induced makes SEM less sensitive to poor initialization (Dias and Wedel 2004; Zaheer et al. 2015) and able to be "pushed out" from unstable suboptimal modes (Jank 2006) and saddle points (Delyon et al. 1999).

2.2.4 SEM for LDA

In the supplementary material of Zaheer et al. (2015) the EM and SEM algorithms are derived for LDA and the main results for SEM are presented in this section using the notation of this thesis. It is worth noting that both EM and SEM have the same E-step for LDA, but that EM has a much more complicated M-step. As only SEM is to be implemented in this thesis, the derivation of the M-step for EM is not pursued here.

The SEM steps derivation starts by finding a lower bound for the posterior, which is displayed in Equation 3. Here Nd denotes the number of word tokens in document d. The KL

F (q, θ, φ) = −

D

X

d=1 Nd

X

i=1

DKL(q(zid|wid)||p(zid|wid, θd, φ)) +

+

D

X

d=1 Nd

X

i=1

p(w_id|θ_d, φ) +

K

X

k=1

log p(φ_k|β) +

D

X

d=1

log p(θ_d|α)

(3)

(13)

divergence denoted DKLis a measure of how different one probability distribution is compared to another, in this case how different the model implied distribution q is from the true distribution p. In the E-step parameters θ, φ are fixed to compute q. As q only occurs in the negative KL divergence term, this is equivalent to minimizing the divergence between q and p. KL divergence is nonnegative and therefore has its minimum at 0, which occurs when the approximated and true distributions are the same. This leads to the E-step expression displayed in Equation 4.

It quite intuitively corresponds to for all word tokens in all documents computing the respective probabilities of belonging to each topic k . . . K given current model parameter estimates. The S-step consists of sampling a topic indicator from zid ∼ Categorical(qid1, . . . , qidK) for every word token in the corpus.

q(z_id = k|w_id) = θ_dkφ_kw_kdid PK

k⁰=1θ_dk⁰φ_k⁰_w_id (4)

Going on to the M-step, as the Categorical parameters θ, φ lie on the simplex, constrained optimization using Lagrange multipliers is used to yield analytical solutions. Differentiating and solving the Lagrangian yields analytical solutions to the M-step shown in Equation 5.

θ_dk = N_dk+ α − 1

Nd+ Kα − K, φ_kv = N_kv+ β − 1

Nk+ V β − V (5)

Here Ndk is the number of words in document d belonging to topic k and Nd is the total number of words in the document. Likewise, N_kv is the number of times word v is categorized as belonging to topic k and Nkis the total number of words belonging to that topic. All of these are sufficient statistics that fit in two arrays of size D × K and K × V respectively. This lets the M-step for SEM to be performed with minimal memory requirements in contrast to standard EM which has more complicated expressions including all values of q in Equation 4.

As K, V and all of the sufficient statistics are counts, it is apparent from Equation 5 that α, β > 1 or otherwise θ, φ could take on negative values. This is due to a property of the Dirichlet distribution only having an analytical maximizing solution in this range. For values of 0 < α, β < 1 it is maximized in the simplex corners and for α, β = 1 it is uniform on the simplex and has no maximum. This restriction is unfortunate as α, β has big impact on the model and commonly are set to values such as 1/K or 0.1, or potentially tuned around the restricted range (George and Doss 2018).

(14)

2.2.5 Gradient-based Optimization

Gradient descent (GD) is in many cases a suitable technique for parameter optimization. Per- forming optimization to find a minimum or maximum is equivalent and these terms will be used interchangeably throughout this thesis. This algorithm displayed in Equation 6 consists of computing the gradient of some objective function J with respect to model parameter θ and taking a step of size η in the opposite direction of the gradient, resulting in updating θ_s→ θ_s+1. While Equation 6 is written in scalar form and for θ, this procedure is typically performed for all parameters simultaneously.

θ_s+1 = θ_s− η · ∇_θJ (θ_s) (6)

In vanilla GD the gradients are computed for the whole training dataset. This does often not efficiently fit in the memory and then becomes slow. Stochastic Gradient Descent (SGD) aims to solve this by computing the gradient and taking steps using a subset of the data. Technically SGD does this for 1 observation at a time, while Mini-batch GD does this for some bigger subset. In practice there seldom is any reason to only using 1 observation at a time, and a mini-batch is almost always used. It is therefore common to refer to mini-batch GD as simply SGD, which will be done from now on.

An important property of both GD and SGD is that they guarantee convergence to a local minimum, even escaping saddle points under small technical assumptions together with a not too aggressive learning rate and the strict saddle assumption (Lee et al. 2016; Ge et al. 2015).

Having the learning rate decrease over time is called decay and may help if the learning rate is set to aggressively. The strict saddle assumption implies the need for at least one direction in which the gradient curvature strictly is negative. Du et al. (2017) found that the time it takes to escape however at worst may be exponentially large for GD, a result they expect extends to SGD. Another area where the vanilla versions of GD and SGD struggle is when it comes to the ravines commonly found around the minimums. Here the gradient is steeper in some directions which cause steps to oscillate over the short axis, making progress along the long axis slow.

Modifying the update rule to include a momentum term that incorporates a fraction of past updates helps remedy this and makes convergence faster. Consecutive steps taken in different directions then become small which reduces the oscillating behavior, while steps in the same

(15)

direction accumulate which speeds up the descent down the steep direction (Ruder 2016).

There exist many sophisticated optimization procedures incorporating versions of the afore- mentioned techniques amongst others. One efficient extension well suited for a wide range of problems is the Adam optimizer introduced in (Kingma and Ba 2014). It uses first and second moment estimates of the gradient to individually adapt the learning rates for each parameter.

Such procedures are conveniently named adaptive learning rates. These estimates are incorporated as exponentially decaying averages, with the first-order term acting like momentum.

Adam is robust and requires little tuning, works well for sparse gradients and non-stationary objectives (ibid.).

2.3 Methods

In the following sections the implementation of SEM with a numerical gradient-based M-step is described and a heuristic argument for its convergence is given.

2.3.1 Tensorflow Probability (TFP)

Tensorflow Probability (TFP) is built on top of the popular machine learning (ML) library Ten- sorflow (TF). It is used to compute gradients for arbitrary parameters in probabilistic models. In the TF framework objects are stored in arrays of arbitrary dimensions called tensors and computations are made by composing these and operations into nodes in directed graphs (Abadi et al. 2016). Models are specified using simple modular building blocks and computational graphs are used to describe how data flows through during computation, allowing for complex models to be specified in a straightforward manner. Structuring tensors and operations in a computational graph has several key benefits making it suitable for large-scale gradient-based inference. It allows for distributing subgraph computations to different devices and makes dif- ferentiation of variables in complex models much simpler by utilizing backpropagation (ibid.).

To compute the derivative for some parameter the system simply finds and backtracks the path between the relevant nodes in the computational graph, applying the chain rule to compute partial derivatives in each step.

By adding parameterized probability distributions as building blocks and bijectors, TFP extends the library to a wide range of probabilistic models while still utilizing the same underlying techniques making TF suitable for ML tasks. Bijectors are deterministic transformations that map each element of a set X to the set Y with a one-to-one correspondence. Each element

(16)

in X thus has a unique pairing in Y and vice versa, with this mapping function being invertible.

This allows for describing a transformed probability according to Equation 7, where F is the bijective function and DF⁻¹(y) is the inverse of the Jacobian of F (Dillon et al. 2017).

p_Y(y) = p_X(F⁻¹(y))|DF⁻¹(y)| (7)

In TFP bijectors are used to from an underlying smaller subset of distributions gain access to an exponentially wider number of choices. They can further be chained together to allow for more complex transformations. Models with such transformed probability distributions still have differentiable parameters as information necessary to invert the transformation is stored by TFP (ibid.). Estimation of constrained parameters can utilize bijectors to serve as a map between the constrained and unconstrained space. Parameters of a discrete probability distribution can for example be transformed to unconstrained space using a bijection. They are then estimated using standard optimization techniques for continuous parameters and then transformed back into constrained space by applying the inverted bijective function.

2.3.2 Gradient SEM

The algorithm proposed in this thesis is SEM with the M-step being performed using numerical gradient-based optimization, referred to as Gradient SEM. This name is inspired from Lange (1995) where Gradient EM using Newton-Raphson optimization first was introduced. SEM with analytical maximization will from hereon be referred to as Analytical SEM in order to clearly distinguish between the algorithms.

The reason for using Gradient SEM is that more complex topic models often do not have analytical solutions to the M-step and that PPLs allow for generic and sophisticated optimization.

For some topic models there is a consideration to run the algorithm partially gradient-based with analytical maximization for some component. An example of this is LLA, where φ and the Long Short-Term Memory (LSTM) component are conditionally independent given z. In Zaheer et al. (2017) they utilize this by implementing an LLA inference algorithm consisting of gradient-based optimization for the LSTM component together with the same analytical M-step as in LDA for optimizing φ. While utilizing such a method is cheaper, it is not always applicable nor preferable. As discussed in 2.2.4, performing analytical maximization requires the Dirichlet hyperparameter to be > 1. This does not encompass the commonly used values of

(17)

0.1 and 1/K. In George and Doss (2018) it is shown that the choice of α, β have great influence on the final model, and they implement an approach of searching for optimal hyperparameter values, which most often result in values <1. So while analytical or partially gradient-based SEM has computational advantages, the completely gradient based implementation in this thesis has the advantage of potentially finding better final solutions as α, β > 0 is possible.

Concerning implementation of the Gradient SEM algorithm there is an additional consideration of how many gradient steps s to take in each M-step. The number of steps necessary for convergence varies greatly with optimizer, hyperparameters, data, and during execution. Per- forming many unnecessarily small steps can be expensive. In Section 2.3.3 there is a heuristic argument for overall algorithm convergence which assumes that the optimization procedure in every single M-step converges. Given situations in which overall convergence holds this likely is a too strict assumption and will in practice be relaxed by putting some cap on s in order to reduce computational cost. While taking less gradient steps reduces the computational burden of the M-step, it may also slow down global convergence if too few steps are taken, mitigat- ing any gains or even increasing cost. Different implementations of stopping criteria will be used in this thesis and a simulation studying the properties of varying η and s is performed in Experiment 2.

The implementation of Gradient SEM in this thesis will be done using the PPL Tensorflow Probability. Here the model’s generative process is easily specified using building blocks of parameterized probability distributions. Model specification essentially consist of translating from plate notation as in Figure 1 and Figure 2 to appropriate distributions in TFP together with specifying all transformations and relationships. Rather than specifying the generative process on a word level, draws from the Categorical distributions are summed to form Multinomial distributions. This describes the generative process on a document level in terms of sufficient statistics and is an equivalent way of formulating a topic model with computational benefits. If the model would be formulated on a word level it would produce ragged arrays not suitable for Single Instruction Multiple Data (SIMD) parallelism used in TFP (Piponi et al. 2020). SIMD is a form of parallellized computation where a single operation is applied to a batch of multiple vectorized data points at the same time. Document level formulation further is beneficial as it allows the Gradient SEM algorithm to run on compact arrays of sufficient statistics alleviating considerable memory pressure.

The simple GD and sophisticated Adam optimization algorithms are to be used in this

(18)

thesis. The former is not expected to work particularly well and requires more tuning but is included as a form of baseline Gradient SEM method. It will however be implemented with decay and momentum to improve performance, while still being referred to as GD. The advantages of Adam discussed in Section 2.3.2 make it attractive for generic topic modeling. It works well in a wide range of cases and has the ability to handle the sparse gradients notorious for high dimensional topic models. Optimization is performed using the full batch of data rather than using SGD. This is due to the data being structured in the form of two arrays with the sufficient statistics, making implementation a bit unclear.

For the simplex parameters θ, φ, direct optimization is not possible in the TF framework.

Bijective transformations will therefore be used to transform the parameters to continuous space where they are to be optimized. The bijective transformation used here is the centered softmax function displayed in Equation 8 which serves as the map between the simplex and continuous space. Denote ˜θ as θ transformed to continuous space, then Equation 8 shows the transformation for ˜θ back to θ on the simplex. This transformation is performed for all documents d . . . D and topics k . . . K. Here it is presented for θ but is used in the same way for φkfor all k . . . K and topics v . . . V , except that φ is normalized to a probability distribution over words.

Centering is represented by c.

Sof tmaxCentered(˜θdk) = exp(˜θ_dk− c) PK

k⁰=1exp(˜θ_dk− c) (8) In order to further add speed and flexibility to model specification, automated procedures are used for deriving model log probability which serves as the objective function. Optimization in M-step may sometimes result in parameter estimates that numerically are 0, especially when using an aggressive learning rate. When computing the model log probabilities this numerical instability sometimes result in attempting to compute log(0) which is undefined. This is solved by rounding simplex parameter values up to an arbitrarily small floor of 1e − 18 and then normalizing the distribution. This is performed after each gradient step. The chosen value is small enough to have a negligible effect on results while at the same time ensuring numerical stability.

Concerning the implementation of the DMR model, a burn-in period of LDA is typically run to acquire initial estimates of θ, φ (Mimno and McCallum 2008; Benton and Dredze 2018).

This is a useful technique for topic models as they are multimodal and initialization affects the

(19)

solution. Utilizing initialization strategies has been shown to speed up training and achieve better results (Roberts et al. 2016). Here the burn-in will consist of training the DMR model using Gradient SEM, while optimizing only θ, φ. Initializing all regression coefficients to 0 this equates to training an LDA model with α = 1 as for all documents d . . . D and topics k . . . K

α_dk = exp(x^T_dλ_k) 1 = exp(x^T_d0).

While not explicitly necessary, having the same α value during the burn-in and when starting full model training intuitively may help to ensure a smooth transition between the two phases.

In a PPL framework optimizing an implicit LDA model through a DMR model specification with frozen parameters has the added benefit of not having to recalculate gradients.

2.3.3 Convergence of Gradient SEM

In contrast to the EM algorithm whose general convergence properties were laid out in early works such as Dempster et al. (1977) and Wu (1983), SEM was only theoretically proven for special cases such as for the 2 component mixture in Diebolt and Celeux (1993). A heuristic argument for convergence on SEM is that while the EM algorithm is guaranteed to improve the likelihood in each iteration, SEM does so on average. The S-step however dramatically com- plicates derivation of a rigorous mathematical proof for the general case. Nielsen et al. (2000) took the elegant approach of showing that iterations of the SEM algorithm form a Markov chain, which under technical assumptions converge to a stationary normal distribution centered at the MLE. The relevant assumption necessary here is that maximization needs to be possible, which implies the existence of an MLE. These are in general fulfilled even for complex topic models such as LLA (Zaheer et al. 2017).

Given all of this there is a heuristic argument for the convergence of the Gradient SEM algorithm. As mentioned SEM with an analytic M-step has the theoretical guarantee of converging to a local mode of the likelihood function which results in an MLE. Using GD to maximize a function converges to a local optimum like an analytical solution would under additional mild technical assumptions, the strict saddle assumption and a not too aggressive learning rate (Lee et al. 2016). The Gradient SEM algorithm for topic models should thus converge given the additional assumptions required for GD convergence and that GD is run until convergence

(20)

in each M-step. Convergence properties of Gradient SEM are studied through simulation in Experiments 1 and 2.

(21)

3 Experiments

In the following sections the purpose and design of each experiment are described in detail. The first two experiments are simulation-based with Experiment 1 aiming to compare the performance of Gradient SEM with other algorithms. Experiment 2 studies the properties of varying the number of gradient steps taken in each M-step. Lastly in Experiment 3 a DMR model is estimated with Gradient SEM on a real dataset and the resulting topics are examined.

3.1 Experiment 1

In this experiment the methods CGS, Analytical SEM and Gradient SEM are to be compared in terms of performance. For Gradient SEM the optimizers GD and Adam are used. The experimental setting in which the methods are to be compared is inspired by Jonasson and Magnusson (2021) in which algorithms are compared on a simulated toy corpus by comparing the number of times they converge to known modes of varying quality. The simulated toy corpus consist of D = 3 documents each of size Nd = 400, d = 1, . . . , D, with a vocabulary of size V = 3. The first document is comprised of 9/10 of word 1 and 1/10 of word 2. Documents 2 and 3 consists of only words 2 and 3 respectively. With the setting K = 2 and α, β = 1 there are 3 known modes of varying quality the algorithms can converge to, from which how often it ends up in these can be used to assess their performance. When Jonasson and Magnusson (ibid.) performed this experiment using CGS, it converged to the best mode about 30% of the time and to the other modes roughly 35% of the time each, with the worst mode being reached slightly more often. In each run of the algorithms, the best mode achieved is counted towards performance as in practice parameters estimates from the best solutions are stored and used for the final model. The modes are in terms of log-posterior density given by Equation 9. Here C is a constant set to 0 in this implementation.

L(w, z) =

K

X

V

Xlog Γ(n_vk+ β) −

K

XΓ log(

V

Xn_vk+ β)+

D

X

K

Xlog Γ(n_kd+ α) −

D

XΓ log(

K

Xn_kd + α) + C.

(9)

As explained in Section 2.2.4 SEM does not have analytical solutions for α, β ≤ 1 so all comparisons are therefore made with the similarly small hyperparameter values of α = β = 1.1. The experiment is repeated 1000 times for each algorithm. Topic indicators are initialized

(22)

uniformly at random and parameters θ, φ as the sample proportions. All algorithms are run for 50 iterations which is more than enough to converge given reasonable hyperparameter values where applicable. Whether the hyperparameters are reasonable in this case implies that they are set such that convergence is possible and not very slow. These criteria are not met when for example the learning rate is too aggressive or too small respectively. While CGS in practice is run for much longer, it very often converges to a local mode quickly on the toy corpus. The local modes in this corpus are very stable and CGS is thus unable to escape them, running it for longer should not affect the results.

Due to some of the algorithms compared being fundamentally different, there is some dis- crepancy in how they are run to ensure fair comparisons. For the Gradient SEM algorithms, each M-step is set to take s = 5 gradient steps. This is due to the ES-step being extremely quick for this small corpus, making the cost of the M-step disproportionately large. In contrast, taking for example only s = 1 steps would for many configurations result in longer overall convergence. Five gradient steps are thus considered a middle ground where the M-steps often should converge. Further properties of putting a cap on the maximum number of gradient steps in the M-step is studied in Experiment 2.

Concerning hyperparameter tuning, the learning rate η is tuned for Adam by a grid search over the 20 values in the space 0.05, . . . , 1 with increments of 0.05. For SEM with GD a uniform random search of 20 combinations of values for 0 < η < 0.5, momentum between 0.5 and 0.9, decay between 0.95 and 0.99. The reason for searching a narrower range of η values in GD is that reasonable values are more plausible in this range.

3.2 Experiment 2

For the Gradient SEM algorithm convergence in every single M-step is likely not necessary for overall algorithm convergence. As discussed in Section 2.2.4 there are many factors affect- ing convergence in the M-step and several plausible configurations for achieving reasonable results. In this experiment Gradient SEM algorithm convergence and performance is studied for different configurations of η and s. If performance of similar quality can be achieved while taking fewer gradient steps, it could be a viable option for reducing the computational cost of the M-step. This unless global convergence is slowed down so much that the algorithm would have to be run for much longer.

(23)

This experiment is performed on the same toy corpus with the same model hyperparameter values α = β = 1.1 as in Experiment 1, with performance being measured in the same way.

Only the Adam optimizer is considered due to its properties discussed in Section 2.2.5 which make it suitable for topic modeling. The learning rate η is studied for three different orders of magnitude 0.01,0.1 and 1, which are considered to be low, reasonable, and high respectively in this experimental setting. Gradient steps s are studied for values of 1,5 and 15. Here taking only one step should take very long to converge, 15 is likewise most of the time more than enough except for potentially with the smallest step size. Five gradient steps are considered a middle ground. All 9 combinations of these parameter values are studied.

3.3 Experiment 3

In the third experiment a DMR model is to be inferred using Gradient SEM on a real moderately sized dataset. This is a very experimental procedure and focus is on constructing the algorithm and examining whether it can produce interpretable results. The dataset used is a subset of the 2008 presidential election political blogs corpus developed by Eisenstein and Xing (2010). It contains posts containing at least 200 words from seven political blogs. The corpus include ratings of whether a blog is labeled as conservative or liberal, which will be used as a covariate in the DMR model. In Roberts et al. (2016) the dataset is stemmed and cleaned using a standard stopwords list. Stemming is the process of trimming the end of words, for example the end of a word in plural form to singular, in order for occurrences of both forms to attribute to the same count. Stopwords are common words which in themselves do not contribute to the meaning of a document, and therefore are removed. The dataset is further trimmed of rare words which occur in less than 1% of the documents. A random sample of 5000 cleaned documents are available in the STM R package (Roberts et al. 2019) and is used in this experiment. The cleaned corpus consist of more than 3 million word tokens and has a vocabulary size of 2632 words.

The political blogs dataset has been subject to several previous studies, though with other topic models. In Romney et al. (2015) structural topic models (another covariate compatible topic model) were estimated for K = 5, 20, 100 on this same dataset. They show that the number of topics shift the level of abstraction for topic interpretation. For K = 20 topics were interpretable on a fairly high level but not too general. This same number of topics is used in this experiment.

(24)

Concerning DMR implementation, model hyperparameters are set with the same specification as in the original DMR paper (Mimno and McCallum 2008). Here β = 0.01 and regression coefficients λ has its priors set to 0 mean normal distributions with variances 0.5 and 10 for the regression coefficient and intercept respectively. The model is initialized with 50 iterations of the burn-in procedure described in Section 2.3.2. After burn-in the algorithm is run in increments of 50 iterations after which convergence is assessed and the algorithm either proceeds or is terminated. Each M-step is set to consist of a maximum number of gradient steps s = 500 or until there has been no improvement in the objective function value for 5 iterations. The value of s = 500 is set to ensure convergence in each M-step within some reasonable threshold.

(25)

4 Results

4.1 Experiment 1

In this first experiment CGS, Analytical SEM and Gradient SEM with the two different optimizers GD and Adam are compared. One thing that immediately stands out in Figure 3 where global convergence is shown is the erratic behavior of GD. Even tuned it has big difficulties converging. The other algorithms behave quite similarly to each other, with CGS being slightly faster and Gradient SEM using Adam being noticeably slower but not by any considerable means. Table 1 shows how often the different algorithms converge to the different modes (or not converging at all) as measured by the log marginal posterior L. CGS and Analytical SEM perform quite similarly, with the latter having slightly better performance. For the Gradient SEM methods, the simple GD implementation perform very poorly while Adam reach the best mode more often than any of the other algorithms.

Table 1: Mode (as log posterior density) relative frequencies by algorithm on toy corpus

L CGS SEM SEM+GD SEM+Adam

-545.6 0.34 0.42 0.08 0.63

-648.1 0.30 0.32 0.41 0.21

-676.9 0.36 0.26 0.19 0.16

Higher values of L correspond to better modes.

Each algorithm is run 1000 times.

(26)

Figure 3: Algorithm convergence for LDA model on toy data

−2000

−1500

−1000 −500 s = 1 s = 5

η = 0.01

s = 15

−2000

−1500

−1000 −500

η = 0.1

0 250 500

−2000

−1500

−1000 −500

0 250 500 0 250 500

η = 1

Log mar ginal posterior

Iterations

Figure 4: Gradient SEM with Adam convergence by learning rate and gradient steps cap

(27)

4.2 Experiment 2

Here the properties of capping the number of gradient steps s in each M-step are studied for Gradient SEM with the Adam optimizer using three learning rates of different orders of magnitude. Relative counts of global convergence to different modes (or none convergence) are presented in Table 2. It is observed that the Gradient SEM algorithm performs reasonably well over most configurations, with the exception of some relatively poor performance for s = 1 and slow convergence for η = 0.01. Extremes of η values have varying degrees of convergence issues for s = 1, 5. Examining Figure 4 shows that for η = 1 algorithm behavior is more erratic and sometimes has difficulties converging. This while η = 0.01 behaves as if it is converging but too slowly to finish within the window of the run. Observing the best mode counts in Table 2 shows that taking = 5 steps outperforms except for when η = 0.01.

Table 2: Mode (as log posterior density) relative frequencies by learning rate and number of M-steps for Gradient SEM with Adam optimizer on toy corpus

s=1 s=5 s=15

L η = 0.01 η = 0.1 η = 1. η = 0.01 η = 0.1 η = 1. η = 0.01 η = 0.1 η = 1.

-545.6 0 0.22 0.23 0.31 0.60 0.31 0.38 0.46 0.19

-648.1 0 0.50 0.45 0.33 0.22 0.28 0.37 0.33 0.44

-676.9 0 0.28 0.29 0.16 0.19 0.37 0.25 0.21 0.37

NA 1 0 0.02 0.20 0 0.04 0 0 0

Higher values of L correspond to better modes.

Each algorithm is run 500 times.

NA implies non-convergence of the algorithm within the given number of iterations.

(28)

4.3 Experiment 3

The DMR model converged after 200 total iterations including the 50 burn-in iterations. Burn- ing in for 50 iterations seems to be a bit excessive in this case as only very small improvements in the objective function is achieved at the end. During model training the M-step cap of s = 500 is often hit for early iterations with the M-step almost always converging after that. In Figure 3 the top ten words as measured by largest φ value are presented for every topic. Many of the topics are interpretable, consisting of fairly coherent words relating to subjects such as:

economics, energy, foreign/military, parties, candidates, campaign/polling, and legislation.

Table 3: Top words by topic for DMR model on political blogs dataset

5 Discussions and Conclusion

In this thesis two simulation-based experiments have been conducted studying the properties of Gradient SEM for LDA, and one experiment implementing DMR on a real dataset. Concerning Research Question 1, it is apparent from the results that the PPL framework can be utilized for implementing SEM with a numerical gradient-based optimization M-step for topic modeling, despite the lack of functionality for optimizing discrete parameters explicitly. The limitation can be worked around by performing parameter estimation in continuous space through the use of bijective transformations.

In Experiment 1 several algorithms for topic model inference were compared, namely CGS, SEM, and Gradient SEM. For the latter, the Adam and GD optimization procedures were used with tuned hyperparameters. With performance assessed in terms of the number of times each algorithm converged to the best mode, CGS and Analytical SEM both performed relatively well, with the latter slightly more often to the best mode and CGS converging quicker. For the Gradient SEM algorithms, SEM with GD performed extremely poorly and SEM with Adam outperformed all other algorithms. From Figure 3 it is observed that SEM with GD exhibits erratic behavior. It is not considered practically useful as more sophisticated versions can be used rather than spending time and computation tuning it.

While the Adam algorithm looks well behaved (much like Analytical SEM), its result should be taken carefully as it is noticed that the dataset has favourable conditions for performing gradient-based optimization under some configurations. This is exemplified well in Experiment 2 where taking only s = 1 gradient step each iteration in combination with a high learning rate of η = 1 achieves sub-par (and not terrible) performance. The toy dataset for α = β = 1.1 has its two best modes located where word tokens within a document are allo- cated to the same topic, which is likely to occur when θdhas values close to 0 and 1 for d . . . D.

Taking a large gradient step when optimizing θ in continuous space may yield such estimates when transforming the parameters back to the simplex. While the modes are highly polarized, these conditions are relatively close to reality as topic modes typically aim for sparse solutions.

This is in practice typically done by setting weak priors. The experiment highlight an interesting property of Gradient SEM which may be abused. The ability to overshoot could potentially be utilized for reaching better modes quicker by developing global learning rate schedules. It is also worth noting that even tuned and presented favorable conditions, Gradient SEM with the GD optimizer performed very poorly.

(30)

A fair comparison can instead be made between CGS and Analytical SEM to Gradient SEM with Adam by observing results for the latter when implemented with a small learning rate in Experiment 2. In the case with η = 0.01, gradient steps are so small that the algorithm can not exhibit the behavior described earlier. For η = 0.01 and s = 15 the Gradient SEM algorithm with Adam perform similarly to CGS and Analytical SEM. Answering Research Question 2 that Gradient SEM can perform similarly well to these methods, at least in a small scale simulated environment.

Further continuing with Experiment 2, Gradient SEM is run with learning rates of different orders of magnitude and varying number of gradient steps s taken in each M-step. One thing that stands out from the results in Table 2 is that the algorithm performed better for s = 5 than s = 15. Such high performance hints to there potentially being benefits of adding additional noise in each iteration of the algorithm. Research Question 3 concerns if having a cap on the number of gradient steps s in each M-step can be used to cut computational costs while maintaining similar quality performance. From Table 2 it can be seen that reducing the number of gradient steps from s = 15 to s = 5 overall performs similarly. When examining Figure 4 it is apparent that putting a cap on the maximum number of Gradient steps in each M-step can slow down global convergence a lot. Taking a few more gradient steps than s = 5 in this implementation may have been a better option. Answering Research Question 3, it should be possible to gain some computational benefits by putting a cap on the number of gradient steps taken in each M-step while still maintaining quality results. It is however unclear whether these computational benefits are of much practical use due to overall slower convergence.

The DMR model using Gradient SEM in Experiment 3 produced many interpretable topics, answering Research Question 4. These topics are further similar to the ones achieved with a structural topic model in Romney et al. (2015). Concerning implementation, the within M-step stopping criterion implemented consists of taking a maximum number of s = 500 gradient steps or terminating if no improvement in the cost function has been achieved in the last five steps. This seems to have worked well, with the algorithm hitting the cap only in very early iterations. During the run, most iterations still took several hundred steps to converge. The number of steps necessary for within M-step convergence clearly scales with corpus size, and may even be limited by factors such as the size of the largest document. Gradient SEM being parallelizable alleviates this problem tremendously, but it is of great interest to develop methods which speed up M-step convergence.

(31)

To conclude, topic model inference with Gradient SEM utilizing numerical gradient-based optimization in a PPL framework has shown results comparable to other popular methods.

Making more large-scale comparisons on real datasets is of great interest. Results from Experi- ment 2 hints at inducing noise by implementing stopping criteria may have beneficial properties which should be further examined. This may help escape unattractive locations in the gradient landscape, something that the development of global learning rate schedules also may be used for. While Gradient SEM is parallelizable, there is a lot of potential to speed up the M-step. This may be done by utilizing more sophisticated optimization procedures. While mini-batching of documents is possible, its implementation is a bit unclear. Potentially mini- batches of documents and topics from the sufficient statistics matrices could be used to perform full-scale SGD and improve large scale performance. Such developments could further make Gradient SEM a potentially viable method for large-scale generic topic model inference.

(32)

References

Abadi, Martín et al. (2016). “Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”. In: arXiv preprint arXiv:1603.04467. arXiv: 1603.04467.

Benton, Adrian and Mark Dredze (2018). “Deep Dirichlet Multinomial Regression”. In: Pro- ceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 365–

374.

Blei, David M (2012). “Probabilistic Topic Models”. In: Communications of the ACM 55.4, pp. 77–84.

Blei, David M, Andrew Y Ng, and Michael I Jordan (2003). “Latent Dirichlet Allocation”. In:

the Journal of machine Learning research3, pp. 993–1022.

Buntine, Wray L and Aleks Jakulin (2012). “Applying Discrete PCA in Data Analysis”. In:

arXiv preprint arXiv:1207.4125. arXiv: 1207.4125.

Delyon, Bernard, Marc Lavielle, and Eric Moulines (1999). “Convergence of a Stochastic Ap- proximation Version of the EM Algorithm”. In: The Annals of Statistics 27.1, pp. 94–128.

ISSN: 0090-5364.

Dempster, Arthur P, Nan M Laird, and Donald B Rubin (1977). “Maximum Likelihood from Incomplete Data via the EM Algorithm”. In: Journal of the Royal Statistical Society: Series B (Methodological)39.1, pp. 1–22.

Dias, José G and Michel Wedel (2004). “An Empirical Comparison of EM, SEM and MCMC Performance for Problematic Gaussian Mixture Likelihoods”. In: Statistics and Computing 14.4, pp. 323–332.

Diebolt, Jean and Gilles Celeux (1993). “Asymptotic Properties of a Stochastic EM Algorithm for Estimating Mixing Proportions”. In: Stochastic Models 9.4, pp. 599–613.

Diebolt, Jean and Eddie HS Ip (1996). “Stochastic EM: Method and Application”. In: Markov Chain Monte Carlo in Practice. Springer, pp. 259–273.

Dillon, Joshua V et al. (2017). “Tensorflow Distributions”. In: arXiv preprint arXiv:1711.10604.

arXiv: 1711.10604.

Du, Simon S, Chi Jin, Jason D Lee, Michael I Jordan, Barnabas Poczos, and Aarti Singh (2017).

“Gradient Descent Can Take Exponential Time to Escape Saddle Points”. In: arXiv preprint arXiv:1705.10412. arXiv: 1705.10412.

(33)

Eisenstein, J. and E. Xing (2010). The CMU 2008 Political Blog Corpus. Technical Report (Carnegie Mellon University. Machine Learning Department). Carnegie Mellon University, School of Computer Science, Machine Learning Department.

Foulds, James, Shachi Kumar, and Lise Getoor (2015). “Latent Topic Networks: A Versatile Probabilistic Programming Framework for Topic Models”. In: International Conference on Machine Learning. PMLR, pp. 777–786.

Ge, Rong, Furong Huang, Chi Jin, and Yang Yuan (2015). “Escaping from Saddle Points—

Online Stochastic Gradient for Tensor Decomposition”. In: Conference on Learning The- ory. PMLR, pp. 797–842.

George, Clint P and Hani Doss (2018). “Principled Selection of Hyperparameters in the Latent Dirichlet Allocation Model”. en. In: p. 38.

Gorinova, Maria, Dave Moore, and Matthew Hoffman (2020). “Automatic Reparameterisation of Probabilistic Programs”. In: International Conference on Machine Learning. PMLR, pp. 3648–3657.

Griffiths, Thomas L and Mark Steyvers (2004). “Finding Scientific Topics”. In: Proceedings of the National academy of Sciences101.suppl 1, pp. 5228–5235.

Holtzen, Steven, Guy Van den Broeck, and Todd Millstein (2020). “Scaling Exact Inference for Discrete Probabilistic Programs”. In: Proceedings of the ACM on Programming Languages 4.OOPSLA, pp. 1–31.

Jank, Wolfgang (2006). “The EM Algorithm, Its Randomized Implementation and Global Op- timization: Some Challenges and Opportunities for Operations Research”. In: Perspectives in Operations Research. Springer, pp. 367–392.

Jonasson, Johan and Måns Magnusson (2021). “Rapid Mixing in Unimodal Landscapes and Ef- ficient Simulatedannealing for Multimodal Distributions”. In: arXiv preprint arXiv:2101.10004.

arXiv: 2101.10004.

Kingma, Diederik P and Jimmy Ba (2014). “Adam: A Method for Stochastic Optimization”.

In: arXiv preprint arXiv:1412.6980. arXiv: 1412.6980.

Lange, Kenneth (1995). “A Gradient Algorithm Locally Equivalent to the EM Algorithm”. In:

Journal of the Royal Statistical Society: Series B (Methodological)57.2, pp. 425–437.

Lee, Jason D, Max Simchowitz, Michael I Jordan, and Benjamin Recht (2016). “Gradient Descent Converges to Minimizers”. In: arXiv preprint arXiv:1602.04915. arXiv: 1602 . 04915.

STOCHASTIC EM FOR GENERIC TOPIC MODELING USING PROBABILISTIC PROGRAMMING