Using Primary Dynamic Factor Analysis on repeated cross-sectional surveys with binary responses

(1)

Linköpings universitet

Master’s thesis, 30 ECTS | Computer Science

2020 | LIU-IDA/LITH-EX-A--20/007--SE

Using Primary Dynamic

Fac-tor Analysis on repeated

cross-sectional surveys with binary

re-sponses

Primär Dynamisk Faktoranalys för upprepade

tvärsnittsunder-sökningar med binära svar

Arvid Edenheim

Supervisor : Martin Singull Examiner : Cyrille Berger

(2)

Upphovsrätt

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - under 25 år från publicer-ingsdatum under förutsättning att inga extraordinära omständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Över-föring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och till-gängligheten ﬁnns lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet än-dras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagets hemsida http://www.ep.liu.se/.

Copyright

The publishers will keep this document online on the Internet - or its possible replacement - for a period of 25 years starting from the date of publication barring exceptional circumstances.

The online availability of the document implies permanent permission for anyone to read, to down-load, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement.

For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: http://www.ep.liu.se/.

(3)

With the growing popularity of business analytics, companies experience an increasing need of reliable data. Although the availability of behavioural data showing what the consumers do has increased, the access to data showing consumer mentality, what the con-sumers actually think, remain heavily dependent on tracking surveys. This thesis inves-tigates the performance of a Dynamic Factor Model using respondent-level data gathered through repeated cross-sectional surveys. Through Monte Carlo simulations, the model was shown to improve the accuracy of brand tracking estimates by double digit percent-ages, or equivalently reducing the required amount of data by more than a factor 2, while maintaining the same level of accuracy. Furthermore, the study showed clear indications that even greater performance benefits are possible.

(4)

Acknowledgments

I would first like to thank Nepa for giving me the opportunity to write this thesis at their company. I would also like to direct my sincerest gratitude towards my supervisors at Nepa, Goran Dizdarevic and Patrik Björkholm, for providing great support and expertise, without whom this thesis would not have seen the light of day. A special mention to the Data Sci-ence team for putting up with my vigorous gesticulation and verbal confusion with ggplot. Furthermore, I want to thank Martin Singull and Cyrille Berger at Linköping University for providing valuable feedback. Finally, a very special thanks to Lisa Yuen, my devoted cheer-leader, for encouragement and listening to endless monologues about this thesis.

(5)

Abstract iii

Acknowledgments iv

Contents v

List of Figures vii

List of Tables viii

List of Algorithms ix 1 Introduction 2 1.1 Motivation . . . 3 1.2 Aim . . . 4 1.3 Research Questions . . . 4 1.4 Delimitations . . . 4 2 Theory 5 2.1 Factor Analysis . . . 5

2.2 Dynamic Factor Analysis . . . 11

2.3 State-Space Model . . . 11

2.4 Kalman Filter . . . 13

2.5 Kalman Smoother . . . 14

2.6 Estimating Model Parameters . . . 15

2.7 Evaluation Metrics . . . 16

3 Method 19 3.1 Data Sources . . . 19

3.2 Primary DFA Model . . . 22

3.3 Algorithm . . . 23 3.4 Evaluation . . . 26 4 Results 29 4.1 Simulated Data . . . 29 4.2 Real Data . . . 34 5 Discussion 37 5.1 Results . . . 37 5.2 Method . . . 41 5.3 Source Criticism . . . 44

5.4 The work in a wider context . . . 44

(6)

6.1 Future Work . . . 46

Bibliography 48

A Appendix 52

A.1 Excerpts from Simulated Data . . . 52 A.2 Fitted DFA Models . . . 53

(7)

2.1 Common Factor Model . . . 6

2.2 Comparison of Factor Rotations . . . 9

2.3 Graphical Representation of a State-Space Model . . . 12

2.4 Prediction, Filtering and Smoothing . . . 13

2.5 Update cycle of the Kalman Filter . . . 14

4.1 Example of a fitted DFA model on simulated (binary) data . . . 30

4.2 Distribution of results with simulated continuous data . . . 31

4.3 Distribution of test results with simulated discrete data . . . 31

4.4 Distribution of test results with simulated binary data . . . 32

4.5 Distribution of test results with diagonal error-covariance matrix . . . 33

4.6 Distribution of results with simulation from real data - One brand . . . 35

4.7 Distribution of results with simulation from real data - Multiple brands . . . 36

5.1 Visualization of DFA and Sample Mean over different sample sizes . . . 38

5.2 DFA with diagonal error covariance matrix fitted on real data . . . 39

5.3 DFA fitted on simulated binary data generated from real respondent data . . . 40

5.4 Comparison of sample sizes: 25 vs. 200 respondents . . . 41

5.5 Convergence of the DFA model - Log-likelihood and Parameter difference . . . 43

A.1 Fitted DFA on binary respondent data . . . 53

A.2 Fitted DFA on binary respondent data . . . 54

(8)

List of Tables

2.1 Factor Extraction Techniques . . . 8

3.1 Example Survey Questions . . . 21

3.2 Closed-ended question - Which brand do you prefer? . . . 22

3.3 Model configurations . . . 27

4.1 Comparison of moving averages with different order . . . 29

4.2 Results with simulated continuous data . . . 30

4.3 Results with simulated discrete data . . . 32

4.4 MAE and RMSE scores with simulated binary data . . . 33

4.5 MAE and RMSE scores with simulated binary data and diagonal error-covariance matrix . . . 34

4.6 RMSE and MAE scores from real-time estimation at step T and T-1 . . . 34

4.7 MAE and RMSE scores with simulation from real data - One brand . . . 35

4.8 MAE and RMSE scores with simulation from real data - Multiple brands . . . 35

5.1 Summary of test results . . . 37

(9)

2.1 Kalman Filter . . . 13

2.2 Kalman Smoother . . . 15

2.3 Expectation Maximization . . . 16

3.1 Modified Kalman Filter . . . 24

(10)

Acronyms

AIC Akaike Information Criterion. 43, 46

DFA Dynamic Factor Analysis. vii, 3, 4, 19, 29, 30, 32–35, 37–42, 45, 46 DLM Dynamic Linear Model. 12

EM Expectation Maximization. 42, 43 FA Factor Analysis. 4, 5

KF Kalman Filter. 44

KPI Key Performance Indicator. 4, 27, 35, 40 M Measure. 30, 32–35, 42

MA Moving Average. 30–35, 37, 42 MAD Mean Absolute Deviation. 17

MAE Mean Absolute Error. viii, 17, 18, 29, 30, 32–35, 37–43, 46 MAPE Mean Absolute Percentage Error. 17, 18, 38, 43

ML Maximum Likelihood. 8

MLE Maximum Likelihood Estimate. 8, 15, 23

RMSE Root Mean Squared Error. viii, 17, 18, 29, 30, 32–35, 37, 38, 40, 42, 43, 46 SDFA Structural Dynamic Factor Analysis. 46

SM Sample Mean. 29–35, 37, 39, 42 SSM State-Space Model. 11, 12 W Week. 34

(11)

When conducting social survey research with the aim of tracking the state of and changes in population sentiment, researchers and businesses typically resort to two methods: Longi-tudinal surveys and Repeated cross-sectional surveys [10]. A longiLongi-tudinal study aims to follow the same panel composition of respondents over time, whereas a cross-sectional design study implies collecting data on different measures from multiple respondents at a single point in time [2]. Consequently, a repeated cross-sectional study follows the same survey structure for each measurement, but gathers data from panels with varying respondent composition over time.

As the longitudinal study requires the respondents to be the same for all timesteps, it is sig-nificantly more costly and relies on high respondent retention [46, 17]. However, an apparent advantage (among others) of using longitudinal panel data is the ability to observe changes at the individual level [17]. Although lacking some of the benefits of longitudinal data, re-peated cross-sectional surveys are by far the most prevalent in the marketing context [10]. The advantages of said method includes lower costs, the possibility to introduce additional respondents and the option to aggregate multiple measurements in order to obtain a larger sample size over a coarser interval [46, 17].

Considering the benefits above and the relevance in the context of tracking consumer effects of marketing efforts, this thesis will focus on repeated cross-sectional surveys performed by repeated sampling from non-overlapping respondents. The general purpose of such studies is not on tracking changes at the individual level, but on observing changes in the state of the population from which the respondent was sampled, henceforth referred to as the population mean. A key assumption made in these studies is that while the individuals may be different for each point of measurement, they, when aggregated, still represent estimates of the same population mean [10]. This aggregated sample, computed by calculating the average of all re-spondents surveyed in the same time period, is referred to as the sample mean and constitutes the estimate of the population mean at that point in time.

An issue with using data gathered from repeated cross-sectional surveys is sampling error. The objective of performing tracking studies can be reduced to this one significant challenge of discerning real changes in population mean from changes related to differences in panel

(12)

1.1. Motivation

composition or common method bias, a problem that becomes increasingly difficult when the population is heterogeneous and sample sizes are small [10]. In order to mitigate this, the current industry standard method is to impose a moving average on the sample means, resulting in a generally smoother fit at the expense of increasingly inert dynamics.

An alternative method that has gained traction in the marketing literature is Dynamic Factor Analysis (DFA). The proposed models differ significantly in their implementations, and in-clude both secondary analysis models such as [11] and primary analysis models such as [29, 28, 10], where secondary refers to models taking the sample mean as input and primary models utilizes individual level data in an attempt to capture dynamics at the respondent level [34]. The use of a primary model can be motivated by considering three structures, or patterns, that typically are present in repeated cross-sectional data [10]:

1. Within-measure inter-temporal dependence. As the population mean reflect the collective state of the population, it is intuitive to imagine that it will not change drastically from one time period to the next, a property referred to as temporal dependence.

2. Between-measure co-movements in population means. Different measures gathered from the same survey should be correlated on an aggregated level over time. By considering cor-relations between population means, it is possible to compare a measures’ movement with the expected movement according to the dynamics of the correlation.

3. Between-measure cross-sectional co-variations (also referred to as individual-level response heterogeneity [28]). This refers to correlation on the primary level, stating that measures can be correlated on the respondent level due to factors such as common method bias, halo effect1, panel composition or other respondent characteristics. When aggregated, changes related to these sampling errors could be mistaken for real between-measure co-movements in population means.

By leveraging the information above, the primary model can borrow information across time and between measures in order to filter idiosyncratic movements in sample means and better discern real changes in population means from those related to sampling error.

This thesis applies a Primary Dynamic Factor model on repeated cross-sectional survey data. It investigates the fitness and capacity of the model when the individual level data is contin-uous, discrete and binary. Furthermore, the model is evaluated and compared with industry standard methods and the implications of this comparison is presented.

1.1 Motivation

Respondent data can take different forms depending on the type of question and the context in which the data will be used. It may be interval or ratio-scaled as in surveys following customer satisfaction for instance, where the questions often are designed to be answered with a rank or score of sorts. The data may also be entirely binary, which is often the case in brand awareness tracking. If the study follows a number of different brands, each brand’s occurrence in a respondent’s answer would then be indicated with a one or a zero. However, when aggregated, this data would still be converted to a ratio-scaled sample mean, enabling the use of DFA models.

While existing studies stress the superiority of DFA models and puts them to use in real en-vironments (see [28, 10] for examples), they overlook the presence of binary respondent data, 1_{The tendency for positive impressions of a person, company, brand or product in one area to positively influence}

(13)

in addition to lacking rigorous simulation studies of the model’s performance in different conditions. This thesis evaluates the performance of the DFA model proposed by Du and Kamakura [10] using Monte Carlo simulations with a wide range of settings in collaboration with Nepa, a consumer research company located in Stockholm.

1.1.1 Nepa

Nepa is a Swedish consumer science company founded in 20062_{. One of their core business}

pillars is Brand Tracking, that is, measuring the performance of a specific brand over time through Key Performance Indicator (KPI)s. These KPIs are assumed to reflect the consumer’s attitude towards and awareness about the brand and is typically used to evaluate the effects of marketing campaigns. In the context of Nepa, this is performed with data gathered weekly by repeated cross-sectional studies.

1.2 Aim

The aim of this research is to investigate the benefit of using Primary Dynamic Factor Analy-sis on data obtained from repeated sampling on non-overlapping panels of respondents and study the impact on estimation error when the sampled data is continuous, discrete and bi-nary.

1.3 Research Questions

1. Can a lower estimation error be obtained when using a Dynamic Factor Analysis model on respondent data gathered from repeated cross-sectional surveys, compared to a moving average?

2. What estimation error can be achieved with a Dynamic Factor Analysis model when the respondent data is binary and how does this compare to numerical values?

1.4 Delimitations

As the model subject to evaluation was administered by Nepa, and this evaluation in itself provides a sufficient scope, no alternative models other than those used as baseline will be considered. Furthermore, although DFA shares conceptual similarities with conventional Factor Analysis (FA), only those areas of FA with relevance to DFA will be described, as they solely constitute a means to an end. Finally, Monte Carlo simulations based on real respondent data, as opposed to simulated, require a sample population more substantial than that available in this research, which is why it is left for further research.

(14)

2 Theory

This chapter presents the theoretical concepts upon which this research is based.

2.1 Factor Analysis

Factor Analysis (FA) is a multivariate statistical method for dimension reduction and struc-ture detection. The general idea for factor analysis in terms of dimension reduction is to describe a set of M of observed variables, or indicators, with the covariance of a set of com-mon N latent factors, such that M " N [19, 10]. By doing this, factor analysis means to describe every observed variable as a linear combination of latent factors, which together compose a more parsimonious representation of the observed data [40, 12]. These factors are common in the sense that each factor is describing the covariance of multiple observed variables, and latent in the sense that they are not observed, but inferred from the values of the observed variables [12]. Furthermore, the latent factors are derived in such a way that they align with subsets of variables that are highly correlated within the subset, but featuring relatively small correlations with variables in other subsets. This facilitates interpretation of said factors as a representation of underlying constructs that is causing the correlation [19].

2.1.1 Model

A basic model describing the relation between observed variable and latent factor is

xi=λi1F1+λi2F2+...+λiNFN+ei, (2.1)

where xi is the ith observed variable, λij is the loading factor from factor j on variable i,

de-scribing the linear dependencies between the observed and latent factors, that is, the strength of the association between the factor and the variable, and e is the unique factor which exerts linear influence on a single variable, sometimes referred to as specific error [19, 4, 13]. This is deceivingly similar to, but is not to be mistaken for, the multivariate regression model,

(15)

Factor Factor Factor Variable λ₁ λ2 λ3 Unique Factor

Figure 2.1: Common Factor Model

where the factors Fican be observed. Subsequently, Equation (2.1) can be generalised into the

expression      x1 x2 .. . xM      =      λ11 λ12 . . . λ1N λ21 λ22 . . . λ2N .. . ... . .. ... λM1 λM2 . . . λN M           F1 F2 .. . FN      +      e1 e2 .. . eM      , (2.2)

which can be described by the more general factor model

X

Mˆ1=MˆNA Nˆ1F +Mˆ1e . (2.3)

In this, X represents the vector of observed measures and A is the loading matrix describing the linear dependencies, and e is the vector of specific factors, which contains the idiosyncratic variance in observation not described by the latent factors. The described factor model above comes with a set of essential assumptions [19]:

1. The factors have a variance of 1 and are independent such that Cov(F) =I, where I is the identity matrix.

2. The unique variance has zero mean, that is E(e) =0.

3. The covariance matrix of the specific factors is diagonal such that Cov(e) = Ψ = diag(Ψ1, ...,ΨM).

(16)

2.1. Factor Analysis

It is important to note that the assumptions above are based on the notion of the orthogonal factor model where the factors are independent, in contrast to the oblique factor model which does not enforce a diagonal factor covariance matrix [19]. However, that model is out of scope for this research and will not be considered any further.

Given these assumptions and the notations above, the covariance for the observed variables can be derived as

Cov(X) =Σ=AA’+Ψ,

Cov(X, F) =A. (2.4)

For future purposes, it is useful to introduce additional notation related to (2.4). We estimate the covariance matrixΣ and correlation matrix by

ˆ Σ=S, ˆ

Cor(X) =R, (2.5)

where S denotes the sample covariance matrix and R the sample correlation matrix obtained from the observations [19]. For a factor analysis to prove useful, it is necessary that the sample covariance matrix S, thus including R, deviate distinctively from a diagonal matrix. If that is not the case, it indicates that the observed variables are not related and an underlying construct will not be identifiable [19].

The sum of the squared factor loadings for each variable, that is for each row in the loading matrix A, is called the communality for that variable. This is a measurement of the amount of variance in the observed variable explained by the latent factor [4]. Conversely, the variance not described by the communality is called the unique variance, denotedΨi[19]

Var(xi) =λ2_i1+...+λ2_iN+Ψi, i=1, .., M. (2.6)

2.1.2 Factor Extraction

In order to compute the model parameters listed in the common factor model (2.3), factor analysis resorts to a considerable number of different methods. An extensive, but not com-prehensive, list of factor extraction techniques as compiled by [38] can be found in Table 2.1. However, as a the majority of these methods has little relevance for this research, they will only be introduced briefly to inform the reader of the alternatives and promote future read-ing. Common for all of them is that they intend to calculate the factor loadings in the loading matrix A that most accurately reproduce the sample correlation matrix R according to (2.4) and (2.5). While every method comes with its set of advantages and drawbacks, they all tend to generate similar factor solutions when the underlying constructs in the data are fairly clear [13].

(17)

Method Approach

Principal Factors Employs the notion of squared multiple correlations (SMC) to estimate the communalities in an iterative approach. It extracts orthogonal factors with decreasing importance, where the or-der in which the factors are generated determines the amount of variance that it is accountable for.

Maximum Likelihood Factoring

Seeks to estimate numerical values for the factor loadings that are most likely to have produced the observed data.

Unweighted Least Squares Factoring

Aims to directly minimize the squared residual between the off-diagonal elements in the sampled and reconstructed correlation matrices. The diagonal elements are then inferred from the cal-culated estimate.

Generalized Least Squares Factoring

Similar to the Unweighted least squares, but with the difference that it applies a weight to each variable in accordance to their correlation with other variables, where highly unique variables that do not correlate with other variables will be treated as less important. Naturally, the resulting model will fit the variables strongly driven by the factors better.

Table 2.1: Factor Extraction Techniques

2.1.2.1 Factor Extraction with Maximum Likelihood

The Maximum Likelihood (ML) method aims to numerically infer the Maximum Likelihood Estimate (MLE) of the factor loadings and the unique variances. In order for the ML method to be applicable, the observed variables are assumed to be sampled from a multivariate Gaus-sian distribution [12]. If this criteria is not satisfied, the resulting factor structure may be distorted and unreliable, in which case methods such as Principal Factoring may be prefer-able. An advantage of the ML method is that it enables statistical evaluation of the how good the factors are able to explain the variance seen in the sample correlation matrix R. This goodness-of-fit is not provided by other methods [4].

2.1.3 Factor Selection

Before applying the factor extraction method, the first decision is to determine the appro-priate number of factors to extract. The realistic goal is not to derive a representation that accounts for all variance in the response variables, but rather find the sufficient amount of factors that accounts for the general correlations in the data [12]. There is a general consensus that underfactoring, that is, extracting too few factors, is worse than overfactoring [44, 14]. Un-derfactoring may cause variables to incorrectly load on a factor only because the actual factor responsible for the variance was excluded from the model, causing the underlying mean-ing of the factor to be obscured [13]. On the other hand, overfactormean-ing generally results in small loadings on the redundant factors, having only minor effects on the overall explained variance [13, 14]. However, overfactoring by definition defeats the purpose of dimension re-duction by complicating the general factor representation and may cause false indications of structures in the data, which is why it should be avoided [13].

There are a number of methods for factor selection, with the most common being ML or eigen-value analysis methods such as Kaiser-Guttman, Scree test and Parallel analysis. However, as these are beyond the scope of this thesis, they will not be elaborated further.

(18)

2.1. Factor Analysis

2.1.4 Factor Rotation

While the primary objective with factor analysis may be to find a set of latent factors to parsi-moniously represent the data, the second objective is to detect structural dependencies among the observed variables. In order to obtain factors that are easier to interpret, the extracted fac-tors often requires rotation. This does not affect the quality of the extracted facfac-tors in terms of model fit, since the rotation only considers mathematically equivalent solutions in terms of the total amount of explained variance [38, 12]. With this in mind, the purpose of the ro-tation can be regarded as to analytically determine which of these equivalent solutions that is the most meaningful [12]. This section will present two different types of rotations: the orthogonal rotation and the oblique rotation. There exists a multitude of different methods for performing both types of rotations, with the most common orthogonal rotations being Vari-max, Quartimax and EquaVari-max, whereas the oblique methods include Direct oblimin, Quartimin and Promax. Similarly to factor extraction, more distinct underlying constructs in the original data will result in more consistent results when comparing different factor rotation methods. Moreover, possible differences after factor extraction tend to be less discernible after rotation [38]. Orthogonal Rotation Factor 1 F actor 2 Oblique Rotation Factor 1 F actor 2

Figure 2.2: Comparison of Factor Rotations

The primary difference between the orthogonal and oblique rotation may be deduced from their names. As can be observed in Figure 2.2, the orthogonal rotation is constrained to return factors which are uncorrelated, meaning they are required to retain the 90° angle between factors, whereas the oblique rotation allows factors to feature angles less or greater than 90° if that provides a better solution, given a specific rotation criterion [12]. As previously stated, both rotations result in mathematically equivalent factor models in therms of communalities and the total variance explained by the factors. The only difference is in how the variance is distributed over the factors [12].

(19)

2.1.4.1 Orthogonal Rotation

The reasoning behind orthogonal rotations is that factors are more interpretable if they are uncorrelated, which makes the factor loadings represent correlations between the factors and the observed variables [4]. This property facilitates the identification of an underlying mean-ing of a factor, since the variation of a variable can be traced directly to the factors on which it loads highly on [38]. However, this reasoning is potentially deceptive if the factors in fact are expected to correlate. Enforcing an orthogonal structure on the factors does not alter the true underlying constructs in the data, but may cause a misrepresentation wherein variables are forced to correlate with a specific factor only because the factor in turn correlates with another factor that influences the variable [12]. Orthogonal rotations should thus be used sensibly unless the underlying processes are known to be somewhat independent [38]. Orthogonal rotations are based on the notion that there exists a simplicity criterion, that is, a property of the loading matrix, that describes the ability of the matrix to produce a simple factor structure. In the case of the Varimax rotation, this property is defined as the sum of the variance of the squared factor loadings, which typically results in the factor loadings being either near one or zero. The rotation then seeks a solution of factor loadings which maximizes this property [12].

2.1.4.2 Oblique Rotation

The oblique rotation can be regarded as the more general factor rotation. While the orthog-onal rotation seeks the mathematically equivalent solution that maximizes the simplicity cri-terion among the orthogonal factor solutions, the oblique rotation maximizes this cricri-terion over all equivalent solutions. If the best solution lies in the orthogonal solution space, the oblique rotation will identify it, otherwise it will generate an inter-correlated factor structure. Contrary to popular belief, the oblique rotation neither causes or requires the factors to be correlated, but instead generates a factor structure more true to the underlying structures in the data by allowing correlations among factors to exist. As such, the oblique rotation pro-vides additional information about the underlying structures in the data, which is ignored when enforcing an orthogonal representation [12]. Bearing this in mind, the oblique rotation generally causes a more complex factor structure. Although the solution is simpler in regards to the criterion, two factors may correlate to a degree that it will be difficult to differentiate between them in terms of underlying significance [38].

2.1.5 Factor Scores

Factor scores are the estimated values of the factors F. As they are unobservable, they can be thought of as the score that would have been observed if it would have been possible to observe the factor directly [4].

2.1.6 Factor Interpretation

After acquiring a factor model, it might be of interest to interpret the meaning of the fac-tors in terms of the latent constructs in the data that they are describing. This is performed by examining which variables load substantially on which factors and interpreting thematic meaning in the grouped variables, that is, the variables significantly driven by changes in the same factors, but also potential significance of the variables not explained by said factors. Determining the meaning of groups of variables requires knowledge of the data and of the context from which the data is gathered [12]. However, when using FA solely for dimension reduction, the need to interpret the factors is reduced.

(20)

2.2. Dynamic Factor Analysis

2.2 Dynamic Factor Analysis

Dynamic Factor Analysis (DFA) is conceptually similar to FA in terms of it being a method for dimension reduction and structure detection. However, it comes with a key distinction. FA assumes that all observations are independently sampled, which means that it is indiffer-ent to the order in which the observations are arranged and will disregard any longitudinal dependencies, which is not a desirable trait in time-series analysis [11]. To mitigate this, DFA assumes temporal interdependencies in the latent factors. In DFA, each latent factor is modelled to follow an AR(1) process, meaning that the state of the latent factor at a certain timestep t is a consequence of the state of the factor at the previous timestep t ´ 1 and a state shock containing the state change, according to the state equation

zt=zt´1+ξ_t, ξ_t„N (0,Ω), (2.7)

where ztis the vector of latent factor states at timestep t and ξtdenotes the latent state shocks

at timestep t. Theoretically, the order is not bound to be 1, as described by [27, 26], but is set to 1 per convention and relevance for this research. In the context of factor analysis, the state vector ztcan be incorporated into the factor model (2.3), resulting in the observation equation

xt=Azt+et, et„N (0,Σ), (2.8)

where xtcontains the observed variables at timestep t, A is the loading matrix and etcontains

the unique factors at timestep t.

The methods used for extracting the factors differs from those used in traditional factor anal-ysis and typically (see [27, 10, 11, 47, 31, 41]) involves techniques from state-space modelling such as the Kalman filter and smoother as well as the Expectation Maximization algorithm out-lined in Section 2.4, 2.5 and 2.6.1, respectively.

2.3 State-Space Model

A State-Space Model (SSM) is based on the notion that the behaviour of a dynamic system can be described by a set of two equations: a state equation describing the transition dynamics between consecutive states, and an observation equation representing the uncertainty intro-duced by observation [3]. The concept of state is key to this description, articulately defined by Haykin [16] as “the minimal set of data that is sufficient to uniquely describe the unforced dynamical behavior of the system”. It is often denoted zt, meaning the specific state at time

t. Typically, the state is latent, that is it can’t be quantified directly, but is inferred through measurements xtcontaining some amount of measurement noise [16].

The state process is generally assumed to be a Markov process, meaning the future states zt+1:T

are conditionally independent of past states z0:t´1 given the present state zt. A second

as-sumption of state-space models is that all observations xtare conditionally independent given

the state zt[36, 30]. These assumptions can be generalized as

p(zt|z0:t´1, x1:t´1) =p(zt|zt´1), (2.9)

(21)

zt-1 zt

xt-1 xt

... ...

Figure 2.3: Graphical Representation of a State-Space Model

where (2.9) is called the transition probability and (2.10) the emission probability [40, 3].

If the state and observation variables are discrete, the state-space model is often referred to as a hidden markov model [3], whereas the continuous case simply is described as SSM, or specified as one of its variants, such as Dynamic Linear Model (DLM) [36].

2.3.1 Dynamic Linear Model

A Dynamic Linear Model (DLM) is a special continuous case of the general state-space model, assuming linear dependencies and Gaussian transition and emission models [3]. An example of a DLM, as defined in [39, 42], is

p(zt|zt´1) = N (zt|Azt´1+But, R), (2.11)

p(xt|zt) = N (xt|Czt, Q). (2.12)

The analogous general formulation of the DLM described above is the state equation

zt =Azt´1+But+et, et„N (0, R), (2.13)

and observation equation

xt=Czt+δt, δt„N (0, Q). (2.14)

In the equations above, A is a square matrix that relates the previous state into the new state according to dynamics of the model. Similarly, B incorporates inputs from the control variable ut into the new state [42]. As the problem at hand does not require any control inputs, Bt

and ut will not be considered any further. Similar to A and B, C is the measurement matrix

that relates the state to the measurement [16], similar to the loading matrix in factor analysis. Finally, Q is the measurement covariance matrix, describing the measurement noise, and R is the process covariance matrix, representing the uncertainty introduced by the state transition [39].

In order to perform inference on the latent states, one must compute the conditional densities p(xs|z1:t). Typically, it is practical to divide this problem into three separate cases, illustrated

(22)

2.4. Kalman Filter Prediction Filtering Smoothing s T 0 t

Figure 2.4: Prediction, Filtering and Smoothing

in Figure 2.4. When t ă s, the problem is referred to as prediction, when t = s it is called filtering and when t ą s it is referred to as smoothing [30, 36]. In practice, this implies that when filtering, only information from past and present observations are considered, whereas smoothing takes into account all observations. However, while filtering can be performed se-quentially in a forward recursion manner as new data becomes available, the smoothed distri-butions are computed through backward recursion, starting from the last time step, and thus have to be recalculated in order to incorporate new data [36].

2.4 Kalman Filter

Originally proposed by Rudolph E. Kalman in 1960 [20], the Kalman Filter is a forward re-cursion algorithm for continuous states with the purpose of predicting the true latent states given a sequence of noisy measurements and a linear-Gaussian system model [3]. In essence, it provides a set of recursive mathematical equations for efficiently calculating a least-squares solution, which can be utilized to estimate past, present and future states [42].

The implementation of the Kalman Filter as interpreted by Thrun, Burgard and Fox [39] is outlined in Algorithm 2.1. The linear process that the algorithm is trying to estimate is as-sumed to be governed by the equations (2.13) and (2.14).

Algorithm 2.1.Kalman Filter

For an intuitive overview of the operations of the Kalman Filter, Algorithm 2.1 can be divided into two parts: prediction and correction, where rows 2-3 pertains to prediction and 4-6 to correc-tion. The reason for this separation is that during the prediction steps, the algorithm provides an a priori estimate of the current state and error covariance solely based on the state and covariance from the previous time step. During the correction, the predicted error covariance is used to provide the Kalman gain, which is a weight describing to what degree the observa-tion should be trusted. Using this knowledge, a new measurement is incorporated to obtain

(23)

Figure 2.5: Update cycle of the Kalman Filter

a corrected a posteriori estimate of the state and error covariance, calculated as the weighted mean of the prediction and the measurement. This procedure is then repeated for all time steps [42].

2.4.1 Lind’s Modified Kalman Filter

As the computing of the Kalman gain as described in Algorithm 2.1 requires the inversion of matrices proportional to the sample size of the survey, that is N ˆ N, attempts have been made to find a more parsimonious approach [22, 21, 25]. One such, seemingly undiscovered

1_{, approach was proposed by Jo T. Lind in 2005 [25]. Lind suggests that the correction step,}

that is, line rows 4-6 in Algorithm 2.1, can be written as

Σt|t = (Σ´1t|t´1+Q ´2

t ATNtA)´1,

µ_t|t=µ_t|t´1+Q´2_t NtΣt|tAT(z¯t´Aµ_t|t´1), (2.15)

resulting in the inverse of matrices of dimension n ˆ n, with n being the number of latent factors such that n ! N [25]. While Lind mainly highlights the computational benefits of this modification, Du & Kamakura puts it to use and suggests that it can be combined with the primary DFA model in Section 2.2 in order to produce more accurate estimates compared to a secondary analysis [25, 10].

2.5 Kalman Smoother

The Kalman smoother, often referred to as the Rauch-Tung-Striebel smoother [32, 16, 36, 3], is a forward-backward recursion algorithm for computing the conditional state densities. In con-trast to the Kalman Filter presented in Section 2.4, which only considers the past observations xi where i P [1, t], the Kalman smoother considers all available observations, both past and

future, to determine the optimum state estimates [16]. Other methods exists for computing the same smoothed estimates [16, 15]. While these produce similar results, they often consist of three-step solutions: 1) a forward filter, 2) a backward filter, and 3) a smoother combining the results from 1) and 2). The Kalman smoother is more efficient in that it integrates the output of the Kalman filter directly in the backward pass, making it a two-step algorithm [16].

(24)

2.6. Estimating Model Parameters

The complete update equations as presented in [3, 16, 36] are given in Algorithm 2.2, followed by clarifications of each step, where

Algorithm 2.2.Kalman Smoother

1: functionKALMANSMOOTHER(µ_t+1,Σt+1) 2: Kt´1=Σt´1|t´1A1Σ´1t|t´1

3: µ_t´1|T= µ_t´1|t´1+Kt´1(µ_t|T´ µ_t|t´1)

4: Σ_t´1|T=Σt´1|t´1+Kt´1(Σt|T´Σt|t´1)K1t´1 5: return µ_t,Σt

6: end function

Σt|t,Σt+1|t, µt|t and µt+1|tare given from the Kalman filter. The complete smoothing process

proceeds as follows [16]:

1. The Kalman filter is applied to the observations using forward recursion as described in Section 2.4.

2. The recursive smoother equations as described in Algorithm 2.2 are applied to the out-put from 1., starting at time step T and propagating backwards.

As the algorithm starts from the last timestep and propagates backwards, it never updates the state estimates at that specific timestep, resulting in estimates possibly less true to the dynamics of the model in comparison to the other timesteps.

2.6 Estimating Model Parameters

Consider the problem of finding the MLE given the likelihood function

L(θ, X) =p(X|θ) =

ż

p(X, Z|θ)dZ, (2.16)

where X is a set of observed data, Z is a set of latent data and θ are the model parameters. For mathematical convenience, it is custom to consider the corresponding log likelihood function

`(θ)= ln L(θ, X). The MLE can then be found as

θMLE =argmax

θ

`(θ). (2.17)

As it turns out, this is not a trivial task when the integral in (2.16) does not have closed form solution [8]. The presence of the integral prevents the logarithm from acting directly on the joint distribution, resulting in non-trivial expressions of the MLE [3]. This problem requires iterative methods in order to estimate the likelihood.

2.6.1 Expectation Maximization

The Expectation Maximization (EM) algorithm is a general two-step iterative technique for computing maximum likelihood estimates for probabilistic models with latent variables, that is, missing data or unknown parameters [8]. It is general in that it provides a probabilistic framework for performing inference on the missing variables with a variety of implementa-tions, and has been used widely in a multitude of applications [3].

(25)

Consider the problem outlined in (2.17). In order to solve this, it is practical to introduce the notation θold, θnewand Q(θ, θold), where

Q(θ, θold) =

ż

p(Z|X, θold)ln p(X, Z|θ)dZ. (2.18)

Let the set tX, Zu denote the complete data set, which makes the complete data log-likelihood function ln p(X, Z|θ). Since the complete data is not given, it is necessary to refer to comput-ing the posterior distribution for the latent data, that is p(Z|X, θ), which hopefully brings some clarity to the definition of Q(θ, θold). This is where the ingeniousness of the EM-algorithm takes place. By dividing this problem into two steps: the Expectation (E) step and the Maximization (M) step, it is possible to iteratively converge to a distribution for the com-plete data log likelihood. In the E-step, the posterior for the latent data is evaluated using the parameters θold. This distribution is then used to find the expectation of the complete data log likelihood, which corresponds to evaluating the expression in (2.18). In the sub-sequent M-step, this distribution is maximized in order to obtain the new parameters θnew, corresponding to the expression

θnew=argmax θ

Q(θ, θold). (2.19)

In order to start the algorithm, it is necessary to choose initial values for the parameters θold. The true key to why this method is viable is the property that each iteration, as it turns out, will increase the likelihood of the incomplete data. Convergence is thus evaluated as when the likelihood is increased with less than some chosen interruption criterion, or similarly when the parameters θnew« θold[3].

In order to summarize the conclusions presented in this section, the general EM-algorithm as presented by [3] is rendered in its entirety below.

Algorithm 2.3.Expectation Maximization

1: Choose initial settings for the parameters θold

2: E step: Evaluate p(Z|X, θold)

3: M step: Evaluate θnewgiven by Equation (2.19), where Q(θ, θold)is given in Equation (2.18).

4: Check for convergence of either the log likelihood or the parameter values. If the conver-gence criterion is not satisfied, then let θoldÐ θnewand return to step 2.

2.7 Evaluation Metrics

This section will present three different performance metrics commonly used in time series analysis and forecasting to evaluate model accuracy; Root Mean Squared Error, Mean Absolute Error and Mean Absolute Percentage Error [1, 24, 6]. There exists multiple other performance measures that could have proved useful to this study, but since the evaluation process in itself only requires one or two indicators, the ones presented in this chapter are the ones often used for similar analysis and subject to between-metric comparisons due to their different characteristics [18, 43, 6, 24].

(26)

2.7. Evaluation Metrics

2.7.1 Mean Absolute Error

Mean Absolute Error (MAE), or Mean Absolute Deviation (MAD) when the measure of cen-tral tendency is chosen to be the mean, is a measure of the average magnitude of the error

MAE= 1 T T ÿ t=1 |yt´˜yt|. (2.20)

It is calculated by taking the sum of the residuals of the mean ˜yt and the prediction ytand

averaging it by the number of predictions. It is a linear score in that it assigns equal weights to all errors, disregarding of magnitude [24].

2.7.2 Mean Absolute Percentage Error

Mean Absolute Percentage Error (MAPE) is a generic measure obtained by computing the error relative to the magnitude of the measurement

MAPE= 1 T T ÿ t=1 |yt´ ˜yt yt |. (2.21)

It is generic in that it is expressed as intuitive percentages and thus can be compared over different datasets, a property commonly referred to as scale-independent. An evident flaw with this approach is that it is not applicable for measures yt = 0 and results in a skewed

distribution for measures in the vicinity of zero [18].

2.7.3 Root Mean Squared Error

RMSE is a measure of the mean of the squared errors

RMSE= g f f e1 T T ÿ t=1 (yt´˜yt)2. (2.22)

By taking the root square, the unit of the metric is obtained on the same scale as the unit of the measurement [37]. By squaring the errors, RMSE penalizes variance as it gives more weight to errors with higher magnitude and less weight to errors with smaller magnitude [6].

2.7.4 Comparison of Performance Metrics

In the research of evaluation metrics, there has been substantial debate regarding the charac-teristics and suitability of certain metrics, as well as their performance [6, 43, 18]. Considering the metrics presented in this section, the debate on MAE vs. RMSE is particularly polarized. Following an article published by Willmott and Matsuura in 2005 [43], where they proposed avoidance of RMSE in favor of MAE, many researchers chose to follow their advice [6]. While acknowledging some of the criticism of RMSE to be valid, Chai and Draxler [6] argued the el-igibility of both metrics, as well as proposing RMSE to be the superior metric when the error distribution is expected to be Gaussian.

Considering the practical properties of the measures, a generally accepted premise is that MAE ď RMSE. Although both metrics will indicate increased variance, RMSE is considerably more sensitive to outliers compared to MAE and will grow to be increasingly larger than

(27)

MAE as the error variance increases [6, 43]. As such, RMSE should prove to be the more useful metric if large variance is particularly undesirable [24].

Due to its methodical similarities, MAE and MAPE will generally be correlated [24]. While MAPE has the advantage in terms of interpretability and comparability of the results, it comes with considerable disadvantages in terms of applicability [37].

(28)

3 Method

This chapter will present the methodical approaches used in this research. The first section describes the data used, as well as the methods used for simulating the data. The second Section presents the DFA model derived by combining the techniques presented in Chapter 2. Finally, Section 3.3 defines the EM algorithm as well as the update equations used in the M step of the EM algorithm and Section 3.4 describes the methods used for evaluating the results.

The terms population and respondent will be used frequently throughout this chapter, where population refers to the complete information of a variable represented by a typically large set of samples, with population mean describing the assumed true state of a variable. A respondent pertains to a single entity of the population.

3.1 Data Sources

3.1.1 Simulation

In order to evaluate the performance of the model, it’s necessary to know the true means for each timestep. This was achieved by simulation, where each simulated time-series yit P yt

was assumed to follow a random walk according to the transition model

yt=yt´1+et et„N (ξ,Ω) (3.1)

where ξi P ξ is a random drift component unique to each generated time-series andΩ is a

generated positive semi-definite covariance matrix common for all time-series. In order to replicate authentic conditions, the simulated data was required to satisfy the three patterns of correlation mentioned in Section 1.1,

1. within-measure inter-temporal dependence,

(29)

3. between-measure cross-sectional co-variance.

By following the Markov process in equation (3.1), within-measure inter-temporal depen-dencies were incorporated into the data. Between-measure co-movements were imposed by generating a non-diagonal transition matrix for the step et, resulting in correlated population

means. Correlated population means is a prerequisite in order to represent the data with a factor structure [10]. Finally, in order to introduce between-measure cross-sectional co-variations on the respondent level, the primary respondent data was sampled from emission models with a non-diagonal covariance matrix. This also allowed evaluation of the perfor-mance of the model with different levels of correlation on cross-sectional measures. The exact models used can be found in Section 3.1.1.1 and 3.1.1.2.

A population was generated through repeated multivariate sampling from the model de-scribed in equation (3.1), where each timestep was sampled N times corresponding to the population size. A panel of respondents could then be generated by randomly selecting n samples from the population at each timestep, where both n and N where selected such that n ! N, and N was typically large. This made it possible to re-sample from the same population and evaluate performance on the same time-series with different sample sizes, simulating different number of respondents.

The key assumption that is made is that although the panel composition of respondents may vary for each timestep, they still represent the same population mean at the aggregated level. In contrast to a longitudinal study, where the individual respondents are the same for each timestep and thus constitute the total population, the approach used in this research focuses on reproducing the structures 1. and 2. described above, under the assumption that the re-spondents are sampled from the same population as per [10]. The specific processes for which the different types of data were sampled from the time-series differs somewhat between nu-merical and binary respondent data, but remain continuous when aggregated to a population mean.

3.1.1.1 Numerical Data Sampling

The numerical data can be divided into two categories: discrete and continuous. The process through which they were generated differed only in that the discrete data was obtained by rounding the continuous data to the closest integer. The actual sampling was performed using the multivariate Gaussian emission model

ˆyjt=yt+ψjt ψjt„N (0,Σ), (3.2)

where yit P yt is the value of the time-series i at time t andΣ is the generated covariance

matrix imposing between-measure cross-sectional co-variations at the primary level. Thus, ˆyijtcorresponds to respondent j’s answer to measure i at time t. Excerpts from the simulated

continuous and discrete respondent level data can be found in Appendix A.1.1 and A.1.2 respectively.

3.1.1.2 Binary Data Sampling

Although binary distributions technically are a special case of discrete distributions, limited to only two values, the binary case required a slightly different approach. In order to impose between-measure cross-sectional co-variance on the primary level, the population had to be sampled with covariance from the generated time-series and result in binary response vari-ables. This was no trivial task as the standard binary distributions such as Poisson, Bernoulli

(30)

3.1. Data Sources

and Binomial, does not incorporate a multivariate covariance structure out of the box. Per-sisting with the analogy that each respondent could be regarded as a multivariate Bernoulli trial, each time-series was normalized to the interval[a, b], where a, b P[0, 1], which enabled the use of the actual time-series as the probability in a Bernoulli trial. The property that E(X) = p if X „ Bernoulli(p)implied that the population mean would approximately yield the probability by which each respondent was sampled with, which was the time-series it-self. Multivariate Bernoulli sampling is not a standard feature in most statistical software and thus had to be simulated through sampling from a multivariate Gaussian distribution and binarization of the output. Although this approach was a result of reasoning in combi-nation with trial and error, support for it can be found in [23], who presents a conceptually similar method utilizing a Gaussian approximation to simulate correlated binary variables. The complete process proceeded as follows:

1. Select the marginal probabilities pi for each binary variable and the covariance matrix

Σ describing the desired covariance between the binary variables. 2. Set µi=Φ´1(pi), whereΦ´1(p)is the Probit function.

3. Draw a sample s fromN (µ,Σ).

4. Binarize the sample such that xi =0 ðñ si< 0 and xi =1 ðñ si> 0. For si = 0, xi

was assigned uniformly.

The steps outlined above allowed the use of a covariance structured matrix when sampling from a multivariate Bernoulli distribution, resulting in between-measure co-variance on the respondent level. An excerpt from the simulated binary respondent level data can be found in Appendix A.1.3.

3.1.2 Panelist Data

Although the simulation process described in Section 3.1.1 aimed at reproducing true condi-tions, the Primary DFA model used in this research was designed under very specific assump-tions and with the purpose of identifying the exact structures imposed on the simulated data. In order to evaluate the performance of the model when said structures may be weak or even non-existent, the model was trained using actual respondent data provided by Nepa, contain-ing raw respondent level data from three different cross-sectional trackcontain-ing studies carried out during 156 consecutive weeks. Due to clientele confidentiality and proprietary reasons, the data itself is not included in this paper. However, a selection of example questions similar to those in the surveys from which the data was gathered can be seen in Table 3.1. Furthermore, anonymized figures displaying the fitted models can be found in Appendix A.2.

Measure Question

Ad Awareness Which computer manufacturers have you recently seen adver-tisements from?

Awareness Which computer manufacturers have you seen or heard of? Preference Which computer manufacturer would you select if you were

able to choose freely?

Consideration Which other computer manufacturer would you also consider? Table 3.1: Example Survey Questions

(31)

The respondent data was exclusively binary and also contained closed-ended questions such as Preference, which resulted in a single response. If the output from a closed-ended ques-tion was translated into the desired format where each respondent’s answers pertains to a row of binary outputs as in Appendix A.1.3, each row would contain a single one with the remaining entries being zero. This scenario differed slightly from those described previously, where each row could have multiple non-zero entries. Due to no actual covariance occurring on the respondent level, the use of a non-diagonal cross-sectional covariance matrixΣ was impractical, possibly even nonsensical, and required a slightly modified model in whichΣ was constrained to be diagonal.

Respondent Brand 1 Brand 2 Brand 3

1 0 1 0

2 0 0 1

3 1 0 0

Table 3.2: Closed-ended question - Which brand do you prefer?

3.1.3 Sampling from Panelist Data

In an attempt to provide sample conditions indisputably more true to actual conditions, while maintaining the possibility to evaluate performance, the provided panelist data was used to generate the parameters from which a population could be sampled according to the methods described in Section 3.1.1. The means y_it were generated by evaluating a moving average of the four most recent measurements for each measure, which guaranteed within-measure inter-temporal dependencies as well as imposing between-measure co-movements according to the structures contained in the data. Instead of generating a random respondent covariance matrixΣ as in Section 3.1.1, it was generated by computing the covariance matrix of the raw respondent data.

3.2 Primary DFA Model

The Primary DFA model used was that proposed by Du and Kamakura [10], which is an implementation of the model presented in Section 2.2. In addition to the original model, it also includes a covariate term similar to the input control variable from the Kalman Filter in Section 2.4. This term enabled the option to include known covariates such as seasonality and marketing investments related to the analyzed measure.

The state equations for the DFA model is given by

µ_t=Azt+But, (3.3)

zt=zt´1+ξ_t, ξ_t„N (0,Ω), z0„N (a0,Ω0), (3.4)

where µtdenotes the N ˆ 1 population mean vector at timestep t, ztdenotes the M ˆ 1 vector

of latent states, A is the N ˆ M loading matrix that translates the M latent states into the N population means, ut is a N ˆ 1 vector of known covariates and B is the N ˆ M loading

matrix that translates the covariates into the population means µ_t[10]. ξ_tis the M ˆ 1 vector of latent state chocks, i.e., containing the real state changes from step t ´ 1 to step t. Finally,Ω is the state covariance matrix, which was diagonal in order to maintain independent factors and thus facilitate interpretability of the extracted factors.

The corresponding observation equation was

(32)

3.3. Algorithm

where y_it denotes the vector of responses from respondent i at timestep t, eit denotes the

sample noise representing respondent i’s deviation from the population mean, andΣ is the error covariance matrix incorporating the cross-sectional between-measure co-variations. In order to make the analogy to state-space modelling more explicit, the equations (3.3), (3.4) and (3.5) can be generalized into the state-space representation with the state equation

zt=zt´1+ξ_t, ξ_t„N (0,Ω), (3.6)

and observation equation

y_t=J_tAzt+JtBut+e, et„N (0, INtbΣ), (3.7) where

y_t”(y1_1t y1_2t . . . y1_N_t_t)1, (3.8)

J_t”1NtbIM, (3.9)

and Ntdenotes the number of respondents at timestep t, not to be confused with the number

of latent factors N. In equation (3.9) b denotes the Kronecker product, giving Jtthe dimensions

(MNtˆM). Thus, the model parameters that needed to be estimated were A(N ˆ M), B(N ˆ M),Σ(N ˆ N),Ω(M ˆ M),Ω0(M ˆ M)and a0(M ˆ 1).

3.3 Algorithm

This Section describes, part by part, the complete algorithm used in this research, start-ing with the Expectation Maximization (EM) algorithm used for inference, movstart-ing on to the method for selecting the number of latent factors, after which all parts are compiled into a sequential list in the order of which they were carried out.

3.3.1 EM

A number of different approaches have been proposed in order to perform inference on the model parameters in the context of state space models, including examples such as Principal Component Analysis in combination with a Kalman Smoother [9] and a univariate Kalman Filter and Smoother aiming to mitigate the curse of dimensionality [21]. However, the most widely used approach in DFA, as well as state-space modelling in general, is the EM algo-rithm introduced in Section 2.6.1 (see [10, 11, 35, 31, 27] for successful examples). The imple-mentation of the EM algorithm used in this research utilizes a Kalman Filter and Smoother as the E-step following the research of [10, 25], and parameter update equations based on the likelihood function of [35] in the M-step, which are equivalent to the MLE of the Q-function described in equation (2.18).

3.3.1.1 E-step

The Kalman Filter used in the forward pass of the E-step was that proposed by [10], which is an implementation of Lind’s modified Kalman Filter described in Section 2.4.1.

(33)

Algorithm 3.1.Modified Kalman Filter

9: return zi|i, zi|i´1, var(zi|i), var(zi|i´1) 10: end function

The output from the Kalman Filter in Algorithm 3.1 is used as input to the Kalman Smoother proposed by [10].

Algorithm 3.2.Modified Kalman Smoother

1: functionKALMANSMOOTHER(zi|i, zi|i´1, var(zi|i), var(zi|i´1), A,Σ, N) 2: for t Ð T, ..., 1 do

3: Lt´1= var(zt´1|t´1)var(zt|t´1)´1 4: zt´1|T= zt´1|t´1+ Lt´1(zt|T´zt|t´1)

5: var(zt´1|T)= var(zt|t´1)´1+ Lt´1[var(zt|T)´var(zt|t´1)]L1t´1 6: end for

7: cov(zT|T, zT´1|T)=[I ´ var(zT|T)A1NtΣ´1A]var(zT|T´1) 8: for t Ð T, ..., 2 do

9: cov(z_t´1|T, z_t´2|T)= var(z_t´1|t´1)L1

t´2+Lt´1[cov(zt|T, zt´1|T)´var(zt´1|t´1)]L1t´2 10: end for

11: return z_i|T, var(z_i|T), cov(z_i|T, z_i´1|T)

12: end function

In Algorithms 3.1 and 3.2, z_t|t´1denotes the predicted current state at timestep t given the previous state at timestep t ´ 1 and z_t|tdenotes the corrected current state, with the analogous definition applying to the error covariance var(zt|t)and var(zt|t´1). zt|Tand var(zt|T)denotes

the smoothed state and error covariance at timestep t given all timesteps. Finally, the term cov(zt|T, zt´1|T)denotes the covariance of uncertainties about the state transition from t to

t ´ 1.

3.3.1.2 M-step: Parameter Update Equations

In the subsequent M-step, the model parameters were updated with the maximum likeli-hood estimates using the equations provided by Du and Kamakura [10], which in turn were derived from the ML function elaborated by Shumway and Stoffer [35, 36].

The update equations for the initial state and state covariance at t=0 is

ˆa0=z0|T, (3.10)

and

ˆ

(34)

3.3. Algorithm

the error covariance is updated with the equation

ˆ Σ= 1 řT t=1Nt T ÿ t=1 Nt ÿ i(t)=1

[(yi(t)´ ˆAzt|T´ ˆBxt)(yi(t)´ ˆAzt|T´ ˆBxt)1+Avarˆ (zt|T)Aˆ 1

]. (3.13)

and finally, the state covariance matrix is updated with the equation

ˆ Ω= 1 T T ÿ t=1 diag( " (zt|T´zt´1|T)(zt|T´zt´1|T)1+var(zt|T) +var(zt´1|T) ´cov(zt|T, zt´1|T)´cov(zt|T, zt´1|T)1 # )1_, _(3.14)

The convergence was evaluated partly from parameter difference, but also using the log-likelihood function described by [10]

where tr denotes the trace, defined as

tr(A) =

n

ÿ

i=1

a11+a22+¨ ¨ ¨+ann. (3.16)

3.3.2 Determining the Number of Factors

Throughout all experiments performed in this, study each configuration was fitted with a factor structure containing the same number of latent factors as measures in the input data. 1_{The reader should be advised that the expression within [brackets] is intended as a single-row matrix and any}