Sequence classification on gamified behavior data from a learning management system: Predicting student outcome using neural networks and Markov chain

(1)

1

SEQUENCE CLASSIFICATION ON GAMIFIED BEHAVIOR DATA FROM A LEARNING MANAGEMENT SYSTEM

Predicting student outcome using neural networks and Markov chain

Master Degree Project in Data Science Two years Level 30 ECTS

Spring term 2020 Niclas Elmäng

Supervisor: Niclas Ståhl

Supervisor: Alberto Montebelli Examiner: Huseyin Kusetogullari

(2)

Abstract

This study has investigated whether it is possible to classify time series data originating from a gamified learning management system. By using the school data provided by the gamification company Insert Coin AB, the aim was to distribute the teacher’s supervision more efficiently among students who are more likely to fail. Motivating this is the possibility that the student retention and completion rate can be increased. This was done by using Long short-term memory and convolutional neural networks and Markov chain to classify time series of event data. Since the classes are balanced the classification was evaluated using only the accuracy metric. The results for the neural networks show positive results but overfitting seems to occur strongly for the convolutional network and less so for the Long short-term memory network. The Markov chain show potential but further work is needed to mitigate the problem of a strong correlation between sequence length and likelihood.

Keywords: Long Short-term Memory, Convolutional neural network, Markov Chain, Time series Classification, Gamification

(3)

1 Introduction ... 1

2 Background ... 1

2.1 Gamification ... 1

2.2 Insert Coin ... 2

2.2.1 The data ... 2

2.3 Artificial intelligence ... 3

2.3.1 Artificial neural networks ... 3

2.3.2 Back-propagation ... 5

2.3.3 Recurrent neural networks ... 6

2.3.4 Long short-term memory ... 7

2.3.5 Convolutional neural networks ... 8

2.4 Markov chain ... 9

2.5 Related work ... 9

2.6 Ethical issues ... 10

3 Method ... 11

3.1 Problem definition ... 11

3.2 Research question ... 11

3.3 Metrics ... 12

3.4 Validity threats... 13

4 Implementation ... 13

4.1 General preprocessing ... 13

4.2 ANN data preprocessing ... 14

4.2.1 Experiment ... 15

4.3 MC data preprocessing and experiment ... 16

5 Results ... 17

5.1 ANN performance ... 17

5.2 MC performance ... 19

5.3 Discussion ... 20

6 Conclusions ... 22

References ... 24

(4)

(5)

1

1 Introduction

Increased student motivation and student retention are always of interest to schools and educators (Buckley and Doyle, 2016). The inclusion of gamification in a course can lead to positive benefits based on the idea that it is used to influence behavior. Insert Coin AB is a software company that delivers gamification as a service and they want to be able to use the data they collect to further improve the customer’s, i.e. the school’s, service.

A problem presented by Insert Coin AB with their student event data they provided is the possibility to distinguish between students who might pass or fail the final exam.

The data consists of timestamped events originating from the student’s actions in their learning management system. By applying machine learning to this binary classification problem, the teachers can use this knowledge to distribute their attention and supervision more efficiently to students who are in greater need. As such, the motivation behind this study is the potential of increased student retention and completion rate.

The method in this study was to use the timestamped gamified event data in the context of Long short-term memory and convolutional neural networks and Markov chains.

The data for the neural networks takes the form of a time series of categorical values where each unique event is a category. In contrast, the Markov chain treats the events in the data as unique states to transition between while the classification is done by using two separate Markov chains, one for each class. The likelihood of each class is determined by evaluating the probability of a student’s sequence of events with each class of Markov chain.

The predictions using the neural networks are evaluated using the accuracy metric. The testing shows positive results, or high accuracy, although there are some concerns with overfitting, especially for the convolutional neural network. The Markov chain shows promising potential, but experimentation runs short as there is a strong correlation between the length of a sequence and its likelihood. The longer, or more unique, a sequence is, the lower the likelihood is. In many cases, the sequence is not recognized at all by the Markov chain.

2 Background

This section discusses the area around the problem in this study, such as gamification, Insert Coin and their dataset, artificial intelligence, neural networks, and finally some related work that has been done.

2.1 Gamification

“Gamification is using game-based mechanics, aesthetics and game thinking to engage people, motivate action, promote learning, and solve problems” (Kapp, 2012, p. 10).

Incorporating game-based behavior and thinking into people in contexts of non-game applications, such as education has sparked interest in many educators. Using gamification in an educational environment stems from the idea that gamification influences student behavior. While games can bring emotions such as curiosity,

(6)

2

frustration, and joy, it also enables students to engage and be more productive. Because gamification bridges the gap between game-based mechanics and game design to a non-game environment, the possibility of increasing student motivation and capturing their attention and interest is of great importance to educators (Buckley and Doyle, 2016).

2.2 Insert Coin

The Swedish company Insert Coin AB, founded in 2012, is specifically working with the concept of gamification. Using the approach ‘Software as a Service’, aka SaaS, they deliver their so-called Gamify the World ENgine, abbreviated as GWEN. This engine is purposefully made to reduce the complex overhead of implementing gamification features. In other words, it offers expansive gamification with minimal effort from the user. A user, in this case, is the customer using GWEN in whatever service they provide.

While GWEN offers an easier, less troublesome implementation of commonly used gamification patterns, it is also meant to be a more streamlined and adaptable tool.

Many situations are thus covered by giving the users the ability to mix and match between a bunch of so-called modules. Examples of modules are common gamification elements such as levels, achievements, missions, and challenges.

2.2.1 The data

As mentioned earlier, Insert Coin AB provided a dataset for this study. This dataset comes from a school that wanted gamification implemented in its learning management system. Students were supposed to use this system in parallel with their studies. This is where GWEN is used to determine the applicable reward or response based on the actions performed by the students.

The raw dataset consists of 92 students and 20 events with the features id, createdAt, event, type, userId, and klarat av prov. The id is unique for every entry while userId is unique for every user. The createdAt is the timestamp and event and type represent the action and what type of action it was. Some of the possible actions present in the event feature are variations of ‘turn in assignment’ and ‘visit course feed’. Finally, the klarat av prov feature is the label, representing whether the student managed to pass or fail the final exam. The number of students who passed is 54%, leaving 46% who failed, see Figure 1.

(7)

3

Figure 1. Pie chart showing the distribution between the two classes ‘Pass’

and ‘Fail’ represented by True and False, respectively

2.3 Artificial intelligence

The term artificial intelligence (AI) was first coined by John McCarthy (1955) and he says that “every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” These machines, much like humans, are supposed to be able to solve some problems and answer questions. The advantage machines have is that they can solve difficult problems much faster than what humans are capable of. Common problems can be image classification, text generation, and prediction of different varieties.

While AI defines the general concept of machines being able to solve “problems now reserved for humans” (McCarthy, 1955) and in a smart or human-like way, machine learning (ML) defines a more self-sustained approach. Coined and popularized by Samuel (1959), ML procedures consist of feeding a computer “the rules of the game, a sense of direction and a redundant and incomplete list of parameters”. Samuel demonstrated this learning-behavior by allowing the computer to iteratively play the game Checkers until it was able to play a better game than that of its creator, i.e. the programmer. Dietterich (1997) discusses some common emergent problems for machine learning such as knowledge discovery, language processing, robot control, as well as older established problems such as speech, face- and text-recognition.

2.3.1 Artificial neural networks

Artificial neural networks (ANNs) are a subset of methods in ML where the intent is to mimic how the human brain functions and processes information via so-called neurons (Graves, 2012). These artificial neurons are interconnected to receive an input from one or more sources and send the output to other neurons. The layers of a neural network are built from distributed neurons and often assigning specific functions or operations to each layer. The layers consist of an input layer, any number of hidden layers and an output layer.

The depth of the network is decided by the number of hidden layers between the input and output layers, where more layers result in a deeper network. This setup of several

(8)

4

layers of consecutive calculations means that the network in theory can adapt its output-space to any non-linear problem-space, given that the activation functions are non-linear. The layers extract an increasingly complex representation of the feature set starting from the input layer, feeding through the hidden layers, and ending in the output layer.

The simplest configuration of an ANN is a multilayer perceptron (MLP) (Gardner and Dorling, 1998), also called a feed-forward network, shown in Figure 2 below.

Multilayer in this case means that it is supposed to be more than two layers, i.e. more than just the input and output layers. Every neuron of each layer is also supposed to be completely interconnected to each neuron in the neighboring layers, i.e. they are dense.

Figure 2. Visualization of a dense MLP with 3 input nodes, 4 hidden nodes, and 2 output nodes. Between each layer are the connection weights W.

The output of a neuron is defined as some activation function applied to the sum of all weighted inputs plus some bias. The output 𝑦 of the neuron is then defined as

𝑦 = 𝜎 (∑ 𝑤_𝑖𝑥_𝑖 + 𝑏

n

i=1

), (1)

where 𝜎 is the chosen activation function, 𝑛 is the number of inputs available, 𝑤_𝑖 is the weighted connection to input 𝑥_𝑖 and 𝑏 is the bias. The bias is a trainable constant value of 1 that can shift the activation function either left or right, i.e. it controls where the activation function intercepts the y-axis. The bias allows the network to fit on even more variations of complex inputs.

The choice of the activation function is important in deciding whether the output of the network will be non-linear or not. There are many kinds of activation functions with

(9)

5

varying degrees of non-linearity and other constraints, but a common activation function is a sigmoid function defined as

𝜎(𝑥) = 1

1 + 𝑒^-𝑥 (2)

2.3.2 Back-propagation

The next step after feeding all inputs into an ANN is the training step. The training process uses the back-propagation algorithm (Gardner and Dorling, 1998) and requires a loss function to evaluate the current configuration of the network. To know how to improve the configuration it is common to use a gradient-based optimizer. The optimizer will update the weights in the direction indicated by the derivative of the activation function. The bias weights are also updated this way but since they use an activation function that always returns a constant value of 1, they omit any inputs from previous layers. The sigmoid function has a simple derivative defined as

𝜎^′(𝑥) = 𝜎(𝑥)(1 − 𝜎(𝑥)), (3)

and with it, the optimizer can minimize the error between the resulting output and the expected output. Mean square error (MSE) is a simple example of a loss function and stochastic gradient descent (SGD) of an optimizer. MSE is defined as

𝑀𝑆𝐸 =∑ⁿ_i=1(𝑌_𝑖− Ŷ_𝑖)²

𝑛 , (4)

where 𝑌_𝑖 are the predicted values and 𝑌̂_𝑖 are the expected values. A general formulation of the SGD can be defined as

𝑤_𝑘+1= 𝑤_𝑘− 𝛼_𝑘∇̃𝑓(𝑤_𝑘), (5)

where ∇̃𝑓(𝑤_𝑘) ≔ 𝛻𝑓(𝑤_𝑘; 𝑥_𝑖𝑘) represents the resulting gradient of the loss function 𝑓, e.g. Euclidean distance, cross-entropy, or MSE (Pascanu et al., 2013b), on data batch 𝑥_𝑖_𝑘 (Wilson et al., 2017).

An adaptive approach to optimization is the RMSProp algorithm. It is used to implement an adaptive and separate learning rate for each of the weights in the network. The idea behind this is that the magnitude of the gradients can be quite different from layer to layer. Because of this, it can be hard to decide on one global learning rate. RMSProp attempts to solve this problem by keeping a moving average of the squared gradient for each weight, defined by Tieleman and Hinton (2012) as follows

(10)

6

𝜐(𝑤, 𝑡) = 𝛾𝜐(𝑤, 𝑡 − 1) + (1 − 𝛾) (^𝜕𝐸

𝜕𝑤𝑡(𝑡))

2

, (6)

where 𝐸, 𝑤, 𝑡 are the error, weight, and epoch, respectively and 𝛾 is the forgetting factor which is typically set to 0.9.

2.3.3 Recurrent neural networks

Recurrent neural networks (RNNs), as explained by Pascanu et al. (2013a), is a network with discrete-time simulation capabilities. This temporal behavior is implemented by introducing feedback loops between nodes in a previous layer to other nodes in the current layer (Karim et al., 2018). This recurrent connection makes it possible to blend and reuse outputs from previous timesteps with the current timestep.

The RNN, much like MLP, includes an input node and an output node that takes in outside data and yields results, respectively. The difference is that RNNs can process inputs of variable length thanks to their hidden nodes, or internal state. Figure 3 below shows the input node as X, the hidden node as h, the output node as O. It also shows how an RNN is “unfolded” by expanding the recurrent part of the network into multiple directionally connected layers.

Figure 3. “Recurrent neural network unfold” by fdeloche / CC BY-SA 4.0

Pascanu et al. (2013a) describes the RNN, or the “discrete-time dynamical system”, to contain an input 𝑥𝑡, an output 𝑦𝑡 and the hidden state ℎ𝑡, where t represents time and

where W, U, and V are the transition, input, and output matrices respectively while

∅_ℎ and ∅_𝑜 are element-wise nonlinear functions such as sigmoid, tanh, or relu.

𝒉_𝑡 = 𝑓_ℎ(𝒙_𝑡, 𝒉_𝑡−1) = ∅_ℎ(𝑾^⊺𝒉_𝑡−1+ 𝑼^⊺𝒙_𝑡), (7)

𝒚_𝑡 = 𝑓_𝑜(𝒉_𝑡, 𝒙_𝑡) = ∅_𝑜(𝑽^⊺𝒉_𝑡), (8)

(11)

7 2.3.4 Long short-term memory

A problem that RNNs possess is that the hidden layer’s influence is prone to the vanishing gradient or exploding gradient problem. This is because the multiplicity that exists between all layers and their recurrent connections can lead to the gradient exponentially traveling in any direction (Graves, 2012). Looking at Eq (3) above, the derivative of the sigmoid function will always be less than one. As such, there is a risk that the gradient will become increasingly small after several iterations. The vector 𝑉 in Figure 3 is the propagated information recurrently traveling from one layer to the next and, thanks to multiplication of a too small or large gradient, it can vanish or explode, thus making it impossible to accurately take “temporally distant events” into account (Pascanu et al., 2013b).

Long short-term memory (LSTM) is an improvement over regular RNNs, as they mitigate the vanishing gradient problem. The structure of an LSTM network is built up with memory blocks, in place of the RNNs hidden state. These memory blocks contain memory cells that are controlled by the input, output, and forget-gates, allowing the LSTM unit, shown in Figure 4 below, to process information over many more timesteps. This finally mitigates the vanishing gradient by not allowing the different gates to open and update any values, as long as the activation is near 0 (Graves, 2012).

More formally, Karim et al. (2018) summarize the definition of an LSTM as

𝐠^𝑢 = 𝜎(𝑾^𝑢𝒉_𝑡−1+ 𝑰^𝑢𝒙_𝑡), (9)

𝒈^𝒇 = 𝜎(𝑾^𝑓𝒉_𝑡−1+ 𝑰^𝑓𝒙_𝑡), (10)

𝒈^𝑜= 𝜎(𝑾^𝒐𝒉_𝑡−1+ 𝑰^𝑜𝒙_𝑡), (11)

𝒈^𝑐 = 𝑡𝑎𝑛ℎ(𝑾^𝑐𝒉_𝑡−1+ 𝑰^𝑐𝒙_𝑡), (12)

𝒎_𝒕 = 𝒈^𝒇⊙ 𝒎_𝒕−𝟏+ 𝒈^𝒖⊙ 𝒈^𝒄, (13)

𝒉_𝒕 = 𝑡𝑎𝑛ℎ(𝒈^𝒐⊙ 𝒎_𝒕), (14)

where, for every timestep 𝑡, a hidden vector 𝒉 and a memory vector 𝒎 are maintained and responsible for state updates and outputs. The gates 𝒈^𝑢, 𝒈^𝑓,𝒈^𝑜,𝒈^𝑐 are updated using the logistic sigmoid function σ on the recurrent weight matrices 𝑾^𝑢,𝑾^𝑓,𝑾^𝑜,𝑾^𝑐, the projection matrices 𝑰^𝑢,𝑰^𝑓,𝑰^𝑜,𝑰^𝑐, the previous hidden vector 𝒉_𝑡−1 and the input 𝒙_𝑡.

⊙ is the elementwise multiplication between matrices.

(12)

8

Figure 4. “Long Short-Term Memory” by fdeloche / CC BY-SA 4.0

2.3.5 Convolutional neural networks

Convolutional neural networks (CNNs), much like ANNs, are inspired by biological traits where CNNs are based on the idea to mimic the visual cortex part of the brain.

The CNN architecture, shown in Figure 5, consists of convolutional and pooling/subsampling layers, followed by dense layers producing the final output (Rawat and Wang, 2017). The different layers of CNN serve specific purposes. Starting with the convolutional layer’s, their purpose is to extract features from the input, be it images or other types of data with learnable patterns. These features are extracted by analyzing small parts of the image using the parameters filters and kernel size. The number of feature maps corresponds to the number of filters used, while kernel size defines the size of the sliding window used by the filters from which they extract feature maps. Rawat and Wang (2017) formally define it as

𝛾_𝑘= 𝑓(𝑊_𝑘∗ 𝑥), (15)

where the 𝑘th output feature map 𝛾_𝑘 is computed by applying a nonlinear activation function 𝑓 on the 2D convolutional operation ∗ between the feature map 𝑊_𝑘 and the input 𝑥.

(13)

9

Figure 5. “Typical cnn” by Aphex34 / CC BY-SA 4.0

The pooling layers are used to reduce the resolution of the input feature maps. This is done by looking at a section of the feature map and reducing it down to one value. The reason behind this is that the original, higher resolution input is prone to distortions and translations that can affect the result. A common way is to employ max-pooling where the maximum value is chosen to represent the region. Rawat and Wang (2017) define it formally as

𝛾_𝑘𝑖𝑗 = 𝑚𝑎𝑥

(𝑝, 𝑞) ∈ ℵ_𝑖𝑗𝑥_𝑘𝑝𝑞, (16)

where the 𝑘th feature map output 𝛾_𝑘𝑖𝑗 is obtained by the result of the max function on the element 𝑥_𝑘𝑝𝑞 at location 𝑝,𝑞 inside the pooling region (filter region) ℵ_𝑖𝑗.

2.4 Markov chain

Markov chains (MCs) are modeled after transition probability matrices and can efficiently model complex stochastic processes. Some applications where MCs have been used successfully are in probabilistic models such as wind speed modeling (Sahin and Sen, 2001), contact-based disease spreading (Gomez et al., 2010), credit risk modeling (Wozabal and Hochreiter, 2012), and genetic algorithms (Suzuki, 1995).

A sequence of states 𝑋₁, 𝑋₂, 𝑋₃, … , 𝑋_𝑡 is called an MC if it satisfies the property that 𝑋_𝑡 is only dependent on the previous state 𝑋_𝑡−1, or defined more specifically as

𝑃[𝑋_𝑡∈ 𝐴|𝑋₀, 𝑋₁, … , 𝑋_𝑡-1] = 𝑃[𝑋_𝑡∈ 𝐴|𝑋_𝑡-1], (17)

where 𝐴 can be any set of states and 𝑃 is a conditional probability (Gilks et al., 1995).

2.5 Related work

Karim et al. (2018) mention that there has been an abundance of time series data in the last decade, e.g. weather readings, financial recordings, and industrial observations. While the application of deep learning in those scenarios has proven

(14)

10

successful, they propose two deep learning models that can outperform the current best models.

The proposed models are augmentations of a fully convolutional neural network (FNC) using a basic LSTM module (LSTM-FCN) or an LSTM module using an attention mechanism (ALSTM-FCN). The network architecture consists of two paths, or views, that the input data takes, the CNN path, and the LSTM path. While the networks are concatenated before a final SoftMax activation function, the difference between the two proposed models lies in the LSTM side of the architecture. It will either use a basic LSTM or the proposed Attention LSTM, everything else stays the same.

The input data takes the form of either an 𝑁 length univariate time series for the FCN path or a multivariate single-step time series with 𝑁 variables for the LSTM path. This is done by transposing the temporal dimension in a dimension shuffle layer. This proved to increase the efficiency of their models by significantly decreasing the training time, as well as decreasing the tendency to overfit on smaller sequences.

A different application to time series problems is anomaly detection using MC (Vasheghani Farahani et al., 2019). They also start by pointing out the abundance of time series data that is being collected in real-time, such as telecommunications, network traffic, power usage, etc. Once an anomaly is detected, either from streaming data (online) or stored data (offline), it can be investigated and explained by a domain expert. They successfully implement their proposed method and perform tests with positive results.

The contribution Vasheghani Farahani et al. (2019) bring is in the form of pattern-wise anomaly detection where smaller sub-sequences of a time series are investigated.

Approaching the problem from an MC perspective comes from the idea that an anomalous behavior can be represented using a sequence of observations. Their algorithm determines the likelihood that a sequence of events is anomalous by using k-means clustering. The clustering is done by assuming that anomalies will end up outside of a normal cluster, the distance between normal and anomalous clusters is large, and finally that clusters with few objects represent anomalies. The different clusters can be thought of as states in the MC. As such, the likelihood that a subsequence is an anomaly is determined by the likelihood that consecutive subsequences in the time series will end up in an anomaly state/cluster.

2.6 Ethical issues

A concern regarding a gamified education is that it can affect the student’s behavior in a bad way. Does the gamification layer advocate cheating that was not possible before?

Will the students learn cheating behavior that does not apply to real life? By including a gamified layer in the course curriculum, the consequences of cheating can be downplayed and treated as not so serious events. Cheating in the digital world can be very different from cheating in the real world.

Gal-Oz and Zuckerman (2015) discuss the implications of cheating in online games and a gamified fitness system. Cheating in online games comes from the act of abusing the game mechanics in unintended ways to give the player an unfair advantage over other

(15)

11

players. On the other hand, gamified systems are different from online games in that they are a hybrid between purely functional software and games. As such, Gal-Oz and Zuckerman state that such a system offers an inferior experience while also having a greater impact on real-life. In the case of a gamified fitness system where the rewards are given based on physical sensors, users can abuse this weakness to ‘fabricate false detection’ and cheat the system.

Another concern is whether the implementation of gamification addresses exploitative and manipulative issues. Publicly showcasing the performance of students in the form of leaderboards or deliberately using provocative phrasing are two examples of the aforementioned issues (Kim and Werbach, 2016). Furthermore, Bogost (2015) claims that gamification is “exploitationware” and that it is there to replace real incentives with fictional ones, requiring no real investment.

3 Method

This section discusses the overall problem definition of the study, such as detailing the aim of the study, the motivation behind it, and finally the research question that is focused on throughout the study, as well as the hypothesized outcome.

3.1 Problem definition

The purpose of this study was to investigate if it is possible to determine whether a student will pass the final exam or not, based on gamified behavior data from a school’s learning management system. A big part of this study was to explore the different machine learning models LSTM, CNN, and Markov chains. A more thorough background is given for LSTM detailing where it came from and why. As such, the ANN and RNN models are discussed but not used in the results. The results are investigated using the differences between the chosen models in terms of prediction accuracy. Some discussion about the complexity of data processing and implementation were also discussed.

Using gamification in an educational environment has a positive effect on student learning (Buckley and Doyle, 2016). Although the effect is generally positive, it also depends on if the student is motivated intrinsically or extrinsically, i.e. if they are motivated by personal goals or material goals. With gamification in mind, the benefit possible with machine learning is that the teacher’s attention and supervision can be more efficiently distributed to students in need. With more efficient supervision the better the chances are for students to pass the final exam.

3.2 Research question

The research question in this study focuses on if the data provided is sufficient for the given problem. More specifically, is the gamified behavior data generated by student interaction enough to effectively classify students into ‘pass’ or ‘fail’ classes? Thus, the final research question developed is: Can the outcome of a student’s final exam be predicted using machine learning methods and time-series data from a gamified learning management system?

(16)

12

The hypothesis, concerning the pattern capabilities of CNN, the more long-term memory of LSTM and the probability distributions of MC, is that these different models can solve the problem to a satisfying degree. Although the data will require preprocessing, the hyperparameters and data input needs to be as equal as possible to give the models the same conditions.

3.3 Metrics

Because the research question is a binary classification problem, common approaches to measuring performance are using the metrics precision, recall, and accuracy. The precision measures how many of the predictions were positive, while the recall measures how many of the positive values were correctly predicted. Accuracy measures the total amount of correct predictions. They are each defined as

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝

𝑡𝑝 + 𝑓𝑝, (18)

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑡𝑝

𝑡𝑝 + 𝑓𝑛, (19)

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑡𝑝 + 𝑡𝑛

𝑡𝑝 + 𝑡𝑛 + 𝑓𝑝 + 𝑓𝑛, (20)

where 𝑡𝑝, 𝑡𝑛, 𝑓𝑝, 𝑓𝑛 are the true positive/negative and false positive/negative predictions (Seliya et al., 2009). The positive and negative are represented by the pass and fail class, respectively. True positive means that a student with the true class pass was correctly classified to belong to the pass class, while false positive is the opposite;

the student was incorrectly classified to belong to the fail class. The same goes for true negative and false negative but for the fail class instead.

The potential problem with using accuracy as a metric is in the case of imbalanced classes. Given a sample dataset where 95% of the classes are positive and 5% negative.

By simply classifying all values as positive the accuracy would become 95%, which can give a false indication of the performance. This is known as the accuracy paradox (Abma, 2009, pp. 86–87) and there are ways to circumvent this. A balanced accuracy can be created by normalizing the true positive and true negative classes according to their sizes and divide their sum by two, defined as

𝐵𝑎𝑙𝑎𝑛𝑐𝑒𝑑 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =𝑇𝑃𝑅 + 𝑇𝑁𝑅

2 (21)

In this case, the data provided by Insert Coin AB is balanced, nearly 50/50 split between the two class labels. Thus, the decision was made to only use the accuracy metric to evaluate the performance of the classifiers. The setup to test the accuracy

(17)

13

required extra care to make sure that training and testing were always done on the same samples for all scenarios.

The results from an MC can be measured using log probability. This means that whatever probability is generated, take the logarithm of that value instead. This transforms the metric into a sum instead of a product and can as such be more easily added or subtracted with other probabilities. The log probability can be defined as

𝑙𝑜𝑔 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝑙𝑜𝑔(𝑃(𝑋_𝑡|𝑋_𝑡-1)), (22)

where 𝑃(𝑋_𝑡|𝑋_𝑡-1) is the probability to observe the state 𝑋_𝑡 given state 𝑋_𝑡-1. The resulting values will be in the range [0, −𝑖𝑛𝑓] (“Maximum Likelihood Estimation for Markov Chains,” 2009).

3.4 Validity threats

Analyzing and mitigating possible validity threats is a critical process in questioning the validity of an experiment’s results (Feldt and Magazinius, 2010). Validity is adequate if the results are valid for the group that conducted the study, or if it is possible to generalize the results to a broader group of interest. Although, this does not imply that adequate validity is the most general validity. If a study was made by an organization and it was designed to only answer questions relevant to them, the results only need to be relevant for them (Wohlin et al., 2012, p. 102).

The conclusion validity threats, low statistical power, fishing and error rate, and reliability of measures (Wohlin et al., 2012, pp. 104–105) are considered mitigated by taking extra care in the preprocessing steps. Low statistical power was mitigated automatically through the fact that the dataset provided has balanced classes, i.e. no need to up- or down-sample. Fishing and error rates were mitigated by comparing the prediction accuracy of the two neural network models with all hyperparameters set as equal as possible. Finally, the reliability of measures was mitigated by ensuring repeatability between tests through consistent splits into train and test sets.

4 Implementation

This section describes the implementation part of the study, such as how the data processing is done as well as the chosen machine learning models and their settings.

Some problems are discussed and their solutions, when applicable, are motivated for each approach.

4.1 General preprocessing

Insert Coin AB provides data originating from a school’s learning management system.

The data consists of events in the form of JSON-objects with supporting information such as which user it was and when it happened. The first step was to extract only the information of interest for a time series prediction, such as the user, the timestamp, and the event itself. To make the events easier to read and, more importantly, make some events less ambiguous, they were processed into much shorter strings with much

(18)

14

clearer names, see Figure 6. Finally, there was a specific event related to the actual test called ‘tag:prov’ that was removed completely from the dataset. The idea behind this decision was that this event would influence the resulting class too much to make it a challenge, as well as playing with the idea that this problem is meant to be ‘real-time’, i.e. evaluate classifications at any time throughout the course.

Figure 6. “Pre” preprocessing shown on the left and “post” preprocessing on the right. A side-effect of merging some events lead to the number of events

going down from 20 to 16.

4.2 ANN data preprocessing

The processed dataset consists of 92 unique users 16 unique events, compared to the 20 events originally present. If the dataset is to be used by the chosen neural network models it will require further processing, as well as be split into training, validation, and test sets. Because the events are categorical values the decision was made to one- hot encode the events at each timestep. One-hot encoding turns categorical values, like the names of events, into vectors with columns named after each event and a 1 or 0 whenever the specific event is present. As such, the resulting dataset contains a time series of vectors containing 1s and 0s, split into 60% training, 20% validation, and 20%

test sets.

To further exploit the fact that the dataset originates from a school course, the start and end dates are used to pad every user’s time series into the same length, making the time series 114 days long. This was done by one-hot encoding the summed events of each day and padding empty days with a ‘no_activity’ event. As such, the order of the events throughout the day is no longer considered, in favor of instead representing multiple events at once per day. Figure 7 below shows a heatmap of the events throughout the duration of the course. The heatmap is generated by summing the number of events that occurred on each day.

(19)

15

Figure 7. The number of events per day throughout the course where the top image is the first half and the bottom image is the second half. The x- axis represents the days and the y-axis consists of each event. The darker

the color, the more activity of that specific event occurred.

4.2.1 Experiment

The LSTM and CNN models were implemented using Keras (Chollet and others, 2015) and TensorFlow (Martín Abadi et al., 2015) and everything between them is as equal as possible. The exception, of course, is where the two models differ in functionality, e.g. the CNN’s use of the hyperparameters kernel and filter, and the unique convolutional layers as well as the LSTMs use of lstm layers. The common aspects are the output layer, the number of input neurons/units, the optimizer function, the loss function, and finally the preprocessed data used. Because the LSTM required preprocessing to work, the CNN was tested using the same preprocessed data. Since this study is about a binary classification problem, the output layer was defined as a dense layer with one unit using the sigmoid activation function. For the optimizer, the RMSProp was chosen along with the loss-function binary cross-entropy.

Because the number of events is so few and the classes are binary, the network was made as simple as possible, i.e. keeping the number of neurons and layers low. This

(20)

16

was done to make it less complex and less prone to overfitting, as well as keeping the training time low; the narrow time allotted for this study limited the number of quantitative experiments that could be done.

Quantitative tests were done on the two different network models, where the common hyperparameters were tested using different values. To ensure valid comparative results, the dataset was split the same every time. As such, the training data and testing data was the same for every model and variation of their parameters. Thus, the prediction results could be accurately compared between all results, mitigating some conclusion validity threats.

The test setup consisted of iteratively creating, training, and testing prediction accuracy of the two neural network models while changing the learning rate, batch size, and class weights independently for each session. This was done by deciding on a range of values to investigate and extracting five linearly spaced values in that range, thus creating 30 different models to compare.

4.3 MC data preprocessing and experiment

MC requires a different setup compared to the ANNs. The data was not one-hot encoded but instead was kept as is concerning the ordering. The event strings were translated into single symbols and grouped into a new string representing the sequence of events carried out by each user. As such, the actual order is preserved but compressed into one string, see Figure 8 below.

Figure 8. The left table shows the conversion between event name and symbol, while the right table shows what a typical event sequence looks like.

Using the python library Pomegranate (Schreiber, 2018), two different kinds of MCs were created from the samples in the dataset: one based on users with class Pass and one on class Fail. Each type of MC had five MCs created using 𝑘 ∈ [1, … ,5]. The idea behind creating two versions is that the different MCs will give different probabilities based on the class of the user. For example, evaluating a user who has failed using the MC sampled on users who has failed should result in high probability, or close to zero log probability. Comparing the same user with the MC sampled on users who instead have passed should result in low probability, or a much larger negative log probability.

Symbol Event

A COURSE_FEED_VISITED B COURSE_WORK_FEED_VISITED C turned_in:film

D tag:prov E turned_in:exit F tag:medhjälpare G turned_in:quiz H title:exit I title:quiz J turned_in:diagnos K turned_in L tag:kurs M turned_in:provquiz

N COURSE_WORK_GRADED_VISITED O COURSE_EXPLORED

P COURSE_WORK_MISSING_VISITED

event Passed test

AABABAAAABEGAABAABDAAABEAABBABAEABAEAABAAAABAABABAAAABAABABBAABABABABAAAABBAEABABBAABAABAEAAABAABBAEAABABAAABABAAAAAABAABABGGGAAABBAABBABABAEABEAABAEAAAABBAEAABBAGABGBAAAABEBAAABAABAAAAD1

ABAABAABAAA 0

ABBAAABAABAEAAAAEEEAABAABAAABABAABAAAAAABAABAAABAAAABEEAAAAABEEAAAAEAAAAAAABABAAABAABAAAABGD1 ABAAABABAAAEAEECABAEECAABABAAABA 0

EEEEABAABABABAEABEABAEAABEAAAAAAAAAABAAAABAAABAAAKA0

A 0

ABAEAAAAEAAAAAAAABEAAABAAAAAAAAAAAAAAAAAAAABAABAABGABAAABAABAAD1

BAABAABBAEAABAAAACABABBABABACBAEABAABHAABEABAEABAABAAAABBABAAABBAABABAABABBABAABBAABBABABAAAABABAABAKABAABABAAEAAAEBABAEAABAGGGGGAAABAAAGAAB0

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABEAAAAABGBAAAABAAACEGBAAABABAEBAAABAABBAABBACBAABBAABABEAABAABAABEAABAAABABAAABAABBAAABAABAABBAABABABAAABAEAABEABAAKABAAABAEABAEAAAABAAAKAABAAAD1 CAAABAAABAEAAAAAAAAAGEEABABAABBAAABBAABAAABBABAABABABABJEAABAABEABABBAAAAABAABAABD1

(21)

17

The test setup consisted of running predictions and retrieving a similarity value, or probability value, of every user’s sequence of events with every MC model. Because there were 𝑘 ∈ [1, … ,5] MCs for each class, the resulting data consisted of 10 log probabilities for each user. A user was then to be evaluated based on which class of MC model it best aligned with, that is, which class it had closest to zero log probability with.

Unfortunately, there were a bunch of problems with this approach that required a lot of further work, which is discussed later on in section 5.3.

5 Results

This section presents the results from the test experiments, as well as discussing the relation to the data. Interesting findings are mentioned, and some decisions are motivated.

5.1 ANN performance

The difference in performance between the CNN and LSTM models are demonstrated using grouped bar charts, where each group showcases the difference in accuracy for the parameter in question. Every parameter is equal where possible and they have been trained for 300 epochs each.

Figure 9 below displays the accuracy for five different learning rates, where each learning rate originates from a linear distribution in the range [0.001, 0.05]. A tendency seems to be that the higher the learning rate, the lower the accuracy is. It also seems to be that the CNN model manages to generally achieve higher accuracy than the LSTM model for varying learning rates, except for a learning rate of 0.013 where they are equal, which is fairly probable since the test set is so small. The highest accuracy achieved is 0.89 by the CNN model with a learning rate of 0.001. The same learning rate for the LSTM model gives an accuracy of 0.53. Given the balanced number of classes, CNN’s results are rather good while the LSTMs can be thought of as random guesses.

Figure 9. Accuracy per model for various learning rates. The learning rate controls how much the network updates the weights after each iteration.

(22)

18

Settling on a learning rate of 0.013, where both CNN and LSTM performed equally, different batch sizes in the range [1, 20] were then compared. Displayed in Figure 10 below, the accuracy starts similarly to the learning rates. This time the accuracy of CNN started out being higher but eventually became similar or slightly below LSTM. In general, a higher batch size produces a higher accuracy for both models but, once again, the test, although balanced, was limited in size. The highest achieved accuracy was tied between CNN with a batch size of 1 and 20, and LSTM with a batch size of 10. The CNN and LSTM models achieved equal accuracy at batch size 15.

Figure 10. Accuracy per model for various batch sizes. The batch size controls how many samples to look at per epoch before updating weights.

Continuing with the learning rate and batch size being set to 0.013 and 1 respectively, the class weight parameter was tested. The normal usage for this parameter is to make an unbalanced dataset more balanced during training but in this case, it is used to emphasize the ‘Pass’ class. The parameter starts at both classes being equal until it eventually weighs 10 times to the positive class. By looking at Figure 11, the LSTM model is barely shifting in accuracy whereas the CNN model starts high but ends up declining. The maximum accuracy of 0.84 is achieved by the CNN model.

(23)

19

Figure 11. Accuracy per model for various class weights. The class weight is intended to be used in unbalanced datasets to make the number of each class more even. The intention in this case is to be able to focus on the

‘Pass’ class more.

5.2 MC performance

The prediction using MCs showed promising results at first. It seems that the resulting log probabilities generated by each MC for every user are in line with what is expected, e.g. positive MC on negative users results in mostly -inf values and vice versa. The problem emerges when retrieving the log probability for positive users on the positive MC and vice versa. This resulting log probability varies in the range [−6, −𝑖𝑛𝑓], which means there are positive sequences not recognized at all by the positive MCs and vice versa. As such, there is a strong correlation between the length of the sequence and its log probability, see Figure 12. This problematic correlation has been left untouched in the interest of time in favor of the ANNs.

(24)

20

Figure 12. Tableau dashboard showing the correlation between the length of a sequence and its negative log probability for every user, colored according to their class. The left plot represents a positive MC while the right

represents a negative.

5.3 Discussion

An interesting phenomenon that occurred during the training of the ANNs was that the loss for the CNN model varied drastically between the training and validation sets, see Figure 13. This means that CNN can learn the training set with 100% accuracy while the validation set remains around 50-60%. Although most CNN training graphs look like this, they are still able to, with high accuracy, predict the correct class on the test set. There is a similar story with the LSTM model, although it is not as quick and determined to overfit, which can be seen by how sporadic the loss and accuracy are in Figure 14.

Figure 13. A typical loss (left) and accuracy (right) training graph for the CNN model. It shows overfitting on the training set.

(25)

21

Figure 14. A typical loss (left) and accuracy (right) training graph for the LSTM model. It seems to be able to overfit but not as easily as the CNN

model.

The data setup can play a big role in how the different models perform. CNN is very vulnerable to this because of how it looks at smaller parts of the input and learns the pattern. Because the input structure stays the same, e.g. no warping of rotation like in an image, it is very easy for CNN to learn what patterns correlate with each class. MC has the same vulnerability relating to how the data input is setup. While the MC just used the raw input sequence, creating and padding empty days with a made-up event could have influenced the CNN a lot.

Because the dataset only contains 92 unique users, the training performance might be compromised. Ideally, there would be more users to train on but since this is data provided by Insert Coin AB collected during one course, the data that exists is what the models will have to train on. Generating artificial data would completely negate the human behavior aspect and as such invalidate the gamification aspects of the data.

The reason why LSTM manages to peak at a batch size of 10 is unclear. It can be because the test set contains 19 users and the batch size is almost half of that, combined with the learning rate, set it up to perform better at those settings. Also, since the experiment only includes 5 different batch sizes, it is unclear what would happen if it were to further increase. Most likely it would diminish even more in performance because the batch size would be much larger than the number of unique users in the test set.

At first, the ANNs were tested using one neuron per timestep, i.e. 114 neurons. The resulting accuracy tended to not overfit but also not learn the dataset as good as previously assumed. The number of neurons was later changed to be one per feature, i.e. 16 neurons. This led to better overall accuracy and prediction results on both the training set and the test set.

The number of events per user declined sharply as the course progressed, indicating that many users stopped using the service altogether, as can be seen in Figure 15 below.

Further investigation of the data also indicates that all the users who have the event

‘tag:prov’ are users who passed the test, while users who are missed that event failed the test. This further motivates the decision behind removing that event from the

(26)

22

dataset, as preliminary testing of the ANNs with the event included show that they achieve an accuracy of basically 100%.

Figure 15. Timeline of event activity where the number of events per day is shown as stacked bars, colored per event.

The problem with the MCs and the sequence length correlation could be further developed by including sliding windows on the sequences. This could mitigate the problem by instead averaging a probability over the whole sequence. By removing the sequence length bias, sequences of varying lengths could be more easily compared in future experiments.

Although MCs with a higher 𝑘 than 1 lead to fewer −𝑖𝑛𝑓 probabilities the overall performance was still the same. There were far too many ill-fitted log probabilities between the two classes of MC and users for this approach to be viable. Since the model generation is an exponentially increasing task, testing much higher k was not feasible for this project, both in terms of time and hardware limitations.

6 Conclusions

The benefit of machine learning and gamification is that the teacher’s attention and supervision can be more efficiently distributed to students in need. As such, using data originating from a school’s gamified learning management system, the problem of classifying users based on whether they will pass or not has been investigated.

The implementation consisted of preprocessing the gamified data, provided by Insert Coin AB, into one-hot encoded sequences compatible with neural networks and Markov chains. The testing and experimentation consisted of evaluating and

(27)

23

comparing the accuracy of different hyperparameters for an LSTM and a CNN, as well as the potential application of Markov chains.

The LSTM and CNN models manage to achieve high accuracy after being trained on a balanced dataset, while the MC shows potential in doing the same but with a problematic correlation between sequence length and probability. Thus, the hypothesis that the models can, with high accuracy, correctly classify the user’s sequence of events is somewhat correct, faulting at the results of the MC.

Contributions present in this study are the data preprocessing, the experimentation of LSTM and CNN models, and the discussion of the viability of MC. Furthermore, Insert Coin AB can also benefit from insights gained in this study by ensuring their data sources are more easily compatible with this kind of process. They can also make sure that whatever data they receive has all the necessary background information, e.g.

whether the usage of the gamified system is forced upon the users or not.

CNN has unused functionality to handle varying length inputs. It would be interesting to experiment with the data representation more and allow CNN to adapt to the actual data. Preprocessing the data as much as has been done can include new problems that normally should not be there, e.g. correlation between ‘no_activity’ and ‘fail’. It would be more beneficial for CNN to learn more patterns in the un-processed (as in no artificially inflated events) data instead.

A real-time aspect was proposed by Insert Coin AB and it could be investigated by generating classification reports at different time intervals, e.g. every week or every big assignment. This would make the results more similar to if actual streamed real-time data was provided constantly, as well as being much more interesting for the teachers as they probably already know which students will pass or not if they are at the end of a course.

As shown in the heatmap in Figure 7, the activity declined rapidly about halfway through. This way, the second half of the data becomes very sparse and less reliable. It would be interesting to experiment with data where the usage of the gamified system was more enforced by either Insert Coin AB or the school itself. This in conjunction with data from other students at universities or high schools, as well as different academic subjects, could be interesting to investigate. The impact gamification and the applied machine learning can have on students is a future research area that could be of interest to teachers, schools, as well as the students themselves. This is because increased student retention and student motivation is something many want (Buckley and Doyle, 2016).