Representations of data for repetitive user behaviors using Markov Chains

(1)

Representations of data for

repetitive user behaviors using Markov Chains

Master’s Degree Project in Informatics Two years Level ECTS

Spring term 2020 David Grimmer

Supervisor: Niclas Ståhl, Alberto Montebelli Examiner: Huseyin Kusetogullari

(2)

Abstract

This report presents problematic issues with analyzing user behaviors of a repetitive nature for Anomaly Detection using the Markov Chain model. The users in the data tend to use certain events in succession for a long period of time. Doing the same thing in succession can be normal but when can these users be considered to have an abnormal behavior? The work done in this report presents two alterative ways of representing the data for letting the Markov Chain capture large sections of repeating events without needing to increase the order of the Markov Chain. The presented representations show promising results for an increase in the Markov Chain capability to distinguish users of repetitive nature from each other, as well as suggestions for future development.

(3)

1 Introduction 1

1.1 Related work 2

2 Problem definition 3

2.1 Problem statement 3

2.2 Strategy 3

2.3 Research aim 4

2.4 Motivation 4

3 Method 5

3.1 Markov Chain 5

3.2 The Data 6

3.3 Test statements 8

3.4 Test 1 9

3.5 Test 2 9

3.6 Test 3 10

4 Results 12

4.1 Test 1 13

4.2 Test 2 14

4.3 Test 3 16

5 Discussion 19

5.1 General 19

5.2 Test 1 19

5.3 Test 2 20

5.4 Test 3 20

5.5 Conclusion 21

6 Contribution 21

7 Future work 22

References 23

Appendix 1 24

(4)

1 Introduction

The work done in this report is in regards to solving problems that arise when analyzing user behaviors with a repetitive nature for anomaly detection. Repetitive nature refers to users with the tendency of doing the same thing in succession for long periods of time. Doing the same thing for some time can be normal but when does it become an abnormal behavior?

The aim of this project is to present alternative ways to represent the sequences describing the user’s activity history to make it easier for the Markov Chain model to distinguish large sections of repetitive actions without losing detailed information for smaller sections.

This work is in cooperation with Insert Coin AB (2012) who provided the data used in this project. Insert Coin AB is a company located in Gothenburg who specializes in gamification.

They develop and maintain their product GWEN, which is a gamification engine that

provides a gamification interface to companies who want to implement a gamified solution to their system.

Deterding et al. (2011) describe gamification as using game design elements in a non-game context for making activities more game-like. Implementing gamification can increase user engagement (Terrill, 2008). Hamari et al. (2014) describe in their work that gamification gives positive effects but is greatly dependent on the context it is implemented and the users using it.

GWEN, Gamify the World ENgine, is a service developed by Insert Coin and is used to streamline the implementation of a gamification layer on top of external systems. GWEN is constructed of multiple modules. Each module handles common gamification elements such as levels, missions, achievements, and challenges and is completely independent of each other. This allows great flexibility of which modules are combined together to suit the needs of the client. GWEN is a well-designed solution to provide a tailored gamification layer to its clients that gives the users a balanced and well-designed gamification experience.

In this report we present two alternative ways to represent the data to better capture how users behave. The first representation works by creating a set of states for each type of event (previously named as action) to represent different repetitions and calculate the transition probabilities between those states. The second representation works by creating states representing the event and repetitions separated from each other and calculating transitions between those states. The two representations presented are both compared to the result obtained by the preserved original sequences to get a baseline of how the presented representations perform.

This report presents findings of interest which suggest that the presented representations can distinguish a greater difference between normal users and flagged users (with often high repetitive behaviors) better than when the original sequences are analyzed.

(5)

1.1 Related work

Markov Chain has been used on many occasions for anomaly detection with great results, such as for detecting intrusions into computer and network systems (Ye, 2000). The work done by Boldt et al. (2020) presents an anomaly detection design for Markov Chains. This is done by constructing different temporal resolutions of the sequences and then training the models on the different resolutions. They use Markov Chains with absorbing states to then cluster users by using the absorbing states as labels. The result from the different Markov Chains is then combined to determine if a user is an anomaly or not. Hoang and Hu (2004) have been using a Hidden Markov Model (HMM) for anomaly detection for system calls in operating systems. HMM has similarities to Markov Chains but is based on indirect

observation of “hidden” states. The purpose of their work is to present a training scheme to reduce the training time of HMMs by subdividing the sequences of observations and train submodels to later combine the submodel to a complete HMM. Anomaly detection can be achieved by training on normal behavior to then be able to distinguish impostors that significantly deviate from normal behavior. They describe that a potential problem can be false alarms due to it being impossible to build a complete database covering all scenarios.

Markov Chains are a common approach for analyzing behaviors such as describing movements of taxis in a network in regards to the relationship between customers and the waiting time for taxi drivers (Wong et al., 2005) or modeling animals’ behavior response (Yang and Chao, 2005).

The article by Avery and Henderson (1999), describes how Markov Chain models can be used to analyze sequences of DNA. They emphasize that with increasing order of the Markov Chain comes an exponential computational complexity in the training process. They also point out in this article that there is a suggested relationship between that a high order Markov Chain needs longer sequences and that longer sequences need a high order Markov Chain to work efficiently.

Many of the articles that can be found regarding Markov Chain aim to detect a smaller subsequence or patterns of interest. To solve our problem however, we would need a solution that can also capture information from larger subsequences when the behavior is of a repetitive nature. This is a project that analyzes behavior aimed at detecting anomalies that tackle the problem of the exponentially increasing computational complexity when large subsequences need to be analyzed to capture the repetitive nature of the data.

(6)

2 Problem definition

2.1 Problem statement

This work is in cooperation with Insert Coin AB which provides a gamification layer on top of, in this case, a streaming website. This service lets people earn experience, unlock rewards, and achievements based on their actions. These actions are sent to Insert Coin and are stored as event data, representing what players do in the system. This service has had instances of players trying to exploit the system in various ways which can lead to an unfair situation for other players e.g. in regards to leaderboard placement. Insert Coin seeks to be able to use the data to identify users with abnormal behavior (that is to say behavior that deviates from de average user) who are more likely trying to cheat or exploit the system. An example of this could be users who have found an unintended loophole in the system that rewards them too much and therefore get an inappropriate amount of rewards, such as experience points.

The problem with analyzing the data comes from the repetitive nature of the users’ behavior which is the focus in this project. Repetitive nature refers to users with the tendency of doing the same thing in succession for long periods of time. Doing the same thing for some time can be normal but when does it become an abnormal behavior? The example in figure 1 shows a sequence of events (marked by different colors) and a black indicator under the sequence that illustrates how a subsequence is analyzed. When sweeping through the sequence many of the subsequences will be consisting of only the orange event and the probability of an orange event after a section of orange events will therefore be very likely.

This can be problematic when the sections of repeating events become abnormally large because this would be missed when analyzing smaller subsequences at a time. Increasing the size of the subsequences used to be able to cover those larger sections of repetitions is also problematic because of the increasing complexity of the model.

Figure 1: A fabricated sequence containing multiple types of events (represented by different colors) that exemplifies a sequence with a repetitive nature. The black mark under the sequence shows how a subsequence, in the left slot, is used to calculate the probability of the next event, the right slot. Note that larger sections of repetitiveness are common in the

data and that this is a simplified example.

2.2 Strategy

This project will be evaluated by measuring the proposed representations of the data by how likely the sequences are to exist. These obtained results will be compared to the result gained from analyzing unmodified sequences from the same data. The average probability for each sequence will be presented as well as the worst probability within these sequences.

The number of sequences that are unrecognized by the model will also be compared throughout the different tests.

(7)

Validity threats to this project arise due to the lack of similar research regarding the

presented problem. This combined with some flaws in the data, which are later discussed, threatens this project to be an isolated problem for this particular company. The proposed representations of repetitive sequences are of an experimental nature and might not be the optimal way of representing this kind of data, room for potential improvement and alternative ways should be kept in mind. Fishing and error rate described by Wohlin et al. (2012) might as well be a threat to this project.

2.3 Research aim

The aim of this project is to develop different representation strategies to handle repetitive behavior sequences to construct better states for the Markov Chain to make the model able to better distinguish users of a repetitive nature from each other.

This report has the intent to answer the following question:

● Can alternative representations of the data facilitate a Markov Chain’s capability to analyze sequences of a repetitive nature?

● What seems to be the strengths and weaknesses of those representations?

● Can we detect anomalies based on these representations of data?

The hypothesis for this study is that a restructure of the sequences is needed to improve the Markov Chains model’s understanding of repetitive sequences.

2.4 Motivation

No directly related articles have, according to our knowledge, been done to handle the problem of analyzing sequences with a high imbalance of which states are used, with large sections staying in the same state over and over. Repeating the use of the same event is normal to some degree but when does it become abnormal? This and the request from Insert Coin is the main motivation for doing this study.

From a gamification point of view, finding users who cheat or exploit the system is of interest to be able to keep a balanced and fair challenge for all users. A game need designed rules and by allowing cheaters the game system collapses. Normal users might get unmotivated to continue using the system if the leaderboard is occupied by obvious cheaters with unrealistic scores. One could also speculate that cheating in a gamified system appears to be rather consequence-free which could lead to even more exploiting and cheating. There is always an ethical dilemma on what information about the users should be stored. Storing information about what the users do in this gamified system can be motivated for the developers to further understand their users for improvements of their system. For this project, analyzing behaviors from stored data can be justified by trying to detect abnormal behaviors that might threaten the gamified rule system, the experience for other users and their enjoyment to continue using the system.

From a machine learning point of view, being able to analyze data to find anomalies in user behavior is of interest to build a secure system. In this instance, where a user’s behavior can be repetitive, does not necessarily suggest any abnormal behavior, but if the behavior

(8)

becomes too repetitive this can become the case. This becomes an interesting scenario to handle appropriately.

3 Method

3.1 Markov Chain

Markov chain is used in this project due to the model’s previous successful results in

anomaly detection (Ye, 2000) (Boldt et al. 2020) as well as in behavior analysis (Wong et al., 2005) (Yang and Chao, 2005).

Markov Chain is a stochastic process that describes the probability for a given sequence of events where the probability for an event only depends on the previous 𝑘^𝑡ℎ events where k describes the order of that Markov Chain (later referred to as the k-value). Markov Chains are so-called “memory-less”, meaning that the probability for a future action does not depend on previous steps that lead up to the current state, the independence of a state is called the Markov property (Meyn and Tweedie, 2012).

Discrete-Time Markov Chain (DTMC) is a time and event discrete stochastic process

(Fabien, 2019). It relies on the Markov property that there is a limited dependence within the process. States are observed as 𝑞₁, 𝑞₂, . .. and are commonly defined by a finite space: 𝑆 = 1, . . . 𝑄

𝑃(𝑞_𝑡 | 𝑞₁. . . 𝑞_𝑡−1) = 𝑃(𝑞_𝑡 | 𝑞_𝑡−𝑘. . . 𝑞_𝑡−1) 𝑤ℎ𝑒𝑟𝑒 𝑘 = 1 𝑜𝑟 2

𝑊ℎ𝑒𝑛 𝑘 = 1, 𝑃(𝑞_𝑡 | 𝑞₁. . . 𝑞_𝑡−1) = 𝑃(𝑞_𝑡 | 𝑞_𝑡−1) 𝑃(𝑞₁, 𝑞₂, . . . 𝑞_𝑇) = 𝑃(𝑞_𝑡)𝑃(𝑞₂ | 𝑞₁). . . 𝑃(𝑞_𝑇|𝑞_𝑇−1)

Fabien describes properties of states as Transient if the probability to return back to the previous state is lower than 1 (this is the most common state), Recurrent if there is a

probability of 1 to return back to the previous state, Absorbing if the probability to stay in the same state is 1. A DTMC is irreducible if all states, in a finite number of steps, can be reached. This results in a strongly connected graph. States can be Periodic if the states require steps larger than 1 to return to the same state again, otherwise it is called Aperiodic.

States with the probability to stay, a self-loop, is always Aperiodic.

The probability of a sequence of states is calculated using Bayes Rule. The DTMC is said to be homogeneous if the time t does not affect the transition probability, so that:

𝑃(𝑞_𝑡 = 𝑗 | 𝑞_𝑡−1= 𝑖) = 𝑃(𝑞_𝑡+𝑘= 𝑗 | 𝑞_{𝑡+𝑘−1}= 𝑖) = 𝑎𝑖𝑗 All transitions can be summarized as a transition matrix shown as:

𝐴 = [𝑎_𝑖𝑗], 𝑖 ∈ 1. . . 𝑄, 𝑗 ∈ 1. . . 𝑄

(9)

To fulfill the property of being stochastic the matrix needs to contain only positive values and each row in the matrix of transitions sums to 1. Figure 2 illustrates an example of transition probabilities between the states 𝑆1 and 𝑆2.

Figure 2: Illustration of transition probabilities between two states as well as transitions back to themselves.

The transition matrix to the illustration in figure 2 is shown in figure 3.

Figure 3: Transition matrix that represent the states transformation probabilities where each row represents transitions from one state.

The library provided by Pomegranate (2016) is used to implement the Markov Chain.

Parameters such as the k-value are varied throughout the different tests.

3.2 The Data

The data comes from a streaming website that has implemented a gamified system for users while they watch streams in the form of doing quizzes as an example. The data contains the actions users do in this gamified system e.g. update their profile or start a new quiz.

Available features represented in the data are:

● Entry ID

● Timestamp

● Event information

● Type of event

(10)

● User ID

The feature “event information” comes as JSON-objects of different sizes and internal structures resulting in an unstructured feature. User IDs have been depersonalized to prevent ethical issues of handling personal information.

Entry ID is used to prevent duplicated entries. Timestamp is used to determine the order of the events in the sequences. Event information is used as the events in the sequences.

Type of event is not used because the events are analyzed individually. User ID is used to organize the sequences.

The data also contains a list of 56 user IDs that have previously been flagged for

inappropriate behavior. The users are manually found by the provider and marked to be considered suspicious. Note that the users that are not currently flagged could be behaving badly but are not yet noticed.

Figure 4 shows the recorded activity for “normal users” (green) compared to the flagged users (red). The line labeled “breakpoint 2” will be the start of the time window used in this project. There are no described restrictions in which order events could occur.

Figure 4: Shows the number or recorded activity, on a logarithmic scale, of flagged and normal users over time. The breakpoint indicated the start of used user data in this project.

Bars for the 2 groups are not stacked but are overlaying each other, the number on each bar shows the exact number or recordings. For a full-sized version, see Appendix 1.

The data obtained from Insert Coin did contain some breakpoints where the structure of the data completely changed and new events were being used while others were discarded, this can be seen in figure 5 where the zoomed-in section shows a more detailed view of a break in the structure. The original data consisted of 15 months but were reduced to 2.5 months for us to be able to work on consistent data. The time window closest to the current version of the system was chosen, from breakpoint 2 and forward. The original data consisted of ~25

(11)

million recordings from 91 916 users and the selected time window had ~2 million recordings from 45 647 users.

Figure 5: Present the number of recorded usage of different events (represented by colors) over time; This to showcase different breakpoints, such as in the zoomed section, where the

structure of which events are being used change.

Of the 56 flagged users, 31 of them were active before and under the currently selected time window. One user, during the selected time window, used two events no other user was using. Further investigation found that this particular user used one event that was 6 months outdated and another event that was unique for this user. This user also stood for 21% of the total recordings during this time. This user was removed during preprocessing due to being considered noise in the data.

The different events were converted into single symbols and added together into a sequence of symbols for each user. For readability, the symbols chosen were capital letters in the alphabet and they are 9 in total. The data was then split into a train and test set of users with a 90/10% division in an attempt to capture the average “normal behavior” among the training users. Flagged users are kept in a separate set as well. Models are trained only on the training set of users and the sets are kept consistent over all tests done in this project.

3.3 Test statements

This section aims to clarify later descriptions of the different tests.

Event - refers to the recorded action for the users.

Sequence - refers to a chain of events that describe the user’s activity over time.

State - refers to the internal state in the Markov Chain model. The sequence from the data, without any alternative representation, will result in a 1:1 representation for each event as a state in the Markov Chain model.

(12)

Users within the train, test, and flagged datasets are consistent throughout all tests and the models are trained on users in the training dataset.

When giving examples of sequences “-” denotes the separation between the events in a sequence or between resulting states.

3.4 Test 1

The first test intends to establish a ground truth for how the Markov Chain model performs on the data provided. In this test, the original sequence for each user is used to train the model. This results in an internal transition graph similar to the one illustrated in figure 6 where each event is directly translated to a corresponding state. There is no guarantee that there are transitions from all states to every other state, but this is likely due to no known restrictions of in what order the user can perform actions.

Figure 6: This graph shows transitions between states as well as transitions to themselves as a standard representation of the Markov Chain

3.5 Test 2

The sequences are restructured to represent repetitions of the same event in a different way.

Each event type is represented by multiple states with repetitions of that particular event resulting in states that represent both the type of the event and a fixed number of repetitions, this is illustrated in figure 7. The states 𝑆₃in the illustration represents the largest number of repetitions for the different events. When the number of repetitions is very high, in the original data, it leads to the state representing the highest repetition to transition back to itself until the remainder is smaller than what the state represents. The set of fixed repetitions for the states are described as 𝑅_𝑛= {2⁰, 2¹, . . . ,2^𝑛−1}. All event types are represented with the same number of states.

(13)

Figure 7: Shows the possible transitions between states in the presented representation.

Different events (colored areas) consist of states that represent that event of a fixed repetition. S3 represent the highest repetition making it transition to itself when representing

large sections of repetitions.

When representing a sequence of repeating events into the corresponding states, the largest valid repetition state is prioritized, and the remainder is reconstructed in the same way.

Example:

When 3 repetition values {1,2,4}are used for the states representing the event “A” in a sequence it would be represented as follow:

𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 − 𝐴 𝑔𝑖𝑣𝑒𝑠: 𝐴4 − 𝐴4 − 𝐴2 − 𝐴1

𝑊ℎ𝑖𝑐ℎ: 𝑅𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑒𝑙𝑒𝑣𝑒𝑛 "𝐴" 𝑖𝑛 𝑠𝑢𝑐𝑐𝑒𝑠𝑠𝑖𝑜𝑛

3.6 Test 3

Instead of creating multiple states for each event, such as in Test 2, this strategy

differentiates events and repetitions from each other. Each event is represented by one state that leads to another state representing a repetition threshold. This means that all sections when an event appears, no matter how many of the same event is in succession, will use the same state corresponding to the event followed by a transition to one of the repetition states.

The repetition states represent thresholds and the highest valid threshold state is used to represent the repeating usage of the same event. This approach introduces rounding down repeating events to specified thresholds. This is illustrated in figure 8 where 3 event states are represented by {𝑆_𝑒1, 𝑆_𝑒2, 𝑆_𝑒3} and 3 repetition state thresholds {𝑆_𝑟1, 𝑆_𝑟2, 𝑆_𝑟3}. Because the remainder is discarded the event state visited after a repetition state will not be the same

(14)

event state as the one previous to the repetition state. This is shown to the right in the illustration which shows two steps between states (marked in red); 𝑆_𝑒1can’t be visited as the next transition after a first step to 𝑆_𝑟1. This representation makes all the states periodic.

Figure 8: Shows the possible transitions between states in the presented representation. A) Shows transitions between states representing 3 events and 3 repetitions thresholds. B) Gives an example of 2 steps of transitions (red arrows). Since repetitions are rounded down

to a threshold can E1 not be visited directly again when transitioning from R1.

Example:

Having 3 repetition states for thresholds of {1, 5, 10} and the events “A” and “B”

would represent the following sequence with states as:

𝐴 − 𝐴 − 𝐵 − 𝐵 − 𝐵 − 𝐵 − 𝐵 − 𝐴 𝑔𝑖𝑣𝑒𝑠: 𝐴 − 1 − 𝐵 − 5 − 𝐴 − 1

Note that the 2 first “A” in this example is rounded down to the closest valid repetition state (which represents the repetition threshold of 1).

(15)

4 Results

The following section presents the result from the different tests with a detailed table for different k-values (previously mentioned as the order of the Markov chain), which describe how many previous states are taken into account when probabilities are calculated. The scale represents the log probability (a logarithm of a probability) with values from 0 to - infinity, where -infinity represents unrecognized sequences. Probabilities close to zero are those sequences that are the most likely to exist; Therefore, shows max the best probability found, min the worst probability and avg shows the average probability for users in the different group of users.

As the k-value increases the minimum required length of the sequences increases as well resulting in many sequences being too short. These users are discarded before the tests are performed and the percentage of discarded users is represented under “reduced by” in the tables.

The users unrecognized by the Markov Chain model are excluded when considering the log probabilities for the different users. The percentage of how many users from the different sets of data who are unrecognized can be seen under “unrecognized” in the tables. Note that as users become unrecognized they do no longer affect measured values. This contributes to more variation in the log probabilities between different models.

Distributions for each test are presented as bar charts showing how many users are in each interval of probabilities. Note that the scales showing the number of users are logarithmic.

(16)

4.1 Test 1

Table 1 shows the gathered measurements from Test 1, when the original sequences are used to train the Markov model. Train users and test users have similar values when looking at the average for the 2 types of measurements where the average probability for the

flagged users are on average better but worst for the worst analyzed subsequence. When looking at the k-values of 7, we can see that roughly 60% of the users in train and test are discarded for having less than 8 events in total.

Table 1: Measured values regarding the probability of existing for each user’s sequence on average and for the worst subsequence for each user.

The distribution of users by their average probability is presented in figure 9 for a k-value of 7. Looking at where flagged users that are still recognized are in the distribution, we can see that they are close to zero which indicate their high probability of existing on average.

(17)

Figure 9: Shows the distribution of average probability for each user from the train (dark blue), test (light blue) and flagged (red) group of users.

4.2 Test 2

The result for this test is done using 5 states for each event type where the states represent repetitions of {2⁰, 2¹, 2², 2³, 2⁴} for the 9 different types of events. Table 2 shows gathered measurements from Test 2 and noticeable is the low count of k-values because of the fast increase in complexity for the model. The average probability for users in train, test and flagged are much closer to each other than in previous test both for the average probability for each user and for the worst subsection. It is also noticeable that unrecognized flagged users are high for k-value of 3.

(18)

The distribution of users by their average probability is presented for k-value 2 in figure 10, and k-value 3 in figure 11. The purpose of showing multiple distributions is to showcase changes between different k-values. The flagged users that are recognized are further from a high probability to exist than in previous test.

(19)

4.3 Test 3

Table 3 shows how many times users change which event they are using. A user with 0 changes can still have varied sequence lengths but they are only using one type of event. A user with 1 change can as well have various lengths but changes one time to another event.

Table 3: Counts how many users in the data have a recorded “change” of what event they are using. A change of 0 means that the only use one type of event and never use any other.

(20)

Table 4 shows gathered measurements from Test 3 where event states and repetition states are separated. The thresholds used for the repetition states in this test are

{1, 5, 10, 25, 50, 100, 250} events in succession rounding the number of repetitions down to the nearest threshold. The average probability for each of the flagged user is on average noticeable worse when compared to the corresponding value for train and test users. A large difference between on how many users are unrecognized between test and flagged users can also be seen where the percentage for the test users are still somewhat low.

The distribution of users by their average probability is presented for k-value 4 in figure 12, and k-value 5 in figure 13. The purpose of showing multiple distributions is to showcase changes between different k-values. Flagged users appear to be spread in the distribution with few of them close to 0. More train and test users seem to be closer to zero than previous tests.

(21)

(22)

5 Discussion

5.1 General

We start by discussing the data used during this project. The structure of the original data changed on multiple occasions which led us to use a much smaller portion of the data. This raises some questions about the list of flagged users. There are no timestamps included for when they were flagged, suggesting that some of them might have been exploiting an old version of the system which now no longer is a possibility or that the user has corrected its behavior after a warning. If this is the case, this would mean that their inappropriate behavior is missing from the data while they still are marked as being flagged. This is a likely problem for this project based on the high activity of flagged users before the selected time window and to the fact that 79.5% of flagged users active during the time window have been active before.

Another indication that the data contains flaws is the user removed in the preprocessing step who used events no other user was using. It becomes hard to decide which behaviors are actually good or bad due to no domain expert being present during this project. There is also an imbalance between recorded user activities where some users had extremely long

sequences while 18% of the users had a sequence of the length of 1, which can lead to problems finding a suitable k-value as described by Avery and Henderson (1999).

Hoang and Hu (2004) describe in their paper the importance of having clean normal behavior as training data and this is questionable in this project. It is difficult to know if the

“normal users” are all behaving normally or if some of them are undetected users meant to be flagged. The consequence of having abnormal users in the data can be that if an abnormal user is highly active (a long sequence) the user’s behavior would adjust the Markov Chain model to consider these abnormal transitions between states as more likely.

A user being unrecognized by the Markov Chain model does not necessarily mean it is an anomaly but the difference between the test data compared to the flagged users could still indicate a difference in their behavior.

5.2 Test 1

Looking at the average log probability for the users we can see that train and test have a similar log probability of -0.677 and -0.703. This is not the case then looking at flagged users as their average value is only -0.126, suggesting that the average flagged user disappears in the crowd. This can more easily be seen in the distribution where flagged users appear close to zero.

The problem with representing the data in its original sequence is the imbalance between detecting large sections with the same event in succession (with a high k-value) and being able to recognize new data. A high k-value increases the possible combinations of events which can cause a problem when introducing new data. We can see in table 1 that the flagged users are more unrecognized than test users with a difference of 20.64%. A low k- value should make the Markov Chain model very susceptible to consider repetitive behavior very likely due to those cases making the Markov Chain model see a state going to itself

(23)

many times. This can be seen for k=1 with a difference of 0.814 between the average of flagged users and the test and train users, where for a k=7 the difference is 0.577.

5.3 Test 2

Looking at the average probability among the users we can see that the average values for test users and train users are similar while the value for flagged users is slightly higher. This is also the case when looking at the average value for the worst probability for the users.

This would suggest that this representation is better at recognizing flagged users.

A potential problem can be that an event that appears with different lengths in successions can be represented very differently e.g. 15 events of the same type in succession are represented to transitions between 4 states (𝐴₈− 𝐴₄− 𝐴₂− 𝐴₁) while 16 events in

succession are represented by only one state (𝐴16). States are also disconnected from each other in the sense that 𝐴₂ and 𝐴₄ are individual states with no internal relationship to one another.

Having states that represent repetitions for each event increases the total number of states drastically. This causes the models to be fairly complex fast with an increasing k-value. This is noticeable in this project due to that the available hardware only managed to train models up to a k-value of 3.

When comparing users that are unrecognized for test and flagged users we can see a noticeable difference between them. Of the flagged users, 82.86% are unrecognized while 22.59% of test users are unrecognized. This suggests a noticeable difference between the behaviors in those two groups.

5.4 Test 3

Improvement with this representation compared to the one in Test 2 is that when an event is used the Markov Chain model will go to the same state representing that event and it does not matter how many times that event is used in succession. This reduces the required states allowing larger k-values. This representation guarantees that an event state after a repetition state will never be the same event state as the one before this repetition state, this reduces the possible cases of combinations.

This representation requires the sequences to “change” what event is being used. In table 3 it becomes very clear that 53.46% of the users in the data only change the type of event they used once. This results in many users being discarded when models of higher k-values are trained.

This representation seems to be more suitable when users are encouraged to do different things in the system so “changes” in what events are being used are recorded.

When comparing users that are unrecognized for tests and flagged users we can see a noticeable difference between them. Of the flagged users are 74.29% unrecognized while 9.22% of test users are unrecognized. Similar to Test 2, this suggests a noticeable

difference between the behaviors in those two groups. Compared to Test 2 flagged users in

(24)

this test are less unrecognized but the difference between test and flagged users is greater which suggests a better understanding of the differences between the two groups.

5.5 Conclusion

This project can conclude that how the data is represented is of great importance to how models, like the Markov Chain, can perceive it in the case of analyzing sequences where the sequences are of a repetitive nature. This project presents promising results for ways of representing data for analysis.

The different representations presented in this report suggest that the one presented in Test 3 indicates better capability to capture the behavior of repetitive sequences. This is because of it seemingly detecting a higher difference in the behavior between test users and flagged users, compared to the other two tests. With the promising result presented by this report and with a system that encourages more uniform user activity as well as the inclusion of a domain expert for evaluation the representation in Test 3 has the potential to be used for anomaly detection with future development.

Currently, better data is needed to evaluate this work’s true impact. With correctly labeled data it would be clear if the unrecognized users are indeed anomalies and if the

unrecognized test users are in fact examples of anomalies or if they should be considered as false positives. With better labeling, the unrecognized users could be compared with Boldt et al. (2020) work with absorbing states.

6 Contribution

This project enlightens a seemingly new problem about repetitive sequences for Markov Chains and two ways of representing the data to capture the length of repetitive sections while preserving the information regarding the use of different events. The different tests, especially Test 3, provide promising results that could be of interest to others that face similar problems when analyzing repetitive user behaviors.

This report contributes with insights about the data intended for Insert Coin AB as well as the result from this study on how user activity can be represented to seemingly better represent repetitive sequences for future analysis, such as for anomaly detection.

Hoang and Hu (2004) mention the problem of false alarms because of the difficulty of

building a database of all possible scenarios. The representation presented in Test 3 has the potential to reduce the possible scenarios for higher k-values by compressing sequences by an alternative representation. If the “normal behavior” is captured well this representation can consider transitions that are unknown for the Markov Chain as anomalies, similarly to the solution using absorbing states by Boldt et al. (2020). More testing needs to be done before unrecognized users can be confidently considered as anomalies.

Further tests need to be done to conclude any significant importance to the community and they need to be done in a more controlled environment with more complete data.

(25)

7 Future work

Of the representations presented in this project Test 3 seems to be the most promising representation for future work and development. This is mainly because of its high difference in unrecognized users between test and flagged users as well as having a low number of internal states. Modifications to this approach would be to remove the repetition threshold of 1 and have “5 or less” as the smallest threshold enabling higher thresholds instead. The difference between doing the same thing 250 or 500 times in a row can be argued to be more important to represent instead of doing an event 1 or 4 times in a row.

This project can be further developed by, from the current 39 flagged users, excluding those who have previously been active before the current time window in an attempt to counteract the mentioned problem of not knowing when the users became flagged. This would only include 8 flagged users compared to the other ~45 000 normal users but it would still be interesting to investigate this.

The current version of the gamification system might not be the best suited for gathering this kind of data. It appears that the users do not need to use it if they do not want to, they might try it a few times and never use it again. This can explain why so many users have such low recorded activity. A better user base in regards to activity would be advantageous to get a better representation of “normal behavior” and a more balanced dataset for easier analysis.

Dividing the sequences based on time would also be a potential improvement to this project.

Currently, the order of when events are done is preserved and represented by the sequence but there is no representation of how active the users are. Currently, there is no difference between users that do 100 events in one day or 1 event over 100 days. Because of this small time window of 2.5 months would suggest a smaller impact on this project. This

becomes less of an issue when simpler systems reject users that do things “too fast” but this phenomenon of users being extremely active has been observed during this project

suggesting that no such system is in use. Combining multiple Markov Chain models to mutually decide whether or not a user is an anomaly inspired by the work done by Boldt et al. (2020) would be an interesting addition as well to this project.

(26)

References

Avery, P.J., Henderson, D.A., 1999. Fitting Markov chain models to discrete state series such as DNA sequences. J. R. Stat. Soc. Ser. C Appl. Stat. 48, 53–61.

https://doi.org/10.1111/1467-9876.00139

Boldt, M., Borg, A., Ickin, S., Gustafsson, J., 2020. Anomaly detection of event sequences using multiple temporal resolutions and Markov chains. Knowl. Inf. Syst. 62, 669–

686. https://doi.org/10.1007/s10115-019-01365-y

Deterding, S., Dixon, D., Khaled, R., Nacke, L., 2011. From Game Design Elements to Gamefulness: Defining Gamification. Presented at the Proceedings of the 15th International Academic MindTrek Conference: Envisioning Future Media

Environments, MindTrek 2011, pp. 9–15. https://doi.org/10.1145/2181037.2181040 Fabien, M., 2019. Markov Chains and HMMs Main concepts, properties, and applications.

Markov Chains HMMs - Data Sci. https://towardsdatascience.com/markov-chains- and-hmms-ceaf2c854788 [accessed 5.14.20].

Hamari, J., Koivisto, J., Sarsa, H., 2014. Does Gamification Work? -- A Literature Review of Empirical Studies on Gamification, in: 2014 47th Hawaii International Conference on System Sciences. Presented at the 2014 47th Hawaii International Conference on System Sciences (HICSS), IEEE, Waikoloa, HI, pp. 3025–3034.

https://doi.org/10.1109/HICSS.2014.377

Hoang, X.A., Hu, J., 2004. An efficient hidden Markov model training scheme for anomaly intrusion detection of server applications based on system calls, in: Proceedings.

2004 12th IEEE International Conference on Networks (ICON 2004) (IEEE Cat.

No.04EX955). Presented at the 2004 12th IEEE International Conference on Networks (ICON 2004), IEEE, Singapore, pp. 470–474.

https://doi.org/10.1109/ICON.2004.1409210

Insert Coin - We Gamify The World, 2012. https://insertcoin.se/ [accessed 5.28.20].

Meyn, S., Tweedie, R., 2012. Markov Chains and Stochastic Stability. Springer Science &

Business Media.

Schreiber, J., 2016. Pomegranate. GitHub Repository.

https://github.com/jmschrei/pomegranate

Terrill, B., 2008. My Coverage of Lobby of the Social Gaming Summit.

http://www.bretterrill.com/2008/06/my-coverage-of-lobby-of-social-gaming.html [accessed 4.14.20].

Wong, K.I., Wong, S.C., Bell, M.G.H., Yang, H., 2005. Modeling the bilateral micro-searching behavior for urban taxi services using the absorbing markov chain approach. J. Adv.

Transp. 39, 81–104. https://doi.org/10.1002/atr.5670390107

Yang, H., Chao, A., 2005. Modeling animals’ behavioral response by Markov chain models for capture–recapture experiments. Biometrics 1010–1017.

Ye, N., 2000. A Markov Chain Model of Temporal Behavior for Anomaly Detection 4.

Wohlin, C., Runeson, P., Höst, M., Ohlsson, M.C., Regnell, B. and Wesslén, A., 2012.

Experimentation in software engineering. Springer Science & Business Media.

(27)