Information Engineering with E-Learning Datasets

(1)

(2)

(3)

Information Engineering with E-Learning Datasets

STEVEN A. GERICK

Master in Machine Learning (TMAIM) Date: September 17, 2019

Supervisor: Johan Gustavsson, György Dán Examiner: Olov Engwall

School: Electrical Engineering and Computer Science (EECS) Host Company: Sana Labs AB

Swedish Title: Informationsutveckling inom E-Lärande

(4)

(5)

Abstract

The rapid growth of the E-learning industry necessitates a streamlined process for identifying actionable information in the user databases maintained by E-learning companies. This paper applies several traditional mathemati- cal and some machine learning techniques to one such dataset with the goal of identifying patterns in user proficiency that are not readily apparent from simply viewing the data. We also analyze the applicability of such methods to the dataset in question and datasets like it. We find that many of the methods can reveal useful insights into the dataset, even if some methods are limited by the database structure and even when the database has fundamental limits to the fraction of variance that can be explained. We also find that such methods are much more applicable when dataset records have clear times and student grades have fine resolution. We also suggest several changes to the way data is gathered and recorded in order to make mass-application of machine learning techniques feasible to more datasets.

(6)

Sammanfattning

Snabb utveckling inom E-lärandesindustrin gör snabba och generaliserbara metoder för informationsutveckling med E-lärandesdatabaser nödvändiga. Det- ta arbete tillämpar olika traditionella maskininlärnings- och matematiska metoder i en sådan databas för att identifiera mönster i användarfärdighet som inte lätt kan upptäckas genom att läsa igenom databasen. Detta arbete analy- serar även metodernas generaliserbarhet, särskilt var dem kan användas, deras nackdelar, och vad databaserna behöver uppfylla för att lätt kunna analyseras med metoderna. Vi finner att många av metoderna kan upplysa om strukturer och mönster i databasen även om metoderna begränsas i effektivitet och generaliserbarhet. Metoderna är också enklare att tillämpa när databasens artiklar associeras med tydliga tidpunkter och studenternas betyg har hög upplösning.

Vi föreslår ändringar för datainsamlingstekniken som kan förenkla paralleli- serbara storskalig tillämpningar av maskininlärningsmetoder på många data- baser samtidigt.

(7)

Chapter 1 Introduction

Recent years have seen a rapid growth in the E-learning market, a market which primarily provides online tools and applications meant to replace or supple- ment at least some educational activities that traditionally take place in schools and corporate workshops. Common offerings of companies in this market include mobile learning apps, application simulation tools, podcasts, and virtual classrooms. The industry was worth approximately 162 billion USD in 2015 and is expected to reach a value of approximately 325 billion USD by 2025 [1]. Some notable examples of companies in this industry include:

• Language learning software Rosetta Stone, with parent company worth 130 million USD at the end of 2013. The company dropped to negative net worth in 2018, mostly due to being outcompeted by other software such as Duolingo [2].

• Language learning website and app Duolingo, worth 700 million USD in 2017 [3].

• Online Courses collection Lynda, purchased for 1.5 billion USD in 2015 and since rebranded as LinkedIn Learning [4].

Despite major disruptions in the sector due to competition and the rapid introduction of new software paradigms for E-learning, the industry continues to balloon in value, with some companies such as Duolingo achieving almost brand-name status. The industry thus promises to produce significant new business value in at least the next decade while simultaneously revolutioniz- ing the learning environment for end users, increasing both the accessibility and quality of educational materials available to a general audience.

1

(10)

More recently, some companies have begun incorporating machine learning into the way their platforms interact with students, with Duolingo for example conducting research into modelling the kinds of errors students make and what kinds of challenges increase student engagement (the time students spend on the platform) [5]. Research such as that conducted at Duolingo show that machine learning techniques can be used to better understand users, to provide user-specific feedback, and increase student engagement and proficiency (correctness of student responses to questions).

Although a 2017 survey found that almost all businesses are gathering data to improve their business operations, only 39% were using a big data solution to gain business insights, with 23% using some other solution and 38% having no solution for insights at all [6]. Companies using non-big-data solutions or no solution at all stand to gain significant business benefits and improved user satisfaction from the development of big data analytics, especially in an environment like online learning where nearly every detail about a user’s interaction with the product can be recorded.

Previous research in learning theory has established the effectiveness of several different methods for predicting user proficiency and user engagement.

Here, user proficiency is primarily defined by their scores on questions and/or exams, and user engagement is primarily defined either as average daily ac- tivity or the average time a new student stays active before exiting the learning environment permanently. Predicting these metrics can allow companies to offer "personalized learning" as part of their learning services, where the product can model and take courses of action individually for each student’s personal habits in order to maximize their learning potential or time spent using the product. For example, Lindsey et al. found that personalizing the choice of study material recommended to students dramatically increased the length of time users could reliably recall study materials [7]. Some of the methods used in previous work are listed below in section 2.1. Realistically, many datasets cannot use some or even most of these methods, depending on the size and scope of the dataset at hand.

With so many businesses gathering data but not utilizing it, and with the developers of big data solutions having finite resources, there is substantial motivation to develop methods for identifying which businesses can benefit the most from the deployment of big data tools, without the lengthy financial cost and development cycles required for the development of a fully functional analytics tool.

The goal of this thesis is to conduct a preliminary investigation into a dataset collected by one such company and hosted at a machine learning ser-

(11)

vice provider (Sana Labs AB), identifying both the database features that con- tribute positively to the strength of machine learning models and the features that detract from it. Through isolation of these features to mock datasets and evaluation of the accuracy of various machine learning models of the dataset, components of the dataset can be evaluated and their information content in- ferred. Importantly, we do not choose the dataset we work with based on how large it is or how professional the database schema is; we instead choose to work with a smaller dataset from a company that was not designed to be used for machine learning analysis techniques. We anticipate that these sorts of datasets are more likely to have problems with the application of machine learning techniques, and therefore are more likely to inspire concrete sugges- tions for what such companies need to change about their database models in order to easily integrate machine learning analysis techniques. Sana Labs primarily provides AI services to education platforms, and hopes to use the work from this thesis to simplify identification of potential businesses to partner with.

1.0.1 The Research Question

In this paper, we seek to answer the question: "which machine learning al- gorithms provide the most accurate predictions of a user’s grade in online courses, and what design paradigms present in datasets used for these pur- poses impede the accuracy and/or massive application of these predictions?"

We do this through analysis of both the accuracy of several machine learning methods and the estimated impact of adverse database paradigms on such accuracy, performed on our sample dataset and generalized to recommendations for datasets in this field as a whole. Ultimately, we also wish to investigate methods of predicting how users behave in such datasets, but the focus on predicting any aspects of the ways the users behave is mostly limited to identification of methods that can be used to that end in future work. That is, we focus mainly on user proficiency (i.e. the user’s final scores). This question seeks to qualify and quantify the accuracy of various machine learning techniques in order to determine which database formats can easily be modelled, to suggest improvements in database format, and to streamline the preprocessing of data for use in machine learning tools.

(12)

1.0.2 Conditions and Limitations

This thesis primarily seeks to innovate through the evaluation of the issues sur- rounding the application of various machine learning methods to datasets that were defined and collected without the intent of immediate data analysis. Be- fore beginning, it was expected that the dataset evaluated in this thesis would have significant flaws in its schema that would complicate attempts to analyze it with machine learning methods and force us to exclude certain analysis methods entirely. Qualifying the impact of these flaws in a systematic manner can expand the scope of datasets considered for machine learning analysis without forcing the company to define and gather a new database.

The ultimate goal – which this thesis only lays the foundations for – is that a dataset can be fed into a series of models, some of which may use results from previous tools in the chain. Both the applicable models and the desired output metrics depend on the type and amount of data available. Methods selected here should be able to be incorporated into further tools, either as features or as supplementary predictors, but the work in this thesis is intended merely to provide concrete steps for such future work, not actual design of future methods.

The dataset that is investigated in this paper is from one of their partners providing online courses for a standardized certification (hereafter "the partner"). For confidentiality reasons we cannot reveal the identity of the partner. The dataset consists primarily of records of user interactions with questions in the study forms, as well as final grades the users have received on the exam. User interactions consist of entries containing a user ID, question id, answer, score, time spent, and timestamp. Additional tables provide the score the user received on the exam and module IDs for each question. Typically, other datasets contain at minimum user IDs, question IDs, answers, scores on questions, and timestamps, all of which are found in this dataset as well. The dataset analysed here is unusual in that it has a clearly defined final score (that received on the exam); most other online learning courses do not build to a single final exam to test what users have learned, with no fixed goal.

The partner’s goal with each student is profit, and they benefit from the development of methods of accurately predicting which students will fail, which minimizes the number of failing users and the money spent offering such students additional resources. The student’s immediate goal with the service is to pass the exam to receive their certification, and they benefit from the same method developments through personalized recommendations for how to maximize their chances of passing the exam.

(13)

Chapter 2 Background

This chapter primarily covers the state of the art in the field of e-learning and other relevant fields, such as traditional learning, recommender systems, and machine learning at large. We also address specific considerations for applying the state of the art to our dataset (which may not satisfy the requirements for some methods), and a final selection of methods that we apply to predict user behavior and exam scores in our dataset.

2.1 Candidate Methods and Prior Work

For this work, we seek to apply several different methods and compare the practicalities of preprocessing, parameter selection, and postprocessing. We also seek to qualify which aspects of the database make any of these steps difficult to define or hard to readily apply to other datasets. In general, this means that we preferably focus on selecting methods that give easily inter- pretable results and can be compared to each other. The following is a list of candidate machine learning methods that have seen use in multiple fields.

Each is accompanied by a short description of its operation, some examples of applications, and a justification for why the method is or is not applied in this work according to the above criteria. Due to the sparsity of literature within the field of E-learning, most of these are justified not on the grounds of future success within similar work, but on the grounds of success on a wide variety of other non-related works.

Spaced repetition Beginning with Hermann Ebbinghaus in 1880-1885 and formalized through subsequent replication [8], spaced repetition predicts that the likelihood a user answers a question correctly decays roughly exponentially with time, with the decay rate decreasing with the number

5

(14)

of times the user has attempted or reviewed the problem. This method can be applied in any dataset where concrete times are available for when the student studies material multiple times, and can be used to predict the likelihood that the next attempt results in a correct answer. It can also be used as a model for suggesting when and what content users should review. Bower and Rutson-Griffiths has found that spaced repetition is effective for in the Test of English for International Communi- cation (TOEIC) exam, with number of repetitions positively correlated with exam score [9]. Schimanke et al. provides an overview of neuro- logical mechanisms that support the use of spaced repetition, the most important being the importance of multiple distinct and separated review sessions and the need for sleep [10]. These works and others in the field generally vary wildly in the precise choice of learning curve used and the method of integration of repetition into the final product; for example Schimanke, Mertens, and Vornberger found that mobile games typically cannot use the classical spaced repetition approach and must adapt the timescales to their individual products [11]. Murre and Dros provide an overview of different equations used in the field as well as their goodness of fit to replication data [8].

Our dataset contains a list of user interactions with precise timestamps, so we could analyze the probability of a user generating the correct answer using spaced repetition models. We could use such an analysis, if accurate, as a model to recommend interventions for users to answer questions more accurately and in turn learn the material better, or to dy- namically adjust the difficulty of lessons. Development of spaced repetition requires careful parameter selection, such as the choice of equation to model user memory, parameters for the model, selection of relevant data, and choice of evaluation metric. Therefore we chose not to apply spaced repetition modeling techniques in this work when other options would give quicker and possibly more accurate results, reserving spaced repetition for a future work.

A/B Testing Traditional A/B testing requires the deployment of two products to subsets of the user population: the current version of the product "A", and a test version "B". Typically, B is offered to a small randomized subset of the user base, and after a period of data gathering, user choices under the two products can be compared to decide if B improves user behavior according to the developer’s goals. This process is typically not mathematically complex; with sufficient time and a large user base,

(15)

the two groups of users will either differ in the target metric (e.g. proficiency or time spent on the platform) in a statistically significant way or not. In the former case, the product version with superior target metric is used to develop another test version or deployed as the new A product.

The selection, deployment, and maintenance of a separate B product can often be expensive and logistically difficult, especially when the simul- taneous use of two versions of the product raises ethical questions about product quality or when the two products require separate technical support procedures. There are however methodologies which seek to reduce the burden of testing and deployment, such as automated testing of mod- ular, genetically evolving tests [12], and offline A/B tests, described in the next item in this list.

Creation of a live A/B testing model is considered a line of last resort for the analysis methods discussed in this paper; the express intent of this work is to avoid such expensive and time consuming procedures through finding means of analyzing the user database than through live intervention, as described in the introduction.

Offline A/B Testing Offline A/B testing is essentially equivalent to A/B test- ing in nature, but does not involve the actual deployment of a B product, and instead infers its effect through existing data under the A product. If the only difference between the A and B products is in the likelihood of taking certain actions, and the A product is capable of making the same recommendations as the B product (albeit with different probability), the B product can be evaluated using only the data gathered under the A policy. This evaluation effectively reweighs the observed data so that data points that would be more likely under B than A are given higher weights. When data gathered under A is sufficient, the predicted benefit of the B policy can differ significantly from that of the A policy. This can save significant financial and labor resources compared to deploying several possibly inferior B products, though results are only comparable to online A/B testing when a large amount of data is gathered and the A policy is somewhat nondeterministic. The amount of data required can become very large if the number of possible recommendations is large, creating a severe bias-variance tradeoff that requires context-dependent mitigation techniques [13]. These mitigation techniques include but are not limited to the segregation of users into separately weighted groups, dynamic reweighting of individual users, and various resampling techniques.

(16)

Our dataset of about 900 users studying over a thousand questions is far too small to have significant overlap in user behavior, and thus offline A/B testing is expected to have unacceptably high bias and/or variance if applied here. Therefore, we do not attempt to apply offline A/B tests in this work.

Challenge personalization Personalization of the difficulty level can improve user engagement and learning by finding a "sweet spot" between a chal- lenging difficulty level that encourages the user to learn, and an easy difficulty level that keeps user self esteem and interest from dropping so low that they stop using the product. This typically involves dynamic adjustment of the difficulty level, increasing the difficulty when the user performs too well and decreasing it when the user performs too poorly according to some thresholds. Although some prior works have found that user engagement is maximized when task difficulty is neither too easy nor too difficult [14], later replication studies have shown that this is not always the case [15][16], and that users may in fact prefer problems with a challenge with restricted scope [17].

In our dataset, such models can be used as a metric for predicting user engagement, specifically as a predictor from feedback on questions to the length between review sessions. If applied, we would expect to be able to at least make users study material more comprehensively and more often. However, this work primarily focuses on modelling the final user scores, where this metric is less useful. We thus recommend that this analysis method be saved for a future related work due to the non-straightforward nature of defining a "challenge" for the user given the low amount of overlap in user behavior in our database and the ex- istence of external study materials (both of which should make analysis of challenge less informative).

Regression methods When the metric of interest is a final clear number, such as a student score on a final test, traditional or machine learning regression methods can be applied to model this metric as a function of predictors constructed from the user data. Particular candidates for regres- sion include traditional linear regression, support vector regression (SVR), and Neural Networks, all of which differ in their approach to optimizing the prediction, and all of which will be discussed separately in detail in section 2.3. All methods attempt to either minimize the prediction error or to maximize the fraction of variance explained. Re- gression methods have a long and varied history due to the variety of

(17)

methods employed and simplicity of earlier methods. Classical regression is a very old technique with correspondingly varied applications, such as robotics [18], satellite image processing [19], soil science [20], and of course E-learning [21].

Both the linear and support vector regression models are be applied in this work in an attempt to model user scores after defining a suitable vector representation of users. This representation and the mode of operation of these algorithms is discussed in more detail in the final selection of methods. We recommend neural networks be applied in a later work and added to the suite used in this paper.

Classification methods Similarly to the above regression methods, if the dataset has a final clear metric which is one of a set of classes (e.g. ’failed’,

’passed’), similar methods of linear classification, Support Vector machines (SVMs), Decision Trees and again Neural Networks can be applied. These methods tend to be the most straightforward to apply as they require comparatively little data preprocessing. Some works, such as that by Chen, have applied multiple of these methods to the same data to compare the efficacy of multiple methods simultaneously [22]. All of these methods operate differently and can be measured according to several measures of performance mostly independently of the choice of model. Some measures of model performance include accuracy, specificity, sensitivity, and receiver operating characteristic (ROC).

Both linear classification and SVMs are explained and applied in this work to attempt to predict whether users failed or passed the exam, with decision trees and neural network classification models left for a future work due to time constraints.

Time Series Analysis Methods Models such as a recurrent neural network (RNN) can operate directly on datasets presented as a stream of user events and predict user behavior at any point, whereas other methods typically require a static representation of user behavior. This means that RNNs see many applications to time-series-like data, such as blood glucose levels over the course of a day [23], developments in the size of cracks in alloys [24], and even the mimicry of user behaviors [25].

When the observations are irregularly spaced, a magnitude-to-frequency transform (such as the Fourier or Haar transform) must be performed on the data first before using it as input into a neural network [26]. As with other neural networks, selection of hyperparameters for RNNs can be a

(18)

time-consuming process and is context-dependent.

Because our dataset contains timestamps for user events, we expect that much of the user behavior is hidden in the time components, and the best model for user behavior is likely a time series analysis method of some sort. However, since the design of time series analysis methods using neural networks can be complex and time consuming and our data is not regularly spaced, we opt not to apply any time series methods in this work. We expect the design of features and selection of network architecture to take significant time and offer only minor improvements over the accuracy of other methods. In other words, we begin with other methods instead as they are expected to return acceptable (although not optimal) results with much less time spent on algorithm design, and leave the time series analysis methods for a future investigation.

Latent Factor Analysis This method is also known as principal component analysis or PCA. In cases where feature selection is trivial (e.g. already present in the dataset in vectorized form), correlations can be extracted between features, between users, and in the feature-user space using matrix singular-value decomposition. Latent factor analysis is primarily used in recommender systems such as in the work of Wang et al. [27], but has also been used primarily under the name of principal component analysis for e.g. the modelling of diets [28] and fruit ripeness detection [29].

We use latent factor analysis in this work to find correlations between users and questions; these correlations can be considered as abstract

"concepts" that represent underlying hidden themes in the data. If suc- cessful, we may be able to explain what some of these components represent in both the question space and the user space, and possibly use this to infer which aspects of the user interactions are particularly important for predicting user success.

It is important to note that despite the abundance of candidate methods, rel- atively few seem to have been systematically applied to the field of E-learning, with organized research such as that conducted by Duolingo [5] being a major exception. The field of E-learning seems to not be heavily researched either; a quick search for the combination of "machine learning" with some other field keyword in KTH’s Primo literature indexing services shows that published literature on machine learning techniques and e-learning is similar in volume to machine learning with meteorology, pharmaceuticals or chemical engineering, as shown in table 2.1. We thus have to mostly draw inspiration from other

(19)

fields when selecting which methods to use, as we cannot motivate most of the above candidate methods based on precedent specifically within E-learning.

2nd Search Term Results

"robotics" 15128

"taxonomy" 11480

"finance" 7874

"proteomics" 7033

"sociology" 5668

"systems engineering" 4948

"nutrition" 3546

"astronomy" 3133

"geology" 3234

eLearning/"e-learning" 2464

"pharmaceuticals" 2401

"meteorology" 2359

"chemical engineering" 2203

"analytical chemistry" 1111

"aquaculture" 603

Table 2.1: Some examples of number of search results for articles in machine learning in various fields

2.2 Dataset Issues to Consider

Two significant issues with this dataset prevent the application of some of the candidate methods, and contributed to the final selection of methods to be used in this thesis. We also have a third issue that guides our analysis later in this work.

1. The first issue is that no timestamps are provided for the users’ exam scores. This prevents for example the construction of score predictors based on the time between user review sessions and the final exam. As shown in our later analysis on this subject in section 4.1.2, plots of user activity seem to indicate that many users have taken the exam multiple times and most users seem to take the exam on different dates; recon- structing approximate exam dates is feasible but with no way to assess accuracy of these reconstructions, any analysis using them should avoid making concrete conclusions.

(20)

2. The second issue is that although the exams are scored on a scale from 0 to 100, none of the 934 students in the dataset have a reported score less than the passing score that is not zero. That is, all failing students have their scores reported as zero. This thresholding or "hiding" of scores for failing students makes regression of student scores much more difficult, as two students with nearly identical study habits scoring just below and just above the passing threshold are reported with massively different scores. Analysis later in this work shows the expected impact of this issue on regression techniques.

3. The third issue is that many users have access to supplementary study materials. As an extreme case, of the users who have no activity at all in the database, some have passed the exam and some have not. With no activity in the database, we cannot possibly expect to be able to differentiate between a passing and a failing user, and this places an upper limit on the accuracy, specificity, and sensitivity of any classification methods we may apply even assuming these are the only users that cannot be accurately modelled. Specifically, about 20 percent of the failing users and about 40 percent of the passing users have no activity in the database, or about 35 percent of all users (a breakdown of user event counts is shown in figure 2.1). If we assume that all students with activity can be correctly classified, and only those students can be correctly classified, our optimal classifier would correctly classify all users with activity in the system and guess randomly for the rest, meaning that the optimal sensitivity we could achieve is 80 percent, the optimal specificity 90, and the optimal accuracy 82.5 percent (80 percent of failing users, 60 percent of passing users, and 65 percent of all users had activity in the database. The optimal sensitivity is all of the 60 percent passing with activity plus half of the 40 percent passing with no activity, i.e. 80 percent. The optimal specificity is all of the 80 percent failing with activity plus half of the 20 percent failing with no activity, i.e. 90 percent. The optimal accuracy is all of those with activity (65 percent) plus half of the 35 percent that did not, i.e. 82.5 percent). To avoid difficulty in analyzing and optimizing these metrics when the best possible score for an optimal classifier is actually less than 1, we simply ignore users in the database with no activity when applying machine learning techniques to guess their score. Note, however, that many users still remain with minimal activity in the system, such as only viewing questions two or three times. We expect that even ignoring the users with no activity, many

(21)

Figure 2.1: Plot of logarithm of number of user events versus number of users.

Users with no activity are more likely to pass than users in general, implying they are using external study materials. Also shown is the fact that about one- third of students have no activity at all; we choose not to fit the y axis to show their count because it would nearly flatten the rest of the graph, and because we ignore students with no activity in our analyses.

users with minimal activity still make perfectly accurate classification inherently impossible regardless of the choice of model and features.

2.3 Selected Methods for this Work

Although many of the methods in the previous list can theoretically be applied to the dataset we are working with, some of them have mutually exclu- sive requirements for preprocessing and parameter selection. Most also differ qualitatively in what aspects of the dataset they attempt to model, and require separate analysis methods. As an exploratory thesis, this work restricts itself to only the latent factor model and a subset of regression and classification methods, with the other applicable methods to be applied in a later work.

(22)

The following methods are selected from the previous list to be used in this work to analyze user behavior. As discussed in their corresponding sections in the above list, we expect all of these methods to be applicable to the dataset to be used, though we take an agnostic approach as to whether we expect them to perform well given the dataset issues.

Latent Factor Analysis In a dataset like the one in this work, one simple ma- trix to define for analysis is a user by question-answered matrix, with the resulting components explaining user-user correlation, question-question correlation, and orthogonal components of these correlations. This method was selected due to its success in the literature in finding hidden correlations between items in other databases without the need to define features (i.e. it is a form of unsupervised learning). More information can be found in the preceding method overview section.

Linear Classification Linear classification models have seen application for example in text classification [30] and biochemical receptor classification [31]. Our dataset’s final target metric can be bucketed into two classes of "passing" and "failing", and can easily be stated as a classification problem. Our features to be used are however not straightforward to design, as we are working with irregularly spaced time series observations. Preprocessing is required to engineer the features for classification. It is important that the features provide significant information that can be used to differentiate between the classes, without the use of a very large number of features that could lead to the model overfitting the training datasets. This method was selected due to its long history as a tried, proved, and simple method that can provide a baseline.

SVM SVMs have seen application for example in landslide prediction [32]

and cancer gene identification [33]. An SVM would require the same preprocessing as linear classification and similarly output a prediction for pass or fail. Unlike linear classification, using an SVM requires a selection of a kernel function as a distance metric between points, some of which also require careful selection of a scaling constant. This method was selected due to its ability to accurately handle non-linear features (unlike linear classification), its ability to ignore outliers, and because Chen’s work has found SVMs to perform better than linear or logistic regression and simple neural networks [22].

Linear Regression Using the same preprocessed features as the linear clas- sification or SVM system, we can attempt to model users’ actual scores

(23)

instead of whether they pass or fail, and then do the classification in post-processing by checking whether a predicted score lies above or below the passing score. This method has the additional advantage of its main processing stage resulting in more intuitive predictor strengths; if accurate it would tell us exactly how quickly the score is predicted to change with a given change in a user metric (i.e. the gradient). This method was selected for similar reasons to that of linear classification;

it is an older method with closed-form solutions not requiring parameter tuning, and can thus provide a baseline for other regression methods and possibly for classification as well.

Support Vector Regression SVR is a slight modification of support vector machine classification, and has been used in polyp detection [34] and image reconstruction [35]. SVR would use the same preprocessed features and regress to user scores using a machine similar to SVM; slightly different constraints in SVR mean that it minimizes a small subset of points’ deviation from the prediction instead of maximizing the distance of a small subset of points from a margin. Like SVM, SVR requires the selection of a kernel function as a distance metric, some of which require an additional scaling parameter. SVR was chosen for this work due to its ability to accurately regress non-linear data unlike linear regression and its ability to ignore outliers.

Additionally, random forests and artificial neural networks were considered for this work and are within the scope of existing pre-processing, but were excluded due to time restrictions. We leave these to a future work. We justify the investigation of random forests with literature precedent in image segmentation [36], liver fibrosis detection [37], and Peptide Identification [38]. Future investigation of neural networks is justified by literature precedent in image segmentation [39] and predicting construction costs [40] (for regression) and diabetes classification [41], sentiment analysis [42], and pollen classification [43] (for classification).

The actual mathematics for the operation of each of these methods, along with any accompanying pre- or post-processing for their application, are explained in their corresponding section in the methodology chapter.

(24)

Chapter 3 Methodology

This chapter describes how we attempt to quantify the impact of the dataset schema issues, followed by a breakdown of the specifics of how the methods selected in the previous chapter are applied and how data is preprocessed for them.

3.1 Database details

Three datasets within one database were used for this work: a user dataset, a question dataset, and a user-question (event) dataset.

The user dataset consists of entries containing a user ID and reported score.

The event dataset consists of entries containing a user ID, question ID, answer (-1 if not answered, 0 through 4 if answered), grade (-1 if not graded, 0 or 1 if wrong or correct, respectively), time spent attempting the problem, and timestamp for the attempt. The question dataset consists of entries containing a question ID and module ID.

Some general statistics on the database are given in table 3.1. Some statistics are censored to a small number of significant figures to respect the privacy of the partner.

3.2 Research Design

As discussed in the background, we have the goal of quantifying how accurate the various applied machine learning techniques are, as well as quantifying the impact of aspects of poor database design in our dataset. For the dataset we are working with from the partner, the three major design flaws in

16

(25)

Statistic value

Number of Users ≈34000

Of which users with reported score 934 Of which users with activity 603

Of which passed ≈830

Of which failed ≈105

Number of attempts (those with score) ≈1500000 Minimum timestamp (those with score) ≈2017-01-18 Maximum timestamp (those with score) ≈2018-05-04 Average time spent on questions (users with

score)

≈45 s

1

Number of modules 39

Number of questions 1350

Average attempts per user with score 1616.8 Fraction of attempts answered per user with

score

≈0.70

Fraction of answers correct per user with score

≈0.72

Table 3.1: Sample statistics from the datasets used in this work

the database described in section 2.2 are evident before we even begin applying analysis methods, as these flaws complicate the formulation of a problem definition, choice of preprocessing techniques, and choice of analysis methodology. Dataset flaw analysis is thus split into its own section (section 3.3 before the rest, because these issues around problem formulation must be addressed before analysis of the algorithm results can even make sense.

Estimating the impact of the score clipping on the methods used is difficult for this work, because of the large number of possible predictors and the fact that we are simultaneously attempting to estimate the impact of the predictors on the score. To isolate the threshold impact, we use instead a simple fabricated dataset model with a single linear predictor, and show how various levels of thresholding affect the identification of R²values for predictors.

After addressing the dataset problems, the rest of the chapter focuses on individual analysis for our selected methods. As discussed in the background chapter, these methods are latent factor analysis, linear and SVM classification, and linear and SVR regression, in that order. Latent factor analysis is an outlier among our choice of methods here, because it is an unsupervised learning method that does not seek to predict user scores, like the other methods.

The separation of the remainder into regression and classification methods is logical because there is a naturally defined system of "classes" on the scores:

(26)

failing and passing scores. As shown in analysis of database flaws in section 4.1.1, it is even more important to separate these two because one of the classes is already collapsed into a single score value.

3.3 Dataset Limitations: Assessing Impact

3.3.1 Score Clipping

For analysis of the impact of score clipping (i.e. the conversion of scores below a certain threshold to 0), we use a mock dataset with known properties to measure the impact of similar score clipping on our mock dataset with known properties.

To generate scores, we start with 500 000 points uniformly spaced on a line from (0,0) to (100,100) in the X-Y plane. Depending on the desired R² value, a multiplier is calculated to satisfy this R², and each point is multiplied by a random number between this multiplier and 1. We derive the relationship used to generate the multiplier from the true R²value of such a dataset as follows:

R

²

≡ 1 − σ

_RES²

σ

_{T OT}²

(3.1)

where σRES² is variance in the data points from the regressor’s predicted value, and σ²T OT is the total variance of the data from its own mean. We wish to, given R², generate points that regress to a particular line, with variance from the line depending on R². Given an x coordinate x, we will select from a distribtuion where the y coordinate is generated uniformly between abx and ax, with −1 < a < 1, −1 < b < 1. We know R²(it is given), and we assume a = 1 (i.e. the highest score assigned to x is x. We thus wish to find b such that the optimal linear regressor for the data has our desired R²value.

We know that for any x, the average score generated is uniformly distributed between abx and ax, so the mean score for any x is ^abx+ax₂ = ^a(b+1)x₂ . All of these means lie on a line, so the optimal linear regressor is the same line. That is, the optimal linear regressor is a line with y-intercept 0 and slope

a(b+1)x

2 . The expected variance from this line at any x is the square of the spread of possible values divided by 12 (i.e. the variance of the uniform distribution over such an interval), or ^(ax−abx)

2

12 = ^a²^(1−b)₁₂ ²^x². The expected value of this variance across all possible values of x (i.e. σRES² ) is

(27)

Z

100 0

a

²

(1 − b)

²

x

²

12 · 100 dx = 2500

9 a

²

(1 − b)

²

(3.2)

The total variance of the data is more difficult. From the previous mean argu- ment, we know that the expected value of x is the mean of the regressor line, or 25a(1 + b) We know that the expected score variance given x is the expected squared difference between this expected value and the generated score, which is uniformly distributed between abx and ax. That is, this expected variance given x is:

σ_{T OT}² (Y |X = x) = Z ax

abx

(25a(1 + b) − t)² a(1 − b)x dt

= a²

3(1 − b)((25(1 + b) − bx)³− (25(1 + b) − x)³)

(3.3)

The total variance in the data is the expected value of this across all values of x, or

σ_{T OT}² = Z 100

0

a²

300(1 − b)((25(1 + b) − bx)³− (25(1 + b) − x)³)dx

= 625

9 a²(7b²− 2b + 7)

(3.4)

Thus,

R² = 1 −

2500

9 a²(1 − b)²

625

9 a²(7b²− 2b + 7)

= 1 − 4(1 − b)² (7b²− 2b + 7)

(3.5)

This function does not pass the horizontal line test and therefore cannot strictly be inverted; however, it can be inverted when we restrict −1 < b < 1. This branch of the inverted function corresponds to

b = (R

²

+ 3) − 4 √

3p(1 − R

²

)R

²

7R

²

− 3 (3.6)

(28)

Figure 3.1: Dataset of 500K points for R² = 0.

As expected, this returns a b value of -1 when R² = 0 (i.e. scores for x are randomly distributed between −x and x, and a b value of 1 when R² = 1 (i.e.

all scores for x are exactly x).

For our dataset, b < 0 is nonsensical; it implies negative scores. The minimum realistic R² occurs when b = 0 (i.e. scores for x are uniformly distributed between 0 and x, and this minimum R² is 1 − ⁴₇ ≈ 0.4286

Sample datasets for fixed R² values are shown in figures 3.1 through 3.6.

Due to the large number of points (500K), individual points are not distinguish- able; instead the charts show roughly the area in which data can be found.

After generating these datasets, several trials were run in which progres- sively more of the data was censored by replacing scores below a threshold with zeros (i.e. clipping), and a linear regressor was fit and its R² value recorded. Because the datasets can be generated to have arbitrary true R²values, we can plot the true R² value against the one detected by the regressors with this thresholded data. The chosen clipping levels of 25, 50, 75, and 100 are an arbitrary selection of values chosen to have a series of equally spaced samples between 0 and 100. The results are visualized and analysed in section 4.1.1.

(29)

Figure 3.2: Dataset of 500K points for R² = 0.2.

Figure 3.3: Dataset of 500K points for R² = 0.4. This value is just below the minimum R² = 0.428 for a reasonable b; i.e. b is approximately 0.

(30)

(31)

Figure 3.6: Dataset of 500K points for R² = 1. All points lie exactly on the regressor line.

(32)

3.3.2 Students with Multiple Exam Times

This thesis primarily deals with the issue of multiple exam dates by doing analysis under three different subsets of the user interaction records. All three policies will be compared in the analysis to determine which policy provides the best result (and therefore the best method of adjusting for the multiple exam times). These subsets of user interaction records are:

1. All data (hereafter policy 1)

2. Only data before the most recent likely test date identified by a sliding window algorithm (hereafter policy 2)

3. Only the data for users with a single identified test window (students which have unambiguous test taking times, hereafter policy 3, approximately 470 of 930 students)

We chose to call these selections "policies" instead of proper datasets because of the significant overlap in content between them, as well as the fact that they were all sourced from a single larger dataset. That is, there is no significant difference between the content present in each dataset other than our choice of which data from the original dataset to exclude.

The candidate test dates for policy 2 are selected by calculating a sliding- window over logged user events for each user. Given a date D and a student ID x, the value of the activity count is ln(|A|), where A is the set of all events for x starting at date D and ending 41 days later. The window is evaluated every 1/8 months or approximately 3.805 days. These parameters were chosen to allow non-zero window steps to accurately reflect student activity during a study period, while allowing high enough resolution to see the drops in activity between study periods that normally occur even when the student takes the test two seasons in a row. Starting with event counts for all users for every day during the approximately 2017.01.01 - 2018.06.01 period, each bin contains a count of user interactions over the 41 day period starting at a new date, with the new date (i.e. the new bin) falling 3.805 days after the previous. The date of the start of the current bin is always kept as a floating point number, but the actual day index for the start of the window is this floating point number rounded to the nearest integer. A candidate test taking date is any date where the bin count drops to 0 after any period of non-zero activity, excluding the bin corresponding to the end of observations. In some cases, multiple test windows are identified because user activity drops to 0, starts again, and then drops to 0 again. In this case, the most recent candidate is used.

(33)

Figure 3.7: Raw interaction count data for users 0-8, binned by week. Y axis is logarithm of number of frequency count plus one. All of these students are identified as failing in the database.

After the selection of a policy, records are drawn from the database, and are either used as is (policy 1), stripped of all records occurring after a user’s most recent test date (policy 2), or stripped of all records occurring for users with multiple test dates (policy 3). These records are then run through preprocessing as described in the next section.

Plots of raw user activity counts and binned user activity under policy 2 are presented in figures 3.7 through 3.14 to illustrate some of the properties of student activity counts. These are meant to convince the reader that they indicate many users have taken the exams multiple times, as well as show the high variability between users and the difficulty in extracting any useful information at all from these plots in certain cases (e.g. users with no activity or only a small amount of activity on one day). The users were selected for visualization such that some figures show activity for only passing students, and others show activity for only failing students. Students are otherwise selected arbitrarily within their group of pass/fail status; the users selected for visualization here are arbitrarily selected as small contiguous ranges of student IDs.

(34)

Figure 3.8: Windowed interaction count data for users 0-8

Figure 3.9: Raw interaction count data for users 9-17, binned by week. All of these students are identified as failing in the database.

(35)

Figure 3.11: Raw interaction count data for users 18-26, binned by week. All of these students are identified as failing in the database.

(36)

Figure 3.12: Windowed interaction count data for users 18-26. All of these students are identified as failing in the database.

(37)

Figure 3.13: Raw interaction count data for users 410-418, binned by week.

All of these students are identified as passing in the database.

(38)

3.3.3 Users with No Activity

Unfortunately, there is nothing we can do to adjust for or predict the impact of the users with no activity, other than to either disregard all such users or adjust our accuracy metrics to give classifiers and regressors upwards a small amount to account for the impossibility of predicting some of the data points.

We choose to disregard the users, as the selection of a manner of revising the regressors upwards is significantly more complex than simply choosing to only analyse data for users with data.

3.4 Implementation

This section covers the selection of features to be used in the linear and support vector models, as well as the selection of parameters to be used for all selected models in general.

3.4.1 Feature Selection

All methods except the latent factor analysis used a set of 13 predictors, drawn as statistics over a user’s activity gathered under each policy:

# name description

1 avg_grd_ans proportion of answers correct 2 avg_ans proportion of attempts answered

3 var_ans_time 0.01 * standard deviation of attempt time

4 frac_m_att fraction of modules with at least one question attempt 5 frac_m_ans fraction of modules with at least one attempt answered 6 frac_m_corr fraction of modules with at least one answer correct 7 frac_q_att fraction of unique questions with at least one attempt 8 frac_q_ans fraction of unique questions with at least one answer 9 frac_q_corr fraction of unique questions with at least one correct

answer

10 frac_last_wk fraction of attempts falling between the test interval and the week prior

11 log_attempts logarithm of number of attempts 12 log_answered logarithm of number of answers 13 log_correct logarithm of number of correct answers

Two metrics (average grade over all questions and average proportion of questions answered) are suspected to be important as trivial representations of user engagement and skill, though they may possibly not have an impact on final score. These two also inspired the separation of most other predictors into

(39)

groups of three: one for attempts, one for answers, and one for correctness.

Some users fail to attempt most of the problems in the database (and thus have no records for these problems). Others may attempt a large number of problem but not answer them (shown in the database as an answer and grade of -1). Some users may attempt and answer a large number of problems but not receive a correct grade, in which case their grade for those answers are reported as 0.

Since questions belong to a finite set of natural modules, statistics regard- ing fractions of modules attempted or completed form a natural set of components for our users. The number of unique questions is finite, and thus also provides a natural set of components. The number of total questions is not bounded and often differs dramatically between users in an almost exponen- tial distribution, so a logarithm of questions seemed fitting here (and since we are ignoring users with no activity, the logarithm is always defined).

We also expect from raw data on users and from previous studies in educational science that users who cram (spend significant portions of study time in the days just before the exam) do more poorly than those who study over long periods [44] [45] [46], so the fraction of studying done in the last few days (here, a week) is also expected to have an impact on user scores.

The final metric of variance in answer time is mainly intended as a control to check for effect size, as we expect it to not have a significant correlation with score.

All metrics except the logarithmic measures and variance fall naturally between 0 and 1. The logarithmic measures reach a maximum value of about 9.5 with a mean of 7, while the variance reaches a maximum of about 15 with a mean of about 1.3. The linear classification and regression methods do not require normalized data, while the SVM and SVR methods use a library that automatically normalized the data in preprocessing; we do not need to preprocess the four non-normalized metrics ourselves.

Latent factor analysis used a different data matrix, described in its subsec- tion later in this chapter.

All classification and regression methods are run for 1000 trials except for polynomial kernels with SVM or SVR, which are instead run with 100 trials due to slow training times. Each trial consists of a random partition of the 934 users into 80% training samples and 20% test samples. The large number of trials allows for an estimate of the standard error in metrics such as R², sensitivity, and specificity. In this work, sensitivity is the likelihood that a passing student is correctly identified as passing, while specificity is the likelihood that a failing student is correctly identified as failing.

(40)

3.4.2 Latent Factor Analysis

Latent factor analysis is typically used to visualize correlations in a dataset composed of two different kinds of datapoints. Here, these two kinds of datapoints are the users and the questions. By applying latent factor analysis here, we may be able to discover correlations between users and questions that are not readily apparent from simply looking at records in the database.

A matrix is constructed to represent which questions which users have attempted. This matrix is n × m, where n is the number of users and m is the number of unique questions. A large ID-contiguous section of approximately one third of users with no activity was excluded from this matrix, leaving a total of approximately 600 users and 5000 questions (where approximately three quarters of question IDs do not correspond to a question with activity in the database). We calculate the singular value decomposition (SVD) of this matrix M , and use an expectation-maximization algorithm with a bayesian information criterion to identify the optimal number of classes for both users and questions, described below:

Singular Value Decomposition (SVD) of a matrix M is accomplished by calculating the eigenvectors of M M^T (called U), the eigenvectors of M^TM (called V), and the eigenvalues of these matrices (which are identical between the two, and called Σ). These matrices satisfy U SV^T = M . U contains eigenvectors that represent correlations between users and some underlying hidden

"information space" we wish to discover, while V^T contains eigenvectors that represent correlations between this hidden information space and the questions. Σ is a diagonal matrix containing the strength of these correlations, in the form of eigenvalues corresponding to the vectors.

The Expectation Maximization (EM) algorithm we use to classify points is a gaussian mixture model from scikit-learn. For a given number of classes, initial n-dimensional normal distributions are initialized with random means and covariance matrices, and points are assigned probabilities of having been generated by each class. Each class is updated using the new expected means and covariances as a sample of the assigned points weighted according to their probability for that class, and the process is repeated until the class parameters converge. To avoid overfitting, we cannot use all of the approximately 900 components in the decomposition for classification; we instead use only the strongest 15.

The Bayesian Information Criterion grows logarithmically with the number of points, linearly with the number of parameters, and shrinks logarithmically with the goodness of the model fit. Smaller values of the BIC value are

(41)

better, and imply that the model has improved more than would be expected from chance from introducing new parameters. This criterion is typically used to avoid overfitting with the use of a large number of parameters.

Because we do not have access to additional information on the questions (such as what their text is), we cannot qualitatively attempt to explain what these classes of questions correspond to. We can however still make some nu- merical judgements about the resulting classification in terms of how well it separates questions, and how much cleaner such a classification seems compared to the human-indexed question classes.

3.4.3 Linear Classification

The predictors are regressed to the scores using linear regression. Since approximately five-sixths of the students passed, we attempt to partially mitigate bias by using weighted least squares regression, with weights inversely proportional to class frequency (where the two classes are passing and failing).

We have n users and 13 predictors, so our predictor matrix D is n × 13. Our score vector y is n × 1, where passing scores are replaced with 1 and failing scores are kept as 0. Our guess for the best predictor weights b is the b that best satisfies Db = y in the least squares sense with a diagonal weight matrix W , solved as b = (D^TW D)⁻¹D^TW y. Given a new user predictor vector x, we guess that the user passes if xb > 0.5.

W = 1

|A|







1_A(x₁) 0 . . . 0 0 1_A(x₂)

... . . .

0 1_A(x_n)





 + 1

|B|







1_B(x₁) 0 . . . 0 0 1_B(x₂)

... . . .

0 1_B(x_n)





 (3.7)

where A is the set of users that failed and B is the set of users that passed.

3.4.4 SVM

The predictors are used to train an SVM from the scikit-learn library, with results presented for sigmoid, linear, polynomial degree 3, and radial basis function kernels.

Given a list of points xi ∈ X and classes yi ∈ Y , SVM calculates a hyperplane to maximize the distance between the closest points in every class and the hyperplane, i.e. it maximizes:

(42)

P

n

i=1

c

_i

−

¹₂

P

n i=1

P

n

j=1

y

_i

c

_i

(x

_i

· x

_j

)y

_j

c

_j

such that

P

n

i=1

c

_i

y

_i

= 0 and 0 ≤ c

i

≤

_2nλ¹

∀i

The dot product is typically calculated through the use of a kernel or "pseudo- distance" metric instead of direct representation of the vectors in a high-dimensional space.

where w is the hyperplane vector and

w = P

n

i=1

c

_i

y

_i

x

_i

The SVC class from scikit-learn uses λ = 1 and optimizes the c parameters through gradient descent [47], and allows the choice of radial basis, sigmoid, linear, or polynomial kernels. The kernels are described below:

Linear x_i· x_j ≡ hx_i, x_ji (dot product) Polynomial xi· xj ≡ (γhxi, xji + r)^d Radial Basis x_i· x_j ≡ exp(−γ||x_i− x_j||²) Sigmoid x_i· x_j ≡ tanh(γhx_i, x_ji + r)

The data points are weighted such that all points have a weight inversely proportional to their class’s frequency, as in linear regression.

3.4.5 Linear Regression

This method is identical to linear classification, but the passing scores are not replaced with 1, and we guess that the user passes if xb > c, where c is the passing score for this exam. That is, we regress the predictors to the actual score (0-100) instead of to the class of scores (pass/fail), and the model predicts the actual score (0-100) instead of the class of scores (pass/fail).

3.4.6 SVR

The predictors are used to train a SVR machine from scikit-learn’s SVR, with results presented for sigmoid, linear, polynomial degree 3 and radial basis function kernels. Gamma is set to ’auto’, meaning that the gamma scale parameter is automatically set as _d∗σ¹²

X, where d is the number of predictors and σ²_X is the variance in X.

SVR uses a slightly different loss metric than SVM; it minimizes:

1

2

||w||

²

(43)

such that

|y

_i

− (hw, x

_i

i + b)| ≤ ∀i

(44)

Chapter 4 Results

This chapter shows results both for the dataset schema analysis and for the individual regression and classification problems.

4.1 Analysis of Dataset Limitations

Recall that we addressed three major problems in the methodology: the fact that non-passing scores were clipped to 0, the fact that many users seem to have multiple exam dates, and the fact that many users have no activity in the database at all. Here we analyse the results of our findings on the impact of such problems on our dataset.

4.1.1 Score clipping

The impact of score clipping on identified R² values is visualized in figures 4.1 and 4.2. The x axis is the true R² value, guaranteed through careful data generation as proved in section 3.3.1, while the y axis is the R² value of the best linear regressor to the data. In both figures, the blue line without clipping functions as a control; without clipping, the linear regressor gives a perfectly accurate estimate of the percentage of variance explained by the best linear fit.

This blue curve is therefore simply the line y=x. Other curves are set to clip data below 25 (yellow), 50 (green), and 75(red) to 0 in figure 4.1, or remove such data entirely in figure 4.2. The blue baseline curve can be considered an extreme example where clipping is set to 0.

Figure 4.1 shows that increases in clipping to 0 cause the linear regressor to underestimate the percentage of variance that can be explained by a linear predictor when this percentage is very large, and to overestimate this percent-

36

Information Engineering with E-Learning Datasets