Semi-supervised learning with HALFADO: two case studies

(1)

IT 20 042

Examensarbete 30 hp

Juli 2020

Semi-supervised learning with

HALFADO: two case studies

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Semi-supervised learning with HALFADO: two case

studies

Moustafa Aboushady

This thesis studies the HALFADO algorithm[1], a semi-supervised learning al-gorithm designed for detecting anomalies in complex information flows. This report assesses HALFADO’s performance in terms of detection capabilities (pre-cision and recall) and computational requirements. We compare the result of HALFADO with a standard supervised and unsupervised learning approach. The results of two case studies are reported: (1) HALFADO as applied to a FinTech example with a flow of financial transactions, and (2) HALFADO as applied to detecting hate speech in a social media feed. Those results point to the benefits of using HALFADO in environments where one has only modest computational resources.

Ämnesgranskare: Niklas Wahlström Handledare: Kristiaan Pelckmans

(4)

(5)

Popular Scientific Summary

Almost everyone nowadays have a smartphone, a laptop, a tablet, or even an Internet of Things (IoT) device. People expect their requests on these devices to be handled and processed instantaneously, especially if they work in a specific domain e.g. Stock Market. At the same time, these devices generate continues data streams, that usually represent a behaviour or even a change in behaviour, and the need to address these real-time information and decision making con-straints on mobile/ubiquitous data stream analysis, have led to an increased demand for efficient and scalable online learning algorithms, that is concerned with processing/analyzing data in real-time, and adapting to any changes in behaviour.

In this thesis we introduce HALFADO, an algorithm for processing and detecting anomalies/faults in data in real-time. Our goal is to investigate HAL-FADO’s implementations in terms of detection capabilities, computational re-quirements, and its scalability to be utilized in many different domains. There-fore we show its ability in two different applications while being implemented on a modest hardware: (1) applied to detect fraud transactions in a flow of financial transactions, (2) applied to detect hate speech in a social media feed.

HALFADO could be of interest for anyone who is interested in discovering anomalies in a certain data flow in real-time, or works in a field where anomalies are critical to be detected in real-time, e.g. Healthcare monitoring system, where an anomaly would be some patient having a serious change in heart beats, that requires instant handling of the situation.

The implementations of HALFADO are beneficial to different prospects, in industry for any entity interested in getting insights on data in real-time, or even individuals who in some cases need help as fast as possible like the above mentioned healthcare system.

(6)

1 Introduction

1.1 Online Learning

In Machine Learning (ML), data is a crucial key component. It’s the part that decides on how well your ML model will perform on unseen data: more data leads to a better generalization (prediction power over unseen data). As de-scribed in [1] there are two design options based on the nature of the modeling pipeline on which you receive your data. The first is to build your learning model while your data is at rest (batch learning), and the second is when your data is flowing in streams into the model (online learning). Batch learning is the more traditional approach, it splits the dataset into two sub-sets for training and testing, and with that comes the underlying assumption that the test data have similar statistics to the training data [2]. It also assumes that the data is stored and can be accessed several times, however that assumption imposes several resource constrains including storage and computational power.

More than ever, the volume of data streams has increased exponentially due to amongst others advances in hardware technology. Applications such as fi-nancial processing [3], sensor networks [4], web logs, and sentiment analysis, generate continuously fresh data. These datasets often become so large to the point that it might be infeasible to store them [5]. Moreover, some critical ap-plications like healthcare monitoring requires that these data flows have to be analyzed in a real-time manner. As a result, Online learning has gathered more attention in recent years as the solution for continuous data streams analysis.

Online learning is concerned with learning a pattern incrementally by pro-cessing examples one at a time as defined in this overview by Widmer et al.[6]. It’s performed in a sequence of consecutive rounds, and can be thought of as answering a sequence of questions [7]. In the case of online classification, the Yes or No answers point to the target classes, and a question is classified to either the Yes class or the No class. The goal of Online learning for classification is to make as few mistakes as possible, that is to minimize the total number of erroneous classifications [6]. In contrast to batch learning, online learning has the flexibility to scale and adapt to changes in the data properties, to process data in real-time with limited resources, and to discover and learn new patterns in continuous data streams. However, batch learning can also be used for the analysis of continuous data streams, as suggested in [2] by keeping a buffered dataset of past data records. Subsequently, the model will be re-trained at regular intervals on such a new batch hence adapting to the changes reflected in this batch. Although this paradigm manages to make batch learning more flexible to changes, it doesn’t address necessarily the computing and storage resources required. As presented in [8], online learning is also needed to address the real-time information and decision making constraints on mobile/ubiquitous

(8)

data stream analysis. A wide range of tasks nowadays with such temporal con-strains1 increases the need for efficient and scalable online learning algorithms. As in many application areas, it is impossible to make assumptions regarding the distribution of the data as the sequence of the data can be deterministic[7]. Examples of online prediction problems include Online Regression, Prediction with expert advice, Online detection, and the multi-armed bandit problem in a limited feedback setting [7].

Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train based on the entire dataset 2. Thus, it processes incoming real-time data sequentially.

1.2 Anomaly/Fault detection

Anomaly/Fault detection is a method of detecting outliers that do not agree to the normal behavior (or normal pattern). The anomaly/fault detection task is to recognize the presence of these outliers with respect to a model definition of ”normal” [9]. Anomalies are defined not by their own characteristics, but in contrast to what is normal. You may not know what the anomalies will look like, but you can build a system to detect them in contrast to what you’ve discovered and defined as being a normal pattern , see [10] for example.

In our financial transactions ”section 4.1” case where a fault detection is called fraud detection [11], the supervised approach can be used when there is a dataset available with the records ”transactions” labeled as e.g. ”normal”, or ”fraudulent”. The labels help the algorithm to form an idea of how a nor-mal/fraudulent transaction looks like. However, we focus on online fault/fraud learning using a semi-supervised approach, where the records ”transactions” are coming at a high-frequency and are not labeled in advance. After a while, the algorithm has formed a pattern for a ”normal” transaction, and counts every transaction that deviates enough from the normal pattern as a possibly ”fraudulent” transaction. Those potentially fraudulent transactions will then be submitted for follow-up analysis, and hence acquire a ”label”. This then allows the detection algorithm to learn.

The detection of anomalies in high-frequency streams of high-dimensional (financial, social media) data does pose challenges beyond the reach of many existing approaches. There is an algorithm dubbed FADO (Online Fault De-tection) which is the basis of this thesis described in [12] and elaborated in the work [11]. This FADO approach is based on tools from machine learning [13][14] as in [15], introduces a deterministic detection technique for streams of data as an alternative in case the usual stochastic models aren’t applicable [12]. The

1

https://www.sciencedirect.com/topics/computer-science/temporal-constraint

(9)

streaming model deviates from the traditional training-testing divide as at no point, one has access to a representative training data set.

This thesis continues the exploration of methods of fault detection for stream-ing data, based on techniques of semi-supervised machine learnstream-ing. We will however focus on a different algorithm named HALFADO. The algorithm is described in section 3.3. This report explores HALFADO from a performance, application-oriented and computational perspective. A second aim of this project is to explore the implementational limits of this technique. That is, we will pro-vide epro-vidence that this powerful algorithm can be implemented and run on very modest hardware.

(10)

2 Theory

2.1 Online learning vs Offline/Batch learning

In offline or batch learning, a model is derived from an historical dataset. After learning, the model is validated on fresh test data. In online learning however, one takes one observation at a time, makes a prediction for that one, observes the outcome and updates the model accordingly. The exact setting is given in alg. (1). In this setting, training and testing is intertwined.

Algorithm 1: Online Learning initialize f0 for t = 1, 2, 3, ... do (1) Measure xt (2) Predict ft−1(xt) (3) Update ft−1− > ftif needed. end for

We can have supervised, unsupervised or semi-supervised learning both in the batch setting as well as in the online setting. For example, one can do super-vised learning in the batch setting as for example using deep neural networks, or in the online setting, as for example using the HALVING algorithm. Table (1) examplifies this.

Online Learning Batch Learning Supervised learning HALVING Deep learning Unsupervised learning FADO K-means Semi-supervised learning HALFADO label propagation

Table 1: A characerization of learning algorithms.

Online learning approaches allow to be more memory efficient and adaptable. It is indeed more memory efficient because it does not need/save the observation once processed, and it is adaptable because it makes no assumption about the distribution of your data. As your data distribution morphs or drifts due to, say, changing customer behavior, the model can adapt on-the-fly to keep pace with trends in real-time. In order to do something similar with offline learning one would have to create a sliding window of your data and retrain every time

3_{. These properties of online learning are helpful in our two use cases for the}

following reasons :

• The memory efficiency of HALFADO is particularly helpful in the Social Media case, since a lot of tweets are being posted everyday, and naturally the bigger the data the more accurate the model, so having millions of

3_{https://dziganto.github.io/data%20science/online%20learning/python/scikit-learn/}

(11)

tweets saved is not a good idea, hence online learning is by far more suitable for the job.

• Its adaptability comes into play in the financial transaction use case, since fraud in finance usually follows the change in the customers spending habits, and having the ability to detect that in real-time, or at the very minimum to raise an alarm for further inspection, is crucial.

2.2 Supervised Learning

Supervised Learning algorithms use labels of datapoints to learn from. In other words, data has an input-output format. The input is usually a vector of fea-tures, and the output being the label that classifies the input into different categories. Consider for example the social media case study: here we consider a list of tweet messages, represented as a numerical vector, with each tweet la-beled as either neutral or as hate-speech. A supervised algorithm analyzes the training data and produces a model that can with a certain accuracy assign an output/label to an unseen input/test data, by generalizing the rule it learned during the analysis of the training data.

Using the Scikit learn (SKlearn) library, a machine learning library for python programming, a ’tweet’ is converted into a numerical vector using the Bag of Words (BoW) model and Term Frequency-Inverse Document Frequency (TF-IDF), the vectors are then analyzed by the Naive Bayes classifier. We use the list of 3000 standard words in the Oxford dictionary as a reference. Then any tweet can be encoded as a fixed-length vector with the length of the vo-cabulary of known words. The value in each position in the vector could be computed with the frequency of each word in the reference dictionary in the tweet4.

•The first step to take in the BoW model, is to convert each tweet into a vec-tor of numbers, by representing each tweet with a vecvec-tor/sparse matrix with the size of the vocab, and the count that each word in each tweet appeared at in the position of the corresponding word in the vocab, here is a small example: Let this be the vocab we built in a dictionary represen-tation [”T his” : 0, ”thesis0 : 1, ”is” : 2, ”really” : 3, ”conf using” : 4, ”I” : 5, ”don0t” : 6, ”understand” : 7], where the keys are words in the docu-ment and the values are indexes. Now we are at the first tweet in docudocu-ment to convert to a vector of numbers, this first tweet is ”This is good, this I understand”, a vector representation to this would be [2, 0, 1, 0, 0, 1, 0, 1], you can see that the each number corresponds to the count of each word in the vocab found in the tweet, e.g. since ”this” appeared twice, there is a

(12)

count of 2 in the index of ”This” from the vocab, and ”don’t” wasn’t men-tioned in the tweet, and that is why the count is 0, this count technique is done by a CountVectorizer.

•The second Step is to replace pure word frequencies with TfidfVectorizer. IDF represent words in a tweet with it’s score calculated with the TF-IDF method instead of the BoW frequency count. The TF-TF-IDF method stands for Term Frequency-Inverse Document Frequency. TF-IDF accounts for the fact that frequent words occurring often are usually not very informative. For example, the word “the” will appear often in many tweets, and their overall frequency count will not be very meaningful in the encoded vectors5.

• Term Frequency : This summarizes how often a given word appears within a document5_.

• Inverse document Frequency : This down scales words that appear a lot across documents5.

For this method since we used CountVectorizer to get the count of words, we will skip TF, and use IDF. The inverse document frequencies are cal-culated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word5_{, in our example above, “this” at}

index 0. So we end up with an encoded vector like this

[1, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3] (1)

and the values are then normalized between 0 to 1, these encoded vectors can next be used for the analysis by the supervised learning algorithm.

2.3 Unsupervised Learning

Unlike Supervised Learning, Unsupervised Learning allows learning without la-bels. One of the methods used in Unsupervised Learning is Cluster Analysis, that is used as an exploration method of a dataset so as to group data based on their similarities. For this algorithm, the input would be tweets (without label assigned), and since many machine learning algorithms deal only with numer-ical representations e.g. vectors of numbers, some very useful methods to do so in Natural Language Processing (NLP) is Word to Vector (Word2Vec), and Skip-grams.

5

(13)

Word2vec is a group of related models that are used to produce word em-beddings. These representations are based on shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words6. In other words, word2vec builds vector representations of words as illustrated in Figure 1, under the assumption that two words in the same context share a similar meaning, hence will have similar vector representation from the model.

Figure 1: Vector representations of words using Word2Vec

Skip-gram is one of the techniques used in word2vec to get a vector represen-tation based on context ”similar represenrepresen-tation for words with the same context”, as shown by Figure 2b7, that actually update the vectors with similar context to be closer, by having a sliding window ”see Figure 2a” over all tweets, and trying to predict the surrounding words of a target word (the center word in the window) during the training phase of the word2vec model, and by the end of that phase Skip-gram would have up-dated the vectors of words with similar context, in other words, words that appear together often.

Now that we have our word2vec model trained, and a vocab has been built with it’s corresponding context based vector representation, we use that as the input to the clustering analysis part using K-Means algorithm.

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data. It has two properties8_:

6_{https://en.wikipedia.org/wiki/Word2vec}

7_{https://jalammar.github.io/illustrated-word2vec/}

(14)

(a) Skip-gram’s sliding window _{(b) Skip-gram’s Architecture}

Figure 2: Skip-Grams

• All the data points inside the same cluster should be similar to each other.

• The data points from different clusters should be as different as pos-sible ” see Figure 3 ”.

Clustering is used on a wide range of real-world application e.g. recom-mendation systems, image segmentation, etc.

Figure 3: Points in each cluster are completely different

K-means is a distance based algorithm, that means, it assigns a point to a cluster based on the distance between that point and the cluster’s centroid, with the objective of minimizing the sum of distances between points in the same cluster to their cluster centroid. There are a few steps k-means goes through to achieve the grouping/clustering of points in a dataset :

(15)

• Choose the number of clusters k.

• Choose k random points as the clusters centroid. • Assign each point to the closest cluster centroid. • Re-calculate centroids for all the clusters.

The algorithm will stop when:

1. After recalculating centroids, they aren’t changing.

2. After recalculating centroids, data points stays in the same cluster. 3. The number of iterations has been reached.

Figure 4 shows how our training data points clustered/grouped together based on context. Next comes the part where we want to test the strength of our algo-rithm on new test data points. Since clustering is only an explorative approach, in other words doesn’t have train and test stages, adding new data points could potentially change the learned structure from the old data points. To tackle this problem, we use the learned vocab and it’s corresponding context based vectors by the trained word2vec model, as a look up matrix for the new data points, then use the trained cluster algorithm to predict which cluster the new data point will belong to. Under the assumption that the training set is large enough for the word2vec model to learn vocab, that will include most of the words in the new data points as well, and that way, the clustering algorithm won’t change the discovered structure, it will only predict based on the knowledge gained from that structure. To explain, here is a more clear example of how this is done: let this list of words be a new data point/tweet, [”He”, ”hates”, ”these”, ”people”], under our assumption, the vocab from the training phase includes most of these words, hence will use the trained word2vec model to get the vector represen-tation of each word from the lookup matrix, then get our k-means to predict which cluster each one belongs to, we end up with [1, 1], a vector that holds the cluster number of which each word belongs to, with exception to pronouns, the algorithm doesn’t consider them, the words ’hate’, ’people’, gives the algorithm the context it needs to assign them to the cluster that groups hate speech. There for, this sentence would be classified as a hate tweet.

2.4 Semi-Supervised Learning

Semi-Supervised learning (SSL) is a learning method that lies in between Su-pervised and UnsuSu-pervised learning, where some of the data gets labeled and the rest is unlabeled. Labeled data is expensive and time consuming as it re-quires a lot of human involvement to annotate it, while Unlabeled data is rather

(16)

Figure 4: Two clusters for our classes after k-means converges

easy to collect, there has been only few ways to model it. The goal of semi-supervised learning is to understand how combining labeled and unlabeled data may change the learning behavior, and to design algorithms that take advantage of such a combination. Semi-supervised learning is of great interest in machine learning and data mining because it can use readily available unlabeled data to improve supervised learning tasks when the labeled data is scarce or expen-sive. There are different semi-supervised learning settings (e.g. Semi-supervised classification, and Constrained clustering). [16]

2.5 Evaluation Metrics

The next step after testing a machine learning algorithm is to measure how well the algorithm did, that is to see and evaluate the generalization it gained during the training phase, and if something needs to be tuned in the parameters, or a change is required in the implementation to get a better result. And that is done by utilizing some evaluation metrics, however since there are various metrics to evaluate ML models in different applications, choosing the right metric is not a simple job, hence knowing about them and when to use them is crucial. Here we are going to introduce some metrics that we used to evaluate the algorithms mentioned above, then show how well each algorithm achieved in accordance to these metrics.

Confusion Matrix which is not a metric, but it’s beneficial to know. Figure 5 shows an example of how it’s used with a Supervised learning algorithm, and how many entries the algorithm labelled correctly and how many it didn’t.

(17)

Figure 5: A sample confusion matrix9

This matrix offers four combinations of labels, from the predicted classes versus the actual classes as follows:

• True-Positive (TP) : In our example, when a cat is predicted correctly as a cat.

• False-Negative (FN) : when a cat is incorrectly predicted as a non-cat. • False-Positive (FP): when a non-cat in incorrectly predicted as a cat. • True-Negative (TN): when a cat is correctly predicted as a

non-cat.

These four numbers are the parameters for some of the next evaluation metrics.

Accuracy Considered to be the simplest metric, and it’s defined as

(numberofcorrectlypredicted totalnumberofprediction

) ∗ 100 (2)

To continue with the example used in the confusion matrix ”see Figure 5 ”, the accuracy would be

Accuracy = (90 + 940)/(1000 + 100) = 93.6%

Since accuracy isn’t a good evaluation metric for a lot of ML models, specially in case of imbalanced data, where samples that belong to one class are far more numerous than the other, hence it’s accuracy would be high, since one will have a lot of correct predictions that belong to the larger class. Thus, if your object is to measure the algorithm’s ability to correctly predict samples that belong to the smaller class–in like manner

(18)

in the case of this thesis, where detecting ”anomalies” is the goal–then the accuracy isn’t a very good way of showing that. Instead we use other metrics that can evaluate the prediction power in each class e.g. precision and recall.

Precision is a class specific performance metric, defined as follow :

P recision = T P T P + F P Following the cats example:

P recision[Cats] = 90/(90 + 60) = 60%

P recision[N on − cats] = 940/950 = 98.9%

Recall Another class specific metric also know as Sensitivity, defined as:

Recall = T P

T P + F N (3a)

Recall[Cats] = 90/100 = 90% (3b)

Recall[N on − cats] = 940/1000 = 94% (3c)

The difference between precision and recall using the cats example is that pre-cision is samples correctly predicted as cats / all samples predicted as cats, and recall is samples correctly predicted as cats / all cats samples in the dataset , Figure 6 helps seeing the difference more clearly.

ROC curve the Receiver Operating Characteristics Curve is also a metric to evaluate the performance of a binary classifier as a function of the rate of number of True-Positives compared to number of False-Positives. An ideal classifier would have an ROC graph where it would hit a true positive rate at 100% with zero increase in the false positive rate, but since in reality, that is almost never the case, we calculate how many entries the algorithm

(19)

Figure 6: Precision and Recall10

is correctly predicting (TP), with the increase in the false positives (FP)11. Those two are then plotted against each other, as shown in, as shown in Figure 7.

11

https://www.dezyre.com/data-science-in-python-tutorial/ performance-metrics-for-machine-learning-algorithm/

(20)

Figure 7: An example of a Receiver Operating Characteristic (ROC) curve. The model associated to the blue line achieves almost optimal trade-off, while the model associated to the dashed line is equivalent to a random approach.

(21)

3 Methods

3.1 HALVING for online supervised learning

3.1.1 HALVING for Prediction From Expert Advice

Prediction, is about speculating the state of a short term sequence of events. Examples of prediction problems: (1) forecasting tomorrow’s weather at a cer-tain location, or (2) guessing if the stocks are going up or down for the near future[15]. There are many ways this can be done. Here, we describe one par-ticular method, the Halving Algorithm. In this algorithm, we have an unknown sequence of bits y1, y2, ..., yt ∈ 0, 1. At each time step, a learner/forecaster

makes a prediction ˆpt ∈ 0, 1, to evaluate ˆpt, the learner/forecaster will make a

prediction based on an advice given by a set of experts N, each expert in the set makes a vote on the prediction (f1,t, ..., fN,t) where fi,t ∈ 0, 1 is the

pre-diction made by expert i to next bit yt[15]. Now, the advice will be based on

the majority of votes, that is if the number of experts that voted 1 > number of experts that voted 0, then the learner/forecaster will take the advice of the majority, thus it’s prediction will be ˆpt = 1, and vice versa. Then the true

value of ytwill be revealed: if ˆpt6= yt, e.g. ”the majority was wrong”, then the

learner/forecaster will remove all experts that made a mistake (in this example, experts who voted ”1”). The goal is to minimize the mistakes.

3.1.2 The HALVING Algorithm: Soft implementation using con-stant factor Alpha

The previous implementation of the Halving Algorithm assumed that some ex-pert will never make a mistake. Working without that assumption means that we iterate knowing that at some point we might end up with no experts in the set. Therefore, a new extension was proposed [15], that is very similar to the original one, but here we add weights wi = 1 to the experts i = 1, 2, ...., N .

Computing the advice will be based not only on the majority of experts’ votes, but also on there weights i.g. if the total weight of experts that voted 1 > the total weight of experts the voted 0, then ˆp = 1, and vice versa, and when the true bit yt is revealed and ˆpt = yt, the learner/forecaster, will only multiply

the weights of experts that made a mistake, wi ← βwi, by a constant factor

0 < β < 1, which is a small fraction to down scale the weights of those experts that were wrong. That means that their vote now has less influence.”.

3.2 FADO for online unsupervised learning

Online fault detection (FADO), developed by Kristiaan Pelckmans[12], is the algorithm lying on the basis for this thesis: it offers an alternative deterministic

(22)

detection methodology to the usual stochastic methods, specifically, in instances at which a stochastic setting can’t be used. Moreover, in one of the cases im-plemented in this thesis, we work with a stream of financial transactions that mostly consists of normal, non-fraudulent transactions, and only a small part of it is fraudulent (non-normal) transactions, for which you can see how the trans-actions aren’t sampled randomly, therefore a stochastic setting wasn’t used.

FADO ”see Algorithm 2” tries to get this property ”normal or fraudulent ” of a transaction, after each iteration which in this case is a time-step. And for simplification it is assumed that we receive a transaction each time-step, FADO follows by making the decision after each time-step, this setting is what makes it online and unsupervised [12]. An alarm in algorithm 2, only indicates that there is something wrong, and more analysis –inspection of a transaction– is needed, and decides how FADO uses the exploration-exploitation trade-off, by taking two values:

• Alarm is false : that leads to further learning (exploration).

• Alarm is true : the model was adequate ”the transaction is faulty”, (ex-ploitation). Algorithm 2: FADO () initialize w0 = 0d. for t = 1, 2, ... do (1) Receive transaction yt ∈ Rn. (2) Raise an alarm if kyt − wt−1k2 > , and set vt = yt−wt−1 kyt−wt−1k2 ∈ R n

(3) If an alarm is raised, update

wt = wt−1 + ytvt.

Otherwise, set wt = wt−1.

end for

Algorithm 2: A formal setup of FADO, where the transactions are represented as a vector yt∈ Rnand where there is no side-information for this transaction available. The fraudulent or non-fraudulent decision is encoded by a vector w ∈ Rn_{and parameter > 0[}₁₂_].

(23)

3.3 HALFADO for online semi-supervised learning

HALFADO is a semi-supervised algorithm, that lies between supervised and unsupervised in terms of labeling. It mean we only need to label some samples to help the algorithm. Its advantage is that it doesn’t need a huge annota-tion effort like supervised learning, and furthermore it outperform unsupervised learning. The algorithm extends the usage of Halving algorithms, and works on a meta-level instead of explicitly numerical-representation like in FADO, through the expert framework [17]. In the previous section, we learned that Halving algorithm uses a majority rule of θ = 50% , that is, more than 50% of experts have to raise an alarm about the transaction/tweet in order to be inspected. However HALFADO uses a majority rule of only θ = 1%. This is beneficial when the active set gets relatively small e.g. 100 experts, leading to an inspection whenever a single expert raised an alarm [17].

Algorithm 2: expert (msg, A)

Result: yh : A binary vector representing the voting of each expert initialize m : integer, ns : integer, P : m x ns matrix. yh ← [0] ∗ len(A) for i = 1, 2, ... , len(A) do if msg · P [A[i]] > 0 then yhi← 1 else yhi← 0 end if end for return yh

The expert algorithm ”Algorithm 2”, takes msg a binary vector explained in section: 4.2.2, and A (a list that maintain the indexes for experts that are in the pool/active set), as input.

(24)

Algorithm 3: HALF ADO

initialize : yh : A vector to hold the experts’ votes. A : A set of active experts

for t = 1, 2, ... do

(0) Receives a msg/transaction xh (1) yh = the votes of all experts

(2) Raise an alarm if any experts voted 1 (True) if An alarm is raised then

(3) prediction = Get the majority vote (4) Manually inspect the msg/transaction (5) actual = Assign the actual label end if

if prediction! = actual then (6) update A

end if end for

The above pseudo code for HALFADO algorithm, shows at a high level the steps of the algorithm designed for the social media case. The vote of each expert is taken by the dot product of the binary vector representing a tweet (xh) and the vector representing that expert, that is ∀i, yh[i] = xh · P [A[i]], i = 1, 2, ..., N , section 4.2.2 further explains this equation. In step 6 we update the active set of experts A by the method mentioned in section 3.1.2.

Algorithm 4 shows how an alarm is raised in HALFADO with the alarm method, that checks the yh vector –a vector that represents the experts votes–, and if one expert voted 1, an alarm is raised. In the main algorithm we only make a prediction if an alarm is raised, then the prediction is made with the pred label method ”see algorithm 5, according to the majority rule –votes of the majority–, but here the majority as mentioned before, is 1% of experts we have in the pool/active set.

(25)

Algorithm 4: alarm (yh) Result: alarm : bool

initialize : alarms ← [F alse] ∗ len(yh) for i = 1, 2, ..., len(yh) do if yh[i] > 0 then alarms[i] ← T rue else alarms[i] ← F alse end if end for

if alarms contains any true value then alarm ← T rue

else

alarm ← F alse end if

return alarm

Algorithm 5: pred label (yh)

if sum(yh) > (0.1) ∗ len(yh) then pred ← 1

else pred ← 0 end if return pred

(26)

4 HALFADO in TWO Case Studies

In this section, datasets for both cases are explained, along with the structure of incoming transaction/tweet, and the setup of each. Our setup setting for implementing both cases (Fin-Tech, and Social Media) is the same, it is divided over three parts. However the same part in each case is rather different in design. Two of these parts are on a personal/desktop computer, and the third is on a Respberry Pi, the computer and the raspberry pi are connected via either ethernet caple, or wireless. The implementation of HALFADO on the raspberry pi has two versions, one written entirely in C for the financial transactions case, and the other written in Python for the social media case. The reason for having two cases/implementations is to show how flexible HALFADO is, and its ability to adapt to two different domains. The part where the implementation is mostly the same, is the visualization part, using python code on a desktop computer to visualize the results. After the processing of tweets/transactions on the raspberry pi, it sends the results back via the socket API. The results here are tp, fp, fn, tn, number of processed tweets/transactions, all tweets/transactions that were identified as hate/fraudulent ... etc, but for example, in the social media case, one can return all sorts of results, depending on the kind of insights you are looking to get from the data. For instance, it can return the rate of which hate tweets are being posted by users. There are a lot of possible analysis that can be done over the result, and a lot of information to gain. The visualisation’s implementation was done in a Python app with Flask, and Angular.js, to view the detection in real time.

4.1 Case study in Fin-Tech

4.1.1 Dataset

The dataset used here were locally generated, the structure of a transaction was based on the financial transactions’ structure proposed in [18]. Each gener-ated transaction represents a simple buy/sell transaction, and consists of these features: ISIN number12_{, quantity, unit price, issuer ID, receiver ID, and}

trans-action type (buy, sell).

4.1.2 Setup

As mentioned before the setup contains the following parts:

Python app on computer to generate transactions Each feature in a trans-action ”see section 4.1.1”, were carefully generated to mimic a real life

12

(27)

Figure 8: An example showing the structure of a generated transaction

financial transaction as much as possible, since financial transactions are sensitive information, and we can’t get real dataset of transactions. Each generated transaction were sent to the raspberry pi through python’s socket API.

HALFADO algorithm on Raspberry pi this part took the longest to de-velop, compared to its equivalent in the social media case, it is written in C from scratch. We start by managing a queue that listens on the socket for incoming transactions, and once it receives one, it sends it to HAL-FADO. As an extension to the previously mentioned Halving algorithm, we maintain a pool of experts, however the voting mechanism here is as-signed by hand written/hard coded rules. To further clarify, the voting is based on the country code, of which the transaction originated from. We initialized a set of country codes, that were also used to generate the transactions, each expert is assigned one code, and will vote ”1” if that code was in the transaction, which will raise an alarm, otherwise it will vote 0. The inspection is also done by hand written rules, where we have a small sub-set of country codes, that is, when found in a transaction, the true label of it would be fraudulent. Hence, all experts voting based on country codes that exist in our small sub-set, will have during the life time of the algorithm’s run, larger weights, and other experts will be down weighted.

4.1.3 Performance

As our implementation for this case was based on hard coded rules for experts to give their votes upon, and transactions that we generated, our performance measurements is mostly relative to that notion, therefore it can’t be compared to the performance of other algorithms. The detection capability namely evaluated in the form of precision and recall, isn’t very informative in this setting, so we are to only present the computational requirements of it on both a personal computer and a raspberry pi. The following table shows how many transactions

(28)

Transactions per second Total processing time Personal computer 1189 7.6 min

Raspberry Pi 724 12.15 min

Table 2: A comparison of HALFADO’s computational time on a Personal com-puter (2015 laptop, UBUNTU, 4Gb RAM) and a Raspberry pi (version 4 model B, 4Gb RAM).

on average can HALFADO process per second, and the total time it took to process 500000 transactions on both the personal computer and the raspberry pi.

4.2 Case Study in Social Media

4.2.1 Dataset

We use the Twitter API to access and download tweets, and for confidentiality reasons we only take the text/tweet from all the information we retrieve from the API, and delete other information.

Twitter API offers access to all features of twitter, without having to go through the website interface13. To access the API, one has to go through the following steps :

• To have a twitter account.

• Using the twitter account, apply for developer access.

• Create an application with the developer access, that will generate the API credentials to use for accessing twitter.

To access the API from the app, one will need to provide the following keys,that can be found on on the created application with the developer access :

• Consumer key. • Consumer secret key. • Access token key. • Access token secret key.

These keys are specific to the application created, so sharing them does not work. Having all that setup, will grant the ability to:

(29)

• Send a tweet.

• Search twitter for tweets, e.g. to get all tweets within a specific time period, or tweets that include a specific keyword.

• Keep or remove re-tweets.

4.2.2 Setup

The setting for cases’ setups is similar, however the design and implementa-tions is significantly different, the following parts were developed in a much more dynamic, adaptable approach than the above mentioned static Fin-Tech approach.

Python app to get the streams of tweets from the API on computer We set a stream listener that get all information of users

tweeting/retweeting in English language, then dropping all user, time, and location information and only keeping the text/tweet, then we send the retrieved tweet over sockets to the raspberry pi.

HALFADO algorithm implemented on Raspberry Pi , In this version, we have a queue that maintain tweets coming over the socket, any tweet coming will be appended to that queue, and the algorithm checks it, and when there is one ready, it pops tweet out, to be processed.

BoW as mentioned in section 2.2, There the BoW model was used to build the vocab, but in this Social Media case it was manually built in its simplest form, namely by getting words from the oxford 3000 word list, and saving it in our BoW list. Our goal here is to convert each tweet to a numerical representation, to be processed by the algorithm. This is done by taking a tweet out of the queue, convert it to a list of words, checks if each word is in the BoW list, then build a corresponding binary vector with values of 1, if the corresponding word exists in the BoW list, and 0 otherwise, to end up with a binary vector representation of the tweet.

Halving algorithm , we initialize the pool of experts, and send the bi-nary vector to get each expert’s vote. This is done in the following way: Let ns be the length of the binary vector, and m be the num-ber of experts in the pool, we used ns = 500 and m = 1000. We build P a (mbyns) matrix, and fill each row with a random normal distribution. Figure 9 shows an example of P . We then get the dot product of the binary vector representing the tweet and each row in P . If the result of the dot product is bigger than 0, then the vote

(30)

of that expert is 1, and the other way around. With the votes of all experts managed in a list, we first check if any of them voted 1, and if so, we make our own prediction based on the majority vote θ = 1%, that is: if the number of experts that voted 1 is larger than 1% of the total number of experts, our prediction will be 1, and 0 otherwise. Now we have a predicted label to whether this tweet is a hate tweet or not. If the prediction was 1, a manual inspection will follow that will reveal the actual label. This is how the algorithm learns: if the inspection, led to labeling the tweet as ”hate”, the al-gorithm then learns what a ”hate” tweet looks like. Furthermore, when the predicted label is equal to the actual label, we know that at least 1% of the experts were right, and we move on to process the next tweet. Otherwise the majority failed, and we have to update the pool of experts, either with the hard approach, by eliminating all experts that voted wrong, or the soft approach, by down weighting them. We usually use the soft way when we want to have a large number of experts, for as long as possible through out the running of the application, because the hard way depends on eliminating ex-perts after each iteration. Here, however, we use the hard way, so we update the pool by eliminating experts.

Figure 9: m = 10, and ns = 3, P is a 10 x 3 matrix

4.2.3 Performances

For comparison reasons, and since both the (un) supervised algorithms were trained offline on dataset of tweets, we present the results of the social media case study, and instead of reading tweets as streams directly through the API, we read line by line from a dataset, which we got from the API, and then was compiled into a csv file with a length of 500000 tweets. As mentioned in section 3.3, HALFADO only makes predictions on tweets, for which an alarm

(31)

was raised. Thus, our evaluation metrics will be changed a bit, for example when computing the accuracy, the denominator will be the number of inspected tweets instead of 500000. Precision, and recall for HALFADO, unsupervised, and supervised algorithms are shown below, along with the ROC curve 10 for the supervised algorithm.

Algorithm Precision Recall Time Semi-supervised HALFADO 37.20% 32.85% 10 min

Unsupervised K-means 50% 61% 22 min Supervised Naive Bayes 73% 80% 0.15 min

Table 3: Performance metrics achieved on the social media case.

As both the supervised and unsupervised implementations used batch/offline learning model, it’s only natural for an online learning algorithm like HALFADO to come next to them in terms of precision and recall, and that is what Table 3 shows. A better option for a fair comparison would have been to compare HAL-FADO to other online supervised and unsupervised algorithms, however, most of the designs/libraries/implementations for currently existing online supervised and unsupervised algorithms are either weakly documented or not supported enough for us to customize it to our cases.

Table 3 also shows that our implementation of the unsupervised algorithm achieved precision of 50%, that is to be expected, since we used an explorative approach ”Cluster Analysis” for classification.

(32)

4.2.4 Performance comparison

The following figures shows the comparison in terms of precision and recall ”see Figure 11”, and in terms of the processing time of 500000 tweets of the three algorithms ”see Figure 12”

Figure 11: The Supervised algorithm achieved a precision and recall of (73%, 80%), and the Unsupervised algorithm achieved (50%, 61%), while HALFADO achieved(37.20%, 32.85%)

(33)

Figure 12: The processing of 500000 tweets with the Supervised algorithm took 0.15 min, and the Unsupervised algorithm took 22.08 min, while HALFADO only took 10.5 min on a raspberry pi

5 Discussion

5.1 Performance

The evaluations that were done in this thesis have compared HALFADO to two other implementations of a supervised and unsupervised algorithms, for which HALFADO comes in the middle in terms of computation time, and scores lower than the other two in terms of precision and recall. The point was to show how HALFADO performs compared to well designed algorithms that are used nowa-days in applications everywhere. However, in [17], you can see HALFADO’s performance compared to other detectors that is based on the occurrence of a single word from the Oxford 300 word list.

This performance, with its computational power and efficiency, allows for the processing of > 180 tweets per second on a raspberry pi, all without an optimized code or powerful external libraries for Natural Language Processing, like the other two algorithms.

Most of the computation time was spent in computing the experts them-selves, that is constructing the random projection of BoW representation, which we mentioned before as the matrix P = mbyns, with a very small computational overhead of implementing HALFADO on top of this group of experts[17].

(34)

different interesting feature in a message/vector, and it votes based on, and then HALFADO’s training/learning is really nothing more than eliminating experts whom vote was based on a bad discovered feature, rather than optimization of a cost/loss function [17].

5.2 Existence Condition

The previous setup for HALFADO is based on an assumption called the ex-istence condition, which states that among the m experts there is at least one perfect expert, a perfect expert here means an expert that made the right vote throughout the life of processing all data. This assumption is vital for both the theoretical and practical parts of HALFADO. If that assumption happened to fail, it would consequently lead to a failure in detecting the anomalies/out-of-the-ordinary data points, bringing an unbound number of false positives, and knowing that an expert will get eliminated once making a single mistake, re-gardless how good it was, makes the condition more likely to not be maintained. In practice however, setting a large enough number of experts at the beginning, implies that the condition is most likely to be satisfied. Modifying HALFADO to the weighted version discussed in section 3.1.2 would help in avoiding the issues with that condition. However, this would be at the expense of computa-tional resources. [17]

Section 3.1.2 explains the implementation of the weighted version. Simply put, all the experts have a weight, and when an expert makes a mistake, instead of being eliminated, its weight is reduced. This scheme however will have all experts checked with each iteration, hence the effect on the computational effort. The weighted version was introduced to keep as many experts in the pool/active set for as long as possible, and to make sure that the pool of experts is never empty. In order to increase the robustness in the direction of allowing some errors, HALFADO could alternatively only evict experts from the active set that made mistakes with a certain probability 0 < α < 1, and with choosing α to be sufficiently small, a few mistakes will likely be tolerated, here we set α = 0.01.

(35)

6 Conclusion

6.1 Conclusion of the work

HALFADO offers an efficient way for anomaly/out-of-the-ordinary online de-tection in data streams, that is both theoretically proven and practically re-liable and fast [17], that allows working with non-numerical representation of data streams from different fields, without the need for feature design [17]. We presented two case studies for HALFDAO’s usages in FinTech and Social Me-dia, with which showed that, detection can be done in a non-stochastic expert setting, since neither individual financial transactions, nor social media mes-sages/tweets were sampled from a stochastic model.

This thesis showed how HALFADO performed compared to other supervised and unsupervised algorithms, but the main goal of that comparison is to present HALFADO as a valid option for anomaly detection in the context of real-time information flows, and invites others to invest in building ML models/libraries for handling streams. Detection problems have been researched thoroughly for different domains, but only few models were built (e.g. Creme 14_{) and even}

those do not have that much support.

6.2 Open Problems

6.2.1 Feature selection in HALFADO

With IoT/Ubiquitous devises constantly generating data, we enter the age of big data, and big data mining: that is to get useful insights from that data. Naturally, the dimensionality of the data is very high – dimensions represents features – , with this in mind, not all features are relevant/important to train an ML model. To deal with this high dimensionality, feature selection is applied to choose only the most relevant features that best represents the data. However, feature selection in its traditional sense, assumes that all features and data are present to a model before it starts selecting features ”batch learning”, but that is not always the case, in particular when data is arriving in a sequential manner over time ”online learning”, like in some real-world applications e.g. object detection, where it is infeasible to wait until all data is available to perform feature selection. Hence, we need to apply online feature selection upon each data point’s arrival.

(36)

Despite the rapid increase in applications that try to utilize online feature selection in its implementations, there is still a lot of problems with online feature selection, this survey by Hu et el.[19] presents some of these problems as follows:

• Existing online feature selection algorithms mainly focus on the handling of single label classification. However, in many scenarios, the instances may have two or more labels.

• In real-world online applications, the quality of data cannot be guaranteed, such as lack of attribute values, noisy data and so on.

• With the rapid growth of the amount of data, centralized online feature selection algorithms will become increasingly unable to meet the require-ments of computational performance, thus, distributed online feature se-lection algorithms will become another challenge in the future.

HALFADO is one of the online learning algorithms that doesn’t utilize fea-ture selection in its implementation, this however, can be the topic of another research, as it will boost HALFADO’s performance.

6.2.2 Unsupervised Online Learning

While HALFADO learns and predicts in a semi-supervised online manner, with the intervention of human help in some cases to validate and further inspect, un-supervised online learning requires no human help, and its applications/research solutions are still behind compared to supervised online learning. In [20] an un-supervised online learning approach is used to detect moving objects (continuous data streams) using an automatic labeler based on the ”background subtraction technique”, that labels the data points before feeding it to the learner. However, the labeler has a poor overall accuracy, but this technique still produced an ef-fective learner. This is a good idea for the problem it is designed for, but for example in critical systems, failing too much (making a lot of mistakes) could have very adverse effects.. Another example is introduced by Parveen et al [21], where they use unsupervised online learning for Insider Threat Detection, by extracting the most repetitive patterns and storing it in a compressed dictio-nary, then comparing the test data to these stored patterns, and if it differs from these patterns by a certain percentage, it’s considered an anomaly. As this setting addresses some of the problems of dealing with unbounded, dynamic data streams, we can not truly call it an online learning approach, since the processing is not done in real-time, and some storage resources would be needed to maintain the dictionaries. Furthermore, it uses the batch learning paradigm of having training and test sets.

This shows the need for new algorithms to exploit the benefits of deploying an adaptable, and efficient Unsupervised Online Learning model, that will handle many types of data as it is.

(37)

7 Further Work

A step beyond the current HALFADO setup, where the stream is coming in at one place (centralized), would be to consider the case where there are multiple streams coming from different sources (decentralized), this can be for exam-ple, when an international entity has servers/nodes all over the world. If we have a setup of the semi-supervised algorithm on each of the servers/nodes across the world, then we will couple the algorithms on the nodes by sending update-messages between nodes whenever necessary. This will be an instance of distributed ML.

(38)

References

[1] Ekaba Bisong and SpringerLink (Online service). Building Machine Learn-ing and Deep LearnLearn-ing Models on Google Cloud Platform: A Comprehen-sive Guide for Beginners. Apress, Berkeley, CA, 1st 2019.;1; edition, 2019.

[2] Yuwei Cui, Subutai Ahmad, and Jeff Hawkins. Continuous online sequence learning with an unsupervised neural network model. Neural computation, 28(11):2474–2504, 2016.

[3] Yunyue Zhu and Dennis Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, pages 358–369. Elsevier, 2002.

[4] Gianmarco De Francisci Morales, Albert Bifet, Latifur Khan, Joao Gama, and Wei Fan. Iot big data stream mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 2119–2120, 2016.

[5] Charu C Aggarwal. Data streams: models and algorithms, volume 31. Springer Science & Business Media, 2007.

[6] Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine learning, 23(1):69–101, 1996.

[7] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107–194, 2012.R

[8] Mohamed Medhat Gaber, João Gama, Shonali Krishnaswamy, João Bártolo Gomes, and Frederic Stahl. Data stream mining in ubiquitous environments: state-of-the-art and current directions. Wi-ley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(2):116–138, 2014.

[9] Terran D Lane. Machine learning techniques for the computer security domain of anomaly detection. (doctoral thesis, purdue university, purdue, france). 2002.

[10] Ted Dunning and Ellen Friedman. Practical machine learning: a new look at anomaly detection. O’Reilly Media, Inc., 2014.

[11] Miel Verkerken. Monitoring Financial Transactions: Efficient Algorithms for Streaming Data. Master’s thesis, Uppsala University, Uppsala, Sweden.

[12] Kristiaan Pelckmans. FADO: A deterministic detection/learning algorithm. CoRR, abs/1711.02361, 2017.

[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning. Springer-Verlag New York, 2 edition, 7 2009.

(39)

[14] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Vorabversion eines Lehrbuchs, 5, 2016.

[15] Nicolo Cesa-Bianchi and G´abor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 40 West 20th Street, New York, NY 10011-4211, USA, 2006.

[16] Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.

[17] Kristiaan Pelckmans and Moustafa Aboushady. Detection in Real-time Information flows. under review, .

[18] Zahra Rabiei. Identifying intended and unintended errors in financial trans-actions: a case study. Master’s thesis, Uppsala University, Sweden, 2017.

[19] Xuegang Hu, Peng Zhou, Peipei Li, Jing Wang, and Xindong Wu. A survey on online feature selection with streaming features. Frontiers of Computer Science, 12(3):479–493, 2018.

[20] Vinod Nair and James J Clark. An unsupervised, online learning framework for moving object detection. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., volume 2, pages II–II. IEEE, 2004.

[21] Pallabi Parveen and Bhavani Thuraisingham. Unsupervised incremental sequence learning for insider threat detection. In 2012 IEEE International Conference on Intelligence and Security Informatics, pages 141–143. IEEE, 2012.

Semi-supervised learning with HALFADO: two case studies

Examensarbete 30 hp

Juli 2020

Semi-supervised learning with

HALFADO: two case studies

Abstract

Semi-supervised learning with HALFADO: two case

studies

Moustafa Aboushady

Popular Scientific Summary

Contents

1

Introduction

1.1

Online Learning

1.2

Anomaly/Fault detection

2

Theory

2.1

Online learning vs Offline/Batch learning

2.2

Supervised Learning

2.3

Unsupervised Learning

2.4

Semi-Supervised Learning

2.5

Evaluation Metrics

3

Methods

3.1

HALVING for online supervised learning

3.2

FADO for online unsupervised learning

3.3

HALFADO for online semi-supervised learning

4

HALFADO in TWO Case Studies

4.1

Case study in Fin-Tech

4.2

Case Study in Social Media

5

Discussion

5.1

Performance

5.2

Existence Condition

6

Conclusion

6.1

Conclusion of the work

6.2

Open Problems

7

Further Work

References