Analysis of machine learning for human motion pattern recognition on embedded devices

(1)

AVANCERAD NIVÅ, 30 HP STOCKHOLM SVERIGE 2018,

Analysis of machine learning for human motion pattern recognition on embedded devices

A study of the implementation of machine learning on embedded devices and important aspects regarding hardware limitations

TOMAS FREDRIKSSON RICKARD SVENSSON

KTH

SKOLAN FÖR INDUSTRIELL TEKNIK OCH MANAGEMENT

(2)

(3)

for human motion pattern recognition on embedded devices

TOMAS FREDRIKSSON AND RICKARD SVENSSON

Master in Mechatronics Date: August 16, 2018

Supervisor: Joakim Gustavsson joagusta@kth.se Examiner: Martin Törngren martint@kth.se

Swedish title: Analys av maskininlärning för igenkänning av mänskliga rörelser på inbyggda system

School of Industrial Technology and Management

(4)

(5)

Master of Science Thesis TRITA-ITM-EX 2018:464

Analysis of machine learning for human motion pattern recognition on embedded

devices

Tomas Fredriksson and Rickard Svensson

Approved Examiner Supervisor

Martin Törngren Joakim Gustavsson

Commissioner Contact person

Tritech Technology AB

Ludvig Michaelsson

Abstract

With an increased amount of connected devices and the recent surge of artificial intelligence, the two technologies need more attention to fully bloom as a useful tool for creating new and exciting products.

As machine learning traditionally is implemented on computers and online servers this thesis explores the possibility to extend machine learning to an embedded environment. This evaluation of existing machine learning in embedded systems with limited processing capabilities has been carried out in the specific context of an application involving classification of basic human movements. Previous research and implementations indicate that it is possible with some limitations, this thesis aims to answer which hardware limitation is affecting classification and what classification accuracy the system can reach on an embedded device.

The tests included human motion data from an existing dataset and included four different machine learning algorithms on three devices.

Support Vector Machine (SVM) are found to be performing best compared to CART, Random Forest and AdaBoost. It reached a classification accuracy of 84,69% between six different included motions with a classification time of 16,88 ms per classification on a Cortex M4 processor.

This is the same classification accuracy as the one obtained on the host

(6)

computer with more computational capabilities. Other hardware and machine learning algorithm combinations had a slight decrease in classification accuracy and an increase in classification time. Conclusions could be drawn that memory on the embedded device affect which algorithms could be run and the complexity of data that can be extracted in form of features. Processing speed is mostly affecting classification time. Additionally the performance of the machine learning system is connected to the type of data that is to be observed, which means that the performance of different setups differ depending on the use case.

Keywords: ai, machine learning, embedded systems, internet of things, human motion analysis, support vector machines, decision tree, random forest, features, cortex m-series

(7)

Examensarbete TRITA-ITM-EX 2018:464 Analys av maskininlärning för igenkänning av mänskliga rörelser på

inbyggda system

Tomas Fredriksson and Rickard Svensson

Godkänt Examinator Handledare

Martin Törngren Joakim Gustavsson

Uppdragsgivare Kontaktperson

Tritech Technology AB

Ludvig Michaelsson

Sammanfattning

Antalet uppkopplade enheter ökar och det senaste uppsvinget av ar- tificiell intelligens driver forskningen framåt till att kombinera de två teknologierna för att både förbättra existerande produkter och utveckla nya. Maskininlärning är traditionellt sett implementerat på kraftfulla system så därför undersöker den här masteruppsatsen potentialen i att utvidga maskininlärning till att köras på inbyggda system. Den här undersökningen av existerande maskinlärningsalgoritmer, implemen- terade på begränsad hårdvara, har utförts med fokus på att klassificera grundläggande mänskliga rörelser. Tidigare forskning och implementation visar på att det ska vara möjligt med vissa begränsningar. Den här uppsatsen vill svara på vilken hårvarubegränsning som påverkar klassificering mest samt vilken klassificeringsgrad systemet kan nå på den begränsande hårdvaran.

Testerna inkluderade mänsklig rörelsedata från ett existerande dataset och inkluderade fyra olika maskininlärningsalgoritmer på tre olika system. SVM presterade bäst i jämförelse med CART, Random Forest och AdaBoost. Den nådde en klassifikationsgrad på 84,69% på de sex inkluderade rörelsetyperna med en klassifikationstid på 16,88 ms per klassificering på en Cortex M processor. Detta är samma klassifikationsgrad som en vanlig persondator når med betydligt mer beräknings- resurserresurser. Andra hårdvaru- och algoritm-kombinationer visar

(8)

en liten minskning i klassificeringsgrad och ökning i klassificeringstid.

Slutsatser kan dras att minnet på det inbyggda systemet påverkar vilka algoritmer som kunde köras samt komplexiteten i datan som kunde extraheras i form av attribut (features). Processeringshastighet påverkar mest klassificeringstid. Slutligen är prestandan för maskininlärningsy- stemet bunden till typen av data som ska klassificeras, vilket betyder att olika uppsättningar av algoritmer och hårdvara påverkar prestandan olika beroende på användningsområde.

Nyckelord: ai, maskininlärning, inbyggda system, internet of things, människliga rörelser, support vector machines, decision tree, ran- dom forest, features, cortex m-series

(9)

Acknowledgments

To our families and friends, thank you for your support during this thesis and our previous academic years. This achievement would have been nearly impossible without you.

To our supervisor at The Royal Institute of Technology, Joakim Gustavsson, thank you for your enthusiastic support and valuable feedback during the early stages of the project. We wish you the best in the future! To our examiner, Martin Törngren, thank you for your feedback and guiding us in the academic world. Also to our coordinator, Damir Nesic, thank you for helping us with all administrative tasks and feedback regarding the report.

To everyone at Tritech Technology AB thank you for welcoming us with open arms during this thesis. Especially to our supervisor Ludvig Michaelsson for helping us form the topic and giving us valuable feedback throughout the entire project. We also want to thank our manager Boe Sjögren for giving us the opportunity to do this thesis. We want to thank our fellow thesis colleagues at the company for all the feedback and interesting discussions during this time. Everyone else at the office deserves to be mentioned for always assisting us and keeping a good spirit as well as providing us with useful information that helped us during the thesis.

Thank you all for supporting us during this thesis and our academic years, you have all made this possible.

With our warmest regards, Tomas Fredriksson & Rickard Svensson Stockholm, Sweden, 12th June, 2018

(10)

Abstract iii

Sammanfattning v

Acknowledgments vii

Glossary xv

Acronyms xvii

1 Introduction 1

1.1 Purpose . . . 2

1.2 Scope . . . 2

1.2.1 Delimitations . . . 3

1.3 Methodology . . . 3

1.4 Ethics . . . 5

1.5 Disposition . . . 6

2 Background 7 2.1 Machine learning algorithms . . . 7

2.1.1 K Nearest Neighbour . . . 8

2.1.2 Naïve-Bayes . . . 9

2.1.3 Artificial Neural Networks . . . 9

2.1.4 Decision Trees . . . 10

2.1.5 Support Vector Machines . . . 13

2.1.6 Ensemble Classifiers . . . 16

2.2 Features . . . 17

2.2.1 Feature extraction . . . 18

2.2.2 Feature selection . . . 24

2.3 Hardware study . . . 28

viii

(11)

2.3.1 Microcontroller . . . 28

2.3.2 Inertial Measurement Unit . . . 29

2.4 Sensor fusion . . . 33

2.4.1 Kalman filter . . . 33

2.4.2 Complementary filter . . . 34

2.4.3 Mahony . . . 35

2.4.4 Madgwick . . . 35

2.5 Related Works . . . 36

3 Implementation 39 3.1 Experimental Setup . . . 39

3.2 Hardware Setup . . . 40

3.3 Machine Learning Algorithms . . . 41

3.4 Feature Selection . . . 43

3.5 Dataset . . . 45

3.6 Feature Extraction . . . 46

3.7 Test Design . . . 47

3.8 Measurements . . . 48

4 Results 49 4.1 Feature Selection . . . 49

4.2 Feature Extraction . . . 61

4.3 Machine Learning Algorithms . . . 61

4.3.1 Classification Accuracy . . . 63

4.3.2 Performance on Target Hardware . . . 66

5 Discussion 70 5.1 Research Questions . . . 70

5.1.1 Classification grade . . . 70

5.1.2 Hardware components effect . . . 71

5.2 Implementation . . . 73

5.2.1 Feature Selection . . . 73

5.2.2 Feature Extraction . . . 75

5.2.3 Machine Learning Algorithms . . . 76

5.2.4 Hardware . . . 77

5.2.5 Dataset . . . 78

5.3 Ethical considerations . . . 79

5.4 Conclusion . . . 80

5.5 Future work . . . 80

5.5.1 Future research . . . 81

(12)

5.5.2 Recommendations for ML on embedded systems 83

Bibliography 85

A Features UCI Dataset 97

(13)

1.1 Flowchart of the different phases during the thesis work. 4

2.1 Model of a neural network with one hidden layer . . . . 10

2.2 A decision tree with 4 leaf nodes . . . 11

2.3 A decision boundary with margin to its three support vectors. . . 14

2.4 Data set of observations in a two-dimensional feature space {x1, x₂}. . . 18

2.5 Principal components {y1, y₂} of the data set x. . . 20

2.6 Reduced dimensionality of the feature space by projecting observations on y1 . . . 21

2.7 Rotational axis of angular motion. . . 30

2.8 Block representation of Kalman’s algorithm . . . 34

2.9 Block representation of Complementary filter . . . 34

2.10 Block representation of Mahony’s algorithm . . . 35

2.11 Block representation of Madgwick’s algorithm . . . 36

3.1 Overview of experimental setup for testing and evaluat- ing machine learning system. . . 40

4.1 Mutual information feature selection algorithm with k selected features. . . 50

4.2 Mutual information feature selection algorithm with k selected features, close look on less than 80 selected features. . . 50

4.3 Feature selection algorithm f _select with a percentile of selected features. . . 50

4.4 Feature selection algorithm f _select with a percentile of selected features, close look on less than 15% selected features. . . 51

xi

(14)

4.5 Classification time per classification and classification accuracy for a variation of number of features. . . 51 4.6 Number of features selected and classification accuracy. . 52 4.7 Performance for SVM algorithm with specific features. . 53 4.8 Performance for Decision Tree algorithm with specific

features. . . 54 4.9 Performance for Random Forest algorithm with specific

features. . . 55 4.10 Performance for AdaBoost algorithm with specific features. 56 4.11 Performance for SVM algorithm with specific features

and only three classes to be classified. . . 57 4.12 Performance for Decision Tree algorithm with specific

features and only three classes to be classified. . . 58 4.13 Performance for SVM algorithm with specific features

and three similar classes were to be classified. . . 59 4.14 Performance for Decision Tree algorithm with specific

features and three similar classes were to be classified. . . 60 4.15 Classification test for machine learning algorithms with

respective target hardware, measuring classification time with feature extraction. . . 67 4.16 Classification test for machine learning algorithms with

respective target hardware, measuring only the classification time. . . 69

(15)

3.1 Hardware used for testing the machine learning algo-

rithms classification grade. . . 40

3.2 Algorithms used in experiments in the thesis. . . 42

3.3 List of operations performed by each algorithm in the feature extraction step. . . 46

4.1 Description of feature naming conventions. . . 52

4.2 Unused feature types that existed in the dataset. . . 53

4.3 Relevant features naming convention. . . 55

4.4 Accuracy performance for machine learning algorithms on specific sets of features. . . 56

4.5 Specific features for each machine learning algorithm. . . 61

4.6 Size of header file for each machine learning algorithm with six classes. . . 62

4.7 Size of header file for each machine learning algorithm with three classes. . . 62

4.8 Size of header file after compilation for each machine learning algorithm with six classes. . . 63

4.9 Classification accuracy with specific hardware and machine learning setup. . . 63

4.10 The six classes and their index in the dataset that are being classified during the tests. . . 64

4.11 Confusion Matrix for SVM machine learning algorithm, with actual class on the left-hand and predicted class on top. . . 64

4.12 Confusion Matrix for Decision Tree machine learning algorithm, with actual class on the left-hand and predicted class on top. . . 65

xiii

(16)

4.13 Confusion Matrix for Random Forest machine learning algorithm, with actual class on the left-hand and predicted class on top. . . 65 4.14 Confusion Matrix for AdaBoost machine learning algo-

rithm, with actual class on the left-hand and predicted class on top. . . 65 4.15 Classification tests on hardware with elapsed time for

each test, with feature extraction. . . 66 4.16 Classification tests on hardware with elapsed time for

each test, without feature extraction. . . 68

(17)

Accuracy The degree to which the result of a measurement, calculation, or specification conforms to the correct value.

Bias The tendency to output a certain value such as a class or a data signal.

Classification Assigning each input to a finite number of predetermined categories.

Correlation A mutual relationship or connection between two or more things, such as data or features.

Covariance A measure of how much two variables vary together, related to variance which tells you how much a single variable varies, covariance tells you how two variables vary together.

Curse of Dimensionality As a dimension is added the difficulty to perform calculations increases exponentially.

Diverse A condition of being composed of different elements, such as feature sensitivity.

Feature An individual measurable property or characteristic of a phe- nomenom being observed.

Learning Improving the performance on future task based on observations from past data.

Overfitting When a classifier becomes too closely formed to the training data.

Pruning When unimportant parts of a tree is removed.

Quaternion An extended number system for handling complex numbers.

xv

(18)

Regression The desired output contains one or several continuous variables.

Underfitting When a classifier is too loosely formed to the training data.

Variance The probabilistic distribution of classifications between the classes.

Window A collection of time dependent data samples, usually a fixed amount, that classification then can be performed on.

(19)

AHRS Attitude and Heading Reference Systems ALU Arithmetic Logic Unit

ANN Artificial Neural Network

CART Classification And Regression Tree CPU Central Processing Unit

DAGSVM Directed Acyclic Graph Support Vector Machine DOF Degrees of Freedom

FA Factor Analysis FPU Floating Point Unit GA Genetic Algorithm

ICA Independent Component Analysis IMU Inertial Measurement Unit

IoT Internet of Things k-NN K Nearest Neighbour

MARG Magnetic, Angular Rate and Gravity MEMS Micro-Electro-Mechanized-Systems MI Mutual Information

PC Principal Component

PCA Principal Component Analysis PCs Principal Components

xvii

(20)

RAM Random-Access Memory ROM Read-Only Memory RP Random Projections

SFS Sequential Feature Selection SVM Support Vector Machine

UART Universal Asynchronous Receiver/Transmitter USB Universal Serial Bus

(21)

Introduction

Machine learning is a field of study among engineering disciplines which allows a computer system to learn without being explicitly pro- grammed [1]. The topic is currently receiving a lot of attention and many speculate about its usefulness among innovation and product development [2]. One application where machine learning could be relevant is when extracting complex information from raw data [3]. With increasing availability of sensors in new applications and the amount of sensors in existing products [4, 5], further extensions of these applications are possible. For example identifying reckless driving instead of just analysing high or low speeds [6].

Tritech Technology AB¹ is a consulting company that specializes in embedded and connected systems. The company is involved in projects focusing on developing intelligent products which makes machine learning relevant for further exploration of new products. Specifically motion recognition and monitoring could be applied to some of the products that are currently being developed.

Using Machine Learning to analyse and classify motion based on sensor data provided by various sensors is traditionally done on high performance computers [7–13], which raises the question of how the reduction in computational power and memory could change the performance of the system and its ability to run machine learning algorithms with a high percentage of correctly classified motions. The effectiveness of machine learning algorithms is often related to the amount of data

1http://www.tritech.se/

1

(22)

available to train on [14]. There are several motion data sets available online, for example a dataset from UC Irvine [15], which can be used to evaluate activity recognition on embedded platforms.

1.1 Purpose

The thesis will evaluate how selected machine learning algorithms perform when classifying motions on an embedded platform. Basic human motions such as sitting, standing up, walking and climbing stairs will be used throughout the thesis. Different machine learning algorithms will be trained with the sampled data on a high performance unit and then run and classified on an embedded processing unit. The selection of algorithms for the project will be a part of the in-depth study. The thesis should answer the following questions:

RQ1 What percentage of correctly classified motion types can the machine learning algorithm reach on the limited processing unit?

RQ2 Which hardware limitation (Processing unit, memory or sensor sampling frequency) is affecting classification the most?

1.2 Scope

The project will involve configuring machine learning algorithms, feature extraction and selection for classifying basic human motion. The accuracy will be based on the percentage of correctly classified human motions. True positives will be considered correct and all other cases are considered incorrectly classified. These machine learning algorithms will be run on an embedded platform with a limited time window for the classification process, providing further limitations.

The hardware used in implementation will be off-the-shelf embedded controllers. The components that are focused on includes processor speed, memory size, sampling frequency and Floating Point Unit (FPU).

Testing all microcontrollers is not plausible within this thesis so three Cortex M-series microcontrollers from ARM²was selected as a suitable scope as these are commonly used on the market [16]. The processing

2https://www.arm.com/

(23)

of sensor data will be included to create a realistic load on the microcontroller. However, sensors will not be implemented in the system.

The machine learning algorithms will be trained on a computer which has more computational power and then run on the embedded devices.

Training the algorithm on the embedded device would require more computational power and that will not be focused on during this thesis.

The motion data set used for the thesis will be an activity recognition data set from an external source [15]. The dataset has a high number of measurements that are tagged with which class it is related to which would be a very time demanding task and outside of the time frame of the thesis.

1.2.1 Delimitations

Other features and algorithms might be suitable for classification of other motions or activities such as sports or vehicle dynamics. However, this is outside the scope of this thesis. The machine learning algorithms used during the project will be existing algorithms, meaning no new algorithms will be developed. The selection of sensors will not be a part of the project and no other components of a microcontroller will be analysed, such as bus-speed and bus-size. No own sensor data will be acquired and tested on the machine learning system, however this will be considered when designing the system and further discussed in Chapter 5.

1.3 Methodology

The thesis aims to follow the research and implementation method proposed in this section, which is launched with a planning phase, where a project plan and detailed time plan are considered deliverables, and required to move on to the next phases. All the phases are explained in a flowchart which can be seen in Fig. 1.1.

The research phase includes a pre-study phase which is running parallel with the planning phase in order to get more knowledge about the topic and a basis for the research questions conducted during this phase. To be able to implement the theoretical framework on a test platform,

(24)

Problem

definition Pre-study

Define research questions

Formulate project

scope

In-depth study

Implement Testing

Gather results Evaluate

results Conclusion

Refine research questions

Reimplement

Figure 1.1: Flowchart of the different phases during the thesis work.

an in-depth research phase will also be conducted, consisting of the qualitative method; document and article analysis, regarding:

• State-of-the-art motion recognition

• Machine learning algorithms

• Feature extraction, selection and detection

• Hardware and test platform study

• Sensor fusion

• Filtering techniques

The implementation and testing phase will have an experimental approach and be based on the theoretical framework founded in the previous phases. The verification part (during the testing and post- testing phase) will include quantitive data gathering and analysis to experimentally iterate and improve the system design for classification of motion types. This will include training the system with the motion data, concluded with running tests on the motion data to validate the performance of the embedded system.

(25)

1.4 Ethics

When collecting the training and testing data, whether the collection data is performed by the project or the data set will be provided by an external source, the characteristics for the specific individual must be considered when talking about the ability to correctly classify the movement each time. Basic movements could be largely affected by a style of walking, long-lasting injuries, shape of the relevant body parts, weight, etc. It should also be noted that the shoes and ground for which the data collection is done on will affect the data characteristics. If these things are included in the data collection one issue that could arise is the topic of integrity. In insurance related questions, for example, this would be very helpful for the companies because they could use this information to provide a fair price for a customers insurance. But this would have some implications, among them the issue of these companies being granted the power to decide what risks the costumer constitute.

For the collection of data for the project, another thing that should be considered is the safety for the test subjects. Depending on the setup of the data collection, it should be noted that experimenting with humans entails further implications and difficulties. Would the collection of data expose any of the subjects in danger and if so, how would this be minimised? These are considerations that need to be done when designing the tests and evaluation methods for the project. This also applies when discussing the integrity of the test subjects, the anonymity for all test subjects should be guaranteed as well as the ability to be named in the research as a contributor to the project. This is indeed a delicate matter that would require some reflection and agreements among parties involved.

It should also consider the use cases for the technologies approached by this project. Machine learning and motion recognition could be applied in many different fields, both useful for the improvement and development of todays society in its entirety and including many new useful products and services. Machine learning could however be harmful in many aspects in terms of surveillance, prediction and have a negative impact on society and human rights. With larger companies gaining information about human interaction and mapping out patterns in their

(26)

lifestyle, these authorities are gaining a lot of power. The ethical issues regarding artificial intelligence is very complex and should saturate every machine learning project.

1.5 Disposition

For this report, Chapter 2 will expand the theoretical framework for which the implementation will be based upon. Related work that is relevant for the thesis will be summarised in Section 2.5. The method for designing the implementation will be described in Chapter 3. Fur- thermore the results from implementation will presented in Chapter 4.

Finally, Chapter 5 will elaborate on the project as a whole and discuss system design, results, conclusions and future work.

(27)

Background

This chapter will establish a frame of reference for the theoretical background used in the implementation of the thesis. The information is intended to provide a general knowledge about machine learning algorithms, feature selection and extraction, the hardware included in the experiments and some information regarding sensors and filtering techniques.

2.1 Machine learning algorithms

The concept of machine learning is to provide a computer system with the ability to learn a concept without being explicitly designed [1]. For a system to be considered learning, the performance should improve on future tasks depending on observations done on past data [17]. How- ever, the learning much depends on the nature of the system and the data to be processed. Furthermore, the type of feedback in the system also distinguishes the type of learning, where the main types of learning is unsupervised learning (where the system is learning patterns in the input without any explicit feedback), reinforcement learning (where the system learns from a series of consequences, including rewards or punishments) and supervised learning (where the system is provided with pairs of input-output data and learns to create output data from provided input) [3]. Another way of categorising machine learning would be by analysing the desired output, these are described as clas-

7

(28)

sification (where the goal is to assign each input to a finite number of predetermined categories), regression (where the desired output contains one or several continuous variables) or clustering (where the goal is to discover groups of similar data without having predetermined categories such as in classification, which makes clustering unsupervised in nature) [18].

A related field of study to machine learning is statistical learning, which refers to a set of tools for practical machine learning approaches [6].

The idea is to include all hypotheses and not just the one that most likely. By weighing their probability and including all of them in the learning model, the whole problem is reduced to probabilistic logic [3].

However, an issue with including large sets of hypotheses or data pools would be the concept of overfitting [3, 6]. If the provided model is too closely made to fit to the training data, the future predictions could be incorrect due to the new data not matching perfectly with the data used for training [19]. Another problem that arises in more complex models is the curse of dimensionality [20]. When the dimensionality and volume increase, the distance between data also increases, which means that to be statistically significant the amount of data increases exponentially.

Furthermore, underfitting is a problem in statistical learning when the model fails to correctly correspond to the real problem, which also would affect the predictability [19].

As learning is applied to numerous problems that differ in complexity and size, there are countless unique and different algorithms that are optimal for different problems. Current research suggest that when comparing different algorithms for specific problems, the performance differs due to the unique nature of each problem [21–27]. Some of the more common classification algorithms are described in the following sections.

2.1.1 K Nearest Neighbour

K Nearest Neighbour (k-NN) is a basic method where the distances in feature space from a new data point to old labeled data points are used to classify the new data point. The k in k-NN is the number of neighbours that should be taken into account when classifying new data. A higher k leads to more neighbours being taken into account

(29)

and a less biased model is developed. The whole training set is stored in the memory to compare distance with, leading to larger memory required if not unnecessary data points are removed [28].

2.1.2 Naïve-Bayes

Naïve-Bayes is a machine learning method that uses probability to classify data samples. It uses Bayes theorem [17] where the probability of C given x is

P (C|x) = P (x|C)P (C)

P (x) (2.1)

For classification of multiple classes and features the equation is

c_{N B} = argmax_c_j∈CP (c_j)Y

i

P (x_i|c_j) (2.2)

where cN B is the predicted class and xi is the measured features. If the absolute probabilities were entered this could give the correct class every time, this is however not plausible as the absolute probabilities are unobtainable [17].

2.1.3 Artificial Neural Networks

In Artificial Neural Network (ANN) a number of virtual nodes are used to calculate the probability of a data sample being a specific class.

The nodes are placed in layers and have weighted connections with each other which can be seen in Fig. 2.1. The input nodes react to the input data and send signals to the next layer, called a hidden layer as it is hidden from the outside by other layers. Depending on the signal and the weights (multipliers), different nodes on the second layer will send their signal to the next hidden layer [29].

The amount of layers and the amount of nodes on each layer of the network change the complexity of the ANN and its ability to classify complex data [29].

(30)

Input #1 Input #2 Input #3 Input #4

Output Hidden

layer Input

layer

Output layer

Figure 2.1: Model of a neural network with one hidden layer

2.1.4 Decision Trees

Decision trees can be pictured as a tree made out of nodes, see Fig.

2.2. At each node a small decision is made, for example a true-or- false statement. Each data sample gets checked at the first node and depending on if the statement is true or false the data continues left or right and continues in such fashion until it reaches a leaf node. When the data reaches a leaf node in a tree it is classified. Different paths in the tree leads to different leafs, meaning to different classes. Several leafs might however have the same class [17].

To make the tree as efficient as possible the checks should be on the features that separates the data the best in descending order [17]. For example when classifying what type of fruit a fruit is, checking if the fruit is a sphere would remove a lot of potential fruit as it would then be spherical, like an apple or orange, or not like a banana or a pear. Checking if it grows on trees would not be as helpful as many fruits do. The shape feature then has a larger information gain than whether the fruit grows on trees or not. The selection of these features is explained in Section 2.2.2. When growing a decision tree the lower amount of nodes leads to a more efficient classifier in the amount of checks that need to be done before classifying the data. Too few nodes could however lead to misclassification due to the tree having too high variance. Many nodes could lead to a finer classification that can react to smaller changes in the data but it would have higher bias which

(31)

Check

Class A Check

Check

Class C Class D

Class B

Figure 2.2: A decision tree with 4 leaf nodes

could lead to misclassifications if the bias is too high. It could also slow down the classification time as more statements need to be checked before a classification is reached. A method to reduce the unnecessary nodes is by pruning the tree. Then nodes and branches of nodes that give little change in classification percentage are removed from the tree.

This improves speed, decreases the memory load and can improve the accuracy of correct classifications [17].

CART

Classification And Regression Tree (CART) is a greedy decision tree with binary splits [30]. It is called a greedy algorithm due to that it selects the split with the best outcome at the time and do not take the old or the following splits into consideration. Binary splits means that each decision node splits into strictly two new nodes and not more which some decision trees can. CART uses Gini Index which is calculated

(32)

as,

Gini index = 1 −

k

X

i=1

p²_i (2.3)

which is a measure of impurity for calculating the best feature to split on [31]. pi is the probability of class i out of the k possible classes given a feature [32].

C4.5

C4.5 allows multiway splits instead of just binary splits. The algorithm includes pruning of the tree and it can handle incomplete data points.

It uses information gain ratio to decide good splits instead of Gini Index [32]. The information gain ratio is the ratio between the information gain and the split information, given as

Gain ratio = Inf ormation gain

Split inf ormation (2.4) The split information is the potential information that is gained from splitting the training data into n subsets. The information gain is the potential information that is gained towards classification by the split.

By maximising this gain C4.5 reduces the risk of splits that creates unnecessary large amount of subsets [33]. For example, when classifying fruit, splitting on the time of harvest would generate a large amount of subsets so the information gain would be almost maximal which means that its value would be close to zero. The split information would be low so its value would be larger than zero. The gain ratio would thereby be small and it would be considered a bad feature to split on.

Reduced Error Pruning Tree

Reduced Error Pruning Tree is a decision tree which applies reduced error pruning [34]. When performing reduced error pruning the training data is split into two parts, one part for growing the tree and one part for pruning the tree. First a tree is grown that classifies the growing data as perfect as possible. The tree is then pruned bottom-up, meaning

(33)

that it starts at the decision nodes furthest out on the tree. These nodes are evaluated if they could be replaced by a leaf that would classify all data that comes to it as the majority class of the possible classes at the decision node. If this is possible without decreasing the correct classification grade of the pruning dataset then the node is replaced with the leaf. The pruning then continues up to the next node towards the root node. If it reaches a node which it can’t replace with a leaf without reducing classification percentage, it will leave the node as a check and continue up. This method can be problematic if the training data is sparse as it requires the training dataset to be split into two parts.

It could also create problems if there exists some special cases in the data used for growing that does not exist in the data for pruning. The nodes that classify these would be removed by the pruning due to that the classification percentage of the pruning data is not affected by them and they are thereby considered unnecessary by the algorithm despite that they are required to classify all the data. If these issues do not exist however it can be an effective way to reduce the tree to reduce bias and increase performance due to less computations [35].

2.1.5 Support Vector Machines

When using SVM to classify data, a boundary in the form of a hyperplane is placed between the classes of the training data in the feature space. The separating hyperplane is placed so the margin between it and the classes’ data points is as large as possible. This can be seen in Fig. 2.3 where

w · x − b = 0 (2.5)

is the hyperplane, w the vector normal to the hyperplane and x a data point. The parameter b is related to the distance between the point and the hyperplane that is b/kwk. The data points on the margin are called the support vectors. In some cases the classes can not be separated with a plane due to data points being on the wrong side compared to the rest of the class, these are called outliers and SVM can handle these by using soft-margins and it then tries to minimise the amount of outliers and still maximise the margin [36].

(34)

w · x − b > 1 w · x − b = 0

w · x − b < −1

Mar gin

Figure 2.3: A decision boundary with margin to its three support vectors.

Some classes can be hard to separate from each others with a hyperplane. However, separation in a higher dimension can be easier. So support vector machines can use this to separate classes. The data points of the training data are projected onto a higher dimension hyperspace where a hyperplane is used to separate the classes [36]. Performing the projection to hyperspace can be computationally demanding so instead kernels are used. By using kernels, the same results from calculations in higher dimensional space can be reached without having to calculate the datapoints positions in hyperspace thus decreasing computational load. Different complexity on the kernels can increase classification performance but can also increase computational load. When finding the optimal hyperplane the computer needs to solve a quadratic problem. However a least square SVM can be used and this changes the quadratic problem to solving a system of linear equations [37]. This is not as computationally demanding and can be used with several different kernels, such as linear or radial [38].

SVM was originally created for binary classification, meaning it only could separate between two classes, but different methods to make it able to classify multiple classes have been developed since its creation.

Some of these are one-against-one, one-against-all and Directed Acyclic Graph Support Vector Machine (DAGSVM) [39]. One-against-one re-

(35)

ceived highest score in a journal [39]. For this method k(k − 1)/2 binary classifiers are trained where k is the amount of classes. Each classifier is trained on classifying two classes from each other. Voting is then used to decide which class a new data sample is from. In one-against-all a binary classifier is created per class where each one is trained to classify its class positive and the rest as negative. In DAGSVM the same amount of classifiers as one-against-one is trained but to decide which class a new data sample is a rooted binary directed acyclic graph is used.

It has k(k − 1)/2 internal nodes and k leaves. When a new sample is classified it starts in the root and is then analysed by the first binary SVM and depending on the results it continues to the left or right node and continues in this way until it reaches a leaf and is classified as a particular class. This decreases the amount of SVMs that need to classify each new data point due to the behaviour of a binary tree even if the amount of trained SVMs might stay the same [40].

Two of the most common kernels that can be used in a SVM are explained in the following sections.

Linear SVM

When using linear SVM a linear kernel K, expressed as

K(x, x_i) = x^Tx_i (2.6)

is used. x and xi represents two datapoints to be projected into hyperspace. The separating hyperplane can be written as a linear function.

This is less computationally demanding than a more complicated kernel which enables it to run faster on a limited processing unit compared to for example the radial SVM.

Radial SVM

When using Radial SVM a Radial kernel function K, expressed as

K(x, x_i) = exp −kx − x_ik²

2α² (2.7)

(36)

is used instead. x and xirepresents two datapoints to be projected into hyperspace and α is a variable to tune the radial kernel. When using this radial kernel it enables the boundary to be a linear hyperplane in the transformed feature space but be nonlinear in the original feature space.

This enables more complex data to be separated but is computationally heavier than the linear SVM.

2.1.6 Ensemble Classifiers

Ensemble Classifiers contains multiple classifiers that work together to find the correct classification. This is based on the concept that a crowd is smarter than any of its individuals on their own [41]. In some studies ensemble classifiers outperform other classification methods [27]. The classifiers that make up the ensemble need to have an accuracy higher than random guessing and be diverse from each others which means that they should be relying on different variables if possible to reach the same conclusion but based on different data [42]. If the classifiers are not diverse then they can all have a bias towards a class or feature and give incorrect classification. If they are diverse however, the probability of them all being low decreases and when collected with majority vote the correct classification has a higher probability to have majority then the incorrect [27]. These ensembles can be made with only simple classifiers but can also be made up out of more complex classifiers.

Examples of ensembles that use the same sort of classifiers are Random Forest [43], Bag of SFA Symbols [44] and Time Series Forest [45] where Random Forest is the most commonly used and is made out of decision trees. Some examples of ensembles that use different classifiers are Elastic Ensemble [46] and Collective of Transformation Ensemble [47]. Elastic Ensemble is a combination of 11 different nearest neighbour classifiers [46]. Collective of Transformation Ensembles contains the 11 classifiers from Elastic Ensemble but adds other sorts of classifiers to a total of 35 different classifiers [47]. When improving the performance of an ensemble made out of similar classifiers there are two major way of doing this, called Bagging or Boosting.

(37)

Bootstrap Aggregating, Bagging

When bagging an ensemble the classifiers are trained on randomly selected parts of the training data. This creates variance in the ensemble which is good when classifying new data. Bagging is also good when the data contains a lot of noise as the ensemble will not have such a strong bias [48]. This could however lead to lack in performance if the increased variance is not necessary. If the training dataset it too small or not a perfect representation of the testing dataset it can instead be beneficial [49]. Random Forest is an example of a bagging algorithm.

Boosting

When boosting an ensemble the classifiers are trained in sequence and the performance of the last classifier will effect the next. This is done by adding weights to the training data. These weights are then changed based on if the last classifier classified them correctly (reducing their weight) or incorrectly (raising their weight) effectively changing their importance to be classified correctly [50]. This creates lower variance in the ensemble which makes it good at classifying if the training data is very representative of the test data. This could however lead to a heavy bias towards the test data making it sensitive to noisy data [49].

AdaBoost is an example of a boosting algorithm.

2.2 Features

A feature is a characteristic of the data that is used to build a predictive model [18]. These attributes are derived from the initial set of data for the system to build a better model and to reduce the complexity.

This processing stage is called feature extraction in machine learning applications [51]. However, some data can be irrelevant for the model which incentivise an additional selection process to discard those that appear to be irrelevant. This is called feature selection [6].

(38)

2.2.1 Feature extraction

Feature extraction is closely related to dimensionality reduction, where the idea is to transform the data to be useful for the application. This transformation is distinguished as two major types, linear and nonlinear [52].

Principal Component Analysis

Principal Component Analysis (PCA) is a linear dimension reduction method and is considered the best method in terms of mean-square error [53]. Essentially the idea is to reduce the dimensionality of the data set while maintaining as much of the variation as possible. This is achieved by introducing a new set of variables, the Principal Com- ponents (PCs), which consist of orthogonal linear combinations of the original data set [54]. PCA does not ignore covariances and correlations, however, it concentrates on variances for the data set [52]. Suppose for a set of p observations in a m-dimensional feature space, the data set is represented by a matrix A ∈ R^p×m. An example in two dimensions x = {x₁, x₂} will be used, see Fig. 2.4.

x

₂

x

₁

Observation

Figure 2.4: Data set of observations in a two-dimensional feature space {x1, x2}.

The first step for the PCA algorithm is to from the m-dimensional fea- ture space x = [x1, . . . , x_m]look for a new set of features, y = [y1, . . . , y_m] where y1 = α11x1 + α12x2 + · · · + a1mxm is the linear function where the elements of x has the maximum variance, the PCs. The next step

(39)

is to find the linear function y2 which is uncorrelated with y1 but still having the maximum variance of x. These steps are repeated until the kth linear function yk is found, where k is the maximum amount of uncorrelated functions yi in data set











y₁ = α₁₁x₁+ α₁₂x₂+ · · · + a_1mx_m y₂ = α₂₁x₁+ α₂₂x₂+ · · · + a_2mx_m

...

y_k = α_k1x₁+ α_k2x₂+ · · · + a_kmx_m

(2.8)

Up to a total of p PCs can be found. However, it is aimed towards that the variation in x can be described by k PCs, where k p. The vectors











v₁ = [a₁₁, a₁₂, . . . , a_1m] v2 = [a21, a22, . . . , a2m]

...

v_k = [a_k1, a_k2, . . . , a_km]

(2.9)

are the eigenvectors of the correlation matrix of A and represents coeffi- cients of the PCs. As mentioned earlier, the first Principal Component (PC) is the direction of the maximum variance from the observed set of data points. The following PCs are orthogonal to the first and together they describe the maximum variance for the entire data set. Thus, the eigenvectors make up a new feature space as in

y = {v₁, v₂, . . . , v_k} (2.10) From the example with the two-dimensional feature space this method can be applied and receive a new feature space {y1, y₂} which describes the data better than the original feature space, see Fig. 2.5.

By analysing the eigenvectors, specifically looking at the eigenvalues, it can be determined whether the corresponding vector is containing relevant information. If an eigenvalue is higher, that specific direction contains a higher variance, which is good for reducing the dimensionality with minimal loss of information. Thus, sorting out n eigenvectors

(40)

y ₂

y ₁

Observation

Figure 2.5: Principal components {y1, y₂} of the data set x.

with lower eigenvalues a new feature space yr

y_r = {v₁, v₂, . . . , v_k−n} (2.11) is obtained. These eigenvectors describe the reduced feature space and forms a transformation matrix P ∈ R^k−n×mas

P =





 v₁ v₁ ... v_k−n







(2.12)

The observed data set can now be presented in the new reduced feature space as,

Ar = (PA^T)^T ∈ R^p×k−n (2.13) When iterating the proposed example with a two-dimensional feature space, this method can be applied and reduce the dimensionality to a one-dimensional feature space. It is done by projecting the data on the first PC, y1, this is represented in Fig. 2.6.

As previously mentioned, PCA is a linear dimensionality reduction method. This introduces some problems when analysing the data. One issue that could arise would be when using the presumption that the

(41)

y

₁

Observation

Figure 2.6: Reduced dimensionality of the feature space by projecting observations on y1

highest variance gives the most relevancy to the dynamics, which could not be the case and has to be considered when using PCA [54].

Factor Analysis

Another linear dimension reduction method is Factor Analysis (FA).

While being closely related to PCA, this method is used to consider different variables and combine them into common factors and thus reduce the dimension of the data set. [52]. These factors are not directly measured, instead they are hypothetical constructs that are used to represent these variables. Examples could be where data points would be addition and division tests which then results in numerical ability to be the common factor [55].

For a mathematical definition of the method, it is assumed that p is the number of variables {X1, X₂, . . . , X_p} and m is the number of underly- ing factors {F1, F₂, . . . , F_m}. Here is Xj the variable represented in the hidden factor,

X_j = a_j1F₁+ a_j2F₂+ · · · + a_jmF_m+ e_j (2.14) where the weight of each factor is denoted aj1, aj2, . . . , ajm and the unique factor for that variable Xj is denoted ej. These weights gives an understanding of how much each variable has contributed to each factor. Suppose that for each variable Xi and Xj, factors should be

(42)

found so that when they are extracted, the correlation between the two variables are equal to zero. Thus, a correlation matrix R need to be computed between the observed variables. For FA the fundamental theorem is defined as

R_m×m − U_m×m² = F_m×pF_p×m⁰ (2.15) where Rm×mis the correlation matrix between the observed variables, U_m×m² represents the diagonal matrix of unique variance of each variable and Fm×pis the common factor weights. This equation is solved using the eigenvalues and eigenvectors of the matrix. FA is a complex mathematical method and the criteria for choosing the factors are extensive which further complicates the practical application. These limitations for FA include the lack of understanding what these constructed common factors really correspond to. Correct interpretation of all data and accuracy among the common factors is something that is hard to achieve [55].

Independent Component Analysis

Independent Component Analysis (ICA) is in comparison the previous mentioned methods, a higher-order method that seeks linear projections that are as statistically independent as possible, which is a much stronger condition than just being uncorrelated [52]. The method can be applied in many dimensionality reduction problems as well as the blind source separation concept, with a common example being the cocktail party problem, where the system is listening to one person speaking in a noisy room full of people [56]. Suppose n observations of linear mixtures x1, x₂, . . . , x_nof n independent components, then

x_j = a_j1s₁+ a_j2s₂+ · · · + a_jns_n (2.16) for all j, where s is each independent component and a is the weight for each component. For ICA each mixture value xj and independent component sk is a random variable, instead of a proper time signal.

It can be assumed that the model is zero-mean. For this definition a vector-matrix notation is used, so x is the random vector with mixture elements {x1, x₂, . . . , x_n} and s is the random vector with elements

(43)

{s₁, s₂, . . . , s_n}. Furthermore, the matrix A is denoted with elements aij. The mixing model is then written as

x = As (2.17)

The independent components s are latent variables, which means that they are non-observable. This means that all that can observed is the random vector x, so estimations of A and s must be done with as general assumptions as possible. So by assuming the components siis statistically independent and also have a nongaussian distribution, the matrix A can be assumed. Then its inverse A⁻¹is computed and the independent component s can then be obtained by

s = A⁻¹x (2.18)

This model can be extended with a noise term ui for more realistic applications, however, the noise-free model is difficult enough and being good enough for most applications [57].

Random Projections

A simple dimension reduction technique, yet very powerful [52], Ran- dom Projections (RP) is using random projection matrices R to project the original data X ∈ R^p into lower dimensional spaces transforming the data S ∈ R^kwith k p by

S = RX (2.19)

The random projection matrix R are constructed by independent and identically distributed zero-mean normal variables. The method is significantly quicker than even the relatively quick method PCA, which motivates the use of this specific method. The choice of elements rij in Rare often Gaussian distributed but this is not a must [58]. It has been shown empirically that the results are comparable to PCA when using RP [59]. The method has been used in several applications ranging from text recognition to human activity recognition [60].

(44)

Regression

In statistical analysis, regression is a method for measuring and estimat- ing the relationship between variables among a data set. In dimensionality reduction applications, the goal is to model a response variable yin a set of xi variables. regression methods can be both linear and non-linear. An important criteria for regression methods are that all variables xiare uncorrelated and relevant to explaining the variation in y. However, this is often not the case, why these methods needs to be implemented with a feature selection method [52]. In machine learning context, these methods including a feature selection part is called wrapper methods which will be further elaborated in the next section.

2.2.2 Feature selection

The objective of feature selection includes providing a better understanding of the system or the model, to improve efficiency by reducing the training times, and to improve performance in terms of better accuracy [6, 61]. When performing feature selection the main subsets of methods would include types such as filter methods [62], wrapper methods [63] and embedded methods [64]. Here the input data xij, yk

consists of n samples (i = 1, ..., n) with d variables (j = 1, ..., d), xi is the i:th sample and yk is the class label (k = 1, ..., y) [61]. The different approaches will be elaborated in their respective subsection.

Filter methods

The main principle for filter methods is that it selects features regard- less of the model at hand. Within the set of all features it selects the best subset of features in regards to for example correlation between variables. While these methods are effective for computation time and may be more general in terms of reducing the risk for overfitting, they may select redundant features because the methods do not consider the relationship between different features and the model domain, thus a question of relevancy must be raised [61]. Two relevant and common methods when considering filter feature selection methods are Correla-

(45)

tion Criteria and Mutual Information [65], these will be elaborated in the following section.

Correlation Criteria

The Pearson correlation criteria is one of the most simple methods to score correlation [61], it is defined as

R(i) = cov(x_i, Y )

pvar(x_i) · var(Y ) (2.20) where xi is the i:th variable, Y is the output, cov() is the covariances and var() the variance. However, correlation criteria ranking can only detect linear dependencies between target and chosen variable.

Mutual Information

By measuring the dependencies between two variables, another criteria in Mutual Information (MI) is defined. However, to further elaborate the concept then the concept of entropy need to defined as in [66]

H(Y ) = −X

y

p(y)log(p(y)) (2.21)

where Eq. 2.21 describes the uncertainty in output Y. If variable X is observed then a conditional entropy is given as

H(Y |X) = −X

x

X

y

p(x, y)log(p(y|x)) (2.22)

where Eq. 2.22 shows that when observing a variable X, the uncertainty in output Y is reduced, thus defining MI as

I(Y, X) = H(Y ) − H(Y |X) (2.23) Eq. 2.23 shows that the MI will be zero if variable X and output Y are independent and greater than zero if they are dependent. This enables the algorithm to avoid similarities when selecting features.

(46)

Wrapper methods

To evaluate the relationship between different variables, wrapper methods can be used. The methodology evaluates the subset to detect their possible interactions, and thus their relative usefulness in regards to predictive power. However, both the computation time and the risk for overfitting increases as the dimensionality increases [61]. These methods are closely related to regression in statistical analysis and is in machine learning context often combining feature selection and extraction to reduce dimension and choose appropriate variables for analysis [52].

Sequential selection algorithms

These methods are named sequential because of their iterative characteristics [67]. Sequential Feature Selection (SFS) is an algorithm that starts with no features and as a first step adds one feature, this expan- sive step is then followed up with an evaluation of the classification accuracy. As the feature set is expanded the individual feature that is evaluated is permanently added to the subset if the accuracy is increased with the addition of that specific feature. This method does not consider the dependencies between different features which may create additional issues for the new feature subset [65]. There are other sequential methods, for example an inverted SFS method where the subset starts with all possible features and removes one and evaluates the accuracy for each iteration [67].

Heuristic search algorithms

The concept of heuristic search algorithms is that they have the ability to rank different choices when for example selecting relevant features in such algorithms [65]. A common method, Genetic Algorithm (GA) is based on Darwinian theory of evolution that uses the law of natural selection to find the best solution [68]. The parameters for the GA can be modified and compete against each other. Thus, creating a modified versions of GA and selecting the best features [52].

(47)

Embedded methods

To combine the two previous methods to utilise the advantages of generalisation, computation time as well as predictability. The method performs classification and feature selection simultaneously which aims to be more efficient by not splitting the data into training and valida- tion sets [61]. The main approach is to include the feature selection method in the training process, and to provide the largest possible generalization of the subset, thus providing the minimal risk of false classifications [64].

The risk is calculated differently for each embedded method. A common method is the max-relevancy, min-redundancy method which is based of MI [65]. The criteria is given by maximizing the maximum relevance substracted with the minimum redundancy:

max( 1 kSk

X

i∈S

I(h, i) − 1 kSk²

X

i,j∈S

I(i, j)) (2.24)

where S is the set of features and h is the target classes, which is then dependent on the problem characteristics. I(i, j) is the MI between features i and j.

Another method is to use the weight of the classifier to rank features for removal. The weight w is defined as

wj = µ_j(n) − µ_j(m)

σ_j(n) − σ_j(m) (2.25)

where µj is the mean of the samples in class n and m, σj is the variance of the respective classes and j is between 1 and D. Then, the weight vector w can be used to classify because the features rank proportionally.

The decision D(x) is then

D(x) = w(x − µ) (2.26)

where µ is the mean of the data included in the decision. By looking at the change in wj, that can be used to correspond to the removal of a feature j.

(48)

Embedded methods performs better than filter methods as the number of training points increase. However, filter methods will always outperform both wrapper and embedded methods for a small training size [64].

2.3 Hardware study

To be able to investigate how machine learning algorithms perform on a limited processing unit and to identify which hardware limitations will be crucial for the testing of algorithms, research regarding microcontrollers and sensors is done in this section.

2.3.1 Microcontroller

The microcontroller that will be used as a limited processing unit is made up out of several components, such as Central Processing Unit (CPU), Random-Access Memory (RAM), Flash memory and sometimes a FPU. A microcontroller also contains an interface device to communi- cate outside of the chip, this usually connects to sensors and actuators [69].

The CPU contains an Arithmetic Logic Unit (ALU), which is the part that performs the arithmetic and logic calculations on integers based on the instructions given to it by the program. The performance of the CPU is based on the size of the ALU. The standard size for microcontrollers has been 8-bit for a long time [70] but recently 16-bit, 32-bit and 64- bits can be used as well [69]. The number of bits is the binary size of numbers that the ALU can work with. For example the numbers between 0 to 2ⁿ− 1 for a ALU with n-bits, for example an 8-bit ALU can handle number between 0 and 2⁸− 1 = 255. Working with other numbers requires additional computations. A larger ALU enables larger values and add processing power to the CPU but also adds complexity to the whole microcontroller as the communication buses between the parts on the microcontroller need to be able to handle the larger sizes of data. The performance of the CPU is also based on the frequency at which the ALU performs its calculations. A higher

(49)

frequency leads to larger amount of calculations that can be done within the same amount of time [69].

To be able to handle data and store the program the microcontroller requires memory. This usually comes in two forms, volatile and non- volatile memory. The non-volatile memory, also known as Flash or Read-Only Memory (ROM), is where the program is stored with all the instructions for the CPU. This memory stays even when the power to the microcontroller is removed or lost. The volatile memory however loses its data when it is without power. It is in this memory that values that are used by the processor during runtime are stored and read [69].

Some embedded controllers have an FPU, which is a part of the controller that can handle calculations on floating point numbers [71] in contrast to an ALU that only does this on integers. Calculations of floating point numbers, also called decimal numbers, can be performed on a microcontroller without an FPU through software but this requires more processor time. Floating point values are beneficial when for example working with high precision or small numbers as these require decimals.

Emulated Processing Unit

Instead of having a physical unit it is possible to emulate a microcontroller to be able to test its performance. There exists a few options regarding software to use such as Simics¹and OVPSim². This enables multiple different microcontrollers to be tested without the need of physical microcontrollers.

2.3.2 Inertial Measurement Unit

This is sensors that are based on Micro-Electro-Mechanized-Systems (MEMS) technology, which is tiny integrated systems that combine mechanical and electrical components to sense, control and actuate on

1https://www.windriver.com/products/simics/

2http://www.ovpworld.org/

Analysis of machine learning for human motion pattern recognition on embedded devices