Utilizing machine learning in wildlife camera traps for automatic classification of animal species: An application of machine learning on edge devices

(1)

Author: Niklas E

^RLANDSSON

Supervisor: Fredrik A

^HLGREN

Examiner: Daniel T

^OLL

Semester: VT21

Field of study: Computer Science Level: G2F

Degree project

Utilizing machine learning in wildlife camera traps for

automatic classification of animal species

An application of machine learning on edge devices

(2)

Abstract

A rapid global decline in biodiversity has been observed in the past few decades, especially in large vertebrates and the habitats supporting these animal populations. This widely accepted fact has made it very important to understand how animals respond to modern ecological threats and to understand the ecosystems functions. The motion activated camera (also known as a camera trap) is a common tool for research in this field, being well-suited for non-invasive observation of wildlife.

The images captured by camera traps in biological studies need to be classified to extract information, a traditionally manual process that is time intensive. Recent studies have shown that the use of machine learning (ML) can automate this process while maintaining high accuracy. Until recently the use of machine learning has required significant computing power, relying on data being processed after collection or transmitted to the cloud. This need for connectivity introduces potentially unsustainable overheads that can be addressed by placing computational resources on the camera trap and processing data locally, known as edge computing. Including more computational power in edge and IoT devices makes it possible to keep the computation and data storage on the edge, commonly re- ferred to as edge computing. Applying edge computing to the camera traps enables the use of ML in environments with slow or non-existent network access since their functionality does not rely on the need for connectivity.

This project shows the feasibility of running machine learning algorithms for the purpose of species identification on low-cost hardware with similar power to what is commonly found in edge and IoT devices, achieving real-time performance and maintaining high energy efficiency sufficient for more than 12 hours of runtime on battery power. Accuracy results were mixed, indicating the need for more tailor-made network models for performing this task and the importance of high quality images for classification.

i

(3)

List of Figures

1 An example of a DNN with one hidden layer. Source: [1]. . . 14

2 A Raspberry Pi 4B with added cooling fan. . . 17

3 A Coral dev board. . . 17

4 An OpenMV Cam H7+. . . 18

5 Diagram showing program operation for CPU only systems. . . 19

6 Diagram showing program operation when using an Edge TPU. . . 20

7 The Samsung smartphone used for video recording. . . 22

8 The feeding station setup used in the experiment. . . 22

9 Example of an optimal image depicting a Scarlet macaw. Source: [2] 23 10 Classification made using a ROI extracted by frame differencing. . 24

11 ROI used for the inferencing in Figure 10. . . 24

12 Bar chart showing the results from the Coral dev board. . . 28

13 Bar chart showing results from the Raspberry Pi. . . 29

14 Bar chart showing accuracy results from the OpenMV H7+. . . 30

(6)

List of Tables

1 Detailed information on the videos included in the test suite. . . 8

2 Performance results for the Coral dev board. . . 25

3 Performance results for the Raspberry Pi 4B. . . 26

4 Performance results for the OpenMV H7+. . . 26

5 Power consumption values for the Coral dev board. . . 27

6 Power consumption values for the Raspberry Pi 4B. . . 27

7 Power consumption values for the OpenMV H7+. . . 27

8 Accuracy percentage scores for the Coral dev board. . . 28

9 Accuracy percentage scores for the Raspberry Pi 4B. . . 29

10 Accuracy percentage scores for the OpenMV H7+. . . 30

(7)

List of code examples

1 Frame differencing method using OpenCV . . . 20

(8)

Terminology

ANN Artificial Neural Network DNN Deep Neural Network

CNN Convolutional Neural Network

MobileNets A CNN designed for mobile vision applications

ML Machine Learning

ROI Region of Interest QoS Quality of Service SBC Single-board computer

(9)

Acknowledgements

I would like to thank my supervisor Fredrik Ahlgren for all the valuable guidance and feedback throughout the process. Your insight really helped with keeping the project on course and focused.

(10)

1 Introduction

The past decades have brought with them a rapid decline in biodiversity across the globe and the degradation of natural habitats hosting animal populations.

This widely accepted fact has made it very important to understand how animals respond to modern ecological threats and to understand the ecosystems functions [3]. The motion activated camera (also known as a camera trap) is a commonly used non-invasive tool well-suited for remotely observing wildlife.

In recent years the camera trap has developed from a tool with limited uses to a device that is frequently used in many different ecological studies. Wearn et al.

described the camera trap as a highly effective survey method whose effectiveness will only continue to improve as technology advances, as seen previously in the transition from film cameras to digital cameras [4]. This transition and the subsequent improvement of digital camera technology in the years since has greatly increased the quantity of data that can be captured in one placement, tens of thousands of images before the battery runs out compared to the original film cameras 36 images that could fit on a single roll of film [5].

Trolliet et al. describes the difficulties of field work that are often hampered by time constraints and limited availability of personnel. Some studies also require working in very remote areas that are difficult to get to. Camera traps are well suited to overcome these limitations and enable researches to collect data consis- tently in these remote regions with less effort required. Today camera traps are used for a wide variety of studies such as monitoring animal populations, identifying animal behaviours, locating endangered species, identifying habitats and flora-fauna interactions [3].

1.1 Background

A traditional camera trap is most often used with a remote trigger, the most common type found today is the passive infrared (PIR) sensor. This sensor works by detecting heat energy emissions from objects. When this remote trigger is activated the camera takes a picture. The images captured by the camera need to be processed by classification to extract the information needed for a given study [6].

Until recently this classification has been an entirely manual and time consuming process which severely limits the efficacy of the method and can make it infea- sible to use if resources are limited. Norouzzadeh et al. describes the effort of manually classifying the Snapshot Serengeti dataset, which required a total of

˜14.6 years at 40 hours/week of effort from 30,000 volunteers. Norouzzadeh et al.

calculated that the use of machine learning with their model trained on the Snap- shot Serengeti dataset could save around 8.4 years of 40 hours/week of work, providing a significant efficiency gain [7].

Since the camera has no way of knowing if the trigger is activated by an animal or something else this results in many false positives. This makes the datasets bulkier and as a result makes manual processing more time consuming. Norouz- zadeh et al. gives an example from the Snapshot Serengeti dataset, where 75 % of the 3.2 million images contained in the dataset are classified as empty [7].

(11)

1.1.1 Developing solutions with machine learning

Over the past few years many efforts have been made to develop neural networks for automating the identification of wildlife species and behaviour making the traditionally manual classification effort much faster while still maintaining high accuracy, such as the works of Norouzzadeh et al. and Tabak et al. [7] [6].

Norouzzadeh et al. demonstrated that their automatic model trained on 48 species in the 3.2 million image Snapshot Serengeti dataset can save an estimated 8.4 years (>17000 hours at 40h/week)by classifying 99.3% of the dataset at the same accuracy of 96.6% as achieved by human volunteers. This efficiency gain according to the authors highlight the importance of using machine learning and deep neural networks to make camera trap data collection more available by reducing cost and complexity [7].

Further improvements in accuracy have been made by Tabak et al. [6], with their model trained on animals from five regions in the US calculated at 98% accuracy only using images not seen in training. Their model further reached an out-of- sample accuracy as high as 82% on a dataset from Canada and could correctly identify 94% of images containing an animal from those that were empty in an out-of-sample dataset from Tanzania. The model is also fast, being capable of classifying 2000 images per second on a laptop running MacOS with 16 Gigabytes of RAM.

1.1.2 Cloud or edge computing?

One commonality between all of the studies focusing the use of machine learning to aid wildlife camera trap studies is the use of central processing when evalu- ating the trained models. A recent trend in the field of machine learning is to place the computation directly on edge devices instead of relying on powerful centralised infrastructure to handle the very intensive workloads often associated with machine learning. At the time of writing there are no studies available focusing on the evaluation of using machine learning on the edge in the context of classifying data collected using wildlife camera traps.

Edge computing has many advantages, among them the reduced need for energy intensive data transmission, decreasing bandwidth use and improving latency and round-trip times in the process. Adegbija et al. describes the significant and potentially unsustainable energy overheads incurred in the transmission of data, where of one bit of data transmitted from a Rockwell Automation sensor can in certain circumstances require as much as 1500x - 2000x more energy compared to what is required to execute a single instruction locally [8].

Merenda et al. [9] show that interest in edge computing has risen massively in the past 5 years. Running machine learning applications locally on low performing hardware such as IoT devices has become a large area of interest and has been the subject of many studies such as the work of Abdel Magid et al. [10] exploring the feasibility of moving the machine learning computation closer to the edge.

Both Merenda et al. and Abdel Magid et al. argue that mission-critical tasks are dependent on reducing the edge-cloud delay.

(12)

1.1.3 Purpose and motivation

Camera traps are often deployed in remote areas where access to a stable and fast connection is often very limited. Sending large images to the cloud for real-time classification is bandwidth and energy intensive and high latency and round-trip times means that real-time classification is difficult to perform. Using central processing in the cloud for this purpose would therefore be inefficient compared to executing the classification on the edge device, in this case the camera trap. The hardware used as the camera traps in this project are small, low-cost single-board computers that are widely available to regular consumers.

This project aims to continue the current trend of applying machine learning at the edge by examining the feasibility of moving the classification of images from camera traps in the context of bird species classification to the camera trap itself.

The focus on the classification of bird species is done for the relative ease of experiment setup, this edge computing approach could most likely be applied with other types of classification problems as well.

The primary purpose of this approach is to decrease the cost often associated with camera trap studies both in money, personnel and time. This is achieved by automating the image classification using machine learning on the edge. This project also aims to increase the body of knowledge available to the AI research community in the context of applying ML solutions in the ecological and biological fields of study.

Adding the image classification to the camera trap makes it possible to identify if the image does contain a relevant animal before the image is saved, which should in theory significantly reduce the amount of false positive images contained in the dataset, if not eliminate them entirely.

All of these improvements would decrease the cost traditionally associated with wildlife camera trap studies and thus make possible more studies using this tech- nique. The decreased time it takes for extracting the data makes it possible to perform more studies in the same amount of time. Making this method of study feasible for projects with less resources can also have a positive impact on wildlife studies by decreasing the necessity for more invasive methods for data collection.

The energy saved by decreasing the need for transmitting large amounts of data will increase battery life and prevent the overheads that are inherent to the transmission of data to the cloud such as bandwidth bottlenecks, round-trip times and latency.

1.2 Problem formulation

1.2.1 Hypothesis

The hypothesis in this project is that applying the edge computing paradigm to a camera trap in the context of classifying bird species is feasible, and can achieve real-time classification performance, high accuracy and long operation using battery power. To achieve this widely available low-priced consumer hardware platforms can be used, such as the popular Raspberry Pi and Coral board. Using image classification will also significantly reduce the occurrence of empty images, reducing the total size of the collected dataset.

(13)

1.2.2 Research questions

The questions this project examines are:

• Which tested hardware platform is most suitable for use in terms of performance and real-time classification?

• Which tested hardware platform is most suitable for use in terms of accuracy and energy efficiency?

1.3 Limitations/Scope

The scope of this project is focused only on the application of machine learning on the edge in the context of image classification and wildlife camera traps.

This project uses a pre-trained inference model for all testing performed in the experiment. The model is published by Google and trained on the iNaturalist 2017 dataset, and is compiled for use with both CPU-only and Edge-TPU equipped systems. This project will not focus on training a custom model.

The hardware used for comparison in this project is the Raspberry Pi, Coral board and OpenMV H7+ camera. The respective platforms included cameras will not be used to minimize experiment variables.

This project will focus on the identification of bird species rather than identifying individual birds of the same species.

1.4 Related work

Some notable prior works in the context of machine learning in wildlife camera traps include an article by Norouzzadeh et al. [7], training a model that could identify animal species and behaviour from datasets collected from wildlife camera traps, achieving very high accuracy and the potential to save significant amounts of time and money compared to manual classification.

Other notable works include that made by Tabak et al., further improving accuracy and performance. Their work achieved a very high out-of-sample accuracy, as much as 82% [6].

Some notable generalised studies focusing on the need for moving machine learning closer to edge include the works of Merenda et al. and Abdel Magid et al.

Merenda et al. has shown that interest in this area of study has risen massively in the past 5 years, and outlines some of the challenges involved such as latency and privacy [9]. Abdel Magid et al. further highlights the importance of minimizing the edge-cloud delay in mission-critical systems in their work exploring the feasibility of moving machine learning to edge devices [10].

1.5 Overview

Section 2 gives an overview of the implemented solution and also information regarding the interpretation of the results, and how they are evaluated. Section 3 gives a short introduction to machine learning and deep learning, along with a description of neural networks, a type of deep learning. This section also gives

(14)

an introduction to the TensorFlow and TensorFlow Lite libraries and information regarding edge devices and their use in the context of machine learning.

Section 4 gives more details regarding the practical implementation such as the software development and the experiment setup. Section 5 shows the results from the controlled experiment performed in this project. The evaluated attributes are performance, energy efficiency and accuracy.

Section 6 discusses the method and the experiment, their limitations and alter- natives. This section also discusses the results. Section 7 describes the projects conclusions and provides suggestions for future work.

(15)

2 Method

To answer the research questions a controlled experiment was performed to compare three tested hardware platforms in regards to performance, accuracy and energy efficiency. The independent variable in the experiment is the hardware platform used for testing and the software developed for the test. The dependent variables are performance, accuracy and energy efficiency. The experiment is re- peated three times, once for each hardware platform and the dependent variable values are recorded and compared between the hardware platforms.

2.1 Hardware platforms

To achieve machine learning functionality on the camera trap the device must have more computational resources than a traditional camera trap. Jensen et al.

describe the complexity of balancing accuracy and performance when using machine learning on edge devices. If high accuracy is needed more complex algorithms must be used, which increases computational cost. Most low-cost hardware on the market do not contain high-performing hardware, making most full- size predictive models run too slow to be usable. This leads to the trade-off between running a less complex model which decreases accuracy or increasing the available resources, bringing downsides such as higher power draw, increased cost bigger physical size [11].

There are neural networks specifically designed for use with IoT, mobile and other low-power hardware in mind, for example MobileNets that can achieve good performance without a significant drop in accuracy [12]. These efficient algorithms together with libraries like TensorFlow lite, a scaled down machine learning framework based on TensorFlow has made possible the execution of on-device, real-time machine learning on low-cost devices such as the popular Raspberry Pi line of computers.

This project compares three low-cost options for enabling the use of machine learning in a camera trap application, one of them is the current (at the time of writing) Raspberry Pi 4B single-board computer. The second hardware platform examined is the Google Coral dev board with on-board Edge TPU accelerator for increasing NN inference performance. The third option examined in this project is the OpenMV Cam H7+, a micro-controller equipped camera board capable of running a Tensorflow Lite ML-model.

The Raspberry Pi 4 and the Coral board both include 4 GB of RAM. The OpenMV cam H7+ includes 1 MB of SRAM and 32 MB of SDRAM.

The tested hardware is run in their stock configuration and clock speeds. The Raspberry Pi has an added cooling fan to help manage the heat and prevent throttling of the CPU. The Coral board comes as standard with a cooling fan.

The Raspberry Pi and the Coral board are run with their respective Debian based included operating systems, Raspberry Pi OS 5.10 and Mendel Linux 5.2. No changes are made to the operating systems from their standard configurations.

The OpenMV H7+ is run through the OpenMV IDE connected and powered with a USB cable to a stationary computer running Windows 10.

(16)

2.2 Software development

The software was developed for the experiment to solve the problem of automatic identification of wildlife in camera trap images in the context of executing machine learning on edge devices. Three software artefacts are produced in this project, one software implementation for each hardware platform under test.

The software artefacts are developed in Python using the OpenCV and Tensor- Flow lite libraries. The development environment for the Raspberry Pi 4B and the Coral dev board is Raspberry Pi OS 5.10 using the Geany Python IDE. The environment for the OpenMV cam is the OpenMV IDE installed on a computer running Windows 10.

The program artefacts are designed to process pre-recorded video instead of a real-time video stream from a connected camera for testing purposes. To find regions of interest (ROIs) in a frame a frame differencing routine is implemented to find candidates for movement in a given frame. This ROI is then passed to the inferencing routine for classification, if a bird is deemed present the frame is saved.

The model used for classification in the project is published by Google and is pre- trained using the supervised approach on the iNaturalist 2017 dataset capable of identifying 964 species of birds. The iNaturalist 2017 dataset contains approxi- mately 496,000 training images of different species, 214,000 of these belonging in the Avian category containing 964 labels with a subset of these being annotated with bounding boxes [13]. The model expects an RGB input image of size 224 x 224 with three channels per pixel and outputs one or more probable classification labels with a softmaxed combined total confidence score of 1. The label with the highest score is the most probable classification.

The pre-trained model was originally developed and published with the Google AIY vision kit, a low-cost do-it-yourself Raspberry Pi based kit aimed primarily at makers and beginner programmers interested in machine learning and computer vision [14].

More details of the software development and implementation are found in Sec- tion 4.1.

2.3 Experiment outline

The hardware platforms are evaluated using a test suite of pre-recorded videos overlooking a feeding station, capturing different species of birds. The use of pre-recorded video instead of the tested platforms included cameras was done to eliminate any possible variables that could be introduced due to the difference in sensor quality and resolution between the cameras included with the tested hardware.

Birds were chosen as the subjects for the pre-recorded videos for testing. This is due to the risk of not being able to capture other wildlife in a reasonable time frame and the relative ease of setting up bird feeders and camera equipment in a relatively controlled manner. Two species of bird are included in the test: The Parus Major (Great tit) and the Corvus Monedula (Eurasian Jackdaw). The full

(17)

specifications of the videos used in the experiment such as species, resolution, zoom level and length can be found in Table 1.

Table 1: Detailed information on the videos included in the test suite.

Video Video 1 Video 2 Video 3 Video 4

Species Corvus Monedula Corvus Monedula Corvus Monedula Parus Major

Resolution 720p 1080p 1080p 1080p

Zoom 5x 5x 5x 5x

Length 3:10 0:39 0:48 1:28

The evaluation of the test suite is based on three attributes that are considered necessary to enable well-performing automatic species classification that is usable in the field. These attributes are performance, energy efficiency and accuracy.

More details on the experiment setup is outlined in Section 4.2.

2.4 Attribute measurement and evaluation

2.4.1 Measuring performance

To measure performance across all tested platforms the frames per second (FPS) metric is used. Achieving greater FPS will mean that the programs reaction time will be quicker, and can mean the difference between a captured image of an animal versus the animal running out of frame before the classification could be performed.

A target of ≥ 10 FPS will be the goal to reach for the test suite. This target is deemed to be acceptable for real-time use due to it being fast enough to classify animals in motion before they move out of frame. The neural network inference times will also be recorded and compared between the different systems.

The inference time is the primary driver in the software performance on the different hardware platforms. A quicker inference time will result in more frames being processed in a given amount of time, increasing FPS. The total percentage of frames passed to the inference routine is also recorded. A lower percentage of frames passed to the inference routine means that the routine is executed less often, increasing performance by not inferencing frames where no movement is detected.

2.4.2 Measuring energy efficiency

To measure the hardware platforms energy effectiveness the metric of performance (FPS) per watt is used. Performance per watt is a metric that can be easily compared between systems and can also be used to measure the increased efficiency of future hardware platforms. This measurement does not isolate the inference program bringing introducing some inaccuracy since the total board power consumption naturally includes other functions such as the OS and other background processes. Since all of the platforms to be tested includes this inaccuracy it will be overlooked in this project.

To measure the boards power consumption an energy meter plugged into a wall socket is used to show the board power consumption in watts, that combined

(18)

with the FPS calculation by the software is combined to provide the FPS/watt metric used. The energy meter used is the Luxorparts model 50003 with a consumption accuracy of +/- 0.2W.

Using the power consumption values a calculation is done to determine the expected runtime of the computer when using an iiglo Powerbank with a 30Ah capacity, providing a maximum of 5V 3.6A (18W). This battery bank uses a LiPo battery and is one of the largest available on the Swedish market that can provide the required minimum of 5V 3A specified by the Raspberry Pi and the Coral board. The expected battery life values are calculated using the formula (Ah∗V)/W = T where Ah is the capacity of the battery in Amp hours, V is the voltage of the battery, W is the power draw in watts. T is the expected battery run time in hours.

The energy efficiency test is performed on Video 4 (see Table 1), with energy consumption measurements taken every 30 seconds while the program is running.

The minimum, peak and average consumption values are then calculated.

2.4.3 Measuring accuracy

To measure the accuracy of the inference performed by the ML-enabled camera trap the amount of correctly classified images is compared to the amount of in- correctly classified images.

The threshold for the inference confidence score used in the experiment is 0.5, or 50 %. Only images classified with a confidence score greater than this threshold are saved and included in the accuracy calculations. The accuracy results are shown using graphs and an accuracy percentage ranging between 0 - 100%

that is calculated from the amount of correct classifications compared to the total amount of classifications. Two scores are calculated, one only including the correct species and one including classifications made within the same family of bird. In this test classifications made within the Corvidae and Paridae families are included.

The pre-recorded videos used in the experiment are manually classified as a comparison baseline for calculating the automatic classification accuracy. All videos used in the test contain only one species of bird. Multiple individuals of the same species can be present in the video frame at a given time. The accuracy measurement will only take into account the classified images and the accuracy of species identification by comparing the results to the manual classification effort.

2.5 Reliability and validity

It is important to acknowledge and address issues that can affect the reliability and validity of this study. Reliability concerns the reproducibility of the results gathered by this study. Validity comes in many forms, of particular interest in this study is the internal and external validity. Internal validity concerns the validity of the results and conclusions reached from the collected data and external validity concerns whether the generality of the results is justified.

(19)

2.5.1 Internal validity

One threat to the internal validity in this study is the need for conducting the experiment outdoors, where uncontrollable factors such as lighting, weather and wind has the potential to affect the accuracy. This threat is not easy to mitigate, and thus the accuracy results and conclusions can not be seen as representative for the accuracy of the inference model used in the test or the general accuracy of using machine learning in the context of wildlife camera traps.

2.5.2 External validity

One threat to the external validity of this study is the generality of the results and the conclusions taken from these. The experiment in this study mitigates this threat by using common measurement metrics for the attributes measured. This makes it possible to compare the results gathered in this study to future studies comparing the same attributes on different hardware and thereby accurately measure the performance difference.

2.5.3 Reliability

One threat to reliability in this study is the measurement method for performance when measuring the inference time on the different hardware platforms. Care is taken to place the timers used in the program as close to the algorithm as possible to make sure no other operations are included in the resulting inference time measurement.

2.6 Ethical considerations

Multiple field recordings of the experiment location containing different species of birds were needed to perform the experiment for this study. Recording an outside environment where the public has access brings a risk of recording un- knowing people passing by. To mitigate this the feeding station and cameras were installed in a closed backyard on private property and the recording was supervised throughout the entire recording duration, no cameras were permanently installed. After the experiments was conducted and the results were gathered the recordings were deleted.

(20)

3 Theory

3.1 Camera traps

A camera trap consists of a compact camera combined with a remote trigger.

When the trigger is activated the camera takes a picture. The most common type of remote trigger in use today is a motion sensor, usually a passive or an active infrared sensor. Camera traps are a commonly used tool in the biological research community today for a large variety of studies, such as population estimations, habitat identification, observing animal behaviours etc.

Camera traps are not a new technology. Camera traps have been in use for over 100 years, first pioneered by George Shiras who is widely considered one of the fathers of wildlife photography. He used tripwires as a remote camera trigger that when disturbed took a picture with a strong magnesium flash. This method made it possible for Shiras to obtain some of the first images of nocturnal wildlife in North America [5].

Camera traps did not become an accepted scientific tool overnight. The first efforts exploring the use of camera traps in scientific studies began in the 1950’s and camera traps did not see widespread use until the 1980’s with the advent of cheap 35mm film cameras. These early commercialisation efforts mainly targeted the hunting market though it did not take long for researchers to recognize the potential of these widely available devices for biological studies and conservation efforts. During this time camera traps saw use in two of the most cited papers in this field of study (Karanth [15] and Karanth et al. [16]), demonstrating that the use of camera trap data collection combined with the mark and recapture method could accurately estimate tiger population density [5].

Digital camera traps began replacing film camera traps in the early 2000’s and have largely phased out the older film camera traps today. This is mainly due to their capability of gathering significantly larger amounts of data than the 36 images that can fit on a camera film roll. This quantity coupled with the ever improving image quality of digital cameras have made the modern camera trap the highly effective and flexible data collection tool it is today, used by many researchers for a variety of different studies across the globe [5].

Several conservation groups like the WWF, Panthera and the Zoological society of London (ZSL) have also recently started using camera traps to combat poaching, in combination with a collection of sensors to recognize gunshots and vehicle movements as seen in ZSLs ”InstantDetect” system. Other methods have been explored, using the camera traps to assist park rangers in monitoring areas that are difficult to access for patrols. In this use-case the camera trap needs good connectivity to be able to send real-time data via WiFi or cellular connection to prevent data loss in case of theft or destruction, securing evidence and providing the capability of preventing poaching if this information can be transmitted to authorities in real-time [5].

(21)

3.2 A short introduction to Machine Learning

Machine learning is a branch of artificial intelligence focusing on algorithms that can be trained with the use of data and automatically improve their accuracy over time. The trained algorithms are used to find patterns and features in data to make decisions and predictions based on new data of a similar type.

An introductory article on the topic written by IBM lists examples of the use of machine learning commonly found in everyday life [17]:

”Digital assistants search the web and play music in response to our voice commands. Websites recommend products and movies and songs based on what we bought, watched, or listened to before. Robots vacuum our floors while we do. . . something better with our time.

. . . Medical image analysis systems help doctors spot tumours they might have missed. And the first self-driving cars are hitting the road.”

There are different methods of machine learning in common use today. The methods usually fall into one of three main categories, supervised learning, unsupervised learning and semi-supervised learning.

3.2.1 Supervised machine learning

Supervised machine learning uses a labelled training dataset, which means that the dataset contains both the input and the desired output, the features the model needs to identify. Supervised learning usually requires less data than unsupervised learning, and the models are easier to train having access to the desired result.

IBM describes a drawback to the supervised learning model in their introductory article, namely the expense of preparing fully labelled datasets and the danger of overfitting, i.e making a model too biased to the training data to be able to handle any variations [17].

3.2.2 Unsupervised Machine learning

Unsupervised learning uses unlabelled data and extracts patterns and hidden features. This according to IBM can be very useful for anomaly detection [18], as found in email spam filters for example [17].

Nvidia describes in their blog post on the topic further uses for anomaly detection, such as bank transaction monitoring to prevent fraud [19]. Nvidia further describes other methods of organization of the data in unsupervised learning, such as clustering, association and auto-encoding.

Clustering is one of the most common methods of organizing the data when using unsupervised learning. In this method the model in training looks for similarities in the dataset and divides it into different clusters or groups. Nvidia gives an example in their blog post regarding the grouping of birds by approximate species, by clustering similar traits together [19].

(22)

3.2.3 Semi-supervised machine learning

Semi-supervised learning is a middle-ground between supervised and unsupervised learning. Using a small labelled dataset as a guideline for extracting relevant features from a larger unlabelled dataset can improve the accuracy significantly without having to use a fully supervised learning method with often very expensive labelled datasets [17]. One common use case for this learning method is in the medical field to classify CT scans or MRI scans. Manually labelling an entire dataset of these scans is time-intensive and expensive, instead a trained professional can label a small subset of scans for diseases improving the accuracy of the learning network with a relatively small amount of labelled data. [19].

3.2.4 The ML method used in the project

The pre-trained model used in this project experiment was developed by Google using the supervised machine learning method, based on the iNaturalist 2017 dataset bird subcategory containing 964 labels [13]. This method fits well for the purpose of bird species classification when following the guidelines described by IBM: a well-defined problem (identifying bird species) with a known set of expected results (bird species) [20]. This makes unsupervised learning unde- sirable since we do not need to find patterns and anomalies in the data. The dataset was already fully labelled when published making the advantages of semi-supervised learning irrelevant since the effort of fully labelling the dataset was already made.

3.2.5 Deep learning

Deep learning is a subset of machine learning. A deep learning algorithm is designed to learn the same way as a human brain does. According to IBM [17] most deep learning models are unsupervised or semi-supervised.

One type of deep learning model commonly used is the artificial neural network (ANN). Merenda, et al. describes a neural network as a combination of weights and biases [9]. An artificial neural network is made of nodes or neurons that are connected to each other and grouped in layers. Different layers in the network perform different actions on their inputs. A neural network with one or more ”hidden” layers between the input layer and the output layer is called a deep neural network (DNN). The neuron layers communicate with each other via weighted connections called edges, where the weight determines an edges relative importance.

When used for classification tasks the final layer of a DNN is a softmax function that outputs a value between 0 and 1 for each class, with all class outputs being equal to 1. This value is interpreted as the probability of an image belonging to a given class, with a higher value meaning more confidence in the result [7].

The use of deep learning and neural networks in the context of image recognition has become more widespread in the last decade. A convolutional neural network (CNN) is a type of feedforward DNN where the layers of neurons uses convolutional operations. CNNs are often used for the task of image recognition and video analysis, and has over the years become more efficient to the point

(23)

Figure 1: An example of a DNN with one hidden layer. Source: [1].

that CNNs such as MobileNets can achieve high performance on low-power devices [12]. DNNs are commonly used in other problem domains as well, such as speech recognition and machine translation [7].

3.3 The TensorFlow ML library

TensorFlow is an open-source software platform for machine learning development, providing multiple libraries and tools for creating machine learning applications. TensorFlow was developed by researchers working on the Google Brain team, TensorFlow being the second-generation system made by this team for developing ML solutions succeeding the earlier DistBelief system used by Google since 2011 [21].

TensorFlow is platform-independent and flexible, supporting a variety of algorithms for training and inference for deep neural network models. An application built using TensorFlow can be executed with little or no change on a wide variety of systems with differing resources, from mobile phones to large-scale dis- tributed systems. TensorFlow also supports different tools and libraries to extend its functionality, such as TensorFlow Hub. TensorFlow Hub provides a library of pre-trained TF and TFLite models that can be implemented with a small amount of code, making it easy to implement common ML tasks without requiring the effort of training a custom model from scratch [22].

Since its release TensorFlow has been used for researching and deploying machine learning systems in many areas of computer science such as computer vision, natural language processing and speech recognition [23].

TensorFlow was released in November 2015 under the Apache 2.0 license, maintaining two APIs for Python and C++ and providing reference implementation and documentation.

(24)

3.3.1 TensorFlow Lite

TensorFlow lite is a toolset for TensorFlow designed primarily for use in edge devices and other power-restricted environments. It enables the use of on-device ML-inference with low latency and smaller binary sizes, from around 1 MB when all supported operators are linked to around 300 KB when limited to support for InceptionV3 and MobileNets image classification models. This variation in binary size enables the use of TensorFlow lite on a large variety of edge and embedded devices, from small microcontrollers to more traditional CPU equipped devices such as mobile phones or Single-board computers (SBCs). One example of a microcontroller running TensorFlow lite is the OpenMV camera family, covered in more detail in Section 3.4.3.

According to TensorFlow the use of TensorFlow lite for machine learning on edge and embedded devices will improve some areas of concern [24]:

• Latency: No round-trip time to a server when performing on-device execution.

• Privacy: No data needs to leave the device.

• Connectivity: No internet connection is required, useful in bandwidth limited areas or areas with network congestion.

• Power consumption: Network connections increase the power consumption of edge devices.

TensorFlow lite contains an optimized interpreter designed to be as small and fast as possible to achieve good performance on edge and embedded devices [25].

TensorFlow lite also includes a converter that converts full-size trained Tensor- Flow models to TensorFlow lite models via the use of FlatBuffer serialization [26].

A FlatBuffer (an open-source serialization library) is designed to be as memory efficient as possible, requiring no extra memory allocations (in C++) than that of the buffer to access the data. FlatBuffers are designed to not require any pars- ing or unpacking enabling quick access to the serialized data without additional overhead [27].

3.4 Edge devices and the edge computing paradigm

Traditionally edge devices were only serving as entry and exit points on the bor- ders of large networks, controlling access and quality-of-service (QoS) to these.

The need for more computing power on the edge gave rise to the increasingly popular edge computing paradigm due to the increasing popularity of IoT enabled devices [28].

Interest in edge computing has risen massively in the past 5 years owing to the concerns of edge-cloud latency and bandwidth saturation as well as grow- ing privacy and reliability concerns, for example in tasks like health monitoring [9].

Running machine learning applications locally on low performing hardware such as edge devices and embedded systems has become a large area of interest and

(25)

has been the subject of many studies such as the work of Abdel Magid et al. [10]

exploring the feasibility of moving the machine learning computation closer to the edge. Both Merenda et al. and Abdel Magid et al. argue that mission-critical tasks are dependent on reducing the edge-cloud latency.

Adegbija et al. gives an example IoT/edge use case with high potential impact in the area of medical diagnostics. Portable MRI and ultrasound devices transmit several gigabytes of high resolution images to medical personnel to be processed.

This system when scaled to include multiple devices that transmit these large amounts of data can result in bottlenecks and cause problems in scenarios such as medical emergencies where low latency is vitally important. Instead of transmitting this data, edge computing can be applied by equipping the medical devices with enough computational resources to process the data that it generates. Only this locally processed data is transmitted to the medical personnel, minimizing the amount of raw data transmission thereby reducing the risk of bandwidth bottlenecks, improving latency and improving energy consumption [8].

In recent years the capability of low-cost hardware has risen to the point of being able to run machine learning algorithms in real-time with good performance, making it possible to move many traditionally centralised ML tasks to the edge devices [29]. This also provides solutions to the primary concerns of energy consumption, bandwidth bottlenecks and latency while also alleviating the concern of privacy since no data needs to leave the device.

A popular device for use in edge computing and embedded systems is the Rasp- berry Pi series of single-board computers. More machine learning focused devices have also been developed in recent years to accelerate ML processes on the edge, such as the Google Coral family of hardware and the OpenMV camera family. These three products are compared in this project.

3.4.1 Raspberry Pi

The Raspberry Pi line of single-board computers (SBCs) was developed by the Raspberry Pi foundation, with its first generation computer releasing in Febru- ary 2012. The Raspberry Pi line of computers are low-cost and very modular, making them popular for use in industry applications and hobbyist projects alike [30].

According to Raspberry Pi around 44% of all Raspberry Pi computers sold have gone to industrial customers with over 35+ million boards sold to this market segment, and products like the Compute module series were developed to target embedded applications [31].

The current model of Raspberry Pi is the model 4 B, released in 2019 using a Broadcom BCM2711 Quad core Cortex-A72 (ARM v8) 64 bit system-on-a-chip (SoC) with the choice of 2, 4 or 8 gigabytes of RAM. A Raspberry Pi model 4B is shown in Figure 2.

3.4.2 Google Coral

Google coral is a hardware and software platform specifically targeting neural network inferencing on embedded systems. Coral offers a range of products

(26)

Figure 2: A Raspberry Pi 4B with added cooling fan.

ranging from accelerator cards in USB and PCI-express formats to complete development kits and system-on-modules for embedded systems. All coral products include the Edge-TPU coprocessor, an application-specific integrated circuit (ASIC) that is designed to execute neural network inferencing at high speed and low power cost [32].

The development board offered by coral is similar in form factor to a raspberry pi, designed for prototyping ML-applications for embedded systems. The dev board comes with an NXP i.MX8M SoC with 4 ARM-Cortex A53 cores and 1 Cortex M4F core, the Edge-TPU ASIC and the choice of 1, 2 or 4 gigabytes of RAM [33]. A Coral dev board is shown in Figure 3.

Figure 3: A Coral dev board.

3.4.3 OpenMV

The OpenMV camera series is a range of micro-controller equipped camera modules capable of running machine learning algorithms and neural network inferencing, with support for TensorFlow lite neural network models. They require no operating system, making them very power efficient consuming around 200mA of power when processing images. They are programmed in MicroPython. The camera modules are expandable with with shields similar to that of an Arduino board, using a standardised I/O pin layout for easy creation of custom circuits for use with the camera [34].

The camera board used in this project is the OpenMV Cam H7+, using the STM32H743II ARM Cortex M7 processor with 32MBs of SDRAM + 1MB of SRAM. This RAM capacity makes it possible to run larger TensorFlow lite binaries using this camera than other models in the range that only include 1MB of SRAM [35]. An OpenMV Cam H7+ is shown in Figure 4.

(27)

Figure 4: An OpenMV Cam H7+.

4 Implementation

4.1 Software development

To perform automatic species classification in a camera trap software was developed. Three programs have been developed in Python (MicroPython for the OpenMV) using TensorFlow lite for inferencing using the pre-trained model published by Google. The programs are prototypes and are not currently capable of running from a live video feed, instead running inference on pre-recorded videos.

The most compute intensive part of the program is the inference routine, where the model is applied to try to identify a species of bird in the frame. This execution can take a very long time, as much as 120ms on the Raspberry Pi. The execution of this part of the code must be run as few times as possible by im- plementing movement detection, and only running the inference routine when sufficient movement has been detected in the frame.

To do this a frame differencing routine was developed to mimic the actions of a PIR sensor. This frame differencing identifies movement in a frame by comparing the current frame to a background frame, and extracting the region of the frame where the movement occurred. This region-of-interest (ROI) is then passed to the inference routine for classification, only running the inference when a suffi- ciently large ROI has been detected. The use of an ROI for inferencing was also an attempt to increase the accuracy of the pre-trained model which is not originally intended for use in real-world conditions, as described further in Section 4.2.

A condensed version of the program operation is shown in Figure 5. This flowchart depicts the common steps the programs perform to achieve automatic species identification. The Coral board is slightly different as shown in Figure 6, the inference is performed on the Edge TPU instead of the CPU. This improves performance significantly and as seen in Section 5.1 is a significant factor in the increased performance of the Coral board.

Due to difference in software support different libraries were used to perform some tasks like frame resizing, frame differencing and inferencing. The Coral dev board uses the Pycoral library to load Tensorflow lite models compiled for the EdgeTPU. The Raspberry Pi uses a TensorFlow lite interpreter wheel that only

(28)

Figure 5: Diagram showing program operation for CPU only systems.

contains functionality for CPU inference, while the OpenMV cam uses its built in TensorFlow lite library for inference. The OpenMV cam lacks functionality for applying a threshold mask as seen in OpenCV, this made is impossible to use the same motion detection method by finding contours as used in the Coral board and Raspberry Pi.

The frame differencing method using OpenCV is shown in Example 1, a slightly modified application of the method described by Adrian Rosebrock in his tuto- rial series on OpenCV [36]. The absolute difference between a background frame and the current frame - both converted to grayscale from the original BGR and Gaussian blurred - is calculated and a threshold applied to the result. If the difference is above the threshold specified the pixel area is set to a value of 255, and the other areas set to 0.

The findContours method is then applied to the resulting threshold image containing the candidate ROIs. The list of ROIs is then iterated over and the biggest ROI is found and returned. Only one ROI is returned to the inference routine to speed up execution time of the inference routine.

(29)

Figure 6: Diagram showing program operation when using an Edge TPU.

1 def find_ROI (gray , firstFrame ):

2 # Minimum area for ROI 3 min_area = 500

4 # Calculate difference between background and current frame

5 diff = cv2. absdiff ( firstFrame , gray )

6 # Keep only the difference deemed significant , as specified

7 #by the threshold , dilate to fill gaps 8 threshold = cv2. threshold (diff , 25, 255 ,

9 cv2. THRESH_BINARY )[1]

10 threshold = cv2. dilate ( threshold , None , iterations =2) 11 # Find the contours of the thresholded areas in the frame 12 cnts = cv2. findContours ( threshold . copy () ,cv2.

RETR_EXTERNAL ,

13 cv2. CHAIN_APPROX_SIMPLE )

14 cnts = imutils . grab_contours ( cnts ) 15 # Values for placing the largest ROI 16 mx = (0 ,0 ,0 ,0)

(30)

17 mx_area = 0 18

19 # Iterate through the found contours 20 for c in cnts :

21 #If a contours is below the minimum area specified , drop it

22 if cv2. contourArea (c) < min_area :

23 continue

24 # Draw a bounding box around the ROI and calculate area 25 (x, y, w, h) = cv2. boundingRect (c)

26 area = w * h

27 # Keep only the biggest ROI found 28 if area > mx_area :

29 mx = (x,y,w,h)

30 mx_area = area

31

32 return mx

Example 1: Frame differencing method using OpenCV

The result of the inference method is a classification label that had the highest confidence score, and the confidence score itself. If this confidence score is higher than 0.5 (50%) then the video frame and the ROI for that frame is saved.

The lack of software library support was very noticeable on the OpenMV camera.

There was no way to implement a frame differencing method that was function- ing as the OpenCV method used for the Coral and Raspberry Pi. The OpenMV does have functions to implement frame differencing, but the threshold function did not support masking the image to like OpenCV. This and other limitations made it necessary to perform some functions manually, like scaling the frames in the video and providing the inference function with a manually set static ROI to use for the entire video duration. Even with this functionality replaced by manual effort the performance when running inference with the pre-made model was lacking, only being capable of 0.15 fps on average.

The videos used for the OpenMV camera are the same videos as used by the other computers, but only one frame per second of video is run through the inference method to make running the test suite less time consuming. This also decreased the chance of test failure and lowered the impact of having to reset the test if something went wrong.

4.2 Experiment setup

The experiment was performed using four pre-recorded videos. The videos were recorded using a Samsung Galaxy S8 smartphone and a Logitech Brio 4K web camera. The Samsung smartphone is shown in Figure 7 installed on a tripod.

The use of pre-recorded video instead of the tested platforms included cameras was done to eliminate any possible variables that could be introduced due to the difference in sensor quality and resolution. Since none of the included cameras with the tested systems were cross-platform compatible the decision was made to use external cameras to make sure that the same source footage was used.

(31)

Figure 7: The Samsung smartphone used for video recording.

A feeding station was set up as shown in Figure 8 with a variety of foods, such as seeds, nuts and tallow balls to attract as wide a variety of species as possible.

This attempt was not very successful however, only two species of birds were selected for the final test videos. This was due to weather conditions and camera problems including artefacts in the image due to the use of digital zoom.

Figure 8: The feeding station setup used in the experiment.

The two species that were used in the test was the Corvus Monedula (Eurasian Jackdaw) and the Parus Major (Great Tit), as specified in Table 1. These species were by far the most common sight at the feeding station and provided good contrast to the test due to their difference in size and color.

Three videos used in the test included the Corvus Monedula recorded in both 720p and 1080p. This was done to determine if any improvement in accuracy could be gained by higher resolution source material and what impact this would have on the performance. The fourth video contains the Parus Major, only recorded in 1080p. All videos were recorded using the same zoom level of 5x from the same position to better frame the feeding station.

Other species seen but not included in the test due to insufficient video quality was the Pica Pica (Eurasian Magpie), Phasianus Colchicus (Common Pheasant) and the Carduelis Carduelis (European Goldfinch).

The use of a pre-trained classification model was a limiting factor in this experi-

(32)

ment, since the model is not made to be used in a real-world context. The model is very accurate when run on a well cropped image with great detail, not so much in a low detail wide angle view with a changing background and differing lighting conditions. Figure 9 shows a good example of an optimal image for the pre-made Tensorflow lite model, a well-framed image with good detail.

Figure 9: Example of an optimal image depicting a Scarlet macaw. Source: [2]

(33)

To overcome this limitation frame differencing was used in the software to extract a ROI from the larger image in an attempt to isolate the movement in the frame (which is presumably a bird) to increase the accuracy of the model.

Figure 10 shows the results of a classification made using the ROI extracted through the use of frame differencing. Figure 11 shows the ROI used in the inferencing process, resized to the models expected size of 224x224.

Figure 10: Classification made using a ROI extracted by frame differencing.

Figure 11: ROI used for the inferencing in Figure 10.

(34)

5 Results

5.1 Performance

5.1.1 Coral dev board

The results for the Coral dev board are shown in Table 2. The coral dev board could not reach 30 fps in any tested resolution, this being the threshold for true real-time classification for standard 30fps video.

Stepping up to 1080p the performance drops by 42.4% from 21.2 fps average to 12.21. This is above the threshold of 10fps that is deemed suitable for real-time use in this project. The average inference time across all 4 tests is calculated to 3.65ms.

The frames processed and frames inferenced fields show the total amount of frames processed in each video and the amount of frames out of the total that were processed through the inference routine. These values are then calculated to gain the total inference percentage of the video that the inference routine was executed on. This value gives an indication if the frame differencing used for motion detection is effective at reducing the amount of time the inference routine is executed.

The video processing time refers to the total time taken to process the video. The Avg. inference time shows the average time for executing a frame through the inference routine.

Table 2: Performance results for the Coral dev board.

Frames processed 5713 1197 1460 2667

Frames inferenced 4918 1192 1445 2537

Total inference % 86% 99% 99% 95%

Avg. inference time 3.53ms 3.84ms 3.7ms 3.54ms

Video processing time 269.53s 98.06s 120.6s 221.35s

Avg. FPS 21.2 12.21 12.11 12.05

5.1.2 Raspberry Pi 4

The results for the Raspberry Pi 4B are shown in Table 3. The Raspberry Pis average inference time across all 4 tests was 118.9ms, slower than the Coral dev boards 3.65ms. The Raspberry Pi 4B did not achieve a FPS greater than 10 in any tested resolution, seeing a performance drop of 24% when stepping up to 1080p.

(35)

Table 3: Performance results for the Raspberry Pi 4B.

Total inference % 86% 99% 99% 95%

Avg. FPS 7.16 5.47 5.53 5.75

5.1.3 OpenMV Cam H7+

See Table 4 for the OpenMV cameras performance results. Inference times of over 5 seconds across the entire test suite. Note that the amount of frames per video are lowered to 1 frame per second of video to make the tests run within a reasonable time frame as explained in Section 4.1.

Table 4: Performance results for the OpenMV H7+.

Total inference % 100% 100% 100% 100%

Avg. FPS 0.15 0.15 0.15 0.15

5.2 Energy efficiency

The result of the Coral dev board energy efficiency test is shown in Table 6.

The average power consumption is≈7.8% higher than the Raspberry Pi 4B. Due to the higher performance of the Coral board the FPS/W when using the averaged energy consumption value is 0.97, 94% higher than the Raspberry Pi 4B.

(36)

Table 5: Power consumption values for the Coral dev board.

Power Consumption Idle Load Min. Load Avg. Load Max.

Coral dev board 5.4 W 10.9 W 12.3 W 14.2 W

Avg. FPS/W N/A 1.1 0.97 0.84

Battery life (hours)

30 Ah capacity 27.7 h 13.76 h 12.19 h 10.56 h

5.2.2 Raspberry Pi 4

The result of the Raspberry Pi 4B energy efficiency test is shown in Table 6.

The average FPS/W when using the averaged energy consumption value is 0.5, 48.5% lower compared to the Coral dev boards 0.97 FPS/W.

Table 6: Power consumption values for the Raspberry Pi 4B.

Raspberry Pi 4B 4.25 W 10.3 W 11.4 W 14.2 W

Avg. FPS/W N/A 0.55 0.5 0.4

30 Ah capacity 35.29 h 14.56 h 13.15 h 10.56 h

5.2.3 OpenMV Cam H7+

Table 7 shows the energy efficiency results for the OpenMV H7+. The Micro- controller in the OpenMV cam is very efficient and the board is capable of being powered through an ordinary USB port on a computer. This leads to very long calculated battery run times to the point that the device can remain idle for over 10 days, and under a worst case power draw load for over 5 days.

The performance is the lowest of the three tested devices, with an average FPS/W of 0.14, compared to the Raspberry Pis 0.5 and the Coral boards 0.97.

Table 7: Power consumption values for the OpenMV H7+.

OpenMV H7+ 0.6W 0.85W 1.05W 1.2W

Avg. FPS/W N/A 0.17 0.14 0.12

30 Ah capacity 250h 176.5h 142.8h 125h

5.3 Accuracy

The accuracy results for the Coral dev board are shown in Figure 12. The accuracy scores for species and family classifications are shown in Table 8.

(37)

The accuracy scores for the Coral dev board are low with species accuracy below 15% in video 1 and 0% in videos 2 and 3 with video 4 (the only video not depicting jackdaws) being the outlier with species accuracy of 87.3%. The results show a difficulty of correctly classifying the Eurasian Jackdaw, with the results worsening as resolution increases to 1080p in video 2 and 3. When accounting for the classifications within the Corvidae family the results are better, reaching a high of 43.7% family accuracy in video 3. Video 4 is an outlier here also, reaching a family accuracy of 89.3%.

The total amount of classifications in video 4 is much higher due to confidence scores regularly being above 70%, going as high as 98% as the software progresses through the frames in the video.

Table 8: Accuracy percentage scores for the Coral dev board.

Correct species 5 0 0 220

Correct family 12 2 7 225

Total classifications 49 16 16 252

Accuracy(species) 10.2% 0% 0% 87.3%

Accuracy(family) 24.5% 12.5% 43.7% 89.3%

Video 1 Video 2 Video 3 Video 4

0 100 200 300

49

16 16

252

5 0 0

220

12 2 7

225

Classifications

Total Correct Correct family

Figure 12: Bar chart showing the results from the Coral dev board.

5.3.2 Raspberry Pi 4

The accuracy results for the Raspberry Pi 4B are shown in Figure 13.

The accuracy percentage scores for the Raspberry Pi 4B are shown in Table 9.

The accuracy results achieved by the Raspberry Pi 4 are close to those achieved by the Coral board. The difference in the model used by the Coral dev board is that it is Edge TPU compiled, otherwise the models are identical. The same difficulty of accurate classification of the Jackdaw in videos 1, 2 and 3 are seen in

(38)

the Raspberry Pi 4 aswell, with species accuracy at 11.9% for video 1 and 0% for videos 2 and 3. The family accuracy scores are 1.6% better in video 1, 0.8% better in video 2 and 10.4% worse in video 3 compared to the Coral board.

The Raspberry Pi achieves 2.5% better species accuracy scores and 1.4% better family accuracy scores in video 4 compared to the Coral board. The Raspberry Pi captured 5 fewer total frames and correctly classified one more frame compared to the Coral board.

Table 9: Accuracy percentage scores for the Raspberry Pi 4B.

Correct species 5 0 0 221

Correct family 11 2 5 224

Total classifications 42 15 15 247

Accuracy(species) 11.9% 0% 0% 89.5%

Accuracy(family) 26.1% 13.3% 33.3% 90.7%