Eye Tracking Using a Smartphone Camera and Deep LearningBlickspårning med mobilkamera och djupinlärning

(1)

IN

DEGREE PROJECT COMPUTER ENGINEERING, FIRST CYCLE, 15 CREDITS

STOCKHOLM SWEDEN 2020,

Eye Tracking Using a Smartphone Camera and Deep Learning

Blickspårning med mobilkamera och djupinlärning

ADAM SKOWRONEK

OLEKSANDR KULESHOV

KTH ROYAL INSTITUTE OF TECHNOLOGY

(2)

(3)

Eye Tracking Using a Smartphone Camera and Deep Learning

___________________________________________________________________

Blickspårning med mobilkamera och djupinlärning

Adam Skowronek Oleksandr Kuleshov

Examensarbete inom Datateknik Grundnivå, 15 hp

Handledare på KTH: Jonas Willén Examinator: Ibrahim Orhan

TRITA-CBH-GRU-2020:293 KTH

Skolan för kemi, bioteknologi och hälsa 141 52 Huddinge, Sverige

(4)

(5)

Acknowledgements

The authors would like to extend their gratitude towards Jonas Willén, the supervisor for this project, for guidance and support with the project and the writing of the thesis report. We would also like to thank Gustaf Öqvist Seimyr for his guidance and help, and Lexplore AB for providing us with the opportunity to perform the project.

(6)

(7)

Abstract

Tracking eye movements has been a central part in understanding attention and visual processing in the mind. Studying how the eyes move and what they fixate on during specific moments has been considered by some to offer a direct way to measure spatial attention. The underlying technology, known as eye tracking, has been used in order to reliably and accurately measure gaze. Despite the numerous benefits of eye tracking, research and development as well as commercial

applications have been limited due to the cost and lack of scalability which the technology usually entails. The purpose and goal of this project is to make eye tracking more available to the common user by implementing and evaluating a new promising technique. The thesis explores the possibility of implementing a gaze tracking prototype using a normal smartphone camera. The hypothesis is to achieve accurate gaze estimation by utilizing deep learning neural networks and

personalizing them to fit each individual. The resulting prototype is highly inaccurate in its estimations; however, adjusting a few key components such as the neural network initialization weights may lead to improved results.

Keywords: eye tracking, smartphone, RGB camera, convolutional neural networks, deep learning, computer vision

(8)

(9)

Sammanfattning

Att spåra ögonrörelser har varit en central del i att förstå uppmärksamhet och visuell bearbetning i hjärnan. Att studera hur ögonen rör sig och vad de fokuserar på under specifika moment har av vissa ansetts vara ett sätt att mäta visuell uppmärksamhet.

Den bakomliggande tekniken, känd som blickspårning, har använts för att pålitligt och noggrant mäta blickens riktning. Trots de fördelar som finns med blickspårning, har forskning och utveckling samt även kommersiella produkter begränsats av kostnaden och oförmågan till skalbarhet som tekniken ofta medför. Syftet och målet med arbetet är att göra blickspårning mer tillgängligt för vardagliga användare genom att implementera och utvärdera en ny lovande teknik. Arbetet undersöker möjligheten att implementera en blickspårningsprototyp genom användning av en vanlig mobilkamera. Hypotesen är att uppnå noggrann blickspårning genom

användning av djupinlärning och neuronnät, samt att personalisera dem till att passa den enskilda individen. Den resulterande prototypen är väldigt oprecis i dess

uppskattning av blickriktningen, dock kan justeringen av ett fåtal nyckelkomponenter, som initialiseringsvikterna till det neurala nätverket leda till bättre resultat.

Nyckelord: blickspårning, smartphone, RGB-kamera, faltningsnätverk, djupinlärning, datorseende

(10)

(11)

1 Introduction

Tracking eye movements has been a central part in understanding attention and visual processing in the mind. Studying how the eyes move and what they fixate on during specific moments in time when a person is performing a certain task has been considered to offer a direct way to measure spatial attention, and by some to provide a window into the brain[1].

Visual search, perception and reading comprehension[2] are among some of the areas in visual research it has been applied to. Apart from research regarding vision, analyzing eye movements has been of interest to a much broader community with applications in areas ranging from customer behavior[3] to driving[4-6] and game theory[7].

The underlying technology, known as eye tracking, has been used extensively to reliably and accurately measure eye movements[8][9]. Several different methods and systems have been developed in order to track the gaze of a person. They all come with certain trade-offs like invasiveness, complexity, and cost; however, all of them have the same goal: tracking a person's point of gaze accurately, reliably, and consistently. A widespread and commonly available technological solution to the issue of accurately and reliably tracking a person’s gaze is therefore desirable.

1.1 Problem

Despite the numerous benefits of eye tracking, research and development as well as commercial applications using this technique have been limited due to the high cost and lack of scalability which the technology usually entails. The equipment typically used for eye tracking in real-world settings is both expensive and non-scalable because of the specialized hardware required for most state-of-the-art eye tracking systems. This typically includes the use of infrared light sources and multiple high spatio-temporal resolution infrared cameras, head mounted contraptions, or other invasive equipment. Cheaper alternatives which utilize only normal RGB cameras exist; however, their accuracy has up to recently proven to be much lower than their infrared variant counterparts. Since accuracy and precision are of utmost importance to most eye tracking tasks, RGB image systems have thus often not been useful enough to implement into real-world applications. Achieving accurate, reliable, and robust eye tracking using simple, regular RGB cameras, such as those built into smartphones or webcams for desktops/laptops would cut costs drastically and make the technology more scalable and available to a broader community due to their pervasiveness in society. Recent advancements to research in the field of single RGB camera gaze estimation technology has shown the potential of developing such a solution.

(14)

1.2 Goal

The purpose of this project is to implement and evaluate a technique for an eye tracking system using only a single, regular RGB camera and a computer as

hardware, thereby making the technology more available. The goal of the thesis is to develop a prototype for measuring eye movements based on image analysis of video footage collected by the camera. The idea is to create a prototype accurate enough to be capable of tracking and analyzing a person's reading comprehension.

1.3 Societal impact

Collecting image data of people’s faces and measuring where they focus their gaze has the potential of violating their privacy and integrity if used irresponsibly or without their explicit consent. Eye tracking technology could therefore potentially be used for malicious intent such as involuntary surveillance and/or control. Making the

technology more widely available however, also offers the possibility of increasing the benefits received by society as a whole since it would ease the development of new applications to help people. For instance, eye tracking could be used in

solutions to alleviate some of the problems facing people with motor disabilities by making gaze controlled user interfaces, or it could be used to help identify students with reading difficulties early on in their lives by analyzing their reading patterns. A significant advantage of making eye tracking technology more available is the substantial reduction in development cost for such applications. By eliminating the need for specialized hardware equipment and instead relying solely on commonly available one, developing products is made easier and cheaper. By utilizing

equipment which many people already possess, such as smartphones, it limits the amount of special equipment and hardware needed in order to employ this

technology; thus having less impact on the environment.

1.4 Delimitations

The prototype will be built using only a smartphone and a computer. The system is meant to function with any Full HD (1920x1080) RGB camera, and thus the selection of smartphone will not be crucial to the function of the system since most modern smartphones possess such a camera. The system will be created with the goal of tracking the user’s eye movements projected onto a smartphone screen. The system of the prototype does not need to be implemented on the smartphone in its entirety, merely the videostream could be collected with the phone's camera and the gaze estimation could also be projected onto the phone’s screen. The estimation of gaze points need not be implemented in real-time, which should increase the precision of the model compared to previous findings. The prototype will not take into account or compensate for special factors such as light conditions, or user behavior such as head movements or wearing glasses.

(15)

2 Background

In this section, a thorough walkthrough of previous work in the field as well as relevant literature, concepts and theory related to eye tracking will be presented.

2.1 Eye tracking

Eye tracking is part of the broader field of computer vision, which deals with how computers may gain high-level understanding by acquiring, processing, and analyzing images or video. Eye tracking, or gaze tracking is the process of

measuring the point of a person’s gaze, where and what they are looking at, or the motion of the eyes during a certain task. It is a common method for observing the allocation of visual attention. The human eye consists of multiple components: the pupil, retina, cornea, sclera, and iris among others. The different processes and activities these components perform as we focus our visual attention during specific tasks can be captured and used as signals in various applications[10]. The goal of tracking a person's eyes is to determine the gaze direction towards certain objects or focus points.

Eye tracking is an active field of study and new methods of improvement are frequently published, yet many challenges remain unsolved. Different gaze

estimation techniques can be used to determine the gaze position, some of which rely on the connection between data from an image of the eye(s) or face and the point of focus. Other methods rely on specific facial features and the relationship between them[11]. Part of several video-based eye tracking methods involve the concept of eye detection, which is the process of locating the eye region from an image of the face. For this to happen, the facial features first need to be located and extracted. There are several methods used for this process, some of which will be explored in further detail below. Video, or camera based eye tracking can typically be grouped into feature-based and appearance-based methods.

Feature-based methods use the geometric model features of the eye. It is concerned with identifying features of the eye(s) such as the pupil, iris, and the eye corners in order to directly calculate the gaze direction. Camera calibration prior to use is crucial in these systems. The point of gaze is determined by modelling the general physical structure of the eye in geometric forms to calculate a three- dimensional gaze vector. An important factor is to find features that are less

sensitive, meaning more visible, to changes in lighting conditions and viewpoint such that tracking is robust under different circumstances. This method is however still prone to error when used under bright lighting conditions, such as outdoors where the features might be unavailable to detect. These methods require specialized hardware and algorithms to make use of, and are constrained by factors such as limited head movement and environment[11].

Appearance-based methods track eye movements by photometric

appearance. The method achieves gaze estimation by directly mapping relevant

(16)

image contents to positions on the screen. Appearance-based methods usually do not require calibration by the user prior to using the system since the mapping is made directly on the image data. These methods are less constrained by

environmental factors such as light, head pose/angle, and head movement, yet there are also challenges facing this type of technology precisely because of these

conditions. Extensive research has not been conducted on the effect of different illumination conditions, but it is hypothesized that it negatively affects the ability to detect the eyes during poor lighting conditions. The appearance-based methods typically do not use special equipment, rather they simply use RGB or grayscale images collected from regular cameras[11][12].

2.2 Eye tracking techniques

Many methods and techniques have been used to measure eye movements. These methods can typically be divided into three major categories:

1. Detecting movements on an object attached to the eye (scleral search coil).

2. Measurement of electric potentials using electrodes placed around the eyes (electro-oculography).

3. Video-based eye tracking without direct contact to the eye.

2.2.1 Scleral search coil

This method uses an object in direct contact with the eye, a special contact lens, to measure the movements of the eyes. The lens contains mirrors attached to a wire coil that moves in a magnetic field. The field induces a voltage in the coil to produce a signal that represents the eye position. The eye movements can be recorded by monitoring the infrared wavelength range that is reflected by the mirrors and recorded by eye trackers for accurately moving an image in line with the eye movement[12].

This method allows the measurement of eye movement in horizontal, vertical and torsion directions[13]. The advantages of this technique are its high accuracy, good resolution, 3D data representation, and high sampling rate, whereas the main disadvantages are its complicated implementation, invasiveness, and high

equipment cost. Magnetic search coils are the method of choice for researchers studying the dynamics and underlying physiology of eye movement, although many medical labs refrain from using this approach due to its disadvantages[13-15].

2.2.2 Electro-oculography

In this approach, electric potentials are measured with sensors (electrodes) placed around the eyes. The eyes are thereafter used as the origin of a steady electric potential field. By recording small differences in the skin around the eyes, the position of the eyes can be estimated. It can be modelled to be generated by a

(17)

dipole with its positive pole at the cornea and its negative pole at the retina. If the eye moves from a centered position toward one of the electrodes, this electrode sees the positive side of the retina and the opposite electrode sees the negative side of the retina. Consequently, a potential difference occurs between the electrodes.

Assuming that the resting potential is constant, the recorded potential is a measure of the eye's position[12][16].

This technique is typically limited to lab environments since it requires special equipment and supervision. It is a relatively cheap and easy method, but highly invasive, and thus not suitable for everyday use. A major advantage of this method is its ability to measure eye movements even while the eyes are closed, for instance, while sleeping[17][18].

2.2.3 Video-based eye tracking

This category is rather large, and therefore will be divided into smaller sub-

categories. The optical methods can be categorized based on invasiveness and also the type of camera and light source used.

2.2.3.1 Invasiveness

Eye tracking systems utilizing cameras can be divided into the more invasive head mounted systems, or the less invasive desktop systems. Head mounted systems make use of some sort of contraption attached to the subjects head, typically glasses or a helmet of some sort. These systems usually consist of a camera for each eye and multiple infrared LEDs. A desktop system consists of a single or multiple cameras of different variety, stationed facing the subject and near the object of interest, whom the subject is meant to focus their gaze upon.

2.2.3.1.1 Head mounted systems

Attaching the camera(s) to the head comes with several benefits when tracking the eyes. By decreasing the distance between the camera lens and the eyes, the eye movements can be captured more accurately since a clearer image with much higher resolution of the eye is able to be extracted. Furthermore, the relationship between the eye and the camera being static and near, means that there will be no inconsistencies in the eye tracking footage based on user behavior, such as the eyes appearing partially closed due to panning/rolling/tilting of the head. This is an

advantage because it limits the amount of possible errors, for instance, there is no need to identify and extract the eyes from the camera recording, and thus no room for failure in this regard. Because the camera is attached to the head and follows its movements, there is also no need to compensate for the head position in the gaze direction. The drawbacks of head mounted systems are their relative expensiveness, invasiveness, and inability to scale since they are often limited to a lab environment.

(18)

2.2.3.1.2 Desktop systems

Desktop systems use some non-intrusive, remote method for measuring eye

movements. A typical setup of a desktop system is a single or multiple cameras set up remotely to record the eyes and face, and a computer which processes and analyzes the data.

Desktop systems are widely used for eye tracking for the benefits of being non-invasive, more scalable, mobile, and relatively inexpensive compared to other methods. They do have their downsides however, mainly accuracy. The longer distance between the camera and subject leads to a lower resolution of the eye region, with fewer pixels depicting the eye and thus a less accurate and less detailed perception of it. Furthermore, by setting up the camera in a fixed position in space, separated from the head, another problem arises: the camera doesn’t know where the subject is located within its field of view. The camera has no notion of where the subject exists in space, nor where their eyes are located. This problem can be resolved in a number of ways, one of which is through the use of multiple cameras where each of them have a specific task, for instance, one for keeping track of the eyes, and one for capturing the head location as seen in Majaranta & Bulling[10] and Bulling & Gellersen[19]. In these types of solutions, all of the data captured from the cameras are combined to form the solution to estimating the subject's gaze. The benefit of this method is that the use of multiple cameras allows for much more head motion[12]. When instead using a single camera, the subject needs to be detected within the image/videostream somehow. A method presented by Cheung & Peng[20]

for single camera eye tracking involves first detecting the subject’s face and then identifying the eye region, and finally extracting and calculating the subject’s gaze direction based on their eye movements.

Detecting and extracting the eye region from the subject’s face proves yet another challenge due to head movement and position. This challenge is typically resolved through either using some sort of algorithm to compensate for head

movements, using a stabilization tool such as a chin rest, or by assuming the subject keeps their head completely still and straight during the eye tracking procedure, which puts higher demands on the subject. Many single camera methods that are commonly used in commercial systems include the use of infrared light to produce a point of reference for the gaze estimation.

2.2.3.2 Camera and light setup

The most widely used type of eye tracking systems are video-based, where the camera focuses on the eyes while recording their movements as the subject views some sort of stimulus. Several modern state-of-the-art eye tracker systems in

commercial use consist of a camera and an infrared or near-infrared light source, but other systems using normal cameras with visible light also exist. Different camera setups are used depending on if feature-based or appearance-based approaches are used.

(19)

2.2.3.2.1 Infrared/active light

Most infrared, or active light methods are feature-based. In this method, an infrared light source is used to shine into the eyes in order to produce a reflection on the eye’s surface. This reflection is subsequently recorded by a video camera and combined with other eye features, which are then analyzed to measure eye movement and gaze direction. Corneal reflection and the center of the pupil are features typically tracked, although other features are also used[21]. The output vector created by the center of the pupil and the corneal reflection (glint) on the eye’s surface is used to calculate and estimate the point at which the subject is currently looking[11].

Within infrared/near-infrared eye tracking, there are two types of tracking techniques: bright-pupil and dark-pupil. The difference between the two depends on the angle from which the light source shines upon the eye. In bright-pupil eye

tracking, the illumination source is in line with the optical path, and the pupil appears bright. This comes with the benefit of creating a higher contrast between the iris and the pupil, allowing for more reliable eye tracking among subjects with different iris pigmentation, while significantly reducing the effects of interference caused by

obstructions, such as eyelashes. In dark-pupil eye tracking, the illumination source is instead out of line with the optical path, which does not affect the appearance of the pupil since the retroreflection from the retina is directed away from the camera[22].

The benefits of the infrared methods are their high accuracy, ability to handle disturbances such as blinking, and in the case of bright-pupil eye tracking, the ability to record eye movements in varying lighting conditions, ranging from total darkness to very bright[12]. The downsides are their relative expensiveness, and the

requirement for special hardware, limiting the ability for scale.

2.2.3.2.2 Visible/passive light

Methods that are concerned with eye tracking using a regular camera, such as a smartphone camera or web camera employ different methods to achieve the goal of eye tracking. Certain techniques based on visible or natural light are considered to be substitutes for the infrared versions[12].

In feature-based methods, visible light instead of infrared may be used as the illumination source to create the corneal reflection, although this may cause some discomfort to the subject[22][23]. The systems based on visible light have limitations, especially in bright environments due to poor contrast and light variations in the visible spectrum.

Sugano et al.[24] proposes an eye tracking system where an incremental learning method is used for a single camera attached to a computer monitor. The system also implements a head pose estimation by using a 3D rigid facial mesh.

Yamazoe et al.[25] propose a method based on facial feature recognition using a single video camera. Wang et al.[26] shows a method based on iris detection using only one camera. Sesma et al.[27] use only a regular webcam to implement a gaze tracking system.

(20)

2.2.3.2.3 Appearance-based/machine learning

The use of artificial intelligence, specifically deep learning, in the pursuit of accurate and reliable eye tracking has seen a surge in recent years. Particularly the

convolutional neural network has yielded promising results for satisfactory eye tracking since it’s designed specifically for image processing[28]. While the standard convolutional neural network structure seems to fit the task of eye tracking,

designing a custom neural network tailored for the specific task at hand might yield better results[29].

The deep learning techniques generally work by using images of the face and/or eyes as input to the model, while using either screen coordinates or some other equivalent in a defined plane as corresponding targets for training. The model finds and learns the connection between the features in the images, and the targets on the screen. Once trained and having learned the mapping between the input and output data, the model is used to predict output which in this case is the gaze

location coordinates, based on previously unseen input data.

The advantages of these types of methods are their low cost, simplicity, and relative insensitivity to user behavior such as head pose. They typically also do not require camera calibration prior to use. The drawbacks thus far has been low accuracy compared to methods such as feature-based infrared systems[12].

(21)

3 Method

In this section, a literature study of the relevant literature will be presented. An in-depth explanation of the chosen method and the underlying theory will also be provided.

3.1 Literature study

At the start of this thesis project, a literature study of the relevant literature was conducted. The primary focus was to select literature that fit the scope and

delimitations stated in the first chapter of the report. The aim of the literature study was to get a brief overview of the available technology, the variety of different methods which currently exist on the market and in research, and to find an appropriate method to implement for this project.

Mainly google scholar was used to search for appropriate research, but also various research institutions and associations such as IEEE. Example search strings that were used when searching within these networks include variations of: “eye tracking using rgb camera”, “eye tracking using single rgb camera”, “eye tracking using webcam”, and “eye tracking using smartphone”.

There are several methods for estimating a person’s gaze. When considering the delimitations and scope of this study, many methods were excluded based on the initial literature study. For instance, these methods[30-34] make use of special equipment such as IR LEDs. Other methods were using multiple cameras[10][19].

Some[35-37] used RGB-D cameras, meaning special depth sensing cameras, which have been proven to be promising for eye tracking but are outside the delimitations of this project. Methods involving animation of the face and/or eyes[38][39] were investigated, but were ultimately concluded to be too complex of a solution.

In the study “Real Time Eye Gaze Tracking With 3D Deformable Eye-Face Model”, Wang & Ji[40] propose a method more suitable for this project. They use only a regular RGB webcam as hardware and achieve somewhat acceptable results (3.5 degree error). Similarly, Cheung & Peng[20] achieved lower error (1.28 degrees without head movement and 2.27 degrees with minor head movement) in “Eye Gaze Tracking With a Web Camera in a Desktop Environment” using the same hardware as in the previously mentioned study. In “3D gaze estimation with a single camera without IR illumination”, Chen & Ji[41] managed to achieve an error of lower than 3 degrees. Recent approaches in machine learning for eye tracking have also been shown to be promising[42-47], achieving results comparable to the above mentioned studies involving single RGB cameras. Their accuracy however, has so far been too low for the needs of most real-world eye tracking applications -especially when compared to state-of-the-art specialized eye tracker systems where the error margin is typically less than 1 degree[12].

(22)

3.2 Chosen method

The chosen method for creating an eye tracking system using only a single

smartphone camera is based on the study “Accelerating eye movement research via accurate and affordable smartphone eye tracking” by Navalpakkam et al.[48]. The study presents a highly accurate implementation of eye tracking on smartphones compared to previous findings[42-45], and was therefore chosen to be the

foundation for this project. By leveraging machine learning, they demonstrate a solution capable of reaching a high level of accuracy comparable to state-of-the-art eye trackers that are hundreds of times more expensive. The study will henceforth be referred to as “the original study” in this report.

The method is based on a personalized recognition of eye movements and can be divided into three major parts: the base model, the fine-tuned base model, and the personalized model. The first part involves training a convolutional neural network model on a publicly available dataset in order to create a generalized model capable of inferring the gaze location from eye images, to a certain level of accuracy.

The second part uses calibration data collected from study participants, obtained through the use of a custom calibration app in order to re-train the neural network model from the first part, with the goal of fine-tuning it and improving the model’s accuracy. In the third and final part, the system is personalized by adding an additional regression model which is trained on the calibration data from only a single participant, finalizing the system by adapting it to each individual. During inference, the fine-tuned base model and the regression model are applied in sequence to an image to generate the final estimated gaze location. An overview of the entire system and how the different components interact is shown in Figure 3.1.

Figure 3.1: Overview of the gaze estimation system[48].

(23)

3.2.1 Public dataset - GazeCapture

GazeCapture is a publicly available dataset specifically produced to provide aid in the task of eye tracking. Many such datasets exist, however; this is the largest in terms of amount of data as well as participants with nearly 2,5M frames taken from over 1 450 individuals. It has approximately 30 times as many participants and 10 times as many frames as the second largest comparable dataset while containing a significant amount of diversity in terms of pose and background illumination[49].

The dataset consists of frames and associated metadata from various

recordings made with different iOS devices’ front facing camera while the participant is performing a calibration task. The metadata include device information, screen orientation information, and target/marker coordinates on which the user is

presumably focusing their gaze upon at the time of the particular frame. The original study uses a subset of the dataset which accommodate their needs by only selecting frames from smartphone devices in the portrait screen orientation. When filtering out both the irrelevant and non-valid frames (no face/eyes detected etc.), this subset contains 624 803 valid entries.

Figure 3.2: Preprocessed data. The figure shows the transformation each frame goes through prior to being fed to the machine learning model. A full face image is processed and results in both eye regions being cropped (left eye flipped horizontally), resized, and the eye corner pixel coordinates being extracted.

3.2.2 Data preparation

In order for the data to be used by the system, it first needs to be pre-processed, formatted, and adapted to fit the models. The transformation process which the data goes through is illustrated in Figure 3.2. The raw input data to the system consists of video recordings taken by a smartphone’s front-facing camera, divided into individual frames. From these input frames, additional metadata about the eye corner positions in terms of pixel coordinates relative to the original image are retrieved, and eye regions are located and extracted. This is achieved by automatically detecting the face and facial features in each frame with the use of a dedicated face detector, a pre-trained machine learning model. Ultimately, the eye region images are cropped based on the detected eye features, and the needed metadata are extracted. The cropped eye region images are resized to a resolution of 128x128 pixels, then

(24)

augmented and normalized by subtracting the mean and dividing by the standard deviation of the pixel intensities in a channelwise manner. This makes the input data consistent and standardized across the entire dataset, removing irrelevant

differences and preparing them for the models to use. Corresponding screen

stimulus point coordinates viewed at the time of the specific frame are also added to the metadata.

Figure 3.3: Custom calibration app. Nine symmetrically placed calibration points are shown on the screen successively during the calibration task. The participant is simultaneously being recorded with the phone’s front-facing camera.

3.2.3 Custom calibration app

In order to obtain participant calibration data for the later stages of the system, a custom mobile app is required. The purpose of the app is to record and store the front-facing camera feed while simultaneously displaying a visual stimulus on the screen. The objective is to capture the participant’s eye movements while they perform a visually stimulating task. A simple dot calibration is sufficient, where the subject is meant to focus their gaze upon dots appearing on various locations on the screen. Nine dots, symmetrically placed in a grid across the surface of the screen are shown consecutively in intervals of three seconds, after which the calibration task is finished. Figure 3.3 shows screenshots from the app at different moments during the calibration task. The distance between the participant’s face and the camera lens is supposed to be far enough away such that the entire face is visible, but close enough so that the majority of the screen consists of the participant’s face.

(25)

Figure 3.4: Base model architecture[48]. Size used for convolutional layers: Conv1 (7 x 7 x 32; stride

= 2), Conv2 (5 x 5 x 64; stride = 2), Conv3 (3 x 3 x 128; stride = 1). Average pooling layers are of size 2 x 2. Size used for fully connected layers: FC1 (128), FC2 (16), FC3 (16), FC4 (8), FC5 (4), FC6 (2).

3.2.4 Base model

The base model is made up of a multilayer feed-forward convolutional neural

network (ConvNet/CNN). It takes as input the cropped out eye region images of both eyes from each frame of a video recording by a smartphone’s front-facing camera.

The base model is trained using a subset of the publicly available dataset GazeCapture. Each eye is fed through a ConvNet tower consisting of three convolutional layers with an average pooling layer in between each one of the convolutional layers. The left eye crop is flipped horizontally to allow for shared weights between the ConvNet towers, simplifying the training process. Inner and outer eye corner pixel-coordinates, relative to the top-left corner of the original frame are fed through three consecutive fully connected layers. Thereafter, the output from the ConvNet towers and the output from the fully connected layers are combined and sent through three additional fully connected layers. A regression head then outputs the estimated gaze location on the screen in the form of x and y coordinates.

Rectified Linear Units (ReLUs) are used as nonlinearities for all layers except for the final fully connected output layer, which has no activation function. Figure 3.4 shows an illustration of the base model architecture.

3.2.5 Fine-tuned base model

In this stage, the base model is fine-tuned by re-training the network using the calibration data collected from all study participants through the custom app. The data collected by the app is prepared in a similar manner to the previously

(26)

mentioned public dataset. Dot calibration during approximately 30 seconds per participant is used to generate input/target pairs by collecting and storing the video stream of the participant performing the calibration task, synchronized with the screen location of the displayed stimulus point. Images and marker locations on the screen serve as inputs and targets for the model, respectively. During fine-tuning of the model, all of the layer weights of the base model are allowed to be updated from the previously initialized weights of the pre-trained base model until the model converges. The model training parameters for fine-tuning are left unchanged compared to the base model.

3.2.6 Personalized model

In the final stage, the fine-tuned base model is personalized by fitting an additional regression model, support vector regression (SVR), at a user-level in order to

produce the final gaze estimation. The regression model is fitted to the output of the base model’s penultimate layer. Here, a high-level representation of the fine-tuned base model is extracted and the penultimate layer’s output is used as the feature representation. The first part is to extract the feature embedding of the fine-tuned base model using the calibration frames belonging to a particular participant. Then, the extracted features (from the dot calibration task) are used as the input data, and the ground truth marker locations on the screen as the targets to fit the SVR model for that particular participant and time block.

3.3 Testing and evaluation

In order to measure the accuracy of the developed eye tracking system, a number of tests and associated evaluation of the results are made. The model accuracy is evaluated by measuring and comparing the difference in cm between the estimated gaze locations produced by the system and the ground truth marker locations displayed on the screen during the calibration task. The results from the evaluation are combined and the mean error is calculated as a representation of the overall model accuracy. Further, in order to test whether the system is accurate enough to follow a person’s gaze during reading, an additional task is performed. The reading accuracy is tested by the system being fed a camera recording of a participant performing a reading task in order to see if the estimation follows the presumed ground truth.

3.4 Developer tools and environment

The eye tracking system is written primarily in Python and the models are trained using the built in virtual GPU functionality of the Google Colaboratory IDE. Python is the most widely used programming language for machine learning with much

support, rich libraries and extensive documentation, and Google Colaboratory provides free GPU usage and has Python integrated into it, making it a good option

(27)

for machine learning model training. Data preparation and processing of the dataset is performed locally on the authors’ personal computers.

(28)

(29)

4 Implementation and results

In this section, the developed system, its implementation details, and its produced results will be presented. The gaze estimation system consists of 4 modules: the data pipeline and three models, and thus each module and their respective results will be presented in a dedicated section. Certain changes were made to the original designs which deviated from the method described in chapter 3.

4.1 Data generation and processing

The first module of the gaze estimation system is data processing. The data chosen from the GazeCapture dataset and the calibration data obtained with the mobile app was prepared with a Python script according to the description provided in section 3.2.2. The OpenCV library[50] was used for image handling including reading, writing, and resizing the images for optimal use by the machine learning models. In order to identify and extract the eye regions and obtain the eye corner landmarks from the original full face images, the 68 landmark face detection model from the Dlib library[51] was used. Figure 4.1 presents the facial landmarks recognized by the face detector, and its application to a facial image. The images were converted into

grayscale prior to applying the face detection model since performance is increased by doing so. The images for the left and right eye regions were placed in separate folders, uniquely identified by concatenation of their original directory and image numbers from the dataset. The landmarks and targets belonging to each image were stored in arrays in json format following the same order as the images.

Figure 4.1: Face detector that generates 68 landmark coordinates for a provided image[52]. Used for identifying and extracting eye regions and obtaining inner and outer eye corner landmark coordinates.

4.2 Base model

The second module of the gaze estimation system is the base model. The goal of the base model is to achieve a generalized, but not fully accurate prediction model

(30)

for gaze estimation. The model was constructed using the Keras API[53] with a TensorFlow backend. Keras offers a simple and flexible deep learning API for Python while also providing extensive documentation and developer guides.

Figure 4.2: Original (left) and revised (right) base model architectures. The eye corner landmarks input was removed from the original architecture due to issues of yielding poor results during model training.

4.2.1 Architecture

The architecture of the base model was changed from the original design by

removing the eye corner landmarks input channel. The reason being that the model showed an increase in performance and gave better and more accurate results after being removed. Original and revised architectures can be seen in Figure 4.2. Base model hyperparameters used for the model are listed below; selected according to the original study.

● Weight related parameters:

- random weight initialization - weight regularization: Disabled - Conv dropout probability: 0.02 - FC4 dropout probability: 0.12

● Batch normalization parameters:

- batch normalization: Enabled - moving average momentum: 0.9

(31)

● Learning rate schedule parameters:

- initial learning rate: 0.016 - decay steps: 8000 - decay rate: 0.64 - decay type: ‘staircase’

● Optimizer:

- Adam optimizer (β1=0.9, β2=0.999, 𝜖=1e-7)

● Model training parameters:

- batch size: 256 - epochs: 300

● Loss function:

- Euclidean distance error

4.2.2 Gaze estimation results

Normally, the entire dataset is loaded into memory before model training. Due to the prepared GazeCapture subset of 624 803 samples not being able to fit into memory all at once, a different approach had to be used. A function for loading smaller

batches of data into memory successively as needed by the model training algorithm was therefore developed. Using this method however yielded poor results (see Figure 4.9) and decreased performance by a factor of 10; thus the decision was made to train the base model using smaller samples of data. Three models were trained for each category of samples (1000, 2000, 4000, and 8000). Model

evaluation was made by comparing the distance between predicted/estimated gaze locations and ground truth gaze locations from a control sample containing 200 previously unseen observations from the same dataset. Figure 4.3-4.8 and Table 4.1 show the evaluation results of the base model after training on the smaller samples.

A. B. C.

Figure 4.3: Base model gaze estimation results when training on a sample size of 1000 observations.

Blue markers represent ground truth and red markers represent corresponding predictions made by the model. Axes represent distance in cm from the center of the front-facing camera lens.

(32)

A. B. C.

Table 4.1: Measured mean error and standard deviation in cm between predictions and ground truth marker locations for different sample sizes.

Mean error

(cm) A. B. C. Average

1000 4.64±1.85 3.82±2.06 4.12±1.94 4.19±1.95

2000 3.52±1.71 4.78±1.65 3.27±1.43 3.86±1.60

4000 2.62±1.46 3.08±1.35 2.11±1.70 2.60±1.50

8000 1.94±1.33 2.03±1.40 2.39±1.16 2.12±1.30

(33)

W. X.

Y. Z.

Figure 4.7: Charts depicting the distribution of prediction error in cm. W = 1000 observations, X = 2000 observations, Y = 4000 observations, Z = 8000 observations.

W. X. Y. Z.

Figure 4.8: Training and validation loss charts for the base model. Illustrates loss during model training. W = 1000 observations, X = 2000 observations, Y = 4000 observations, Z = 8000 observations.

Figure 4.9: Failed attempts. Model failed to find a connection between input and targets.

(34)

4.3 Fine-tuned base model

The third module of the system is fine-tuning. Here, the base model is re-trained with user data obtained with the custom calibration app in order to improve prediction accuracy for the particular users of the system. Further training on a pre-trained model allows for keeping the neural network weights and biases of the pre-trained model, making weight initialization already adapted for the task at hand.

The custom calibration app was developed using Javascript and the React Native framework[54] which ran on their platform Expo. The React Native framework is open source and cross-platform, which means it’s free to use and works for both Android and iOS. The device of choice for the calibration part of the project was the iPhone XS, mainly due to availability but also since it’s an iOS device like all other devices in the GazeCapture dataset.

4.3.1 Gaze estimation results

The base model which appeared to be the most accurate according to the figures presented in section 4.2.2, and which also had the lowest measured mean error and std (model A trained on 8000 samples of data) was selected for fine-tuning. The model was re-trained with the calibration data collected from all study participants, of which there were 8 in total. Calibration data was obtained over a period of 27

seconds (3 seconds per dot, 9 dots in total) resulting in approximately 900

input/target pairs for each participant. Data from 7 participants was used for training while the remaining one participant’s data was used for model evaluation. Due to ambiguity in the original study, it’s unclear whether they used all input/target pairs or merely a subset of ~100 input/target pairs from each participant. Models were

therefore trained on both amounts, for a total of ~6300 and ~700 samples respectively. Figure 4.10-4.13 and Table 4.2 show the evaluation results of the fine-tuning stage.

A. B. C.

Figure 4.10: Fine-tuning gaze estimation results when training on a sample size of ~6300 observations. Blue markers represent ground truth and red markers represent corresponding predictions made by the model. Axes represent distance in cm from the center of the front-facing camera lens.

(35)

A. B. C.

Figure 4.11: Fine-tuning gaze estimation results when training on a sample size of ~700 observations.

Table 4.2: Measured mean error and standard deviation in cm between predictions and ground truth marker locations for different sample sizes.

X. Y.

Figure 4.12: Charts depicting the distribution of prediction error in cm. X = 6300 observations, Y = 700 observations.

X. Y.

Figure 4.13: Training and validation loss charts. Illustrates loss during model training. X = 6300 observations, Y = 700 observations.

Mean error

(cm) A. B. C. Average

~6300 0.79±0.98 0.83±0.62 1.04±0.72 0.89±0.77

~700 2.68±1.82 2.84±1.79 3.04±1.69 2.85±1.77

(36)

4.4 Personalized model

The final module of the system, the personalization stage, deals with personalizing and fitting the system to each individual. In the final part of the system, a regression model was added to the fine-tuned base model’s penultimate layer (see Figure 4.2, layer FC5). The support vector regression (SVR) module[55] from the scikit-learn library was used for this purpose. Models trained on both ~6300 samples and ~700 samples were selected for personalization. SVR model hyperparameters were selected according to the original study as follows:

- kernel: ‘rbf’

- gamma: 0.06 - C: 20.0

4.4.1 Gaze estimation results

The fine-tuned models appearing to be the most accurate according to the figures presented in section 4.3.1, and which also had the lowest measured mean error and std (fine-tuned model A (trained on ~700 samples of data), and fine-tuned model B (trained on ~6300 samples of data)) were selected for personalization. The selected models had the SVR model applied to them, which was subsequently trained on the calibration data from a particular participant. Data from 8 dots of 9 in total from the calibration task was used for training, while the remaining data from one dot was used for evaluation. The total amount of calibration frames used for training in this part of the system was reduced to ~100 per user after recommendation from the original study on which the method is based. Two personalized models were

developed for each of the 3 participants partaking in the personalization stage of the system, one for each of the 2 selected fine-tuned models. Figure 4.14, 4.15 and Table 4.3 show the evaluation results of the personalization stage.

The personalized models were also evaluated with a reading test where the user read a short text which appeared on the screen, while they were simultaneously being recorded by the phone’s front-facing camera. The reading task had no ground truth marker locations but was instead compared visually in order to estimate the accuracy of the presumed reading pattern of the user. Figure 4.16 shows results from the reading task.

(37)

A. B.

Figure 4.14: Gaze estimation results for personalized model A, built on fine-tuned model A (trained on

~700 samples of data). A and B are prediction results for differently placed dots viewed by the same participant. Blue markers represent ground truth and red markers represent corresponding predictions made by the model. Axes represent distance in cm from the center of the front-facing camera lens.

A. B.

Figure 4.15: Gaze estimation results for personalized model B, built on fine-tuned model B (trained on

~6300 samples of data). A and B are prediction results for differently placed dots viewed by the same participant. Blue markers represent ground truth and red markers represent corresponding predictions made by the model. Axes represent distance in cm from the center of the front-facing camera lens.

Table 4.3: Measured mean error and standard deviation in cm between predictions and ground truth marker locations for different personalized models. Personalized model A = personalized model built on fine-tuned model A (trained on ~700 observations), Personalized model B = personalized model built on fine-tuned model B (trained on ~6300 observations).

Mean error (cm) A. B. Average

Personalized model A

0.11±0.05 3.45±0.10 1.78±0.08

Personalized model B

4.73±0.00 2.33±0.06 3.53±0.03

(38)

A. B. C.

Figure 4.16: Reading task prediction results for a particular user. A = predictions from personalized model A. B = predictions from personalized model B. C = reading task. iPhone XS dimensions for comparison: (H:14.4cm, W:7.1cm).

(39)

5 Discussion and analysis

In this section, the results presented in chapter 4 will be analyzed and discussed.

The emphasis will be put on how the different parts of the system may have impacted the results and how they may be improved.

5.1 Analyzing the results

5.1.1 Base model

According to the results presented in Figure 4.3-4.7 and Table 4.1, base model accuracy improves as the training data sample size increases, the best one reaching an accuracy of 1.93±1.33 cm on previously unseen data. For comparison, the

original study achieved a base model accuracy of 1.92±0.20 cm. When observing the charts in Figure 4.7, it can be seen that the distribution of prediction error for the base model gets narrower as more samples are used for training. The distribution of error seems to hold the majority of the data points between 0.5-5 cm for all models.

While this may seem low, one needs to consider the confines of machine learning, the device type and its size in this particular case. The prediction space for the model is essentially limited to the edges of the plane created by the ground truth targets, thus reducing the possibility of producing estimations far outside the screen area. The model will typically not produce gaze estimations outside of the screen area, an issue which is more frequently associated with other methods of eye tracking, since it hasn't been exposed to targets located outside said area during training. The distance between point of estimation and ground truth was measured using the formula for euclidean distance, as can be seen in Equation 4.1 below.

Looking at the figures, it appears that the models have tendencies of

clustering the predictions, and a general bias for predicting gaze towards the center and right side of the screen. If Figure 4.8 is studied, it can be seen that the difference between training and validation loss is relatively low, indicating that the models likely do not suffer from overfitting. Overfitting is when the model overly adapts to the pattern of the training data, making it capable of producing accurate predictions for that particular set of data only, and nothing else. A large gap between training and validation loss may indicate that the model has overfitted to the training data, making it poor for generalizations.

Figure 4.9 shows various failed attempts at base model training, where the model failed to generalize. These outcomes were most likely due to the result of unfortunate weight initialization where the model got stuck in a certain pattern of learning and was unable to free itself.

istance

d =

√

^(pred^ᵧ^{− t}^rue^ᵧ)²^{+ (}^pred^ₓ^{− t}^rue^ₓ)² ^(4.1)

Formula used to calculate distance between estimated/predicted marker location and ground truth marker location.

(40)

5.1.2 Fine-tuned model

For the fine-tuned models, highly accurate estimations were achieved when trained on the larger sample size, with mean error reaching below 1 cm. When instead using a smaller sample size, the model produces much poorer results, in fact, even worse results than the base model. Whether this is due to the rather extreme positions of the targets (edges of the screen) or the model performing poorly in general is difficult to say. Both models show a greater accuracy and/or bias towards the top and center of the prediction space. This might however be expected due to the eyes being more visible when looking straight at the screen as opposed to looking at the bottom, where the eyes may appear partially closed or hooded by the upper eyelid.

When examining the loss charts seen in Figure 4.13, a large difference can be seen between training and validation loss for the model trained on the larger sample of data. This indicates a high risk that the model may have overfitted to the training data, and will generalize poorly when looking at other areas of the screen. When looking at the estimation results and the loss chart for the model trained on a smaller amount of data, barely any indication of overfitting is seen; however, the estimations appear to be fairly inaccurate overall as mentioned earlier.

5.1.3 Personalized model

From looking at Figure 4.14, 4.15, and Table 4.3, it may be concluded that both of the personalized model variants perform poorly. Compared to the prediction models in the earlier parts of the system, there seems to be no improvement to estimation accuracy. In fact, personalization appears to have had a worsening effect on system performance. Only one of the estimations, seen in chart A from Figure 4.14 achieved somewhat desirable results; however, since it’s the only result of its kind, it may be seen as an outlier rather than a valid representation. For comparison, the original study achieved a personalized gaze estimation accuracy of 0.46±0.03 cm.

The estimation results from the reading task, as shown in Figure 4.16, are difficult to evaluate since the specific reading pattern of the user is unknown. The results however, show estimations to be towards the correct screen area where the text is located. Although the predictions are not distributed more evenly across the stimulus area as perhaps one would expect, the model estimates the gaze to be towards the center and top area. The model also cuts off its predictions at around the center of the screen area and generally does not predict below the point where the text ends. The model has however previously shown to have a general bias towards the center and top area of the screen, so it’s highly likely that that plays a major part in the estimation results. A person with normal reading capabilities, which all

participants in the study possess, do not display a reading pattern such as the one shown in Figure 4.16, and therefore the model is deemed inaccurate. Overall, the gaze estimation system performs poorly when considering the limitations and confines of the chosen method.

(41)

5.2 Discussion

5.2.1 Method and implementation

Due to a variety of reasons, the method described in chapter 3 could not be followed through as intended. The aspect which most likely had the largest impact on this, the performance of the system, and which limited the ability of the other factors in

affecting the results, was the weight initialization of the base model. The issue of receiving favorable weight initialization values could be crucial for reaching desirable results with machine learning, especially for large neural networks where there are many parameters. Shared weights between the convolutional towers makes the model even more dependent on weight initialization. If the weights are initialized unfavorably for one eye, there is still a chance that the layers processing the other eye receives favorable weights. Looking at section 4.2.1, it states that the weights were initialized at random, according to the original study. This has the potential of yielding both highly desirable but also highly undesirable results. The use of seeds in random generators are used to generate reproducible and consistent

(pseudo)random values across runtimes. The issue lies in finding seed-values which generate values favorable to weight initialization for the task to be accomplished, which unfortunately the original study did not provide. Initially, no seed was used for weight initialization, which resulted in the base model exclusively producing

outcomes such as those seen in the failed attempts in Figure 4.9. By empirically testing different seed-values, ultimately a value was reached which broke this pattern and allowed the model to somewhat consistently produce a favorable outcome. The consistency of desirable results was however not as long-lasting as we’d hoped, and worked approximately 60% of the time.

Suspicions were that the eye corner landmarks may have played a role in overruling the data from the eye images since they were much larger values, numerically speaking. The decision was therefore made to normalize the landmark values in the same fashion as the eye images had been. This proved to have little to no effect on the outcome, after which the decision to completely remove the

landmarks altogether from the model was made. This granted better results, both in terms of consistency and gaze estimation accuracy, and therefore this architecture was kept.

The model was now capable of producing fairly desirable results consistently, approximately 95% of the time. However, the model had only been tested on small samples of data, between 2000 and 5000 observations in total. When the model was finally applied to train on the entire specified subset of the GazeCapture dataset containing 624 803 observations, the model produced failed outcomes. Since the entire subset was too large to fit into memory at once, a function was developed for loading the data into memory in batches for the model to use as needed by the training algorithm. This function was constrained in terms of speed due to the high amount of I/O operations required when loading data in batches from disk, and thus

(42)

was terribly slow and inefficient. The training duration for the model reached upwards of 10+ hours, only to produce failed outcomes. Empirically testing new seed-values for the base model until finding a favorable one would be impossible due to the time limitations of the project.

The decision was thus made to simply use one of the models produced by the smaller samples of data. Tests showed a general improvement of results as sample size increased, so the choice was made to use a model trained on as much data as could fit into memory at once, 8000 observations. The amount of data used for training obviously affects the models ability to produce generalized predictions of gaze estimation across different users. When training the model, there are several features which need to be accounted for: illumination condition, eye color, eye shape, distance to camera etc. If the model has not been exposed to different variations of these features during training, the prediction results will suffer as a consequence.

The fine-tuned model showed signs of overfitting to the calibration data when using a larger sample, and poor performance in general for the smaller sample. The suspicion is however, that the reason for this is merely an extension of not being able to train the model on the full dataset, and that the fine-tuning stage itself would not have affected the results as much, had it been able to. The image quality,

background illumination condition, head pose and distance to the camera was set up to be optimal for eye tracking purposes when performing the calibration task. When taking this into consideration, in combination with that the amount of calibration data used for re-training/fine-tuning was almost as large as the initial base model training sample, it’s no surprise that the model overfitted.

Regarding the personalized model, not enough data was collected in order to make concluding statements or generalizations about the accuracy of the gaze estimation. A personalized model was developed for only three subjects, mainly chosen for their availability to participate as test subjects for the project.

5.2.2 Evaluation

Each model was evaluated by comparing the distance between the true calibration task target location, and the corresponding predicted gaze location. The devices used for performing the calibration task were hand-held, both in the GazeCapture dataset and during calibration performed by the study participants. Further, there is no control mechanism to ensure that the subject is actually looking at the displayed target for each frame. This task can be quite demanding as the gaze may drift away at any moment due to the difficulty of focusing one’s gaze on a small, stationary position for an extended amount of time.

5.2.3 Solutions

The identified main problem of the gaze estimation system stems from the weight initialization of the convolutional neural network. This issue seems to have a

(43)

negative cascading effect on the entire system, rendering the additional steps more or less insignificant. A solution to the problem would be to use some sort of

framework for analyzing and finding potentially favorable values for weight

initialization on a larger scale. Tweaking the values of the model hyperparameters may produce different results, for instance, increasing the learning rate and related parameters may allow the model to not get stuck in a loop of erroneous predictions.

For the evaluation part, in order to facilitate the chance of the subject actually focusing their gaze upon the displayed target correctly, a moving, dynamic stimulus point could be used. A solution to ensure that the device is held completely still during calibration would be to use a device stand. Finally, more test subjects and more tests should be made in order to evaluate and verify the performance of the gaze estimation system. Tests could be made where the user is made to read a text beginning from the middle of the screen and ending at the bottom, in order to expose whether the model predicts more correctly, or if it’s simply due to prior estimation bias.

(44)

Eye Tracking Using a Smartphone Camera and Deep LearningBlickspårning med mobilkamera och djupinlärning

Eye Tracking Using a Smartphone Camera and Deep Learning

Blickspårning med mobilkamera och djupinlärning

ADAM SKOWRONEK

OLEKSANDR KULESHOV

Eye Tracking Using a Smartphone Camera and Deep Learning

Blickspårning med mobilkamera och djupinlärning

Adam Skowronek Oleksandr Kuleshov

Acknowledgements

Abstract

Sammanfattning

Table of contents

1 Introduction

1.1 Problem

1.2 Goal

1.3 Societal impact

1.4 Delimitations

2 Background

2.1 Eye tracking

2.2 Eye tracking techniques

2.2.1 Scleral search coil

2.2.2 Electro-oculography

2.2.3 Video-based eye tracking

2.2.3.1 Invasiveness

2.2.3.2 Camera and light setup

3 Method

3.1 Literature study

3.2 Chosen method

3.2.1 Public dataset - GazeCapture

3.2.2 Data preparation

3.2.3 Custom calibration app

3.2.4 Base model

3.2.5 Fine-tuned base model

3.2.6 Personalized model

3.3 Testing and evaluation

3.4 Developer tools and environment

4 Implementation and results

4.1 Data generation and processing

4.2 Base model

4.2.1 Architecture

4.2.2 Gaze estimation results

4.3 Fine-tuned base model

4.3.1 Gaze estimation results

4.4 Personalized model

4.4.1 Gaze estimation results

5 Discussion and analysis

5.1 Analyzing the results

5.1.1 Base model

√

5.1.2 Fine-tuned model

5.1.3 Personalized model

5.2 Discussion

5.2.1 Method and implementation

5.2.2 Evaluation

5.2.3 Solutions