Using Graph Neural Networks for Track Classification and Time Determination of Primary Vertices in the ATLAS Experiment

(1)

STOCKHOLM, SWEDEN 2020

Using Graph Neural Networks for Track Classification and Time

Determination of Primary Vertices in the ATLAS Experiment

MATTIAS GULLSTRAND STEFAN MARAŠ

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

(2)

(3)

Using Graph Neural Networks for Track Classification and Time

Determination of Primary Vertices in the ATLAS Experiment

MATTIAS GULLSTRAND STEFAN MARAŠ

Degree Projects in Mathematical Statistics (30 ECTS credits)

Degree Programme in Industrial Engineering and Management (120 credits) KTH Royal Institute of Technology year 2019

Supervisors at KTH: Jimmy Olsson, Christian Ohm Examiner at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2020:387 MAT-E 2020:094

Royal Institute of Technology School of Engineering Sciences KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Starting in 2027, the highluminosity Large Hadron Collider (HLLHC) will begin operation and allow higherprecision measurements and searches for new physics processes between elementary particles. One central problem that arises in the ATLAS detector when reconstructing event information is to separate the rare and interesting hard scatter (HS) interactions from uninteresting pileup (PU) interactions in a spatially compact environment. This problem becomes even harder to solve at higher luminosities. This project relies on leveraging the time dimension and determining a time of the HS interactions to separate them from PU interactions by using information measured by the upcoming HighGranularity Timing Detector (HGTD). The current method relies on using a boosted decision tree (BDT) together with the timing information from the HGTD to determine a time. We suggest a novel approach of utilizing a graph attentional network (GAT) where each bunchcrossing is represented as a graph of tracks and the properties of the GAT are applied on a track level to inspect if such a model can outperform the current BDT. Our results show that we are able to replicate the results of the BDT and even improve some metrics at the expense of increasing the uncertainty of the time determination. We conclude that although there is potential for GATs to outperform the BDT, a more complex model should be applied. Finally, we provide some suggestions for improvement and hope to inspire further study and advancements in this direction which shows promising potential.

Keywords

Time determination, graph neural network, graph attentional network, HGTD, vertex, node classification, particle physics, machine learning

(6)

(7)

Från och med 2027 kommer highluminosity Large Hadron Collider (HLLHC) att tas i drift och möjliggöra mätningar med högre precision och utforskningar av nya fysikprocesser mellan elementarpartiklar. Ett centralt problem som uppstår i ATLAS

detektorn vid rekonstruktionen av partikelkollisioner är att separera sällsynta och intressanta interaktioner, så kallade hardscatters (HS) från ointressanta pileup

interaktioner (PU) i den kompakta rumsliga dimensionen. Svårighetsgraden för detta problem ökar vid högre luminositeter. Med hjälp av den kommande High

Granularity Timingdetektorns (HGTD) mätningar kommer även tidsinformation relaterat till interaktionerna att erhållas. I detta projekt används denna information för att beräkna tiden för enskillda interaktioner vilket därmed kan användas för att separera HSinteraktioner från PUinteraktioner. Den nuvarande metoden använder en trädregressionsmetod, s.k. boosted decision tree (BDT) tillsammans med tidsinformationen från HGTD för att bestämma en tid. Vi föreslår ett nytt tillvägagångssätt baserat på ett s.k. uppvaktande grafnätverk (GAT), där varje protonkollision representeras som en graf över partikelspåren och där GAT

egenskaperna tillämpas på spårnivå. Våra resultat visar att vi kan replikera de BDT

baserade resultaten och till och med förbättra resultaten på bekostnad av att öka osäkerheten i tidsbestämningarna. Vi drar slutsatsen att även om det finns potential för GATmodeller att överträffa BDTmodeller, bör mer komplexa versioner av de förra tillämpas. Vi ger slutligen några förbättringsförslag som vi hoppas ska kunna inspirera till ytterligare studier och framsteg inom detta område, vilket visar lovande potential.

Nyckelord

Tidsbestämning, neurala grafnätverk, uppvaktande grafnätverk, HGTD, vertex, nodklassificering, partikelfysik, statistisk inlärning

(8)

(9)

We would like to express our gratitude to Christian Ohm, researcher at the Particle and Astroparticle Physics division at KTH, for the opportunity to work on this project and his tireless support during the project. We thank our mathematical statistics and course supervisor Jimmy Olsson, at the Department for Mathematical Statistics, for his thorough feedback on the scientific rigour, and plan of the project. A thank you to Alexander Leopold, researcher at Sorbonne in Paris and the designer of the boosteddecisiontreebased approach, for taking the time to sit down to discuss the model and our thoughts around it. A thank you to the KTHATLAS team for listening and providing weekly feedback on the progress of the project. A special thank you to the HGTD Simulation, Performance & Physics team for providing key areas for improvement for the project to follow up on. A final thank you to Pawel Herman, associate professor at the Divison for Computational Science and Technology at KTH, for his contributing thoughts on the report’s neural network approach.

Stockholm, December 2020

Mattias Gullstrand and Stefan Maraš

(10)

(11)

Adam Adaptive Moment Estimation CNN Convolutional Neural Network DNN Deep Neural Network

GAT Graph Attentional Network GCN Graph Convolutional Network GNN Graph Neural Network

GPU Graphics Processing Unit HEP High Energy Physics

HGTD HighGranularity Time Detector

HLLHC High Luminosity Large Hadron Collider

HS Hard Scatter

ITk Inner Tracker

LHC Large Hadron Collider

MC Monte Carlo

MLP MultiLayer Perceptron

PU Pileup

RAM Random Access Memory RNN Recurrent Neural Network SGD Stochastic Gradient Descent TDR Technical Design Report VBF Vector Boson Fusion

(12)

(13)

1 Introduction

1

1.1 Background . . . 1

1.2 Problem and Research Question . . . 3

1.3 Purpose . . . 4

1.4 Goal . . . 4

1.5 Methodology . . . 4

1.6 Delimitations . . . 5

1.7 Outline . . . 6

2 Conceptual Background

7 2.1 Physics background . . . 7

2.1.1 The Standard Model . . . 7

2.1.2 The ATLAS experiment . . . 9

2.1.3 The ATLAS coordinate system . . . 9

2.1.4 HighGranularity Timing Detector . . . 12

2.1.5 Relevant physics processes . . . 12

2.1.6 Particle parameters and variable explanation . . . 15

2.2 Related work . . . 17

2.2.1 Boosted decision tree . . . 18

2.2.2 Related studies . . . 20

2.2.3 Summary of related work . . . 22

3 Mathematical background and engineeringrelated content

23 3.1 Basics of neural networks . . . 23

3.1.1 Nodes . . . 23

3.1.2 Activation function . . . 24

3.1.3 Supervised learning . . . 25

3.1.4 Learning rule . . . 25

3.1.5 Batch learning and online learning . . . 26

(14)

3.1.6 Loss function . . . 26

3.1.7 Learning rate . . . 27

3.1.8 The backpropagation algorithm . . . 28

3.1.9 Dropout . . . 30

3.1.10 Recall and precision . . . 30

3.1.11 Precisionrecall trade off . . . 31

3.1.12 Topologies . . . 31

3.2 Overview of graph neural networks . . . 33

3.2.1 Graph neural networks . . . 33

3.2.2 Graph convolutional networks . . . 36

3.2.3 Graph attention networks . . . 39

3.3 Engineeringrelated and scientific content . . . 42

3.3.1 Computational background . . . 42

3.3.2 Data acquisition and processing . . . 42

3.3.3 Choice of neural network model . . . 43

3.3.4 Model pipeline . . . 45

3.3.5 How to evaluate the model . . . 49

4 Results

50 4.1 Final results. . . 50

4.1.1 Step 1 filter out tracks spatially incompatible with the HS vertex 50 4.1.2 Step 2 using the vertex reconstruction . . . 50

4.1.3 Step 3 and 4 using timing information . . . 52

4.2 Comparison to the BDT benchmark . . . 52

5 Analysis and conclusions

55 5.1 Analysis . . . 55

5.1.1 The fourstep model . . . 55

5.1.2 Limitations of the graph construction. . . 56

5.2 Conclusions . . . 57

5.2.1 Key conclusions from the project . . . 57

5.2.2 Future work. . . 58

5.2.3 Final words . . . 60

References

61

(15)

Introduction

1.1 Background

The Large Hadron Collider (LHC) is the World’s largest and most powerful particle accelerator, where bunches of protons are accelerated to near light speed before colliding at four different locations along its circular tunnel.

The collisions will take place following a twodimensional Gaussian distribution with standard deviations of 50 mm along the beam axis and 175 ps in time [1]. The LHC is currently preparing for an upgrade to its highluminosity phase which will allow for higher rates of protonproton bunches to collide in order to increase the likelihood of producing interesting rare interactions between elementary particles. This will imply that the number of collisions per proton bunchcrossing will increase from around 30 to 200 in same region of space.

The ATLAS (A Torodial LHC ApparatuS) experiment, one of the four major experiments at the LHC, is located at one of these collision points and designed to explore the process of these collisions to answer some of the most fundamental questions of the nature of the universe, especially concerning the Higgs boson and the nature of dark matter.

The collected data from one proton bunchcrossing is referred to as an event.

With this increase in the spatial density of collisions between protons, it becomes harder to separate the interesting socalled hardscatter interactions (HS) from the uninteresting socalled pileup interactions (PU) in an event. Furthermore, the points

(16)

Figure 1.1.1: Visualisation of the interactions in a single bunch crossing (event) in the z− t plane [2]. The interactions are equivalent to vertices as described in the problem formulation above.

where new particles are created out of the energy from the proton–proton collisions are referred to as vertices. The problem of separating HS from PU is due to increasingly small distances between the vertices in the three dimensional Cartesian space, which makes it harder to discern which particle comes from what interaction.

To mitigate this problem for the ATLAS experiment at higher luminosity, a new detector will be installed that takes advantage of the fourth dimension, i.e. time.

This new detector, called the HighGranularity Timing Detector (HGTD) [2], had its technical design report (TDR) approved by CERN in September 2020, and is designed to measure the time of particles colliding in the ATLAS experiment with a precision of 30 ps. Figure 1.1.1 shows a visual representation of these interactions based on Monte Carlo simulated data. The reconstructed trajectories of charged particles from the collision point to the HGTD allow scientists to assign timemeasurements to the vertices they originated from. By combining the spatial and temporal information, the time of the proton–proton collision itself can be determined using statistical

(17)

machinelearning (ML) methods. When the times of the collisions are determined, this information can be used to separate collisions in time even though they are taking place in a window of only a few hundred picoseconds.

In the past decade, a large growth in the application of deep neural networks (DNNs) has been seen for solving problems in a wide range of areas. In the context of high energy physics (HEP), neural networks have been explored in track reconstruction [3], data acquisition [4], analysis and particle interpretation [5]. These methods have specifically been beneficial at CERN and its LHC [6], where deep learning methods are utilized more and more frequently. More specifically, since data in particle physics can easily be represented by sets and graphs, graph neural networks (GNNs) offer key advantages in learning highlevel features and make predictions and/or classifications based on these. Furthermore, GNNs have previously shown to outperform traditional DNNs and especially problems surrounding physics applications [7]. For these reasons, graph neural networks have great potential to be applied to solve the problem of classifying tracks as HS or PU and determining the times of HS vertices, while providing a beneficial foundation for future work on the subject.

For the challenge of discerning HS from PU interactions, implementing a graph network would be done by creating one graph for every event where each node represents one track with a vector of track features. The training is then performed on the set of created graphs. The construction, training, and evaluation of such a network will be the focus of this report.

1.2 Problem and Research Question

A central problem in the ATLAS detector is how to separate interesting HS vertices from uninteresting PU vertices in a spatially compact environment. This problem becomes harder to solve at higher luminosities. Our project relies on leveraging and exploiting the time dimension by using highprecision timing measurements by the HGTD. Furthermore, initial studies have used a boosted decision tree (BDT) to determine the times of the HS vertices [2]. However, the currently used BDT has some shortcomings in two domains: Firstly, in 5% of the events, the time determination of the HS vertex is based completely on PU tracks, yielding an incorrect time determination. Secondly, in 25% of the events, a time is not given at all. We are thus presented with two connected points of improvement where a more complex

(18)

network focusing on tracklevel learning can be implemented to solve the given problem. Since GNNs have shown promising results in track classification and vertex finding, implementing a GNN has great potential to solve the problem and improve on the metrics in the TDR. In accordance with the problem formulation we present the following research question to guide this thesis:

• Can a graph neural network make better binary predictions of a track being HS or PU, and better time determinations of hardscatter vertices on simulated bunchcrossing data than the current boosted decision tree?

1.3 Purpose

The main purpose of the thesis is to analyse Monte Carlo (MC) simulated collision data and train a GNN to recognize patterns in the data and enhance the efficiency and certainty of the time determination in order to solve the problem described in Section 1.2. While enhancing the performance of the time determination, the purpose of the thesis is also to try to analyze and exploit currently unused time information from the MC simulations. Not only does this study aim to achieve improved results, it may also improve the way the BDT model is designed and operates, improving on its performance as well. Furthermore, to solve the problem we suggest using a graph attentional network (GAT) (described in Section 3.2.3) which will be evaluated by its suitability to solve the problem and how it performs compared to the BDT.

1.4 Goal

The goal of the project is to improve the classification of tracks belonging to the HS vertices to improve the time determination of these vertices. We thus aim to reduce the number of vertices that get no time (25%) as well as reduce the amount of vertices that receive an incorrect time due to using measurements of unrelated tracks (5% of the determinations).

1.5 Methodology

This degree project relies on collecting, analyzing, and preprocessing large data sets on which a graph neural network will be trained to classify tracks (as HS) and assign a

(19)

time to the vertex of these classified tracks. As such, the report relies on a quantitative methodology with an inductive research approach, since we analyze data and then design and evaluate a model to learn a pattern from that data. Furthermore, the results of the model are then compared to the BDT, which serves as a benchmark. We will thus apply a quantitative analytical methodology which can be summarized in the following steps:

• Objectives: State the purpose and the goal of the study being undertaken (presented in Sections 1.31.4).

• Relevant theory: Theoretical background including previous research and knowledge on the subject area (presented in Chapters 23).

• Research questions: A concise formulation of what will be studied (presented in Section 1.2).

• Data collection: What type of data is collected and how. Which quantitative data will be used, primary or secondary, and from where (presented in Section 3.3.2).

• Tests: The modeling of the data and the statistical tests performed to answer the research question (presented in Chapter 4).

• Results: Present the outcome of the neural network modeling and statistical tests (presented in Chapter 4).

• Conclusion and discussion: Answering the research question and discussing the results as well as the implication of the answered research question for future work (presented in Chapter 5).

1.6 Delimitations

We assume that the primary vertex is the one with the highest sum of transverse momentum squared (summed over the associated tracks) [8]. This may not always be true and can thus affect the learning as it does not always align with reality.

However, it is currently mitigated by excluding the small fraction of events where this assumption is not true. There is a very large number of hyperparameter combinations that could be evaluated when designing the proposed neural network. For the sake of time, an extensive hyperparameter analysis will not be performed. Instead, the

(20)

hyperparameter variations that are conducted focuses on changing one variable at a time for smaller models and then applying those results to the final model. Another delimitation is the imbalanced data set (2% of the data are HS tracks) which poses a challenge for the learning of the neural network, it is currently mitigated by discarding incompatible tracks in space to the assumed HS vertex. Filtering thresholds for the data are calculated on a restricted 20000 events instead of the full data set as the process is time consuming and requires a lot of RAM memory.

1.7 Outline

Moving forward, Chapter 2 presents the conceptual background of the project, this includes a physics background explaining the context and a section of related work including the currently used BDT model. Chapter 3 covers the mathematical background and the engineeringrelated content, including a detailed explanation of neural networks in general, the choice of model and how the framework for data collection, training and evaluation is built. In Chapter 4, the results are presented and compared against the performance of the BDT model and finally in Chapter 5 we present our conclusions and a discussion of future work as well as some final words on the project.

(21)

Conceptual Background

2.1 Physics background

This section presents the relevant physics background to contextualize the problem and make it more tangible for readers who do not have a background in experimental particle physics.

2.1.1 The Standard Model

The Standard Model of particle physics is a successful scientific theory describing the fundamental particles of the universe. It explains the smallest building blocks of nature, the constituents of all known matter in the universe, the fermions, divided into two types; quarks and leptons, and how these interact through the electromagnetic and the strong and weak nuclear forces. Each group of fermions consists of six particles, divided into pairs forming three generations. The lightest and most stable particles make up the first generation while the heavier and more unstable particles constitute the second and third generations. All stable matter is built by first generation particles, as particles from higher generations quickly decay to more stable ones. The quarks are also divided into three generations. Firstly, there is the up quark and down quark in the first generation, then the charm quark and strange quark in the second generation, and finally the top quark and bottom/beauty quark in the third generation. [9]

For each force mentioned above, there is a corresponding force carrying particle, a boson. For the electromagnetic force, it is the photon, denoted γ. For the strong nuclear force, it is the gluon, denoted g. For the weak nuclear force, it is the Z and W

(22)

Figure 2.1.1: Particles of the standard model [9].

bosons. There is also another boson included in the standard model, the Higgs boson.

Discovered in 2012, the Higgs boson is the particle of the Higgs field, a field present in all of the universe and which by interaction gives mass to other particles [9]. For a full visual representation of the standard model, see Figure 2.1.1.

The ATLAS experiment played a crucial part in the discovery of the longsought Higgs boson (and thus the Higgs field), confirming an almost 50 years old theory and the last missing ingredient of the Standard model. The discovery of the Higgs boson was one of the main goals when building the LHC and one of the most important ones made so far at CERN [10].

Although the Higgs boson discovery was an important milestone for CERN, there are

(23)

still many unanswered questions and discoveries to be made. One of these is the mystery of dark matter. Cosmological measurements and astronomical observations tell us that of all the matter in the universe, only 15% consists of ordinary matter, fermions, particles that build up us, our surroundings, and everything that we can measure and interact with. The remaining 85% appears to consist of dark matter.

The central property of dark matter is that it only interacts through gravity, weak interactions, and can thus not be detected easily with particle detectors [11]. Such a property makes it difficult to directly measure dark matter as our measuring devices consists of ordinary matter. Nonetheless, due to very precise measurements of indirect evidence that support this theory, it has been accepted by the research community.

Researchers from all over the world are now searching for direct evidence of dark matter. To be able to answer this questions, research groups within the ATLAS collaboration are searching for processes where dark matter is created in the collisions produced by the LHC.

2.1.2 The ATLAS experiment

The ATLAS detector is the largest of the four detectors at the LHC. The detector consists of multiple subsystems which are placed in a layerwise fashion around the collision point (see Figure 2.1.2). The four major components are the Inner Detector which measures the direction, momentum and charge of electrically charged particles, the Calorimeters which measure the particles’ energy, the Muon Spectrometer which measures the momentum of muons, and the magnet systems which bend the trajectories of charged particles. Note that the momenta of the particles are measured through the curvature of the tracks which are bent by the magnetic fields. The Muon Spectrometer is located where only muons can reach. Because of this, measurements in the Muon Spectrometer can be used to classify tracks as muon tracks and further determine their momentum. [12]

2.1.3 The ATLAS coordinate system

The coordinate system used by the ATLAS experiment is given in Figure 2.1.3.

Important for this study is the zaxis defined by the beam direction, the x− y plane which is transverse to the zaxis (referred to as the transverse plane) and the polar

(24)

Figure 2.1.2: The ATLAS detector [13].

angle θ, the angle between a track and the zaxis. Given θ, we can also define the pseudorapidity as

η =− ln (

tanθ 2

)

Hence, the absolute value of the pseudorapidity is large when the angle between a track and the beam axis is small.

Another very important quantity is the longitudinal impact parameter of a track, z₀, defined as the distance in z between the origin and the closest approach of a track to the origin. The resolution of z₀ depends primarily of η and the transverse momentum of a track and is crucial for assigning tracks to the correct vertices. If the resolution of z0 of a track is larger than the typical distance between vertices, it is impossible know which vertex the track originated from. Shown in Figure 2.1.4, a track with a transverse momentum of 1 GeV and|η| = 3 has a z0 resolution of approximately 1 mm, which can be compared to the average of 1.8 vertices per mm in the ATLAS detector at high luminosity [2]. However, as shown in Figure 1.1.1, timing information can be used to separate vertices even if there are multiple vertices spatially compatible with a track.

This is the purpose for the HGTD.

(25)

Figure 2.1.3: The ATLAS coordinate system [14].

Figure 2.1.4: Resolution of the longitudinal track impact parameter for muons with transverse momentum of 1 GeV (blue) and 10 GeV (green) as a function of η [2].

(26)

2.1.4 HighGranularity Timing Detector

The TDR for the HGTD of the highluminosity LHC (HLLHC) was approved in September of 2020 and the HGTD will be part of this new highluminosity upgrade.

Luminosity refers to the amount of collisions taking place in an area during a certain time and is expressed with the units cm⁻²s⁻¹. The upgrade to high luminosity aims to yield an increase in luminosity by a factor 10 to 10³⁵cm⁻²s⁻¹. In the context of the ATLAS experiment, this will imply an increase in the number of protonproton collisions from around 30 to approximately 200 per bunchcrossing. Under these conditions, more tracks will be produced and they will thus become harder to discern from eachother spatially. The HGTD is then able to register those particles which reach the detector and assign time information to them.

Structurally, the HGTD consists of two disks which cover the pseudorapidity range 2.4 <|η| < 4.0 (referred to as the forward region) with a radius of 120 mm < R < 640 mm. Each disk consists of three thinner disks (inner, middle and outer ring) stacked together and encapsulated by an envelope of with a radial extent of 110 to 1000 mm.

The disks consist of 2×4 cm²silicon sensors with 1.3×1.3 mm²pixels. The disk radius sets the limit of the angle from the beam line for which particles that will enter the detector and the size of the pixels determines on the spatial uncertainty of the incoming particles. The timing detector will be located approximately 3.5 m from the collision point, one disk on each side as shown in Figure 2.1.5. [2]

2.1.5 Relevant physics processes

By colliding protons in the LHC, new particles can be created and if some of these particles have the proposed characteristics of dark matter this could give direct evidence that a new dark matter particle has been created.

Invisibly decaying Higgs produced through vector boson fusion

One such potential process is when a Higgs boson is produced through something called vector boson fusion (VBF) and afterwards decays into invisible particles that cannot be detected (resulting in non conserved momentum). An example of a vector boson is the W boson, a force carrier particle of the weak force (as described in Section 2.1.1). In a proton–proton collision, as the protons get close enough to eachother, one quark of each proton can radiate a W boson. These two W bosons can then fuse

(27)

Figure 2.1.5: The HighGranularity Timing Detector [2]

.

together (vector boson fusion) and produce a Higgs boson. This Higgs boson can with a certain probability decay into invisible particles. According to the Standard Model, this is expected to be around 0.1%. However, if experimental evidence shows that the Higgs boson, produced through VBF, decays into invisible particles with a significantly higher probability than the derived one, this could show evidence of dark matter. This is the signal process of our analysis and it can be written as: VBF H→ inv. A Feynman diagram of this process is given in Figure 2.1.6.

What is especially favorable with this specific process is that it has a relatively unique signature, which makes it more recognizable. As the Higgs boson decays into invisible

Figure 2.1.6: Feynman diagram of an invisibly decaying Higgs produced through vector boson fusion. Denotations: q = quark, V = vector boson, H = Higgs and χ = invisible particle. [15]

(28)

Figure 2.1.7: An illustration of a jet, formed from a protonproton (pp) collision and the resulting pointed spray of particles (red to blue) [16].

particles two jets are produced, sometimes reaching the forward region. In particle physics, jets are the experimental signatures of quarks and gluons produced in high

energy processes such as headon collisions [16]. Since gluons and quarks have color charge (see Section 2.1.1) and can not exist freely, they can not be directly observed.

Instead, quarks and gluons come together to form colorless hadrons, a process known as hadronisation, which leads to a pointed spray of particles known as a jet. Figure 2.1.7 illustrates a typical particle jet formed from a proton–proton collision.

The reconstructed tracks in jets that reach the forward region will be assigned a time by the HGTD and it will thus be easier to separate the interesting HS vertex from the uninteresting PU vertices.

Main backgroundZ + Jets

Another process that can produce similar measurements in the detector is when a Z boson is produced together with jets and decays into neutrinos. A corresponding Feynman diagram is given in Figure 2.1.8. As seen in the figure, this process produces one jet which may reach HGTD. If it does and if a jet produced by pileup also reaches the HGTD, this will give similar measurements to the footprint of the signal process.

Further, as neutrinos are very hard to detect, this process could give rise to incorrect identifications of the VBFH → inv process. This process is therefore the main background process. If we can separate the two jets of the described process in time, we can then identify this as a background process and remove it to more accurately identify the interesting signal process.

(29)

Figure 2.1.8: Feynman diagram over the main background, when a Z boson is produced together with a jet. The Z boson decays into neutrinos which are invisible to the detector. Denotations: q = quark, Z = Z boson, g = gluon and ν = neutrino [15].

2.1.6 Particle parameters and variable explanation

We begin by presenting some important expressions which are key to understanding the collisions and how the model will be applied to them. The first important expression is event(s) which signifies one bunchcrossing, currently containing around 30 ppinteractions, where each interaction may produce several tracks (reconstructed from measurements in the ITk). As the LHC moves into its highluminosity phase the number of interactions will increase to around 200. For each simulated event, the number of interactions is denoted by µ, while the number of average number of interactions for a set of events is denoted by ⟨µ⟩. Figures presented in the TDR showcasing simulated events thus use the notation ⟨µ⟩ = 200 (see Figure 2.2.1a for example). To visualise the structure of an event and how PU affects the HS interaction, see Figures 2.1.9, 2.1.10, and 2.1.11, illustrating a simulated Z → µµ event in the ATLAS detector for with PU values of µ = {2, 50, 140}. To the left, a cross section of the detector. To the right, the inside of the detector, shown along the beam axis (grey cylinder); the red jets contain particles from the interesting HS interaction, the yellow/orange tracks are particles from PU interactions, the yellow and green squares are parts of the calorimeters that are activated from a particle passing through it.

The simulation closest to the conditions that this study focuses on is found for µ = 140, shown in Figure 2.1.11. Although the figure illustrates a Z → µµ event, this image is analogous for the VBF H→ inv event as well.

For each track in an event, the detector registers a number of track features, which signify the parameter values of a track. An example of a particle feature is the track’s cartesian coordinates measured from the center of the collision beam. While

(30)

Figure 2.1.9: A simulated Z → µµ event in the ATLAS detector for µ = 2

(31)

discussing collisons in the detector, each point where a protonproton collision occurs, from which new particles are created, is called a vertex. In Figure 2.1.11, a vertex would be the points from which one or more of the yellow/orange tracks originate from.

Table 2.1.1 presents and explains the particle features that are relevant for this study.

The uncertainties on some of the quantities are estimated during the reconstruction process (by measurements in the detector).

2.2 Related work

This section presents the methods and work conducted on the topic so far. This will include the current method that classifies HS tracks and calculates the time of the primary vertex, how background/signal algorithms have been constructed previously, and finally we present points of improvements for these methods.

(32)

Table 2.1.1: Table of track features relevant for this study Parameter Explanation

p_T Transverse momentum

η Pseudorapidity: Angle of a track trajectory relative to the beam axis

z₀ Longitudinal impact parameter of a track

σ_z₀ Uncertainty in z₀

q/p Particle charge over momentum

σ_q/p Uncertainty in q/p

d₀ Transverse impact parameter

σ_d₀ Uncertainty in d₀

delta z Difference between z₀and z of the primary vertex (z_pv) delta z resunit Difference between z₀and z of the primary vertex (z_pv)

divided by» σ_z²

0 + σ_z²_pv

time Time of the track registered by the HGTD

time res Time resolution of the track registered by the HGTD is HS Truth label if track is from the HS

distance primary vx |z₀− zpv|

distance PU vx The absolute distance from z₀to the closest pileup vertex

n PU vertices Number of pileup vertices within 1mm of z₀

vertex id The index of the reconstructed vertex that a track is said to originate from

2.2.1 Boosted decision tree

In the TDR for the HGTD, the present model which classifies the HS tracks and calculates the time of the primary vertex is a BDT developed by Alexander Leopold.

The time determination of the primary vertex with the BDT is described in Subsection 3.2.2 of the TDR and the process can be summarized as follows:

• For each event, an iterative timeclustering algorithm finds clusters of tracks which are within a window in the zcoordinate around the assumed HS vertex and which also have consistent time.

• The window around the assumed HS vertex is defined by the track z₀resolution parameterized by the track p_Tand η.

• Time consistency of tracks in a cluster is ensured by the rule that the time of any track has to match that of any other track in the cluster within a window of 3σ_t where σ_tis the square of the sum of the tracktime errors for the two tracks being compared.

(33)

• The BDT algorithm is applied to identify the most likely HS cluster among the clusters derived from the iterative timeclustering algorithm

• The BDT performs its optimization by taking in eight variables (presented in Table 2.2.1) as input and calculates the time of the HS vertex based on these.

Table 2.2.1: Table of BDT variables [2].

Variable explanation

Weighted averages (taking into account the corresponding track parameter error) of the transverse impact parameter Weighted averages (taking into account the corresponding track parameter error) of 1 over the transverse momentum Uncertainty in cluster_q_over_p

Uncertainty in cluster_d0

The uncertainty on the weighted average of the longitudinal impact parameter

The distance in z between the cluster’s averaged z₀ and the position of the primary vertex

The significance in z between the cluster’s averaged z₀and the position of the primary vertex

The total sum of the transverse momentum squared (p²_T) of the tracks

In the output of the model, the HS clusters (signal) are defined as those clusters which contain more than, or equal to, 50% HS tracks as determined by the truth information from the MC samples of the VBF H → inv events (training data). The PU clusters (background) on the other hand, are defined as clusters containing only PU tracks.

Important to note is that a track can belong to more than one cluster. [2]

The BDT determines the best cluster as the one containing at least three tracks with the maximum BDT output which passes a cut of 0.2, where the cut was chosen to ensure a background efficiency below about 10%. Finally, the results of the TDR showed that for 60% of the cases, the BDT chose the correct cluster. Furthermore, in 25% of the cases, no cluster is selected at all while the remaining 15% of the cases corresponded to mixed clusters that had varying fractions of HS and PU tracks. In particular, 5% of the total cases, the BDT will assign a time based purely on PU tracks, thus resulting in an incorrect time for the HS vertex [2]. The key results from the BDT for the signal process VBF H→ inv can be found in Figure 2.2.1:

(34)

(a) The distribution of reconstructed t₀(t0_reco) minus truth t₀(t0_truth) for all vertices for which a t0was found, separated into various categories based on the fraction of hardscatter tracks in the selected cluster.

(b) Fraction of events as a function of the fraction of hardscatter tracks in the selected cluster.

Figure 2.2.1: Graphs from the TDR showing the results of the BDT for the purpose of benchmarking.

2.2.2 Related studies

Since signal/background classification is an important part of experimental particle physics, many studies have been written on the subject and different approaches have been taken to solve the problem of particle identification, i.e. discerning signal from background. We here summarize a few examples of such studies that use similar methodologies as those implemented in this project.

The first study that is presented gives a general formulation of the value of GNNs in particle physics [7]. The paper argues that since data in particle physics can usually be represented by sets and graphs, GNNs offer key advantages in learning the highlevel features and making predictions and/or classifications based on these. Furthermore,

(35)

GNNs have previously proven to outperform traditional DNNs, especially for problems surrounding physics applications. The paper focuses on reviewing various GNN applications in high energy physics (HEP) such as different graph constructions, model architectures, learning objectives and open particle physics problems where GNNs have promising potential. Some key areas of application that the paper suggests are for event classifications, PU mitigation in node classification, efficiency parameterization, and edge classification for charged particle tracking. For event classificiation there is a recent example from Choma et. al. [17] who implemented a graph convolution model to analyse data from the IceCube neutrino telescope. Using a GCN the authors managed to yield a signaltobackground ratio that was about three times as large as the baseline for such a signal. There are many such examples but the key points is that as CERN is moving into its HLLHC phase, GNNs can be a promising tool to infer new conclusions from the LHC experiments. Furthermore, key findings from this paper are the successful applications of GNNs on event and node classification as well as PU mitigation which are two very important aspects which our study also aims to tackle.

[7]

In the second study [18], Bayirli utilises a Recurrent Neural Network (RNN) to analyse the hitlevel and tracklevel Transition Radiation Tracker detector (part of the ATLAS detector) variables with the aim of enhancing the performance of the detector for particle identification. In the applied model, the best discriminating variables are chosen and combined in an RNN consisting of long shortterm memories and feed

forward units. Proceeding, the author lists the hitlevel and tracklevel variables and then performs a variable selection step to deduce which variables are the most important to describe the hits and tracks. The network model is then applied on data containing the most important variables as features for the training, testing and validation. A hyperparameter analysis was also performed to tune the network for best performance. Finally the model was evaluated using ROCcurves and ROC AUC scores to evaluate the network performance. Bayirli concluded that the RNN model with LSTM units he implemented performed better than the previous likelihood model by a significant amount. The key difference to take into account in context of this paper is that Bayirli’s modeling relies on exploiting the sequence structure of the hits in the TRT detector, something that does not apply to our particular case. [18]

The third study focuses on jet classification by implementing a universal SettoGraph network model which uses all track information in a jet in order to determine if a set

(36)

of tracks originated from the same vertex inside the jet [19]. The focus of the study is solely on vertex finding and is challenged by two factors: firstly, secondary vertices can be in close proximity to the primary vertex and even to each other, making them difficult to discern spatially. Secondly, the multiplicity of the charged particles in the respective vertices is very low, ranging from between one to five. The graph model proposed is divided into three parts. The first part denoted Φ, is an equivariant Set

toSet function, which means that it learns a node representation, and for this the authors use a DeepSet network [20]. The second part is a broadcasting layer which forms all the possible ktuples of nodes from the learned node representations and is denoted by β. The final part is denoted by Ψ and is a MultiLayer Perceptron (MLP) that operates on each edge/hyperedge to produce the final output, the edge prediction.

The data trained on was a generated t¯tsample and the results showed that the Setto

Graph model outperformed standard techniques while also improving on the ability to discern nearby vertices. Finally, the authors propose that for further study, one could explore the application of this technique on more complicated decays like the boosted Higgs to (bb/cc) as well as applying it on more complex data sets such as full detector simulations and PU interactions [19]. This last part is relevant for our study as this is the type of problem we are working on, with the exception that the full detector simulations are limited to those events where at least one HS track reaches the HGTD.

2.2.3 Summary of related work

The most important message for this study is the algorithm and results of the BDT used in the TDR as well as the success of GNNs in particle identification. Concerning the BDT, the results from the TDR show that in 60% of the cases, the algorithm assigns the correct time, while in 25% of the cases, no time at all is given and in the worst 5% of cases, clusters are given based entirely on PU tracks giving an inaccurate time determination. The two last cases thus provide good areas for improvement where a more complex network that utilizes lowerlevel data could outperform the current model. The studies that have applied different neural networks (among them GNNs) for particle identification show promising results where models can use fewer parameters while still being able perform likewise or better than previous methods.

Utilizing a graph network approach, event information can be gathered on track level even though the data structure is irregular.

(37)

Mathematical background and engineeringrelated content

3.1 Basics of neural networks

This section presents the essential knowledge needed to understand a neural network and how it works in theory and in practice. Furthermore, the neural network that is used in this study will also be presented in detail.

3.1.1 Nodes

The nodes of a network are the points of input where data enters the network. In the case where these nodes are ”hidden”, meaning that they are part of the second layer or deeper layer, the input is the output of the previous layer. In a neural network, each node has a weight assigned to each input, usually initialized to a random number between 0 and 1. The purpose of the node is to calculate a weighted aggregation of the input with a bias which will then be used to find out if the node/rest of the network should activate or not based on a specific threshold [21]. The weighted aggregation can most easily be described by the sum z =∑_n

i=1(x_i·wi) + b, where x_iis the ith input of the node, wi is its weight, b a bias, and z the node’s output. Figure 3.1.1 presents a visual representation of a perceptron based on the McCullochPitts Neuron Model from 1943 [21]. Encircled by the red square is the first part of the node which calculates the sum zthat is to be sent through an activation function.

(38)

Figure 3.1.1: A single layer perceptron with one node showing the input and node calculation. Where x denotes a vector of size n containing all x_i, and w is likewise an nvector containing all w_i. Finally, σ denotes the activation function and ˆythe output of the perceptron.

3.1.2 Activation function

Since the value z from the node may range from−∞ to +∞ we utilise an activation function, also referred to as a transfer function, to decide if we should consider the output z of a node to indicate that the node is activated or not [22]. There are many different types of activation functions depending on the network architecture used and the problem one is trying to solve. For the sake of brevity, only the activation functions relevant to this study are presented (where z is the input):

• ReLU (Rectified Linear Unit): σ(z) = max(0, z), where σ(z)∈ [0, ∞)

• Sigmoid: σ(z) = _1+e¹_−z, where σ(z)∈ (0, 1)

The ReLU function is used for our multilayer perceptron in order to be able to solve nontrivial groupings, while the sigmoidfunction is used in the GNN to signify probabilities of a track being a HS or PU. The threshold put on the sigmoidfunction can be summed up as follows:





if ˆy ≥ 0.5, assign track as HS

if ˆy < 0.5, assign track as PU (3.1) The output from the activation function is then ˆy = σ(z). Which can be seen in

(39)

Figure 3.1.2: A single layer perceptron with one node showing the activation function and output. Where x denotes a vector of size n containing all x_i:s, and w is likewise an nvector containing all w_i:s. Finally, σ denotes the activation function and ˆythe output of the perceptron.

Figure 3.1.2. Note that the process of calculating z and σ(z) both occur inside the node and the output is then ˆy.

3.1.3 Supervised learning

Supervised learning means that the data comes with labels which the network uses to learn if the prediction or classification it has made is correct or not [22]. For our study, the labels are binary values referring to a track as either being a HS or PU, i.e. 1 or 0.

Looking back at Figure 3.1.2, the predicted label would be denoted by ˆywhile the true label would be denoted by y, thus implying that y ∈ {0, 1}.

3.1.4 Learning rule

The learning rule of a network is in essence the logic of the learning algorithm, i.e. how the weights and biases in the network are updated so that the network ”learns” the input that is provided. As with activation functions there are likewise many different types of learning rules and for this reason we only present the one that is relevant for this study, that is the backpropagation (”backprop”) algorithm by Adaptive Moment Estimation (Adam). The general idea of gradient descent learning rules is to follow the gradient of the loss function (presented in Section 3.1.6) to a minimum, with

(40)

the idea of designing a learning rule which can find the global minimum (not always possible).

3.1.5 Batch learning and online learning

The main difference between online and batch learning is learning incrementally from an incoming stream of data or not. In essence, online learning means that the network learns (updates its weights) based on each new data point that is presented to it, while batch learning means that the network learns data points in groups (batches) of a size k, usually chosen as a power of 2. Another key difference between the learning methods is the computational power required as well as the weight updates of the network. During online learning, the network’s weights are updated for each input learned while batch learning performs weight updates on each batch. A batch is a group of data points that the network is learning, the number of data points in a batch is known as the batchsize. In general, as online learning is a one pass method (going through the data in one sequence), while batch learning is a multipass method, online learning usually requires less computational power. However, for most problems where model evaluation is of essence, batch learning is superior since we are making some distributional assumptions about the data we are training which will translate to the data set we are subsequently testing, this is not the case for online learning. Online learning methods are thus more difficult to evaluate. In this study, batch learning will be used to find relationships in the batches of events that are studied, update the network’s weights on these and then move on to the next batch to see if these updated weights hold up, otherwise, update them again based on the new batch.

3.1.6 Loss function

The loss function measures how well the algorithm models the input data by penalizing deviations of the prediction from the true value [22]. If the deviations are too large, the loss function will increase, i.e. a cost has been incurred. The aim of a network is to reduce the value of the loss function, thus optimizing the best model prediction.

Depending on the type of network and problem at hand there are many different loss functions to consider, the two most common regression losses are the mean squared error, M SE = ¹_n∑_n

i=1(ˆy_i− yi)², and the mean absolute error, M AE = _n¹ ∑_n

i=1|ˆyi− yi|, where n is the batch size. In our case we are mostly interested in binary classification

(41)

and for this, we us the the binary crossentropy loss, BCE =−¹_n∑n i=1

(yilog(ˆyi) + (1− y_i)log(1− ˆyi))

. In the BCE loss, the predictions ˆy_i are seen as probabilities, i.e. the probability of being the correct label y_i. As such, the logfunction serves to penalize large deviations from y_i. For example if the label y_i = 1and the prediction is ˆyi = 0.01 the loss function would be high since the network has given a very faulty prediction.

Using the learning rule, the network weights should then be updated such that the loss function decreases.

3.1.7 Learning rate

The learning rate denotes the hyperparameter which determines the step size an optimizing algorithm takes at each training epoch towards a minimum of a loss function. It is one of the most important hyperparameters in a network and may vary in value greatly depending on what type of network architecture is being used and on what type of data. A very large learning rate can result in the algorithm converging fast but jumping ”around” the minimum. A very small learning rate on the other hand can lead to very slow convergence and may also result in the network finding a local minimum before ever reaching a global minimum. To illustrate this, Figure 3.1.3 illustrates three hypothetical cases of a loss function L(w) in one dimensions, where w denotes the weightvector of the network. The green dots in the graphs represent the steps taken towards a minimum of the loss function based on the size of the learning rate.

(a) Small learning rate (b) Large learning rate (c) Good learning rate Figure 3.1.3: Illustration of the loss as a function of the network’s weights with step size towards a minimum

Because of the challenges presented above there are many suggestions for ways to mitigate and chose learning rates. Adaptive learning rates are one example of how to mitigate this problem by beginning with a larger learning rate to find a ”slope” in the

(42)

gradient and then lowering the learning rate of the following epochs, ensuring that the network reaches a minimum but can still find a global minimum. One example of an adaptive learning rate is to lower the learning rate by a factor of 0.1 as soon as the global minimum of the loss function is not lowered for multiple preceding epochs. Another technique for accelerating the convergence of the loss function is to use momentum.

The Adam algorithm uses an adaptive learning rate which incorporates an adaptive momentum that lower the risk of getting stuck in a local minimum.

3.1.8 The backpropagation algorithm

To begin with, Adam [23] was chosen over regular gradient descent (GD) and stochastic GD (SGD) since it is more adaptive than SGD in the context of GCNs while still reducing the computational burden in highdimensional optimization problems which comes with regular GD. Adam, like the SGD, thus achieves faster iterations in exchange for lower convergence rates. Since we rely on very highdimensional data from a large samplesize, Adam provides key advantages where the tradeoff can become manageable using an adaptive learning rate. The difference between Adam, SGD, and GD in practice is that SGD replaces the real gradient in GD, calculated using the entire data set, by a stochastic estimate calculated from a pseudorandomly selected subset of the data. Adam goes a step further than the SGD and introduces ”momentum” terms to avoid getting stuck in local minimum. To get a better understanding of the Adam algorithm we present the mathematics of its implementation below. Firstly, the key parameters and variables in the algorithm are presented in Table 3.1.1

Table 3.1.1: Table of algorithmic contents [23]

Variables and parameters Domain Explanation

t Z⁺ Time step of the algorithm

m_t R The gradient at time t

vt R⁺ The squared gradient at time t

α R⁺ Learning rate of the algorithm

θ R Parameters to optimize w.r.t

L(θ) R Stochastic loss function

gt R Gradient of L(θ) w.r.t θ

ϵ R⁺ Scaling parameter

β₁, β₂ [0, 1) Hyper parameters which control

the decay rates of the exponential moving averages of m_tand v_t

(43)

To clarify further, the goal is to minimizeE[L(θ)] w.r.t. θ. To do this we need to calculate the gradients g_t=∇θL_t(θ)at each time step t [23]. To begin, the algorithm requires the values of β₁, β₂ ∈ [0, 1), α, ϵ as well as an initial parameter vector (usually the network weights) θ₀ and the loss function L(θ) to be known. We then initialize m₀ = 0, v₀ = 0, and t₀ = 0and start the optimization with the condition that as long as the parameter θ_thas not converged, we continue. The algorithm proceeds by updating the timestep as t ← t + 1 and then calculating gt = ∇θLt(θt−1). In the next step, the gradients mt

and v_tare updated as:

m_t ← β1∗ mt−1+ (1− β1)∗ gt

v_t← β2∗ vt−1+ (1− β2)∗ gt²

(3.2)

The authors however note that since m and v are initialised as 0vectors, they become biased towards zero and especially during the initial time steps as well as when the decay rates are small, i.e. when β₁ and β₂ are close to 1. These initialization biases can however be mitigated by computing biascorrected the following bias corrected estimates:

ˆ

m_t= m_t 1− β1^t

ˆ

v_t= v_t 1− β2^t

(3.3)

In the last step, the algorithm then proceeds to update the parameters θ_tby using the biascorrected estimates:

θt ← θt−1− α ∗ mˆ_t

√vˆ_t+ ϵ

We now go back to the top and check if θ_t has converged, if it has not, we begin the process again and calculate the new timestep, and the algorithm continues from there. From empirical testing, the authors also suggest using initial values of β₁ = 0.9, β2 = 0.999and ϵ = 10⁻⁸. [23]

(44)

3.1.9 Dropout

Dropout is a regularization technique designed to reduce overfitting when training a neural network. The meaning of dropout is to randomly drop out, i.e. turn off, nodes (can be either visible or hidden, or even both) and their respective connections when training the network. The dropout rate is the fraction of the nodes which are omitted for each epoch of training and is thus a hyperparameter which can be adjusted to tune the network to yield better results. [24] See Figure 3.1.4 for an illustration.

(a) General feedforward network (b) Dropout with rate = 0.3

Figure 3.1.4: A general feedforward network visualising the effect of dropout during training.

3.1.10 Recall and precision

The concepts of recall and precision are mostly present for pattern recognition and classification problems, especially binary classification. The terms rely on the logic of measuring the relevance of the classified data. To understand these concepts, the classification of data is divided into four groups, true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). In binary classification, the model predicts if a data point belongs to class 0, or class 1. In this context, positives are the data points that are predicted to belong to class 1. This group can now be separated

(45)

into two parts: true positives and false positives. The true positives are the samples correctly classified as 1, whereas the false positives are the data points that were predicted to belong to class 1 but actually belongs to class 0. The same holds for the negatives. True negatives are the data points correctly classified as a 0, while the false negatives are data points incorrectly classified as a 0. Recall is explained as the fraction of true positives ”picked up” (or recalled) by the model. ”Of all the actual positive data points, what percentage is captured by the model?” This number is calculated as

T P

T P +F N. The precision (or the true positive rate) is on the other hand the fraction of true positives over all the positives. ”When the model predicts that a data point is a positive, what is the probability that it is correct?” This number is calculated as _{T P +F P}^{T P} .

3.1.11 Precisionrecall trade off

Given the definitions in the previous section, there is a direct trade off between precision and recall. The only term separating these metrics apart is the FN and FP in the denominator, for recall and precision respectively. Trying to minimize the FN (maximizing recall) often results in maximizing the FP (minimizing precision). How to handle this trade off is highly dependent on the classification problem at hand.

3.1.12 Topologies

A topology is the architecture of the network, i.e. how the layers are designed and how information is transferred between layers and nodes.

Multilayer perceptron

An MLP is structured with an input layer consisting of n input nodes, one or more hidden layers with nodes, followed by an output layer. The structure forms a feed

forward network where information is passed from the input layer through the hidden layer(s) to reach the output layer. In the forward pass, all but the input layer has a nonlinear activation function from one node to the other. In the backward pass, i.e.

the learning phase, the network utilizes supervised learning with backpropagation.

MLPs are useful in classification problems and fitness approximations. [22] Figure 3.1.5 illustrates a general twolayer MLP.

(46)

Figure 3.1.5: A multilayer perceptron with n inputs, a hidden layer of size m showing the activation function and output. W^H denotes the weights that are used for the input to the hidden layer, and W^Olikewise denotes the weights from the hidden layer to the output layer. z^H_i (i ∈ {1, ..., m}) denotes the i:th output of the input layer while zi^O

(j ∈ {1, 2}) denotes the j:th output of the hidden layer.

(47)

Deep neural networks

The difference between a shallow neural network and a deep neural network (DNN) is simply that a deep neural network has more than one hidden layer. DNNs is a family of NNs that uses representation learning and can be trained with supervised, semisupervised and nonsupervised learning. The main idea is to use an unbounded number of layers with bounded number of nodes with nonpolynomial activation functions to discern highlevel features from the input data [22]. Taking image classification as an example problem, the first levels could be used to recognize edges while the last layers could identify attributes relevant for humans such as numbers, faces or animals. Examples of DNNs are Convolutional Neural Networks (CNNs), RNNs, GNNs and MLPs with more than one hidden layer.

3.2 Overview of graph neural networks

Below, three types of GNNs are presented: the general graph neural network, the graph convolutional network (GCN), and the graph attentional network (GAT). These networks will be described by their respective characteristics and how they are related to the purpose and goal of this study.

3.2.1 Graph neural networks

Although Sperduti et al. [25] were the first to apply neural networks on directed acyclic graphs in 1997, it took until 2005 for the notion of GNNs to be fully outlined [26] and further elaborated by Scarselli et al. in 2009 [27]. For the purpose of clarity, the general GNN presented in this section (3.2.1) will focus on the model described by Scarselli et al. as it provides a more relevant and modern view of the network architecture.

In the broad sense, a graph network represents data using a graph structure and models the data with nodes connected by edges, either directed or undirected. Graph neural networks are DNNs which operate on a graph domain, their analysis is focused on node classification, predicting links and clustering. More specifically, the idea is to learn a function τ which maps a graph G and one of its nodes n to a vector of, in the case of classification problems, integers: τ (G, n) ∈ Z^m [27]. Furthermore, for graph domain applications there are two broad classes to consider: graphfocused and node

focused as named by Scaselli et al. For graphfocused applications, the function τ is

Using Graph Neural Networks for Track Classification and Time Determination of Primary Vertices in the ATLAS Experiment

Using Graph Neural Networks for Track Classification and Time

Determination of Primary Vertices in the ATLAS Experiment

MATTIAS GULLSTRAND STEFAN MARAŠ

Using Graph Neural Networks for Track Classification and Time

Determination of Primary Vertices in the ATLAS Experiment

MATTIAS GULLSTRAND STEFAN MARAŠ

Keywords

Nyckelord

1 Introduction

2 Conceptual Background

3 Mathematical background and engineering­related content

4 Results

5 Analysis and conclusions

References

Introduction

1.1 Background

1.2 Problem and Research Question

1.3 Purpose

1.4 Goal

1.5 Methodology

1.6 Delimitations

1.7 Outline

Conceptual Background

2.1 Physics background

2.1.1 The Standard Model

2.1.2 The ATLAS experiment

2.1.3 The ATLAS coordinate system

2.1.4 High­Granularity Timing Detector

2.1.5 Relevant physics processes

2.1.6 Particle parameters and variable explanation

2.2 Related work

2.2.1 Boosted decision tree

2.2.2 Related studies

2.2.3 Summary of related work

Mathematical background and engineering­related content

3.1 Basics of neural networks

3.1.1 Nodes

3.1.2 Activation function

3.1.3 Supervised learning

3.1.4 Learning rule

3.1.5 Batch learning and online learning

3.1.6 Loss function

3.1.7 Learning rate

3.1.8 The back­propagation algorithm

3.1.9 Dropout

3.1.10 Recall and precision

3.1.11 Precision­recall trade off

3.1.12 Topologies

3.2 Overview of graph neural networks

3.2.1 Graph neural networks

3 Mathematical background and engineeringrelated content

2.1.4 HighGranularity Timing Detector

Mathematical background and engineeringrelated content

3.1.8 The backpropagation algorithm

3.1.11 Precisionrecall trade off