Implementation of Variational Autoencoder on the simulated particle collider data

(1)

Implementation of Variational Autoencoder on

the simulated particle collider data

by: Mário Cardoso

February 16, 2021

Supervisors: R. Gonzalez Suarez, O. Sunneborn Gudnadottir

Abstract

We study the possibility of applying deep learning algorithms, such as Variational Autoencoders, on simulated particle collider data to detect Beyond the Standard Model events. In this report, we apply three dif-ferent processes of training the data for better efficiency and the results of said training on detecting anomalies. Links to the training and testing data can be found here: https://www.phenomldata.org/

Nuclear and Particle Physics Division

Department of Physics and Astronomy, Uppsala University (Sweden)

(2)

1 Introduction

The Standard Model (SM) of particle physics has been widely accepted for its incredibly high accuracy predictions on several physics processes, nevertheless it lacks the explanation to certain processes such as the gluino masses and Dark Matter (DM). There is a need for an extension of this model: Beyond the Standard Models (BSM); and the searches for physical processes that support these are a must. This search for New Physics (NP) has been the primary objective of the Large Hadron Collider (LHC), which boosted the study and implementation of Machine Learning (ML) as a tool to detect such events.

Several attempts have been made to detect BSM processes in the LHC data, with the majority being model dependent. This method consists of specifying a particular model and then performing the search for events that fit within it. So far there have been no observations of NP, when using this method, given the possibility of missing BSM events when restraining the data to fit a particular model. This gives great importance in increasing the odds of observing NP in the data, since out of ∼ 40 million events/second produced at the LHC only ∼ 1 000 of those events can be stored. One such way of achieving this, is by implementing model independent algorithms like Variational Autoencoder (VAE) [4, 5, 6], that detect anomalies (outliers) in the data as possible NP processes.

1.1 VAE description

The VAE is a direct probabilistic unsupervised model that is approximated to an artificial neural network, used for dimensionality reduction. The Fig.1 shows a schematic representation of the algorithm of a VAE. It is trained on a set of inputs (features), that go through an encoder that compresses them into a lower-dimension latent space (z), which then decompresses it, through a decoder, returning the shape parameters that describe a probability density function (pdf) of each feature. In contrast to a simple Autoencoder (AE), the VAE allows for a stochastic modeling of the latent space.

(4)

Figure 1: Schematic representation of the VAE algorithm. [10]

In this report we intend to perform a preliminary test by following the process done on Ref. [3] and implementing a built in VAE model of the Pyod package [11] to access the possibility of using these type of deep learning algorithms as anomaly detectors for NP events.

(5)

2 Data Samples

As stated before, we will use simulated data for this project provided by LHC Simulation Project [2]. Given the fact that this is a small project we will only utilize 2 data sets: single top (SM) and gluino (BSM). The first data sample (single top) represents the process pp → t + jets(+2j), corresponding to σ = 130 pb, with the number of events being Nevents= 1 297 142. On the other hand,

the second data sample (gluino) represents the physics process pp →egeg (1T eV ) (˜g-gluino), with σ = 0.20 pb, and a Nevents = 2 013. Both of these sets have an

integrated luminosity of 10 fb−1 _{and are normalized to the cross section, such}

that the weights of each event is 1. These events have a set of requirements they fulfill:

– At least one (b)-jet with transverse momentum pT > 60 GeV and

pseudorapidity |η| < 2.8, or

– at least one electron with pT > 25 GeV and |η| < 2.47, except for

1.37 < |η| < 1.52, or

– at least one muon with pT > 25 GeV and |η| < 2.7, or

– at least one photon with pT > 25 GeV and |η| < 2.37.

These data samples are stored in .csv files that were formatted using Pandas [9, 7] using the following format (Ref. [2]):

eventID; processID; eventweight; M ET ; M ET phi; obj1, E1, pt1, eta1, phi1; obj2, E2, pt2, eta2, phi2; ...

We used the parameters of each file to calculate 12 of the 21 High Features (HF), presented on Ref. [3]. Out of the 21 HF we weren’t able to compute for the isolated lepton the three isolation quantities (CHPFIso, NEUPFIso, GAMMAPFIso), the absolute value of the its transverse momentum and the number of reconstructed hadrons (both charged and neutral), due to the data chosen for this project not having the proper information encoded. Before com-puting the HF, we first need to filter our data set by requiring at least 1 lepton in each event with pT > 22 GeV. This requirement is due to the fact that

a lot of physics processes have 1 lepton final states, such as the single top process that we are considering. After our selection process we end up with Nevents−single top= 156 896and Nevents−gluino= 638.

The features used are stated below: • The total lepton charge.

• ST , i.e. the scalar sum of the pT of all the jets, leptons, and photons in

(6)

• The number of jets entering the ST sum (NJ).

• The invariant mass of the set of jets entering the ST sum (MJ).

• The number of these jets being identified as originating from a b quark (Nb).

• The transverse mass, MT , of the isolated lepton l and the ETmisssystem,

defined as: MT = q 2pl TE miss T (1 − cos∆φ)

• The number of selected muons (Nµ), i.e. all muons in the event with

pT > 20 GeV.

• The invariant mass of this set of muons (Mµ). If Nµ= 0, then set Mµ= 0.

• The absolute value of the total transverse momentum of these muons (pµ

T ,T OT ).

• The number of selected electrons (Ne),i.e. all electrons in the event with

pT > 20 GeV.

• The invariant mass of this set of electrons (Me). If Ne = 0, then set

Me= 0.

• The absolute value of the total transverse momentum of these electrons (pe

T ,T OT).

The training and formatting of the data was run using Google Colaboratory Notebook [1], with resource to Pandas ([9, 7]) and Scikit-learn [8] packages. The VAE is taken from the Pyod package [11] and then used for training and testing of the data.

(7)

3 Approach

One important aspect left to explain is how the model identifies the events as 1’s (anomalies/outliers) and 0’s (inliers). During our training we set a parameter called contamination, that states a percentage of the outliers in our training data. In our case, since we are training our model on SM data there should be no contamination. This however is not possible since we are implementing the VAE model from the Pyod package [11] that restrict us to choosing a contamination value. To explain how this threshold is computed let’s imagine that we have 1000 events in our training set and we apply a contamination of 0.01. This implies that this set will have 10 outliers (#ouliers = #events × contamination), so to find the testing set threshold for outliers, we need to look at the training scores first and find the 10th value, corresponding to the last outlier:

scores = [0.254, 0.364, ..., 9.756, ..., 268.275]

If 9.756 is the 10the outlier, then this value defines the threshold in the training set. To get the overall threshold for the test set, we just need to normalize the threshold, i.e. take the 9.756 and divide it by the maximum value of the scores array, which gives a overall threshold of 0.0364. This implies that any normalized test score above this number is considered an outlier.

We will divide our testing in three parts. The first step is to train our VAE using only half of the SM single top data set, by splitting it into the even num-bered events and the corresponding uneven numnum-bered. Since Nevents−single top=

156 896, implies that for this phase our training set and testing set have Nevents=

78 448. In this step we are expecting to see no anomalies, i.e all scores being lower than the threshold, given that we are testing on SM data. We will also use this testing to study the configuration of the following parameters: number of times our algorithm will run through the entirety of the training set (epochs) and number of neurons (mathematical function that adds inputs according to weights to then process it through a non linear function), by testing the VAE on the other half of the SM data set.

After comparing and choosing the best configuration, we will then proceed to train the machine on the entirety of the SM data and test it on the BSM gluino data set. Here we expect the model to predict only anomalies, i.e. all the scores being higher than the threshold.

Given that a real data sample, shall have a mix of SM events and possible BSM events, we are hoping to study how the algorithm behaves when given a mixed test set. For this objective, we will create a training set with half of the SM single top data, similar to the first step, however in this case our test set will be a mix of SM single top and BSM gluino data. .

(8)

4 Results

4.1 Anomaly free testing

In this stage, as stated before, we will compare different configurations of epochs and number of neurons, during the training and testing process. First we train our VAE on half of the SM data and we choose the number of neurons to be [5, 5] and the number of epochs to be 100. While observing the training process, we noticed that after epoch 20th, both the loss and the validation loss (val_loss) barely presented any fluctuation between their values. Due to this, we compared the accuracy between using 100 epochs and 20 epochs. In Table 1 and Fig. 2 are the results of our training & testing and as expected, there is no difference between these values. So to save time and computing power we chose the number of epochs to be 20.

SM Accuracy Loss Val_loss epochs: 100 0.9903 11.1005 11.5783

epochs: 20 0.9903 12.2009 9.7880

Table 1: Accuracy, loss and validation loss of training the VAE with 100 epochs and 20 epochs.

(9)

Figure 2: Scores of SM training (blue) vs SM testing (red) with a threshold of 0.03627 (green). Right: distributions using epochs= 20; Left: distributions using epochs= 100. The x axis was normalized using the min-max method, while the counts were normalized using the weights = 1/counts.

After setting the number of epochs to 20, we then compared the neuron configuration presented in Table 2 and Fig. 3. Since the number of features used is 12, we first computed the training using [12, 12] neurons and then with only [5, 5].

SM Accuracy Loss Val_loss neurons = [5, 5] 0.9903 11.0591 10.4264 neurons = [12, 12] 0.9903 11.9049 11.0697

Table 2: Accuracy, loss and validation loss of training the VAE with [5, 5] and [12, 12]neurons configuration.

(10)

Figure 3: Scores of SM training scores (blue) vs SM testing scores (red) with a threshold of 0.03627 (green). Left: scores using [5,5] neurons; Right: scores using [12,12] neurons, both computed at epochs= 20.

Due to the unchanged accuracy during both processes, we came to realize that for these data sets, there is no relevant difference between the number of neurons. So for the upcoming tests the number of epochs is 20 and the neurons configuration is [5, 5].

To have more information about the number of misidentified events, we computed the confusion matrix using Scikit-learn [8], which can be interpreted as shown in Table 3. This matrix allows us to visualize the performance of our algorithm. Each row represents the predicted outliers/inliers, while each column gives the real value of the data.

Confusion Matrix True value₀ ₁ Prediction 0 TP_{1 FN} _TNFP

(11)

4.2 Testing on BSM data sample

Now we will apply our model to identify anomalies within the gluino BSM data set. The VAE is trained using the entirety of the SM single top data set, with the parameters set above. As previously mentioned, we are expecting our test prediction to be an array consisting of all 1’s (anomalies), since all events from the BSM should be identified as such. In Fig. 4 we have the representation of our training and testing, with an accuracy of 0.6912.

Figure 4: Scores of SM training (blue) vs SM testing (red) with a threshold of 0.02816(green).

From the training histogram we can see most data being below the threshold which corresponds to SM data, as expected. The testing data shows a small peak below the threshold, which can be seen from the respective confusion matrix in Table 5 to be 197 misidentified events. The low efficiency in this case, might be due to low amount of BSM data as well as only one kind of SM data in the training process.

(12)

Confusion Matrix True value₀ ₁ Prediction 0 0_{1 0} 197₄₄₁

Table 5: Confusion matrix corresponding to the testing prediction on the SM data set, where expected values to be only ones, i.e. anomalies.

4.3 Testing on mixed data sample

Given the fact that real data will be a mix of SM and BSM events, we want to observe how this algorithm behaves when testing it on a mix of single top and gluino data. To do this, we create a testing set composed of half the single top data and the whole of the gluino data, where the first 639 rows correspond to the BSM events. We use the other half of the SM data to train our model. After testing, we compared our prediction array to an array of the same length composed of 639 ones and the rest being zeros. Fig. 5 shows our training and testing results with an accuracy of 0.9879.

(13)

the SM events in the testing data, as well as some of the BSM events (small red peak above threshold). These score distributions are closer to what we would expect on real data. Looking at the confusion matrix in Table 6, we noticed that the same number of SM and BSM events were wrongly identified as in the cases before.

Confusion Matrix True value₀ ₁ Prediction 0 77685 197₁ ₇₆₃ ₄₄₁

Table 6: Confusion matrix corresponding to the testing prediction on the SM data set, where expected values to be only zeros, i.e. no anomalies.

For this test we were able to compute the Receiver Operating Characteris-tic (ROC) curve, that represents the diagnosCharacteris-tic ability of our VAE to classify anomalies. This curve is created by calculating the True Positive Rate (TPR) against the False Positive Rate (FPR) for different threshold values. Each of these parameters is computed with resource to equations (1), (2), using the values obtained through the confusion matrices above.

T P R = T P

T P + F N (1)

F P R = F P

F P + T N (2)

In the other steps, this wasn’t possible since for the first testing we had F P = T N = 0 and for the second testing T P = F N = 0. The ROC curve can only be computed when having a mix of outliers and inliers in our true value array. In the Fig.6 we can observe that the curve is above the diagonal T P R = F P R, which is a sign of a good classifier. The closer this curve gets to a step function, where all values of F P R correspond to T P R = 1, the better the classifier.

(14)

(15)

5 Conclusion

We study the possibility of using a deep learning algorithm (VAE) for detecting possible BSM events in real collider data, when trained on a SM data sample. Given that this was a small project and a simple test, we went through only 2 simulated data samples: the SM single top data set and the BSM gluino data set. The value that this method adds comes with the fact that it is model independent, which might encompass BSM physics that is missed/discarded by model specific searches.

The output of the method is a list of anomalous events that would be used by particle physicists to study possible new physics processes and use those models to perform model specific searches on the data for higher efficiency. The authors in Ref.[3] also propose the releasing of the anomalous events a catalog to motive open source study, between the scientific communities.

We began this experiment by training and testing on the SM data, that led us to choose a better configuration for the neurons and the epochs. While we were expecting no anomalies, we came across ∼ 1% of the SM events being identified as BSM events. Then we used the whole SM single top data sample to train our algorithm and test it in the BSM gluino data. In this case, since we were testing on only BSM events, we expected 100% anomalies, but only got ∼ 69%outliers. For the final part of this project, we trained the machine on half of the SM data sample and then tested it on a mix between the gluino data and the other half of the single top data. Here we got an accuracy of 98.8%, but when computing the confusion matrix we can observe that the algorithm wrongly identifies the same numbers of events of SM and BSM in the data sets. This strange phenomenon might be due to having only worked on 2 data sets: single top (SM) and gluino (BSM); but also having less number of features in which some might be crucial to properly identify events. Given more time, it would be possible to identify which events were wrongly identified and search for the similarities in between them, to understand why the algorithm misidentifies them.

For the future of this project, we advise to experiment on various data sets both SM and BSM, like it was done on the Ref. [3], as well as, building the VAE from the start, instead of relying on a built model like the Pyod VAE model, which gives less freedom to experiment and control the algorithm. If accomplished, we can then start applying the VAE to real data and begin studying the events that the algorithm identifies as BSM.

(16)

References

[1] Ekaba Bisong. “Google Colaboratory”. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. Apress, 2019, pp. 59–64. doi: 10.1007/978-1-4842-4470-8_7. url: https://doi.org/10.1007/978-1-4842-4470-10.1007/978-1-4842-4470-8_7.

[2] G. Brooijmans et al. “Les Houches 2019 Physics at TeV Colliders: New Physics Working Group Report”. In: :PhysTeV Les Houches. arXiv: 2002. 12220 [hep-ph].

[3] Olmo Cerri et al. “Variational Autoencoders for New Physics Mining at the Large Hadron Collider”. In: JHEP 05 (), p. 036. doi: 10 . 1007 / JHEP05(2019)036. arXiv: 1811.10276 [hep-ex].

[4] Carl Doersch. Tutorial on Variational Autoencoders. 2021. arXiv: 1606. 05908 [stat.ML].

[5] Diederik P Kingma and Max Welling. Auto-Encoding Variational Bayes. 2014. arXiv: 1312.6114 [stat.ML].

[6] Diederik P. Kingma and Max Welling. “An Introduction to Variational Au-toencoders”. In: CoRR abs/1906.02691 (2019). arXiv: 1906.02691. url: http://arxiv.org/abs/1906.02691.

[7] Wes McKinney. “Data Structures for Statistical Computing in Python”. In: Proceedings of the 9th Python in Science Conference. Ed. by Stéfan van der Walt and Jarrod Millman. 2010, pp. 56–61. doi: 10.25080/Majora-92bf1922-00a.

[8] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

[9] The pandas development team. pandas-dev/pandas: Pandas. Version lat-est. Feb. 2020. doi: 10.5281/zenodo.3509134. url: https://doi.org/ 10.5281/zenodo.3509134.

(17)

List of Figures

1 Schematic representation of the VAE algorithm. [10] . . . 4 2 Scores of SM training (blue) vs SM testing (red) with a threshold

of 0.03627 (green). Right: distributions using epochs= 20; Left: distributions using epochs= 100. The x axis was normalized using the min-max method, while the counts were normalized using the weights = 1/counts. . . 9 3 Scores of SM training scores (blue) vs SM testing scores (red) with

a threshold of 0.03627 (green). Left: scores using [5,5] neurons; Right: scores using [12,12] neurons, both computed at epochs= 20. . . 10 4 Scores of SM training (blue) vs SM testing (red) with a threshold

of 0.02816 (green). . . 11 5 Scores of SM training (blue) vs BSM+SM testing (red) with a

threshold of 0.03627 (green). . . 12 6 ROC curve for the testing in a mixed data sample . . . 14

List of Tables

1 Accuracy, loss and validation loss of training the VAE with 100 epochs and 20 epochs. . . 8 2 Accuracy, loss and validation loss of training the VAE with [5, 5]

and [12, 12] neurons configuration. . . 9 3 Representation of a confusion matrix. TP (True Positives), FP

(False Positives), FN (False Negative) and TN (True Negative). 10 4 Confusion matrix corresponding to the testing prediction on the

SM data set, where expected values to be only zeros, i.e. no anomalies. . . 10 5 Confusion matrix corresponding to the testing prediction on the

SM data set, where expected values to be only ones, i.e. anoma-lies. . . 12 6 Confusion matrix corresponding to the testing prediction on the

SM data set, where expected values to be only zeros, i.e. no anomalies. . . 13

Implementation of Variational Autoencoder on the simulated particle collider data