Application of the Boosted Decision Tree Algorithm to Waveform Discrimination

(1)

Application of the Boosted Decision Tree Algorithm to Waveform Discrimination

Joakim Sjunnebo

sjunnebo@kth.se

SA104X Degree Project in Engineering Physics, First Level Supervisor: Elena Moretti

Department of Physics School of Engineering Sciences KTH Royal Institute of Technology

Stockholm, Sweden

May 21, 2013

(2)

Abstract

The Polarised Gamma-ray Observer (PoGOLite) is a balloon-borne experiment aimed at measuring the polarisation of hard X-rays from astronomical sources. In the planned flight environment the neutron background is high. A smaller version of PoGOLite, named PoGOLino, was constructed with the goal of measuring the neutron background rates and was launched in March 2013.

The signals produced in the detectors of both these instruments give rise to waveforms of different shapes depending on the type of detector the interaction occurred in.

A method to distinguish between signal and background waveforms based on their shape has been developed. This was done using a machine learning algorithm called boosted decision trees, implemented in the software package Toolkit for Multivariate Data Anal- ysis (TMVA). By constructing new discriminating variables the classification efficiency was improved.

The developed classification will be applied to the measurements taken during the 2013 flight of PoGOLino and the method can also be used for the data analysis of future PoGOLite measurements.

(3)

Chapter 1 Introduction

Measurements of electromagnetic radiation are essential to observational astrophysics.

High-energy photons in the X-ray and gamma ray band can originate from several types of sources including neutron stars, pulsars and black hole binaries. There are several competing theories describing the creation and emission of this radiation. A photon flux can be characterised by several measurable parameters including energy, intensity, time variation, polarisation angle and polarisation degree. The first three of these have been measured extensively for many sources and over wide energy ranges. These measurements agree with several of the existing theories. Polarimetry on the other hand, is still a relatively unexplored field, especially in the high-energy spectrum. Since the competing theories make different predictions about the polarisation, measuring the polarisation parameters would help discriminate between them.

1.1 PoGOLite

The Polarised Gamma-ray Observer (PoGOLite) [1] is a balloon-borne polarimeter seek- ing to measure the polarisation of hard X-rays and soft gamma rays in the 25-80 keV energy range from such celestial objects as mentioned above. The primary and secondary target is the Crab pulsar and the black hole Cygnus X-1, respectively. Since cosmic radiation is absorbed in the atmosphere, a high float altitude of ∼40 km is required. In this environment the amount of background radiation, primarily in the form of neutrons and charged particles, is high.

In preparation for PoGOLite a smaller version, the PoGOLite "Pathfinder" [1], was constructed. It was launched in July 2011, but due to a tear in the balloon the flight had to be terminated after approximately 7 hours. No scientific measurements were made;

however, the functionality of the instruments and the background radiation was studied.

(6)

1.2 PoGOLino

The primary source of background for PoGOLite is caused by neutrons [2]. No previous measurements have been made of the neutron energy spectrum or flux for the planned flight latitude and altitude of PoGOLite. In order to further understand the background radiation an even smaller version of PoGOLite, named PoGOLino [3], was constructed.

Its purpose was to measure the neutron background radiation at the flight latitude and altitude of PoGOLite. It was launched in March 2013 and performed more extensive measurements of the background radiation. Knowledge drawn from these measurements will be used for the flight preparation of the planned 2013 summer flight of PoGOLite.

1.3 Objective

The detection system in PoGOLite and PoGOLino consists of several different types of detectors. One type of detector is responsible for signal detection while the others are responsible for active and passive background reduction. A detector hit by a particle absorbs the incoming energy and scintillates, i.e. re-emits the energy in the form of light.

The material used in the signal detectors have a different scintillation decay time than the material used in the background detectors. This difference is reflected in output signals with different rise times. Consequently, determining if an event is to be classified as a signal event or a background event comes down to discriminating waveforms based on their rise time.

The primary goal of this project is to develop and implement a method to effectively discriminate events based on the shape of their waveform. This will be done using a machine learning algorithm called boosted decision trees [4], which is known to be suitable for these kinds of problems. The secondary goal is to apply the method to the data taken during the 2013 flight of PoGOLino. This will provide a better understanding of the background environment in which PoGOLite will operate and will be used for the preparation of its maiden flight. The method developed in this project can also be used for the data analysis of the PoGOLite measurements.

1.4 Outline

Chapter 2 starts out with the scientific goals for PoGOLino and continues with an overview of the atmospheric particle environment in which it operates. In chapter 3 a detailed description of the PoGOLino instrument and its constituents will be presented along with the outcome of the 2013 flight. Chapter 4 begins with a description and categorisation of the waveforms and thereafter focuses on describing the methods and tools used for the waveform discrimination. In chapter 5 the discriminating variables will be presented along with the testing results. This chapter will end with an application of the training to the 2013 flight data. In chapter 6 the effect of the discriminating variables on

(7)

the training will be discussed and the results of the application to the 2013 flight data will be analysed. Finally, some conclusions are given in chapter 7.

(8)

Chapter 2 Atmospheric particle environment

2.1 Scientific goals

Simulations of the expected background for PoGOLite have shown that the dominant background is composed of neutrons [2] and a number of questions concerning the background radiation need to be answered before PoGOLite can be launched. Since no previous measurements have been made on the neutron energy spectrum for the flight environment of PoGOLite, i.e. 40 km altitude and 68^◦ latitude, current preparations have been forced to rely on simulations. The neutron flux was however measured during the 2011 flight of the PoGOLite pathfinder. This flux is known to depend on the solar activity which has a period of 11 years. Since the 2011 flight when the solar activity was closer to its minimum it has quickly been increasing towards its maximum and therefore the neutron flux will differ for the 2013 flight.

The polarimeter in PoGOLite is surrounded by a thick shield of polyethylene to reduce the background radiation. This shield shifts the energy spectrum towards lower energies and therefore simulations of background rates are highly dependent on how neutron transportation through polyethylene is modelled.

The primary scientific goal of PoGOLino is to measure the current neutron background [3]. By doing this the neutron energy spectrum for the planned flight altitude and latitude can be better understood, as well as that the current atmospheric particle simulations can be verified. Furthermore, by comparing with the 2011 flight data the neutron flux variation as a result of the solar activity can be measured. Lastly, by performing the measurements with one of the detectors unshielded and the rest shielded by polyethylene the simulations of neutron transportation through polyethylene can be verified.

The rest of this chapter will provide an overview of atmospheric particles and more specifically neutron background, together with the above mentioned dependencies on altitude, latitude and solar activity.

(9)

2.2 Atmospheric particles

2.2.1 Cosmic rays and air showers

Cosmic rays are high-energy particles whose origin was up until recently unknown but are now believed to originate from supernovae [5]. They consist of a mixture of particles including protons, α-particles, electrons and nuclei of heavier elements and their energies range over a very large span from approximately 10⁸ eV to over 10²⁰ eV. Cosmic rays entering the atmosphere from space are called primary cosmic rays, these interact with the atmosphere and decay into what is called secondary cosmic rays. The secondary particles may then interact with the atmosphere and decay themselves, producing new particles with lesser energy that repeat the process until the energy is too low to produce new particles and they are instead absorbed. This is called an air shower and an example of one can be seen in figure 2.1. The secondary particles produced in the shower include neutrons, pions, positrons, muons and photons.

Figure 2.1: An example of an air shower. The direction of the shower is mainly in the direction of the momentum of the primary particle and it can grow to several kilometers both in the longitudinal and lateral directions. Taken from [6].

(10)

2.2.2 Altitude effects

The cosmic ray flux is known to be dependent on the altitude. An example can be seen in figure 2.2. At a certain altitude the cosmic ray flux has a distinct maximum called the Pfotzer peak [7]. In the figure this occurs at about 20 km whereafter the flux decreases to become stable from approximately 45 km. As will be discussed later in this chapter, the altitude of the Pfotzer peak is dependent on latitude and therefore the profile for PoGOLite flight latitudes will differ from the profile in the figure, which corresponds to a relatively low latitude.

The existence of an altitude dependence is explained by the air shower products.

The altitude of the first interaction of an air shower is dependent on the energy of the incoming primary particle which in turn is related to the number of produced secondary particles. The flux will be very low at low altitudes since only very few particles have enough energy to reach those altitudes without being absorbed. The atmospheric density decreases exponentially with increasing altitude, thus at very high altitudes the density is nearly constant which explains the stable flux at these altitudes. Between these extremes the maximum flux occurs as a consequence of that the primary particle energy spectrum has a maximum.

Figure 2.2: The measured counting rate as a function of atmospheric pressure (solid line) and altitude (dashed line). Taken from [7].

During the 2011 flight of the PoGOLite pathfinder the total event rate was measured by counting all the events in of one the detector cells, with a lower energy threshold set to reject noise. The result as a function of altitude can be seen in figure 2.3 which shows a Pfotzer peak at a higher altitude than in figure 2.2.

(11)

Figure 2.3: The counting rate as a function of altitude measured during the 2011 flight of the PoGOLite pathfinder. Taken from [2].

2.2.3 Solar modulation

The Sun continuously releases a stream of ionised particles, mostly protons and electrons, which is called the solar wind. The solar wind creates a "bubble" of charged particles surrounding the entire Solar System called the heliosphere. Incoming cosmic rays are modulated to lower energies and the least energetic are even deflected by the heliosphere, causing the cosmic ray flux to be affected. It has been observed that the solar activity, measured in number of observable sunspots and having a period of 11 years, is related to the amount of solar wind and therefore to the cosmic ray flux on Earth. This is an anti-correlation since when the solar activity and hence the amount of solar wind is high, the cosmic ray energy spectrum is more modulated and the flux on Earth lower.

2.2.4 Geomagnetic effects

Earth’s magnetic field is approximately the field of a magnetic dipole. The magnetic poles, which have a temporal dependence, differ from the geographical poles. Since it is the geomagnetic field that is the cause of the flux effect, the relevant latitude is the magnetic latitude. Charged low-energy primary particles will be redirected by the magnetic field towards the closest magnetic pole. For a primary particle to enter the atmosphere it must have a higher energy than the cut-off rigidity, which as a consequence of the redirection of low energy particles will be higher near the magnetic equator than at higher magnetic latitudes, see figure 2.4. This has the effect that the flux of low energy particles is higher at higher magnetic latitudes, which in turn affects the altitude dependence of the cosmic ray flux, since a larger amount of low energy particles will result in more high altitude air showers and therefore a shift of the Pfotzer peak towards higher altitudes.

(12)

Figure 2.4: World map showing the cut-off rigidity in GeV. Taken from [8].

2.3 Neutron background

Due to the high planned flight altitude and latitude of PoGOLite the amount of cosmic ray background is expected to be high. The design is chosen so that as much as possible of the cosmic rays are rejected passively or actively. Two particles are still able to fake an event, that is to produce a signal that is similar to an event triggered by a photon originating from the observed source, namely gamma-rays and neutrons. Simulations have shown that the gamma-ray rate is low but that the neutron rate is of the same order as the signal rate, see figure 2.5.

Figure 2.5: Simulated background rates at an altitude of approximately 38.6 km, divided into neutron, gamma-ray and total contribution compared to expected signal rates from a Crab and a mCrab source at zenith. Taken from [9].

(13)

Since the gamma-ray contribution is expected to be much lower than the neutron contribution only the neutron background is considered in this thesis.

2.3.1 Atmospheric neutron creation

Atmospheric neutrons are created when cosmic rays interact with the atmosphere. Low- energy cosmic rays mainly produce neutrons through the excitation and subsequent evap- oration of atmospheric nuclei. The direction of neutrons created through this process is relatively uniform, as a result of the primary particles having a low energy. High- energy cosmic rays interact with the atmosphere through knock-on reactions or charge- exchanging events. Due to the high energy of the primary particles the direction of the neutrons created through these processes is anisotropic. The direction of the neutrons is altitude-dependent and as a result of backscattering the majority of particles will come from below at PoGOLite flight altitudes [2].

2.3.2 Neutron interaction with matter

Neutrons can interact with matter through three main processes; elastic scattering, inelastic scattering and absorption. These processes are illustrated in figure 2.6. Of these processes, elastic scattering in the fast scintillators (see chapter 3 for a description) is expected to be the most significant part of the PoGOLite background [2].

The effects on atmospheric particles mentioned in section 2.2 also apply to neutrons so that the neutron flux is dependent on altitude, latitude as well as solar activity.

Figure 2.6: Illustration of elastic scattering, inelastic scattering and absorption. Taken from [10]

(14)

Chapter 3 PoGOLino

3.1 Detector description

Scintillation light is the light produced when a particle travels through and ionises a lumi- nescent material. Materials exhibiting this property are called scintillators. In PoGOLino neutrons are measured in two different ways, both of which are based on scintillation [3].

In the first way an incoming neutron is scattered off a nucleus in a plastic scintillator. The nucleus gains momentum, travels through the plastic and ionises it, causing scintillation light to be produced. The main source of background for the polarisation measurements in PoGOLite is neutron scattering in the plastic scintillators [3], depositing energies of the same order as the photons. As mentioned in section 1.1, the PoGOLite photon energy range is ∼10-100 keV. Since the scattered nucleus deposits approximately 10 % of its energy in the scintillator, the relevant nucleus energies are ∼0.1-1 MeV which in turn are produced upon scattering by ∼0.2-100 MeV neutrons.

A specific property of these plastic scintillators is the short scintillation light decay time of 2 ns. Due to this property they will be referred to as the "fast scintillators". One 5 mm thick plate of the fast scintillator is sandwiched between two 40 mm Bi4Ge3O12

(BGO) crystals to form a Phoswich Detector Cell (PDC), where Phoswich stands for phosphor sandwhich. The BGO crystals are used to reject gamma and charged particle background and have a longer scintillation decay time of 300 ns.

The second way of measuring neutrons also uses a PDC but with the fast scintillator replaced by a LiCaAlF₆ disk of the same size [3]. LiCaAlF₆ is a scintillator with a high cross section for thermal neutron capture [11]. The isotope produced by neutron capture will quickly decay through alpha emission which produces scintillation light, resulting in a distinct line in the measured energy spectrum. Figure 3.1 shows a picture of the LiCaAlF6 disks and BGO crystals.

(15)

Figure 3.1: BGO crystals and LiCaAlF₆ disks. The four larger crystals at the top of the picture are BGOs and the two thinner hexagonal disks are LiCaAlF₆.

3.2 Read out

Each PDC is read out by a single photomultiplier tube [3] which multiplies the incoming scintillation photons. The signal from each photomultiplier tube reaches a central analog- to-digital converter (ADC) where it is digitised with a sampling rate of 37.5 MHz. When an event is registered, 50 sampled ADC values surrounding the trigger are stored to make up an event waveform. Determining in which material the interaction took place is done using waveform discrimination, which is described in chapter 4.

3.3 Instrument set up

A schematic overview of the PoGOLino intrument can be seen in figure 3.2. The instrument contains three separate PDCs; two with LiCaAlF6 and one with a fast scintillator.

One of the LiCaAlF₆ PDCs and the fast scintillator PDC are placed inside a shield of polyethylene, whereas the last LiCaAlF₆ PDC is unshielded. The reason for this arrange- ment is that the PoGOLite polarimeter is shielded by polyethylene to reduce background.

By performing measurements with one LiCaAlF₆PDC shielded and one unshielded, simulations of neutron transportation through the polyethylene can be verified. Furthermore, by shielding the fast scintillator PDC the expected neutron rates at the fast scintillators in PoGOLite can be measured and the neutron interaction with the plastic scintillator can be tested since the measurements from the shielded LiCaAlF₆ PDC can be used for comparison. [3]

(16)

The three PDCs, the shield, the photomultiplier tubes and the read out electronics are all placed inside a sealed aluminium pressure vessel.

Figure 3.2: Payload design of PoGOLino. The radius of the vessel is 17 cm and the height 50 cm. The maximum weight is 10 kg. Taken from [3].

3.4 2013 flight

PoGOLino was launched from Esrange in Kiruna at 17:17 on the 20th of March 2013 and the maximum altitude reached was approximately 30.5 km. At 20:15 the balloon was cut from the payload followed by a descent of the payload with parachute. The complete payload survived the landing near Muonio on the Swedish-Finnish border. The detailed flight data is still being analysed.

(17)

Chapter 4 Waveform discrimination

4.1 Waveforms

A waveform consists of the 50 sampled ADC values surrounding an event trigger. The difference in scintillation decay times of the detector components of the PDC is reflected in how the waveform rise times differ. The fast scintillator gives a short rise time of ∼0.1 µs whereas the BGO crystal gives a longer rise time of ∼0.3 µs and the LiCaAlF₆ gives an even longer rise time. This difference makes it possible to distinguish between the events.

Waveforms from the three respective scintillators are called fast (fast scintillator), middle (BGO) and slow events (LiCaAlF6) with respect to their characteristic rise times.

In this thesis only events originating from the BGO-fast scintillator-BGO PDC are analysed and thus no LiCaAlF₆ events will be considered. As a result of this, BGO events will be called slow instead of middle for convenience. Furthermore, events originating from the fast scintillator will be called signal events and all other events will be called background events. Figure 4.1 depicts a pure signal event and three typical background events.

4.2 TMVA

ROOT is a C++ based framework for handling and analysing large amounts of data efficiently and was originally developed at CERN [12]. The Toolkit for Multivariate Data Analysis (TMVA) is a package that provides a machine learning environment for ROOT [4]. This environment is used for multivariate classification and regression using various methods included in the package. The methods all belong to the family of supervised learning algorithms. The task of these algorithms is to find a mapping function from a set of input data with known desired output. Once this function is determined it can be used to predict the correct output of any input. In the case of waveform discrimination the input would consist of two sets of waveforms; one categorised as signal

(18)

800 1000 1200 1400 1600

50

0 10 20 30 40

Sample point

ADC value

(a) A pure fast event triggered by an interaction in a fast scintillator.

800 1500

900 1000 1100 1200 1300 1400

50

0 10 20 30 40

Sample point

ADC value

(b) A pure slow event triggered by an interaction in one of the BGO crystals.

800 1000 1200 1400 1600 1800 2000

50

0 10 20 30 40

Sample point

ADC value

(c) A mixed event. First it rises quickly, like a fast event, but then it continues to rise more slowly for some time, like a slow event. This kind of event is first triggered by an interaction in the fast scintillator and then in one of the BGO crystals.

1070

1045 1050 1055 1060 1065

50

0 10 20 30 40

Sample point

ADC value

(d) A chaotic event. Some events exhibit a more chaotic behaviour like this waveform. They are characterised by many sharp peaks and large portions of negative slope. However, the specific shapes differ between the events.

Figure 4.1: Four typical waveform shapes. Of these only (a) is characterised as a signal event, (b)-(d) are all background events.

events and one as background events. The act of determining the mapping function is generally called the training phase, in the sense that the user trains the program to distinguish between signal and background events. The training phase is the first of three phases associated with TMVA operation, the other two being the testing phase and the application phase.

During the training phase the user provides the signal and background sample and specifies the input variables used to describe the waveforms. The user can set individual weights for the events and choose which multivariate analysis methods to use. Each method has a set of options the user can configure to optimise the output.

During the testing phase the performance of the training is evaluated. TMVA auto- matically carries out the testing phase by performing tests on a subset of the provided sample, which is kept separate from the training sample and thus independent from it.

The results are summarised in plots that are conveniently accessed using a graphical user interface and a more detailed testing output is saved in a log file.

Figure 4.2 shows three examples of the test result plots. The background rejection versus signal efficiency plot (figure 4.2a) shows how background contaminated a classified sample is for a certain signal efficiency. Ideally the background rejection should be 1 for all signal efficiencies, corresponding to a straight line along the top of the plot.

(19)

The classifier response plot (figure 4.2b) shows the separation of the signal from background events after the classification. A separation value can be assigned to these kinds of histograms which would be 0 for a total overlap and 1 for no overlap. Ideally there should be no overlap between the signal and background histogram. This plot has the training and testing sample superimposed and shows two values called the Kolmogorov- Smirnov test values for the signal and background, respectively. A Kolmogorov-Smirnov test is a test for the equality of two distributions. In this case it is the equality of the training and testing classifier output histograms that is being tested. The Kolmogorov- Smirnov test value represents how well the trained classifier describes the test sample.

If this value is low the classifier is overtrained, which means that it is biased for the particular training sample and does not describe the testing sample well.

(a) Signal efficiency curve. Here 70 % of the background events would be rejected and the rest accepted if all signal events were to be accepted.

(b) Classifier response histogram. The filled distributions are for the test sample and the dotted distributions are for the training sample. The overlap area can be accessed in the log file as the an- ticorrelated separation value. Note the Kolmogorov-Smirnov test values.

(c) Variable separation histogram.

Figure 4.2: Example of plots provided after the test phase.

(20)

Figure 4.2c shows a separation histogram for a variable. A separation value is analo- gously defined for these plots as for the classification response histograms.

In the log file the discriminating variables are ranked according to two different qual- ities, namely separation value and importance. The variable importance describes the number of times a variable was used to discriminate during the training.

The training phase results in a XML file that describes the classification. This file can then be used to classify uncategorised waveforms during the final application phase.

4.3 Boosted decision trees

A decision tree is a structure consisting of a root node and a number of branch and leaf nodes (see figure 4.3). The root node and each branch node is associated with a binary question involving one of the discriminating variables. The tree is transversed from the root to one of the leaves with the path being determined by the answer to the question associated with each subsequent node. The leaves represent an event being either signal or background.

Figure 4.3: A schematic overview of a decision tree. The tree is transversed starting from the root node where the value of x_i in relation to c₁ determines the next node to read. This continues until a leaf node is reached and the event is classified as either signal or background.

A tree is grown starting from the root node where the optimal split is calculated for each variable and the split that gives the best separation with respect to signal and background is applied to the data set. This procedure is repeated recursively on each subset until the size of the subset reaches a minimum value and a leaf is created. The type of the leaf is determined by the majority rule of the events that end up in the final subset.

To boost the classification performance and increase the stability with respect to statistical fluctuations in the training sample, multiple trees are grown to form a forest.

(21)

The events are given different weights for each tree resulting in different splits being made. During the application phase each event transverses every tree in the forest and is classified according to the majority vote of all trees.

4.4 Data selection

As noted in section 4.2, supervised learning algorithms require categorised events for training. Consequently, one set of signal events and one set of background events had to be manually selected. The data was collected by exposing a BGO-fast scintillator-BGO PDC to three different sources (Cs-137, Co-60 and Na-22) in a laboratory. By using different sources, which all have different typical energies (662 keV for Cs-137, 1173 keV and 1353 keV for Co-60 and 511 keV and 1275 keV for Na-22) and therefore ADC values, the classification is made solely on the shape of the waveform as opposed to on the energy, which can vary.

It was known that providing the algorithm with a large training set benefits the classification. Furthermore, the required size of the training set for an arbitrary good classification grows with the number of input variables [13]. With the number of variables prior to any variable reduction or addition being 50 it was deemed sufficient having a set of ∼10⁴ events. By inspecting each waveform visually and categorising them as either signal or background, 3999 signal events and 8301 background events were selected and used in the analysis. To prevent bias from any of the sources an equal amount of events were selected from each source, both for signal and background.

(22)

Chapter 5 Waveform analysis and results

This chapter contains the training and testing results. Section 5.1 presents the results obtained without the addition of new variables with respect to the 50 ADC values. With the goal of improving these results new variables were created. These are described and defined in section 5.2. The classifier performance is evaluated for different number of variables in section 5.3. In section 5.4 the results obtained with the new variables are presented. Finally, the results of applying this classification to the PoGOLino 2013 flight data are presented in section 5.5.

5.1 Output of test phase without constructed variables

The first training was performed by adding each ADC value of the waveform as a separate variable. The background rejection versus signal efficiency curve is shown in figure 5.1.

Figure 5.1: Background rejection versus signal efficiency curve for the training using only the 50 ADC values as variables.

(23)

At a background rejection of 0.99 the signal efficiency was 0.698 and at 0.9 it was 0.982. The background rejection integral, which represent the total background rejection, was equal to 0.988. Shown in figure 5.2 is the classifier response histogram which had a separation value of 0.839 and the Kolmogorov-Smirnov test values 0.724 and 0.998 for signal and background, respectively. The somewhat low Kolmogorv-Smirnov test value for the signal suggests that there is a risk for the classifier being overtrained. For a more detailed explanation of the output values and figures, see section 4.2 and figure 4.2.

Figure 5.2: Boosted decision tree response histogram for the training using only the 50 ADC values as variables. Notice the Kolmogorov-Smirnov test values in the upper-right part of the figure.

5.2 Description of variables

With the goal of improving the test results achieved in section 5.1, new discriminating variables were constructed. A large factor determining the quality of the training is the separation of the variables. Therefore, variables with a large separation value were chosen to be included in the new variable set. In total seven new variables were selected.

Their separation values and ranking are summarised in table 5.1. When comparing these separation values with the highest separation value of the 50 ADC value variables, which was 0.3022, it is evident that some of the new variables have a lower separation value than this. However, they still proved to benefit the training, something which will be discussed in chapter 6. The rest of this section contains a brief description and definition of each new variable.

(24)

Table 5.1: The new variables along with their respective ranking and separation value. For comparison the variable ranked as number six was the ADC value of sample point number seven with a separation value of 0.3022.

Variable Ranking Separation value

s_max 1 0.8037

slopemax−2,max−1 2 0.6910

ratio_{f ast,slow} 3 0.6758

∆s_{max,f ast} 4 0.4343

slope_{f ast,max} 5 0.3713

σ_slopef ast,f ast+4 7 0.2933

µ_slopef ast,f ast+4 8 0.2838

The rise time difference explained in section 4.1 can be exploited by calculating the maximum difference between ADC values separated by four and fifteen sample points [14].

The variable separated by four sample points is called the fast output and the other one the slow output. Quantitatively, the definition is:

Fast output: max

1≤i≤46(v(s_i+ 4) − v(s_i)) (5.1)

and

Slow output: max

1≤i≤35(v(s_i+ 15) − v(s_i)), (5.2)

where si is the i:th sample point and v(si) is the corresponding ADC value. There are 50 sample points for each waveform, so i ∈ [1, 50] which explains the limits imposed on i in the equations. The variables are illustrated in both a signal and background waveform in figure 5.3.

As the figures suggest, the fast and slow output are approximately equal for signal events whereas the slow output will be much larger for background events. The validity of this hypothesis is confirmed if the fast and slow output are plotted in a two-dimensional histogram, as in figure 5.4. Two distinct branches emerge, one corresponding to signal events and one to background events.

(25)

(a) Signal event: fast and slow output are about equal.

(b) Background event: slow output is greater than the fast.

Figure 5.3: Geometrical representation of the fast and slow output for a signal and a background event.

Figure 5.4: Two-dimensional histogram showing the fast and slow output variables for a mixed sample of events. Each point correspond to one event. The two distinct branches correspond to background events (slow branch) and signal events (fast branch). Taken from [14].

The discriminating variable ratiof ast,slow used in the training is defined as the ratio of the fast output to the slow output. By combining the two variables to one the total number of variables used in the training is minimised while conserving the waveform information, in line with the number of variables restriction described in section 4.4. The separation value for the fast and slow output variables was 0.2161 and 0.1927 respectively, whereas it can be seen in table 5.1 that ratio_{f ast,slow} has a larger separation value of 0.6758.

In signal waveforms the global maximum occurs at approximately the same sample point whereas it typically occurs a few sample points later for mixed events and even later for slow events, as in figure 4.1. This led to the construction of smax, which is the

(26)

sample point of the global maximum and is formally defined as s_max = arg max

1≤i≤50

v(s_i) . (5.3)

As can be seen in table 5.1, s_max has the largest separation value out of all variables. In figure 5.5 the position of s_max is indicated in both a signal and a background waveform.

(a) Signal event (b) Background event

Figure 5.5: An arrow from the maximum ADC value down to the corresponding sample point marks s_max.

The main difference between the fast and mixed waveforms is the slope right before the global maximum. To distinguish between these two similar waveform shapes slopemax−2,max−1 was constructed. slopemax−2,max−1 is the slope between two and one sample points before the global maximum and is defined as

slopemax−2,max−1 = v(smax− 1) − v(smax− 2) . (5.4) The separation value for this variable was the second highest. In figure 5.6 the slope is illustrated in a signal and a background waveform.

Figure 5.6: The arrows indicate slopemax−2,max−1.

The rise time can be represented by the sample point difference between the global maximum and the fast output. This variable, called ∆s_{max,f ast}, typically varies between signal and background waveforms and is quantitatively defined as

∆smax,f ast = smax− sf ast , (5.5)

(27)

where sf ast is the first sample point of the fast output (the si in equation 5.1). Figure 5.7 illustrates the sample point difference in a signal and a background waveform.

Figure 5.7: The left line marks s_{f ast}and the right s_max. The arrow indicates the sample point difference ∆s_{max,f ast}.

A variable that describes the slope of the entire rising part of the waveform was believed to efficiently discriminate signal from background. slope_{f ast,max} is the slope between the fast output and the global maximum and is defined as

slope_{f ast,max} =







v(s_max) − v(s_{f ast})

s_max− s_{f ast} if s_max 6= s_{f ast}

0 if smax = sf ast

. (5.6)

A geometrical representation of the slope is shown in figure 5.8.

Figure 5.8: slope_{f ast,max} as indicated by the arrow between the fast output sample point and the global maximum.

A similar slope variable to slope_{f ast,max} is µ_slopef ast,f ast+4 which is the mean slope between the fast output and four sample points forward and is defined as

µ_slopef ast,f ast+4 = 1 4

s_{f ast}+4

X

i=sf ast+1

v(s_i) − v(s_{f ast})

s_i− s_{f ast} . (5.7)

(28)

The slopes that are averaged are draw as arrows in figure 5.9.

Figure 5.9: Geometrical representation of the slopes that are averaged in µ_slopef ast,f ast+4 and whose standard deviation is σ_slopef ast,f ast+4.

The slopes from a fixed point to the successive sample points of the rising part of the waveform varies greatly for slow events and is more constant for signal-like events. This is exploited by constructing σ_slopef ast,f ast+4 which is the standard deviation of the slopes between the fast output and four sample points forward and is defined as

σ_slopef ast,f ast+4 = v u u t 1 4

sf ast+4

X

i=sf ast+1

v(s_i) − v(s_{f ast}) si − sf ast

− µ_slopef ast,f ast+4

2

(5.8)

The standard deviation is calculated for the same slopes as µ_slopef ast,f ast+4 uses for aver- aging and are shown in figure 5.9.

5.3 Output of test phase for different number of vari- ables

As mentioned in section 4.4 the number of variables affect the classifier performance.

The dependence of classifier performance on the number of variables was investigated by performing the training for different number of variables and noting the output. This was done for between 7 and 57 variables, where 57 corresponds to using all the 50 ADC value variables and the 7 constructed variables. The variable selection was based on the variable importance ranking. The investigated values were the signal efficiencies at background rejections of 0.9 and 0.99 and the Kolmogorov-Smirnov test values. The results can be seen in figure 5.10 where the above mentioned values are plotted as a function of the number of variables used for training. The Kolmogorov-Smirnov test values varies greatly and the signal efficiency at a background rejection of 0.99 is nearly constant. Based on the results in this plot it was decided that 48 was the optimal number of variables to use for the training with acceptable Kolmogorov-Smirnov values both for signal and background as well as a high signal efficiency at a background rejection of 0.9.

(29)

Figure 5.10: Classifier performance values as a function of the number of variables used for training. Based on this plot 48 was considered as the optimal number of variables. The efficiencies and Kolmogorov-Smirnov test values values corresponding to 48 variables are indicated by the circles.

5.4 Output of test phase with constructed variables

Based on the results obtained in section 5.3 the new variable set used for the training consisted of the 7 new variables together with the 41 highest ranking ADC value variables.

The background rejection versus signal efficiency curve is shown in figure 5.11.

At a background rejection of 0.99 the signal efficiency was 0.826 and at 0.9 it was 0.985.

The integral was equal to 0.991. The classifier response histogram is shown in figure 5.12.

The separation value was 0.862 and the Kolmogorov-Smirnov test values 0.988 and 0.920 for signal and background, respectively. Both Kolmogorov-Smirnov test values are quite large, over 0.9, which suggests that the risk of the classifier being overtrained is small.

For a more detailed explanation of the output values and figures, see section 4.2 and figure 4.2.

(30)

Figure 5.11: Background rejection versus signal efficiency curve for the new variable set.

Figure 5.12: Boosted decision tree response histogram for the new variable set. Notice the Kolmogorov-Smirnov test values in the upper-right part of the figure.

5.5 Application to the 2013 flight

The developed classification was applied to the measurements taken during the 2013 flight of PoGOLino. Figure 5.13 shows the event rate, as measured by the shielded fast scintillator PDC, for two different background rejections as a function of altitude during the flight. In both these cases the event rate is low, which results in quite large error intervals. However, a general trend can be obeserved with a maximum event rate at approximately 20 km.

(31)

5 10 15 20 25 30 35 0

0.05 0.1 0.15 0.2 0.25

PoGOLino event rate

Altitude (km)

Events rate (counts/s)

Background rejection = 0.86 Background rejection = 0.9

Figure 5.13: The discriminated PoGOLino flight data showing the event rate as a function of altitude. The two different curves correspond to different background rejections in the classification.

(32)

Chapter 6 Discussion

6.1 Predicting the training performance

Two of the new variables had a lower separation value than one of the 50 ADC value variables (see table 5.1), but still proved to benefit the training in the form of a higher signal efficiency. Furthermore, the separation ranking and importance ranking differed for most variables with deviations of up to seven ranking places. Figure 5.10 also showed that the Kolmogorov-Smirnov test values varied greatly with the number of variables used for training. All this indicates that a more complex relationship exists between the separation, importance, number of variables, Kolmogorov-Smirnov test values and outcome of training. However, an investigation of this relationship is beyond the scope of this thesis.

6.2 Improvements using the new variable set

When comparing the test output of the training using only the 50 ADC value variables and the new variable set, a number of improvements are evident. The signal efficiency was 18.3 % and 0.3 % higher at a background rejection of 0.99 and 0.9, respectively. The signal efficiency integral increased with 0.3 % and the classifier response separation value increased with 2.7 %. Both Kolmogorov-Smirnov test values were over 0.9 which suggest that the risk of the classifier being overtrained is small.

The effect the new variables had on the classification was not as large as believed a priori. However, the training using only the 50 ADC value variables performed better than expected and so there was a smaller room for improvement than believed beforehand.

This strengthens the regard of boosted decision trees as being an algorithm yielding good results "out of the box", without the need for very much tuning [15].

(33)

6.3 PoGOLino flight data compared to another mea- surement

The event rate achieved when discriminating the PoGOLino data was quite low (figure 5.13). This is mainly explained by the fact the fast scintillator has a very low efficiency for neutron capture. This plot can be compared with a corresponding event rate versus altitude plot measured by the shielded LiCaAlF₆ PDC, shown as the red curve in figure 6.1. Since LiCaAlF6 has a much higher efficiency for neutron capture, the event rate is much higher than in figure 5.13. However, the general trend in both plots seem to be in agreement. Both plots suggest that the maximum event rate occurs at an altitude of approximately 15 − 20 km, which was expected from simulations [2].

Figure 6.1: Red curve: event rate measured by the LiCaAlF₆ PDC as a function of altitude.

These finding are yet to be published. Courtesy of the PoGOLino collaboration.

(34)

Chapter 7 Conclusions

PoGOLite is a balloon-borne polarimeter aimed at measuring the polarisation of hard X-rays from point sources. At the high planned flight altitude and latitude of PoGOLite the amount of neutron background is high. To study this background more extensively, PoGOLino was constructed with the goal of measuring the neutron background rates.

The signals produced by both PoGOLite and PoGOLino results in waveforms. There are different types of detectors in the instruments with different purposes. The events are distinguished based on the type of detector the interaction occurred in and the different detectors give rise to differently shaped waveforms.

This thesis has developed a method to discriminate between signal and background waveforms based on their shapes using a machine learning algorithm called boosted decision trees implemented in the TMVA framework. It was found that the discrimination performed by the algorithm without adjusting any parameters or variables was already quite efficient. However, by constructing new variables that better described the waveform, the discrimination performance was further improved. Most notably, the signal efficiency at a background rejection of 0.99 increased by 18.3 %.

This study has not investigated adjustments of the finer algorithm parameters such as tree pruning preferences and forest sizes. Furthermore, the relationship between variable separation, variable importance, Kolmogorov-Smirnov test values, number of variables and efficiency of classification was found to be complex and was not further examined.

A further study of these aspects might lead to an even more efficient classification.

Since the method involves a human factor, in the sense that the training sample have to be constructed manually, there is a risk for a systematic error if the person selecting the training sample does not perfectly know what a signal waveform looks like. Furthermore, if the person is being inattentive in this stage the categorisation might be done with errors. One way to escape these problems would be to gather the waveforms by exposing a single fast scintillator for the signal and a single BGO for the background. This data collection method can also be used to verify and validate the trained trees by applying the classification to these waveforms.

The developed classification was applied to the measurements taken during the 2013

(35)

flight of PoGOLino. The results show a maximum event rate at an altitude of approximately 20 km, which was expected from simulations. Further simulations will be performed to validate the response of the fast scintillator for neutron detection.

In the future, the method developed in this thesis can also be used for the data analysis of PoGOLite measurements.

(36)

Bibliography

[1] PoGOLite homepage: http://www.particle.kth.se/pogolite/, accessed 2013- 04-28.

[2] M. Kole, “PoGOLite: 2011 flight results and 2012 pre-flight predictions”, Licentiate thesis, Royal Institute of Technology, Stockholm, 2012.

[3] M. Kole “PoGOLino”, Department of Physics internal report, Royal Institute of Technology, Stockholm, 2013.

[4] TMVA homepage: http://tmva.sourceforge.net/, accessed 2013-04-28.

[5] M. Ackermann et al., “Detection of the Characteristic Pion-Decay Signature in Supernova Remnants”, Science, vol.339 , no.6121, pp.807-811, 2013.

[6] P. Grieder, “Extensive Air Showers: High Energy Phenomena and Astrophysical Aspects. A Tutorial, Reference Manual and Data Book”, Heidelberg: Springer;

2010.

[7] G. Pfotzer, “Dreifachkoinzidenzen der Ultrastrahlung aus vertikaler Richtung in der Stratosph¨are”, Zeitschrift f¨ur Physik, vol.102, no.1-2, pp.23-58, 1936.

[8] D. Smart, M. Shea, “Fifty years of progress in geomagnetic cutoff rigidity determi- nations”, Advances in Space Research, vol.44, no.10, pp.1107-1123, 2009.

[9] C. Bettolo, “Performance Studies and Star Tracking for PoGOLite ”, Doctoral thesis, Royal Institute of Technology, Stockholm, 2010.

[10] J. Hong, “Development of neutron shields for gamma-ray telescopes in space and observation of galactic center sources by a balloon-borne gamma-ray telescope, GRATIS”, PhD thesis, Colombia University, 2002.

[11] H. Takahasi et al., “A Thermal-Neutron Detector with a Phoswich System of LiCaAlF6 and BGO Crystal Scintillators onboard PoGOLite”, Nuclear Science Symposium Conference Record (NSS/MIC), IEEE , pp.32-37, 2010.

[12] ROOT homepage: http://root.cern.ch/, accessed 2013-04-28.

(37)

[13] V.N. Vapnik, “An overview of statistical learning theory”, IEEE Transactions on Neural Networks, vol.10, no.5, pp.988-999, 1999.

[14] M. Kiss, “Pre-flight development of the PoGOLite Pathfinder”, Doctoral thesis, Royal Institute of Technology, Stockholm, 2011.

[15] A. Hoecker et al., “TMVA 4 Users Guide”, CERN, Switzerland, 2009.

Application of the Boosted Decision Tree Algorithm to Waveform Discrimination