Classification of Seismic Body Wave Phases Using Supervised Learning

(1)

IT 19 036

Examensarbete 30 hp Juli 2019

Classification of Seismic Body

Wave Phases Using Supervised Learning

Gunnar Atli Eggertsson

Institutionen för informationsteknologi

(2)

(3)

Teknisk- naturvetenskaplig fakultet UTH-enheten

Besöksadress:

Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

Postadress:

Box 536 751 21 Uppsala

Telefon:

018 – 471 30 03

Telefax:

018 – 471 30 00

Hemsida:

http://www.teknat.uu.se/student

Abstract

Classification of Seismic Body Wave Phases Using Supervised Learning

Gunnar Atli Eggertsson

The task of accurately distinguishing between arrivals of different types of seismic waves is a common and important task within the field of seismology. For data generated by seismic stations operated by SNSN this task generally requires manual effort. In this thesis, two

automatic classification models which distinguish between two types of body waves, P- and S-waves, are implemented and compared, with the aim of reducing the need for manual input. The algorithms are logistic

regression and feed-forward artificial neural network. The applied methods use labelled historical data from seismological events in Sweden to train a set of classifiers, with a unique classifier

associated with each seismic station. When evaluated on test data, the logistic regression classifiers achieve a mean accuracy of

approximately 96% over all stations compared to approximately 98% for the neural network classifiers. The results suggest that both

implemented classifiers represent a good option for automatic body wave classification in Sweden.

Examinator: Jarmo Rantakokko Ämnesgranskare: Kristiaan Pelckmans Handledare: Björn Lund

(4)

(5)

Popular Scientific Summary in Swedish

Seismologi är den del av geovetenskapen som använder markrörelser orsakade av seismiska vågors utbredning genom Jorden för att studera både jordbävningskällan och Jor- dens inre struktur. Markrörelserna registreras på seismiska mätstationer och seismiska data representeras vanligtvis i form av ett seismogram; en tidsserie som visar amplitud som funktion av tid.

Seismiska vågor delas upp i två huvudkategorier; volymvågor och ytvågor. Volymvågor är vidare indelade i P (tryck) - vågor och S (skjuv) - vågor. Genom att använda information från de två typerna av volymvågor, särskilt för skillnaden i ankomsttid mellan dem på flera seismiska stationer, kan en seismisk källa lokaliseras i tid och rum.

Den seismiska aktiviteten i Sverige övervakas av det Svenska Nationella Seismiska Nätet (SNSN), en del av institutionen för geovetenskaper vid Uppsala Universitet. SNSN driver för närvarande över 65 permanenta seismiska stationer i Sverige. Datasamlingssystemet som används vid SNSN är utformad på ett sådant sätt att när en seismisk station detek- terar en inkommande energitransient överförs data om transienten från stationen till en central dator i Uppsala i en kompakt struktur som kallas faslogg. Varje faslogg innehåller ett flertal variabler som karaktäriserar den detekterade transienten.

Identifiering av de olika sorternas seismiska vågor är en vanlig och viktig uppgift inom seismologins område. För data som produceras av seismiska stationer i drift hos SNSN kräver denna uppgift i allmänhet en manuell insats och är därmed både tidskrävande och dyr. Det ligger därför i SNSN:s intresse att utveckla automatiserade metoder som kan göra identifieringen mer exakt.

I denna avhandling implementeras och jämförs två automatiska klassificeringsmodeller vid varje seismisk station i drift hos SNSN. Modellerna är logistisk regression och neurala nätverk och de använder fasloggar som utgångsdata för att skilja mellan P- och S-vågor.

För att träna modellerna används fasloggar från historiska seismiska händelser i Sverige.

Vid utvärdering av testdata uppnår de logistiska regressionsmodellerna en genomsnittlig noggrannhet på cirka 96 % över alla stationer, jämfört med ungefär 98 % för de neurala nätverksmodellerna. Resultaten tyder på att båda modellerna utgör bra alternativ för automatisk volymvågsklassificering i Sverige.

(6)

Acknowledgements

I would like to give special thanks to my supervisor, Björn Lund, and my co-supervisor, Peter Schmidt, at SNSN, for providing me with the opportunity to do this project as well as their invaluable feedback and guidance throughout the project work.

I also want to thank my reviewer, Kristiaan Pelckmans, for his helpful comments and feedback on the project.

Finally, I thank Guðrún Elín Jóhannsdóttir for devoted proofreading and feedback in terms of the report’s setup and overall structure.

(7)

1 Introduction

1.1 Setting

Seismology is a sub-field of Earth Sciences which studies ground motion caused by the propagation of seismic waves through the Earth. Ground motions are recorded at locations known as seismic stations and data is commonly represented in the form of a seismogram; a time-series which depicts amplitude plotted against time [1].

Seismic waves are typically divided into two categories; body waves and surface waves.

Body waves are further sub-divided into P (Pressure)-waves and S (Shear)-waves. For the two types of body waves, information about the difference in their arrival times at several seismic stations can be used to estimate the location of a seismic source in space and time. Further, through the collection of absolute and differential travel times of different waves found in a seismogram, combined with information about which types of waves are detected and which types are not, structural information about the Earth’s interior can be obtained [2].

Seismic activity in Sweden is monitored by the Swedish National Seismic Network (SNSN), a part of the department of Earth Sciences at Uppsala University [3]. SNSN currently operates over 65 active permanent seismic stations in Sweden. The data acquisition system used by SNSN is constructed in such a way that when a seismic station detects an incoming energy transient, data is transmitted from the station in a compact structure known as a phase log. Each phase log contains various different features which character- ize the detected transient. The phase logs are transmitted on a single-station basis which implies that each phase log is transmitted from a station to a central location, irrespec- tive of activity measured on nearby stations and is thus treated as if it corresponded to a real earthquake [4].

One of the fundamental aspects of seismological data analysis is the "picking" of seismic wave phases. Picking, in this sense, refers to the process of marking the arrival times of seismic wave phases. At SNSN, picking is first performed automatically, as a part of automatic detection- and association processes. To distinguish between arrivals of P- and S-wave phases, the automatic system uses amplitude ratios between vertical- and horizontal components, obtained from phase logs. The accuracy achieved by this type of distinction is limited to around 60%. For this reason the task of distinguishing between between arrivals of P- and S-wave phases is also performed manually at SNSN. It is a time-consuming and expensive procedure which serves as a motivation for SNSN to develop automatic methods that can more accurately make the distinction. Figure 1 shows an example of a window from the software used to manually pick arrival times of seismic body waves at SNSN.

(10)

Figure 1: Window from the software used to manually pick arrival times of seismic body waves at SNSN.

1.2 Related Work

Böðvarsson et al. (1999) attempt to use an artificial neural network model of a type called Multi-Layer Perceptron (MLP) to improve the accuracy of automatic distinction between P- and S-wave phases. Their data consists of phase logs originating from seismic stations installed in Southern Iceland which use the same data acquisition system as SNSN uses at present. They report correct classification of more than 95% of the total wave phases used in their project [4]. The architecture of the applied MLP model consists of fourteen input features, two hidden layers containing 14 and 3 neurons, respectively, and an output layer of one neuron, whose value determines if a phase log is classified as being generated by a P- or an S-wave.

Kortström et al. (2016) use a supervised learning algorithm termed support vector machines to automatically distinguish between different types of seismic events in Finland, such as earthquakes and blasts, by using differences in the distribution of signal energy between naturally occurring- and human induced seismic activity [5].

1.3 Project Aim and Scope

This project is inspired by the work of Böðvarsson et al. (1999). The high accuracy achieved using an artificial neural network classifier to distinguish between P- and S- wave phases in Iceland has prompted SNSN to attempt similar techniques in Sweden.

The data acquisition system used at SNSN today is identical to the one used in Iceland so the work of Böðvarsson et al. (1999) is expected to serve as a reference for this project.

(11)

The primary aim of the project is to apply supervised learning techniques to develop prototype classifiers for each of the seismic stations operated by SNSN. The classifiers will use processed phase information data from the SNSN detector software, in the form of phase logs. The job of the classifiers will be to use a set of input features from phase logs to predict whether the phase log is generated by the propagation of a P-wave or an S-wave. The data used to train the classifiers is historical data which consists of features from phase logs generated by seismic stations in Sweden. The data originates from different types of seismological events, including naturally occurring earthquakes, industry related blasts and induced seismic events. The labelling of the data is done by matching phase logs with manually assigned wave types (P or S).

The problem is set up as a binary classification problem for each station. Two separate classification algorithms are implemented to perform the task. First, a simple algorithm named logistic regression is applied and secondly, a feed-forward artificial neural network classifier (ANN), comparable to the one used by Böðvarsson et al. (1999) is applied.

The idea behind first implementing logistic regression is to obtain a picture of how well a simple classification algorithm performs on the problem and evaluate how much accuracy gain comes with a more complicated algorithm like ANN. The results from the two algorithms are compared to one another, to the phase distinction done by the automatic system at SNSN and to the results reported by Böðvarsson et al. (1999).

The eventual goal of the project is to create prototype classifiers that can classify previously unseen phase logs as either P- or S-waves, with as high accuracy as possible.

1.4 Report Structure

The report is divided into seven sections, including introduction. Section 2 provides background information for the project, from the domains of seismology and machine learning. Seismological terms used in the report are defined and information about SNSN and seismicity in Sweden is provided. The field of machine learning is introduced generally with emphasis on supervised-learning methods and classification problems in particular. The two algorithms implemented in the project, logistic regression and artificial neural networks, are introduced and metrics needed for model evaluation are defined.

Section 3 discusses the characteristics of the raw phase log data used in the project. The process of creating labelled datasets is discussed along with discoveries about the data made during initial data analysis.

Section 4 discusses the methods applied in this project, from the formulation of the problem to the metrics used for evaluating the results.

(12)

Section 5 presents the project’s results, comparing the achieved evaluation metrics to each other and to the performance of the automatic phase distinction process currently in place at SNSN.

Section 6 discusses the achieved results as well as providing suggestions for possible fu- ture work related to the project.

Section 7 summarizes the project’s results and evaluates its overall success with respect to initial plans.

(13)

2 Background

2.1 Earthquakes and Seismology

Earthquakes are natural phenomena that occur when rock in the Earth undergoes frictional sliding or abrupt breaking, releasing energy in the form of elastic waves known as seismic waves. Earthquakes can have a wide range in size, ranging from being almost completely imperceivable to causing catastrophic disasters accompanied by human casu- alties [1].

The study of the propagation of seismic waves through the Earth is the domain of earthquake seismology. At present, earthquake seismology makes a case for being the most powerful of all geophysical methods for studying the Earth’s interior [6]. This stems from the fact that out of all waves encountered in geophysics, the physical properties of elastic waves yield the highest resolution of the Earth’s internal structure [2].

Seismic waves are elastic waves that expand spherically away from their source and can propagate either along the Earth’s surface or within its interior [2]. Seismic waves which propagate through the Earth’s interior are known as body waves whereas those that propagate along the surface are called surface waves [1].

The focus of this project, body waves, are further subdivided into P-waves, compressional waves that cause movement of particles parallel to the wave’s propagation direction, and S-waves, shear waves that cause movement of particles perpendicular to the wave’s propagation direction [1]. P-waves travel at a higher velocity than S-waves and represent the first detected motion from a source at a measuring station [2].

Ground motions caused by the propagation of seismic waves are recorded by instruments known as seismographs. In general, seismographs are comprised of a measuring device known as a seismometer which measures ground motion and a recording device which transmits data onto media from which it can be read, for example a disk [6]. Coupled together with a GPS-antenna these devices make up a traditional seismic station. A common location setting for a seismograph is on bedrock in a remote location, sufficiently far away from urban noise, such as traffic [1].

Among the data received from seismic stations are seismograms, which show the motion of the ground as a function of time. Seismograms are the essential data used by seis- mologists to study the propagation of seismic waves [2]. A typical seismogram has time represented on the horizontal axis and wave amplitude on the vertical axis. The arrival time of a seismic wave is defined as the moment when the wave is first detected at a seismic station [1].

(14)

2.1.1 Seismicity in Sweden

From analyzing the distribution of earthquake epicenters around the globe it is evident that most of the Earth’s seismicity takes place in fairly slim zones that correspond to the Earth’s plate boundaries. In these zones relative movements of plates can build up enough energy to cause abrupt breaking of rock or frictional sliding on pre-existing faults.

A small portion of earthquakes does however also take place in the interiors of plates.

These are known as intraplate earthquakes. Earthquakes occurring for example in Sweden fall into this category [1].

Sweden is located in Northern Europe and lies far away from the nearest plate boundary.

For this reason seismic activity in Sweden is low. However, due to the relative strength and rigidity of the tectonic plates, earthquake generating processes do occur in the interior of plates and can lead to earthquakes. This is the case in Sweden. Another source of earthquakes in Sweden is the land’s post-glacial rebound, the process in which the land surface re-rises after being pushed down by last ice age’s ice sheet, causing deformation in vertical and horizontal directions. Finally, activities relating to the Swedish mining industry can generate earthquakes as blasting activities can weaken underlying rock, making it more susceptible to fracturing [7].

2.1.2 SNSN

The Swedish National Seismic Network (SNSN) is a part of the Department of Earth Sci- ences at Uppsala University. The network traces its origins back to the year 1904 when the first Nordic seismometer was installed in Uppsala. At present the network consists of over 65 active permanent seismic stations, widely spread over Sweden. Figure 2 shows the location of the seismic stations operated by SNSN. Each station is connected to a central computer in Uppsala, via the mobile net. Automatic analysis of incoming phase log data detects and locates seismic events such as earthquakes and blasts.

The network’s primary purpose is to monitor local seismicity and collect data from seismological events originating in Sweden, along with larger events originating outside the country’s borders. The data originating from local events in Sweden is for example used to study ongoing crustal deformation processes. Data from larger events originating outside Sweden provides the opportunity to enhance knowledge on the structure of the Earth’s crust and upper mantle beneath the stations [8].

Statistics for 2018:

In the year 2018 the total number of seismic events recorded by SNSN in Sweden were:

• 695 earthquakes (≈ 1.9 per day)

• 6769 blasts from mining areas (≈ 18.5 per day)

• 3960 seismic events in vicinity of mining areas (≈ 10.8 per day)

(15)

Figure 2: Locations of seismic stations operated by SNSN shown in triangles.

2.1.3 Phase Logs

When a seismic station in Sweden detects ground motion that exceeds a certain threshold, data is transmitted from the station in a compact format known as a phase log. Each phase log is 128 bytes in size and contains values for various different features which char- acterize the detected energy transient. The phase logs are transmitted on a single-station basis, so each detected phase is treated as if it corresponded to a real earthquake [4]. All generated phase logs are transmitted to the center in Uppsala where a separate piece of software associates phase logs from different stations with each other to see which phase

(16)

logs originate from the same source [9].

Details about how some feature values in the phase logs are computed is at present scarce and challenging to come by. This makes understanding the nature of some of the features and assessing their predictive capabilities for distinguishing between P- and S-waves a challenging prospect. The software which computes the generated phase log values dates back a few decades and contains limited documentation. For this reason gaps currently exist in the knowledge at SNSN about the details of some of the phase logs’ features.

Among the features included in each phase log are:

• The time at which a wave phase was detected

• Information about the detector

• Averages in signal and noise levels for up-down, north-south and east-west components.

• Time duration of a wave packet along with references to previous and subsequent phases if more than one phase has been tracked at once.

• Azimuth, coherency and velocity computed using all three components of ground motion (up-down, north-south, east-west) [10].

• Spectral parameters such as DC-level and corner frequency for up-down, north- south and east-west components.

• Wave type, as predicted by the automatic system.

2.2 Machine Learning 2.2.1 General

In today’s era where enormous amounts of data are generated every day via plethora of various devices, the need for automated analysis procedures grows fast. This is where the field of machine learning enters the scene. Machine learning can be defined as a collection of algorithms that are capable of automatic data pattern identification and making predictions about unseen data based on previous learning [11]. It is a sub-field of artificial intelligence based around the concept of machines’ ability to learn from data and make informed decisions, with as little need for human interference as possible [12].

Machine learning is typically split into two primary branches, supervised - and unsupervised learning. The main difference between the two branches mostly lies in the datasets used [11]. At the primary branches’ intersection a third branch merges, known as semi-supervised learning which combines methods from both primary branches [13].

In supervised learning the goal is to use a labelled dataset {(x_i, yi)}^N_i=1to guide the process of producing a prediction model. Each x_irepresents a vector of features, inputs that

(17)

influence the outputs y_i which are termed labels [14] [11]. Supervised learning is further discussed in Section 2.2.2.

Unsupervised learning contains no labels for guidance so no measurable outputs are present. Here, the dataset consists solely of feature vectors {x_i}^N_i=1 and the goal can for example be to detect data patterns or describe how the data points are clustered in space. These goals tend to be more ambiguous than the ones encountered in supervised learning, partly because it is not necessarily clear what sorts of patterns are being are being sought. The lack of labels also implies a lack of a clear error metric to evaluate results [14] [11].

Semi-supervised learning is a mixture of supervised- and unsupervised learning. Here, the dataset consists of a mixture of labelled and unlabelled samples, usually with a larger number of unlabelled samples. Similar to supervised learning the aim of applying semi- supervised learning is typically to produce a prediction model. Including the unlabeled samples in the dataset has the purpose of improving the eventual model [13].

2.2.2 Supervised Learning

Supervised learning is a branch of machine learning where labelled datasets {(x_i, yi)}^N_i=1 are used to produce predictive models that can make predictions on unseen data [13].

”Labelled” in this sense implies that an output variable y_i(label) is present in the dataset and is used as part of the learning process. Along with the labels, datasets are comprised of inputs in the form of feature vectors x_i where each feature represents one input variable which affects the label. The supervised learning problem may be summarized in a simple way by thinking about the features as tools to predict the labels [14].

The nature of the labels y_i differs between applications. Label types that can differ in size, where similar values represent natural similarity are termed quantitative [14]. Re- gression is an example of a problem where the label types are quantitative, in the form of real numbers [11]. Label types that fall into one of a finite number of classes are termed qualitative [14]. Classification is an example of a problem where the label types are qualitative, y_i∈ {1, ..., M }, where M represents the number of distinct classes [11].

The nature of the features x_i can also differ between applications. A set of feature vectors may include both quantitative and qualitative features as well as features of a type termed ordered categorical where no metrics are involved but some hierarchy exists between values. An example of a feature of ordered categorical type is one that can take values in a set such as {low, medium, high} [14].

(18)

2.2.3 Classification

Classification is one of the most important supervised learning problems encountered within machine learning. In classification the aim is to learn a model via a classification algorithm, that maps input features x to output labels y where y ∈ {1, ..., M } and M represents the total number of distinct classes [11]. Subsequently, the learned model is applied on previously unseen, unlabelled data, with the aim of predicting the labels of unlabelled examples. The output from such model is typically either a direct label or a real number that can be interpreted in terms of the classes, for example a probability value [13].

If N = 2 in a classification problem, the problem is termed binary classification [11]. A well known example of a binary classification problem is the classification of emails into spam/non-spam emails [13]. If M > 2 in a classification problem, the problem is termed multi-class classification [11].

2.2.4 Learning Algorithm Selection

An important question to address when it comes to any machine learning problem is the choice of a learning algorithm. Various different learning algorithms exist at present, each containing their own advantages and disadvantages. Among factors that need to be taken into consideration before choosing a learning algorithm are the need for explain- ability, dataset dimensions, feature types and training speeds [13]. In this project two learning algorithms are examined; logistic regression and feed-forward artificial neural networks.

2.2.5 Logistic Regression

Logistic regression is one of the fundamental classification algorithms in use today. De- spite the name, logistic regression is not a regression algorithm. The naming of the algorithm rather stems from structural similarities with linear regression, a fundamental regression algorithm [13]. Like linear regression, logistic regression models the labels y_i as linear functions of the features x_i. The logistic regression algorithm takes an extra step to ensure that the output falls within the interval [0, 1]. This is achieved by passing the linear combination of features through a special logistic function, often referred to as a sigmoid function (Equation (2)) [14]. The following discussion applies to binary classification.

Denote the labelled dataset by {(x_i, y_i)}^N_i=1 where each x_i represents a feature vector for one example that corresponds to label y_i. The labels y_i can only be one of two classes, for example either negative or positive. Assume the dataset contains p features in total so for each example i we have an input feature vector {x^(j)_i }^p_j=1.

(19)

Given a set of feature vectors x_ia standard linear model predicts an output via Equation (1):

ˆ

y_i = f (x_i) = ˆβ₀+

p

X

j=1

βˆ_jx^(j)_i (1)

where ˆβ0 represents a bias term and { ˆβj}^p_j=1represents a vector of coefficients, the model parameters [14].

The codomain of f (x_i) in this case is R but for binary classification purposes it is required that y_i only takes one of two possible values. For this purpose a function commonly referred to as a sigmoid function is introduced [13]. The sigmoid function is an ”S- shaped”, continuous function with codomain [0, 1]. It is defined as:

σ(x) = e^x

e^x+ 1 = 1

1 + e^−x (2)

where e represents Euler’s number, the base of the natural logarithm. A plot of the sigmoid function can be seen in Figure 3.

Figure 3: Sigmoid function

Passing the linear combination of the input features, defined by Equation (1), through the sigmoid function, produces an output σ( ˆyi) that fulfills 0 ≤ σ( ˆyi) ≤ 1:

σ( ˆyi) = σ( ˆβ0+

p

X

j=1

βˆjx^j_i) = 1

1 + e^{−( ˆ}^β⁰⁺^P^p^j=1^β^ˆ^j^x^jⁱ⁾

(3)

(20)

Typical practice is then to define a threshold value that guides the classification. If the threshold value is set to, for example, thres = 0.5, outputs that fulfill σ( ˆyi) < 0.5 are classified as negative whereas outputs that fulfill σ( ˆy_i) ≥ 0.5 are classified as positive [13].

To summarize, prediction using logistic regression, given an input set of feature vectors x_i, is obtained by passing a linear combination of the input features through a sigmoid function and classifying the output based on a threshold value.

To fit the model parameters {β_j}^p_j=0 the most common method is the method of maxi- mum likelihood that is based on conditional probabilities P r(y_i|x_i) [14]. The idea is to maximize the likelihood of predictions being accurate by designing a ”loss” function that needs to be maximized or minimized. A commonly used function for this purpose is the so called log-likelihood function [13], also known as the binary cross entropy function. If the log-likelihood is to be maximized, rather than minimized, it is defined as:

L(β) =

N

X

i=1

yiln(σ( ˆyi)) + (1 − yi) ln(1 − σ( ˆyi)) (4) where y_i represent the actual labels and σ( ˆyi) represent the outputs from Equation (3).

The intuition behind Equation (4) can be summarized as follows:

Denote the positive class as 1 and the negative class as 0.

• If the actual label is y_i = 1, a value of σ( ˆyi) above the threshold value is desired, for the example be classified as positive. The second term in Equation 4 is equal to zero and the loss function becomes:

L(β) =

N

X

i=1

ln(σ( ˆyi)) .

Since 0 ≤ σ( ˆyi) ≤ 1, L(β) increases as σ( ˆyi) increases. Therefore higher values of σ( ˆy_i) increase the value of L(β) that is being maximized.

• If the actual label is y_i = 0, a value of σ( ˆy_i) below the threshold is desired, for the example be classified as negative. The first term in Equation 4 is equal to zero and the loss function becomes:

L(β) =

N

X

i=1

ln(1 − σ( ˆy_i)) .

Since 0 ≤ σ( ˆyi) ≤ 1, L(β) increases as σ( ˆyi) decreases. Therefore lower values of σ( ˆyi) increase the value of the L(β) that is being maximized.

Fitting the model parameters {β_j}^p_j=0is an optimization problem where the loss function L(β) is optimized with respect to the parameters. The problem contains no closed form solution so application of a numerical optimization method is required. An example of a numerical method commonly used for this purpose is gradient descent [13].

(21)

2.2.6 Gradient Descent

The most commonly applied numerical method to optimize loss functions in machine learning is the gradient descent method. Gradient descent is an iterative minimization algorithm which minimizes an objective function by updating model parameters in the direction of the objective function’s negative gradient with respect to the model parameters. The size of steps taken towards the objective function’s local minimum on each iteration is controlled by a learning rate parameter [15].

Several variations of the gradient descent algorithm exist. Stochastic gradient descent is a commonly used variation which differs from traditional gradient descent in that parameter updates are performed for each data example, one at a time, rather than for the entire dataset [15]. Another widely used variation is the Adam optimizer, an extension to stochastic gradient descent which introduces learning rates that are adaptive and individual for each model parameter [16].

2.2.7 Feed-Forward Neural Networks

Artificial Neural Networks (ANN) is a term used for a collection of powerful supervised learning methods with a broad application spectrum that can be applied for both regression and classification purposes. The terminology echoes the methods’ origins as they were initially developed with the aim of modelling neurons in the human brain. A typical representation of an ANN is a sequence of interconnected units, termed neurons that are divided into distinct layers and connect to neurons in different layers. Layers in an ANN typically consist of a single input layer, a single output layer and a collection of zero or more hidden layers in between. Connections between individual neurons in a neural network represent biological synapses [14].

Various different sub-classes of neural network methods with different characteristics exist at present. One of the most commonly used types of neural networks are feed-forward neural networks or multilayer perceptrons (MLP) which are closely related to logistic regression [13]. The terminology here stems from the fact that data is input to the network through the input layer and information is fed forward through the layers of the network to the output layer [17]. An example schematic diagram of a fully connected MLP, one where every output of a layer is connected to every input of the next layer, is shown in Figure 4.

For classification purposes, the aim of a multilayer perceptron is to approximate a math- ematical function f_{M LP}(x) which takes as an input a feature vector x and returns a label y. To achieve its aim the MLP uses labelled training data to learn the model parameter values, commonly referred to as weights, that yield the best function approximation. The training data provides the output layer with instructions on what it needs to achieve; an output value close to y for each feature vector x. Deciding the actions of other layers, needed to generate the required output, has to be determined by the learning algorithm.

(22)

Figure 4: Fully connected MLP

The term hidden layers refers to the fact that required output from these layers is not specified by the training data [17].

Typically, MLPs can be interpreted as a nested structure of functions. As an example, a 3-layer MLP may be interpreted as a combination of 3 functions f₃(f2(f1(x))) where f₁ represents the first hidden layer layer, f₂ represents the second hidden layer and f₃ represents the output layer. The input layer directly receives the dataset’s input features while all other layers in the network receive their inputs from the preceding layer [13].

In the typical MLP architecture where each hidden layer and the output layer are computed as functions of their preceding layer, the layers (excluding the input layer) are computed as [17]:

h1 = a1(W₁^Tx + b1) (5)

h₂= a₂(W^T₂h₁+ b₂) (6)

and so forth, where h₁ represents the first layer and h₂represents the second layer, which is computed as a function of h₁. Each W_i represents a matrix of weights, which control

(23)

the mapping from layer i to layer i + 1. Each b_i represents a vector of bias terms. The weights and the bias terms represent the neural network’s unknown parameters which need to be tuned through training. The functions a_i are termed activation functions [17].

Choice of activation functions in a multilayer perceptron varies between applications.

For binary classification problems the activation function of the output layer is typically the sigmoid function (Equation 2). Examples of other popular activation functions for neural networks include the hyperbolic tangent function (Equation 7) and the rectified linear unit function (Equation 8), defined as:

tanh(x) = e^x− e^−x

e^x+ e^−x (7)

relu(x) =

(0 if x < 0

x otherwise (8)

The hyperbolic tangent function has codomain [-1, 1] while the rectified linear unit function ”rectifies” negative inputs to zero and returns positive inputs unaltered [13].

The task of training neural networks involves finding values for the weights which lead to a good fit to the training data. To measure a neural network model’s fit for a binary classification problem the binary cross entropy loss function (Equation 4) is commonly used. The typical procedure for optimizing the loss function for neural networks is termed back-propagation. The back-propagation algorithm uses the chain rule for differentiation to compute the gradient of the loss function by a forward- and backward pass through the network. [14].

The training of neural networks involves defining values for several hyperparameters.

Examples of hyperparameters encountered for neural networks include the number of hidden layers, the number of neurons within each layer, type of activation function and initial weights. The tuning of hyperparameters typically combines background knowledge with experimentation. For the number of neurons within a hidden layer, having too many neurons is typically preferred over having too few. This is because too few neurons limit a model’s flexibility to fit to data. A typical number of hidden layer neurons lies between five and one-hundred, depending on the dimensions of the data [14].

2.2.8 Feature Selection

Selecting the data features to use when constructing a machine learning model is a process that can significantly influence results. The process of feature selection includes manually and automatically extracting those features in a dataset that are most important in terms of predicting a target label. Benefits of implementing feature selection include shorter training time, increased accuracy and lower risk of overfitting [18].

(24)

2.2.9 Model Evaluation

A critical part of a machine learning project is the assessment of a model’s performance.

Assessing the model’s prediction performance on unseen test data gives a measure of how well the model generalizes and its overall quality [14].

A typical practice for model assessment, in situations where data is plentiful, is to shuffle the dataset and divide it into three distinct subsets, termed training set, validation set and test set. The point of such division is to make sure the model does not only make accurate predictions on the data with which it is trained, but can also generalize to make predictions on unseen data [11].

General rules for split ratios of a dataset into training-, validation- and test sets are difficult to obtain as the ratios should depend on the dataset’s characteristics such as signal-to-noise ratio and amount of available data [14]. In a typical scenario though, the training set is substantially larger than the validation- and test-sets which often are of similar size [13]. An example of a split might be 50% for training and 25% for validation and test each [14].

The training set is used to to train the model by tuning its model parameters. The validation set is involved in the training process and guides the process of finding the optimal model parameters values by estimating prediction error. The test set is kept completely separate from the training process. Once the final model has been estab- lished it is evaluated on the test set by its ability to predict the labels of the test. This gives an estimate of how well the model generalizes [14].

In some cases available data is scarce and the majority of it required for training. Such situations can result in only small amounts of data being left for validation/testing, making the previously described method of assessment susceptible to inaccuracy. A widely used method of assessment in such cases is K-fold cross-validation which measures a learning method’s average generalization error when used on unseen test data [14].

The idea behind K-fold cross-validation is to divide the training set into K approximately equally sized subsets, termed folds. For each fold k ∈ {1, ..., K} the model is fitted to all folds except the k −th fold, and tested on the k −th fold, producing a desired performance estimate. Once this process has been repeated for all K folds the performance estimates are averaged to produce a final score. K = 5 and K = 10 are common examples of choices for K in cross-validation [11].

(25)

2.2.10 Evaluation Metrics

In order to evaluate the performance of a classification model, a way of measuring its performance is needed. Consider a binary classification model in which examples are classified as either positive or negative. Define:

• True Positives (TP) as number of examples correctly classified by the model as positive.

• False Positives (FP) as number of examples that in reality are negative but are classified by the model as positive.

• True Negatives (TN) as number of examples correctly classified by the model as negative.

• False Negatives (FN) as number of examples that in reality are positive but are classified by the model as negative.

Following is a description of some commonly applied evaluation metrics for binary classification models.

Confusion matrix conveniently summarizes a model’s performance in terms of TP, FP, TN and FN in a matrix layout. For a binary classification problem it is a real-valued 2x2 matrix with one dimension representing the actual labels and the other dimension representing the predicted labels. The values in the matrix represent the number of examples in each category. Studying the confusion matrix allows for a general overview of the model’s performance and shows how the prediction errors are distributed over the classes [13]. Figure 5 shows a typical confusion matrix setup.

Accuracy represents the ratio of the total number of correctly classified labels to the total number of all classified labels. Using accuracy as an evaluation metric places equal emphasis on prediction errors for both classes. In some applications this may not be desirable since prediction errors for one of the classes may have much higher significance than for the other. In such cases accuracy is unlikely to be an appropriate metric [13].

Accuracy is defined as:

Accuracy = T P + T N

T P + F P + T N + F N (9)

Precision represents how many, out of all examples classified as being positive, actually are positive. Applications that place high significance on positive predictions being accurate require high precision [13]. Precision is defined as:

P recision = T P

T P + F P (10)

(26)

Figure 5: Confusion matrix. Image credit: Narkhede, 2018 [19]

Recall represents how many, out of all examples that actually are positive, are correctly classified as being positive. Applications that place high significance on correctly classifying all positive examples require high recall [13]. Recall is defined as:

Recall = T P

T P + F N (11)

Another evaluation metric, based on precision and recall, is the F-measure, sometimes also referred to as F − score or F₁ − score. It is defined as a harmonic mean of the precision and recall [20]:

F = 2 · P recision · Recall

P recision + Recall (12)

The F − score can be a useful metric in applications that do not place special emphasis on achieving either high precision or high recall over the other. It can be interpreted as giving a general measure of an algorithm’s performance in predicting examples belonging to the positive class.

2.2.11 Class Imbalance

The term class imbalance refers to cases where number of examples belonging to each class are not equal. It is a common phenomenon experienced in machine learning problems. Depending on the nature of the problem, class imbalance may also be expected

(27)

due to examples of one class being significantly more common in reality than examples of other classes. It is important to be aware of how class imbalance may affect predictions acquired from a classification model. This is especially true when accuracy is used as an evaluation metric. A binary classification model trained on a dataset containing 95% of examples from one class and 5% of examples from the other class could yield results with very high accuracy by simply always predicting the larger class [21].

When dealing with datasets that contain imbalanced classes, alternative evaluation metrics to accuracy can be useful. Examples include e.g. precision and recall which demon- strate how accurately a classification model predicts examples in smaller classes. Other possible methods to battle class imbalance include under-sampling, where examples from the training set belonging to the larger class are removed or over-sampling, where new examples based on the training set examples belonging to the smaller class are generated [21].

2.2.12 Overfitting

Overfitting is a commonly encountered property of prediction models which is charac- terized by the models generalizing poorly to make predictions on new data. This implies being capable of accurately predicting the labels of data examples from the training set but making frequent errors when predicting labels of data examples which are not included in the training set [13]. Overfitting can occur when a model starts learning and adapting to minor details in the training data which likely originate from noise rather than an actual signal [11].

2.2.13 Hyperparameter Tuning

The term hyperparameters refers to parameters which are model-specific and are defined before a model is trained. Values of hyperparameters, as opposed to model parameters, are not tuned by the learning algorithm, but rather by the person implementing the algorithm, typically by a trial-and-error process [13]. Examples of hyperparameters include for example the type of activation function used in an ANN, the number of hidden layers in an ANN and the number of neurons within an ANN layer.

Among methods used to perform hyperparameter tuning is a technique known as grid search. First, pre-defined lists of hyperparameter values to be tested for the chosen learning algorithm are prepared. Grid search works by exhaustively creating models, combining all possible combinations of the pre-defined hyperparameter values and evaluating the performance achieved using the different combinations. The hyperparameter combination that yields the best results, assessed using a desired evaluation metric, is chosen as the optimal combination. Because grid search evaluates all possible combinations of hyperparameter values, computation times increase rapidly as the number of different parameters and possible parameter values increases.

(28)

3 Data

The data used in the project is numerical data originating from phase logs that are transmitted from seismic stations in Sweden. The detection system used by SNSN is designed to contain thresholds low enough to record small earthquakes. The low detection thresholds also mean that ground motion caused by other sources than seismicity gets more easily detected. A high portion of all phase logs transmitted to the central location in Uppsala does not originate from seismic activity but is instead caused by noise of various kinds, both anthropogenic and natural. At the central location, different phase logs are collected and merged in a single list. An automatic process then combines two or more phase logs that are likely to have the same seismic origin into a single seismic event [4].

3.1 Creating Labelled Datasets

The information required to create the labelled datasets needed for supervised learning is provided by data analysts at SNSN. A data analyst distinguishes seismic events such as earthquakes and blasts from noise and marks the arrival times of seismic wave phases on seismograms. Information about the arrival times of different seismic waves, for example P- and S-waves, can subsequently be used to match phase logs with wave types. The labelled datasets used in this project are created by going through all recorded phase logs for every seismic station and extracting the ones that have been matched with seismic wave phases from seismic events such as earthquakes, blasts and induced events. This way, a labelled dataset is created for each seismic station where the features are comprised of phase log features and the labels consist of the manually picked wave types.

3.2 Initial Data Analysis 3.2.1 Feature Inconsistencies

Initial data analysis reveals unexpected inconsistencies in several features that can be seen when plotted against time. This is mainly notable in corner frequency and DC-level features for all three components. Figures 6 and 7 show examples of such inconsistencies for corner frequencies and DC-level in all three components at the seismic station Linköping (lnk). The figures show that the range in the feature values changes abruptly at two distinct time periods (in 2003 and 2007). Similar phenomena is observed at all stations. Details about the source of the inconsistencies are at present unknown although they are suspected to be related to instrument changes at the stations and unlikely to have a natural cause.

(29)

Figure 6: Corner frequencies as a function of time at seismic station lnk. Different colors represent different components (up-down, north-south, east-west).

Figure 7: DC-level as a function of time at seismic station lnk. Different colors represent different components (up-down, north-south, east-west).

3.2.2 Feature Correlation

Feature correlation analysis reveals high correlation between different components of the signal- and noise level features. This implies a strong, linear relationship between these features. Figure 8 shows a correlation matrix for a subset of data features from seismic station Kurravara (kur) in Northern Sweden. The signal- and noise level features are denoted by [z|n|e]_signal and [z|n|e]_noise where z indicates up-down component, n in-

(30)

dicates north-south component and e indicates east-west component. The values shown in the matrix are the computed Pearson’s correlation coefficients for each pair of features.

The figure shows that correlation between different signal- and noise level components is higher than between any other pair of features. Identical computations for other stations yield correlation matrices with the same characteristics.

Figure 8: Correlation matrix for different features at the seismic station Kurravara. Each row and each column represent one phase log feature. The values represent the Pearson correlation coefficient between the features. The strength of relationship between two features is determined by the absolute value of the correlation coefficient. An absolute value of 1 implies perfect linear relationship

(31)

3.3 Number of Data Examples

The number of data examples (phase logs associated with a manually picked wave type) ranges from 1396 to 41278 for different stations. The stations were constructed in different years and are located in areas with very different seismic settings so variability in number of data examples is expected. For each station data is collected from the station’s year of construction to the end of 2018. Figure 9 shows the number of data examples for each station. The majority of the data examples originates from industry related blasts but they also include naturally occurring- earthquakes and induced events.

Figure 9: Number of data examples for all stations. The stations are ordered from the one containing the most number of examples to the one containing the fewest.

3.4 Class Imbalance

Figure 10 shows the proportion of examples that belong to the P-wave class for all stations. From the figure it can be seen that a degree of class imbalance is present in the data. On average, around 83% of the data available for each station belongs to the P- wave class. The most likely explanation for this is the fact that in the case of blasts, which make up the majority of the data examples, S-wave arrivals on a seismogram can be significantly harder to detect than P-wave arrivals, leading to fewer S-wave phases being picked. Additionally, the software used to locate seismic events places substantially higher weight on P-wave arrival picks than S-wave, so just picking a P-wave arrival can be a faster way of locating an event.

(32)

Figure 10: Proportion of P-wave examples for all stations

(33)

4 Method

4.1 Problem Statement

The problem is set up as a binary classification problem with the goal of classifying phase logs transmitted from seismic stations in Sweden as being generated by the propagation of either P- or S-waves. Individual models are developed for each station, 64 in total.

The problem is solved using supervised learning.

4.2 Data Collection

The data used in the project is numerical data from phase logs generated at seismic stations in Sweden. The labelled datasets required for supervised learning are constructed by exhaustively going through all recorded phase logs for each seismic station and matching those that originate from actual seismic events with manually assigned labels (Section 3.1). P-waves are encoded as a negative class (0) while S-waves are encoded as a positive class (1).

4.3 Learning Algorithm Selection

The problem is solved using two types of supervised learning algorithms; logistic regression and multilayer perceptron (MLP). The logistic regression algorithm (Section 2.2.5) is chosen because of its inherent simplicity and the desire to assess how accurately the problem can be solved using a simpler classifier than the one used by Böðvarsson et al.

(1999). Other simple classification algorithms such as K-nearest neighbours or decision tree classifiers could also have been applicable for this purpose. The MLP algorithm (Section 2.2.7) is chosen because of accurate results achieved using an MLP classifier under similar conditions by Böðvarsson et al. (1999).

4.4 Software Tools

The programming part of the project is done using the Python programming language.

In the data collection stage, the software libraries NumPy and Pandas are used.

The classification models are built using the Keras deep learning library, a high-level Python-based neural network API. TensorFlow is used as a backend engine. The Scikit-learn machine learning library is widely used in conjunction with Keras in the project, for example during hyperparameter optimization, automatic feature selection and computation of model evaluation metrics.

Classification of Seismic Body Wave Phases Using Supervised Learning

Examensarbete 30 hp Juli 2019

Classification of Seismic Body

Wave Phases Using Supervised Learning

Gunnar Atli Eggertsson

Institutionen för informationsteknologi

Abstract

Classification of Seismic Body Wave Phases Using Supervised Learning

Popular Scientific Summary in Swedish

Acknowledgements

Contents

1 Introduction

2 Background

3 Data

4 Method