Machine Learning Applied to Reach Classification in a Northern Sweden Catchment

(1)

Master’s Thesis, 60 hp

Master’s Thesis in Earth Sciences, 60 hp

Vt 2021

MACHINE LEARNING APPLIED TO REACH CLASSIFICATION IN A

NORTHERN SWEDEN CATCHMENT

Mariana Dos Santos Toledo Busarello

(2)

Abstract

An accurate fine resolution classification of river systems positively impacts the process of assessment and monitoring of water courses, as stressed by the European Commission’s Water Framework Directive. Being able to attribute classes using remotely obtained data can be advantageous to perform extensive classification of reaches without the use of field work, with some methods also allowing to identify which features best described each of the process domains. In this work, the data from two Swedish sub-catchments above the highest coastline was used to train a Random Forest Classifier, a Machine Learning algorithm. The obtained model provided predictions of classifications and analyses of the most important features.

Each study area was studied separately, then combined. In the combined case, the analysis was made with and without lakes in the data, to verify how it would affect the predictions. The results showed that the accuracy of the estimator was reliable, however, due to data complexity and imbalance, rapids were harder to be classify accurately, with an overprediction of the slow- flowing class. Combining the datasets and having the presence of lakes lessened the shortcomings of the data imbalance. Using the feature importance and permutation importance methods, the three most important features identified were the channel slope, the median of the roughness in the 100-m buffer, and the standard deviation of the planform curvature in the 100-m buffer. This finding was supported by previous studies, but other variables expected to have a high participation such as lithology and valley confinement were not relevant, which most likely relates to the coarseness of the available data. The most frequent errors were also placed in maps, showing there was some overlap of error hotspots and areas previously restored in 2010.

Keywords: channel type, geomorphology, Random Forest

(3)

1. Introduction ... 1

1.1. Purpose ... 1

1.2. Hypothesis ... 2

1.3. Study areas ... 2

2. Materials and Methods ... 4

2.1. Channel data... 4

2.2. Random Forest Classifier ...5

2.4. Predictors ... 7

2.4.1. Terrain Analysis Metrics - Distribution Metrics... 7

2.4.2. Channel slope ... 7

2.4.3. Surficial Geology ... 8

2.4.4. Surficial Geology Depth... 8

2.4.5. Lithology ... 8

2.4.6. Block Presence ... 8

2.4.7. Confinement coefficient ... 8

2.4.8. Drainage area ... 10

2.4.9. Local drainage density ... 10

2.5. Statistical Analyses ... 10

2.5.1. Feature selection ... 10

2.5.2. Feature importance and permutation importance... 10

2.5.3. Data Complexity Measures ... 11

2.5.4. Accuracy and ROC-AUC ... 12

2.5.5. Confusion matrix and performance assessment ... 12

2.5.6. Receiver Operator Characteristics – Area Under the Curve ... 12

2.5.7. High error frequency analysis ... 13

3. Results ... 14

3.1. Highest contributing variables ... 14

3.2. Accuracy analysis ... 20

3.3. Complexity analysis ... 24

3.4. Frequent error locations ... 25

4. Discussion ... 31

5. Conclusion ... 36

6. Acknowledgements ... 36

(4)

7. References ... 38

Appendices ... I

(5)

1 1. Introduction

Classification works by grouping up systems based on common characteristics, providing a consistent reference for those who work within the same area, where reproducibility is needed.

In hydrogeomorphology, trying to classify river channels can be dated back to the late 19

^th

century when Davis (1899) designed a stream division between young, mature, and old age.

This division showed the relationship between river and landform development in his “cycle of erosion”.

Later classification works were then based on several river aspects, generating a great number of classification systems. For example, Leopold and Wolman (1957) presented a morphological organization as straight, meandering or braided. Strahler (1957) developed the popular stream order classification by naming the most distant tributaries as first orders which when merging, have their order increased by one. Rosgen (1994) elaborated a detailed sorting using channel and sediment attributes both. A process-based framework was defined by Montgomery &

Buffington (1997), assessing the channel response potential and conditions for drainage basins in mountains.

The Swedish Agency for Marine and Water Management (SWAM, Havs- och Vattenmyndigheten) has a hydromorphologic classification system composed of seven classes.

It considers the stream’s slope, sinuosity, width/depth ratio, valley confinement, soil, and bottom material (Havs- och vattenmyndighetens författningssamling, 2013). Swedish rivers are boreal, characterized by the deglaciated landscapes they flow through (Snyder et al., 2008), and being separated into river-lakes, rapids, or slow-flowing reaches (Nilsson et al., 2002), with their velocity and sedimentation rate providing the attributes. In the Swedish context, geomorphic and hydrological processes play a key role in river restoration efforts (Polvi et al., 2020), with the channel categorization reflecting these features and grouping them.

Recently, Machine Learning (ML) has been used in Earth Sciences to build models according to the data available (Solomatine et al., 2008), with application in classification as well. Some of these were classifying alluvial channels based on landscape characteristics and other controlling variables (Beechie and Imaki, 2014). Automatic land-cover classification with aerial remote sense imagery using self-organizing feature maps and support vector machine (Velickov et al., 2000). A comparison between different algorithms to perform lithology classification using geophysical data obtained through remote sensing (Cracknell & Reading, 2014). Support vector machine being used to automate the landform classification in Mars (Stepinski et al., 2007).

Mitchell (1997) defined ML as applications in which a computer looks for patterns in datasets, afterwards applying these learned patterns on new data, creating predictions. It can be either supervised, when examples of data output are provided, or unsupervised, when it is expected to look for patterns in a free manner, grouping up individuals based in the similarities found.

Regression problems are the ones that predict continuous data, while classification tasks do the same to categorical information.

1.1. Purpose

The Water Framework Directive (European Commission, 2000) highlights the importance of

classifying water reaches to better assess verification and management. Attempts to use

automatic methods of channel classification in Sweden have been done before (Polvi & Nydahl,

2015), with some challenges related to the coarseness and small-scale aspect of the data being

found during the procedure. A ML approach could be useful in this regard due to its recent

successful applications in geosciences.

(6)

2 1.2. Hypothesis

My hypothesis was that with the use of the Random Forest estimator, it will be possible to recognize the important drivers of reach types, which are expected to be confinement and channel slope.

1.3. Study areas

The chosen study areas were two sub-catchments above the highest coastline in northern Sweden: Hjuksån and Bjurbäcken, both located in the Vindel River catchment, a part of the Ume River catchment (Figures 1-3).

Törnlund & Östlund (2002) described how the Vindel River has been altered to allow for its better use to float timber downstream. These alterations started in 1820 by removing the stones out of the streams, regulating the flow of water, and building floatway structures, persevering until the timber-floating decline in 1976.

In 2010, a restoration project started called Vindel River LIFE (LIFE08 NAT/S/000266;

Vindel River LIFE, 2014). Its goal was to restore the river and its tributaries following two different techniques: first, the traditional restoration, with habitat reconstruction, boulder insertion in the streams, dam removal, and building bypasses; and second, an advanced restoration, with the insertion of large boulders and tree trunks from the surroundings of the reach, applied to 10 chosen sites.

Figure 1. Study area in the Umeälven catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020). Data from the

Swedish Meteorological and Hydrological Institute (SMHI, Sveriges meteorologiska och hydrologiska institut) and

Swedish Mapping, Cadastral and Land Registration Authority. Coordinate system: SWEREF99.

(7)

3 Figure 2. Hjuksån sub-catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020), with images from Swedish National Satellite Database supplied by Swedish Mapping, Cadastral and Land Registration Authority (2020D). Coordinate system: SWEREF99.

Figure 3. Bjurbäcken sub-catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020), with images from Swedish National Satellite Database supplied by Swedish Mapping, Cadastral and Land Registration Authority (2020D).

Coordinate system: SWEREF99.

(8)

4 2. Materials and Methods

2.1. Channel data

The stream lines were obtained following the workflow outlined by Skog (2015), and were the result of work performed by a previous project assistant in the group (unpublished). First, a digital elevation model (DEM) was downloaded for each study area, with a resolution of 2-m, developed by The Swedish Mapping, Cadastral and Land Registration Authority (Lantmäteriet) through Laser Imaging, Detection, And Ranging (LIDAR; Swedish Mapping, Cadastral and Land Registration Authority, 2020A). The “Fill” tool in ArcMap v10.7.1 (Esri, 2020) was then used to remove sinks, returning a depression less DEM, being used as an input for the “Flow Direction” tool. This tool produced a raster with the flow direction information from each pixel to the cell right next to it, following the steepest downslope.

With this flow direction raster as an input, the “Flow Accumulation” tool was used and returned a raster with information on flow accumulation of every cell. Next, this raster was multiplied by 4 in the “Raster calculator”, to convert it to square meters. A conditional statement was used in the “Raster calculator” to select pixels to be displayed, choosing only those above a certain catchment area threshold. Finally, with the “Raster to Polyline” tool, a polyline shapefile was created with the raster data representing the stream line.

The main stream was identified manually by drawing a polyline starting at the outlet and moving upstream. When two streams of the same size merged, the values of the upstream cells were verified, looking for the largest accumulation. The “Construct Points” tool was applied to create a point every 50-m along the stream line, and a process domain was attributed to it. The classification of the channel types was done based on orthophotos following these points, ascertaining three classes: lakes (L), slow flowing (SF) and rapids (R). In total there were 632 classified reaches in study area 1 (SA1) and 1224 in study area 2 (SA2), with most of them being 50-m long, with some exceptions for channel extremities. All the studies from this work were performed on these channel data.

The data was analysed separately for each study area, dataset 1 being the data from Hjuksån, and dataset 2 from Bjurbäcken. The location of lakes is precise in Swedish government records (SMHI, 2020), so there was not a gap in knowledge in this aspect. However, the interaction between the three classes could reveal some patterns as well, therefore two extra analyses were done by combining the two datasets: with the three classes (combined multiclass dataset) and with only R and SF (combined binary analysis). The total number of examples separated by class and by datasets is shown in Table 1 and

Table 2 , together with the proportion that each class represents in the whole data.

Table 1. Total number of examples in each class and how much of the total dataset they cover in the combined multiclass and binary ones.

Combined dataset

(multiclass) Combined dataset (binary)

L R SF R SF

Total features 301 328 1227 328 1227

(9)

5 Proportion

(%) 16.2 17.7 66.1 21 79

Table 2. Total number of examples in each class and how much of the total dataset they cover in dataset 1 and 2.

Dataset 1 Dataset 2

L R SF L R SF

Total features 107 108 417 194 220 810 Proportion

(%) 16.9 17.1 66 15.8 18 66.2

2.2. Random Forest Classifier

Predicting which class, a reach belongs to using ML is considered a supervised classification task. ML is defined as computer algorithms that have the goal of automatically improve their performance in a task through experience, a process also known as training (Mitchell, 1997).

During this training, they use the data available to carry out a determined task, focusing on improving their performance by experience. Tasks may vary, being either a classification (in the case of discrete data) or regression (for continuous data).

The algorithms can be divided into supervised or unsupervised. A supervised learning classification (Janikow, 1993) happens when the training data has each record defined by a pair composed of a class and a set of attributes, which will be used as examples of the classification. In this case, the task is to create a generalization of the data to describe the classes. Decision rules are the class separation logic, created when a rule-based framework is used to express the descriptions. These rules are later used to predict classes and properties for yet unseen data. In unsupervised learning, the labels are not known by the algorithm, which is free to find patterns, perform clustering and dimensionality reductions (Ghahramani, 2004).

One of these algorithms is the Decision Tree Classifier (Safavian & Landgrebe, 1991), which works by dividing complex decisions into smaller steps, starting at the root, where no division rule has been applied yet. All the attributes are sorted through the tree: from root to branches, and then to the leaf nodes, where a classification is finally returned (Figure 4).

Decision Trees can overfit the data, however (Ho, 1995), resulting in accuracy loss of the generalization in unseen data. A Random Forest (Breiman, 2001) is an algorithm that combines Decision Trees, reducing this undesirable effect by growing many independent trees, each one with a different sample of examples. Each tree decides on a class, casting their votes, with the most popular label being selected (

Figure 5 ).

The work from Guillon et al. (2020) selected the Random Forest Classifier as the estimator with higher accuracy and stability for hydrologic and geomorphic classification. Accounting for that, in this thesis I followed their workflow, producing some of the attributes used by them for the case study, later using them as input data for the algorithm.

The estimator was executed using the Scikit-learn v0.23.2 (Pedregosa et al., 2011)

RandomForestClassifier in python v3.7.5. 80% of the data was used to train the algorithm and

(10)

6 the prediction was applied at the remaining 20%. This process was repeated 1000 times (folds), resampling the training data every time to generate statistical distributions on the results. The subsequent analyses performed are described in the section 2.5.

Figure 4. A Decision Tree. All the classes are contained in the first square (root, in dark grey), being partially divided by decisions (branches) until finally reaching a single class (leaf node). The examples that satisfied the condition are labelled True and grouped in grey squares, while those who do not are labelled False and grouped in white squares. Tree obtained using data from dataset 2.

Figure 5. Random Forest scheme. Each decision tree will cast a vote for the examples, with the most voted ones

chosen as the label the estimator will apply to the example. The trees were plotted using data from the dataset 2.

(11)

7 2.4. Predictors

Predictors are the input attributes used by the estimator to predict the labels. As mentioned above, the input data for supervised classification tasks is composed of several records characterized by a vector of the desired output value and the characteristics of the record (i.e., a label and its attributes – predictors). These examples are organized as columns in a comma- separated value (csv) file, where a single column contains the labels for each record and several others hold the attributes.

In total, 118 predictors were obtained through ArcMap, QGIS and python. For this study, they were either calculated directly from the area around each reach with zonal statistics, or had values extracted from rasters and vectors at the reaches’ midpoint. The complete table with all the predictors and their full names can be found in Appendix I. List with all the predictors obtained and their definition. The ones selected during the feature selection process are highlighted in grey squares. .

2.4.1. Terrain Analysis Metrics - Distribution Metrics

Following Guillon et al. (2020), the Terrain Analysis Metrics – Distribution Metrics (TAM- DM) predictors were calculated. Eight TAMs were derived from this DEM using ArcMap: slope, aspect, flow direction, roughness, planform curvature, profile curvature, terrain ruggedness index and topographic position index. The slope represents the terrain steepness, and it was calculated using the planar method, where the tool obtains for each pixel the maximum value change rate to the pixels around it. Aspect is the slope direction, calculated by fitting a plane in the neighbouring cells around the pixel being processed, and then finding the direction of this plane.

Both the planform and profile curvatures were calculated using the “Curvature” tool, based on the surface’s second derivative. Planform curvature explains the flow convergence and divergence, oriented perpendicularly to the direction of the maximum slope (Conforti et al., 2010). A profile curvature, on the other hand, is oriented parallelly to the maximum slope direction, influencing the flow reduction and acceleration (Minár et al., 2013).

Roughness describes how the surface height is deviating from an ideal smooth surface (Pollyea

& Fairley, 2012), it was obtained using the R package Raster v3.4.5 (Hijmans et al., 2015), calculated by the maximum value of a cell minus the minimum value of a pixel and the other eight around it. The R package SpatialEco v1.3-6 (Evans & Ram, 2015) was used to calculate the terrain ruggedness index (TRI) and topographic position index (TPI). TRI measures the terrain heterogeneity by calculating the elevation difference between a cell and its neighbours (Riley et al., 1999). TPI calculates the difference of elevation between the cell and the mean elevation within a certain distance to the cell, returning a relative topographic position (Weiss, 2001).

A 100-m near-channel buffer was applied to each reach individually and, using the Rasterio v1.1.0 (Gillies et al., 2013) and Rasterstats v0.14 (Perry, 2013) packages with python, six Distribution Metrics (DM) were calculated for every TAM: mean, median, minimum, maximum, standard deviation and skewness; resulting in 54 features. The same process was applied one more time, now using a square-shaped buffer with 1100-m side, to capture differences between the hillslope and valley bottom. The value was decided based on the mean of the valley bottom width. With this, 54 additional features were obtained.

2.4.2. Channel slope

This predictor was defined by extracting values from the DEM to obtain the slope at the water

surface of each reach, being different from the TAM slope mentioned above. According to

Swedish Mapping, Cadastral and Land Registration Authority (2020B), the points returned

(12)

8 during the scanning process used to construct the DEM were interpolated using the Triangulated Irregular Network method, being smoothed when they were located over water since regular LiDAR only makes readings on water under specific circumstances (Rak et al., 2017). In the case of lakes and sea, this smoothing step returns the ground values closest to the water surface to obtain a uniform elevation value. For watercourses, the water surface becomes sloped in the DEM.

The starting (h

0

) and ending (h

F

) points of each reach section were sampled from the DEM with the tool “Sample raster values” from QGIS v3.10.11 Toolbox (QGIS.org, 2021), returning the value of each point’s elevation. The difference between the two was calculated, being then divided by the section length, obtained using the QGIS geometry function “$length”. The calculation is described in Equation (1).

𝑠𝑙𝑜𝑝𝑒 = [ℎ ₀ − ℎ _𝐹 ] 𝑙𝑒𝑛𝑔𝑡ℎ

(1)

2.4.3. Surficial Geology

A shapefile with surficial geology information (jordarter_25_100k_jg2.shp) was obtained from Geological Survey of Sweden (SGU, Sveriges Geologiska Undersökning, 2014), with a resolution of 1:100000. The layer represents the geology type (or quaternary deposit) at a depth of approximately 0.5-m, and the values are treated as categorical. To determine a category for each reach, the tool “v.distance” was used in QGIS, giving each example the same class as the nearest polygon.

2.4.4. Surficial Geology Depth

The data is sampled from jorddjup_10x10m.tif (SGU, 2020), a raster in the scale 1:100000 (Daniels & Thunholm, 2014) with values of soil depth underneath the organic layer. However, it was empty in water areas. To fill these data gaps, the tool “r.surf.idw” in QGIS was used to perform an Inverse Distance Weighting interpolation.

2.4.5. Lithology

This predictor followed the same workflow as the Surficial Geology, using the shapefile berggrund_50_250k_geo_enh_y.shp (SGU, 2017) that represents the bedrock map layer, with a resolution of 1:50000. Its values were treated as categorical as well.

2.4.6. Block Presence

This predictor followed the same workflow as the Surficial Geology, obtained using the data from jordarter_25_100k_bl.shp, with a resolution of 1:100000. It describes the boulder presence at surface level (SGU, 2014). The values were also treated as categorical.

2.4.7. Confinement coefficient

Channel confinement is defined by Fryirs et al. (2016) as the percentage of a stream segment that is adjacent to a confining margin on either side of the banks, quantifying it as a numerical value for each reach that ranges from 0 (unconfined) to 1 (confined). The calculation process used the Confinement Toolbox (O’Brien et al., 2019) in ArcMap, requiring three types of data:

the stream channel, a valley bottom polygon and the buffered bankfull channel, as described below.

2.4.7.1. Stream channel

The tool only produces reliable results with reach lengths over 200-m since it depends on calculations that consider the stream length and channel width (O’Brien et al., 2019).

Considering this, the original stream channel was combined into a single polyline that was

(13)

9 subsequently split into parts with 200-m length to be used as input. The coefficient values obtained were later connected to the original data of 50-m length.

2.4.7.2. Valley bottom polygon

This layer was obtained using the “FluvialCorridor” toolbox (Roux et al., 2015) in ArcMap and the 2-m resolution DEM. It worked by comparing the fluvial network elevation with a detrended elevation model that was created by the workflow around the stream network input by the user.

First, the workflow required the depression less DEM, then the “Int” tool was used to convert the float values of the raster into integers, the format required for the input. After these steps were completed, the “ValleyBottom” tool and was run and the parameters described below were filled up according to the catchment features.

“Smoothing tolerance” smooths the random variation from the stream network considering the tolerance value (in meters) input by the user. Taking into account the total length of the stream lines (36-km for study area 1 and 53-km for study area 2), the value of 1000-m produced the best results.

“Small buffer size” selects the area around the stream network for which the elevation values will be analysed, i.e., the stream elevation. For this work, the value of 50-m is chosen because it provides coverage for most parts of the fluvial network.

“Large buffer size” selects the area around the fluvial network that will undergo an elevation analyses to determine the valley bottom, creating the altimetric reference plan. To make sure that the whole valley bottom will be included, an overly large value of 3000-m was chosen.

“Disaggregation step” is the length of the segments into which the stream will be divided after being smoothed, therefore the value selected is 50-m since that was the length used for classification.

“ThresholdMIN” is the lowest value for which a cell will be included inside the valley polygon, used to compensate unusually low values that might come up in the detrended DEM. Since no strong negative values were present, it remains 0.

“ThresholdMAX” is the maximum value that a cell can hold in the detrended DEM. The value used was the difference between discharge in spring (high-flow) and summer (low-flow) as determined by Nilsson et al. (1991) for the Vindel river: 2.4m. However, the tool only takes integer numbers, therefore the value used was 2.

“Aggregation distance” is the maximum distance allowed between polygons in order to aggregate them and avoid discontinuities, determined empirically. The best results were achieved with 300-m as an input.

“Minimum area” is the lowest value of the area that a polygon must have in order to not be deleted during the cleaning phase of the tool, the chosen value of 10-m

²

was empirically obtained.

“Minimum hole size” is the minimum size that a hole in the polygon can have in order to not to be cleaned up, empirically determined as 10-m

²

.

“Smoothing valley bottom” is the length used to smooth the random variation, empirically obtained as being 100-m.

2.4.7.3. Buffered Bankfull channel

(14)

10 The buffered bankfull channel polygon was obtained by tracing the active channel manually following orthophotos ( Swedish Mapping, Cadastral and Land Registration Authority , 2020C).

The polygon is buffered afterwards using twice the DEM resolution, resulting in a 4m buffer.

2.4.8. Drainage area

The “Snap Pour Points” tool was used in ArcMap with two inputs: the flow accumulation raster and a constructed shapefile with the reaches’ midpoints. A 25-m range is selected to find the point around the reach length with the highest value. Afterwards, these values were sampled using the “v.distance” tool in QGIS, and later the formula [𝑟𝑎𝑠𝑡𝑒𝑟 𝑣𝑎𝑙𝑢𝑒] × [𝑟𝑎𝑠𝑡𝑒𝑟 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛] ² was used in the “Field Calculator” tool in ArcMap, returning the drainage area values.

2.4.9. Local drainage density

The local drainage density is obtained through the division of the drainage area values by the length of each individual reach (Danesh-Yazdi et al., 2017).

2.5. Statistical Analyses

All the data was stored as datasets using pandas v0.23.4 (McKinney, 2010) and geopandas v0.8.2 (GeoPandas developers, 2013). The results were obtained after 1000 runs of the estimator and subsequently analysed using statistical tools described below.

2.5.1. Feature selection

When there are many different predictors available, maybe not all of them will be relevant during the fitting and predicting process. One of the ways in which this happens is if these variables are correlated, which results in longer processing times and lower performance of the estimator. Therefore, it was important to perform a feature selection to remove the redundant ones from the analysis.

The hierarchical clustering analysis (Bridges Jr, 1966) works by verifying the correlation matrix between the variables, and based in these values, grouping them together and hierarchically. The tighter the correlation between the variables, the lower they are positioned in a dendrogram, and as the tree increases in height, the criterion for clustering is loosened, grouping more features together. Using hierarchical clustering for feature selection has shown improvement in the processing time of estimators without losing accuracy (Butterworth et al., 2005), therefore this method was selected.

Using the clustering package cluster.hierarchy.dendrogram in SciPy (Virtanen et al., 2020), a dendrogram was created (see Appendix III). To better understand if and how each variable was correlated, a heatmap was also generated by plotting the correlation results from the dendrogram against each other (see Appendix IV). The height chosen to cluster and select the predictors was 1, empirically obtained, to remove redundant features but still preserving a higher accuracy. Then, one of the variables from each cluster was selected as an input to the estimator, resulting in 61 predictors chosen to be used as input of the Random Forest Classifier.

2.5.2. Feature importance and permutation importance

To find out which predictor best describes each label, the Random Forest Classifier offers two

tools for analysing the contribution of the features. The predictor importance was obtained by

using the function feature_importances_, based on the Gini importance, also known as mean

decrease impurity (MDI; Breiman, 2002). It works by calculating how much a predictor has

reduced the variability in the classes. However, it must be noted that the MDI can be biased by

high cardinality in a variable (Strobl et al., 2007), that is, the number of unique values it

(15)

11 contains. This can result in a predictor receiving a higher rank even though it did not contribute as much for the distinction of classes.

The alternative that does not suffer from this bias is the permutation_importance. Breiman (2001) described permuting a feature column after a first score (baseline metric) is obtained, then reassessing this score after the permutation, several times. The difference between the score before and after the permutation is the permutation importance.

Since several variables had low result values close to each other, a threshold of 3% was chosen in the feature importance results to reduce the number of predictors analysed, only plotting the ones with an importance above the threshold.

The final list of the most important variables was based on the output of both metrics.

2.5.3. Data Complexity Measures

The Data Complexity Measures (DCMs) analysis was used to understand the impact of the feature selection in the dataset complexity, using the R package EcoL v0.3 (Lorena et al., 2019).

For classification problems, complexity can be studied to estimate how difficult the task of separating the data points into classes is (Lorena et al., 2019). It is usually affected by three factors: class ambiguity, data sparsity and dimensionality, and complexity of the boundaries for class separation. Class ambiguity refers to when the classes cannot be discriminated by the data, sparsity returns information about an incomplete or sparse dataset, and the separating boundary is about how small can the minimum description of a class be in order to be representative of the class.

DCMs can be grouped in six different groups: feature-based, linearity, neighbourhood, network, dimensionality, and class imbalance; with values always ranging from 0 to 1. The measurements described in detail in this section were those that had a difference above one and half standard deviations between pre-feature selection and post-feature selection analyses, with each family better addressing a different type of factor. A full list with all the definitions can be found in Appendix II.

2.5.3.1. Linearity measures

These measures inform if the classes can be linearly separated, an information with direct relationship to the complexity of classifying a dataset. The sum of the error distance by linear programming (L1) finds out if the sum of the distances between examples that have been misclassified and their classification linear boundary can be used to separate the dataset. For a value equal to zero, it means that the problem is not complex and that it is linearly separable.

The error rate of the linear classifier (L2) returns the error rate of the linear Support Vector Machine classifier (SVM). A value closer to 1 means that there are more errors, and consequently, a larger complexity. The non-linearity of a linear classifier (L3) detects the presence of concavities in the class boundaries, with values closer to 0 showing that the problem is less complex.

2.5.3.2. Dimensionality measures

This measurement analyses the data sparsity in a simplified manner, according to the number

of samples and the data dimensionality. The average number of features per dimension (T2) is

the ratio of the number of records in the dataset divided by the dimensionality of each one,

reflecting data sparsity. When there are many predictors for few examples, it results in a sparse

distribution, hindering the effectiveness of a estimator, which will be represented by values

closer to 1.

(16)

12 2.5.4. Accuracy and ROC-AUC

The accuracy_score metrics works by computing the fraction of how many of the test samples were correctly predicted by the algorithm (Pedregosa et al., 2011). The Receiver Operating Characteristic – Area Under the Curve (ROC-AUC) score was obtained with the roc_auc_score function, which operates by comparing if the estimator gives a higher score to a positive instance that has been randomly chosen than to a negative one (Fawcett, 2006). The ROC curve is thoroughly described in section 2.5.6.

2.5.5. Confusion matrix and performance assessment

For classification tasks, just looking at the accuracy is not enough to determine the success of an estimator – it is also necessary to look at which classes were better predicted, and which ones were not. This becomes even more important when there is a disproportion in the number of examples from each class, as it is the case with the present dataset. The confusion matrix is used in such cases since it provides a class-level accuracy assessment (Stehman, 1997; Provost

& Kohavi, 1998). With that, it is possible to tell how many examples were predicted as being from each class, and how many that were actually in that class.

According to Fawcett (2006), when there are only two classes, the predictor has two choices:

the example is either one class (true positives – TP) or the other (true negatives – TN). In the case of a wrong classification, two types of errors can be made: it can guess that the example was from one class, but it was not (type I error, false positive – FP) or it can guess that the record was not from a class, when, in reality, it was (type II error, false negative – FN).

The true positive rate (TPR, sensitivity, recall) shows the proportion of correctly predicted positives from all the positive examples, and measures how well the estimator could identify the TP, being calculated through TP/(TP+FN). The true negative rate (TNR, specificity) analyses how well a predictor could guess the TN from all the negative examples, obtained through TN/(TN+FP).

The false positive rate (FPR) is the ratio of falsely predicted positives, and it can be calculated using FP/(FP+TN). Finally, the precision is obtained through TP/(TP+FP) and it indicates the proportion of the returned positive results that are TP. These results were presented in performance assessment tables.

The same analysis was repeated for the dataset with three classes, combined and for each study area separately. Having another class enabled the estimator to be wrong in two different ways, e.g., for an example that belongs to class 1, it can wrongly guess that it is from either 2 or 3.

For this work, having information for all the classes was important, it would not be enough to represent the results only in terms of positives or negatives, even in the binary case. Therefore, the performance assessments were done for each class, separately. The results were plotted using the median values of each matrix element after all the runs were finished.

2.5.6. Receiver Operator Characteristics – Area Under the Curve

In the Receiver Operator Characteristics (ROC) curve, the TPR of a classifier is plotted against its FPR, showing how the trade-off of the algorithm happens for each class. The area under the ROC curve is the AUC value, returning an evaluation of how much discrimination was provided for a pair of classes, ranging from 0 to 1. It is an analysis that complements the confusion matrices and performance assessment tables, showing how TPR behaves according to the FPR.

In the case of a multiclass problem, the classes are organized in a pairwise manner, e.g.,

plotting one class performance against the other two, once for each class (Hand & Till, 2001),

with the AUC values being calculated from the ROC curves for each label. With several ROC

curves present, to enhance the comprehension of the results, it could be better to average them

(van Asch, 2013). The micro-average curve is made by combining all classes’ contributions to

(17)

13 calculate the average of the curves, while the macro-average is based in the calculations of each class’ curve separately, which are later averaged (Yang, 1999).

Following Saito & Rehmsmeier (2015), an imbalanced dataset might produce an overestimation or underestimation of FP, depending on which class was unbalanced. A calculation of the number of TPs and FPs through the ROC curve helped to understand how the class imbalance could not be seen just by the curve itself. To verify this assertion, the method was adapted for multiclass datasets: the ROC curves were analysed individually, with the class represented in the curve considered positive, and the other two classes combined, negatives. The macro-average curve was analysed in the binary case since it provides a better comprehension of estimators using imbalanced datasets (Ferri et al., 2003).

I arbitrarily chose a point on each curve where TPR=0.9 and recorded its correspondent FPR value. This TPR was then multiplied by the number of examples (𝑛0) of the positive class in that dataset, returning the total TP in that point (Equation (2)).

𝑇𝑃 = 𝑇𝑃𝑅 × 𝑛0 (2)

After that, two different calculations were made. The first one used the combined number of examples from the other two classes as a negative class (𝑛1 and 𝑛2), then later multiplied this value by the FPR recorded from the graph. This was the 𝐹𝑃 _𝑖 of the datasets with class imbalance present (Equation (3)).

𝐹𝑃 _𝑖 = 𝐹𝑃𝑅 × (𝑛1 + 𝑛2) (3)

In the second calculation, it was considered as if the other two negative classes had the same number of examples as the positive class (2 × 𝑛0), creating a balanced artificial version of the original datasets. This number is also multiplied by the FPR of the point on the curve, returning the 𝐹𝑃 _𝑏 for an ideal balanced dataset (Equation (4)).

𝐹𝑃 _𝑏 = 𝐹𝑃𝑅 × 2 × 𝑛0 (4)

Both FP values obtained were then compared to show how the imbalance in the dataset produced an overestimation or underestimation, depending on which class is overrepresented.

2.5.7. High error frequency analysis

An opportunity that presents itself when ML is used in the geoscientific context is the possibility to locate the errors of the estimator in maps, bringing forth an opening to find even more patterns that might not have been present in the used predictors through the overlap with other maps.

In the subsequent analysis, considering the accuracy results, the combined datasets were used to investigate the pattern of hits and misses. These were plotted using heatmaps over the study areas, considering how many times they showed up as wrongly or correctly classified after the 1000 folds. This provided a visual indication of the places in which the classifier performed worse and better based on the frequency of misses and hits.

Afterwards, these examples were separated between those who were never right and those who

were never wrong when a label was given. The difference of median values between only hits

and only misses for each predictor was assessed, and the features that showed a difference over

20% were selected for a second analysis. Next, a Mann-Whitney U test (Mann & Whitney, 1947)

was performed to find if the medians of the distributions were significantly different, alongside

the predictors already pointed out by the permutation importance. All these variables were

(18)

14 then plotted against one another in scatter plots to identify the threshold of when the classifications would always be right and wrong.

3. Results

After training and testing, the Random Forest Classifier provided adequate results for the reach classification. Accuracy is reported in section 3.2 and the most contributing features, in section 3.1. The complexity results can be found in section 3.3 while the error analysis is reported in 3.4.

3.1. Highest contributing variables

Figure 6 shows the relative importance of all the features that were above the threshold, compared by type of dataset. One of the predictors which had the highest contribution to explain the data was the channel slope, though with a considerable difference between datasets. Two other variables were above the threshold and in common between all of them:

median_100_roughness, and std_100_planformcurve. Dataset 1 had eight more predictors besides the three common to all datasets: max_100_slope, mean_100_slope, mean_100_aspect, max_100_planformcurve, min_100_planformcurve, max_100_profilecurve, skew_100_flowdir, and mean_100_flowdir. Dataset 2, nonetheless, had five more predictors that filled the criteria: median_1100_tpi, ldd_skm, mean_1100_elevation, min_1100_elevation and X_coords.

The combined multiclass dataset had six more predictors above the threshold:

skew_100_flowdir, mean_100_flowdir, median_100_planformcurve, min_1100_elevation, mean_1100_elevation and X_coords. The combined binary dataset, on the other hand, had also skew_100_flowdir, mean_100_flowdir, median_100_planformcurve, and median_100_tpi.

The observation of a higher value for the slope importance in the combined binary dataset (20±2.7%) can be interpreted as this predictor being more effective in separating R from SF reaches, but not being as important when the separation involves the L class (11.1±1.4%). This can be verified by checking the boxplots of this variable by each class (Figure 7), where it was visible that the values for the L class have a strong overlap to the SF one. Consequently, this importance was reduced with the presence of the L label.

The same pattern happens to the median roughness, which had 8.6±1.4% of importance in the binary dataset and 5.8±1.2% in the multiclass. However, in the importance of the planform curvature standard deviation in the 100-m buffer was 7.1±1.2% in the multiclass dataset and in the binary one, 6.1±1.2%. Further analysis can be found in the Discussion section.

The boxplots of the three highly ranked predictors in both analyses show that the distribution is different depending on the classes and dataset analysed. In the std_100_planformcurve, the values are similar between dataset 1 (Figure 8B) and the combined dataset (Figure 8A) but differ from dataset 2 (Figure 8C) in scale and outliers.

The permutation importance also had the three same features at the top for all datasets:

channel slope, median_100_roughness and std_100_planformcurve (Figure 9).

The cardinality (Table 3) varied depending on the predictor and dataset being analysed, usually

keeping the proportion of unique values according to the maximum number of examples.

(19)

15 The same situation is observed in the median_100_roughness boxplots (Figure 10), but not in the channel slope where the distribution looks homogeneous between the three (Figure 7).

Zoomed in version of the plots to better differentiate values closer to zero can be found in Figure 11 and Figure 12.

Figure 6. Relative predictor importance table, with the standard variation plotted for each one of them. Only the predictors that contributed with over 3% of the data prediction were selected from all datasets.

Having higher importance values in dataset 1 (Figure 6) shows that the examples contained in

it were more easily separable by the most important features, even though that was not so

visible in dataset 2. This can be seen in how the channel slope is distributed in the zoomed in

boxplots (Figure 12C), where an overlap can be seen between the SF class interquartile range

and the L label, while the R label’s first quartile overlaps with the third quartile of the SF class.

(20)

16 The planform curvature boxplots (Figure 11) shows that the distinction seen in the combined datasets and dataset 1 between L and SF classes does not happen with the same intensity in dataset 2. The boxplots of the median roughness reinforced this notion (Figure 8), having the 95% confidence interval closer to each other in the SF and R labels. The L class in dataset 1 is apart from the other classes, but the distance is smaller in dataset 2, where an overlap between the L’s Q1 and SF’s Q3 can be seen.

Figure 7. Boxplots with the data of channel slope in (A) combined dataset, (B) dataset 1, and (C) dataset 2,

separated by class.

(21)

17 Figure 8. Boxplots with the data of the standard deviation of the planform curvature in the 100-m buffer in (A) combined dataset, (B) dataset 1, and (C) dataset 2, separated by class.

Figure 9. Top predictors according to permutation importance analysis using the (A) combined multiclass dataset,

(B) combined binary dataset, (C) dataset 1, and (D) dataset 2.

(22)

18 Figure 10. Boxplots with the data of the median of the roughness in the 100-m buffer in (A) combined dataset, (B) dataset 1, and (C) dataset 2, separated by class.

Table 3. Number of unique values (cardinality) of each predictor above the threshold obtained by the feature importance separated by type of dataset. The maximum value is 1856 in the combined multiclass dataset, 1555 in the combined binary dataset, 632 in dataset 1, and 1224 in dataset 2.

Number of unique values

Predictor Combined

multiclass dataset

Combined binary dataset

Dataset

1 Dataset 2

max_100_slope 886 735 277 610

mean_100_slope 1851 1555 632 1220

mean_100_aspect 1852 1555 632 1220

median_1100_tpi 491 455 167 364

max_100_planformcurve 666 541 230 437

min_100_planformcurve 737 597 259 479

ldd_skm 1835 1538 626 1210

max_100_profilecurve 748 607 242 507

median_100_tpi 756 753 295 535

median_100_planformcurve 275 275 150 126

mean_1100_elevation 1856 1555 632 1219

skew_100_flowdir 1856 1555 632 1224

mean_100_flowdir 1856 1555 632 1224

std_100_planformcurve 1850 1555 632 1219

min_1100_elevation 636 572 198 438

X_coords 1853 1552 630 1223

median_100_roughness 1741 1526 611 1145

slope 1142 1089 399 744

(23)

19 Figure 11. Zoomed in boxplots of the standard deviation of the planform curvature for (A) combined dataset, (B) dataset 1, and (C) dataset 2, separated by class. The interval displayed is from 0 to 4.

Figure 12. Zoomed in boxplots of the channel slope for (A) combined dataset, (B) dataset 1, and (C) dataset 2,

separated by class. The interval displayed is from 0 to 0.01.

(24)

20 3.2. Accuracy analysis

The values for the accuracy and ROC-AUC score differed between each other and between datasets as well (Figure 13). In the ROC-AUC evaluation, the multiclass dataset outperformed the other ones, followed by dataset 2, while dataset 1 and the combined binary had the same performance. The combined binary dataset had the best performance in the accuracy score, followed by the multiclass one, then dataset 2. Dataset 1 was evaluated as the least accurate.

Figure 13. Accuracy and AUC scores.

The confusion matrix from dataset 1 (Figure 14A) displayed that when the estimator guessed an example belonged to the L class, it was wrong 17.6% of the time for examples that belonged to the SF class, being correct the rest of the time. When it classified an example as R, the algorithm never wrongly classified examples from the L and SF label, but only 38% of the R examples were correctly predicted, with the rest being classified as SF. In the predicted SF label, the estimator was wrong 13% of the time misclassifying R examples as being SF and 7%

wrongly attributing the SF label to L examples.

Figure 14. Confusion matrix of the analysis using (A) dataset 1 and (B) dataset 2.

The confusion matrix of dataset 2 had some differences compared to dataset 1 (Figure 14B).

The classifier was wrong 7.7% of the time when predicting the L label: 5.1% labelling an SF example as being L and 2.6% guessing an example from the R label was L. In the R class, it was incorrectly assigning this label to SF examples 11.1% of the time, it never wrongly classified L examples, and correctly labelled 88% of the times. However, only 18% of the R records were correctly classified, with most of them (79%) being wrongly assigned to the SF class 17.4% of the times. The SF label was rightly classified 81.5% of the time, with only 1.8% of its examples receiving a different class. 1% of the times L records were labelled as SF incorrectly, and 17.4%

of the times this happened to the R examples.

For the combined binary dataset (Figure 15B), the confusion matrix showed that when the

predictor guessed the class of the SF label, it was wrong 13% of the times, and when it did the

(25)

21 same for the R, it got it wrong 9%. Moreover, only 45% of the time it predicted the right class for R examples, while for the SF examples it was incorrect only 1.2% of the time.

The confusion matrix of the combined multiclass dataset (Figure 15AError! Reference source not found.) indicated that the estimator was wrong 11.4% of the times guessing an SF example belonged to the L label, and 3.2% assigning the L label to an R example. 15.4% of the L examples were misclassified as being SF, but none as R. When predicting the R class, it was right 90.3% of the time, and when it was wrong, it guessed that the SF examples belonged to this class, never assigning it to an L example. Nevertheless, it does not classify the R examples correctly 57.6% of the time, wrongly assigning them the SF label 54.5% and the L label 3%. The SF also predicted that an R record belonged to the SF class 12.9% of the times and for L records, this happened 1.2%. 2.8% of the SF labels were wrongly classified as L.

Figure 15. Confusion matrix of the analysis using the (A) multiclass and (B) binary dataset.

Comparing the performance table of the classifier for the two datasets (Table 4), it shows that the R class performed better in dataset 1 than in dataset 2, with higher recall, specificity, and precision; and lower FPR than in dataset 2. Meanwhile, the SF and L classes performed better in dataset 2, with reduction in the FPR and increase in the other metrics.

Table 4. Performance assessment of the classification task for dataset 1 and 2.

Dataset 1 Dataset 2

R SF L R SF L

Recall 0.38 0.96 0.66 0.18 0.98 0.95

Specificity 1 0.52 0.97 0.99 0.55 0.98

FPR 0 0.47 0.03 0.01 0.44 0.01

Precision 1 0.8 0.82 0.89 0.81 0.92

The performance assessment (Table 5) indicated that, with three labels instead of two in the

dataset, the recall and precision were reduced for both R and SF, the specificity increased and

FPR reduced for the SF label, while remaining unchanged in the R class. However, comparing

to the dataset 1 and 2 separately, the values for the L label seem to have been averaged between

them while the R recall was improved at the cost of some precision. The FPR of the SF label

was highest in the binary dataset version and lowest in the combined multiclass one, which

also had higher specificity.

(26)

22 Table 5. Performance assessment of the classification task using the combined multiclass dataset.

Combined binary

dataset Combined multiclass dataset

R SF R SF L

Recall 0.45 0.99 0.42 0.96 0.87

Specificity 0.99 0.45 0.99 0.65 0.97

FPR 0.01 0.54 0.01 0.35 0.03

Precision 0.91 0.87 0.90 0.84 0.85

In the ROC curves, dataset 1 (Figure 16), the L curve had the same performance of the R one (AUC = 0.96), while the SF class had a lower value (AUC = 0.91). The micro- and macro- average curves both had the same value (AUC = 0.95). Dataset 2 (Figure 17) showed a perfect performance for the L label (AUC = 1), the R class had the lowest results from all the datasets (AUC = 0.93) with the SF class performing slightly better (AUC=0.94). The micro-average outperformed the macro-average (AUC = 0.97 and 0.96, respectively). In the combined binary dataset (Figure 20), the R and SF curves have the same value (AUC=0.94), with a different one for the micro- and macro-averages (AUC=0.95). In the multiclass combined dataset (Figure 19), the best performance was from the L label (AUC=0.99), which is visible from the curve position closer to the north-western corner of the graph. The R and SF labels had the same performance (AUC = 0.95), with a higher value for the micro- (AUC=0.97) and macro-average (AUC=0.96).

Figure 16. ROC curves of dataset 1, including micro-average, macro-average, and the AUC values of all curves.

(27)

23 Figure 17. ROC curves of the dataset 2, including micro-average, macro-average, and the AUC values of all curves.

Figure 18. ROC curves of the combined binary dataset, plus micro-average, macro-average, and the AUC values of

all curves.

(28)

24 Figure 19. ROC curves of the combined multiclass dataset, plus micro-average, macro-average, and the AUC values of all the curves.

The TP and FP analysis based on the multiclass ROC curves (Table 6) showed that the overrepresented class (SF) had an underestimation of how much FPs it returns, since the values from FP

i

are lower than the FP

b

. The opposite happened to the L and R classes since they are underrepresented, with FP

b

lower than FP

i

. This was also observed in the combined binary dataset.

Table 6. Calculated values of TP, FP

i

, and FP

b

at the selected points in the ROC curves for each dataset. The balanced data considers the same number of examples for all the classes, with the imbalanced maintaining the original classes proportions (Table 1 and Table 2).

class FPR TPR TP FP

i

FP

b

Dataset 1 L 0.1 0.9 96 52 21

R 0.15 0.9 97 78 62

SF 0.29 0.9 375 62 241

Dataset 2

L 0.007 0.9 174 7 3

R 0.18 0.9 198 180 79

SF 0.21 0.9 729 87 340

Combined Multiclass

L 0.02 0.9 271 31 12

R 0.15 0.9 295 229 98

SF 0.19 0.9 1104 119 466 Combined

Binary R/SF 0.18 0.9 295 220 59

3.3. Complexity analysis

The DCMs (Figure 20) showed that after the feature selection process, the complexity was

reduced in some features from both combined datasets (Figure 20A and Figure 20B), which

can be seen in the measures L1, L2, L3 and T2. In dataset 1 (Figure 20C), the complexity was

reduced in L2, L3 and T2, while in dataset 2 (Figure 20D) this only happened in T2.

(29)

25 Figure 20. Data complexity measures of (A) combined multiclass dataset (sd=22.5), (B) combined binary dataset (sd=20.8), (C) dataset 1 (sd=4.8), and (D) dataset 2 (sd=4.9). The closer to 1 a measurement is, the more complex is the classification task, with exception to the C1 measurement where a higher value means that there is less class imbalance. The hatched bars depict the values obtained before the feature selection process in order to compare the change in complexity caused by removing some of the clustered variables from the dataset. Each measurement definition and explanation can be found in Appendix II.

3.4. Frequent error locations

The maps with the localization of most errors (Figure 21 and Figure 23), showed that their

frequency and number of distinct locations were higher in the combined binary dataset, for

both study areas. The same pattern was also observed in the maps with the correctly classified

examples (Figure 22 and Figure 24). Another difference between the study areas was that in

Bjurbäcken apparently the errors and correct classifications are more spread along the stream

lines.

(30)

26 Figure 21. Comparison between the location of the most frequent errors in Hjuksån for (A) combined multiclass and (B) combined binary dataset. Images from Swedish Mapping, Cadastral and Land Registration Authority, National Land Survey of Finland, Esri, HERE, Garmin, METI/NASA, USGS. Coordinate system: SWEREF99.

Figure 22. Comparison between the location of the most frequent correctly classified reaches in Hjuksån for (A)

combined multiclass, and (B) combined binary dataset. Images from Swedish Mapping, Cadastral and Land

Registration Authority, National Land Survey of Finland, Esri, HERE, Garmin, METI/NASA, USGS. Coordinate

system: SWEREF99.

(31)

27 Figure 23. Comparison between the location of the most frequent errors in Bjurbäck for (A) combined multiclass dataset, and (B) combined binary dataset. Images from Esri, HERE, Garmin, METI/NASA, USGS. Coordinate system: SWEREF99.

Figure 24. Comparison between the location of the most frequent correctly classified reaches in Bjurbäck for (A) combined multiclass dataset, and (B) combined binary dataset. Images from Esri, HERE, Garmin, METI/NASA, USGS. Coordinate system: SWEREF99.

An overlap between the highest frequency of errors in both study areas and demonstration

restoration sites was also observed ( Figure 25 and Figure 26 ; Gardeström et al., 2013).

(32)

28 Figure 25. Hujksån’s heatmap and 2010 restoration location. Strong dark grey represents the hotspot

of errors found using the combined multiclass dataset, and the identified white reach with black

outline shows the stream that was under demonstration restoration in 2010. Database supplied by

Esri, NLS, NMA, and USGS. Coordinate system: SWEREF99.

(33)

29 Figure 26. Bjurbäcken’s heatmap and 2010 restoration location. Strong dark grey represents the hotspot of errors found using the combined multiclass dataset, and the identified white reach with black outline shows the stream that was under demonstration restoration in 2010. Database supplied by Esri, NMA, and USGS. Coordinate system: SWEREF99.

The difference between the distribution of predictor values corresponding to the examples that are only hits and the ones which are only misses has been inspected with the Mann-Whitney U test (Table 7). The results mean that the difference was significant in all the predictors listed for the multiclass dataset, except for median_planform_curve. In the binary dataset this exception happens in the slope predictor.

The three main predictors scatter plots showed that slope values lower than 0.0175, median

roughness below 0.05 and standard deviation planform curvature values under 5 created zones

(34)

30 where it was more likely to never correctly guess the class of an example – the misclassified records had values under these thresholds. Other predictors show up as relevant for separating between hits and misses, but only those that produced visually separable scatter plots were plotted.

Table 7. p-values of the Mann-Whitney U test for the binary and multiclass datasets.

predictor p-value (binary) p-value (multiclass)

skew_1100_elevation 1.2x10

^-7

0.03 std_1100_elevation 0.002 10

^-5

max_1100_roughness 0.03 0.01

median_1100_planformcurve 0.002 0.42

skew_100_roughness 3.6x10

^-11

3.6x10

^-8

median_100_roughness 0.02 2.1x10

^-6

min_100_aspect 0.0002 0.007

median_100_flowdir 0.0006 0.0007

max_100_profilecurve 0.02 3.3x10

^-5

slope 0.07 5.9x10

^-7

ldd_skm 1.7x10

^-7

0.02 median_100_profilecurve 0.06 0.06

(35)

31 Figure 27. Scatter plots of the selected predictors data for each example, plotted against one another with the data from the multiclass (A) and binary (B) datasets. Orange crosses are the errors and blue circles are the correctly classified.

4. Discussion

The main goal of this study was to identify the features that better describe different process

domains in the Swedish fluvial context, using ML. The results showed that using the feature

importance metric, several different predictors were selected throughout the analysis, differing

depending on the dataset (Figure 6). However, only three are always above the 3% threshold

and ranked high in the permutation importance analysis (Figure 9): channel slope, roughness

median in the 100-m buffer and the standard deviation of the planform curvature in the 100-

m buffer.

(36)

32 The high frequency of values closer to zero for the L class in the slope predictor was due to the way the raster was built. As detailed in the Materials and Methods section: since LiDAR does not work on water, the raster was interpolated and smoothed using the ground readings closer to the water surface. This resulted the non-null values almost always belonging to the reaches labelled as either SF or R.

Considering how the DEM was created using LiDAR data, it is natural that the same pattern of the slope predictor was obtained again in the boxplots: values were closer to zero in the L class and less so for R and SF. Since the 95% interval around the median did not have an overlap between the R and SF label, this explains why the predictor is also relevant for the binary dataset distinction between classes.

This result is corroborated by the water surface slope being previously used to differentiate between stream classes (Yang, 1971). Montgomery & Buffington (1997) identified bed slope as one of the main controllers for their reach classification, even though the values showed an overlap between some channel types. The shear stress and supply of sediments would press themselves on the channel bed surface, adjusting it. This would in turn be reflected in the channel morphology, a notion supported by the organization of the relative roughness following the slope. However, water surface and channel bed slope are not necessarily identical. Palucis & Lamb (2017) pointed that, in most natural streams, the variables controlling the state of the channel bed covary with the slope, therefore, in steep channel beds, this would make it the prominent driver of morphology.

For the median roughness and standard deviation of planform curvature, an important detail is that these predictors were obtained though the zonal statistics of a 100-m buffer. This means that for some extensions of the river channel, the terrain surrounding the reach has been included as well. Nonetheless, the same pattern of data distribution found in the slope predictor was repeated in the boxplots of these features: a higher concentration of values closer to zero in the L class, and a more diverse distribution and higher values for R and SF.

The roughness represents the variation of local-scale topography (Pollyea & Fairley, 2012).

This TAM has been successfully identified as an indicator of channel-bed morphology in mountainous areas, being able to distinguish between step-pools and riffle-pools using LiDAR data (Cavalli et al., 2008). In Arctic mountains, it has been recognized as one of the important features for the classification of landforms that are related to erosion and deposition processes (Mithan et al., 2019).

The planform curvature delineates processes of flow accumulation by convergence and divergence (Hu et al., 2021) and has been found to have an influence in river meandering (Güneralp & Rhoads, 2009). In downslope flow, slope erosion processes can be impacted by the planform curvature (Conforti et al., 2010), an occurrence that is also found in the accumulation and denudation when the convergence and divergence are uniform (Minár et al., 2013).

The distribution metric (DM) identified as important for this metric was the standard deviation, therefore it is plausible to expect the planform curvature to have a lower dispersion of distribution in the case of the L class. This showed that the values inside the 100-m buffer were not too dispersed, suggesting that there was not much oscillation between flow divergence and convergence for this label. The R and SF classes, on the other hand, expressed higher standard deviation values (Figure 11), opposing the L label’s results, which was expected as the characteristics of these processes domains implied more dispersion. This scattering in the distribution of values was not conveyed by other DM.

The remaining predictors listed as important by the feature importance method were not

considered as relevant by the permutation importance method. As previously mentioned, the

Machine Learning Applied to Reach Classification in a Northern Sweden Catchment

Master’s Thesis, 60 hp

Master’s Thesis in Earth Sciences, 60 hp

Vt 2021

MACHINE LEARNING APPLIED TO REACH CLASSIFICATION IN A

NORTHERN SWEDEN CATCHMENT

Mariana Dos Santos Toledo Busarello

Abstract

Keywords: channel type, geomorphology, Random Forest

Table of Contents

1. Introduction ... 1

1.1. Purpose ... 1

1.2. Hypothesis ... 2

1.3. Study areas ... 2

2. Materials and Methods ... 4

2.1. Channel data... 4

2.2. Random Forest Classifier ...5

2.4. Predictors ... 7

2.4.1. Terrain Analysis Metrics - Distribution Metrics... 7

2.4.2. Channel slope ... 7

2.4.3. Surficial Geology ... 8

2.4.4. Surficial Geology Depth... 8

2.4.5. Lithology ... 8

2.4.6. Block Presence ... 8

2.4.7. Confinement coefficient ... 8

2.4.8. Drainage area ... 10

2.4.9. Local drainage density ... 10

2.5. Statistical Analyses ... 10

2.5.1. Feature selection ... 10

2.5.2. Feature importance and permutation importance... 10

2.5.3. Data Complexity Measures ... 11

2.5.4. Accuracy and ROC-AUC ... 12

2.5.5. Confusion matrix and performance assessment ... 12

2.5.6. Receiver Operator Characteristics – Area Under the Curve ... 12

2.5.7. High error frequency analysis ... 13

3. Results ... 14

3.1. Highest contributing variables ... 14

3.2. Accuracy analysis ... 20

3.3. Complexity analysis ... 24

3.4. Frequent error locations ... 25

4. Discussion ... 31

5. Conclusion ... 36

6. Acknowledgements ... 36

7. References ... 38

Appendices ... I

1

1. Introduction

Classification works by grouping up systems based on common characteristics, providing a consistent reference for those who work within the same area, where reproducibility is needed.

In hydrogeomorphology, trying to classify river channels can be dated back to the late 19

century when Davis (1899) designed a stream division between young, mature, and old age.

This division showed the relationship between river and landform development in his “cycle of erosion”.

Buffington (1997), assessing the channel response potential and conditions for drainage basins in mountains.

The Swedish Agency for Marine and Water Management (SWAM, Havs- och Vattenmyndigheten) has a hydromorphologic classification system composed of seven classes.

Regression problems are the ones that predict continuous data, while classification tasks do the same to categorical information.

1.1. Purpose

The Water Framework Directive (European Commission, 2000) highlights the importance of

classifying water reaches to better assess verification and management. Attempts to use

automatic methods of channel classification in Sweden have been done before (Polvi & Nydahl,

2015), with some challenges related to the coarseness and small-scale aspect of the data being

found during the procedure. A ML approach could be useful in this regard due to its recent

successful applications in geosciences.

2 1.2. Hypothesis

My hypothesis was that with the use of the Random Forest estimator, it will be possible to recognize the important drivers of reach types, which are expected to be confinement and channel slope.

1.3. Study areas

The chosen study areas were two sub-catchments above the highest coastline in northern Sweden: Hjuksån and Bjurbäcken, both located in the Vindel River catchment, a part of the Ume River catchment (Figures 1-3).

In 2010, a restoration project started called Vindel River LIFE (LIFE08 NAT/S/000266;

Figure 1. Study area in the Umeälven catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020). Data from the

Swedish Meteorological and Hydrological Institute (SMHI, Sveriges meteorologiska och hydrologiska institut) and

Swedish Mapping, Cadastral and Land Registration Authority. Coordinate system: SWEREF99.

3

Figure 2. Hjuksån sub-catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020), with images from Swedish National Satellite Database supplied by Swedish Mapping, Cadastral and Land Registration Authority (2020D). Coordinate system: SWEREF99.

Figure 3. Bjurbäcken sub-catchment. Map created in ArcGIS Pro v2.7 (Esri, 2020), with images from Swedish National Satellite Database supplied by Swedish Mapping, Cadastral and Land Registration Authority (2020D).

Coordinate system: SWEREF99.

4

2. Materials and Methods

2.1. Channel data

Table 2 , together with the proportion that each class represents in the whole data.

Table 1. Total number of examples in each class and how much of the total dataset they cover in the combined multiclass and binary ones.

Combined dataset

(multiclass) Combined dataset (binary)

𝑠𝑙𝑜𝑝𝑒 = [ℎ ₀ − ℎ _𝐹 ] 𝑙𝑒𝑛𝑔𝑡ℎ