2006:057 CIV
M A S T E R ' S T H E S I S
Spatial Relations in the Geochemistry of Swedish Lakes
Anders Selberg Magnus Westerstrand
Luleå University of Technology MSc Programmes in Engineering
Civil Engineering
Department of Civil and Environmental Engineering Division of Geographical Information Technology
2
Abstract
The detection of patterns in the spatial distribution of biotic and abiotic data is central in environmental monitoring programs. In this report we aimed to understand geographic aspects in data describing Swedish lake chemistry. We looked at two different aspects.
First we investigated if it is possible to divide Swedish lakes into geographic regions based on their geochemical composition. This was done in four ways: 1) using Kriging interpolation, maps on the lake chemistry were produced and interpreted visually, 2) trends in the data are established using Principal Components Analysis – PCA, and examined geographically, 3) the lakes are formed into clusters based on the lake chemistry data, and these clusters where examined geographically, and 4) lake chemistry data was compared with forest ecoregions. By doing so, we showed that it is difficult to divide the lakes into geographic regions where the lake geochemistry is significantly different among regions. However, the mountain area in the north‐
western part of the country was identified as one region. Some gradients where also identified, mainly an east‐west gradient in Norrland and a north‐south gradient in the middle of Sweden. The forest ecoregions in Sweden did partly represent the geographical differences in lake chemistry represented by principal components. The lakes where also grouped in clusters based on the principal components, and these clusters showed similarities with the forest ecoregions. However some adjustment of the ecoregions borders is needed to better fit the variations in water chemistry.
Second, we tried to understand the relationship between lake catchment and lake data. The GIS data describing catchment characteristics came from various sources and for some of them, we developed new methods for conversion to useful variables.
Due to collinearity and missing data in the variables representing catchment characteristics, we have not been able to test what effect the lakes’ catchments have on the lake chemistry. However, the quality of the geographical data that serves as base for these variables has been examined, and is generally considered to be of sufficient quality to be used in the analysis.
Sammanfattning
Att upptäcka mönster i den geografiska spridningen av biotiska och abiotiska data är viktigt för miljöövervakningsprogram. I den här rapporten presenteras försök att förstå geografiska aspekter i vattenkemidata rörande svenska sjöar. Två aspekter har undersökts.
För det första har vi undersökt möjligheterna att dela in Sveriges sjöar i geografiska regioner baserat på deras geokemiska innehåll. Detta har gjorts på fyra sätt: 1) Med Kriging som interpoleringsmetod har kartor på vattenkemin tagits fram och tolkats visuellt, 2) Trender i kemidata tas fram med principalkomponentsanalys, PCA, och undersöks geografiskt, 3) Sjöarna grupperas i kluster baserat på vattenkemin och dessa kluster undersöks geografiskt och 4) Vattenkemidata jämförs med skogsekoregioner. Genom detta har vi visat att det är svårt att dela upp sjöar i geografiska regioner där vattenkemin skiljer sig signifikant mellan regioner. De svenka fjällen, i nordvästra delen av landet, har dock identifierats som en region.
Vissa gradienter har också identifierats, framför allt en väst‐östlig gradient i Norrland och en nord‐sydlig i mitten av landet. Skogsekoregionerna i Sverige kan delvis beskriva skillnader i sjöarnas geokemi, här representerat av principal komponenter. Vidare uppvisar klusterbildningarna vissa likheter med ekoregionerna. Men för att regionerna bättre ska beskriva variationen i geokemi bör gränserna justeras något.
För det andra har vi försökt fastställa samband mellan sjökemi och företeelser inom avrinningsområde. De geografiska data som ligger till underlag för avrinningsområdenas företeelser kommer från olika källor och för vissa har vi utvecklat nya metoder för att omvandla dessa till användbara variabler. På grund av kolinjäritet och för många nollvärden hos vissa parametrar i dessa data har vi inte kunnat testa sambandet mellan dem och vattenkemidata signifikant. Vi har också undersökt kvaliteten på det geografiska data som legat till grund för avrinningsområdenas företeelser. Dessa data bedöms i det stora hela hålla tillräckligt god kvalitet för att användas i analysen.
4
Foreword
This report is the result of our senior‐year project, the master thesis on the MSc programme in Environmental Engineering, performed at Luleå University of Technology.
It combines Biogeochemistry and Geographical Information Science, fields that are represented at the two departments Civil and Environmental Engineering and Applied Chemistry and Geosciences. In this project, the two fields are used together as an interdiciplinary approach to answer one set of questions. Therefore, only one report has been written, although the thesis will result in two different specializations on the same MSc program. Selberg has focussed on the geographic aspects of each part in the project, and Westerstrand has focussed on the geochemistry and its correlations.
First and foremost we wish to thank our project coordinator Frauke Ecke, for her help and guidance.
We would also like to thank the following persons for helping us in our work in one way or another: Mats Nilsson at Sveriges lantbruksuniversitet, Tomas Lindberg at Sveriges geologiska undersökning, Johan Kristoffersson at Skogsvårdsstyrelsen, Kerstin Vännman and Lennart Widenfalk at Luleå University of Technology.
As the knowledge about the elements behaviour increases, the understanding of the ecosystems gets easier to grasp, hopefully this thesis is one step further in that journey.
Table of contents
1 Introduction 8
1.1 Background 8
1.2 Lake geochemistry 8
1.3 Geographic information 9
1.4 Problem Statement 11
1.5 Objective 11
2 Method 12
2.1 Lake data 12
2.2 Geochemical regionalization 13
2.2.1 Regionsalization 13
2.2.2 Processing of the initial data 14
2.2.2.a Outliers 14
2.2.2.b Detecting Outliers in the Lake Chemistry Data 15 2.2.2.c What we did 16
2.2.2.d Transformations 17
2.2.3 Principal Components Analysis 17
2.2.4 Interpolation 18
2.2.4.a Polynomial Interpolation 18
2.2.4.b Geostatistics and Kriging 19
2.2.5 Confirming regions using Kruskal‐Wallis test 21
2.2.6 Cluster Analysis 21
2.2.7 Comparison with forest ecoregions 22
2.3 Catchment and Lake Interaction 22
2.3.1 Catchment parameters 22
2.3.1.a Catchments and area 22
2.3.1.b Land use 27
2.3.1.c Watercourses and ditching 28
2.3.1.d Roads 28
2.3.1.e Altitude 29
2.3.1.f Forest parameters 29
2.3.1.g Clear‐cutting 32
2.3.1.h Bedrock 34
2.3.1.i Soils 34
2.3.2 Choosing parameters 37
3 Results 39
3.1 Regionalisation 39
3.1.1 Maps of each parameter 39
3.1.2 PCA and Clusters 40
3.1.3 Forest ecoregions and lake chemistry PCs 45
3.2 Catchment and lake interaction 48
6
4 Discussion 49
4.1 Regionalization 49
4.2 Catchment and lake interaction 50
5 Conclusions 52
6 References 53
1 Introduction
1.1 Background
The Swedish Environmental Protection Agency (EPA) is responsible for
environmental monitoring in Sweden. One monitoring program studies the water chemistry in Swedish lakes run by the Swedish University of Agricultural Sciences.
Most of the water quality data used in this thesis originates from the monitoring databases. The Environmental Monitoring Program in Sweden has conducted series of measurements since the early 1970’s.
Within the framework of the Swedish Environmental Monitoring Program and the VEGA project, (also financed by EPA) research is being conducted at the division of Applied Geochemistry at Luleå University of Technology, in order to study the connections between geochemistry in lakes and the occurrence of water vegetation.
One of the objectives for the VEGA project is to find plant‐based indicators, suitable for reflecting the geochemical state of lakes. These indicators will be used for determining the water quality, in addition to the data derived from the ordinary monitoring program.
The detection of patterns in the spatial distribution of biotic and abiotic data is central in environmental monitoring programs. Geographical differences make it more difficult to establish one single system of plant‐based indicators for all of Sweden. Hence, it is essential for the VEGA project to establish a system that is adjusted to fit the different hydro geochemical regions.
The lake biogeochemistry is not only affected by natural conditions, such as soil type, watershed size and topography. Anthropogenic factors influence as well. Two factors that have been identified, but not quantified, are forest clear‐cutting and ditching (Ecke, 2004).
Göransson (2003) studied the lake water chemistry, its synchronous variation and its dependence to ecoregions and catchment characteristics using data from 2000.
However, in order to answer the questions in this project, more detailed studies are needed.
1.2 Lake geochemistry
Geochemistry, the science about the chemical composition on the earth and solar system and their chemical development i.e. the motion of elements, is an essential part of environmental research. Without knowing what concentrations that occur naturally in nature, it is impossible to tell how or if there are any anthropogenic influence.
8
There are many natural and anthropogenic factors at a broad range of spatial scales, that could have an effect on lake geochemistry. Natural factors are both global and local. Natural global factors are e.g. temperature and precipitation, natural local factors could e.g. be bedrock, soil composition, forest type, size of the lake and area of the catchment.
Human impact on lake geochemistry have become more visible over the last years.
While global factors have bean heavily discussed over the last years local factors seems overshadowed. Global factors could e.g. be sulphide and nitrogen discharge from the industry. Local factors are e.g. represented by clear cutting, ditching, agriculture and traffic.
1.3 Geographic information What it is
Geographic information connects non‐spatial data to locations. This allows for analysing far more than the non‐spatial data itself. The geography can be subject to analysis, as well as connections between spatial and non‐spatial data. Geographical analysis is the study of how spatial location (and consequently shape, size, dispersion, movement etc) affect different processes and relations (Eklundh, 2001).
There are two fundamentally different approaches for representing geospatial data:
as objects and as fields. The object approach considers the world as built up by discrete objects like points, lines and areas. These objects are populated by attributional non‐
spatial data giving them their characteristics. The field approach views the world as continuous fields of attribute data, without any spatial demarcations. Where in the object approach, the attributes are linked to the object extents, fields provide attribute values at any location. (Heuvelink, 1998).
How it works
Geographical information systems, or GIS, are generally the type of computer software that is used for managing geographical data. There are many different definitions of a GIS, depending on who is using it. Generally, a GIS should include tools for storage, analysis and presentation. Often, however, the entire chain of data management can be kept within a GIS, thus including both production and editing, as well as more peripheral issues like meta data management, web interaction etc.
(Eklundh, 2001; www.gis.com, 2005)
Figure 1- A schematic picture
of a GIS (www.gis.com)
In a GIS, object representations are normally in vector format. They are built up by point vertices, which are
connected by topological relations. A database table connects the object with its attribute data. Fields on the other hand, are normally represented by rasters. They provide values of an attribute’s field in a uniform grid where each cell in the grid has one value of the attribute (Eklundh, 2001). Figure 1 presents the common overlay functionality of a GIS, here with both object and field representations.
Why we use it
Geographical information systems are necessary tools in this project for three main reasons. First, much of the input data used are in formats intended for GIS use. The information in their data cannot be extracted unless using a GIS. Second, much of the operations on and processing of the data are purely geographical and cannot be achieved in any other way than using specialized tools such as GIS. Among others, these include data overlay operations, analyses of spatial correlation structures, geographic interpolations, and proximity analyses. These are all operations whose results depend on the geographical location of data. Third, the visual aspects of presenting data are essential, too. In fact, one step in this project considers the visual interpretation of interpolated maps.
Geographical error behaviour
When regarding geographical data, it is common to talk about different types of errors, such as positional, which considers the purely geometric component, thematic, concerning classification correctness, temporal, the actuality and dating of the data, and logical, dealing with geometric topology and database issues. To this, we can add the errors in the attribute values, which are often unrelated to geography.
Error is normally defined as the difference between estimations and the actual, true value of a parameter. In other words, in order to establish the error, the true value has to be known. Because of always being a generalisation of the real world, geographical data are by nature erroneous. Therefore, an exact true value cannot be established, and what is considered the truth must be based on the issues at hand.
Uncertainty can be regarded as the distribution of errors around an unknown, true value, and the estimation can then be seen as a realisation of a stochastic variable belonging to this distribution. It is often both easier and cheaper to determine, monitor and model uncertainties than errors. Uncertainty is in its turn composed of accuracy and precision, where precision relates to the repeatability of the measurement, and accuracy is related to its bias. (Eklund, 2001; Heuvelink, 1998)
The output of a GIS operation is a function of its input values. Inaccurate or uncertain inputs will affect the outcome. In chapter 3, we will try to establish different types of quality descriptions of the data used in this project. Having information on uncertainties, such as RMSEs (Root Mean Square Errors), biases and classification errors, helps deciding whether and how to use the data and to perform further analyses. It also provides a possibility to determine the uncertainty and
10
reliability of the results. There are essentially two ways of determining error (or uncertainty) propagation in a GIS operation: analytically and stochastically. Analytical error propagation means regarding a GIS operation as a mathematical operation, using math laws to establish uncertainty in the output, based on input and algorithm. A stochastic approach, on the other hand, uses simulations. Realisations of the input parameter distributions are drawn based on their uncertainty, and the GIS operation is repeated for all combinations of input values. By examine how the outcome varies, one can establish how errors in input values affect the output.
Establishing all components for analytical error propagation can easily become very complex, especially when incorporating geometrical components, whereas the stochastic approach often requires lots of time and computing power. (Eklundh, 2001; Heuvelink, 1998, Zhang and Goodchild, 2002)
1.4 Problem Statement
Knowledge on what effect water chemistry has on aquatic plants is essential in order to understand how plants serve as indicators for water quality. Sweden can be divided into several regions, e.g. climatic and geological. For analyzing the water chemistry of Swedish lakes it also is important to know how and if the lakes can be divided in regions. This would have implications on plant indicator monitoring.
Natural and anthropogenic processes and conditions near a lake often affects the lakes geochemistry. Quantifying the effect of different factors on the water geochemistry is therefore important.
The use of geographical data in statistical analyses can be very useful. But it is also important to know how they work and what their limitations are. Consequently, an analysis of the geographical data itself is in order.
1.5 Objective
The purpose of this thesis is to understand the spatial pattern of the geochemical conditions in Swedish lakes and to search for connections between catchment characteristics and lake geochemistry. This is done by trying to answer the following questions. Is it possible to divide the lakes into geographic regions where the lake geochemistry is scientifically different from one region to another? How would those regions look? Could the Swedish forest ecoregions be such regions? Are there any relationships between the lake geochemistry and the lakes’ catchment characteristics?
Is the geographical data available of good enough quality to be trusted?
2 Method
This chapter describes how the two parts of the thesis were performed. First, we begin with a brief introduction to the lake data. Second, we present the regionalization procedure along with relevant theory. The final part of the chapter describes the work with the catchment/lake interaction analysis.
2.1 Lake data
The analysed lakes in this project are part of the Swedish national survey of 2000, a survey on water chemistry and microinvertebrates in Swedish lakes and streams (Wilander et al. 2003). They were selected based on both geographic and hydrochemical considerations, such as variance, size and dispersion.
The lakes were relatively equally distributed over the country. Amongst the 3,025 lakes included in the study, one third, or 1,206, were analysed for trace metals. The different aluminium fractions where only analysed in 317 lakes, due to the criteria that they should have a pH under 6.5 in the Swedish national survey 1995 to be included 2000. This could be compared to the fact that 75% of the lakes in the survey had a pH lower than 6.95.
In the Swedish national survey 2002, 34 variables were analyzed. In this thesis we will take a closer look on 26 of them (Table 1):
12
Table 1 – The 26 chemical variables included in this study Mean Minimu
m
Maximum Std.Dev.
pH 6,63 4,29 8,11 0,69
KOND 25 MS 7,75 0,52 60,7 9,78
Ca [mg/l] 8,96 0,06 108 13,6
Mg [mg/l] 1,45 0,04 13,39 1,56
Na [mg/l] 4,27 0,09 64 6,44
K [mg/l] 0,84 0,04 7,98 0,98
ALK 0,36 -0,14 5,37 0,63
SO4 [mg/l] 4,32 0,22 86,45 6,98
Cl [mg/l] 5,82 0,11 131,74 11,84
F [µg/l] 0,12 0,02 1,01 0,1
TOT N [µg/l] 271 1 547 158
TOT P [µg/l] 14 1 115 15
Si [mg/l] 1,59 0,01 6,5 1,24
TOC [µg/l] 11,4 0,2 51,2 7,3
Fe [µg/l] 264 7 998 253
Mn [µg/l] 32 1 344 42
Cu [µg/l] 0,6 0,1 5,5 0,5
Zn [µg/l] 2,3 0,1 23,4 2,5
Cd [µg/l] 0,014 0,001 0,166 0,015
Pb [µg/l] 0,3 0,01 3,22 0,34
Cr [µg/l] 0,38 0,03 3,97 0,3
Ni [µg/l] 0,7 0,03 8,95 0,82
Co [µg/l] 0,125 0,006 1,281 0,15
As [µg/l] 0,38 0,02 4,49 0,36
V [µg/l] 0,33 0,02 2,93 0,32
Al [µg/l] 102 1,6 548,5 104,6
The report following on the national survey of 2000 has attached a list of the analysis methods, detection range and relative uncertainty in annex 2 of Wilander et al.
(2003). It shows that all uncertainties in the chemical variables are in the range of one to ten percent, except for Cd, Cr and Tot‐P (15, 20 and 15 percent respectively).
2.2 Geochemical regionalization
One of the objectives of the VEGA project is to establish a monitoring system of plant indicators for geochemical conditions in lakes. One question that arises when constructing such a system is geographic variability. This first part of the report deals with trying to find areas, geographical regions, throughout Sweden where the dataset indicates some type of intrinsic similarity, e.g. with low variability within such regions and high variability among them.
2.2.1 Regionalization
There are a number of possible ways of trying to find large‐scale regionalization.
Before doing so, however, it needs to be defined what is meant by a region. Mayhew (1997) states that there is not one single definition of regionalization, other than being the demarcation of areas in such a way that the variability within them is little while
each area is different from the others. Finding what separates one region from another must therefore be based on the question at hand. In this case, the interest lies in finding out if the datasets themselves can reveal any regionalization based on similarities in the sampled measurement data. This would then be the definition of regions used in this project:
A region is a geographic area where a sufficient number of hydrogeochemical variables show low variability within the regions, but high variability among them.
There are three major issues in the process of regionalization. First is what data the analysis should be based on. Both the question what variables to include, as well wether to use the initial variables, groups of variables or categories based on the variables, must be addressed. Second is what method to use. Either the regions are demarcated manually, or some type of automatic/numeric/computerized method is used. The third issue is if and how the regions will be verified. If the demarcation method do not provide any verification, then the statistics of the regions are examined.
Different methods for finding or demarcating regions exist. For example, filter‐based methods are often used in image processing (e.g. in remote sensing) , and methods that identify borders and gradients are common in this group. They require variables to be represented as fields instead of point objects. (Östman, A., 2004; Lillesand, M. et al 2004)
Here, attempts will in a first step be made to establish regions by visually interpreting the data after having geographically interpolated them. With the help of Principal Components Analysis, major trends in the data are explored further. Also, using Cluster Analysis, the lakes are grouped into categories based on their chemical properties and then these are examined geographically.
Göransson (2003) investigated if the parameters from the 2000 survey can be linked to already predefined regions, ecoregions, but found no such connection. Here, we examine how the ecoregions correspond to both the result from the Principal Component Analysis as well as the Cluster categories.
2.2.2 Processing of the initial data
2.2.2.a Outliers
In order to receive reliable results from the interpolations, it must be made sure that the data provided are as proper as possible. Outliers and extreme values (measurement errors, extreme on‐site conditions etc.) in the data must thus be identified. If they are not part of the true distribution, they have to be excluded.
14
While examining the data closer it became clear that some values where very extreme. One example of this is iron, where the concentration varied from 4 μg/l to 1,536,560 μg/l. A very high variation, a variation that could not exist without anthropogenic influence. Thus, it became obvious that the extreme values were very important to examine.
These extreme values are often called outliers when you are handling statistical data.
Outliers are something that have been puzzling scientists for a very long time, and thus there are many definitions to what an outlier is.
One try to define outliers is found in the program that where used to make statistical calculations in this thesis, Statistica 6.0.
Outliers are atypical (by definition), infrequent observations; data points which do not appear to follow the characteristic distribution of the rest of the data. These may reflect genuine properties of the underlying phenomenon (variable), or be due to measurements errors or other anomalies which should not be modelled.
Unfortunately, there is no precise way to define an outlier with mathematic statistics.
Instead, there are several not so precise to choose from. When understanding this it becomes important to know your data, and it’s underlying phenomenon’s.
Our data consisted of several variables that all are examined one bye one for outliers.
Although most of the parameters are chemical concentrations, they depend on different phenomena. Most of them originates from the decomposition of the soil and bedrock in the catchments area. Precipitation, evaporation and temperature are other variables amongst many that change the chemical composition of the lakes. Many of them are also influenced by anthropogenic sources.
Knowing that the chemical composition is dependent on several variables it becomes hard to choose one single model to find outliers with. One would not want to exterminate to many because of the risk to eliminate whole regions, and not to few because it would not give a clear, maybe even a false, picture of the substances distribution over Sweden.
2.2.2.b Detecting Outliers in the Lake Chemistry Data
There are several statistical methods to detect outliers, of which we have tried a few in this project. At first, we tried to use simpler models like box plots, scatter plots and line plots. All showed a vast amount of outliers and most variables in such numbers that it was not trustworthy.
When taking a closer look at the data it became obvious that most of the data was not normally distributed. And after plotting the data to find out what type of distribution it belonged to or was closest to, we discovered that most variables distributions was pretty similar to the lognormal distribution. The long tail that shows up to the right for many variables indicates that we have a large group of outliers. Or that the nature behaves in such a way that it is lognormal distributed.
For example, it could and probably do exist smaller and extremer environments in a long country with big north south distances like Sweden. This would create a type of lognormal distribution when looking at the data for whole country. Environments that could play such a role is mire and peat land, the land below the high coast line and unusual types of bedrock.
Other things that have impact on the water geochemistry, pushing it in the lognormal direction, are anthropogenic sources. This could e.g. be road salt, liming or over fertilization, all of them quite frequently occurring but in different amounts at different parts of the country.
After concluding that it is possible to have this kind of distribution naturally, it was clear that we could not exclude all of the outliers that was noticed in the different box and scatter plots that was done earlier.
An even closer look on the data was necessary, and when examining the raw values for the different variables it looked like there where jumps in the data values.
Amongst the highest values there where groups that seamed to distance them self from the rest of the data. Investigating some of these groups made it clear that it sometimes appeared to be data values that could be called real outliers and should be removed. However, some times this was not the case. Instead it seamed like the values that formed the grope where a natural gathering of points depending on something in that geographic area. Thus, they should not be removed.
Our main interest is not trying to regionalize the natural variations in Sweden, but the actual. Considering all this, no earlier classification of Swedish water chemistry makes it possible to discover and eliminate the outliers that are important for this thesis.
2.2.2.c What we did
After all attempts to find a statistical way to handle outliers (see Appendix A), quite good knowledge about the data was gained. Deeper looks into some of the diagrams, such as box plots, line plots, scatter plots and combined plots was done and a combination of statistical and graphical methods was used to exterminate some outliers (see Appendix A).
16
The most important thing here was not to delete any regions. All values were always checked against statistical methods and against its coordinates to make sure that that did not happen.
One thing ‐ that was not done in this thesis but still is worth mentioning ‐ is that although all the extreme values have been deleted there exists outliers in different regions within the normal values for the whole of Sweden.
2.2.2.d Transformations
As mentioned, most of the parameters in the lake survey have exponential distributions. The rest present a normal distribution on their logarithmic values, except for one parameter, pH, which is more or less normally distributed. This is likely due to the fact that pH represent the negative logarithm of H+ concentration.
Although some forms of geographic interpolation, particularly some Kriging methods, need a normal distribution of the data to perform accurately, this is not the case for Ordinary Kriging used here (Wackernagel, 1995). Nor is it for polynomial interpolation.
2.2.3 Principal Components Analysis
Principal components analysis (PCA) involves a mathematical procedure that transforms a set of correlated response variables into a smaller set of uncorrelated variables called principal components (PCs) (Dallas E J 1998). In our case, this means that it might be possible to transform the many different chemical variables into just a few PCs.
PCA is a kind of multivariate regression that maximizes the explanation of the variation within the original set of variables using a few new vectors, or PCs. The first PC is selected so that it explains as much of the variation in the data as possible.
Each succeeding component then accounts for as much of the remaining variation as possible.
Each PC explains some of the variation; the first explains the most, the second less than the first, the third even less and so on until 100% of the variation is accounted for. However, for it to be any point in using PCA not all PCs can be used. To determine how many PCs to use, it is important to know how high percentage of the variance they explain i.e., to look at their eigenvalues and what loadings each PC contains of i.e. the correlation coefficient between a variable and the formed PC. The best case is to get as high percentage explained as possible with few PCs. A PC that explains a low percentage usually only consists of random noise.
There are different ways to calculate the PCs. If possible, the best way is to use the covariance‐variance matrix. The covariance‐variance matrix is calculated from the real values, i.e. no normalization is done. This gives PCs that explain the true
dimensionality of the variables. However, this is not always what is best for the analysis. For example, there is a big difference in the variation between Ca and As but that does not necessarily mean that calcium’s variation is of bigger importance for the lake geochemistry than that of arsenic. Therefore using the covariance‐
variance matrix in this case would give PC highly dominated by Ca since it is calculated from the variance.
Therefore, to get a meaningful result of the lakes’ variation, we standardize the variables before performing the PCA. This results in a correlation matrix from witch the PCs are calculated. Hence, all our PCs in this report are calculated with standardized values, this is important to have in mind when looking at the results.
This means that all our variables are given an equal weight in the analysis.
When the PCs are calculated, they are used to calculate the principle component scores (PCS). This means in this report that each lake is given a specific value calculated from the PC. If there are several PCs it is probable that each explains a different percentage of the dataset, the first most the second a little less and so on.
However, in the software used in this project, Statistica 6.0, the PCs values are standardized to have a SD equal to 1.0. Doing so erases the differences in explanation percentage between the different PC. This could be unfortunate, e.g. if the first PC accounts for 75% of the variation and the second PC accounts for 15 % of the variation it could men that the first PC is much more important. This could be solved either by unstandardizing the PCS or by weighting the PC. There also are times where despite the different percentage explanation between the PC they could be equally important for the analysis, therefore it will be clarified what has been done to the PCS every time they are used in the thesis.
2.2.4 Interpolation
In order to display the data in a clear and explicit way, we chose to present them as sufaces. This was because of two things: the difficulty to discern them as points, and to avoid the noise. Interpolated surfaces have the ability to eliminate both, so. To transform the point‐wise measurements to continuous surfaces, they need to be interpolated. Most geographically based interpolation methods consider spatial autocorrelation, which is the phenomenon that locations close to each other are more likely to be similar than locations further apart.
After reviewing different methods, two were chosen for interpolating our data:
Polynomial Interpolation and Ordinary Kriging.
2.2.4.a Polynomial Interpolation
Polynomial interpolation, or polynomial regression, is the process of adjusting a polynomial expression z=f(x, y) to best fit a dataset of points P. For each new point p0
18
in space, the value of z is based on this expression. This is done in three dimensions.
The spatial x‐ and y‐coordinates are used as independent variables, and the measured hydrochemical variables are set as the dependent, z, and one at a time.
This yields smooth surface maps with low short‐distance variation. The higher the degree of the polynomial used, the better the polynomial should fit the points in P, but also with a higher risk of strong oscillation in between them. Finding the polynomial best fitted to P is done using the Method of Least Squares.
Being a geometrical method rather than a statistical, polynomial interpolation needs no prior analysis or assumption of the dataset. For this reason, the results are left without support, as they cannot be linked to any theory regarding distribution, spatial autocorrelation et cetera, of the data. The benefit is that in the resulting smooth surfaces, regions can quite easily be identified. Even so, these regions might not contain any meaningful information, and must be interpreted with caution.
The software used for performing this interpolation, ArcGIS (ESRI, 2004), allows changing a number of parameters when adjusting the polynomial, such as degree of the polynomial, or the number of included points (i.e. measurements) in the neighbourhood around each new interpolated point. Limiting the number of points to include in the estimation, to anything less than all of P, results in different coefficients in the polynomial. The advantage is that by using a smaller number of points, this method yields a surface more compliant to adapt to the dataset. The surface fits more accurately to the points, but this happens with the cost of losing its smooth appearance, and possibly more identifiable, regions. The interpolation is called global if the polynomial is fitted to the entire P, and local if a smaller section is used, where only a few points are included at a time.
2.2.4.b Geostatistics and Kriging
The other method for interpolating that has been used here is Ordinary Kriging.
Kriging uses an entirely different approach than polynomials in trying to predict values on new locations; it uses weighted mean. Using measurements pi in the neighbourhood of the new point p0 and weighting the points differently, Kriging gives p0 the unit sum of the points pi’s values. Kriging bases its weights on the spatial autocorrelation in the dataset. This is examined with a geostatistical analysis that has to be performed on the dataset before interpolating. (Eklundh, 2001; Wackernagel, 1995; Östman, 1995).
Geostatistics is a comprising term for a number of statistical methods that are used for modelling the spatial behaviour of geographically distributed variables. It assumes that a variable Z(x) can be divided into different components and expressed as
Z ( ) x = m ( x ) + ε ( x ) + ε ′
Equation 1 – The geostatistical model for spatially distributed variables (Eklundh, 2001)
where
Z(x) is the spatial attribute on location x,
m(x) is a deterministic component of Z representing large‐scale fluctuation, often expressed as a constant or a low‐degree polynomial,
ε(x) is a zero‐mean, spatially autocorrelated random component, representing small‐
scale fluctuation and
ε’(x) is a random, spatially non‐autocorrelated term that sums up the remaining variation that cannot be explained by the above; i.e. the noise term. Among other things, measurement errors are comprised in ε’. (Eklundh, 2001; Heuvelink, 1998).
Variograms are used for establishing the autocorrelational behaviour of ε(x), and there are two types. First, the experimental variogram, which is based on the actual data set.
It describes how the variability between pairs of points depends on the distance that separates them. Normally, these pairs are grouped into distance intervals, and the mean variability of each group is plotted against this distance (actually the semivariance γ is plotted, which is half the variance). Next, to be able to use this connection in a mathematically correct way, a continuous function called a theoretical variogram is fitted to the experimental. The theoretical variogram is used in an equation system that minimizes the estimation variance in each new interpolated point p0. By solving the system, the Kriging weights are established and ε calculated.
Combining ε with m, the statistically most likely z value for p0 is attained For more details, see Wackernagel (1995).
The variogram has essentially three components. First is the range, meaning the maximum distance for any autocorrelation. Next is the sill, which is the variogram’s top semivariance γ, and the γ on distances beyond the range. The third component is the nugget, representing γ when the pair‐wise distance approaches zero. See Figure 2.
20
Figure 2- Examples of one-dimensional distributions with different autocorrelational behaviour.
Bottom row shows corresponding theoretical variograms. Sill is equal in all three and only the middle variogram has both range and nugget. (Freely after Sharov, A.; 2004)
When producing maps of the hydrochemical components, both polynomial and Ordinary Kriging interpolation were used. As very little was known about their spatial characteristics, the interpolation parameters were based only on our judgement from variogram and visual interpretation. In order to in full use the geostatistical benefits of kriging, one should be able to explain the variogram models found, i.e. relate them to the actual dispersion of each variable. (Wackernagel, 1995) As these were unknown, we only used the variograms that were best fitted.
One of the core issues in finding regions is what actually will separate them from one another. In this work, they have been chosen subjectively, just by trying to interpret the interpolated maps. In other words, nothing more than qualified guesses.
Therefore, some statistical confirmation of the chosen regions is useful.
2.2.5 Confirming regions using Kruskal‐Wallis test
The Kruskal‐Wallis test can be used if the data have an other distribution than the normal distribution. Kruskal‐Wallis is a test to compare the means between several groups to se if there are significant differences between the group means.
2.2.6 Cluster Analysis
Cluster analysis is a way to examine if a number of observations can be divided into groups, or as it is called here: clusters. The variables in each cluster should have more in common than the variables in the different clusters.
Statistica 6.0 where used and the K‐means clustering method chosen. There are three different ways to make the cluster analysis, it was determined that the “choose observations to maximize initial between‐cluster distances”‐method was most appropriate. This means that the number of clusters are determined by the user. The algorithm 1) selects the first N number of cases to be the respective cluster centers; 2) uses subsequent cases to replace previous cluster centers if their smallest distance to any of the cluster centers is larger than the smallest distance between clusters; and if this is not the case, then 3) replaces subsequent cases will initial cluster centers if their smallest distance from a cluster center is larger the distance of that cluster center from any other cluster center.
2.2.7 Comparison with forest ecoregions
Göransson (2003) has investigated if the forest ecoregions in Sweden correlates with the variability of lake geochemistry, but she found no such connection.
Here we will try to compare these ecoregions with the regions we established in this project.
From the SNA ‐ Sveriges Nationalatlas ‐ website, maps depicting the forest ecoregions in Sweden were downloaded. They were imported to ArcGIS and the ecoregions were digitized. Here we used the same classification as Göransson (2003), with six classes: 1) Arctic/alpine, 2) Northern boreal, 3) Middle boreal, 4) Southern boreal, 5) Boreo‐nemoral, and 6) Nemoral. See Figure 3. The lakes were classified accordingly.
2.3 Catchment and Lake Interaction
The purpose of this second part of the project is to understand the connections between catchment characteristics (natural and anthropogenic) and lake
geochemistry. For a subset of the sampled lakes, we establish a database of catchment specific characteristics based on geographical data. Their lake geochemistry is represented by the measurement parameters as described in Table 1 and added to the database. We then analyse the statistics of the data. The complete database can be found in Appendix C – Data.
Figure 3- Forest ecoregions in Sweden
2.3.1 Catchment parameters
The choice of catchment variables to include is limited to what data has been available to the project at low cost. The anthropogenic influence to be analysed is represented by forest clear‐cutting, and forest ditching. The clear‐cutting was available from a specific clear‐cutting dataset. In addition, soil and bedrock maps were available and included. Another data source provided land cover type, roads, watercourses (from which also forest ditching was derived) and altitude. A group of data sets describing forest parameters, such as wood type, wood volume and tree age were also available and used in the analysis. Lake liming and road salting were initially supposed to be included in the analysis, but since no reliable databases exists, these two factors were excluded from the analysis.
2.3.1.a Catchments and area
There were several things to take into consideration when choosing what catchments to include in the study. First, the lakes had to be included in the national survey.
22
Second, the data set describing clear‐cutting was limited to Norrbotten, and so became the catchments. A large number of objects in the clear‐cutting data were undated or with incomplete geometries. These had to be avoided in order to maintain some reliability in the analysis and this further reduced the number of possible catchments to include. Through this, the number of suitable lakes was strongly limited. Overall, 28 lakes were found and included. These are presented in Table 2 and Figure 4.
Table 2 – The 28 lakes who’s catchments are used in the project
ID Name X Y Area (m2)
0 Luspentjaure 7273480 1665970 1,559,669 1 Sotträsket 7277020 1628340 13,718,066 2 Melaksträsk 7276690 1689510 7,914,849 3 Stortjärn 7309240 1705870 33,385,637 4 Finnträsk 7288080 1714820 44,930,145 5 Karkjärv 7375640 1796970 8,324,816 6 Raitajärvi 7391320 1825490 109,775,116 7 Katkojärvi 7371750 1834540 6,623,266 8 Kerunkijärv 7366690 1852330 2,835,081 9 Vaikkojaur 7495890 1746500 42,219,531 10 Kattekjaure 7345420 1685490 2,742,890 11 Kuossaure 7429050 1594480 27,569,720 12 Nujaure 7492650 1656810 54,670,806 13 [Unnamed] 7510270 1686700 1,855,042 14 Temminkijärv 7546300 1768100 18,772,939 15 Käymäjärvi 7494240 1807390 50,189,909 16 Tuorpunjaure 7416880 1658740 7,257,846 17 Lahnajärvi 7434600 1787510 12,730,261 18 Saittajärv 7487820 1778090 70,419,924 19 Udtjajaure 7361060 1642720 45,039,926 20 Sandträsk 7245210 1665790 28,562,557 21 Norra Kroktjärnen 7260180 1663070 2,312,759 22 Suobdek 7299400 1637840 83,715,482 23 Östra Gangsjajaure 7323620 1637900 299,129,344 24 Vuotnajaur 7393470 1722510 16,753,005 25 Karhulompolo 7395650 1845090 4,399,198 27 Pounujärvi 7525860 1716550 736,917
The areas were visually demarcated from interpreting elevation lines (equidistance 10 m) with some additional help from surface features such as watercourses, lakes and wet or dry areas. The elevation lines are part of the Vägkartan dataset, described in the Land use paragraph, below.
Figure 4 – The 28 catchments included in the study
Uncertainty
In order to assess the uncertainty in catchment size, the geometric accuracy in the elevation data was used, since this had been used in the interpretation. Two sources were available. First, an 8‐bit raster DEM with some 20 unique values, representing the elevation strongly terraced. Second, there were elevation lines for every tenth meter in height. Since the DEM and its terraces was considered an unnatural representation of the terrain and the elevation lines could be interpolated to a smoother surface, the elevation lines were chosen for this task.
According to the producer Lantmäteriet, the mean error (RMSE) in elevation data is 2.5 meters. The sources for these data are essentially profile plates from stereo image interpretation, whose values have been interpolated to an elevation grid. The elevation lines have then been extracted from this grid (Lantmäteriet, 2001). Due to interpolation, it is likely that this error is spatially correlated. Although 2.5 meters in
24
measured points (on elevation lines, that is), it is also likely to find higher uncertainties in between them, and higher the further away from lines one look. A reasonable assumption is that it does not exceed 5 meters in RMSE, since the vertical distance between each line is 10 meters.
An attempt to examine the variability in catchment size was made for the lake Käymäjärvi. First, the elevation lines were kriged to a most‐likely elevation surface (Figure 5 a). By using a theoretical variogram with a spherical model manually defined as [12.5*Spherical(1000)+3.125*Nugget], a representation of its prediction standard error as a function of distance to measures was also created (Figure 5 b).
Next, eight simulations of autocorrelated error surfaces of N(0,1) were produced.
They were achieved by using an iterative method as described by Heuvelink (1998), although a bit simplified, resulting in simulations of so‐called autoregressive error fields. Their effective ranges were set to 1500 meters, with a zero nugget. This was done in ArcGIS using the Raster Calculator tool. The autoregressive approach is suitable when the correlation structure is only implicitly known and no actual measurements exist, as in this case (Heuvelink, 1998). Each iteration follows the presumption that the autocorrelation satisfies the model of linear nearest neighbour dependency described in Equation 2.
Z
( )
x = μ+∑
qi(
Z( )
xi −μ)
+rW( )
xEquation 2 – A model of linear nearest neighbour dependency (Heuvelink, 1998)
where Z is the simulated error in x, the mean μ = 0, qi and r determines the autocorrelation structure, xi is in the neighbourhood of x (here the neighbourhood is the four nearest raster cells), and W is a spatially independent noise field of N(0,1). qi was set to 0.247 and r to 0.012. In the Raster Calculator, iteration could look like
Iter237 = (0.247 * FocalSum(Iter236, IRREGULAR, kernel.txt, DATA)) + 0.012 * Normal
This operation was run until the maximum (absolute) difference between two successive fields on any location no longer exceeded a predefined value, in this case set to 0.01. An example is presented in Figure 5 c).
Each of the eight simulations was multiplied with the standard error surface and added to the kriged elevation surface. This yielded eight simulations of the terrain, (see example in Figure 5 d)), each one considering the initial elevation data, their uncertainty as stated by the producer, and the autocorrelation of the error distribution. Elevation lines were extracted from each surface and interpreted, along with the same features as initially (watercourses, wet areas etc), to alternative catchments for Käymäjärvi. All eight catchments are superimposed in Figure 5 f).
a) b) c)
d) e) f)
Figure 5 – a) Elevation lines over a kriged most-likely elevation surface, b) Prediction standard error of the kriged surface, c) Example of an autocorrelated error surface, created autoregressively, d) One of the realisations of the terrain simulation, e) The simulation variations of the 230 m and 260 m elevation lines, f) All eight catchments delineated using the simulation surfaces.
Lake Käymäjärvi has one of the flattest catchments in the study. Its area is therefore assumed to be more sensitive to variations in topography than others are. The results showed a 14 % standard deviation, see Table 3, which should be more than the average catchment. However, the results were biased: all simulations gave lower value than the initial estimation. This can be attributed to the grainier surface produced by the simulated elevation compared to the initial, having a more pronounced cut‐off effect on the watershed. In the final analyses, however, only the original estimation was used.
26
Table 3 – Simulation Results on the Käymäjärvi catchment area Simulation Area (m2)
1 40 686 516 2 49 277 813 3 35 716 699 4 45 160 145 5 25 785 911 6 37 380 963 7 44 362 761 8 38 276 260 Original estimation 50 189 909 Simulation Mean 39 580 883 Standard deviation 7 181 546 Median 39 481 388
2.3.1.b Land use
The vector‐based data set Vägkartan, a product from the Swedish surveying agency, Lantmäteriet, serves as basis for the land use parameters. Vägkartan in its digital form is a vectorised version of what was originally a 1: 100 000 paper map product. Its land cover data is accessible in polygon format and contains 12 land use classes. A combination of the tools DISSOLVE, CLIP, IDENTITY and AREA in ArcGIS was used for summing up the area for each class within each catchment.
Uncertainty
Vägkartan was put together from a number of sources: 1: 50 000 maps, orthophotos, and various records. The uncertainty in the data differs from one area to another, based on what source and method was used in the production. As a paper product, its intended viewing scale is 1: 50 000. According to Lantmäteriet a mean point error of 20 meters is probable (Lantmäteriet, 2002).
Goodchild and Zhang (2002) have investigated both analytically and by simulations, how point errors on line and polygon vertexes propagate to their lengths and areas.
They show the influence that also the numbers of vertices, spatial correlation of errors and polygon shape have on the uncertainty of both length and area estimates.
In the case of spatial error correlation, calculation complexity grows fast with the number of vertices. On the other hand, if no spatial correlation is assumed, the variance can be expressed as simply as l/2 times the positional error variance, l being the average length of boundary segments. If calculating assumed area variances on the mean segment length of the Vägkartan feature boundaries, which is 52 m, using the mean vertex error of 20 m, the standard deviation is 520 m2. Compared with the mean and (the assumingly more representative) median feature area, 1 900 000 m2 and 11 000 m2 respectively, we conclude that a 5 % standard deviation is likely.
However, as this depends on boundary resolution rather than on feature area, elongated objects are more sensitive of vertex errors than those that have a small