Early-Stage Prediction of Lithium-Ion Battery Cycle Life Using Gaussian Process Regression

(1)

IN

DEGREE PROJECT MATHEMATICS, SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2020

Early-Stage Prediction of

Lithium-Ion Battery Cycle Life Using

Gaussian Process Regression

LOVE WIKLAND

(2)

(3)

Early-Stage Prediction of

Lithium-Ion Battery Cycle Life

Using Gaussian Process

Regression

LOVE WIKLAND

Degree Projects in Mathematical Statistics (30 ECTS credits) Master's Programme in Applied and Computational Mathematics KTH Royal Institute of Technology year 2020

Supervisor at ABB Corporate Research: Mikael Unge Supervisor at KTH: Jimmy Olsson

(4)

TRITA-SCI-GRU 2020:078 MAT-E 2020:041

Royal Institute of Technology

School of Engineering Sciences KTH SCI

(5)

Abstract

(6)

(7)

Prediktion i tidigt stadium av

litiumjonbatteriers livsl¨

angd med hj¨

alp

av Gaussiska processer

(8)

(9)

Acknowledgements

I wish to express my great gratitude to Principal Scientist Mikael Unge, my supervisor at ABB Corporate Research, for giving me the opportunity and showing me the trust to carry out this project. Dr Unge’s advice, guidance, and encouragement have been key factors in the success of this study. I also wish to thank Associate Professor Jimmy Olsson, my supervisor at KTH, for his enlightening discussions with me and his inspiring input on the presented work. Finally, a special thanks to my good friend Patrik Amethier for immensely helpful notes on structure, content, and presentation.

(10)

(11)

List of Figures

1.1 Histogram of cycle lives for the 124 cells.. . . 4

2.1 (a) Main components of a Lithium-ion battery. (b) Charging process. (c) Discharging process. . . 6

2.2 Overview and mapping of stress factors, degradation mechanisms, and degradation modes for Lithium-ion battery ageing. . . 7

2.3 The voltage-discharge curve for a representative cell. The grey area rep-resents energy dissipation of the cell. Image design by Severson et al. . . . 7

2.4 Sampled functions from Gaussian process prior distributions using the squared exponential, often called RBF, (upper), Rational Quadratic (mid-dle) and Matern ⌫ = 3/2 (lower) kernel functions. The black lines indicate the mean and the grey areas the standard deviation at any input value. . 15

2.5 Sampled functions from Gaussian process posterior distributions using the squared exponential, often called RBF, (upper), Rational Quadratic (middle) and Matern ⌫ = 3/2 (lower) kernel functions. The black lines indicate the mean and the grey areas the standard deviation at any input value. . . 18

4.1 Log10 cycle life against the regressor of the variance-model. . . 25 4.2 Predicted cycle lives to observed cycle lives for the three replicated models.

(a) Variance-model. (b) Discharge-model. (c) Full -model. Points on the black line are perfectly predicted. Image design by Severson et al. . . 26

4.3 GP-discharge model using Matern 3/2 kernel. Upper left: Variance-model from above, included for comparison. Upper right: predicted resid-uals to observed residresid-uals after GP prediction. Lower left: the two upper models combined. Lower right: same as lower left, including 95% con-fidence bars from residual prediction. Points on black line are perfectly predicted. . . 28

4.4 RMSE (upper) and percent error (lower) contributions for the primary (red) and secondary (orange) test sets using the linear (L), or Gaussian Process (GP) - Matern 3/2 kernel -, discharge extension. Matern kernel chosen due to overall good test scores. . . 29

4.5 GP-full model using Matern kernel. Upper left: Variance-model from above, included for comparison. Upper right: predicted residuals to ob-served residuals after GP prediction. Lower left: the two upper models combined. Lower right: same as lower left, including 95% confidence bars from residual prediction. Points on black line are perfectly predicted. . . 31

(14)

List of Figures vii

4.6 RMSE (upper) and percent error (lower) contributions for the primary (red) and secondary (orange) test sets using the linear (L), or Gaussian Process (GP) - Matern kernel -, full extension. Matern kernel chosen due

to overall good test scores. . . 31

4.7 Univariately fitted GP models using the regressor as indicated by the x-axis. Black line shows the predictive mean while the grey area represents one standard deviation of that mean. (a) RBF kernel. (b) Matern 3/2 kernel. (c) Dot-product 2 :nd degree. . . 34

A.1 Input features to Log10cycle life. . . 42

A.2 Input features to cycle life residuals; y yˆvar. . . 43

A.3 Univariate GP-models using the squared exponential (RBF) kernel.. . . . 44

A.4 Univariate GP-models using the rational quadratic kernel. . . 45

A.5 Univariate GP-models using a second degree dot-product kernel. . . 46

A.6 Univariate GP-models using the Exponential Sinus kernel. . . 47

(15)

List of Tables

3.1 Summary of the models that will be considered. The first three are repli-cations of the models presented by Severson et al. The following three are this thesis’s contribution. . . 21

4.1 Performance of the original models by Severson et al., marked with *, and the replicated models. Severson et al. argue that the cell with the shortest lifetime is an outlier and does not represent the data set in general. As such, the results when excluding said cell is presented in parenthesis. . . . 27

4.2 Performance of the GP-discharge models for five di↵erent kernels. The two linear models at the top are the already presented original (*) dis-charge-model and replication disdis-charge-model. . . 28

4.3 Performance of the GP-full models for five di↵erent kernels. The two linear models at the top are the already presented original (*) full -model and replication full -model. . . 30

4.4 Performance of the GP-custom models for five di↵erent kernels. . . 33

4.5 Performance of original (*) models by Severson et al. with its linear replication underneath and finally the corresponding GP-model. The two bottom are from the custom GP model. . . 35

A.1 Each model’s use of features. . . 49

(16)

(17)

Chapter 1

Introduction

1.1 Project Relevance and Aim

Rechargeable batteries are today’s most commonly used devices for electrical energy stor-age. More specifically, Lithium-ion (Li-ion) batteries, a particular type of rechargeable battery, are increasingly used in numerous applications, ranging from portable devices to electric vehicles and grid energy storage systems. Favourable characteristics have driven their popularity and widespread adoption, for example low and falling costs of production, high energy densities and long lifetimes [1]. Li-ion batteries experience per-formance deterioration with time and use, manifesting in capacity degradation, i.e. less charge can be stored in the battery, and loss of power[2]. This thesis will focus on the former, and in particular prediction of cycle life. A battery’s lifetime, or cycle life (used synonymously in this thesis) is the number of cycles a battery can be fully charged and fully discharged before reaching 80% of initial capacity. A consequence of long lifetimes is delayed feedback concerning capacity degradation, as one has to charge and discharge a battery many times to measure the capacity fade, rendering performance evaluation time consuming and expensive.

Accurate early-stage prediction of battery performance would create new opportunities regarding production and use. Early-stage refers to using only data from the earlier cycles of the battery’s life to predict future capacity and power, as opposed to actually charging and discharging all cycles while taking measurements until 80% capacity is reached - a much more expensive and time-consuming process. Early-stage predictions would enable manufacturers to accelerate development through fast evaluation of new cell technologies. Users would be given means to accurately predict capacity fade, thus enabling optimization regarding usage[3]. Such opportunities make advancements in early-stage prediction of cycle life for Li-ion batteries appealing. Therefore, this project

(18)

Chapter 1 Introduction 2 aims to investigate means of making such predictions using data-driven methods. In particular, a data set of 124 cells will be used, where lifetimes vary between around 150 and 2300 cycles - with a mean of 807 cycles - using only data from the first 100 cycles for training and prediction.

1.2 Previous Research

Data-driven prediction of battery health has gained increased attention over the past couple of years, in both academia and industry [2]. In particular, machine learning methods are used by a growing number of authors to predict cycle life or remaining useful life (RUL) of batteries - most frequently defined as the number of cycles left until discharge capacity has reached 80% of initial capacity. For example, Severson et al.[3] employ linear methods with a feature space partially derived from the discharge voltage curve. They achieve remarkable accuracy (around 10% mean error) in early-stage prediction of cycle life on a data set of 124 cells, with cycle lives ranging from 150 to 2300 cycles. Even though the authors of [3] consider a high dimensional feature space of around 20 input variables, with the cycle lifetime being the target variable, most of the predictive accuracy lies with a single variable.

Other authors opt for more flexible models. Richardson et al.[4] employ Gaussian process regression (GPR) for short and long term capacity prediction. Additionally, the authors use explicit mean functions to achieve improved results compared to that of using the zero mean function. However, contrary to the authors of [3], they aim at modeling capacity vs. cycles, basing a cell’s capacity prediction on how that particular cell’s capacity has changed with the number of cycles. Percentage wise, they consider a much larger part of the ageing data, i.e. not an early-stage prediction, for their predictions compared to [3] and the data set considered contains far fewer cells.

Li et al.[5], similar to the authors of [4] and using the same data set, tackle the challenge of predicting capacity with number of cycles. They do so using Gaussian process mixture models and show its superiority compared to using a single process. As in [4], they use a much larger percentage of the ageing data than the authors of [3]. Furthermore, in the same framework of modeling capacity vs. cycles, Liu et al.[6] use linear mean functions on Gaussian processes with the aim of k-steps-ahead prediction of capacity.

(19)

Chapter 1 Introduction 3 above described, by [4–6]. GPR has, to the best of the authors knowledge, however not been used in the second framework where one considers a di↵erent input-output variable setting; regressing cycle life on a high dimensional feature space, as done in [3], where the features are derived from the batteries. Inspired by the statistical methods used to research the first framework, this paper makes a relevant contribution to current research by considering GPR in the second framework; the usage of GPR to regress cycle life on a high dimensional feature space.

1.3 Research Question

Given the impressive accuracy of Severson et al.[3] using only linear methods, this ap-pears to be a reasonable starting point for future research. However, since Li-ion bat-teries age due to several complex mechanisms, the capacity fade process is generally non-linear [2]. Thus, a linear model, even when making use of non-linear feature trans-formations, might be a sub-optimal choice when trying to achieve the best prediction accuracy possible. Therefore, an extension of the linear model, using GPR, might be able to capture the non-linear tendencies and thus more accurately predict cycle life. This thesis will use the one dimensional, linear variance-model from [3] as a starting point and then further investigate whether the available regressors could be of greater predictive value in a GP setting than in a linear one. Specifically, the residuals created when subtracting the variance-model lifetime predictions from the actual lifetimes will be used as target variables in a GPR setting. Explicitly, the project will investigate the following research question: Can the predictions of Li-ion battery lifetime by Sever-son et al.[3] be improved by extending their linear model and applying Gaussian process regression?

As mentioned above, Liu et al.[6] did put a linear mean function on a Gaussian process with the aim of k-steps-ahead prediction of Li-ion battery capacity. However, as they consider a continous regression of capacity using only number of cycles, the methods presented in this thesis will provide a unique contribution. Given the insights presented in [3], and the features used to regress battery cells’ single valued cycle life, one is inclined to believe that accurate predictions can be made by combining linear methods with GPR.

1.4 Data

(20)

Chapter 1 Introduction 4

Figure 1.1: Histogram of cycle lives for the 124 cells.

systems, model APR18650M1A, 1.1 Ah nominal capacity). The cells are charged under di↵erent fast-charging conditions as this is an area of particular commercial interest. The di↵ering charging conditions result in great variance concerning cycle life; among the 124 cells, cycle life vary from around 150 to 2300 with a mean of 807 cycles and a standard deviation of 377 cycles. The data set contains continuously measured in-formation regarding voltage, current, and cell can temperature for each cell and cycle respectively. Furthermore, it contains discharge capacity and other derivations that can be made from the already mentioned data. The authors of [3] state that, to the best of their knowledge, the data set is the largest of its kind. The author of this master’s thesis has yet to discover a more extensive data set of similar information. For more information concerning the data, for example details concerning charging and discharg-ing conditions or measurement devices, the reader is encouraged to consult the Methods section of [3]. Figure1.1shows a histogram of lifetimes for the 124 cells.

1.5 Outline

This initial chapter has presented the project’s relevance and aim. Furthermore, we pre-sented related previous research, which lead us to the introduction of a specific research question and the data that will be used throughout the project. Chapter 2 presents a theoretical background for Li-ion batteries and their ageing, followed by basic theory for statistical learning, and finally more advanced theory on Gaussian process regression. Chapter 3o↵ers a method description, a motivation for why the described methods are chosen over other other options, and finally an evaluation of the methods used. Chapter

(21)

Chapter 2

Theoretical Background

2.1 Lithium-ion Batteries

2.1.1 General

The main components of a Lithium-ion battery are the cathode and the anode — the storage sites for the lithium. The cathode is the positive electrode, made out of a lithium metal oxide, for example lithium iron phosphate in the LFP battery, while the negative electrode often consists of a carbon derivative, for example graphite. Connecting the two, allowing for lithium-ions to travel between the anode and cathode is the electrolyte[7]. In order to separate the two electrodes an insulating material, for example polypropy-lene, is used. The insulation inhibits the flow of electrons, thus preventing electrical short circuits, while the porosity of the material enables transportation of ionic charge carriers[8]. Finally, both the anode and the cathode have their own current collector, collecting the charge, i.e. the electrons, that are to power a device [7]. Figure 2.1 a), taken from [9], o↵ers a graphic overview of the general Li-ion battery design.

When charging the battery, the power supplied separates the electrons from the lithium atoms in the cathode. The electrons run through the external circuit to the anode side while the lithium-ions reach the anode side by traveling through the electrolyte. This is an unstable state. Consequently, when connected to an external device, for example a laptop, the electrons run through said device, which creates a powering current, to once again find themselves on the cathode side. Simultaneously , the lithium-ions travel back to the cathode through the electrolyte, and the battery is back to the uncharged stable state[7]. Figure 2.1b) and c), taken from [9], show an overview of the charge and discharge processes.

(22)

Chapter 2 Theoretical Background 6

(a) (b) (c)

Figure 2.1: (a) Main components of a Lithium-ion battery. (b) Charging process. (c) Discharging process.

2.1.2 Overview of Battery Ageing

Battery ageing can be understood through stress factors, degradation mechanisms, and degradation modes, see Figure 2.2, taken from [2]. Stress factors, such as low temper-ature or high state of charge (SOC), cause degradation mechanisms. The degradation mechanisms are in turn widely categorized in three modes: increase of impedance, loss of lithium inventory (LLI), and loss of active material (LAM), the active material be-ing that which is involved in the charge and discharge cycles. For example, plenty of stress factors cause the formation of solid electrolyte interface (SEI) on the surface of the negative electrode. The formation of SEI causes the internal resistance of the cell to increase, while also irreversibly consuming lithium-ions resulting in a permanent loss of lithium inventory and thus loss of charge carriers. LAM is the result of a number of factors. For example, structural changes in the electrodes can reduce the density of lithium storage sites, resulting in a reduced capacity to hold lithium-ions[2]. For a more thorough account of how stress factors, degradation mechanisms, and degradation modes are connected, please see section 2 of [2].

2.1.3 Relating to the Regressor of the Variance-Model

(23)

Figure 2.2: Overview and mapping of stress factors, degradation mechanisms, and degradation modes for Lithium-ion battery ageing.

Figure 2.3: The voltage-discharge curve for a representative cell. The grey area represents energy dissipation of the cell. Image design by Severson et al.

dissipation for the cell, between the two cycles[3]. Let Q100 10(V ) denote the curve of

the hundredth cycle minus the curve of the tenth cycle, or, more general, Q(V ) for di↵erences between unspecified cycles. It turns out, regressors associated with Q(V ), a vector, o↵er great predictive ability. The variance-model uses the variance of the Q100 10(V ) vector as a single regressor, hence the name variance-model. The mean of

(24)

Chapter 2 Theoretical Background 8 When discussing the predictive ability of the the variance regressor, the authors of [3] investigate the degradation mode of LAM, and more specifically LAM of the delithiated negative electrode (LAMdeNE). Early on in a battery cell’s life, LAMdeNE causes a shift

in the voltage-discharge curve, as demonstrated in Figure2.3, but has yet to a↵ect actual discharge capacity, again evidenced by the two curves. The shift, however, induces sever capacity fade later on in the cell’s life. Additionally, LAMdeNE only alters a fraction of

the voltage-discharge curve, explaining the strong correlation between the variance of Q(V ) and cycle life. As this degradation mode manifests early in the voltage-discharge curve, without yet having an e↵ect on capacity, regressors associated with Q(V ) en-able accurate early-stage lifetime predictions[3]. Please see section Rationalization of predictive performance in [3] for a more thorough discussion.

2.2 Mathematics

2.2.1 Overview of Statistical Learning

Section 2.2.1.1 to 2.2.1.6 below are all based on chapter two of [10], which provides a more thorough account of the subjects covered. For an even more mathematically rigorous introduction, the reader is encouraged to consult [11].

2.2.1.1 General

Fundamentally, statistical learning refers to a collection of methods for estimating a function, f . In a general setting, we assume

Y = f (X) + ✏, (2.1)

where f is a fixed, however unknown, function. Y is referred to as the dependent variable, the response variable, or the output variable whereas X = (X1, X2, ..., Xp) is

(25)

Chapter 2 Theoretical Background 9 2.2.1.2 Prediction and Inference

In the prediction framework, we are interested in making a prediction of Y given the data X. The response variable, Y , might be quantitative, in which case the prediction is referred to as regression. It might also be qualitative, or categorical, in which case the prediction is called classification. Prediction is done using

ˆ

Y = ˆf (X) + ✏, (2.2)

where ˆY represents our prediction of Y using ˆf as an estimate for f . When the aim is prediction, we are generally not very interested in the specific form of ˆf as long as it makes accurate predictions for Y . The prediction accuracy of ˆY for Y can be described mathematically as

E[(Y Y )ˆ 2] = E[(f (X) + ✏ f (X))ˆ 2] = (f (X) f (X))ˆ 2+ E[✏2], (2.3)

where E[·] is the expected value operator. The first term on the rightmost side of equation (2.3) represents the reducible error and is a measure of how well ˆf estimates f . However, even if one manages to produce a perfect estimation of f , that is ˆf = f , the accuracy of ˆY is still a↵ected by the irreducible error; E[✏2] = V ar(✏), the variance of ✏. In fact, the irreducible error poses an upper limit on the accuracy of ˆY . In practice, this bound is almost always unknown.

When considering inference one is interested in understanding how X1, X2, ..., Xp might

a↵ect Y but not necessarily in making predictions for Y given X. In this setting, contrary to that of prediction, we want to know the exact form of ˆf because we want to understand how X a↵ects Y , or in other words, how Y changes as a function of X. Questions that might be of interest include; Firstly, which out of the p available predictors actually a↵ect Y ? In many cases it is only a fraction out of all available predictors that have an a↵ect on Y . And Secondly, what is the relationship between Y and a given predictor? For example, it might be linear, quadratic, exponential or something considerably more complex.

(26)

Chapter 2 Theoretical Background 10 2.2.1.3 Estimating f - Parametric and Non-Parametric Methods

Let us assume we have access to a set of n di↵erent data points called training data. Training data, because we will use these points to teach, or train, our model to estimate f . Let xij represent the value of the jth predictor for the ith data point (notation wise,

we use lower case letters to represent outcomes of random variables, and capital letters for the actual variable). We thus have j = 1, 2, ..., p and i = 1, 2, ..., n. Now let yi be the

response variable associated with the ith data point. Mathematically, our training data can be expressed as _{(x1, y1), (x2, y2), ..., (xn, yn)} with xi = (xi1, xi2, ..., xip)T. Using

the training data, we now want to find a function that accurately predicts Y given X. That is, we want to construct ˆf such that Y _{⇡ ˆ}f (X) for any observation of (X, Y ). In general, there are two types of methods for doing so; parametric and non-parametric. Parametric methods involve two steps. The first one is deciding on the functional form of f . An example is the linear functional form: f (X) = 0+ TX, 2 Rp. The

second step is using the training data to train the model, or equivalently, to estimate the parameters, 0 and , associated with f so that Y ⇡ 0+ TX. The second step is

done using di↵erent algorithms. In the linear case, Ordinary Least Squares (OLS) is a common choice. Depending on functional form, di↵erent algorithms for estimating the parameters might apply. When considering a parametric approach to estimate f one reduces the problem of estimating f to the problem of estimating a set of parameters - generally a much easier task than fitting an entire function. However if the initial assumption on the functional form of f is o↵, it is likely that ˆf will have poor predictive accuracy. One might then choose a more flexible functional form for ˆf . However this usually brings about more parameters to estimate as well as the risk of overfitting - the phenomena that arises when a model essentially follows the errors too closely.

Non-parametric methods, such a k-nearest-neighbors, contrary to parametric ones, make no explicit prior assumption on the functional form of f . Instead they estimate f by coming as close to the training data points as possible with certain set demand on smoothness making sure ˆf is not too rough or wiggly. The advantage is that such an estimate has the potential of accurately fitting a wider range of functions. Generally, the disadvantage is that a larger number of data points is needed to accurately estimate f compared to a parametric approach.

2.2.1.4 Supervised and Unsupervised Learning

(27)

Chapter 2 Theoretical Background 11 with a response, yi, i = 1, 2, ..., n. The problem is then to fit a model for either prediction

or inference, or both. The yipoints teach the model, or in other words, act as supervisors,

hence the name.

When dealing with unsupervised statistical learning, the task is not as straightforward because the predictor data points, xi, lack associated response variables yi. The

su-pervisors have disappeared, making the learning unsupervised. It no longer makes any sense for prediction since there is no response variable to predict. Instead, one tries to understand relationships between di↵erent variables or observations. A popular tool within the unsupervised framework is cluster analysis where one tries to group the data points into di↵erent clusters with regard to specific variables in order to gain further insight from the data.

2.2.1.5 Model Assessment

Unfortunately, there is no single method with superior performance over any other method and data set. Consequently, method choice is always dependent on the data in question. Which model to opt for is not always obvious but there are measures for evaluating a models predictive ability thus giving an indication of what models are suit-able for what data. Within the regression setting, one such method is the mean squared error (MSE) given by

M SE = 1 n n X i=1 (yi f (xˆ i))2. (2.4)

It is the average of the squared sum of errors for the predictions that ˆf makes for true values, yi. The closer the predictions are to the true values, the smaller the M SE. One

makes a distinction between training MSE and test MSE. Training MSE is what we obtain when applying equation (3.2) to previously seen data points, or in other words, the points used when fitting the model. However, we are more interested in the model’s performance on previously unseen data, or equivalently, data that has not been used to train the model. Evaluating equation (3.2) on such data is referred to as the test MSE. In practice, given a data set of n points, one might use the first k points to train the model and the other n k points for evaluation. Variations of equation (3.2) might also be used, for example taking the square root of (3.2) gives the root mean square error, (RMSE).

The expected value of the test MSE for an unseen data point, (x0, y0), can

mathemati-cally be divided into three principal parts:

(28)

Chapter 2 Theoretical Background 12 The expected test MSE is the average test MSE one would obtain after repeatedly estimating f using many di↵erent training sets and for each estimation testing it at x0.

As seen in equation (2.5), the expected test MSE is a↵ected by both the variance and the bias of our model ˆf . The variance refers to the amount by which ˆf changes using di↵erent training data sets. The estimated ˆf will depend on the training data used but preferably it should not vary too much for di↵ering training sets as such variation could be the result of overfitting. In general, more flexible methods have higher variance. The bias term is the result of approximating a real function with a simpler one. For example, when putting a linear functional form on the relation between Y and X, we assume that the relationship is indeed linear. It is, however, unlikely that the true relationship is perfectly linear, and therefore, we have introduced bias in our model. Generally, more flexible methods are associated with less bias. The fact that bias generally decreases while variance generally increases with method flexibility is known as the bias-variance trade o↵. As we seek to construct ˆf in order to minimise the test MSE, we wish to use a method and model where both bias and variance is low.

2.2.1.6 Linear Regression

In simple linear regression, we assume a linear relationship between the response variable and a single predictor, i.e.

Y _⇡ 0+ 1X. (2.6)

Using the training data, we estimate the parameters and make predictions; ˆ

y = ˆ0+ ˆ1x. (2.7)

In multiple linear regression, we once again assume a linear relationship but this time between the response variable and multiple regressors, i.e.

Y _⇡ 0+ 1X1+ 2X2+· · · + pXp. (2.8)

For the sake of notation, we will write the above as Y ⇡ T_{X. Note that the constant}

term 0 is included in , and that the X-vectors first element is set to 1, indeed making

equation (2.8) and Y ⇡ T_{X equivalent. Using that notation, we make predictions using}

ˆ

y = ˆTx, ˆ_{2 R}p+1, x_{2 R}p+1, (2.9)

(29)

Chapter 2 Theoretical Background 13 Mathematically, it means one minimizes the residual sum of squares (RSS), given by

RSS = n X i=1 (yi yˆi)2 = n X i=1 (yi ˆTxi)2, ˆ2 Rp+1, xi 2 Rp+1, (2.10)

with respect to ˆ. The ˆ that minimizes equation (2.10) is the least squares estimate of .

2.2.2 Gaussian Processes

This section is based on chapter 2, 4, 5 and Appendix A of Rasmussen & Williams Gaussian Processes for Machine Learning, with full citation as [12]. For further detail on the subject presented, see their original work.

2.2.2.1 Gaussian Distributions and Identities

A random vector X = (X1, X2, ..., Xn) has a multivariate Gaussian, also called

Nor-mal, distribution if a linear combination of its di↵erent elements has a one dimensional Gaussian distribution. The joint probability density for such a distribution is given by

p(X) = (2⇡) p/2|K| 1/2exp 1

2(X m)

T_K 1_(X _{m) ,} _(2.11)

where m is the mean vector and K is the symmetric and positive definite covariance matrix. Note that since X 2 Rn_{, it follows that m} _{2 R}n _{and K} _{2 R}n⇥n_{. One often}

writes X _{⇠ N(m, K). One might also let X}1 and X2 be jointly normal, i.e.

X1 X2 ! ⇠ N " m1 m2 ! , K11 K12 K21 K22 !# . (2.12)

Note that K₁₂T = K21. The marginal distribution of X1 is then X1 ⇠ N(m1, K11). While

the conditional distribution of X1 given X2 = x2 is

X1|X2 = x2⇠ N m1+ K12K221(x2 m2), K11 K12K221K21 . (2.13)

2.2.2.2 Definition and Representation

(30)

Chapter 2 Theoretical Background 14 process’s mean and covariance function. A real process is often written as

f (X)⇠ GP m(X), k(X, X0) , (2.14)

with mean function m(X) and covariance function, also often called kernel function or just kernel, k(X, X0) defined as

m(X) =E[f(X)] (2.15)

k(X, X0) =E[(f(X) m(X))(f(X0) m(X0))]. (2.16)

Often, the mean function is set to zero even though this need not be the case.

We view the random variables in a Gaussian process as representations of the value of f (X) at location X. As such, the definition above can be restated as; a Gaussian process on X is a collection of random variables {f(X)}X2X such that for any finite collection

(X1, X2, ...Xn)2 Xnthe random vector (f (X1), f (X2), ..., f (Xn))T has a joint Gaussian

distribution.

Once a mean and covariance function is defined, the process defines a distribution over functions. This can be further illustrated by sampling functions from the process, or equivalently, from the distribution. Figure2.4, from [13], shows three di↵erent subplots, each subplot is associated with a specific kernel, and thus a distribution over functions. From each distribution, five sample functions have been pulled and plotted in their respective subplots. The two key takeaways from this image being; First, one can sample functions from GP, just as one can sample outcomes from any other random variable distribution. And second, di↵erent covariance functions defines certain fundamental characteristics about the functions being sampled, for example regularity.

2.2.2.3 Covariance Functions

As we will see in the next section, the covariance function is a key ingredient in a Gaussian process predictor because it makes an assumption regarding the function we wish to learn. In the framework of supervised learning, similarity between data points is crucial because one expects that points with similar inputs, x, will have similar target values, y. The idea is that training points near a specific test point should be of high predictive value at that point. For Gaussian processes, it is the covariance functions that define that similarity, or nearness.

(31)

Figure 2.4: Sampled functions from Gaussian process prior distributions using the squared exponential, often called RBF, (upper), Rational Quadratic (middle) and Matern ⌫ = 3/2 (lower) kernel functions. The black lines indicate the mean and the

grey areas the standard deviation at any input value.

r =_|X X0_{| are called isotropic and are invariant to all rigid motions. Certain kernels} can be used both in an isotropic and anisotropic manner. Functions that depend only on X_{· X}0 are called dot-product covariance functions and are invariant to rotations of the coordinates about the origin. The dot-product is further present in polynomial ker-nels. As Figure2.4above illustrates, di↵erent kernels specify distributions over di↵erent functions. In the next section, we will see how di↵erent kernels a↵ect the learning of our process. Below, we present and categorise the most commonly used kernels in machine learning. Note also that sums of kernels are kernels, as are products and convolutions, making it possible to create di↵erent kinds of mixture kernels.

The isotropic squared exponential (SE), often referred to as RBF, kernel is given by kSE(r) = 2fexp

r2

2l2, (2.17)

with the characteristic length-scale, l, as parameter. Note that most kernel functions have the signal variance, 2_f, included as a parameter — in a sense this is a measure of how much the functional values vary, as opposed to how much the observed target variables vary in relation to the function, given by the noise term. Remember, Y = f (X) + ✏. The isotropic Matern kernel is given by

(32)

Chapter 2 Theoretical Background 16 with parameters ⌫, l > 0 and K⌫ a modified Bessel function. A special case is ⌫ = 1/2,

making the kernel an exponential one; k(r) = exp _lr.

The rational quadratic (RQ) kernel is also isotropic, given by

kRQ(r) = f2 ✓ 1 + r 2 2↵l2 ◆ ↵ , (2.19)

with ↵, l > 0 as parameters. The kernel can be seen as an infinite sum of SE-kernels with di↵erent length scales.

The inhomogeneous polynomial kernel is given by

k(X, X0) = 2_f( 2₀+ X_{· X}0)p, (2.20)

for some positive integer p. It contains the constant covariance term 2

0 which can also

pose as a kernel itself.

2.2.3 Gaussian Process Regression

Gaussian Process regression is a Bayesian, non-parametric, supervised learning method. In Bayesian statistics, modeling is done by specifying a prior distribution - a specification that captures our previous knowledge or beliefs of what we are modeling, before seeing any data points. The prior is then combined with data points to create a posterior distribution - our updated belief of what we are modeling, after taking data points into account. The posterior is then used to create the predictive distribution - the distribution from which one makes predictions for the response variable in question, given new input variables.

2.2.3.1 Making Predictions

For GPR, the relation between these distributions is more easily understood through a general problem. Thus, consider a problem of regression with additive noise, i.e. Y = f (X) + ✏ with ✏_{⇠ N(0,} 2_{). We have n training data points} _{(x

i, yi)}ni=1 that are

outcomes of independent pairs (X, Y ) and, as usual, X represents the input variables and Y the outcome variable. Furthermore, we have a set of test input variables _{xi}n+ki=n+1

for which we want to predict the test response variables _{yi}n+k_i=n+1. For notational

purposes, we let xxx = (x1, x2, ..., xn)T, yyy = (y1, y2, ..., y3)T, xxx000 = (xn+1, ..., xn+k)T, and

yyy000 = (yn+1, ..., yn+k)T. As usual, capital letters refer to random variables while lower

(33)

Chapter 2 Theoretical Background 17 We start by specifying a Gaussian process, that is, a prior distribution over functions f (X) _{⇠ GP m(X), k(X, X}0) , by convention m(X) = 0. From such a prior one can randomly generate functions, particular outcomes from the prior distribution - exactly what was seen in Figure 2.4 above. The specification of the prior is important as it specifies a set of properties for the functions that will be considered for prediction. As our response variable is assumed to be a sum of f (X) and ✏ _{⇠ N(0,} 2), we get the covariance for the response variable as cov(YYY ) = K(XXX, XXX) + 2_I

n where K is the

covariance matrix for XXX and Inthe identity matrix of size n. The joint distribution of the

noisy observations, YYY , and the test output functions, f (XXX000) = (f (Xn+1), ..., f (Xn+k))T,

is f (XXX000) YYY ! ⇠ N " 0 0 ! , K(X 0 X0 X0, XXX000) K(XXX000, XXX) K(XXX, XXX000) K(XXX, XXX) + 2_I n) !# . (2.21)

With the joint prior specified, we will restrict this distribution to only contain functions that agree with the training data. To that end, having XXX = xxx, we condition on YYY = yyy using equation (2.13) to get the posterior distribution of f (XXX000) - our knowledge of the process by combining the prior and the training data. Thus the posterior distribution for f (XXX000) is Gaussian with mean and covariance given by

m_{f (X}_X_X000₎_|X_X_X=x_x_x,Y_Y_{Y =y}_y_y(XXX000) = K(XXX000, xx)[K(xx xx, xxx) + 2In] 1yyy, (2.22)

k_{f (X}_X_X000₎_|X_X_X=x_x_x,Y_Y_{Y =y}_y_y(XXX000, XXX000) = k(XXX000, XXX000) k(XXX000, xxx)[K(xxx, xxx) + 2In] 1k(xxx, XXX000). (2.23)

To get the predictive distribution for yyy000 = (yn+1, ..., yn+k)T, we proceed similarly as

above and compute the joint distribution of YYY and YYY000: YYY000 YYY ! ⇠ N " 0 0 ! , K(X 0 X0 X0, XXX000) + 2Ik K(XXX000, XXX) K(XXX, XXX000) K(XXX, XXX) + 2In) !# , (2.24)

with mean and covariance given by

m_Y_Y_Y000_|X_X_X=x_x_x,Y_Y_{Y =y}_y_y(XXX000) = K(XXX000, xxx)[K(xxx, xxx) + 2In] 1yyy, (2.25)

k_Y_Y_Y000_|X_X_X=x_x_x,Y_Y_{Y =y}_y_y(XXX000, XXX000) = k(XXX000, XXX000) + 2Ik k(XXX000, xxx)[K(xxx, xxx) + 2In] 1k(xxx, XXX000). (2.26)

Given a set of test input variables as stated above, xxx000 = (xn+1, ..., xn+k)T, we make

predictions using the predictive mean evaluated at XXX000 = xxx000 with confidence intervals obtained from the predictive covariance, again evaluated at XXX000 = xxx000. Note that the predictive mean and the posterior mean coincide. Note further that these are linear combinations of realised response variables YYY = yyy. The model has learnt from the training data. Finally, note that the di↵erence between the covariance matrices is a term 2_I

(34)

Figure 2.5: Sampled functions from Gaussian process posterior distributions using the squared exponential, often called RBF, (upper), Rational Quadratic (middle) and Matern ⌫ = 3/2 (lower) kernel functions. The black lines indicate the mean and the

grey areas the standard deviation at any input value.

as Y0 = f (X0) + ✏ by assumption. Figure 2.5, from [13], illustrate how the priors in Figure 2.4 are combined with data points to create posterior distributions. The data points are generated from a sinus curve. The two key takeaways are; Firstly, the prior distribution becomes the posterior distribution after considering seen data points; And secondly, di↵erent kernels result in di↵erent posterior distributions and thus di↵erent predictive distributions.

2.2.3.2 Bayesian Model Selection

Bayesian model selection in the framework of GPR is based on maximising the evidence with regards to the families of covariance function and said families hyperparameters. This is done through the marginal likelihood, or the evidence, p(yyy|X, ✓✓✓), which is the integral of the likelihood times the prior:

p(yyy|X, ✓) = Z

p(yyy|fff, X, ✓)p(fff|X, ✓)dfff, (2.27) where ✓ is the vector of hyper-parameters specifying the covariance function, including

2 _{in the noise. For example, using the rational quadratic kernel in equation (}_2.19_),

✓ = ( 2

f, ↵, l, 2). We call the integral in (2.27) the marginal likelihood due to the

(35)

Chapter 2 Theoretical Background 19 function values. In the framework of Gaussian process regression, the prior is Gaussian, i.e. fff_{|X ⇠ N(0, k(X, X)) and the likelihood, too, is Gaussian with yyy|fff ⇠ N(fff,} 2_I

n).

Since a Gaussian distribution multiplied by another Gaussian distribution is also Gaus-sian, (2.27) is analytically tractable. When aiming to maximise the evidence, one might equivalently consider maximising the log evidence given by

log p(yyy_{|X, ✓✓✓) =} 1 2yyy T_{(k(X, X)+} 2_I n) 1yyy 1 2log det(k(X, X)+ 2_I n) n 2log 2⇡. (2.28) Note that this result also follows from observing that yyy _{⇠ N(0, k(X, X) +} 2In). The

three terms in (2.28) have interpretable roles. The data-fit term is the one involving the observed targets; 1₂yyyT(k(X, X) + 2In) 1yyy. Then there is a complexity penalty term

depending on the covariance function and the noise; 1₂log det(k(X, X) + 2_I

n). The final

term is merely a normalising constant. The two first terms stand in direct conflict with each other; the data-fit term is increasing in model flexibility while the complexity term decreases. The trade-o↵ between the two is why the log marginal likelihood is frequently used in Bayesian model selection. In practice, (2.28) is usually maximised through the equivalent minimisation of the negative log-likelihood using gradient descent algorithms. It is, however, a non-convex optimisation problem, which implies that the existence of local optima is possible.

One specific part of model selection is variable selection, or feature selection - the process of determining which of all available input variables should be included in the model. For Gaussian process regression, Automatic Relevance Determination (ARD), proposed by Neal [14], and further discussed in [12], is a feasible method. It entails extending the squared exponential kernel in (2.17) to

kSE(X, X0) = f2exp (

1

2(X X

0₎T_{M (X} _X0_)), _(2.29)

with M = diag(lll) 2, a square matrix of the same size as X, i.e. p. The l1, ..., lp act as

(36)

Chapter 3

Methodology

3.1 Method Description

3.1.1 Models Considered

The overall aim of this thesis is early-stage prediction of cycle life for Lithium-ion bat-teries. More specifically, we wonder if we can achieve better results than those presented in [3], under the same constraints with regard to data usage, using Gaussian process regression. To that end, we will consider six di↵erent regression models. The first three models will be purely linear - replications of the models o↵ered in [3]. The following three will be a combination of a linear model and di↵erent GPR models.

As referred to by the authors of [3], we will call the first model the variance-model, a simple linear regression using an input variable of high predictive value, derived from the voltage-discharge curve of Li-ion batteries. The second model, the discharge-model, adds another five regressors to the variance-model. The final model of [3] is referred to as the full -model and removes some regressors used in the discharge-model while adding other to a grand total of nine regressors, including the one used in the variance-model. For the three models involving GPR, they are all based on the same structure; the linear variance-model as the initial predictor, and then using GPR to predict the residuals created by the variance-model. The first GP-model, the GP-discharge model, will use the same five regressor the linear discharge-model added in the framework for residual prediction. The same approach is extended to the GP-full model. Finally, we present a third model using the linear variance-model as base while adding a residual regression using regressors chosen using a mixture of methods. For regressor usage specification and construction, see Appendix TableA.1and the list of regressor construction explanations. The models considered are summarized in Table3.1.

(37)

Chapter 3 Methodology 21

Name Method Nr. of Regressors Note

Variance Linear 1 Replication

Discharge Linear 6 Replication

Full Linear 9 Replication

GP-discharge Linear + Gaussian Process 6 This work

GP-full Linear + Gaussian Process 9 This work

GP-custom Linear + Gaussian Process 8 This work

Table 3.1: Summary of the models that will be considered. The first three are replications of the models presented by Severson et al. The following three are this

thesis’s contribution.

3.1.2 Train and Test Data

The data of 124 cells will be split in three di↵erent sections, as done in [3], to enable fair comparison; a training set of 41 points, a primary test set of 43 points, and finally a secondary test set of 40 points. The di↵erentiation between the primary and secondary test set originates from [3] and is justified by the fact that the latter was generated after their initial model development. As explained in 2.2.1.5, the train set will be used to fit our models while the two test sets will be used to asses model performance. The authors of [3] ”stress the error metrics of the secondary test data set, as these data had not been generated at the time of model development and are thus a rigorous test of model performance”. For the full account of which cell belongs to which data set, please see the supplementary information of [3], available in [15].

3.1.3 Model Fitting

Linear models are fitted using ordinary least squares. The GPR model design and fitting is specified below. All Gaussian process modeling is done using Pythons Scikit-learn package for GPR, available at [16].

1. Fit the simple linear regression variance-model using the training set: y = 0+ varxvar.

2. For all batteries, predict cycle life using the variance-model: ˆyvar = ˆ0+ ˆvarxvar.

(38)

Chapter 3 Methodology 22 4. Train a Gaussian process regression model using the training data, where the above

mentioned residuals act as target values: yres= fGP(x).

5. For all batteries, predict cycle life residuals using the GP-model: ˆyres= ˆfGP(x).

6. For all batteries, the final cycle life prediction is the sum of the variance-model prediction and the GP-model residual prediction: ˆy = ˆyvar+ ˆyres

Note that this design is mathematically equivalent to setting the mean function of the Gaussian process prior to the variance-model. We might also see it as a sum of di↵erent functions: Y = g(Xvar) + f (X) + ✏, with g(Xvar) = 0 + varXvar and

f (X) _{⇠ GP 0, k(X, X}0) . Practically, the above outlined steps are easy to implement. Steps 4-6 will be done using the five most commonly considered kernel functions, en-abling thorough comparison.

3.1.4 Feature Selection

Concerning methods used for the input variable selection, or feature selection, a few things should be noted. Since the GP-discharge model and GP-full model use the same regressors as those used in their corresponding linear models, feature selection is not relevant for these models. For the GP-custom model, however, the used regressors are selected after three di↵erent considerations. Firstly, Automatic Relevance Determination will be employed and assessed. Secondly, each regressor will be visually assessed when bivariately fitted in various GPR settings, see Figures A.3 to A.7. Thirdly, systematic evaluation after removing single regressors will be used to determine said regressor’s impact on performance.

3.1.5 Model Evaluation

All models will be evaluated with the primary and secondary test sets. The evaluation metrics used will be RMSE and average percentage error, as these are the ones used by [3]. We define RMSE as RM SE = v u u t 1 n n X i=1 (yi yˆi)2 (3.1)

and average percentage error as

% err = 1 n n X i=1 |yi yˆi| yi ⇥ 100. (3.2)

(39)

Chapter 3 Methodology 23

3.2 Rationale for Method Choice

In short, the method used in this thesis involves combining simple linear regression with Gaussian process regression using a large feature space. There are several reasons why this method seems to be a reasonable approach when aiming to predict cycle life for Li-ion batteries. Let us start by justifying the use of the variance-model as initial predictor with the following application of non-linear methods, in our case GPR, to account for residuals. Firstly, the impressive accuracy achieved by the linear variance-model is an argument to use said model as an initial predictor. Secondly, even though the linear extensions, i.e. the discharge- and full -model of Severson et al., achieve slightly better results[3], it is clear that the added regressors have limited predictive power in a linear framework. Combining that insight with the fact that batteries age due to many di↵erent and complex mechanisms, see [2] for an illuminating account, it is reasonable to suspect that the application of more flexible, non-linear methods to these added regressors might capture patterns that linear models otherwise fail to account for.

It remains, however, to justify the choice of Gaussian processes when selecting between other flexible methods such as support vector machines (SVMs) and neural networks (NN). GPR is chosen over SVMs and NNs because they o↵er confidence intervals as-sociated with the predicted values, and not just point predictions. This is a desired characteristic in practical implementations of the predictive model, both economically and from a safety point of view[2]. Relevance vector machines (RVM), the Bayesian approach of a support vector machine and a particular instance of a Gaussian process, first proposed by Tipping [17], is also capable of producing predictions with associated uncertainties. The author of this thesis opted for GPR as [2] stresses the need for large data sets when implementing RVM. Furthermore, [18] argues that RVM is a poor long-term predictor of RUL. However, the authors of [18] consider a di↵erent framework with regard to input-output then the one considered in this thesis. Due to this di↵erence, it would nevertheless be interesting to apply RVM in a similar input-output variable fashion as GPR is applied in this thesis.

(40)

Chapter 3 Methodology 24 and evaluation is an inefficient but nevertheless potentially illuminating method. Further evaluation will be discussed in the next section.

3.3 Method Evaluation

A frequently mentioned drawback of Gaussian process regression is computational com-plexity [2]. Training of, and prediction using, Gaussian processes is of complexity O(n3) and O(n2), respectively, where n is the number of training points[12]. This can be com-pared to Linear Regression where training is done in O(np2), where p is the number of features used, and predictions is a single dot product of size p[20]. That being said, the comparisons in model performance between fully linear and Gaussian process models is not fair with regards to computational complexity, only with regards to data availability. Although the complexity of GPR does not pose a problem in this thesis’s specific imple-mentation, as our data set consists of only 124 points, it might be a limiting factor for later, practical, implementations of the model. When dealing with large data sets, there are several methods dedicated to decreasing the complexity of GPs, see for example [21]. Another drawback of Gaussian processes concerns the model’s sensitivity to the choice of covariance function [2]. To mitigate this drawback, this thesis evaluates performance using the five most frequently used covariance functions. That does not, however, ex-clude the possibility of other covariance functions, for examples di↵erent combinations of sums, products or convolutions of the ones explored, being a superior choice. Such considerations for this specific learning task is left for future research.

Feature selection will be done using a number of methods. Firstly, Automatic Relevance Determination will be used. ARD is, however, problematic due to two reasons. Firstly, the length scales themselves, lii2 {1, ..., p}, are not consistently well-identified, but only

the ratios of 2

f and li are. Secondly, ARD consistently overestimates the relevance of

(41)

Chapter 4

Results and Discussion

4.1 Linear Model Replication

4.1.1 The Feature of the Variance-Model

One of the 19 available features has superior predictive power over all other. It was introduced and initially accounted for in subsection 2.1.3. It was derived in [3] where the rationale behind its predictive ability is further discussed. Here, we merely remind the reader that it is based on the variance of the voltage-discharge curve for di↵ering cycles. Figure4.1shows the linear relationship of this variable and cycle life, after Log10

transformations of both variables. Although no model has yet been trained, a legend for the data sets is included, showing how our data sets are oriented for this variable. Note that the x-axis is Z-score transformed using the training data. For corresponding plots of all available regressors, see Appendix FigureA.1.

Figure 4.1: Log10 cycle life against the regressor of the variance-model.

(42)

Chapter 4 Results and Discussion 26

(a) (b) (c)

Figure 4.2: Predicted cycle lives to observed cycle lives for the three replicated models. (a) Variance-model. (b) Discharge-model. (c) Full -model. Points on the black line are

perfectly predicted. Image design by Severson et al.

4.1.2 Replications and Comparisons

The three models in [3] are replicated to the best of the author’s ability. The results, when plotting the predicted cycles lives against the actual, observed, cycle lives for the three models, are shown in Figure 4.2. Table 4.1 shows the models’ performance with regard to RMSE and mean percent error. The first three, marked with *, are from Table 1 of [3]. The following three, are the replications of this thesis. The results for the variance-model are very consistent. For the other two, however, there are some discrepancies in the results. For example, the discharge-model of [3] achieves 8.6 % mean error while the replicated one only achieves 12.3 %. Similar discrepancies can be found for the full-model.

(43)

RMSE (cycles) Mean percent error (%)

Train Primary test Secondary test Train Primary test Secondary test

Variance* 103 138 (138) 196 14.1 14.7 (13.2) 11.4 Discharge* 76 91 (86) 173 9.8 13.0 (10.1) 8.6 Full* 51 118 (100) 214 5.6 14.1 (7.5) 10.7 Variance 104 138 (138) 196 14.1 14.7 (13.2) 11.4 Discharge 67 115 (111) 178 8.8 12.8 (9.5) 12.3 Full 54 178 (167) 189 6.0 15.6 (9.1) 10.4

Table 4.1: Performance of the original models by Severson et al., marked with *, and the replicated models. Severson et al. argue that the cell with the shortest lifetime is an outlier and does not represent the data set in general. As such, the results when

excluding said cell is presented in parenthesis.

4.2 Gaussian Process Regression Models

The model design of the GPR models is given in Section3.1.3 above. After predicting the cycle life using the replicated variance-model, and subtracting that prediction from the observed cycle life values, one is left with a set of residuals. Those residuals are plotted against 18 regressors in the appendix, Figure A.2. The five figures A.3 toA.7, one for each kernel, show the results of fitting a GP model using the residuals as target values, and the single regressor indicated along the x-axis as input variable, for each kernel respectively.

4.2.1 GP-discharge Model

(44)

Discharge RMSE (cycles) Mean percent error (%)

Linear* 76 91 (86) 173 9.8 13.0 (10.1) 8.6 Linear 67 115 (111) 178 8.8 12.8 (9.5) 12.3 RBF 50 117 (116) 188 6.4 10.4 (8.4) 12.3 RatQ. 50 117 (116) 188 6.4 10.4 (8.4) 12.3 ExpSineSq. 44 117 (117) 194 5.5 9.3 (8.5) 13.0 DotProd. 66 112 (107) 172 9.2 12.6 (9.3) 9.9 Matern 45 116 (115) 185 5.8 10.6 (8.5) 11.9

Table 4.2: Performance of the GP-discharge models for five di↵erent kernels. The two linear models at the top are the already presented original (*) discharge-model and

replication discharge-model.

Figure 4.3: GP-discharge model using Matern 3/2 kernel. Upper left: Variance-model from above, included for comparison. Upper right: predicted residuals to observed residuals after GP prediction. Lower left: the two upper models combined. Lower right: same as lower left, including 95% confidence bars from residual prediction. Points on

black line are perfectly predicted.

observed residual using the Matern kernel is seen. Thirdly, the lower left figure shows the sum of the linear prediction and the GP residual prediction. Finally, we present that same prediction with the 95% confidence bars, calculated in the GP prediction for the residuals.

(45)

Figure 4.4: RMSE (upper) and percent error (lower) contributions for the primary (red) and secondary (orange) test sets using the linear (L), or Gaussian Process (GP) - Matern 3/2 kernel -, discharge extension. Matern kernel chosen due to overall good

test scores.

replication model and a GP model respectively. Figure 4.4 shows how the test sets, the performance scores, the replicated linear and the GP Matern model, relate to one and other, and to the overall distribution of observed cycle lives. For example, we see that the cells with the longest cycle lives contribute the most in absolute terms to prediction error, as seen to the right in the upper figure, especially to the primary GP score (rightmost point). However, with regards to percentage, it is in particular the cell with the shortest cycle life (leftmost, lower) that contributes to the high primary linear score. The lines indicate the averages over both test sets, for the replicated linear discharge-model and for the GP-discharge model. The GP line is just below in both cases. 47 out of the 83 test points (57%) are better predicted using the GP extension compared to the linear extension.

(46)

Full model RMSE (cycles) Mean percent error (%)

Linear* 51 118 (100) 214 5.6 14.1 (7.5) 10.7 Linear 54 178 (167) 189 6.0 15.6 (9.1) 10.4 RBF 29 115 (114) 191 3.9 9.9 (8.3) 10.4 RatQ. 24 114 (114) 190 3.3 9.8 (8.0) 10.3 ExpSineSq. 19 144 (142) 204 2.4 12.3 (9.6) 12.2 DotProd. 53 111 (100) 186 7.2 11.9 (6.8) 10.1 Matern 22 113 (113) 191 2.9 9.6 (7.8) 10.3

Table 4.3: Performance of the GP-full models for five di↵erent kernels. The two linear models at the top are the already presented original (*) full model and replication full

-model.

achieves remarkable performance on the secondary test which is difficult to compare and further discuss as the replication model did not achieve the same result, see discussion concerning data in4.1.2above. In summary, the results for the linear discharge-extension and the GP-discharge extension are very similar with a few edge points seemingly causing the largest discrepancies in metric performance. A slight preference could be placed on the GP extension due to predicting a majority of the points better, and slightly better overall performance scores.

4.2.2 GP-full Model

(47)

Figure 4.5: GP-full model using Matern kernel. Upper left: Variance-model from above, included for comparison. Upper right: predicted residuals to observed residuals after GP prediction. Lower left: the two upper models combined. Lower right: same as lower left, including 95% confidence bars from residual prediction. Points on black

line are perfectly predicted.

(48)

Chapter 4 Results and Discussion 32 The results for the GP-full model is further analysed. Firstly, the superior training scores are familiar as this was also the case for the GP-discharge model. As in the discharge case, these are probably due to greater flexibility in the GPs compared to the linear models, and thus a model closer fitted to the training data. Secondly, concerning the lower RMSEs and percentage errors of GPs compared to the linear model, they are, for the RMSEs, mainly due to a few points as shown in the upper plot of Figure

4.6, and for the percentage error, mainly due to the leftmost point in the lower plot of the same Figure. A probable explanation of the results regarding the model with the dot-product kernel, i.e. good performance on the secondary test set but the worst on the training, might be that the dot-product of second degree is the least flexible kernel, and thus not as tuned into the training data as the other kernels. Remember that the train data and primary data come from the same data batch, while the secondary test data is from another. Di↵erences can be seen in the clusters of A.1. From Figure

4.6 we note that the linear and GP points are, in general, closer together than in the discharge case, implying overall very similar predictions. However, arguably, the overall prediction accuracy is superior for the GP compared to the linear, evidenced by the superior handling of outliers and a majority of points being better predicted by the GP. Consequently, the GP extension is preferred.

4.2.3 GP-custom Model

The discharge, as well as the full, extensions’ use of features was predetermined by the choices made in [3]. For the custom version, on the other hand, the inclusion of a regressor was thought to firstly depend on Automatic Relevance Determination. The results were not encouraging. Firstly, they were inconsistent, i.e. the length scales returned after fitting a GP using all available features and the RBF ARD kernel varied. Even when averaging over many rounds and comparing such averages, severe variations were present. This might be a result of the problems discussed in Method Evaluation

3.3, or of local minima in the optimisation problem to maximise the evidence, equation

(49)

Custom model RMSE (cycles) Mean percent error (%)

RBF 40 110 (110) 184 5.1 9.8(8.1) 10.3

RatQ. 39 111 (110) 181 5.0 9.8 (8.0) 10.0

ExpSineSq. 75 134 (134) 205 9.5 11.2 (10.0) 12.4

DotProd. 64 117 (112) 172 8.6 12.8 (9.0) 8.5

Matern 36 110 (109) 179 4.6 9.7 (7.9) 9.9

Table 4.4: Performance of the GP-custom models for five di↵erent kernels.

Some regressors show promising results over multiple kernels, displaying patterns beyond a constant function at zero but still smoother than something seemingly unrealistically fluctuant. The five regressors showing the most promosing results are shown in Figure

4.7, using three di↵erent kernel functions. As these regressors seem relevant for multiple kernels, one is inclined to believe that they actually are, which is why they will be considered for the GP-custom model. Furthermore, as the GP-full model has achieved the best results so far, its regressors will be further considered. Combining the results from ARD, the graphic analysis and the full model, a total of 12 regressors are considered when doing the final elimination by removing a single regressor one at a time and evaluating the overall results. Five regressors are removed this way, leaving a total of seven features for the custom extension, thus eight for the GP-custom model in total (seven plus one from the initial linear predictions).

The RMSE and percentage error results for the GP-custom model are shown in Table

4.4. Some things stand out. Firstly, using the dot-product kernel of second degree, one achieves the best score reported on the secondary test, even better than the original discharge-model of [3]. Secondly, using the Matern kernel, mean percent error is below 10% on both test sets for the first time. A comparison against the rest of the models, and a related discussion, follow in the next section.

4.3 Complete Model Comparisons

(50)

(a) (b) (c)

Figure 4.7: Univariately fitted GP models using the regressor as indicated by the x-axis. Black line shows the predictive mean while the grey area represents one standard deviation of that mean. (a) RBF kernel. (b) Matern 3/2 kernel. (c) Dot-product 2 :nd

degree.

said cell, the original full -model achieves the best percentage error of 7.5% (although the GP-full model achieved 6.8% on this metric). Furthermore, when examining the scores relating to the second test set, the GP models outperform the linear ones with regard to percentage error, the exception being the original discharge-model, achieving 8.6%. That figure is, however, marginally beaten by the GP-custom dot-product model with its 8.5%. Finally, we note that the models that perform the best on the secondary test set, i.e. the original full -model and the GP-custom dot-product model, do not compete for the best performance on the primary or train scores.

(51)

Comparison RMSE (cycles) Mean percent error (%)

Variance* 103 138 (138) 196 14.1 14.7 (13.2) 11.4 Variance 104 138 (138) 196 14.1 14.7 (13.2) 11.4 Discharge* 76 91 (86) 173 9.8 13.0 (10.1) 8.6 Discharge 67 115 (111) 178 8.8 12.8 (9.5) 12.3 GP-Matern 45 116 (115) 185 5.8 10.6 (8.5) 11.9 Full* 51 118 (100) 214 5.6 14.1 (7.5) 10.7 Full 54 178 (167) 189 6.0 15.6 (9.1) 10.4 GP-Matern 22 113 (113) 191 2.9 9.6 (7.8) 10.3 DotProd. 64 117 (112) 172 8.6 12.8 (9.0) 8.5 Matern 36 110 (109) 179 4.6 9.7 (7.9) 9.9

Table 4.5: Performance of original (*) models by Severson et al. with its linear replication underneath and finally the corresponding GP-model. The two bottom are

from the custom GP model.

superior performance on the train data of the GP models to their greater flexibility com-pared to linear models. Furthermore, regarding the original full -model’s performance in the primary test percentage error when excluding the outlier, 7.5%, it is not obvious that this concludes the original full -model’s superiority; as the linear full replication fails to achieve similar accuracy, discrepancies in the data used seem likely and when comparing the linear replication with its GP counterpart, the GP achieves greater accu-racy. The fact that models with superior performance on the secondary test set do not compete for the best scores on the primary test scores indicates a trade-o↵ between the respective sets. The model that seems to find the best compromise is the GP-custom Matern kernel model, performing well overall, but in fact, not the best on any particular score. Finally, one may note that we have compared many scores on mean percent error throughout this chapter. This is mainly because the results for the RMSEs are more difficult to interpret. Cells with longer lives contribute more to this measure, relatively speaking, as indicated by Figures 4.4 and 4.6. This is also why removing the cell with the shortest life often lowers the percentage error considerably, with marginal e↵ect on the RMSE.

(52)

Early-Stage Prediction of Lithium-Ion Battery Cycle Life Using Gaussian Process Regression

Early-Stage Prediction of

Lithium-Ion Battery Cycle Life Using

Gaussian Process Regression

LOVE WIKLAND

Early-Stage Prediction of

Lithium-Ion Battery Cycle Life

Using Gaussian Process

Regression

LOVE WIKLAND

Abstract

Prediktion i tidigt stadium av

litiumjonbatteriers livsl¨

angd med hj¨

alp

av Gaussiska processer

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Project Relevance and Aim

1.2

Previous Research

1.3

Research Question

1.4

Data

1.5

Outline

Chapter 2

Theoretical Background

2.1

Lithium-ion Batteries

2.2

Mathematics

Chapter 3

Methodology

3.1

Method Description

3.2

Rationale for Method Choice

3.3

Method Evaluation

Chapter 4

Results and Discussion

4.1

Linear Model Replication

4.2

Gaussian Process Regression Models

4.3

Complete Model Comparisons