Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

(1)

Degree project in

Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

JOHANNA ROSENVINGE

(2)

Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

JOHANNA ROSENVINGE

February 2013

Master’s Thesis in Computer Science at CSC Supervisor: Örjan Ekeberg

Examiner: Anders Lansner Project provider: Scania CV AB

TRITA xxx yyyy-nn

(3)

Abstract

Worn out batteries is a frequent cause of unplanned immobilization of trucks, causing disrupted operations for haulage contractors. To avoid unplanned maintenance, it is desirable to accurately estimate the battery lifetime to perform preventive replacements before the components fail. This master’s thesis has investigated how technical features and operational conditions influence the lifetime of truck batteries and how the risk of failure can be modeled.

A support vector machine classifier has been used to examine how well the available data discriminate the vehicles with battery failure from those without. The performance of the classifier, according to the area under the receiver operating characteristic curve, was 70.54% and 76.95% for haulage and distribution vehicles respectively. Maximum likelihood estimation was applied to censored failure time data show- ing that, if failures that occurred within 100 days after delivery were omitted, both failure data sets were normal distribution on a 95% sig- nificance level.

To investigate how different features influence the lifetime, random forests and Cox regression were applied on two different models, one intended to be applied for new vehicles and one for vehicles that have been operating for a time, hence having an age covariate. The results from the first model were satisfying, having significant Cox coefficients and low Brier scores for both random forests and Cox. The second model however did not give credible results, having non-significant regression coefficients.

(4)

Sammanfattning

Livslängdsanalys av bilbatterier med random forests och Coxregression

Utslitna batterier är en vanligt förekommande orsak till oplanerade stopp för lastbilar, vilket leder till störningar i åkeriernas planering. För att undvika oplanerat underhåll är det önskvärt att kunna uppskat- ta batteriets livslängd, för att utifrån den uppskattningen kunna utföra förebyggande byten innan komponenten går sönder. Det här examensar- betet har undersökt hur olika tekniska egenskaper och driftsförhållanden påverkar livslängden på lastbilsbatterier och hur risken över tid för att batteriet går sönder kan modelleras.

En supportvektormaskin har använts för att studera hur väl tillgänglig data utskiljer de fordon med batteriproblem från dem utan. Klassifice- rarens prestanda, enligt arean under receiver operating characteristic- kurvan, var 70.53% för fjärrtransportsfordon och 76.95% för distribu- tionsfordon. Maximum likelihood-estimering tillämpades på censurerad data över tidpunkter för inträffade fel. Denna analys visade att data- mängderna över båda fordonstyperna var normalfördelade på 95% sig- nifikansnivå om de hundra första dagarna efter att fordonet levererades till kund utelämnades.

För att undersöka hur olika egenskaper inverkar på livslängden tillämpa- des metoderna random forests och Coxregression på två olika modeller.

Den första modellen är avsedd att tillämpas på nya fordon och den andra för fordon som har varit i bruk under en tid, och därmed har en variabel som beskriver fordonets ålder. Resultatet från den första modellen var tillfredsställande. Coxkoefficienterna var signifikanta och Brierpoängen var låga både för random forests och för Cox. Den andra modellen gav däremot inte tillförlitliga resultat, då dess regressionskoefficienter ej var signifikanta.

(5)

Acknowledgements

First, I would like to express my gratitude to my supervisor at Scania, Thomas Claesson, whose expertise, understanding and patience have been invaluable to me. I would also like to thank Ann Lindqvist and Björn Y Andersson at Scania for helping me in procuring the data for this thesis. For their technical support and valuable contributions, I would like to thank Erik Frisk and Mattias Krysander at Linköping University and Gunnar Ledfelt at Scania. I would also like to thank Örjan Ekeberg, my supervisor at KTH, for his academic experience, as well as my examiner Anders Lansner.

(6)

Abbreviations

Ah Ampere-hour

ANN Artificial neural networks AUC Area under curve

CBM Condition-based maintenance MLE Maximum likelihood estimation MLR Multiple logistic regression PDF Probability density function RBF Radial basis function RF Random forests

ROC Receiver operating characteristic RSF Random survival forests

RUL Remaining useful life SOC State of charge SOH State of health

SVM Support vector machine

(7)

Introduction

Effective transportation of goods and people requires that vehicles are available when they are planned to operate. For safety reasons, it is crucial to avoid unplanned breakdowns that can cause traffic casualties and tailbacks. Disrupted operations can also entail fees for late delivery and destroyed goods for the haulage contractor. Accurate lifetime predictions for components enable preventive replacements to be performed more efficiently to avoid unplanned maintenance. If the component lifetime can be estimated, the reliability can be improved while avoiding unnecessary maintenance and enabling maintenance to be scheduled more efficiently.

This thesis focuses on predictive replacements of automotive batteries, one of the more complex components in a vehicle to monitor and predict. A fixed, conserva- tive battery replacement interval that covers operating situations can be used to reduce the risk of changing battery too late. However, by using information about operating conditions, the battery replacement interval could be flexible, reducing the maintenance cost for the haulage contractors.

1.1 Problem description

A discharged battery is the main cause of unplanned immobilization of long haulage trucks produced by Scania. Total power demands are constantly increasing, due to a larger number of electrically powered systems in modern vehicles. More vehicles are now equipped with features like windscreen heating, seat heating, kitchen equipment and communication systems. If a truck driver consumes too much electrical energy with the engine inoperative, the battery will not be able to deliver enough power needed to start the engine. Battery related problems are more likely to occur in winter as the battery’s ability to accept charge drops in cold temperatures, which increases the time it takes to recharge the battery. Also, the load on the battery usually increases in colder weather. In most cases it is enough to charge the battery to restore acceptable functionality. In many situations of failure, however, the

(10)

CHAPTER 1. INTRODUCTION

Battery performance and useful lifetime are affected by various factors, such as operating temperature and discharging/charging cycles. The useful life is in this context defined as the period during which the battery is expected to be usable for the purpose it was acquired. The purpose of the automotive battery is to start the vehicle in all kinds of weather, but also to cover the electrical power needs of the vehicle when the alternator is switched off or cannot generate enough power.

The aging of batteries is a complex process where several parallel degeneration processes are involved. As the battery becomes older, it loses performance and becomes more susceptible to failure. Today, satisfactory methods to estimate the health of the battery in a Scania vehicle are missing. One way to prevent starting problems due to worn out batteries is to change the battery more often. However, this can be unnecessary expensive for the haulage contractor. For this reason a better method to determine the risk of battery failure is desired.

1.2 Objectives with the thesis

The ambition of the master’s thesis is to develop a model that can predict the risk of failure of an automotive battery. The actual lifetime of a battery is typically random and unknown, and therefore it must be statistically estimated from available sources of information.

This master’s thesis attempts to:

• analyze which variables discriminate the vehicles with battery failure within a distinct period of time from those without,

• examine if a classifier can separate these two groups of vehicles from the available data,

• model the distribution of lifetimes of automotive batteries,

• model the impact of explanatory variables on the lifetime distribution.

1.3 Outline

The thesis starts off with a background chapter which initially describes theory of lead-acid battery characteristics and the dominating degeneration processes, which aims to give the reader basic understanding of how different conditions and usage affect the battery lifetime. Next, the chapter discusses different maintenance policies for components and the research position of lifetime estimation and failure time analysis of components in general and batteries in particular.

The background is followed by a chapter describing the data available in the project,

(11)

CHAPTER 1. INTRODUCTION

and the form the variables are given on.

The description of the methods used in the project is divided into three chapters, each covering one or more of the objectives described in section 1.2. The first of these chapters contains theory and implementation of the classification and feature selection that is used to analyze the first two thesis objectives. The second chapter contains theory and implementation of maximum likelihood estimation, which partly attempts to analyze the third objective. The third and last method chapter describes the theory and implementation of the reliability models that are used to model the lifetime distributions and the impact of different variables on the lifetime.

This analysis aims to cover the third and forth objectives.

The method chapters are followed by results, discussion and conclusion.

(12)

Chapter 2

Background

2.1 Battery characteristics and degeneration

In a modern road vehicle, the electrical system is vital for the mobility, safety and comfort. Since the electrical system in an automotive vehicle is required to be active before the alternator can function, some kind of battery is necessary. The original function of the battery was to support the starting, lighting and ignition system.

In a modern vehicle however, the list of electrical equipment is constantly growing which imposes heavier loads on the battery.

The electrical system of a vehicle consists of a battery, an alternator, voltage control and protective devices and the electrical loads. Once the engine is operating, the alternator will generate electrical power and distribute it to the electrical energy consuming devices in the vehicle. The surplus energy is used to charge the battery.

The battery and the alternator are placed in parallel across the system and the battery is kept at the alternator output voltage, which is also applied directly to the loads.

2.1.1 The lead-acid battery

In automobiles, the lead-acid battery is by far the most common. There are several advantages with the lead-acid battery that makes it suitable in the broad spectrum of automotive duties, and although new technologies are constantly being introduced on the market, it appears as if it will remain on its dominant position. Some advantages with the lead-acid battery can be listed as (Reasbeck and Smith, 1997):

• The ability to deliver very large currents over a short period of time, which is required to start the vehicle.

• The high cell voltage, which results in fewer cells per battery for a given voltage.

• The availability and low cost of the materials in the component.

(13)

CHAPTER 2. BACKGROUND

• The high chemical stability over the range of temperatures automobiles normally operate in.

The lead-acid battery, invented by Gaston Planté in 1859, was the first battery that could be recharged by passing a reverse current through it. Although the battery has developed and improved a great deal since then, the basic principles are still the same.

A battery is made up by several power producing cells connected in series or in parallel to achieve the desired voltage and capacity. The cells are normally identical and contribute with the same amount of voltage. The voltage from each cell depends on the chemical reactions within the cell. Each cell consists of two electronic conduction plates, the electrodes, in contact with an ionically conducting phase, the electrolyte. At the surfaces of the plates, reactions occur. In these reactions, electrons are exchanged between the electrodes and the ions in the electrolyte. The circuit of charge flow is completed through an electric circuit between the electrodes.

When the lead-acid battery is fully charged, the positive electrode consists of lead dioxide (P bO2) and the negative electrode of metallic lead (P b). The third ac- tive material is the sulfuric acid (H2SO₄), which forms the conductive electrolyte between the electrodes. The electrolyte also consists of water. On discharge, the following chemical reactions, on the positive plate, the negative plate and the overall reaction respectively, occur:

P bO₂+ 4H⁺+ SO²⁻₄ + 2e → PbSO4+ 2H2O (2.1)

P b+ SO²⁻₄ − 2e → P bSO4 (2.2)

P bO₂+ P b + 2H2SO₄ → 2P bSO4+ 2H2O (2.3) During discharge, the concentration of sulfuric acid in the electrolyte decreases.

This fact has frequently been used to indicate the state of charge of the battery by measuring the density of the electrolyte. As the concentration of sulfuric acid decreases, the conductivity reduces which contributes to lower power-outputs at low states of charge. At the electrodes, lead sulfate is produced during discharge. The layers of sulfate blocks the active materials in the electrodes. As more lead sulfate is built up, less current can be discharged and the power output lowers. The charging process is the reversed process, where electrons are forced from the positive plate to the negative plate.

The lead-acid cell consists of the following functioning parts:

(14)

Active materials

The porous lead dioxide on the positive plate and the porous metallic lead on the negative plate are the active materials that provide the electrode reactions.

Support grids

The purpose of the grids is to provide mechanical support for the porous active materials and to provide a low-resistance current path to the cell terminals.

Separators

The purpose of the separators is to prevent short-circuit through physical contact between positive and negative plates.

Electrolyte

The electrolyte provides an ionic path between the plates. In the lead-acid cell the electrolyte is also one of the active materials as it provides the reaction with sulfuric acid.

Cell terminals

The cell terminals provide a low-resistance electrical connection to the outer circuit.

Usually, all positive plates are linked to one terminal and all negative plates to another terminal.

Cell container

The cell container encloses the cell components. It is made of plastic, which is chemically resistant and insulating.

2.1.2 Degeneration processes and stress factors

As the battery becomes older, the inner structure of the components and the materials are changing. This leads to degradation in the performance and eventually to end of life of the battery. The following aging mechanisms dominate the lead-acid battery damaging process (Svoboda, 2004):

Corrosion of the positive grid

As parts of the grid corrode, the connection between the active material and the terminals is reduced. This causes a reduction in capacity. Grid corrosion also causes an increase in the internal resistance. Corrosion of the positive grid is a natural aging mechanism, due to the fact that the lead on the positive electrode is thermodynamically unstable. According to Ruetschi (2004), positive grid corrosion is probably the most frequent cause of failure in lead-acid automotive batteries.

(15)

Sulfation

During discharge, sulfate is created at both electrodes. When charged, the sulfate is dissolved and converted to lead dioxide on the positive electrode and metallic lead on the negative electrode. These are the fundamental reactions of the lead-acid battery. Under some conditions though, the sulfate is built up in large crystals that cannot be dissolved during the charging process. As a consequence, the amount of active material is decreased which leads to a loss of capacity.

Shedding

Sulfation can cause active material to detach from the electrodes, since sulfate crystals have a larger volume than lead oxide. This process is called shedding.

Overcharging can also cause shedding as gassing bubbles can force active material to detach from the electrode.

Active mass degradation

Active mass degradation is a process where the mechanical structure at the bound- ary between the electrolyte and the electrodes is changed. This leads to a decrease in the surface area, which reduces the capacity.

Water loss/drying out

The lead-acid batteries used in Scania vehicles require maintenance in form of water addition. If the battery is dried out, the battery can be damaged.

Electrolyte stratification

When the electrolyte stratifies, the acid content with higher density sinks to the bottom. Due to this, the chemical reactions are concentrated to the lower parts of the electrodes, which reduces the capacity. This can also cause corrosion at the parts of the electrodes where the acid concentration is lower.

The damage mechanisms discussed above are highly affected by certain operating conditions and usage patterns. The major factors, called stress factors, affecting the aging processes are identified by Svoboda (2004):

• Time at low state of charge (SOC)

• Ah-throughput (defined as the cumulative ampere-hour discharge in a one- year period normalized in units of the battery nominal capacity)

• Charge factor (defined as the Ah charged divided by the Ah discharged over a period of time)

(16)

• Temperature

Table 2.1 shows the relationships between some stress factors and aging processes.

Although the relationship matrix is a simplification, it shows clearly that a stress factor can have positive correlation with some aging processes while negative correlation with other aging processes. Another complicating factor is that the various aging processes interact in complex manners (Sauer and Wenzl, 2007).

Table 2.1: Relationships between stress factors and aging processes (Svoboda, 2004).

Corrosion Sulphation Shedding Water

loss AM

degrada- tion

Electrolyte stratifica- Long tion

time at low SOC

Indirect through low acid concentration and low potentials.

Strong positive correlation.

No direct

impact. None None Indirect

effect.

Longer timethrough higher sulfation.

Ah-thoughput None No direct

impact. Impact through mechanical stress

No direct

impact. Impact. Strong positive correlation.

Charge

factor Strong indirect impact through high polar- ization of electrodes.

Negative correlation.

Strong impact through gassing.

Strong im-

pact. No direct

impact. Strong positive correlation.

Time be- tween full charge

Strong negative correlation.

Strong positive correlation.

Negative correlation.

No direct

impact. Strong positive correlation.

Tempera-

ture Strong

positive correlation.

Correlation

can be

both negative and positive.

No direct

impact. Positive correlation.

Low im-

pact. No direct impact.

2.2 Component maintenance policies and remaining useful life

With optimal replacement intervals system reliability can be improved, system failures prevented and maintenance costs reduced. Under some rare circumstances, the optimal replacement strategy is to replace the component after a failure has occurred, so called reactive maintenance. The benefit with this strategy is that the

(17)

lifetime of the component will be maximally used, which reduces waste. However, in many situations a breakdown is costly and can cause a dangerous situation and must be avoided.

A broadly practiced replacement technique is to utilize a predetermined maintenance schedule. This strategy reduces the probability of failure and the risk of unplanned downtime. Failure can still occur and under certain operation conditions the prescribed interval might be too extensive or too narrow.

One way to make more accurate estimations of the component lifetime is to use information about individual usage and conditions of the system or component.

Condition-based maintenance (CBM) is a maintenance strategy that bases the maintenance decisions on information collected by monitoring the condition of the system. Two important aspects of CBM are diagnostics and prognostics. Diagnostics attempts to detect faults that have occurred in the system while prognostics deals with prediction of faults before they occur.

The most common form of prognostics is to predict how much time is left before failure. By evaluating component condition and data from past operations the remaining useful life (RUL) of the component can be estimated. The RUL is the length from the particular time to the end of the useful life of a component or system. The RUL is a random variable and it depends on the current age of the component, the environment and the observed information of the health of the component. When continuously estimating RUL of the component, fluctuating usage is taken in account, which makes the estimate more accurate.

If Xt is defined as the random variable of the remaining useful life at time t and Y_t as the operating history up to t, the probability density function (PDF) of Xt

conditional on Ytwill be denoted as f(xt|Yt). The estimation of RUL can be formu- lated as estimating f(xt|Yt) or E(Xt|Yt). If there is no information about Yt, then the estimation of f(xt|Yt) becomes:

f(xt|Yt) = f(xt) = f(t + xt)

R(t) (2.4)

where f(t + xt) is the value of the PDF at time t + xt and R(t) is the reliability function at t. If Yt is available, this will provide more information to make the estimation of RUL more accurate (Si, 2011).

Prognosis approaches for estimating lifetime could either be physics-based models or data driven. Algorithms that use the data driven approach for predicting lifetime construct models directly from data, rather than relying on any physics or engineering principles. Dorner et al. (2005) investigates the latter approach by constructing a physico-chemical aging model of the battery. The model is based on

(18)

of reactants. For any point in the battery at any time, the model provides state variables like potential, current density, state of charge, temperature, acid concentration etc. This information is used to quantify degradation processes and how these processes impact the battery performance. This model requires data from laboratory experiments on the aging mechanisms, which is usually very difficult to achieve. Due to the complex and non-linear behavior of the battery, authentic physics-based models that can be applied in varying operating conditions are difficult to achieve and rarely suitable (Sauer and Wenzl, 2007).

With data driven methods, the RUL model is fitted to the available data. These methods are suitable for complex systems like batteries where the chemical and physical processes, and their interactions, are difficult to represent analytically.

The data are generally of two main types: past recorded failure data and operational data. Here operational data incorporate any data which may have an impact on the RUL, such as environmental information, performance information and information about the condition of the system or asset. The data driven approaches can be machine learning approaches or statistical approaches, or a mix of them both.

Some research has focused on applying machine learning techniques to RUL estimation. Tian (2009) introduces a method to predict remaining useful life of equipment using artificial neural networks (ANN). The model takes the age and condition monitoring measurement values at discrete inspection points as input and the life percentage as output. To reduce the effects of noise the measurement series are fitted with a generalized Weibull distribution failure function. The ANN method is validated using data from condition monitoring of pump bearings. Yang and Widodo (2008) proposes a regression based support vector machine (SVM) method for machine prognosis. Based on previous state data it attempts to predict the future state condition.

For risk analyses, it is essential to provide a probability distribution rather than a mean estimate of the time to failure (Wang and Christer, 2000). There is a large body of literature on estimation of f(xt|Yt) using statistical methods. The statistical methods for estimating the RUL are however usually based on times series data about the state of health of the component, either directly or indirectly observed.

2.3 Lifetime estimations of batteries

In the past several decades, different approaches for health management of batteries have been extensively studied in the literature. Traditional approaches have mostly focused on estimating the state of charge (SOC) rather than the state of health (SOH) or RUL. While SOH mainly considers diagnostics, SOC and RUL are prognostic concerns. For applications where the variation of operating conditions is large, lifetime prediction is still at an early stage. There is also limited experience

(19)

in lifetime estimations derived from real operating condition data.

Techniques based on statistical methods have been considered by some researchers.

Jaworski (1999) applied statistical parametric models to predict time to failure of batteries exposed to varying temperatures. Saha et al. (2007) introduced a Bayesian learning framework to predict remaining useful life of lithium-ion batteries. The approach combines relevance vector machine (RVM) and particle filter (PF) to generate a probability density function (PDF) for the end-of-life to estimate RUL. RVM is a type of SVM constructed under Bayesian framework and thus has probabilistic outputs. PF is a technique for implementing a recursive Bayesian filter using Monte Carlo simulations to approximate the PDF. The model was built using internal parameters such as charge transfer resistance and electrolyte resistance under the assumption that the parameters gradually change as the battery ages.

In a more recent paper, Saha et al. (2009) compared autoregressive integrated mov- ing average (ARIMA), extended Kalman filtering (EKF) and RVM-PF approach to estimate RUL from experimental data from lithium-ion batteries. ARIMA is a model that is used to fit time series data to predict future points in the series.

ARIMA models are used for observable non-stationary processes that have some clearly identifiable trends. EKF uses a series of measurements observed over time and produces statistically optimal estimates of the underlying system state. Com- pared to the traditional Kalman filter, the EKF can handle non-linear systems. In the first step of the EKF algorithm, a state transition model is used to propagate the state vector into the next time step. In the second step of the algorithm, a measurement of the system is used to correct the prediction. Their result showed considerable differences in performance of the three approaches where the Bayesian statistical approach outperformed the ARIMA and the EKF methods.

2.4 Failure time analysis

Survival analysis is a field in statistics for modeling data that describe time to an event. In short, it is the study of lifetimes and their distributions. Survival analysis is traditionally used in the area of biostatistics, where an event usually means death. The methods have however spread to other disciplines and are often used to analyze events such as unemployment in economy, divorce in sociology and failure in mechanical systems in engineering. In engineering, the field of survival analysis is usually called failure time analysis.

The time to failure T may be thought of as a random variable. There are sev- eral representations of the distribution of T .

(20)

2.4.1 Reliability function

The reliability function, R(t), represent the probability that the time of failure is later than some time t:

R(t) = P r(T > t) (2.5)

It is usually assumed that R(0) = 1 and that R(t) → 0 as t → ∞.

2.4.2 Lifetime distribution and lifetime density

The cumulative distribution function, F (t), is the complement of the reliability function:

F(t) = P r(T ≤ t) = 1 − R(t) (2.6)

This function is also called the failure function as it represents the proportion of units that have failed up until time t. The derivative of the cumulative distribution function is the density function of the lifetime distribution:

f(t) = d

dtF(t) (2.7)

where f(t) represents the instantaneous failure rate.

2.4.3 Hazard function

The probability that failure occur in the next short period of time given survival up to that time is called the hazard function h(t):

h(t) = lim

∆t→0⁺

P r(t ≤ T ≤ t + ∆t|T ≥ t)

∆t = f(t)

R(t) (2.8)

The hazard function equals the proportion of the population that fail per unit time t, among those still functioning at this point in time.

The cumulative hazard function H(t) is the area under the hazard curve up un- til time t:

H(t) =^! ^t

0 h(u)du (2.9)

A plot of H(t) shows the cumulative probability that failure has occurred at any point in time.

2.4.4 Censoring

Often the data used in failure time analysis are censored. Censoring occurs when each observation result either in knowing the exact value or in knowing that the value lies within an interval. When data are right censored, a censored data point is above a certain known value but the exact value is unknown. In a failure analysis test, let n be the number of units. During time C, r failures are observed, where

(21)

0 ≤ r ≤ n. Since the times of failure for n − r units are unknown, but it is known that these times to failure are larger than C, the data are right censored.

In reliability analysis it is commonly assumed that the variables C and T are in- dependent. This is called non-informative censoring. The distribution of survival times of units that are censored at a particular time is no different from that of units that are still observed at this time. One common type of independent censoring is the simple type I censoring, where all subjects in the study are censored at the same, fixed time.

(22)

Chapter 3

Data

In this section, the available data used in the analyses in this report are described.

The possible analyses of battery lifetimes are restricted heavily due to the limita- tions in the data. First, there exist no database over complete lifetimes of truck batteries used in vehicles produced by Scania. Another problematic aspect is that truck batteries are relatively easy to replace, so batteries might be exchanged without Scania’s knowledge, either by the haulage contractor or at a workshop not connected to Scania.

Another limitation is the variables available in the data. No direct measurement of the health of the battery is available. Instead, the analyses must be based on information about technical specification and operating conditions that are assumed to influence the lifetime. This information is however not available for all vehicles, which restricts the number of vehicles to base the analyses on.

The operational data that are recorded and stored in the trucks are stored accumulated and is not time stamped. Therefore, a regular time series analysis is difficult to carry out.

3.1 Failure data

The analyses in the project are based on data from trucks delivered to customers in England between 2007 and 2010. Each truck has a repair contract that is valid for two years from the delivery date. The data contain information about which trucks have replaced their batteries during this period of time and when this replacement occurred. When using this data, two assumptions are made. It is assumed that no other battery replacements are made during these two years, so it is assumed that a haulage contractor utilizes the repair contract it has paid for. It is also assumed that there actually is a problem with the battery replaced, so each replacement is considered to be a battery failure.

(23)

CHAPTER 3. DATA

The choice of using data from England was made due to its relatively long and uniform contract time. The English truck drivers are also more frequent customers of the Scania Assistance service, due to England’s strict regulations when a driver is obligated to contact road assistance services.

The assistance data for battery components include the date of the assistance er- rand, as well as the action made, for example a replacement or a jump start. The assistance errands of interest for a vehicle are those that occurred before the time of failure or before the contract expired. The data used in the analysis are the number of assistance errands for a vehicle that required a jump start during this period of time.

The data set consists of approximately 10 000 haulage vehicles and 3 500 distribution vehicles. A subset of this complete set, here denoted the operational set, consists of all vehicles for which complete information about technical specification and collected operational data are available. This set consists of about 1 400 haulage vehicles and 450 distribution vehicles.

3.2 Technical specification

For each vehicle, the size of the battery and the alternator is known. The battery size is measured in ampere-hours (Ah) and the alternator in amperes (A). For haulage vehicles it is also known if some sort of kitchen equipment, such as a microwave or a coffee maker, is installed. These features together with their possible values are presented in table 3.1.

Table 3.1: Technical specification data and possible values.

Values Unit Battery 140/180/225 Ah Alternator 80/100/150 A

Kitchen Yes/No -

3.3 Operational data

Operation data from trucks are automatically recorded and stored throughout the life of the vehicle. This data source provides information about the use and performance of the truck. The data are accumulated in bins, either as a scalar, a vector or a matrix depending on the nature of the variable and stored each time when the vehicle visits a workshop.

The operational variables used in the analysis in this report are presented in table

(24)

CHAPTER 3. DATA

voltage, the first element contains the percentage of time with a system voltage less than 26 V, the second element 26-26.5 V, the third 26.5-27 V etc. The last element contains the percentage of time with voltage above 30 V. For temperature, the first element contains the percentage of time with a measured ambient temperature below -30 ^◦C. The following elements cover an interval of 10 ^◦C, hence the second element contains percentage of time between -30 and -20^◦C and the last element the percentage of time above 50^◦C.

Each of the three method chapters that now follow contains a data section describing which data are used in that analysis.

Table 3.2: Operational data.

Number of kilometers driven FormScalar

Time in drive Scalar

Time idle with PTO* Scalar Time idle without PTO* Scalar

Start time Scalar

Number of starts Scalar

Voltage Vector

Ambient temperature Vector

* PTO (power take-off) means that power is taken from the truck engine to an attached application, for example a dump trailer or a crane.

(25)

Chapter 4

Classification and feature selection

In this section the methods used for classification and feature selection are described.

The purpose of the classification is to investigate if the available data are sufficient to distinguish the vehicles with battery failure. The feature selection is used to identify properties that discriminates this group of vehicles. The result from the feature selection can also be used to improve the result of the classifier by omitting less relevant attributes.

4.1 Support vector machine

Support vector machine (SVM) is a supervised machine learning technique for data classification. Like other classification methods, the aim is to construct a model from training samples of which the class is known, and use this to predict the class label of unseen samples. The SVM algorithm was introduced by Vapnik in 1992 and has gained popularity due to its several advantages when compared to other machine learning techniques. The classification performance is usually high and the method is considered to have a high generalization performance. Also, the training of the model when using SVM is guaranteed to result in the best classifier, compared with ANN, which can result in a local minimum. This motivates the use of SVM in this thesis.

The basic idea is to map the training vectors into a high dimensional space where the SVM finds a separating hyperplane with the maximal margin. In principle, it is possible to transform any data set so that the classes can be separated linearly (Vapnik, 1998). In Figure 4.1 the maximum margin hyperplane is a line that separates two classes in two dimensions. The data points in each class that lie on the margins are called support vectors. The maximum margin hyperplane, and therefore the classification problem, is only a function of the support vectors, and not all data points in the training set.

(26)

CHAPTER 4. CLASSIFICATION AND FEATURE SELECTION

Figure 4.1: The solid line is the maximum margin hyperplane that separates the two classes. The samples on the dashed lines are the support vectors.

The hyperplane can be expressed as:

&w, x' + b = 0

w ∈ H, b ∈ R (4.1)

where w is the normal vector of the plane, b is a parameter that determines the offset of the hyperplane from the origin and H is some dot product space.

For any testing instance x, the decision function is:

f(x) = sgn(&w, x' + b) (4.2)

The task to find a maximum margin hyperplane can be formulated into a Lagrangian problem:

L(w, b, α) = 1

2)w)²−

"m i=1

α_i(yi(&w, xi' + b − 1)) (4.3) Using this method, the Lagrangian is minimized with respect to the variables w and b and maximized with respect to the Lagrange multipliers αi. The multipliers reflect the weight given to each training sample.

The problem can be expressed in a dual form optimization problem. This requires that w is eliminated from 4.3. This can be accomplished by using the Karush- Kuhn-Tucker condition, which implies that the solution can be expressed as a linear

(27)

combination of the training vectors:

w =^"^m

i=1

α_iy_ixi (4.4)

Doing so, the dual optimization problem takes the following form:

αmax∈R^m

"m i=1

αi−1 2

"m i,j=1

αiαjyiyj&xi, xj' subject to αi ≥ 0, i = 1, ..., m

"m i=1

αiyi= 0

(4.5)

The Lagrangian formulation above may not be solvable if the data cannot be separated by a hyperplane. To solve this problem, SVMs use the kernel trick and soft margin classifiers. By using the kernel trick, the SVM can perform nonlinear classification. This can be done by replacing every dot product with a nonlinear kernel function. A soft margin classifier permits mislabeling of some of the training samples if no hyperplane that can split the two classes exists. This is done by intro- ducing an upper bound C on the Lagrange multipliers, thus limiting the influence of a single support vector. With kernel function and soft margin, the optimization becomes:

αmax∈R^m

"m i=1

α_i−1 2

"m i,j=1

α_iα_jy_iy_jk(xi, x_j) subject to 0 ≤ αⁱ≤ C, i = 1, ..., m

"m i=1

α_iy_i = 0

(4.6)

The decision function can then be expressed as:

f(x) = sgn

#_m

"

i=1

α_iy_ik(x, xi) + b

$

(4.7) There are three basic kernel functions that are commonly used and that have shown good performance: polynomial, radial basis and sigmoid kernels. In the applications of SVM in this paper the linear kernel and the radial basis kernel have been used and compared. The Gaussian radial basis function (RBF) has good general performance.

It can be expressed as:

k(xi, xj) = e^−γ%xⁱ^−x^j^%² (4.8) The parameter γ is a measure of how similar samples are required to be.

The overall effectiveness of SVM depends on the selection of kernel, the size of

(28)

4.2 Data preprocessing

SVM requires each data point to be represented as a vector of real numbers. Boolean attributes need to be converted into numeric data. To represent a histogram of length m, m elements in the feature vector are used.

To avoid attributes with large numeric ranges to dominate, attributes are scaled.

The attributes are scaled to the range [-1, 1] by:

x^& = 2 x− mi

M_i− mi − 1 (4.9)

where Mi and mi are respectively the maximum and the minimum value of the ith attribute, and x^& is the scaled value of x.

The training and testing sets are scaled using the same scaling factors.

4.3 Feature selection

In machine learning, feature selection is the technique for selecting the subset of most relevant features that are used in the model. A simple and effective approach for extracting features is to calculate the F-score for each feature and drop features below a certain threshold. The F-score is a measure of the discrimination of the positive and negative data sets. For the ith feature, the F-score is defined as (Chen and Lin, 2006):

F(i) =

%¯x⁽⁺⁾_i − ¯xi

&2

+^%¯x⁽⁻⁾_i − ¯xi

&2

1 n₊− 1

n₊

"

k=1

%x⁽⁺⁾_k,i − ¯x⁽⁺⁾_i ^&²+ 1 n₋− 1

n₋

"

k=1

%x⁽⁻⁾_k,i − ¯x⁽⁻⁾_i ^&²

(4.10)

Here, n+ and n₋ are the number of positive and negative instances respectively,

¯xi, ¯x⁽⁺⁾_i and ¯x⁽⁻⁾_i are the mean values of the whole, positive and negative data sets and x⁽⁺⁾_k,i and x⁽⁻⁾_k,i are the kth instance of the positive and negative sets. The larger F-score, the more discriminative is the feature.

A drawback with F-score is that it does not take shared information among features into account. Despite this, it is however a simple and effective method for ranking and measuring the importance of features, and this motivates the use of the F-score feature selector in this thesis.

4.4 Parameter selection

The optimization of the parameters C and γ is done using ν-fold cross-validation and grid search. In ν-fold cross validation, the training set is divided into ν subsets

(29)

of equal size. All of the subsets are used as the test set once to evaluate a classifier that is trained on the remaining ν − 1 subsets. The cross-validation accuracy is the percentage of data points correctly classified. The motive of the procedure of cross-validation is to prevent overfitting.

In grid search, various values of the parameters C and γ are tried out and the pair with the best cross-validation accuracy is picked.

4.5 Performance evaluation

To assess the performance of the model, the data were split into training and testing sets. The sets were split randomly 60:40. The training set was used in parameter selection and in the construction of the SVM model, while the testing set only was used at the last stage to evaluate the performance.

To illustrate the performance of a binary classifier, a receiver operating characteristic (ROC) curve can be used. It is a plot with the true positive rate (TPR) on the y-axis and the false positive rate (FPR) on the x-axis. The TPR is also known as sensitivity and the FPR is one minus the specificity or true negative rate (TNR).

Sensitivity = TP

TP + FN (4.11)

Specificity = TN

TN + FP (4.12)

The area under the ROC curve (AUC) has shown to be a useful measure of the performance of the classifier. A high value of the AUC implies that the model efficiently discriminates between the classes. The AUC varies between 0.5 (random guess) and 1.0 (a perfect classifier). Barakat and Bradley (2006) showed that ROC curves and AUC could provide a more reliable measure of quality than the accuracy.

To evaluate the performance of the SVM model, multiple logistic regression (MLR) analysis was performed with the same selected features. The modeling was performed on the training data set and the estimated coefficients were applied to the test data set to calculate the AUC values.

4.6 Implementation

To generate the SVM models, the SVM library LIBSVM was used. LIBSVM is a freely available software library implemented by Chang and Lin (2011).

For every feature, the F-score is calculated using a script in the LIBSVM library.

For a set of thresholds, the features with F-scores above every threshold are used to

(30)

Table 4.1: Variables used in the feature selection and SVM analysis.

Variable Number of features

Voltage histogram 10

Temperature histogram 5

Battery size 1

Kilometers per day 1

Start time per start 1

Fraction of time idle with PTO 1 Fraction of time idle without PTO 1

Kilometers per start 1

Number of assistance errands 1

Kitchen equipment 1

Alternator size 1

is used, and the features below this threshold are dropped.

The package also included a utility (grid.py) that was used to find the optimal values of the parameters C and γ. Five-fold cross validation was used, with C spanned between 2⁻⁵ and 2¹⁵and γ between 2⁻¹⁵and 2³. It also included functions for plotting ROC curves and to calculate the AUC.

The logistic regression analysis was done in the statistical program R (R Devel- opment Core Team, 2008).

4.7 Data

A list of the features used in the feature selection and the SVM analysis is presented in table 4.1, along with the number of features it corresponds to in the feature vector.

Each element in the voltage and the temperature histograms becomes a feature in the feature vector. The two first and the three last elements in the temperature vector are omitted since the corresponding values are almost exclusively zero at these extreme temperatures. The kitchen feature is set to one if kitchen equipment is installed, otherwise it is zero. This feature is omitted for distribution vehicles since the corresponding value is zero in every case. The rest of the features are set to its numerical value. Finally, the feature values are normalized as described in section 4.2.

(31)

Chapter 5

Lifetime distribution estimations

A simple approach to lifetime distribution modeling is to analyze if the times of failure follow some defined probability distribution. The model is simple in the sense that it does not investigate how different features influence the time to failure, it only includes the times to failure or, if the subject is right censored, the known time of survival.

5.1 Maximum likelihood estimation

Maximum likelihood estimation (MLE) is a widely used approach to parameter estimation and inference in statistics. Given a chosen probability distribution model and the observed set of data, the MLE method estimates the unknown model parameters such that the probability of obtaining the particular set of data is maximized.

Let f(x|θ) represent the PDF for a random variable x conditioned on a set of parameters θ. For n independent and identically distributed observations the joint density function is:

f(x1, ..., xn|θ) = 'n i=1

f(xi|θ) = L(θ|x1, ..., xn) (5.1) where L(θ|x1, ..., xn) is called the likelihood function. A maximum likelihood esti- mator is a value of the parameter θ such that the likelihood function is a maximum.

The maximum likelihood estimate for parameter θ is denoted ˆθ. It is usually more convenient to work with the logarithm of the likelihood function, called the log- likelihood:

ln L(θ|x1, ..., x_n) =^"ⁿ

i=1

ln f(xi|θ) (5.2)

Since the logarithm function is a monotonically increasing function, maximizing ln L(θ|x1, ..., x ) is equivalent to maximizing L(θ|x1, ..., x ). If the log-likelihood

(32)

CHAPTER 5. LIFETIME DISTRIBUTION ESTIMATIONS

the likelihood equation:

∂ln L(θ|x1, ..., x_n)

∂θ = 0 (5.3)

For a sample containing both exact and right censored observations, the likelihood can be written as:

L(θ|x1, ..., x_n) =^'ⁿ

i=1

f(xi|θ)^δⁱ[1 − F(xi|θ)]^1−δⁱ (5.4) where δi = 1 for an exact observation and δi = 0 for a right censored observation and F (xi|θ) is the cumulative distribution function.

5.2 Common distributions

This section considers probability distributions which are most often used when modeling likelihood of failure: the normal, the Weibull and the gamma distributions.

5.2.1 Normal distribution

The normal (or Gaussian) distribution is a continuous probability distribution with a probability density function known as the Gaussian function:

f(x|µ, σ²) = 1

σ√2πe⁻¹²(^x^−µσ )² (5.5) The parameters µ and σ² are the mean and the variance of the distribution respectively. The cumulative distribution function for the normal distribution is:

F(x|µ, σ²) = 1 2

(1 + erf⁾x− µ σ√2

*+ (5.6)

where erf(x) is the error function. The normal distribution describes an increasing failure rate.

5.2.2 Weibull distribution

The probability density function of the Weibull distribution is:

f(x|α, β) = β α

)x α

*β−1

e^−(x/α)^β (5.7)

The parameter α > 0 is the scale parameter and parameter β > 0 is the shape parameter. The cumulative distribution function for the Weibull distribution is:

F(x|α, β) = 1 − e^−(x/α)^β (5.8)

The mean and the variance of the Weibull distribution are respectively:

α^%Γ(1 + β⁻¹)^& (5.9)

(33)

α²^%Γ(1 + 2β⁻¹) − Γ(1 + β⁻¹)²^& (5.10) where Γ(x) is the Gamma function.

The Weibull distribution is often used to model the distribution of lifetimes of objects. The value of the parameter β is an indicator of the failure rate. If β < 1, the failure rate decreases over time, if β = 1 the failure rate is constant over time and if β > 1 the failure rate increases over time.

5.2.3 Gamma distribution

The gamma distribution with shape parameter α and scale parameter β has the probability density function:

f(x|α, β) = 1

Γ(α)β^−αx^α⁻¹e^−x/β (5.11) The cumulative distribution function for the gamma distribution is:

F(x|α, β) = 1

Γ(α)Γ(α, x/β) (5.12)

where Γ(x) is a complete Gamma function and Γ(a, x) an incomplete Gamma func- tion.

The mean and the variance of the gamma distribution are respectively:

αβ (5.13)

αβ² (5.14)

The value of the parameter α determines the failure rate. With α > 1 the failure rate increases over time.

5.3 Performance evaluation

To evaluate the goodness of fit of the three distributions to the data, Pearson’s chi-squared test is applied. The value of the chi-squared test statistic is:

χ² =^"^k

i=1

(Oi− Ei)²

Ei (5.15)

where k is the number of subintervals, Oi is the observed number of data points in interval i and Ei is the expected number of data points in interval i. The statistic value is compared to a chi-squared distribution with d = k−1−n degrees of freedom, where n equals the number of estimated parameters, and if:

χ²< χ²_1−α,d (5.16)

it can be assumed at the significant level α that the data follow the assumed theo-

(34)

5.4 Implementation

To fit the probability distributions to failure data using maximum likelihood estimation, the Statistics Toolbox in Matlab is used.

5.5 Data

In this analysis technical specification and operational data are unnecessary. There- fore the complete data set can be used where the only requirement is that the time of failure is known. The data set for haulage vehicles consists of 2107 observed failures and 7919 censored for which failure was not observed after 730 days. For distribution vehicles, 469 failures were observed and 2987 were censored.

(35)

Chapter 6

Reliability models

In this chapter, the two methods used for constructing reliability models are ex- plained. For each method, two models are developed. The aim of both models is to describe the influence of features on the reliability probability of the battery in the vehicle. They differ in the following sense:

Model I

The aim of model I is to describe the reliability as a function of time for a vehicle that is new. It therefore only includes features that are known or can be estimated when the vehicle has yet not been in use.

Model II

When a vehicle has been operating for a time, more information about events and operating conditions is available, for example how many times a certain vehicle has called assistance service due to battery problems. Some data collected from the vehicle are also likely to be more accurate than the assumptions used in model I, such as the ambient temperatures or the average kilometers driven per day. Treating the time since the vehicle began operating as a feature, model II aims to describe the remaining lifetime for a vehicle in use.

The two methods used for modeling are random forests and Cox regression. Both of the methods can handle censored data, which most other regression methods, such as the support vector regression, are not designed for. This is the main reason for using these two methods.

6.1 Random forests

Random forests (RF) is an ensemble machine learning method developed by Breiman (2001). Ensemble predictors are built from multiple base learners which can sig-

Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

JOHANNA ROSENVINGE

Lifetime Analysis of Automotive Batteries using Random Forests and Cox Regression

Abstract

Sammanfattning

Livslängdsanalys av bilbatterier med random forests och Coxregression

Acknowledgements

Abbreviations

Contents

Chapter 1

Introduction

1.1 Problem description

1.2 Objectives with the thesis

1.3 Outline

Chapter 2

Background

2.1 Battery characteristics and degeneration

2.2 Component maintenance policies and remaining useful life

2.3 Lifetime estimations of batteries

2.4 Failure time analysis

Chapter 3

Data

3.1 Failure data

3.2 Technical specification

3.3 Operational data

Chapter 4

Classification and feature selection

4.1 Support vector machine

4.2 Data preprocessing

4.3 Feature selection

4.4 Parameter selection

4.5 Performance evaluation

4.6 Implementation

4.7 Data

Chapter 5

Lifetime distribution estimations

5.1 Maximum likelihood estimation

5.2 Common distributions

5.3 Performance evaluation

5.4 Implementation

5.5 Data

Chapter 6

Reliability models

6.1 Random forests