Customer Support Process Analysis
Using statistics and modeling to analyze a global customer support process
Tobias Björch Fredrik Strålberg
June 14, 2016
Copyright © 2016 Tobias Bj ¨orch and Fredrik Str˚alberg All rights reserved
CUSTOMER SUPPPORT PROCESS ANALYSIS - USING STATISTICS AND MODELING TO ANALYZE A GLOBAL CUSTOMER SUPPORT
Submitted in partial fulfillment of the requirements for the degree Master of Science in In- dustrial Engineering and Management
Department of Mathematics and Mathematical Statistics Ume˚a University
SE-901 87 Ume˚a, Sweden Supervisor:
Konrad Abramowicz Examiner:
Leif Nilsson
Abstract
A key challenge for a company with global support is to provide qualitative service to their customers. Management of global support centers has to consider customer re- quirements, service agreements, budget, resources and more. Therefore, management has a limited room for testing new approaches, especially in global operations. This thesis aims to use statistics, modeling and discrete event simulation to analyze a global support process. Analysis shall provide approximate results to support decision making. Model representation of the global support process uses non-parametric bootstrap, to replicate variability observed in the real-world system. Variability in the arrival process is con- sidered by using bootstrap block resampling. To describe the observed global support process, data has been collected from the case company. The results from simulation are validated by comparison with the observed data. Simulation results validate the model representation and therefore potential process enhancements are tested. Further, discus- sion considers results from test of process enhancements and validity of the simulation model.
Sammanfattning
En utmaning f ¨or ett f ¨oretag med en global support ¨ar att erbjuda kvalitativ service till deras kunder. Ledningen f ¨or globala supportcenter m˚aste ta h¨ansyn till kunders ¨onskem˚al, serviceavtal, budget, resurser med mera. D¨arf ¨or har ledningen ett begr¨ansat handlings- utrymme f ¨or att testa nya tillv¨agag˚angss¨att, speciellt inom globala verksamheter. M˚alet med det h¨ar arbetet ¨ar att med hj¨alp av statistik, modellering och diskret h¨andelsestyrd simulering analysera en global supportprocess. Analysen ska bidra med approximativa resultat vilka kan anv¨andas som beslutsunderlag. Modellrepresenationen av den globala support processen till¨ampar en icke-parametrisk ˚atersampling f ¨or att replikera variabili- tet i det observerade systemet. F ¨or att replikera variabilitet i ankomstprocessen anv¨ands
˚atersampling av block (bootstrap block resampling). F ¨or att beskriva den observerade glo- bala supportprocessen anv¨ands data inh¨amtad fr˚an f ¨oretaget d¨ar arbetet utf ¨ordes. Resultat fr˚an simulering valideras genom j¨amf ¨orelser med observerat data. Simuleringsresultaten validerar modellrepresentationen och d¨arf ¨or har potentiella processf ¨orb¨attringar testats.
Vidare presenteras diskussion om validering av simuleringsmodellen samt resultat fr˚an tester av potentiella processf ¨orb¨attringar.
Svensk titel: Analys av en supportprocess
To our families
Acknowledgements
We wish to thank various people that has contributed to this thesis. Firstly, we would like to thank our supervisors at the case company. You helped us create the idea behind this thesis and guided us during our work. It has been a pleasure working with both of you.
Special thanks go to our supervisor Konrad Abramowicz for his encouragement, time, patience and his enthusiasm. You have inspired us to work hard even at times when the goal of this thesis felt distant. We have really appreciated all your support and it has been a pleasure getting to know you better.
Then we wish to thank our families and friends for their understanding and patience, during evenings and weekends, when we have been working on this thesis.
Finally, we would also like to show our gratitude to employees at the case company
that have been helpful and willing to answer all our questions.
Contents
1 Introduction 2
1.1 Background . . . . 2
1.1.1 Case company . . . . 2
1.1.2 Problem description . . . . 3
1.1.3 Possible ways to analyse a global support process . . . . 3
1.1.4 Process description . . . . 4
1.1.5 Service requirements and work schedule . . . . 9
1.2 Purpose . . . . 11
1.2.1 Potential process enhancement . . . . 11
1.3 Observed Data . . . . 12
1.4 Delimitations . . . . 12
1.5 Approach and Outline . . . . 13
2 Theory 14 2.1 Probability Theory . . . . 14
2.1.1 Sample space, events and probability . . . . 14
2.1.2 Axioms of probability . . . . 14
2.1.3 Random variable . . . . 14
2.1.4 Random variable characteristics . . . . 14
2.1.5 Expected value and variance . . . . 15
2.1.6 Distributions used in this thesis . . . . 15
2.2 Statistical Inference . . . . 16
2.2.1 Sample mean and variance . . . . 16
2.2.2 Hypothesis testing . . . . 17
2.2.3 Methods of tests . . . . 17
2.2.4 Inference about differences in means of populations . . . . 18
2.3 Stochastic Simulation . . . . 21
2.3.1 Pseudorandom numbers . . . . 21
2.3.2 Bootstrap . . . . 21
2.3.3 Bootstrap block resampling . . . . 22
2.4 Discrete Event Simulation . . . . 22
2.4.1 Warm-up period . . . . 23
2.5 Software . . . . 23
2.5.1 Software usage . . . . 23
2.5.2 SimEvents library . . . . 23
3 Data 27 3.1 Observed Data . . . . 27
3.1.1 Performance table . . . . 27
3.1.2 Time reporting table . . . . 28
3.1.3 Change history table . . . . 28
3.2 Data Processing . . . . 29
3.2.1 Arrival process . . . . 29
3.2.2 Model parameters . . . . 31
4 Method 33
4.1 Implementation . . . . 33
4.2 Model Representation . . . . 33
4.2.1 Attributes of customer support requests . . . . 33
4.2.2 Generate customer support request . . . . 34
4.2.3 Customer unit . . . . 35
4.2.4 Global support center . . . . 37
4.2.5 Product line maintenance . . . . 42
4.3 Simulation settings . . . . 44
4.3.1 Time . . . . 44
4.3.2 Work schedule . . . . 44
4.3.3 Input data . . . . 45
4.4 Representation of Potential Process Enhancement . . . . 45
4.4.1 Early routing . . . . 45
4.4.2 No individual assignment of customer support request in the global support centers . . . . 46
4.4.3 Number of engineers in product line maintenance . . . . 47
5 Results 48 5.1 Validation Of Model . . . . 48
5.1.1 Mean duration using original arrival process . . . . 48
5.1.2 Mean duration using bootstrap arrival process . . . . 49
5.1.3 Comparison of individual customer support request duration . . . . 50
5.2 Potential Process Enhancement . . . . 53
5.2.1 Early routing . . . . 53
5.2.2 No individual assignment of customer support request in the global support centers . . . . 56
5.2.3 Number of engineers in product line maintenance . . . . 57
6 Discussion and conslusion 60 6.1 Review . . . . 60
6.2 Validation of model . . . . 60
6.2.1 Evaluation of mean duration . . . . 60
6.2.2 Comparison of individual customer request support duration . . . . 61
6.3 Potential Process Enhancement . . . . 61
6.3.1 Early routing . . . . 61
6.3.2 No individual assignment of customer support request in the global support centers . . . . 62
6.3.3 Number of engineers in product line maintenance . . . . 63
6.4 Conclusions and Recommendations . . . . 63
References 64
A Appendix A - Individual times divided by different priorities using bootstrap ar-
rival process 65
B Appendix B - Individual times divided by different customer regions using boot-
strap arrival process 66
Abbreviations
SAP Systems, Applications and Products.
CSR Customer Support Request
CU Customer Unit
GSC Global Support Center PLM Product Line Maintenance DES Discrete Event Simulation OAP Original Arrival Process BAP Bootstrap Arrival Process
ER Early Routing
NIAC No Individual Assignment of CSR
NEP Number of Engineers in PLM
1 Introduction
In this thesis we make inference about a key challenge for a company with after sales ser- vices, namely management of service support centers. Management has to consider several aspects, such as customer requirements, budget restrictions, service agreements, available re- sources etc. Therefore, one of the key challenges for a company is to offer short service time to a customer while organizing resources. To simplify the approach of organizing resources to meet future demand, it would be of interest to have a mathematical method to evaluate the organisation and service. We aim to use statistics, modeling and simulation to analyze a global support process. We focus on finding an effective way to represent a real-world sys- tem with simulation. The thesis shall give decision makers a way to model a global support process. Hence, approximate results shall give an indication of how the process respond to changes.
1.1 Background
1.1.1 Case company
This thesis is carried out at a global telecommunication company that is part of changing the environment of communication technology. The case company provides equipment, soft- ware and service to enable transformation through mobility. Their leadership in technology and service has played an important role for expansion and improvement of connectivity worldwide. The company structure is divided into several business units, which are sup- ported by group functions such as sales, finance, human resources, etc.
This thesis analyze a global support process for customer service. Customers that are given service by this support organisation is large corporations. The global support is a service offered to customers for a service fee. Customers pay annual service fee to get help with resolving problems regarding a product.
The support organization is divided in three different levels: customer unit (CU), global support center (GSC) and product line maintenance (PLM). There exist approximately 150 CUs around the world, 3 GSC and 1 PLM-unit. An illustration of the global support or- ganization is seen in Figure 1.1. The geographical positions is not representing the actual position of the units in the organization.
Figure 1.1: An illustration of the global support organization. The geographical positions is not representing the actual
position of the units.
1.1.2 Problem description
It is a challenge to give service of advanced products such as telecommunication equipment.
It can take several days to solve a problem. It is also a challenge to meet customer demands on short response time and overall service time. There is a demand for local customer units to enable short response time, therefore the units are strategically located all over the world.
Customer units (CU) are local offices, and they are the first contact point with global support. CUs aim to give short response time and communication in customer’s native language. The other challenge is offering short overall service time. A company has to weigh between having specialists spread across the local offices or having centralized specialist units, which assists the local offices. Centralized specialist units are efficient because all local offices can require assistance from these units, i.e. they share these resources.
Another challenge is to offer 24/7 support to customers from different locations. The purpose of having units spread across the globe is to gain efficiency by using global volume and global competence. It also enables shared work over different time zones.
The global support handles customer support requests (CSR). When a customer contacts the global support, a CSR is created. During the following support process the CSR is tracked and data is collected.
The support organisation has three support levels CU, GSC and PLM. Each of them have their own sub process to support the customer. Together they constitute the global support process. The global support process is described in more detail under Section 1.1.4.
It is difficult for a large scale organisation to analyze effects of changes. Therefore, a method to indicate effects of changes is of interest to study.
1.1.3 Possible ways to analyse a global support process
1.1.3.1 Value stream mapping
Lean is a systematic method for eliminating waste and creating value. Methods such as lean offer a way to analyse a process by observation. Value stream mapping is a method used to map activities in a process. First determine the calendar time for each activity (lead time).
Then measure the time a resource spend on each activity (processing time). Identify time between activities and identify any loop backs. These measures can be used to calculate the flow efficiency and identify improvement areas. (Bicheno et al. 2011)
1.1.3.2 Performance measures
By observing the process and using average values of performance measures such as lead time, processing time and time between activities it is possible to calculate average service times, capacity and other values of interest. These measures are some examples of param- eters that are helpful to managers. However, they do not consider the variation that may exists in the process.
1.1.3.3 Testing new support process
A third approach to analyse the global support process is testing new set-ups or a new
process, i.e. a pilot study. A limited group tries a new process and during this time capture
the performance measures of this new way of working. This makes it possible to compare
the performance measures with the original support process. This approach gives reliable
results but requires lots of testing, involves several people and makes it very time consuming
and costly.
1.1.3.4 Simulation
Simulation is another approach which aims to represent the real-world process by using his- torical data. By simulation it is possible to test different set-ups without making any changes to the current organisation. One have to assume that the process can be well replicated in a simulation model.
Simulation models are used to solve problems and provide results to support decision makers. One thing to concern is to make sure that the results are accurate. There are meth- ods to check if the simulation is a good representation of the real-world process. Model validation and verification address the appropriateness of the representation. A probabilis- tic model which make inference about random variables uses properties such as mean and variance to determine validity. There exist many different concepts within validation. Con- ceptual validation is performed to test that theories and assumptions are reasonable, for the intended purpose of the study. Face validation is a subjective measure done by individu- als knowledgeable about the system. They are asked to determine whether the model is behaving reasonable in regard to real world system. (Sargent 2011)
The global support process has several similarities with an incoming call center. Accord- ing to Kim (2005, p. 390), call centers are commonly used by corporations to a wide range of activities. They can be planned to give service, support and serve other types of cus- tomer inquiry’s. These types of call centers is commonly referred to as service centers. Calls occupy resources in the service center and the management of a service center has a goal of achieving high service level for customers and use resources efficiently. A service center handles entities with its resources.
In a study from Bouzada (2009), the author uses an empirical case from a call center to compare between experimental methods (simulation), with analytical methods (queueing theory). The aim of the study is to compare the methods for dimensioning of the handling capacity. Bouzada (2009) results was able to verify that use of simulation for dimensioning of handling capacity showed advantages compared to the analytical method, mainly in complex organisations.
We conclude that these studies show that simulation is a suitable method to analyse a call center. Hence, we can consider simulation as a suitable approach to analyse the global support process.
1.1.4 Process description
As mentioned in Section 1.1.2, the global support handles CSRs. When a customer problem
occurs in the global support a CSR is created with a date and time stamp. During the
following support process variables regarding the CSR is collected in a data base. This
section describe the global support process, which is presented in Figure 1.2. We describe
sub processes in more detail under the following sections.
Figure 1.2: Global support process.
1.1.4.1 Customer support request duration
CSR duration is defined as: the time elapsed between the creation of a CSR until it is closed.
Engineers within each of the three support levels can present a solution to a customer. If the solution resolves the problem this leads to a closing of the CSR. A CSR can be closed in any of the support levels, which means that the support process can look different for each CSR.
If a CSR is handled by all three support levels the duration of a CSR is the time elapsed from start to CSR closed in PLM, seen in Figure 1.2. If a CSR is handled by the first two levels the duration of a CSR is the time elapsed from start to CSR closed in GSC. If a CSR is handled by the first support level duration of CSR is the time elapsed from start to CSR closed in CU.
1.1.4.2 Creation of a customer support request
Figure 1.3: CSR creation in the global support process
A CSR is created when a customer contacts someone within a CU and requires support.
CSRs can be created by all sorts of reasons and by any of the customers in the world.
Each time a CSR is created, it gets a date and time stamp and it is registered with a unique CSR ID. Creation of a CSR can be seen in Figure 1.3.
1.1.4.3 Assign engineer
After creation of a CSR it needs to be assigned to a support engineer. If a support engineer is available the CSR is assigned to the engineer. If all engineers are occupied, the CSR is put in a queue. The CSR queue holds CSR that has not yet been assigned to an engineer. The CSR queue also holds CSR that has been analysed by an engineer but is awaiting some additional information.
Assign engineer is a sub process that occurs on each support level Figure 1.4.
Figure 1.4: Assign Engineer in the global support process.
1.1.4.4 Pre-Analysis
Figure 1.5: Pre-Analysis in the global support process
Pre-Analysis occurs immediately after an engineer has been assigned to a CSR. Pre- Analysis is the first analysis of the customer problem performed by an engineer to nar- row the search for a solution. In order to allocate resources to the ”right” CSRs, all CSRs are assigned a priority according to technical and/or commercial impact of the problem.
A CSR can have one of five different pri- orities. A customer with a high priority CSR requires short service time and can ex- pect service during any time of day until the problem is resolved. Meanwhile a customer with a low priority CSR can expect longer
service time. Due to the fact that service is only given during daytime when the support is open. Customers with low priority CSR can also expect to be put in a queue while CSRs with higher priority is being served.
The CSR information established during the Pre-Analysis is stored in the business system Systems, Application and Products (SAP). Pre-Analysis is performed in all support level.
Although this process is more extensive for a CU-engineer, since they are the first contact
point for a customer. Pre-Analysis is viewed in Figure 1.5.
1.1.4.5 Analysis
Figure 1.6: Analysis in the global support process
Analysis is the biggest sub process in the global support. Prior to the anal- ysis the support engineer has secured measurement data and/or a remote connection which makes it possible to troubleshoot and try to resolve the problem together with the customer. It is also possible that more information is required during analysis. Then the engineer requests info either from the customer or other support personnel that has been previously assigned to the CSR. The engineer puts the CSR on hold and start working with another CSR. We refer to this scenario of CSR
on hold as a break. The elapsed time from information request until more information is received is referred to as break time. The expressions break and break time is referred to throughout the thesis. This creates a loop which can occur several times, depending on difficulty of isolating and resolving the problem. For example, a support engineer requires measurement data from the last week for a faulty product. Customer responds by saying that it takes a day to get the information. Engineer pauses analysis until information is received. Meanwhile the engineer can continue to service other CSRs. The Analysis sub process is seen in Figure 1.6
1.1.4.6 Escalation
If a support level determines that the CSR cannot be solved within that support level, a decision can be made to escalate the CSR to the next level. This can happen both in CU and GSC. An escalation, is a decision by the current organisational level to assign the CSR to the next level. In Figure 1.7 the escalation sub process is visualized.
Figure 1.7: Escalation in the global support process.
1.1.4.7 Find and present solution
Support engineers isolates the problem and a recovery procedure. The recovery procedure may look different, it can be a software update, hardware change, a product restart etc.
The support engineer presents the solution to the customer which in turn tries to recover the faulty product. If the recovery procedure does not resolve the problem, it is denied by the customer and support engineer returns to Analysis. If the recovery procedure is accepted by the customer, the CSR is closed and all info regarding CSR is updated in the software SAP, e.g. duration, support time and number of support activities
Find and present solution is seen in Figure 1.8
Figure 1.8: Find and present solution in the global support process.
1.1.5 Service requirements and work schedule
In Section 1.1.4 we describe the global support process. The organisation working according to this process is large and active all over the world, as seen in Figure 1.1. Having an organisation that offers service to global customers leads to a challenge in giving service during any time of day.
CSR with priority 4 and 5 are given service 24/7 which requires all support levels to
have personnel active at all times. CSR with priority 1, 2 and 3 are given support during
common office hours. Now we explain how the organisation is scheduled to meet these
service requirements. In Figure 1.9 we see a description of the path a CSR is escalated
according to time of day.
Figure 1.9: Path of a CSR during common office hours.
Figure 1.9 shows the path of CSR with any priority 1,2,3,4 and 5. All CSR is offered service during common office hours. CSR with priority 1,2 or 3 is only being served during the units common office hours.
Figure 1.10: Path of a CSR with priority 4 or 5, during evenings and weekends, in the Global support.
In support handling of CSR with priority 4 or 5 there exist a handover procedure which occurs when a GSC closes, see Figure 1.10. The handover procedure passes the service of a CSR from the closing GSC to the newly open GSC. A reduced amount of engineers works to service CSR with these priorities. In the global support there are three support levels, CU, GSC and PLM, which we describe in the following list.
Customer unit
There exist approximately 150 CUs in the global support, which work to serve one or more customer in their vicinity. Due to the scale of this support level we regard it as open at all time and able to service all customers at any time of day. This is a simplified representations of the CUs. We elaborate more on this topic under Section 4.2, where we describe our model representation.
Global support center
There exist 3 GSCs in the global support. The geographic locations of the GSCs makes it possible to offer service during any time of day, by sharing work over different time zones. Due to confidential agreements, number of engineers in each of the GSCs is referred as R GSC1 , R GSC2 and R GSC3 . These are reference values of the number of resources in each of the GSC during common office hours.
During weekends the global support needs to serve CSR with priority 4 or 5. This leads to less need for service of CSRs during weekends. Because of this, the number of support engineers is reduced in each of the GSCs during weekends.
Product line maintenance
There exist 1 PLM unit in the global support. Due to confidential agreements we refer to the number of engineers in PLM as R PLM . This is a reference value of the number of resources in PLM during common office hours.
Similar to GSC, there is less need for service of CSRs during evenings and weekends.
Because of this, the number of support engineers is reduced during evenings and week- ends.
1.2 Purpose
The purpose of this study is to use simulation to represent a real-world process, namely the global support process described in Section 1.1.4. The simulation model shall be able to provide approximate results which can indicate effects of organizational changes. The resulting model shall be able to test hypothetical set-ups. This thesis shall increase the case company’s knowledge of the process of support. We answer the following questions in this thesis:
• How can the global support be effectively simulated?
• How can simulation provide approximate results that can support the decision makers in managing the global support?
1.2.1 Potential process enhancement
These potential improvements corresponds to testing alternative set-ups of the support or-
ganisation or altering the global support process. We want to test the following potential
process enhancement.
1.2.1.1 Early Routing
As described in Section 1.1.4 a CSR can be escalated to higher support levels. In the current organisation CSR is created in one of the CU and is escalated in turn to GSC and then PLM.
Therefore, it is of interest to study the potential enhancement in having an early routing. We assign a person with knowledge about the process and common customer problems. This person has the role of routing the CSR to the support level, that is most suitable according to CSR characteristics.
We assume that the people working with routing has knowledge required to efficiently send CSR to the ”right” support level. By using observed data we will introduce an enhance- ment with routing functionality which uses probability based on historical data.
1.2.1.2 No individual assignment of customer support request in the global support cen- ters
For readability we will refer to this process enhancement as NIAC in GSC. In Figure 1.2 we can see that there exist a CSR queue on all support levels. From Section 1.1.4.3 we know that this queue holds CSR that has not been assigned to an engineer.
Consider a CSR which has been assigned to an engineer. If the engineer requests infor- mation from customer, support is paused until the information is receieved and the engineer continues to support the CSR. We introduce a potential enhancement where a CSR is placed back in the CSR queue when it is paused, for information request. With this potential en- hancement we allow any engineer to continue where the last engineer ended, without any loss of efficiency.
1.2.1.3 Number of engineers in product line maintenance
The third question regards the number of engineers in PLM support level. We want to test different variation of support engineers in the highest support level and test if this affects overall service time in the support process.
1.3 Observed Data
As part of this study, relevant data is collected to build a simulation model representing the global support process. The case company acquires data from each CSR. In total there are 180 characteristics and 145 keyfigures recorded. The provided data for this study is historical data recorded for CSRs from year 2015. The data contains information such as:
• Date and time.
• Priority level.
• Origin such as customer, country and region.
• Support handling time carried out by each support level (CU, GSC,PLM).
• Number of support activities.
• Duration of CSR, i.e. time from creation until CSR is closed.
1.4 Delimitations
In order to make this thesis viable a few limitations is introduced. The data considered in
the thesis, see Section 1.3, comes from CSRs regarding one specific product of the global
support service. Firstly, the reason for this is the amount and size of the data. This specific product might have complex correlations with other products which will not be considered in this thesis. Secondly, our thesis aims to effectively simulate the global support. One would expect it to be possible to use a similar model with other product types, if it is of interest.
Data and information of the customer units (see Figure 1.1) is not attainable, due to the geographic locations and number of CUs. Therefore, simplified representations of the CUs is used.
In the global support process there might be two or more engineers working at the same time, with the same CSR. This event is difficult to track in the observed data. Therefore, this event is not considered in our simulation model.
1.5 Approach and Outline
In this thesis we use simulation to analyse and evaluate a global support process. More specifically, we study the flow of CSRs through a global support process. Selected parts of the CSR data mentioned in Section 1.3 is used to represent the real world system in the simulation model.
In Chapter 2 the underlying theory we use in our thesis is presented, including probabil-
ity theory, statistical inference and simulation theory. In Chapter 3 we present the considered
data and data processing. In Chapter 4 we present how theory is applied, choice of simu-
lation approach and describe the model representation. The results and validation of the
model is presented in Chapter 5 followed by discussion and conclusions in Chapter 6.
2 Theory
In this chapter we start by defining fundamental concepts in probability theory, followed by statistical inference. Then we define theory of stochastic simulation, and discrete event simulation. Finally we also present theory about software used in this thesis.
2.1 Probability Theory
2.1.1 Sample space, events and probability
Let S denote a sample space of the experiment, which is the set of all possible outcomes.
Any subset A of the sample space is known as an event. For each event A of an experiment having sample space S there is a number P ( A ) , called the probability of event A.
2.1.2 Axioms of probability
P is a probability measure of event A for a sample space S if it satisfies the following axioms:
Axiom 1 0 ≤ P ( A ) ≤ 1 Axiom 2 P ( S ) = 1
Axiom 3 For any sequence of mutually exclusive events A 1 , A 2 , ...
P
n [ i=1
A i
!
=
∑ n i=1
P ( A i ) , n = 1, 2, ...
2.1.3 Random variable
Experiments are carried out to find a numerical quantity of interest. The resulting numerical quantity is a observation of a random variable. A set of observations from a random variable is called a sample.
There are two major types of random variables: discrete random variables and continuous random variables. Discrete random variables can take on a limited, or at most a countable number of values. Continuous random variables can take on an uncountable number of values.
2.1.4 Random variable characteristics
A random variable can be described by some characteristics, that describe its possible out- comes. For a discrete random variable X, the likelihood of taking on a specific value, is given by the probability mass function p ( x ) which is defined by:
p ( x ) = P ( X = x ) .
For a continuous random variable the possibility of taking on a given value is 0, because there are uncountably many values. Therefore it is suitable to talk about the possibility to end up in a given set. A continuous random variable, X, has a probability density function (pdf) f X , defined for all real numbers x and having the property that for any set A of real numbers:
P ( X ∈ A ) = Z
A f X ( x ) dx.
There exist a common characteristic which is shared by both types of random variables. It is called cumulative distribution function (cdf), F and its defined as
F ( x ) = P ( X ≤ x ) .
For further reading of theory that consider random variables, we refer to Ross (2012).
2.1.5 Expected value and variance
Expected value, for a random variable X, describes the weighted average of the possible val- ues in S, where each value is weighted with the probability of X taking that value. Expected value of a random variable X is denoted by µ = E [ X ] , and defined as:
µ = E [ X ] =
R ∞
− ∞ x f ( x ) , if X is a continuous random variable.
∑ i x i P ( X = x i ) , if X is a discrete random variable.
Variance is a measure of the variation in the possible values of the random variable X. If X is a random variable with mean µ, then the variance of X, denoted by σ 2 , is defined as:
σ 2 = Var ( X ) = E [( X − µ ) 2 ] ,
another measure of variation is the notion of standard deviation, which is defined as the square root of variance, σ.
2.1.6 Distributions used in this thesis
In this section we define the distributions used in this thesis.
2.1.6.1 Normal distribution
We say that random variable X is normally distributed with parameters µ ∈ R and σ > 0 and denote it by
X ∼ N ( µ, σ 2 ) if X has the density
f ( x | µ, σ 2 ) = √ 1 2σ 2 π
e −
(x−µ)2
2σ2
, x ∈ R.
For such defined variable, we have E ( X ) = µ and Var ( X ) = σ 2 . 2.1.6.2 t-distribution
We say that random variable X is t-distributed with ν ∈ N + degrees of freedom and denote it by
X ∼ t ν
if X has the density
f ( x | ν ) = Γ ( ν+1 2 )
√
νπ Γ ( ν 2 )
1 + x
2
ν
−
ν+21, x ∈ R,
Γ ( t ) = Z ∞
0 x t−1 e −x dx.
2.1.6.3 Chi-squared distribution
We say that random variable X is chi-squared distributed with k ∈ N + degrees of freedom and denote it by
X ∼ χ 2 ( k ) if X has the density
f ( x | k ) = 1 2
k2Γ
k 2
x
k
2
−1 e −
x2, x > 0.
2.1.6.4 F-distribution
We say that random variable X is F-distributed with parameters d 1 ∈ N + and d 2 ∈ N + and denote it by
X ∼ F ( d 1 , d 2 ) if X has the density
f ( x | d 1 , d 2 ) = 1 B d
1
2 , d 2
2d 1 d 2
d12x
d12−1
1 + d 1
d 2
x
−
d1+2d2, x > 0,
where the function B ( x, y ) , x > 0 and y > 0, is defined by
B ( x, y ) = Z 1
0 t x−1 ( 1 − t ) y−1 dt.
2.2 Statistical Inference
Statistical inference is the part of statistics that aims to derive properties of an underlying distribution e.g. unkown parameter θ, by analysing a set of data. Inferential statistical analysis aims to draw conclusions about a population using hypothesis testing and deriving estimates. The theory presented in this section can be found in (Alm and Britton 2008).
2.2.1 Sample mean and variance
Lets start by defining a a sample x = ( x 1 , x 2 , ..., x n ) from the random variable X with the cdf F X ( x; θ ) . We introduce the point estimates of the expected value and variance, namely the sample mean and sample variance. Sample mean is defined by
¯x : = 1 n
∑ n i=1
x i = x 1 + x 2 + . . . + x n
n .
Then, the sample variance is defined by
s 2 : = 1 n − 1
∑ n i=1
( x i − ¯x ) 2 ,
and the sample standard deviation is denoted by s.
2.2.2 Hypothesis testing
In hypothesis testing framework we want to test a null hypothesis about the unknown pa- rameter θ:
H 0 : θ = θ 0 ,
against an alternative hypothesis H 1 . The alternative hypothesis can be of different types:
• Simple alternative hypothesis:
H 1 : θ = θ 1
• Composite alternative hypothesis:
– one sided
H 1 : θ < θ 0 or H 1 : θ > θ 0 – two sided
H 1 : θ 6= θ 0 When a test is preformed with a null hypothesis we write
H 0 : θ = θ 0
H 1 : θ 6= θ 0 .
When testing the null hypothesis there are two types of errors that can occur, Type 1 error and Type 2 error. These errors and how they emerge can be obtained in Table 2.1.
Table 2.1: How Type 1 and Type 2 error emerges
Decision \ Reality H 0 false H 0 true Reject H 0 Correct Type 1 error
Do not reject H 0 Type 2 error Correct
where the probability of making the Type 1 error is denoted by α, i.e.
α = P ( Rejecting H 0 when H 0 is true ) , and the probability of making Type 2 error i denoted β, i.e.
β = P ( Not rejecting H 0 when H 0 is false ) .
Quantity α is called a significance level of the test and in most cases α is being controlled whenever a test is constructed.
2.2.3 Methods of tests
There are three different methods to perform a test: test variable method, direct method
and confidence interval method. In this thesis we present the direct method. In the direct
If we want to test H 0 : θ = θ 0 on significance level α, we start by finding a reference variable R θ for parameter θ, which distribution by the definition does not depend on θ regardless of what value it takes. Then we choose test variable
T ( X ) = R θ
0( X ) .
If the null hypothesis is true then T ( X ) has a fully known distribution. In the direct method we want to find a test variable T ( X ) and then from the result of experiment x calculate:
p − value : = P H
0to get at least as extreme value of T ( X ) as we have observed ( i.e., T ( x )) ,
where P H
0stand for probability calculated under the assumption that the null hypothesis is true. Then, for a significance level, α, we reject H 0 if p − value ≤ α and we do not reject H 0 if p − value > α.
In general, when we want to test H 0 : θ = θ 0 vs :
• H 1 : θ > θ 0 , then the more extreme means greater than T ( x ) . Hence, p − value = P H
0( T ( X ) ≥ T ( x ))
• H 1 : θ < θ 0 , then the more extreme means smaller than T ( x ) . Hence, p − value = P H
0( T ( X ) ≤ T ( x ))
• H 1 : θ 6= θ 0 , then the more extreme means the absolute value of T ( X ) greater than the absolute value of | T ( x )| . Hence,
p − value = P H
0(| T ( X )| ≥ | T ( x )|)
2.2.4 Inference about differences in means of populations
We now introduce procedures to test equality of means. Comparing two population means is tested with a two-sample t-test meanwhile testing equality of several means is tested with Analysis of Variance (ANOVA).
2.2.4.1 Two-sample t-test
Assume that we have observed two independent samples x = ( x 1 , x 2 , ..., x n
1) from X ∼ N ( µ 1 , σ 1 2 ) and y = ( y 1 , y 2 , ..., y n
2) from Y ∼ N ( µ 2 , σ 2 2 ) , where µ 1 , µ 2 and σ 1 2 , σ 2 2 are the means and variances of random variables X and Y respectively. Let X and Y be the vectors of random variables corresponding to the observed samples. In what comes, there are no assumption regarding the size of the two samples and the variances are unknown. The reference variable for θ = µ 1 − µ 2 is given by
R µ
1−µ
2( X, Y ) = X ¯ − Y ¯ − ( µ 1 − µ 2 ) r
s
21n
1+ n s
222
,
which is approximately t-distributed with f degrees of freedom, where 1
f = 1 n 1 − 1
( n 2 s 2 x ) 2
( n 2 s 2 x + n 1 s 2 y ) 2 + 1 n 2 − 1
( n 1 s 2 y ) 2
( n 2 s 2 x + n 1 s 2 y ) 2 .
Then to test H 0 : µ 1 − µ 2 = 0, we use the introduced reference variable, and obtain a test variable
T ( X, Y ) = R µ
1−µ
2( X, Y ) = X ¯ − Y ¯ r
s
2xn
1+ s n
2y2
,
which under H 0 is approximately t-distributed with f degrees of freedom. Now, for example, if H 1 : µ 1 − µ 2 6= 0, we reject the null hypothesis if P (| T ( X, Y )| ≥ | T ( x, y )|) = p − value ≤ α.
2.2.4.2 Analysis of variance
We now consider the procedure for testing differences in means of several populations, which is the analysis of variance, ANOVA. Consider the following representation of ob- servations, in the Table 2.2.
Table 2.2: Representation of observations
Population Observations Statistics Distribution (factor A)
1 x 11 , x 12 , . . . , x 1n
1¯x 1• , s 2 1 X 1 ∼ N ( µ 1 , σ 2 ) 2 x 21 , x 22 , . . . , x 2n
2¯x 2• , s 2 2 X 2 ∼ N ( µ 2 , σ 2 )
.. . .. .
p x p1 , x p2 , . . . , x pn
p¯x p• , s 2 p X p ∼ N ( µ p , σ 2 )
The statistics in Table 2.2 are defined as,
¯x i• = 1 n i
n
ii=1 ∑
x ij
s 2 i = 1 n i − 1
n
ii=1 ∑
( x ij − ¯x i• ) .
In total we have N = ∑ i=1 p n i observations and we define ¯x •• as the total average of all observations. To test the null hypothesis
H 0 : µ 1 = µ 2 = . . . = µ p H 1 : at least one pair differs
we can make use for the assumption of equal variances. The main idea is to construct two estimators of σ 2 to make inference. First construct an estimator of σ 2 which is unbiased regardless of the null hypothesis. Then construct another estimator of σ 2 which is unbiased only under H 0 . If the observed ratio of the two estimates deviates significantly from 1 we reject H 0 . Let X ij be a random variable which corresponds to x ij , j = 1, 2, ..., n i and i = 1, 2, ..., p. Further, let the X be the vector containing all the variables X ij , j = 1, 2, ..., n i
and i = 1, 2, ..., p.
Each of s 2 1 ( X ) , . . . , s 2 p ( X ) is an unbiased estimator of σ 2 , regardless if the null hypothesis is
true or not. We can pool all the estimators to obtain one
s 2 e ( X ) = ∑
p
i=1 ( n i − 1 ) s 2 i ( X )
∑ p i=1 ( n i − 1 ) = ∑
p
i=1 ( n i − 1 ) n 1
i
−1 ∑ n j=1
i( X ij − X ¯ i• ) 2 N − p
= ∑
p
i=1 ∑ n j=1
i( X ij − X ¯ i• ) 2
N − p = SSE
N − p = MSE.
The abbreviation SSE stands for sum of square errors, and MSE for mean square error. One can also show that SSE ∼ χ 2 ( N − p ) .
The second estimator which is unbiased only under null hypothesis is
S 2 A ( X ) = ∑
p
i=1 ∑ n j=1
i( X ¯ i• − X ¯ •• ) 2
p − 1 = ∑
p
i=1 n i ( X ¯ i• − X ¯ •• ) 2 p − 1
= SSA
p − 1 = MSA.
The abbreviation SSA stands for factor A sum of squares, and MSA for factor A mean square. Moreover, one can prove that, under H 0 , SSA ∼ χ 2 ( p − 1 ) , and SSA is independent from SSE. Further it is possible to show that if the null hypothesis is violated, the bias of MSA is positive, hence it overestimate the true value σ 2 .
Now we can build a ratio
T ( X ) = MSA MSE
underH
0∼ F ( p − 1, N − p ) ,
using the distribution results, we can now reject the null hypothesis if the P ( T ( X ) ≥ T ( x )) = p − value ≤ α.
2.2.4.3 Normality assumption violation
In both introduced test procedures, we assume that the underlying distributions are normal, and we use normality to construct the reference variable and test variable. If we cannot assure that a data set comes from a normal distribution, then we cannot guarantee the cor- rectness of the introduced methods.
By considering our test variables from the previous sections:
T ( X, Y ) = R µ
1−µ
2( X, Y ) = X ¯ − Y ¯ r
s
2xn
1+ s n
2y2