A Trade-based Inference Algorithm for Counterfactual Performance Estimation

(1)

IN

DEGREE PROJECT

MATHEMATICS,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

A Trade-based Inference Algorithm

for Counterfactual Performance

Estimation

SIMON ALMERSTRÖM PRZYBYL

(2)

(3)

A Trade-based Inference Algorithm

for Counterfactual Performance

Estimation

SIMON ALMERSTRÖM PRZYBYL

Degree Projects in Mathematics (30 ECTS credits) Degree Programme in Mathematics (120 credits) KTH Royal Institute of Technology year 2019 Supervisor at Intrum: Jim Idefeldt

Supervisor at KTH: Tatjana Pavlenko Examiner at KTH: Tatjana Pavlenko

(4)

TRITA-SCI-GRU 2019:273 MAT-E 2019:70

Royal Institute of Technology School of Engineering Sciences

KTH SCI

SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

(5)

Abstract

A methodology for increasing the success rate in debt collection by matching individual call center agents with optimal debtors is developed. This methodology, called the trade algorithm, consists of the following steps. The trade algorithm first identifies groups of debtors for which agent performance varies. Based on these differences in performance, agents are put into clusters. An optimal call allocation for the clusters is then decided. Two methods to estimate the performance of an optimal call allocation are suggested. These methods are combined with Monte Carlo cross-validation and an alternative time-consistent validation procedure. Tests of significance are applied to the results and the effect size is estimated.

The trade algorithm is applied to a dataset from the credit management services company Intrum and is shown to enhance performance.

Sammanfattning

En metodik för att öka andelen lyckade inkassoärenden genom att para ihop tele-fonhandläggare med optimala gäldenärer utvecklas. Denna metodik, kallad handels-algoritmen, består av följande steg. Handelsalgoritmen identifierar först grupper av gäldenärer för vilka agenters prestationsförmåga varierar. Utifrån dessa skillnader i prestationsförmåga är agenter placerade i kluster. En optimal samtalsallokering för klustren bestäms sedan. Två metoder för att estimera en optimal samtalsallokerings prestanda föreslås. Dessa metoder kombineras med Monte Carlo-korsvalidering och en alternativ tidskonsistent valideringsteknik. Signifikanstester tillämpas på resultaten och effektstorleken estimeras.

Handelsalgoritmen tillämpas på data från kredithanteringsföretaget Intrum och visas förbättra prestanda.

(6)

(7)

Acknowledgements I would like to thank:

First and foremost, my supervisor Jim Idefeldt at Intrum for guiding me through-out the thesis, being generous with his time, and encouraging me to explore new ideas. Dacil Ullman at Intrum for daring to take in a thesis worker and making it happen. Ekaterina Kruglov at Intrum for establishing the initial contact with Intrum. My aca-demic supervisor Tatjana Pavlenko at KTH for her guidance and for taking me on as a student even though I was outside of her area of responsibility. Emilio Zamorano de Acha at Intrum for giving feedback on a draft of the thesis and also for preparing me for the presentation by discussing possible improvements. The entire Intrum Group Data Analytics team for listening to my half-time report presentation, giving feedback, and for helping me learn Intrum’s IT systems (with special thanks to Arturs Valujevs). Mickael Bäckman at Intrum for giving feedback on a draft of the thesis. Katar´ına Mpofu and Ioanna Zygaki at Intrum for being great lunch company and supporting me. Vanessa Söderberg at Intrum for inspiring and challenging me. Franz Österback, Samuel Kerneur, and Hans Ravén all at Intrum for brightening up the office. Intrum as a company for providing the resources and access to data. Josephine Sullivan at KTH for providing intial guidance for the thesis. The KTH mathematics student office for always being helpful. My classmates, especially Tobias Magnusson, for helping me make it through the tough courses. My friends for helping me relax from studying. Finally, I would like to thank my family Bogdan, Elisabet, Erik, and Tove for tirelessly caring for me.

(8)

(9)

Introduction

1.1 Project Description

The purpose of this project is to investigate the following question:

Question. Can debt collection call center performance be increased by matching call center agents with debtors?

To illustrate the idea behind this question, consider a call center with two agents A1 and A2and assume the debtors can be divided into two groups of debtors D1 and D2. The result of a call from an agent to a debtor is to either receive payment within a specified time period or not. Assume the following table describes the historical success rates of the call center and assume these rates will remain the same in the future:

D1 D2 A1 0.43 0.46 A2 0.39 0.44

Table 1.1: Historical success rates of a fictional call center.

The aggregate success rate of this call center is maximized by letting the agents specialize: If A1calls D1the loss in probability of success is 0.46 − 0.43= 0.03 com-pared to the alternative (A1calling D2). However, if A2calls D1the loss in probability of success is 0.44 − 0.39 = 0.05 compared to the alternative. Thus a larger loss is incurred when A2calls D1than when A1calls D1. It follows that the aggregate success rate will be maximized by letting A1 only call D1 and A2 only call D2 (this will be proved formally in chapter 2).

In this thesis, we will answer the question above by investigating the following more specific questions:

(12)

ii. Using historical data, how do we empirically estimate the performance of using an optimal call allocation?

iii. How do we increase the stability of our results?

iv. How should we divide debtors into groups to maximize performance?

The scope of the thesis is to answer these questions by constructing the trade algo-rithm (chapter 2) and then applying this algoalgo-rithm to a real-world dataset (chapter 3) from the credit management services company Intrum.

Due to reasons we will present later in the thesis, we present an alternative way of measuring the performance of a call center: Rather than using success rate, we use a measure called actual overperformance. Intuitively, an 0.01 increase in actual overperformance corresponds to a 1 percentage point increase in the success rate for a given difficulty of the debtor population. For the Intrum dataset, we apply statistical tests which strongly reject the null of the trade algorithm yielding no increase in out-of-sample actual overperformance (with p-values being essentially zero for the relevant tests). More specifically, we obtain that the algorithm increases out-of-sample actual overperformance by at least 0.004 (95% confidence level). This corresponds to raising the success rate of the call center from its current 38.8% to 39.2%, which is an increase of over 1% in the number of successful cases.

(13)

Chapter 2

The Trade Algorithm

2.1 Framework

In this section, we set up the basic framework for allocating calls. • A call center consists of:

i. m agents denoted by A1, ..., Am.

ii. N debtors denoted by d1, ..., dNwhom calls will be allocated to.

iii. A sequence of strictly positive integers m1, ..., mmrepresenting the number of calls each agent has to make. Note thatP

imi= N has to hold.

The agents make calls to the debtors. Each call results in either receiving or not receiving payment within a specified time period, represented by the following random variable: Yd =        1, if d pays 0, if d does not pay

We denote P(Yd = 1) by p(d) and call this the score of debtor d. Yd is thus a Bernoulli variable with parameter p(d).

• A debtor partition is a partition of the set of N debtors into n subsets D1, ..., Dn. A partitioned call center is a call center together with a debtor partition. Each Djis called a debtor group. We denote the size |Dj| by njand thusPjnj = N holds. For the remaining definitions of this section, let a partitioned call center be fixed.

• A call allocation is a matrix X= (xi, j) of non-negative integers satisfying: X j xi, j= mi (AC) X i xi, j= nj (DC)

(14)

For all i and j respectively. AC and DC denote agent and debtor condition re-spectively. The number xi, jrepresents the number of calls agent Aiwill make to debtor group Dj, and the conditions ensure that all debtors are called and that the agents make the right number of calls each.

• Define the following random variables:

? Yi, j= the number of successful calls from Aito Dj.

? Y = the total number of successful calls, thus Y = Pi, jYi, j= PdYd. Use | to denote conditioning, for example Yi, j | X is Yi, jgiven that the call allo-cation X is used.

• Adopt the following assumption:

Model Assumption 1. For each Ai, assume that the following expression is constant for all d ∈ Dj:

pi, j= P(d pays | Aicalls)

This model assumption specifies a fixed agent effect we assume is stable. We are therefore ignoring the effects of agents changing over time by for example learn-ing, reacting to changes in their work tasks (e.g. changing their call allocation), and so on. In model assumption 1 we assume that probability of success for an agent Aiin a given debtor group Djis constant. Yi, j| X then follows a binomial distribution B xi, j, pi, j and in particular its expectation is pi, jxi, j. Thus:

E[Y | X]= E[X i, j Yi, j| X]= X i, j E[Yi, j| X]= X i, j pi, jxi, j

Under model assumption 1, maximizing E[Y | X] with respect to X is therefore to maximizeP

i, jpi, jxi, j.

• An optimal call allocation is a solution X∗= (x∗_{i, j}) to the following integer linear program ILP(m, n): arg max xi, j X i, j pi, jxi, j subject to AC and DC and xi, j∈ N

The integer linear program corresponding to a partitioned call center having as many agents as debtor groups (i.e. m= n) where also all mi= 1 and all nj= 1 is denoted by ILP(m). ILP(m) is used as a theoretical device to gain intuition. ILP(m, n) can easily be stated in canonical linear programming form by express-ing the objective function as a scalar product and the constraints as Ax= n where A is a fixed matrix, n is a fixed column vector and x is a vectorized form of X.

(15)

• The naive performance is:

B=X i, j

pi, jminj N

The naive performance is thus the theoretical aggregate performance obtained if all agents distribute their calls exactly according to the sizes of the debtor groups (ignoring that calls technically are integer objects). To see this, note that one term in the sum is:

pi, jmi nj N =

probability of success Aiin Dj× (total number of calls Ai× share of debtors in Dj) | {z }

number of calls Aito Djin naive allocation

= expected number of successes Aiin Djunder naive allocation

• The gains from trade G are: G=

P

i, jpi, jx∗_{i, j}− B N

The gains from trade are thus the possible percentage point increase (in deci-mal form) in aggregate performance over the naive performance which can be obtained by reallocating calls optimally.

• The opportunity cost ci, j,kof agent Aicalling debtor group Djinstead of another debtor group Dkis the reward lost by foregoing the alternative, that is:

ci, j,k= pi,k− pi, j In particular, ci, j,k= −ci,k, jholds.

• Without loss of generality, let A1and A2be two agents of a partitioned call center having opportunity costs c1, j,kand c2, j,k. Then A1has a comparative advantage over A2for Djwith respect to Dkif c1, j,k≤ c2, j,k. The absolute difference |c1, j,k− c2, j,k| is the magnitude of the comparative advantage.

2.2 Binary Trade

When there are only two debtor groups, we denote ci, j,kby ci, jwhich is unambiguous since there is only one debtor group other than Dj. We begin by proving the general version of the statement made in the introduction:

Proposition 1. If c2,1 , c1,1then the unique solution toILP(2) is obtained by letting each agent call the group for which the agent has a comparative advantage. The gains from trade are in this case:

G=|c2,1− c1,1| 4

In particular the gains from trade increase as the magnitude of the comparative advan-tage increases.

(16)

Proof. Assume without loss of generality that c2,1 > c1,1 (otherwise just relabel the debtor groups). We begin by simplifying the agent and debtor conditions for this case:

                         x1,1+ x1,2= 1 x2,1+ x2,2= 1 x1,1+ x2,1= 1 x1,2+ x2,2= 1 x1,1, x1,2, x2,1, x2,2∈ N ⇐⇒                    x1,2= 1 − x1,1 x2,1= 1 − x1,1 x2,2= 1 − x1,2= 1 − (1 − x1,1)= x1,1 x1,1∈ N and x1,1≤ 1

The optimization problem ILP(2) therefore becomes: arg max xi, j p1,1x1,1+ p1,2x1,2+ p2,1x2,1+ p2,2x2,2= arg max x1,1 p1,1x1,1+ p1,2(1 − x1,1)+ p2,1(1 − x1,1)+ p2,2x1,1= arg max x1,1 (p1,1− p1,2− p2,1+ p2,2)x1,1= arg max x1,1 p2,2− p2,1− (p1,2− p1,1) x1,1= arg max x1,1 (c2,1− c1,1)x1,1

By assumption c2,1− c1,1is positive and to maximize (c2,1− c1,1)x1,1we allocate the call of agent 1 to group 1. It follows from the constraints that agent 2 then has to allocate its call to group 2. In particular, the solution is unique. Note that if we would have c2,1 = c1,1, then any call allocation would be optimal. We calculate the gains from trade: G= P i, jpi, jx∗i, j− B N = p1,1+ p2,2−Pi, j p2i, j 2 = 2p1,1+ 2p2,2− p1,1− p1,2− p2,1− p2,2 4 p1,1+ p2,2− p1,2− p2,1 4 = c2,1− c1,1 4 We extend the situation above with one more agent:

Proposition 2. Consider ILP(3, 2) and assume: m1≤ n1 m2≤ n2 c2,2< c3,2< c1,2

Denote this problem byILP∗(3, 2). The optimal call allocation for ILP∗(3, 2) is:

X∗=           m1 0 0 m2 n1− m1 n2− m2          

(17)

Proof. It is clear that the agent and debtor conditions are satisfied by X∗_{(noting that} n1− m1+ n2− m2 = N − m1− m2 = m3) and that all of its entries are non-negative integers, X∗thus defines a call allocation. To derive a contradiction, assume that some other call allocation X0 , X∗is optimal. In particular, at least one of the entries x01,2 and x0_2,1are then non-zero (since both being zero gives the call allocation X∗), assume x0_1,2 > 0. Then x0_1,1 = m1− x_1,20 < n1, thus at least one of the entries x0_2,1and x0_3,1are non-zero, assume x0_2,1 > 0. Now let the agents A1and A2 trade calls to define a new allocation: X00=           x0_1,1+ 1 x0_1,2− 1 x0_2,1− 1 x0_2,2+ 1 x0_3,1 x0_3,2          

Then the difference between the values of the objective function for the call allocations X00_{and X}0_is:

X i, j

pi, j(x00_{i, j}− x0_{i, j})= p1,1− p1,2− p2,1+ p2,2= c1,2+ c2,1= c1,2− c2,2> 0

Which contradicts X00_{being optimal. The other scenarios not considered will result in}

similar contradictions.

The above proposition 2 motivates the following definition:

Definition 1. In a partitioned call center with three agents and two debtor groups, satisfying:

c2,2< c3,2< c1,2

A1 and A2 are the heterogeneous agents (having particularly strong preference for one specific debtor group each) and A3 is the homogeneous agent (not having as pro-nounced preference).

Intuitively proposition 2 says that when the debtor groups are large compared to the number of calls the heterogeneous agents are allowed to make, only the heterogeneous agents will actively trade with each other.

Proposition 3. The gains from trade for a call center described by ILP∗(3, 2) increase as the magnitude of the comparative advantage between the heterogeneous agents in-creases.

Proof. Calculating the gains from trade G and using n1+n2= N = m1+m2+m3gives:

G= P i, jpi, jx∗_{i, j}− B N = m1p1,1+ m2p2,2+ (n1− m1)p3,1+ (n2− m2)p3,2−Pi, j pi, jNminj N = Nm1p1,1+ Nm2p2,2+ N(n1− m1)p3,1+ N(n2− m2)p3,2−Pi, jpi, jminj N2

The numerator is: m1p1,1(N − n1) | {z } n2 +m2p2,2(N − n2) | {z } n1 −p1,2m1n2− p2,1m2n1+ p3,1 N(n1− m1) − m3n1+ p3,2 N(n2− m2) − m3n2 | {z } :_=γ(p3,1,p3,2,m1,m2,m3,n1,n2) =

(18)

m1n2(p1,1− p1,2 | {z } c1,2 )+ m2n1(p2,2− p2,1 | {z } c2,1=−c2,2 )+ γ(p3,1, p3,2, m1, m2, m3, n1, n2) =⇒ G=m1n2c1,2− m2n1c2,2 N2 + γ(p3,1, p3,2, m1, m2, m3, n1, n2) In particular: ∂G ∂c1,2 = m1n2 N2 > 0 ∂G ∂c2,2 = − m2n1 N2 < 0

This completes the proof.

2.3 Overperformance

Model assumption 1 is very restrictive: It assumes that the probability of success for agent Ai in debtor group Dj is the constant pi, j. A less restrictive assumption is to assume that for agent Aiand debtor group Dj, there is a constant δi, jsuch that the agent perturbs the score p(d) of a debtor d ∈ Djto p(d)+ δi, j. We express this formally: Model Assumption 2 (Overperformance Assumption). For each agent Aiand debtor group Djthere exists a constant −1 < δi, j< 1 such that for all d ∈ Dj:

P(d pays | Aicalls)= p(d) + δi, j

For p(d)+ δi, jto represent a probability, we require that 0 ≤ p(d)+ δi, j ≤ 1 holds and therefore only consider such debtors d. Intuitively, p(d) is rarely very close to zero or one and the effect δi, jof the agent is likely very small, so the previously stated requirement is not a problem in practice.

Under model assumption 2 we have: E[Yi, j− E[Yi, j] | X]= X d∈Dj Aicalls p(d)+ δi, j − X d∈Dj Aicalls p(d)= δi, jxi, j =⇒ E[Y − E[Y] | X]=X i, j δi, jxi, j

Under the overperformance assumption, maximizing E[Y − E[Y] | X] with respect to X is therefore to maximizeP

i, jδi, jxi, j. We refer to E[Y − E[Y] | X] as overperformance, since it measures the expected number of successful calls obtained by following the call allocation X compared to the expected number of successful calls according to score.

(19)

Introduce the following notation: Q1= Y Q2= Y − E[Y] q1,i, j= pi, j q2,i, j= δi, j Model assumption k then implies that:

qk,i, jxi, j= E[Qk,i, j]

Where Qk,i, jis Qkrestricted to agent Aiand debtor group Dj. Furthermore denote the optimization problem corresponding to model assumption k (maximizingP

i, jqk,i, jxi, j) by ILP(k, m, n). The following table describes the two different model assumptions and their implications:

Figure 2.1: Comparison of model assumptions.

Since E[Y] is a constant, ILP(k, m, n) maximizes the same objective function E[Y | X] both for k = 1 and k = 2. Still, the optimization problems ILP(1, m, n) and ILP(2, m, n) may yield different optimal solutions since they are based on different model assumptions.

We now provide an illustration of why one would like to use the overperformance assumption instead of model assumption 1:

Due to the operational details of the call center, situations where agents call debtors from different score distributions occur. The purpose of this section is to provide an illustration of why using estimated overperformance ˆδi, jrather than the success rate ˆpi, j to decide the call allocation is preferred in such cases.

Again consider the fictional call center from section 1.1. This time, except for calculating the historical success rate, assume we have also calculated the mean esti-mated score ˆp(d). We estimate ˆp(d) using a logistic regression procedure which will be described in section 5.2 (also see figure 2.5 for exactly what set we actually perform logistic regression on).

D1 D2 A1 (0.43, 0.41) (0.46, 0.45) A2 (0.39, 0.37) (0.44, 0.47)

Table 2.1: Estimated historical (success rate, mean score) of fictional call center. The calculated success rate is an estimate of pi, j. The difference between the suc-cess rate and the corresponding mean score is an estimate of the overperformance con-stant δi, j. We will discuss these estimators formally in later sections.

(20)

D1 D2 A1 0.02 0.01 A2 0.02 -0.03

Table 2.2: Estimated historical overperformance constants.

Overperformance measures performance by taking both success rate and mean score into account, and can therefore show that agents whose performance seems low according to success rate is in fact high given the difficulty (mean score) of the called debtors. Using these overperformance constants to distribute calls, we obtain that A1 should only call D2and A2should only call D1, which is the opposite result compared to before (though of course in this case we have decided the numbers to force this reversal).

2.4 Estimating Performance

In ordinary supervised learning problems, we can easily calculate for example the clas-sification rate or mean squared error to measure performance. However, the question what would performance have been if the call allocation ˆX∗_{was used?} _{does not have} an as clear answer. In this and the following section we develop two estimation meth-ods for answering this question.

Suppose we are given a dataset Data of outcomes from a call center and want to estimate the change in performance from using another call allocation than the one used to collect the data. First, define the training T ⊆ Data and validation V ⊆ Data sets of the data. To be able to estimate performance changes in-sample, we allow T = V but typically T and V partition Data. Define all miand njin the agent and debtor conditions AC and DC of section 2.1 by measuring them on the validation set V of Data. This ensures that the optimal call allocation we decide respects the relative agent and debtor group sizes of the validation data (intuitively, we are not allowed to overwork certain agents or stop calling certain debtors).

We begin by introducing some notation: Let Ti, jdenote the set of calls in training T from agent Aito debtor group Dj, and let Vi, jdenote the corresponding set in validation V. For W ⊆ Data, let Qk(W) denote Qkrestricted to W, e.g. for k= 1 this is the number of successful calls in W. The estimators ˆQk(W) are defined as follows:

ˆ Q1(W)= Y(W) ˆ Q2(W)= Y(W) − X d∈W ˆp(d)

2.4.1 Scaling Method

While we cannot motivate the following estimation method rigorously, we believe it is a reasonable way to estimate performance:

(21)

2. Since the adopted model assumption implies that qk,i, j= E[Qk,i, j]/xi, j, estimate qk,i, jby: ˆqk,i, j= ˆ Qk(Ti, j) |Ti, j|

In section 5.2, we prove properties of these estimators.

Definition 2. Replacing qk,i, jin ILPE(k, m, n) with its estimated value ˆqk,i, j de-fines the optimization problem ILPE(k, m, n). An estimated optimal call alloca-tionis a call allocation ˆX∗solving ILPE(k, m, n).

3. Solve ILPE(k, m, n), giving ˆX∗_.

4. Estimate the performance of ˆX∗in V by: ˆ Qk( ˆX∗, V) = X i, j ˆ Qk(Vi, j) |Vi, j| ˆx ∗ i, j

Intuitively, we first adopt model assumption k which means we assume there is some fixed agent effect for each agent Aiin each debtor group Dj. We then estimate these agent effects in training T to decide the call allocation. We then reestimate the fixed agent effects in validation V, multiply these estimates with the number of correspond-ing calls, and finally sum them up to obtain an estimate of the out-of-sample perfor-mance of the decided call allocation. We call this method the scaling method since we are scaling the estimated validation performance by ˆx∗

i, j. We further define:

Definition 3. The actual overperformance of a call allocation ˆX∗is: ˆ

Q2( ˆX∗, V) − ˆQ2(V) |V|

Actual overperformance thus normalizes the scale to overperformance per call and translates the scale so the actual overperformance of the call allocation the validation data was collected under is zero.

Note that if we adopt the overperformance assumption and estimate performance, then: ˆ Q2( ˆX∗, V) − ˆQ2(V) |V| = 1 |V| X i, j Y(Vi, j) −Pd∈Vi, j ˆp(d) |Vi, j| ˆx ∗ i, j− ˆ Q2(V) |V| = 1 |V| X i, j Y(Vi, j) |Vi, j| ˆx ∗ i, j− 1 |V| X i, j P d∈Vi, j ˆp(d) |Vi, j| ˆx ∗ i, j− ˆ Q2(V) |V|

Though technically the overperformance assumption do not allow us to motivate scal-ing the terms individually, the above expression can at least informally be thought of as:

(22)

Where the baseline in this case is the overperformance per call in V of the call allo-cation the data was collected under. Even though it is a questionable procedure, when applying the scaling method with the overperformance assumption we still perform this decomposition to see where our actual overperformance comes from (an increase in expected success rate or a decrease in mean score).

Some concluding informal remarks: With the scaling method, we have argued that given model assumption k our construction of ˆQk( ˆX∗, V) is reasonable. More generally, due to debtors having different score, it be argued that only using success rate to evalu-ate performance is problematic because an estimevalu-ated increase in the success revalu-ate can be driven by debtors having higher score (this is the same argument as in section 2.3 but for evaluating the aggregate performance). Independent of our model assumption and specific estimation method, this is another reason why using actual overperformance rather than success rate to evaluate aggregate performance may be preferable.

2.4.2 Simulation Method

We provide an alternative way to the scaling method to estimate the performance of a call allocation ˆX∗which does not require scaling performance in the same way: For all agents and debtor groups add constraints ˆxi, j ≤ |Vi, j| to ILPE(k, m, n). Recall that n denotes the number of debtor groups and modify the agent and debtor conditions AC and DC to: X j ˆxi, j=jmi n k (AC’) X i ˆxi, j=j nj n k (DC’)

The floor functions might cause AC’ and DC’ to not align properly (P

i, j ˆxi, j,Pj,iˆxi, j), and in practice we therefore generally have to decrease some terms in the larger sum. We ignore this issue as it is easy to correct.

Having modified ILPE(k, m, n) defines the new optimization problem ILPE’(k, m, n): arg max

ˆxi, j

X i, j

ˆqk,i, jˆxi, j subject to AC’ and DC’ and ˆxi, j∈ N and ˆxi, j≤ |Vi, j| Now perform the following procedure:

1. Adopt one of the model assumptions k and solve ILPE’(k, m, n). The solution instructs agent Aito call group Dja number of ˆx∗_{i, j}times out of the |Vi, j| available. 2. For each agent Aiand debtor group Dj, uniformly at random choose ˆx∗_{i, j}out of

(23)

3. Define the performance ˆQ( ˆX∗_{, V) on W as Q restricted to W. Since performance} is not obtained by scaling using ˆx∗_{i, j}, the adopted model assumption is irrelevant for measuring performance in this case.

The results of this method are stochastic even given a labeled set of validation data V ⊆ Data, we therefore call this the simulation method. For a fixed V, we can thus repeat steps 2 and 3 above several times to obtain a distribution of the performance of the optimal call allocation ˆX∗. In contrast, the scaling method only provides a point estimate of performance.

The performance of the optimal call allocation can be compared with the perfor-mance of the whole validation set V (i.e. with the call allocation used to collect the data). However, for the simulation method we can also construct an alternative ap-proach by defining the following uniformly random allocation on V (we will also refer to it simply as the random allocation): For each agent Aiand debtor group Dj, choose the share 1/n of the available data in V for this agent-debtor group combination (use floor functions if needed). This ensures that the agent and debtor conditions are ap-proximately satisfied. As before, make the choice of precisely what observations to use from each agent-debtor group combination uniformly at random. Calculate the performance of the total chosen set. Repeat several times to obtain a distribution of the performance of this allocation.

2.4.3 Comparison of Estimation Methods

The following table describes the general differences between the two estimation meth-ods. It is also worth noting that running many rounds of the simulation method to obtain distributions of performance is of course more computationally costly than running the scaling method to obtain a point estimate.

Method Scales performance Stochastic Compare performance of ˆX∗_with Scaling Yes No Performance on all of V Simulation No Yes Uniformly random allocation

Table 2.3: Comparison of estimation methods. The following definition applies to both estimation methods: Definition 4. The empirical gains from trade of a call allocation ˆX∗are:

Gk( ˆX∗, V) = ˆQk( ˆX∗, V) − ˆQk(V)

The empirical gains from trade are thus the estimated increase in performance us-ing the estimated optimal call allocation instead of the original call allocation used to collect V.

(24)

2.5 Increasing Stability and Maximizing Performance

2.5.1 Agent Clustering

The following is our intuitive motivation for the need of clustering agents:

Consider a call center with a binary debtor partition and m > 3 agents. The un-certainty in the estimation of ˆqk,i, jis large, as the table below concerning success rate illustrates: n confidence interval 50 [0.36, 0.64] 100 [0.40, 0.60] 500 [0.46, 0.54] 1000 [0.47, 0.53] 2000 [0.48, 0.52]

Table 2.4: For each n: Confidence interval of binomial p in B(n, p) using normal ap-proximation when obtained estimate (sample success rate) is ˆp= 0.50.

Moreover, estimation errors for a single agent propagate due to the interlinked na-ture of deciding the call allocation. These two considerations (uncertainty in perfor-mance estimation and propagation of errors) together imply that the variance of the estimated optimal call allocation ˆX is high as the training set changes.

To decrease this variance we construct a way to cluster agents into two agent clus-ters intended to act as heterogeneous agents and a third agent cluster intended to act as a homogeneous agent. To perform the clustering, divide the training set T into four subsets by time:

T1 T2 T3 T4

t Figure 2.2: Partition of T .

Consider the three sets:

U1= T1∪ T2 U2= T2∪ T3 U3= T3∪ T4

(25)

agents:

AL_k = bottom bm/2c agents in Ukby c1 A_kH= agents not in AL_k

c1

AL_k AH_k

Figure 2.3: Definition of AL_kand AH_k, each point represents an agent. Now define the agent clusters:

AL=\ k A_kL AH=\ k A_kH AM= agents not in AL∪ AH

Using these three clusters to allocate calls, we hope to have decreased the variance in the call allocation by increasing the certainty in the performance estimation (since each cluster uses observations from many agents) and decreased the risk for complex error propagation by decreasing the number of clusters from m to 3.

2.5.2 Optimal Debtor Partition

Let T = V ⊆ Data. Assume further that there are different features describing the debtors. Given a feature f we define a binary debtor partition D based on a subset S ⊆ R( f ) of the possible values R( f ) the feature f attains: Assign debtors d with f(d) ∈ S to one debtor group and all other debtors to the second debtor group. Given this debtor partition D, proceed as follows:

1. Adopt one of the model assumptions k.

2. Cluster the agents according to the previous subsection 2.5.1.

3. Calculate Gk( ˆX∗, V) for the given debtor partition and agent clusters (treating each cluster as an individual agent).

Our goal is to find a debtor partition D with as high Gk( ˆX∗, V) as possible. Note how-ever that the in-sample empirical gains from trade do not inform us about the stability of the results. We consider our features to be either numerical or nominal (though technically not correct we consider ordinal variables to be numerical) and calculate Gk( ˆX∗, V) for the following debtor partitions:

i. For each numerical debtor feature f let the 0.50-sample quantile of f define a binary debtor partition (ignoring missing values). If there are missing values of f also consider the debtor partition defined by debtors having missing values of f or not.

(26)

ii. For each nominal debtor feature f , find all possible partitions of size two of R( f ) (treating missing values like any other category).

2.6 Validation Methods

2.6.1 Monte Carlo Cross-validation

Consider a fixed training set T , adopt a model assumption k, and fix a number 0 < β < 1. One round of Monte Carlo cross-validation (MCCV) is performed as follows:

1. Uniformly at random choose bβ|T |c observations from T to be our effective train-ing set TE. Let the remaining observations in T be our effective validation set VE.

2. Use TE to cluster agents.

3. Obtain the mjand njfor the agent clusters from VE. 4. Solve ILPE(k, 3, 2) with the agent clusters.

5. Calculate ˆQk( ˆX∗, V).

Running many rounds of MCCV together with the scaling method (which for a given effective validation set provides a point estimate of performance), we obtain an es-timated distribution of out-of-sample performance ˆQk( ˆX∗, V). Visualizing the distri-bution eases communication of our results. In [MSP05], Molinaro et al. compare MCCV with other cross-validation methods such as v-fold cross-validation, leave-one-out cross-validation, bootstrap procedures, and variations thereof. They conclude their article with: As the sample size grows, the differences among the resampling methods decrease. In the article, sample sizes below thousand are considered. The dataset we will consider has hundreds of thousands of observations and thus qualifies as being large in this context. The writers also write that MCCV does not decrease the MSE or bias enough to warrant its use over v-fold CV, i.e. MCCV performs better than v-fold cross-validation by a small margin but it is vastly more computationally expensive. To us, the visualization possibilities of obtaining an estimate of the performance distribu-tion outweighs the computadistribu-tional costs.

Chapter 4 in [KJ16] also investigates differences between cross-validation methods and also concludes that differences in performance are small for large sample sizes. Finally, results concerning the bias and variance of MCCV are derived in [Bur89].

2.6.2 Rolling Validation

We also present a second validation procedure based on a rolling window approach: Order the training set T by time and slide a rolling window over T . For a given location of the window, use all of the data coming before the window as effective training and let the data contained in the window be the effective validation set. The idea is illustrated below:

(27)

t T

Effective training Effective validation

Figure 2.4: Illustration of rolling validation.

2.7 Workflow

(28)

1. Fix a point in time t∗and divide the full dataset into train-ing T and test T est by if observations are before or after t∗.

2. Adopt a model assumption k. We recommend the overperformance assumption. Also fit a score-card on T and investigate its performance on T est.

3. Find a set D of debtor partitions by perform-ing the optimal debtor partition search on T .

4. Run MCCV on T with the scaling method for the debtor partitions D, clustering agents in each round using the effective training set

(this applies throughout the workflow). Possibly remove debtor partitions from D to define the new smaller set of debtor partitions D0.

5. For the remaining debtor partitions D0_{, run rolling} validation not including T est with the scaling method.

6. Choose the best debtor partition D∗_{from D}0 _{based on an overall} assessment of the previous results. Use the simulation method with MCCV to further verify the behavior of the chosen debtor partition. 7. Test performance of D∗ _{on T est using both the scaling and} sim-ulation methods. Apply statistical tests (described in section 5.3) to examine the results of the simulation method, and finally use

es-timators associated with the tests to construct confidence intervals for the performance gains compared to using a random allocation.

Figure 2.5: The trade algorithm.

We would like to comment that we are aware that some overfitting may occur due to searching for debtor partitions on the whole T . However, our view is that the debtor partition search gives us debtor partitions with high potential gains from trade, and the later validation steps inform us about the stability of these gains. Given that one has a lot of data, it is probably preferable to do the debtor partition search step on a completely different dataset than the rest of the algorithm.

(29)

2.8 General Trade

The trade algorithm can be generalized to debtor partitions having l ∈ N debtor groups. The general procedure is:

1. Extend proposition 1 to provide the solution of ILP(l).

2. Show similarly as in proposition 2 that a ”homogeneous” cluster of agents not having any special comparative advantages can be added to ILP(l) without af-fecting the active trading procedure. This defines the new optimization problem ILP∗(l+ 1, l).

3. As in proposition 3, derive the gains from trade G of ILP∗(l+ 1, l) and identify what relation between the costs of the agents that drives the gains.

4. Use the relation from above to construct a suitable clustering procedure yielding c(l) clusters.

5. Define debtor partitions using the 1/c(l)-sample quantiles for numerical features and the partitions of size c(l) for nominal features. Search among these for debtor partitions with high empirical gains from trade Gk( ˆX∗, V) in training T .

Step 1 is difficult because ILP(l) is not easy to solve when l > 2. Problems like ILP(l) are called assignment problems, see [BDM09] for a theoretical treatment of these.

(30)

Chapter 3

Intrum Application

We apply the trade algorithm to a real-world dataset, each section of this chapter fol-lows the corresponding step in the workflow figure 2.5. The figures of this chapter are placed in the separate last chapter Figures. By baseline, we refer to the performance of the current validation set.

3.1 Dataset

The dataset we use is from one of the call centers of the credit management services company Intrum. The call center is located in a European country and the data covers part of the years 2017-2018. The dataset consists of 229 728 observations of the result of the first outgoing call to consumers holding debt which Intrum is trying to collect, each debtor thus only occurs once in the dataset. The debtors are characterized by nu-merical and nominal features while the only available agent information is an identifier for which agent made the call. The target variable is a binary variable representing whether a payment has been received or not from the debtor during a fixed time period after the call was made.

Figure 5.1 describes the distribution of the number of calls per agent. We split the dataset into a training set T and a test set T est by the 0.80-sample time quantile t∗= ˆt0.80of the dataset.

3.2 Scorecard

We adopt the overperformance model assumption. We fit two scorecards on T : One for known debtors (having had debt at Intrum before) and one for unknown debtors (who are not known to Intrum since before). We construct these two scorecards because there are more features available for known debtors, allowing us to build a more precise scorecard for these debtors. However when defining debtor partitions, we only use features available for both known and unknown debtors. We test the performance of the scorecards on T est, the results are shown in figure 5.2.

(31)

Finally, while we will search for debtor partitions using variables not necessarily occurring at all in any of the scorecards, we note that we prefer debtor partitions based on features occurring in both scorecards. Such partitions decrease the risk of exploiting scorecard deficiencies to seemingly improve performance.

3.3 Optimal Debtor Partition

Figure 5.3 shows the results of the search for optimal debtor partitions, using both success rate and overperformance (even though we technically only adopted the latter model assumption). We use the debtor partitions found in both searches. For com-mercial reasons, we index the found debtor partitions by numbers to not reveal exactly what variables are used at Intrum, and label the set of debtor partitions D. The indexing is not unique (e.g. 03 occurs several times but refers to different debtor partitions) but the debtor partitions that will be of special interest do not suffer from this ambiguity.

3.4 Results for D

Figure 5.4 shows the result for the scaling method over 50 rounds of MCCV for the debtor partitions D. The results are presented numerically in figure 5.5. We have also added the numbers for the gains from trade in training from the debtor partition search (figure 5.3) to the table and labeled them trainAOP. For all debtor partitions in D, actual overperformance is lower in validation than during the search process.

Out of the investigated D, debtor partition 31 has the highest actual overperfor-mance during the scaling method MCCV (around 0.0074) and also had high actual overperformance from the search. We remove the debtor partitions with very low suc-cess rate and mean score from consideration, thus defining the restricted set of debtor partitions D0_{. Figure 5.6 is a zoomed in version of figure 5.4, focusing on D}0_.

3.5 Results for D

0

The results from rolling validation are presented in figure 5.7. Debtor partition 31 performs best also in this case, having high and stable actual overperformance around 0.009. Moreover, the feature f which debtor partition 31 is based on has no missing values in our dataset and is thus easy to work with. The variable f is also used in both scorecards (though in another binned form than in debtor partition 31) and is thus (at least partly) compensated for correctly by the scorecard. Since debtor partition 31 also performed best in the previous scaling method estimation, we choose D∗_{= 31.}

3.6 Further Results for D

∗

= 31

We run three new rounds of MCCV using the simulation method to investigate debtor partition 31 further, the results are shown in figure 5.8. Similarly as for the previ-ous estimation methods (the ordinary scaling method and the rolling validation scaling

(32)

method), generally the trade algorithm for debtor partition 31 chooses somewhat higher scored cases than the random allocation but manages to compensate with a clearly higher success rate. Actual overperformance with the simulation method remains high but slightly lower than for the previous estimation methods.

We investigate the second MCCV round (corresponding to the middle column in figure 5.8) even further because it has highest mean score distribution of the three rounds:

The feature f debtor partition 31 is based on is ordinal (integer, we treat it as nu-merical) and the trade algorithm thus considers two debtor groups having either low or high values of f (defined by the 0.5-sample quantile of f ). To ensure that the trade algorithm has not taken advantage of our binning and chosen observations giving a skewed distribution of f , we investigate its distribution in figure 5.9 and compare with the distribution of f on the whole validation set of the current MCCV round. Vi-sually we consider them to be very similar. The Kolmogorov-Smirnov tests (to be described in section 5.3) whose results are presented in figure 5.10 do not strongly in-dicate that the distributions are unequal but the p-values are still low. However, R also provides warnings for the p-values being approximate in the presence of ties (techni-cally Kolmogorov-Smirnov should only be applied to continuous distributions, not to discrete).

We also investigate the performance stability of the individual clusters for this MCCV round. Specifically we calculate how the difference ci,2between overperfor-mance in the two debtor groups develop over time for the different agent clusters by sliding rolling windows over the effective training set of the round and the observations chosen in effective validation. Figure 5.11 shows that the clusters behave as intended (we interpret the trend to be due to deviations in the scorecard over time).

Our overall assessment is that debtor partition 31 behaves well (with the distribution of f for the trade algorithm being a slight caveat) and we therefore apply it to the test set T est.

3.7 Performance on T est

Figure 5.12 shows the results on test for debtor partition 31 using the scaling method. The mean scores of the baseline and the trade algorithm are very close, and the actual overperformance of around 0.004 is therefore practically solely due to an increase in the success rate of the same size. Figure 5.13 shows the costs ci,2 of the three agent clusters for the scaling method in training and test. While the sizes of ci,2 change between training and test change, the relation c1,2 < c3,2 < c2,2 is preserved. Figure 5.14 also shows the shares of the calls made by each cluster in training and test.

The results of the simulation method with 4 000 rounds, figure 5.15, show similar performance as from the scaling method. Three statistical test investigating whether two sample distributions are different in a statistically significant sense are presented in 5.3. The three test are applied to the actual overperformance distributions of fig-ure 5.15 and their results are presented in figfig-ures 5.16, 5.17, and 5.18. All tests reject the relevant null with very low p-values. The Fischer sign test and the Wilcoxon-Mann-Whitney rank sum test give similar 95% confidence intervals for the translation

(33)

in distribution from the random allocation to the call allocation decided by the trade algorithm, the intervals being approximately [0.004, ∞). We informally note that just as for the scaling method, it is clear from figure 5.15 that most of the actual overper-formance of the simulation method comes from increasing the success rate rather than decreasing the mean score.

(34)

Chapter 4

Conclusion

In this chapter we discuss some of the problems of the trade algorithm.

4.1 Joint Hypothesis Problem

We measure the performance of a call allocation in overperformance using a scorecard to control for the effect of some debtors having a higher propensity to pay than others. Overperformance is thus always in relation to a scorecard. It is therefore impossi-ble to distinguish between real increases in performance due to improved agent-debtor matching and having abused weaknesses in the scorecard to seemingly gain perfor-mance. This problem is similar to the joint hypothesis problem in finance described in [Fam91]: Market efficiency (i.e. that prices are correct in the sense that they reflect all available information) and asset pricing models (models describing what asset prices should be) cannot be tested separately, but are always tested in conjunction with each other. The easiest test of the trade algorithm would be to run it simultaneously as a random allocation, let both strategies call similar debtors, and compare the results.

4.2 Dynamic E

ffects

Our two different model assumptions (recall the table of figure 2.3) both assume that agent performance is constant and scalable. Clearly agents learn, and arguably their performance also depends on the call allocation itself: Perhaps too much specialization (only calling one debtor group) is tiring and decreases performance. These effects are however beyond the scope of this thesis to model.

4.3 Obstacles for Implementation

The algorithm we have presented in this thesis tries to increase performance by emulat-ing what would have happened if calls were reallocated. Our attempt is however rather an analysis of data than an actual system for allocating calls: In practice, calls have to

(35)

scheduled, client (companies buying Intrum’s debt collection services) requests need to be taken into account, there are language barriers to consider, and so on. To implement the trade algorithm in practice, it needs to be integrated with an actual dialer system.

(36)

Chapter 5

Theoretical Background

5.1 Logistic Regression

The true scores p(d) are unknown and we thus use their logistic regression estimates ˆp(d). For the sake of completeness we will give a short intuitive introduction to logistic regression following the presentation in [MPV12]:

Let d = (d1, ..., dn)T be a vector of features describing debtor d. The features can be continuous, ordinal, or nominal (encoded as binary variables). We want to model p(d) as a function of d. It is easier to model a target variable having range (−∞, ∞) than 0 < p(d) < 1 and we therefore transform p(d) into −∞ < η(d) < ∞:

η(d) = ln p(d) 1 − p(d) We assume η(d) can be described by a linear model:

η(d) = dT_β Transforming η(d) back to p(d) gives:

p(d)= 1 1+ e−dT_β

However, we are not observing outcomes of p(d) but of d paying or not, i.e. Yd. Since p(d)= E[Yd], we have:

E[Yd]= 1 1+ e−dT_β

We can thus model Ydas:

Yd= 1

1+ e−dT_β + εd

Where:

(37)

This completes the specification of the model. We estimate β through maximum like-lihood. To ease readability we denote p(d) by pd, and since Yd is Bernoulli we thus have:

L(β | yd)= P(Yd= yd)= py_dd(1 − pd)1−yd

Letting y be the vector of the mutually independent observations yd, we get: L(β | y)=Y

d pyd

d(1 − pd) 1−yd

Using the relations between our variables, we obtain: ln L(β | y)=X d yddTβ − X d ln[1+ edTβ]

We now introduce some further notation: Several debtors d may have the same debtor features d. Let tdbe the number of occurrences of debtor features d, and let zd be the number of times these debtor features yielded the observed value yd = 1. We then have:

ln L(β | y)=X d zdln(pd)+ X d (td− zd) ln(1 − pd)

We will not describe the details of how the ˆβ maximizing ln L(β | y) is found but refer to appendix C.14.1 in [MPV12] for a complete description of the procedure. In short, we use the Newton-Raphson method applied to ln L(β | y) to find the ˆβ which solves ∂ ln L_∂β ( ˆβ) = 0. We now describe the Newton-Raphson method (the following presentation is inspired by Ekaterina Kruglov’s lecture slides from the KTH course SF2930):

Consider a twice continuously differentiable strictly concave function f : (a, b) → R for a < b ∈ R. We want to find the x∗which maximizes f (since f is strictly concave, a unique maximum exists). First, guess a point x0 and consider the Taylor expansion of f in a neighbourhood of x0: f(x) ≈ f (x0)+ (x − x0) f0(x0)+ 1 2(x − x0) 2_f00_(x 0) | {z } :=g(x)

The function f is maximized at x if and only if f0(x)= 0. Since f (x) ≈ g(x), we define our next guess x1of the optimum as the solution to g0(x1)= 0:

g0(x1)= f0(x0)+ (x1− x0) f00(x0)= 0 ⇐⇒ x1= x0− f0(x0) f00_(x 0)

Successively repeating the last update step, we obtain a sequence (x0, x1, x2, ...) which can be shown to converge to the the point x∗ _{which maximizes x. In practice, the} Newton-Raphson method is terminated when |xn+1− xn|< ε for some sufficiently small ε > 0.

(38)

Finally, the Newton-Raphson can be generalized to multivariable functions: Let A be an open subset of Rn_{and let f : A → R be a twice differentiable strictly concave} function. Then the update step of the Newton-Raphson algorithm is:

xn₊₁= xn− H−1(xn)∇ f (xn) Where: ∇ f (xn)= ∂ f ∂xn,1, ..., ∂ f ∂xn,k H(xn)=                ∂2_f ∂xn,1∂xn,1 · · · ∂2_f ∂xn,k∂xn,1 .. . ... ... ∂2_f ∂xn,1∂xn,k · · · ∂2_f ∂xn,k∂xn,k               

5.2 Properties of Estimators

Estimating pi, j

Estimating pi, j amounts to estimating a binomial proportion which is a standard en-deavour. Recall that we estimate pi, jby:

ˆpi, j=Y (Ti, j) |Ti, j| Using the likelihood function

L(pi, j| yi, j)= P(Yi, j= yi, j)= |Ti, j | yi, j

!

py_{i, j}i, j(1 − pi, j)|Ti, j|−yi, j

to solve ∂L/∂pi, j = 0 shows that ˆpi, jis the maximum likelihood estimator. Moreover, it is easily shown that ˆpi, jis unbiased and that its variance tends to zero as the sample size |Ti, j| increases.

Estimatingδi, j

Recall that we estimate δi, jby: ˆ δi, j=

Y(Ti, j) −P

d∈Ti, j ˆp(d)

|Ti, j|

We first show that estimating δi, jby this ˆδi, jwould be reasonable if the true scores p(d) were known:

Proposition 4. The following estimator ofδi, jis unbiased and its variance goes to zero as |Ti, j| → ∞:

Yi, j−P

d∈Ti, jp(d)

(39)

Proof. E "_Y i, j−Pd∈Ti, jp(d) |Ti, j| # = P d∈Ti, j(p(d)+ δi, j) − P d∈Ti, jp(d) |Ti, j| = δi, j

The estimator is thus unbiased. Moreover, by mutual independence of the Ydvariables and since Var(Yd)= (p(d) + δi, j)(1 − p(d) − δi, j) for d ∈ Ti, j:

Var Yi, j −P d∈Ti, jp(d) |Ti, j| ! = P d∈Ti, jVar(Yd) |Ti, j|2 = P d∈Ti, j(p(d)+ δi, j)(1 − p(d) − δi, j) |Ti, j|2 ≤ |Ti, j| |Ti, j|2 = 1 |Ti, j| → 0

Where we also have used 0 ≤ p(d)+ δi, j≤ 1. Logistic regression is biased for finite samples, see [Lan+03]. However as we will see, logistic regression is unbiased as the sample size grows to infinity. We will use this asymptotic result to show that also our estimator ˆδi, j is asymptotically unbiased. We first formalize these concepts slightly:

Definition 5. LetΘ be the set of possible values of a parameter θ. Let ˆθ be an estima-tor of θ, consider a countably infinite sample and let ˆθn be the ˆθ-estimator of θ using the first n observations of the sample. An estimator ˆθ is asymptotically unbiased if limn→∞E[ˆθn|θ = θ0]= θ0for every θ0∈Θ.

To ease notation, we skip subscripts and denote asymptotic unbiasedness as limn→∞E[ˆθ]= θ. Now formalize our probability space as (Ω, F , P) and define:

Definition 6. A sequence (Xn)∞_n₌₁ of random variables converges almost surely or stronglyto a random variable X if:

P({ω ∈ Ω | lim

n→∞Xn(ω)= X(ω)}) = 1 We denote strong convergence by Xn

a.s. −−→ X.

Definition 7. An estimator ˆθ of θ is strongly consistent if ˆθ −−→a.s. θ as sample size increases, i.e. n → ∞.

See theorem 16.1 in [Das08] for a formal version and proof of the following state-ment:

Theorem 1 (Consistency of Maximum Likelihood). Given sufficient regularity condi-tions, the maximum likelihood estimator is strongly consistent.

The following theorem shows that almost sure convergence basically implies uni-form convergence, see theorem 2.3.2 in [Fri82] for a more general statement and proof: Theorem 2 (Egoroff’s Theorem). Let (Xn) be a sequence of random variables converg-ing almost surely to a random variable X. Then for anyε > 0 there exists a measurable set B ⊆Ω such that P(B) < ε and Xnconverges uniformly to X onΩ \ B.

(40)

We finally come to the goal of this section, to prove the asymptotic unbiasedness of ˆδi, j. First we prove a lemma:

Lemma 1. If the product space of possible debtors and possible values for β is a compact subset of some Rk, then there exists a constant L such that for all d and all α, β in the parameter space:

1 1+ e−dT_α − 1 1+ e−dT_β ≤ Lkα − βk We call L aLipschitz constant.

Proof. Since the left-hand side of the inequality consists only of continuously di ffer-entiable functions (and no division with zero or similar degenerative behaviors occur on the wholeΩ = Rk), the lemma follows from the following statement: If f is con-tinuously differentiable on an open Ω ⊆ Rk_{, then f is Lipschitz continuous on every} compact K ⊆Ω. We refer to [Fis] for a proof. Proposition 5. Under the compactness assumption in lemma 1, the estimator ˆδi, j is asymptotically unbiased. Proof. lim |Ti, j|→∞E "_Y_{i, j}₋P d∈Ti, j ˆp(d) |Ti, j| # = lim |Ti, j|→∞ E[Yi, j] −P d∈Ti, jE[ ˆp(d)] |Ti, j| = lim |Ti, j|→∞ P d∈Ti, jp(d)+ |Ti, j|δi, j− P d∈Ti, jE[ ˆp(d)] |Ti, j| = δi, j+ lim |Ti, j|→∞ P d∈Ti, jE[p(d) − ˆp(d)] |Ti, j| We want to show that the second term vanishes:

lim |Ti, j|→∞ P d∈Ti, jE[p(d) − ˆp(d)] |Ti, j| ≤ lim |Ti, j|→∞ P d∈Ti, jE[|p(d) − ˆp(d)|] |Ti, j| ≤ lim |Ti, j|→∞ |Ti, j| maxd∈Ti, jE[|p(d) − ˆp(d)|] |Ti, j| = lim|Ti, j|→∞maxd∈Ti, j E[|p(d) − ˆp(d)|]= lim |Ti, j|→∞maxd∈Ti, j E[| 1 1+ e−dTβ − 1 1+ e−dTβˆ| | {z } ∆(β, ˆβ) ]

Let ε > 0. By the strong consistency of the maximum likelihood estimator, ˆβ converges almost surely to β. Let L be a Lipschitz constant from lemma 1. By Egoroff’s theorem, there exists a measurable subset B ⊆Ω such that ˆβ converges uniformly to β on Ω \ B and P(B) < ε/2. Let |Ti, j| be sufficiently large so that kβ − ˆβk < ε/(2L) almost surely

(41)

onΩ \ B, then for all d ∈ Ti, j: E[|∆(β, ˆβ)|] = Z Ω\B |∆(β, ˆβ)| | {z } ≤Lkβ− ˆβk≤L_2Lε₌ε₂ +Z B |∆(β, ˆβ)| | {z } ≤1 ≤ ε 2 × P(Ω \ B)| {z } ≤1 + 1 × P(B) |{z} <ε 2 < ε

5.3 Distribution Tests

We follow the presentation in [HWC15] very closely and mostly keep their notation. We will derive the test statistics of the tests we consider, however we do not comment on their asymptotic behavior. In practice, the R implementations of these tests gener-ally use results concerning the asymptotic normality of the test statistics. For the Fisher sign test and the Wilcoxon-Mann-Whitney rank sum test, we can also construct confi-dence intervals for the effect sizes θ and ∆ occurring in the tests, however we simply refer to [HWC15] page 81 (comment 49) and page 143 (comment 22) respectively for these constructions. We mainly consider one-sided tests, because we mainly use the tests to investigate whether the trade algorithm produces better results than a random allocation.

5.3.1 Fisher Sign Test

When the simulation method is used to estimate performance, for one round of MCCV we obtain one performance estimate X1for the random allocation and one performance estimate Y1for the trade algorithm. Running n MCCV rounds, we thus obtain n pairs (X1, Y1), ..., (Xn, Yn) of performance. Since the training and validation sets of different MCCV rounds overlap, different pairs are not independent but we treat them as such to be able to apply our tests. The Fisher sign test assumes:

B1. The differences Zi= Yi− Xi, i= 1, ..., n are mutually independent.

B2. Each Ziis a continuous random variable and all Zihave the same median θ: P(Zi≤θ) =

1

2 = P(Zi> θ)

Assumption B2 means we assume that the trade algorithm constantly performs θ better than the random allocation for any training and validation set. We set up the following hypotheses:

H0 : θ ≤ 0 HA : θ > 0

(42)

We define the test statistic: B= n X i=1 1Zi>0

High values of B indicate that we should reject H0. Since Zi is continuous, Zi = 0 occurs with probability zero. When θ= 0, we have P(Zi< 0) = 1/2 = P(Zi> 0). Each outcome (z1, ..., zn) of (Z1, ..., Zn) are thus equally likely. There are 2n such outcomes, thus: P(B ≥ b) = P( n X j=1 1Zi>0≥ b)=

number of outcomes whose sum is greater than or equal to b 2n

Given the outcome b, the p-value of our test is P(B ≥ b). Choose a significance level α and let b1−α = infb{1 − α ≤ P(B ≤ b)}. The rejection region of our test is then B ∈[b1−α, ∞).

5.3.2 Wilcoxon-Mann-Whitney Rank Sum Test

In the Wilcoxon-Mann-Whitney rank sum test we do not treat the random allocation performance and the trade performance as a pair, but as independent observations of two different random variables describing performance. Since our performance esti-mates for the random allocation and the trade algorithm are obtained from the same effective validation set, treating them as independent is highly questionable. We there-fore find the Fisher sign test to be more suitable for our situation, but we present the Wilcoxon-Mann-Whitney rank sum test and Kolmogorov-Smirnov tests as alternative tests anyway. We assume:

A1. X1, ..., Xmare independent and identically distributed (i.i.d.) continuous random variables, and Y1, ..., Ynare also i.i.d. continuous random variables. The former variables are thus a random sample from a population distributed according to a random variable X, and similarly the latter are a random sample from Y. A2. The random samples from X and Y are mutually independent.

A3. X+ ∆= Y for some ∆ ∈ R.d

We are interested in testing if∆ ≤ 0, and therefore define the following null and alter-native hypotheses:

H0:∆ ≤ 0 HA:∆ > 0

The test statistic W is defined as follows: Since X and Y are continuous, all values of the two random samples are distinct with probability one. Rank all of the m+ n = N values of the two random samples from X and Y in ascending order. Let Sjdenote the

(43)

position of Yjin this ranking. When∆ = 0, all possible rankings are equally likely and thus P(S1= s1, ..., Sn = sn)= P(S1 = σ(s1), ..., Sn = σ(sn)) where σ is a permutation of s1, ..., sn. Disregarding the mutual order of the Sj variables, there are

_N n

possible ways of assigning values to the Sjvariables, each assignment being equally likely when ∆ = 0 (since each ranking is equally likely). Define:

W = n X

j=1 Sj

High values of W corresponds to high ranks for the Yjvariables. The rejection region for our test is thus where W is large:

P(W ≥ w) = P( n X

j=1

Sj≥ w)=

number of assignments whose sum is greater than or equal to w _N

n

Similarly as for the Fisher sign test, we can now construct the rejection region for the significance level α as W ∈ [w1−α, ∞).

5.3.3 Kolmogorov-Smirnov Test

The one-sided Kolmogorov-Smirnov test needs to be used with care, see [Fil15] for the risks of using the test.

Assume A1 and A2 of the previous subsection holds. Let F and G be the distribu-tion funcdistribu-tions of X and Y respectively. The hypotheses of the test are:

H0: F(t) ≤ G(t) for all t ∈ R HA : F(t0) > G(t0) for some t0∈ R

For the sake of illustration, assume temporarily that m = n = 2 and that we have obtained an outcome such that x1< y2< y1< x2(the order of the indices is irrelevant). Such an outcome is represented by xyyx, called a meshing.

Define the empirical distribution functions as follows: Fm(t)= number of X-variables ≤ t m Gn(t)= number of Y-variables ≤ t n Define: J+= max Z∈{X1,...,Xm,Y1,...,Yn} Fm(Z) − Gn(Z)

When F = G on R, all X- and Y-variables are distinct with probability one, and all meshings are equally likely. In particular, a meshing is identified by the placement of

(44)

its n characters y. There are thusN_nmeshings. The value of J+is a direct consequence of the meshing. High values of J+are in favour of rejecting H0. Calculate:

P(J+≥ j)=

number of meshings with J+greater than or equal to j _N

n

The rejection region [ j1−α, ∞) is constructed as before.

As a final remark, note that the statistic J+ usually is defined as follows in the literature: J+= gcd(m, n) max Z∈{X1,...,Xm,Y1,...,Yn} Fm(Z) − Gn(Z)

The constant factor causing desirable asymptotic behavior. Moreover, the two-sided test, i.e. the test having hypotheses

H0: F(t)= G(t) for all t ∈ R HA : F(t0) , G(t0) for some t0∈ R

is constructed in the same way as the as the one-sided but instead defining: J= gcd(m, n) max

Z∈{X1,...,Xm,Y1,...,Yn}

(45)

(46)

Figure 5.1: Distribution of number of calls per agent. The mean number of calls is 1 376 (rounded) and the median is 394.

(47)

Figure 5.2: Performance of unknown (left) and known (right) scorecards.

(48)

(49)

(50)

(51)

(52)

Figure 5.8: Estimation of performance (rows) for debtor partition 31 using the simula-tion method (100 simulasimula-tion steps) for three different rounds of MCCV (columns).

(53)

Figure 5.9: Distribution of f for the middle round of MCCV in figure 5.8 (observations chosen by trade compared to the whole validation set of the round).

(54)

(55)

Figure 5.11: Cluster stability over time for the middle round of MCCV in figure 5.8. Training (left) and chosen observations in validation (right) with varying window lengths (rows).

(56)

Figure 5.13: Cluster performance from the scaling method.

(57)

(58)

Figure 5.16: Results from Fisher sign test for the distributions in figure 5.15.

Figure 5.17: Results from Wilcoxon-Mann-Whitney rank sum test for the distributions in figure 5.15.

(59)

(60)

References

[BDM09] Rainer Burkard, Mauro Dell’Amico, and Silvano Martello. Assignment Problems. Revised Reprint. Society for Industrial and Applied Mathemat-ics, 2009.

[Bur89] Prabir Burman. “A Comparative Study of Ordinary Cross-Validation, v-Fold Cross-Validation and the Repeated Learning-Testing Methods”. In: Biometrikavolume 76, number 3 (1989).

[Das08] Anirban DasGupta. Asymptotic Theory of Statistics and Probability. Springer Science+Business Media, LLC, 2008.

[Fam91] Eugene F. Fama. “Efficient Capital Markets: II”. In: The Journal of Fi-nancevolume 46, number 5 (1991).

[Fil15] Guillaume J. Filion. “The signed Kolmogorov-Smirnov test: why it should not be used”. In: GigaScience volume 4, issue 1 (2015).

[Fis] Daniel Fischer. f ∈ C1_{defined on a compact set K is Lipschitz? https:} //math.stackexchange.com/a/1622593. Accessed: 2019-06-18. [Fri82] Avner Friedman. Foundations of Modern Analysis. Dover edition. Dover

Publications, Inc., 1982.

[HWC15] Hollander, Wolfe, and Chicken. Nonparametric Statistical Methods. 3rd edition. John Wiley & Sons, Inc., 2015.

[KJ16] Max Kuhn and Kjell Johnson. Applied Predictive Modeling. 5th printing. Springer Science+Business Media LLC New York, 2016.

[Lan+03] Langner et al. “Bias of Maximum-Likelihood estimates in logistic and Cox regression models: A comparative simulation study”. In: Sonderforschungs-bereichvolume 386, paper 362 (2003).

[MPV12] Douglas C. Montgomery, Elizabeth A. Peck, and G. Geoffrey Vining. In-troduction to Linear Regression Analysis. 5th edition. John Wiley & Sons, Inc., 2012.

[MSP05] Annette M. Molinaro, Richard Simon, and Ruth M. Pfeiffer. “Prediction error estimation: a comparison of resampling methods”. In: Bioinformatics volume 21, number 15 (2005).

[Sim07] Richard Simon. Fundamentals of Data Mining in Genomics and Proteomics. chapter 8. Springer Science+Business Media, LLC, 2007.

(61)

(62)

r Cilla

(63)

(64)

(65)

(66)

A Trade-based Inference Algorithm for Counterfactual Performance Estimation

IN

DEGREE PROJECT

MATHEMATICS,

SECOND CYCLE, 30 CREDITS

,

STOCKHOLM SWEDEN 2019

A Trade-based Inference Algorithm

for Counterfactual Performance

Estimation

SIMON ALMERSTRÖM PRZYBYL

A Trade-based Inference Algorithm

for Counterfactual Performance

Estimation

SIMON ALMERSTRÖM PRZYBYL

Contents

Chapter 1

Introduction

1.1

Project Description

Chapter 2

The Trade Algorithm

2.1

Framework

2.2

Binary Trade

2.3

Overperformance

2.4

Estimating Performance

2.4.1

Scaling Method

2.4.2

Simulation Method

2.4.3

Comparison of Estimation Methods

2.5

Increasing Stability and Maximizing Performance

2.5.1

Agent Clustering

2.5.2

Optimal Debtor Partition

2.6

Validation Methods

2.6.1

Monte Carlo Cross-validation

2.6.2

Rolling Validation

2.7

Workflow

2.8

General Trade

Chapter 3

Intrum Application

3.1

Dataset

3.2

Scorecard

3.3

Optimal Debtor Partition

3.4

Results for D

3.5

Results for D

3.6

Further Results for D

= 31

3.7

Performance on T est

Chapter 4

Conclusion

4.1

Joint Hypothesis Problem

4.2

Dynamic E

ffects

4.3

Obstacles for Implementation

Chapter 5

Theoretical Background