A Distributed Fair Random Forest

(1)

A Distributed Fair Random Forest

James Fantin∗ Advised by Dr. Chao Lan

May 2, 2020

Abstract

Machine learning algorithms are increasingly responsible for making critical decisions which have broad societal impact. Questions are arising about the fairness of the algorithms which make these decisions. While existing models have been proposed, many require direct access to private data which may be impossible given new privacy regulations. We propose a distributed fair random forest algorithm which does not require direct access to private demographic data. Our approach uses randomly generated decisions trees which are added to our forest if they are fair with a weighted voting mechanism for accuracy. In building on existing literature, we assume a third party holds private demographic data which can communicate with a data center that builds a model without compromising the privacy of individuals demographic data. We compare our algorithm against existing fair random forest and decision tree algorithms and show that our method can outperform existing methods.

1 Introduction

Issues with the fairness of machine learning models are becoming wildly apparent. In 2018 Amazon shut down an internal AI model used to give recommendations for hiring decisions because it discriminated against female applicants [1]. Recently Apple was sued for discrimination because a machine learning model gave female applicants for the Apple credit card a lower credit limit than male applicants [11]. In response to fairness concerns, many fair algorithms have been produced.

Unfortunately, existing methods often require direct access to private demographic information. This poses a challenge as governments are beginning to regulate access to private information. The European Union’s General Data Protection Regulation (GDPR) and California’s California Con-sumer Privacy Act (CCPA) give strict requirements for holding sensitive demographic data. This has wide-ranging implications for the privacy of protected features which may even be impossible to directly access. With direct access to private data becoming impossible, techniques are needed which allow fair algorithms to be developed without direct access to private data. Recently, there has been work developing distributed fair frameworks which assume a third party has access to private features and is responsible for building models which are given to a data center containing the non-private data [4].

(2)

party responsible for holding private demographic data and a data center holding the rest of the data. The third party is responsible for auditing the algorithms produced by the data center to ensure they are fair.

In our paper, we explore extending the technique to a random forest. Random forest has many benefits which make it one of the most widely used algorithms for machine learning. One benefit is it is quite interpretable based on feature contribution scores which score how important features are at predicting the label [9]. Fairness research has already seen the benefits of interpretable random forest in regard to which features are selected in fair models versus unfair models [10]. This allows one to easily compare the importance of features in varying models and discover which features are the most important to fair models.

We apply our algorithm on three data sets used in fairness research and compare with existing models to demonstrate that our algorithm achieves similar fairness and accurate results often outperforming existing models in fairness. This trade-off can be controlled by a hyper-parameter ρ.

2 Related Work

New discussions about access to protected data are appearing. Encryption based techniques which hide demographic information have been proposed [7], but these come at the price of increasing time complexity of the algorithm. It also may be difficult to convince stakeholders and policymakers that these encryption techniques are secure and would not cause any data privacy leaks if they were ever compromised [3].

Fair decision tree techniques have also been developed. Most consist of a modification of the information gain metric when selecting attributes to split leaf nodes. The general idea is to add a new fairness information gain that measures how closely the selected attribute correlates with the protected features. Often this criteria is subtracted or multiplied by the standard information gain criteria to split based on accurate but fair attributes [6]. In their paper, the entropy score is used for the fair information gain criteria and measures the entropy in regards to the protected feature. Another decision tree technique called FAHT uses a Hoeffding tree with a similarly modified information gain which measures the fairness gain of a specific attribute [12]. Here the fairness information gain is combined using multiplication as opposed to a subtraction approach to account for the varying scales between the two metrics. This paper develops the algorithm for an on-line setting.

The existing fair random forest algorithm technique is to modify the information gain similarly to the previous papers and modify the Gini impurity score to measure both the class label and the protected feature [10]. In this paper, the fair information gain is subtracted from the standard information gain criteria. Intuitively, the model chooses features at nodes in the decision trees which are correlated with the label but not with the protected attribute. One item of note is that if a feature is highly correlated with the label and the protected attribute, but no better feature exists to split the node on then this feature will be chosen. So in certain data sets, this algorithm can still lead to biased results. In the paper the fair decision trees are then combined using the standard majority voting routine into a fair random forest.

(3)

3 Fair Random Forest Framework

We propose a new technique using randomly generated decision trees as opposed to a regularization approach for fairness. This has the benefit by allowing our method generate trees quickly which are easily adapted into a distributed framework. We also describe the added benefit that we can generate more fair trees and threshold the fairness of each tree in order to optimize the trade-off between fairness and accuracy.

In line with existing fairness research, we will use the common statistical parity [8] metric and the normed disparate impact [8] metric to analyze the fairness of each algorithm.

We use the extra random decision tree algorithm developed in [2] for the base trees in our random forest technique. We add the constraint that only one feature is randomly selected at each split. This added constraint amounts to building a completely random decision tree. For improved performance, we use the implementation of this algorithm developed in the popular SciKit learn python package.

An added benefit of using random decision trees is that with a few added constraints our random decision trees can be differentially private. Jagannathan et al. uses a slightly modified version of the extra random decision tree algorithm and show that by removing a pruning step in the tree generation process and adding noise to each leaf node that the algorithm is -differentially private [5]. Their work considers the random forest framework with random decision trees in a similar fashion to our technique. Thus, by following their steps, our algorithm could be made differentially private, though future research is needed to prove that by only selecting the fairest random decision trees that differential privacy is maintained.

As opposed to existing fair random forest algorithms, our algorithm maintains individual privacy using a third party which holds private demographic information. A data center is then responsible for generation of the decision tree and storing the non-private data. The data center builds a random decision tree with a training set D and sends decision tree to the third party. The third party audits the tree on D and returns D if the statistical parity is below ρ, the fairness constraint set by the data center. A smaller ρ will build a fairer model compared to a larger ρ. The hyper-parameter ρ can be used to balance the trade-off between increased accuracy and increased fairness.

The linear expression to learn fair models in [4] cannot easily be applied to decision trees. Instead, we propose a weighted voting technique to create a linear combination of decision trees. The standard random forest algorithm uses a majority voting technique to predict the label for each instance. Majority voting with our fair model is not optimal as it will often result in trivial decision trees. We define a trivial decision tree as a decision tree DT which for every instance x, DT predicts x has label 0 (or label 1 for all instances). This occurs because trivial decision trees are always fair and have a statistical parity value of 0. Thus, they can often be chosen by our technique when ρ is small enough.

While a random forest built with many trivial decision trees would be fair, it would not lead to any interesting predictions, which we desire for real world use cases. To solve this issue, we propose a weighted voting mechanism for each tree to learn an accurate tree. We collect the predictions for all decision trees and then use a ridge regression model to use learn weights for each decision tree. This has the added benefit that most trivial decision trees will receive a weight of 0 since their predictions do not help predict the true label. Thus, the ridge model eliminates trivial decision trees

(4)

3.1 Algorithm Design

In this section we present the pseudocode for our algorithm. ExtraDecisionTree is a function which produces a random decision tree [2]. This object has two functions which are fit which fits and builds the tree and predict which predicts the values from some data. Statistical parity is a function which calculates the statistical parity. Ridge is a function which produces a standard ridge regression model with two standard functions fit and predict. For our fair random forest R, we assume it contains a list R.dT rees which we will use to store decision trees. R.ridgeM odel is where we will store the ridge regression model.

Algorithm 1: Training A Fair Random Forest Input:

D, training set L, true label

M , number of fair models F , fair attribute

α, ridge regression alpha value ρ, fairness constraint

Output: A fair random forest R

i = 0; predAr = []; while i < M do decisionTree = ExtraDecisionTree(D); decisionTree.fit(D); prediction = decisionTree.predict(D); sp = StatisticalParity(D, F , prediction); if sp < ρ then i++; R.addTree(decisionTree); predAr.append(prediction); end end

ridgeModel = new Ridge(α); ridgeModel.fit(predAr, L); R.ridgeModel = ridgeModel;

In the distributed model, a third parity with the private demographic data would be responsible for measuring the statistical parity of each model. This would allow the third party to control all access to the private demographic data. Then the third party would only return the tree if statistical parity was below ρ.

(5)

Algorithm 2: Prediction of Fair Random Forest Input:

R, A Fair Random Forest D, data set

Output: A prediction pred for D

predAr = [];

for dTree in R.dTrees do

predAr.append(dtree.predict(D)); end

pred = ridgeMode.predict(predAr);

4 Experiment

In this section, we evaluate our proposed algorithm on three real-world data sets and compare against two other tree based methods. All code for our algorithm is written in python and available at our website1along with the data used in each of the model comparisons. We reproduce the FAHT code, as this was not made available by the authors, in Java which is also available on our website2.

4.1 Data

To allow comparison against existing fair distributed research [4], we compare against the Com-munity Crime3, COMPAS4, and Credit Card5 data sets. The Community data set contains 1993 instances with 101 features. The label is the crime rate of the community. The protected feature is the fraction of African-American residents which we threshold to a binary value. The value is 0 if the percentage is less than 0.5 and 1 otherwise. The COMPAS dataset contains 18317 instances with 40 features. We removed data with incomplete records and reduced the dataset to 1600 in-stances and 15 features. The binary label is risk of recidivism with race as the protected feature. The Credit Card data set contains 30000 instances with 23 features. The label is if a person defaults on their credit payment and education degree is the sensitive attribute.

4.2 Design

For each algorithm, we randomly chose 75% of the instances for training data and 25% for testing data. The models are evaluated on 50 trials where we randomly shuffle the data sets in each trial. Since the FAHT algorithm uses an on-line setting, there is no testing / training split but random shuffles are used to account for concept drift which may occur overtime in the data.

We optimize each algorithm using grid search for hyper-parameters or values chosen in the corresponding paper. FAHT has no hyper-parameters in the model so no tuning is needed. For our

1_{https://github.com/pjlake98/Distributed-Fair-Random-Forest} 2

https://github.com/pjlake98/Fair-Forest

(6)

Table 1: Performance on the Community Crime Data Set

Method Statistical Parity Normed Disparate Classification Error FAHT 0.0943 ± 0.4405 0.7590 ± 0.4533 0.1482 ± 0.0117

FRF 0.0991 ± 0.3755 0.7754 ± 0.6879 0.1263 ± 0.0741 FDT 0.1730 ± 0.1914 0.7177 ± 0.3171 0.1395 ± 0.0392 DFRF 0.0153 ± 0.0430 0.7252 ± 0.6741 0.1413 ± 0.0770

Table 2: Performance on the COMPAS Data Set

FRF 0.0000 ± 0.0000 NA 0.2281 ± 0.0176 FDT 0.0019 ± 0.0019 0.3581 ± 0.6419 0.2302 ± 0.0164 DFRF 0.0037 ± 0.0023 0.6612 ± 0.6307 0.2296 ± 0.0104

Table 3: Performance on the Credit Card Data Set

FRF 0.0626 ± 0.0436 0.3632 ± 0.1450 0.1882 ± 0.0124 FDT 0.0461 ± 0.0289 0.2757 ± 0.1508 0.1924 ± 0.0138 DFRF 0.0123 ± 0.0540 0.1285 ± 0.2904 0.2055 ± 0.0183

DFRF algorithm, we select 100 as the number of base models and 1 for the minimum number of instances in each leaf. In the community crime data set, we chose 50 for the alpha value in ridge regression with a max depth of 20. For the COMPAS data set, we chose 5 for alpha and 4 for the max depth. For the credit card data set, we chose 225 for alpha and 12 for the max depth.

For the FRF (fair random forest) algorithm we choose 1 for the minimum number of instances in each leaf in each trial. We select 4 as the max depth on the community crime set, 6 for the max depth on the COMPAS set, and 6 for the credit card set. For the FDT (fair decision tree) we choose 1 for the minimum number of instances in each leaf in each trial. We select 4 for the max depth on the community crime set, 6 for the depth of the COMPAS set, and 6 for the credit card set.

4.3 Results

Our results are summarized in Tables 1, 2 and 3. We notice that on the Community Crime data set our DFRF algorithm achieves the lowest statistical parity by far compared to the other algorithms with a slight cost to accuracy of the model. In the other two data sets it still maintains small statistical parity in line with existing models. This suggests our DFRF algorithm can remove bias while maintaining low classification error.

4.4 Sensitive Analysis

(7)

Figure 1: Classification Performance versus ρ

the Community Crime Data Set in Figure 1. The performance is averaged over 50 random trials with the same hyper-parameters as chosen as were used in Table 1. By choosing a lower ρ we may have a more fair model at the cost of higher classification error.

5 Conclusion

We propose a distributed fair random forest algorithm which protects the privacy of private de-mographic data. We assume a third party maintains access to private demographic data and communicates with a data center audit fair trees without compromising individual demographic data. Our algorithm is compared against existing fair tree algorithms and we demonstrate that it maintains similar accuracy and fairness while sometimes achieving the best fairness on three real world data sets.

References

[1] Dastin, J. Amazon scraps secret ai recruiting tool that showed bias against women. Reuters (2018).

[2] Geurts, P., Ernst, D., and Wehenkel, L. Extremely randomized trees. Machine learning 63, 1 (2006), 3–42.

[3] Holstein, K., Wortman Vaughan, J., Daum´e III, H., Dudik, M., and Wallach, H. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019), pp. 1–16.

[4] Hu, H., Liu, Y., Wang, Z., and Lan, C. A distributed fair machine learning framework with private demographic data protection. 2019 IEEE International Conference on Data Mining (2019).

(8)

[6] Kamiran, F., Calders, T., and Pechenizkiy, M. Discrimination aware decision tree learning. In 2010 IEEE International Conference on Data Mining (2010), IEEE, pp. 869–874.

[7] Kilbertus, N., Gasc´on, A., Kusner, M. J., Veale, M., Gummadi, K. P., and Weller, A. Blind justice: Fairness with encrypted sensitive attributes. Proceedings of the 35th Inter-national Conference on Machine Learnin (2018).

[8] McNamara, D., Ong, C. S., and Williamson, R. C. Provably fair representations. CoRR abs/1710.04394 (2017).

[9] Palczewska, A., Palczewski, J., Robinson, R. M., and Neagu, D. Interpreting random forest classification models using a feature contribution method. In Integration of reusable systems. Springer, 2014, pp. 193–218.

[10] Raff, E., Sylvester, J., and Mills, S. Fair forests: Regularized tree induction to minimize model bias. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (2018), pp. 243–250.

[11] Vigdor, N. Apple card investigated after gender discrimination complaints. The New York Times (2019).

[12] Zhang, W., and Ntoutsi, E. Faht: an adaptive fairness-aware decision tree classifier. Twenty-Eighth International Joint Conference on Artificial Intelligence (2019).