Random Forest on XAI Today

Explainable AI for Improved Heart Disease Prediction

Tue, 09 Jul 2024 00:00:00 +0000

The paper “Optimized Ensemble Learning Approach with Explainable AI for Improved Heart Disease Prediction” focuses on explaining machine learning models in healthcare, similar to my original work in “Ada-WHIPS: explaining AdaBoost classification with applications in the health sciences”. The newer paper combines a novel Bayesian method to optimally tune the hyper-paremeters of ensemble models such as AdaBoost, XGBoost and Random Forest and then applies the now well established SHAP method to assign Shapley values to each feature. The authors use their method to analyse three heart disease prediction datasets, included the well-known Cleveland set used as a benchmark in many ML research papers.

SHAP (Lundberg and Lee) came hot on the heels of the revolutionary LIME method (Ribeiro, Singh and Guestrin), which together delivered a paradigm shift in the usefulness and feasibility of eXplainable Artificial Intelligence (XAI). In fact, LIME was published at exactly the time I was becoming interested in the topic of XAI and served as inspiration for my own Ph.D journey. Both methods fall into the category of Additive Feature Attribution Methods (AFAM) and work by assign a unitless value to each level of the set of input features. The main benefits of AFAM become clear when viewing a beeswarm plot of their responses across a larger dataset, such as the whole training data. Patterns emerge showing which input variables affect the response variable most strongly, and in which direction. This usage is much more sophisticated than classic variable importance plots, which lack direction and mathematical guarantees offered by SHAP.

In the clinical setting, these mathematical guarentees mean that the resulting variable sensitivity information could be used to create a broader diagnostic tool. However, while this approach can provide a general understanding of which variables drive a model’s predictions, it lacks the fine-grained, instance-specific clarity offered by perfect fidelity, decompositional methods.

On the other hand, my original method Ada-WHIPS (firmly within the decompositional methods category) enhances interpretability in clinical settings by providing direct, case-specific explanations, making it a powerful tool for clinicians needing detailed transparency for patient-specific decision-making. Given the choice of an AdaBoost model (or a Gradient Boosted Model, or a Random Forest), it makes sense to use an XAI method that is highly targeted to these decomposable ensembles. Ada-WHIPS digs deep into the internal structure of AdaBoost models, redistributing the adaptive classifier weights generated during model training (and therefore a function of the training data distribution) to extract interpretable rules at the decision node level.

One area where Ada-WHIPS could benefit from the techniques in the new paper is the use of Bayesian methods to tune hyperparameters. Their approach potentially leads to improved model accuracy, a crucial factor in high-stakes environments like healthcare and “juicing up” the model internals for greater accuracy in the generated decision nodes. However, the paper appears to omit any detail about how this approach is deployed. This omission is indeed a great pity because, from what I understood, the Bayesian parameter selection was actually the authors’ novel contribution (the use of ensembles and SHAP on these particular datasets being nothing particularly new).

In conclusion, the SHAP-based approach offers valuable insights at a macro level, the new paper boasts improvements in model accuracy through Bayesian tuning, and my Ada-WHIPS method’s per-instance clarity and actionable insights should prove practical in scenarios where clinicians require detailed explanations of specific cases. I would be delighted to see some confluence of the three ideas, so that the benefits from each can combine and reinforce the use of highly targeted explainability in clinical applications.

Algebraic Aggregation of Random Forests

Thu, 10 Aug 2023 00:00:00 +0000

In my paper, “CHIRPS: Explaining random forest classification”, I took an empirical approach to addressing model transparency by extracting rules that make Random Forest (RF) models more interpretable. Importantly, this was done without sacrificing the high levels of accuracy achieved by the high-performing RF models.

The recently published “Algebraic aggregation of random forests: towards explainability and rapid evaluation” by Gossen and Steffen provides a theoretical counterpart, offering essential proofs and a mathematical framework for achieving explainability with RF models.

While my paper focused on simplifying complex models by rule extraction on a per instance basis, this subsequent work introduces Algebraic Decision Diagrams (ADDs) to aggregate Random Forests, optimizing their structure and enhancing interpretability at the model level. Both papers aim to improve model transparency, though by different means: my approach is empirical, leveraging rule extraction to clarify black-box models, whereas the latter introduces algebraic methods to combine decision trees into efficient, understandable diagrams.

The mathematical concepts in Gossen and Steffen’s paper, such as path reduction and algebraic operations, support model simplification. Importantly, the authors provide formal proofs that this aggregation retains the original model’s accuracy. This complements the practical focus in my paper, where the goal was also to maintain accuracy while increasing explainability.

Ultimately, the two papers reach the same destination—improving transparency of RF models—but by different routes. While my paper uses rule extraction to bring clarity to complex models, the subsequent work constructs a theoretical basis using algebraic tools, providing formal assurances to the outcomes I demonstrated empirically. Together, they offer complementary perspectives on making RF models more understandable and efficient.

Explaining Random Forests with Representative Trees

Thu, 15 Jun 2023 00:00:00 +0000

The paper “Can’t see the forest for the trees: Analyzing groves to explain random forests” explores a novel take on model-specific explanations, as outlined in my own research (e.g. you can look at “CHIRPS: Explaining random forest classification” as a reference). This new paper by Szepannek and von Holt seeks to make Random Forests (RF) more interpretable. RF are notoriously hard to explain due to their complexity and these novel methods works well for both classification and regression, which is a very useful extension to the field.

The authors introduce most representative trees (MRT) and surrogate trees, essentially distilling a simpler model to run side by side with the black box RF. MRTs focus on highlighting individual trees within a random forest that best explain the overall model behavior, while surrogate trees mimic the forest with simpler, more digestible versions. I have some reservations about the latter approach, because my own research showed that any surrogate model comes with a failure rate, which is the number of examples that the surrogate classifies differently than the black box model under scrutiny. I also question the assertion that a model of 10 or 24 decision trees really is so interpretable. Even a model of this reduced size still likely contains far too many components for a human-in-the-loop to consider and understand.

In any case, to give the authors their due credit, they navigate the trade-offs between accuracy and interpretability of both MRT and surrogate tree methods, and propose a novel concept called groves: small collections of decision trees that balance the need for interpretability with predictive accuracy. Groves provide a middle ground by combining the benefits of MRTs and surrogate models, reducing the overall complexity while still offering meaningful insights into how the model operates. This approach aligns with the goal of making models more transparent and trustworthy.

Through various case studies, the paper shows how groves and surrogate trees can be effectively applied to real-world datasets. The trade-off between model accuracy and explainability remains a central challenge. Yet, in these studies, groves provide a workable compromise by making it easier for humans to understand what is driving the model’s predictions without overwhelming them with unnecessary detail.

The discussion also highlights a key challenge in using groves: deciding on the right number of trees to use for explanation. Using too many trees risks overwhelming the user with information (as I have already pointed out), while too few might fail to capture the complexity of the underlying model and run with an untenable failure rate. I dicuss ways to achieve a zero failure rate in my thesis. Keeping explanations concise and accessible is just a part of the complete picture.

In conclusion, this paper underscores the crucial need for enhancing the interpretability of machine learning models, particularly in high-stakes fields like healthcare and finance, where decision transparency is essential. By extending the work in interpretability through methods like groves and surrogate trees, it addresses the challenge of making powerful models like random forests more understandable.

Explaining Random Forests with Boolean Satisfiability

Mon, 21 Jun 2021 00:00:00 +0000

The paper “On Explaining Random Forests with SAT” uses Boolean satisfiability (SAT) methods to provide a formal framework for generating explanations of Random Forest (RF) predictions. A key result in the paper is that abductive explanations (AXp) and contrastive explanations (CXp) can be derived by encoding the RF’s decision paths into propositional logic.

Encoding a decision path as propositional logic, is an entirely reasoned approach and quite straightforward, as I showed in my paper CHIRPS: Explaining random forest classification. The decision paths of an RF model can be transformed into a Boolean formula in Conjunctive Normal Form (CNF). For example, each decision tree in the forest is represented as a set of clauses. Following the paths for a single example prediction essentially carves out a region of the feature space with a set of step functions, resulting in a sub-region that must return the target response. When the clauses of this step functions set correspond to a subset of the features, a change in the remaining feature inputs has no effect on the model prediction. This subset is a prime implicant (PI) explanation.

A PI-explanation is a minimal subset of features that are sufficient to guarantee a particular prediction made by a machine learning model. It represents the smallest set of conditions that, if held constant, would lead to the same classification result. Essentially, it’s the most concise explanation of why the model arrived at its decision, highlighting the critical features responsible for that prediction. In fact, my own research centred finding soft PI-explanations, and revealing the limits where they no longer hold true for extreme outliers and unusual examples.

The authors of this paper show that finding AXp and CXp by this PI-explanation method reduces to solving a SAT problem and is therefore NP-hard but can be polynomial under specific conditions. This insight into the problem complexity is significant because it establishes that generating explanations is feasible when those assumptions are met and opens up the method to practical applications with real-world data.

Overall, the SAT-based methodology enables a structured, efficient way to uncover the decision-making process of Random Forests, ensuring that their predictions are not just accurate but also explainable, which is crucial for domains requiring transparency like healthcare and finance.