Revisiting the Rashomon Set Argument

Wed, 26 Feb 2025 00:00:00 +0000

About eighteen months ago, I posted about this paper discussing the Accuracy-Interpretability Trade-Off (AITO) or Performance-Explainability Trade-Off (PET). This paper revisited the sometimes overlooked debate over the validity of this trade-off. That is to say, is it even necessary to accept that such a trade-off or dichotomy exists? Are we really forced to choose between an accurate model and an interpretable one, or must we always compromise our target metrics? You can read my previous blog here

One of the arguments against accepting the Trade-Off is the so-called Rashomon Set (RS) argument. The RS argument suggests that, for many real-world tasks, multiple models from a single function class can achieve nearly the same level of performance. Within this set of models, some will be inherently interpretable. This idea, named after Breiman’s Rashomon Effect, has been discussed extensively but never finally settled. The reason stems from the fact that finding an optimal model is an NP-hard problem and this is at the foundation of machine learning. We approximate a near optimal solution through risk minimization, a paradigm that encourages the thinking that our single, finally selected model is the best we can do. Decades of research into ensemble models hasn’t changed that, because the ensemble takes the seat of a single, risk-minimized model. Rashomon Set theoretical research throws out this limiting paradigm in favour of an exploration of many near optimal models in the space of all possible models in a single function class.

In their paper Exploring the Whole Rashomon Set of Sparse Decision Trees, Xin et al. develop a dynamic programming-based method to generate and sample from the RS of sparse decision trees derived from several benchmark datasets. Three novel applications are presented, including a fascinating take on RS-derived variable importance. Most importantly, there is proof that traditional tree ensemble methods generate only a fraction of the RS several orders of magnitude smaller than its theoretical maximum size.

The question of RS-derived feature importance is explored in fine detail in Partial Order in Chaos: Consensus on Feature Attributions in the Rashomon Set. The authors address the variability in feature attribution explanations. Those of us who have worked with foundational models, such as Random Forests and Boosting methods are familiar with their capability to provide feature importance measures. We are, however, inevitably frustrated by the inconsistency of feature importance across model classes that differ in only trivial ways, and even within multiple runs of the same model while merely adjusting the random seed. In this paper, Laberge et al. show that the same is true even for theoretically stable methods, such as SHAP (Lundberg and Lee, 2017) in a trivial example with a simulated data set for which the ground truth explanation is known. They go on to propose a framework for achieving much more consistent measures of variable importance that is based on consensus within the Rashomon Set.

In the paper On the Existence of Simpler Machine Learning Models, Semanova et al. proposed the Rashomon Ratio as a means to estimate the opportunity of finding a highly interpretable model for any given problem. However, these methods are limited to the Sparse Decision Tree model class and cannot be adapted to incorporate other foundational models that aren’t based on recursive partitioning of binary features. A linear model, for example, with continuous features cannot have its RS enumerated by the application of combinatorics.

The Rashomon Set argument is compelling. So far, however, the strongest evidence remains empirical rather than axiomatic. A formal theoretical proof remains elusive. Nevertheless, ongoing research continues to explore the conditions and methodologies that facilitate the identification of interpretable models within the Rashomon set and, as such, the RS argument remains a fascinating question of theoretical machine learning research.

Revisiting the Performance-Explainability Trade-Off

Fri, 01 Sep 2023 00:00:00 +0000

I was very excited to read and review the paper Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI) last month. I wrote an extensive section on this topic for my Ph.D. thesis (although I coined the name Accuracy-Interpretability Trade-Off or AITO). I have always felt that the subject is too rarely discussed, and never in enough depth and scientific rigour. This Performance-Explanability Trade-off (PET) in the notion that improving model performance (by this, they must mean accuracy or related measures such as true positive rate or AUC/ROC) comes at the cost of explainability.

The authors of this paper state their goal as wanting to refine the discussion of PET in the field of Requirements Engineering for AI systems. Frankly, the paper is quite generic with respect to this self-stated niche once the text gets going, although that in no way detracts from their position on the topic itself. For the most part, the paper focuses on Cynthia Rudin’s influential critique of the performance-explainability trade-off. Rudin is described by the authors as being particularly critical of post-hoc explainability techniques, arguing that they can produce misleading or incomplete explanations that fail to remain faithful to the model’s decision-making process. Again, this was also a foundational point in my thesis on XAI; what good is an explanation of an output other than what the model gave? Proxy (simplified) explanatory models are particularly prone to this behaviour.

Rudin also contends that interpretable models can often match the performance of black-box models, provided that sufficient effort is invested in knowledge discovery and feature engineering. This phenomenon is known as the Rashomon Set argument, which posits that for many real-world tasks, there exist multiple high-performing models, including some that are inherently explainable. The authors argue that while this is an intriguing theoretical claim, it lacks strong empirical backing and does not guarantee that such explainable models will be easily identifiable or practical to develop in all domains. On this point, I find myself in total agreement with the authors. The Rashomon Set argument is merely conjecture from what I can tell and it’s something I would like to revisit in a future blog post.

The authors’ strongest arguments lie in the fact that analyst/researcher-led feature engineering has been vastly superseded and overpowered by the capabilities of deep learning, which hinges on a feature self-learning paradigm built into the model. It’s just so much faster to build a deep neural layer, that researcher/analyst time a resource can be freed up and expended on making post-hoc explainability much for feasible. The authors argue that the real issue is not just whether performance and explainability are in tension, but how much effort is required to achieve both. They suggest that model development should be viewed as a multi-objective optimization problem, where teams must balance the trade-offs between performance, explainability, and available resources, while also considering domain-specific risks such as ethical concerns or financial constraints. From this more nuanced position, they are able to derive an extended framework called PET+ (Performance-Explainability-Time trade-off), which incorporates time and resource constraints into the equation.

I appreciate and commend the reflection on this chronically overlooked and misunderstood topic and hope that their paper contributes towards frameworks for evaluating modeling approaches in the future.

Theory on XAI Today

Revisiting the Rashomon Set Argument

Revisiting the Performance-Explainability Trade-Off