-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to pass a sklearn Pipeline as model to powershap
?
#33
Comments
Hey @eduardokapp, That is an excellent question! With the exact same train of thougt (feature selection should be included in pipeline for cross-validation), I've implemented You should be aware however that powershap performs a transformation (i.e., selecting features) and thus cannot be the final step in your scikit-learn pipeline. Dummy code of how this would look like ⬇️ pipe = Pipeline(
[
# ... (some more preprocessing / transformation steps)
("feature selection", PowerShap()),
("model", CatBoostClassifier()),
]
) Hope this helps! If not, feel free to provide a minimal reproducible code :) Cheers, Jeroen |
I'm not sure I understand what you're saying. I agree that powershap is a sort of sklearn transformer kinda object and, yes, it should be inside the pipeline! However, what I don't really get is: if powershap has a "cv" parameter to pass a crossvalidator and, as I understand, powershap fits a model many times in its processing, wouldn't it be necessary for the model parameter in powershap to accept a pipeline and not just a model? Hope I clarified my question! Thank you for your quick response. |
Oh, I see! My apologies for misinterpreting your question - looking back at the title I acknowledge you formulated it quite clearly 🙃
Indeed, this would make sense! I see two options:
We currently comply with the 1st option. To some extent, supporting the 2nd option as well further limits data leakage. However, I am not 100% confident whether complying with this makes sense from an algorithmic standpoint, as we are then possibly not comparing apples to apples - the data will then change (slightly) over the folds when performing internal cross-validation. I do suspect this effect - if measurable at all - to be very minimal 🤔 Interested in hearing your opinion about this @eduardokapp! |
Putting a scikit-learn pipeline in powershap is from an algorithmic standpoint a plausible option. Because we refit the preprocessors every powershap iteration, every feature in that iteration can be compared to each other. Furthermore, the label should be comparable across iterations, which in turn enables comparing the Shapley values because they will be in the same magnitude. The main concern where this could go wrong is where the resulting distributions after preprocessing are completely different across iterations. However, in cases such as normalization or a min-max scaler, both the label, the features, and the Shapley values will have the same magnitudes across iterations and therefore the algorithm will still perform adequately. I hope this answers your question a bit @eduardokapp ? |
Thank you so much for taking the time to answer my question. So, given that this makes sense, what should be done (code-wise) to make it happen? I'd be happy to implement this feature. |
Hey @jvdd, @JarneVerhaeghe! I've been thinking about this issue and maybe it could be solved by creating a new explainer subclass that uses the ones you already defined but also applies the pipeline transformation steps. Excuse me for the over-simplified ideas here, but something along the lines of: class PipelineExplainer(ShapExplainer):
@staticmethod
def supports_model(model) -> bool:
from sklearn.pipeline import Pipeline
# Check if model is a Pipeline
if not isinstance(model, Pipeline):
return False
# Get the final step (estimator) of the pipeline
estimator = model.steps[-1][1]
# Check if the final step is an instance of one of the supported models
supported_models = [
CatBoostRegressor, CatBoostClassifier,
LGBMClassifier, LGBMRegressor,
XGBClassifier, XGBRegressor,
ForestRegressor, ForestClassifier, BaseGradientBoosting,
LinearClassifierMixin, LinearModel, BaseSGD,
tf.keras.Model
]
return isinstance(estimator, tuple(supported_models))
def _fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs) -> np.array:
# Get the final estimator from the pipeline
estimator = self.model.steps[-1][1]
# Fit the pipeline
self.model.fit(X_train, Y_train, **kwargs)
# Get the transformed data from all the preceding steps
transformed_X_val = X_val
for name, step in self.model.steps[:-1]:
transformed_X_val = step.transform(transformed_X_val)
# Calculate the shap values using the final estimator
# maybe here that would be some way of just inheriting or modifying the behavior of the other classes
explainer = shap.Explainer(estimator)
shap_values = explainer.shap_values(transformed_X_val)
return shap_values
def _validate_data(self, validate_data: Callable, X, y, **kwargs):
# Validate the data for each step in the pipeline
for name, step in self.model.steps[:-1]:
X = step._validate_data(validate_data, X, **kwargs)
return super()._validate_data(validate_data, X, y, **kwargs) |
I have a similar concern, it looks like PipelineExplainer should be a good solution to that. |
Is there currently a way to pass a sklearn.pipeline.Pipeline object as the model parameter? I can't seem to do it, and I think that being able to do it would be better for the internal cross validation.
For example, imagine that you have a model defined as a pipeline, that first takes one or two steps of preprocessing that may include operations that should not be done on the validation or test set eg. filling missing values with the mean.
Right now, I'm preprocessing my data and then passing just the pipeline final step to the powershap object.
The text was updated successfully, but these errors were encountered: