Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to pass a sklearn Pipeline as model to powershap? #33

Open
eduardokapp opened this issue Feb 28, 2023 · 7 comments
Open

Comments

@eduardokapp
Copy link

Is there currently a way to pass a sklearn.pipeline.Pipeline object as the model parameter? I can't seem to do it, and I think that being able to do it would be better for the internal cross validation.

For example, imagine that you have a model defined as a pipeline, that first takes one or two steps of preprocessing that may include operations that should not be done on the validation or test set eg. filling missing values with the mean.

Right now, I'm preprocessing my data and then passing just the pipeline final step to the powershap object.

@jvdd
Copy link
Member

jvdd commented Mar 1, 2023

Hey @eduardokapp,

That is an excellent question! With the exact same train of thougt (feature selection should be included in pipeline for cross-validation), I've implemented powershap to be scikit-learn compatible.

You should be aware however that powershap performs a transformation (i.e., selecting features) and thus cannot be the final step in your scikit-learn pipeline.
=> your final step should be an estimator (some sort of model)

Dummy code of how this would look like ⬇️

pipe = Pipeline(
    [
        # ... (some more preprocessing / transformation steps)
        ("feature selection", PowerShap()),
        ("model", CatBoostClassifier()),
    ]
)

Hope this helps! If not, feel free to provide a minimal reproducible code :)

Cheers, Jeroen

@eduardokapp
Copy link
Author

I'm not sure I understand what you're saying. I agree that powershap is a sort of sklearn transformer kinda object and, yes, it should be inside the pipeline!

However, what I don't really get is: if powershap has a "cv" parameter to pass a crossvalidator and, as I understand, powershap fits a model many times in its processing, wouldn't it be necessary for the model parameter in powershap to accept a pipeline and not just a model?

Hope I clarified my question! Thank you for your quick response.

@jvdd
Copy link
Member

jvdd commented Mar 1, 2023

Oh, I see! My apologies for misinterpreting your question - looking back at the title I acknowledge you formulated it quite clearly 🙃

However, what I don't really get is: if powershap has a "cv" parameter to pass a crossvalidator and, as I understand, powershap fits a model many times in its processing, wouldn't it be necessary for the model parameter in powershap to accept a pipeline and not just a model?

Indeed, this would make sense! I see two options:

  • if we put powershap in a scikit-learn pipeline, all transformations are fitted once and then performed on the data that is passed to powershap. (this is what I described in my previous comment)
  • if we put a scikit-learn pipeline in powershap, all transformations are fitted in every fold of the internal cross-validation in the powershap iterations.

We currently comply with the 1st option. To some extent, supporting the 2nd option as well further limits data leakage. However, I am not 100% confident whether complying with this makes sense from an algorithmic standpoint, as we are then possibly not comparing apples to apples - the data will then change (slightly) over the folds when performing internal cross-validation. I do suspect this effect - if measurable at all - to be very minimal 🤔

Interested in hearing your opinion about this @eduardokapp!
Also @JarneVerhaeghe can you weigh in on this as well?

@JarneVerhaeghe
Copy link
Contributor

Putting a scikit-learn pipeline in powershap is from an algorithmic standpoint a plausible option. Because we refit the preprocessors every powershap iteration, every feature in that iteration can be compared to each other. Furthermore, the label should be comparable across iterations, which in turn enables comparing the Shapley values because they will be in the same magnitude. The main concern where this could go wrong is where the resulting distributions after preprocessing are completely different across iterations. However, in cases such as normalization or a min-max scaler, both the label, the features, and the Shapley values will have the same magnitudes across iterations and therefore the algorithm will still perform adequately.

I hope this answers your question a bit @eduardokapp ?

@eduardokapp
Copy link
Author

Thank you so much for taking the time to answer my question. So, given that this makes sense, what should be done (code-wise) to make it happen? I'd be happy to implement this feature.

@eduardokapp
Copy link
Author

Hey @jvdd, @JarneVerhaeghe! I've been thinking about this issue and maybe it could be solved by creating a new explainer subclass that uses the ones you already defined but also applies the pipeline transformation steps.

Excuse me for the over-simplified ideas here, but something along the lines of:

class PipelineExplainer(ShapExplainer):
    @staticmethod
    def supports_model(model) -> bool:
        from sklearn.pipeline import Pipeline
        
        # Check if model is a Pipeline
        if not isinstance(model, Pipeline):
            return False
        
        # Get the final step (estimator) of the pipeline
        estimator = model.steps[-1][1]
        
        # Check if the final step is an instance of one of the supported models
        supported_models = [
            CatBoostRegressor, CatBoostClassifier,
            LGBMClassifier, LGBMRegressor,
            XGBClassifier, XGBRegressor,
            ForestRegressor, ForestClassifier, BaseGradientBoosting,
            LinearClassifierMixin, LinearModel, BaseSGD,
            tf.keras.Model
        ]
        return isinstance(estimator, tuple(supported_models))

    def _fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs) -> np.array:
        # Get the final estimator from the pipeline
        estimator = self.model.steps[-1][1]
        
        # Fit the pipeline
        self.model.fit(X_train, Y_train, **kwargs)
        
        # Get the transformed data from all the preceding steps
        transformed_X_val = X_val
        for name, step in self.model.steps[:-1]:
            transformed_X_val = step.transform(transformed_X_val)

        # Calculate the shap values using the final estimator
        # maybe here that would be some way of just inheriting or modifying the behavior of the other classes
        explainer = shap.Explainer(estimator) 

        shap_values = explainer.shap_values(transformed_X_val)

        return shap_values

    def _validate_data(self, validate_data: Callable, X, y, **kwargs):
        # Validate the data for each step in the pipeline
        for name, step in self.model.steps[:-1]:
            X = step._validate_data(validate_data, X, **kwargs)
        return super()._validate_data(validate_data, X, y, **kwargs)

@ppawlo97
Copy link

I have a similar concern, it looks like PipelineExplainer should be a good solution to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants