Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add a possibility to persist artifacts besides the model itself #46

Open
benjamin-work opened this issue Aug 29, 2017 · 6 comments

Comments

@benjamin-work
Copy link
Collaborator

At the moment, only the model can be persisted and loaded. However, there are scenarios that necessitate saving and loading additional data.

E.g., assume that we have a regression problem. We want to normalize the targets to a certain range during training but when calling the predict service, data should be mapped back to the original range. Touching the targets is not part of an sklearn pipeline, so we may do it during data loading. However, when we start the prediction service, we need to have access to the mapping. Currently, we would have to load the data again to generate the mapping, or try to save the mapping as an attribute of the model.

Ideally, we would be able to just save and load the mapping using palladium tools. The solution should not be too specific to the example above, but be a more general solution to how to persist additional artifacts.

@dnouri
Copy link
Collaborator

dnouri commented Aug 29, 2017

Another way to deal with this is to move the normalization into a model wrapper (or "meta-estimator" in scikit-learn). A NormalizeTarget wrapper would normalize on the way in and out. The model is somewhat more self-contained this way, which may be good regardless.

@benjamin-work
Copy link
Collaborator Author

Yes, for this specific case, that would work. For other cases, that could be an awkward solution. I could imagine that a more general solution would have a "cache" that is just stored together with the model, so that there is no need for handling separate files.

@dnouri
Copy link
Collaborator

dnouri commented Aug 29, 2017

There's this utility called palladium.interfaces.annotate which is used by Palladium to store the model version along with the model pickle. It's a glorified way of sticking an attribute onto the object before it's pickled.

To stick something in you would call annotate(model, {'useful': 'stuffs'}), and to get it out again (say in production, after loading): stuffs = annotate(model)['useful'].

@benjamin-work
Copy link
Collaborator Author

Okay, so you would suggest to use this if extra data needs to be saved?

@dnouri
Copy link
Collaborator

dnouri commented Aug 29, 2017

Okay, so you would suggest to use this if extra data needs to be saved?

Hmm, just had another look and it seems that at least palladium.persistence.Database assumes it can call json.dumps on the annotations. (It then stores the annotations in a separate column.) So this won't work for all types of data.

Which leaves us with what you already did I assume, which is sticking attributes on the model object. Not too nice, but probably nicer than having to worry about storing extra data somewhere else and having to support that in all persisters.

If you prefer to use something like annotate, then we could make a trivial change and add other keys, besides __metadata__, to annotate. (__metadata__ is what it's trying to be clever about when persisting.)

@benjamin-work
Copy link
Collaborator Author

But isn't the model just a blob? Instead of persisting the model, could we not persist something like {'model': model, 'cache': cache}? That way, we don't need to store something extra and worry about keeping model and extra in sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants