tune-sklearn-xgboost

This is to prepare for PyBay talk 2021.

The problem is based on Kaggle safe driver contest. The goal is to predict if a driver will file an insurance claim in the next year.

The original dataset has 57 columns. For the sake of time, I removed about half of the less important features.

The demo will go like this:

run a default xgboost classifier and see the result.
incentivize that there are so many hyperparameters to tune and GridSearchCV and RandomizedSearchCV can be improved by more advanced search and scheduling algorithms.
show TuneSearchCV has very similar API as GridSearchCV.
run a hpo with a bunch of parameters (this currently takes about 40 minutes, as would be with any serious hpo).
plan to just issue the run and direct the audience to notice a few things from the logs
- resource allocations, indicated by lines like "Resources requested: 2.0/40 CPUs, 0/0 GPUs, 0.0/57.2 GiB heap, 0.0/4.66 GiB objects"
- hyperband algorithm, trials are paused and restored all the time
- Current best trial
plan to show the pre-captured result
plan to run the best_model_ through test set and show the new result. Should improve the gini_score by around 5%.

Before the demo, a cluster will be started beforehand. Port forwarding is set up through ray attach cluster.yaml -p 9999. Jupyter notebook is run on head node of the cluster through jupyter lab --port=9999

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

tune-sklearn-xgboost

Files

README.md

Latest commit

History

README.md

File metadata and controls

tune-sklearn-xgboost