Skip to content

Latest commit

 

History

History
24 lines (18 loc) · 1.53 KB

README.md

File metadata and controls

24 lines (18 loc) · 1.53 KB

tune-sklearn-xgboost

This is to prepare for PyBay talk 2021.

The problem is based on Kaggle safe driver contest. The goal is to predict if a driver will file an insurance claim in the next year.

The original dataset has 57 columns. For the sake of time, I removed about half of the less important features.

The demo will go like this:

  • run a default xgboost classifier and see the result.

  • incentivize that there are so many hyperparameters to tune and GridSearchCV and RandomizedSearchCV can be improved by more advanced search and scheduling algorithms.

  • show TuneSearchCV has very similar API as GridSearchCV.

  • run a hpo with a bunch of parameters (this currently takes about 40 minutes, as would be with any serious hpo).

  • plan to just issue the run and direct the audience to notice a few things from the logs

    • resource allocations, indicated by lines like "Resources requested: 2.0/40 CPUs, 0/0 GPUs, 0.0/57.2 GiB heap, 0.0/4.66 GiB objects"
    • hyperband algorithm, trials are paused and restored all the time
    • Current best trial
  • plan to show the pre-captured result

  • plan to run the best_model_ through test set and show the new result. Should improve the gini_score by around 5%.

Before the demo, a cluster will be started beforehand. Port forwarding is set up through ray attach cluster.yaml -p 9999. Jupyter notebook is run on head node of the cluster through jupyter lab --port=9999