Skip to content

xwjiang2010/tune-sklearn-xgboost

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

tune-sklearn-xgboost

This is to prepare for PyBay talk 2021.

The problem is based on Kaggle safe driver contest. The goal is to predict if a driver will file an insurance claim in the next year.

The original dataset has 57 columns. For the sake of time, I removed about half of the less important features.

The demo will go like this:

  • run a default xgboost classifier and see the result.

  • incentivize that there are so many hyperparameters to tune and GridSearchCV and RandomizedSearchCV can be improved by more advanced search and scheduling algorithms.

  • show TuneSearchCV has very similar API as GridSearchCV.

  • run a hpo with a bunch of parameters (this currently takes about 40 minutes, as would be with any serious hpo).

  • plan to just issue the run and direct the audience to notice a few things from the logs

    • resource allocations, indicated by lines like "Resources requested: 2.0/40 CPUs, 0/0 GPUs, 0.0/57.2 GiB heap, 0.0/4.66 GiB objects"
    • hyperband algorithm, trials are paused and restored all the time
    • Current best trial
  • plan to show the pre-captured result

  • plan to run the best_model_ through test set and show the new result. Should improve the gini_score by around 5%.

Before the demo, a cluster will be started beforehand. Port forwarding is set up through ray attach cluster.yaml -p 9999. Jupyter notebook is run on head node of the cluster through jupyter lab --port=9999

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published