Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Drop dask-xgboost #834

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion ci/environment-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,6 @@ dependencies:
# to allow CI to pass
- dask !=2021.3.0
- dask-glm
- dask-xgboost
- pip:
- dask_sphinx_theme >=1.1.0
- graphviz
7 changes: 0 additions & 7 deletions dask_ml/xgboost.py

This file was deleted.

2 changes: 0 additions & 2 deletions docs/source/history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ focused around particular sub-domains of machine learning.

- dask-searchcv_: Scalable model selection
- dask-glm_: Generalized Linear Model solvers
- dask-xgboost_: Connection to the XGBoost library
- dask-tensorflow_: Connection to the Tensorflow library

While these special-purpose libraries were convenient for development, they
Expand All @@ -20,5 +19,4 @@ future development.

.. _dask-searchcv: https://github.com/dask/dask-searchcv
.. _dask-glm: https://github.com/dask/dask-glm
.. _dask-xgboost: https://github.com/dask/dask-xgboost
.. _dask-tensorflow: https://github.com/dask/dask-tensorflow
9 changes: 1 addition & 8 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,13 +79,6 @@ re-implement these systems. Instead, Dask-ML makes it easy to use normal Dask
workflows to prepare and set up data, then it deploys XGBoost
*alongside* Dask, and hands the data over.

.. code-block:: python

from dask_ml.xgboost import XGBRegressor

est = XGBRegressor(...)
est.fit(train, train_labels)

See :doc:`Dask-ML + XGBoost <xgboost>` for more information.


Expand Down Expand Up @@ -132,4 +125,4 @@ See :doc:`Dask-ML + XGBoost <xgboost>` for more information.

.. _Dask: https://dask.org/
.. _Scikit-Learn: http://scikit-learn.org/
.. _XGBoost: https://ml.dask.org/xgboost.html
.. _XGBoost: https://ml.dask.org/xgboost.html
20 changes: 0 additions & 20 deletions docs/source/modules/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -263,26 +263,6 @@ Classification Metrics
metrics.log_loss


:mod:`dask_ml.xgboost`: XGBoost
===============================

.. automodule:: dask_ml.xgboost

.. currentmodule:: dask_ml.xgboost

.. autosummary::
:toctree: generated/
:template: class.rst

XGBClassifier
XGBRegressor

.. autosummary::
:toctree: generated/

train
predict

:mod:`dask_ml.datasets`: Datasets
======================================================

Expand Down
67 changes: 6 additions & 61 deletions docs/source/xgboost.rst
Original file line number Diff line number Diff line change
@@ -1,76 +1,21 @@
XGBoost & LightGBM
==================

.. currentmodule:: dask_ml.xgboost

XGBoost_ is a powerful and popular library for gradient boosted trees. For
larger datasets or faster training XGBoost also provides a distributed
computing solution. LightGBM_ is another library similar to XGBoost; it also
natively supplies native distributed training for decision trees.

Dask-ML can set up distributed XGBoost or LightGBM for you and hand off data
from distributed dask.dataframes. This automates much of the hassle of
preprocessing and setup while still letting XGBoost/LightGBM do what they do
well.

Below, we'll refer to an example with XGBoost. Here are the relevant XGBoost
classes/functions:
Both XGBoost or LightGBM provided Dask implementations for distributed
training. These can take Dask objects like Arrays and DataFrames as input.
This allows one to do any initial loading and processing of data with Dask
before handing over to XGBoost/LightGBM to do what they do well.

.. autosummary::
train
predict
XGBClassifier
XGBRegressor
The XGBoost implementation can be found at https://github.com/dmlc/xgboost and documentation can be found at
https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.

The LightGBM implementation can be found at https://github.com/microsoft/LightGBM and documentation can be found at
https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#dask.

Example
-------

.. code-block:: python

from dask.distributed import Client
client = Client('scheduler-address:8786')

import dask.dataframe as dd
df = dd.read_parquet('s3://...')

# Split into training and testing data
train, test = df.random_split([0.8, 0.2])

# Separate labels from data
train_labels = train.x > 0
test_labels = test.x > 0

del train['x'] # remove informative column from data
del test['x'] # remove informative column from data

# from xgboost import XGBRegressor # change import
from dask_ml.xgboost import XGBRegressor

est = XGBRegressor(...)
est.fit(train, train_labels)

prediction = est.predict(test)

How this works
--------------

Dask sets up XGBoost's master process on the Dask scheduler and XGBoost's worker
processes on Dask's worker processes. Then it moves all of the Dask
dataframes' constituent Pandas dataframes to XGBoost and lets XGBoost train.
Fortunately, because XGBoost has an excellent Python interface, all of this can
happen in the same process without any data transfer. The two distributed
services can operate together on the same data.

When XGBoost is finished training Dask cleans up the XGBoost infrastructure and
continues on as normal.

This work was a collaboration with XGBoost and SKLearn maintainers. See
relevant GitHub issue here: `dmlc/xgboost #2032 <https://github.com/dmlc/xgboost/issues/2032>`_

See the ":doc:`Dask-ML examples <examples>`" for an example usage.

.. _XGBoost: https://xgboost.readthedocs.io/
.. _LightGBM: https://lightgbm.readthedocs.io/
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
"pytest-mock",
]
dev_requires = doc_requires + test_requires
xgboost_requires = ["dask-xgboost", "xgboost"]
xgboost_requires = ["xgboost"]
complete_requires = xgboost_requires

extras_require = {
Expand Down