Replies: 1 comment 1 reply
-
I'm quite sure that memory issues are not a problem during development if we stick to sampled&optimized versions of the full dataset. The problem arises during training for the final leaderboard, where we should use as much data as possible to train our model. So, as a first solution, we can try to use the whole 750mil rows dataset + custom features and optimize the hell out of it with pandas, using the smallest data representation possible for each feature we choose to use. This should yield a pandas dataframe that hopefully fits in memory. If the above solution does not work and we still run out of memory on the cluster, I see two further options here: For now, let's not worry about this: we can use sampled datasets to keep memory requirements down and use pandas dataframes directly to fit our models. This applies only to model training; feature engineering should still be done through PySpark/Koalas to keep things fast. Data -> Load as Spark dataframes -> Feature engineering in PySpark/Koalas -> Transform Spark DF to Pandas DF and hope memory does not run out -> Fit model on Pandas DF -> profit |
Beta Was this translation helpful? Give feedback.
-
The issue
While trying to reimplement the native xgboost baseline using the new framework I have encountered some compatibility issues between koalas and xgboost. In particular, xgboost's
DMatrix
API works only with pandas dataframes.Since NVIDIA used cudf, another "compatible" dataframe implementation I wondered how they solved this compatibility issue.
Answer: they don't, apparently the processed data
shape = (60714530, 69)
is small enough to be kept in a single pandas dataframeSee below for the code
Proposed approach
My proposal is to do as they did but with Koalas and CPUs rather than CUdf and a GPU cluster.
We can use koalas for the heavy stuff, i.e. feature engineering, and convert to pandas when the data is small enough to be handled
Today we talked about
.to_pandas()
inH2OXGBoostBaseline
might explode when running on the full dataset. What if it doesn't?NVIDIA's xgboost training code
Source: https://nbviewer.jupyter.org/github/rapidsai/deeplearning/blob/main/RecSys2020/02_ModelsCompetition/XGBoost1/5-xgb-train-REPLY.ipynb
Additional resources
https://databricks.com/blog/2020/11/16/how-to-train-xgboost-with-spark.html
Beta Was this translation helpful? Give feedback.
All reactions