Does training need to be distributed? #11

mrandri19 · 2021-04-25T21:29:12Z

mrandri19
Apr 25, 2021

The issue

While trying to reimplement the native xgboost baseline using the new framework I have encountered some compatibility issues between koalas and xgboost. In particular, xgboost's DMatrix API works only with pandas dataframes.

Since NVIDIA used cudf, another "compatible" dataframe implementation I wondered how they solved this compatibility issue.

Answer: they don't, apparently the processed data shape = (60714530, 69) is small enough to be kept in a single pandas dataframe

See below for the code

Proposed approach

My proposal is to do as they did but with Koalas and CPUs rather than CUdf and a GPU cluster.

We can use koalas for the heavy stuff, i.e. feature engineering, and convert to pandas when the data is small enough to be handled

Today we talked about .to_pandas() in H2OXGBoostBaseline might explode when running on the full dataset. What if it doesn't?

NVIDIA's xgboost training code

Source: https://nbviewer.jupyter.org/github/rapidsai/deeplearning/blob/main/RecSys2020/02_ModelsCompetition/XGBoost1/5-xgb-train-REPLY.ipynb

train = pd.read_parquet( 'data/train-final-te-'+TARGET+'-1.parquet' )

import xgboost as xgb
print('XGB Version',xgb.__version__)

# CREATE TRAIN AND VALIDATION SETS
RMV = [c for c in DONT_USE if c in train.columns]
#LEarning rates for 'reply', 'retweet', 'retweet_comment', 'like'
LR = [0.05,0.03,0.07,0.01]

#Like
xgb_parms['learning_rate'] = LR[TARGET_id]

print('#'*25);print('###',TARGET);print('#'*25)

dtrain = xgb.DMatrix(data=train.drop(RMV, axis=1) ,label=train[TARGET].values)
gc.collect()

model = xgb.train(xgb_parms, 
                  dtrain=dtrain,
                  num_boost_round=500,
                 ) 

del dtrain
gc.collect()  

#save model
joblib.dump(model, 'model-'+TARGET+'-1.xgb' ) 
del model
gc.collect()

Additional resources

https://databricks.com/blog/2020/11/16/how-to-train-xgboost-with-spark.html

manumacc · 2021-04-26T08:51:08Z

manumacc
Apr 26, 2021

I'm quite sure that memory issues are not a problem during development if we stick to sampled&optimized versions of the full dataset.

The problem arises during training for the final leaderboard, where we should use as much data as possible to train our model. So, as a first solution, we can try to use the whole 750mil rows dataset + custom features and optimize the hell out of it with pandas, using the smallest data representation possible for each feature we choose to use. This should yield a pandas dataframe that hopefully fits in memory.

If the above solution does not work and we still run out of memory on the cluster, I see two further options here:
i) Decrease the cardinality of the full dataset, down to a manageable size -- either through some sort of downsampling, or just getting rid of rows until the dataset fits in memory
ii) [in case we decide to use XGBoost for our final submission] Find a way to run XGBoost with Spark if we want to use the full 750mil-rows dataset. In order to use H2O's implementation of XGBoost, we can try to understand how to run PySparkling, i.e., the H2O interface with PySpark, on our cluster. To whoever is listening: if you have pointers on how to do this, please reach out to us. I could not manage to run PySparkling on our cluster yet.

For now, let's not worry about this: we can use sampled datasets to keep memory requirements down and use pandas dataframes directly to fit our models. This applies only to model training; feature engineering should still be done through PySpark/Koalas to keep things fast.
So:

Data -> Load as Spark dataframes -> Feature engineering in PySpark/Koalas -> Transform Spark DF to Pandas DF and hope memory does not run out -> Fit model on Pandas DF -> profit

1 reply

mrandri19 Apr 26, 2021
Author

we can try to use the whole 750mil rows dataset + custom features and optimize the hell out of it with pandas, using the smallest data representation possible for each feature we choose to use.

There are some cool tricks for this in their notebooks:

def save_memory( df ):
    features = df.columns
    for i in range( df.shape[1] ):
        if df.dtypes[i] == 'uint8':
            df[features[i]] = df[features[i]].astype( np.int8 )
            gc.collect()
        elif df.dtypes[i] == 'bool':
            df[features[i]] = df[features[i]].astype( np.int8 )
            gc.collect()
        elif df.dtypes[i] == 'uint32':
            df[features[i]] = df[features[i]].astype( np.int32 )
            gc.collect()
        elif df.dtypes[i] == 'int64':
            df[features[i]] = df[features[i]].astype( np.int32 )
            gc.collect()
        elif df.dtypes[i] == 'float64':
            df[features[i]] = df[features[i]].astype( np.float32 )
            gc.collect()

For now, let's not worry about this: we can use sampled datasets to keep memory requirements down and use pandas dataframes directly to fit our models. This applies only to model training; feature engineering should still be done through PySpark/Koalas to keep things fast.
So:

Data -> Load as Spark dataframes -> Feature engineering in PySpark/Koalas -> Transform Spark DF to Pandas DF and hope memory does not run out -> Fit model on Pandas DF -> profit

Sounds good

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does training need to be distributed? #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Does training need to be distributed? #11

mrandri19 Apr 25, 2021

The issue

Proposed approach

NVIDIA's xgboost training code

Additional resources

Replies: 1 comment · 1 reply

manumacc Apr 26, 2021

mrandri19 Apr 26, 2021 Author

mrandri19
Apr 25, 2021

Replies: 1 comment 1 reply

manumacc
Apr 26, 2021

mrandri19 Apr 26, 2021
Author