- The complete workflow is shared at Competition | 玉山人工智慧公開挑戰賽2021冬季賽 — 信用卡消費類別推薦8th Solution.
- The reproduced result is logged at Wandb | EsunReproduce
Given raw features of customers (e.g., customer properties, transaction information of credit cards), the objective is to predict and output (recommend) the ordering of consumption categories (i.e., shop_tag
) with total amount of consumption (i.e., txn_amt
) ranked top 3. The illustration is as follows:
chid | top1 | top2 | top3 |
---|---|---|---|
10128239 | 18 | 10 | 6 |
10077943 | 48 | 22 | 6 |
First, 49 consumption categories (i.e., shop_tag
) are given, but participants are asked to predict only 16 of them; that is, only these 16 categories can appear in the final submision (recommendation). Second, all the 500000 customer (i.e., chid
) are the predicting targets; in other words, all of them should be included in the final submission.
Following is the step-by-step guideline for generating the final result. For quicker inference for the best result, please skip this section and go to Quick Inference for The Best Result directly.
The very first step is to generate the raw data (e.g., raw DataFrame, feature maps) for further EDA and feature engineering process. With high memory consumption, raw data is generated as follows (following argument setting is just an example):
a. Put raw data tbrain_cc_training_48tags_hash_final.csv
in folder data/raw/
b. Run command
python -m data_preparation.convert_type
Output partitioned files are dumped under path
data/partitioned/
.
Run command
python -m data_preparation.gen_raw_df
Output raw DataFrames
raw_data.parquet
andraw_txn_amts.parquet
are dumped under pathdata/raw/
.
Run command
python -m data_preparation.gen_feat_map --feat-type <feat-type>
Output feature maps are dumped under either
data/processed/feat_map/
ordata/processed/feat_map_txn_amt/
.
Run command
python -m data_preparation.gen_purch_map
Output purchasing maps
purch_maps.pkl
is dumped under pathdata/processed/
.
Complete training process is configured by three configuration files, config/data_gen.yaml
, config/data_samp.yaml
and config/lgbm.yaml
.
data_gen.yaml
controls data constraint, feature engineering and final dataset generation. data_samp.yaml
is related to sample weight generation, and lgbm.yaml
is the hyperparameter setting for LightGBM classifier.
To better manage experimental trials, I use Wandb
to record training process, log debugging message, and store output objects.
Base model is trained as follows (following argument setting is just an example):
For more detailed information, please refer to data_gen_template.yaml
.
Default setting can obtain the best performance. Please feel free to play around with it.
Default setting can obtain relatively stable performance. And, this is the hyperparameter set I use to train all the base models. If there's no GPU support, please set device
option to cpu
.
Run command
python -m tools.train_tree --model-name lgbm --n-folds 1 --eval-metrics ndcg@3 --train-leg True --train-like-production True --val-like-production True --mcls True --eval-train-set True
For more detailed information about arguments, please run command python -m tools.train_tree -h
Output structure is as follows:
output/
├── config/
├── models/
├── pred_reports/
All dumped objects are pushed to
Wandb
remote.
For single base model inference, pre-trained LightGBM classifier is pulled from Wandb
remote first, then the probability distribution is predicted.
Single base model inference is run as follows (following argument setting is just an example):
Run command
python -m tools.pred_tree --model-name lgbm --model-version 0 --val-month 24 --pred-month 25 --mcls True
For more detailed information about arguments, please run command python -m tools.pred_tree -h
Output structure is as follows:
output/
├── pred_results/
├── submission.csv # For quick submission
All dumped objects, excluding
outputs/submission.csv
, are pushed toWandb
remote.
Because single model faces performance bottleneck, so stacking mechanism is implemented to boost the performance.
Stacker training process is controlled by config/lgbm_meta.yaml
or config/xgb_meta.yaml
depending on stacker choice. Further more, if restacking (i.e., stacking with other raw features) is enabled, then setting config/data_gen.yaml
is necessary.
Stacker is trained as follows (following argument setting is just an example):
For more detailed information, please refer to data_gen_template.yaml
.
Run command
python -m tools.train_stacker --meta-model-name xgb --n-folds 5 --eval-metrics ndcg@3 --objective mcls --oof-versions l184 l186 l187 l190 l192 l194 l195 b1 b2 b3
For more detailed information about arguments, please run command python -m tools.train_stacker -h
Output structure is as follows:
output/
├── cv_hist.pkl
├── meta_models/
├── pred_reports/
├── config/
All dumped objects are pushed to
Wandb
remote.
For meta model inference, pre-trained LightGBM or XGB stacker (i.e., classifier) is pulled from Wandb
remote first, then the probability distribution is predicted.
Meta model inference is run as follows (following argument setting is just an example):
Run command
python -m tools.pred_stacker --meta-model-name xgb --meta-model-version 0 --pred-month 25 --objective mcls --oof-versions l184 l186 l187 l190 l192 l194 l195 b1 b2 b3 --unseen-versions l48 l50 l51 l54 l58 l60 l61 b1 b2 b3
For more detailed information about arguments, please run command python -m tools.pred_stacker -h
Output structure is as follows:
output/
├── pred_results/
├── submission.csv # For quick submission
All dumped objects, excluding
outputs/submission.csv
, are pushed toWandb
remote.
To better combine merits of different models (either base models or meta models), blending with coefficients optimized by Bayesian optimization is implemented. Blending is run as follows (following argument setting is just an example):
Run Bayesian optimization in ensemble.ipynb
and obtain blending coefficients.
Run command
python -m tools.blend --oof-versions l16 l18 x8 x10 --unseen-versions l10 l12 x7 x9 --weights 0.144372 0.856641 0.307942 0.19094 --meta True
For more detailed information about arguments, please run command python -m tools.blend -h
Output structure is as follows:
1. For blending oof predictions:
output/
├── pred_reports/
2. For blending unseen predictions:
output/
├── pred_results/
├── submission.csv
All dumped objects are pushed to
Wandb
remote.
This section provides the shortcut to obtain the performance on leaderboard. The best result can be generated as follows (following argument setting is just an example):
Modify project
parameter in wandb.init()
to Esun
in script blend.py
.
Run command
python -m tools.blend --oof-versions l16 l18 x8 x10 --unseen-versions l10 l12 x7 x9 --weights 0.144372 0.856641 0.307942 0.19094 --meta True