Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dlrm benchmark test #375

Open
wants to merge 37 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
90acec9
add dlrm benchnark files
ShawnXuan Jul 28, 2022
1ae2602
modify sh
ShawnXuan Jul 28, 2022
5e0b69a
add dcn benchmark files
ShawnXuan Jul 28, 2022
803cc82
add deepfm benchmark files
ShawnXuan Jul 28, 2022
a86382c
modify dlrm scripts
ShawnXuan Aug 3, 2022
e00436a
script ready
ShawnXuan Aug 4, 2022
9c015dd
make raw dataset
ShawnXuan Aug 4, 2022
fbcbc6e
update dlrm readme
ShawnXuan Aug 5, 2022
8200172
update scala files
ShawnXuan Aug 5, 2022
e1afb77
Merge branch 'main' of github.com:Oneflow-Inc/models into ctr_benchma…
ShawnXuan Aug 5, 2022
ab019bc
update
ShawnXuan Aug 5, 2022
7f3a231
update
ShawnXuan Aug 5, 2022
62d3047
update dcn benchmark scripts
ShawnXuan Aug 5, 2022
1a65042
update deepfm benchmark scripts
ShawnXuan Aug 5, 2022
0eed35c
deepfm working
ShawnXuan Aug 5, 2022
4e567c3
deepfm update
ShawnXuan Aug 5, 2022
4fcd5c2
keep dlrm benchmark only
ShawnXuan Aug 5, 2022
6a263a3
update dlrm benchmark
ShawnXuan Aug 5, 2022
ff3dce3
update
ShawnXuan Aug 5, 2022
eb0ef34
update
ShawnXuan Aug 5, 2022
6bb1a94
update
ShawnXuan Aug 5, 2022
06a2952
update
ShawnXuan Aug 5, 2022
9f20f87
rm some envs
ShawnXuan Aug 5, 2022
9768c93
disable split all reduce
ShawnXuan Aug 5, 2022
016e78d
path configurable
ShawnXuan Aug 5, 2022
2855f16
add args for shell
ShawnXuan Aug 5, 2022
cbce12e
update README.md
ShawnXuan Aug 5, 2022
85d0d81
rm persistent
ShawnXuan Aug 5, 2022
c7bf466
update
ShawnXuan Aug 5, 2022
0a50f3c
update README.md
ShawnXuan Aug 5, 2022
4ddac00
update README.md
ShawnXuan Aug 5, 2022
bfdb867
val->test
ShawnXuan Aug 8, 2022
6476417
update
ShawnXuan Aug 8, 2022
6f5f963
update
ShawnXuan Aug 8, 2022
14283f4
add docker files
ShawnXuan Nov 30, 2022
6c4654f
update
ShawnXuan Nov 30, 2022
159a4d6
update
ShawnXuan Nov 30, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 60 additions & 5 deletions RecommenderSystems/dlrm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,19 @@
## Directory description
```
.
|-- tools
|-- criteo1t_parquet.py # Read Criteo1T data and export it as parquet data format
|-- dlrm_train_eval.py # OneFlow DLRM training and evaluation scripts with OneEmbedding module
|-- requirements.txt # python package configuration file
└── README.md # Documentation
├── tools
│   ├── criteo1t_parquet.py # make criteo terabyte dataset for OneFlow DLRM by python
│   ├── criteo1t_parquet.scala # make criteo terabyte dataset for OneFlow DLRM by spark
│   ├── criteo1t_parquet_int32.scala # make criteo terabyte dataset for OneFlow DLRM by spark, int32 for sparse features
│   ├── launch_spark.sh # launch spark
│   ├── parquet_to_raw.py # convert dataset from parquet to raw format
│   └── split_day_23.sh # split day_23 to test and validation set
├── dlrm_train_eval.py # OneFlow DLRM training and evaluation scripts with OneEmbedding module
├── dlrm_benchmark_a100.py # OneFlow DLRM benchmark training and evaluation scripts
├── train_dlrm_benchmark.sh # OneFlow DLRM benchmark AMP training command
├── train_dlrm_benchmark_fp32.sh # OneFlow DLRM benchmark FP32 training command
├── requirements.txt # python package configuration file
└── README.md # Documentation
```

## Arguments description
Expand Down Expand Up @@ -101,3 +109,50 @@ python3 -m oneflow.distributed.launch \
--data_dir /path/to/dlrm_parquet \
--persistent_path /path/to/persistent
```

## Run OneFlow DLRM benchmark
1. make dlrm raw format dataset (sparse feature dtype = int32)
- split day_23 to test.csv and val.csv, in criteo terabyte dataset directory where extracted day_0 to day_23 files located, such as `/RAID0/dlrm_parquet_int32`:
```
head -n 89137319 day_23 > test.csv
tail -n +89137320 day_23 > val.csv
```
- launch spark shell in "RecommenderSystems/dlrm/tools" directory:
```
export SPARK_LOCAL_DIRS=/RAID0/tmp_spark
spark-shell \
--master "local[*]" \
--conf spark.driver.maxResultSize=0 \
--driver-memory 360G
```
- load scala file in spark-shell, and execute `makeDlrmDatasetInt32`
```
:load criteo1t_parquet_int32.scala
makeDlrmDatasetInt32("/RAID0/criteo1t_raw", "/RAID0/dlrm_parquet_int32")
```
- convert parquet dataset to oneflow raw format
```
# create folders manually
mkdir -p /RAID0/criteo1t_oneflow_raw/test
mkdir -p /RAID0/criteo1t_oneflow_raw/val
mkdir -p /RAID0/criteo1t_oneflow_raw/train

python parquet_to_raw.py
```

note: suppose root folder of target raw dataset is `/RAID0/criteo1t_oneflow_raw`

2. train OneFlow DLRM benchmark in AMP mode

```
./train_dlrm_benchmark.sh

```

3. or train OneFlow DLRM benchmark in FP32 mode

```
./train_dlrm_benchmark_fp32.sh

```

Loading