Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dlrm profile #344

Draft
wants to merge 38 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e21226b
MMoe parquet script;
Liuxinman Jun 2, 2022
aa5d356
dlrm profile
ShawnXuan Jun 2, 2022
a628a82
update
ShawnXuan Jun 2, 2022
c430e5f
Add Mmoe dataloader;
Liuxinman Jun 8, 2022
9e2c614
Add mmoe eval part;
Liuxinman Jun 8, 2022
21df60f
Update args
Liuxinman Jun 8, 2022
e955c08
Add sh script
Liuxinman Jun 8, 2022
966b03d
Fix bugs in parallel
Liuxinman Jun 9, 2022
503780a
Replace table size array;
Liuxinman Jun 9, 2022
0875be5
Update readme;
Liuxinman Jun 9, 2022
ac2cd7e
Update README.md
Liuxinman Jun 9, 2022
315fe7d
Change gate and tower to dnn
Liuxinman Jun 15, 2022
4c82e90
add oneembedding key_type
ShawnXuan Jun 16, 2022
f44b50f
pad dense input
ShawnXuan Jun 16, 2022
9b523cb
Merge branch 'dlrm_key_type' of github.com:Oneflow-Inc/models into dl…
ShawnXuan Jun 16, 2022
f2dbdf1
add padding in prefetch
ShawnXuan Jun 16, 2022
cc59c3d
add sh
ShawnXuan Jun 16, 2022
a5e25c9
udpate
ShawnXuan Jun 17, 2022
7126d88
update
ShawnXuan Jun 23, 2022
d7d3479
fix typo in mmoe_parquet.py;
Liuxinman Jun 23, 2022
4567c2b
Merge branch 'main' of github.com:Oneflow-Inc/models into dlrm_profile
ShawnXuan Jun 24, 2022
cbbcea1
eval steps
ShawnXuan Jun 29, 2022
b8896dc
nsys 4gpus
ShawnXuan Jun 29, 2022
49fae88
update default file name
ShawnXuan Jun 29, 2022
9e0deee
Update README.md (dataset);
Liuxinman Jun 29, 2022
3290080
Merge branch 'main' of https://github.com/Oneflow-Inc/models into dev…
Liuxinman Jun 29, 2022
cb4e51a
Remove sklearn and pandas dependency in mmoe_parquet.py
Liuxinman Jun 30, 2022
30e65b8
Fix bugs in mmoe_parquet.py
Liuxinman Jun 30, 2022
13b0318
Simplify mmoe_parquet
Liuxinman Jun 30, 2022
39f84fa
Update readme
Liuxinman Jul 4, 2022
aa7d714
format mmoe_train_eval.py
Liuxinman Jul 4, 2022
48ab765
Format mmoe_parquet.py
Liuxinman Jul 4, 2022
6ca5f11
Remove num_sparse_features and num_dense_features
Liuxinman Jul 6, 2022
50d6235
Merge branch 'dev_mmoe_spark' of github.com:Oneflow-Inc/models into d…
ShawnXuan Jul 11, 2022
4ced5dc
env tests
ShawnXuan Jul 12, 2022
e5fca31
update
ShawnXuan Jul 12, 2022
39f6cac
update
ShawnXuan Jul 12, 2022
68cd3d0
update
ShawnXuan Jul 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions RecommenderSystems/dlrm/criteo1t_nsys_4gpu.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
prefix=${1:-4gpu_bsz27648}

persistent=./persistent
rm -rf ${prefix}.* $persistent/*

#export CUDA_VISIBLE_DEVICES=1
export ONEFLOW_FUSE_MODEL_UPDATE_CAST=1
export ONEFLOW_ENABLE_MULTI_TENSOR_MODEL_UPDATE=1
export ONEFLOW_KERNEL_ENABLE_CUDA_GRAPH=1
export ONEFLOW_EAGER_LOCAL_TO_GLOBAL_BALANCED_OVERRIDE=1
#export ONEFLOW_ONE_EMBEDDING_ENABLE_QUANTIZED_COMM=1
export ONEFLOW_ONE_EMBEDDING_USE_SYSTEM_GATHER=0
#export ONEFLOW_ONE_EMBEDDING_EMBEDDING_SHUFFLE_INDEPENTENT_STREAM=1
export ONEFLOW_PROFILER_KERNEL_PROFILE_KERNEL_FORWARD_RANGE=1


/usr/local/cuda-11.6/bin/nsys profile --stats=true -o $prefix \
python3 -m oneflow.distributed.launch \
--nproc_per_node 4 \
--nnodes 1 \
--node_rank 0 \
--master_addr 127.0.0.1 \
dlrm_prefetch_train.py \
--data_dir /RAID0/xiexuan/dlrm_parquet_int32 \
--persistent_path $persistent \
--store_type device_mem \
--train_batches 300 \
--train_batch_size 27648 \
--learning_rate 3 \
--one_embedding_key_type int32 \
--amp
#--train_batches 300 \
Loading