Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batch_input and elapsed time per iteration suddenly slow down during model training #1248

Open
Yuhanleeee opened this issue Jun 29, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@Yuhanleeee
Copy link

Batch_input and elapsed time per iteration slow down during model training

微信图片编辑_20240629150957

Arguments

data_impl ....................... mmap........................updated
deepspeed_extra_args ............ {'bf16': {'enabled': True}}.updated
dynamic_loss_scale .............. True........................updated
eval_interval ................... 40000.......................updated
eval_iters ...................... 10..........................updated
fp32_allreduce .................. True........................updated
global_num_gpus ................. 4...........................updated
gpt_j_residual .................. True........................updated
hidden_size ..................... 768.........................updated
init_method ..................... small_init..................updated
is_pipe_parallel ................ True........................updated
launcher ........................ slurm.......................updated
log_interval .................... 10..........................updated
lr .............................. 0.0006......................updated
lr_decay_iters .................. 143000......................updated
lr_decay_style .................. cosine......................updated
max_position_embeddings ......... 2048........................updated
min_lr .......................... 6e-05.......................updated
no_weight_tying ................. True........................updated
num_attention_heads ............. 12..........................updated
num_layers ...................... 12..........................updated
num_workers ..................... 32..........................updated
optimizer ....................... {'type': 'Adam', 'params': {'lr': 0.0006, 'betas': [0.9, 0.95], 'eps': 1e-08}}updated
optimizer_type .................. Adam........................updated
output_layer_init_method ........ wang_init...................updated
partition_activations ........... True........................updated
pipe_parallel_size .............. 1...........................updated
pos_emb ......................... rotary......................updated
precision ....................... bfloat16....................updated
rotary_pct ...................... 0.25........................updated
save ............................ /pythia/checkpoints/test_1updated
save_iters ...................... [10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]updated
seq_length ...................... 2048........................updated
sparsity_config ................. {}..........................updated
synchronize_each_layer .......... True........................updated
test_data_paths ................. ['/pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated
test_data_weights ............... [1.0].......................updated
text_gen_type ................... unconditional...............updated
tokenizer_type .................. HFTokenizer.................updated
train_batch_size ................ 128.........................updated
train_data_paths ................ ['/pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated
train_data_weights .............. [1.0].......................updated
train_iters ..................... 143000......................updated
train_micro_batch_size_per_gpu .. 32..........................updated
user_script ..................... train.py....................updated
valid_data_paths ................ ['pile_0.87_deduped_text_document/pile_0.87_deduped_text_document']updated
valid_data_weights .............. [1.0].......................updated
vocab_file ....................../pythia/utils/20B_tokenizer.jsonupdated
wall_clock_breakdown ............ True........................updated
zero_allgather_bucket_size ...... 500000000...................updated
zero_contiguous_gradients ....... True........................updated
zero_optimization ............... {'stage': 0, 'allgather_partitions': True, 'allgather_bucket_size': 500000000, 'overlap_comm': True, 'reduce_scatter': True, 'reduce_bucket_size': 500000000, 'contiguous_gradients': True, 'cpu_offload': False, 'load_from_fp32_weights': False}updated
zero_reduce_bucket_size ......... 500000000...................updated
zero_reduce_scatter ............. True........................updated
zero_stage ...................... 0...........................updated
account ......................... None........................default
activation ...................... gelu........................default
activation_checkpointing ........ None........................default
adlr_autoresume ................. False.......................default
adlr_autoresume_interval ........ 1000........................default
amp ............................. None........................default
apply_query_key_layer_scaling ... False.......................default
attention_dropout ............... 0...........................default
attention_softmax_in_fp32 ....... False.......................default
autotuning ...................... None........................default
autotuning_run .................. None........................default
base_shapes_file ................ None........................default
bf16 ............................ None........................default
bias_dropout_fusion ............. False.......................default
bias_gelu_fusion ................ False.......................default
char_level_ppl .................. False.......................default
checkpoint ...................... None........................default
checkpoint_in_cpu ............... False.......................default
checkpoint_num_layers ........... 1...........................default
checkpoint_scale ................ linear......................default
checkpoint_validation_with_forward_pass False................default
clip_grad ....................... 1.0.........................default
comment ......................... None........................default
comms_logger .................... None........................default
communication_data_type ......... None........................default
compression_training ............ None........................default
contiguous_checkpointing ........ False.......................default
coord_check ..................... False.......................default
create_moe_param_group .......... True........................default
csv_monitor ..................... None........................default
curriculum_learning ............. None........................default
curriculum_seqlen ............... 0...........................default
data_efficiency ................. None........................default
data_path ....................... None........................default
data_types ...................... None........................default
deepscale ....................... False.......................default
deepscale_config ................ None........................default
deepspeed ....................... True........................default
deepspeed_activation_checkpointing True......................default
deepspeed_mpi ................... False.......................default
deepspeed_slurm ................. False.......................default
detect_nvlink_pairs ............. False.......................default
distributed_backend ............. nccl........................default
do_test ......................... None........................default
do_train ........................ None........................default
do_valid ........................ None........................default
dump_state ...................... False.......................default
elasticity ...................... None........................default
enable_expert_tensor_parallelism False.......................default
eod_mask_loss ................... False.......................default
eval_results_prefix ............. ............................default
eval_tasks ...................... None........................default
exclude ......................... None........................default
exit_interval ................... None........................default
expert_interval ................. 2...........................default
extra_save_iters ................ None........................default
finetune ........................ False.......................default
flops_profiler .................. None........................default
force_multi ..................... False.......................default
fp16 ............................ None........................default
fp16_lm_cross_entropy ........... False.......................default
git_hash ........................ 4c426da.....................default
gmlp_attn_dim ................... 64..........................default
gpt_j_tied ...................... False.......................default
gradient_accumulation_steps ..... 1...........................default
gradient_clipping ............... 1.0.........................default
gradient_noise_scale_cpu_offload False.......................default
gradient_noise_scale_n_batches .. 5...........................default
gradient_predivide_factor ....... 1.0.........................default
hidden_dropout .................. 0...........................default
hostfile ........................ None........................default
hysteresis ...................... 2...........................default
include ......................... None........................default
init_method_std ................. 0.02........................default
intermediate_size ............... None........................default
iteration ....................... None........................default
keep_last_n_checkpoints ......... None........................default
label_data_paths ................ None........................default
layernorm_epsilon ............... 1e-05.......................default
layernorm_fusion ................ False.......................default
lazy_mpu_init ................... False.......................default
load ............................ None........................default
local_rank ...................... None........................default
log_dir ......................... None........................default
log_grad_norm ................... False.......................default
log_grad_pct_zeros .............. False.......................default
log_gradient_noise_scale ........ False.......................default
log_optimizer_states ............ False.......................default
log_param_norm .................. False.......................default
loss_scale ...................... None........................default
loss_scale_window ............... 1000.0......................default
make_vocab_size_divisible_by .... 128.........................default
mamba_causal_conv_fusion ........ False.......................default
mamba_inner_func_fusion ......... False.......................default
mamba_selective_fp32_params ..... True........................default
mamba_selective_scan_fusion ..... False.......................default
mamba_use_bias_in_conv .......... True........................default
mamba_use_bias_in_linears ....... False.......................default
master_addr ..................... None........................default
master_port ..................... 29500.......................default
maximum_tokens .................. 64..........................default
memory_profiling ................ False.......................default
memory_profiling_path ........... None........................default
merge_file ...................... None........................default
min_scale ....................... 1.0.........................default
mlp_type ........................ regular.....................default
mmap_warmup ..................... False.......................default
model_parallel_size ............. 1...........................default
moe_eval_capacity_factor ........ 1.0.........................default
moe_expert_parallel_size ........ 1...........................default
moe_glu ......................... False.......................default
moe_jitter_eps .................. None........................default
moe_lbl_in_fp32 ................. False.......................default
moe_loss_coeff .................. 0.1.........................default
moe_min_capacity ................ 4...........................default
moe_num_experts ................. 1...........................default
moe_token_dropping .............. False.......................default
moe_top_k ....................... 1...........................default
moe_train_capacity_factor ....... 1.0.........................default
moe_type ........................ megablocks..................default
moe_use_residual ................ True........................default
mup_attn_temp ................... 1.0.........................default
mup_embedding_mult .............. 1.0.........................default
mup_init_scale .................. 1.0.........................default
mup_output_temp ................. 1.0.........................default
mup_rp_embedding_mult ........... 1.0.........................default
mup_width_scale ................. 2...........................default
no_load_optim ................... False.......................default
no_load_rng ..................... False.......................default
no_save_optim ................... False.......................default
no_save_rng ..................... False.......................default
no_ssh_check .................... False.......................default
norm ............................ layernorm...................default
num_gpus ........................ None........................default
num_kv_heads .................... None........................default
num_nodes ....................... -1..........................default
num_samples ..................... 1...........................default
num_unique_layers ............... None........................default
onnx_safe ....................... False.......................default
opt_pos_emb_offset .............. 0...........................default
output_layer_parallelism ........ column......................default
override_lr_scheduler ........... False.......................default
padded_vocab_size ............... None........................default
param_sharing_style ............. grouped.....................default
pipe_partition_method ........... type:transformer|mlp........default
prescale_gradients .............. False.......................default
profile ......................... False.......................default
profile_backward ................ False.......................default
profile_step_start .............. 10..........................default
profile_step_stop ............... 12..........................default
prompt_end ......................
...........................default
rank ............................ None........................default
recompute ....................... False.......................default
return_logits ................... False.......................default
rms_norm_epsilon ................ 1e-08.......................default
rope_fusion ..................... False.......................default
rotary_emb_base ................. 10000.......................default
rotary_save_freqs_buffer ........ False.......................default
rpe_max_distance ................ 128.........................default
rpe_num_buckets ................. 32..........................default
s3_chunk_size ................... 104857600...................default
s3_path ......................... None........................default
sample_input_file ............... None........................default
sample_output_file .............. samples.txt.................default
save_base_shapes ................ False.......................default
scaled_masked_softmax_fusion .... False.......................default
scaled_upper_triang_masked_softmax_fusion False..............default
scalenorm_epsilon ............... 1e-08.......................default
scheduler ....................... None........................default
seed ............................ 1234........................default
short_seq_prob .................. 0.1.........................default
sliding_window_width ............ None........................default
soft_prompt_tuning .............. None........................default
sparse_attention ................ None........................default
sparse_gradients ................ False.......................default
split ........................... 969, 30, 1..................default
steps_per_print ................. 10..........................default
temperature ..................... 0.0.........................default
tensorboard ..................... None........................default
tensorboard_dir ................. None........................default
top_k ........................... 0...........................default
top_p ........................... 0.0.........................default
use_bias_in_attn_linear ......... True........................default
use_bias_in_norms ............... True........................default
use_bnb_optimizer ............... False.......................default
use_checkpoint_lr_scheduler ..... False.......................default
use_cpu_initialization .......... False.......................default
use_mup ......................... False.......................default
use_qk_layernorm ................ False.......................default
use_shared_fs ................... True........................default
use_tutel ....................... False.......................default
use_wandb ....................... None........................default
wandb ........................... None........................default
wandb_group ..................... None........................default
wandb_host ...................... https://api.wandb.ai........default
wandb_init_all_ranks ............ False.......................default
wandb_project ................... neox........................default
wandb_team ...................... None........................default
warmup .......................... 0.01........................default
weight_by_num_documents ......... False.......................default
weight_decay .................... 0.1.........................default
weighted_sampler_alpha .......... 1.0.........................default
world_size ...................... None........................default

Environment:

  • PyTorch version: 2.3.1
  • CUDA version: 12.2
  • NCCL version: 2.20.5

Hardware:

  • GPU: A100-SXM4-40GB
  • CPU: AMD EPYC 7543 32-Core Processor
  • Memory: 263793632 kB (total), 195607748 kB (free)
@Yuhanleeee Yuhanleeee added the bug Something isn't working label Jun 29, 2024
@Quentin-Anthony
Copy link
Member

I can't seem to reproduce this. My guess is that you have a systems issue (someone else sharing the GPU? GPU throttling? Network congestion?)

If you'd like to discuss further, please send your exact training config (not just the raw arguments) in this issue and I'll take a look.

@StellaAthena
Copy link
Member

StellaAthena commented Sep 9, 2024

Building off of what Quentin said, I saw this behavior like this when two users (on different GPUs) were saving checkpoints to the same drive. Since it's the input that's experiencing the inconsistency, maybe you're having a similar issue but with data loading?

@Yuhanleeee
Copy link
Author

Thanks for your reply. Yes, you are right, I meet the same issue when data loading. May I ask if there are any available solutions? Your help means a lot to me.

@StellaAthena
Copy link
Member

Thanks for your reply. Yes, you are right, I meet the same issue when data loading. May I ask if there are any available solutions? Your help means a lot to me.

This is not an issue with our library, this is an issue with your computing cluster (and "issue" isn't really the right word, this is just how storage systems work). If you have the ability to move the data to a storage device that is private to you or to something only accessible from your compute nodes that would be the best solution as far as I am aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants