Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

JunZhan2000 · 2022-04-12T14:54:47Z

I am trying to train a Chinese model of a conformer. When I train with 4 2080ti, there will be an error in the middle of the epoch: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered, and the time of occurrence is not fixed. This problem doesn't occur when I train with only one gpu. please help me

This is my environment：

tensorflow-gpu==2.7
tensorflow-text==2.7
tensorflow-io==0.23

Below is my config.yml configuration

speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False

decoder_config:
vocabulary: /remote-home/jzhan/TensorFlowASR/vocabularies/AISHELL-1/AISHELL-1_10000.subwords
target_vocab_size: 10000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /remote-home/jzhan/Datasets/AISHELL-1_test/train/transcripts.tsv

model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add

learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/train/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/train/tfrecords
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train

eval_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval

test_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test

optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9

running_config:
batch_size: 8
num_epochs: 50
checkpoint:
filepath: /remote-home/jzhan/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5
save_best_only: False
save_weights_only: True
save_freq: epoch
states_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2

nglehuy · 2022-04-16T11:31:14Z

@Guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine.
The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly.
I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

JunZhan2000 · 2022-04-17T02:28:11Z

I'm creating the environment via conda, it's a really weird bug

JunZhan2000 · 2022-04-17T07:01:59Z

@Guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

I created the environment through conda, and I upgraded to 2.8 version of tensorflow-gpu, cuda11.2, cuDNN 8.4.0. But still got this error. It seems that I can only train slowly with one GPU

nglehuy · 2022-04-17T12:33:44Z

@Guokr233 Let me recheck the mirror strategy of tensorflow to see if there's any changes.
Currently I don't have multi-gpus so it's hard to reproduce the issue. I've moved to TPUs on colab and it still works fine 😄

NusratNB · 2022-10-14T00:41:45Z

I've also encountered the same problem. I've followed the solutions which are given on this issue, but it didn't work:
tensorflow/tensorflow#44281

Moreover, I followed this solution also, but again it didn't work:
tensorflow/tensorflow#40814 (comment)

NusratNB · 2022-10-14T01:22:25Z

@Guokr233 can you try this solution:
tensorflow/tensorflow#50735 (comment)

JunZhan2000 closed this as completed Apr 17, 2022

JunZhan2000 reopened this Apr 17, 2022

nglehuy added bug Something isn't working more info needed There's not enough information to reproduce labels Apr 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

JunZhan2000 commented Apr 12, 2022 •

edited

Loading

nglehuy commented Apr 16, 2022

JunZhan2000 commented Apr 17, 2022

JunZhan2000 commented Apr 17, 2022

nglehuy commented Apr 17, 2022

NusratNB commented Oct 14, 2022 •

edited

Loading

NusratNB commented Oct 14, 2022

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

Comments

JunZhan2000 commented Apr 12, 2022 • edited Loading

nglehuy commented Apr 16, 2022

JunZhan2000 commented Apr 17, 2022

JunZhan2000 commented Apr 17, 2022

nglehuy commented Apr 17, 2022

NusratNB commented Oct 14, 2022 • edited Loading

NusratNB commented Oct 14, 2022

JunZhan2000 commented Apr 12, 2022 •

edited

Loading

NusratNB commented Oct 14, 2022 •

edited

Loading