Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

Open
JunZhan2000 opened this issue Apr 12, 2022 · 6 comments
Labels
bug Something isn't working more info needed There's not enough information to reproduce

Comments

@JunZhan2000
Copy link

JunZhan2000 commented Apr 12, 2022

I am trying to train a Chinese model of a conformer. When I train with 4 2080ti, there will be an error in the middle of the epoch: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered, and the time of occurrence is not fixed. This problem doesn't occur when I train with only one gpu. please help me

This is my environment:

tensorflow-gpu==2.7
tensorflow-text==2.7
tensorflow-io==0.23

Below is my config.yml configuration

speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False

decoder_config:
vocabulary: /remote-home/jzhan/TensorFlowASR/vocabularies/AISHELL-1/AISHELL-1_10000.subwords
target_vocab_size: 10000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /remote-home/jzhan/Datasets/AISHELL-1_test/train/transcripts.tsv

model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add

learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/train/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/train/tfrecords
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train

eval_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval

test_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test

optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9

running_config:
batch_size: 8
num_epochs: 50
checkpoint:
filepath: /remote-home/jzhan/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5
save_best_only: False
save_weights_only: True
save_freq: epoch
states_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 16, 2022

@Guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine.
The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly.
I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

@JunZhan2000
Copy link
Author

I'm creating the environment via conda, it's a really weird bug

@JunZhan2000
Copy link
Author

@Guokr233 this might be the problem of tensorflow ifself or you didn't setup the environment right. Did you use anaconda3 (or miniconda)? And make sure your cuda driver is installed correctly on your machine. The anaconda ensures that your environment has enough library/packages needed to run for gpu correctly. I don't have experience to solve CUDA errors, so the only solution that I can provide is that we need to make sure we setup the environment right, and when the environment is not a problem, then we can test the newer version of tensorflow (2.8 for example)

I created the environment through conda, and I upgraded to 2.8 version of tensorflow-gpu, cuda11.2, cuDNN 8.4.0. But still got this error. It seems that I can only train slowly with one GPU

@nglehuy
Copy link
Collaborator

nglehuy commented Apr 17, 2022

@Guokr233 Let me recheck the mirror strategy of tensorflow to see if there's any changes.
Currently I don't have multi-gpus so it's hard to reproduce the issue. I've moved to TPUs on colab and it still works fine 😄

@nglehuy nglehuy added bug Something isn't working more info needed There's not enough information to reproduce labels Apr 17, 2022
@NusratNB
Copy link

NusratNB commented Oct 14, 2022

I've also encountered the same problem. I've followed the solutions which are given on this issue, but it didn't work:
tensorflow/tensorflow#44281

Moreover, I followed this solution also, but again it didn't work:
tensorflow/tensorflow#40814 (comment)

@NusratNB
Copy link

@Guokr233 can you try this solution:
tensorflow/tensorflow#50735 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working more info needed There's not enough information to reproduce
Projects
None yet
Development

No branches or pull requests

3 participants