[torchbench] `hf_T5_large` training fails to run on dynamo. #6901

ysiraichi · 2024-04-08T15:10:53Z

🐛 Bug

hf_T5_large training fails to run on dynamo. See the error below:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --repeat 8 --iterations-per-run 1 \
    --xla PJRT --dynamo None --test train \
    -k hf_T5_large

Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 945, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 941, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 61, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 256, in run_single_config
    metrics, last_output = self.run_once_and_gather_metrics(
  File "xla/benchmarks/experiment_runner.py", line 345, in run_once_and_gather_metrics
    output, _ = loop(iter_fn=self._default_iter_fn)
  File "xla/benchmarks/experiment_runner.py", line 302, in loop
    output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
  File "xla/benchmarks/experiment_runner.py", line 218, in _default_iter_fn
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 400, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 156, in train
    self._optimizer_zero_grad()
  File "xla/benchmarks/benchmark_model.py", line 159, in torch_dynamo_resume_in_train_at_156
    loss = self.compute_loss(pred)
  File "xla/benchmarks/benchmark_model.py", line 160, in torch_dynamo_resume_in_train_at_159
    loss.backward()
  File "xla/benchmarks/benchmark_model.py", line 161, in torch_dynamo_resume_in_train_at_160
    self._optimizer_step()
  File "xla/benchmarks/benchmark_model.py", line 150, in _optimizer_step
    self.optimizer.step()
  File "torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "torch/optim/adam.py", line 135, in step
    @_use_grad_for_differentiable
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 36, in inner
    return fn(*args, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 917, in forward
    return compiled_fn(full_args)
  File "torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 107, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 181, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "torch/_dynamo/backends/torchxla.py", line 36, in fwd
    compiled_graph = bridge.extract_compiled_graph(model, args)
  File "xla/torch_xla/core/dynamo_bridge.py", line 618, in extract_compiled_graph
    xm.mark_step()
  File "xla/torch_xla/core/xla_model.py", line 1056, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: Bad StatusOr access: INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas /tmp/tempfi
le-benchmarking-group-a100-40g-q60p-3afc5b57-185461-6157c733d1fd3, line 4045; error   : Entry function 'loop_broadcast_fusion_7' uses too much parameter space (0x1200 bytes, 0x1100 max).
ptxas fatal   : Ptx assembly aborted due to errors
: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

Affected Configurations

Training+Dynamo

Environment

Reproducible on XLA backend [CPU/TPU]: CUDA
torch_xla version: 5c48be1

cc @miladm @JackCaoG @vanbasten23 @cota @golechwierowicz @frgossen @zpcore

The text was updated successfully, but these errors were encountered:

ysiraichi · 2024-05-20T19:50:04Z

Since last report, it started failing due to OOM.

ysiraichi added the xla:gpu label Apr 8, 2024

ysiraichi mentioned this issue Apr 8, 2024

Failing Torchbench Models: tracking issue #5932

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[torchbench] `hf_T5_large` training fails to run on dynamo. #6901

[torchbench] `hf_T5_large` training fails to run on dynamo. #6901

ysiraichi commented Apr 8, 2024

ysiraichi commented May 20, 2024

[torchbench] hf_T5_large training fails to run on dynamo. #6901

[torchbench] hf_T5_large training fails to run on dynamo. #6901

Comments

ysiraichi commented Apr 8, 2024

🐛 Bug

Affected Configurations

Environment

ysiraichi commented May 20, 2024

[torchbench] `hf_T5_large` training fails to run on dynamo. #6901

[torchbench] `hf_T5_large` training fails to run on dynamo. #6901