Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] hf_T5_large training fails to run on dynamo. #6901

Open
ysiraichi opened this issue Apr 8, 2024 · 1 comment
Open

[torchbench] hf_T5_large training fails to run on dynamo. #6901

ysiraichi opened this issue Apr 8, 2024 · 1 comment
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

hf_T5_large training fails to run on dynamo. See the error below:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --repeat 8 --iterations-per-run 1 \
    --xla PJRT --dynamo None --test train \
    -k hf_T5_large
Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 945, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 941, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 61, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 256, in run_single_config
    metrics, last_output = self.run_once_and_gather_metrics(
  File "xla/benchmarks/experiment_runner.py", line 345, in run_once_and_gather_metrics
    output, _ = loop(iter_fn=self._default_iter_fn)
  File "xla/benchmarks/experiment_runner.py", line 302, in loop
    output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
  File "xla/benchmarks/experiment_runner.py", line 218, in _default_iter_fn
    output = benchmark_model.model_iter_fn(
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "xla/benchmarks/torchbench_model.py", line 400, in train
    super().train(inputs, collect_full_output=collect_full_output)
  File "xla/benchmarks/benchmark_model.py", line 156, in train
    self._optimizer_zero_grad()
  File "xla/benchmarks/benchmark_model.py", line 159, in torch_dynamo_resume_in_train_at_156
    loss = self.compute_loss(pred)
  File "xla/benchmarks/benchmark_model.py", line 160, in torch_dynamo_resume_in_train_at_159
    loss.backward()
  File "xla/benchmarks/benchmark_model.py", line 161, in torch_dynamo_resume_in_train_at_160
    self._optimizer_step()
  File "xla/benchmarks/benchmark_model.py", line 150, in _optimizer_step
    self.optimizer.step()
  File "torch/optim/optimizer.py", line 391, in wrapper
    out = func(*args, **kwargs)
  File "torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "torch/optim/adam.py", line 135, in step
    @_use_grad_for_differentiable
  File "torch/_dynamo/eval_frame.py", line 410, in _fn
    return fn(*args, **kwargs)
  File "torch/_dynamo/external_utils.py", line 36, in inner
    return fn(*args, **kwargs)
  File "torch/_functorch/aot_autograd.py", line 917, in forward
    return compiled_fn(full_args)
  File "torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "torch/_functorch/_aot_autograd/runtime_wrappers.py", line 107, in runtime_wrapper
    all_outs = call_func_at_runtime_with_args(
  File "torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
    out = normalize_as_list(f(args))
  File "torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 181, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "torch/_functorch/_aot_autograd/utils.py", line 89, in g
    return f(*args)
  File "torch/_dynamo/backends/torchxla.py", line 36, in fwd
    compiled_graph = bridge.extract_compiled_graph(model, args)
  File "xla/torch_xla/core/dynamo_bridge.py", line 618, in extract_compiled_graph
    xm.mark_step()
  File "xla/torch_xla/core/xla_model.py", line 1056, in mark_step
    torch_xla._XLAC._xla_step_marker(
RuntimeError: Bad StatusOr access: INTERNAL: ptxas exited with non-zero error code 65280, output: ptxas /tmp/tempfi
le-benchmarking-group-a100-40g-q60p-3afc5b57-185461-6157c733d1fd3, line 4045; error   : Entry function 'loop_broadcast_fusion_7' uses too much parameter space (0x1200 bytes, 0x1100 max).
ptxas fatal   : Ptx assembly aborted due to errors
: If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.

Affected Configurations

  • Training+Dynamo

Environment

  • Reproducible on XLA backend [CPU/TPU]: CUDA
  • torch_xla version: 5c48be1

cc @miladm @JackCaoG @vanbasten23 @cota @golechwierowicz @frgossen @zpcore

@ysiraichi
Copy link
Collaborator Author

Since last report, it started failing due to OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant