-
Notifications
You must be signed in to change notification settings - Fork 182
Issues: pytorch/torchtitan
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
add H100 in CI
better_engineering
Repo code quality improvements
integration test
Adding integration tests
create a note on torchtitan official release
documentation
Improvements or additions to documentation
release_blocking
Issues that are blocking the milestone / release completion
Non-DP runs default to float32 precision
enhancement
New feature or request
#630
opened Oct 18, 2024 by
carmocca
[Triton] Implement Liger Kernels
enhancement
New feature or request
#623
opened Oct 17, 2024 by
casper-hansen
Is there way to offload training memory to DRAM (using FSDP2?) for training Llama3-8B with torchtitan?
question
Further information is requested
#620
opened Oct 15, 2024 by
0781532
Question about torch.compile has better throughput with 128-GPUs than 8-GPUs
question
Further information is requested
#619
opened Oct 15, 2024 by
dz1iang
redundant checks in Repo code quality improvements
good first issue
Good for newcomers
checkpoint.py
better_engineering
Ability to train based on epoch
enhancement
New feature or request
good first issue
Good for newcomers
#613
opened Oct 13, 2024 by
abatilo
[Compile] Understand why FSDP2 saves both SDPA out and wo in for bwd
question
Further information is requested
#610
opened Oct 11, 2024 by
awgu
why is xformers not used for attention computation?
question
Further information is requested
#608
opened Oct 9, 2024 by
jason718
Granular layer selection during Pipeline Parallelism
question
Further information is requested
#598
opened Oct 3, 2024 by
bhuvan777
Gradient norm clipping with pipeline parallelism (PP)
bug
Something isn't working
release_blocking
Issues that are blocking the milestone / release completion
Support Gemma2 in torchtitan
enhancement
New feature or request
#594
opened Oct 1, 2024 by
pansershrek
reproducable numerics for loss, weights and gradients for single node (8 GPUs)
enhancement
New feature or request
#593
opened Oct 1, 2024 by
weifengpy
Inference with the checkpoint
enhancement
New feature or request
#586
opened Sep 23, 2024 by
mathmax12
Support INT8 mixed-precision training from torchao?
enhancement
New feature or request
#578
opened Sep 14, 2024 by
gau-nernst
Wrong train_state.step when resuming from checkpoint for the second time
bug
Something isn't working
#571
opened Sep 8, 2024 by
LeoXinhaoLee
Pipeline Parallelism + FSDP
question
Further information is requested
#562
opened Aug 29, 2024 by
jeromeku
Fail-safe and partial redundancy for HSDP on unreliable compute
enhancement
New feature or request
#561
opened Aug 27, 2024 by
evkogs
PP UX/training confusion re: loss = -1. (need to better document or add auto logging of last rank loss?)
#550
opened Aug 21, 2024 by
lessw2020
2D whole model compile fails at embedding layer
bug
Something isn't working
#534
opened Aug 20, 2024 by
tianyu-l
[rfc] getting rid of seed-checkpoint for Pipeline Parallelism
enhancement
New feature or request
release_blocking
Issues that are blocking the milestone / release completion
Previous Next
ProTip!
Mix and match filters to narrow down what you’re looking for.