H100/H200/B200 FlashAttention3 for Flux + TorchAO improvements #1033

bghira · 2024-10-06T12:56:20Z

Unlocks FP8 training on H100.

When using torch.compile in bf16 mode w/ rank 16 LoRA we see 2 iterations per second.

1000 steps can be trained in 500 seconds or just over 8.3 minutes.

…s transformer model

bghira and others added 17 commits October 5, 2024 15:55

move models into models module

0f47f67

h100: use flash attention 3 when available

89b2964

boost max pythonvers to 3.12

cb0772e

triton library update

3f5e3d0

add nccl latest

6ca55f6

ddp: disable optimizeddp when gradient checkpoint used

424f033

h100 should get fp8-torchao and flash attention 3 on vanilla diffuser…

8e2911e

…s transformer model

Merge branch 'main' into feature/flash-attention-3-h100

3a95492

update fp8-torchao docs

11e36fc

attn fix for h100

2f5ba25

update triton

74f3287

vaecache should not report on bunk cache files

9b81e17

support for utf8 prompts

a3a3b47

s3: refactor torch compressed file load/save

7f75340

s3: refactor torch compressed file load/save (fix)

f7f514d

s3: tested backwards compat loading w/ broken cache

bf34644

quantisation should provide error during OOM to use --quantize_via=cpu

ee758c4

bghira merged commit 8bd9132 into main Oct 8, 2024
1 check passed

bghira deleted the feature/flash-attention-3-h100 branch October 14, 2024 01:49

Provide feedback