Benchmark

Environment

Hardware: nVidia RTX 4090 with i9-12900KF
Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A

Results

		Diffusers		Original
Precision	Params	SDP	xFormers	SDP	xFormers	None
FP32	Default	33.0	20.0
BF16	Default	73.0	45.5
FP16	Default	73.0	75.0	48.0	48.6	17.3
	NHWC (channels last)	72.0
	HyperTile (256)	79.0
	ToMe (0.5)	77.0
	Model no-move (medvram)	85.0
	VAE no-slicing, no-tiling	73.8
	Sequential offload (lowvram)	27.0

Notes: Options

All numbers are in it/s and higher is better
Test matrix is not full as some options can be combined together (e.g. cuDNN + HyperTile)
while others cannot (e.g. HyperTile + ToMe)
Results may differ on different GPU/CPU combinations
For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU
Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time
Equally, original backend may perform better on older hardware
Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
xFormers have a slight performance advantage over SDP
However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent
Some extensions can add significant overhead to pre/post processing even if they are not used
Not worth consideration: cuDNN, NHWC, inference mode, eval
- cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
- channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
- inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
- eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
Benefit of BF16 vs FP16 is not performance as much, its ability to run higher numerical ranges so it can perform calculations where FP16 may result in NaN
Running in FP32 results in 60% performance drop - if you need FP32, you're leaving a lot on the table
Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling

Compile

Compile type	Performance	Overhead
cudnn/default	73.5	4
inductor/default	89.0	40
inductor/reduce-overhead	92.0	40
inductor/max-autotune	91.0	220
nvfuser/default	84.0	5
cudagraphs/reduce-overhead	85.0	14
stable-fast/sdp	96.0	76
stable-fast/xformers	96.0	101
stable-fast/full-graph	94.0	96

Notes: Compile

Performance numbers is in it/s and higher is better
Overhead is time in seconds needed to optimize a model with specific params and lower is better
Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change
Model compile may not be compatible with any method that modifies underlying model,
including loading Lora weights on top of a model
stable-fast compile backend requires that package is manually installed on the system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark

Benchmark

Environment

Results

Notes: Options

Compile

Notes: Compile

Clone this wiki locally