Skip to content

Benchmark

Vladimir Mandic edited this page Dec 30, 2023 · 17 revisions

Benchmark

Environment

  • Hardware: nVidia RTX 4090 with i9-12900KF
  • Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
  • Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A

Results

Diffusers Original
Precision Params SDP xFormers SDP xFormers None
FP32 Default 33.0 20.0
BF16 Default 73.0 45.5
FP16 Default 73.0 75.0 48.0 48.6 17.3
NHWC (channels last) 72.0
HyperTile (256) 79.0
ToMe (0.5) 77.0
Model no-move (medvram) 85.0
VAE no-slicing, no-tiling 73.8
Sequential offload (lowvram) 27.0

Notes: Options

  • All numbers are in it/s and higher is better
  • Test matrix is not full as some options can be combined together (e.g. cuDNN + HyperTile)
    while others cannot (e.g. HyperTile + ToMe)
  • Results may differ on different GPU/CPU combinations
    For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU
  • Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time
    Equally, original backend may perform better on older hardware
  • Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
  • xFormers have a slight performance advantage over SDP
    However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent
  • Some extensions can add significant overhead to pre/post processing even if they are not used
  • Not worth consideration: cuDNN, NHWC, inference mode, eval
    • cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
    • channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
    • inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
    • eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
  • Benefit of BF16 vs FP16 is not performance as much, its ability to run higher numerical ranges so it can perform calculations where FP16 may result in NaN
  • Running in FP32 results in 60% performance drop - if you need FP32, you're leaving a lot on the table
  • Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
  • Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling

Compile

Compile type Performance Overhead
cudnn/default 73.5 4
inductor/default 89.0 40
inductor/reduce-overhead 92.0 40
inductor/max-autotune 91.0 220
nvfuser/default 84.0 5
cudagraphs/reduce-overhead 85.0 14
stable-fast/sdp 96.0 76
stable-fast/xformers 96.0 101
stable-fast/full-graph 94.0 96

Notes: Compile

  • Performance numbers is in it/s and higher is better
  • Overhead is time in seconds needed to optimize a model with specific params and lower is better
    Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change
  • Model compile may not be compatible with any method that modifies underlying model,
    including loading Lora weights on top of a model
  • stable-fast compile backend requires that package is manually installed on the system
Clone this wiki locally