Benchmark

To run standardized benchmark, you can use UI -> System -> Benchmark feature or via CLI using cli/run-benchmark.py script.

It runs identical tests, but often CLI is faster due to lower overhead.

Environment

Hardware: nVidia RTX 4090 with i9-12900KF
Packages: Torch 2.1.0 with CUDA 12.1 and cuDNN 8.9
Params: model=SD15 | batch-size=4 | batch-count=4 | steps=50 | resolution=512px | sampler=Euler A

Results

|||Diffusers||Original|||| |---|---|---|---|---|---|---| |Precision|Params|SDP|xFormers|SDP|xFormers|None| |FP32|Default|33.0|20.0| |BF16|Default|73.0|45.5| |FP16|Default|73.0|75.0|48.0|48.6|17.3| ||NHWC (channels last)|72.0| ||HyperTile (256)|79.0| ||ToMe (0.5)|77.0| ||Model no-move (medvram)|85.0| ||VAE no-slicing, no-tiling|73.8| ||Sequential offload (lowvram)|27.0|

Notes: Options

All numbers are in it/s and higher is better
Test matrix is not full as some options can be combined together (e.g. cuDNN + HyperTile)
while others cannot (e.g. HyperTile + ToMe)
Results may differ on different GPU/CPU combinations
For example, pairing better CPU with older GPU may benefit from more processing done on CPU and leaving GPU to do only core ML tasks while paring high-end GPU with older CPU may result in lower results since CPU cannot feed enough tasks to GPU
Diffusers perform significantly better than original backend on modern hardware since tasks remain on GPU for longer time
Equally, original backend may perform better on older hardware
Running quick tasks such as single image generate at low steps may not be sufficient to fully saturate high-end GPU so results will be lower
xFormers have a slight performance advantage over SDP
However, SDP is a built-in in Torch and "just works" while xFormers needs manual install and its highly version dependent
Some extensions can add significant overhead to pre/post processing even if they are not used
Not worth consideration: cuDNN, NHWC, inference mode, eval
- cuDNN full bench finds best math algorithm for specific GPU, but default is nearly identical
- channels-last should better trigger utilization of tensor cores, but in practise result is nearly identical
- inference-mode should have more optimizations than default no_grad, but in practise result is nearly identical
- eval mode should allow for removal of some params in the model, but in pracise result is nearly identical
Benefit of BF16 vs FP16 is not performance as much, its ability to run higher numerical ranges so it can perform calculations where FP16 may result in NaN
Running in FP32 results in 60% performance drop - if you need FP32, you're leaving a lot on the table
Cost of using lowvram is very high as it needs to swap parts of model in-memory. Even using medvram comes at noticeable cost
Best: xFormers, FP16, HyperTile, no-model-move, no-slicing/tiling

Compile

Compile type	Performance	Overhead
cudnn/default	73.5	4
inductor/default	89.0	40
inductor/reduce-overhead	92.0	40
inductor/max-autotune	91.0	220
nvfuser/default	84.0	5
cudagraphs/reduce-overhead	85.0	14
stable-fast/sdp	96.0	76
stable-fast/xformers	96.0	101
stable-fast/full-graph	94.0	96

Notes: Compile

Performance numbers is in it/s and higher is better
Overhead is time in seconds needed to optimize a model with specific params and lower is better
Model needs compile on initial generate, but it may also need a recompile if params such as resolution of batch size change
Model compile may not be compatible with any method that modifies underlying model,
including loading Lora weights on top of a model
stable-fast compile backend requires that package is manually installed on the system

Intel ARC

Environment

Hardware: Intel ARC 770 LE 16GB with R7 5800X3D & MSI B350M Mortar (PCI-E 3.0) & 48 GB 3200 MHz CL18 RAM
OS: Arch Linux with this Docker environment: https://github.com/Disty0/docker-sdnext-ipex
Packages: Torch 2.1.0a0+cxx11.abi with IPEX 2.1.10+xpu and MKL / DPCPP 2024.0.0
Params: model=SD15 | batch-size=1 | batch-count=1 | steps=40 | resolution=512px | sampler=Euler a | CFG 6

Results

		Diffusers	Original
Precision	Params	it/s	it/s
BF16	Default	8.54	7.75
FP16	Default	6.92	7.23
FP32	Default	3.73	3.74
BF16	HyperTile (256)	10.03	9.32
BF16	ToMe (0.5)	9.24	8.61
BF16	No IPEX Optimize	8.23	7.82
BF16	Model no-move (medvram)	9.04
BF16	VAE no-slicing, no-tiling	8.67
BF16	Sequential offload (lowvram)	1.60	0.67

OpenVINO

Environment

Hardware: Intel ARC 770 LE 16GB with R7 5800X3D & MSI B350M Mortar (PCI-E 3.0) & 48 GB 3200 MHz CL18 RAM
OS: Arch Linux
Packages: Torch 2.1.2+cpu and OpenVINO 2023.2.0
Params: model=SD15 | batch-size=1 | batch-count=1 | steps=20 | resolution=512px | sampler=Euler a | CFG 6

GPU Results

		Diffusers
Precision	Params	it/s
Default	Default	9.21

CPU Results

		Diffusers
Precision	Params	s/it
Default	Default	3.00
Default	LCM & CFG 0	1.60
INT8	Default	3.30
INT4_SYM	Default	4.00
INT4_ASYM	Default	4.30
NF4	Default	5.25
FP32	Diffusers & No OpenVINO	4.20

API Benchmarks

Using latest version of SD.Next with Torch 2.2.0, CUDA 12.1
Note: Usage of SD.Next via API is faster than via UI due to lower overhead.

Environment: Intel i9-13900KF platform with nVidia RTX 4090 GPU

As you can see, we're reaching peak performance of ~110 it/s using simple settings:

vlado@wsl:~/dev/sdnext-dev $ python cli/run-benchmark.py --maxbatch 32
2024-02-07 11:19:53,026 INFO: {'run-benchmark'}
2024-02-07 11:19:53,027 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-07 11:19:53,046 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': 'd967bd03', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-07 11:19:53,048 INFO: {'platform': {'arch': 'x86_64', 'cpu': 'x86_64', 'system': 'Linux', 'release': '5.15.146.1-microsoft-standard-WSL2', 'python': '3.11.1', 'torch': '2.2.0+cu121', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-07 11:19:53,051 INFO: {'model': 'sd15/lyriel-v16 [ec6f68ea63]'}
2024-02-07 11:19:53,054 INFO: {'system': {'cpu': {'free': 49020043264.0, 'used': 1495736320, 'total': 50515779584.0}, 'gpu': {'system': {'free': 24110956544, 'used': 1645740032, 'total': 25756696576}, 'session': {'current': 0, 'peak': 0}}}}
2024-02-07 11:19:53,054 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16, 24, 32]}
2024-02-07 11:19:59,394 INFO: {'warmup': 6.34}
2024-02-07 11:20:02,354 INFO: {'batch': 1, 'its': 33.63, 'img': 1.49, 'wall': 1.49, 'peak': 7.05, 'oom': False}
2024-02-07 11:20:06,213 INFO: {'batch': 2, 'its': 64.3, 'img': 0.78, 'wall': 1.56, 'peak': 7.1, 'oom': False}
2024-02-07 11:20:11,293 INFO: {'batch': 4, 'its': 90.87, 'img': 0.55, 'wall': 2.2, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:19,416 INFO: {'batch': 8, 'its': 104.6, 'img': 0.48, 'wall': 3.82, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:30,850 INFO: {'batch': 12, 'its': 111.96, 'img': 0.45, 'wall': 5.36, 'peak': 7.18, 'oom': False}
2024-02-07 11:20:46,236 INFO: {'batch': 16, 'its': 110.37, 'img': 0.45, 'wall': 7.25, 'peak': 7.18, 'oom': False}
2024-02-07 11:21:09,338 INFO: {'batch': 24, 'its': 109.75, 'img': 0.46, 'wall': 10.93, 'peak': 7.18, 'oom': False}
2024-02-07 11:21:39,623 INFO: {'batch': 32, 'its': 111.38, 'img': 0.45, 'wall': 14.37, 'peak': 7.18, 'oom': False}

With a full optimizations and custom compiled Stable-Fast:
We're reaching peak performance of ~150 it/s (and ~165 it/s using TAESD instead of full VAE):

vlado@wsl:~/dev/sdnext-dev $ python cli/run-benchmark.py --maxbatch 32
2024-02-07 11:29:23,431 INFO: {'run-benchmark'}
2024-02-07 11:29:23,432 INFO: {'options': {'prompt': 'photo of two dice on a table', 'negative_prompt': 'foggy, blurry', 'steps': 50, 'sampler_name': 'Euler a', 'width': 512, 'height': 512, 'full_quality': True, 'cfg_scale': 0, 'batch_size': 1, 'n_iter': 1, 'seed': -1}}
2024-02-07 11:29:23,451 INFO: {'version': {'app': 'sd.next', 'updated': '2024-02-07', 'hash': 'd967bd03', 'url': 'https://github.com/vladmandic/automatic/tree/dev'}}
2024-02-07 11:29:23,453 INFO: {'platform': {'arch': 'x86_64', 'cpu': 'x86_64', 'system': 'Linux', 'release': '5.15.146.1-microsoft-standard-WSL2', 'python': '3.11.1', 'torch': '2.2.0+cu121', 'diffusers': '0.26.2', 'gradio': '3.43.2'}}
2024-02-07 11:29:23,456 INFO: {'model': 'sd15/lyriel-v16 [ec6f68ea63]'}
2024-02-07 11:29:23,459 INFO: {'system': {'cpu': {'free': 49373564927.99999, 'used': 1142214656, 'total': 50515779583.99999}, 'gpu': {'system': {'free': 24110956544, 'used': 1645740032, 'total': 25756696576}, 'session': {'current': 0, 'peak': 0}}}}
2024-02-07 11:29:23,459 INFO: {'batch-sizes': [1, 1, 2, 4, 8, 12, 16, 24, 32]}
2024-02-07 11:29:38,504 INFO: {'warmup': 15.04}
2024-02-07 11:29:38,965 INFO: {'batch': 1, 'its': 78.16, 'img': 0.67, 'wall': 0.23, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:42,630 INFO: {'batch': 2, 'its': 98.91, 'img': 0.51, 'wall': 1.01, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:47,192 INFO: {'batch': 4, 'its': 117.92, 'img': 0.42, 'wall': 1.7, 'peak': 7.11, 'oom': False}
2024-02-07 11:29:54,028 INFO: {'batch': 8, 'its': 142.42, 'img': 0.35, 'wall': 2.81, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:03,161 INFO: {'batch': 12, 'its': 153.29, 'img': 0.33, 'wall': 3.91, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:14,921 INFO: {'batch': 16, 'its': 153.41, 'img': 0.33, 'wall': 5.21, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:33,534 INFO: {'batch': 24, 'its': 144.65, 'img': 0.35, 'wall': 8.3, 'peak': 7.11, 'oom': False}
2024-02-07 11:30:56,914 INFO: {'batch': 32, 'its': 150.59, 'img': 0.33, 'wall': 10.63, 'peak': 7.11, 'oom': False}

Additional performance may be reached by experimenting with different settings, but combination of such may lead to unstable results
For example: channels-last, hyper-tile, tomesd, fused-projections

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark

Benchmark

Environment

Results

Notes: Options

Compile

Notes: Compile

Intel ARC

Environment

Results

OpenVINO

Environment

GPU Results

CPU Results

API Benchmarks

Clone this wiki locally