Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device::create_buffer is sometimes slow (4ms) and slows down rendering. #5984

Open
John-Nagle opened this issue Jul 18, 2024 · 8 comments
Open
Labels
area: performance How fast things go

Comments

@John-Nagle
Copy link

Description
Performance bottleneck found with Tracy: too much time is spent in "Device::create_buffer" and that seems to delay other thread.
This should be a fast operation, but it's taking about 4ms at times.

Repro steps

  1. Get render-bench.. Build branch "hp" in release mode.
  2. Run with the Tracy profiler 0.10
  3. Zoom in on the slowest frames in the profiler.

The test creates a large number of visible objects on screen, waits 10 seconds, deletes them, waits 10 seconds, etc. So capture one full create/delete cycle.

Expected vs observed behavior
The part of the code in the profiling scope "Device::create_buffer" is 1) taking as long as 4ms, and 2) locking out some other operations in the render thread. As far as I can tell, that ought to be a fast operation.

Extra materials
Screenshot of the trace.
renderbenchcreatebuffer1

Full Tracy trace file:
renderbenchcreatebuffer.zip

Platform
WGPU 0.20 from crates.io
Linux 22.04 LTS.
NVidia 3070. Driver 535 (proprietary, tested)

@partisani
Copy link

What do you mean by Linux 22.04 LTS? The latest version of the linux kernel is 6.10

@John-Nagle
Copy link
Author

Ubuntu 22.04 LTS

@Wumpf
Copy link
Member

Wumpf commented Jul 18, 2024

As far as I can tell, that ought to be a fast operation.

I think wgpu should do a better job documenting that it is in fact known to be a very slow operation.

That said, regardless it needs to be looked at if it really has to lock out render pass recording (or vice versa, not that it matters :)).
Wgpu is making some considerable progress in this area, so might be worth checking if the just released 22.0.0 got better in that regard, but I'd be kinda surprised if it's fundamentally different (but who knows! personally I lost a bit track of all the refactors that went in 😅).

Ideally, it would only be an "occasionally very slow" operation, i.e. whenever it actually happens to bottom out to an allocation in the driver (which shouldn't happen all that often)!

@Wumpf Wumpf added the area: performance How fast things go label Jul 18, 2024
@John-Nagle
Copy link
Author

just released (WGPU) 22.0.0

OK, I will upgrade all my code, and Rend3, and re-test. More tomorrow.

very slow operation.

Indeed. 4ms is slow for something in the main render loop.

@cwfitzgerald
Copy link
Member

The buffer in question is the vertex buffer (you can tell by it being accessed by the mesh manager). This buffer can get very large, and large allocations can take a second for us to generate, as the underlying memory allocation takes a little bit. It shouldn't block the main thread, however.

@John-Nagle
Copy link
Author

Waiting for wgpu-egui and wgpu-profiler to catch up to wgpu 22.0.0. Both have the appropriate pull requests.

@John-Nagle
Copy link
Author

The buffer in question is the vertex buffer (you can tell by it being accessed by the mesh manager). This buffer can get very large, and large allocations can take a second for us to generate, as the underlying memory allocation takes a little bit. It shouldn't block the main thread, however.

Right. Profiling can show this happening, but extracting cross-thread cause and effect from profiling data is hard.

@John-Nagle
Copy link
Author

The pull request to fix wgpu-profiling failed. See Wumpf/wgpu-profiler#75

A new WGPU version is needed to fix that, apparently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance How fast things go
Projects
Status: Todo
Development

No branches or pull requests

4 participants