Add indirect dispatch and CPU shaders #360

raphlinus · 2023-09-11T19:09:40Z

A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.

DJMcNab

I don't want to approve, as I haven't run the code, and some of the finer points are likely lost on me. But this looks really good.

I appreciate the simplifying choice to not support doing GPU->CPU results dynamically (even ignoring the extreme performance costs). I hadn't appreciated that choice had been made before, but it makes a lot of sense. It would be good to document that somewhere, possibly in cpu_shader/mod.rs?

I have added some thoughts on developer experience

src/engine.rs

src/lib.rs

DJMcNab · 2023-09-11T20:00:36Z

src/shaders.rs

+    if use_cpu {
+        engine.set_cpu_shader(pathtag_reduce, cpu_shader::pathtag_reduce);
+    }


This feels quite noisy, but I don't know if avoiding that is worth the hassle

Yeah, I'm not sure how to best address it. There's a lot of boilerplate, and I suppose it might be possible to address it with macros, but I don't see a clear win. I'm also purposefully choosing not to do fine-grained selection of some shaders but not others, but that would increase the repetition further - it would loos something like if use_cpu.pathtag_reduce { engine.set_shader(pathtag_reduce, cpu_shader::pathtag_reduce); }.

src/cpu_shader/pathtag_reduce.rs

DJMcNab · 2023-09-11T20:11:49Z

src/cpu_shader/pathtag_reduce.rs

+        let mut m = PathMonoid::default();
+        for j in 0..WG_SIZE {
+            let tag = scene[(pathtag_base + i * WG_SIZE as u32) as usize + j];
+            m = m.combine(&PathMonoid::new(tag));


Musings on bad and premature optimisations
I wonder if these operations could be reordered to improve pipelining. That is, since operation 2 depends on operation 1 (according to the CPU), could we start operation 3/4 whilst 1/2 are ongoing. My (likely unfounded) intuition around CPU implementations is that loop steps depending on the previous iteration is slow, because the operation must fully complete before the next can begin.
Is there some way to signal to LLVM that these operations are order-invariant?

Interesting question but out of scope for this work. The specific goal is to be a reference that both matches the GPU (especially in interface and memory layout) but is also clear and can serve as a reference for correctness. If you're micro-optimizing, there's quite a lot that can potentially be done. For example, you might compute the monoid in SIMD lanes, then do a reduction afterwards. I believe there's a whole research agenda, possibly a PhD, in how to best implement parallelizable primitives like scan. Ideally you'd just express your high level intent, "I want to scan this monoid," and the compiler + library would work together to get you the best implementation tuned for the target, exploiting normal scalar optimizations like you propose, SIMD, ispc-like techniques, multithreading (with work-stealing queues on CPU), and both single-pass and multi-dispatch approaches on GPU.

For now, we just grind out "good enough" implementations.

src/cpu_dispatch.rs

armansito · 2023-09-15T19:51:44Z

src/cpu_shader/pathtag_reduce.rs

+}
+
+pub fn pathtag_reduce(n_wg: u32, resources: &[CpuBinding]) {
+    let r0 = resources[0].as_buf();


Not for this PR, but we should consider some generic functions on CpuBinding to reduce this boilerplate in the future, so you can do something like let config = resources[0].as::<ConfigUniform>();

Heh, I started out saying "sadly, no," but then saw a way forward and it seems to work. Having that method return &ConfigUniform would not work because it would drop the borrow from the RefCell too early, but it is possible to make a typed guard, and have that do bytemuck in its deref impl. Inference also works, so you don't need to turbofish the type.

Probably best as a followup PR so we can land this, but now I'm inclined to do it. One question is whether it should just implement deref_mut (which can panic if the resource is read-only) or make the client write out a method call to make that fallibility explicit. (It's also the case that bytemuck can panic, for example if alignment is not satisfied, so maybe that's not a real problem)

Followup: I think it's possible to solve the panic problem by having all checks in the method that generates the guard, so the deref can't fail (the docs say "this trait should never fail" in boldface). I'll queue this up as a followup.

src/engine.rs

Mostly minor changes in response to review. The main behavior change is to preserve use_cpu when hot-reloading (just the GPU) shaders.

Mostly involves more wgpu feature gates.

This is in line with the recommendation in https://docs.rs/getrandom/latest/getrandom/#webassembly-support. We don't know exactly what broke, but this seems like a reasonable fix.

Add indirect dispatch and CPU shaders

d0798f3

A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.

raphlinus mentioned this pull request Sep 11, 2023

📀 Make wgpu an optional dependency #359

Merged

DJMcNab reviewed Sep 11, 2023

View reviewed changes

armansito reviewed Sep 15, 2023

View reviewed changes

Address review feedback

81f46fa

Mostly minor changes in response to review. The main behavior change is to preserve use_cpu when hot-reloading (just the GPU) shaders.

raphlinus force-pushed the indirect branch from d31a2db to 81f46fa Compare September 18, 2023 19:40

armansito approved these changes Sep 19, 2023

View reviewed changes

raphlinus added 2 commits September 19, 2023 11:04

Merge branch 'main' into indirect

20fc94c

Mostly involves more wgpu feature gates.

Add js feature for getrandom

00f1966

This is in line with the recommendation in https://docs.rs/getrandom/latest/getrandom/#webassembly-support. We don't know exactly what broke, but this seems like a reasonable fix.

DJMcNab mentioned this pull request Sep 19, 2023

Don't use flume's default features johannesvollmer/exrs#219

Merged

raphlinus merged commit 56753e0 into main Sep 19, 2023
4 checks passed

raphlinus deleted the indirect branch September 19, 2023 18:48

raphlinus mentioned this pull request Sep 21, 2023

[shaders] Option to force read_write access mode on all storage bindings #364

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add indirect dispatch and CPU shaders #360

Add indirect dispatch and CPU shaders #360

raphlinus commented Sep 11, 2023

DJMcNab left a comment •

edited

Loading

DJMcNab Sep 11, 2023

raphlinus Sep 18, 2023

DJMcNab Sep 11, 2023

raphlinus Sep 18, 2023

armansito Sep 15, 2023

raphlinus Sep 18, 2023

Add indirect dispatch and CPU shaders #360

Add indirect dispatch and CPU shaders #360

Conversation

raphlinus commented Sep 11, 2023

DJMcNab left a comment • edited Loading

Choose a reason for hiding this comment

DJMcNab Sep 11, 2023

Choose a reason for hiding this comment

raphlinus Sep 18, 2023

Choose a reason for hiding this comment

DJMcNab Sep 11, 2023

Choose a reason for hiding this comment

raphlinus Sep 18, 2023

Choose a reason for hiding this comment

armansito Sep 15, 2023

Choose a reason for hiding this comment

raphlinus Sep 18, 2023

Choose a reason for hiding this comment

DJMcNab left a comment •

edited

Loading