Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indirect dispatch and CPU shaders #360

Merged
merged 4 commits into from
Sep 19, 2023
Merged

Add indirect dispatch and CPU shaders #360

merged 4 commits into from
Sep 19, 2023

Conversation

raphlinus
Copy link
Contributor

A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.

A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to approve, as I haven't run the code, and some of the finer points are likely lost on me. But this looks really good.

I appreciate the simplifying choice to not support doing GPU->CPU results dynamically (even ignoring the extreme performance costs). I hadn't appreciated that choice had been made before, but it makes a lot of sense. It would be good to document that somewhere, possibly in cpu_shader/mod.rs?

I have added some thoughts on developer experience

src/engine.rs Outdated Show resolved Hide resolved
src/engine.rs Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
Comment on lines +98 to +100
if use_cpu {
engine.set_cpu_shader(pathtag_reduce, cpu_shader::pathtag_reduce);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels quite noisy, but I don't know if avoiding that is worth the hassle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure how to best address it. There's a lot of boilerplate, and I suppose it might be possible to address it with macros, but I don't see a clear win. I'm also purposefully choosing not to do fine-grained selection of some shaders but not others, but that would increase the repetition further - it would loos something like if use_cpu.pathtag_reduce { engine.set_shader(pathtag_reduce, cpu_shader::pathtag_reduce); }.

src/cpu_shader/pathtag_reduce.rs Show resolved Hide resolved
let mut m = PathMonoid::default();
for j in 0..WG_SIZE {
let tag = scene[(pathtag_base + i * WG_SIZE as u32) as usize + j];
m = m.combine(&PathMonoid::new(tag));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Musings on bad and premature optimisations I wonder if these operations could be reordered to improve pipelining. That is, since operation 2 depends on operation 1 (according to the CPU), could we start operation 3/4 whilst 1/2 are ongoing. My (likely unfounded) intuition around CPU implementations is that loop steps depending on the previous iteration is slow, because the operation must fully complete before the next can begin.

Is there some way to signal to LLVM that these operations are order-invariant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question but out of scope for this work. The specific goal is to be a reference that both matches the GPU (especially in interface and memory layout) but is also clear and can serve as a reference for correctness. If you're micro-optimizing, there's quite a lot that can potentially be done. For example, you might compute the monoid in SIMD lanes, then do a reduction afterwards. I believe there's a whole research agenda, possibly a PhD, in how to best implement parallelizable primitives like scan. Ideally you'd just express your high level intent, "I want to scan this monoid," and the compiler + library would work together to get you the best implementation tuned for the target, exploiting normal scalar optimizations like you propose, SIMD, ispc-like techniques, multithreading (with work-stealing queues on CPU), and both single-pass and multi-dispatch approaches on GPU.

For now, we just grind out "good enough" implementations.

src/cpu_dispatch.rs Show resolved Hide resolved
}

pub fn pathtag_reduce(n_wg: u32, resources: &[CpuBinding]) {
let r0 = resources[0].as_buf();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should consider some generic functions on CpuBinding to reduce this boilerplate in the future, so you can do something like let config = resources[0].as::<ConfigUniform>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I started out saying "sadly, no," but then saw a way forward and it seems to work. Having that method return &ConfigUniform would not work because it would drop the borrow from the RefCell too early, but it is possible to make a typed guard, and have that do bytemuck in its deref impl. Inference also works, so you don't need to turbofish the type.

Probably best as a followup PR so we can land this, but now I'm inclined to do it. One question is whether it should just implement deref_mut (which can panic if the resource is read-only) or make the client write out a method call to make that fallibility explicit. (It's also the case that bytemuck can panic, for example if alignment is not satisfied, so maybe that's not a real problem)

Followup: I think it's possible to solve the panic problem by having all checks in the method that generates the guard, so the deref can't fail (the docs say "this trait should never fail" in boldface). I'll queue this up as a followup.

src/engine.rs Show resolved Hide resolved
src/engine.rs Show resolved Hide resolved
src/engine.rs Show resolved Hide resolved
Mostly minor changes in response to review. The main behavior change is to preserve use_cpu when hot-reloading (just the GPU) shaders.
Mostly involves more wgpu feature gates.
This is in line with the recommendation in https://docs.rs/getrandom/latest/getrandom/#webassembly-support. We don't know exactly what broke, but this seems like a reasonable fix.
@raphlinus raphlinus merged commit 56753e0 into main Sep 19, 2023
4 checks passed
@raphlinus raphlinus deleted the indirect branch September 19, 2023 18:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants