-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indirect dispatch and CPU shaders #360
Conversation
A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to approve, as I haven't run the code, and some of the finer points are likely lost on me. But this looks really good.
I appreciate the simplifying choice to not support doing GPU->CPU
results dynamically (even ignoring the extreme performance costs). I hadn't appreciated that choice had been made before, but it makes a lot of sense. It would be good to document that somewhere, possibly in cpu_shader/mod.rs
?
I have added some thoughts on developer experience
if use_cpu { | ||
engine.set_cpu_shader(pathtag_reduce, cpu_shader::pathtag_reduce); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels quite noisy, but I don't know if avoiding that is worth the hassle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure how to best address it. There's a lot of boilerplate, and I suppose it might be possible to address it with macros, but I don't see a clear win. I'm also purposefully choosing not to do fine-grained selection of some shaders but not others, but that would increase the repetition further - it would loos something like if use_cpu.pathtag_reduce { engine.set_shader(pathtag_reduce, cpu_shader::pathtag_reduce); }
.
let mut m = PathMonoid::default(); | ||
for j in 0..WG_SIZE { | ||
let tag = scene[(pathtag_base + i * WG_SIZE as u32) as usize + j]; | ||
m = m.combine(&PathMonoid::new(tag)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Musings on bad and premature optimisations
I wonder if these operations could be reordered to improve pipelining. That is, since operation 2 depends on operation 1 (according to the CPU), could we start operation 3/4 whilst 1/2 are ongoing. My (likely unfounded) intuition around CPU implementations is that loop steps depending on the previous iteration is slow, because the operation must fully complete before the next can begin.Is there some way to signal to LLVM that these operations are order-invariant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question but out of scope for this work. The specific goal is to be a reference that both matches the GPU (especially in interface and memory layout) but is also clear and can serve as a reference for correctness. If you're micro-optimizing, there's quite a lot that can potentially be done. For example, you might compute the monoid in SIMD lanes, then do a reduction afterwards. I believe there's a whole research agenda, possibly a PhD, in how to best implement parallelizable primitives like scan. Ideally you'd just express your high level intent, "I want to scan this monoid," and the compiler + library would work together to get you the best implementation tuned for the target, exploiting normal scalar optimizations like you propose, SIMD, ispc-like techniques, multithreading (with work-stealing queues on CPU), and both single-pass and multi-dispatch approaches on GPU.
For now, we just grind out "good enough" implementations.
} | ||
|
||
pub fn pathtag_reduce(n_wg: u32, resources: &[CpuBinding]) { | ||
let r0 = resources[0].as_buf(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should consider some generic functions on CpuBinding
to reduce this boilerplate in the future, so you can do something like let config = resources[0].as::<ConfigUniform>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I started out saying "sadly, no," but then saw a way forward and it seems to work. Having that method return &ConfigUniform
would not work because it would drop the borrow from the RefCell too early, but it is possible to make a typed guard, and have that do bytemuck in its deref impl. Inference also works, so you don't need to turbofish the type.
Probably best as a followup PR so we can land this, but now I'm inclined to do it. One question is whether it should just implement deref_mut (which can panic if the resource is read-only) or make the client write out a method call to make that fallibility explicit. (It's also the case that bytemuck can panic, for example if alignment is not satisfied, so maybe that's not a real problem)
Followup: I think it's possible to solve the panic problem by having all checks in the method that generates the guard, so the deref can't fail (the docs say "this trait should never fail" in boldface). I'll queue this up as a followup.
Mostly minor changes in response to review. The main behavior change is to preserve use_cpu when hot-reloading (just the GPU) shaders.
Mostly involves more wgpu feature gates.
This is in line with the recommendation in https://docs.rs/getrandom/latest/getrandom/#webassembly-support. We don't know exactly what broke, but this seems like a reasonable fix.
A first cut at adding indirect dispatch and CPU shaders. This version just has one (very simple shader), and allows its selection on the command line.