Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indirect dispatch and CPU shaders #360
Add indirect dispatch and CPU shaders #360
Changes from all commits
d0798f3
81f46fa
20fc94c
00f1966
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Musings on bad and premature optimisations
I wonder if these operations could be reordered to improve pipelining. That is, since operation 2 depends on operation 1 (according to the CPU), could we start operation 3/4 whilst 1/2 are ongoing. My (likely unfounded) intuition around CPU implementations is that loop steps depending on the previous iteration is slow, because the operation must fully complete before the next can begin.Is there some way to signal to LLVM that these operations are order-invariant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question but out of scope for this work. The specific goal is to be a reference that both matches the GPU (especially in interface and memory layout) but is also clear and can serve as a reference for correctness. If you're micro-optimizing, there's quite a lot that can potentially be done. For example, you might compute the monoid in SIMD lanes, then do a reduction afterwards. I believe there's a whole research agenda, possibly a PhD, in how to best implement parallelizable primitives like scan. Ideally you'd just express your high level intent, "I want to scan this monoid," and the compiler + library would work together to get you the best implementation tuned for the target, exploiting normal scalar optimizations like you propose, SIMD, ispc-like techniques, multithreading (with work-stealing queues on CPU), and both single-pass and multi-dispatch approaches on GPU.
For now, we just grind out "good enough" implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR, but we should consider some generic functions on
CpuBinding
to reduce this boilerplate in the future, so you can do something likelet config = resources[0].as::<ConfigUniform>();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I started out saying "sadly, no," but then saw a way forward and it seems to work. Having that method return
&ConfigUniform
would not work because it would drop the borrow from the RefCell too early, but it is possible to make a typed guard, and have that do bytemuck in its deref impl. Inference also works, so you don't need to turbofish the type.Probably best as a followup PR so we can land this, but now I'm inclined to do it. One question is whether it should just implement deref_mut (which can panic if the resource is read-only) or make the client write out a method call to make that fallibility explicit. (It's also the case that bytemuck can panic, for example if alignment is not satisfied, so maybe that's not a real problem)
Followup: I think it's possible to solve the panic problem by having all checks in the method that generates the guard, so the deref can't fail (the docs say "this trait should never fail" in boldface). I'll queue this up as a followup.