Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indirect dispatch and CPU shaders #360

Merged
merged 4 commits into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions examples/headless/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -90,6 +90,7 @@ async fn render(mut scenes: SceneSet, index: usize, args: &Args) -> Result<()> {
&RendererOptions {
surface_format: None,
timestamp_period: queue.get_timestamp_period(),
use_cpu: false,
},
)
.or_else(|_| bail!("Got non-Send/Sync error from creating renderer"))?;
Expand Down
1 change: 1 addition & 0 deletions examples/with_winit/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ console_error_panic_hook = "0.1.7"
console_log = "1"
wasm-bindgen-futures = "0.4.33"
web-sys = { version = "0.3.60", features = [ "HtmlCollection", "Text" ] }
getrandom = { version = "0.2.10", features = ["js"] }
6 changes: 6 additions & 0 deletions examples/with_winit/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ struct Args {
scene: Option<i32>,
#[command(flatten)]
args: scenes::Arguments,
#[arg(long)]
/// Whether to use CPU shaders
use_cpu: bool,
}

struct RenderState {
Expand All @@ -70,6 +73,7 @@ fn run(
let mut render_cx = render_cx;
#[cfg(not(target_arch = "wasm32"))]
let mut render_state = None::<RenderState>;
let use_cpu = args.use_cpu;
// The design of `RenderContext` forces delayed renderer initialisation to
// not work on wasm, as WASM futures effectively must be 'static.
// Otherwise, this could work by sending the result to event_loop.proxy
Expand All @@ -84,6 +88,7 @@ fn run(
&RendererOptions {
surface_format: Some(render_state.surface.format),
timestamp_period: render_cx.devices[id].queue.get_timestamp_period(),
use_cpu: use_cpu,
},
)
.expect("Could create renderer"),
Expand Down Expand Up @@ -492,6 +497,7 @@ fn run(
timestamp_period: render_cx.devices[id]
.queue
.get_timestamp_period(),
use_cpu,
},
)
.expect("Could create renderer")
Expand Down
72 changes: 72 additions & 0 deletions src/cpu_dispatch.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
// Copyright 2023 The Vello authors
// SPDX-License-Identifier: Apache-2.0 OR MIT

//! Support for CPU implementations of compute shaders.
raphlinus marked this conversation as resolved.
Show resolved Hide resolved

use std::{
cell::{RefCell, RefMut},
ops::Deref,
};

#[derive(Clone, Copy)]
pub enum CpuBinding<'a> {
Buffer(&'a [u8]),
BufferRW(&'a RefCell<Vec<u8>>),
#[allow(unused)]
Texture(&'a CpuTexture),
}

pub enum CpuBufGuard<'a> {
Slice(&'a [u8]),
Interior(RefMut<'a, Vec<u8>>),
}

impl<'a> Deref for CpuBufGuard<'a> {
type Target = [u8];

fn deref(&self) -> &Self::Target {
match self {
CpuBufGuard::Slice(s) => s,
CpuBufGuard::Interior(r) => r,
}
}
}

impl<'a> CpuBufGuard<'a> {
/// Get a mutable reference to the buffer.
///
/// Panics if the underlying resource is read-only.
pub fn as_mut(&mut self) -> &mut [u8] {
match self {
CpuBufGuard::Interior(r) => &mut *r,
_ => panic!("tried to borrow immutable buffer as mutable"),
}
}
}

impl<'a> CpuBinding<'a> {
pub fn as_buf(&self) -> CpuBufGuard {
match self {
CpuBinding::Buffer(b) => CpuBufGuard::Slice(b),
CpuBinding::BufferRW(b) => CpuBufGuard::Interior(b.borrow_mut()),
_ => panic!("resource type mismatch"),
}
}

// TODO: same guard as buf to make mutable
#[allow(unused)]
pub fn as_tex(&self) -> &CpuTexture {
match self {
CpuBinding::Texture(t) => t,
_ => panic!("resource type mismatch"),
}
}
}

/// Structure used for binding textures to CPU shaders.
pub struct CpuTexture {
pub width: usize,
pub height: usize,
// In RGBA format. May expand in the future.
pub pixels: Vec<u32>,
}
8 changes: 8 additions & 0 deletions src/cpu_shader/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
// Copyright 2023 The Vello authors
// SPDX-License-Identifier: Apache-2.0 OR MIT

//! CPU implementations of shader stages.

mod pathtag_reduce;

pub use pathtag_reduce::pathtag_reduce;
35 changes: 35 additions & 0 deletions src/cpu_shader/pathtag_reduce.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
// Copyright 2023 The Vello authors
// SPDX-License-Identifier: Apache-2.0 OR MIT

use vello_encoding::{ConfigUniform, Monoid, PathMonoid};

use crate::cpu_dispatch::CpuBinding;

const WG_SIZE: usize = 256;

fn pathtag_reduce_main(
n_wg: u32,
config: &ConfigUniform,
scene: &[u32],
reduced: &mut [PathMonoid],
) {
let pathtag_base = config.layout.path_tag_base;
for i in 0..n_wg {
let mut m = PathMonoid::default();
for j in 0..WG_SIZE {
let tag = scene[(pathtag_base + i * WG_SIZE as u32) as usize + j];
raphlinus marked this conversation as resolved.
Show resolved Hide resolved
m = m.combine(&PathMonoid::new(tag));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Musings on bad and premature optimisations I wonder if these operations could be reordered to improve pipelining. That is, since operation 2 depends on operation 1 (according to the CPU), could we start operation 3/4 whilst 1/2 are ongoing. My (likely unfounded) intuition around CPU implementations is that loop steps depending on the previous iteration is slow, because the operation must fully complete before the next can begin.

Is there some way to signal to LLVM that these operations are order-invariant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting question but out of scope for this work. The specific goal is to be a reference that both matches the GPU (especially in interface and memory layout) but is also clear and can serve as a reference for correctness. If you're micro-optimizing, there's quite a lot that can potentially be done. For example, you might compute the monoid in SIMD lanes, then do a reduction afterwards. I believe there's a whole research agenda, possibly a PhD, in how to best implement parallelizable primitives like scan. Ideally you'd just express your high level intent, "I want to scan this monoid," and the compiler + library would work together to get you the best implementation tuned for the target, exploiting normal scalar optimizations like you propose, SIMD, ispc-like techniques, multithreading (with work-stealing queues on CPU), and both single-pass and multi-dispatch approaches on GPU.

For now, we just grind out "good enough" implementations.

}
reduced[i as usize] = m;
}
}

pub fn pathtag_reduce(n_wg: u32, resources: &[CpuBinding]) {
let r0 = resources[0].as_buf();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but we should consider some generic functions on CpuBinding to reduce this boilerplate in the future, so you can do something like let config = resources[0].as::<ConfigUniform>();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I started out saying "sadly, no," but then saw a way forward and it seems to work. Having that method return &ConfigUniform would not work because it would drop the borrow from the RefCell too early, but it is possible to make a typed guard, and have that do bytemuck in its deref impl. Inference also works, so you don't need to turbofish the type.

Probably best as a followup PR so we can land this, but now I'm inclined to do it. One question is whether it should just implement deref_mut (which can panic if the resource is read-only) or make the client write out a method call to make that fallibility explicit. (It's also the case that bytemuck can panic, for example if alignment is not satisfied, so maybe that's not a real problem)

Followup: I think it's possible to solve the panic problem by having all checks in the method that generates the guard, so the deref can't fail (the docs say "this trait should never fail" in boldface). I'll queue this up as a followup.

let r1 = resources[1].as_buf();
let mut r2 = resources[2].as_buf();
let config = bytemuck::from_bytes(&r0);
let scene = bytemuck::cast_slice(&r1);
let reduced = bytemuck::cast_slice_mut(r2.as_mut());
pathtag_reduce_main(n_wg, config, scene, reduced);
}
Loading