Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alignas proposal #154

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 283 additions & 0 deletions proposals/0013-alignas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,283 @@
<!-- {% raw %} -->

# HLSL alignas Specifier

* Proposal: [0013](0013-alignas.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use NNNN instead of 0013 until just before this merges, and you pick the next available number.

* Author(s): [Mike Apodaca (NVIDIA)](https://github.com/mapodaca-nv)
* Sponsor: TBD
* Status: **Under Consideration**
* Impacted Project(s): (DXC, Clang, etc)

*During the review process, add the following fields as needed:*

* PRs: [#NNNN](https://github.com/microsoft/DirectXShaderCompiler/pull/NNNN)
* Issues:
[#2193](https://github.com/microsoft/DirectXShaderCompiler/issues/2193)

## Introduction

The proposal is to add to HLSL support for the `alignas` specifier on the
declaration of a structure and the declaration of a structure member, used by
`[RW]StructuredBuffer` declarations, and templated `ByteAddressBuffer` loads
and stores.

As additional benefits, this proposal would:
(a) eliminate the need for applications to add dummy elements to structures
to force specific alignments, and
(b) further converge HLSL and C++11 syntax.

## Motivation

Some GPUs can optimize 16-byte memory accesses for buffer loads and stores.
Unfortunately, in many instances, IHV compilers must assume 4-byte alignments
for structured buffer element accesses.

In the current specification, UAV buffer root views may be aligned to 4 bytes,
whereas as descriptors in descriptor tables must be aligned to 256 bytes.
Therefore, when an application chooses to use root views, the IHV compiler must
assume the worst-case alignment.
For some GPUs, this will disable vectorized loads and stores from memory that
require 16-byte aligned addresses.

Under consideration for future specifications, placed resource alignment
requirements may be tightened to as much as 1-byte alignment.
This change would further justify maintaining minimum alignment requirements
and additional explicit application hints.

Currently, these limitations can only be optimized using runtime monitoring and
bookkeeping of root view addresses, followed by background thread
re-compilation and re-caching of shaders.

## Proposed solution

Describe your solution to the problem. Provide examples and describe how they
work. Show how your solution is better than current workarounds: is it cleaner,
safer, or more efficient?

## Detailed design

### Syntax

The following excepts from the
[C++ reference manual](https://en.cppreference.com/w/cpp/language/alignas)
would apply to HLSL.

> **alignas**( _expression_ )
> **alignas**( _type-id_ )
> 1. _expression_ must be an integral constant expression that evaluates to
> zero, or to a valid value for an alignment or extended alignment.
> 2. Equivalent to `alignas(alignof(type-id))`.

> The `alignas` specifier may be applied to:
> - the declaration or definition of a class;
> - the declaration of a non-bitfield class data member;

> The object or the type declared by such a declaration will have its
> alignment requirement equal to the strictest (largest) non-zero expression
> of all `alignas` specifiers used in the declaration, unless it would weaken
> the natural alignment of the type.

> If the strictest (largest) `alignas` on a declaration is weaker than the
> alignment it would have without any `alignas` specifiers (that is, weaker
> than its natural alignment or weaker than `alignas` on another declaration
> of the same object or type), the program is ill-formed.

> Invalid non-zero alignments, such as `alignas(3)` are ill-formed.

> Valid non-zero alignments that are weaker than another `alignas` on the same
> declaration are ignored.

> `alignas(0)` is always ignored.

### Example Usage

```c++
///////////////////////////////////////////////////////////////////////////////
// common.h
// every object of type Foo will be aligned to 16-bytes
struct alignas(16) Foo
{
float3 bar;
alignas(16) uint baz; // 16-byte aligned member
};
static_assert(sizeof(Foo) == 32);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawBufferLoad is a SPIR-V only thing, and this proposal also targets that.

would be nice if static_assert got implemented in the SPIR-V backend (currently only works in DXIL).


///////////////////////////////////////////////////////////////////////////////
// compute.hlsl
#include "common.h"
ByteAddressBuffer InBuf : register(t0);
RWStructuredBuffer<Foo> OutBuf : register(u0);

[numthreads(1, 1, 1)]
void main(uint gid : SV_GroupID)
{
Foo tmp = InBuf.Load<Foo>(gid * sizeof(Foo)); // 16-byte aligned reads
OutBuf[gid].bar = tmp.bar; // 16-byte aligned write
OutBuf[gid].baz = tmp.baz + 1; // 16-byte aligned write
...
}

///////////////////////////////////////////////////////////////////////////////
// app.cpp
#include "common.h"
void main()
{
// every element of vector is aligned to 16-bytes
std::vector<Foo> vecFoo(1K);
...

ComPtr<ID3D12Resource> uavBuffer;
ThrowIfFailed(device->CreateCommittedResource(
&CD3DX12_HEAP_PROPERTIES(D3D12_HEAP_TYPE_DEFAULT),
D3D12_HEAP_FLAG_NONE,
&CD3DX12_RESOURCE_DESC::Buffer(vecFoo.size() * sizeof(Foo),
D3D12_RESOURCE_FLAG_ALLOW_UNORDERED_ACCESS),
D3D12_RESOURCE_STATE_COMMON,
nullptr,
IID_PPV_ARGS(&uavBuffer)));
...

// SRV GPUVA guaranteed to be 16-byte aligned
commandList->SetComputeRootShaderResourceView(0,
uavBuffer->GetGPUVirtualAddress() + 1 * sizeof(Foo));

// UAV GPUVA guaranteed to be 16-byte aligned
commandList->SetComputeRootUnorderedAccessView(0,
uavBuffer->GetGPUVirtualAddress() + 3 * sizeof(Foo));

commandList->Dispatch(32, 1, 1);
}
```

### DXIL

The structure base alignment hint needs to be passed down to IHV compiler
via DXIL metadata.

```diff
target datalayout = "e-m:e-p:32:32-i1:32-i8:32-i16:32-i32:32-i64:64-f16:32-f32:32-f64:64-n8:16:32:64"
target triple = "dxil-ms-dx"

%dx.types.Handle = type { i8* }
%dx.types.ResRet.f32 = type { float, float, float, float, i32 }
%dx.types.ResRet.i32 = type { i32, i32, i32, i32, i32 }
%struct.ByteAddressBuffer = type { i32 }
%"class.RWStructuredBuffer<Foo>" = type { %struct.Foo }
- %struct.Foo = type { <3 x float>, i32 }
+ %struct.Foo = type { <3 x float>, i32, i32, <3 x i32> } ; implicit padding (tbd - may not be needed)

define void @main() {
%1 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 1, i32 0, i32 0, i1 false) ; CreateHandle(resourceClass,rangeId,index,nonUniformIndex)
%2 = call %dx.types.Handle @dx.op.createHandle(i32 57, i8 0, i32 0, i32 0, i1 false) ; CreateHandle(resourceClass,rangeId,index,nonUniformIndex)
%3 = call i32 @dx.op.groupId.i32(i32 94, i32 0) ; GroupId(component)
- %4 = shl i32 %3, 4
+ %4 = shl i32 %3, 5
%5 = call %dx.types.ResRet.f32 @dx.op.bufferLoad.f32(i32 68, %dx.types.Handle %2, i32 %4, i32 undef) ; BufferLoad(srv,index,wot)
%6 = extractvalue %dx.types.ResRet.f32 %5, 0
%7 = extractvalue %dx.types.ResRet.f32 %5, 1
%8 = extractvalue %dx.types.ResRet.f32 %5, 2
- %9 = or i32 %4, 12
+ %9 = or i32 %4, 16
%10 = call %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32 68, %dx.types.Handle %2, i32 %9, i32 undef) ; BufferLoad(srv,index,wot)
%11 = extractvalue %dx.types.ResRet.i32 %10, 0
- call void @dx.op.bufferStore.f32(i32 69, %dx.types.Handle %1, i32 %3, i32 0, float %6, float %7, float %8, float undef, i8 7) ; BufferStore(uav,coord0,coord1,value0,value1,value2,value3,mask)
+ call void @dx.op.bufferStore.f32(i32 69, %dx.types.Handle %1, i32 %3, i32 0, float %6, float %7, float %8, float undef, i8 7, i32 16) ; +alignment
%12 = add i32 %11, 1
- call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %1, i32 %3, i32 12, i32 %12, i32 undef, i32 undef, i32 undef, i8 1) ; BufferStore(uav,coord0,coord1,value0,value1,value2,value3,mask)
+ call void @dx.op.bufferStore.i32(i32 69, %dx.types.Handle %1, i32 %3, i32 16, i32 %12, i32 undef, i32 undef, i32 undef, i8 1, i32 16) ; +alignment
ret void
}

; Function Attrs: nounwind readnone
declare i32 @dx.op.groupId.i32(i32, i32) #0

; Function Attrs: nounwind readonly
declare %dx.types.Handle @dx.op.createHandle(i32, i8, i32, i32, i1) #1

; Function Attrs: nounwind readonly
declare %dx.types.ResRet.i32 @dx.op.bufferLoad.i32(i32, %dx.types.Handle, i32, i32) #1

; Function Attrs: nounwind readonly
declare %dx.types.ResRet.f32 @dx.op.bufferLoad.f32(i32, %dx.types.Handle, i32, i32) #1

; Function Attrs: nounwind
declare void @dx.op.bufferStore.f32(i32, %dx.types.Handle, i32, i32, float, float, float, float, i8) #2

; Function Attrs: nounwind
declare void @dx.op.bufferStore.i32(i32, %dx.types.Handle, i32, i32, i32, i32, i32, i32, i8) #2

attributes #0 = { nounwind readnone }
attributes #1 = { nounwind readonly }
attributes #2 = { nounwind }

!llvm.ident = !{!0}
!dx.version = !{!1}
!dx.valver = !{!2}
!dx.shaderModel = !{!3}
!dx.resources = !{!4}
!dx.entryPoints = !{!10}

!0 = !{!"dxc(private) 1.7.0.4219 (staging-sm-6.8, c468d525d)"}
!1 = !{i32 1, i32 0}
!2 = !{i32 1, i32 8}
!3 = !{!"cs", i32 6, i32 0}
!4 = !{!5, !7, null, null}
!5 = !{!6}
!6 = !{i32 0, %struct.ByteAddressBuffer* undef, !"", i32 0, i32 0, i32 1, i32 11, i32 0, null}
- !7 = !{!8}
+ !7 = !{i32 1, i32 16} ; alignment
!8 = !{i32 0, %"class.RWStructuredBuffer<Foo>"* undef, !"", i32 0, i32 0, i32 1, i32 12, i1 false, i1 false, i1 false, !9}
- !9 = !{i32 1, i32 16} ; stride
+ !9 = !{i32 1, i32 32, i32 16} ; stride and alignment
!10 = !{void ()* @main, !"main", null, !4, !11}
!11 = !{i32 0, i64 16, i32 4, !12}
!12 = !{i32 1, i32 1, i32 1}
```

### Device Compatibility

Any device that supports SM6.XX+ is expected to support this alignment hint.

### Device Behavior

In order to avoid undefined behavior, or inconsistent behavior across IHV
devices, the device **must** ignore (e.g., mask out) any address value bits
smaller than the alignment specified when accessing memory.
The device cannot ignore the alignment even if it is larger than what the
device supports or optimizes for the operation.

> **Remark**: it is expected that an IHV driver implementation could perform
> this address mask during root signature binding rather than during shader
> execution. As such, this feature requires a driver update and cannot be
> retroactively supported by existing, shipped drivers that support an older
> shader model.

### Validation

The DXIL compiler may issue errors or warnings for ill-formed or ignored
alignment specifiers, respectively, in accordance with the syntax rules above.
Since the compiler is not aware of the calculation of the GPUVA itself,
it cannot issue errors or warnings for the value not being properly aligned.

Runtime validation may check if the GPUVA value provided to
`Set[Graphics|Compute]RootUnorderedAccessView` meets the base alignment
requirements, as specified in the shader, when the `Draw` or `Dispatch` call
is added to the command list.

Tests may be authored to validate that the device properly ignores address
value bits smaller than the alignment specified.

## Alternatives considered (Optional)

If alternative solutions were considered, please provide a brief overview.
This
section can also be populated based on conversations that occur during
reviewing.

## Acknowledgments (Optional)

* Contributor(s):
+ [Anupama Chandrasekhar (NVIDIA)](https://github.com/anupamachandra)
+ [Justin Holewinski (NVIDIA)](https://github.com/jholewinski)

<!-- {% endraw %} -->