The Language SIG hosts discussions and presentations focused on how to integrate a broad range of languages to support oneAPI. The languages at the heart of oneAPI are C++ and SYCL. This group discusses how work done within the oneAPI open source implementations can contribute to the ISO C++ and SYCL specifications managed by ISO and Khronos.
The language TAB is led by Ruyman Reyes Castro <[email protected]>
To find out how to join the Language SIG get in touch.
- Error handling
- Function pointers revisited
- oneDPL C++ standard library support
- Ruyman Reyes
- Rod Burns
- Cohn, Robert S
- Tom Deakin
- Victor Lomuller
- Khaldi, Dounia
- Spruit, Neil R
- Andrew Richards
- Gene Amdahl Meeting Room
- Reinders, James R
- Yates, Brandon
- Slavova, Gergana S
- Voss, Michael J
- Brodman, James
- Xiong, Jianxin
- Mehdi Goli
- Keryell, Ronan (XILINX LABS)
- Tu, Peng
- Benie
- Andrew Lumsdaine
- Lueck, Gregory M
- Richards, Alison L
- Arteaga Molina, Jaime A
- Lowney, Geoff
- Mcguire, Russell W
- Alastair Murray
- Kukanov, Alexey
- Videau, Brice
- Wells, Alex M
- Melonakos, John
Dounia Khaldi, Slides
- Great community reception, with contributions to MLIR dialects upstream
- Different levels of abstraction are exposed to different users. Joint matrix is aim for the middle. Breaks down gemm into primitives, its low level, but its portable across targets.
- This presentation will cover both the SPIR-V and the SYCL extension, both are needed for different targets
- Joint matrix relies in various abstractions for Code generation (PTX ISA, SPIRV, GPU intrinsics..)
- Joint matrix is not a replacement of the framework and the libraries, this is useful when implementing new operations or optimizing unexpected combinations of operations
- This is also useful for library developers, they need to write code that is portable
- Intel PVC has two stacks with slides, each slide has 16 XE core, 8 vector engines and 8 XMX engines (GEMM accelerators)
- Intel Xeon codenamed Sapphire Rapids have AMX extensions, which are GEMM accelerators
- NVIDIA and other hardware vendors have their own GEMM accelerators
- You would need a lot of different intrinsics to target all of them
- The SYCL Matrix extension is an experimental SYCL API at this point so it may change from one release to the other
- The joint matrix has a type of group, only subgroup is supported. Use is the matrix A,B or accumulator for GEMM, then you specify the shape (Rows, columns) and the layout.
- There are various operations supported, fill, load, store
- (Slide shows an example of using the extension)
- Example above multiply in SYCL. The load and mad happen on the K loop.
- You can do an element-wise operation with data that is on the join_matrix
- Q (Ronan): can you do negative strides or is just unsigned?
- A: Stride is a positive number.
- Same example and source can run across Intel CPU, GPU and NVIDIA GPU.
- Additional functions to pass row/col. This is Intel specific, NVIDIA cannot support this on tensorcores
- Q(Ruyman): Any restrictions on element wise operations supported?
- A(Douina): No restriction, any SYCL kernel code is valid
- Size combinations are different between AMX and XMX, and even between generations of XMX. NVIDIA Has different numbers.
- How do we write portable code? There is a query interface, static and dynamic
- Static queries require hardware architecture checks. Basic code is similar between SYCL joint matrix and CUDA Fragments
- CUDA code migration to SYCL is simple as it is very close to the wmma operations
- Joint matrix extension in MLIR generates SPIR-V code for multiple backends
- Currently: Full support of SYCL joint matrix extension on AMX, XMX and NVIDIA Tensor Cores
- Next steps: Standarization of joint matrix on SYCL and SPIR-V
Mehdi Goli, Slides <presentations/2023-06-07_JointMatrix_NVIDIA.pdf.pdf>
- Gemms are used everywhere and its very important we optimize those
- Presentation about Joint Matrix Performance analysis, showing support for SM72 and SM80 (Jetson and Ampere)
- we use the joint matrix extension on both, we can achieve 97% of cuDNN on Jetson
- On SM80 / A100 we use different sizes and see mixed results (very good on small sizes, really bad on large sizes) performance comparison with cutlas and cudnn.
- SYCL-BLAS Half and TF32 performance is slightly better for small sizes but gets much worse for bigger sizes performance comparison with cutlas and cudnn
- NVIDIA uses ldmatrix and cp.async (Shared Store From Global Load) to get higher performance. These instructions allow to bypass the cache and apply prefetching
- Tensorcore support has evolved across different NVIDIA architectures, and they have added new instructions that support some advanced features using a different part of the PTX ISA (wmma vs mma).
- WMMA is a higher level instruction that mapps to multiple HMMA instructions on the SASS.
- MMA instructions map to a single hmma wherever possible, or backwards compatible breaks down to multiple hmma instructions for previous geneerations
- WMMA is exposed in CUDA and what we use today for joint_matrix extension, whereas MMA is what cutlas and other use via hard-coding assembly.
- Results from NVIDIA suggest WMMA is slower than MMA.
- The performance gap from our joint matrix numbers is due to the lack of cp.async and needs to be added to DPCPP.
- Need somehow to expose the mma instruction to DPCPP so that we can fix the performance gap.
- Q(Ruyman) you mean supporting it within joint_matrix extension?
- A(Mehdi): Yes should be possible
- Q(Jianxin): This would be an implementation detail?
- A(Mehdi): Yes
- Q(Geoff): Why don't we load this on local memory?
- A(Mehdi): Is not supported in our backend
- Q(Geoff): If we preload the stuff in SLM wouldnt be get more performance?
- A(Mehdi): Our backend does not supported it, this is one of the key factor on the performance problems we see.
- Q(Dounia) Are there technical challenges on the implementation
- A(Mehdi): Its a lot of different configurations and maintenance to the backend. Individual mapping of builtins is difficult.
- A(Dounia): ATS and PVC sizes are different, thats why we have the query. Implementaiton is bigger but its transparent, the user have to type which hardware they have.
- Q(Geoff): Any matrix multiplication should tile itself onto SLM but seems its not the case? why joint matrix should be 3 times slower? they have a nice feature to do it on the ISA but you can do that yourself right?
- A(Mehdi): They use a different instruction to implement the loading that gives better performance, we cannot use that instruction in DPC++ backend yet
(Meeting adjourned, out of time)
- Rod Burns (Codeplay)
- Ruyman Reyes (Codeplay/Intel)
- Juan Fumero (Manchester)
- Brodman, James (Intel)
- Harms, Kevin (ANL)
- Kotselidis Christos (Pierer-Innovation)
- Spruit, Neil R (Intel)
- Slavova, Gergana S (Intel)
- Cohn, Robert S (Intel)
- Arteaga Molina, Jaime A (Intel)
- Xiong, Jianxin (Intel)
- Gordon Brown (Codeplay)
- Ronan Keryell (AMD)
- Richards, Alison L (Intel)
- Alex Pim (Imagination Technologies)
- Ganapathi, Ravindra Babu (Intel)
- Frank Brill (Cadence)
- romain.dolbeau (Sipearl)
- Wells, Alex M (Intel)
- Mrozek, Michal (Intel)
Ruyman Reyes Castro (RRC)
- oneAPI Language SIG Introduction
- OneAPI Language SIG is evolution from oneAPI Language TAB
- Main focus is feedback into SYCL language and C++ ISO interaction
- Looking into ideas from other languages and how they apply to SYCL/C++
- Please send topics, suggested agendas
Juan Jose Fumero (JJF): "TornadoVM: Multi Backend Hardware Acceleration Framework for Java"
- Slides
- TornadoVM is a programming model and API for accelerator offloading
- Supports multiple backends, including Level Zero, OpenCL and CUDA
- Relies on Graal and Graal IR for compiler and bytecode optimization
- Defines a common IR format that then is lowered to target IR (PTX, SPIR-V, OpenCL C)
- Long list of feedback for Level Zero API design based on experience interacting with level zero via JNI
- RRC: That feedback is better directed to the HW SIG where the right Level Zero and Unified Runtime people is, but many thanks.
- Q&A:
- GS: Do you support other targets beyond SPIRV?
- JJF: We support SPIRV on Level Zero API. We use a third party JAVA library to generate the SPIRV. We support OpenCL but not SPIR-V. OpenCL SPIR-V should work.
- RRC: Your matrix multiplication slide shows you only add annotation but you perform loop swap optimization, do you have enough semantic information to trigger that optimization from the Java IR?
- JJF: It is not highligted on the slides but we use a custom type for the matrix type so we know is a 2D and we control the layout
- RRC: Your Task Graph API is similar to what is being designed for SYCL-Graphs, so would be good to have your feedback there, will conntect you to the right people
- JJF: I am not aware of the SYCL-Graph approach but happy to collborate
Post-meeting actions:
- RRC connected JJF with oneAPI HW SIG to pass on the feedback
- RRC connected JJF with team working on SYCL Graphs to discuss ideas for the proposal
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Lumsdaine (University of Washington, Pacific Northwest)
- Antonio J. Peña (Barcelona Supercomputing Center)
- Benjamin Brock (Intel)
- James Brodman (Intel)
- Gordon Brown (Codeplay)
- Michael Kinsner (Intel)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- John Melonakos (Intel)
- Nevin Liber (Argonne National Laboratory)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Romain Dobleau (SiPearl)
- Timmie Smith (Intel)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Vasanth Tovinkere (Intel)
John Pennycook
- Slides
- Proposal for SYCL bug fixes and clarifications about concurrency: PR300
- Expressing concurrency/parallelism requirements in ISO C++
- threads and async
- execution policies
- p2300 schedulerss
- Expressing requirements in SYCL
- parallel_for(sycl::rank, ...)
- all work items have weakly parallel forward progress guarantees
- parallel_for(sycl::nd_range, ...)
- above, with barriers
- no way to ask for stronger guarantees
- parallel_for(sycl::rank, ...)
- Use cases: global synch via atomics
- arrive and wait
- barrier, 1 workers, barrier
- assumes
- non-leader does not starve leader
- arrive and wait
- use case, sub-group specialization
- sync within sub-group of work-group
- Towards an extension, hierarchy
- Host: host progress guarantee
- work-group: work-group progress guarantee
- sub-group
- work-item
- each thread blocks with forward progress delegation on its children
- OpenCL 1.x
- at least one makes progress
- no guarantee for individuals, strengthen an invidivual, but not permanent
- OpenCL 2.x
- sub-group are concurrent instead of weakly parallel
- at least one work item per subgroup must make progress
- Backends
- precise mapping is device-specific
- eager vs lazy submission
- cooperative kernels
- mapping of hardware threads
- Extension sketch
- Questions
- Guaranteees within a scope
- for all
- forward progress requirements of a specific kernel
- run-time queries and compile-time properties
- Questions
- Querying scoped guarantees
- runtime: device.get_info
- compile-time: launch kernel differently or fail to launch
- today it will fail if you need something that cannot be provided
- Q & A
- Does number of subgroups/workgroups affect the query?
- Device queries return strongest requirement
- when there is a limit on number
- programmer attaches requirements to property
- exisiting queries e.g. largest number of workitems will reflect requirement
- what about cooperative launch?
- programmer expresses need for barrier
- triggers cooperative launch
- lockstep (e.g. vector) seems stronger than weakly parallel
- providing a stronger guarantee cannot break
- programmer cannot assume lockstep
- people do assume lockstep, needs to be considered
- how will async be integrated with sycl queue & submit?
- tie forward progress to async semantics
- question about how it would affect offload
- events vs futures
- observability of parallelism vs weakly parallel
- important question on definition, Can I acquire a lock without deadlock?
- Does number of subgroups/workgroups affect the query?
- Robert Cohn (Intel)
- Alastair Murray (Codeplay)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Lumsdaine (University of Washington, Pacific Northwest)
- Ben Tracy (Codeplay)
- Christian Trott (Sandia National Laboratory)
- Ewan Crawford (Codeplay)
- Michael Kinsner (Intel)
- Geoff Lowney (Intel)
- Mehdi Goli (Codeplay)
- John Melonakos (Intel)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- Rod Burns (Codeplay)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Mike Voss (Intel)
- Ronan Keryell (Xilinx, AMD)
- Romain Dobleau (SiPearl)
- Ruyman Reyes (Codeplay)
- What’s the change? oneAPI is shifting to open governance,
establishing a steering committee. Rod Burns from Codeplay to lead
the Steering Committee.
- Looking for feedback on what would changes will make this new forum more useful. Will be reaching out to individuals.
- Could it include OpenACC and other things, or just SYCL?
- SYCL plus any standard and domain-specific library interfaces
- How does it fit with Khronos?
- Separate from Khronos. SYCL is still managed by Khronos.
- Recommend not to separate the SYCL ecosystem from Khronos – agreed. Basically need to work out how the oneAPI forum works in relation to the DPC++ implementation of SYCL and the Khronos forum.
- Khronos does not manage anything beyond the language while this forum will encompass libraries and other domain-specific interfaces.
- What would the language working group be if it is not sycl? Do we
need it?
- Need to define the working model – the working group should define it
- Language specification would take years. Need to decide: are we building on top of SYCL or as an extension of SYCL. For example, some SYCL 2020 features do not work for the broader use cases (e.g. client codes) because they were not considered in the initial proposals that got accepted.
- Comes down to 2 options:
- Option 1: oneAPI programming model is close to SYCL but not exactly SYCL
- Option 2: Nothing in oneAPI will require extensions that are not standardized
- What about the other working groups (e.g. math libs)?
- Majority of libs are dispatch libs or cover small subset of SYCL so do not expect to introduce non-compatible extensions.
- Exception is oneDPL today. But long-term goal for oneDPL is to be fully SYCL 2020 compliant – team is working towards it.
- What is the mandate/scope? Must work everywhere? What about hardware
differences?
- Kokkos makes everything works everywhere, and tries to take advantage of hardware where possible.
- OpenMP excludes certain HW-specific features from the main standard – those will be enabled through vendor-specific optimizations
- For the oneAPI forum, want it to work everywhere. Extensions should allow for specific HW. Ultimately, we want processor vendors to adopt these interfaces & open-source implementations.
- What if Intel develops new features and wants to expose it?
- In that case, Intel will propose an extension to oneAPI and implement it ahead of spec in the oneAPI products
- Will there be implementations before specification is defined?
- For example, OpenMP 5.0 defined the spec first with no implementations and now subsequent versions are fixing bugs because of it
- For Intel, we've decided not to release spec without implementation for the various oneAPI elements. Seems like that’s the direction of both Kokkos and Khronos.
- For the MPI forum, it alternates between spec meetings & implementations developed in between, which has worked well
- Ultimately, working groups will have to decide but extremely
likely we’ll have to do implementations as spec is developed.
- The TAB was supportive of this direction
- Rod will look to setup a smaller group discussion to gather more feedback offline
- Slides
- What is it? Reusable task graph to reduce host overhead,
good for small repetitive kernels
- Break submit into 2 parts: definition & execution
- Reduction in overhead - even in simple examples (1Dheat), GPU is kept busy
- 2 modes: Explicit API vs Record & Replay
- Spec for both is public: Explicit API PR and Record & Play PR
- Both approaches are compatible in a single extension
- Explicit API
- Issues with edges
- You can create invalid graphs. Why not inline with senders/receivers? Prevents you from creating invalid graphs.
- You can always build that on top of node/edge.
- SYCL queue already builds a task graph, why introduce a different API?
- Limitations in what can be expressed. This direct programming model is good if you want to keep memory local. Will follow-up with more examples.
- Issues with edges
- Record & replay
- Node is a command-group submission, edge is dependency
- Whole graph update - update graph with buffers
- Do you error out if someone waits on queue during record? No.
- TAB recommends some notification or timer
- Graph extends the lifetime of buffer objects created during record
- Compared to CUDA, like having object that can be queried. It's the safer choice.
- Want to move towards having a single vendor extension that supports both functionalities. Will work on resolving differences between Intel and Codeplay implementations.
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Lumsdaine (University of Washington, Pacific Northwest)
- Antonio J. Peña (Barcelona Supercomputing Center)
- James Brodman (Intel)
- John Melonakos (ArrayFire)
- Ronan Keryell (AMD)
- Michael Kinsner (Intel)
- Leping
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Lukas Sommer (Codeplay)
- Mehdi Goli (Codeplay)
- Pablo Reble (Intel)
- Gergana Slavova (Intel)
- Xinmin Tian (Intel)
- Christian Trott (Sandia National Laboratory)
- Umar Arshad (ArrayFire)
- Victor Lomuller (Codeplay)
- Victor Perez
- Slides
- Why Fusion?
- short-running kernels
- manual fusion is too much work, not composable, error prone, domain-specific
- extend sycl api for user-driven, automatic
- Extension requirements
- fused kernel must be equivalent
- improves performance
- minimal changes in code bases
- sycl runtime makes final decision
- Extension
- queue methods
- start_fusion
- cancel_fusion
- complete_fusion
- some properties
- queue methods
- Use
- start, submit, submit, complete
- fused everything between start/complete
- promote local, private for work group/work item
- Questions about what is legal for local/private
- Implementation in computecpp
- Evaluation
- sycl dnn
- Porting to dpc++
- step 1, intel cpu/gpu
- step 2, amd/nvidia gpu
- future work
- dpc++ support
- single internalize property
- support fusing different nd ranges
- explore fusing more arithmetic heavy networks
- Q/A
- Discussion on interfaces
- start/stop separates fusion from place where function is called
- Can lead to errors when programmer not aware it is part of a fusion
- what about mpi/host calls, explicit copies, host operations, ...
- requires compiles analysis, debugging tools to find problems
- fusion object would be more explicit
- liked using queue object because you could call libraries and have it fused
- cuda has default device/etc, which is also dangerous
- Can lead to errors when programmer not aware it is part of a fusion
- start/stop separates fusion from place where function is called
- Do you intercept copies on other queues?
- Is there a way to enable fusion and then launch kernel that you do
not want fused?
- No, but you could add to another queue
- How do you handle device limitations, for example limited
arguments.
- fall back to running kernels unfused
- How would you do static compilation? Coulbe c++ type/hint, and at runtime it would check if already compiled. Could do AOT, use MLIR.
- Is allocation performed when temp buffer is created?
- yes. You may not use it, but it will be allocated.
- Discussion on interfaces
- Robert Cohn (Intel)
- multi-backend
- nvida, amd gpu, intel gpu, CPU
- extensions
- async buffers
- lightweight accessors
- heirarchical programming model
- low-latency interop
- concurrent use of same buffer
- interop between buffers and usm pointers
- embed native CUDA or HIP
- oneMKL
- upstream BLAS support for use with hipsycl
- rocRand support
- competitive with dpc++, CUDA, rocblas
- Q: Why is it slower than cublas for small problems, that is the
problems size we commonly see?
- issues in run-time. we are working on it
- oneAPI and hipSYCL
- many components work, oneMKL, Level zero
- opportunities to test with different compilers
sycl:: queue q1;
should not work for <CL/sycl.hpp> header file
- how do you handle dependencies expressed with accessors? e.g. with
out of order queue
- runtime caches until flush, build taskgraph, backend decides
- how do you handle overhead of providing the out of order semantics
on cuda?
- handled in runtime layer on top of queue
- hipsycl implementation
- library-only
- single source, multiple pass
- and combinations
- sscp: single source single pass
- single source, multi pass
- library-only for host
- advantage: pure c++, portable, behaves like regular C++ application for debugging/profiling
- specification bugs do not work/work well with library-only
- may need to write code differently for performance
- sscp: single source single pass
- nvc++ uses this design
- cannot use macro's to distringuish host/device code
- fast compile time
- library-only for device
- only limitation compared to library-only host is 1.2.1 hierarchical parallelism
- NVC++ support: sscp
- can use this to write SYCL programs for nvidia
- provide functionality to specialize based on host/device without relying on macro
- but single compilation pass nvc++ is not faster than multi-pass clang
- issues with barrier for CPU
- barrier in middle of workgroup difficult in library-only
- with compiler you can split and vector loops
- scoped parallelism instead of ndrange
- separate code before/after barrier
- add compiler support, but retain library advantage
- Q: If you are adding compiler pass anyway, can you use that to
resolve other SYCL issues?
- we could go full-blown compiler, but it does not resolve spec issues which allows library-only.
- we still offer library-only, and believe there is value, there are ways to resolve spec issues.
- Robert Cohn (Intel)
- Christian Trott (Sandia National Laboratory)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Lumsdaine (University of Washington, Pacific Northwest)
- James Brodman (Intel)
- John Melonakos (ArrayFire)
- Michael Kinsner (Intel)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Romain Dolbeau (SiPearl)
- Ronan Keryell (AMD)
- Ruyman Reyes (Codeplay)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Xinmin Tian (Intel)
- Vasanth Tovinkere (Intel)
- Umar Arshad (ArrayFire)
- Speed of upstreaming sycl to llvm
- Discuss at next meeting
- Bottleneck is reviewing
- Difficulty of using C++ from SYCL. Trivial copyable and other requirements
- In this year, more non-intel presentations on SYCL and related technologies
- DPC++ is an open source project to implement SYCL in LLVM
- Intel has not consistently said this, emphasizing DPC++ over SYCL
- Early heasitance was about lack of maturity of SYCL 1.2, but SYCL 2020 is much more complete.
- Redoing messaging to emphasize SYCL
- marketing materials
- documentation
- DPC++ book will have SYCL logo, ensure examples work with other compilers
- If you find issues, send them to James
- Discussion
- industry-wide standards have innovation problem
- SYCL is still early enough to move fast
- Want to know that hardware features will be exposed right away, and not wait for standardization.
- SYCL has vendor extensions, issue will be the time for vendor extension to be part of standard
- Comparison to OpenMP, which is slow
- Tools and processes are better for collaboration now
- openmp slowness comes from discussion
- industry-wide standards have innovation problem
- Robert Cohn (Intel)
- Slides
- PGAS and OpenSHMEM
- symmetric heap is remote accessible
- put/get/atomic
- collectives
- memory ordering
- OpenSHMEM specification
- active vendor/government/academic
- new features
- thread safety
- communication contexts
- teams
- upcoming
- memory spaces
- GPU
- Q: Are there reference implementations?
- Sandia OpenSHMEM is closest to reference implementation, most likely to support new features. There are others.
- Feedback:
- Kokkos has implemented communication library for PGAS and GPU. Does cacheing and aggregation, performs better than lower-level vendor libraries.
- OpenSHMEM integration with oneAPI
- Host OpenSHMEM Can work with SYCL/USM
- Can do GPU initiated memory
- issues with resource sharing with MPI
- need subset & extensions, to execute on device
- feedback:
- aggregation useful for put/get, does not require API change, put per thread,
- coherency only on kernel boundary, don't need to synchronize frequently, otherwise no doable for PGAS
- SYCL cannot support fence from kernels
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Lumsdaine (University of Washington, Pacific Northwest)
- Ben Ashbaugh (Intel)
- David Beckingsale (Lawrence Livermore National Laboratory)
- James Brodman (Intel)
- Madhura Chatterjee (Intel)
- Christian Trott (Sandia National Laboratory)
- Romain Dolbeau (SiPearl)
- En Shao (Institute of Compute Technology, China Academy of Sciences)
- Elvis Fefey (Intel)
- Joseph Koston (Intel)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Javier Martinez (Intel)
- Nevin Liber (Argonne National Laboratory)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Umar Arshad (ArrayFire)
- Mike Voss (Intel)
- Anuya Welling (Intel)
- Slides
- Why not SYCL Queue?
- More abstract/higher level than level zero or SYCL
- User manages memory allocation, data transfer, etc.
- SYCL queue does not support explicit graphs, which is needed by users and will be supported in the dynamic selection proposal
- SYCL can offer queue with same functionality
- Mixing of direct use of SYCL queue and higher level functions (e.g. oneDNN) may lead to inter-operability issues
- hipSYCL has similar functionality: multi-device scheduler, but they cannot fully implement all SYCL queue features (e.g. get_device)
- More abstract/higher level than level zero or SYCL
- What if my kernel requires certain conditions (e.g. GPU only)?
- You can create a policy that is limited to a universe of devices
- You can have different kernels for different devices
- You can extend the execution policy: create one that's application- specific, or provide a fallback policy
- Future-proofing: having to check the queue sounds fragile for
long-lived code
- In our initial proof-of-concept, we have a fallback to the CPU, or you can limit the device universe you define
- How clever will your selector be? Will it look at instruction mix,
bytes to flops, others?
- Specification will provide some simple-to-reason-about policies: round-robin, static ranking, auto-tuning, etc.
- We do expect for more complicated policies to exist, might allow custom policies to be defined.
- Who makes the choices will impact how this is designed. Example: auto-tuning will require certain user inputs.
- How do you express data management?
- Putting it in higher level abstraction doesn't allow consideration of data transfers.
- How will custom scoring policies be defined? Should the spec provide
a way to attach characteristics to the task submitted?
- Yes, that's a possibility. Want to avoid putting anything but the simplest policy in the spec.
- Alternatively, could pass properties at submission time
- Should we pursue the queue or execution policy path for dynamic
selection?
- Doesn't have to be either/or, might need both. Certain kernels will take better advantage of queue-like APIs, which others will be better suited for execution policy.
- The problem with C++ execution policies are that they're basically just permissions for the compiler, e.g. std:par.
- This is more like an executor/scheduler. Kokkos makes policy and resources orthogonal. May not be better to mash them together.
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Richards (Codeplay)
- Ben Ashbaugh (Intel)
- James Brodman (Intel)
- Romain Dolbeau (SiPearl)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Nevin Liber (Argonne National Laboratory)
- Paul Petersen (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Roland Schulz (Intel)
- Sanjiv Shah (Intel)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Mike Voss (Intel)
- Slides
- Level Zero Runtime API: Ben Ashbaugh
- Add a layer above today's level zero for utility functions that are shared by language runtimes. Make it part of level zero with a spec.
- Discussion
- Does this mean PI goes away? We are writing plugins.
- Support in principle, but it needs to be done with the community. Can't be a PR that shows up. We need a spec, discussion.
- Resource manager: Mike Voss
- Discussion
- What happens when a constraint cannot be satisfied? e.g. NUMA
- Considering options:
- If required constraint cannot be met, then reject request and let requestor decide what to do.
- no strong opinion on the behavior, but it needs to be specified so application can respond
- Considering options:
- What happens when a constraint cannot be satisfied? e.g. NUMA
- Discussion
- Level Zero Runtime for CPU: Xinmin Tian
- Refactor and formalize plugin interface to support CPU
- Discussion
- General discussion
- Codeplay experience: functionality they creating in PI plugins could be re-used
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- Andrew Richards (Codeplay)
- Antonio J. Peña (Barcelona Supercomputing Center)
- James Brodman (Intel)
- Jessica Davies (Intel)
- Joseph Garvey (Intel)
- Michael Kinsner (Intel)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- Alison Richards (Intel)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Roland Schulz (Intel)
- Jason Sewall (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Tom Deakin (University of Bristol)
- Umar Arshad (ArrayFire)
- SC21: Anyone travelling to it? Potentially, if it actually happens.
- Want to hear from TAB on priorities for SYCL & DPC++: what's a must fix, what's nice to have, etc.
- Slides
- New property list
- Enables property to be known at compile-time by compiler, preference for runtime values
- Migrate attributes (which should not be semantic) to properties (which can be semantic)
- Example: memory semantics
- Status
- PR in is github intel/llvm (link)
- Currently implemented as oneapi extension
- Eventually want to fold it back into property_list as part of SYCL spec
- Proposed for next major SYCL version - send your feedback now so proposal can be holistic
- Styles
- Proposal will extend SYCL properties mechanism to include
values known at compile-time:
- no associated value: no_init
- runtime only: context_bound
- [NEW] compile-time: work_group_size
- Q: How can it be attached to lamba?
- Not directly possible. Example properties are passed to parallel_for.
- Could potentially wrap your lambda into a function object and add property there. HipSYCL uses this method.
- Could also have new unique name for new kernel properties: that might make it easier to identify each property
- How to distinguish where some properties apply to launch, some apply to lambda?
- How would a library vendor consume the properties if in a lambda?
- Needs more discussion. Study this offline & bring a separate proposal for launch mechanism (e.g. parallel_for).
- Proposal will extend SYCL properties mechanism to include
values known at compile-time:
- Definition of properties
- No longer nested namespaces. Convention is to collapse object names to reduce verbosity.
- property_list is invariant to property ordering
- Have a proof-of-concept, will be made public shortly
- Interaction with existing SYCL runtimes classes
- User doesn't have to know when they set the property
- Currently planning to add for accessor, can decide on others on case-by-case basis
- Feedback
- Good to see progress, have seen it before in SYCL committee
- Ensure any device specific properties are portable
- Don't break semantics, implementation should be able to ignore hints
- Want to be able to mix-and-match between vendor-specific & generic
extensions
- Yes, should be able to do that
- Robert Cohn (Intel)
- Aksel Simon Alpay (Heidelberg University)
- David Beckingsale (Lawrence Livermore National Laboratory)
- James Brodman (Intel)
- Christian Trott (Sandia National Laboratory)
- Erik Lindahl (Stockholm University)
- Michael Kinsner (Intel)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Nevin Liber (Argonne National Laboratory)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Romain Dolbeau (SiPearl)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Umar Arshad (ArrayFire)
- June TAB meeting is cancelled - overlaps with ISC'21
- Welcome to Romain Dolbeau, who joins us from SiPearl!
- Slides
- Published slides have been updated based on discussion
- Motivation
- Design Goals
- uniform<T>
- Compiler can mark variables as:
- Varying: different value for each work item
- Uniform: proven the same for each work item
- uniform<T> overrides above compiler analysis, undefined if values are not the same
- Storage is implementation-defined. Can be scalar or vector.
- Discussion
- Statement that it is an optimization hint and can be ignored is not accurate, user facing and can lead to bugs [Slides have been updated accordingly]
- Need debug options, when 1) assigned, and 2) modified
- Cannot modify since it's a constant
- Knowing it is constant changes viewpoint because it eliminates a
class of bugs
- Do we need to augment the name to make it clear it's a constant?
- Name is common with OpenMP uniform, with some exceptions
- Compiler can mark variables as:
- invoke_simd
- Explicit SIMD
- Can invoke on a function that takes/returns SIMD/uniform
arguments, SIMD mask
- bool -> SIMD mask
- arithmetic -> SIMD
- uniform -> scalar
- Can invoke on a function that takes/returns SIMD/uniform
arguments, SIMD mask
- Discussion
- Does reqd_sub_group_size have to be known at compile time to use
invoke?
- Yes. In current proposal, only possible to know this via an attribute that will be defined at compile-time.
- Does it follow normal rules with templates/overloads?
- Yes
- Sub-group size
- Taking an argument by reference is not allowed. Becomes hard to understand if it is reference to vector, or vector of references. OMP solves this by having linear reference but not available here.
- How does it work on CPU? Can you set subgroup size to 8?
- It is allowed. Same as GPU, changes SIMD width.
- This is only available in DPC++, SYCL does not guarantee this.
- Discussion of SIMD-agnostic code: determining sub-group size
- How do you reconcile this if you don't know the vector
lengths? Those can vary by architecture, how can we be more
arch-agnostic? Variability includes changing the sub-group
size even during runtime.
- That's really a C++ semantics concern, outside the scope of SYCL/DPC++
- Implementation could potentially still support through a kernel dispatched at launch time by first understanding the machine arch. Would still need to know the set of possible sizes.
- Realistically, hardware vector lengths are limited. But, theoretically, a developer can optimize for any vector length.
- Seems like an appropriate topic for a change proposal in an upcoming C++ standard meeting.
- How do you reconcile this if you don't know the vector
lengths? Those can vary by architecture, how can we be more
arch-agnostic? Variability includes changing the sub-group
size even during runtime.
- Does reqd_sub_group_size have to be known at compile time to use
invoke?
- Explicit SIMD
- Robert Cohn (Intel)
- Romain Dolbeau (SiPearl)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Christian Trott (Sandia National Laboratory)
- En Shao (Institute of Compute Technology, China Academy of Sciences)
- Christian Trott (Sandia National Laboratory)
- Erik Lindahl (Stockholm University)
- Guangming Tan (Institute of Compute Technology, China Academy of Sciences)
- Simon P Garcia de Gonzalo (Barcelona Supercomputing Center)
- Michael Kinsner (Intel)
- Alexey Kukanov (Intel)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Ronan Keryell (Xilinx)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Umar Arshad (ArrayFire)
- Ruyman Reyes (Codeplay)
- Pradeep Garigipati (ArrayFire)
- Andrew Richards (Codeplay)
- James Brodman (Intel)
- Slides
- oneDPL recap
- Notable changes
- Namespace oneapi::dpl, ::dpl, dropped oneapi::std because of usability
- Algorithms are blocking by default
- Execution policy
- device_policy, fpga_policy
- Implicit conversion to sycl::queue
- Notable implementation-specific additions,
not yet part of the spec:
- <random>
- range-based API
- asynch API
- <random>
- Subset of C++ random
- Generate several RNs at once into sycl::vec
- Seed + offset lets you generate the same as one at a time API
- Feedback
- for_each should not be part of std:
- Have it for convenience, types prevent confusion with standard
- for_each should not be part of std:
- Range-based API
- Ranges are new for C++20
- Used in algorithms, not yet for execution policy
- Not fully standard-compliant, not based on concepts, no projections
- Examples:
- Fancy iterators allow combine into single kernel, but clumsy
- Ranges allows 1 kernel, more concise
- Expressed as pipeline of transformations
- Using with execution policies
- Range over:
- Sequence of indexes
- USM data
- Buffer
- With variants for all_read, all_write
- Looking for feedback on how to make it device copyable
- Range over:
- oneDPL v2021.3 has 34 algorithms with range-based API
- Feedback: happy to see modern C++
- Async api
- Blocking is default
- Deferred waiting mode enabled by macro
- Only for no return value functions
- Non-standard, will not be part of spec
- Experimental async
- Never wait, return future-like object
- Supports multi-device
- API
- Add _async suffix, alternatives: namespace, policy class
- Taken an arbitrary number of dependencies as arguments
- Returns an unspecified future-like type
- Not specific because it is an extension and did not want to limit
- Inter-operable with sycl::event
- Holds internal buffers, so keep track of lifetime. Attached to return value.
- Feedback
- Do you have control over launching policy?
- We use queue submit, so no control
- Looks fine
- Not sure adding dependencies is right, does not like argument number creep
- _async is ok since return value is different
- Could look like CUDA graph. Add .then.
- Is this allowed to be eager?
- Could start submitting at get
- Probably best to allow it be eager without requiring it.
- Can you re-submit the same graph?
- You can create separate function, which addresses convenience but not performance
- We are interested in looking at static graph
- .then allows more explicit graph building
- Looking at C++ executors, schedules, but proposals are not
settled
- It might address the issue of building/executing graphs
- Do you have control over launching policy?
- Minimum C++
- oneDPL supports C++11
- SYCL 2020 requires C++17
- Strong desire to move to c++17
- Feedback
- Kokkos moved to 14 in Jan and will move to 17 by end of year, stakeholders are ok
- Surprises not good for users, should have very clear policy
- e.g. support for latest-5 years
- Established cadence
- Is oneDPL useable without 17? Relying on sycl features which
need it.
- We have different set of execution policies
Attendees:
- Aksel Simon Alpay (Heidelberg University)
- James Brodman (Intel)
- John Melonakos (ArrayFire)
- Michael Kinsner (Intel)
- Alexey Kukanov (Intel)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- John Pennycook (Intel)
- Pradeep Garigipati (ArrayFire)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Alison Richards (Intel)
- Ronan Keryell (Xilinx)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Kevin Smith (Intel)
- Timmie Smith (Intel)
- Stefan Yurkevitch (ArrayFire)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Umar Arshad (ArrayFire)
- Robert Cohn (Intel)
- IWOCL and SYCLcon 2021 registration is open
- Our next TAB meeting (on April 28) will coincide with an IWOCL live event. Will shift our TAB meeting to 1 week earlier (to April 21).
- What other topics should we discuss here? Give us your suggestions.
Continued from SYCL 2020 implementation priorities
No discussion on the following topics, please see slides for details. Special request to group: provide feedback on images as it hasn't gotten much attention in the community.
- Kernel bundles
- Specialization constants
- Device copyable
- Sampled_image, unsampled_image
- Accessor to const T is read-only
- sycl::exception error codes, not class hierarchy
Implemented features
- Kernels must be immutable
- Change is due to high probability of bugs & allowing more freedom of implementation
- A few folks have seen problems during implementation (when kernels could be mutable). Lots of discussion on how to define the right behavior so ultimately decided to restrict mutability. If this group has use cases where restrictions need to be loosened, let the team know.
- Do we need to add a note/block article to describe the issue? Yes, documentation is a good idea.
- marray
- vec used for SPMD code, but designed for SIMD (want to move in that direction in the future)
- SIMD support via ESIMD, sycl::vec, std::simd
- marray recommended for vectors in SPMD code
- Size does not contain padding
- No swizzle and write to element allowed
- sycl::exception derives from std::exception
- No discussion
- Async errors no longer silently ignored
- No discussion
- sycl::bit_cast is c++20 bit_cast
- No discussion
- Queue
- Without this, folks were missing a certain degree of control
- Basically, a missing constructor: explicit context & device
- Namespace from cl::sycl to sycl
- Still accepts cl::sycl
- Kernels must be immutable
Looking forward to further input from this group on prioritization for LLVM open source project. Want to know:
- What should be implemented next? What are you dependent on?
- What's missing DPC++ that's critical for your workloads
Request for additional features
- Virtual function support
- May not be possible on all devices, e.g. FPGA
- FPGA has some workarounds when virtual functions are needed through std::variant
- Is variant something we can use in the general case as well? No. Requires developer to know all possible types & code is not easy to re-write until you get pattern-matching.
- Inheritance rules: single vs. multiple, restrictions
- Could we use vtable size when conflicts arise?
- OpenMP committee is considering limiting to single inheritance to make implementation easier
- Virtual function support
Attendees:
- Aksel Simon Alpay (Heidelberg University)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Robert Cohn (Intel)
- James Brodman (Intel)
- Michael Kinsner (Intel)
- Alexey Kukanov (Intel)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (Intel)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Xinmin Tian (Intel)
- Tom Deakin (University of Bristol)
- Ronan Keryell (Xilinx)
- Alison Richards (Intel)
- Christian Trott (Sandia National Laboratory)
- John Melonakos (ArrayFire)
- Stefan Yurkevitch (ArrayFire)
- Umar Arshad (ArrayFire)
- Ruyman Reyes (Codeplay)
- Simon P Garcia de Gonzalo
- Pradeep Garigipati (ArrayFire)
- Andrew Richards (Codeplay)
- Slides
- Need your feedback on prioritizing implementation of SYCL 2020 features for upstream LLVM
- Atomics
- Could AddressSpace argument be generated at runtime? Other implementations
have not used it.
- Perhaps can consider a basic version of atomic_ref without it
- Limitations on arbitray size atomics? Do we need to go beyond 64?
- Yes, need it to support complex double.
- Could AddressSpace argument be generated at runtime? Other implementations
have not used it.
- Subgroups
- How do we handle namespace changes and existing code?
- We will have both for a period of time. Eventually DPC++ extension will be deprecated.
- How do we handle namespace changes and existing code?
- Group Algorithms
- What are the restrictions on where you call the APIs, especially
nested loops?
- Designed to be called from ndrange parallel. Cannot be called in hierarchical parallelsim (parallel for work group, parallel for work item).
- Could it work at work-group scope? We have it in hipSYCL.
- Pennycook to follow-up offline
- What are the restrictions on where you call the APIs, especially
nested loops?
- Sub-group Algorithms: no discussion, check slides for details
- Reductions
- Do you support multiple reductions? Limited support only. For example, no more than one reducer per kernel is allowed.
- What happens if ndspan gets into C++23 but we are still on C++17?
- Like span (C++20), we pre-adopt, eventually it becomes std::span
- Why is parallel_for without explicit work-group size challenging?
- Implementations have heuristics for work-group size. Can't use same heuristics because of other limitations: constraints on shared memory, etc.
- Reduction code is 2/3 of the CUDA backend in Kokkos. It's important to have reductions in the standard - same code has failed by simply moving to a different version of the same hardware platform in the past.
- Any performance testing with span reductions? Past experience has shown that performance falls apartn when going beyond 8, you are better off doing scalar.
- Reductions aren't guaranteed to be deterministic? Right.
- Group Mask: no discussion, check slides for details
- Accessor Changes: no discussion, check slides for details
- Work-group local memory
- What is the rationale for using a function instead of wrapper
type? Similar feature in hipSYCL but implemented with wrapper.
- Thread local was closest. Did not want keyword. Thought wrapper type was confusing for scope & visibility and has restrictions on where you can put it. Can't use as temporary. Looks like it is per work-item, but isn't.
- We want to align on function vs. wrapper for next spec version (Roland will follow-up with Aksel)
- What is the rationale for using a function instead of wrapper
type? Similar feature in hipSYCL but implemented with wrapper.
- Multi_ptr: no discussion, check slides for details
- Heterogenous device
- Is this a const expr function?
- No. Only known at runtime.
- Still looking at dispatching on the device, this is host dispatch.
- Is this a const expr function?
- Did not finish the remainder - will bring this discussion back in March
- Focused on describing items that are not fully implemented yet. Looking for prioritization from this group on what to do first.
- How should feedback be submitted?
- Opening issues on llvm github is best. Ok to also use email to TAB members.
Attendees:
- Alexey Kukanov (Intel)
- Gergana Slavova (Intel)
- Xinmin Tian (Intel)
- Sanjiv Shah (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- James Reinders (Intel)
- Mark Hoemmen (Stellar Science)
- Piotr Luszczek (University of Tennessee, Knoxville)
- Christian Trott (Sandia National Laboratory)
- Nevin Liber (Argonne National Laboratory)
- Marius Cornea (Intel)
- Michael Kinsner (Intel)
- Edward Smyth (Numerical Algorithms Group (NAG))
- Sarah Knepper (Intel)
- James Brodman (Intel)
- Geoff Lowney (Intel)
- Pablo Reble (Intel)
- Mehdi Goli (Codeplay)
- John Pennycook (Intel)
- Roland Schulz (Intel)
- Timmie Smith (Intel)
- Shane Story (Intel)
- Maria Kraynyuk (Intel)
- Jeff Hammond (Intel)
- Nichols Romero (Argonne National Laboratory)
- Penporn Koanantakool (Google)
- Alison Richards (Intel)
- Robert Cohn (Intel)
Small group discussions on 3 major themes identified in Geoff's presentation
- Irregular Parallelism: led by Mike Kinsner & James Brodman
- Can we look to OpenMP? Mark up the work and later decide who does it.
- Dynamic dispatch but need to consider:
- Chicken and egg
- Is this the right abstraction or is there a better option?
- Is a kernel too much?
- Do we need a smaller "task"?
- Consider cross lane operations to help dynamically remap/move work. Do we need better ways to detect this?
- Can cooperative groups help here? Is converged control flow restriction too limiting?
- Tasking has been one approach
- Granularity/complexity important - if it's too hard, an application might not use it
- NUMA: led by Xinmin Tian
- Slides
- Places (an abstraction) is a reasonable abstraction for NUMA affinity control
- The C++ standard committee executor WG is investigating NUMA support as well
- Ease-of-use considerations:
- How to present NUMA control / usage model to users is very important for ease of use
- A big customer prefers a simpler method for applications w.r.t NUMA domains usage. User expects implicit NUMA-aware support for applications cross-tile.
- We may need high abstractions such as “spread” and “close” for programmers
- Also need to support fine-level control for ninja programmers with a good mirror to architectural hierarchy
- GPU (HW and driver) may support a “fixed mode” for programmers on NUMA thread-affinity control
- Performance:
- TensorFlow uses and supports a high-level control of NUMA domains for TF performance
- Kokkos primarily uses OpenMP environment variables to get ~10x performance for some Kokkos users
- Good thread-affinity control is tied to implementation specifics
- Scheduling:
- How to support NUMA control has impact on portability and scheduling. Explicit NUMA control is served better in applications. Use the subdevice (tile) as a GPU (a NUMA domain), then, the scheduling happens in the tile, which minimizes NUMA impact but is a bit more work for users.
- DPC++ (Gold) started with a high level control DPCPP_CPU_CU_AFFINITY={master | close | spread} for CPU. There are scheduling implications as well for thread-data affinity.
- Need to give people an easy mode that works. Tying data to tasks is key: if we can design something where programmers say "Here are my data dependencies, please schedule this in a way that gets good performance" we'll have more luck than if we ask nonexperts to reason about things like whether pages should be interleaved and the granularity of thread scheduling.
- Distributed computing: led by Jeff Hammond
- Preference for send-recv, particularly in stencil codes
- TensorFlow doesn’t use MPI but we've reimplemented all of the MPI collective algorithms in MeshTensorFlow
- What is the memory consistency model? Assume memory consistency only at kernel boundaries. We did distributed GPU in Kokkos already and it works great on DGX but may not apply in other cases.
- Higher level abstractions are important but hard. It’s nice to not have to implement the entire STL and start small.
- Still upset at MPI standard dropping C++ bindings.
- Getting things into ISO C++ is a huge pain.
- MPI-3 RMA is amazing. Should we consider as similar model in DPC++?
Attendees:
- James Brodman (Intel)
- Robert Cohn (Intel)
- Tom Deakin (University of Bristol)
- Jeff Hammond (Intel)
- Ronan Keryell (Xilinx)
- Alexey Kukanov (Intel)
- Mike Kinsner (Intel)
- Jinpil Lee (RIKEN)
- Nevin Liber (Argonne National Laboratory)
- Geoff Lowney (Intel)
- Greg Lueck (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Heidi Poxon (HPE)
- Pablo Reble (Intel)
- James Reinders (James Reinders Consulting LLC)
- Alison Richards (Intel)
- Andrew Richards (Codeplay)
- Ruyman Reyes (Codeplay)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Christian Trott (Sandia National Laboratory)
- Slides
- oneAPI spec 1.0 released on 2020-09-28; SYCL 2020 provisional released
- Thanks to TAB for their ongoing engagement
- Feedback provided has influenced both the DPC++ spec as well being fed into SYCL
- Specifically looking for directional feedback: items that are missing, that need more focus, or are going in the wrong direction
- Extensions table in DPC++ spec section does not look up to date
- oneAPI team to follow-up: e.g. SYCL provisional has parallel reduce but missing here
- The more we can say: "this is just SYCL", the better
- Want to know occupancy of kernels
- Need to add the ability to set the global and local range in parallel_for range not nd_range, and perhaps also to assert no barriers in nd_range parallel_for. Would this be harder for CPU?
- SYCL has mechanism for query, but what it queries is back-end specific - need to add something at the user level
- Better solution for trivially copyable issues
- Everything you capture needs to be trivially copyable but implies destructor does not do anything specific
- Unified shared memory (USM) is one way to deal with it but it comes with penalties - need memcopyable solution
- Example: a tuple is unlikely to be trivially copyable
- Want the ability to have non-trivial destructors with byte-copyable objects
- Need follow-up meeting: this time next week
- Static way to specify graphs of computations
- After data movement is optimized, only thing left is latencies
- Up to 40% latencies, in some cases
- What about streams/events? They're not as effective as CUDA graphs.
- Construct up front vs record/replay?
- In Kokkos, it needs to be explicitly constructed
- Having an explicit interface feels safer
- Vulkan/cl have been looking at command lists
- Level 0 has support for command lists
- Some benefit for paramertizability
- Would like to have timing of previous executions guide allocation/placement
- After data movement is optimized, only thing left is latencies
- Auto-tuning for tiling/nd-range/work group size
- Do I have to write heuristics for every platform when using oneAPI across GPU's/CPU's?
- Kokkos has moved from heuristics to auto-tuning, including an auto feature where users let Kokkos choose parameters
- Kernels can be called millions of time, auto-tuning in same run is not a big deal
- Not just work group, also want to control occupancy: run at lower occupancy to use less cache. Could achieve 2.5x speedup by reducing occupancy.
- Need a hint for parallel_for and query to know what happened
- Want hints from the user about whether auto-tuning might be worthwhile
- Building a graph is one hint
- Hint about tuning parameter, does not change semantics, versus statements about barrier
- Using property list
- Lots of places where you hint
- Cooperative groups/barriers
- Considering device barriers vs mpi-style
- Kokkos is not using this because can't be sure it can be supported everywhere, and might not be faster than forcing a kernel stop/start. Latencies are also a problem and the device runs at lower frequency.
- Going back to host is very expensive. Could we use wavefront algorithm?
- Tried it for solvers, did not work
- Prefer coarse-grain barriers because it is easier to support and barriers are just one among many sources of overhead
- How can we get more feedback on oneDPL, oneTBB?
- Should we continue to discuss in this meeting or a separate forum?
Attendees:
- Robert Cohn (Intel)
- Gergana Slavova (Intel)
- Christian Trott (Sandia National Laboratory)
- Ruyman Reyes (Codeplay)
- Geoff Lowney (Intel)
- Heidi Poxon (HPE)
- James Brodman (Intel)
- James Reinders (James Reinders Consulting LLC)
- Mike Kinsner (Intel)
- Pablo Reble (Intel)
- Sergey Kozhukhov (Intel)
- Jinpil Lee (RIKEN)
- Timmie Smith (Intel)
- Ted Barragy (NAG Lead Computational Scientist supporting BP)
- Ronan Keryell (Xilinx)
- Roland Schulz (Intel)
- John Pennycook (Intel)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Andrew Richards (Codeplay)
- Greg Lueck (Intel)
- Tom Deakin (University of Bristol)
- Welcome to Jinpil Lee who joins us from RIKEN! Jinpil is participating on the recommendation of Mitsuhisa Sato, RIKEN's deputy director.
- oneAPI spec v1.0 will be live next week. Thank you all in helping us achieve this tremendous milestone!
- Slides
- Purpose of this proposal is to prevent name conflicts between vendors
extending the SYCL spec, and make the extension apparent in user code
- Expect that SYCL new features will initially appear as extensions
- 3 options presented
- Covers methods for macros, free functions, and members
- Options took into account:
- Verbosity
- Similarity with past practice
- Similarity to macro name when all caps is used
- Option 1: All capitals
- Options 2: Initial capital
- Options 3: EXT prefix
- Discussion
- Option 3 preferred by multiple people. Reasons why:
- Most consistent
- Makes is clear this is an extension even if it's not obvious based on the extension string
- Any worry about additional verbosity?
- Only 4 additional characters. Generally developers should be ok exchanging the extra characters for clarity.
- More verbosity might be good here as it forces people to be deliberate when using extensions
- For the vast majority, expect vendor-specific extensions to be temporary as they will be rolled into the standard. It is understood some may remain extensions forever because they are not suitable for standardization but those will be mostly exceptions.
- Would like offline feedback on bad experiences with any of the options.
- Option 3 preferred by multiple people. Reasons why:
- Slides
- Function pointers are important, we want to enable them in Intel implementation and SYCL spec
- The options shown are high-level summary of many detailed discussions - mostly looking for feedback on the overall direction
- Today, function pointers are not allowed in device code, want to relax this restriction
- How are function pointers represented in source code? 2 options:
- (Option 1) Implicit: typical C/C++ function pointers
- (Option 2) Explicit: wrapper around pointer
- Many options exist for language and implementation:
- Attributes vs wrappers
- Part of function type
- Based on past experience with Intel compiler implementation:
- OpenMP: attributes were enabled but not part of type system
- Encountered difficulties in passing function pointers with different vector variants
- Option 1: use C/C++ function pointers
- Every pointer is created with default set of variants: e.g. linear, uniform
- Discussion
- Concerned about generating multiple variants. A lot of code
generation. Is this really necessary, safe, clear how to implement
with compilers?
- Need it for virtual functions. Might need multiple variants for device.
- CUDA has bare-boned function pointer. Only usable in the context
where it is created (device, host).
- We would still need translation functions for passing function pointers between host and device
- This is for SIMD. Need to know: vectorization factor (subgroup size), mask/unmask. Writing SPMD, and want to use SIMD, need called function to be in vector factor/mask.
- Compiler must create these variants and make choices as it
compiles/builds binary, how portable is this between different
compilers, different hardware?
- Not an easy answer, also need to take ease of debugging into account - does it crash when it fails?
- Each use case should be considered, including trade-offs for performance
- Are attributes part of overload resolution? No.
- Option 2 is really for non-virtual functions but overall direction might be to do a hybrid approach
- Need more discussion on this topic. Bring back to October meeting.
- Include more examples, clearer use case descriptions
- Concerned about generating multiple variants. A lot of code
generation. Is this really necessary, safe, clear how to implement
with compilers?
Attendees:
- Robert Cohn (Intel)
- Gergana Slavova (Intel)
- Alison Richards (Intel)
- Andrew Richards (Codeplay)
- Ruyman Reyes (Codeplay)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Geoff Lowney (Intel)
- Hal Finkel (Argonne National Laboratory)
- James Brodman (Intel)
- John Pennycook (Intel)
- Jeff Hammond (Intel)
- Roland Schulz (Intel)
- Ronan Keryell (Xilinx)
- Ted Barragy (NAG Lead Computational Scientist supporting BP)
- Timmie Smith (Intel)
- Tom Deakin (University of Bristol)
- Xinmin Tian (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Christian Trott (Sandia National Laboratory)
- Greg Lueck (Intel)
- Spec: Robert Cohn
- Looking for feedback on usefulness of the PDF version of oneAPI spec
- Slides
- Extension mechanism
- Discussion
- Extension of existing classes breaks binary compatibility?
- When moving between vendors, you have to recompile, even without extensions
- It's the job of the implementor to ensure vendor-specific code runs on targeted hardware
- Needs more verbose guidance on how to make changes that are source compatible: conversions, constructors, overload sets. Further discussion to happen offline.
- Compile-time warnings would be useful
- Extension of existing classes breaks binary compatibility?
- Discussion
- Optional features of devices
- Similar to extension, because it may not be there
- has() is passed aspect enum. Use if/template to handle absence of feature
- Error handling
- Throw runtime exception when using a feature that is not supported by device
- devconstexpr: constant when compiling for device
- Discussion
- If this uses a keyword, it's no longer C++
- Could hide it in macro but that has other downsides
- Issues about lambda capture, device compiler, types not being present when feature is not supported.
- Discussion
- Slides
- Ability to declare local memory for static size, instead of just accessor
- group_local_memory allocates, returning multi_ptr
- Discussion
- Dynamically sized arrays
- Only static, use accessor for dynamic
- Support for arrays (std::array) and type requirements
(e.g. trivially destructible)
- Arrays supported, only requirement is trivially destructible
- Dynamically sized arrays
- Not enough time for full discussion, looking forward to further feedback here
Attendees:
- Robert Cohn (Intel)
- Gergana Slavova (Intel)
- Ilya Burylov (Intel)
- Alison Richards (Intel)
- Andrew Richards (Codeplay)
- Christian Trott (Sandia National Laboratory)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Geoff Lowney (Intel)
- Hal Finkel (Argonne National Laboratory)
- James Brodman (Intel)
- John Pennycook (Intel)
- Mike Kinsner (Intel)
- James Reinders (James Reinders Consulting LLC)
- Jeff Hammond (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Roland Schulz (Intel)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Timmie Smith (Intel)
- Xinmin Tian (Intel)
- Slides
- Changes in accessors for SYCL 2020 provisional
- Device and host accessors have different behavior, not obvious from
the call name
- Absence of handler is interpreted different for host (blocking) and non-host (non-blocking) accessor
- Placeholder host accessor are not supported
- Considering making 2 new types of host accessor, blocking and non-blocking
- Discussion
- Concerns about excessive overloading and implicit behavior
- Just call it non-blocking vs calling it a task
- Names-based on semantics vs use-case
- Recommend to make the code be self-descriptive
- Creating more dedicated types/alias
- Is this level of granularity enough?
- Removed operator[](size_t index)
- Allowed passing item instead
- Need implicit conversions from size_t and other types to id
- Should check spec that it works that way
- Feedback from Argonne
- Highly desirable to have uniform set of rules for naming things
- Define a consistent prefix
- E.g. image_accessor vs host_image_accessor, should "image" always be first?
- Deduction guides are useful, but don't solve the problem of strict
argument order
- Default arguments must be in order. Might be better to have specialized/more general.
- Kokkos experience: helper classes take variadic arguments to
make typedef
- Host accessor does not help, because it needs to be stored and must be generic
- Christian can provide an example to share with the group
- Confusion around how local memory, irregularity around usage
- Local memory allocated by accessor, different from all other accessors. Normally allocated somewhere else.
- Difference between view & allocation
- Working on a proposal, expect to bring it to this body for review soon
- Highly desirable to have uniform set of rules for naming things
- Are 0 dimensional data structures used?
- Yes, common in Kokkos
- Atomic counters, error flags, ..
- Would also like to see 0 dimensional buffer (no range, 1 element)
- Need subspan mechanism to get view vs 1-off solutions
- Yes, common in Kokkos
Attendees:
- Robert Cohn (Intel)
- Gergana Slavova (Intel)
- Alexey Kukanov (Intel)
- Antonio J. Peña (Barcelona Supercomputing Center)
- David Beckingsale (Lawrence Livermore National Laboratory)
- Geoff Lowney (Intel)
- Hal Finkel (Argonne National Laboratory)
- Heidi Poxon (HPE)
- James Brodman (Intel)
- John Pennycook (Intel)
- Roland Schulz (Intel)
- Ronan Keryell (Xilinx)
- Ruyman Reyes (Codeplay)
- Sandip Mandera (Intel)
- Timmie Smith (Intel)
- Tom Deakin (University of Bristol)
- Xinmin Tian (Intel)
- Alison Richards (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Andrew Richards (Codeplay)
- SYCL 2020 provisional spec is now public: James Brodman
- Fairly major change vs. SYCL 1.2.1 including USM, quality-of-life improvements, new way of doing images
- A lot of the changes included were prototyped in DPC++ first
- Call for action: provide input on the spec either via the SYCL github (to be available soon) or through this group
- DPC++ vs SYCL
- With SYCL 2020, differences between DPC++ and SYCL are smaller, would be good to see a technical list of differences
- Would like to see a closer connection being made between DPC++ & SYCL
- DPC++ messaging has explicitly shifted to highlight the fact that DPC++ = ISO C++ + SYCL + extensions
- What is the need for a separate name, why not call it SYCL + vendor
extensions, similar to OpenMP?
- DPC++ is a short-hand way to refer to the collection of extensions. While the difference between DPC++ & SYCL 2020 is fairly small now due to the recent release, expectation is to continue to prototype new extensions through DPC++ before upstreaming to SYCL.
- This feedback will be rolled up to ensure it reaches the right people
- Slides
- deprecate cl::sycl::atomic replace with intel::atomic_ref
- mostly aligned with c++2- std::atomic_ref
- Which address spaces?
- local, global, or generic
- What about constant?
- Atomic does not seem relevant
- Issue about LLVM optimization, synchronization edges, etc. For more information, see comment 6 in LLVM PR37716
- memory orderings and scopes
- single happens-before relation
- questions about hardware implications, need for fences
- By specifying memory order/scope, you can tune performance
- Situations where fences are required dominates the performance. Need to do the exercise where fences are required for common patterns and look at other architectures, if it will be part of SYCL
- changes to fences and barriers
- changes memory consistency model
- makes sycl default behavior close to C++
- difference still exists because private memory
- Questions
- should we support std::atomic_ref in device code?
- Yes as a migration solution, with expectation that eventually code uses SYCL native
- Do not want to support name, but give it different meaning
- Interesting to see if this supports different-sized <T>s
- Do we need std::atom-like interface as well as atomic_ref?
- Is the issue performance?
- What are the semantics of std::atomic on host being accessed on device
- Argonne has code that uses std::atomic. Would it make sense to compile code that uses it in device code?
- what is code usage of std::atomic?
- arrays, data structures
- Is the issue performance?
- should we support std::atomic_ref in device code?
Attendees:
- Ted Barragy (NAG Lead Computational Scientist supporting BP)
- David Beckingsale (Lawrence Livermore National Laboratory)
- James Brodman (Intel)
- Robert Cohn (Intel)
- Tom Deakin (University of Bristol)
- Hal Finkel (Argonne National Laboratory)
- Ronan Keryell (Xilinx)
- Mike Kinsner (Intel)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- Andrew Lumsdaine (University of Washington, Pacific Northwest National Laboratory)
- Antonio J. Peña (Barcelona Supercomputing Center)
- John Pennycook (Intel)
- Heidi Poxon (HPE)
- Pablo Reble (Intel)
- James Reinders (James Reinders Consulting LLC)
- Alison Richards (Intel)
- Andrew Richards (Codeplay)
- Roland Schulz (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Christian Trott (Sandia National Laboratory)
- Slides
- Namespaces
- oneapi:: vs one:
- Don't like 'one': too much chance for collision
- People will make jokes about 'one'
- 'one' has poor searchability
- People can make alias if they want something shorter
- Board recommends 'oneapi'
- oneapi:: vs one:
- Top level include directory
- one/dpl/ vs oneapi/dpl vs onedpl vs dpl
- Board recommends to follow the namespace structure: oneapi/dpl
- Can use symlinks/header that includes header for support old code
- oneDPL execution policy
- predefined execution policy
- Verbose: default_policy cpu_policy, ...
- Concise: cpu, gpu, default. Namespace will make it unique.
- Don't like pol, preferred spell it out, default preferred to deflt
- Generally concise is not preferred. Code is read more than written so it's better to be verbose.
- Like to distinguish between type and variable. Using C++17 std way with _v will make the distinction clear.
- What about policy_gpu?
- Not a big difference
- Short names are not that short because you would normally have namespace
- Sync vs Async
- Currently some algorithms block, some do not block
- Board would prefer option 'c'
- Standard API should be blocking
- Add an explicit async API for those implementations that need it
- For current implementation, move into namespace?
- No code out there now. Making it synchronous is a performance but not correctness issue. Like async, but if goal is to follow C++, then require all blocking
- Range-based API for algorithms
- Allows concise expression of pipelines
- Did we miss algorithms? Please review list and provide feedback.
- Add ranges now, or as extension/experimental?
- Would be useful for graph library
- No disagreement about delaying making it part of spec
- Ok to have it implemented even though it's not part of spec. No experience in HPC community with using ranges so having it available would give people a chance to experiment.
- Extension APIs
- No discussion, see details in slide 14
Attendees:
- Bharat Agrawal (Ansys)
- David Beckingsale (Lawrence Livermore National Laboratory)
- James Brodman (Intel)
- Robert Cohn (Intel)
- Tom Deakin (University of Bristol)
- Hal Finkel (Argonne National Laboratory)
- Jeff Hammond (Intel)
- Mike Kinsner (Intel)
- Alexey Kukanov (Intel)
- Geoff Lowney (Intel)
- Antonio J. Peña (Barcelona Supercomputing Center)
- John Pennycook (Intel)
- Pablo Reble (Intel)
- James Reinders (James Reinders Consulting LLC)
- Ruyman Reyes (Codeplay)
- Andrew Richards
- Alison Richards (Intel)
- Gergana Slavova (Intel)
- Timmie Smith (Intel)
- Xinmin Tian (Intel)
- Phuong Vu (BP)
- Rules of the road
- Notes published immediately after the meeting on Github
- Email [email protected] or submit a github PR to add/remove name, add affiliation to attendees list
- Slides
- Recap
- STL API
- Parallel STL
- non-standard API extensions
- Required C++ version
- Minimum DPC++ version will be C++17
- Is it ok for oneDPL?
- Will limit host-side environment. Default is C++14 for latest host compilers
- Discussion:
- Where are livermore compilers?
- C++11 is fine, RAJA is C++11-based, some customers not ready for C++14
- What is the issue?
- People running on systems where supported gcc version is old
- But not about the code
- Why is host compiler different?
- If we require only 14, can we still make deduction work smoothly? Yes.
- At Argonne, there is a range of conservatism, we should not
impose artificial barriers
- Provide C++17 features and ease of use when available, but there is value in being more conservative
- On the other hand, we don't want to create 2 dialects
- Where are livermore compilers?
- Top-level namespace
- DPC++ has multiple namespaces: sycl::, sycl::intel
- oneDPL adds a namespace
- Discussion
- Strictly standard could be nested, new things own namespace
- Requires change to sycl spec
- Standard allows to use the sycl::intel extension
- Recommend top-level oneapi namespace
- Can use C++ using to bring it into sycl::intel if desired
- Example: oneapi::mkl
- Strictly standard could be nested, new things own namespace
- Standard library classes
- Issues
- Some classes cannot be fully supported
- 3 different implementations
- Options
- White-listed
- Freestanding implementation
- Duplicate, bring standard library into SYCL
- Spec says whether require implementation or to host to host
- Analysis of pro/cons, see slide
- Propose to go the combined route:
- Whitelist the things that 'just work'
- API's that need substantial adjustments are defined in SYCL spec
- Freestanding for the rest
- Analysis, see slide
- Discussion
- Seems like a practical solution
- For freestanding, would there be conversions for standard types? Yes.
- Slide shows mapping, whitelisted, custom, SYCL
- Discussion
- Functional can't be whitelisted
- Discussion
- Issues
- Not enough time for remaining topics, moved to next meeting
Attendees: David Beckingsale, James Brodman, Robert Cohn, Tom Deakin, Hal Finkel, Mike Kinsner, Alexey Kukanov, Erik Lindahl, Geoff Lowney, Antonio J. Peña, John Pennycook, Pablo Reble, James Reinders, Ruyman Reyes, Alison Richards, Roland Schulz, Timmie Smith, Xinmin Tian
- We will be publishing TAB presentations materials & notes with names on github. Please contact Robert.S.Cohn@intel if you have concerns. If you are a watcher on the repo, you will get email notification for meeting notes. Follow-up discussions can be in the form of github issues.
- Specification is available on oneapi.com. DPC++ spec contains the list of SYCL extensions with links to github docs describing them.
- oneAPI open source projects are moving to oneapi-src organization on github.
- Repo for oneAPI Specification sources is in same org. File issues if you have detailed feedback about the specifications.
- Slides
- Pointer-based memory management, complementary to SYCL buffers
- What is the latency for pointer queries?
- Have not measured, but it requires calls into driver and is not lightweight
- Can it be accelerated with bit masks?
- Could it be made fast enough so free() could check?
- Are there any issues when using multiple GPUs?
- All pointers must be in same context
- Not likely to work if devices are not all from same vendor
- Peer-to-peer, GPU's directly accessing each other's memory, is being considered for inclusion in Level Zero spec, and might be added to DPC++ spec
- Non-restricted shared allocations should work fine
- What about atomics?
- We are trying to flesh out general details of atomics first, and will define USM characteristics after.
- OMP also uses the name USM, we need a document that compares/contrasts the capability
- Are operations that prefetch (ensure data is resident on a
specific device) placed in queues? What does 'done' mean?
- Investigating
- Are hints suggestions or hard rules?
- Device is free to define the behavior. Devices vary in their capability.
- Can you change the flavor of allocation? (shared, device, ..)
- No. What is the use case?
- Example: When we are limited by memory capacity, a library may want to change the allocation.
- Follow-up from last meeting: John Pennycook
- Prototype implementation published as PR on github
- Addressed feedback on types for reductions: assertion checks if are accumulating in type different from initial type
- Minimum version of C++: James Brodman Slides
- Currently C++11, want to move to C++17
- Considered C++14 + key features
- Clang default is 14
- Follow-up from last meeting
- Review of group collectives
- Simplifying language for common patterns
Slides:
- Overview
- DPC++
- oneDPL
- What is oneAPI?
- oneAPI is a programming model for accelerators. It contains nine
elements, in four distinct groups:
- Language & its library
- oneAPI Data Parallel C++ (DPC++)
- oneAPI Data Parallel C++ Library (oneDPL)
- Deep Learning Libraries
- oneAPI Deep Neural Network Library (oneDNN)
- oneAPI Collective Communications Library (oneCCL)
- Domain-focused Libraries
- oneAPI Math Kernel Library (oneMKL)
- oneAPI Data Analytics Library (oneDAL)
- oneAPI Threading Building Blocks (oneTBB)
- oneAPI Video Processing Library (oneVPL)
- Hardware Interface Layer
- oneAPI Level Zero (Level Zero)
- Language & its library
- oneAPI is a programming model for accelerators. It contains nine
elements, in four distinct groups:
- What is the minimum base language for DPC++? Are newer standards
supported? Have you talked about changing the DPC++ baseline to C++
14?
- C++11 is the base language for DPC++; more modern versions of C++ can be used. Our goal is to carefully define interoperability with features from newer C++ standards so that implementations of DPC++ are consistent. (The Intel open source toolchain is based on trunk clang, so it is very modern.)
- For SYCL the minimum base language is ISO C++11 (in SYCL 1.2.1). C++11 features are used in the definition of language features. This allows tools to compile SYCL even if they only support C++11. Tools supporting newer C++ will compile code using newer C++ features, without issue.
- Changing the baseline to C++14 is something that will happen shortly as part of the SYCL specification. We expect to see a formal process and timeline defined that allows developers and implementers to reason about what the minimum version will be in future SYCL specifications. And again, be aware that this is the minimum version which a compiler must support because mandatory language features use aspects of that C++ version. Newer C++ can always be used if a toolchain supports it all that you lose is guaranteed compatibility with other implementations that don’t support as modern a C++ version.
- Why is the base OpenCL version 1.2 instead of 2.0?
- OpenCL doesn’t have significant adoption beyond 1.2. The Khronos OpenCL working group is moving to a more flexible model, where only desired features beyond 1.x must be supported. We’re aligning with that direction and want DPC++ to be deployable on a wide base of OpenCL implementations (which is 1.2 today). DPC++ features such as USM have OpenCL extensions to enable key features from DPC++ to be available on top of all OpenCL versions, such as 1.2.
- The 0.5 specification has a table specifying which language features
are required on a device and which are optional, e.g.,
pipes/channels are required on FPGA and subgroups not required on
FPGA. How did you make this decision?
- Most features should be supported on all devices for functional portability, even if not performant. However, some language features are naturally IP specific.
- Pipes are an easy example. Pipes are designed for spatial architectures and require independent forward progress across kernels for many uses, a forward progress guarantee that we don’t want to impose on all devices. OpenCL 2.0 tried to make pipes usable on GPUs as well as FPGAs and ended up with a bloated feature that nobody uses because it can’t achieve performance anywhere, even on FPGA.
- Implementation effort is also a consideration. We don’t want to create large additional effort in DPC++ implementations for a feature on an IP where it is expensive to implement and will rarely be used. We see a balance between requiring implementation effort vs portability of a feature across all devices.
- Subgroups are not required on FPGA, because implementations typically do not vectorize across work-items. However, subgroups can be easily implemented with a subgroup size of 1. Would this be a useful change to the specification?
- Unified Shared Memory (USM) how does this work with OpenCL?
- We have published the appropriate extensions for OpenCL to enable USM. USM should be considered an alternative to (or a replacement for) the SVM features added to OpenCL 2.0, with USM being designed to be much more usable. Note our proposed OpenCL USM extension builds on top of even older OpenCL versions.
- Directed Acyclic Graphs (DAGs) buffers/accessors allow creation of
implicit DAG edges. However, this feature does not interact well
with C++ classes. Will DAGs independent of buffers be added, for
better C++ support/integration?
- The USM extension adds an explicit “depends on” mechanism, for DAG edge creation without buffers/accessors. Please give us feedback if you want tweaks or different interfaces for specific use cases.
- Will USM replace OpenSHMEM?
- No. USM is currently defined within a single node, whereas OpenSHMEM is a scale-out model for distributed memory. We believe OpenSHMEM and USM are independent and expect both to work together.
- In terms of the mental model for USM vs SYCL buffers, it is a bit like a PGAS language (e.g. UPC) vs MPI because USM supports load-store between different physical address spaces, whereas SYCL buffers are opaque objects, but one does need to understand MPI or PGAS to program in SYCL.
- Do the USM allocator functions permit the definition of new allocators?
- Yes, it is possible to define your own memory allocation model. That is hidden in “…” in the slides - there is a C++ allocator interface. The USM extension defines a variety of mechanisms for allocation.
- Do the USM allocator functions permit the definition of new
allocators?
- Yes, it is possible to define your own memory allocation model. That is hidden in “…” in the slides - there is a C++ allocator interface. The USM extension defines a variety of mechanisms for allocation.
- Reductions
- Motivation. Reductions are foundational for parallel processing; users should not need to write out the details of their implementation. The compiler team should do a very good job of optimizing the reduction call based on target architecture. A bunch of physicists and chemists should not have to do this to run molecular dynamics. It needs to be provided in the language; most programmers will call SYCL reduce and be happy. The proposed DPC++ extension will be proposed to Khronos as an extension to the SYCL standard.
- Determinism. With floating point arithmetic, deterministic
reductions can be very expensive. We chose not to define
determinism or ordering in this version, but we would like to know
what specific requirements you have. We believe that both
non-deterministic and deterministic reductions have a place and
need to be enabled. We’ve started with non-deterministic because
they cover many uses and are much more performant on some
hardware.
- It is OK for default to be non-deterministic but also want the ability to set a runtime flag and have determinism if required. This should be set on a per reduction/per kernel-level, not globally.
- The specification shouldn’t over specify. In specific (not all) cases I want to have determinism.
- Hardware issues. On the Intel GPU, we have 3 levels of reduction:
EU level reduction, SLM level reduction, global reduction. We need
to be careful and think about how the language level reduction
will map to HW for both non-deterministic and deterministic
reduction.
- If you want this to be an industry specification you must be very careful DON’T THINK OF INTEL HW think of any possible hardware available.
- Compiler issues. How can the compiler support multiple devices
efficiently? You can have only one SYCL application. How can you
know it’s going to run on a FPGA or on what HW? How do you get it
to run best on the HW?
- Some flows create outputs for multiple targets, known at compile time. These implementations will be specialized. SPIR-V for generic targets requires a generic implementation, unless these primitives are defined through SPIR-V. The fat binary direct specialization flow is expected for performance. Should library calls for reduction be defined at the SPIR-V level?
- Parallel reduce or Parallel For. Don't like that you are
doing parallel_for with a reduction clause… There is a reason
that TBB has reduce. Why are you making a different choice?
- We are treating this in the same way as collectives there are several collectives that operate on multiple work items that are running. Treat reduction as across the iteration space.
- Can we make a language distinction between loops with completely independent iterations and ones with some type of dependencies? How can we distinguish between the two? That would be useful. Then the reduction question becomes more salient never call a synchronization across work groups.
- We should have a broadcast primitive. You want reduction plus broadcast.
- Standardization efforts work well when there is enough experience
and the effort can be focused on standardizing best practices. Are
we at this point or are their fundamental unresolved issues?
- Consider the MPI forum work. Everyone knew how to do proper message passing just an issue of setting an API.
- MPI2 RMA is not so good… don’t want that. I started doing an industry wide study of data parallelism and went through TBB, Kokkos, RAJA, and then stumbled on SYCL. There are important questions but with DPC++ we are at a similar level of experience to MPI1 message passing systems on supercomputers. This is meant to be iterative, not converge on one true solution immediately. These are mostly syntax debates Kokkos vs Raja syntax debates.
- This discussion is a core reason to have iteration with respect to DPC++ extensions.
- Is the kernel argument restrict for USM pointers or buffers?
- Both.
- Optional Lambda naming
- Required lambda naming causes a variety of problems, particularly for libraries. The Intel open source DPC++ implementation has had optional lambda naming for a while now.
- Lambda names are very useful for debugging and profiling. Give it
a string as a profiling. Names are optional, but still a type.
Request for:
- Need to have a string-based name AND
- We should add the option to have string names on buffers - look at Kokkos as example
- Other implementations - How can you make this more attractive for
your competitors to adopt this? Some of us have spent years
developing OpenCL code due to vendor-independence and
portability. Will look to see if DPC++ gets adopted by other
vendors.
- Codeplay has announced they will support DPC++ on top of Nvidia hardware. See article here.
- What is oneAPI? What is DPC++? What is SYCL?
- oneAPI is the programming model, consisting of a language, a set of libraries and an HW interface layer.
- DPC++ is the language, built on ISO C++ and Khronos SYCL and extensions.
- Some think of oneAPI as the platform, and DPC++ as the language built on C++ and SYCL. Most of the extensions that form DPC++ are being fed back into SYCL for consideration and hopefully inclusion in future standards.
- Really like what you are saying, however DPC++ could be perceived as
“pulling an OpenACC”. Why not just call it SYCL?
- We are aware of that possible misperception. We want to be very
explicit about how we are different than OpenACC versus OpenMP:
- We are not forking from SYCL, we are building on top of it.
- We are very explicit that DPC++ == ISO C++ and Khronos SYCL and Extensions
- We are discussing all our extensions openly with the SYCL committee.
- We are not forming another standards foundation/group.
- We are being very open, using permissive licensing and an open implementation
- The collective set of extensions does need a name.
- We are working with both Khronos SYCL and ISO C++ to put as many of these extensions into those standards as possible. That will take time and we will continue to work on it.
- We intend to make the codesign process with our customers much faster than is possible otherwise
- We are aware of that possible misperception. We want to be very
explicit about how we are different than OpenACC versus OpenMP:
- What does STL vector container mean in the context of accelerator?
- Ideally, we want to get the full STL working, however as you note, we know there are challenges. For example, a parallel push on vector is problematic. We may allow some operations but not all.
- Need to worry about pointer, shared pointer, and container semantics.
- Push in a parallel context? A lot of C++ was not made for parallelism.
- Simple acts: pointers, iterators on top of that…
- Two high-level things:
- What do we expect to support for device-side memory allocation?
- Can I free it on the host or on the device? A lot of uses where we have code paths to do that (particle codes, etc.) But you don’t want to build something like vector push-back. You want to allocate in chunks. How you build that in? What primitive do you want to provide in a parallel construct. Don’t pick the convenient thing to do… you are making a standard so think about it and how you want this be careful and offer what will work over time.
- Capturing objects in a lambda does USM guarantee that you have a
coherent connection between host/accelerator?
- No
- What about Python, Java, C#? Will those be part of the oneAPI effort
in the future?
- Our thought process is to focus on the lower levels of the stack
and allow others to build on it. We do not want to push into
higher levels of the stack it is a large space. Instead, we want
to offer an open specification, in open source, and provide
infrastructure that others can build upon. Some examples:
- with our LLVM work, we hope to allow anybody to build additional languages that can easily by powered by oneAPI and add accelerator support. An LLVM-based language like Julia could easily leverage this work to support any oneAPI platform
- The hardware interface layer, Level Zero, could be used by any language if so desired.
- Level Zero could also be implemented by any HW vendor to leverage the entire oneAPI SW stack.
- We will plug oneDNN and oneCCL into deep learning frameworks. This could then enable any HW vendor to implement oneDNN and oneCCL to plug into all frameworks instead of building framework-specific interfaces
- We will plumb the oneAPI libraries into the Python ecosystem via numpy, scipy, scikit-learn, pandas interfaces.
- The Python numba compiler could leverage the LLVM infrastructure to enable accelerator support.
- Our thought process is to focus on the lower levels of the stack
and allow others to build on it. We do not want to push into
higher levels of the stack it is a large space. Instead, we want
to offer an open specification, in open source, and provide
infrastructure that others can build upon. Some examples:
- USM vs buffers
- There are a few other reasons why buffers allow you to work out the memory model. Note OpenCL only gives you buffers. Buffers allow the accelerators to know what they need to work on. You may be able to create an accelerator that doesn’t use pointers but may use a DMA system.
- I can see why people want USM but mixing USM w/ buffers may not make sense. It may be better us use buffer with indices into arrays to build data structures.
- Data migration with USM
- Is there an interface that will allow you to do on-demand paging? Will you be able to adapt to where the data is? If it’s on the GPU, run on the GPU; if it is on the CPU, run on the CPU.
- C++ had no notion of this without NUMA.
- Other general comments
- Like that you are getting feedback on github.