Add more compute backends #8

HadrienG2 · 2023-07-07T09:01:41Z

Due to lack of time, I did not get the change to explore more advanced optimization strategies:

CPU
- Specialized: Exploit the fact that some parameters are not tunable via CLI to make them known at compile time. Implementation strategy :
  1. Add a Parameters::new() inline fn that takes only those parameters that are tunable and adds defaults for the rest. Use it in Default impl. Clarify in the docs that if new parameters are made tunable, the specialized CPU backend must change.
  2. Use this Parameters::new() function instead of struct update syntax in simulation binaries.
  3. Create a specialized backend which only stores the updatable parameters, checks in new() that the input parameters are equal to the result of Parameters::new() called with these parameters, then calls Parameters::new() in compute_step so the compiler knows about the other parameters, then forwards to autovec.
  4. Measure difference in execution performance. If good, make this the new sequential backend of parallel.
- Manual instruction scheduling to improve ILP (unroll and jam yadda) => Document that this is not something I would normally do or recommend doing due to the big impact on code readability, but sometimes you gotta do what you gotta do to ace the benchmark... => Done in grayscott-with-rust
- Try std::simd on nightly, which from other experiments should be as good as intrinsics yet as easy as slipstream. => Done in grayscott-with-rust
- simd_naive : Make SIMD work on the naive data layout, with unaligned loads and edge handling. Compare to autovec. => Done in grayscott-with-rust
- NUCA parallel executor optimizations for modern client chips (Zen 2, Adler Lake...) => Split data following the hardware cache hierarchy and hand each chunk over to a pinned thread pool. If that static scheduling is not enough (probably will be OK for Zen 2 but not for Adler Lake), implement dynamic load balancing where threads steal from local work queue first, and from work queue of increasingly remote NUCA domains as needed.
- NUMA memory allocation and parallel execution optimizations for modern multiprocessor machines.
GPU
- Compare and contrast wgpu + WGSL, krnl + rust-gpu, CubeCL.
- Storage image without input sampling (try and compare a bound-checked texelFetch or zero padding) => Done in grayscott-with-rust
- Naive use of a storage buffer (pick between bound checking and padding depending on what worked best for storage images, share as much code as possible between ImageConcentration and the new BufferConcentration) => Should perform worse if images do their caching right => Done in grayscott-with-rust
- Buffer with a manual shared memory cache. Implementation idea:
  - Allocate a shared memory location that's sized like a workgroup + stencil-sized edge all around.
  - Each thread loads the concentration corresponding to its position and stores it in shared memory.
  - Threads on the border of the workgroup additionally load/store edge values of the concentration that are closest to them.
  - Then we compute quantities that don't depend on the stencil, like uvSquared, to feed ILP a little.
  - Then we do a workgroup barrier to wait for stencil data to be in.
  - Then we do the stencil and dependent computations, and store the result to main memory.
  - Overall, done in grayscott-with-rust.
- Subgroups as a way to reduce shared memory accesses
  - Check GLSL extensions with "subgroup" in their name.
  - Map subgroups to 2D tiles : 4x4 for 16 threads, 8x4 for 32 threads, 8x8 for 64 threads.
  - Replace shared memory with subgroup shuffles where possible, keep shared memory for data interchange between subgroup edges.
  - Should test on various hardware, the subgroup/shared tradeoff seems very HW-dependent.
  - Done in grayscott_with_rust.
- Try to process N simulation steps in a single compute shader dispatch (this will be a lot harder, but has the potential to bring the greatest speedups) : see MonoDispatch.pdf/MonoDispatch.odt => Tried for a while, but the amount of work is just enormous, didn't finish.
  - Make debug stats cover all big branches
  - Before going pseudorandom, try a simple min of (distance, local index) u32-packed tuples and check debug stats to see how good/bad they are.
  - If pseudorandom is truly needed, it can use among other inputs : local workitem id, global workgroup id, shader_clock if available, simulated data from previous step...
- Try to transpose the optimized SIMD layout from the CPU version to the GPU version.

HadrienG2 · 2024-07-10T04:01:33Z

ILP, std::simd and simd_naive were tried in https://gitlab.in2p3.fr/grasland/grayscott-with-rust, which only leaves NUMA/NUCA management as a future area of exploration. On the GPU side, I have tried unsampled images and buffers with and without a local memory cache.

HadrienG2 added the enhancement New feature or request label Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more compute backends #8

Add more compute backends #8

HadrienG2 commented Jul 7, 2023 •

edited

Loading

HadrienG2 commented Jul 10, 2024

Add more compute backends #8

Add more compute backends #8

Comments

HadrienG2 commented Jul 7, 2023 • edited Loading

HadrienG2 commented Jul 10, 2024

HadrienG2 commented Jul 7, 2023 •

edited

Loading