You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Due to lack of time, I did not get the change to explore more advanced optimization strategies:
CPU
Specialized: Exploit the fact that some parameters are not tunable via CLI to make them known at compile time. Implementation strategy :
Add a Parameters::new() inline fn that takes only those parameters that are tunable and adds defaults for the rest. Use it in Default impl. Clarify in the docs that if new parameters are made tunable, the specialized CPU backend must change.
Use this Parameters::new() function instead of struct update syntax in simulation binaries.
Create a specialized backend which only stores the updatable parameters, checks in new() that the input parameters are equal to the result of Parameters::new() called with these parameters, then calls Parameters::new() in compute_step so the compiler knows about the other parameters, then forwards to autovec.
Measure difference in execution performance. If good, make this the new sequential backend of parallel.
Manual instruction scheduling to improve ILP (unroll and jam yadda) => Document that this is not something I would normally do or recommend doing due to the big impact on code readability, but sometimes you gotta do what you gotta do to ace the benchmark... => Done in grayscott-with-rust
Try std::simd on nightly, which from other experiments should be as good as intrinsics yet as easy as slipstream. => Done in grayscott-with-rust
simd_naive : Make SIMD work on the naive data layout, with unaligned loads and edge handling. Compare to autovec. => Done in grayscott-with-rust
NUCA parallel executor optimizations for modern client chips (Zen 2, Adler Lake...) => Split data following the hardware cache hierarchy and hand each chunk over to a pinned thread pool. If that static scheduling is not enough (probably will be OK for Zen 2 but not for Adler Lake), implement dynamic load balancing where threads steal from local work queue first, and from work queue of increasingly remote NUCA domains as needed.
NUMA memory allocation and parallel execution optimizations for modern multiprocessor machines.
GPU
Compare and contrast wgpu + WGSL, krnl + rust-gpu, CubeCL.
Storage image without input sampling (try and compare a bound-checked texelFetch or zero padding) => Done in grayscott-with-rust
Naive use of a storage buffer (pick between bound checking and padding depending on what worked best for storage images, share as much code as possible between ImageConcentration and the new BufferConcentration) => Should perform worse if images do their caching right => Done in grayscott-with-rust
Buffer with a manual shared memory cache. Implementation idea:
Allocate a shared memory location that's sized like a workgroup + stencil-sized edge all around.
Each thread loads the concentration corresponding to its position and stores it in shared memory.
Threads on the border of the workgroup additionally load/store edge values of the concentration that are closest to them.
Then we compute quantities that don't depend on the stencil, like uvSquared, to feed ILP a little.
Then we do a workgroup barrier to wait for stencil data to be in.
Then we do the stencil and dependent computations, and store the result to main memory.
Overall, done in grayscott-with-rust.
Subgroups as a way to reduce shared memory accesses
Check GLSL extensions with "subgroup" in their name.
Map subgroups to 2D tiles : 4x4 for 16 threads, 8x4 for 32 threads, 8x8 for 64 threads.
Replace shared memory with subgroup shuffles where possible, keep shared memory for data interchange between subgroup edges.
Should test on various hardware, the subgroup/shared tradeoff seems very HW-dependent.
Done in grayscott_with_rust.
Try to process N simulation steps in a single compute shader dispatch (this will be a lot harder, but has the potential to bring the greatest speedups) : see MonoDispatch.pdf/MonoDispatch.odt => Tried for a while, but the amount of work is just enormous, didn't finish.
Make debug stats cover all big branches
Before going pseudorandom, try a simple min of (distance, local index) u32-packed tuples and check debug stats to see how good/bad they are.
If pseudorandom is truly needed, it can use among other inputs : local workitem id, global workgroup id, shader_clock if available, simulated data from previous step...
Try to transpose the optimized SIMD layout from the CPU version to the GPU version.
The text was updated successfully, but these errors were encountered:
ILP, std::simd and simd_naive were tried in https://gitlab.in2p3.fr/grasland/grayscott-with-rust, which only leaves NUMA/NUCA management as a future area of exploration. On the GPU side, I have tried unsampled images and buffers with and without a local memory cache.
Due to lack of time, I did not get the change to explore more advanced optimization strategies:
The text was updated successfully, but these errors were encountered: