Prefix sum implementation WIP #14

raphlinus · 2021-11-04T00:33:55Z

This is a draft, I'm still working on it. I'll likely create another subdirectory for tests and add this to that, as overwriting the main hello example is not very good form. But I'm doing it for expedience.

The version at tip of tree as I write this (87e5b20) works well on AMD 5700 XT. In fact, it works very well - I'm seeing 36.4 billion elements/s, which is excellent. It's within a sliver of a compute shader that just copies input to output, and looking at GPU counters suggests that memory bandwidth is pretty well saturated.

This version also makes some progress on each spin, so does not depend on strong forward progress guarantees from the GPU.

That said, I am employing the atomicOr workaround for the atomic bugs I'm seeing, otherwise I get both incorrect results and hangs (try N_DATA = 1 << 17 for a nice mix of the two). I will probably work on a simplified version of the test to exercise the atomic problems without bringing in all of the complexity of full prefix sum.

Sorta works but deadlocks on larger inputs.

Still doesn't fix deadlocks tho :/

Still WIP

Fastest results on AMD at workgroup = 1024. Note, this has atomicOr workaround for correctness. Also note, not all targets will support a workgroup of this size; on shipping, we'd need to query and select at runtime.

Do a small sequential scan at the leaf of the hierarchy. That amortizes both the workgroup-scope tree reduction and the (still sequential) decoupled look-back to a larger number of inputs. Note: this falls short of a real performance evaluation because there's no attempt to warm up the GPU clock. But it's valid as a very rough swag.

Better for performance analaysis

Performance measurement requires keeping the GPU busy. That means not copying results back to CPU and doing verification there.

Naga will accept ordinary loads and stores to atomic types, but tint will not.

raphlinus added 8 commits October 26, 2021 20:30

First try at prefix sum

274d151

Sorta works but deadlocks on larger inputs.

Make storage barriers uniform control flow

2800b73

Still doesn't fix deadlocks tho :/

Verify results

5fc557c

Still WIP

Larger workgroup

17094ab

Fastest results on AMD at workgroup = 1024. Note, this has atomicOr workaround for correctness. Also note, not all targets will support a workgroup of this size; on shipping, we'd need to query and select at runtime.

Iterate runs

93ad8ee

Better for performance analaysis

Go fast

697ea4e

Performance measurement requires keeping the GPU busy. That means not copying results back to CPU and doing verification there.

Use explicit atomic stores

87e5b20

Naga will accept ordinary loads and stores to atomic types, but tint will not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix sum implementation WIP #14

Prefix sum implementation WIP #14

raphlinus commented Nov 4, 2021

Prefix sum implementation WIP #14

Are you sure you want to change the base?

Prefix sum implementation WIP #14

Conversation

raphlinus commented Nov 4, 2021