Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix sum implementation WIP #14

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Prefix sum implementation WIP #14

wants to merge 8 commits into from

Conversation

raphlinus
Copy link
Contributor

This is a draft, I'm still working on it. I'll likely create another subdirectory for tests and add this to that, as overwriting the main hello example is not very good form. But I'm doing it for expedience.

The version at tip of tree as I write this (87e5b20) works well on AMD 5700 XT. In fact, it works very well - I'm seeing 36.4 billion elements/s, which is excellent. It's within a sliver of a compute shader that just copies input to output, and looking at GPU counters suggests that memory bandwidth is pretty well saturated.

This version also makes some progress on each spin, so does not depend on strong forward progress guarantees from the GPU.

That said, I am employing the atomicOr workaround for the atomic bugs I'm seeing, otherwise I get both incorrect results and hangs (try N_DATA = 1 << 17 for a nice mix of the two). I will probably work on a simplified version of the test to exercise the atomic problems without bringing in all of the complexity of full prefix sum.

Sorta works but deadlocks on larger inputs.
Still doesn't fix deadlocks tho :/
Still WIP
Fastest results on AMD at workgroup = 1024. Note, this has atomicOr
workaround for correctness.

Also note, not all targets will support a workgroup of this size; on
shipping, we'd need to query and select at runtime.
Do a small sequential scan at the leaf of the hierarchy. That amortizes
both the workgroup-scope tree reduction and the (still sequential)
decoupled look-back to a larger number of inputs.

Note: this falls short of a real performance evaluation because there's
no attempt to warm up the GPU clock. But it's valid as a very rough
swag.
Better for performance analaysis
Performance measurement requires keeping the GPU busy. That means not
copying results back to CPU and doing verification there.
Naga will accept ordinary loads and stores to atomic types, but tint
will not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant