Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a draft, I'm still working on it. I'll likely create another subdirectory for tests and add this to that, as overwriting the main hello example is not very good form. But I'm doing it for expedience.
The version at tip of tree as I write this (87e5b20) works well on AMD 5700 XT. In fact, it works very well - I'm seeing 36.4 billion elements/s, which is excellent. It's within a sliver of a compute shader that just copies input to output, and looking at GPU counters suggests that memory bandwidth is pretty well saturated.
This version also makes some progress on each spin, so does not depend on strong forward progress guarantees from the GPU.
That said, I am employing the atomicOr workaround for the atomic bugs I'm seeing, otherwise I get both incorrect results and hangs (try N_DATA = 1 << 17 for a nice mix of the two). I will probably work on a simplified version of the test to exercise the atomic problems without bringing in all of the complexity of full prefix sum.