Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open
Changes from 12 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a03577e
initial rough commit
danhoeflinger Oct 30, 2024
ce117f5
minor improvements
danhoeflinger Oct 30, 2024
ccc001e
revision
danhoeflinger Nov 1, 2024
d518a14
Formatting, minor
danhoeflinger Nov 1, 2024
6e03468
spelling and grammar
danhoeflinger Nov 1, 2024
10c4e50
Minor improvements
danhoeflinger Nov 1, 2024
efa7c9b
subsection
danhoeflinger Nov 1, 2024
1ac82fd
Adding some alternative approaches
danhoeflinger Nov 1, 2024
02523c4
minor improvements
danhoeflinger Nov 1, 2024
ac7b654
line widths
danhoeflinger Nov 4, 2024
506fb62
fixing numbering.
danhoeflinger Nov 6, 2024
1c6cb47
putting in specifics for TBB / OpenMP
danhoeflinger Nov 6, 2024
ceee3e3
Update Atomic strategy
danhoeflinger Nov 12, 2024
0711090
more clarity about serial backend and policy
danhoeflinger Nov 12, 2024
3c5ad12
minor corrections
danhoeflinger Nov 12, 2024
06a734f
c++17 -> c++20 fix
danhoeflinger Nov 13, 2024
b858a0e
Updates after some experimentation and thought
danhoeflinger Dec 16, 2024
53f4643
improvements from feedback
danhoeflinger Dec 20, 2024
d718e0e
thread enumerable storage +
danhoeflinger Dec 20, 2024
bb9e6f9
remove general language keep specifics to histogram
danhoeflinger Dec 20, 2024
17e0510
SIMD naming
danhoeflinger Dec 20, 2024
9614209
spelling
danhoeflinger Dec 20, 2024
2964a9e
clarifying thread enumerable storage
danhoeflinger Dec 20, 2024
9287fd2
minor improvements
danhoeflinger Dec 30, 2024
cdf5092
spelling
danhoeflinger Dec 30, 2024
215c2b7
adding link to implementation
danhoeflinger Dec 30, 2024
04d5127
rename to __enumerable_thread_local_storage
danhoeflinger Jan 15, 2025
fe1efa2
Added sections on complexity
danhoeflinger Jan 15, 2025
60ec0e5
spelling
danhoeflinger Jan 15, 2025
54e16b6
wording adjustments
danhoeflinger Jan 15, 2025
77435a3
minor formatting
danhoeflinger Jan 15, 2025
52bab0d
describe fall back to serial implementation
danhoeflinger Jan 16, 2025
b25411b
rename rfc directory
danhoeflinger Jan 16, 2025
5d23f2a
adding discussion of input sizes
danhoeflinger Jan 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions rfcs/proposed/host_backend_histogram/README.md
akukanov marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Host Backends Support for the Histogram APIs

## Introduction
In version 2022.6.0, two `histogram` APIs were added to oneDPL, but implementations were only provided for device
policies with the dpcpp backend. `Histogram` was added to the oneAPI specification 1.4 provisional release and should
be present in the 1.4 specification. Please see the
[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
for a full definition of the semantics of the histogram APIs. In short, they take elements from an input sequence and
classify them into either evenly distributed or user-defined bins via a list of separating values and count the number
of values in each bin, writing to a user-provided output histogram sequence. Currently, `histogram` is not supported
with serial, tbb, or openmp backends in our oneDPL implementation. This RFC aims to propose the implementation of
`histogram` for these host-side backends. The serial implementation is straightforward and is not worth discussing in
much length here. We will add it, but there is not much to discuss within the RFC, as its implementation will be
straightforward.
akukanov marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me share my thoughts:
In my understanding RFC is not a book... So, I would preferer to have a short, concise and precise description of what is offered, without frills, like a mathematical theorem. For example:

"The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These APIs are defined in the oneAPI Specification 1.4. Please see the
oneAPI Specification for the details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram support to these backends."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've accepted your language here. Thanks.

## Motivations
Users don't always want to use device policies and accelerators to run their code. It may make more sense in many cases
to use a serial implementation or a host-side parallel implementation of `histogram`. It's natural for a user to expect
that oneDPL supports these other backends for all APIs. Another motivation for adding the support is simply to be spec
compliant with the oneAPI specification.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to it is not a story telling, I would suggest omitting introductory expressions like "It may make more sense" or "It's natural for a user to expect"... Only short and exact information.

For example,
"There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding the support is simply to be spec compliant with the oneAPI specification."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taken suggestion. Thanks

## Design Considerations

### Key Requirements
Provide support for the `histogram` APIs with the following policies and backends:
- Policies: `seq`, `unseq`, `par`, `par_unseq`
- Backends: `serial`, `tbb`, `openmp`

Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends
which they can select from when using oneDPL. It is important that all combinations of these options have support for
the `histogram` APIs.

### Performance
akukanov marked this conversation as resolved.
Show resolved Hide resolved
As with all algorithms in oneDPL, our goal is to make them as performant as possible. By definition, `histogram` is a
low computation algorithm which will likely be limited by memory bandwidth, especially for the evenly-divided case.
Minimizing and optimizing memory accesses, as well as limiting unnecessary memory traffic of temporaries, will likely
have a high impact on overall performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking into account my shared thought above, I would propose to re-prahse it keeping the main point shorter:

"A histogram algorithm is a memory-bound algorithm. So, the implementation should care of reducing memory accesses and minimizing temporary memory traffic."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken mostly. Thanks

### Memory Footprint
mmichel11 marked this conversation as resolved.
Show resolved Hide resolved
There are no guidelines here from the standard library as this is an extension API. However, we should always try to
minimize memory footprint whenever possible. Minimizing memory footprint may also help us improve performance here
because, as mentioned above, this will very likely be a memory bandwidth-bound API. In general, the normal case for
histogram is for the number of elements in the input sequence to be far greater than the number of output histogram
bins. We may be able to use that to our advantage.

### Code Reuse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can this topic omit at all. It tells nothing about 'histogram', just general wording, which can be applied for any new feature in oneDPL...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed some of the general language and added something which is important for histogram in an attempt to answer feedback from @akukanov to clarify where the implementation of the algorithm will live.

Our goal here is to make something maintainable and to reuse as much as we can which already exists and has been
reviewed within oneDPL. With everything else, this must be balanced with performance considerations.

### unseq Backend
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"unseq Backend"
Basically, we don't have such back end officially. Yes, sometimes we used such term in the internal communication as for "name" for a set of functions with "pragma simd" implementation. But we did not specify and publish API for that. So, I suggest renaming this topic to "SIMD/openMP SIMD Implementation" f.e.

Copy link
Contributor

@akukanov akukanov Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this part is about developing (or not) an implementation for unsequenced policies.
I do not mind calling it `unseq backend" in the design docs, but Mikhail is correct that it's rather informal (while parallel backend is somewhat more formal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the proposed name for the section better anyway for what is discussed. Thanks.

As mentioned above, histogram looks to be a memory bandwidth-dependent algorithm. This may limit the benefit achievable
from vector instructions as they provide assistance mostly in speeding up computation. Vector operations in this case
also compound our issue of race conditions, multiplying the number of concurrent lines of execution by the vector
length. The advantage we get from vectorization of the increment operation or the lookup into the output histogram may
not provide much benefit, especially when we account for the extra memory footprint required or synchronization
required to overcome the race conditions which we add from the additional concurrent streams of execution. It may make
sense to decline to add vectorized operations within histogram depending on the implementation used, and based on
performance results.

## Existing Patterns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we intent to give some information about OneDPL parallel backend patterns on which histogram can based on, I would notify, there is not "count_if" pattern, there is "reduce"("transform_reduce") pattern.
When a man says "reduce", it becomes more or less obvious that histogram calculation based on reduce is not effective at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified the language a little here to make it more clear that copy_if uses reduce internally. I still think it deserves some text describing it as it may not be immediately obvious to everyone that reduce is not well matched.

### count_if
`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a
sequence. `count_if` returns a scalar-typed value and doesn't provide any function to modify the variable being
incremented. Using `count_if` without significant modification would require us to loop through the entire sequence for
each output bin in the histogram. From a memory bandwidth perspective, this is untenable. Similarly, using a
`histogram` pattern to implement `count_if` is unlikely to provide a well-performing result in the end, as contention
should be far higher, and `reduce` is a very well-matched pattern performance-wise.

### parallel_for
`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what
we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use
`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output
histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
to synchronize and accumulate between threads.

## Proposal
I believe there are two competing options for `histogram`, which may both have utility in the final implementation
depending on the use case.

### Implementation One (Embarrassingly Parallel)
This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
`histogram`.

#### OpenMP:
1) Determine the number of threads that we will use locally
2) Create temporary data for the number of threads minus one copy of the histogram output sequence. Thread zero can
use the user-provided output data.
3) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
its own copy of the output sequence using the temporary storage to remove any race conditions.
4) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
histogram into the output histogram sequence. This step is also embarrassingly parallel.
5) Deallocate temporary storage.

#### TBB
For TBB, we can do something similar, but we can use `enumerable_thread_specific` and its member function, `local()` to
provide a lazy allocation of thread local management, which does not require querying the number of threads or getting
the index. This allows us to operate in a compose-able manner while keeping the same conceptual implementation.
1) Embarassingly parallel accumulation to thread local storage
2) Embarassingly parallel aggregate to output data

I believe the challenge here may be to properly provide the heuristics to choose between this implementation and the
other implementation. However, we should be able to have some reasonable division.

### Implementation Two (Atomics)
This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
output histogram data, we can merely run a `parallel_for` pattern.

To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics
specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
data in an atomic wrapper, but we cannot assume `C++17` for all users. We could look to implement our own
`atomic_ref<T>` for C++17, but that would require specialization for individual compilers. OpenMP provides atomic
operations, but that is only available for the OpenMP backend.

It remains to be seen if atomics are worth their overhead and contention from a performance perspective and may depend
on the different approaches available.

### Selecting Between Algorithms
It may be the case that multiple aspects may provide an advantage to either algorithm one or two. Which `histogram` API
has been called, `n`, the number of output bins, and backend/atomic provider may all impact the performance trade-offs
between these two approaches. My intention is to experiment with these and be open to a heuristic to choose one or the
other based upon the circumstances if that is what the data suggests is best. The larger the number of output bins, the
better atomics should do vs redundant copies of the output.

## Alternative Approaches
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have a curiosity question. Which approach does NVidia use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVidia has a similar API within CUB but not within Thrust, and therefore does not have a CPU implementation that I am aware of, only one specifically for a GPU device.

modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
overhead tradeoffs. It seems quite likely that this would result in more overhead, but it could be worth exploring.

* Another possible approach could be to do something like the proposed implementation one, but with some sparse
representation of output data. However, I think the general assumptions we can make about the normal case make this
less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple
threads. This could be explored if we find temporary storage is too large for some cases and the atomic approach
does not provide a good fallback.

## Open Questions
* Would it be worthwhile to add our own implementation of `atomic_ref` for C++17? I believe this would require
specializations for each of our supported compilers.

* What is the overhead of atomics in general in this case and does the overhead there make them inherently worse than
merely having extra copies of the histogram and accumulating?

* Is it worthwhile to have separate implementations for TBB and OpenMP because they may differ in the best-performing
implementation? What is the best heuristic for selecting between algorithms (if one is not the clear winner)?

* How will vectorized bricks perform, and in what situations will it be advantageous to use or not use vector
instructions?
Loading