Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for histogram CPU implementation #1930

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

danhoeflinger
Copy link
Contributor

Adds an RFC for histogram CPU implementation.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger
<[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
more formatting fixes

Signed-off-by: Dan Hoeflinger <[email protected]>
@akukanov
Copy link
Contributor

akukanov commented Nov 6, 2024

Overall, this all sounds good enough for the "proposed" stage, where it's expected that some details are unknown and need to be determined. I am happy to approve it but will wait for a few days in case @danhoeflinger wants to update the document with some follow-up thoughts on the discussion.

time.

### Other Unexplored Approaches
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, I have a curiosity question. Which approach does NVidia use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVidia has a similar API within CUB but not within Thrust, and therefore does not have a CPU implementation that I am aware of, only one specifically for a GPU device.

cases which are important, and provides reasonable performance for most cases.

### Embarrassingly Parallel Via Temporary Histograms
This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Embarrassingly Parallel term mean?

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Update: got it.. https://en.wikipedia.org/wiki/Embarrassingly_parallel

  2. Of course, if you are solving a concrete task and you are allowed to use the all machine recourses, and there are no any other workloads on the node, the best way for histogram calculation - to make static dividing of amount of work, each thread is calculating a local histogram, and after the local histograms are reducing into one.

  3. But, talking about parallelism in a kind of general library we have to keep in mind that a final user's application can work in "different circumstances", depends on their application type, task, real-time data, other workloads on the same host and other many things..
    When we were developing TBB backend we kept in mind that things and preferred to use TBB auto partitioner (instead of static f.e).
    Also composability reasons make sense here.

  4. BTW, have you considered a "common parallel reduce" (in general) pattern (and tbb::parallel_reduce pattern, in particular) for histogram calculation? It seems the parallel histogram calculation matches on the common reduce (with a certain "big" grainsize): each Body calculates a local histogram (bins), Combiner summaries the all local bins into final ones.
    Additionally, if number of bins is "big" we can apply the second level of parallelism within Combiner code - SIMD or even "parallel_for" and SIMD, if number of bins is "too big".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Yes, although I think there is no reason here to do static division of work, but rather rely upon our existing parallel_for implementation to provide a good composable implementation.

  2. Agree, which is why the intent is to use the existing parallel_for structure (including partitioners) to implement the parallelism. If we were to do it from scratch, we would do it in a similarly composable way, but better to rely upon existing infrastructure

  3. Yes, I thought about this. For TBB and even more for openMP the built in reduction functionality is geared toward very simple lightweight types as the reduction variable where we may have an arbitrarily large array. Especially since we want a unified implementation, it does not seem like these backend are really set up to handle these large reduction variables. It seems we should take more control to ensure no unnecessary copies are made, and that the final combination is done performantly, based on knowledge we have of the task. The implementation remains quite simple and unified.

This method uses temporary storage and a pair of embarrassingly parallel `parallel_for` loops to accomplish the
`histogram`.

For this algorithm, each parallel backend will add a `__thread_enumerable_storage<_StoredType>` struct which provides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Why a TLS is used, not the global memory with "thread id" as a key? F.e. bins[thread_id]?
  2. Does TBB guarantee that the same threads finalize the work? "The same threads" means the threads which have started the work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. There are a number of reasons:
    a) My understanding is that in TBB while it may be technically possible to get the the local thread id within an arena, it is an undocumented API and generally discouraged and against the TBB mindset. Using TLS seems to be the preferred method specifically with TBB.
    b) While what you suggest perhaps fits better within OpenMP, we want to create a single implementation and not require a __parallel_histogram within every current and future backend, but rather depend upon existing functionality within the backend as much as we can (in this case __parallel_for).
    c) With smaller values of n, num_bins and larger number of threads, not all threads should be used because of the overhead associated with allocation and initialization of each temporary bin copy. We can let the partitioner decide how many blocks to employ, but we want to avoid unnecessary allocation and initialization overheads wherever possible.

I will mention a downside for completeness, but it is outweighed here in my opinion:
It requires implementation of a thread local storage class for each backend. This is only non-trivial for OpenMP. It has been written generically though to serve future patterns though so it is nice to have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I'm not exactly sure what you mean here by "finalize the work". If you mean the second parallel for, then no, we are explicitly parallelizing over a different dimension (num_bins), and accumulating across the temporary histograms which were used from different threads. TBB does guarantee that each thread will always use its own TLS for each grain of work though, when retrieved through local().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. I'm not exactly sure what you mean here by "finalize the work"

I try to explain with example:
Usage of a general TBB pattern tbb::paralle_for doesn't suppose using system thread directly. There is only a "Body" which is called (with a part of data(tbb range) by executing thread. Imagine the input range is split into 4 parts. Two threads call 2 parts simultaneously. The Body stores local bin results in TLS, associated with mentioned threads.
After, to "finalize the work", TBB should call Body two time to process final 2 parts of input range. These final two calls may be done by another threads which have the other associated TLSs. So, it is impossible to make final reduce of local bins, located in TLSs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 parallel_for calls, each of which is "embarrassingly parallel" where no thread body depends on previous thread bodies. The first parallel_for must complete before the second one starts though. The first parallel_for uses the TLS as normal, and just accumulates sections of the input data into each thread's individual TLS.

The second parallel_for call does not use the TLS as normal, but rather has every thread visit a section of every TLS which was created one by one, processing a section of the histogram bins in parallel, combining the work of different threads from the first loop into the final global histogram.

The TLS we propose here (that is also implemented in the PR) supports this, and we obtain the correct result. We will not have perfect cache effects when accessing TLS from different threads than it was created upon but that is just something we have to deal with.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your answer, Dan...
It seems you don't catch my question/concerns...

I try to explain my concern again:
tbb::paralle_for "produces several calls of Body, which is passed to tbb::paralle_for. You don't know how many callbacks is, because "tbb auto-partioner" is applied by default.
Each call of this body may be done by the different threads. Moreover, The first calls of the body may be done by threads "ids" 0-3, the last calls may be done by another threads, "ids" 4-7 by example. Each TLS is associated with its own thread. you don't know IDs of threads.... I don't understand how you can get the calculated local bins from the all TLSs....

Copy link
Contributor Author

@danhoeflinger danhoeflinger Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are basically using TBB's enumerable_thread_specific as a model here, and implementing a stripped down version for omp and a trivial version for serial backend.

enumerable_thread_specific has two ways of accessing the data.
a) local() which gets the TLS for the current thread,
b) begin() and end() which provide iterators to the sequence of all local storage from all threads.

This allows us to use (a) in the first parallel loop and (b) in the second parallel loop. The second parallel loop does not use the enumerable_thread_specific as a "Thread Local Storage" but rather a 2-D array space which it iterates over summing across columns (corresponding to individual histogram bins from different threads). This allows us to accumulate the data from all threads into the global space histogram copy no matter which threads are used and when.

I'm not sure how else I can explain it. The code in the implementation is tested, working, and pretty concise, if you want to see the details you can look at the PR.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "TBB TLS", not system TLS...
I clarified that question with Alexey.
TBB TLS is a kind of container and allows to iterate the all local bins... I was not aware of that.
Now I got it.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

@mmichel11
Copy link
Contributor

mmichel11 commented Jan 13, 2025

One question I have for the group... If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I agree that we should honor the user's request for a specific policy as opposed to using the serial implementation until some empirically determined cutoff point. I also imagine that the exact cutoff point where the parallel version performs better can highly vary dependent on a user's hardware setup and giving them the freedom to manually choose when to make the switch from the serial to parallel version may result in better performance than any generic decisions we could make.

Copy link
Contributor

@mmichel11 mmichel11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken another pass through the document. A single question regarding how technical we want to get when explaining the algorithm.

The RFC looks ready to me.

rfcs/proposed/host_backend_histogram/README.md Outdated Show resolved Hide resolved
Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@akukanov
Copy link
Contributor

akukanov commented Jan 15, 2025

One question I have for the group...
If we know that a serial implementation will provide better performance up to some threshold (perhaps dependent on num bins, num threads, num input elements), can / should we dispatch instead to a serial implementation?

From my reading, it seems the answer is probably no. Execution policies have semantic meaning, and par / par_unseq do not simply mean "provide the fastest version" even if that is what the users probably want.

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.


### SIMD/openMP SIMD Implementation
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
loop iterations. OneDPL does not directly use any intrinsics which may offer more complex functionality than what is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence may be omitted.
Based on the first sentence we can conclude that "OneDPL does not directly use any intrinsics..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied.

increment. For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as
each input has the same mathematical operations applied to each. However, for the custom range API, each input element
uses a binary search through a list of bin boundaries to determine the selected bin. This operation will have a
different length and control flow based upon each input element and will be very difficult to vectorize.
Copy link
Contributor

@MikeDvorskiy MikeDvorskiy Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we can calculate the bin indexes for the input data in SIMD manner.
After that we can process the result in a serial loop.
No?

Copy link
Contributor Author

@danhoeflinger danhoeflinger Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable only for the even binned case. Without using intrinsic operations, we must do this with omp simd and the ordered structured block. Initial investigation seemed to indicate that this was unsuccessful for generating vectorized code, and my suspicion is that it will not really help anyway. I can revisit this and attempt it, but the intention for now was to omit vectorizations from this first phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'll ask that we leave it as described in the RFC, which gives some understanding of how this can be improved in the future, but starts without vectorization for this phase.
We can add an issue to explore using simd ordered to get some improvement for histogram even, and leave it out for this RFC and the initial PR implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.. Ok, lets leave it as described.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@MikeDvorskiy
Copy link
Contributor

MikeDvorskiy commented Jan 16, 2025

I believe yes, we can. Generally, a serial implementation is correct for parallel execution policies, so it's more of a QoI question whether to parallelize or not. While the policies do not mean "provide the fastest version", they do not mean "always use multiple threads" either. It's a permission to use multiple threads, but not an obligation.

Agree with Alexey. Especially since, at least, the host patterns "sort" and "merge" use some thresholds for switching to a serial implementation, depending on number of input elements.
So, probably it makes sense to add a description that the implementation may do fallback to a serial implementation for a "small" number of input elements by performance reasons.

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me.
Just one comment about adding a description that the implementation may do a fallback to a serial implementation.

akukanov
akukanov previously approved these changes Jan 16, 2025
Copy link
Contributor

@akukanov akukanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approved.

The questions in my earlier approval are now addressed. The last couple of comments I made do not hold the proposal from landing.

Signed-off-by: Dan Hoeflinger <[email protected]>
Signed-off-by: Dan Hoeflinger <[email protected]>
@danhoeflinger
Copy link
Contributor Author

Waiting for a second approval to merge, so I added the small asks you made in the mean time @akukanov .

Copy link
Contributor

@MikeDvorskiy MikeDvorskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants