-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support on ppc64le #2937
Comments
cc @ceseo |
Our plan is to contribute the CUDA kernels (histogram16/64/256) to treelearner and create a common API for OpenCL and CUDA (a common template class), so it minimizes the amount of changes needed in the current code base. At first, CUDA would be conditioned to ppc64le/linux only. Does that look reasonable? |
Thanks @ceseo and @everton1984 , the contribution is very welcome! |
@huanzhang12 As @ceseo mentioned, we want to add a new set of histograms written in cuda, use the current gputreelearner as a base class and add specialisations with non-common code for OpenCL and CUDA. I think this would add minimal changes to code outside gputreelearner. Would that be acceptable? Are you aware of any GPU code outside that part that we might have missed? |
@everton1984 It will be a great contribution if you can add CUDA equivalent kernels to LightGBM! Yes you can inherit the current gputreelearner as the base class, but currently many functions in GPU tree learner involve memory management in OpenCL (e.g., enqueue_write_buffer, enqueue_unmap_buffer, etc), so most of the methods cannot be directly reused. I suggest making an abstraction layer for memory management (e.g., use a wrapper to handle memory related function calls, and forward the call to OpenCL/CUDA backend accordingly), so that many parts of existing gputreelearner can be reused. |
@huanzhang12 My idea was to implement something like
so GPUTreeLearner would keep the common code. I think this approach is easily maintainable and change very little code outside GPUTreeLearner. |
One current limitation to our CUDA code is that it can NOT coexist (AKA compiled at the same time) with GPU / OCL code (similar to the constraint that CPU and GPU / OCL can NOT coexist). Do you see a need for both simultaneously in the same compilation? If so, hopefully we can overcome this. We also assume that having parity with GPU / OCL and CUDA is a given? That to us means single precision 16, 64 & 256 histogram support. Our initial CUDA version will "probably" also have double precision histograms. So it would be nice for someone with GPU / OCL expertise to add that functionality into the project. |
@everton1984 The implementation skeleton looks great! I suggested an abstraction layer on memory management as it may help reduce code duplication. But you can determine what is the best approach to do this. @ChipKerchner The OpenCL implementation does support double precision. It can be enabled using the flag |
@huanzhang12 the original motivation of this port was exactly running on Tesla V100s on POWER9 systems. Do you have any objections starting the port with double precision support only at first? Most ppc64le systems support only server-grade GPUs anyway. |
@huanzhang12 we have also encountered the need for histograms with bin size >256. We have plans for supporting Histograms for larger bins. I think OpenCL implementation would benefit from this capability as well. |
@bordaw This is not relevant to the current issue, we're discussing enablement of GPU on ppc64le. Create another Issue to discuss new features please. |
@ceseo @everton1984 Don't worry about single/double precision too much at this time - as long as we have the double precision version working, it is not difficult to add single precision support. If you look at the current GPU code there are only a few places involving different code path for handling single/double precision, and the differences between the two in handling are not that big. But it is better to keep single precision in mind when designing the code work flow to ease further extensions. @bordaw It is possible to support large bins but performance can be much worse, because if bin size is less than 256 it is possible to fit them in the local memory in GPU with much higher bandwidth. If you would like to work on this you are very welcome to open a new issue. |
@huanzhang12 Yes, I will create another issue. Also, I agree that the code changes required for single/double precision histogram are not significant. |
We are almost ready to push the initial candidate for the CUDA branch. |
The pull request has been active for almost 3 weeks and we're not getting much feedback. I'd like some assistance with the continuous integration systems. It shows problems but my testing on an AMD64 system with a GPU (OCL) shows no issues with the unit tests. I am unable to produce an environment that allows me to run the automated testing. Please help! |
sorry @ChipKerchner , I will look at this today. |
@huanzhang12 Our resources are starting to be moved off this project. We would like to get this feature approved and integrated into master ASAP if possible. Please let us know any outstanding issues. |
Implemented in #3160. Many thanks to all involved people for the amazing work! We are looking forward for further CUDA improvements, e.g. Windows support. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Currently there is no way to execute LightGBM on ppc64le with GPU since it only has OpenCL support. What would be the correct approach to add CUDA code along with OpenCL on LightGBM? If I understand correctly, adding CUDA kernels to treelearner would be enough right?
There are other people interested on this, #1905 and #1129 for example.
The text was updated successfully, but these errors were encountered: