-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] SegFault on MacOS when pytorch is installed #6595
Comments
Thanks for the excellent report @connortann ! Since #6391, (assuming that was typo in your original report and you really mean "OpenMP", not "OpenML") Since you have To narrow it down further, could you try 2 other tests?
I'm sorry to possibly involve yet a THIRD project in your investigation. I'm familiar with these topics and happy to help us all reach a resolution. You may also find these relevant: |
Thanks for the response! Yes I think you're right about sklearn being relevant: the bug seems not to occur if sklearn is not imported. Here's what I tried: the tests pass in all these situations
import time
import sklearn
import torch
from sklearn.datasets import fetch_california_housing
def test_something():
X, y = fetch_california_housing(return_X_y=True)
torch.tensor(X)
time.sleep(3)
import time
import torch
import sklearn
from sklearn.datasets import fetch_california_housing
def test_something():
X, y = fetch_california_housing(return_X_y=True)
torch.tensor(X)
time.sleep(3)
import time
import lightgbm
import torch
import numpy as np
# from sklearn.datasets import fetch_california_housing
def test_something():
# X, y = fetch_california_housing(return_X_y=True)
X = np.ones(shape=(200, 20))
torch.tensor(X)
time.sleep(3)
# ruff: noqa
# fmt: off
import time
import torch
import lightgbm
import numpy as np
# from sklearn.datasets import fetch_california_housing
def test_something():
# X, y = fetch_california_housing(return_X_y=True)
X = np.ones(shape=(200, 20))
torch.tensor(X)
time.sleep(3) So, I think the example above is the minimal reproducer: |
Adding my two cents to this issue. I managed to reproduce the bug following the setting given by @connortann Running the following command raises the segfault but if prepending the command with |
@lesteve ping as scikit-learn is involved in the minimal reproducer (openmp related). |
Honestly @jeremiedbb may be a better person on this on the scikit-learn side. To be honest this is quite a tricky topic at the interface of different projects which make different choices how to tackle OpenMP with wheels and OpenMP in itself is already tricky. The root cause is generally using multiple OpenMP and using One known work-around is to use conda-forge which will use a single OpenMP and avoid most of these issues. I wanted to mention it, even if I understand using conda rather than pip is a non-starter in some use cases. In this particular case, I played a bit with the code and can reproduce without scikit-learn, i.e. only with LightGBM and PyTorch. To be honest, I have heard of cases that go wrong with PyTorch and scikit-learn for similar reasons, but it's generally a bit hard to get a reproducer ... I put together a quick repo: https://github.com/lesteve/lightgbm-pytorch-macos-segfault. In particular, see build log which shows a segfault, python file, worflow YAML file. Importing pytorch before lightgbm works fine, see build log. Python file: import pprint
import sys
import platform
import lightgbm
import torch
import threadpoolctl
print('version: ', sys.version, flush=True)
print('platform: ', platform.platform(), flush=True)
pprint.pprint(threadpoolctl.threadpool_info())
print('before torch tensor', flush=True)
t = torch.ones(200_000)
print('after torch tensor', flush=True) Output:
From the threadpoolctl info, you can tell that there are multiple OpenMP in use the brew one (from LightGBM) and the PyTorch one bundled in the wheel.
(Edit: sorry pinged the wrong Jérémie originally ...) |
Thanks very much for that! Your example has helped to clarify the picture for me a lot. Short Summary
As a result, if you've installed both these libraries via wheels on macOS, loading both will result in 2 copies of Even if all copies of Longer Summarymore details (click me)I investigated this by running the following on my M2 Mac, with Python 3.11. Note that the versions are identical to those from the previous comment. mkdir ./delete-me
cd ./delete-me
pip download \
--no-deps \
'lightgbm==4.5.0' \
'torch==2.4.1'
unzip ./lightgbm*.whl
unzip ./torch*.whl
otool -l ./lightgbm/lib/lib_lightgbm.dylib
otool -l ./torch/lib/libtorch_cpu.dylib
And the following LC_LOAD_DYLIB / LC_RPATH entries
And has the following LC_LOAD_DYLIB / LC_RPATH entries:
So
💥 2 copies of OpenMP loaded at the same time, and all the issues that comes with that. Why didn't @connortann observe this same behavior?Not sure why @connortann was not able to reproduce this in #6595 (comment). That comment shows:
Probably because that example uses different codepaths in How do we fix this?I think some mix of the following would make this better for users. Option 1:
|
I see Option 4 as the "proper solution", but I see the following barriers:
On a related note, NumPy & SciPy is trying this approach with openblas: https://pypi.org/project/scipy-openblas64/. In their case, they only need to coordinate with each other to make sure the user experience is good. |
Description
A segmentation fault occurs on MacOS when lightgbm and pytorch are both installed, depending on the order of imports.
Possibly related: #4229
Reproducible example
To reproduce the issue on GH actions:
Leads to
Fatal Python error: Segmentation fault
. Full output:Environment info
LightGBM version or commit hash:
4.5.0
Result of
pip list
:Additional Comments
We came across this issue over at the
shap
repo, trying to run tests with the latest versions of both pytorch and lightgbm. We initially raised this issue on the pytorch issue tracker: pytorch/pytorch#121101 .However, the underlying issue doesn't seem to be specific just to pytorch or lightgbm, but rather it relates to the mutual compatibility of pytorch and lightgbm. The issue seems to relate to multiple
OpenMLOpenMP runtimes being loaded.So, I thought it would be worth raising the issue here too in the hope that it helps us collectively find a fix.
The text was updated successfully, but these errors were encountered: