-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run tests and build Python wheels for aarch64 architecture #3948
Conversation
Also, I tried to run tests on
but then tests were failing with segfaults:
|
if platform.machine() == 'aarch64': | ||
np.testing.assert_allclose(contribs_csr.toarray(), contribs_dense, rtol=1, atol=1e-12) | ||
else: | ||
np.testing.assert_allclose(contribs_csr.toarray(), contribs_dense) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes contribs_sparse
tests are failing, so need to relax assertion tolerance
> np.testing.assert_allclose(contribs_csr_arr_re, contribs_dense)
E AssertionError:
E Not equal to tolerance rtol=1e-07, atol=0
E
E Mismatched elements: 20 / 840 (2.38%)
E Max absolute difference: 2.22044605e-16
E Max relative difference: 1.
E x: array([[ 0.000000e+00, 0.000000e+00, 0.000000e+00, 0.000000e+00,
E -3.840704e-02, -1.261461e-01, -3.519505e-02, 1.271551e-04,
E 0.000000e+00, -2.690382e-04, -1.218832e-01, 1.221108e-04,...
E y: array([[ 0.000000e+00, 0.000000e+00, -4.847735e-17, 0.000000e+00,
E -3.840704e-02, -1.261461e-01, -3.519505e-02, 1.271551e-04,
E 0.000000e+00, -2.690382e-04, -1.218832e-01, 1.221108e-04,...
Prefetch and malloc tests are failing on
@guolinke Maybe you would like to investigate this. FYI,
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -42,7 +43,8 @@ | |||
|
|||
pytestmark = [ | |||
pytest.mark.skipif(getenv('TASK', '') == 'mpi', reason='Fails to run with MPI interface'), | |||
pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface') | |||
pytest.mark.skipif(getenv('TASK', '') == 'gpu', reason='Fails to run with GPU interface'), | |||
pytest.mark.skipif(machine() != 'x86_64', reason='Fails to run with non-x86_64 architecture') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what way did the Dask tests fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember, sorry! I marked Dask tests to be skipped at the very begging of enabling QEMU and cannot find logs for runs where they were enabled. I think we can investigate Dask support for aarch64 in a follow-up PR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep it's ok, can be a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome work! Thanks for figuring this out. I'm ok with adding the new possibly-longer-running CI job for now. We can always change it to optional in the future if it becomes problematic.
@jameslamb Thanks a lot for your review! I'd like to ask @guolinke give it another look before merging as this PR is quite major. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@guolinke Just want to make sure you have not missed this my comment: #3948 (comment). |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Contributes to #1129.
Fixed #3456.
Fixed #3517.
(Superceded) closed #3421.
I run 10x10 consecutive runs and all of them were successful, so I'm quite sure in stability of new QEMU job.
It can be seen from the picture above that sometimes QEMU job takes ~30min and sometimes ~1h. I thought that maybe there is any bottlenecking test that freezes whole job. But unfortunately, long runs are related to QEMU itself (it is common practice to set
timeout=360
for QEMU builds) because in those jobs all tests contribute to slowness. Here are timings frompytest
.~30min job:
~1h job:
Although this job can be quite long and slow down our CI checks, I believe that it should be run with every PR and commit because ARM64 support becomes more and more important nowadays and as you can see there were a lot of requests for ARM wheels.