[tests][dask] Increase number of partitions in data #4149

jmoralez · 2021-04-01T03:46:06Z

This increases the default partitions of the collections returned by _create_data to 20. The purpose is to make it less likely that a worker gets all the partitions and be more confident that distributed training is being performed across all tests.

…hout data

tests/python_package_test/test_dask.py

jameslamb · 2021-04-01T04:54:30Z

Linking the discussion this came from: #3829 (comment)

tests/python_package_test/test_dask.py

jameslamb · 2021-04-04T18:22:42Z

tests/python_package_test/test_dask.py

@@ -255,7 +255,7 @@ def test_classifier(output, task, boosting_type, tree_learner, client):
            'bagging_fraction': 0.9,
        })
    elif boosting_type == 'goss':
-        params['top_rate'] = 0.5
+        params['top_rate'] = 0.7


it looks like this was added since I last reviewed (981084f). Can you please explain why it's necessary?

test_classifier became flaky in this PR. I assume it's because previously we weren't performing distributed training or at least not everytime, so adding this generated some fails in multiclass classification for data_parallel-dart, voting_parallel-rf (this one is very surprising, given that the atol is 0.8), voting_parallel-gbdt, voting_parallel-dart, voting_parallel-goss. Most of them are for dataframe with categoricals but there are a couple with sparse matrices. I have to debug them to see what's actually happening, this is a very simple classification problem and I'd expect to get a perfect score with little effort. I'll ping you here once I'm done but it could take a bit haha.

got it, thanks! Let me know if you need any help

I'd expect to get a perfect score with little effort

Given the small dataset sizes we use in tests, I think it would be useful to set min_data_in_leaf: 0 everywhere. That might improve the predictability of the results.

Sorry this is taking so long, I haven't had much time and I'm really confused by this. The same data point makes the test fail even for data_parallel and gbdt, I'm trying to figure out what's exactly going on here, I have the test in a while loop and it eventually fails because of that data point, I'm not sure what's wrong with it haha.

Btw, setting min_data_in_leaf=0 gives this error: LightGBMError: Check failed: (best_split_info.right_count) > (0) at /hdd/github/LightGBM/src/treelearner/serial_tree_learner.cpp, line 663. Do you think this could be related to #4026? This data is shuffled but I think forcing few samples in a leaf gives more chance to getting an empty split in one of the workers.

Here's abs(local_probas - dask_probas) per iteration for data_parallel gbdt for just that one sample (index 377):

So from the 7th iteration onwards the probabilities start to increasingly differ, I think there's definitely something strange going on here.

jameslamb · 2021-04-10T03:42:57Z

@jmoralez we fixed a few more CI issues today (#4168 , #4158 , #4167 ). Whenever you return to this, please update to the latest master. Sorry for the inconvenience.

jmoralez · 2021-04-10T03:48:21Z

Will do. I'm actually looking into this right now, seems to be related to the amount of data each worker gets. So both get data using more partitions but I believe it may not be balanced sometimes and those times the tests fail because adding a client.rebalance() and bumping up the threshold for the probas from 0.03 to 0.05 makes the tests pass. What are your thoughts about using rebalance here?

Edit
Adding a rebalance kind of beats the purpose of this PR haha. Let me just keep moving the chunksize a bit, 100 looks promising.

jameslamb · 2021-04-10T04:27:55Z

Adding a rebalance kind of beats the purpose of this PR haha. Let me just keep moving the chunksize a bit, 100 looks promising.

Yeah, exactly haha. You could also try increasing n_samples from 1e3 to 1e4 or something! It's possible that you're running into some problems that are more severe with tiny amounts of data. pytest will report the total timing so you should be able to see (and share with reviewers) the impact of using more data on runtime.

jameslamb · 2021-04-10T04:31:05Z

And the same goes for increasing n_estimators or num_leaves. If training a slightly larger model improves the stability of the tests, that's totally fine.

jmoralez · 2021-04-10T04:48:39Z

It's very strange because it's always the same point that gets a lower score and it doesn't seem logical. The local model gives 99.8% and the distributed one 93.4%.

The red dot is the one that causes the test to fail.

jameslamb · 2021-04-10T19:46:28Z

Sorry, I don't understand the axes in that plot or what you mean by "the local model gives 99.8%".

jmoralez · 2021-04-11T04:53:04Z

Haha, sorry. This is a zoom of the lower left section of the data that gets generated for classification. These points all correspond to the same class (centered at [-4, -4]), the axes are the continuous features. The percentages are the probabilities that each model gives to the class (class 1 in this case). It seems strange that the red dot gets a lower probability given that it's not that far away from the center and there are others further away.

jmoralez · 2021-07-13T02:09:07Z

@jameslamb I just tried this again today and that single point still makes the test fail. Should I close this? Or I can try to make a notebook for you to debug and maybe you can find something else, I'm not sure if #4220 is the reason or if it's something else.

jameslamb · 2021-07-25T05:03:57Z

@jameslamb I just tried this again today and that single point still makes the test fail. Should I close this? Or I can try to make a notebook for you to debug and maybe you can find something else, I'm not sure if #4220 is the reason or if it's something else.

so weird! Thanks for all your investigation so far.

Could you merge the latest master into this branch? I can pick it up from here and see if I find anything else. Sometimes when you've been staring at the same problem for this long, it just requires a second set of eyes. I want to see if we can figure this out, because I think there's a chance we'll uncover a bug in distributed training similar to #4026.

jmoralez · 2021-07-28T04:18:25Z

I have a notebook that I've been using for this, I can maybe upload it here. Do you think that'd help you?

jmoralez · 2021-07-28T14:38:50Z

I uploaded my notebook here, I forgot to specify the cpus so it only has two but changing the threads_per_worker to 1 in the Client allows you to replicate the issue there.

jameslamb · 2021-07-29T02:50:06Z

Perfect, thanks!

github-actions · 2023-08-19T03:52:44Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

increase partitions to 20 in _create_data to avoid having workers wit…

f484d0b

…hout data

jmoralez requested a review from jameslamb as a code owner April 1, 2021 03:46

jameslamb added maintenance in progress labels Apr 1, 2021

jameslamb reviewed Apr 1, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

jameslamb mentioned this pull request Apr 1, 2021

[tests][dask] use dy_true mean in denominator for _r2_score #4151

Merged

StrikerRUS reviewed Apr 2, 2021

View reviewed changes

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved

jmoralez added 2 commits April 2, 2021 21:56

merge master

cde4fc0

increase top_rate for goss in test_classifier

981084f

jmoralez mentioned this pull request Apr 3, 2021

[tests][dask] replace client fixture with cluster fixture #4159

Merged

jameslamb reviewed Apr 4, 2021

View reviewed changes

merge master

c6f1115

jameslamb mentioned this pull request Apr 11, 2021

[dask] flaky test_ranker test #3817

Closed

jmoralez added 2 commits April 22, 2021 20:41

merge master

0a6cc22

merge master

d85c54a

Merge branch 'master' into tests/more-partitions

4df2898

jameslamb mentioned this pull request May 29, 2022

[dask] Random failures in Dask tests during teardown #3829

Closed

jmoralez closed this Jun 1, 2022

github-actions bot removed the in progress label Aug 19, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests][dask] Increase number of partitions in data #4149

[tests][dask] Increase number of partitions in data #4149

jmoralez commented Apr 1, 2021 •

edited

Loading

jameslamb commented Apr 1, 2021

jameslamb Apr 4, 2021

jmoralez Apr 6, 2021

jameslamb Apr 6, 2021

jameslamb Apr 13, 2021

jmoralez Apr 13, 2021

jmoralez Apr 13, 2021

jmoralez Apr 13, 2021

jameslamb commented Apr 10, 2021

jmoralez commented Apr 10, 2021 •

edited

Loading

jameslamb commented Apr 10, 2021

jameslamb commented Apr 10, 2021

jmoralez commented Apr 10, 2021

jameslamb commented Apr 10, 2021

jmoralez commented Apr 11, 2021

jmoralez commented Jul 13, 2021

jameslamb commented Jul 25, 2021

jmoralez commented Jul 28, 2021

jmoralez commented Jul 28, 2021

jameslamb commented Jul 29, 2021

github-actions bot commented Aug 19, 2023

[tests][dask] Increase number of partitions in data #4149

[tests][dask] Increase number of partitions in data #4149

Conversation

jmoralez commented Apr 1, 2021 • edited Loading

jameslamb commented Apr 1, 2021

jameslamb Apr 4, 2021

Choose a reason for hiding this comment

jmoralez Apr 6, 2021

Choose a reason for hiding this comment

jameslamb Apr 6, 2021

Choose a reason for hiding this comment

jameslamb Apr 13, 2021

Choose a reason for hiding this comment

jmoralez Apr 13, 2021

Choose a reason for hiding this comment

jmoralez Apr 13, 2021

Choose a reason for hiding this comment

jmoralez Apr 13, 2021

Choose a reason for hiding this comment

jameslamb commented Apr 10, 2021

jmoralez commented Apr 10, 2021 • edited Loading

jameslamb commented Apr 10, 2021

jameslamb commented Apr 10, 2021

jmoralez commented Apr 10, 2021

jameslamb commented Apr 10, 2021

jmoralez commented Apr 11, 2021

jmoralez commented Jul 13, 2021

jameslamb commented Jul 25, 2021

jmoralez commented Jul 28, 2021

jmoralez commented Jul 28, 2021

jameslamb commented Jul 29, 2021

github-actions bot commented Aug 19, 2023

jmoralez commented Apr 1, 2021 •

edited

Loading

jmoralez commented Apr 10, 2021 •

edited

Loading