-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610
Comments
Thanks for using LightGBM. I need some help understanding this report. Your config has So my read is:
In that case, I wouldn't expect runs 2 and 3 to produce the same output as Run 1... they're starting boosting from a different place. If I've misunderstood, can you please clarify how your setup works? It would also be helpful if you could minimize this example to the smallest possible set of non-default parameter values necessary to reproduce the behavior you're seeing. For example, if you use |
I stop the itreations manually earlier, I save every iteration (snapshot_freq = 1), so cpu\train_model_run1.txt.snapshot_iter_4 is the 4th iterations model of Run 1. Steps I did
Here I'm expecting "Run1 Itreation 5" to have the same model and result as "Run2 Iteration 1" but they are different. Let me know if this is still not clear? TY |
Ohhhhhh I see. Ok, I see at least one reason this would be different... you are using random sampling of rows and columns:
Could you try setting |
I will try with a fix seed, and as many parameters set to default, hopefully this is it. Stay tune. Thought I do not understand
Run3:
I'm going to try with only
|
Hello, I run the test and it seems there is something fishy with big files Run 1
For Run2 I'm expecting the same result as Run1 Iteration 7
As you can see this is very different I hope this help |
The ticket can be tag as 'bug' instead of a 'question'? |
It isn't obviously a bug yet. Sorry, but I'm finding it very difficult to understand exactly how you're getting these logs and what precisely you're saying is the problem. For example, I don't understand the pattern you're using for the model file names, but that pattern seems to contain relevant information. I will try to put together an example, then maybe you could tell me how what I've done differs from what you've done. |
For info, I'm using (huge) bin files but I do not think this matter. The idea is Run a training of 6 iterations (Run 1), save each iteration's model Then compare model 6 from Run 1 and Run 2. For me they are different. |
Description
Hello,
I have the same resuts if I start 2 times the same training on my big dataset (a bin file).
I have different results if I start a new training from a saved model
Details
Run 1, note iteration 5 is
[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006
Run 2:
If I set input_model to the moel 4 as a starting point then I get
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
Which is very different from iteration 5 above
Run 3:
If I run again then I have the same result as in the second run
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
Question: Shouldn't Run 2 and 3 (iteration 1) have the same result as Run 1 (itreation 5)?
Reproducible example
This seems to work for small sample training files. I see this issue with big training files.
The model is 111MB
The training and validation bin files are 42GB
The config file look like this
Environment info
Win 10 Pro + LightGBM CPU mode
LightGBM version or commit hash: SHA-1: 9a76aae
From 08/09/24
Command(s) you used to install LightGBM: I compiled it in VS 2022 and use the command line to start LightGBM
Thanks!
The text was updated successfully, but these errors were encountered: