Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

Open
wil70 opened this issue Aug 14, 2024 · 8 comments
Open
Labels

Comments

@wil70
Copy link

wil70 commented Aug 14, 2024

Description

Hello,
I have the same resuts if I start 2 times the same training on my big dataset (a bin file).
I have different results if I start a new training from a saved model

Details

Run 1, note iteration 5 is
[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 338.806307 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -1.568826
[LightGBM] [Info] Start training from score -0.548796
[LightGBM] [Info] Start training from score -1.541474
[LightGBM] [Info] Iteration:1, training multi_error : 0.422355
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234
[LightGBM] [Info] 24182.741181 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training multi_error : 0.422355
[LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234
[LightGBM] [Info] 40458.526595 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training multi_error : 0.367912
[LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.427742
[LightGBM] [Info] 56931.164557 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training multi_error : 0.299521
[LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.426856
[LightGBM] [Info] 73788.227205 seconds elapsed, finished iteration 4
**[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006**
[LightGBM] [Info] 90687.526692 seconds elapsed, finished iteration 5

Run 2:
If I set input_model to the moel 4 as a starting point then I get
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
Which is very different from iteration 5 above

LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 331.288340 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
**[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233**
[LightGBM] [Info] 17028.539284 seconds elapsed, finished iteration 1

Run 3:
If I run again then I have the same result as in the second run
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233

> [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
> [LightGBM] [Info] Finished initializing training
> [LightGBM] [Info] Started training...
> [LightGBM] [Info] Iteration:1, training multi_error : 0.257515
> [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
> [LightGBM] [Info] 28231.119183 seconds elapsed, finished iteration 1

Question: Shouldn't Run 2 and 3 (iteration 1) have the same result as Run 1 (itreation 5)?

Reproducible example

This seems to work for small sample training files. I see this issue with big training files.
The model is 111MB
The training and validation bin files are 42GB
The config file look like this

task = train
num_threads = 12
force_col_wise=true
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
max_bin = 255
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
num_trees = 10000
learning_rate = 0.1
output_model = "cpu\train_model_run1.txt"
# input_model = "cpu\train_model_run1.txt.snapshot_iter_4"
two_round = true
snapshot_freq = 1
feature_pre_filter=False
lambda_l1=2.183399274499516e-06
lambda_l2=1.3637583114059118e-08
num_leaves=117
feature_fraction=0.7
bagging_fraction=0.9852124460751488
bagging_freq=5
min_child_samples=20

Environment info

Win 10 Pro + LightGBM CPU mode

LightGBM version or commit hash: SHA-1: 9a76aae
From 08/09/24

Command(s) you used to install LightGBM: I compiled it in VS 2022 and use the command line to start LightGBM

Thanks!

  • w
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

I need some help understanding this report. Your config has num_trees = 10000, so training will proceed for 10,000 boosting rounds (because you're not performing early stopping).

So my read is:

  • Run 1: trains for 10,000 iterations, model saved as cpu/train_model_run5.txt
  • Runs 2 and 3: load model from cpu/train_model_run5.txt and then train for an additional 10,000 iterations

In that case, I wouldn't expect runs 2 and 3 to produce the same output as Run 1... they're starting boosting from a different place.

If I've misunderstood, can you please clarify how your setup works? It would also be helpful if you could minimize this example to the smallest possible set of non-default parameter values necessary to reproduce the behavior you're seeing. For example, if you use num_trees = 10 and num_leaves = 7 throughout, do you still see this behavior?

@wil70
Copy link
Author

wil70 commented Aug 14, 2024

In that case, I wouldn't expect runs 2 and 3 to produce the same output as Run 1... they're starting boosting from a different place.

I stop the itreations manually earlier, I save every iteration (snapshot_freq = 1), so cpu\train_model_run1.txt.snapshot_iter_4 is the 4th iterations model of Run 1.

Steps I did

  1. I ran Run1 with 5 iterations, each iteration save a model
output_model = "cpu\train_model_run1.txt"
# input_model = "cpu\train_model_run1.txt.snapshot_iter_4"
  1. Then I stop manually Run1
  2. I strarted Run2 pointing to the input_model of the 4th itreation from Run1
output_model = "cpu\train_model_run2.txt"
input_model = "cpu\train_model_run1.txt.snapshot_iter_4"

Here I'm expecting "Run1 Itreation 5" to have the same model and result as "Run2 Iteration 1" but they are different.

Let me know if this is still not clear?

TY

@jameslamb
Copy link
Collaborator

Ohhhhhh I see.

Ok, I see at least one reason this would be different... you are using random sampling of rows and columns:

feature_fraction=0.7
bagging_fraction=0.9852124460751488
bagging_freq=5

Could you try setting seed to a value other than 0 in your config and repeating your investigation? e.g. seed = 708.

@wil70
Copy link
Author

wil70 commented Aug 14, 2024

I will try with a fix seed, and as many parameters set to default, hopefully this is it. Stay tune.

Thought I do not understand

  1. Why there will be such a difference in the results
  2. Run2 and Run3 have the same results but not Run1

Run3:

output_model = "cpu\train_model_run3.txt"
input_model = "cpu\train_model_run1.txt.snapshot_iter_4"

I'm going to try with only

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run1_.txt"
#input_model = "cpu_test\train_model_run1_.txt"
snapshot_freq = 1
seed = 708

@wil70
Copy link
Author

wil70 commented Aug 15, 2024

Hello, I run the test and it seems there is something fishy with big files
Let me know

Run 1

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run1_.txt"
#input_model = "cpu_test\train_model_run1_.txt.snapshot_iter_6"
snapshot_freq = 1
seed = 708
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -1.568826
[LightGBM] [Info] Start training from score -0.548796
[LightGBM] [Info] Start training from score -1.541474
[LightGBM] [Info] Iteration:1, training multi_error : 0.422355
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234
[LightGBM] [Info] 6802.219645 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training multi_error : 0.422355
[LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234
[LightGBM] [Info] 13876.323698 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training multi_error : 0.422289
[LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.428234
[LightGBM] [Info] 22276.966162 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training multi_error : 0.418515
[LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.428825
[LightGBM] [Info] 30843.596281 seconds elapsed, finished iteration 4
[LightGBM] [Info] Iteration:5, training multi_error : 0.409983
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.429514
[LightGBM] [Info] 39310.763918 seconds elapsed, finished iteration 5
[LightGBM] [Info] Iteration:6, training multi_error : 0.394493
[LightGBM] [Info] Iteration:6, valid_1 multi_error : 0.430006
[LightGBM] [Info] 47973.947956 seconds elapsed, finished iteration 6
[LightGBM] [Info] Iteration:7, training multi_error : 0.375952
[LightGBM] [Info] Iteration:7, valid_1 multi_error : 0.431187
[LightGBM] [Info] 56383.376031 seconds elapsed, finished iteration 7

For Run2 I'm expecting the same result as Run1 Iteration 7

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run2_.txt"
input_model = "cpu_test\train_model_run1_.txt.snapshot_iter_6"
snapshot_freq = 1
seed = 708
LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Info] Finished loading data in 341.768607 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 6711.857059 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Iteration:1, training multi_error : 0.346154
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.485824
[LightGBM] [Info] 7051.144236 seconds elapsed, finished iteration 1

As you can see this is very different
I notice this in the past with different huge dataset, but never found the time to investigate to report it.

I hope this help
thanks
w

@wil70
Copy link
Author

wil70 commented Aug 15, 2024

The ticket can be tag as 'bug' instead of a 'question'?

@jameslamb
Copy link
Collaborator

It isn't obviously a bug yet. Sorry, but I'm finding it very difficult to understand exactly how you're getting these logs and what precisely you're saying is the problem.

For example, I don't understand the pattern you're using for the model file names, but that pattern seems to contain relevant information.

I will try to put together an example, then maybe you could tell me how what I've done differs from what you've done.

@wil70 wil70 changed the title [Bug?] I get different results if I start a training from a saved model [Bug?] I get different results if I start a training from a saved model on huge input files Aug 16, 2024
@wil70 wil70 changed the title [Bug?] I get different results if I start a training from a saved model on huge input files [Bug?] I get different models when I start a training from a saved model from a previous iteration Aug 16, 2024
@wil70 wil70 changed the title [Bug?] I get different models when I start a training from a saved model from a previous iteration [Bug?] Inconsistent Results When Resuming Training from a Saved Model vs letting it go on huge input file Aug 16, 2024
@wil70 wil70 changed the title [Bug?] Inconsistent Results When Resuming Training from a Saved Model vs letting it go on huge input file [Bug?] Inconsistent Results When Resuming Training from a Saved Model Aug 16, 2024
@wil70
Copy link
Author

wil70 commented Aug 16, 2024

For info, I'm using (huge) bin files but I do not think this matter.

The idea is Run a training of 6 iterations (Run 1), save each iteration's model
Run a second training (Run 2) for 1 iteration from iteration 5 (from model 5 from Run 1) so it produce another model 6 (Run 2 model 6)

Then compare model 6 from Run 1 and Run 2. For me they are different.

@jameslamb jameslamb changed the title [Bug?] Inconsistent Results When Resuming Training from a Saved Model [Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants