[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

wil70 · 2024-08-14T15:36:18Z

Description

Hello,
I have the same resuts if I start 2 times the same training on my big dataset (a bin file).
I have different results if I start a new training from a saved model

Details

Run 1, note iteration 5 is
[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006

[LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 338.806307 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -1.568826
[LightGBM] [Info] Start training from score -0.548796
[LightGBM] [Info] Start training from score -1.541474
[LightGBM] [Info] Iteration:1, training multi_error : 0.422355
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234
[LightGBM] [Info] 24182.741181 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training multi_error : 0.422355
[LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234
[LightGBM] [Info] 40458.526595 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training multi_error : 0.367912
[LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.427742
[LightGBM] [Info] 56931.164557 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training multi_error : 0.299521
[LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.426856
[LightGBM] [Info] 73788.227205 seconds elapsed, finished iteration 4
**[LightGBM] [Info] Iteration:5, training multi_error : 0.256301
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.430006**
[LightGBM] [Info] 90687.526692 seconds elapsed, finished iteration 5

Run 2:
If I set input_model to the moel 4 as a starting point then I get
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
Which is very different from iteration 5 above

LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Warning] Parameter two_round works only in case of loading data directly from text file. It will be ignored when loading from binary file.
[LightGBM] [Info] Finished loading data in 331.288340 seconds
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
**[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233**
[LightGBM] [Info] 17028.539284 seconds elapsed, finished iteration 1

Run 3:
If I run again then I have the same result as in the second run
[LightGBM] [Info] Iteration:1, training multi_error : 0.257515
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233

> [LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
> [LightGBM] [Info] Finished initializing training
> [LightGBM] [Info] Started training...
> [LightGBM] [Info] Iteration:1, training multi_error : 0.257515
> [LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.542233
> [LightGBM] [Info] 28231.119183 seconds elapsed, finished iteration 1

Question: Shouldn't Run 2 and 3 (iteration 1) have the same result as Run 1 (itreation 5)?

Reproducible example

This seems to work for small sample training files. I see this issue with big training files.
The model is 111MB
The training and validation bin files are 42GB
The config file look like this

task = train
num_threads = 12
force_col_wise=true
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
max_bin = 255
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
num_trees = 10000
learning_rate = 0.1
output_model = "cpu\train_model_run1.txt"
# input_model = "cpu\train_model_run1.txt.snapshot_iter_4"
two_round = true
snapshot_freq = 1
feature_pre_filter=False
lambda_l1=2.183399274499516e-06
lambda_l2=1.3637583114059118e-08
num_leaves=117
feature_fraction=0.7
bagging_fraction=0.9852124460751488
bagging_freq=5
min_child_samples=20

Environment info

Win 10 Pro + LightGBM CPU mode

LightGBM version or commit hash: SHA-1: 9a76aae
From 08/09/24

Command(s) you used to install LightGBM: I compiled it in VS 2022 and use the command line to start LightGBM

Thanks!

w

The text was updated successfully, but these errors were encountered:

jameslamb · 2024-08-14T15:48:38Z

Thanks for using LightGBM.

I need some help understanding this report. Your config has num_trees = 10000, so training will proceed for 10,000 boosting rounds (because you're not performing early stopping).

So my read is:

Run 1: trains for 10,000 iterations, model saved as cpu/train_model_run5.txt
Runs 2 and 3: load model from cpu/train_model_run5.txt and then train for an additional 10,000 iterations

In that case, I wouldn't expect runs 2 and 3 to produce the same output as Run 1... they're starting boosting from a different place.

If I've misunderstood, can you please clarify how your setup works? It would also be helpful if you could minimize this example to the smallest possible set of non-default parameter values necessary to reproduce the behavior you're seeing. For example, if you use num_trees = 10 and num_leaves = 7 throughout, do you still see this behavior?

wil70 · 2024-08-14T17:16:24Z

In that case, I wouldn't expect runs 2 and 3 to produce the same output as Run 1... they're starting boosting from a different place.

I stop the itreations manually earlier, I save every iteration (snapshot_freq = 1), so cpu\train_model_run1.txt.snapshot_iter_4 is the 4th iterations model of Run 1.

Steps I did

I ran Run1 with 5 iterations, each iteration save a model

output_model = "cpu\train_model_run1.txt"
# input_model = "cpu\train_model_run1.txt.snapshot_iter_4"

Then I stop manually Run1
I strarted Run2 pointing to the input_model of the 4th itreation from Run1

output_model = "cpu\train_model_run2.txt"
input_model = "cpu\train_model_run1.txt.snapshot_iter_4"

Here I'm expecting "Run1 Itreation 5" to have the same model and result as "Run2 Iteration 1" but they are different.

Let me know if this is still not clear?

TY

jameslamb · 2024-08-14T17:29:41Z

Ohhhhhh I see.

Ok, I see at least one reason this would be different... you are using random sampling of rows and columns:

feature_fraction=0.7
bagging_fraction=0.9852124460751488
bagging_freq=5

Could you try setting seed to a value other than 0 in your config and repeating your investigation? e.g. seed = 708.

wil70 · 2024-08-14T17:43:18Z

I will try with a fix seed, and as many parameters set to default, hopefully this is it. Stay tune.

Thought I do not understand

Why there will be such a difference in the results
Run2 and Run3 have the same results but not Run1

Run3:

output_model = "cpu\train_model_run3.txt"
input_model = "cpu\train_model_run1.txt.snapshot_iter_4"

I'm going to try with only

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run1_.txt"
#input_model = "cpu_test\train_model_run1_.txt"
snapshot_freq = 1
seed = 708

wil70 · 2024-08-15T18:15:53Z

Hello, I run the test and it seems there is something fishy with big files
Let me know

Run 1

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run1_.txt"
#input_model = "cpu_test\train_model_run1_.txt.snapshot_iter_6"
snapshot_freq = 1
seed = 708

[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -34.538776
[LightGBM] [Info] Start training from score -1.568826
[LightGBM] [Info] Start training from score -0.548796
[LightGBM] [Info] Start training from score -1.541474
[LightGBM] [Info] Iteration:1, training multi_error : 0.422355
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.428234
[LightGBM] [Info] 6802.219645 seconds elapsed, finished iteration 1
[LightGBM] [Info] Iteration:2, training multi_error : 0.422355
[LightGBM] [Info] Iteration:2, valid_1 multi_error : 0.428234
[LightGBM] [Info] 13876.323698 seconds elapsed, finished iteration 2
[LightGBM] [Info] Iteration:3, training multi_error : 0.422289
[LightGBM] [Info] Iteration:3, valid_1 multi_error : 0.428234
[LightGBM] [Info] 22276.966162 seconds elapsed, finished iteration 3
[LightGBM] [Info] Iteration:4, training multi_error : 0.418515
[LightGBM] [Info] Iteration:4, valid_1 multi_error : 0.428825
[LightGBM] [Info] 30843.596281 seconds elapsed, finished iteration 4
[LightGBM] [Info] Iteration:5, training multi_error : 0.409983
[LightGBM] [Info] Iteration:5, valid_1 multi_error : 0.429514
[LightGBM] [Info] 39310.763918 seconds elapsed, finished iteration 5
[LightGBM] [Info] Iteration:6, training multi_error : 0.394493
[LightGBM] [Info] Iteration:6, valid_1 multi_error : 0.430006
[LightGBM] [Info] 47973.947956 seconds elapsed, finished iteration 6
[LightGBM] [Info] Iteration:7, training multi_error : 0.375952
[LightGBM] [Info] Iteration:7, valid_1 multi_error : 0.431187
[LightGBM] [Info] 56383.376031 seconds elapsed, finished iteration 7

For Run2 I'm expecting the same result as Run1 Iteration 7

task = train
device_type=cpu
boosting_type = gbdt
objective = multiclass
metric = multi_error
num_class = 5
metric_freq = 1
is_training_metric = true
header = false
data = wil10_8_data_2004_2006_split_train.csv.bin
valid_data = wil10_8_data_2004_2006_split_validate.csv.bin
output_model = "cpu_test\train_model_run2_.txt"
input_model = "cpu_test\train_model_run1_.txt.snapshot_iter_6"
snapshot_freq = 1
seed = 708

LightGBM] [Info] Finished loading parameters
[LightGBM] [Info] Load from binary file wil10_8_data_2004_2006_split_train.csv.bin
[LightGBM] [Info] Finished loading data in 341.768607 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 6711.857059 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 278556290
[LightGBM] [Info] Number of data points in the train set: 30472, number of used features: 2398793
[LightGBM] [Info] Finished initializing training
[LightGBM] [Info] Started training...
[LightGBM] [Info] Iteration:1, training multi_error : 0.346154
[LightGBM] [Info] Iteration:1, valid_1 multi_error : 0.485824
[LightGBM] [Info] 7051.144236 seconds elapsed, finished iteration 1

As you can see this is very different
I notice this in the past with different huge dataset, but never found the time to investigate to report it.

I hope this help
thanks
w

wil70 · 2024-08-15T22:41:40Z

The ticket can be tag as 'bug' instead of a 'question'?

jameslamb · 2024-08-15T23:22:53Z

It isn't obviously a bug yet. Sorry, but I'm finding it very difficult to understand exactly how you're getting these logs and what precisely you're saying is the problem.

For example, I don't understand the pattern you're using for the model file names, but that pattern seems to contain relevant information.

I will try to put together an example, then maybe you could tell me how what I've done differs from what you've done.

wil70 · 2024-08-16T00:48:51Z

For info, I'm using (huge) bin files but I do not think this matter.

The idea is Run a training of 6 iterations (Run 1), save each iteration's model
Run a second training (Run 2) for 1 iteration from iteration 5 (from model 5 from Run 1) so it produce another model 6 (Run 2 model 6)

Then compare model 6 from Run 1 and Run 2. For me they are different.

jameslamb added the question label Aug 14, 2024

jameslamb added the awaiting response label Aug 14, 2024

github-actions bot removed the awaiting response label Aug 14, 2024

wil70 changed the title ~~[Bug?] I get different results if I start a training from a saved model~~ [Bug?] I get different results if I start a training from a saved model on huge input files Aug 16, 2024

wil70 changed the title ~~[Bug?] I get different results if I start a training from a saved model on huge input files~~ [Bug?] I get different models when I start a training from a saved model from a previous iteration Aug 16, 2024

wil70 changed the title ~~[Bug?] I get different models when I start a training from a saved model from a previous iteration~~ [Bug?] Inconsistent Results When Resuming Training from a Saved Model vs letting it go on huge input file Aug 16, 2024

wil70 changed the title ~~[Bug?] Inconsistent Results When Resuming Training from a Saved Model vs letting it go on huge input file~~ [Bug?] Inconsistent Results When Resuming Training from a Saved Model Aug 16, 2024

jameslamb changed the title ~~[Bug?] Inconsistent Results When Resuming Training from a Saved Model~~ [Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model Aug 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

wil70 commented Aug 14, 2024 •

edited

Loading

jameslamb commented Aug 14, 2024

wil70 commented Aug 14, 2024 •

edited

Loading

jameslamb commented Aug 14, 2024

wil70 commented Aug 14, 2024 •

edited

Loading

wil70 commented Aug 15, 2024 •

edited

Loading

wil70 commented Aug 15, 2024

jameslamb commented Aug 15, 2024

wil70 commented Aug 16, 2024

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

[Bug?] [CLI] Inconsistent Results When Resuming Training from a Saved Model #6610

Comments

wil70 commented Aug 14, 2024 • edited Loading

Description

Reproducible example

Environment info

jameslamb commented Aug 14, 2024

wil70 commented Aug 14, 2024 • edited Loading

jameslamb commented Aug 14, 2024

wil70 commented Aug 14, 2024 • edited Loading

wil70 commented Aug 15, 2024 • edited Loading

wil70 commented Aug 15, 2024

jameslamb commented Aug 15, 2024

wil70 commented Aug 16, 2024

wil70 commented Aug 14, 2024 •

edited

Loading

wil70 commented Aug 14, 2024 •

edited

Loading

wil70 commented Aug 14, 2024 •

edited

Loading

wil70 commented Aug 15, 2024 •

edited

Loading