The more training rounds, the lower the GPU usage rate #6605

HonLZL · 2024-08-13T13:21:44Z

I wanted test speed in python.
I tried to replicate gpu(L40) and cpu(28cores) experiment with higgs. The following are the experimental results.

num_iterations(500): cuda(28s) version was slower than cpu(71s).
num_iterations(5000): cuda(570s) version was slower than cpu(403s).

Within ten minutes, Volatile GPU-Util gradually decreased from 80% to within 10%

dataset and parameter settings from: https://github.com/microsoft/LightGBM/blob/master/docs/GPU-Tutorial.rst
Dataset Preparation from https://github.com/guolinke/boosting_tree_benchmarks/blob/master/data/higgs2libsvm.py

operating system: ubuntu20.04
compiler versions: tag==4.5.0
no other significant processes running on the machine at the same time which are competing for CPU and memory
no other significant processes

Code:

cpu test

import lightgbm as lgb
import time
params = {
    "max_bin": 63,
    "num_leaves": 255,
    "num_iterations": 500,
    "learning_rate": 0.1,
    "tree_learner": "serial",
    "task": "train",
    "is_training_metric": "false",
    "min_data_in_leaf": 1,
    "min_sum_hessian_in_leaf": 100,
    "ndcg_eval_at": [1, 3, 5, 10],
    "device": "cpu"
}
dtrain = lgb.Dataset("higgs.train")
t0 = time.time()
gbm = lgb.train(
    params,
    train_set=dtrain
)
t1 = time.time()
print("cpu version elapse time: {}".format(t1 - t0))

gpu test

import lightgbm as lgb
import time
params = {
    "max_bin": 63,
    "num_leaves": 255,
    "num_iterations": 500,
    "learning_rate": 0.1,
    "tree_learner": "serial",
    "task": "train",
    "is_training_metric": "false",
    "min_data_in_leaf": 1,
    "min_sum_hessian_in_leaf": 100,
    "ndcg_eval_at": [1, 3, 5, 10],
    "device": "cuda",
    "gpu_platform_id": 0,
    "gpu_device_id": 0
}
dtrain = lgb.Dataset("higgs.train")
t0 = time.time()
gbm = lgb.train(
    params,
    train_set=dtrain
)
t1 = time.time()
print("gpu version elapse time: {}".format(t1 - t0))

jameslamb · 2024-08-15T01:36:15Z

Excellent report, thanks very much!

Could you try installing pandas and inspecting the results of gbm.trees_to_dataframe(), to see if maybe the later trees are very shallow?

That could be one reason for lower GPU utilization... the split-finding part of training can benefit from parallelization, but there's a sync-up after each search where the model has to be updated. I wonder if maybe in the later iterations, LightGBM is training much shallower trees (and therefore spending proportionally more time in those non-parallelized code paths).

num_leaves=255 does not guarantee that every tree with have 255 leaves.

LightGBM will stop growing a particular tree under a few conditions:

no remaining splits which provide gain >= min_gain_to_split
no remaining splits which satisfy min_data_in_leaf or min_sum_hessian_in_leaf
no remaining splits which also satisfy interaction_constraints or monotone_constraints
any of the above, limited by max_depth

Unrelated, some notes on those parameters:

# this is the default, you can omit this
"tree_learner": "serial"

# these are only relevant for the CLI, omit them when using the Python package
"task": "train"
"is_training_metric": "false"

HonLZL · 2024-08-15T12:45:05Z

Could you try installing pandas and inspecting the results of gbm.trees_to_dataframe(), to see if maybe the later trees are very shallow?

Glad to receive your reply! I ran 5,000 rounds using cuda, and this is part of the selection.

,tree_index,node_depth,node_index,left_child,right_child,parent_index,split_feature,split_gain,threshold,decision_type,missing_direction,missing_type,value,weight,count
2475998,4864,16,4864-L121,,,4864-S194,,,,,,,0.006992752334214716,149.0,149
2476998,4866,16,4866-S101,4866-L98,4866-L102,4866-S100,Column_6,1.705680012702942,-0.2550096362829208,<=,left,None,-0.00819252,531.0,531
2477998,4868,13,4868-L171,,,4868-S175,,,,,,,-0.011479730841377055,134.0,134
2478998,4870,12,4870-S18,4870-S48,4870-L19,4870-S17,Column_21,0.6902909874916077,0.9225429296493531,<=,left,None,-0.00471234,529.0,529
2479998,4872,11,4872-S171,4872-S172,4872-S176,4872-S170,Column_14,1.0827800035476685,0.16720353066921237,<=,left,None,-0.0045852,1004.0,1004
2480998,4874,21,4874-L71,,,4874-S75,,,,,,,0.010258754315588665,368.0,368
2481998,4876,11,4876-L47,,,4876-S46,,,,,,,0.0013771386017987898,479.0,479
2482998,4878,7,4878-S24,4878-L13,4878-S25,4878-S23,Column_26,1.3878200054168701,0.758811503648758,<=,left,None,0.00464105,648.0,648
2483998,4880,22,4880-S212,4880-L209,4880-L213,4880-S208,Column_25,1.2470799684524536,0.9166806042194368,<=,left,None,-0.00344305,506.0,506
2484998,4882,12,4882-S212,4882-S213,4882-S219,4882-S206,Column_9,1.0044300556182861,0.9594475924968721,<=,left,None,0.00119075,6424.0,6424
2485998,4884,14,4884-L216,,,4884-S222,,,,,,,0.003906279219997473,252.0,252
2486998,4886,11,4886-S128,4886-S129,4886-S131,4886-S127,Column_1,1.3676400184631348,-1.3494878411293028,<=,left,None,0.0037433,1885.0,1885
2487998,4888,6,4888-S106,4888-L2,4888-L107,4888-S105,Column_10,1.7062599658966064,-2.23167073726654,<=,left,None,-0.00077607,7684.0,7684
2488998,4889,13,4889-S50,4889-L50,4889-L51,4889-S49,Column_19,1.262369990348816,0.22485540062189105,<=,left,None,-0.00118126,378.0,378
2489998,4891,14,4891-S231,4891-S232,4891-L232,4891-S216,Column_27,0.9564549922943115,0.9873551428318025,<=,left,None,-0.00330718,810.0,810
2490998,4893,13,4893-L237,,,4893-S236,,,,,,,0.0020112752702087164,250.0,250
2491998,4895,13,4895-S56,4895-S57,4895-L57,4895-S55,Column_3,1.305359959602356,0.5652261972427369,<=,left,None,0.00609067,383.0,383
2492998,4897,13,4897-S218,4897-L218,4897-S226,4897-S217,Column_9,1.62663996219635,1.2169202566146853,<=,left,None,-0.00209578,11215.0,11215
2493998,4899,19,4899-L229,,,4899-S228,,,,,,,-0.014312340053603859,111.0,111
2494998,4901,14,4901-S125,4901-L117,4901-S134,4901-S119,Column_0,1.1668599843978882,0.5672366917133332,<=,left,None,0.00105788,609.0,609
2495998,4903,15,4903-S144,4903-L132,4903-L145,4903-S131,Column_3,0.6262369751930237,1.4257535934448244,<=,left,None,9.80367e-05,296.0,296
2496998,4905,11,4905-S50,4905-L50,4905-L51,4905-S49,Column_18,1.2204300165176392,0.6936975121498109,<=,left,None,0.00784372,407.0,407
2497998,4907,10,4907-L72,,,4907-S72,,,,,,,-0.00047845793306461624,16472.0,16472
2498998,4909,13,4909-S64,4909-S126,4909-L65,4909-S63,Column_26,1.1328099966049194,1.1948313713073733,<=,left,None,-0.00659668,485.0,485
2499998,4911,15,4911-S252,4911-L185,4911-S253,4911-S251,Column_0,1.3360899686813354,0.5170921981334687,<=,left,None,-0.00218027,956.0,956
2500998,4913,11,4913-S149,4913-L149,4913-S241,4913-S148,Column_4,1.2409199476242065,-0.9571563005447387,<=,left,None,-0.00313243,2587.0,2587
2501998,4915,17,4915-L183,,,4915-S182,,,,,,,-0.01039329694198946,104.0,104
2502998,4917,14,4917-L218,,,4917-S217,,,,,,,0.0029182628467818027,134.0,134
2503998,4919,20,4919-L131,,,4919-S130,,,,,,,0.003046603372175777,267.0,267
2504998,4921,17,4921-S125,4921-S126,4921-S129,4921-S84,Column_23,0.8258450031280518,0.9891601204872132,<=,left,None,-0.00102175,1806.0,1806
2505998,4923,10,4923-S40,4923-S41,4923-S42,4923-S39,Column_14,1.3672300577163696,0.1262423396110535,<=,left,None,0.00479201,770.0,770
2506998,4925,16,4925-L144,,,4925-S143,,,,,,,-0.01314376931544688,158.0,158
2507998,4927,14,4927-L149,,,4927-S148,,,,,,,-0.012864420435356875,108.0,108
2508998,4929,19,4929-L141,,,4929-S149,,,,,,,-0.01445356372371316,100.0,100
2509998,4931,16,4931-S127,4931-L127,4931-L128,4931-S126,Column_8,1.6426000595092773,1.6298071146011355,<=,left,None,-0.00694278,274.0,274
2510998,4933,18,4933-S242,4933-L140,4933-L243,4933-S174,Column_10,1.2257100343704224,0.6037689745426179,<=,left,None,-0.00464622,682.0,682
2511998,4935,15,4935-L71,,,4935-S70,,,,,,,-0.006048560484989801,110.0,110
2512998,4937,14,4937-S190,4937-S191,4937-L191,4937-S189,Column_9,1.0775500535964966,0.9248241186141969,<=,left,None,0.00346585,343.0,343
2513998,4939,10,4939-L14,,,4939-S13,,,,,,,0.008209223070969949,104.0,104
2514998,4941,11,4941-S12,4941-S15,4941-S13,4941-S9,Column_2,0.9452850222587585,0.3386045694351197,<=,left,None,0.00152447,2355.0,2355
2515998,4943,8,4943-S147,4943-S150,4943-S148,4943-S132,Column_7,0.6244350075721741,1.3059919476509096,<=,left,None,-0.000941328,5570.0,5570
2516998,4944,14,4944-L53,,,4944-S53,,,,,,,-0.010817443513866131,153.0,153
2517998,4946,12,4946-L120,,,4946-S119,,,,,,,-0.00032419846042009684,106351.0,106351
2518998,4948,16,4948-L224,,,4948-S224,,,,,,,-0.012146889258367129,117.0,117
2519998,4950,13,4950-S172,4950-S173,4950-S174,4950-S168,Column_13,0.7691389918327332,0.8014479875564576,<=,left,None,-0.00386439,729.0,729
2520998,4952,16,4952-S237,4952-L237,4952-L238,4952-S236,Column_13,0.8237029910087585,1.211699426174164,<=,left,None,-0.00339503,301.0,301
2521998,4954,18,4954-S170,4954-L168,4954-L171,4954-S167,Column_19,0.9019380211830139,-1.0000000180025095e-35,<=,left,None,0.00725262,244.0,244
2522998,4956,13,4956-L102,,,4956-S101,,,,,,,0.004279216547811973,1058.0,1058
2523998,4958,18,4958-L86,,,4958-S95,,,,,,,-0.009275815569726684,116.0,116
2524998,4960,17,4960-L245,,,4960-S251,,,,,,,0.0021101277049967123,188.0,188
2525998,4962,19,4962-L77,,,4962-S76,,,,,,,-0.015216302921784657,106.0,106
2526998,4964,14,4964-S190,4964-L178,4964-S191,4964-S177,Column_24,0.970412015914917,0.6755685508251191,<=,left,None,-0.00333743,620.0,620
2527998,4966,14,4966-L73,,,4966-S76,,,,,,,0.011329781752841099,137.0,137
2528998,4968,14,4968-S32,4968-S174,4968-S33,4968-S29,Column_19,1.5902700424194336,-0.6742075979709624,<=,left,None,0.000892957,5444.0,5444
2529998,4970,11,4970-L200,,,4970-S199,,,,,,,0.006320520037044324,336.0,336
2530998,4972,13,4972-S225,4972-L131,4972-L226,4972-S133,Column_15,0.7520939707756042,-0.3910080790519714,<=,left,None,-0.00766963,295.0,295
2531998,4974,15,4974-L139,,,4974-S143,,,,,,,0.009719555713627415,102.0,102
2532998,4976,13,4976-L131,,,4976-S130,,,,,,,-0.008448866840260916,195.0,195
2533998,4978,15,4978-L239,,,4978-S238,,,,,,,0.01019938246213964,112.0,112
2534998,4980,9,4980-L42,,,4980-S42,,,,,,,0.012193971863459975,126.0,126
2535998,4982,15,4982-L34,,,4982-S33,,,,,,,-0.009220713326855428,192.0,192
2536998,4984,9,4984-S26,4984-L25,4984-L27,4984-S25,Column_1,1.1161600351333618,-0.04436973668634891,<=,left,None,-0.00566565,339.0,339
2537998,4986,13,4986-L236,,,4986-S236,,,,,,,-0.010996394506930211,221.0,221
2538998,4988,18,4988-L222,,,4988-S233,,,,,,,-0.001222929739662567,1392.0,1392
2539998,4990,13,4990-L167,,,4990-S166,,,,,,,0.0017196273834156187,19678.0,19678
2540998,4992,15,4992-L47,,,4992-S46,,,,,,,0.019410366345196963,100.0,100
2541998,4994,11,4994-S136,4994-L109,4994-L137,4994-S118,Column_8,0.7672929763793945,1.0000000180025095e-35,<=,left,None,-0.00867073,459.0,459
2542998,4996,14,4996-S138,4996-L112,4996-S139,4996-S111,Column_23,0.5022619962692261,0.986467868089676,<=,left,None,0.000172803,445.0,445
2543998,4998,9,4998-S76,4998-S81,4998-S77,4998-S75,Column_11,1.4823499917984009,-0.8421601653099059,<=,left,None,8.01628e-06,93124.0,93124                               2544998,4999,11,4999-L179,,,4999-S178,,,,,,,-0.00022046581974725736,1660.0,1660

When using device=cuda training, as the number of training iterations increases, GPU utilization decreases, and the time spent on each training iterations increases. However, device=CPU is not like this.

Unrelated, some notes on those parameters:

# this is the default, you can omit this
"tree_learner": "serial"

# these are only relevant for the CLI, omit them when using the Python package
"task": "train"
"is_training_metric": "false"

Thank you very much for your suggestion! Looking forward to your reply! Thanks!

jameslamb · 2024-08-15T17:10:27Z

Sorry, my request was unclear.

I'm not looking for a random sample of that dataframe. Could you use that output to see if there is a difference in the number of leaves in each tree?

A finding like "the trees in later iterations have fewer leaves" would be very informative here.

I'm looking for output similar to this:

tree 0: 255 leaves
...
tree 100: 75 leaves
...
tree 200: 25 leaves
...
tree 300: 3 leaves
...
tree 400: 3 leaves
...
tree 499: 3 leaves

HonLZL · 2024-08-18T05:08:48Z

Sorry, my request was unclear.

I'm not looking for a random sample of that dataframe. Could you use that output to see if there is a difference in the number of leaves in each tree?

A finding like "the trees in later iterations have fewer leaves" would be very informative here.

I'm looking for output similar to this:
tree 0: 255 leaves
...
tree 100: 75 leaves
...
tree 200: 25 leaves
...
tree 300: 3 leaves
...
tree 400: 3 leaves
...
tree 499: 3 leaves

Hi, sorry for my later reply. I used "cat model.txt | grep -A 1 Tree=" to check the save_model. I got

Tree=4975
num_leaves=255
--
Tree=4976
num_leaves=255
--
Tree=4977
num_leaves=255
--
Tree=4978
num_leaves=255
--
Tree=4979
num_leaves=255
--
Tree=4980
num_leaves=255
--
Tree=4981
num_leaves=255
--
Tree=4982
num_leaves=255
--
Tree=4983
num_leaves=255
--
Tree=4984
num_leaves=255
--
Tree=4985
num_leaves=255
--
Tree=4986
num_leaves=255
--
Tree=4987
num_leaves=255
--
Tree=4988
num_leaves=255
--
Tree=4989
num_leaves=255
--
Tree=4990
num_leaves=255
--
Tree=4991
num_leaves=255
--
Tree=4992
num_leaves=255
--
Tree=4993
num_leaves=255
--
Tree=4994
num_leaves=255
--
Tree=4995
num_leaves=255
--
Tree=4996
num_leaves=255
--
Tree=4997
num_leaves=255
--
Tree=4998
num_leaves=255
--
Tree=4999
num_leaves=255

In fact, every tree has num_leaves=255.

jameslamb · 2024-08-19T02:32:25Z

hmmmm ok thank you for that!

Sorry, but I'm out of ideas. I'm not that familiar with the performance characteristics of the CUDA build here. I hope @shiyu1994 will be able to help.

jameslamb added question awaiting response labels Aug 15, 2024

github-actions bot removed the awaiting response label Aug 15, 2024

jameslamb added the awaiting response label Aug 16, 2024

github-actions bot removed the awaiting response label Aug 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The more training rounds, the lower the GPU usage rate #6605

The more training rounds, the lower the GPU usage rate #6605

HonLZL commented Aug 13, 2024

jameslamb commented Aug 15, 2024

HonLZL commented Aug 15, 2024

jameslamb commented Aug 15, 2024

HonLZL commented Aug 18, 2024

jameslamb commented Aug 19, 2024

The more training rounds, the lower the GPU usage rate #6605

The more training rounds, the lower the GPU usage rate #6605

Comments

HonLZL commented Aug 13, 2024

jameslamb commented Aug 15, 2024

HonLZL commented Aug 15, 2024

jameslamb commented Aug 15, 2024

HonLZL commented Aug 18, 2024

jameslamb commented Aug 19, 2024