Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom median objective function in lightgbm.cv() #6620

Open
arumds opened this issue Aug 22, 2024 · 10 comments
Open

Custom median objective function in lightgbm.cv() #6620

arumds opened this issue Aug 22, 2024 · 10 comments

Comments

@arumds
Copy link

arumds commented Aug 22, 2024

LightGBM version 4.0.0

The objective='regression' trains to predict the mean representation of the data. And i am interested to train to predict median representation of the actual values. Infact, a quanitle model with alpha=0.5 will solve the problem. However, the quantile model does not work with monotone_constraints parameter which is essential in our case. Therefore, a custom median_loss is used as objective passed to the params.

def median_loss(preds, train_data: lgb.Dataset):
        y_true = train_data.get_label()
        residual = preds - y_true
        grad = np.where(residual > 0, 0.5, -0.5)
        hess = np.ones_like(grad)  # Hessian is constant for median pinball loss
        return grad, hess
params={
        "objective": median_loss,
    },

cv_result = lgb.cv(params, dtrain, nfold=n_folds,  stratified=False, return_cvbooster=True)
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Debugging shows that all predictions during the lgb.cv step are 0's and therefore the gradients are uniform. It might not be providing LightGBM with sufficient gradient information to make meaningful splits.

Does anyone have a suggestion on how to train the model effectively with medain_loss custom objective or with a quantile objective preserving the monotonic constraint. @jameslamb @vladv14

@jmoralez
Copy link
Collaborator

Hey. Thanks for using LightGBM. Can you try setting the condition to greater equal? i.e.

grad = np.where(residual >= 0, 0.5, -0.5)

@arumds
Copy link
Author

arumds commented Aug 22, 2024

@jmoralez tried setting to grad = np.where(residual >= 0, 0.5, -0.5)

params={
        "objective": median_loss,
    },

cv_result = lgb.cv(params, dtrain, nfold=n_folds,  metrics='rmse', stratified=False, return_cvbooster=True)

Log:

[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] Using self-defined objective function
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[1]	cv_agg's train rmse: 4.66734 + 0.00107263	cv_agg's valid rmse: 4.66734 + 0.00428721
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

When tried to debug the median_loss objective at the execution of lgb.cv(), the pred are all zero as seen in the screenshot:
Screenshot 2024-08-22 at 23 09 04

With the obective='regression' the model gets trained normally. Logs are below:

1]	cv_agg's train rmse: 0.730986 + 0.000761274	cv_agg's valid rmse: 0.730999 + 0.00305853
[2]	cv_agg's train rmse: 0.724106 + 0.000747364	cv_agg's valid rmse: 0.724126 + 0.00305247
[3]	cv_agg's train rmse: 0.717755 + 0.000743182	cv_agg's valid rmse: 0.717786 + 0.00304095
[4]	cv_agg's train rmse: 0.711056 + 0.000728518	cv_agg's valid rmse: 0.711092 + 0.00303802
[5]	cv_agg's train rmse: 0.704382 + 0.000716823	cv_agg's valid rmse: 0.704426 + 0.00302899
[6]	cv_agg's train rmse: 0.69778 + 0.00070809	cv_agg's valid rmse: 0.697832 + 0.00301913
[7]	cv_agg's train rmse: 0.691297 + 0.000700247	cv_agg's valid rmse: 0.691353 + 0.00301123
[8]	cv_agg's train rmse: 0.685269 + 0.000683244	cv_agg's valid rmse: 0.685337 + 0.00301251
[9]	cv_agg's train rmse: 0.678915 + 0.000665435	cv_agg's valid rmse: 0.678987 + 0.00301451
[10]	cv_agg's train rmse: 0.672621 + 0.000661577	cv_agg's valid rmse: 0.672699 + 0.00300223
[11]	cv_agg's train rmse: 0.666394 + 0.000655792	cv_agg's valid rmse: 0.666477 + 0.00299132

@jmoralez
Copy link
Collaborator

When using a custom objective LightGBM sets the init score as 0 and if it doesn't find a gain with any split you may be left with a single tree with only the root, you can verify this if you use the trees_to_dataframe method.

If you're able to provide a reproducible example we can assist further. The following seems to train normally:

import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression

def median_loss(preds, train_data: lgb.Dataset):
    y_true = train_data.get_label()
    residual = preds - y_true
    grad = np.where(residual >= 0, 0.5, -0.5)
    hess = np.ones_like(grad)  # Hessian is constant for median pinball loss
    return grad, hess

X, y = make_regression(n_samples=1000, n_features=2)
dtrain = lgb.Dataset(X, y)
params={"objective": median_loss, 'num_leaves': 32, 'verbosity': -1, 'metrics': 'l2'}
cv_hist = lgb.cv(
    params,
    dtrain,
    num_boost_round=10,
    nfold=2,
    stratified=False,
    callbacks=[lgb.log_evaluation(1)],
)
# [1]	cv_agg's valid l2: 15698.8 + 269.489
# [2]	cv_agg's valid l2: 15689.7 + 269.239
# [3]	cv_agg's valid l2: 15680.5 + 268.99
# [4]	cv_agg's valid l2: 15671.4 + 268.741
# [5]	cv_agg's valid l2: 15662.2 + 268.491
# [6]	cv_agg's valid l2: 15653.1 + 268.242
# [7]	cv_agg's valid l2: 15644 + 267.993
# [8]	cv_agg's valid l2: 15634.8 + 267.744
# [9]	cv_agg's valid l2: 15625.7 + 267.495
# [10]	cv_agg's valid l2: 15616.6 + 267.246

@arumds
Copy link
Author

arumds commented Aug 22, 2024

@jmoralez
Attached is a test dtrain binary file which can be used to reproduce as below:

dataset_from_file = lgb.Dataset(data="test.bin")

params={"objective": median_loss, 'num_leaves': 32, 'verbosity': -1, 'metrics': 'l2'}
cv_hist = lgb.cv(
    params,
    dataset_from_file,
    num_boost_round=10,
    nfold=2,
    stratified=False,
    callbacks=[lgb.log_evaluation(1)],
    seed=0,
    metrics='rmse',
    eval_train_metric=True,
    return_cvbooster=True)

test.bin.zip

Unzip the file to test.bin

@jmoralez
Copy link
Collaborator

Did you inspect the produced trees?

@arumds
Copy link
Author

arumds commented Aug 23, 2024

You mean to get the model from lgb.train after lgb.cv and inspect the trees? If so, yes there seem to be only root.

The hyper_params from the lgb.cv() and BayesianOptimization returns

`{'num_iterations': 500, 'early_stopping_rounds': 50, 'bagging_freq': 1, 'learning_rate': 0.01, 'verbosity': -1, 'monotone_constraints': [0, 0, 0, -1, 0, 1], 'objective': <function median_loss at 0x3126261f0>, 'bagging_fraction': 0.8646440511781974, 'feature_fraction': 0.9145568099117258, 'lambda_l1': 0.006027633760716439, 'lambda_l2': 0.005448831829968969, 'max_depth': 14, 'min_child_weight': 0.6394705825246829, 'min_data_in_leaf': 16, 'min_gain_to_split': 0.045670920031283195, 'num_leaves': 292}`

The model is trained with these hyper params and yields:

lgb.Booster.trees_to_dataframe(model)
Out[5]: 
   tree_index  node_depth node_index left_child right_child parent_index  \
0           0           1       0-L0       None        None         None   
  split_feature split_gain threshold decision_type missing_direction  \
0          None       None      None          None              None   
  missing_type  value weight count  
0         None      0   None  None  

Does this indicate that the median_loss objective is not good for the dataset?

@jmoralez
Copy link
Collaborator

jmoralez commented Aug 23, 2024

That means LightGBM isn't able to find a split that satisfies the constraints you've set (min_gain_to_split, min_data_in_leaf, min_child_weight, etc).

This doesn't seem to be an issue within LightGBM or your custom loss, I'm pretty sure you'd get the same result if you used the built-in loss (single tree with only the root which predicts the init score).

If you have very few samples you could try getting more data or reducing the constraints (in case 16 is your minimum min_data_in_leaf for example)

@arumds
Copy link
Author

arumds commented Aug 23, 2024

@jmoralez The hyper parameter boundaries for tuning are shown below:

hyperparam_boundaries = {'num_leaves': (100, 300),
                             'max_depth': (10, 20),
                             'feature_fraction': (0.7, 1),
                             'bagging_fraction': (0.7, 1),
                             'min_data_in_leaf': (10, 25),
                             'min_gain_to_split': (0.01,0.05),
                             'lambda_l1': (0, 0.01),
                             'lambda_l2': (0, 0.01)
                             }

And the built-in regression objective gives the best hyper parameters by bayesian hyper param tuning with lgb.cv() cross validation:

{'num_iterations': 500, 'early_stopping_rounds': 50, 'bagging_freq': 1, 'learning_rate': 0.01, 'verbosity': -1, 'monotone_constraints': [0, 0, 0, -1, 0, 1], 'objective': 'regression', 'bagging_fraction': 0.8150324556477333, 'feature_fraction': 0.9375175114247993, 'lambda_l1': 0.005288949197529045, 'lambda_l2': 0.0056804456109393235, 'max_depth': 19, 'min_child_weight': 0.07041859401008829, 'min_data_in_leaf': 11, 'min_gain_to_split': 0.010808735897613029, 'num_leaves': 266}

And there are >1 trees

lgb.Booster.trees_to_dataframe(model)
Out[2]: 
        tree_index  node_depth node_index  ...     value   weight  count
0                0           1       0-S0  ...  4.607710      0.0  66367
1                0           2       0-S2  ...  4.615160  29156.0  29156
2                0           3       0-S7  ...  4.616940  17398.0  17398
3                0           4      0-S18  ...  4.618880   2726.0   2726
4                0           5      0-S53  ...  4.621150    455.0    455
...            ...         ...        ...  ...       ...      ...    ...
265495         499          10   499-L241  ... -0.000076     20.0     20
265496         499          10   499-L256  ...  0.000423     11.0     11
265497         499           7   499-S254  ... -0.000418     25.0     25
265498         499           8    499-L38  ... -0.000174     12.0     12
265499         499           8   499-L255  ... -0.000677     13.0     13

The issue occurs only while using custom loss function where it cannot find a split and only predicts the init score 0.

@arumds
Copy link
Author

arumds commented Aug 26, 2024

@jmoralez is there anything i am missing out here?

@jmoralez
Copy link
Collaborator

What are you returning as the trial's score? As I said, when using a custom objective, LightGBM starts boosting from zero, which might hurt the convergence.

Can you try the approach in #5114 (comment) by setting the init score in your dataset (to the target's median in this case), adding it back to your predictions and then computing your metric on that? If you're using a built-in metric it won't work because it won't take into account the init scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants