You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When dealing with unbalanced data, lightgbm provides two parameters pos_bagging_fraction and neg_bagging_fraction. These parameters are used during the bagging process to sample pos_bagging_fraction * #num_positives positive samples, and neg_bagging_fraction * #num_negatives negative samples. The plan is then to train a tree on these samples only and add it to the list of trees in the model.
However, it looks like that's not what happens. Even though we correctly sample the positive samples and negative samples according to the params provided, it looks like the tree is not trained on the sampled data. I provided a reproducible example below to show why that's the case.
Reproducible example
Below is a toy model that contains an unbalanced data with 90% positive observations and 10% negative observations. I then train a very simple model that contains only two leaves and one tree, and i set the bagging parameters to pos_bagging_fraction = 0.00001 and neg_bagging_fraction=0.99999. The goal here is to make the bagging process selects a sample that contains as many negative observations as possible, with very few positive observations. If this is working correctly, the tree learnt on this sample should predict only the negative class since the "bagged" sample contains only negative observations. However, the tree learnt does the opposite, it continues to heavily predict the positive class, which infers that the sample was not used.
Below is the code:
import numpy as np
import lightgbm as lgb
# Generate random data for binary classification
np.random.seed(42) # For reproducibility
n_samples = 1000
n_features = 2
X = np.random.randn(n_samples, n_features)
y = np.zeros((n_samples,))
# 10% negative, 90% positive
y[(int(0.1 * n_samples)):] = 1
# Shuffle the dataset
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y = y[indices]
# Create LightGBM datasets
train_data = lgb.Dataset(X, label=y)
# Set parameters for LightGBM
params = {
'boosting_type': 'gbdt', # Traditional Gradient Boosting Decision Tree
'objective': 'binary', # Binary classification
'metric': 'binary_logloss', # Evaluation metric
'num_leaves': 2, # Number of leaves in full tree
'learning_rate': 0.05, # Learning rate
'pos_bagging_fraction': 0.0001,
'neg_bagging_fraction': 0.99,
'bagging_freq': 1,
'verbose': 0
}
# Train the model
num_round = 1
lgbm_model = lgb.train(params, train_data, num_round, valid_sets=[train_data])
# get the output probabilities
y_pred = lgbm_model.predict(X_train, num_iteration=lgbm_model.best_iteration)
print(f'min probability: {min(y_pred)}, max probability: {max(y_pred)}')
# visualize the tree
lgb.plot_tree(lgbm_model)
Environment info
Both in google-colab and in macOS Sonoma 14.6.1
LightGBM version or commit hash:
lightgbm version 4.5.0
Command(s) you used to install LightGBM
pip install lightgbm
Additional Comments
Below is the tree learnt from a sample that has the majority negative samples. The values at the leaves should be highly negative implying probability close to 0. Instead, they're positive inferring a probability around 0.85.
The text was updated successfully, but these errors were encountered:
Description
When dealing with unbalanced data, lightgbm provides two parameters
pos_bagging_fraction
andneg_bagging_fraction
. These parameters are used during the bagging process to samplepos_bagging_fraction * #num_positives
positive samples, andneg_bagging_fraction * #num_negatives
negative samples. The plan is then to train a tree on these samples only and add it to the list of trees in the model.However, it looks like that's not what happens. Even though we correctly sample the positive samples and negative samples according to the params provided, it looks like the tree is not trained on the sampled data. I provided a reproducible example below to show why that's the case.
Reproducible example
Below is a toy model that contains an unbalanced data with 90% positive observations and 10% negative observations. I then train a very simple model that contains only two leaves and one tree, and i set the bagging parameters to
pos_bagging_fraction = 0.00001
andneg_bagging_fraction=0.99999
. The goal here is to make the bagging process selects a sample that contains as many negative observations as possible, with very few positive observations. If this is working correctly, the tree learnt on this sample should predict only the negative class since the "bagged" sample contains only negative observations. However, the tree learnt does the opposite, it continues to heavily predict the positive class, which infers that the sample was not used.Below is the code:
Environment info
Both in
google-colab
and in macOS Sonoma 14.6.1LightGBM version or commit hash:
lightgbm version 4.5.0
Command(s) you used to install LightGBM
Additional Comments
Below is the tree learnt from a sample that has the majority negative samples. The values at the leaves should be highly negative implying probability close to 0. Instead, they're positive inferring a probability around 0.85.
The text was updated successfully, but these errors were encountered: