Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pos_bagging_fraction and neg_bagging_fraction are not working as they should #6639

Open
zakariaelh opened this issue Sep 4, 2024 · 0 comments
Labels

Comments

@zakariaelh
Copy link

Description

When dealing with unbalanced data, lightgbm provides two parameters pos_bagging_fraction and neg_bagging_fraction. These parameters are used during the bagging process to sample pos_bagging_fraction * #num_positives positive samples, and neg_bagging_fraction * #num_negatives negative samples. The plan is then to train a tree on these samples only and add it to the list of trees in the model.

However, it looks like that's not what happens. Even though we correctly sample the positive samples and negative samples according to the params provided, it looks like the tree is not trained on the sampled data. I provided a reproducible example below to show why that's the case.

Reproducible example

Below is a toy model that contains an unbalanced data with 90% positive observations and 10% negative observations. I then train a very simple model that contains only two leaves and one tree, and i set the bagging parameters to pos_bagging_fraction = 0.00001 and neg_bagging_fraction=0.99999. The goal here is to make the bagging process selects a sample that contains as many negative observations as possible, with very few positive observations. If this is working correctly, the tree learnt on this sample should predict only the negative class since the "bagged" sample contains only negative observations. However, the tree learnt does the opposite, it continues to heavily predict the positive class, which infers that the sample was not used.

Below is the code:

import numpy as np
import lightgbm as lgb

# Generate random data for binary classification
np.random.seed(42)  # For reproducibility
n_samples = 1000
n_features = 2

X = np.random.randn(n_samples, n_features)
y = np.zeros((n_samples,))
# 10% negative, 90% positive 
y[(int(0.1 * n_samples)):] = 1

# Shuffle the dataset
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y = y[indices]

# Create LightGBM datasets
train_data = lgb.Dataset(X, label=y)

# Set parameters for LightGBM
params = {
    'boosting_type': 'gbdt',       # Traditional Gradient Boosting Decision Tree
    'objective': 'binary',         # Binary classification
    'metric': 'binary_logloss',    # Evaluation metric
    'num_leaves': 2,              # Number of leaves in full tree
    'learning_rate': 0.05,         # Learning rate
    'pos_bagging_fraction': 0.0001,
    'neg_bagging_fraction': 0.99,
    'bagging_freq': 1,            
    'verbose': 0                  
}

# Train the model
num_round = 1
lgbm_model = lgb.train(params, train_data, num_round, valid_sets=[train_data])
# get the output probabilities 
y_pred = lgbm_model.predict(X_train, num_iteration=lgbm_model.best_iteration)
print(f'min probability: {min(y_pred)}, max probability: {max(y_pred)}')
# visualize the tree 
lgb.plot_tree(lgbm_model)

Environment info

Both in google-colab and in macOS Sonoma 14.6.1

LightGBM version or commit hash:
lightgbm version 4.5.0

Command(s) you used to install LightGBM

pip install lightgbm 

Additional Comments

Below is the tree learnt from a sample that has the majority negative samples. The values at the leaves should be highly negative implying probability close to 0. Instead, they're positive inferring a probability around 0.85.

output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants