pos_bagging_fraction and neg_bagging_fraction are not working as they should #6639

zakariaelh · 2024-09-04T20:12:02Z

Description

When dealing with unbalanced data, lightgbm provides two parameters pos_bagging_fraction and neg_bagging_fraction. These parameters are used during the bagging process to sample pos_bagging_fraction * #num_positives positive samples, and neg_bagging_fraction * #num_negatives negative samples. The plan is then to train a tree on these samples only and add it to the list of trees in the model.

However, it looks like that's not what happens. Even though we correctly sample the positive samples and negative samples according to the params provided, it looks like the tree is not trained on the sampled data. I provided a reproducible example below to show why that's the case.

Reproducible example

Below is a toy model that contains an unbalanced data with 90% positive observations and 10% negative observations. I then train a very simple model that contains only two leaves and one tree, and i set the bagging parameters to pos_bagging_fraction = 0.00001 and neg_bagging_fraction=0.99999. The goal here is to make the bagging process selects a sample that contains as many negative observations as possible, with very few positive observations. If this is working correctly, the tree learnt on this sample should predict only the negative class since the "bagged" sample contains only negative observations. However, the tree learnt does the opposite, it continues to heavily predict the positive class, which infers that the sample was not used.

Below is the code:

import numpy as np
import lightgbm as lgb

# Generate random data for binary classification
np.random.seed(42)  # For reproducibility
n_samples = 1000
n_features = 2

X = np.random.randn(n_samples, n_features)
y = np.zeros((n_samples,))
# 10% negative, 90% positive 
y[(int(0.1 * n_samples)):] = 1

# Shuffle the dataset
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
y = y[indices]

# Create LightGBM datasets
train_data = lgb.Dataset(X, label=y)

# Set parameters for LightGBM
params = {
    'boosting_type': 'gbdt',       # Traditional Gradient Boosting Decision Tree
    'objective': 'binary',         # Binary classification
    'metric': 'binary_logloss',    # Evaluation metric
    'num_leaves': 2,              # Number of leaves in full tree
    'learning_rate': 0.05,         # Learning rate
    'pos_bagging_fraction': 0.0001,
    'neg_bagging_fraction': 0.99,
    'bagging_freq': 1,            
    'verbose': 0                  
}

# Train the model
num_round = 1
lgbm_model = lgb.train(params, train_data, num_round, valid_sets=[train_data])
# get the output probabilities 
y_pred = lgbm_model.predict(X_train, num_iteration=lgbm_model.best_iteration)
print(f'min probability: {min(y_pred)}, max probability: {max(y_pred)}')
# visualize the tree 
lgb.plot_tree(lgbm_model)

Environment info

Both in google-colab and in macOS Sonoma 14.6.1

LightGBM version or commit hash:
lightgbm version 4.5.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

Below is the tree learnt from a sample that has the majority negative samples. The values at the leaves should be highly negative implying probability close to 0. Instead, they're positive inferring a probability around 0.85.

The text was updated successfully, but these errors were encountered:

jameslamb added the question label Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pos_bagging_fraction and neg_bagging_fraction are not working as they should #6639

pos_bagging_fraction and neg_bagging_fraction are not working as they should #6639

zakariaelh commented Sep 4, 2024