Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to set delta_metric to identity in pairwise objective #11261

Open
jaguerrerod opened this issue Feb 17, 2025 · 8 comments · May be fixed by #11272
Open

How to set delta_metric to identity in pairwise objective #11261

jaguerrerod opened this issue Feb 17, 2025 · 8 comments · May be fixed by #11272
Labels

Comments

@jaguerrerod
Copy link

jaguerrerod commented Feb 17, 2025

delta_metric have been introduced from version 2 refactoring:

// Use double whenever possible as we are working on the exp space.
double delta_score = std::abs(s_high - s_low);
double const sigmoid = common::Sigmoid(s_high - s_low);
// Change in metric score like \delta NDCG or \delta MAP
double delta_metric = std::abs(delta(y_high, y_low, rank_high, rank_low));
if (best_score != worst_score) {
delta_metric /= (delta_score + 0.01);
}
if (unbiased) {
*p_cost = std::log(1.0 / (1.0 - sigmoid)) * delta_metric;
}
auto lambda_ij = (sigmoid - 1.0) * delta_metric;
auto hessian_ij = std::max(sigmoid * (1.0 - sigmoid), Eps64()) * delta_metric * 2.0;

respect the way of computing gradient and hessians in 1.7 versions:

LambdaWeightComputerT::GetLambdaWeight(lst, &pairs);
// rescale each gradient and hessian so that the lst have constant weighted
float scale = 1.0f / param_.num_pairsample;
if (param_.fix_list_weight != 0.0f) {
scale *= param_.fix_list_weight / (gptr[k + 1] - gptr[k]);
}
for (auto & pair : pairs) {
const ListEntry &pos = lst[pair.pos_index];
const ListEntry &neg = lst[pair.neg_index];
const bst_float w = pair.weight * scale;
const float eps = 1e-16f;
bst_float p = common::Sigmoid(pos.pred - neg.pred);
bst_float g = p - 1.0f;
bst_float h = std::max(p * (1.0f - p), eps);
// accumulate gradient and hessian in both pid, and nid
gpair[pos.rindex] += GradientPair(g * w, 2.0f*w*h);
gpair[neg.rindex] += GradientPair(-g * w, 2.0f*w*h);

Where is defined delta and which is its default value in master version?

My cases of use don't need any delta function to overweight top elements of each query as I'm optimizing spearman correlation of the whole query.
I would like not use delta function to weight pairs.
Is possible disable it or set to identity function?

@jaguerrerod jaguerrerod changed the title How to set delta_metric to identity in pairwise objetive How to set delta_metric to identity in pairwise objective Feb 17, 2025
@trivialfis
Copy link
Member

If you are referring to the rank net loss, then simple rank:pairwise should suffice.

@jaguerrerod
Copy link
Author

If that is the case, then I can't explain the performance drop I'm seeing in version 3.0.0 compared to 1.7.8.
My dataset has very little signal (predictions reach a correlation of 0.03). The queries are large (5,000 observations).
Something significant changed in the refactoring introduced in version 2.0 that consistently reduces performance.
With 3.0.0, correlation reaches 0.022 after just a few iterations and quickly starts overfitting, dropping below 0.02.
With 1.7.8, the model's performance on the test dataset improves continuously up to 0.03, requiring many thousands of trees to reach that level.
Is there any change in sampling, weighting, or the calculation of gradients/hessians introduced in the refactoring that could explain this?

I'm using this parameters to try to reproduce the behaviour of 1.7.8 in 3.0.0 (full query optimization by rank without normalization and pairing method 'mean' as metric is spearman correlation):

booster = 'gbtree',
 objective = 'rank:pairwise',
 tree_method = 'hist',
 device = 'cuda',
 lambdarank_pair_method = 'mean',
 lambdarank_num_pair_per_sample = 200,
 lambdarank_normalization = FALSE,

I'll try to upload a portion of the data as a dataset on Kaggle along with code to reproduce the issue.

@trivialfis
Copy link
Member

lambdarank_num_pair_per_sample is too large. Could you please experiment with 1?

@jaguerrerod
Copy link
Author

jaguerrerod commented Feb 21, 2025

I have trained the same model with the same data using versions 1.7.8 and 3.0.0.
I used xgb.DMatrix in both cases to avoid introducing a difference by using xgb.QuantileDMatrix.

The only changes in the code are the parameters for version 3.0.0:

  • lambdarank_pair_method = 'mean',
  • lambdarank_num_pair_per_sample = 200,
  • lambdarank_normalization = FALSE.

The learning with version 1.7.8 is stable and reaches a correlation of 0.0317 at 13K trees:

1	0.007926573
10	0.01521309
20	0.01781933
30	0.01781363
40	0.01825633
50	0.01907869
100	0.02007266
150	0.02040283
200	0.02058564
250	0.02081901
300	0.02116498
350	0.02146332
400	0.02172211
450	0.0219907
500	0.02210538
13000   0.03174277

The result with version 3.0.0 does not improve after a few trees:

1	0.004441884
10	0.01096987
20	0.01326301
30	0.01420671
40	0.01612788
50	0.01708117
100	0.01741559
150	0.01584577
200	0.01570146
250	0.01508331
300	0.01510909
350	0.01517024
400	0.01534667
450	0.01440822
500	0.01440194

This happens with different dataset subsets, different sets of predictive variables, different general parameters (depth, colsample_bytree, lambda...)
When I train version 3.0.0 with the number of pairs set to 1, the results are worse:

1	0.005942605
10	0.01068313
20	0.01301092
30	0.01300241
40	0.01367269
50	0.01433194
100	0.01474676
150	0.01513208
200	0.0140801
250	0.01407706
300	0.01390105
350	0.01460263
400	0.01426433
450	0.01336728
500	0.01359152

Training with the number of pairs set to 10 the behavior is similar with 3.0.0, it quickly starts to degrade:

1	0.003311274
10	0.01169847
20	0.01278498
30	0.01404126
40	0.01554281
50	0.01665777
100	0.01611878
150	0.0167202
200	0.01597453
250	0.01562164
300	0.0157668
350	0.01574301
400	0.01558456
450	0.01564038
500	0.01515673

I'm not C++ coder and follow the differences between 1.7.8 and 3.0.0 is hard to me as is a whole new refactoring, but I think something is different to cause this behaviors.
I suspected was related with delta_metric, that is the part new in code I detected in gradient and hessian computation.

@trivialfis
Copy link
Member

Ok, I get it now. The ranknet loss has no delta metric (1.0), but the 1.0 is normalized by the ranking score difference, which is undesirable.

@trivialfis trivialfis linked a pull request Feb 21, 2025 that will close this issue
@jaguerrerod
Copy link
Author

jaguerrerod commented Feb 21, 2025

I've seen that you've disabled pair normalization in pairwise. This is the behavior of version 1.7.
I'll test it when it's available to see if I get similar results between both versions.
In datasets with a lot of noise and little signal, this is the best option. However, in datasets with a strong signal, normalizing a pair based on the difference in their labels might make sense.
Perhaps for future versions, including a parameter to choose whether to normalize pairs by the difference in label ranks or not could make the approach more versatile.
What intuitively makes the most sense to me is to use the label rank calculated considering the frequency of each label (like percentiles).
Determining whether pairwise works better with or without pair normalization in more predictable datasets is something worth investigating.
If this option is included in a future release, I commit to running the comparison.
EDIT:
I'll explain a bit how I think the normalization by labels in pairwise.
In this line
delta_metric /= (delta_score + 0.01);
You are normalizing inversely by delta_score, meaning you give more relevance to pairs with similar predictions.
What is the reasoning behind this?
I believe the logic in pairwise should be to give more relevance to pairs that are more different—not based on predictions, but on the true labels.
For example:
delta_metric = std::abs(y_high_rank - y_low_rank)
where y_high_rank and y_low_rank are the percentiles of the true labels, considering their frequency distribution.

@trivialfis
Copy link
Member

You are normalizing inversely by delta_score, meaning you give more relevance to pairs with similar predictions.

Thank you for pointing it out. It's made to prevent overfitting by "smoothing" things out. Sometimes it can make training stagnate as in your example. There are some other similar operations like the one employed by lambdarank_normalization.

At the time of reworking the ranking, I did some tuning for some popular benchmarking datasets like the MSLR and found the normalization useful. On the other hand, as noted in https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html#obtaining-good-result , I also find it quite difficult to get good results. Your experiments are welcome and feel free to make suggestions!

I will try to make it an option instead.

@jaguerrerod
Copy link
Author

@trivialfis This seems fixed with the PR, thank you!

1	0.00424417
10	0.01324821
20	0.01554005
30	0.01645896
40	0.01771046
50	0.01804327
100	0.01979055
150	0.0208095
200	0.02083933
250	0.02113857
300	0.02120267
350	0.02139126
400	0.02197463
450	0.02219081
500	0.02237842

About normalization, I'm thinking in something like:
delta_metric = (1 + std::pow(std::abs(y_high_rank - y_low_rank), parameter))
if parameter = 0 then delta_metric is constant and no normalization
if parameter > 1 the delta_metric overweight pairs with higher difference of ranked labels. And the parameter itself control the intensity of this overweighting.
I'll review the code for NDCG and MAP and general weighting of queries by size and will propose a parametrization schema for your consideration, as a feature request for future versions.

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants