Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predefined bin thresholds #2325

Merged
merged 57 commits into from
Sep 28, 2019
Merged

Predefined bin thresholds #2325

merged 57 commits into from
Sep 28, 2019

Conversation

btrotta
Copy link
Collaborator

@btrotta btrotta commented Aug 13, 2019

Implement the request in #1829 (ability to specify binning thresholds). The thresholds can be specified in a json file using the parameter forcedbins_filename.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btrotta Big thanks for implementing this feature! As usual, I left some minor style comments 😃

include/LightGBM/config.h Outdated Show resolved Hide resolved
src/io/bin.cpp Outdated Show resolved Hide resolved
src/io/bin.cpp Outdated Show resolved Hide resolved
src/io/bin.cpp Outdated Show resolved Hide resolved
src/io/dataset_loader.cpp Show resolved Hide resolved
tests/data/forced_bins.json Outdated Show resolved Hide resolved
tests/data/forced_bins.json Outdated Show resolved Hide resolved
src/io/dataset.cpp Outdated Show resolved Hide resolved
@btrotta
Copy link
Collaborator Author

btrotta commented Aug 14, 2019

@StrikerRUS thanks, fixed! One of the checks is still failing because the docs contain a reference to the URL https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json which does not exist yet. Do I need to do anything about that?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@btrotta Thanks a lot for prompt fixes!

Speaking about the 404 error for non-existent file, I think it's not a big issue. Someone more powerful in terms of GitHub repo administrative rights can merge even with failed CI test. Or we can put fake URL, merge this PR, and then replace the URL with the needed one. But I think it's too much...

include/LightGBM/config.h Outdated Show resolved Hide resolved
src/io/dataset_loader.cpp Show resolved Hide resolved
@guolinke
Copy link
Collaborator

Thanks very much! I am on vacation, so don't have much time to check this carefully.
Some quick questions regards to the changes in the bin.cpp:

  1. Are the bin boundaries the same as before, when without force bin upper bounds?
  2. Why need to update bin finding code? it seems you can just load bin upper bounds from the file.

@btrotta
Copy link
Collaborator Author

btrotta commented Aug 15, 2019

@guolinke

  1. If the user doesn't specify forced bins, then the binning behavior should be the same as my previous PR Bug fix: small values of max_bin cause program to crash #2299. That last PR corrects a smaller bug in the binning algorithm and therefore changes the binning behavior slightly from the original. I have written a more detailed description of the new behavior in the comments for PR Bug fix: small values of max_bin cause program to crash #2299.
  2. The idea is to allow the user to specify some or all of the bin thresholds. If the number of thresholds specified in the json file is less than max_bin, then the binning algorithm will find the remaining ones.

@guolinke
Copy link
Collaborator

#2299 is merged. Maybe need to rebase to the master?

@guolinke
Copy link
Collaborator

Thanks @btrotta .
One slight improvement: As categorical features don't need the pre-defined bin, maybe we could throw a warning when user set predefined bin for categorical features?

src/io/bin.cpp Outdated
@@ -207,8 +306,19 @@ namespace LightGBM {
return bin_upper_bound;
}

std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values,
int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double> forced_upper_bounds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use const std::vector<double>& forced_upper_bounds anywhere.

*/
void FindBin(double* values, int num_values, size_t total_sample_cnt, int max_bin, int min_data_in_bin, int min_split_data, BinType bin_type,
bool use_missing, bool zero_as_missing);
bool use_missing, bool zero_as_missing, std::vector<double> forced_upper_bounds);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.

@@ -596,6 +596,9 @@ class Dataset {

void addFeaturesFrom(Dataset* other);

static std::vector<std::vector<double>> GetForcedBins(std::string forced_bins_path, int num_total_features,
std::unordered_set<int> categorical_features);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.

src/io/bin.cpp Outdated
std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts,
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin) {
std::vector<double> FindBinWithPredefinedBin(const double* distinct_values, const int* counts,
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double>& forced_upper_bounds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.

src/io/bin.cpp Outdated
@@ -207,8 +306,19 @@ namespace LightGBM {
return bin_upper_bound;
}

std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values,
int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double>& forced_upper_bounds) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.

num_features_ += other->num_features_;
num_total_features_ += other->num_total_features_;
num_groups_ += other->num_groups_;
}


std::vector<std::vector<double>> Dataset::GetForcedBins(std::string forced_bins_path, int num_total_features,
std::unordered_set<int> categorical_features) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use const T& for container object anywhere.

@guolinke
Copy link
Collaborator

Just to confirm: the categorical feature index is read from the bin mapper, but it seems the bin mapper is not constructed at that time?

categorical_features.insert(i);
}
}
forced_bin_bounds_ = Dataset::GetForcedBins(io_config.forcedbins_filename, num_total_features_, categorical_features);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guolinke Do you mean this part? My understanding is that in this method bin_mappers is passed in as an argument and the bin mappers are constructed already. It seems that there are 2 uses of this method (DatasetLoader::ConstructBinMappersFromTextData and DatasetLoader::CostrcutFromSampleData). In both cases it looks like the bins are found before calling Dataset::Construct and passed in to the Construct method. Please let me know if I'm misunderstanding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, you are right!

/* Since the dataset is already constructed we don't know which bins are categorical.
Therefore read forced bins assuming no categorical features, and warn if not the same as original. */
std::vector<std::vector<double>> config_bounds = Dataset::GetForcedBins(io_config.forcedbins_filename,
num_total_features_, std::unordered_set<int>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

categorical feature is needed here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the dataset is already constructed, and I think there is no way to know which features are categorical (because this information is only stored in the DatasetLoader object, not in Dataset). Therefore I think the best we can do is get the forced bins assuming no categorical features, and warn the user if these are different from the existing bins.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, you can call this to get the bin mapper:

inline const BinMapper* FeatureBinMapper(int i) const {
const int group = feature2group_[i];
const int sub_feature = feature2subfeature_[i];
return feature_groups_[group]->bin_mappers_[sub_feature].get();
}
. And I think config_bounds will always no equal to forced_bin_bounds_ when categorical feature is set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed it to get the categorical features from FeatureBinMapper. But you're right: when the json file contains forced bins for categorical features it always gives the warning, even if forced bins are unchanged.

@guolinke
Copy link
Collaborator

@btrotta could you resolve the conflict?
BTW, as the json file is missing in the master branch, the test failed.
Maybe you can create a new PR to submit that josn file first?

@btrotta btrotta mentioned this pull request Sep 26, 2019

num_features_ += other->num_features_;
num_total_features_ += other->num_total_features_;
num_groups_ += other->num_groups_;
}


std::vector<std::vector<double>> Dataset::GetForcedBins(std::string forced_bins_path, int num_total_features,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think that moving the GetForcedBins to the dataset_loader is better? It will be much easier to access the categorical_features.

Copy link
Collaborator

@guolinke guolinke Sep 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as the pre-defined bin cannot be updated, we could don't check it in resetconfig.

@guolinke
Copy link
Collaborator

guolinke commented Sep 27, 2019

@btrotta I am a little bit confused now. Could the forced_bin is only processed in dataset_loader?
It seems the dataset class only needs to save/load the forced_bin, not other purposes.
you can save it to the dataset object in dataset_loader methods directly, similar to set_feature_name: https://github.com/microsoft/LightGBM/blob/master/src/io/dataset_loader.cpp#L696

@btrotta
Copy link
Collaborator Author

btrotta commented Sep 28, 2019

@guolinke Ok, I think I see what you mean now, I will try that.

}
}
forced_bin_bounds_ = DatasetLoader::GetForcedBins(io_config.forcedbins_filename, num_total_features_, categorical_features);
forced_bin_bounds_ = forced_bins;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyFeatureMapperFrom, CreateValid also need to copy forced_bin_bounds_.

@@ -290,6 +290,7 @@ class Dataset {

void Construct(
std::vector<std::unique_ptr<BinMapper>>* bin_mappers,
std::vector<std::vector<double>>& forced_bins,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const T&.

@guolinke guolinke merged commit cc7a1e2 into microsoft:master Sep 28, 2019
@StrikerRUS
Copy link
Collaborator

@btrotta Thank you very much!

@btrotta
Copy link
Collaborator Author

btrotta commented Sep 29, 2019

Thanks for your help @guolinke and @StrikerRUS

@lock lock bot locked as resolved and limited conversation to collaborators Mar 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants