-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predefined bin thresholds #2325
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@btrotta Big thanks for implementing this feature! As usual, I left some minor style comments 😃
@StrikerRUS thanks, fixed! One of the checks is still failing because the docs contain a reference to the URL https://github.com/microsoft/LightGBM/tree/master/examples/regression/forced_bins.json which does not exist yet. Do I need to do anything about that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@btrotta Thanks a lot for prompt fixes!
Speaking about the 404
error for non-existent file, I think it's not a big issue. Someone more powerful in terms of GitHub repo administrative rights can merge even with failed CI test. Or we can put fake URL, merge this PR, and then replace the URL with the needed one. But I think it's too much...
Thanks very much! I am on vacation, so don't have much time to check this carefully.
|
|
#2299 is merged. Maybe need to rebase to the master? |
Thanks @btrotta . |
src/io/bin.cpp
Outdated
@@ -207,8 +306,19 @@ namespace LightGBM { | |||
return bin_upper_bound; | |||
} | |||
|
|||
std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values, | |||
int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double> forced_upper_bounds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use const std::vector<double>& forced_upper_bounds
anywhere.
include/LightGBM/bin.h
Outdated
*/ | ||
void FindBin(double* values, int num_values, size_t total_sample_cnt, int max_bin, int min_data_in_bin, int min_split_data, BinType bin_type, | ||
bool use_missing, bool zero_as_missing); | ||
bool use_missing, bool zero_as_missing, std::vector<double> forced_upper_bounds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const T& for container object anywhere.
include/LightGBM/dataset.h
Outdated
@@ -596,6 +596,9 @@ class Dataset { | |||
|
|||
void addFeaturesFrom(Dataset* other); | |||
|
|||
static std::vector<std::vector<double>> GetForcedBins(std::string forced_bins_path, int num_total_features, | |||
std::unordered_set<int> categorical_features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const T& for container object anywhere.
src/io/bin.cpp
Outdated
std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, | ||
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin) { | ||
std::vector<double> FindBinWithPredefinedBin(const double* distinct_values, const int* counts, | ||
int num_distinct_values, int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double>& forced_upper_bounds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const T& for container object anywhere.
src/io/bin.cpp
Outdated
@@ -207,8 +306,19 @@ namespace LightGBM { | |||
return bin_upper_bound; | |||
} | |||
|
|||
std::vector<double> FindBinWithZeroAsOneBin(const double* distinct_values, const int* counts, int num_distinct_values, | |||
int max_bin, size_t total_sample_cnt, int min_data_in_bin, std::vector<double>& forced_upper_bounds) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const T& for container object anywhere.
src/io/dataset.cpp
Outdated
num_features_ += other->num_features_; | ||
num_total_features_ += other->num_total_features_; | ||
num_groups_ += other->num_groups_; | ||
} | ||
|
||
|
||
std::vector<std::vector<double>> Dataset::GetForcedBins(std::string forced_bins_path, int num_total_features, | ||
std::unordered_set<int> categorical_features) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use const T& for container object anywhere.
Just to confirm: the categorical feature index is read from the bin mapper, but it seems the bin mapper is not constructed at that time? |
src/io/dataset.cpp
Outdated
categorical_features.insert(i); | ||
} | ||
} | ||
forced_bin_bounds_ = Dataset::GetForcedBins(io_config.forcedbins_filename, num_total_features_, categorical_features); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guolinke Do you mean this part? My understanding is that in this method bin_mappers
is passed in as an argument and the bin mappers are constructed already. It seems that there are 2 uses of this method (DatasetLoader::ConstructBinMappersFromTextData
and DatasetLoader::CostrcutFromSampleData
). In both cases it looks like the bins are found before calling Dataset::Construct
and passed in to the Construct
method. Please let me know if I'm misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, you are right!
src/io/dataset.cpp
Outdated
/* Since the dataset is already constructed we don't know which bins are categorical. | ||
Therefore read forced bins assuming no categorical features, and warn if not the same as original. */ | ||
std::vector<std::vector<double>> config_bounds = Dataset::GetForcedBins(io_config.forcedbins_filename, | ||
num_total_features_, std::unordered_set<int>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categorical feature is needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the dataset is already constructed, and I think there is no way to know which features are categorical (because this information is only stored in the DatasetLoader
object, not in Dataset
). Therefore I think the best we can do is get the forced bins assuming no categorical features, and warn the user if these are different from the existing bins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, you can call this to get the bin mapper:
LightGBM/include/LightGBM/dataset.h
Lines 476 to 480 in f1a1486
inline const BinMapper* FeatureBinMapper(int i) const { | |
const int group = feature2group_[i]; | |
const int sub_feature = feature2subfeature_[i]; | |
return feature_groups_[group]->bin_mappers_[sub_feature].get(); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I changed it to get the categorical features from FeatureBinMapper. But you're right: when the json file contains forced bins for categorical features it always gives the warning, even if forced bins are unchanged.
@btrotta could you resolve the conflict? |
src/io/dataset.cpp
Outdated
|
||
num_features_ += other->num_features_; | ||
num_total_features_ += other->num_total_features_; | ||
num_groups_ += other->num_groups_; | ||
} | ||
|
||
|
||
std::vector<std::vector<double>> Dataset::GetForcedBins(std::string forced_bins_path, int num_total_features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that moving the GetForcedBins
to the dataset_loader is better? It will be much easier to access the categorical_features
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as the pre-defined bin cannot be updated, we could don't check it in resetconfig.
@btrotta I am a little bit confused now. Could the |
@guolinke Ok, I think I see what you mean now, I will try that. |
} | ||
} | ||
forced_bin_bounds_ = DatasetLoader::GetForcedBins(io_config.forcedbins_filename, num_total_features_, categorical_features); | ||
forced_bin_bounds_ = forced_bins; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyFeatureMapperFrom
, CreateValid
also need to copy forced_bin_bounds_
.
include/LightGBM/dataset.h
Outdated
@@ -290,6 +290,7 @@ class Dataset { | |||
|
|||
void Construct( | |||
std::vector<std::unique_ptr<BinMapper>>* bin_mappers, | |||
std::vector<std::vector<double>>& forced_bins, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const T&.
@btrotta Thank you very much! |
Thanks for your help @guolinke and @StrikerRUS |
Implement the request in #1829 (ability to specify binning thresholds). The thresholds can be specified in a json file using the parameter
forcedbins_filename
.