You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Share LightGBM categorical variable representation across multiple columns.
Motivation
Very useful for sequences and several use cases.
Description
Encode the same category level along multiple columns.
Very similar to using an Embedding Layer in a Neural Network to encode a categorical variable and applying the layer across a sequence of that same variable (example of a word embedding).
The same can be achieved with Target Encoding or other representation applied before modeling by stacking the category columns, into a key-value structure, and then applying said encoding.
Example train dataset
Var 1
Var 2
Var 3
Target
Goose
Cat
Dog
103
Cat
Cat
Dog
4
Goose
Goose
Goose
300
Stack shared category levels
Category Level
Target
Goose
103
Cat
103
Dog
103
Cat
4
Cat
4
Dog
4
Goose
300
Goose
300
Goose
300
Naive target encoding
Category Level
Encode
Goose
250
Dog
53
Cat
37
The text was updated successfully, but these errors were encountered:
This project is maintained mostly by volunteers, one of whom may respond here as we have time and interest.
If you can share more details (relevant research, a prototype implementation, your own research doing such feature engineering manually and demonstrating the benefits), it may make it more likely that maintainers here would invest time into this.
Summary
Share LightGBM categorical variable representation across multiple columns.
Motivation
Very useful for sequences and several use cases.
Description
Encode the same category level along multiple columns.
Very similar to using an Embedding Layer in a Neural Network to encode a categorical variable and applying the layer across a sequence of that same variable (example of a word embedding).
The same can be achieved with Target Encoding or other representation applied before modeling by stacking the category columns, into a key-value structure, and then applying said encoding.
Example train dataset
Stack shared category levels
Naive target encoding
The text was updated successfully, but these errors were encountered: