Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MondrianCP can't handle Pandas dataframe #526

Open
lennartvandeguchte opened this issue Oct 30, 2024 · 3 comments
Open

MondrianCP can't handle Pandas dataframe #526

lennartvandeguchte opened this issue Oct 30, 2024 · 3 comments
Labels
Backlog This is in the MAPIE team development backlog, yet to be prioritised. Regression Related to regression (excluding time series)

Comments

@lennartvandeguchte
Copy link

lennartvandeguchte commented Oct 30, 2024

Describe the bug

When using the new MondrianCP class I'm unable to fit my estimator with a Pandas dataframe, while using the standard MapieRegressor this works fine. Since I'm using a sklearn pipeline that contains some column transformers that use the pandas column name, I can't transform my data into a numpy array first because then sklearn gives me an error when fitting the estimator.

To Reproduce
Below the code to reproduce my problem.

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import TransformedTargetRegressor, ColumnTransformer
from sklearn.preprocessing import  RobustScaler, OneHotEncoder
from mapie.regression import MapieRegressor
from mapie.mondrian import MondrianCP
from lightgbm import LGBMRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

# Create some dummy data
data = pd.DataFrame(np.random.rand(100, 5), columns=['feature1', 'feature2', 'feature3', 'feature4', 'feature5'])
data['categorical_feature'] = np.random.choice(['A', 'B', 'C'], size=100)
y = pd.Series(np.random.rand(100))

# Create bins for the partition
data['BIN'] = pd.cut(y, bins=3, labels=[1, 2, 3])

# Split the data into a train and calibration set
data_train, data_calib, y_train, y_calib = train_test_split(data, y, test_size=0.2, random_state=42)

model = LGBMRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_samples=10,
    num_leaves=31,
    random_state=42
)

ct = ColumnTransformer([
    ("site", OneHotEncoder(), ['categorical_feature']),
    ("features", RobustScaler(), ['feature1', 'feature2', 'feature3', 'feature4', 'feature5']),
    ])
estimators = [('transformers',ct), ('model',  model)]
pre_pipe = Pipeline(estimators)
pipe = TransformedTargetRegressor(regressor=pre_pipe, transformer=RobustScaler())
pipe.fit(data_train, y_train)

strategy = "mondrian"
if strategy == "mondrian":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib, y_calib, partition=data_calib['BIN'])
if strategy == "mondrian_numpy":    
    mapie_regressor = MondrianCP(MapieRegressor(pipe, cv='prefit'))
    mapie_regressor.fit(data_calib.to_numpy(), y_calib, partition=data_calib['BIN'])
else:
    mapie_regressor = MapieRegressor(estimator=pipe, cv='prefit')
    mapie_regressor = mapie_regressor.fit(data_calib, y_calib)

By changing the strategy to mondrian_numpy you can also reproduce the sklearn error I receive.

Expected behavior
Be able to use a Pandas dataframe as input data for MondrianCP class.

@lennartvandeguchte
Copy link
Author

I managed to resolve the sklearn issue when using the 'mondrian_numpy' strategy in the example above by using indices in the ColumnTransformer instead of column names:

numerical_indices = [data.columns.get_loc(col) for col in numeric_features]
categorical_indices = [data.columns.get_loc(col) for col in categorical_features]

ct = ColumnTransformer([
    ("site", OneHotEncoder(), categorical_indices),
    ("features", RobustScaler(), numerical_indices),
    ])

I don't know if the package maintainers still want the MondrianCP class to handle Pandas dataframes? Otherwise this issue can be closed.

@Valentin-Laurent Valentin-Laurent added Bug Backlog This is in the MAPIE team development backlog, yet to be prioritised. Regression Related to regression (excluding time series) Discussion in progress Discussion ongoing between the Mapie team and the author. Needs decision The MAPIE team is deciding what to do next. and removed Backlog This is in the MAPIE team development backlog, yet to be prioritised. Discussion in progress Discussion ongoing between the Mapie team and the author. labels Oct 31, 2024
@Valentin-Laurent
Copy link
Collaborator

Hi @lennartvandeguchte, thank you for reporting this. Good to know you found a workaround.

We need further internal discussion to decide what to do about this. We'll let you know.

Best,

@Valentin-Laurent
Copy link
Collaborator

Following our discussion: support for Pandas dataframes is something we'd like to have, but is not a quick win. Indeed, in a prefit setting, it is easy to address, but in a split or cross setting, we call .fit on the provided estimator (that can be a pipeline), and so we need to avoid casting X,y to NDArray otherwise we're losing some pd.Dataframe functionalities that can be required by the pipeline.

We're adding this to our backlog.

@Valentin-Laurent Valentin-Laurent added Backlog This is in the MAPIE team development backlog, yet to be prioritised. Enhancement and removed Needs decision The MAPIE team is deciding what to do next. Bug labels Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backlog This is in the MAPIE team development backlog, yet to be prioritised. Regression Related to regression (excluding time series)
Projects
None yet
Development

No branches or pull requests

2 participants