Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabPFN fails on text with NA #138

Open
LeoGrin opened this issue Jan 15, 2025 · 2 comments
Open

TabPFN fails on text with NA #138

LeoGrin opened this issue Jan 15, 2025 · 2 comments
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@LeoGrin
Copy link
Collaborator

LeoGrin commented Jan 15, 2025

import pandas as pd
import numpy as np
from tabpfn import TabPFNClassifier
from sklearn.model_selection import train_test_split

# Create a sample dataset with text and NA values
data = {
    'text_feature': ['good product', 'bad service', pd.NA, 'excellent', 'average', pd.NA],
    'numeric_feature': [10, 5, 8, 15, 7, 12],
    'target': [1, 0, 1, 1, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Try to use the data directly without handling NA values
X = df[['text_feature', 'numeric_feature']]
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and try to train TabPFN
classifier = TabPFNClassifier(device='cpu')

# This will fail because TabPFN cannot handle pd.NA or string values directly
try:
    classifier.fit(X_train, y_train)
except Exception as e:
    print("TabPFN failed with error:\n", str(e))

gives

TabPFN failed with error:
 Encoders require their input to be uniformly strings or numbers. Got ['NAType', 'str']
@LeoGrin LeoGrin added bug Something isn't working good first issue Good for newcomers labels Jan 15, 2025
@kgovind0001
Copy link

@LeoGrin Thank you very much for sharing the file to reproduce the bug. There is mismatch in the data type which is required by function _unique_python .

Pull Request should fix this where we preprocess the X also before validating it.

@piri-p
Copy link

piri-p commented Feb 3, 2025

Dear @LeoGrin - could you mention what's the expected behavior here (if anything specific)? I see you mentioned in the pull request by @kgovind0001 that

In addition to the failing tests, I think the NaNs should be kept or at least transformed in the same way as NaNs in int categorical columns, as TabPFN was trained with this logic.

and

(...) make sure that missing values end up being encoded in the same way, just before the actual TabPFN forward pass. With your current solution I think missing values would be encoded as just another category when the type is string, which is different from what is happening if the category has type int.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants