TabPFN fails on text with NA #138

LeoGrin · 2025-01-15T17:23:34Z

import pandas as pd
import numpy as np
from tabpfn import TabPFNClassifier
from sklearn.model_selection import train_test_split

# Create a sample dataset with text and NA values
data = {
    'text_feature': ['good product', 'bad service', pd.NA, 'excellent', 'average', pd.NA],
    'numeric_feature': [10, 5, 8, 15, 7, 12],
    'target': [1, 0, 1, 1, 0, 0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Try to use the data directly without handling NA values
X = df[['text_feature', 'numeric_feature']]
y = df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and try to train TabPFN
classifier = TabPFNClassifier(device='cpu')

# This will fail because TabPFN cannot handle pd.NA or string values directly
try:
    classifier.fit(X_train, y_train)
except Exception as e:
    print("TabPFN failed with error:\n", str(e))

gives

TabPFN failed with error:
 Encoders require their input to be uniformly strings or numbers. Got ['NAType', 'str']

The text was updated successfully, but these errors were encountered:

kgovind0001 · 2025-01-22T12:38:19Z

@LeoGrin Thank you very much for sharing the file to reproduce the bug. There is mismatch in the data type which is required by function _unique_python .

Pull Request should fix this where we preprocess the X also before validating it.

piri-p · 2025-02-03T19:23:53Z

Dear @LeoGrin - could you mention what's the expected behavior here (if anything specific)? I see you mentioned in the pull request by @kgovind0001 that

In addition to the failing tests, I think the NaNs should be kept or at least transformed in the same way as NaNs in int categorical columns, as TabPFN was trained with this logic.

and

(...) make sure that missing values end up being encoded in the same way, just before the actual TabPFN forward pass. With your current solution I think missing values would be encoded as just another category when the type is string, which is different from what is happening if the category has type int.

LeoGrin added bug Something isn't working good first issue Good for newcomers labels Jan 15, 2025

kgovind0001 mentioned this issue Feb 14, 2025

updated to accept NAN values in string columns: Fix for issue #138 #185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TabPFN fails on text with NA #138

TabPFN fails on text with NA #138

LeoGrin commented Jan 15, 2025

kgovind0001 commented Jan 22, 2025

piri-p commented Feb 3, 2025 •

edited

Loading

TabPFN fails on text with NA #138

TabPFN fails on text with NA #138

Comments

LeoGrin commented Jan 15, 2025

kgovind0001 commented Jan 22, 2025

piri-p commented Feb 3, 2025 • edited Loading

piri-p commented Feb 3, 2025 •

edited

Loading