You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import pandas as pd
import numpy as np
from tabpfn import TabPFNClassifier
from sklearn.model_selection import train_test_split
# Create a sample dataset with text and NA values
data = {
'text_feature': ['good product', 'bad service', pd.NA, 'excellent', 'average', pd.NA],
'numeric_feature': [10, 5, 8, 15, 7, 12],
'target': [1, 0, 1, 1, 0, 0]
}
# Create DataFrame
df = pd.DataFrame(data)
# Try to use the data directly without handling NA values
X = df[['text_feature', 'numeric_feature']]
y = df['target']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize and try to train TabPFN
classifier = TabPFNClassifier(device='cpu')
# This will fail because TabPFN cannot handle pd.NA or string values directly
try:
classifier.fit(X_train, y_train)
except Exception as e:
print("TabPFN failed with error:\n", str(e))
gives
TabPFN failed with error:
Encoders require their input to be uniformly strings or numbers. Got ['NAType', 'str']
The text was updated successfully, but these errors were encountered:
@LeoGrin Thank you very much for sharing the file to reproduce the bug. There is mismatch in the data type which is required by function _unique_python .
Pull Request should fix this where we preprocess the X also before validating it.
Dear @LeoGrin - could you mention what's the expected behavior here (if anything specific)? I see you mentioned in the pull request by @kgovind0001 that
In addition to the failing tests, I think the NaNs should be kept or at least transformed in the same way as NaNs in int categorical columns, as TabPFN was trained with this logic.
and
(...) make sure that missing values end up being encoded in the same way, just before the actual TabPFN forward pass. With your current solution I think missing values would be encoded as just another category when the type is string, which is different from what is happening if the category has type int.
gives
The text was updated successfully, but these errors were encountered: