-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add narwhals materializer for dataframe agnosticism. #189
Conversation
As discussed in #187 , here's an initial PR for your review @MarcoGorelli. It's definitely not ready to replace the default backend, but it isn't toooo far either. |
@MarcoGorelli Sorry, just in case it wasn't entirely clear from our conversation in #187 and my tagging you here, I was hoping you could take a look and suggest how to fill the gaps and/or improve things. No worries if you do not have time for now. |
Thanks for the ping! Yup, definitely taking a look, just had some holiday recently 🌴 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for trying this out! left some comments, make some todos in Narwhals itself, but I really like the look of this and I'd like to invest more time into making sense of it and bringing it forwards! I'm away this weekend but I'll get back to this soon after
formulaic/materializers/narwhals.py
Outdated
if not isinstance(values.dtype, nwd.NumericType): | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you also consider datetimes to be categorical?
formulaic/utils/null_handling.py
Outdated
@drop_rows.register | ||
def _(values: narwhals.Series, indices: Sequence[int]) -> narwhals.Series: | ||
# TODO: Is there a better narwhals native way to do this? | ||
selector = numpy.zeros(len(values), dtype=bool) | ||
selector[list(indices)] = True | ||
return values.filter(~selector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I'd have thought there would be an easier way to do this, but it seems there's not in Polars either
I was also expecting that passing a numpy array to Series.filter
would work, but that raises, as Polars expects a Series
or a list
:
In [15]: s = pl.Series([5, 2, 4, 1, 2])
In [16]: mask = np.array([True, True, True, False, False])
In [17]: s.filter(mask.tolist())
Out[17]:
shape: (3,)
Series: '' [i64]
[
5
2
4
]
In [18]: s.filter(mask)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[18], line 1
----> 1 s.filter(mask)
File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/polars/series/series.py:2993, in Series.filter(self, predicate)
2991 if isinstance(predicate, list):
2992 predicate = Series("", predicate)
-> 2993 return self._from_pyseries(self._s.filter(predicate._s))
AttributeError: 'numpy.ndarray' object has no attribute '_s'
I'll raise an issue about this in Polars and see what they say there.
Possible solutions could be:
- for now, to do
return values.filter((~selector).tolist())
. This should be enough to address the error you're seeing whenna_action
isn't'ignore'
- in Narwhals, we make it easy to create a narwhals Series from a numpy array. Issue: Easy way to create nw.Series from numpy array narwhals-dev/narwhals#754
- we support passing numpy arrays to
filter
(I'll report this to Polars and see what they think there)
We'll get back to you on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure there is an easier way to do this with the Polars API at the moment
formulaic/materializers/narwhals.py
Outdated
if isinstance(values, nw.Series): | ||
values = values.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be possible to keep this Narwhals-native, as we have:
cat.get_categories
: https://narwhals-dev.github.io/narwhals/api-reference/expr_cat/#narwhals.expr.ExprCatNamespace.get_categoriesSeries.to_dummies
: https://narwhals-dev.github.io/narwhals/api-reference/series/#narwhals.series.Series.to_dummies
will try this out and let you know
formulaic/utils/null_handling.py
Outdated
@find_nulls.register | ||
def _(values: narwhals.Series) -> Set[int]: | ||
return set(values.is_null().arg_true().to_list()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice 🥳
formulaic/materializers/narwhals.py
Outdated
# TODO: Can we do better than this? Having to reconstitute raw data | ||
# does not seem ideal. | ||
combined = nw.from_dict( | ||
{ | ||
name: (nw.to_native(col) if isinstance(col, nw.Series) else col) | ||
for name, col in cols | ||
}, | ||
native_namespace=nw.get_native_namespace(self.__narwhals_data), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this should be supported (and easier!) - I've opened an issue about it here narwhals-dev/narwhals#756
hey - quick update, I've got most of it working in a branch, but am a bit busy with some conferences now - will get back to it soon. realistically I hope to have something review-ready in October 🤞 this has been quite interesting to work on! |
Hi @MarcoGorelli ! We are nearing a 1.1 release, and I'm thinking to include this on something close to its current state as an experimental feature. I'll try to patch any obvious bugs. If you have a branch I'm also happy to cherry pick any fixes you might have come up with :). |
hey, sounds great thanks! sorry for the delay, I got a bit stuck into a PR we made to Plotly recently, and here I got a bit lost in all the dispatches 😳 |
I would definitely try and volunteer testing the experimental feature on a pyfixest branch if this would make it into the next formulaic release! |
Same with tabmat, would be happy to test drive narwhals support. |
thanks both! i've given this a go in #226 if you fancy pulling it and trying it out |
dd375ec
to
e7250bc
Compare
f96f2bf
to
740ee4e
Compare
740ee4e
to
9e10f51
Compare
d7ba572
to
fb10d57
Compare
@MarcoGorelli @s3alfisc @stanmart Thanks for you interest and support here. I'm merging this in now, and it will be part of the formulaic 1.2.0 release. It is still considered experimental at this point. I will try to increase testing coverage before the 1.2.0 release, but if you would like to put it through its paces ahead of that time with your libraries that would be greatly appreciated :). |
amazing, thank you for taking this forwards! we'll add this to our downstream tests |
Will spend some time to test it this weekend. Thanks @matthewwardrop & @MarcoGorelli ! |
This patch adds initial support for
narwhals
. It's... patchy.You can try it out using:
Note: The polars backend panics when
na_action
is notignore
.There's a lot of hacks here, including fallbacks to pandas objects in places, and of course we still want sparse materialisation to work (which I don't think other backends support sufficiently to replace scipy sparse matrices).