Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add narwhals materializer for dataframe agnosticism. #189

Merged
merged 5 commits into from
Jan 9, 2025

Conversation

matthewwardrop
Copy link
Owner

@matthewwardrop matthewwardrop commented Aug 6, 2024

This patch adds initial support for narwhals. It's... patchy.

You can try it out using:

import narwhals as nw
import pandas as pd
from formulaic import model_matrix

df = pd.DataFrame({"a": [1,2,None], "b": list("abc"), "c": pd.Categorical(list("abc"))})
model_matrix("a + b + c", df, materializer='narwhals', na_action="ignore")

image

import polars
model_matrix("a + b + c", polars.DataFrame._from_pandas(df), materializer='narwhals', na_action="ignore")

image

Note: The polars backend panics when na_action is not ignore.

There's a lot of hacks here, including fallbacks to pandas objects in places, and of course we still want sparse materialisation to work (which I don't think other backends support sufficiently to replace scipy sparse matrices).

@matthewwardrop
Copy link
Owner Author

As discussed in #187 , here's an initial PR for your review @MarcoGorelli. It's definitely not ready to replace the default backend, but it isn't toooo far either.

@matthewwardrop
Copy link
Owner Author

@MarcoGorelli Sorry, just in case it wasn't entirely clear from our conversation in #187 and my tagging you here, I was hoping you could take a look and suggest how to fill the gaps and/or improve things. No worries if you do not have time for now.

@MarcoGorelli
Copy link
Contributor

Thanks for the ping! Yup, definitely taking a look, just had some holiday recently 🌴

Copy link
Contributor

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for trying this out! left some comments, make some todos in Narwhals itself, but I really like the look of this and I'd like to invest more time into making sense of it and bringing it forwards! I'm away this weekend but I'll get back to this soon after

Comment on lines 41 to 45
if not isinstance(values.dtype, nwd.NumericType):
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you also consider datetimes to be categorical?

Comment on lines 124 to 129
@drop_rows.register
def _(values: narwhals.Series, indices: Sequence[int]) -> narwhals.Series:
# TODO: Is there a better narwhals native way to do this?
selector = numpy.zeros(len(values), dtype=bool)
selector[list(indices)] = True
return values.filter(~selector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I'd have thought there would be an easier way to do this, but it seems there's not in Polars either

I was also expecting that passing a numpy array to Series.filter would work, but that raises, as Polars expects a Series or a list:

In [15]: s = pl.Series([5, 2, 4, 1, 2])

In [16]: mask = np.array([True, True, True, False, False])

In [17]: s.filter(mask.tolist())
Out[17]:
shape: (3,)
Series: '' [i64]
[
        5
        2
        4
]

In [18]: s.filter(mask)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[18], line 1
----> 1 s.filter(mask)

File ~/polars-api-compat-dev/.venv/lib/python3.11/site-packages/polars/series/series.py:2993, in Series.filter(self, predicate)
   2991 if isinstance(predicate, list):
   2992     predicate = Series("", predicate)
-> 2993 return self._from_pyseries(self._s.filter(predicate._s))

AttributeError: 'numpy.ndarray' object has no attribute '_s'

I'll raise an issue about this in Polars and see what they say there.

Possible solutions could be:

  • for now, to do return values.filter((~selector).tolist()). This should be enough to address the error you're seeing when na_action isn't 'ignore'
  • in Narwhals, we make it easy to create a narwhals Series from a numpy array. Issue: Easy way to create nw.Series from numpy array narwhals-dev/narwhals#754
  • we support passing numpy arrays to filter (I'll report this to Polars and see what they think there)

We'll get back to you on this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not sure there is an easier way to do this with the Polars API at the moment

Comment on lines 94 to 98
if isinstance(values, nw.Series):
values = values.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to keep this Narwhals-native, as we have:

will try this out and let you know

Comment on lines 70 to 74
@find_nulls.register
def _(values: narwhals.Series) -> Set[int]:
return set(values.is_null().arg_true().to_list())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 🥳

Comment on lines 196 to 190
# TODO: Can we do better than this? Having to reconstitute raw data
# does not seem ideal.
combined = nw.from_dict(
{
name: (nw.to_native(col) if isinstance(col, nw.Series) else col)
for name, col in cols
},
native_namespace=nw.get_native_namespace(self.__narwhals_data),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this should be supported (and easier!) - I've opened an issue about it here narwhals-dev/narwhals#756

@MarcoGorelli
Copy link
Contributor

hey - quick update, I've got most of it working in a branch, but am a bit busy with some conferences now - will get back to it soon. realistically I hope to have something review-ready in October 🤞 this has been quite interesting to work on!

@matthewwardrop
Copy link
Owner Author

Hi @MarcoGorelli ! We are nearing a 1.1 release, and I'm thinking to include this on something close to its current state as an experimental feature. I'll try to patch any obvious bugs. If you have a branch I'm also happy to cherry pick any fixes you might have come up with :).

@MarcoGorelli
Copy link
Contributor

Hi @MarcoGorelli ! We are nearing a 1.1 release, and I'm thinking to include this on something close to its current state as an experimental feature. I'll try to patch any obvious bugs. If you have a branch I'm also happy to cherry pick any fixes you might have come up with :).

hey, sounds great thanks! sorry for the delay, I got a bit stuck into a PR we made to Plotly recently, and here I got a bit lost in all the dispatches 😳

@s3alfisc
Copy link

s3alfisc commented Dec 2, 2024

I would definitely try and volunteer testing the experimental feature on a pyfixest branch if this would make it into the next formulaic release!

@stanmart
Copy link

stanmart commented Dec 4, 2024

Same with tabmat, would be happy to test drive narwhals support.

@MarcoGorelli
Copy link
Contributor

thanks both! i've given this a go in #226 if you fancy pulling it and trying it out

@matthewwardrop matthewwardrop force-pushed the main branch 4 times, most recently from dd375ec to e7250bc Compare December 16, 2024 08:08
@matthewwardrop matthewwardrop force-pushed the explore_narwhals branch 2 times, most recently from f96f2bf to 740ee4e Compare December 17, 2024 06:49
@matthewwardrop matthewwardrop added this to the 1.2.0 milestone Dec 17, 2024
@matthewwardrop matthewwardrop marked this pull request as ready for review January 9, 2025 18:00
@matthewwardrop
Copy link
Owner Author

@MarcoGorelli @s3alfisc @stanmart Thanks for you interest and support here. I'm merging this in now, and it will be part of the formulaic 1.2.0 release. It is still considered experimental at this point. I will try to increase testing coverage before the 1.2.0 release, but if you would like to put it through its paces ahead of that time with your libraries that would be greatly appreciated :).

@matthewwardrop matthewwardrop merged commit 8552610 into main Jan 9, 2025
5 checks passed
@matthewwardrop matthewwardrop deleted the explore_narwhals branch January 9, 2025 18:03
@MarcoGorelli
Copy link
Contributor

amazing, thank you for taking this forwards! we'll add this to our downstream tests

@s3alfisc
Copy link

s3alfisc commented Jan 9, 2025

Will spend some time to test it this weekend. Thanks @matthewwardrop & @MarcoGorelli !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants