-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support of bit vs byte-packed boolean #227
Comments
Hi @AlenkaF, thanks for bringing this up. It should be supported by the design. From https://data-apis.org/dataframe-protocol/latest/design_requirements.html: _Must allow the consumer to inspect the representation for missing values that the producer uses for each column or data type. Sentinel values, bit masks, and boolean masks must be supported. _ For the def describe_null(self) -> Tuple[ColumnNullType, Any]:
"""
...
Value : if kind is "sentinel value", the actual value. If kind is a bit
mask or a byte mask, the value (0 or 1) indicating a missing value. None
otherwise. The def get_buffers(self) -> ColumnBuffers:
"""
...
- "validity": a two-element tuple whose first element is a buffer
containing mask values indicating missing data and
whose second element is the mask value buffer's
associated dtype. None if the null representation is
not a bit or byte mask. That buffer has a Better answer: now that I've written all the above, there's an easier answer. The first return value of
so that's the answer right there. The |
@rgommers I believe the question was moreso about the column data type being a boolean with a memory layout of a single bit per value (Arrow's boolean type layout), not how nulls are represented. @AlenkaF I believe we should be able to represent bit sized boolean types already using the bitpacked_bool_type = (DtypeKind.BOOL, 1, 'b', '=') I'm not sure if it's currently possible to correctly represent a byte-width boolean due to the use of Arrow C Data Interface format strings which I don't believe support that? See https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L245-L269 for more details |
While the dtype tuple specification indeed allows you to specify any bitwidth together with a But that's already an issue right now: pandas uses bools of 8 bits, but then specifies EDIT: I see Keith edited his post above to raise the exact same question ;) |
In Arrow's context, you can say that a byte-width boolean array can be represented as an extension type using uint8 as the "storage type". The Arrow C Data Interface then indicates that the format string describes this storage type. In that logic one could say to do:
But not sure that wasn't necessarily our intention on how to use the Arrow string format |
Thanks for all your quick and detailed replies! I agree with Joris on the last comment. Maybe this was not the intention on how to use the Arrow string format, but I feel it describes the use case well and am not sure there are any other better options. |
When implementing this for Polars, I had some trouble wrapping my head around the You can figure out how to read the bitmask buffer from the Column offset and length. But doesn't it make more sense for |
Indeed, I think we chose the Arrow format strings because they were more comprehensive that the alternative (NumPy-style format strings). But there's a gap here. I think we wanted to go with Ideally we'd get Arrow to add a new type string for a byte-sized logical boolean type. Otherwise we could use |
I would be in favour of using empty string for byte-packed booleans. What do others think? |
…aframe (#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: #37145 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
…om_dataframe (apache#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: apache#37145 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
…om_dataframe (apache#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: apache#37145 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
…om_dataframe (apache#37975) ### Rationale for this change Bit-packed booleans are currently not supported in the `from_dataframe` of the Dataframe Interchange Protocol. Note: We currently represent booleans in the pyarrow implementation as `uint8` which will also need to be changed in a follow-up PR (see data-apis/dataframe-api#227). ### What changes are included in this PR? This PR adds the support for bit-packed booleans when consuming a dataframe interchange object. ### Are these changes tested? Only locally, currently! * Closes: apache#37145 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
This topic came up on a PyArrow issue by Polars developers working on their native Dataframe Protocol Implementation. To note, in the PyArrow implementation of the protocol we decided to cast bit-packed boolean values to
uint8
when producing the interchange object and we castuint8
to bit-packed boolean when consuming an interchange object.As this topic came up again and pandas has added support for bitmask conversion in pandas-dev/pandas#52824 it would make sense to try to support bit-packed boolean dtypes in pyarrow implementation also (without converting to
uint8
), but I haven't found any information in the specification of the protocol about bit vs byte-packed boolean values.Are both, bit and byte-packed booleans, supported by the Dataframe Interchange Protocol?
cc @stinodego @jorisvandenbossche
The text was updated successfully, but these errors were encountered: