Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow JSON deserialization of StructArray from JSON List #6558

Open
jagill opened this issue Oct 14, 2024 · 2 comments · May be fixed by #6643
Open

Allow JSON deserialization of StructArray from JSON List #6558

jagill opened this issue Oct 14, 2024 · 2 comments · May be fixed by #6643
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@jagill
Copy link

jagill commented Oct 14, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently, to deserialize a StructArray from JSON, you need to use a JSON Object. E.g., deserializing a Struct<a: i32, b: string> would need something like {a: 1, b: "c"}. This is also true of top-level RecordBatches. Some services, such as Presto and Trino, serialize ROW fields as lists. The example above would be serialized as [1, "c"]. If you already know the schema, this is a more compact representation that reduces the data on the wire.

I would like the ability for arrow_json to deserialize these list-encoded structs and record batches, perhaps under an option flag.

Describe the solution you'd like

When a StructArrayDecoder encounters a [ (around here), it switches to a parsing mode that does not look for field names, and requires a closing ] for completion. It should return a parsing error if either the number of entries of the list is not the same as the number of fields in the struct, or if any of the sub-parsers encounter the wrong type. This requires the fields of the struct to be in the same order as the JSON List, while the current object parsing can shuffle the fields if they appear in a different order.

Describe alternatives you've considered

I currently parse the results with serde_json, then recursively run down the JSON to convert Lists to Objects using the schema. Then I re-serialize the top-level JSON, then read it using arrow_json. This is not very efficient.

Alternatively, I could reproduce a less-good copy of the version of arrow_json that deserialized using serde_json. Either making serde_json decoders directly, or taking a serde_json::Value and populate the ArrayBuilders myself. This is a lot of code duplication.

Additional context

I don't personally need the ability to serialize a StructArray/RecordBatch into a List, although that would seem symmetrical.

I am happy to make an RFC PR implementing this functionality.

@jagill jagill added the enhancement Any new improvement worthy of a entry in the changelog label Oct 14, 2024
@tustvold
Copy link
Contributor

Provided this functionality is gated by an option on the reader that is disabled by default, and avoids regressing any of the existing benchmarks, this seems reasonable to support

jagill added a commit to jagill/arrow-rs that referenced this issue Oct 29, 2024
Currently, a StructArray can only be deserialized from a JSON object
(e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino)
encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more
compact, and the schema is known.

This PR adds the ability to parse JSON lists into StructArrays, if the
StructParseMode is set to ListOnly.  In ListOnly mode, object-encoded
structs raise an error.  Setting to ObjectOnly (the default) has the
original parsing behavior.

Some notes/questions/points for discussion:
1. I've made a JsonParseMode struct instead of a bool flag for two
   reasons.  One is that it's self-descriptive (what would `true` be?),
   and the other is that it allows a future Mixed mode that could
   deserialize either.  The latter isn't currently requested by anyone.
2. I kept the error messages as similar to the old messages as possible.
   I considered having more specific error messages (like "Encountered a
   '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or
   similar), but wanted to hear opinions before I went that route.
3. I'm not attached to any name/code-style/etc, so happy to modify to
   fit local conventions.

Fixes apache#6558
jagill added a commit to jagill/arrow-rs that referenced this issue Oct 29, 2024
Currently, a StructArray can only be deserialized from a JSON object
(e.g. `{a: 1, b: "c"}`), but some services (e.g. Presto and Trino)
encode ROW types as JSON lists (e.g. `[1, "c"]`) because this is more
compact, and the schema is known.

This PR adds the ability to parse JSON lists into StructArrays, if the
StructParseMode is set to ListOnly.  In ListOnly mode, object-encoded
structs raise an error.  Setting to ObjectOnly (the default) has the
original parsing behavior.

Some notes/questions/points for discussion:
1. I've made a JsonParseMode struct instead of a bool flag for two
   reasons.  One is that it's self-descriptive (what would `true` be?),
   and the other is that it allows a future Mixed mode that could
   deserialize either.  The latter isn't currently requested by anyone.
2. I kept the error messages as similar to the old messages as possible.
   I considered having more specific error messages (like "Encountered a
   '[' when parsing a Struct, but the StructParseMode is ObjectOnly" or
   similar), but wanted to hear opinions before I went that route.
3. I'm not attached to any name/code-style/etc, so happy to modify to
   fit local conventions.

Fixes apache#6558
@jagill
Copy link
Author

jagill commented Oct 30, 2024

Added PR #6643 to implement this (or start a discussion).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants