Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(duckdb): add to_json() to Table and duckdb backend #10681

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

NickCrews
Copy link
Contributor

@NickCrews NickCrews commented Jan 17, 2025

Fixes #10413

I didn't bother adding this to any of the other backends. There is no handy pyarrow.to_json() function that can save the day here. I think we would need to to backend-specific implementations.

Open todos that could come later, but I don't think should be blocking for this PR:

  • Consider taking some of the duckdb-specific config options, and hoisting them to be "official" options across all backends. I could see a format=Literal["array", "jsonl"] = "array" being the most likely choice, but not needed until we survey the other backends.
  • For dates/datetimes/decimals, where there is no corresponding JSON type, we need to decide what to do.
  • Once we decide, implement that.
  • Maybe add more tests to ensure that backends produce the same JSON strings? eg maybe the implementation of the test currently, of just reading back using pandas, is too lenient, and multiple on-disk values would lead to the same result. Maybe we actually want to look at the on-disk representation.

@github-actions github-actions bot added tests Issues or PRs related to tests duckdb The DuckDB backend labels Jan 17, 2025
@cpcloud
Copy link
Member

cpcloud commented Jan 17, 2025

I could see a format=Literal["array", "jsonl"] = "array" being the most likely choice, but not needed until we survey the other backends.

I think I'd rather have to_json and to_jsonl methods. Then there's never an ambiguity about what the word "JSON" means when you're reading the code.

Maybe add more tests to ensure that backends produce the same JSON strings?

I think we should avoid this and instead test that round-tripping produces semantically equivalent pyarrow tables or dataframes.

For example, two JSONL files with the same rows in a different order are equivalent, but comparing their on-disk representation would indicate they were different.

@NickCrews
Copy link
Contributor Author

NickCrews commented Jan 18, 2025

I think I'd rather have to_json and to_jsonl methods

Sounds good to me. I'll just keep this as to_json for now and NOT expose any format options. we can add to_jsonl later if requested.

I think we should avoid this and instead test that round-tripping produces semantically equivalent pyarrow tables or dataframes.

Want me to switch from pandas (as I do here) to pyarrow?

@NickCrews
Copy link
Contributor Author

I added trino as an xfail. Now tests should be passing.

I also actually implemented all the duckdb-specific options of compression, dateformat, and timestampformat. I didn't add any tests though since it felt like overkill.

I explicitly did NOT add the array option, which means users cannot get jsonl through this backdoor. I want that to actually get implemented with the to_jsonl solution when the time comes.

@NickCrews
Copy link
Contributor Author

failing CI looks like an unrelated flake, it didn't fail in this previous CI run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duckdb The DuckDB backend tests Issues or PRs related to tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: add method to_json() (at least for duckdb backend)
2 participants