Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

Open
beckernick opened this issue Sep 17, 2024 · 0 comments
Open

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

beckernick opened this issue Sep 17, 2024 · 0 comments
Assignees
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request

Comments

@beckernick
Copy link
Member

beckernick commented Sep 17, 2024

Some users experience out-of-memory errors during IO when loading datasets that they feel like should be fine for their given GPU. This is currently less significant for cudf.pandas, as we now enable a prefetch-optimized unified memory by default.

Because we don't currently have a similar UVM setup for the Polars GPU engine, this is an acute pain point that blocks usage for many workflows. We've developed chunked readers for Parquet and ORC files that may be able to help in this situation.

Initial testing suggests that a properly configured chunked parquet reader may be effective at reducing peak memory requirements without significantly impacting performance.

For example, running PDS-H q7 at SF200 immediately runs into an OOM with the default Parquet reader. With a pass_read_limit of 16GB for the chunked reader, we can smoothly finish the query and provide a speedup with an H100 vs. the CPU engine on a high-end CPU.

Default CPU engine on a dual socket Intel 8480CL:

Running experiments...
Running scale factor 200.0 on GPU setting false...
Code block 'Run polars query 7' took: 18.65865 s
Experiments completed. ...
Run complete!

Default GPU engine behavior with cuda-async memory resource

Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 3.28927 s
q7 FAILED
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-7ci1mf7i/normal/lib/python3.11/site-packages/librmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:111: cudaErrorMemoryAllocation out of memory
Experiments completed. ...
Run complete!

GPU engine behavior with cuda-async memory resource and "pass_read_limit": 16024000000

Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 10.78470 s
Experiments completed. ...
Run complete!

We should do a full evaluation of the chunked Parquet reader on the PDS-H benchmarks to empirically assess the potential opportunity and tradeoffs for chunked IO. Starting with Parquet makes sense, as it's a more common file format in the PyData world. We can expand from there, as needed.

cc @quasiben @brandon-b-miller (offline discussion)

@beckernick beckernick added feature request New feature or request cudf.polars Issues specific to cudf.polars labels Sep 17, 2024
@beckernick beckernick changed the title [FEA] Evaluate and enable the chunked parquet reader for Polars GPU engine [FEA] Investigate the chunked parquet reader for Polars GPU engine Sep 17, 2024
@brandon-b-miller brandon-b-miller self-assigned this Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants