[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

beckernick · 2024-09-17T15:41:04Z

Some users experience out-of-memory errors during IO when loading datasets that they feel like should be fine for their given GPU. This is currently less significant for cudf.pandas, as we now enable a prefetch-optimized unified memory by default.

Because we don't currently have a similar UVM setup for the Polars GPU engine, this is an acute pain point that blocks usage for many workflows. We've developed chunked readers for Parquet and ORC files that may be able to help in this situation.

Initial testing suggests that a properly configured chunked parquet reader may be effective at reducing peak memory requirements without significantly impacting performance.

For example, running PDS-H q7 at SF200 immediately runs into an OOM with the default Parquet reader. With a pass_read_limit of 16GB for the chunked reader, we can smoothly finish the query and provide a speedup with an H100 vs. the CPU engine on a high-end CPU.

Default CPU engine on a dual socket Intel 8480CL:

Running experiments...
Running scale factor 200.0 on GPU setting false...
Code block 'Run polars query 7' took: 18.65865 s
Experiments completed. ...
Run complete!

Default GPU engine behavior with cuda-async memory resource

Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 3.28927 s
q7 FAILED
MemoryError: std::bad_alloc: out_of_memory: CUDA error at: /tmp/pip-build-env-7ci1mf7i/normal/lib/python3.11/site-packages/librmm/include/rmm/mr/device/cuda_async_view_memory_resource.hpp:111: cudaErrorMemoryAllocation out of memory
Experiments completed. ...
Run complete!

GPU engine behavior with cuda-async memory resource and "pass_read_limit": 16024000000

Running experiments...
Running scale factor 200.0 on GPU setting true...
Code block 'Run polars-gpu-cuda-async query 7' took: 10.78470 s
Experiments completed. ...
Run complete!

We should do a full evaluation of the chunked Parquet reader on the PDS-H benchmarks to empirically assess the potential opportunity and tradeoffs for chunked IO. Starting with Parquet makes sense, as it's a more common file format in the PyData world. We can expand from there, as needed.

cc @quasiben @brandon-b-miller (offline discussion)

The text was updated successfully, but these errors were encountered:

beckernick added feature request New feature or request cudf.polars Issues specific to cudf.polars labels Sep 17, 2024

beckernick changed the title ~~[FEA] Evaluate and enable the chunked parquet reader for Polars GPU engine~~ [FEA] Investigate the chunked parquet reader for Polars GPU engine Sep 17, 2024

brandon-b-miller self-assigned this Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

beckernick commented Sep 17, 2024 •

edited

Loading

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

[FEA] Investigate the chunked parquet reader for Polars GPU engine #16818

Comments

beckernick commented Sep 17, 2024 • edited Loading

beckernick commented Sep 17, 2024 •

edited

Loading